A collection of background reading we’ve found particularly helpful or interesting. Unfortunately some of it is locked up behind paywalls. If you don’t have access to an academic library, we encourage you to contact the authors directly to request a copy — they are usually more than happy to provide one. (You might also want to read this Wikipedia article.)
Why Liberate Electricity Data?
- The importance of open data and software: Is energy research lagging behind? (Energy Policy, 2017). Open community modeling frameworks have become common in many scientific disciplines, but not yet in energy. Why is that, and what are the consequences?
- Opening the black box of energy modeling: Strategies and lessons learned (Energy Strategy Reviews, 2018). A closer look at the benefits available from using shared, open energy system models, including less duplicated effort, more transparency, and better research reproducibility.
- Open Power System Data: Frictionless data for electricity system modeling (Applied Energy, 2019). An explanation of the motivation and process behind the European OPSD project, which is analogous to our PUDL project, also making use of Frictionless Data Packages.
- Open Data for Electricity Modeling German Ministry for Economic Affairs and Energy, 2018. A white paper exploring the legal and technical issues surrounding the use of public data for academic energy system modeling. Primarily focused on the EU, but more generally applicable. Based on a BWMi hosted workshop Catalyst took part in during September, 2018.
- Electric Power Systems: A Conceptual Introduction A book by UC Berkeley professor Alexandra von Meier, published in 2006. A great conceptual overview of how the grid works — and has worked for pretty much the past century. It’s not an electrical engineering textbook, but it does assume the reader has an undergraduate level understanding of math and physics.
Research Computing & Data Hygiene:
- Good enough practices in scientific computing (PLOS Computational Biology, 2017). A whitepaper from the organizers of Software Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others!
- Best practices for scientific computing (PLOS Biology, 2014). An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users.
- Tidy Data (The Journal of Statistical Software, 2014). A more complete exploration of the reasons behind and benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records, as suggested by the best/good enough practices papers above.
- A Simple Guide to Five Normal Forms: A 1983 vintage rundown of data normalization. Short, and informal, but understandable, and with a few good illustrative examples. Bonus points for using ASCII art.
Software Engineering and Python Tools:
- Structuring your Python Project: How to organize a Python project into modules that are packaged for easy interpretation by users. It includes a dummy package available via GitHub.
- Python Testing with PyTest: a 2017 book by Brian Okken on how to use PyTest effectively. It’s very focused on PyTest in particular, rather than more general testing methodologies, but if you’re working in Python, and want to start testing your code, it’s a good reference.
Pandas is one of the main libraries used for data analysis in Python. It makes powerful manipulations of heterogeneous tabular data relatively easy. We use Pandas so much that it deserves its own section, separate from the more general software engineering and python tools.
- Method Chaining: a readable, compact way to pass pandas dataframes through many different transformations all at once.
- Indexes: Pandas dataframe indices can be complex, but they’re also a powerful way to access and manipulate your data.
- Fast Pandas: If you have a dataframe with millions of rows, you need to know what is fast and what is slow.
- Tidy Data: Tips on how and why to re-shape data appropriately for the analysis at hand. Kind of a simplified Pandas version of database normalization guidelines.
- Visualization: How to get a visual window into the data you’re manipulating on the fly.
- Time Series: Pandas offers a lot of rich methods for manipulating and analyzing time series… but they can be a real pain in the butt to get familiar with up front.
- Scaling: How do you work with data that’s too big to hold in your computer’s memory all at once? How can you distribute computationally intensive tasks across a high performance computing cluster? Dask marries pandas to a task graph generator to scale dataframes up to medium sized data science applications (100s of GB), which is perfect for the larger datasets that PUDL is targeting.