Background Reading - Catalyst Cooperative

Background reading we’ve found particularly helpful or interesting. Unfortunately some of it is locked up behind paywalls. If you don’t have access to an academic library you can always contact the authors directly and ask for a copy. You might also want to check out a project called Sci-Hub.

Energy Systems & Regulation

Why liberate energy data?

The importance of open data and software: Is energy research lagging behind? (Energy Policy, 2017). Open community modeling frameworks have become common in many scientific disciplines, but not yet in energy. Why is that, and what are the consequences?
Opening the black box of energy modeling: Strategies and lessons learned (Energy Strategy Reviews, 2018). A closer look at the benefits available from using shared, open energy system models, including less duplicated effort, more transparency, and better research reproducibility.
Open Power System Data: Frictionless data for electricity system modeling (Applied Energy, 2019). An explanation of the motivation and process behind the European OPSD project, which is analogous to our PUDL project, also making use of Frictionless Data Packages.
Open Data for Electricity Modeling German Ministry for Economic Affairs and Energy, 2018. A white paper exploring the legal and technical issues surrounding the use of public data for academic energy system modeling. Primarily focused on the EU, but more generally applicable. Based on a BWMi hosted workshop Catalyst took part in during September, 2018.

How does electricity work in the US?

Short Circuiting Policy (2020) by Leah Stokes explores how electric utilities engage in state-level political systems in the US, and in particular how in some cases they have successfully resisted the transition to clean energy systems. By UC Santa Barbara professor Leah Stokes.
Managing the Utility Financial Transition is a collection of white papers compiled by Energy Innovation in 2018, exploring financial tools and strategies to help electric utilities retire fossil generation assets early. It focuses particularly on using ratepayer backed bonds to securitize power plants that cost more to operate than building new wind and solar from scratch.
Making Climate Policy Work by Danny Cullenward and David G. Victor looks at carbon pricing and emissions offsets as climate policies, and makes the case that however good they might be from an economist’s point of view, they face enormous structural challenges politically, and are unlikely to play more than a niche role in decarbonization.
Solar Power Finance Without the Jargon by Jenny Chase is a personable, accessible overview of the whole solar PV industry and supply chain, exploring how it works from manufacturing, to finance and deployment, and the markets solar plays in now. Updated in late 2023.
Electric Power Systems: A Conceptual Introduction A book by UC Berkeley professor Alexandra von Meier, published in 2006. A great conceptual overview of how the grid works — and has worked for pretty much the past century. It’s not an electrical engineering textbook, but it does assume the reader has an undergraduate level understanding of math and physics.

Software & Data Management

Research Computing & Data Hygiene

Research Software Engineering with Python (2020) is an online textbook for people who want to specialize in writing maintainable software that enables reproducible, open science. It’s a great longer-form resource for self-guided learning, or use in a semester-long course. Written by some of the same people as the scientific computing practices papers below.
Good enough practices in scientific computing (PLOS Computational Biology, 2017). A whitepaper from the organizers of Software Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others!
Best practices for scientific computing (PLOS Biology, 2014). An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users.
Tidy Data (The Journal of Statistical Software, 2014). A more complete exploration of the reasons behind and benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records, as suggested by the best/good enough practices papers above.
A Simple Guide to Five Normal Forms: A 1983 vintage rundown of data normalization. Short, and informal, but understandable, and with a few good illustrative examples. Bonus points for using ASCII art.
Structuring your Python Project: How to organize a Python project into modules that are packaged for easy interpretation by users. It includes a dummy package available via GitHub.
Python Testing with PyTest: a 2017 book by Brian Okken on how to use PyTest effectively. It’s very focused on PyTest in particular, rather than more general testing methodologies, but if you’re working in Python, and want to start testing your code, it’s a good reference.
Everyone wants to do the model work, not the data work: A paper from Google Research looking at the consequences of undervaluing data curation and preparation in high stakes ML/AI applications. They especially focus on “data cascades” with negative downstream impacts in humanitarian applications of AI, and find a need for dramatically better application of best practices in data preparation, documentation, and curation, if AI is to live up to its promise.

Analysis with Pandas & Dask Dataframes

Method Chaining: a readable, compact way to pass pandas dataframes through many different transformations all at once.
Indexes: Pandas dataframe indices can be complex, but they’re also a powerful way to access and manipulate your data.
Fast Pandas: If you have a dataframe with millions of rows, you need to know what is fast and what is slow.
Tidying Data with Pandas: Tips on how and why to re-shape data appropriately for the analysis at hand. Kind of a simplified Pandas version of database normalization guidelines.
Visualization: How to get a visual window into the data you’re manipulating on the fly.
Time Series: Pandas offers a lot of rich methods for manipulating and analyzing time series… but they can be a challenging to get familiar with up front.
Scaling: How do you work with data that’s too big to hold in your computer’s memory all at once? How can you distribute computationally intensive tasks across a high performance computing cluster? Dask marries pandas to a task graph generator to scale dataframes up to medium sized data science applications (100s of GB), which is perfect for the larger datasets that PUDL is targeting.
How to learn Dask in 2020 is a collection of articles and self-guided tutorials that combine videos with interactive Jupyter notebooks. It works through many of the core concepts of Dask, and was presented at SciPy 2020. If you’re already familiar with NumPy, SciPy, and Pandas, this is still easily a full day of material.