Background Reading

Background reading we’ve found particularly helpful or interesting. Unfortunately some of it is locked up behind paywalls. If you don’t have access to an academic library you can always contact the authors directly and ask for a copy. You might also want to check out a project called Sci-Hub.

Energy Systems & Regulation

Why liberate energy data?

How does electricity work in the US?

  • Electric Power Systems: A Conceptual Introduction A book by UC Berkeley professor Alexandra von Meier, published in 2006. A great conceptual overview of how the grid works — and has worked for pretty much the past century. It’s not an electrical engineering textbook, but it does assume the reader has an undergraduate level understanding of math and physics.
  • Short Circuiting Policy (2020) by Leah Stokes explores how electric utilities engage in state-level political systems in the US, and in particular how in some cases they have successfully resisted the transition to clean energy systems. By UC Santa Barbara professor Leah Stokes.
  • Managing the Utility Financial Transition is a collection of white papers compiled by Energy Innovation in 2018, exploring financial tools and strategies to help electric utilities retire fossil generation assets early. It focuses particularly on using ratepayer backed bonds to securitize power plants that cost more to operate than building new wind and solar from scratch.

Software & Data Management

Research Computing & Data Hygiene

  • Research Software Engineering with Python (2020) is an online textbook for people who want to specialize in writing maintainable software that enables reproducible, open science. It’s a great longer-form resource for self-guided learning, or use in a semester-long course. Written by some of the same people as the scientific computing practices papers below.
  • Good enough practices in scientific computing (PLOS Computational Biology, 2017). A whitepaper from the organizers of Software Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others!
  • Best practices for scientific computing (PLOS Biology, 2014). An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users.
  • Tidy Data (The Journal of Statistical Software, 2014). A more complete exploration of the reasons behind and benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records, as suggested by the best/good enough practices papers above.
  • A Simple Guide to Five Normal Forms: A 1983 vintage rundown of data normalization. Short, and informal, but understandable, and with a few good illustrative examples.  Bonus points for using ASCII art.
  • Structuring your Python Project: How to organize a Python project into modules that are packaged for easy interpretation by users. It includes a dummy package available via GitHub.
  • Python Testing with PyTest: a 2017 book by Brian Okken on how to use PyTest effectively. It’s very focused on PyTest in particular, rather than more general testing methodologies, but if you’re working in Python, and want to start testing your code, it’s a good reference.
  • Everyone wants to do the model work, not the data work: A paper from Google Research looking at the consequences of undervaluing data curation and preparation in high stakes ML/AI applications. They especially focus on “data cascades” with negative downstream impacts in humanitarian applications of AI, and find a need for dramatically better application of best practices in data preparation, documentation, and curation, if AI is to live up to its promise.

Analysis with Pandas & Dask Dataframes

  • Method Chaining: a readable, compact way to pass pandas dataframes through many different transformations all at once.
  • Indexes: Pandas dataframe indices can be complex, but they’re also a powerful way to access and manipulate your data.
  • Fast Pandas: If you have a dataframe with millions of rows, you need to know what is fast and what is slow.
  • Tidying Data with Pandas: Tips on how and why to re-shape data appropriately for the analysis at hand.  Kind of a simplified Pandas version of database normalization guidelines.
  • Visualization: How to get a visual window into the data you’re manipulating on the fly.
  • Time Series: Pandas offers a lot of rich methods for manipulating and analyzing time series… but they can be a challenging to get familiar with up front.
  • Scaling: How do you work with data that’s too big to hold in your computer’s memory all at once? How can you distribute computationally intensive tasks across a high performance computing cluster? Dask marries pandas to a task graph generator to scale dataframes up to medium sized data science applications (100s of GB), which is perfect for the larger datasets that PUDL is targeting.
  • How to learn Dask in 2020 is a collection of articles and self-guided tutorials that combine videos with interactive Jupyter notebooks. It works through many of the core concepts of Dask, and was presented at SciPy 2020. If you’re already familiar with NumPy, SciPy, and Pandas, this is still easily a full day of material.