Reading Material

A collection of background reading we’ve found particularly helpful or interesting. Unfortunately some of it is locked up behind paywalls. If you don’t have access to an academic library, we encourage you to contact the authors directly to request a copy — they are usually more than happy to provide one. (You might also want to read this Wikipedia article.)

Why Liberate Electricity Data?

Electricity Systems

  • Electric Power Systems: A Conceptual Introduction A book by UC Berkeley professor Alexandra von Meier, published in 2006. A great conceptual overview of how the grid works — and has worked for pretty much the past century. It’s not an electrical engineering textbook, but it does assume the reader has an undergraduate level understanding of math and physics.

Research Computing & Data Hygiene:

  • Good enough practices in scientific computing (PLOS Computational Biology, 2017). A whitepaper from the organizers of Software Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others!
  • Best practices for scientific computing (PLOS Biology, 2014). An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users.
  • Tidy Data (The Journal of Statistical Software, 2014). A more complete exploration of the reasons behind and benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records, as suggested by the best/good enough practices papers above.
  • A Simple Guide to Five Normal Forms: A 1983 vintage rundown of data normalization. Short, and informal, but understandable, and with a few good illustrative examples.  Bonus points for using ASCII art.

Software Engineering and Python Tools:

  • Structuring your Python Project: How to organize a Python project into modules that are packaged for easy interpretation by users. It includes a dummy package available via GitHub.
  • Python Testing with PyTest: a 2017 book by Brian Okken on how to use PyTest effectively. It’s very focused on PyTest in particular, rather than more general testing methodologies, but if you’re working in Python, and want to start testing your code, it’s a good reference.

Pandas:

Pandas is one of the main libraries used for data analysis in Python. It makes powerful manipulations of heterogeneous tabular data relatively easy. We use Pandas so much that it deserves its own section, separate from the more general software engineering and python tools.

  • Method Chaining: a readable, compact way to pass pandas dataframes through many different transformations all at once.
  • Indexes: Pandas dataframe indices can be complex, but they’re also a powerful way to access and manipulate your data.
  • Fast Pandas: If you have a dataframe with millions of rows, you need to know what is fast and what is slow.
  • Tidy Data: Tips on how and why to re-shape data appropriately for the analysis at hand.  Kind of a simplified Pandas version of database normalization guidelines.
  • Visualization: How to get a visual window into the data you’re manipulating on the fly.
  • Time Series: Pandas offers a lot of rich methods for manipulating and analyzing time series… but they can be a real pain in the butt to get familiar with up front.
  • Scaling: How do you work with data that’s too big to hold in your computer’s memory all at once? How can you distribute computationally intensive tasks across a high performance computing cluster? Dask marries pandas to a task graph generator to scale dataframes up to medium sized data science applications (100s of GB), which is perfect for the larger datasets that PUDL is targeting.