Educational Resources

A collection of open systems and data science background reading we’ve found particularly insightful or educational.

Why Liberate Electricity Data?

General Analytical Computing:

  • Software Carpentry: A set of tutorials and resources aimed at introducing folks with a scientific or analytical background to good standard computing practices, data hygiene, database usage, source code revision control systems, etc.
  • Data Carpentry: A sister project to Software Carpentry, focused more specifically on spreading best practices in data collection, archiving, organization, and analysis.
  • Good enough practices in scientific computing (PLOS Computational Biology, 2017). A whitepaper from the organizers of Software Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others!
  • Best practices for scientific computing (PLOS Biology, 2014). An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users.
  • Tidy Data (The Journal of Statistical Software, 2014). A more complete exploration of the reasons behind and benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records, as suggested by the best/good enough practices papers above.

Pandas:

  • Modern Pandas: an introduction to this intermediate level series of tutorials on getting the most out of Pandas for data analysis, from Tom Augsperger, one of the developers behind Pandas & Dask.
  • Method Chaining: a readable, compact way to pass dataframes through many different transformations all at once.
  • Indexes: Pandas dataframe indices can be complex, but they’re also a powerful way to access and manipulate your data.
  • Fast Pandas: If you have a dataframe with millions of rows, you need to know what is fast and what is slow.
  • Tidy Data: Tips on how and why to re-shape data appropriately for the analysis at hand.  Kind of a simplified Pandas version of database normalization guidelines.
  • Visualization: How to get a visual window into the data you’re manipulating on the fly.
  • Time Series: Pandas offers a lot of rich methods for manipulating and analyzing time series… but they can be a real pain in the butt to get familiar with up front.
  • Scaling: How do you work with data that’s too big to hold in your computer’s memory all at once? How can you distribute computationally intensive tasks across a high performance computing cluster? Dask marries pandas to a task graph generator to scale dataframes up to medium sized data science applications (100s of GB), which is perfect for the larger datasets that PUDL is targeting.

Databases:

Basic Python Tools:

  • Structuring your Python Project: How to organize a Python project into modules that are packaged for easy interpretation by users. It includes a dummy package available via GitHub.