A collection of open systems and data science background reading we’ve found particularly insightful or educational.
Why Liberate Electricity Data?
- The importance of open data and software: Is energy research lagging behind? (Energy Policy, 2017)
- Opening the black box of energy modeling: Strategies and lessons learned (Energy Strategy Reviews, 2018)
General Analytical Computing:
- Software Carpentry: A set of tutorials and resources aimed at introducing folks with a scientific or analytical background to good standard computing practices, data hygiene, database usage, source code revision control systems, etc.
- Data Carpentry: A sister project to Software Carpentry, focused more specifically on spreading best practices in data collection, archiving, organization, and analysis.
- Good enough practices in scientific computing (PLOS Computational Biology, 2017). A whitepaper from the organizers of Software Carpentry on good habits to ensure your work is reproducible and reusable — both by yourself and others!
- Best practices for scientific computing (PLOS Biology, 2014). An earlier version of the above whitepaper aimed at a more technical, data-oriented set of scientific users.
- Tidy Data (The Journal of Statistical Software, 2014). A more complete exploration of the reasons behind and benefits of organizing data into single variable, homogeneously typed columns, and complete single observation records, as suggested by the best/good enough practices papers above.
- Modern Pandas: an introduction to this intermediate level series of tutorials on getting the most out of Pandas for data analysis, from Tom Augsperger, one of the developers behind Pandas & Dask.
- Method Chaining: a readable, compact way to pass dataframes through many different transformations all at once.
- Indexes: Pandas dataframe indices can be complex, but they’re also a powerful way to access and manipulate your data.
- Fast Pandas: If you have a dataframe with millions of rows, you need to know what is fast and what is slow.
- Tidy Data: Tips on how and why to re-shape data appropriately for the analysis at hand. Kind of a simplified Pandas version of database normalization guidelines.
- Visualization: How to get a visual window into the data you’re manipulating on the fly.
- Time Series: Pandas offers a lot of rich methods for manipulating and analyzing time series… but they can be a real pain in the butt to get familiar with up front.
- Scaling: How do you work with data that’s too big to hold in your computer’s memory all at once? How can you distribute computationally intensive tasks across a high performance computing cluster? Dask marries pandas to a task graph generator to scale dataframes up to medium sized data science applications (100s of GB), which is perfect for the larger datasets that PUDL is targeting.
- Databases and SQL from Software Carpentry: An introduction to using databases and making SQL queries for programming novices with a quantitative background.
- A Simple Guide to Five Normal Forms: A 1983 vintage rundown of data normalization. Short, and informal, but understandable, and with a few good illustrative examples. Using ASCII art.
Basic Python Tools:
- Structuring your Python Project: How to organize a Python project into modules that are packaged for easy interpretation by users. It includes a dummy package available via GitHub.