Tools We Use
We are committed to open source software and open data, and use tools that are freely available to everyone whenever possible.
Python has become one of the most popular languages for scientific calculations and data analysis, and we love it too! We’re currently using the miniconda3 distribution, which is packaged up by the kind folks at Anaconda. All our conda environments pull from the community managed conda-forge channel, which is where we publish our packages as well.
Pandas provides easy to use column-oriented dataframes for interactive data manipulation and visualization. It’s the workhorse we use day in and day out to manipulate millions of rows of energy system data.
High level machine learning and clustering techniques have become very accessible and powerful. The sklearn suite of Python tools makes it much easier to automatically associate records in different datasets without overlapping IDs, and to extract interesting patterns from datasets that might otherwise be too large or complex to tackle.
Testing may not be exciting… but it does make our code more likely to work! We use PyTest to ensure that PUDL stays functional, and continuously test the full data processing pipeline from the original government sources to our live database.
SQLAlchemy provides a (mostly) database agnostic Python programming environment, that lets us build complex data structures, which are linked directly to an underlying database through an advanced Object Relational Model.
Most of the energy system data we integrate is not well curated, or accessible via a clean API, so we typically use the Scrapy web-scraping framework to pull it from government agency websites.
GeoPandas extends the Pandas dataframe to work with geospatial data by adding a special geometry column which can contain points, lines, areas, and other more complex geospatial types. It also manages map projections, spatial indexing, and operations like spatial joins, intersections, unions, etc.
We use Docker containers to provide a uniform, reproducible environment for running our ETL pipeline and data validation regardless of the user’s underlying platform. We’re also experimenting with packaging up the processed data inside containers to provide local access via microservices with minimal user setup or configuration.
We use the Frictionless Data standards for publishing static, platform-independent, tabular data. The data is stored in CSV files, with an accompanying JSON metadata wrapper that specifies a relational database schema and provides additional information about the columns, tables, their formats, licensing, sources, etc.
Our day to day data analysis and code prototyping is mostly done within JupyterLab interactive notebooks. The broader Project Jupyter ecosystem grew out of IPython, which was born right here in Boulder, Colorado at CU, in the hands of a physics grad student named Fernando Perez.
For flexible, static two dimensional visualizations, we use Matplotlib, which is tightly integrated with all of the other tools here. The Seaborn library builds on Matplotlib and provides a great interface for clean statistical visualizations.
Dask builds on Pandas by adding a task graph generator and scheduler, and provides a convenient way of working with data that is significantly larger than the memory available on an individual computer. This allows a user to analyze the bigger datasets within PUDL, conveniently making full use of a multi-core processor, or a cloud computing platform like Pangeo.
Under the hood, NumPy and SciPy are libraries that really make Python hum for data analysis and scientific computation. They build upon decades worth of high performance computing libraries, and make that functionality much more accessible than say… Fortran77. Not that any of us would know anything about that. No sirree.
Datasette is a platform for exploring and sharing collections of data using the popular file-based SQLite database as a container. SQLite support is built directly into the Python standard libraries, so this is especially convenient. You can explore our published databases at: https://data.catalyst.coop
Reproducible analyses depend on stable long-term access to their original input data and software environment. The Zenodo research archiving project run by CERN issues DOIs to datasets, publications, software, and other research artifacts, and keeps them safe on a multi-decadal timescale. They also provide access via a REST API. We use Zenodo to archive our raw inputs, as well as our processed outputs and software packages.
Prefect builds on Dask’s task graph and distributed computation methods to provide an intuitive framework for managing dependencies between different steps in complex ETL pipelines.
Some of the datasets we’re publishing are too large to conveniently load and access via SQLite databases, and for them we use partitioned Apache Parquet datasets. This compressed, column-oriented file format allows very fast querying of billions of rows of data, and is well integrated with Dask.
All of our development happens in the open on GitHub, and we use GitHub Actions to manage our automated testing and data deployment. GitHub issues are our main mechanism for bug tracking, project management, and user support.