A couple of weeks ago I attended TWEEDS 2020 virtually (like everything this year) and talked about Catalyst’s ongoing Public Utility Data Liberation (PUDL) project, and especially the challenges of getting a big pile of data into the hands of different kinds of users, using different tools for different purposes. It ended up sketching out a bit of a PUDL infrastructure roadmap for the next year, and so we thought it would be a good idea to write it up here too.
We’ll have a separate post looking at our 2021 data roadmap.
The US Energy Information Asymmetry
PUDL is all about addressing a big information asymmetry in the regulatory and legislative processes that affect the US energy system. Utilities have much more information about their own systems than policymakers and advocates typically do. As a result, regulators often defer to the utilities on technical & analytical points. Commercial data exists, but it’s expensive. We want to get enough data into the hands of other kinds of stakeholders that they can make credible quantitative arguments to regulators, and challenge unfounded assertions put forward by utilities.
A huge amount of detailed energy system data is freely available in the US. It describes the physical and economic aspects of the system, which are each important in their own ways for system planning and clean energy advocacy. But the data is also poorly curated by the various federal agencies. It’s dirty, incomplete, and difficult to integrate into a coherent whole. It’s also way too large and complex to be fully utilized with spreadsheet based analysis. Historically this has meant lots of people do their own small slice of analysis. They focus on the particular state or utility of interest to them, without trying to apply the same analyses across the entire universe of similar data. There’s no standardization or coordination between these efforts, and so we haven’t been able to build on or learn from each other’s work.
We’re trying to serve people with lots of different goals. Our users include academics researching low carbon energy systems as well as clean energy advocates working at the state level. We’ve also worked with smaller renewable energy developers, and would love to support data journalists interested in climate and energy issues. Some of these users are focused on the physical energy system and its operations. Others are interested primarily in the financial and economic aspects of utilities and their assets.
These folks have a wide range of technical backgrounds — anything from college students on Google Sheets, to postdoctoral researchers with access to high performance computing resources. Ideally, we want the data to be accessible to people regardless of the tools that they’re using: be it Excel, Python, R, or pure unadulterated SQL.
At the same time, we only have so many resources to throw at the problem. It’s just been 3 of us, learning as we go, and having to juggle fundraising, client work, and the underlying infrastructure. So we have to focus on a small number of low-maintenance solutions that can work for lots of different use cases, users, and tools.
Making a Database
So what do you do when your data is too big for spreadsheets, and you want to capture lots of internal relationships? The obvious answer is… make a database! The first version of PUDL digested EIA spreadsheets and FERC’s FoxPro database files and populated a local PostgreSQL database. This was fine for our own work, but when we started trying to share the outputs with others, we ran into issues.
The live database wasn’t well suited to archiving or otherwise sharing directly, and we didn’t want to take on the responsibility for hosting it and maintaining it for remote access. It also turned out that cloning a GitHub repository, running the full ETL pipeline, and firing up a PostgreSQL server was beyond the comfort zone of a lot of potential users. It’s also not something that would scale well for larger datasets, since processing 100 GB of data on a laptop is no fun, and it doesn’t really need to be done many times by many people.
Static Versioned Releases
So instead, we decided to do periodic combined software + data releases, including both a known version of the raw inputs, and all the processed outputs, stored in a static, archive-friendly, platform-independent format (see: PUDL Data Release v1.1.0). We chose the CSV + JSON based tabular data packages, as specified by the Frictionless Data project, and published them via CERN’s Zenodo data archiving service. The data packages can be loaded into a database automatically (we provide a script that puts them in SQLite) and they’re relatively easy to parse as a human or a machine. This also meant we didn’t have to worry about users trying to run the data processing pipeline with settings we’d never tested, and which might not work.
But this still wasn’t ideal. Anyone using the CSVs/JSON directly would miss out on a whole other layer of analysis that we’d built on top of the database (in Python). It also took a fair amount of effort to compile a release, validate the data, and ensure that it was reproducible. And that meant we didn’t do it very often. Especially in an advocacy or business context, it would be very useful to have the most recent year-to-date data integrated, but doing a monthly release with this system is just too much overhead for us right now.
Reproducible Data Acquisition
Another problem we had was the lack of reproducible data access — it turns out the EIA, FERC and other federal agencies frequently go back and edit supposedly “final” data releases — sometimes years after the fact — and change both the contents and the structure of the data they’ve published. There’s no version control, no change logs, and no ability for a user to know for sure that they are getting the same data today that they might have gotten last week, or last year. So we were constantly having to interrupt other work and update the software to cater to the current version of publicly available data.
To deal with this we decided to create our own versioned archives of the raw inputs on Zenodo. Each version of the PUDL software now has a collection of DOIs (digital object identifiers) baked into it, which refer to a data package on Zenodo containing the original files published for e.g. EIA-861, or FERC Form 1. Typically they are organized into a single ZIP archive for each year of data. The PUDL datastore module downloads the metadata for each data source, and then grabs each year of data as needed and caches it locally.
This arrangement means we don’t have to archive the inputs as part of a data release. They’re already independently archived, and won’t be duplicated if they haven’t changed between releases. It also makes those raw inputs available programmatically for others to use. In the future, other people could archive their own datasets with the same kind of metadata.
The hard work of parsing and cleaning up the contents of these data packages still has to be done by whoever is using them, but at least this setup lets you say “Give me version 3 of the EIA-861 data from 2009” and know that you’ll always get exactly the same thing.
Containerizing the ETL
To make the whole ETL pipeline easier to run in different contexts, and to minimize the different platforms we have to test on, we’re also packaging it up inside a Docker container. Future data releases will archive the container that was used to process the data, instead of just a conda environment.yml file and a collection of scripts.
Having the software run inside a container also opens up the possibility of using a real database like Postgres and just exposing a local port for the user, or running a Jupyter notebook server from within the container for the user to connect to, or other web applications like Datasette that expose the processed PUDL data in other ways.
ETL in Parallel
With the integration of larger datasets, and more years of data from FERC and EIA, the ETL pipeline is taking longer to run. The EPA CEMS hourly emissions dataset has almost a billion rows, and takes more than 12 hours on a laptop. One consequence of this inconvenience has been that we rarely run the full ETL process in development. That means there are some kinds of data integration issues and bugs that we don’t catch until long after they were introduced, making debugging much harder.
Thankfully, there’s a lot of opportunity to parallelize the ETL. The FERC Form 1 data can be processed independently of the EIA 860/861/923. EPA CEMS can be broken up into hundreds of smaller work units. Most of the individual tables within FERC or EIA don’t depend on each other when they’re being transformed.
To take advantage of this, we’re breaking the various ETL steps out into individual tasks that are related to each other with explicitly declared dependencies using the Prefect orchestration framework and Dask. These tools are designed to work together, and can fully utilize multiple CPU cores on a laptop or workstation, or a big multi-CPU cloud computing instance. In a cloud environment, Prefect + Dask can also use a Kubernetes executor to spread tasks across hundreds of nodes. With these tools we’re hoping to get the full ETL process to where it can run in a matter of minutes, rather than hours.
Nightly Data Builds & Continuous Deployment
Once the ETL process containerized, and refactored to run in parallel, we’ll set up a GitHub Action to kick off a nightly data build in the cloud whenever there are new commits to our GitHub repository. The data build will also run all the data validation routines, and if everything passes, produce an archive-ready data release candidate. This will make it much easier for us to keep the released data and software up to date, and ensure that we catch any data integration errors as soon as they appear.
With these outputs generated automatically every night, we can also start populating live, versioned resources — databases for the smaller and more densely interconnected datasets, and cloud storage buckets full of Parquet files for the long tables. Resources associated with a tagged release will be retained for some lifecycle, and used as backend data for other applications or visualizations. Each commit to the repository could be used to recreate a given version of the output data automatically.
This arrangement does commit us to maintaining infrastructure, but the hope is that we can do it in a way that makes it relatively easy for someone else to replicate — we’re not trying to create a closed system. The newly formed 2i2c (International Interactive Computing Collaboration) is working with this kind of open and relatively platform agnostic infrastructure stack as a way to guarantee users & clients The Right to Replicate.
Real Live Applications
The goal of all that stuff above is easy, timely, reproducible access to the data in a variety of forms, serving a variety of users and use cases, for example:
- A lightweight, user-friendly front-end like the one provided by Datasette, allowing people to browse and download smaller selections of the overall data for use in spreadsheet-based analysis.
- Monthly data releases on Zenodo so that the most recent year-to-date data is always available as soon as it’s been integrated. These archives serve users that need stable long-term access to a known set of reproducible resources, like academic researchers whose work needs to be well referenced and replicable.
- A hosted JupyterHub with direct access to the most recently released data for users that are comfortable with notebook based interactive computing, but who don’t want to bother with setting up the software locally themselves. I.e. an energy systems focused Pangeo instance. We could also let users launch their own notebooks in this environment using BinderHub.
- Programmatic access to the processed PUDL data for users that are using other platforms like Tableau, or building their own applications. This could either via direct connections to the cloud resources, or via a thin wrapper like FastAPI.
- Locally running container …
This refactoring will also allow us to handle more, larger datasets like the FERC EQR and other time series without grinding everything to a halt.
Some of these improvements will be worked into our next data release, which will happen before the end of 2020, and include final annual data through 2019.
We’ve also got a slate of additional datasets — including several focusing on natural gas explicitly — that we’re hoping we can find foundation support to integrate.
If there are useful tools or design patterns or use cases that you think we should be looking at, but that didn’t get mentioned here, please let us know in the comments! Or reach out directly via email: firstname.lastname@example.org.