New PUDL Software & Data Release: v0.4.0

In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:

New Data Coverage

EIA Form 860 added coverage for 2004-2008, as well as 2019.
EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
EIA Form 923 added the 2001-2008 data, as well as 2019.
EPA CEMS Hourly Emissions covering 2019-2020.
FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.

Documentation & Data Accessibility

We’ve come to the conclusion that most users will never want to run the data processing pipeline (who can blame them!) and instead will prefer to just grab and run with the processed outputs. With that in mind we’ve re-oriented our documentation and available data access modes. The detailed information about how to run the ETL is still there in the development docs.

The primary modes of data (and software) access we focused on in this release are:

Processed data archives on Zenodo that include a Docker container preserving the required software environment for working with the data.
A repository of PUDL example notebooks, which are packaged up inside the Docker container with the data release and software environment. They can also be used with the current pre-release version of our docker container, as explained in the README.
A JupyterHub instance hosted in collaboration with 2i2c.
A browsable interface to our live databases, deployed with Datasette.

New Analyses

A lot of our time last year went into client work that has resulted in some new analyses being available in the open source codebase.

Hourly Electricity Demand and Historical Utility Territories

With support from GridLab and in collaboration with researchers at Berkeley’s Center for Environmental Public Policy, we did a bunch of work on spatially attributing hourly historical electricity demand. @ezwelty and @yashkumar1803 did most of this work including:

Semi-programmatic compilation of historical utility and balancing authority service territory geometries based on the counties associated with utilities, and the utilities associated with balancing authorities in the EIA 861 (2001-2019). See e.g. #670 but also many others.
A method for spatially allocating hourly electricity demand from FERC 714 to US states based on the overlapping historical utility service territories described above. See #741
A fast timeseries outlier detection routine for cleaning up the FERC 714 hourly data using correlations between the time series reported by all of the different entities. See #871

This whole analysis can be replicated using the pudl.analysis.state_demand module, but the outputs have also been separately archived on Zenodo for easy re-use independent of our software, and for citation in other works.

Net Generation and Fuel Consumption for All Generators

We have developed an experimental methodology to produce net generation and fuel consumption for all generators. It still has known issues and we’re actively working on the process. See #989

EIA reports net electricity generation and fuel consumption in multiple ways in the Form 923. The generation_fuel_eia923 table reports both generation and fuel consumption, and breaks them down by plant, prime mover, and fuel. In parallel, the generation_eia923 table reports generation by generator, and the boiler_fuel_eia923 table reports fuel consumption by boiler.

The generation_fuel_eia923 table is more complete, but the generation_eia923 + boiler_fuel_eia923 tables are more granular. The generation_eia923 table includes only ~55% of the total MWhs reported in the generation_fuel_eia923 table.

The pudl.analysis.allocate_net_gen module estimates the net electricity generation and fuel consumption attributable to individual generators based on the more expansive reporting of the data in the generation_fuel_eia923 table.

Reproducible Data Access

A major issue with the v0.3.x release series was that we lacked stable access to the original input data. The federal agencies would constantly update supposedly “final” data publications, which we were pulling directly from their website, and these changes would break our ETL pipeline, forcing us to re-integrate data without any warning, and making replication of any analyses very difficult. This was very frustrating and meant that we had to archive a full set of the raw inputs alongside every data release or analysis.

Now we use a series of web scrapers to collect snapshots of the raw input data. We arechive the original data as Frictionless Data Packages on Zenodo, so that we can access them reproducibly and programmatically via a REST API. You can find the scrapers and Zenodo archiving scripts in our pudl-scrapers and pudl-zenodo-storage repositories. You can download the raw archives from the Catalyst Cooperative community on Zenodo

There’s also an experimental caching system that allows these Zenodo archives to work as long-term “cold storage” for citation and reproducibility, with cloud object storage acting as a much faster way to access the same data for day to day non-local use, implemented by @rousik

What’s Next?

Thanks to generous funding from the Sloan Foundation Energy & Environment Program we have the resources to update and maintain our existing datasets, and to improve the underlying infrastructure over the next couple of years. Between now and the end of the year we’re focusing on the following projects.

New Data

As of late September, 2021 a complete version of the 2020 data is now available for all of our core datasets, including FERC 1, FERC 714, and EIA 860, 861, and 923. We’re working on integrating this data over the next couple of weeks, and hope to do another software and data release by the end of October

We’ve also got the 2001-2003 EIA 860 data ready to go, so we’ll be able to offer 20+ years of continuous data coverage in a uniform format for the FERC Form 1, EIA 860/861/923, and EPA CEMS Hourly Emissions datasets.

We’re also overhauling our entity resolution and database normalization process to be able to accommodate a wider variety of inputs. This will let us finally get the EIA 861 data into the core database, and make it easier to bring in the other datasets we’re going after.

Direct SQLite / Parquet Output

We’ve decided to shift to producing relational databases (SQLite files) and columnar data stores (Apache Parquet files) as the primary outputs of PUDL. Tabular Data Packages didn’t end up serving either database or spreadsheet users very well. The CSV file were often too large to access via spreadsheets, and users missed out on the relationships between data tables. Needing to separately load the data packages into SQLite and Parquet was a hassle and generated a lot of overly complicated and fragile code. The direct to SQLite + Parquet system is already faster, simpler, and more robust.

New Metadata

With the deprecation of tabular data package outputs, we’ve adopted a more modular metadata management system that uses Pydantic. This setup will allow us to easily validate the metadata schema and export to a variety of formats to support data distribution via Datasette and Intake catalogs, and automatic generation of data dictionaries and documentation. See #806 and the pudl.metadata subpackage. Many thanks to @ezwelty for most of this work.

Nightly Data Builds

We’re focusing on getting nightly builds of all our data outputs up and running so that we can exhaustively test all of the data processing and analyses on a continuous basis. This system will help us catch bugs and data irregularities much earlier. It should make it possible for us to do data and software releases much more frequently, to the point automating the deployment of our data products to Datasette, Intake catalogs, and other access modes, in the same way that our documentation automatically updates whenever a new PR is merged into the dev or main branches.

The nightly data builds will use Prefect and Dask to parallelize our data processing pipeline, and will run on Google Cloud Platform. This infrastructure will let us scale the pipeline up to processing more and larger datasets. including the FERC Electric Quarterly Report, and on the natural gas side the EIA Form 176, FERC Form 2, and PHMSA’s annual report on gas infrastructure.