The Public Utility Data Liberation (PUDL) Project

Electric utilities report a huge amount of information to the US government and other public agencies. This includes yearly, monthly, and even hourly data about fuel burned, electricity generated, operating expenses, power plant usage patterns and emissions. Unfortunately, much of this data is not released in well documented, ready-to-use, machine readable formats. Data from different agencies tends not to be standardized or easily used in tandem. Several commercial data services clean, package, and re-sell this this data, but at prices which are too high to be accessible to many smaller stakeholders.

The Public Utility Data Liberation (PUDL) project takes information that’s already publicly available, and makes it publicly usable, by cleaning, standardizing, and cross-linking utility data from different sources in a single database. Thus far our primary focus has been on fuel use, generation, operating costs, and operation history. As of November, 2022 PUDL integrates data from:

EIA Form 860: 2001-2023
EIA Form 860m: 2023-06
EIA Form 861: 2001-2022
EIA Form 923: 2001-2022
EPA Continuous Emissions Monitoring System (CEMS): 1995-2022
FERC Form 1: 1994-2022
FERC Form 714: 2006-2020
US Census Demographic Profile 1 Geodatabase: 2010

See a high-level review of these datasets here and a review of how they’re integrated into PUDL here.

The information from these sources allows users to explore the operating costs of individual power plants and see how fuel costs impact the viability of different types of generation. It can highlight the competitiveness of renewable electricity in the market today; it can show how the generation mix of different utilities has evolved over time; and it can indicate how fuel prices and more renewable generation have changed the usage of individual power plants.

By making this database and associated software available under liberal open data and open source licenses, we hope to enable a broader variety of stakeholders to participate quantitatively in electricity regulation and climate policy discussions at the local, state, and federal level. We want to see it used by data journalists, grassroots renewable energy activists, climate change activists, small renewable energy and demand side management companies, and non-profit organizations. You can help us keep this resource free and open for anyone to use by making a monthly contribution to the PUDL project. Any amount is appreciated!

Check out some of the projects we’ve been supporting!

If you or your organization have other data you would like to see integrated into the database, or suggestions for how it could be made to serve your purposes more effectively, please get in touch with us. If you have questions you’d like answered, but don’t have the skills or time required to use the data yourself, we are interested in performing those analyses for other organizations at very reasonable rates. You can reach us at: hello@catalyst.coop.

PUDL Processing Pipeline

PUDL grabs data from various public data sources and takes that data through a journey to make it more clean, standardized and connected. Here is a schema of the PUDL processing pipeline:

For a closer look at the database table creation, see the following schematic. Each PUDL data set gets extracted from our archived original data sets. Then each data sets gets cleaned and normalized through a transform process before getting loaded into frictionless data packages. We have kept the processing pipelines for each data set as separated as possible, although some data sets are highly inter-related and thus rely on each other during the transform step. Catalyst has also developed or integrated “glue”, which generally connects multiple data sets together with shared id’s.

Diagram of PUDL Extract/Transform/Load steps.

For more information on PUDL, how to use it, and what processing has been done to the data sets, see the PUDL documentation page or the PUDL github repository.