Summer 2023 Goals

We’ve been working on our goal-setting process at Catalyst, and want to share our high-level goals for the summer – these take us through September 2023.

Publish all data products as SQL tables

In the past, we’ve published data products in two ways: a large portion of our data was published in SQLite/Parquet files; the rest, including many of our analysis outputs, were calculated directly in the PudlTabl Python class. You could interact with the SQLite and Parquet data any way you wanted. However, to access the latter, you’d need to install the latest version of PUDL and all its dependencies. Maintaining that environment and managing the dependencies was an unnecessary barrier to data analysis.

You may have noticed, from our nightly builds, that more and more of the outputs from PudlTabl are stored directly in pudl.sqlite. We’ve been working on this transition for a few months, since the Dagster migration, and finally have just a few data products remaining: the MCOE outputs (heat_rate_by_unit, heat_rate_by-generator, fuel_cost_by_generator, capacity_factor_by_generator, and mcoe) and the plant parts list (mega_generators, plant_parts_eia). Soon, you’ll be able to access all of our data without installing the PUDL Python package!

This also means PudlTabl will soon be deprecated, and the preferred way to access our data will be through conventional SQL and Parquet tooling such as Datasette, SQLAlchemy, or RSQLite.

Integrate new datasets into PUDL

We also plan to integrate some shiny new datasets, starting with PHMSA data. This contains operational data about methane gas gathering, transmission, and distribution in the US. After a stretch of infrastructure investment, we’re excited to focus on the “integrate new datasets” part of our partnership with Sloan! We’re doubly excited to expand into the methane gas aspect of US energy system data.

Integrate 2022 data for existing datasets

We’re working with RMI to integrate the 2022 data from our existing datasets, such as FERC forms 1/2/6/60/714 and EIA forms 860/860m/861/923. Each year, new data brings new challenges, but this quarter we plan to build automation tooling to help us detect issues as they arise and reduce the manual work required each year. This will be especially important as the annual data reconciliation requirements will increase when we integrate new datasets. This year, we’re especially interested to see how the FERC XBRL data has changed since its debut in 2021. 

Support RMI’s financial modeling efforts

We are also pleased to provide development and architectural support for RMI’s Optimus financial modeling tool. Optimus can show utilities how IRA incentives make cleaner portfolios better long-term investments, aid commercial partners in quantifying the distributional impact of their electrification plans, and support advocates by showing how ratemaking can evolve to minimize the burden of the transition on LMI customers. We’re helping RMI revamp the engineering side of their system to support faster, more confident development of the model.

Apply automated entity matching techniques

We’ve been working with CCAI on entity-matching problems in the energy data space. So far, we’ve been experimenting with using Splink to match EIA and FERC plant IDs. This summer,  we’re hoping to bring that process into PUDL and generalize it to other problems such as inter-year FERC to FERC plant ID matching.

Meet new people and organizations!

Of course, we’re also looking to connect with exciting new people! We’re looking for new contributors, grant funders that are interested in PUDL development and maintenance, and organizations that could benefit from our blend of energy policy domain knowledge and data engineering/data science expertise. If that sparks any connections in your mind, please drop us a line at


Rescuing Historical FERC Data

UPDATE 2022-01-19: We have received word from FERC that access to the historical data discussed below will be restored this week. As it becomes available we will also archive it on Zenodo just in case. Thank you to everyone who reached out and helped bring this issue to FERC’s attention!

This week we discovered that decades worth of energy system data collected by the Federal Energy Regulatory Commission (FERC) had been removed from the agency’s website. They apparently have no plan to archive it or migrate it to another platform. We are attempting to obtain a bulk download of all this data so we can archive it alongside our other raw data sources on Zenodo.

This data records many financial, operational, and economic aspects of the US energy system. It is a unique and valuable resource for anyone trying to understand how public policy and market conditions have shaped our energy system over time. Simply deleting this data with no warning, no plan to archive it, or migrate it to another platform is completely unacceptable.

If you know someone within FERC who can help get us a copy of this data to archive publicly, please put us in touch:


Weeknotes 2021-03-19

What We’re Doing

The Census DP1 GeoDatabase has been integrated into PUDL as a standalone SQLite DB for use with the EIA 861 to compile historical utility and balancing authority service territories, and with FERC 714 data in estimating state level historical hourly electricity demand. Previously it was had an ad-hoc non-standard ETL process.

Our documentation now has an index all of the PUDL DB tables, including the names of the columns, their data types, and descriptions of the contents, thanks to some work with Jinja templates by Austen. This is just one small part of a bigger docs overhaul as we try and get PUDL 0.4.0 out by the end of March.

PUDL is finally compatible with Python 3.9, using both pip and conda. The last dependency to make the transition was Numba, which as of v0.53.0 works with Python 3.9. Our CI is now running tests on both Python 3.8 and 3.9.