duckdb Archives - Catalyst Cooperative

The Federal Energy Regulatory Commission (FERC) collects a lot of data in its role as the regulator of US electricity markets. One dataset that we’ve had our eyes on for years, but never had the resources to integrate into PUDL (until now!) is the Electric Quarterly Report or EQR.

The EQR is the reporting mechanism FERC uses for public utilities to fulfill their responsibility under section 205(c) of the Federal Power Act (FPA) to have their rates and charges on file in a convenient form and place.

Under this section of the FPA:

Where two or more public utilities are parties to the same rate schedule or tariff, each public utility transmitting or selling electric energy subject to the jurisdiction of this Commission shall post and file such rate schedule

In modern US electricity markets, there’s more than a million of these transactions every day. This makes posting them in a “convenient form and place” non-trivial. FERC provides an online EQR report viewer where you can scroll through a list of thousands of sellers and download their filing for a particular quarter as a summary PDF. But this is no way to do analysis.

They also offer bulk downloads of the raw data as quarterly zipfiles containing CSVs or XML documents, but with more than 4 billion transactions recorded since the modern reporting era started in 2013, these formats are also not very ergonomic for bulk analysis. And even as compressed zipfiles, the full FERC EQR dataset is 100 GB. It’s far too big analyze in bulk locally with tools like pandas, let alone spreadsheets. However, with the development of tools like DuckDB and Polars over the last few years, working with this scale of tabular data has become much easier — even on a laptop!

The data locked up in the FERC EQR is potentially valuable for understanding how renewable energy power purchase agreements and battery storage prices have evolved over time, and what’s really driving rising energy costs. The EQR data is particularly valuable for less organized markets like the US Southeast and Intermountain West, where there is no ISO/RTO market data. We’re hopeful that this data can also make it easier to programmatically identify anti-competitive utility affiliate transactions and uneconomic self-dispatch by regulated monopolies.

Democratizing FERC EQR Access

With additional support from Gigawatt-tier PUDL Sustainer GridLab, Catalyst is going to create an open access, cloud-native version of the FERC EQR dataset, designed for bulk analysis.

Timeframe: We expect to be working on the project from now through the first quarter of 2026.
Data Coverage: We are initially targeting the 2013Q3 to present data, since it is all in a very similar format, while older EQR data was submitted under a different reporting regime, in a different format, and is also not as rich in detail (though we have already archived the raw 2002-2013 data).
Outputs: We’re planning to mirror the structure of the current data, with four main tables: transactions, contracts, filer identities, and index publications, along with a new table that records all FERC company IDs and tracks how their reported name and other attributes change over time.
Format: The data will be distributed as a collection of Apache Parquet files, probably partitioned by quarter. All of the data from 2013Q3 to the present will use the same well-defined schema, informed by all of the historical EQR data dictionaries published by FERC.
Processing: Our first priority is making the bulk data freely accessible in a format that’s appropriate for analysis. We are not planning on doing any major data processing, beyond enforcing a uniform schema, converting the data to Parquet, and dealing with character encoding and CSV formatting issues that affect ~1% of all records.
Versioning: Due to the size of the data and the limited free cloud storage we have as part of the AWS Open Data Registry we won’t be able to provide multiple historical versions of the outputs like we do with other PUDL data. Whenever we update the outputs, the older version will be replaced.
Updates: We plan to update the outputs at least once a quarter, and will capture fresh snapshots of the raw EQR data using our automated data archiving scripts.
Access: We plan to publish the FERC EQR alongside the rest of our PUDL data in a freely accessible S3 bucket as part of the AWS Open Data Registry. We also intend to make the EQR outputs available for preview and querying through the PUDL Data Viewer, which will allow users to download smaller subsets of the data as CSVs for spreadsheet-based analysis if that’s what they need.

Get Involved!

If you’re already familiar with the FERC EQR and want to help test out the new data we’d love to hear from you: [email protected]
We’d like to better understand how this data is used in any context, whether it’s research, policy, journalistic, or commercial, so if you’ve got a use case in mind for historical electricity transaction data, let us know what it is. Maybe sign up for office hours so we can chat more, or help you get familiar with Parquet files.
What other data does the FERC EQR need be connected to? One of the things we’re hoping this analysis-ready, cloud-native version of EQR will enable is robust record linkage with other datasets. Can we identify the LMP nodes where power is being delivered? Can we identify electricity buyers in other FERC or EIA data, even though all we have is their name? What else should we be thinking about?
Stay up to date with the project: Subscribe to our ~monthly newsletter, keep track of the FERC EQR issues in the main PUDL repo, or follow the EQR scoping repo where we’ll be prototyping the new system.

After 5 years of ogling the EQR from afar, and projecting our (data) hopes and dreams into its depths, we’re excited to finally tackle it for real, and hope it’ll be a valuable resource for understanding US energy markets and accelerating the ever more economical decarbonization of the US electricity system.