Categories
updates

Entity matching with Splink to connect FERC to EIA

Linking power plant financial data to energy system operational data with help from Climate Change AI

At the end of 2023, Catalyst wrapped up work funded by a Climate Change AI (CCAI) Innovation Grant using entity matching (record linkage) to connect the energy system financial data reported to the US Federal Energy Regulatory Commission (FERC) and physical energy system data reported to the US Energy Information Administration (EIA). While the data published in FERC Form 1 refers to the same utilities, power plants, and generators that are reported by EIA, these entities lack common IDs to link them. This connection between datasets is necessary to show that retiring certain fossil fuel power plants in favor of renewable energy sources is economically beneficial and technically feasible while still meeting the physical demands of today’s grid. Conducting entity matching to model this connection eliminates the extremely laborious process of sifting through these datasets and performing a manual connection. In collaboration with and support of RMI’s Utility Transition Hub, Catalyst created a small validation dataset of manually linked records, and thus know first hand the tedium of conducting this linkage manually.

Over the course of the grant period we developed the connection of FERC Form 1 plants to EIA data from a one-off module to an integrated analysis maintained and deployed with our nightly PUDL builds. Along the way, we updated our FERC-FERC plant connection (the plant_id_ferc1 column in out_ferc1__yearly_all_plants in the PUDL database), providing a unique plant ID to link FERC plants across all years of reporting. We believe our published output table of connections (out_pudl__yearly_assn_eia_ferc1_plant_parts in the PUDL database) is the only regularly updated, free and open-source connection between the FERC and EIA datasets. 

We hope the result enables advocates working to decarbonize our electricity system to more easily bring defensible and data-driven analyses to state-level legislative and regulatory processes. Additionally, we hope that the published matching framework can serve as an open-source example of record linkage for energy datasets and be a model for attempting similar connections with other energy datasets.

Inputs

The data published in FERC Form 1 is messy; reported records correspond to an assortment of generator aggregations (e.g. prime mover, primary fuel source, technology type, plants, or generator units). To create an EIA input that could match the diversity of records reported in FERC Form 1, we created the EIA “plant parts table”. This table contains aggregations of all EIA “plant parts” corresponding to the various granularities appearing in the FERC data.

FERC Input: out_ferc1__yearly_all_plants

EIA Input: out_eia__yearly_plant_parts

Model

After experimenting with several machine learning packages, we decided to use the open-source Python package Splink as it provided helpful transparency into the effects of changing model parameters and produced results better than our existing baseline. Splink is an entity matching and deduplication interface based on the Fellegi-Sunter algorithm for record linkage. Its main advantages are its speed working with data locally, its interface for users to define fuzzy matching logic between attributes in the input datasets, and its features for doing an unsupervised match (with no training data). Splink provides interactive charts of the model weights that make it easier for downstream users to provide feedback without advanced understanding of the underlying model mechanics.

Results

We used the manually matched dataset to evaluate the model results by a metric of precision and recall. Consider the set of FERC records in this manual validation dataset that the model predicted a matching EIA record for. Precision is the percentage of these matches that are correct. It represents the model’s accuracy when making a prediction. Now, consider all of the FERC records in the manual validation dataset. Recall is the percentage of these FERC records that the model predicted an EIA match for. It represents the model’s coverage of the FERC dataset. The table below displays the precision and recall of the Splink model alongside a baseline linear regression model that was previously integrated into PUDL. The “match probability threshold” is the threshold at which pairs with a lower probability of matching are labeled as a non-match. As the match threshold decreases, more record pairs are labeled as a match and the recall increases. However, precision decreases as the match threshold decreases because the match quality is lower and more FERC records are matched to an incorrect EIA record. Considering the needs of downstream users, we prioritized publishing match results with high precision and thus chose a match threshold of .9 for use in our deployed model.

Match Probability ThresholdPrecisionRecall
.95.944.833
.9.943.843
.75.940.862
.5.939.875
.25.938.887
baseline.90.73

Challenges and Limitations

One of the initial challenges we encountered during the project was the high percentage of null values in the input datasets. This significantly impacted the quality of our entity matching results. Additionally, our manually compiled training/validation dataset was relatively small and inherently introduced some unknown biases within the small sample size. Recognizing the dynamic nature of data over time and the potential shifts in representation as more data is published, we additionally experimented with the unsupervised training features for the Splink model. Results were similar to those of the supervised model, and we anticipate using the unsupervised model in the event that the existing training data becomes too outdated or fails to represent evolving patterns in the data. This forward-looking approach ensures adaptability to new data trends and optimizes for scenarios not adequately represented in the initial training dataset.

What’s Next?

With the development of this framework for entity matching, Catalyst is capable of greater flexibility and efficiency in data-driven model development. In 2024, we are building on this framework using funding from the Mozilla Foundation to link Security Exchange Commission utility ownership data to EIA utility operational data. We hope to leverage these models to address analogous issues in natural gas data in the future.

Catalyst is making exciting progress in providing open data to electricity resource planning models like the GridPath RA Toolkit with support from GridLab. Our initial work on these inputs has revealed that there is a need for entity matching in almost all of the datasets under consideration. For example, the Western Electricity Coordinating Council’s Reliability Modeling Anchor Data Set (WECC ADS) has transmission node IDs, generator IDs, and utility IDs that do not match other data sets referring to the same entities. We are excited to utilize the resource efficiency, usability, and transparency of Splink in building entity matching models for these datasets.

Please reach out to us with questions about the modeling process or resulting connection table, and let us know how you are utilizing the FERC to EIA connection!

Categories
updates

Summer 2023 Goals

We’ve been working on our goal-setting process at Catalyst, and want to share our high-level goals for the summer – these take us through September 2023.

Publish all data products as SQL tables

In the past, we’ve published data products in two ways: a large portion of our data was published in SQLite/Parquet files; the rest, including many of our analysis outputs, were calculated directly in the PudlTabl Python class. You could interact with the SQLite and Parquet data any way you wanted. However, to access the latter, you’d need to install the latest version of PUDL and all its dependencies. Maintaining that environment and managing the dependencies was an unnecessary barrier to data analysis.

You may have noticed, from our nightly builds, that more and more of the outputs from PudlTabl are stored directly in pudl.sqlite. We’ve been working on this transition for a few months, since the Dagster migration, and finally have just a few data products remaining: the MCOE outputs (heat_rate_by_unit, heat_rate_by-generator, fuel_cost_by_generator, capacity_factor_by_generator, and mcoe) and the plant parts list (mega_generators, plant_parts_eia). Soon, you’ll be able to access all of our data without installing the PUDL Python package!

This also means PudlTabl will soon be deprecated, and the preferred way to access our data will be through conventional SQL and Parquet tooling such as Datasette, SQLAlchemy, or RSQLite.

Integrate new datasets into PUDL

We also plan to integrate some shiny new datasets, starting with PHMSA data. This contains operational data about methane gas gathering, transmission, and distribution in the US. After a stretch of infrastructure investment, we’re excited to focus on the “integrate new datasets” part of our partnership with Sloan! We’re doubly excited to expand into the methane gas aspect of US energy system data.

Integrate 2022 data for existing datasets

We’re working with RMI to integrate the 2022 data from our existing datasets, such as FERC forms 1/2/6/60/714 and EIA forms 860/860m/861/923. Each year, new data brings new challenges, but this quarter we plan to build automation tooling to help us detect issues as they arise and reduce the manual work required each year. This will be especially important as the annual data reconciliation requirements will increase when we integrate new datasets. This year, we’re especially interested to see how the FERC XBRL data has changed since its debut in 2021. 

Support RMI’s financial modeling efforts

We are also pleased to provide development and architectural support for RMI’s Optimus financial modeling tool. Optimus can show utilities how IRA incentives make cleaner portfolios better long-term investments, aid commercial partners in quantifying the distributional impact of their electrification plans, and support advocates by showing how ratemaking can evolve to minimize the burden of the transition on LMI customers. We’re helping RMI revamp the engineering side of their system to support faster, more confident development of the model.

Apply automated entity matching techniques

We’ve been working with CCAI on entity-matching problems in the energy data space. So far, we’ve been experimenting with using Splink to match EIA and FERC plant IDs. This summer,  we’re hoping to bring that process into PUDL and generalize it to other problems such as inter-year FERC to FERC plant ID matching.

Meet new people and organizations!

Of course, we’re also looking to connect with exciting new people! We’re looking for new contributors, grant funders that are interested in PUDL development and maintenance, and organizations that could benefit from our blend of energy policy domain knowledge and data engineering/data science expertise. If that sparks any connections in your mind, please drop us a line at hello@catalyst.coop.

Categories
updates

Rescuing Historical FERC Data

UPDATE 2022-01-19: We have received word from FERC that access to the historical data discussed below will be restored this week. As it becomes available we will also archive it on Zenodo just in case. Thank you to everyone who reached out and helped bring this issue to FERC’s attention!

This week we discovered that decades worth of energy system data collected by the Federal Energy Regulatory Commission (FERC) had been removed from the agency’s website. They apparently have no plan to archive it or migrate it to another platform. We are attempting to obtain a bulk download of all this data so we can archive it alongside our other raw data sources on Zenodo.

This data records many financial, operational, and economic aspects of the US energy system. It is a unique and valuable resource for anyone trying to understand how public policy and market conditions have shaped our energy system over time. Simply deleting this data with no warning, no plan to archive it, or migrate it to another platform is completely unacceptable.

If you know someone within FERC who can help get us a copy of this data to archive publicly, please put us in touch: hello@catalyst.coop

Categories
updates

PUDL v0.5.0: 2020 and Beyond

It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.

Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.

If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop

Categories
updates

New PUDL Software & Data Release: v0.4.0

In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:

New Data Coverage

  • EIA Form 860 added coverage for 2004-2008, as well as 2019.
  • EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
  • EIA Form 923 added the 2001-2008 data, as well as 2019.
  • EPA CEMS Hourly Emissions covering 2019-2020.
  • FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
  • EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
  • US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.
Categories
linkstream weeknotes

SQL for data analysis, DGP, and pair programming

Some good technical long reads from the last couple of weeks:

(Postgre)SQL for Data Analysis

Before the Tidyverse and Pandas, there was SQL. There’s still SQL, and as Vicki Boykis often points out: every data-centric framework that hangs around long enough tends toward SQL. It’s got almost half a century of careful thinking and optimization behind it. It seems entirely possible that it’ll still be around after another half century.

In this extensive post Haki Benita explores a bunch of data analysis that can be done directly with PostgreSQL in particular. It can be used either as an efficient preprocessing step before handing off to other tools, or to generate final products. It covers basic data selection, random selection, sampling, splitting data into training & testing sets, descriptive statistics, aggregations, regressions, interpolation, binning and much more. It’s almost more of a pocket guide to data analysis in SQL than a blog post.

Data (Error) Generation Processes

In this post Emily Riederer explores how conceptualizing data (and error!) generation processes can help you do better data validation. What does the data represent in the real world? How is it being collected? How does it move from where it’s collected to where it’s processed? What kinds of transformations operate on it before you look at the outputs? Understanding these steps and their contexts makes it easier to imagine how things can go wrong along the way and what errors to check for. It also makes it easier to debug errors when you find them.

On Pair Programming

A guide to pair programming from Birgitta Böckeler and Nina Siessegger. They look at both how and why to do it, and some of the challenges that it brings up. I had no idea that this has been a practice going back as far as the women who programmed ENIAC.

The authors explore several different styles of pair programming and the logistical planning required to make it work. They touch on the extra challenges of doing remote pairing which seems extra relevant these days. They cover productive and destructive social dynamics that come up, and a whole lot more. The article is long, but it’s definitely worth a read if you’ve thought about trying pair programming and been reluctant, or have tried it and been dissatisfied.

Categories
weeknotes

What we’re reading for the week of March 1st, 2021

A roundup of interesting posts related to data, code, energy, or climate that we came across in the first week of March, 2021.

Energy & Climate

  • Xcel Energy’s Comanche 3 coal plant in Colorado continues to be an expensive boodoggle. It’s spent 2 of its first 10 years of operation shut down for maintenance, forcing Xcel to buy electricity on the open market to fill the gap. But actually this was good for their customers, since the electricity the plant produces when it is running is so expensive ($66.25/MWh) that this actually saved customers money! Unfortunately they’ll still be on the hook for remaining capital costs far into the future. Xcel thinks they’ll use the plant as a seasonal or load following resource after 2030… at which point Xcel will still have $460M left in the plant.
  • Market Design for the Clean Energy Transition: Advancing Long-Term Approaches. Materials from a workshop put on by WRI and Resources for the Future, exploring how electricity markets need to adapt to accommodate lots of zero carbon, very low marginal cost generation, that’s also not entirely dispatchable.
  • A post from the Energy and Policy Institute looking at political “miscellaneous” spending by utilities, as reported in the FERC Form 1 — using our data!
  • Securitization in Action: How US States are Shaping an Equitable Coal Transition: a post from some of our collaborators at RMI, looking at some of the work our data liberation has helped enable — namely getting uneconomic fossil generation offline as cheaply as possible. Well, as cheaply as possible without forcing the utilities to absorb the costs anyway.
  • CMIP6: the next generation of climate models explained. A look at how climate scientists compare their models in a standardized way, so that they can understand why they get different answers sometimes. This is something we really need more of in the energy modeling space — otherwise every conversation eventually devolves into criticizing the inputs and assumptions.

Data & Code

  • Command Line Interface Guidelines: a collection of best practices for designing modern command line tools that are relatively user friendly, and take advantage of many features of modern Unix terminals.
  • Column Names as Contracts: an interesting post by Emily Reiderer about the potential benefits of storing metadata in column names using a controlled vocabulary, allowing them to be programmatically parsed.
  • Embedding column-name contracts in data pipelines with dbt builds on that last post, and looks at how Jinja templates and tools like dbt let you do more interesting dynamic data work if your columns have consistent and controlled names.
  • What is dbt anyway? It stands for “data build tool” and it can be used to specify, store, and version control complex data transformation instructions as text files. A lot of the data we’re working with from FERC and EIA are too messy for this to be helpful in our initial ETL process, but once we’ve got the databases being populated in the cloud automatically, this could be a good way to create new derived data products. Thanks to our friend Brittany Bennett at Sunrise Movement for telling us about dbt.
  • I helped build ByteDance’s censorship machine. A story about what it’s like to work inside a tech company actively implementing censorship measures. ByteDance is the Chinese owner of TikTok.
  • Documentation for pydantic. We’re trying to make all of our metadata programmatically accessible, and remove duplication wherever possible, and using pydantic to parse and validate the metadata we compile by hand so we know it’s at least structurally sound.
  • Python Packages is an online book about how to package and distribute… Python packages. We wish we’d had this a couple of years ago when we were figuring it out for the first time! Focuses on modern rather than legacy frameworks, going straight for pyproject.toml, poetry, and CI/CD using GitHub actions. There’s also a cookie cutter repo on GitHub that templates many of the practices from the book. Via Tiffany Timbers.
  • EPA has released a crosswalk table that connects their CEMS data to the EIA boilers and generators. Thank goodness we won’t have to compile it now. More info in their GitHub repo.
  • Nice preprint from Ryan Abernathey et al. on cloud-native scientific data repositories. This is very much in line with our plans for the PUDL data — even though our data is several orders of magnitude smaller than a lot of what he’s talking about.
  • Eliminating Toil is a short essay from some Googlers on the nature of a particular kind of work that shows up in many data wrangling (and software) contexts. A lot of our mission here is saving others from data toil.
  • Great Expectations and Pandas Profiling: a blog post on how to use these two tools together to automatically draft data validation test cases. Vaguely along the same lines as Pandera, though that library has more of a statistical bent.
Categories
updates

PUDL Infrastructure Roadmap for 2021

A couple of weeks ago I attended TWEEDS 2020 virtually (like everything this year) and talked about Catalyst’s ongoing Public Utility Data Liberation (PUDL) project, and especially the challenges of getting a big pile of data into the hands of different kinds of users, using different tools for different purposes. It ended up sketching out a bit of a PUDL infrastructure roadmap for the next year, and so we thought it would be a good idea to write it up here too.

We’ll have a separate post looking at our 2021 data roadmap.

The US Energy Information Asymmetry

PUDL is all about addressing a big information asymmetry in the regulatory and legislative processes that affect the US energy system. Utilities have much more information about their own systems than policymakers and advocates typically do. As a result, regulators often defer to the utilities on technical & analytical points. Commercial data exists, but it’s expensive. We want to get enough data into the hands of other kinds of stakeholders that they can make credible quantitative arguments to regulators, and challenge unfounded assertions put forward by utilities.

Federal Agencies and Their Favorite File Formats
Categories
analysis

Heat Rate Calculation for EIA Generators

Catalyst is pulling together an estimate of the marginal cost of electricity (MCOE) for every natural gas and coal fired power plant in the US whose data we can get our hands on. We’re using data from the EIA 923, EIA 860, and FERC Form 1 to do it.  Getting the heat rate right for each generator is an important part of this calculation, but a lot of the required data is… not perfect. Here’s how we’re working through it.