Categories
updates

Entity matching with Splink to connect FERC to EIA

Linking power plant financial data to energy system operational data with help from Climate Change AI

At the end of 2023, Catalyst wrapped up work funded by a Climate Change AI (CCAI) Innovation Grant using entity matching (record linkage) to connect the energy system financial data reported to the US Federal Energy Regulatory Commission (FERC) and physical energy system data reported to the US Energy Information Administration (EIA). While the data published in FERC Form 1 refers to the same utilities, power plants, and generators that are reported by EIA, these entities lack common IDs to link them. This connection between datasets is necessary to show that retiring certain fossil fuel power plants in favor of renewable energy sources is economically beneficial and technically feasible while still meeting the physical demands of today’s grid. Conducting entity matching to model this connection eliminates the extremely laborious process of sifting through these datasets and performing a manual connection. In collaboration with and support of RMI’s Utility Transition Hub, Catalyst created a small validation dataset of manually linked records, and thus know first hand the tedium of conducting this linkage manually.

Over the course of the grant period we developed the connection of FERC Form 1 plants to EIA data from a one-off module to an integrated analysis maintained and deployed with our nightly PUDL builds. Along the way, we updated our FERC-FERC plant connection (the plant_id_ferc1 column in out_ferc1__yearly_all_plants in the PUDL database), providing a unique plant ID to link FERC plants across all years of reporting. We believe our published output table of connections (out_pudl__yearly_assn_eia_ferc1_plant_parts in the PUDL database) is the only regularly updated, free and open-source connection between the FERC and EIA datasets. 

We hope the result enables advocates working to decarbonize our electricity system to more easily bring defensible and data-driven analyses to state-level legislative and regulatory processes. Additionally, we hope that the published matching framework can serve as an open-source example of record linkage for energy datasets and be a model for attempting similar connections with other energy datasets.

Inputs

The data published in FERC Form 1 is messy; reported records correspond to an assortment of generator aggregations (e.g. prime mover, primary fuel source, technology type, plants, or generator units). To create an EIA input that could match the diversity of records reported in FERC Form 1, we created the EIA “plant parts table”. This table contains aggregations of all EIA “plant parts” corresponding to the various granularities appearing in the FERC data.

FERC Input: out_ferc1__yearly_all_plants

EIA Input: out_eia__yearly_plant_parts

Model

After experimenting with several machine learning packages, we decided to use the open-source Python package Splink as it provided helpful transparency into the effects of changing model parameters and produced results better than our existing baseline. Splink is an entity matching and deduplication interface based on the Fellegi-Sunter algorithm for record linkage. Its main advantages are its speed working with data locally, its interface for users to define fuzzy matching logic between attributes in the input datasets, and its features for doing an unsupervised match (with no training data). Splink provides interactive charts of the model weights that make it easier for downstream users to provide feedback without advanced understanding of the underlying model mechanics.

Results

We used the manually matched dataset to evaluate the model results by a metric of precision and recall. Consider the set of FERC records in this manual validation dataset that the model predicted a matching EIA record for. Precision is the percentage of these matches that are correct. It represents the model’s accuracy when making a prediction. Now, consider all of the FERC records in the manual validation dataset. Recall is the percentage of these FERC records that the model predicted an EIA match for. It represents the model’s coverage of the FERC dataset. The table below displays the precision and recall of the Splink model alongside a baseline linear regression model that was previously integrated into PUDL. The “match probability threshold” is the threshold at which pairs with a lower probability of matching are labeled as a non-match. As the match threshold decreases, more record pairs are labeled as a match and the recall increases. However, precision decreases as the match threshold decreases because the match quality is lower and more FERC records are matched to an incorrect EIA record. Considering the needs of downstream users, we prioritized publishing match results with high precision and thus chose a match threshold of .9 for use in our deployed model.

Match Probability ThresholdPrecisionRecall
.95.944.833
.9.943.843
.75.940.862
.5.939.875
.25.938.887
baseline.90.73

Challenges and Limitations

One of the initial challenges we encountered during the project was the high percentage of null values in the input datasets. This significantly impacted the quality of our entity matching results. Additionally, our manually compiled training/validation dataset was relatively small and inherently introduced some unknown biases within the small sample size. Recognizing the dynamic nature of data over time and the potential shifts in representation as more data is published, we additionally experimented with the unsupervised training features for the Splink model. Results were similar to those of the supervised model, and we anticipate using the unsupervised model in the event that the existing training data becomes too outdated or fails to represent evolving patterns in the data. This forward-looking approach ensures adaptability to new data trends and optimizes for scenarios not adequately represented in the initial training dataset.

What’s Next?

With the development of this framework for entity matching, Catalyst is capable of greater flexibility and efficiency in data-driven model development. In 2024, we are building on this framework using funding from the Mozilla Foundation to link Security Exchange Commission utility ownership data to EIA utility operational data. We hope to leverage these models to address analogous issues in natural gas data in the future.

Catalyst is making exciting progress in providing open data to electricity resource planning models like the GridPath RA Toolkit with support from GridLab. Our initial work on these inputs has revealed that there is a need for entity matching in almost all of the datasets under consideration. For example, the Western Electricity Coordinating Council’s Reliability Modeling Anchor Data Set (WECC ADS) has transmission node IDs, generator IDs, and utility IDs that do not match other data sets referring to the same entities. We are excited to utilize the resource efficiency, usability, and transparency of Splink in building entity matching models for these datasets.

Please reach out to us with questions about the modeling process or resulting connection table, and let us know how you are utilizing the FERC to EIA connection!

Categories
updates

You Don’t Have to Install PUDL Anymore

We’re excited to announce that you no longer have to install the PUDL Python library to access electric generation data linked across FERC and EIA such as capacity factor, heat rate, and fuel cost. These, and many others, are now available directly in the PUDL database, which you can download from Zenodo here. You can find more details on how to access the data here.

We were able to complete this large infrastructural overhaul with the help of generous funding from the Sloan foundation.

Now that you can use any tools you want to analyze the data, here are some ideas:

  • Use the same type of Python code you have been using, but freed from our tangled web of dependencies!
  • Use another language you like better: R, Rust, Ruby, or even other languages that don’t start with R (Julia?)
  • Use Kaggle to check out our data without installing any programming environments at all!
  • Hook up a BI tool to quickly generate low/no-code dashboards and visualizations!

Since we’re moving away from downstream use of the library, we are also deprecating the PudlTabl class. It will still work, for now, but it’s now just a shell around accessing the database tables and will be removed in a future release.

One further change we made during all of this was to rename a bunch of tables to make them a little easier to find and understand. Tables now have standardized prefixes, the nuances of which are explained in the docs. The short version is:

  • When in doubt, start with tables with the out_* prefix. These have been cleaned and connected into wide tables with lots of metadata and are designed to be easy to use for downstream analysis.
  • When you need to dig deeper, look at the core_* tables. These are the cleaned up building blocks of the out_* tables. You may need to join several core_* tables to get the metadata you want.
  • The tables starting with an underscore are intermediate assets. They’re not stable, so please don’t rely on the data in them.

We hope these changes make it easier for a wider variety of users to use our data! Now that we’ve wrapped up this infrastructural work, we’ll shift our focus back to integrating new datasets like PHMSA and EIA 176.

If you want help getting started with our data, or have any datasets you’d like us to integrate, we’d love to talk: drop by our office hours and we’ll walk you through any questions you might have.

Categories
updates

OpenMod USA Takeaways

We had a great time attending the OpenMod USA conference at Stanford last month. Thanks to Open Energy Transition for organizing, and for inviting us to moderate a panel on open data! Thanks also to Greg Miller, Greg Schivley, Ted Nace, and our very own Christina Gosnell for speaking on our panel.

We got to meet a whole bunch of smart, friendly folks who are working on using their energy system modeling skills to facilitate the global energy transition. We learned a lot about how we can better support their work, including these high level takeaways:

  1. We’re still missing useful datasets! There wasn’t a strong front-runner for most-requested dataset, but we clearly heard a need for transmission, gas, and hourly demand, among others.
  2. Our users are interested in making their own technical systems more robust and easier to work with.

It’ll be a continuous process of improvement, of course, but we’ve started working on some projects as a result!

We do have to pick and choose which datasets to integrate first. Right now we’re focusing on natural gas data, integrating EIA 176 with the help of davidmudrauskas, and our own e-belfer is extracting transmission and distribution data from PHMSA.

One way to integrate more data more quickly is to mobilize our community to help integrate new data sources! That means we need to make contributing to PUDL much easier.

The first, most important phase of integrating a new dataset is the exploratory one. You can spend countless hours learning the specific quirks and pain points of the data. Because many of our users are already familiar with these datasets, we encourage “knowledge contributions” in the form of plain-language documentation or useful scripts that handle part of the data wrangling process. We’ve updated our contributing docs to highlight those cases, and have made a new repository to hold the teeming masses of dataset-specific knowledge.

We are also improving our Kaggle environment so that anyone can use PUDL without setting up a whole Python environment. This will make it easier for users to explore PUDL data, especially data that we have archived and/or extracted but not completely cleaned, validated, or connected. 

Apart from the dataset integrations and contribution improvements, we’re following up with folks from the conference to see how we can help them with software architecture, engineering, and infrastructure guidance – we’re looking forward to growing those relationships. If you are curious about how we can help you in this area, don’t hesitate to reach out at hello@catalyst.coop!

In closing, OpenMod was a great experience! We’re excited to build a community that can do amazing things with complete, connected, granular, and accessible energy data. We’re pursuing a bit of funding to support our community efforts, so keep your fingers crossed for us and stay tuned for more updates next year!

Categories
updates

Automated Data Wrangling

An illustration from the Frog and Toad children's books, where Frog and Toat are eating cookies. The caption has been altered to say "We must stop data cleaning!" cried Toad as he continued to clean the data.
Frog and Toad are Data Wranglers

We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of different kinds of messes that go well beyond what you can easily fix with a few nifty dataframe manipulations. It seems like we should be able to develop higher level, more general tools for doing automated data wrangling. Given how much time highly skilled people pour into this kind of computational toil, it seems like it would be very worthwhile.

Like families, tidy datasets are all alike but every messy dataset is messy in its own way.

Hadley Wickham, paraphrasing Leo Tolstoy in Tidy Data
Categories
updates

PUDL Infrastructure Roadmap for 2021

A couple of weeks ago I attended TWEEDS 2020 virtually (like everything this year) and talked about Catalyst’s ongoing Public Utility Data Liberation (PUDL) project, and especially the challenges of getting a big pile of data into the hands of different kinds of users, using different tools for different purposes. It ended up sketching out a bit of a PUDL infrastructure roadmap for the next year, and so we thought it would be a good idea to write it up here too.

We’ll have a separate post looking at our 2021 data roadmap.

The US Energy Information Asymmetry

PUDL is all about addressing a big information asymmetry in the regulatory and legislative processes that affect the US energy system. Utilities have much more information about their own systems than policymakers and advocates typically do. As a result, regulators often defer to the utilities on technical & analytical points. Commercial data exists, but it’s expensive. We want to get enough data into the hands of other kinds of stakeholders that they can make credible quantitative arguments to regulators, and challenge unfounded assertions put forward by utilities.

Federal Agencies and Their Favorite File Formats