A free training series for energy graduate students
In the course of our work at Catalyst, we’ve been lucky enough to work with energy researchers at many institutions. So often, we see people run into the same set of problems – handling data that’s too big for Excel, reproducibly connecting datasets without shared IDs, and writing code that can be easily re-run and updated when a paper is reviewed or when a new year of data comes in.
Now, thanks to generous funding from the Alfred P. Sloan Foundation Energy & Environment Program, we’re pleased to announce “Open Energy Data for All,” an initiative to support energy graduate students. We’ll offer training in foundational data and software skills that enable faster, more open, and more reproducible energy data analysis.
Here’s what we have planned:
We’re hosting a monthly online seminar series addressing key challenges in energy data analysis
We’re working to develop a hands-on energy data curriculum with support from The Carpentries
We’ll give open energy data tutorials at several conferences around the country
Finally, we’ll host a two-day, in-person data lab, bringing together graduate students from across the U.S. to collectively tackle real-world energy data problems.
Kicking off our monthly seminar series
Whether wrestling with APIs or just trying to find the right dataset for your research, it can be hard to know where to start. That’s why, for our first monthly skill-building webinar, we’re starting with an introduction to the US open energy data landscape. Join us!
Intro to the US Open Energy Data Landscape Oct. 30th, 4:30 – 5:30 PM Eastern (20:30 – 21:30 UTC)
What free and public energy data exists for the U.S., and how can I access it?
How have people used that open data in the energy transition?
What common challenges are there with using this data?
How can I evaluate open datasets I find online?
How may data availability affect my research topics?
We’ll be running more webinars monthly – they’re designed to be drop-in, and will be recorded, so don’t worry too much about missing one or another.
Want to learn more?
If you’re interested in hearing more about these projects, sign up for our workshop newsletter and subscribe to our calendar of events! If you’re a faculty member interested in learning more, or in hosting us for a talk or workshop, we’d love to chat – drop us a line at hello@catalyst.coop.
To kick off our NSF POSE grant work, over 4 weeks in July and August we interviewed more than 60 energy data users as part of NSF’s Innovation Corps program (I-Corps). I-Corps helps POSE awardees better understand their users and contributors, and the potential for fostering a sustainable open source ecosystem.
Some of our interviewees were already PUDL users, and many of them weren’t. A fair number of the PUDL users were at organizations we’d never encountered before! We talked to academic researchers and advocates working at non-profits, but also people at for-profit companies, and folks working in the public sector. We even had the chance to talk to some utilities. Interviewee technical and energy domain backgrounds were diverse: from spreadsheet-only NGOs to startups working with cloud-based data pipelines and orchestration frameworks, and everything in between. There were software engineers and lawyers that argue at FERC, grass roots advocates and regional electricity planning organizations too.
It was an intense month for our sometimes introverted team, but overall it was a good experience and we learned a lot. So we thought we’d share some of our high-level takeaways, and see if they resonate the broader energy data community.
Catalyst is a fully-remote organization with members living all over North America. Our floating cyborg heads are besties, but we’re privy to the power of an analog hang. As such, we host an annual member retreat. This year we chose to meet in Mexico City because it seemed like fun and because it overlapped well with this year’s CSV conf (see Zane’s talk here!)
We booked an Airbnb with a few buffer days so folks could explore the city before getting down to business hashing out Catalyst’s future. We perspired, ate popsicles, and pinched ourselves a few times to make sure we were actually there. Even though we had a lot of work planned for the next few days, we agreed that the retreat would be a success even if all we did was get together.
This year at csv,conf,v8 in Puebla, Mexico I gave a talk on our experience as a democratic worker cooperative creating digital public goods, and why we think co-ops are potentially a good fit for creating public-interest technology. You can watch the recorded talk on YouTube, or read on for a bloggified version of the talk below.
We recently found out that Kamran Tehranchi, one of two primary maintainers of the PyPSA-USA open source power system model, was working on adapting it to use open data that we publish through our Public Utility Data Liberation Project (PUDL), so we interviewed him over email to find out more about his experience making the switch.
Can you tell us a little bit about yourself? What problems are you working on? Where are you at?
Sure! I’m currently a PhD Student at Stanford University working in the Interdisciplinary Energy Systems (INES) Lab. By way of my research, I am also an energy system modeler and open-source software developer. My work focuses on electricity system planning, specifically on the impact of electricity transmission resolution within planning models. I primarily work with engineering-economic simulation and optimization models, mainly production cost simulations and capacity expansion models. I use these models to design and simulate future energy systems to understand the impacts of emerging technologies, policies, and climate-energy system interactions. One of the main projects I’ve been working on this past year is the PyPSA-USA planning model which in-part leverages PUDL to develop the electricity system data model.
We are excited to share that the Public Utility Data Liberation Project (PUDL) and Catalyst Cooperative have been awarded a Pathways to Open Source Ecosystems (POSE) Phase I grant by the National Science Foundation (NSF)! This grant will fund a slate of community building and infrastructure projects to expand the PUDL community and facilitate contributions.
We’ve spent time on community-building activities like developing relationships with open energy modelers, presenting at conferences, hosting office hours, and responding to questions on Github Discussions. We applied for the NSF POSE grant so that we can spend more time fostering the PUDL community and improving people’s experience working with public energy data.
Getting to know our community
Are you a researcher or analyst working with energy data or models? An environmental non-profit, clean energy advocate or data journalist working on the U.S. energy transition? A data engineer or open-source expert interested in contributing to the energy transition?
If so, we would love to talk to you! For the first step of our POSE grant, we’re conducting a series of half-hour interviews over the next month to better understand how people find, prepare, and work with energy data, the different contexts they’re working in, and what their biggest data pain points and challenges are. You can sign up using this link. Please spread the word and forward this link to anyone you think might be interested!
Our Focus Areas
With POSE funding, we’ll be working to get PUDL data into more hands and creating new opportunities to contribute back to the PUDL ecosystem. Here’s a glimpse into what’s in the works:
Exploring new front-end tools to make PUDL data easier to access: We’re busy prototyping an alternative to our existing UI tool. Stay tuned, we’ll be looking for users to give us feedback on our beta tool!
Creating new resources for PUDL users: We’ll be hosting a webinar aimed at nonprofits and developing new data access tutorials to make accessing our data easier than ever before.
Supporting PUDL’s contributors: We’ll be developing new resources and coordination practices for external contributors, and creating a contributor onboarding workshop.
Addressing technical barriers to contribution: Whether refactoring memory-intensive tests, or improving our data validation framework using Pandera, Pydantic, and Dagster asset checks, we’re excited to implement some long-awaited improvements to support more distributed development.
Coming to a town near you!: We’ll be traveling to academic conferences, university brown-bags, FOSS meetups and more in order to present on the PUDL project and connect with other clean energy advocates.
Developing organizational models and governance practices to sustain our growing ecosystem: In conversation with our downstream users, we’ll be developing strategies to keep PUDL free, accessible and maintained in the long-term.
We’ll be sharing updates on POSE-funded projects on our socials, blog and newsletter over the coming months. If you want to learn more about any of these projects, get in touch via hello@catalyst.coop or drop by our office hours.
Catalyst is very excited to announce that we have hired Nancy Amandi as a technical writer for PUDL’s Google Season of Docs project (full proposal here). The project will run from June through October, during which time Nancy will work on improving our documentation to make it easier for PUDL users to navigate and find the data they need.
Currently, it’s difficult for new (and long-time) PUDL users and contributors to quickly jump in and start using PUDL because our documentation is extensive and spread out between multiple repositories and websites. We have a data dictionary page in our docs, a Datasette deployment for exploring the data, and a set of example notebooks hosted on Kaggle, but none do a particularly good job of shepherding users to the data they want. The goal of this project is to create a better, more nested, system of table/column documentation so users aren’t overwhelmed by PUDL and know where to find the latest versions of the tables that are most relevant to them!
If you’ve ever struggled to navigate the PUDL docs and have feedback, please send an email to hello@catalyst.coop, and we will incorporate suggestions into our plan for the project.
About Nancy
Nancy is a data engineer and technical writer living in Nigeria. She’s passionate about helping data-driven businesses write clear, concise documentation to convey complex technical concepts to a diverse range of audiences. In addition to her writing, Nancy has extensive experience in creating scalable data pipelines, exhaustive data mining, explanatory datasets, analytical models, and business reporting solutions with structured, semi-structured, and unstructured data. In 2023, Nancy and her team members won the Nigeria Energy Forum Tertiary Institutions Energy Pitch Challenge for their work on OneGrid Energies, a clean tech startup working towards closing the energy affordability gap in Nigeria.
Linking power plant financial data to energy system operational data with help from Climate Change AI
At the end of 2023, Catalyst wrapped up work funded by a Climate Change AI (CCAI) Innovation Grant using entity matching (record linkage) to connect the energy system financial data reported to the US Federal Energy Regulatory Commission (FERC) and physical energy system data reported to the US Energy Information Administration (EIA). While the data published in FERC Form 1 refers to the same utilities, power plants, and generators that are reported by EIA, these entities lack common IDs to link them. This connection between datasets is necessary to show that retiring certain fossil fuel power plants in favor of renewable energy sources is economically beneficial and technically feasible while still meeting the physical demands of today’s grid. Conducting entity matching to model this connection eliminates the extremely laborious process of sifting through these datasets and performing a manual connection. In collaboration with and support of RMI’s Utility Transition Hub, Catalyst created a small validation dataset of manually linked records, and thus know first hand the tedium of conducting this linkage manually.
Over the course of the grant period we developed the connection of FERC Form 1 plants to EIA data from a one-off module to an integrated analysis maintained and deployed with our nightly PUDL builds. Along the way, we updated our FERC-FERC plant connection (the plant_id_ferc1 column in out_ferc1__yearly_all_plants in the PUDL database), providing a unique plant ID to link FERC plants across all years of reporting. We believe our published output table of connections (out_pudl__yearly_assn_eia_ferc1_plant_parts in the PUDL database) is the only regularly updated, free and open-source connection between the FERC and EIA datasets.
We hope the result enables advocates working to decarbonize our electricity system to more easily bring defensible and data-driven analyses to state-level legislative and regulatory processes. Additionally, we hope that the published matching framework can serve as an open-source example of record linkage for energy datasets and be a model for attempting similar connections with other energy datasets.
Inputs
The data published in FERC Form 1 is messy; reported records correspond to an assortment of generator aggregations (e.g. prime mover, primary fuel source, technology type, plants, or generator units). To create an EIA input that could match the diversity of records reported in FERC Form 1, we created the EIA “plant parts table”. This table contains aggregations of all EIA “plant parts” corresponding to the various granularities appearing in the FERC data.
After experimenting with several machine learning packages, we decided to use the open-source Python package Splinkas it provided helpful transparency into the effects of changing model parameters and produced results better than our existing baseline. Splink is an entity matching and deduplication interface based on the Fellegi-Sunter algorithm for record linkage. Its main advantages are its speed working with data locally, its interface for users to define fuzzy matching logic between attributes in the input datasets, and its features for doing an unsupervised match (with no training data). Splink provides interactive charts of the model weights that make it easier for downstream users to provide feedback without advanced understanding of the underlying model mechanics.
Results
We used the manually matched dataset to evaluate the model results by a metric of precision and recall. Consider the set of FERC records in this manual validation dataset that the model predicted a matching EIA record for. Precision is the percentage of these matches that are correct. It represents the model’s accuracy when making a prediction. Now, consider all of the FERC records in the manual validation dataset. Recall is the percentage of these FERC records that the model predicted an EIA match for. It represents the model’s coverage of the FERC dataset. The table below displays the precision and recall of the Splink model alongside a baseline linear regression model that was previously integrated into PUDL. The “match probability threshold” is the threshold at which pairs with a lower probability of matching are labeled as a non-match. As the match threshold decreases, more record pairs are labeled as a match and the recall increases. However, precision decreases as the match threshold decreases because the match quality is lower and more FERC records are matched to an incorrect EIA record. Considering the needs of downstream users, we prioritized publishing match results with high precision and thus chose a match threshold of .9 for use in our deployed model.
Match Probability Threshold
Precision
Recall
.95
.944
.833
.9
.943
.843
.75
.940
.862
.5
.939
.875
.25
.938
.887
baseline
.90
.73
Challenges and Limitations
One of the initial challenges we encountered during the project was the high percentage of null values in the input datasets. This significantly impacted the quality of our entity matching results. Additionally, our manually compiled training/validation dataset was relatively small and inherently introduced some unknown biases within the small sample size. Recognizing the dynamic nature of data over time and the potential shifts in representation as more data is published, we additionally experimented with the unsupervised training features for the Splink model. Results were similar to those of the supervised model, and we anticipate using the unsupervised model in the event that the existing training data becomes too outdated or fails to represent evolving patterns in the data. This forward-looking approach ensures adaptability to new data trends and optimizes for scenarios not adequately represented in the initial training dataset.
What’s Next?
With the development of this framework for entity matching, Catalyst is capable of greater flexibility and efficiency in data-driven model development. In 2024, we are building on this framework using funding from the Mozilla Foundation to link Security Exchange Commission utility ownership data to EIA utility operational data. We hope to leverage these models to address analogous issues in natural gas data in the future.
Catalyst is making exciting progress in providing open data to electricity resource planning models like the GridPath RA Toolkit with support from GridLab. Our initial work on these inputs has revealed that there is a need for entity matching in almost all of the datasets under consideration. For example, the Western Electricity Coordinating Council’s Reliability Modeling Anchor Data Set (WECC ADS) has transmission node IDs, generator IDs, and utility IDs that do not match other data sets referring to the same entities. We are excited to utilize the resource efficiency, usability, and transparency of Splink in building entity matching models for these datasets.
Please reach out to us with questions about the modeling process or resulting connection table, and let us know how you are utilizing the FERC to EIA connection!
We’re going to use Mozilla’s support to link US Securities and Exchange Commission data about utility ownership to financial and operational information in the EIA forms 860/861/923, and through our previous record linkage work involving the EIA data, to FERC Form 1 respondents and the EPA’s continuous emissions monitoring system data.
The SEC Form 10-K is published through EDGAR as structured XBRL data, but the Exhibit 21 attachment that describes which companies own and are owned by other companies is unfortunately just a PDF blob that gets stapled to the XBRL, and so ownership relationships end up being unstructured, or at best, semi-structured data.
We’re going to apply document modeling tools that we’ve developed in some of our client work (to extract structured data from PUC and other regulatory filings) to extract the ownership information from Exhibit 21. This will hopefully include the ownership percentages when they are reported.
Then we’re going to use the generalized entity matching / record linkage tooling that we developed under our previous Climate Change AI Innovation Grant to connect the parent / subsidiary companies named in the SEC data to the financial and operational data reported by the same utility companies in FERC Form 1, as well as EIA and EPA data.
Why is this work important? Being able to make effective energy policy often requires an understanding of the political economy of utilities, and utilities are often composed of Russian doll-like nested holding companies. It can be hard to see where one utility ends and another begins. Understanding which entities share ownership and thus political and economic interests is key to being able to grapple with and influence them.
We’ll be learning from prior work on this problem done by the folks at CorpWatch, and we hope to make the outputs of our work easy to visualize and explore through the Oligrapher interface that LittleSis has developed.
If this work is interesting or useful to you, we’d love to hear more about your use case! You can track our work through this GitHub repository. Also, while we are explicitly focused on and familiar with utilities, the SEC’s Form 10-K covers all publicly traded companies, so we may be producing additional data outputs that aren’t useful to us but which could be useful to others. If that’s you, please let us know.