This year at csv,conf,v8 in Puebla, Mexico I gave a talk on our experience as a democratic worker cooperative creating digital public goods, and why we think co-ops are potentially a good fit for creating public-interest technology. You can watch the recorded talk on YouTube, or read on for a bloggified version of the talk below.
Author: Zane Selvans
A former space explorer, now marooned on a beautiful, dying world.
Integrating PUDL with PyPSA-USA
We recently found out that Kamran Tehranchi, one of two primary maintainers of the PyPSA-USA open source power system model, was working on adapting it to use open data that we publish through our Public Utility Data Liberation Project (PUDL), so we interviewed him over email to find out more about his experience making the switch.
Can you tell us a little bit about yourself? What problems are you working on? Where are you at?
Sure! I’m currently a PhD Student at Stanford University working in the Interdisciplinary Energy Systems (INES) Lab. By way of my research, I am also an energy system modeler and open-source software developer. My work focuses on electricity system planning, specifically on the impact of electricity transmission resolution within planning models. I primarily work with engineering-economic simulation and optimization models, mainly production cost simulations and capacity expansion models. I use these models to design and simulate future energy systems to understand the impacts of emerging technologies, policies, and climate-energy system interactions. One of the main projects I’ve been working on this past year is the PyPSA-USA planning model which in-part leverages PUDL to develop the electricity system data model.
We’re excited to be part of the Mozilla Technology Fund’s 2024 cohort, which is focusing on open source AI for environmental justice!
We’re going to use Mozilla’s support to link US Securities and Exchange Commission data about utility ownership to financial and operational information in the EIA forms 860/861/923, and through our previous record linkage work involving the EIA data, to FERC Form 1 respondents and the EPA’s continuous emissions monitoring system data.
The SEC Form 10-K is published through EDGAR as structured XBRL data, but the Exhibit 21 attachment that describes which companies own and are owned by other companies is unfortunately just a PDF blob that gets stapled to the XBRL, and so ownership relationships end up being unstructured, or at best, semi-structured data.
We’re going to apply document modeling tools that we’ve developed in some of our client work (to extract structured data from PUC and other regulatory filings) to extract the ownership information from Exhibit 21. This will hopefully include the ownership percentages when they are reported.
Then we’re going to use the generalized entity matching / record linkage tooling that we developed under our previous Climate Change AI Innovation Grant to connect the parent / subsidiary companies named in the SEC data to the financial and operational data reported by the same utility companies in FERC Form 1, as well as EIA and EPA data.
The record linkage / entity matching system that we’ve ultimately settled on is based on the excellent (and publicly funded!) Splink library, which relies on DuckDB to enable local linkages on datasets of up to tens of millions of records. Robin Linacre (one of the Splink maintainers) has a tutorial explaining the probabilistic model of record linkage used by Splink, if you’re interested in the internals.
Why is this work important? Being able to make effective energy policy often requires an understanding of the political economy of utilities, and utilities are often composed of Russian doll-like nested holding companies. It can be hard to see where one utility ends and another begins. Understanding which entities share ownership and thus political and economic interests is key to being able to grapple with and influence them.
We’ll be learning from prior work on this problem done by the folks at CorpWatch, and we hope to make the outputs of our work easy to visualize and explore through the Oligrapher interface that LittleSis has developed.
If this work is interesting or useful to you, we’d love to hear more about your use case! You can track our work through this GitHub repository. Also, while we are explicitly focused on and familiar with utilities, the SEC’s Form 10-K covers all publicly traded companies, so we may be producing additional data outputs that aren’t useful to us but which could be useful to others. If that’s you, please let us know.
Rescuing Historical FERC Data
UPDATE 2022-01-19: We have received word from FERC that access to the historical data discussed below will be restored this week. As it becomes available we will also archive it on Zenodo just in case. Thank you to everyone who reached out and helped bring this issue to FERC’s attention!
This week we discovered that decades worth of energy system data collected by the Federal Energy Regulatory Commission (FERC) had been removed from the agency’s website. They apparently have no plan to archive it or migrate it to another platform. We are attempting to obtain a bulk download of all this data so we can archive it alongside our other raw data sources on Zenodo.
This data records many financial, operational, and economic aspects of the US energy system. It is a unique and valuable resource for anyone trying to understand how public policy and market conditions have shaped our energy system over time. Simply deleting this data with no warning, no plan to archive it, or migrate it to another platform is completely unacceptable.
If you know someone within FERC who can help get us a copy of this data to archive publicly, please put us in touch: hello@catalyst.coop
- Chapter 4 of the 2018 National Climate Assessment looks at the potential climate impacts on the US energy system.
- Flow of Flows — Orchestrating ELT with Prefect and dbt. More exploration of how to build data processing pipelines using open source tooling.
- Orchestrating Airbyte data connection tasks with Prefect. Official integrations for Airbyte connectors as Prefect tasks.
- Data cleaning IS analysis, not grunt work. A longish post exploring what we really get out of doing data cleaning, and why it’s more valuable and complex than it often gets credit for.
- Peer learnings about what it means to become an open data steward, from the 2021 ODI Open Data Summit. Videos and responses from participants on many facets of stewarding open data, especially as a business / organization.
PUDL v0.5.0: 2020 and Beyond
It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.
Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.
If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop
Oil and gas companies operating in the arctic and other areas impacted by climate change have been adapting their operations and infrastructure planning to the melting permafrost and other long-term impacts of their pyromania for decades, even while spreading disinformation about the same processes publicly. But are electric utilities doing the same kind of planning?
We’ve been thinking a bit about the ways in which the energy system in the US West is exposed to potential climate risks, in the context of long term utility resource adequacy and operational planning. We posted a short thread on Twitter and got some references from the #EnergyTwitter hive mind.
Modern Data Stack, Etc.
- ETL vs. ELT — a comparison of two data pipeline architectures from the folks at Fivetran.
- What is the Modern Data Stack? another post from Fivetran, attempting to define the different components of data engineering pipelines as they appear to be coming together in the last few years.
- A good interactive introduction to SQL from Mode. You can even use your own data as you work through it, if it’s in a database online. Broken down into beginner, intermediate, and advanced sections.
- Hex is another platform that seems similar to Mode, for collaborating on data analyses using notebooks and a combination of straight SQL and python. Again, you load your own data directly via an online DB connection. I admit that after seeing it mentioned for months I only clicked through after realizing it was named after the magic / science hybrid technology depicted in Arcane.
- Cookiecutter Data Science is a cookie-cutter repo and a set of guidelines for standardizing data science projects to be more easily replicated and parsed by other people.
- Thou Shalt Scale Sustainably: some thoughts (commandments…) on how to scale social enterprises (especially when dependent on foundation funding) from the Shuttleworth Foundation. Not related to the so-called Modern Data Stack.
New PUDL Software & Data Release: v0.4.0
In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:
New Data Coverage
- EIA Form 860 added coverage for 2004-2008, as well as 2019.
- EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
- EIA Form 923 added the 2001-2008 data, as well as 2019.
- EPA CEMS Hourly Emissions covering 2019-2020.
- FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
- EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
- US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.
Environmental Justice Data Liberation
We’ve come across a few allied projects looking at environmental justice data specifically, and thought it would be nice to share!
Environmental Enforcement Watch
In May, Christina and I gave a talk at CSV,Conf,v6 about things we’ve learned liberating US energy system data. We focused a lot on the challenge of making data accessible to advocates. The following talk was analogous, but focused on environmental justice data. The speaker was Kelsey Breseman (@ifoundtheme) from the Environmental Data and Governance Initiative (EDGI) and their project Environmental Enforcement Watch (EEW). EEW is trying to hold polluters accountable using federally reported data, by making that data more accessible to and understandable by the people who are affected. They’re scraping the data from the web and creating a database that folks can query using Google CoLab notebooks. At the same time they’re trying to get EPA the full underlying database accessible to the public.
You can watch her excellent talk here:
I was struck by how many parallels there were between our work. We’re both trying to mitigate the poor curation of government data, and make it more accessible way to the public. EDGI also seems very open and GitHub centered and is trying to operate as a horizontal organization. They support themselves through foundation grants and volunteer labor. Nobody works on EDGI full time. They have a fiscal sponsorship agreement through Earth Science Information Partners (ESIP).
If you’re interested in public data and environmental justice they seem like a great organization! Maybe we can collaborate at some point.