It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.
Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.
If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop
New Direct DB Output
In 2019 when we started archiving our processed data on Zenodo, we began producing tabular data packages made up of CSVs (the data) and JSON (the metadata). The hope was that this output could be a single source of truth that was accessible both to folks who wanted to load the data into a relational database, and to those more comfortable with the text-based formats. We hoped that having One Output To Rule Them All would keep us from having to deal with discrepancies between different formats, data typing issues, and being unsure which of those outputs additional analyses other people were building on.
In practice, we always loaded the data packages into SQLite and ended up building additional infrastructure on top of the SQLite DB that was only accessible to folks using our Python package and able to load the database. Spreadsheet users couldn’t load million-line CSVs, and were often unable to parse and work with the JSON metadata, so a lot of the relationships between different well normalized tables weren’t seeing the full light of day. Introducing the extra step of loading the CSVs into a database or (in the case of the 800,000,000 row EPA CEMS dataset) into Parquet files was a hassle, and confused a lot of people.
As a result, we decided to accept the fact that what we’re producing is fundamentally structured, tabular, relational data, and the right container for that kind of data is a relational database. They’ve endured for half a century for a reason. SQLite is a good format in many contexts because it’s a standalone file that doesn’t require users to do any server setup or authentication, but it still retains the full relational structure of the data. It’s easy to archive and distribute, directly supported by many different languages, and relatively easy to load into other relational databases if need be. We can also easily define output routines that load the data directly into other databases like Postgres or BigQuery.
We’ve also acknowledged that the set of users and use cases we are trying to serve are too diverse to rely on a single data product. Academic researchers, journalists, climate activists, NGOs, small businesses, and policymakers vary widely in their technical backgrounds, and the types of interfaces they want to work with. We need to be able to expose the data and analyses we’re producing in many different ways.
The new direct to SQL and Parquet outputs simplify the overall data processing pipeline, and make it easier for us to integrate the data to other analysis and visualization tools. It will also allow us to start running the pipeline on distributed resources using Dask and Prefect, which will be important as the scale of data and number of integrated analyses increase.
When our primary output was a collection of tabular data packages, we stored our metadata in a single huge datapackage.json file. A bunch of the column-level metadata was duplicated in that file, and it was a hassle to update. It was also difficult to validate, and to use dynamically in our code.
With this most recent release, we’ve mostly moved to storing our metadata in a collection of Pydantic data models representing individual columns, tables, and the database as a whole. We dynamically infer many of the foreign key relationships based on naming conventions, easily validate the entire metadata structure to make sure it’s self consistent, ensure that there’s little to no duplication of metadata, and easily implement new metadata export methods as needed. This will help us keep our data dictionaries up to date, let us generate schemas for different database back ends with SQLAlchemy, and export machine readable metadata for use by applications like Datasette and Intake data catalogs.
More Structured Data
The new metadata system has made it much easier for us to impose additional structure on the data. We can use natural primary keys for almost every table and infer the full set of foreign key relationships based on naming conventions. This has allowed us to find and fix many more inconsistencies within the original data. It’s also allowed us to easily impose explicit value constraints on more columns, and re-code columns en masse when they are all using the same limited set of categorical values. It’s also highlighted some poor normalization in the previous database structure, like the mixture of nuclear units (which report at the generator level) and other thermal units (which report by a combination of plant, prime mover, and fuel type) in the EIA 923 Generation and Fuel table. So that table has now been split into two, with different natural primary keys.
This might seem a little neurotic, but in addition to highlight duplicated and inconsistent data, imposing these structures will make it easier for us to automate and generalize some aspects of the data processing, and make sure that the same cleaning and standardization routines are being applied across the board. It should also allow more effective use of our data in the ever-growing ecosystem of SQL based analysis and transformation tools, without requiring users to have as much implicit knowledge about the structure of the data.
Our Unique Niche(s)
Open source data extraction, transformation, and analysis tools like Airbyte, dbt, and Superset are growing in popularity, and solving some of the problems we’ve been solving with our own custom solutions for open energy system data. We’re trying to figure out what parts of our system could be replaced by or adapted to these tools, and what parts of our system provide unique value, so we can focus on those. Maintaining our own bespoke solutions in parallel with bigger and probably far better engineered community efforts doesn’t make sense in the long run, but in the short term transitioning from one system to another takes a lot of effort.
Right now our uniquely valuable technical contributions seem to be:
- Energy system data extraction
- Domain specific data cleaning
- Integrations between and within energy system datasets
Open Energy Data Extraction
Commercial data extraction platforms like FiveTran and Stitch seem to do a great job of pulling in data from sources that are used by lots of different companies, and that are designed for programmatic use. Unfortunately, public energy system data doesn’t meet either of those criteria. No commercial provider in their right mind is going to build a general data extractor for the annual FERC Form 1 Visual FoxPro databases (or even their new XBRL format). The market is too small and the data is too messy. Only sector specific data providers like S&P Global Market Intelligence or Ventyx / ABB are going to invest in extracting this data, and then they’re going to hoard it and try to extract platform monopoly rents.
But it might be worthwhile building open source connectors for these datasets using Airbyte or something similar. It’s a system that others would be familiar with, and we might be able to avoid some of the background infrastructural work and just focus on the dataset-specific details. It would also be more likely to integrate directly with other tools designed to ease latter steps in the data processing pipeline. For example there’s a new AirbyteConnectionTask class defined within Prefect under their new curated partnership program.
Domain Specific Data Cleaning
Until the federal government hires some data engineers, there’s going to be some unavoidable toil in getting the energy system ready for analysis, and some of that requires domain specific knowledge. It also requires familiarity with the particular quirks of the datasets, which we’ve definitely built up over the last few years! As Randy Au put it in this post from last year, data cleaning IS analysis, and an important way to get to know your data.
We continue to try and generalize and modularize our data cleaning processes, as we see the same kinds of issues showing up in different contexts, so we don’t end up cut-and-pasting solutions with the same general structure across lots of different table or dataset specific modules.
Messy Data Integrations
One major source of value that we add is connections between disparate datasets. For example the FERC Form 1 and the EIA 860/861/923 all refer to the same utilities and power plants. But as published, there’s no way to join the unique data the two agencies provide about these entities. If you want to know about internal capital allocations and non-fuel operating costs as well as generator operational characteristics and fuel costs… you have to find a way to get them together.
As we wrote in the spring, over the last few years there have been some major advances in the automation of data cleaning and integration. As we’ve worked with more and more datasets, we’ve seen lots of recurring patterns that are common to anybody working with messy tabular / relational data — which is almost everyone working with data!
We’re looking for grant funding, other open climate & energy data producers, and researchers who are working on ML for data cleaning to collaborate with in developing these tools and techniques in the open. Alongside improved infrastructure, this should allow us to bring in more datasets more quickly, and create robust connections between them without requiring an exponentially increasing amount of tedious manual work.
Hopefully we’ll have more to say about this soon!