New PUDL Software & Data Release: v0.4.0

In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:

New Data Coverage

  • EIA Form 860 added coverage for 2004-2008, as well as 2019.
  • EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
  • EIA Form 923 added the 2001-2008 data, as well as 2019.
  • EPA CEMS Hourly Emissions covering 2019-2020.
  • FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
  • EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
  • US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.

Weeknotes for 2021-03-12

What We’re Doing

We created a new EIA 861 archive on Zenodo. It includes data from 1990-2019, to support work Karl Dunkle Werner is doing (and he’s the one that updated our scrapers, and is working on integrating the older years). We noticed that there were changes to the 2012 and 2019 EIA 861 data too. But who knows what those changes entail! There aren’t even any formulae in these spreadsheets. Imagine if they published them as CSVs directly to GitHub, and we could see the diffs. Or use tools like daff to understand how the data is changing over time?

Christina and Zane gave a keynote presentation to the 2021 Research Data Access & Preservation (RDAP) Association Summit entitled Distributing Power with Open Data. You can see our slides here.

Zane nominated himself to speak at FERC’s April 16th workshop about the scope and mission of the new Office of Public Participation, focusing on issues of public data accessibility.

Zane finished overhauling our Tox / pytest setup to make it more intuitive and flexible (in the process of documenting the development setup) All the tests that we typically run can be run without needing any command line arguments, and the tests have been split into into distinct unit tests, software integration tests, and data validations. The software dependencies have also been simplified, with only those packages requiring or benefiting from precompiled binaries being specified in both and environment.yml specifications. This was inspired by our documentation update for the v0.4.0 release since it turns out documenting an overly complicated thing is kind of a lot of work, and that work is probably better put toward simplifying the thing instead.

What We’re Reading

  • How to build a community: starting with why? The beginning of what will hopefully be a series of posts from Claire Carroll, the community manager for dbt, on how to cultivate real, supportive online communities of practice around a given project or technology. This is something we’d really like Catalyst to learn how to do in the context of PUDL, and open energy data and modeling more generally.
  • $1.3M in grants were just awarded to open source projects, with support from the Ford Foundation, Sloan Foundation, Open Society Foundations, and others, facilitated by the Open Collective Foundation. They’re trying to develop sustainable funding / support system for open source infrastructure. More information on the program over on the Open Collective and also at the Ford Foundation. Some of their research findings so far.
  • RS21 looks like an interesting data consultancy, doing a lot of work with the public sector and NGOs.
  • VSCode Atom is Dead! Long live Atom! With Microsoft’s purchase of GitHub, development on the Atom editor has waned, and it’s started to feel a bit like abandonware. And given that Microsoft also maintains VS Code, it seems like it’s only a matter of time before we have no choice but to switch… to something. So we started playing around with VS Code, and it’s great!
  • A curated list of open technology projects to sustain a stable climate, energy supply, and vital natural resources. Gotta get PUDL in their dataset section!
  • Cloud native repositories for big scientific data, a paper by Ryan Abernathey et al., talking about the benefits of “Analysis Ready Cloud Optimized” (ARCO) datasets and developments in the field. Lots of parallels with where we’re going with PUDL, but with several orders of magnitude difference in the scale of the data.
  • Eliminating Toil: a definition of Toil from Google, and some musings on why less of it is better. Definitely resonated with the word we’re doing in PUDL.
  • Another look at utility political spending from the Energy & Policy Institute. This time they’re looking FERC Account 426.4, again based on data we’ve liberated from the FERC Form 1, in combination with their own mapping between utilities and their parent holding companies. Here’s the original query from our data.