In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:
New Data Coverage
EIA Form 860 added coverage for 2004-2008, as well as 2019.
EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
EIA Form 923 added the 2001-2008 data, as well as 2019.
EPA CEMS Hourly Emissions covering 2019-2020.
FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.
We’ve come across a few allied projects looking at environmental justice data specifically, and thought it would be nice to share!
Environmental Enforcement Watch
In May, Christina and I gave a talk at CSV,Conf,v6 about things we’ve learned liberating US energy system data. We focused a lot on the challenge of making data accessible to advocates. The following talk was analogous, but focused on environmental justice data. The speaker was Kelsey Breseman (@ifoundtheme) from the Environmental Data and Governance Initiative (EDGI) and their project Environmental Enforcement Watch (EEW). EEW is trying to hold polluters accountable using federally reported data, by making that data more accessible to and understandable by the people who are affected. They’re scraping the data from the web and creating a database that folks can query using Google CoLab notebooks. At the same time they’re trying to get EPA the full underlying database accessible to the public.
You can watch her excellent talk here:
I was struck by how many parallels there were between our work. We’re both trying to mitigate the poor curation of government data, and make it more accessible way to the public. EDGI also seems very open and GitHub centered and is trying to operate as a horizontal organization. They support themselves through foundation grants and volunteer labor. Nobody works on EDGI full time. They have a fiscal sponsorship agreement through Earth Science Information Partners (ESIP).
If you’re interested in public data and environmental justice they seem like a great organization! Maybe we can collaborate at some point.
Before the Tidyverse and Pandas, there was SQL. There’s still SQL, and as Vicki Boykis often points out: every data-centric framework that hangs around long enough tends toward SQL. It’s got almost half a century of careful thinking and optimization behind it. It seems entirely possible that it’ll still be around after another half century.
In this extensive postHaki Benita explores a bunch of data analysis that can be done directly with PostgreSQL in particular. It can be used either as an efficient preprocessing step before handing off to other tools, or to generate final products. It covers basic data selection, random selection, sampling, splitting data into training & testing sets, descriptive statistics, aggregations, regressions, interpolation, binning and much more. It’s almost more of a pocket guide to data analysis in SQL than a blog post.
In this post Emily Riederer explores how conceptualizing data (and error!) generation processes can help you do better data validation. What does the data represent in the real world? How is it being collected? How does it move from where it’s collected to where it’s processed? What kinds of transformations operate on it before you look at the outputs? Understanding these steps and their contexts makes it easier to imagine how things can go wrong along the way and what errors to check for. It also makes it easier to debug errors when you find them.
The authors explore several different styles of pair programming and the logistical planning required to make it work. They touch on the extra challenges of doing remote pairing which seems extra relevant these days. They cover productive and destructive social dynamics that come up, and a whole lot more. The article is long, but it’s definitely worth a read if you’ve thought about trying pair programming and been reluctant, or have tried it and been dissatisfied.
We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of different kinds of messes that go well beyond what you can easily fix with a few nifty dataframe manipulations. It seems like we should be able to develop higher level, more general tools for doing automated data wrangling. Given how much time highly skilled people pour into this kind of computational toil, it seems like it would be very worthwhile.
On creating a version-controlled SQL Query library, from Caitlin Hudon. Queries are little hand-crafted jewels that encode knowledge about the contents and structure of data. They should be saved and shared and improved collaboratively over time! As we move our PUDL data access toward using SQL directly, it seems like we’ll want to do this too, with some of the most common queries baked into the database as views directly. I bet it would be easy to set up automated testing of the queries too, whenever a new version of the DB comes online, to see which old queries are broken, and whether they yield the expected results.
Building Data Dictionaries, also from Caitlin Hudon, talking about the importance of continuous, incremental documentation of institutional knowledge related to your data — where it came from, what it should look like, how it can (and can’t!) be used. We finally made some progress in this direction by starting to publish our table and column level metadata directly in our documentation.
CarbonPlan has published a new review of carbon removal projects, based on responses to Microsoft’s recent RFP. The general takeaway: most projects are low volume, impermanent, and didn’t provide enough detail to be evaluated critically. The one and only 5-star project they reviewed, Climeworks in Iceland, which does permanent mineralized geologic sequestration (injecting CO2 into fresh volcanic basalt) powered by geothermal energy, currently costs $1,100/ton of CO2, or $17,000/yr to sequester the emissions of a typical US person.
The censusviz package provides a straightforward Python interface to US census map and population data, which integrates directly with GeoPandas. Looks like it’s pulling data on the fly from the Census API.
A nice PyData talk from Julia Signell that stitches together data from Intake catalogs and a variety of interactive data visualization & dashboarding tools. Code from the talk can be found on GitHub.
Ghost looks like an interesting open source business, it’s organized as a non-profit foundation that has to reinvest all of its profits in itself, and can’t be bought out. They have developers all over the world, and produce a permissively licensed open source content management system.
Our proposal to the Sloan Foundation was funded! This means we’ve got 2 years of support to work on integrating a bunch of new data, automation of our ETL pipeline, and improving access to the results, some of which we outlined in our 2021 infrastructure roadmap. In combination with the client work we’ve been engaged in, this means we’ll need to recruit a new member to the co-op. We’re trying to figure out what skills and experience we are actually short on, relative to what we need to get done.
We continue to update our documentation, as we prepare for a new software and data release. A lot of it is related to the development environment, contributing, and how to run the ETL as it is now, while the user-facing side of the docs will hopefully be much simpler, directing folks to use preprocessed data with a Docker image that’s available on Zenodo, or our JupyterHub, or Datasette. We’ve set a deadline for the release of April 19th, which is also our next Advisory Board meeting.
We decided to simplify our branch management on GitHub. Instead of having separate ephemeral sprint branches, we’re just branching off of and making PRs against dev, and merging back into main after each 2 week long sprint, which seems to be far more common among open source projects, and is a lot less to try and keep track of. It should also mean we don’t have to re-direct lingering PRs to different base branches.
Data Wrangling Galore: Austen has been cleaning up the FERC Form 1 fuel table to capture previously discarded records that need information from other rows to make sense. Christina continues to work on estimating plant operating costs based on FERC+EIA linkages we’ve developed for RMI, and still need to merge into the main PUDL repo. Zane is trying to make our unit and prime-mover level heat rate estimates more robust and complete.
Our documentation now has an index all of the PUDL DB tables, including the names of the columns, their data types, and descriptions of the contents, thanks to some work with Jinja templates by Austen. This is just one small part of a bigger docs overhaul as we try and get PUDL 0.4.0 out by the end of March.
PUDL is finally compatible with Python 3.9, using both pip and conda. The last dependency to make the transition was Numba, which as of v0.53.0 works with Python 3.9. Our CI is now running tests on both Python 3.8 and 3.9.
We created a new EIA 861 archive on Zenodo. It includes data from 1990-2019, to support work Karl Dunkle Werner is doing (and he’s the one that updated our scrapers, and is working on integrating the older years). We noticed that there were changes to the 2012 and 2019 EIA 861 data too. But who knows what those changes entail! There aren’t even any formulae in these spreadsheets. Imagine if they published them as CSVs directly to GitHub, and we could see the diffs. Or use tools like daff to understand how the data is changing over time?
Zane nominated himself to speak at FERC’s April 16th workshop about the scope and mission of the new Office of Public Participation, focusing on issues of public data accessibility.
Zane finished overhauling our Tox / pytest setup to make it more intuitive and flexible (in the process of documenting the development setup) All the tests that we typically run can be run without needing any command line arguments, and the tests have been split into into distinct unit tests, software integration tests, and data validations. The software dependencies have also been simplified, with only those packages requiring or benefiting from precompiled binaries being specified in both setup.py and environment.yml specifications. This was inspired by our documentation update for the v0.4.0 release since it turns out documenting an overly complicated thing is kind of a lot of work, and that work is probably better put toward simplifying the thing instead.
What We’re Reading
How to build a community: starting with why? The beginning of what will hopefully be a series of posts from Claire Carroll, the community manager for dbt, on how to cultivate real, supportive online communities of practice around a given project or technology. This is something we’d really like Catalyst to learn how to do in the context of PUDL, and open energy data and modeling more generally.
RS21 looks like an interesting data consultancy, doing a lot of work with the public sector and NGOs.
VSCode Atom is Dead! Long live Atom! With Microsoft’s purchase of GitHub, development on the Atom editor has waned, and it’s started to feel a bit like abandonware. And given that Microsoft also maintains VS Code, it seems like it’s only a matter of time before we have no choice but to switch… to something. So we started playing around with VS Code, and it’s great!
OpenSustain.tech: A curated list of open technology projects to sustain a stable climate, energy supply, and vital natural resources. Gotta get PUDL in their dataset section!
Cloud native repositories for big scientific data, a paper by Ryan Abernathey et al., talking about the benefits of “Analysis Ready Cloud Optimized” (ARCO) datasets and developments in the field. Lots of parallels with where we’re going with PUDL, but with several orders of magnitude difference in the scale of the data.
Eliminating Toil: a definition of Toil from Google, and some musings on why less of it is better. Definitely resonated with the word we’re doing in PUDL.
A roundup of interesting posts related to data, code, energy, or climate that we came across in the first week of March, 2021.
Energy & Climate
Xcel Energy’s Comanche 3 coal plant in Colorado continues to be an expensive boodoggle. It’s spent 2 of its first 10 years of operation shut down for maintenance, forcing Xcel to buy electricity on the open market to fill the gap. But actually this was good for their customers, since the electricity the plant produces when it is running is so expensive ($66.25/MWh) that this actually saved customers money! Unfortunately they’ll still be on the hook for remaining capital costs far into the future. Xcel thinks they’ll use the plant as a seasonal or load following resource after 2030… at which point Xcel will still have $460M left in the plant.
Market Design for the Clean Energy Transition: Advancing Long-Term Approaches. Materials from a workshop put on by WRI and Resources for the Future, exploring how electricity markets need to adapt to accommodate lots of zero carbon, very low marginal cost generation, that’s also not entirely dispatchable.
Securitization in Action: How US States are Shaping an Equitable Coal Transition: a post from some of our collaborators at RMI, looking at some of the work our data liberation has helped enable — namely getting uneconomic fossil generation offline as cheaply as possible. Well, as cheaply as possible without forcing the utilities to absorb the costs anyway.
CMIP6: the next generation of climate models explained. A look at how climate scientists compare their models in a standardized way, so that they can understand why they get different answers sometimes. This is something we really need more of in the energy modeling space — otherwise every conversation eventually devolves into criticizing the inputs and assumptions.
Data & Code
Command Line Interface Guidelines: a collection of best practices for designing modern command line tools that are relatively user friendly, and take advantage of many features of modern Unix terminals.
Column Names as Contracts: an interesting post by Emily Reiderer about the potential benefits of storing metadata in column names using a controlled vocabulary, allowing them to be programmatically parsed.
Embedding column-name contracts in data pipelines with dbt builds on that last post, and looks at how Jinja templates and tools like dbt let you do more interesting dynamic data work if your columns have consistent and controlled names.
What is dbt anyway? It stands for “data build tool” and it can be used to specify, store, and version control complex data transformation instructions as text files. A lot of the data we’re working with from FERC and EIA are too messy for this to be helpful in our initial ETL process, but once we’ve got the databases being populated in the cloud automatically, this could be a good way to create new derived data products. Thanks to our friend Brittany Bennett at Sunrise Movement for telling us about dbt.
I helped build ByteDance’s censorship machine. A story about what it’s like to work inside a tech company actively implementing censorship measures. ByteDance is the Chinese owner of TikTok.
Documentation for pydantic. We’re trying to make all of our metadata programmatically accessible, and remove duplication wherever possible, and using pydantic to parse and validate the metadata we compile by hand so we know it’s at least structurally sound.
Python Packages is an online book about how to package and distribute… Python packages. We wish we’d had this a couple of years ago when we were figuring it out for the first time! Focuses on modern rather than legacy frameworks, going straight for pyproject.toml, poetry, and CI/CD using GitHub actions. There’s also a cookie cutter repo on GitHub that templates many of the practices from the book. Via Tiffany Timbers.
Eliminating Toil is a short essay from some Googlers on the nature of a particular kind of work that shows up in many data wrangling (and software) contexts. A lot of our mission here is saving others from data toil.
Great Expectations and Pandas Profiling: a blog post on how to use these two tools together to automatically draft data validation test cases. Vaguely along the same lines as Pandera, though that library has more of a statistical bent.
Users have been asking for live access to our data forever, either via a PUDL API or a web interface, but we didn’t feel like we had the resources to maintain that kind of service and ensure it was reliable. Then a few weeks ago we came across an awesome open source project called Datasette that takes SQLite databases, wraps them in a Docker container, and lets users explore the data with their web browser.
It’s perfect for publishing read-only, infrequently updated data. That’s exactly what we’re doing with PUDL, and we’re already storing the data in SQLite, so it only took an afternoon to get the development version of our databases published. This goes a long way toward satisfying some of our data access goals for less technical users, which we touched on a few weeks ago in this post.
Our Datasette instance can be found at https://data.catalyst.coop and it contains both the raw FERC Form 1 DB, with all of the Form 1 data from 1994-2019, and our PUDL DB, which includes the EIA 860 and EIA 923 data from 2009-2019, and the subset of the (113!) FERC Form 1 tables that we’ve taken the time to clean up so far.
The system has already made it easier for us to collaborate and share the huge pile of data we’ve compiled over the last four years. We’re looking forward to using this system to get our data into the hands of more users.
Just a few examples of custom SQL queries or whole tables:
Please give it a spin, and let us know what you think! This is still experimental, and the interface will probably evolve. If you find problems, feel free to create an issue on GitHub, or drop us a line at pudl@catalyst.coop. Also, we’re still hoping to get the EIA 861 and FERC 714 integrated by the end of the year. See our Data We Wrangle page for additional datasets of interest. And if you’ve got other favorite tools for publishing live, open data, let us know in the comments.