Categories
updates

Rescuing Historical FERC Data

This week we discovered that decades worth of energy system data collected by the Federal Energy Regulatory Commission (FERC) had been removed from the agency’s website. They apparently have no plan to archive it or migrate it to another platform. We are attempting to obtain a bulk download of all this data so we can archive it alongside our other raw data sources on Zenodo.

This data records many financial, operational, and economic aspects of the US energy system. It is a unique and valuable resource for anyone trying to understand how public policy and market conditions have shaped our energy system over time. Simply deleting this data with no warning, no plan to archive it, or migrate it to another platform is completely unacceptable.

If you know someone within FERC who can help get us a copy of this data to archive publicly, please put us in touch: hello@catalyst.coop

Categories
linkstream

Data pipelines, stewardship and cleaning

Categories
updates

PUDL v0.5.0: 2020 and Beyond

It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.

Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.

If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop

Categories
linkstream

Are electric utilities planning for climate change?

Oil and gas companies operating in the arctic and other areas impacted by climate change have been adapting their operations and infrastructure planning to the melting permafrost and other long-term impacts of their pyromania for decades, even while spreading disinformation about the same processes publicly. But are electric utilities doing the same kind of planning?

We’ve been thinking a bit about the ways in which the energy system in the US West is exposed to potential climate risks, in the context of long term utility resource adequacy and operational planning. We posted a short thread on Twitter and got some references from the #EnergyTwitter hive mind.

Categories
linkstream

Modern Data Stack, Etc.

  • ETL vs. ELT — a comparison of two data pipeline architectures from the folks at Fivetran.
  • What is the Modern Data Stack? another post from Fivetran, attempting to define the different components of data engineering pipelines as they appear to be coming together in the last few years.
  • A good interactive introduction to SQL from Mode. You can even use your own data as you work through it, if it’s in a database online. Broken down into beginner, intermediate, and advanced sections.
  • Hex is another platform that seems similar to Mode, for collaborating on data analyses using notebooks and a combination of straight SQL and python. Again, you load your own data directly via an online DB connection. I admit that after seeing it mentioned for months I only clicked through after realizing it was named after the magic / science hybrid technology depicted in Arcane.
  • Cookiecutter Data Science is a cookie-cutter repo and a set of guidelines for standardizing data science projects to be more easily replicated and parsed by other people.
  • Thou Shalt Scale Sustainably: some thoughts (commandments…) on how to scale social enterprises (especially when dependent on foundation funding) from the Shuttleworth Foundation. Not related to the so-called Modern Data Stack.

Categories
updates

New PUDL Software & Data Release: v0.4.0

In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:

New Data Coverage

  • EIA Form 860 added coverage for 2004-2008, as well as 2019.
  • EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
  • EIA Form 923 added the 2001-2008 data, as well as 2019.
  • EPA CEMS Hourly Emissions covering 2019-2020.
  • FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
  • EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
  • US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.
Categories
updates

Environmental Justice Data Liberation

We’ve come across a few allied projects looking at environmental justice data specifically, and thought it would be nice to share!

Environmental Enforcement Watch

In May, Christina and I gave a talk at CSV,Conf,v6 about things we’ve learned liberating US energy system data. We focused a lot on the challenge of making data accessible to advocates. The following talk was analogous, but focused on environmental justice data. The speaker was Kelsey Breseman (@ifoundtheme) from the Environmental Data and Governance Initiative (EDGI) and their project Environmental Enforcement Watch (EEW). EEW is trying to hold polluters accountable using federally reported data, by making that data more accessible to and understandable by the people who are affected. They’re scraping the data from the web and creating a database that folks can query using Google CoLab notebooks. At the same time they’re trying to get EPA the full underlying database accessible to the public.

You can watch her excellent talk here:

I was struck by how many parallels there were between our work. We’re both trying to mitigate the poor curation of government data, and make it more accessible way to the public. EDGI also seems very open and GitHub centered and is trying to operate as a horizontal organization. They support themselves through foundation grants and volunteer labor. Nobody works on EDGI full time. They have a fiscal sponsorship agreement through Earth Science Information Partners (ESIP).

If you’re interested in public data and environmental justice they seem like a great organization! Maybe we can collaborate at some point.

Categories
linkstream weeknotes

SQL for data analysis, DGP, and pair programming

Some good technical long reads from the last couple of weeks:

(Postgre)SQL for Data Analysis

Before the Tidyverse and Pandas, there was SQL. There’s still SQL, and as Vicki Boykis often points out: every data-centric framework that hangs around long enough tends toward SQL. It’s got almost half a century of careful thinking and optimization behind it. It seems entirely possible that it’ll still be around after another half century.

In this extensive post Haki Benita explores a bunch of data analysis that can be done directly with PostgreSQL in particular. It can be used either as an efficient preprocessing step before handing off to other tools, or to generate final products. It covers basic data selection, random selection, sampling, splitting data into training & testing sets, descriptive statistics, aggregations, regressions, interpolation, binning and much more. It’s almost more of a pocket guide to data analysis in SQL than a blog post.

Data (Error) Generation Processes

In this post Emily Riederer explores how conceptualizing data (and error!) generation processes can help you do better data validation. What does the data represent in the real world? How is it being collected? How does it move from where it’s collected to where it’s processed? What kinds of transformations operate on it before you look at the outputs? Understanding these steps and their contexts makes it easier to imagine how things can go wrong along the way and what errors to check for. It also makes it easier to debug errors when you find them.

On Pair Programming

A guide to pair programming from Birgitta Böckeler and Nina Siessegger. They look at both how and why to do it, and some of the challenges that it brings up. I had no idea that this has been a practice going back as far as the women who programmed ENIAC.

The authors explore several different styles of pair programming and the logistical planning required to make it work. They touch on the extra challenges of doing remote pairing which seems extra relevant these days. They cover productive and destructive social dynamics that come up, and a whole lot more. The article is long, but it’s definitely worth a read if you’ve thought about trying pair programming and been reluctant, or have tried it and been dissatisfied.

Categories
updates

Automated Data Wrangling

An illustration from the Frog and Toad children's books, where Frog and Toat are eating cookies. The caption has been altered to say "We must stop data cleaning!" cried Toad as he continued to clean the data.
Frog and Toad are Data Wranglers

We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of different kinds of messes that go well beyond what you can easily fix with a few nifty dataframe manipulations. It seems like we should be able to develop higher level, more general tools for doing automated data wrangling. Given how much time highly skilled people pour into this kind of computational toil, it seems like it would be very worthwhile.

Like families, tidy datasets are all alike but every messy dataset is messy in its own way.

Hadley Wickham, paraphrasing Leo Tolstoy in Tidy Data
Categories
updates

Catalyst partners release energy transition resources

In the past two weeks, Catalyst partners Energy Innovation and the Rocky Mountain Institute have released two major resources based on open data to help stakeholders better understand the energy transition in the US electricity sector. We’re excited to say that Catalyst team members prepared data from Catalyst’s Public Utility Data Liberation project and provided analytical support for both resources.

Energy Innovation’s Coal Cost Crossover 2.0 provided an update to their 2018 report, which projected that by 2025 three quarters of the nation’s coal power plants would be uneconomic. The 2.0 shows that the economics of coal power in the US have deteriorated more rapidly than expected. The report finds that 80% of existing coal plants are either uneconomic or slated to retire before 2025. Economic viability is assessed by comparing coal plant operating costs with estimates of building new renewable facilities nearby, using the levelized cost of wind and solar energy estimates from the National Renewable Energy Laboratory’s Renewable Energy Deployment System (ReEDS) model. Coal operating costs are derived from fuel and operations/maintenance data from FERC and EIA, or from estimates from the National Energy Modeling System where FERC and EIA data was unavailable. 

The Rocky Mountain Institute recently released the first version of their Utility Transition Hub, an interactive data portal that allows users to track, quantify, and understand how investments, operations, policies, and regulations shape outcomes in the electricity sector. Stakeholders can explore the energy transition in the power sector as a whole, group subsidiary utilities by their parent company, or make comparisons between utilities. Cleaned data from FERC and EIA underly Tableau visualizations which help users to evaluate historical performance on emissions reductions and investments in renewables, and to assess the alignment of resource planning and climate commitments with a 1.5 degree C trajectory.

Comparing attributes of Duke Energy Corporation's operating subsidiaries, segmented by plant type.
Comparing attributes of Duke Energy Corporation’s operating subsidiaries, segmented by plant type, on RMI’s Utility Transition Hub.