Categories
updates

PUDL v0.5.0: 2020 and Beyond

It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.

Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.

If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop

Categories
updates

New PUDL Software & Data Release: v0.4.0

In August we put out a new PUDL software and data release for the first time in 18 months. We had a lot of client work, and kept putting off doing the release, so a whole lot of changes accumulated. Some highlights, mostly based on the continuously updated release notes in our documentation:

New Data Coverage

  • EIA Form 860 added coverage for 2004-2008, as well as 2019.
  • EIA Form 860m has been integrated (through Nov 2020). Note that it only adds up-to-date information about generators (especially their operational status).
  • EIA Form 923 added the 2001-2008 data, as well as 2019.
  • EPA CEMS Hourly Emissions covering 2019-2020.
  • FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and the data hasn’t been integrated into the core SQLite database, but you can process it on the fly if you want to work with it in Pandas.
  • EIA Form 861 for 2001-2019. Similar to the FERC Form 714, this ETL runs on the fly and the outputs aren’t integrated into the database yet, but it’s available for experimental use.
  • US Census Demographic Profile 1 (DP1) for 2010. This is a separate SQLite database, generated from a US Census Geodatabase, which includes census tract, county, and state level demographic information, as well as spatial boundaries of those jurisdictions.
Categories
updates

Publishing PUDL with Datasette

Users have been asking for live access to our data forever, either via a PUDL API or a web interface, but we didn’t feel like we had the resources to maintain that kind of service and ensure it was reliable. Then a few weeks ago we came across an awesome open source project called Datasette that takes SQLite databases, wraps them in a Docker container, and lets users explore the data with their web browser.

It’s perfect for publishing read-only, infrequently updated data. That’s exactly what we’re doing with PUDL, and we’re already storing the data in SQLite, so it only took an afternoon to get the development version of our databases published. This goes a long way toward satisfying some of our data access goals for less technical users, which we touched on a few weeks ago in this post.

Our Datasette instance can be found at https://data.catalyst.coop and it contains both the raw FERC Form 1 DB, with all of the Form 1 data from 1994-2019, and our PUDL DB, which includes the EIA 860 and EIA 923 data from 2009-2019, and the subset of the (113!) FERC Form 1 tables that we’ve taken the time to clean up so far.

The system has already made it easier for us to collaborate and share the huge pile of data we’ve compiled over the last four years. We’re looking forward to using this system to get our data into the hands of more users.

Just a few examples of custom SQL queries or whole tables:

Please give it a spin, and let us know what you think! This is still experimental, and the interface will probably evolve. If you find problems, feel free to create an issue on GitHub, or drop us a line at pudl@catalyst.coop. Also, we’re still hoping to get the EIA 861 and FERC 714 integrated by the end of the year. See our Data We Wrangle page for additional datasets of interest. And if you’ve got other favorite tools for publishing live, open data, let us know in the comments.

Categories
updates

PUDL Infrastructure Roadmap for 2021

A couple of weeks ago I attended TWEEDS 2020 virtually (like everything this year) and talked about Catalyst’s ongoing Public Utility Data Liberation (PUDL) project, and especially the challenges of getting a big pile of data into the hands of different kinds of users, using different tools for different purposes. It ended up sketching out a bit of a PUDL infrastructure roadmap for the next year, and so we thought it would be a good idea to write it up here too.

We’ll have a separate post looking at our 2021 data roadmap.

The US Energy Information Asymmetry

PUDL is all about addressing a big information asymmetry in the regulatory and legislative processes that affect the US energy system. Utilities have much more information about their own systems than policymakers and advocates typically do. As a result, regulators often defer to the utilities on technical & analytical points. Commercial data exists, but it’s expensive. We want to get enough data into the hands of other kinds of stakeholders that they can make credible quantitative arguments to regulators, and challenge unfounded assertions put forward by utilities.

Federal Agencies and Their Favorite File Formats
Categories
analysis

Boiler Generator Associations from EIA 923 and 860

In working to calculate the marginal cost of electricity of all of the generating units across the country, we first had to calculate the heat rate (MMBtu per MWhr) for each generating unit. The heat rate allows us to attribute the fuel costs, reported at the plant level, to the electric generation, reported at the generating unit level. The heat rate is derived from fuel consumption (MMBtu), reported at the boiler level, and electricity generation (MWh), reported at the generating unit level. To understand the heat rate, one must link up all the boilers with the generators in a given generating unit. Our work to this end uncovered a hole in EIA’s 860 reported boiler generator associations. We filled this hole through a series of matching cartwheels and network analysis.

We’ve recently reconfigured our database ingest process to move the new and improved boiler generator associations into its own table in PUDL. You can also read through this process as a Github Gist.

Categories
analysis

Heat Rate Calculation for EIA Generators

Catalyst is pulling together an estimate of the marginal cost of electricity (MCOE) for every natural gas and coal fired power plant in the US whose data we can get our hands on. We’re using data from the EIA 923, EIA 860, and FERC Form 1 to do it.  Getting the heat rate right for each generator is an important part of this calculation, but a lot of the required data is… not perfect. Here’s how we’re working through it.