Categories
weeknotes

What we’re reading for the week of March 1st, 2021

A roundup of interesting posts related to data, code, energy, or climate that we came across in the first week of March, 2021.

Energy & Climate

  • Xcel Energy’s Comanche 3 coal plant in Colorado continues to be an expensive boodoggle. It’s spent 2 of its first 10 years of operation shut down for maintenance, forcing Xcel to buy electricity on the open market to fill the gap. But actually this was good for their customers, since the electricity the plant produces when it is running is so expensive ($66.25/MWh) that this actually saved customers money! Unfortunately they’ll still be on the hook for remaining capital costs far into the future. Xcel thinks they’ll use the plant as a seasonal or load following resource after 2030… at which point Xcel will still have $460M left in the plant.
  • Market Design for the Clean Energy Transition: Advancing Long-Term Approaches. Materials from a workshop put on by WRI and Resources for the Future, exploring how electricity markets need to adapt to accommodate lots of zero carbon, very low marginal cost generation, that’s also not entirely dispatchable.
  • A post from the Energy and Policy Institute looking at political “miscellaneous” spending by utilities, as reported in the FERC Form 1 — using our data!
  • Securitization in Action: How US States are Shaping an Equitable Coal Transition: a post from some of our collaborators at RMI, looking at some of the work our data liberation has helped enable — namely getting uneconomic fossil generation offline as cheaply as possible. Well, as cheaply as possible without forcing the utilities to absorb the costs anyway.
  • CMIP6: the next generation of climate models explained. A look at how climate scientists compare their models in a standardized way, so that they can understand why they get different answers sometimes. This is something we really need more of in the energy modeling space — otherwise every conversation eventually devolves into criticizing the inputs and assumptions.

Data & Code

  • Command Line Interface Guidelines: a collection of best practices for designing modern command line tools that are relatively user friendly, and take advantage of many features of modern Unix terminals.
  • Column Names as Contracts: an interesting post by Emily Reiderer about the potential benefits of storing metadata in column names using a controlled vocabulary, allowing them to be programmatically parsed.
  • Embedding column-name contracts in data pipelines with dbt builds on that last post, and looks at how Jinja templates and tools like dbt let you do more interesting dynamic data work if your columns have consistent and controlled names.
  • What is dbt anyway? It stands for “data build tool” and it can be used to specify, store, and version control complex data transformation instructions as text files. A lot of the data we’re working with from FERC and EIA are too messy for this to be helpful in our initial ETL process, but once we’ve got the databases being populated in the cloud automatically, this could be a good way to create new derived data products. Thanks to our friend Brittany Bennett at Sunrise Movement for telling us about dbt.
  • I helped build ByteDance’s censorship machine. A story about what it’s like to work inside a tech company actively implementing censorship measures. ByteDance is the Chinese owner of TikTok.
  • Documentation for pydantic. We’re trying to make all of our metadata programmatically accessible, and remove duplication wherever possible, and using pydantic to parse and validate the metadata we compile by hand so we know it’s at least structurally sound.
  • Python Packages is an online book about how to package and distribute… Python packages. We wish we’d had this a couple of years ago when we were figuring it out for the first time! Focuses on modern rather than legacy frameworks, going straight for pyproject.toml, poetry, and CI/CD using GitHub actions. There’s also a cookie cutter repo on GitHub that templates many of the practices from the book. Via Tiffany Timbers.
  • EPA has released a crosswalk table that connects their CEMS data to the EIA boilers and generators. Thank goodness we won’t have to compile it now. More info in their GitHub repo.
  • Nice preprint from Ryan Abernathey et al. on cloud-native scientific data repositories. This is very much in line with our plans for the PUDL data — even though our data is several orders of magnitude smaller than a lot of what he’s talking about.
  • Eliminating Toil is a short essay from some Googlers on the nature of a particular kind of work that shows up in many data wrangling (and software) contexts. A lot of our mission here is saving others from data toil.
  • Great Expectations and Pandas Profiling: a blog post on how to use these two tools together to automatically draft data validation test cases. Vaguely along the same lines as Pandera, though that library has more of a statistical bent.
Categories
updates

Publishing PUDL with Datasette

Users have been asking for live access to our data forever, either via a PUDL API or a web interface, but we didn’t feel like we had the resources to maintain that kind of service and ensure it was reliable. Then a few weeks ago we came across an awesome open source project called Datasette that takes SQLite databases, wraps them in a Docker container, and lets users explore the data with their web browser.

It’s perfect for publishing read-only, infrequently updated data. That’s exactly what we’re doing with PUDL, and we’re already storing the data in SQLite, so it only took an afternoon to get the development version of our databases published. This goes a long way toward satisfying some of our data access goals for less technical users, which we touched on a few weeks ago in this post.

Our Datasette instance can be found at https://data.catalyst.coop and it contains both the raw FERC Form 1 DB, with all of the Form 1 data from 1994-2019, and our PUDL DB, which includes the EIA 860 and EIA 923 data from 2009-2019, and the subset of the (113!) FERC Form 1 tables that we’ve taken the time to clean up so far.

The system has already made it easier for us to collaborate and share the huge pile of data we’ve compiled over the last four years. We’re looking forward to using this system to get our data into the hands of more users.

Just a few examples of custom SQL queries or whole tables:

Please give it a spin, and let us know what you think! This is still experimental, and the interface will probably evolve. If you find problems, feel free to create an issue on GitHub, or drop us a line at pudl@catalyst.coop. Also, we’re still hoping to get the EIA 861 and FERC 714 integrated by the end of the year. See our Data We Wrangle page for additional datasets of interest. And if you’ve got other favorite tools for publishing live, open data, let us know in the comments.

Categories
updates

PUDL Infrastructure Roadmap for 2021

A couple of weeks ago I attended TWEEDS 2020 virtually (like everything this year) and talked about Catalyst’s ongoing Public Utility Data Liberation (PUDL) project, and especially the challenges of getting a big pile of data into the hands of different kinds of users, using different tools for different purposes. It ended up sketching out a bit of a PUDL infrastructure roadmap for the next year, and so we thought it would be a good idea to write it up here too.

We’ll have a separate post looking at our 2021 data roadmap.

The US Energy Information Asymmetry

PUDL is all about addressing a big information asymmetry in the regulatory and legislative processes that affect the US energy system. Utilities have much more information about their own systems than policymakers and advocates typically do. As a result, regulators often defer to the utilities on technical & analytical points. Commercial data exists, but it’s expensive. We want to get enough data into the hands of other kinds of stakeholders that they can make credible quantitative arguments to regulators, and challenge unfounded assertions put forward by utilities.

Federal Agencies and Their Favorite File Formats
Categories
analysis

Boiler Generator Associations from EIA 923 and 860

In working to calculate the marginal cost of electricity of all of the generating units across the country, we first had to calculate the heat rate (MMBtu per MWhr) for each generating unit. The heat rate allows us to attribute the fuel costs, reported at the plant level, to the electric generation, reported at the generating unit level. The heat rate is derived from fuel consumption (MMBtu), reported at the boiler level, and electricity generation (MWh), reported at the generating unit level. To understand the heat rate, one must link up all the boilers with the generators in a given generating unit. Our work to this end uncovered a hole in EIA’s 860 reported boiler generator associations. We filled this hole through a series of matching cartwheels and network analysis.

We’ve recently reconfigured our database ingest process to move the new and improved boiler generator associations into its own table in PUDL. You can also read through this process as a Github Gist.

Categories
analysis

Heat Rate Calculation for EIA Generators

Catalyst is pulling together an estimate of the marginal cost of electricity (MCOE) for every natural gas and coal fired power plant in the US whose data we can get our hands on. We’re using data from the EIA 923, EIA 860, and FERC Form 1 to do it.  Getting the heat rate right for each generator is an important part of this calculation, but a lot of the required data is… not perfect. Here’s how we’re working through it.