Author: dazhongxia

You Don’t Have to Install PUDL Anymore

Post author By dazhongxia
Post date 2024-02-07
No Comments on You Don’t Have to Install PUDL Anymore

We’re excited to announce that you no longer have to install the PUDL Python library to access electric generation data linked across FERC and EIA such as capacity factor, heat rate, and fuel cost. These, and many others, are now available directly in the PUDL database, which you can download from Zenodo here. You can find more details on how to access the data here.

We were able to complete this large infrastructural overhaul with the help of generous funding from the Sloan foundation.

Now that you can use any tools you want to analyze the data, here are some ideas:

Use the same type of Python code you have been using, but freed from our tangled web of dependencies!
Use another language you like better: R, Rust, Ruby, or even other languages that don’t start with R (Julia?)
Use Kaggle to check out our data without installing any programming environments at all!
Hook up a BI tool to quickly generate low/no-code dashboards and visualizations!

Since we’re moving away from downstream use of the library, we are also deprecating the PudlTabl class. It will still work, for now, but it’s now just a shell around accessing the database tables and will be removed in a future release.

One further change we made during all of this was to rename a bunch of tables to make them a little easier to find and understand. Tables now have standardized prefixes, the nuances of which are explained in the docs. The short version is:

When in doubt, start with tables with the out_* prefix. These have been cleaned and connected into wide tables with lots of metadata and are designed to be easy to use for downstream analysis.
When you need to dig deeper, look at the core_* tables. These are the cleaned up building blocks of the out_* tables. You may need to join several core_* tables to get the metadata you want.
The tables starting with an underscore are intermediate assets. They’re not stable, so please don’t rely on the data in them.

We hope these changes make it easier for a wider variety of users to use our data! Now that we’ve wrapped up this infrastructural work, we’ll shift our focus back to integrating new datasets like PHMSA and EIA 176.

If you want help getting started with our data, or have any datasets you’d like us to integrate, we’d love to talk: drop by our office hours and we’ll walk you through any questions you might have.

Tags database, open data, open source, pudl, release, SQLite

updates

OpenMod USA Takeaways

We had a great time attending the OpenMod USA conference at Stanford last month. Thanks to Open Energy Transition for organizing, and for inviting us to moderate a panel on open data! Thanks also to Greg Miller, Greg Schivley, Ted Nace, and our very own Christina Gosnell for speaking on our panel.

We got to meet a whole bunch of smart, friendly folks who are working on using their energy system modeling skills to facilitate the global energy transition. We learned a lot about how we can better support their work, including these high level takeaways:

We’re still missing useful datasets! There wasn’t a strong front-runner for most-requested dataset, but we clearly heard a need for transmission, gas, and hourly demand, among others.
Our users are interested in making their own technical systems more robust and easier to work with.

It’ll be a continuous process of improvement, of course, but we’ve started working on some projects as a result!

We do have to pick and choose which datasets to integrate first. Right now we’re focusing on natural gas data, integrating EIA 176 with the help of davidmudrauskas, and our own e-belfer is extracting transmission and distribution data from PHMSA.

One way to integrate more data more quickly is to mobilize our community to help integrate new data sources! That means we need to make contributing to PUDL much easier.

The first, most important phase of integrating a new dataset is the exploratory one. You can spend countless hours learning the specific quirks and pain points of the data. Because many of our users are already familiar with these datasets, we encourage “knowledge contributions” in the form of plain-language documentation or useful scripts that handle part of the data wrangling process. We’ve updated our contributing docs to highlight those cases, and have made a new repository to hold the teeming masses of dataset-specific knowledge.

We are also improving our Kaggle environment so that anyone can use PUDL without setting up a whole Python environment. This will make it easier for users to explore PUDL data, especially data that we have archived and/or extracted but not completely cleaned, validated, or connected.

Apart from the dataset integrations and contribution improvements, we’re following up with folks from the conference to see how we can help them with software architecture, engineering, and infrastructure guidance – we’re looking forward to growing those relationships. If you are curious about how we can help you in this area, don’t hesitate to reach out at hello@catalyst.coop!

In closing, OpenMod was a great experience! We’re excited to build a community that can do amazing things with complete, connected, granular, and accessible energy data. We’re pursuing a bit of funding to support our community efforts, so keep your fingers crossed for us and stay tuned for more updates next year!

Tags conference, modeling, open data, open source, openmod, stanford, usa

updates

Summer 2023 Goals

We’ve been working on our goal-setting process at Catalyst, and want to share our high-level goals for the summer – these take us through September 2023.

Publish all data products as SQL tables

In the past, we’ve published data products in two ways: a large portion of our data was published in SQLite/Parquet files; the rest, including many of our analysis outputs, were calculated directly in the PudlTabl Python class. You could interact with the SQLite and Parquet data any way you wanted. However, to access the latter, you’d need to install the latest version of PUDL and all its dependencies. Maintaining that environment and managing the dependencies was an unnecessary barrier to data analysis.

You may have noticed, from our nightly builds, that more and more of the outputs from PudlTabl are stored directly in pudl.sqlite. We’ve been working on this transition for a few months, since the Dagster migration, and finally have just a few data products remaining: the MCOE outputs (heat_rate_by_unit, heat_rate_by-generator, fuel_cost_by_generator, capacity_factor_by_generator, and mcoe) and the plant parts list (mega_generators, plant_parts_eia). Soon, you’ll be able to access all of our data without installing the PUDL Python package!

This also means PudlTabl will soon be deprecated, and the preferred way to access our data will be through conventional SQL and Parquet tooling such as Datasette, SQLAlchemy, or RSQLite.

Integrate new datasets into PUDL

We also plan to integrate some shiny new datasets, starting with PHMSA data. This contains operational data about methane gas gathering, transmission, and distribution in the US. After a stretch of infrastructure investment, we’re excited to focus on the “integrate new datasets” part of our partnership with Sloan! We’re doubly excited to expand into the methane gas aspect of US energy system data.

Integrate 2022 data for existing datasets

We’re working with RMI to integrate the 2022 data from our existing datasets, such as FERC forms 1/2/6/60/714 and EIA forms 860/860m/861/923. Each year, new data brings new challenges, but this quarter we plan to build automation tooling to help us detect issues as they arise and reduce the manual work required each year. This will be especially important as the annual data reconciliation requirements will increase when we integrate new datasets. This year, we’re especially interested to see how the FERC XBRL data has changed since its debut in 2021.

Support RMI’s financial modeling efforts

We are also pleased to provide development and architectural support for RMI’s Optimus financial modeling tool. Optimus can show utilities how IRA incentives make cleaner portfolios better long-term investments, aid commercial partners in quantifying the distributional impact of their electrification plans, and support advocates by showing how ratemaking can evolve to minimize the burden of the transition on LMI customers. We’re helping RMI revamp the engineering side of their system to support faster, more confident development of the model.

Apply automated entity matching techniques

We’ve been working with CCAI on entity-matching problems in the energy data space. So far, we’ve been experimenting with using Splink to match EIA and FERC plant IDs. This summer, we’re hoping to bring that process into PUDL and generalize it to other problems such as inter-year FERC to FERC plant ID matching.

Meet new people and organizations!

Of course, we’re also looking to connect with exciting new people! We’re looking for new contributors, grant funders that are interested in PUDL development and maintenance, and organizations that could benefit from our blend of energy policy domain knowledge and data engineering/data science expertise. If that sparks any connections in your mind, please drop us a line at hello@catalyst.coop.

Tags dagster, data, eia, FERC, goals, open data, optimus, phmsa, pudl, rmi, SQLite