We’ve been working on our goal-setting process at Catalyst, and want to share our high-level goals for the summer – these take us through September 2023.
Publish all data products as SQL tables
In the past, we’ve published data products in two ways: a large portion of our data was published in SQLite/Parquet files; the rest, including many of our analysis outputs, were calculated directly in the
PudlTabl Python class. You could interact with the SQLite and Parquet data any way you wanted. However, to access the latter, you’d need to install the latest version of PUDL and all its dependencies. Maintaining that environment and managing the dependencies was an unnecessary barrier to data analysis.
You may have noticed, from our nightly builds, that more and more of the outputs from
PudlTabl are stored directly in
pudl.sqlite. We’ve been working on this transition for a few months, since the Dagster migration, and finally have just a few data products remaining: the MCOE outputs (
mcoe) and the plant parts list (
plant_parts_eia). Soon, you’ll be able to access all of our data without installing the PUDL Python package!
This also means
PudlTabl will soon be deprecated, and the preferred way to access our data will be through conventional SQL and Parquet tooling such as Datasette, SQLAlchemy, or RSQLite.
Integrate new datasets into PUDL
We also plan to integrate some shiny new datasets, starting with PHMSA data. This contains operational data about methane gas gathering, transmission, and distribution in the US. After a stretch of infrastructure investment, we’re excited to focus on the “integrate new datasets” part of our partnership with Sloan! We’re doubly excited to expand into the methane gas aspect of US energy system data.
Integrate 2022 data for existing datasets
We’re working with RMI to integrate the 2022 data from our existing datasets, such as FERC forms 1/2/6/60/714 and EIA forms 860/860m/861/923. Each year, new data brings new challenges, but this quarter we plan to build automation tooling to help us detect issues as they arise and reduce the manual work required each year. This will be especially important as the annual data reconciliation requirements will increase when we integrate new datasets. This year, we’re especially interested to see how the FERC XBRL data has changed since its debut in 2021.
Support RMI’s financial modeling efforts
We are also pleased to provide development and architectural support for RMI’s Optimus financial modeling tool. Optimus can show utilities how IRA incentives make cleaner portfolios better long-term investments, aid commercial partners in quantifying the distributional impact of their electrification plans, and support advocates by showing how ratemaking can evolve to minimize the burden of the transition on LMI customers. We’re helping RMI revamp the engineering side of their system to support faster, more confident development of the model.
Apply automated entity matching techniques
We’ve been working with CCAI on entity-matching problems in the energy data space. So far, we’ve been experimenting with using Splink to match EIA and FERC plant IDs. This summer, we’re hoping to bring that process into PUDL and generalize it to other problems such as inter-year FERC to FERC plant ID matching.
Meet new people and organizations!
Of course, we’re also looking to connect with exciting new people! We’re looking for new contributors, grant funders that are interested in PUDL development and maintenance, and organizations that could benefit from our blend of energy policy domain knowledge and data engineering/data science expertise. If that sparks any connections in your mind, please drop us a line at email@example.com.