Categories
updates

PUDL v0.5.0: 2020 and Beyond

It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.

Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.

If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop

Categories
linkstream

Modern Data Stack, Etc.

  • ETL vs. ELT — a comparison of two data pipeline architectures from the folks at Fivetran.
  • What is the Modern Data Stack? another post from Fivetran, attempting to define the different components of data engineering pipelines as they appear to be coming together in the last few years.
  • A good interactive introduction to SQL from Mode. You can even use your own data as you work through it, if it’s in a database online. Broken down into beginner, intermediate, and advanced sections.
  • Hex is another platform that seems similar to Mode, for collaborating on data analyses using notebooks and a combination of straight SQL and python. Again, you load your own data directly via an online DB connection. I admit that after seeing it mentioned for months I only clicked through after realizing it was named after the magic / science hybrid technology depicted in Arcane.
  • Cookiecutter Data Science is a cookie-cutter repo and a set of guidelines for standardizing data science projects to be more easily replicated and parsed by other people.
  • Thou Shalt Scale Sustainably: some thoughts (commandments…) on how to scale social enterprises (especially when dependent on foundation funding) from the Shuttleworth Foundation. Not related to the so-called Modern Data Stack.

Categories
updates

PUDL Infrastructure Roadmap for 2021

A couple of weeks ago I attended TWEEDS 2020 virtually (like everything this year) and talked about Catalyst’s ongoing Public Utility Data Liberation (PUDL) project, and especially the challenges of getting a big pile of data into the hands of different kinds of users, using different tools for different purposes. It ended up sketching out a bit of a PUDL infrastructure roadmap for the next year, and so we thought it would be a good idea to write it up here too.

We’ll have a separate post looking at our 2021 data roadmap.

The US Energy Information Asymmetry

PUDL is all about addressing a big information asymmetry in the regulatory and legislative processes that affect the US energy system. Utilities have much more information about their own systems than policymakers and advocates typically do. As a result, regulators often defer to the utilities on technical & analytical points. Commercial data exists, but it’s expensive. We want to get enough data into the hands of other kinds of stakeholders that they can make credible quantitative arguments to regulators, and challenge unfounded assertions put forward by utilities.

Federal Agencies and Their Favorite File Formats