Tag: SQL

PUDL v0.5.0: 2020 and Beyond

Post author By Zane Selvans
Post date 2021-12-10
No Comments on PUDL v0.5.0: 2020 and Beyond

It’s been almost a month since we pushed out our first actual quarterly software and data release: PUDL v0.5.0! The main impetus for this release was to get the final annual 2020 data integrated for the FERC and EIA datasets we process. We also pulled in the EIA 860 data for 2001-2003, which is only available as DBF files, rather than Excel spreadsheets. This means we’ve got coverage going back to 2001 for all of our data now! Twenty years! We don’t have 100% coverage of all of the data contained in those datasets yet, but we’re getting closer.

Beyond simply updating the data, we’ve also been making some significant changes to how our ETL pipeline works under the hood. This includes how we store metadata, how we generate the database schema, and what outputs we’re generating. The release notes contain more details on the code changes, so here I want to talk a little bit more about why, and where we are hopefully headed.

If you just want to download the new data release and start working with it, it’s up here on Zenodo. The same data for FERC 1 and EIA 860/923 can also be found in our Datasette instance at https://data.catalyst.coop

Tags airbyte, apache superset, data, dbt, eia860, eia923, elt, ETL, ferc1, fivetran, pudl, pydantic, rdbms, release, software, SQL, SQLite

weeknotes

Weeknotes for 2021-04-02

What We’re Doing

PUDL Data Release 0.4.0 Beta is up on the Zenodo Sandbox server for anyone to play with. Feedback on the instructions and overall setup much appreciated.

What We’re Reading

On creating a version-controlled SQL Query library, from Caitlin Hudon. Queries are little hand-crafted jewels that encode knowledge about the contents and structure of data. They should be saved and shared and improved collaboratively over time! As we move our PUDL data access toward using SQL directly, it seems like we’ll want to do this too, with some of the most common queries baked into the database as views directly. I bet it would be easy to set up automated testing of the queries too, whenever a new version of the DB comes online, to see which old queries are broken, and whether they yield the expected results.

Building Data Dictionaries, also from Caitlin Hudon, talking about the importance of continuous, incremental documentation of institutional knowledge related to your data — where it came from, what it should look like, how it can (and can’t!) be used. We finally made some progress in this direction by starting to publish our table and column level metadata directly in our documentation.

CarbonPlan has published a new review of carbon removal projects, based on responses to Microsoft’s recent RFP. The general takeaway: most projects are low volume, impermanent, and didn’t provide enough detail to be evaluated critically. The one and only 5-star project they reviewed, Climeworks in Iceland, which does permanent mineralized geologic sequestration (injecting CO2 into fresh volcanic basalt) powered by geothermal energy, currently costs $1,100/ton of CO2, or $17,000/yr to sequester the emissions of a typical US person.

The censusviz package provides a straightforward Python interface to US census map and population data, which integrates directly with GeoPandas. Looks like it’s pulling data on the fly from the Census API.

A nice PyData talk from Julia Signell that stitches together data from Intake catalogs and a variety of interactive data visualization & dashboarding tools. Code from the talk can be found on GitHub.

Ghost looks like an interesting open source business, it’s organized as a non-profit foundation that has to reinvest all of its profits in itself, and can’t be bought out. They have developers all over the world, and produce a permissively licensed open source content management system.

Internal investigation calls carbon offsets sold by the Nature Conservancy into question. Emitters want lots of credits, and they want them cheap. Is it going to be possible to come up with offset program standards that are strict enough to be meaningful in climate policy?

Tags carbon offsets, carbon sequestration, data dictionary, PyData, SQL