linkstream weeknotes

SQL for data analysis, DGP, and pair programming

Some good technical long reads from the last couple of weeks:

(Postgre)SQL for Data Analysis

Before the Tidyverse and Pandas, there was SQL. There’s still SQL, and as Vicki Boykis often points out: every data-centric framework that hangs around long enough tends toward SQL. It’s got almost half a century of careful thinking and optimization behind it. It seems entirely possible that it’ll still be around after another half century.

In this extensive post Haki Benita explores a bunch of data analysis that can be done directly with PostgreSQL in particular. It can be used either as an efficient preprocessing step before handing off to other tools, or to generate final products. It covers basic data selection, random selection, sampling, splitting data into training & testing sets, descriptive statistics, aggregations, regressions, interpolation, binning and much more. It’s almost more of a pocket guide to data analysis in SQL than a blog post.

Data (Error) Generation Processes

In this post Emily Riederer explores how conceptualizing data (and error!) generation processes can help you do better data validation. What does the data represent in the real world? How is it being collected? How does it move from where it’s collected to where it’s processed? What kinds of transformations operate on it before you look at the outputs? Understanding these steps and their contexts makes it easier to imagine how things can go wrong along the way and what errors to check for. It also makes it easier to debug errors when you find them.

On Pair Programming

A guide to pair programming from Birgitta Böckeler and Nina Siessegger. They look at both how and why to do it, and some of the challenges that it brings up. I had no idea that this has been a practice going back as far as the women who programmed ENIAC.

The authors explore several different styles of pair programming and the logistical planning required to make it work. They touch on the extra challenges of doing remote pairing which seems extra relevant these days. They cover productive and destructive social dynamics that come up, and a whole lot more. The article is long, but it’s definitely worth a read if you’ve thought about trying pair programming and been reluctant, or have tried it and been dissatisfied.


Weeknotes for 2021-04-02

What We’re Doing

PUDL Data Release 0.4.0 Beta is up on the Zenodo Sandbox server for anyone to play with. Feedback on the instructions and overall setup much appreciated.

What We’re Reading

On creating a version-controlled SQL Query library, from Caitlin Hudon. Queries are little hand-crafted jewels that encode knowledge about the contents and structure of data. They should be saved and shared and improved collaboratively over time! As we move our PUDL data access toward using SQL directly, it seems like we’ll want to do this too, with some of the most common queries baked into the database as views directly. I bet it would be easy to set up automated testing of the queries too, whenever a new version of the DB comes online, to see which old queries are broken, and whether they yield the expected results.

Building Data Dictionaries, also from Caitlin Hudon, talking about the importance of continuous, incremental documentation of institutional knowledge related to your data — where it came from, what it should look like, how it can (and can’t!) be used. We finally made some progress in this direction by starting to publish our table and column level metadata directly in our documentation.

CarbonPlan has published a new review of carbon removal projects, based on responses to Microsoft’s recent RFP. The general takeaway: most projects are low volume, impermanent, and didn’t provide enough detail to be evaluated critically. The one and only 5-star project they reviewed, Climeworks in Iceland, which does permanent mineralized geologic sequestration (injecting CO2 into fresh volcanic basalt) powered by geothermal energy, currently costs $1,100/ton of CO2, or $17,000/yr to sequester the emissions of a typical US person.

The censusviz package provides a straightforward Python interface to US census map and population data, which integrates directly with GeoPandas. Looks like it’s pulling data on the fly from the Census API.

A nice PyData talk from Julia Signell that stitches together data from Intake catalogs and a variety of interactive data visualization & dashboarding tools. Code from the talk can be found on GitHub.

Ghost looks like an interesting open source business, it’s organized as a non-profit foundation that has to reinvest all of its profits in itself, and can’t be bought out. They have developers all over the world, and produce a permissively licensed open source content management system.

Internal investigation calls carbon offsets sold by the Nature Conservancy into question. Emitters want lots of credits, and they want them cheap. Is it going to be possible to come up with offset program standards that are strict enough to be meaningful in climate policy?


Weeknotes for 2021-03-26

What We’re Doing

Our proposal to the Sloan Foundation was funded! This means we’ve got 2 years of support to work on integrating a bunch of new data, automation of our ETL pipeline, and improving access to the results, some of which we outlined in our 2021 infrastructure roadmap. In combination with the client work we’ve been engaged in, this means we’ll need to recruit a new member to the co-op. We’re trying to figure out what skills and experience we are actually short on, relative to what we need to get done.

We continue to update our documentation, as we prepare for a new software and data release. A lot of it is related to the development environment, contributing, and how to run the ETL as it is now, while the user-facing side of the docs will hopefully be much simpler, directing folks to use preprocessed data with a Docker image that’s available on Zenodo, or our JupyterHub, or Datasette. We’ve set a deadline for the release of April 19th, which is also our next Advisory Board meeting.

We decided to simplify our branch management on GitHub. Instead of having separate ephemeral sprint branches, we’re just branching off of and making PRs against dev, and merging back into main after each 2 week long sprint, which seems to be far more common among open source projects, and is a lot less to try and keep track of. It should also mean we don’t have to re-direct lingering PRs to different base branches.

Data Wrangling Galore: Austen has been cleaning up the FERC Form 1 fuel table to capture previously discarded records that need information from other rows to make sense. Christina continues to work on estimating plant operating costs based on FERC+EIA linkages we’ve developed for RMI, and still need to merge into the main PUDL repo. Zane is trying to make our unit and prime-mover level heat rate estimates more robust and complete.


Weeknotes 2021-03-19

What We’re Doing

The Census DP1 GeoDatabase has been integrated into PUDL as a standalone SQLite DB for use with the EIA 861 to compile historical utility and balancing authority service territories, and with FERC 714 data in estimating state level historical hourly electricity demand. Previously it was had an ad-hoc non-standard ETL process.

Our documentation now has an index all of the PUDL DB tables, including the names of the columns, their data types, and descriptions of the contents, thanks to some work with Jinja templates by Austen. This is just one small part of a bigger docs overhaul as we try and get PUDL 0.4.0 out by the end of March.

PUDL is finally compatible with Python 3.9, using both pip and conda. The last dependency to make the transition was Numba, which as of v0.53.0 works with Python 3.9. Our CI is now running tests on both Python 3.8 and 3.9.


Weeknotes for 2021-03-12

What We’re Doing

We created a new EIA 861 archive on Zenodo. It includes data from 1990-2019, to support work Karl Dunkle Werner is doing (and he’s the one that updated our scrapers, and is working on integrating the older years). We noticed that there were changes to the 2012 and 2019 EIA 861 data too. But who knows what those changes entail! There aren’t even any formulae in these spreadsheets. Imagine if they published them as CSVs directly to GitHub, and we could see the diffs. Or use tools like daff to understand how the data is changing over time?

Christina and Zane gave a keynote presentation to the 2021 Research Data Access & Preservation (RDAP) Association Summit entitled Distributing Power with Open Data. You can see our slides here.

Zane nominated himself to speak at FERC’s April 16th workshop about the scope and mission of the new Office of Public Participation, focusing on issues of public data accessibility.

Zane finished overhauling our Tox / pytest setup to make it more intuitive and flexible (in the process of documenting the development setup) All the tests that we typically run can be run without needing any command line arguments, and the tests have been split into into distinct unit tests, software integration tests, and data validations. The software dependencies have also been simplified, with only those packages requiring or benefiting from precompiled binaries being specified in both and environment.yml specifications. This was inspired by our documentation update for the v0.4.0 release since it turns out documenting an overly complicated thing is kind of a lot of work, and that work is probably better put toward simplifying the thing instead.

What We’re Reading

  • How to build a community: starting with why? The beginning of what will hopefully be a series of posts from Claire Carroll, the community manager for dbt, on how to cultivate real, supportive online communities of practice around a given project or technology. This is something we’d really like Catalyst to learn how to do in the context of PUDL, and open energy data and modeling more generally.
  • $1.3M in grants were just awarded to open source projects, with support from the Ford Foundation, Sloan Foundation, Open Society Foundations, and others, facilitated by the Open Collective Foundation. They’re trying to develop sustainable funding / support system for open source infrastructure. More information on the program over on the Open Collective and also at the Ford Foundation. Some of their research findings so far.
  • RS21 looks like an interesting data consultancy, doing a lot of work with the public sector and NGOs.
  • VSCode Atom is Dead! Long live Atom! With Microsoft’s purchase of GitHub, development on the Atom editor has waned, and it’s started to feel a bit like abandonware. And given that Microsoft also maintains VS Code, it seems like it’s only a matter of time before we have no choice but to switch… to something. So we started playing around with VS Code, and it’s great!
  • A curated list of open technology projects to sustain a stable climate, energy supply, and vital natural resources. Gotta get PUDL in their dataset section!
  • Cloud native repositories for big scientific data, a paper by Ryan Abernathey et al., talking about the benefits of “Analysis Ready Cloud Optimized” (ARCO) datasets and developments in the field. Lots of parallels with where we’re going with PUDL, but with several orders of magnitude difference in the scale of the data.
  • Eliminating Toil: a definition of Toil from Google, and some musings on why less of it is better. Definitely resonated with the word we’re doing in PUDL.
  • Another look at utility political spending from the Energy & Policy Institute. This time they’re looking FERC Account 426.4, again based on data we’ve liberated from the FERC Form 1, in combination with their own mapping between utilities and their parent holding companies. Here’s the original query from our data.


What we’re reading for the week of March 1st, 2021

A roundup of interesting posts related to data, code, energy, or climate that we came across in the first week of March, 2021.

Energy & Climate

  • Xcel Energy’s Comanche 3 coal plant in Colorado continues to be an expensive boodoggle. It’s spent 2 of its first 10 years of operation shut down for maintenance, forcing Xcel to buy electricity on the open market to fill the gap. But actually this was good for their customers, since the electricity the plant produces when it is running is so expensive ($66.25/MWh) that this actually saved customers money! Unfortunately they’ll still be on the hook for remaining capital costs far into the future. Xcel thinks they’ll use the plant as a seasonal or load following resource after 2030… at which point Xcel will still have $460M left in the plant.
  • Market Design for the Clean Energy Transition: Advancing Long-Term Approaches. Materials from a workshop put on by WRI and Resources for the Future, exploring how electricity markets need to adapt to accommodate lots of zero carbon, very low marginal cost generation, that’s also not entirely dispatchable.
  • A post from the Energy and Policy Institute looking at political “miscellaneous” spending by utilities, as reported in the FERC Form 1 — using our data!
  • Securitization in Action: How US States are Shaping an Equitable Coal Transition: a post from some of our collaborators at RMI, looking at some of the work our data liberation has helped enable — namely getting uneconomic fossil generation offline as cheaply as possible. Well, as cheaply as possible without forcing the utilities to absorb the costs anyway.
  • CMIP6: the next generation of climate models explained. A look at how climate scientists compare their models in a standardized way, so that they can understand why they get different answers sometimes. This is something we really need more of in the energy modeling space — otherwise every conversation eventually devolves into criticizing the inputs and assumptions.

Data & Code

  • Command Line Interface Guidelines: a collection of best practices for designing modern command line tools that are relatively user friendly, and take advantage of many features of modern Unix terminals.
  • Column Names as Contracts: an interesting post by Emily Reiderer about the potential benefits of storing metadata in column names using a controlled vocabulary, allowing them to be programmatically parsed.
  • Embedding column-name contracts in data pipelines with dbt builds on that last post, and looks at how Jinja templates and tools like dbt let you do more interesting dynamic data work if your columns have consistent and controlled names.
  • What is dbt anyway? It stands for “data build tool” and it can be used to specify, store, and version control complex data transformation instructions as text files. A lot of the data we’re working with from FERC and EIA are too messy for this to be helpful in our initial ETL process, but once we’ve got the databases being populated in the cloud automatically, this could be a good way to create new derived data products. Thanks to our friend Brittany Bennett at Sunrise Movement for telling us about dbt.
  • I helped build ByteDance’s censorship machine. A story about what it’s like to work inside a tech company actively implementing censorship measures. ByteDance is the Chinese owner of TikTok.
  • Documentation for pydantic. We’re trying to make all of our metadata programmatically accessible, and remove duplication wherever possible, and using pydantic to parse and validate the metadata we compile by hand so we know it’s at least structurally sound.
  • Python Packages is an online book about how to package and distribute… Python packages. We wish we’d had this a couple of years ago when we were figuring it out for the first time! Focuses on modern rather than legacy frameworks, going straight for pyproject.toml, poetry, and CI/CD using GitHub actions. There’s also a cookie cutter repo on GitHub that templates many of the practices from the book. Via Tiffany Timbers.
  • EPA has released a crosswalk table that connects their CEMS data to the EIA boilers and generators. Thank goodness we won’t have to compile it now. More info in their GitHub repo.
  • Nice preprint from Ryan Abernathey et al. on cloud-native scientific data repositories. This is very much in line with our plans for the PUDL data — even though our data is several orders of magnitude smaller than a lot of what he’s talking about.
  • Eliminating Toil is a short essay from some Googlers on the nature of a particular kind of work that shows up in many data wrangling (and software) contexts. A lot of our mission here is saving others from data toil.
  • Great Expectations and Pandas Profiling: a blog post on how to use these two tools together to automatically draft data validation test cases. Vaguely along the same lines as Pandera, though that library has more of a statistical bent.