Categories
updates

Publishing PUDL with Datasette

Users have been asking for live access to our data forever, either via a PUDL API or a web interface, but we didn’t feel like we had the resources to maintain that kind of service and ensure it was reliable. Then a few weeks ago we came across an awesome open source project called Datasette that takes SQLite databases, wraps them in a Docker container, and lets users explore the data with their web browser.

It’s perfect for publishing read-only, infrequently updated data. That’s exactly what we’re doing with PUDL, and we’re already storing the data in SQLite, so it only took an afternoon to get the development version of our databases published. This goes a long way toward satisfying some of our data access goals for less technical users, which we touched on a few weeks ago in this post.

Our Datasette instance can be found at https://data.catalyst.coop and it contains both the raw FERC Form 1 DB, with all of the Form 1 data from 1994-2019, and our PUDL DB, which includes the EIA 860 and EIA 923 data from 2009-2019, and the subset of the (113!) FERC Form 1 tables that we’ve taken the time to clean up so far.

The system has already made it easier for us to collaborate and share the huge pile of data we’ve compiled over the last four years. We’re looking forward to using this system to get our data into the hands of more users.

Just a few examples of custom SQL queries or whole tables:

Please give it a spin, and let us know what you think! This is still experimental, and the interface will probably evolve. If you find problems, feel free to create an issue on GitHub, or drop us a line at pudl@catalyst.coop. Also, we’re still hoping to get the EIA 861 and FERC 714 integrated by the end of the year. See our Data We Wrangle page for additional datasets of interest. And if you’ve got other favorite tools for publishing live, open data, let us know in the comments.

More on Datasette

On top of interactive browsing and custom queries Datasette provides an API that can return either JSON or CSV formatted responses. These can be used directly with e.g. pandas.read_csv(). A user can also download a CSV for local analysis using spreadsheets. It has a modular plugin system that allows new functionality to be added over time. There are already plugins for doing GraphQL queries, interactive plotting with Vega-Lite, interactive maps, and 3rd party authentication, as well as deeper integration with the SpatiaLite and full-text Search extensions to SQLite.

One plugin we’re keen to help develop would display per-column descriptions or other metadata that we’ve been maintaining internally, but haven’t yet published in a super readable form.

For more background, check out the talk below by the main Datasette developer Simon Willison, from PyCon in 2019:

And follow @simonw on Twitter!

By Zane Selvans

A former space explorer, now trapped on Earth. Just trying to make sure we don't blow up our one and only spaceship.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.