PUDL Season of Docs Project Proposal

Reorganize PUDL Documentation For Clarity

About PUDL

A rapid and equitable energy transition needs a diversity of participants who are empowered to intervene with high-quality accessible energy data. In 2016, Catalyst Cooperative created the Public Utility Data Liberation project (PUDL) to bridge the gap between available and accessible U.S. energy data – publishing free and open-source energy data and collaborating with non-profits, scholars and local advocates working to decarbonize the United States. Catalyst has developed PUDL into a reliable data and software foundation for multiple open-source energy and environmental projects with tangible policy and regulatory impact. As a worker-owned, mission-driven tech cooperative, we’re proud to support data-driven policy-making, enable rigorous and reproducible energy research, and reduce inequities in high-quality energy data access. PUDL software is currently released under the MIT License and PUDL data and documentation are published under Creative Commons Attribution License v4.0.

PUDL users span a wide array of both technical and energy system domain expertise. Many current PUDL users are energy domain experts who work at non-profit and advocacy organizations focused on energy and environmental policy issues, such as RMI and GridLab. Other common PUDL users are graduate students, early-career researchers, and research analysts for nonprofits and small businesses. Catalyst solicits feedback from institutional partners (e.g. Princeton’s ZEROLab, Singularity Energy) to identify key user needs and governance concerns. PUDL contributors range from highly skilled data and software engineers looking to leverage their skillset in support of a clean energy transition to experts from the energy community, seeking to house their data cleaning efforts in a reproducible and maintained software infrastructure with extensive integration and validation testing.

The problem

We are at a pivotal moment for decarbonization policy research and energy infrastructure investment, where it is crucial for PUDL to onboard additional end users and contributors. Unfortunately, it is difficult for new users and contributors to quickly jump in and start using PUDL as documentation is spread out between multiple repositories and websites. Information may be duplicated between the main PUDL and PUDL Archiver GitHub repositories, PUDL’s ReadTheDocs, and the Catalyst website. Additionally it’s difficult, even for long-time users, to understand what data is available in PUDL. New users often attend our Office Hour sessions, or give up on using PUDL entirely, because the documentation is disorganized and it’s not clear which tables are most useful for certain analyses. We have a data dictionary page in our docs, a Datasette deployment for exploring the data, and a set of example notebooks hosted on Kaggle but none does a particularly good job of shepherding users to the data they want. PUDL needs a better, perhaps more nested, system of table/column documentation so users aren’t overwhelmed by the data and are actually using the correct tables!

Project Scope

Work that is in scope for this project:

  • Audit existing documentation and walk through the existing process for answering three of our most common user questions: how to download a table with operational and financial characteristics about plant generators, the best method to access the latest release of the entire PUDL database, and how to set up the development environment and create a pull request to contribute to PUDL.
  • Make a plan for what information should be conveyed in each location (PUDL GitHub repository, PUDL ReadtheDocs, Catalyst Cooperative website), as well as identify current duplication of information.
  • Work with the Catalyst team to reorganize documentation according to the plan from the audit and eliminate duplicative information, pointing to other information sources where necessary (e.g., add link to the PUDL ReadtheDocs on the Catalyst Cooperative website as appropriate).
  • Create a landing page identifying the ten most useful tables in the PUDL database, write descriptions that point new users to these tables based on the information they’re looking for (e.g., for information on utility finances, head to this page).
  • Work with Catalyst to develop guidelines for where to place new documentation to keep documentation organized and clear moving forward.

Work that is out of scope for this project:

  • This project will not attempt to rewrite or clarify the descriptions of the transformation process that each table goes through from raw data to clean data.

We estimate that this work will take one technical writer 6 months to complete. We have heard from five technical writer candidates who are interested in this project and have not yet finished reviewing the materials they’ve sent over. Ella Belfer and Austen Sharpe from Catalyst Cooperative have volunteered to onboard the hired writer and support the project.

Measuring Project Success

We would like the reorganization of PUDL documentation to result in an increase in the number of users and make it easier for contributors to make pull requests. About half of our existing PUDL Office Hours visits are from new users who are looking for the data contained in one of the ten most useful tables. We think that creating a landing page for the ten most useful tables will decrease the need to attend PUDL Office Hours with questions about how to find these tables and increase the number of PUDL users and downloads of these tables.

In addition to tracking the number of PUDL Office Hours and type of questions asked, we will track the total number of PUDL users and the number of pull requests made from community contributors. We will begin tracking these metrics after the documentation is published.

We would consider the project successful if, after reorganizing and publishing the new documentation:

  • Catalyst has a document with clear guidelines for what documentation should live in the PUDL ReadTheDocs, Github, and website.
  • No more than 25% of our PUDL Office Hours visits and GitHub discussion questions are from new users who cannot find one of the ten most useful tables
  • The number of total PUDL users increases by 20%
  • The number of pull requests from community contributors increases by 10%

Timeline

The project will take approximately six months to complete and the timeline breaks down in the following way:

May: Orient the writer to PUDL documentation across the various locations and point out data access methods and useful tables to explore.

June-July: Audit the existing documentation and document pain points when trying to answer the three questions: accessing a single table, accessing the latest version of the entire PUDL database, and setting up the development environment to make a pull request. Create a document describing what documentation should be in each location

August-October: Reorganize the documentation and create a landing page for the ten most useful tables

November: Project completion

Project Budget

Budget ItemAmountRunning TotalNotes
Technical writer audits, reorganizes, and tests PUDL documentation, creating landing page highlighting the ten most useful tables and setting guidelines for what location documentation belongs in.75007500
Volunteer Stipends50085002 volunteer stipends (Ella Belfer & Austen Sharpe) @ 500 each
Total8500