Categories
updates

Keeping PUDL on the Pulse: Insights from our Annual Ecosystem Survey

Last July we conducted more than 60 interviews with energy data users to kick-off our work for the  NSF POSE grant! While that process was very informative, it was also a huge amount of work, and we also want to have more sustainable ways to understand the needs of energy data users (PUDL or otherwise).

So this winter, we put out our first annual Energy Data Ecosystem Survey—a short, mostly multiple-choice questionnaire built on insights from our conversations. 47 of you responded, and we learned a lot!

Before we dig into the results, we’ve made a few other changes in the wake of talking to you all:

One of our core takeaways from the interviews was how organizations struggle with technical skill building and capacity. This is something Catalyst can address with our consulting services and educational content creation. Thanks to funding from the Sloan Foundation, we’ve developed OpenEnergyData4All, a training series to help future and current energy data users gain more confidence in their technical skills. We’re excited to continue creating content like this to help people engage more readily with the data!

Another important theme was users’ desire for data that is easy to explore. A large portion of technical and non technical users relied heavily on our PUDL browser viewing tool, Datasette. While Datasette was simple for us to set up and enabled users to explore PUDL data on the web and download CSV’s, it isn’t really designed to handle the scale of data that we’re publishing now. We developed a Beta PUDL Viewer for technical and non technical users to interact with and search through the data with greater ease. We’re continuing to improve the tool, so be sure to leave any feedback or ideas here.

Now, the Ecosystem Survey. The TL;DR:

  • It’s difficult to find the right data to answer your questions, and PUDL is no exception. We need to make it easier for folks to find what they’re looking for and parse through our extensive data catalog. We’re excited about the new PUDL data viewer, but there’s more we can do to make PUDL more useful and less overwhelming.
  • We did a good job picking the most important datasets to pull into PUDL! Generally speaking the data that are used most often are the ones we’ve integrated (pats self on the back), though there was undoubtedly some bias in who took the survey.
  • Energy data users have more Python and open source contribution experience than we thought! This combined with the fact that most energy data users have at some point written down quirks about the data tells us that we need to make contributing to PUDL easier and more desirable.

Now here’s a deeper dive on the state of energy data according to you:

Who are you?

You’re mostly academics and educators, but you represent a wide range of organization types including NGOs, for-profits, and hobbyists. A smaller subset of you represent folks working directly in policy, utility regulation, or for utilities themselves. Notably, none of you are journalists.

The primary outputs of your work are peer reviewed publications, data visualizations, and reports.

What are you using energy data for?

You primarily use this data for utility or facility level financial and economic modeling, but you’re also quite involved in developing new policy proposals, energy system models, regulatory interventions, and spreading awareness about the climate and its impact on the grid.

What data are you using the most?

The most sought after datasets tend to be those already integrated into PUDL: EIA 860, EIA 923, EIA 861, NREL ATB, and FERC Form 1. Many of you also rely on data that is not yet (fully) integrated into PUDL: Weather data, EIA AEO, NREL EFS, NREL ResStock, ISO/RTO data, Census, and EPA eGrid data.

How do you choose and obtain data?

72% of you access data by downloading it directly from federal and state agencies.  Unsurprisingly (it’s our survey after all) 60% of you are PUDL users too.  A smaller, but not insubstantial portion (33%) of you endeavor to get data directly from utilities and/or the various RTOs/ISOs.

When choosing a dataset, 72% of you also indicated that clear data processing methodology is important. Other considerations include data sharability, format, and update cadence. These factors were cited more frequently than things like data resolution or completeness. 

How do you work with data? 

The majority of you rely heavily on Python (81%) to process your data. Excel and R are tied, followed by SQL and others. 

CSV, Parquet, and JSON are your most desired data formats, in that order. PDF and XML/XBRL shared a spot at the bottom with 0 votes. 

Most of you have written down quirks about the data you are working with or made visualizations to accompany your work with PUDL data. As noted above, a decent chunk of you have also had to contend with data not yet available through PUDL, thus extracting and transforming it on your own. Very few folks have tried to incorporate this work into PUDL. However, nearly half of you have experience creating GitHub issues or even writing pull requests to change source code.

What are your pain points when using energy data?

Nearly half of you struggle to find the right data to answer your questions, extract data from difficult formats (like PDFs), and connect data from disparate datasets with no common identifiers.

With regard to PUDL specifically, you aptly noted that we publish multiple versions of the same data without enough documentation to let users identify which version is right for their use case. For the amount of information in PUDL, you also noted that our docs are confusing and sometimes insufficient. 

What are we going to do about it?

After reading everyone’s responses and processing the information we collected from the 60 informational interviews, we came up with a list of projects that would improve both PUDL and people’s experience working with energy data in general. The core takeaway is a need for more and better documentation. We’re going to make a concerted effort to put ourselves in your shoes and answer some of the longstanding questions about what’s in PUDL and how to navigate it. 

In a similar vein, we’re going to try and improve the process of contributing to PUDL, whether through code or useful information about the data. Perhaps an energy data wiki? 

Lastly, for those of you that offered, we might reach out for support along the way! We want PUDL to be useful for you, and the best way to ensure that is to involve you in the process. 

Stay tuned on our progress by following along on GitHub, and, as always, reach out if there is something you’d like to see us do.

Live, laugh, clean data,

Austen & the Catalyst Team

Categories
updates

Inside the 2024 Catalyst Member Retreat

Feat. our ongoing quest to find a logo for PUDL.

Catalyst is a fully-remote organization with members living all over North America. Our floating cyborg heads are besties, but we’re privy to the power of an analog hang. As such, we host an annual member retreat. This year we chose to meet in Mexico City because it seemed like fun and because it overlapped well with this year’s CSV conf (see Zane’s talk here!)

We booked an Airbnb with a few buffer days so folks could explore the city before getting down to business hashing out Catalyst’s future. We perspired, ate popsicles, and pinched ourselves a few times to make sure we were actually there. Even though we had a lot of work planned for the next few days, we agreed that the retreat would be a success even if all we did was get together.