Austen Sharpe, Author at Catalyst Cooperative

Last July we conducted more than 60 interviews with energy data users to kick-off our work for the NSF POSE grant! While that process was very informative, it was also a huge amount of work, and we also want to have more sustainable ways to understand the needs of energy data users (PUDL or otherwise).

So this winter, we put out our first annual Energy Data Ecosystem Survey—a short, mostly multiple-choice questionnaire built on insights from our conversations. 47 of you responded, and we learned a lot!

Before we dig into the results, we’ve made a few other changes in the wake of talking to you all:

One of our core takeaways from the interviews was how organizations struggle with technical skill building and capacity. This is something Catalyst can address with our consulting services and educational content creation. Thanks to funding from the Sloan Foundation, we’ve developed OpenEnergyData4All, a training series to help future and current energy data users gain more confidence in their technical skills. We’re excited to continue creating content like this to help people engage more readily with the data!

Another important theme was users’ desire for data that is easy to explore. A large portion of technical and non technical users relied heavily on our PUDL browser viewing tool, Datasette. While Datasette was simple for us to set up and enabled users to explore PUDL data on the web and download CSV’s, it isn’t really designed to handle the scale of data that we’re publishing now. We developed a Beta PUDL Viewer for technical and non technical users to interact with and search through the data with greater ease. We’re continuing to improve the tool, so be sure to leave any feedback or ideas here.

Now, the Ecosystem Survey. The TL;DR:

It’s difficult to find the right data to answer your questions, and PUDL is no exception. We need to make it easier for folks to find what they’re looking for and parse through our extensive data catalog. We’re excited about the new PUDL data viewer, but there’s more we can do to make PUDL more useful and less overwhelming.
We did a good job picking the most important datasets to pull into PUDL! Generally speaking the data that are used most often are the ones we’ve integrated (pats self on the back), though there was undoubtedly some bias in who took the survey.
Energy data users have more Python and open source contribution experience than we thought! This combined with the fact that most energy data users have at some point written down quirks about the data tells us that we need to make contributing to PUDL easier and more desirable.

Now here’s a deeper dive on the state of energy data according to you:

Who are you?

You’re mostly academics and educators, but you represent a wide range of organization types including NGOs, for-profits, and hobbyists. A smaller subset of you represent folks working directly in policy, utility regulation, or for utilities themselves. Notably, none of you are journalists.

The primary outputs of your work are peer reviewed publications, data visualizations, and reports.

This is not entirely surprising as most of the grants we’ve received prioritize building better teaching and research tools for the academic community. Many of our datasets also contain long, historic time series which cater more to academic research than something like real-time market data might to the private sector. We’re excited to continue serving these communities and hope to engage more readily with university labs moving forward. If that’s you, let’s get in touch!

Knowing that researchers make up the bulk of energy data users also helps us prioritize things like clear methodologies that we know are important to them.

Utilities rank low likely because they use their own data and don’t rely as much on public or commercial sources of information. National Labs, on the other hand, are definitely using energy data–their lower ranking might be an artifact of our network or outreach strategy.

All org types in the bottom half of the ranking are ones whose data use we want to better understand.

What are you using energy data for?

You primarily use this data for utility or facility level financial and economic modeling, but you’re also quite involved in developing new policy proposals, energy system models, regulatory interventions, and spreading awareness about the climate and its impact on the grid.

It makes sense that the primary use for energy data is modeling. Luckily we already sniffed this one out and are working closely with GridLab to make our outputs as model-ready as possible. We’re considering things like making a set of standard inputs for the most common models and removing any usage barriers that modelers have with PUDL. If you’re a modeler with ideas about how to make PUDL data better, reach out!

While policymakers may not be the ones using the data, the data is affecting policy in the venues that count! We are interested in understanding the full diversity of uses for this data, so whether you think you are doing something groundbreaking, niche, or utterly everyday, write and tell us about it! We want to know the full data-to-impact story.

What data are you using the most?

The most sought after datasets tend to be those already integrated into PUDL: EIA 860, EIA 923, EIA 861, NREL ATB, and FERC Form 1. Many of you also rely on data that is not yet (fully) integrated into PUDL: Weather data, EIA AEO, NREL EFS, NREL ResStock, ISO/RTO data, Census, and EPA eGrid data.

Our survey included many PUDL users, so it’s not shocking that the datasets they value most are already available in PUDL. Still, it’s encouraging to see respondents’ priorities align with our core offerings. This suggests that we’ve chosen high-impact datasets and can focus more time and energy on improving them.

Identifying which non-PUDL datasets matter is equally important. It helps us prioritize future integration efforts and make a stronger case for funding them. Some of these datasets are already archived on our Zenodo and would be relatively straightforward to incorporate (e.g., EIA AEO, eGRID, NREL EFS). Others are more complex. Here’s how we’re thinking about the trickier ones:

Generally speaking, weather data from sources like NOAA, NASA, and NWS is a lot cleaner and more readily accessible than some of the other data we work with, so the need for a clean copy is less urgent. It’s also much larger and typically structured as a gridded multi-dimensional array, rather than a database table, making it more complicated to integrate into PUDL. Moreover, weather data is a bit outside our purview–we’re not as confident cleaning, transforming, and explaining it as we are energy system data.

It might make sense for us to collaborate with modelers or an organization with domain expertise to help build a bridge between big weather data and energy system model inputs. However, there’s been a fair amount of research in this vein, and gaps or granularity issues in the underlying data seem to be more of a barrier than data access itself. Check out atlite and renewables.ninja for relevant work.

ISO/RTO data is not a product of the federal government, so it’s subject to a set of unique licensing and usage constraints that we’d have to examine more closely before republishing. Extracting and synchronizing data from RTO/ISOs is also non-trivial given the real-time nature of the data and the fact that each ISO/RTO is an independent organization that handles data differently. Grid Status has this covered.

NREL ResStock is interesting but for now too large to archive on Zenodo.

If you’re using any of these non-PUDL datasets, let us know how we could help make them more accessible. What sort of format or tables are most useful to you?

How do you choose and obtain data?

72% of you access data by downloading it directly from federal and state agencies. Unsurprisingly (it’s our survey after all) 60% of you are PUDL users too. A smaller, but not insubstantial portion (33%) of you endeavor to get data directly from utilities and/or the various RTOs/ISOs.

When choosing a dataset, 72% of you also indicated that clear data processing methodology is important. Other considerations include data sharability, format, and update cadence. These factors were cited more frequently than things like data resolution or completeness.

The big takeaway here is that users are more likely to get data straight from the source than curated, often costly platforms like S&P, Hitachi, or Yes Energy, even if that means doing a little (or a lot) of cleaning themselves. We can harness all this knowledge by giving people more opportunities to share their experiences working with these raw datasets and incorporating that information into PUDL.

This also serves as an important reminder that just because our code is open source does not mean that our methodologies are clear and transparent. In order to build trust with users, we need to make sure that PUDL does an adequate job of explaining how the data are processed. This is something we care deeply about and have discussed at length. There are many different ways to share data transformation methodologies, and we want to make sure our approach is both intelligible and readily maintained. This topic could be a whole new blog (and hopefully will be eventually). For right now, we’re aiming to bolster our documentation so it’s easier to navigate and understand what is going on!

How do you work with data?

The majority of you rely heavily on Python (81%) to process your data. Excel and R are tied, followed by SQL and others.

CSV, Parquet, and JSON are your most desired data formats, in that order. PDF and XML/XBRL shared a spot at the bottom with 0 votes.

Most of you have written down quirks about the data you are working with or made visualizations to accompany your work with PUDL data. As noted above, a decent chunk of you have also had to contend with data not yet available through PUDL, thus extracting and transforming it on your own. Very few folks have tried to incorporate this work into PUDL. However, nearly half of you have experience creating GitHub issues or even writing pull requests to change source code.

This further confirms our desire to improve the PUDL contribution process! We want to provide people with venues to share their knowledge about the data, not just write code. Our current idea is to build an energy data wiki that anyone could contribute to. This would help lower the barrier to entry for compiling qualitative information about the data.

What are your pain points when using energy data?

Nearly half of you struggle to find the right data to answer your questions, extract data from difficult formats (like PDFs), and connect data from disparate datasets with no common identifiers.

With regard to PUDL specifically, you aptly noted that we publish multiple versions of the same data without enough documentation to let users identify which version is right for their use case. For the amount of information in PUDL, you also noted that our docs are confusing and sometimes insufficient.

What we’re hearing is that people are more concerned with accessing the data than the quality of the data itself. This makes sense! If you can’t use it, who cares how clean it is. We know how you feel. In fact, we built a whole company around it… It also tells us that we should focus on integrating datasets that are particularly difficult to access in their raw published state–like those stuck in PDF format or with major record linking challenges. These datasets present a greater logistical challenge but result in greater positive impact for users.

It also sounds like we need to spend some time helping people navigate different available datasets, starting with those in PUDL, and perhaps extending to the vast array of data available elsewhere. It’s time for us to knuckle down and spiff up our documentation!

What are we going to do about it?

After reading everyone’s responses and processing the information we collected from the 60 informational interviews, we came up with a list of projects that would improve both PUDL and people’s experience working with energy data in general. The core takeaway is a need for more and better documentation. We’re going to make a concerted effort to put ourselves in your shoes and answer some of the longstanding questions about what’s in PUDL and how to navigate it.

In a similar vein, we’re going to try and improve the process of contributing to PUDL, whether through code or useful information about the data. Perhaps an energy data wiki?

Lastly, for those of you that offered, we might reach out for support along the way! We want PUDL to be useful for you, and the best way to ensure that is to involve you in the process.

Stay tuned on our progress by following along on GitHub, and, as always, reach out if there is something you’d like to see us do.

Live, laugh, clean data,

– Austen & the Catalyst Team