To kick off our NSF POSE grant work, over 4 weeks in July and August we interviewed more than 60 energy data users as part of NSF’s Innovation Corps program (I-Corps). I-Corps helps POSE awardees better understand their users and contributors, and the potential for fostering a sustainable open source ecosystem.
Some of our interviewees were already PUDL users, and many of them weren’t. A fair number of the PUDL users were at organizations we’d never encountered before! We talked to academic researchers and advocates working at non-profits, but also people at for-profit companies, and folks working in the public sector. We even had the chance to talk to some utilities. Interviewee technical and energy domain backgrounds were diverse: from spreadsheet-only NGOs to startups working with cloud-based data pipelines and orchestration frameworks, and everything in between. There were software engineers and lawyers that argue at FERC, grass roots advocates and regional electricity planning organizations too.
It was an intense month for our sometimes introverted team, but overall it was a good experience and we learned a lot. So we thought we’d share some of our high-level takeaways, and see if they resonate the broader energy data community.
Things are changing, fast.
The energy system is starting to change rapidly, and many existing tools and organizations are struggling to keep up. The utilities doing energy modeling are risk averse, and wary about changing how they model the system for planning purposes. Existing modeling frameworks and common workflows don’t typically explore many future scenarios probabilistically to find the set of resources that best preserves future optionality. This is a problem, given that we don’t yet know exactly what the energy system is going to end up looking like. Some modelers believe that an open source and open data ecosystem will be more amenable to innovation, and able to adapt rapidly as modeling needs evolve over the coming decade.
There’s a widespread awareness that more people will be working with energy data going forward, and more of it. At the same time, many people are struggling to manage the volume and complexity of the data they need to work with.
Open models and open data have gotten good enough that they are being used in some commercial applications. Or conversely, proprietary models have been slow to adapt to the new requirements of the energy transition. This is resulting in a flow of resources and knowledge between the academic researchers that have produced most of the open models. People who have been developing open models are starting to use them in commercial contexts.
Organizations that work with energy data or do energy modeling are often growing rapidly, and facing internal scaling issues. They may have data & analysis development practices that made sense with 25 people, but no longer work well with 125. Staffing may not be sufficiently specialized to provide effective data services to internal users.
Updating data infrastructure is hard.
Many established organizations with deep energy and policy expertise struggle to invest in and maintain their internal data infrastructure. This can be due to either a lack of in-house technical expertise, a lack of resources/prioritization, a large accumulation of technical debt, or just institutional inertia. For the individuals tasked with keeping data and analysis up to date this often becomes a pain point.
Moving to a “modern data stack” is a huge investment for an existing organization coming from another set of tools. Setting up your own data warehouse and dedicated internal data team is expensive and a big change in analytics culture. It’s also well beyond the scope of what most medium-sized and/or not particularly technically oriented organizations can do.
Outsourcing much of that work to specialized companies whose APIs can be used to access the data is much easier, but less flexible & scalable. Many data vendors do a poor job of supporting bulk analytical usage. It’s harder to switch to a new set of tools than to start from scratch because of the need to migrate accumulated code and analysis. This kind of change can take “a generation.”
Updating technical skills is also hard.
Upskilling is rare, difficult, disruptive, and expensive, both individually and organizationally. There are generational divides. Ongoing learning seems like it will be required given the needs of the energy transition, but how that will happen at a systemic level is unclear.
Technical training is often available, but underutilized, because people don’t have the time to do the training, or they don’t have an immediate application for what they would learn.
Some users are willing to learn new things, but only if they have high confidence that they are learning the right new things. There are a lot of different options and practitioners feel unable to decide which ones are worth the trouble. E.g. deciding between SnakeMake, Dagster, Airflow, or Prefect for workflow orchestration. Or CSV, Parquet, SQLite, Postgres, DuckDB for tabular data storage. Decision fatigue or paralysis often results in people continuing to use whatever is already familiar, even if they know it’s not ideal.
PUDL has commercial users!
Newer cleantech startups are often comfortable with bulk data, contemporary data processing infrastructure, machine learning, and programmatic analysis. Unlike some more established organizations, they are also starting from a relatively clean slate, so they don’t have the same technical debt / legacy systems to contend with. Some of these companies are already using PUDL and find it relatively straightforward to access.
Even organizations that have access to commercial data like S&P are not always satisfied with it, or are not able to access it with the frequency they’d like. Some commercial users also find open data easier to work with because it’s simpler access, and has more accountability and transparency. Proprietary data often feels like a black box.
Commonly available datasets are considered a commodity. Commercial users do not have a proprietary interest in the kind of data we’re publishing. It’s not part of their competitive advantage. Our value proposition is that they don’t have to think about it any more.
Multiple interviewees in the private sector made enthusiastic, unprompted statements in favor of open products, and expressed a desire to support their ongoing development.
Commercial users of open products value having a B2B relationship with these “vendors” It gives them confidence that there will be ongoing maintenance & support, and that if they have particular issues, those issues will be addressed.
More technical users are often frustrated when the only way to access data is through a high-level interface, like a GUI or Web dashboard, which aggregates or otherwise obscures the nature of the underlying data.
Easy data exploration & evaluation is key.
Examples that work out-of-the-box with no setup are vital for users evaluating a new tool or dataset. Before they know whether the model or input data is worth getting familiar with, even a small technical hurdle or error can cause a new user to walk away. Once they’ve been convinced that there’s value, they may be willing to put in more effort, but you have to earn that effort. If they walk away and end up creating their own DIY solution, it will often be difficult for them to maintain and can become an infrastructure pain point.
Social media posts – especially with data visualizations – are a useful way to introduce folks to what data is available. We spoke to multiple people who saw us post something interesting, and then felt motivated to dig deeper and ended up using PUDL data.
Energy data discovery and finding documentation is a challenge for many people. It’s done via search engines, word-of-mouth, academic citations, even WhatsApp groups. There’s no usable centralized repository or data catalog.
With 200+ tables in the PUDL DB alone, our Data Dictionary page is pretty unwieldy. Users have a hard time searching for the data they need, understanding what lives where, and which of several options is appropriate for their use case.
Long Live Spreadsheets!
Energy system domain knowledge and technical skills are very independent of each other. Lots of people with deep domain knowledge work primarily or entirely in spreadsheets. This is true across multiple disciplines and age groups. Spreadsheets are going to be with us for the long-haul and we need to think of these users as part of our core audience, even if it’s challenging to serve them.
A significant number of PUDL users currently rely on Datasette to download data for use locally. This includes folks that want to stream all of the data, and those that just want to grab a subset. Many users are not finding our bulk download options at all, and even if they do, they still find Datasette → CSV easier to use.
Less technical users are often overwhelmed by the scale and complexity of bulk data, as well as the formats that it often comes in (e.g. SQL databases or Parquet files).
Many users want curated example dashboards or visualizations highlighting themes of the data and possible use cases. Users with a wide array of technical and energy domain knowledge suggested this for data discovery.
Highly technical users sometimes operate in time-sensitive contexts and need quick-and-easy access in high impact use cases. There are times when you just can’t be bothered to learn something new, no matter how expert you are. For example, legislative support work can be extremely time constrained. Doing rough calculations quickly lets you understand the magnitude and directionality. When combined with intuition based on deep background knowledge, a expert can provide advice immediately. This is also true when speaking to the public/press about current events.
Dashboarding and BI tools that serve spreadsheet users can also serve more technical users working under time constraints. E.g. a web interface where you can quickly select and download a CSV of data or make some charts through a GUI. Many folks used GridStatus as a positive example.
Often consulting & advocacy work is organized around one-off projects or reports. Many analyses are never revisited. Projects may also rely on bespoke, proprietary client data. In these contexts reusability and reproducibility of outputs are not highly valued and users may be perfectly happy to download the specific subset of the data they need from a web UI.
Trust & credibility come from many places.
For many users, trust in data is primarily social/institutional rather than empirical. If the data is coming from a government agency or a well-respected source that many other people are using, it’s deemed trustworthy. Actual cross-checking of data by users against other similar or related data sources is less common, but still important.
Data from the national labs is almost universally trusted, but many users are frustrated that the labs do not regularly update and maintain their datasets. Instead, over time the data goes stale and becomes less relevant, and when a new source appears it may not be comparable or have the same structure.
Clear data provenance for processed data helps build trust. Many users want the cleaned, processed data to start with, but they’d like to be able to easily compare it with the raw data and have explanations of what processing was performed.
The worse the raw data is, the more willing folks are to use pre-processed data, and also the more valuable it is to have done the work of cleaning it up and making it available. Lots of people use our FERC Form 1 data that would never have touched the original data. Some people even use our hard won FERC1 to EIA record linkage! People will often walk away from a messy dataset rather than spend the time required to clean it up. Some version of clean data prevents that!
Clearly disclosing uncertainty / known issues with open data builds trust. People who work with data understand that it’s never perfect. Listing known issues and limitations tells users that the data has actually been used by others and that there is active curation happening. A lack of errata and caveats doesn’t suggest the data is perfect, it suggests the problems are unknown or not being publicized.
Seeing an active open source community gives users confidence that the project will be maintained and support will be available if needed.
What data people want.
Many users highly value having frequently updated input data. It minimizes the chances that they’ll end up going around a processed data product to the upstream source to get fresher data.
Record linkages: many organizations find that connecting datasets is a big time sink, but that linkage is necessary for the analyses they are trying to perform. The lack of linkages can make data functionally unusable, even if the information is theoretically available.
Geospatial data is especially valuable. We spoke with multiple people who highly prioritize the spatial aspect of data. It’s an extremely powerful way to do entity matching between different datasets. Many important permitting and policy issues vary by location / jurisdiction. Being able to look up land ownership associated with or impacted by potential projects is important.
The lack of easy access to regulatory filings puts advocates at a disadvantage in front of PUCs. Utilities typically subscribe to expensive data scraping services (one interviewee quoted $15K/month), can easily search and find relevant regulatory documents, and are able to keep abreast of all the different things that are happening. Trying to do the same thing manually is not reasonable. This is in addition to the inability to find and use appropriate data on timescales that are relevant to a proceeding. For people in regulatory proceedings, data sources like the EIA are out of date because everything they’re reporting has by definition already happened, or at least already been decided. Tracking IRPs is about knowing what might happen in 5-10 years, and having a chance to shape that outcome. Advocates need resources that give them this kind of foresight and situational awareness of the regulatory environment.
The time it takes to find data and do your own data cleaning & preparation from scratch is a major impediment in a research context. Working in an environment with rich, plentiful, standardized data can mean the difference between a research publication taking 3 months or 3 years. What data is available and ready to use often determines what research questions are even asked.
Data Wish Lists
Some specific datasets that came up multiple times.
Structured Data
- Plant-specific O&M costs.
- Transmission system costs & existing infrastructure details including spatial information.
- Renewable energy resource quality / generation potential.
- PPA prices and contract terms. (FERC EQR!)
- Detailed power purchase transactions between utilities (FERC EQR!)
- Any kind of information about natural gas distribution systems.
- Regional hydro generation and storage capacity. Hourly hydro generation.
- Weather data for predicting electricity demand and variable renewable generation.
- NEEDS data from EPA
Unstructured Data
- A single interface to search & download all regulatory filings
- Extracted structured data from unstructured regulatory filings.
- Local permitting rules and renewable siting restrictions.
- Localized public sentiment toward new projects (renewable energy, transmission, etc.)
- Up to date and programmatically usable state energy policies & incentives.
Projections
- Projected EV & heat pump adoption scenarios (a continuously updated version of NREL’s Electrification Futures Study)
- Load forecasts (resolved by sector, location, end-use, at hourly resolution)
- Planned projects, including their actual current status, even if they haven’t been approved yet.
What contributors want.
Open source contributors are often trying to maintain or develop technical skills that they don’t get to use in their day jobs. Projects that use technologies they’re interested in learning about are attractive (interviewees mentioned Dagster, DuckDB).
Open source contributors are often trying to learn about the energy domain in order to switch careers and do something more aligned with their values and interests. Those that are motivated by the mission often want to see that the data is actually being used in real-world situations to make change. Partly this confirmation can come from links/press online, but recurring synchronous interaction with others on the project, and real time updates about what’s happening makes it feel like a community that’s more “sticky”.
Potential contributors often land on large, well-known Slack or Discord servers initially, but these can be too big or diffuse to provide a real sense of community or engagement. However, they can be good “watering holes” for finding interested folks with the right skills (Interviewees mentioned Work on Climate, DER Task Force, Climate Change AI).
One-on-one synchronous communication is important for making people feel like… people – like members of a community that you can be connected to and a part of. Ideally in person, but regular calls and chats can do it too.
Having a (real) vibe / character online makes it easier for potential volunteers to know if they’ll connect with you. Blog posts, shared resources (like a reading list), hot takes, controversial but informed opinions, staking out positions. You don’t have to be all biz/tech all the time. Not everybody will be a perfect match, and that’s okay.
Implications for PUDL
Some of what we heard confirmed things we already thought we knew, but there were definitely surprises! We clearly need to do a better job of serving users who don’t and will never write code. It seems like there are opportunities for us to help organizations maintain and migrate their internal data infrastructure — or maybe provide it as a service?
There’s more work that needs doing than we can do on our own, so we’ll have to prioritize. We’re still talking internally about what we want to do in light of what we learned, and we’re going to have to save sharing that process for another post!