We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of different kinds of messes that go well beyond what you can easily fix with a few nifty dataframe manipulations. It seems like we should be able to develop higher level, more general tools for doing automated data wrangling. Given how much time highly skilled people pour into this kind of computational toil, it seems like it would be very worthwhile.
We’re all data janitors
There are multiple open source machine learning frameworks with big user communities. Just in Python there’s at least scikit-learn, PyTorch, and Keras + TensorFlow. There’s also a seemingly endless supply of online courses, blog posts, and tutorials about working with these tools. But to do any good, they all need data. Lots and lots of nice, clean data, and people seem way less excited to talk about how you get that data. So much so that a popular O’Reilly book on the topic is entitled “Bad Data: Cleaning up the data so you can get back to work” as if cleaning up the data isn’t even part of the work! So much so that earlier this year researchers at Google recently published a paper entitled: Everyone wants to do the model work, not the data work. In particular the Google researchers looked at how data issues propagate downstream in “high stakes AI” applications, where machines are potentially making life-and-death decisions, instead of just trying to get you to click on an ad for useless gadget. And from the horror stories in Cathy O’Neil’s book Weapons of Math Destruction and the recent film Coded Bias, it seems like there are an awful lot of “high stakes AI” applications quietly running in the background affecting all of us, mostly without our knowledge…
So I started wondering, why aren’t there similar high level open source data cleaning frameworks? Is data cleaning really that bespoke of a process? Do these tools exist, and we just aren’t finding them? After a few years of doing this kind of work it sure seems like there’s a set of general problems that ought to be susceptible to automated data wrangling solutions.
A few of the things we keep running to include:
- Inferring normalized data structures including the entities, relations, and integrity constraints that are implied by a larger poorly structured dataset (e.g. figuring out that the EIA 923 contains information about utilities, plants, generators, boilers, coal mines, and fuel suppliers).
- Record linkage within time series and between tables, where there are no explicit identifiers, or no shared identifiers (e.g. assigning plant IDs in the FERC Form 1)
- Deduplicating information that’s reported inconsistently across multiple poorly normalized tables and selecting a single canonical value for each field associated with each entity (e.g. coming up with a single record describing each utility, plant, and generator that reports in the EIA Forms 860, 861, and 923)
- Reconciling multiple datasets that refer to the same entities in different ways with different identifiers at different granularities (e.g. linking the utility and plant data reported in the FERC Form 1 and EIA 860/861/923).
Rather than just muddling through and reinventing the wheel every time with a few tweaks, I decided to look for more formalized knowledge related to data cleaning, and found a textbook entitled… Data Cleaning by Ihab Ilyas and Xu Chu, published in 2019. It’s a great survey of the different classes of data cleaning problems and some of their solutions, with tons of references to the deeper academic literature. So, the knowledge exists! This isn’t surprising — people have been working on relational databases for almost half a century, and versions of these problems have probably existed for almost as long.
But in a way, this makes it even weirder that we have an explosion of powerful, widely used open source machine learning frameworks, and very little in the way of high-level data cleaning tools: the problems of creating and maintaining clean structured data have been around much longer.
Automated Data Cleaning
I started poking around some of the literature referenced in the Data Cleaning textbook, and it seems like Ihab Ilyas, who is a professor at the University of Waterloo in Canada, has carved out a niche for himself by focusing on data cleaning as a more generalized set of problems. Along with Theodoros Rekatsinas, Xu Chu, and Christopher Ré, he developed a framework for repairing structured data called HoloClean. Maybe ironically, it uses machine learning algorithms to clean up the structured data, which is then probably fed into other more different machine learning algorithms. Here’s a talk about it from February, 2020:
- A post associated the talk: AI should not leave structured data behind!
- Data Cleaning as a Large-Scale Machine Learning Problem a post by Ihab Ilyas from 2018 digging into the nature of automated data cleaning more deeply.
- Paper: HoloClean: Holistic Data Repairs with Probabilistic Inference by Rekatsinas et al. (2017).
- The HoloClean GitHub Repo, last updated in 2019.
The automated cleaning of dirty datasets relies on the fact that often the data is mostly good, and so you just need to apply a bunch of different outlier detection methods, and assume that relationships between different columns/tables that almost always hold true really should always hold true.
Ilyas founded a company called Inductiv around the HoloClean technology (which was supported by funding from DARPA, the Office of Naval Research, and other public agencies). Then Inductiv was almost immediately acquired by Apple. The original HoloClean GitHub repo is still available, but it hasn’t really been touched since 2019.
Searching for additional high level tools that address other data wrangling problems, I found what seems like a similar pattern: a fairly rich academic literature, with some proof-of-concept tools that either clearly went commercial, or seemed to languish with a modest user base and few contributors, rather than becoming a standout community workhorse like Pandas. In addition to data cleaning, these tools seemed to fall into two other big categories: more structural data curation, and the automated labeling of training data for use in machine learning.
The automation of these three domains (cleaning, structural curation, and data labeling) using machine learning and statistical inference seems like it has the potential to radically transform the usefulness of messy data, which is basically all data in the real world.
A more recent talk from Theodoros Rekatsinas on some of the same ML based data cleaning issues:
Automated Data Curation
We’ve made some use of the Python Record Linkage Toolkit to join FERC and EIA datasets together. It’s built on scikit-learn, and it works well, but it could be more generalized / flexible, and it’s really the project of a single developer.
An adjacent academic effort is AnHai’s Magellan project which focuses on string-based entity matching. Their GitHub repos have been a bit more active recently, but still have only dozens to a couple of hundred stars. For example:
- deepmatcher: a Python package for performing entity and text matching using deep learning
- py_entitymatching: software package to match entities between two tables using supervised learning.
- py_stringmatching: a set of string tokenizers and similarity measures.
- py_stringsimjoin: Scalable string similarity joins.
Another common data curation task is taking one or more poorly normalized tabular datasets which contain lots of duplication and potentially inconsistent values, inferring a corresponding well-normalized database structure, including several different relations with well-defined primary keys and foreign key constraints, and deduplicated records that have been “fused” using some kind of clustering algorithm and a process for choosing which of several inconsistent values should be used when they arise.
As with the above topics, there’s definitely research on this, e.g. Data-driven Schema Normalization by Papenbrock & Naumann (2017), there are some software packages that try to get at the problem, like Autonormalize (GitHub repo) , which is pretty explicitly aimed at preparing data for use in machine learning applications, as opposed to direct analytical consumption. And as with the projects above, Autonormalize seems to have gone dormant.
A more expansive project led by MIT’s Michael Stonebraker (one of the original creators of the Postgres database) and others including Ihab Ilyas, is the “Data Tamer System” (PDF), which attempts to do schema integration, normalization, and record deduplication. It was quickly commercialized with the founding of the company Tamr. The system automates as much of the data integration as possible, while making it easy for domain experts to help with ambiguous cases, rolling that new information into the underlying training data. In 2018 they wrote up some of their experiences with large scale data integration in this paper: Data Integration: The Current Status and the Way Forward (PDF).
This talk contrasts the approach Tamr takes with more traditional manual ETL and schema mapping approaches:
Automated Labeling of Training Data
Yet another example of this dynamic: labeling training data for supervised machine learning applications can be incredibly tedious and labor intensive. Enter Snorkel:
The Snorkel project started at Stanford in 2016 with a simple technical bet: that it would increasingly be the training data, not the models, algorithms, or infrastructure, that decided whether a machine learning project succeeded or failed. Given this premise, we set out to explore the radical idea that you could bring mathematical and systems structure to the messy and often entirely manual process of training data creation and management, starting by empowering users to programmatically label, build, and manage training data.
The project was wildly successful, and as a result, it has been enveloped by the warm embrace of Silicon Valley VCs with the founding of Snorkel Flow.
The Snorkel team is now focusing their efforts on Snorkel Flow, an end-to-end AI application development platform based on the core ideas behind Snorkel.
A pre-commercialization talk about how Snorkel works:
The Contrast with Open Source Machine Learning
Interestingly, the history of the open source machine learning frameworks is almost the opposite of what seems to have been happening recently with powerful, high-level data preparation tools. TensorFlow was originally developed and then open sourced by Google in 2015. PyTorch was similarly an internal toolkit developed by Facebook that got open sourced in 2018. Why did they do this?
The general consensus seems to be that they wanted to make sure that the people doing basic research in machine learning would be able to work with their tools directly, so that integrating new techniques would be easy, and their enterprises would be able to benefit immediately. At the same time, they know better than anyone that without lots of good data and powerful computational infrastructure, fancy machine learning algorithms don’t really do you much good. And especially in the case of neural networks, there’s a whole lot of parameters to tune in setting up a particular model. So by giving away their frameworks they got a lot of mindshare, without really giving up much secret sauce.
This means that neither of these frameworks had to go through the difficult process of cultivating an organic open source community from the ground up. They sprang fully formed into the world, and were able to immediately garner huge user bases, because they were already powerful and full-featured.
The history of Scikit-learn seems to be pretty different. It grew more directly out of the research community, especially the French Institute for Research and Computer Science and Automation, as an independently developed extension of the SciPy library, and it continues to be a more community-driven project.
Automated (Open) Data Wrangling
This all feels a little troubling in the context of wrangling open data in the public interest.
In our wildest dreams, Catalyst is trying to address the problem of enterprise data integration as laid out by Stonebraker and Ilyas, but with respect to the energy and climate adjacent wings of the US Federal Government, the regional grid operators, and occasionally state public utility commissions:
Most large businesses are decomposed into independent business units to facilitate agility. Such units are typically free to “do their own thing”, without being hindered by corporate-wide issues. For example, adopting a specific global schema for a specific entity type (e.g., customers) across all units is often impossible as the needs of these units are different. Waiting for consensus across all the business units means it would take forever to get anything done. This leads todata silos (one for each unit), where similar data are stored with different granularity, schema, and even contradicting and inconsistent details. A typical enterprise has many such silos, and a major goal of many enterprises is after-the-fact integration of silos for business gain.Data Integration: The Current Status and the Way Forward, Stonebraker & Ilyas (2018)
Except in our case, the “business units” are agencies like FERC, EIA, and EPA; the “customers” are the general public, as represented by policymakers, researchers, journalists, and advocacy organizations; and “business gains” are things like clean air, a stable climate, and a cost-effective, reliable energy system.
Most public data is a mess. The knowledge required to clean it up exists. Cloud based computational infrastructure is pretty easily available and cost effective. But currently there seems to be a gap in the open source tooling. We can keep hacking away at it with custom rule-based processes informed by our modest domain expertise, and we’ll make progress, but as the leading researchers in the field point out, this doesn’t scale very well.
If these kinds of powerful automated data wrangling tools are only really available for commercial purposes, I’m afraid that the current gap in data accessibility will not only persist, but grow over time. More commercial data producers and consumers will learn how to make use of them, and dedicate financial resources to doing so, knowing that they’ll be able to reap financial rewards. While folks working in the public interest trying to create universal public goods with public data and open source software will be left behind, struggling with messy data forever.
So, how can we work from the published literature, and existing open source research projects to build actively maintained community tools that fill this niche? What kinds of organizations and stakeholders would benefit from these tools? What kind of organizations would be interested in supporting this kind of work? What other major open datasets beyond energy and climate need this kind of treatment? How can we follow in the footsteps of NumPy, SciPy, Pandas, Scikit-learn, Airflow, Dask, Project Jupyter, and all the other community tools that make up the amazing PyData ecosystem?