Data Churn and Data Versioning

Jay Ulfelder has a very nice post about the problems that applied researchers face when working with data that changes rapidly in its availability and production. I agree with the suggestions he proposes (which were, very roughly, 1. modularity in applied uses, 2. transparency in generating data, and 3. awareness of the larger data ecosystem), and wanted to add my own. This is a lightly edited version of a comment that I left on his post.

The point that hit closest to my experience was about the contingent nature of a lot of this data. As marginal costs and time needed for producing data fall (which is what I think is at the root of the acceleration Jay describes) the infrastructure surrounding the data production gets smaller and smaller. That, in turn, means that there aren’t always solid institutional homes for the data in the way that the older “artisanal, hand-crafted” datasets had (COW and ICPSR, for instance). While these light startup costs are great for spurring innovation, it also means that data projects can fall apart just as quickly as they come together. That leaves a lot of people who were counting on the data in the lurch, as I and I suspect many data/political science people experienced in January.

The churn in the data itself over time is thankfully something I haven’t had to deal with yet, but as innovation goes faster and anyone can contribute improvements to coding software on GitHub, it’s going to be something we have to figure out how to accommodate. It’s great to improve accuracy and add new features, but for production uses of the data, including forecasts, new improvements can’t break the old models.

I think there are a few overlapping solutions to these two problems.
One step is to make everything about the process as open as possible. This does a few things: it distributes the code so anyone can pick things back up if the coding project fails completely, it makes any changes to the code visible so you can figure out exactly how the new version is different from the old, and more broadly it lets users peer into what would otherwise be a black box and see how coding decisions are made. I think people’s default position should be skepticism toward data generating systems they can’t inspect openly.

A second step is to build institutional frameworks around data generating systems. One approach that you allude to is the Open Event Data Alliance (note: I’m involved in setting it up but I don’t speak on its behalf). Having an institution run the system, rather than an individual, makes the whole process more open, less about personalities, and helps solve the indelicately-put “bus problem” (i.e., if all of the methods for generating a dataset are only in one person’s head, what happens if they get hit by a bus?). OEDA also plans to provide gold-standard codings and provide guidance on how to maintain and compare event datasets.

Finally, I think producers and users of datasets should get comfortable versioning their datasets in the same way that software developers version their releases. Bleeding edge features can get incorporated in newer versions and people who want the features can get quick access, but people using it for forecasting or for comparison over time should have access to stable releases and be confident that their availability will continue for years. Maintaining, hosting, and explaining those versions is another role an organization like OEDA should play.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s