By John Beieler and Andy Halterman
The two of us, along with Phil Schrodt, Patrick Brandt, Erin Simpson, and Muhammed Idris have been working on several interrelated projects that we believe will improve the availability and quality of event data. We’ve discussed these projects formally at ISA and informally at MPSA and elsewhere. But we think these issues are important enough that they bear repeating here. These four projects are PETRARCH, Phoenix, EL:DIABLO, and the Open Event Data Alliance (OEDA). (Fun game: which of these are acronyms, backronyms, and regular words?).
PETRARCH is a new Python-based coder designed to replace TABARI. PETRARCH uses Stanford’s CoreNLP to do deep parsing of news text and should be much more accurate and extendable than TABARI. The software is running but under active refinement and we hope to have publicly available data sometime in the fall. The three big advantages of PETRARCH are in the deep parsing of of sentences (as Phil has said before parsing should be done by computational linguists, not political scientists), the ability to easily add or swap out modules, and the much greater ease of reading the code.
Phoenix: The new dataset we’re creating with PETRARCH we’re calling Phoenix. Because it’s difficult to get legal access to historical news articles, Phoenix currently includes only data from 2014. One of our top priorities is securing access to historical data so we can run Phoenix backward, too. We’re happy to hear any ideas regarding sources for historical material. As part of the process of developing new event data and because of large controversies in the wider event data community, we’ve thought a lot about the importance of making the entire data generating process transparent. This transparency extends both to the specifics of how this particular data set is created, but also to greater openness about the origins, accuracy, and objectives of the dataset.
Technical Transparency and EL:DIABLO
As we’ve been building a new coding system from the ground up, we’ve had to grapple with the inordinate impact that many small decisions can have on the resulting data. Given this, we have tried to exhaustively document our decisions. Which sources are used? Just the first sentence or the whole story? Are pronouns coreferenced? Does the software use a shallow or deep parse? What about syntactic dependency parsing? What are the underlying dictionaries? How are sentences segmented? (How) are events deduplicated? And that’s not even mentioning geolocation.
We’ve come to the conclusion that event data generation should be 100% transparent in order to let users accurately assess any biases or issues in the data. As such, we’re releasing our scraping and coding environment as EL:DIABLO, (“Event/Location Data In A Box, Linux Option”). At its core, EL:DIABLO is an attempt to share our event-data generating process with the world. Unfortunately, it seems to be a settled matter that most of the source texts used to generated event data cannot be shared. What we can do, however, is to list the process used to 1) obtain source data, 2) format the source data, 3) run the data through a coder, and 4) perform post processing (geolocation, feature extraction). EL:DIABLO is at its core a shell script that sets up a virtual machine (a computer-within-a-computer) with the exact software and settings that we use to generate event data. The main pieces of software included are a web scraper, the processing pipeline (formatting, loading into a database), and the event coder (PETRARCH or TABARI). This environment allows two things. First, if a person runs the process on the same day we do, the output should be exactly the same. This enables complete transparency of the data-generating process and lets people audit how our data is generated. Second, this setup allows others to swap various components of the processing pipeline. Want to use a different set of dictionaries? That’s a matter of changing three files. Have a different coder you want to try out? As long as it can read text from a file or, in the future, accept input as JSON, it only requires changing two or three lines of code in a Python script.
We believe that this has benefits not only for people to understand our system, but also to create a new generation of custom-built event datasets with common components that are modified for specific regions, periods, actors, and behavior. While TABARI is open source software, very few people used it to generate their own data (Javier Osorio and Erin Baggott are two examples of people who have), in part because C++ is pretty dense code to read. We hope that PETRARCH+EL:DIABLO will drastically lower the bar for people to create their own datasets. It is important to note, however, that while the bar is lowered, there is still some technical knowledge required.
Organizational Transparency- The Open Event Data Alliance
Beyond technical transparency, we believe there’s also a need in the event data community to have an organizational framework to help establish standards for creating and judging event data, and for making shared resources like gold standard codings and coding dictionaries available to newcomers. Our work on this front has centered around the Open Event Data Alliance (OEDA), a fledging not-for-profit organization to host event datasets and the tools to produce and assess them.
Our hope for is that OEDA will be a place where resources and ideas can be shared and standards developed. For example, it’s widely accepted that a “gold-standard” set of event data needs to exist for verification and assessment purposes. We can see OEDA serving that purpose. Other possibilities include licensing fairly expensive corpora such as the LDC Gigaword corpus, or even holding raw text and letting others run their coding engines on top of it, to follow restrictions on transmitting source text [maybe…IANAL]. All of these activities will become increasingly important as more people generate their own datasets and we move away from the older paradigm of “one dataset to rule them all.”
Generating event data will never be perfect. We will always get things wrong, and there will always be places where performance gains can be had. We can allow others to explore the potential places where we go wrong, though. We will wrap up with an invitation. All of our work is free and available on Github. If you feel like you have something to contribute, we welcome your help. If you use EL:DIABLO but don’t understand what’s going on, let us know on Github or send us an email. If you would like to see something in event data that you haven’t seen before, or if you have thoughts on what role OEDA should play, please get in touch. If you think we’re missing an important
(https://github.com/openeventdata/scraper/blob/master/whitelist_urls.csv) for Phoenix, fork us on Github or send us a note. If something is broken we will work to make it better, and we welcome your help.