CLIFF-up: Easy, Automated Geocoding of Text Documents

One of the most important trends in political science is the growth in subnational, geographically disaggregated quantitative research. The prerequisite for this research, of course, is having plentiful and high-quality georeferenced data. The software for generating georeferenced data are often difficult to build, scarce, or not easy to use. As part of my work with the Open Event Data Alliance to generate high-quality, freely available political event data, I’ve taken what I think is perhaps the best open source news text geocoding system, MIT’s CLIFF and packaged it into a virtual machine in the hope that anyone can set it up for their own use in a matter of minutes.

If you want to skip straight to the setting the VM up, it’s hosted on Github here.

CLIFF takes news articles or other documents as input, and uses Stanford’s CoreNLP to extract the people, organizations, and locations mentioned in the text. (We’re big fans of CoreNLP at OEDA–our event data coding pipeline relies heavily on it). CLIFF then passes the documents through a modified version of Berico Technologies’ CLAVIN geoparser, which performs place name lookup and disambiguation based on the content of the text. CLAVIN uses the freely available gazetteer. CLIFF then does a little additional work to establish which of the place names CLAVIN returns are the locations the text is “about.” (I confess that I don’t really understand the process behind this step). Finally, CLIFF runs in a Java Tomcat server, meaning it can be accessed as a web service from other programs or the browser once it’s set up. Here’s what it looks like:

CLIFF geolocation in the browser
CLIFF geolocation service, running in a virtual machine and accessed through the browser.

CLIFF has many excellent features, but ease of setup is not one of them, at least for people like me who are not Java and Tomcat bosses. John Beieler and I each spent many hours over the summer trying to get it configured without success. I’ve finally got it running inside a barebones Ubuntu virtual machine. I’ve put the steps I used to set it up into instructions you can execute inside a virtual machine to configure it exactly as I have mine configured. Vagrant is a program that manages virtual machines, allowing you to easily set up virtual boxes using a configuration file provided. Everything you need to set up CLIFF is in my CLIFF-up repo here.

Now that we have CLIFF running reliably, we at OEDA will begin the process of evaluating it for accuracy and scalability for use in our event data pipeline. I hope you find it useful, and I welcome any comments you have about the installation process. Here are the instructions for setting it up, and you can report issues here.

CLIFF-up: Easy, Automated Geocoding of Text Documents

One thought on “CLIFF-up: Easy, Automated Geocoding of Text Documents

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s