I’ve been working a lot of automated geocoding of text over the last 6 months, and I’ve found myself consistently describing the same set of tasks or ways to extract location information from text. Here are some quick thoughts on how to schematize these geolocation tasks, relate them to each other, and where I think the future of the research is.
I’ll have another post later about how I and other people at Caerus developed a new approach to completing one of these tasks.
From what I’ve seen, there are three discrete tasks in automatically geocoding news articles and news-like text:
- named entity recognition
- geographic name resolution
- extracting meaning and locating events
Step one is to figure out, given a sentence of words, which of the words are a place name. Consider the following fictitious sentence: “German Chancellor Angela Merkel was in France yesterday, meeting with French officials to discuss the civil war in Syria, especially the fighting around the city of Aleppo.” A good named entity recognition (NER) system would recognize that “France”, “Syria”, and “Aleppo” are place names, while “German”, “Angela Merkel”, and “French” are not. From my experience, this is close enough to a solved problem. Several generic NER systems do this well in most contexts (Stanford’s CoreNLP and MIT Lincoln Lab’s MITIE are the two I’m most familiar with). I’ve also had success training custom NER models for place names on corpora where the out of the box models fail (for example, on text with lots of transliterated Arabic place names or in text that doesn’t look like the news text text they were trained on). The effort involved in making a custom-trained MITIE NER model is much less than it would seem, since MITIE learns quickly and the browser-based MITIE trainer tool make tagging entities in text relatively un-painful.
The second step in document geocoding is when given a list of place names, figuring out which location on earth each one refers to. Using the example from above, should “Aleppo” be understood at the city in Syria, the governorate in Syria, or the township in Pennsylvania? The best approach of other people to this problem that I’ve come across is Berico Technologies’ CLAVIN, although the exact mechanism it uses to do it remains hazy to me, since my comprehension of Java is poor. Although it’s the best I’ve found CLAVIN’s accuracy is not always great and that it can be difficult to change things under the hood. The place name resolution problem is not easy and there’s no obvious approach to doing it. Here are a few of the approaches I’ve tried:
- Default to the place with greatest population. This basically a “bet the base rate” approach, which would correctly discard Aleppo Township, but would make the system default to Aleppo Governorate rather than Aleppo City). And geographic gazetteers, such as the geonames.org dataset I use, don’t always consistently include population data for all locations.
Use the locations that minimize the distance between places in an attempt to make it as internally consistent as possible. Coding “Syria” to the country and “Aleppo” to the city in Syria would be a good guess here, but could be thrown off by other place names mentioned in the text (France) and alternative names that could be closer to each other than the correct ones. This approach can bias the system toward obscure towns that happen to be located near the rest of the places, and can get exponentially computationally expensive as the number of possible places increases.
Resolve place names to locations by using the context of the story. The approach I’ve settled on is to mimic the way a human would locate place names as much as possible. I extract all of the place names from the article and use word2vec to calculate the semantically closest country to the bag of place names. That country is used to filter the second stage search for matching place names. This solves the Aleppo, Syria vs. Aleppo, Pennsylvania problem, but doesn’t on its own solve the city vs. governorate problem. More on all of this in a future post.
The third step consists of extracting some sort of human-like meaning from the correctly located geographic points, and one that’s still wide open for research. This task can go one of several ways. One way of extracting meaning from geographic points is the one taken by MIT’s Center for Civic Media with their CLIFF software, which I’ve written about before. CLIFF tries to figure out, given a list of correctly resolved and disambiguated place names (which it gets from CLAVIN), which place or places a news story is “about.” The process for doing this isn’t entirely clear to me (since my Java skills are poor), but involves looking at the co-occurrence of place names and which ones occur earlier in the story. CLIFF is better than a bag of locations, but isn’t perfect at what it does, and doesn’t attempt to do the second thing you could do at this third level of geocoding, namely, locating events. Determining where an event described in a news report occurred is obviously interesting for political scientists, especially those interested in event data. This task would require determining from a list of correctly resolved place names and a list of automatically coded political events, where those events took place. The question of where events described in text took place is a very different question from what places the text is “about”. In the example sentence at the top, the sentence is equally “about” Syria and France, but the real event described is a meeting that took place in France (on the subject of Syria).
So there you have it. Three different kinds of tasks for geocoding documents, each building on each other. I think we’re doing well on the first, better on the second, but there’s still much to be done on the third (which is why we mostly just put dots on maps). Finally, let me note here that it just doesn’t seem like this is a project that political scientists (albeit computationally-inclined political scientists) are really best equipped to handle. If there’s anyone out there in computer science, geography, computational linguistics, or media studies who already as a good approach to these, if love to hear about it.