CAMEO Dictionary Coverage

Aug 10, 2016

I was going through the Petrarch2 dictionary code, working on live updating of dictionaries for the human coding interface, and ended up taking a look at the dictionaries’ coverage of different CAMEO event codes. This quick exercise revealed many codes with few or no dictionary entries and could be an example for how we monitor dictionary development in the future.

In practical terms, it highlights the limitations of using the lowest-level codes in the dictionary coded CAMEO-type event data, such as Phoenix, ICEWS, or GDELT. It’s possible that ICEWS and GDELT have more extensive dictionary entries, but their verb dictionaries are not public, so it’s difficult to know. Given the expense of generating verb dictionary updates, it’s unlikely that they’ll be much better than this.

Three caveats:

the correspondence between number of dictionary entries and the number of events that actually occur can be weak. Event codes with many entries are likely to have these because of synsets, meaning that they might not actually be very common
The number of dictionary entries may also not correspond to the number of coded events. Single entries (e.g. “SAID”) can be responsible for coding hundreds of events, while the numerous entries for other event codes may be so specific or obscure that they never result in a coded event.
Petrarch2, through its internal ontology, can combine codings to produce more complicated codings, especially in the categories dealing with statements, expressing intent, etc. See the Petrarch2 document for more information.

I processed the dictionary using Petrarch2’s read_verb_dictionary function, which returns a dictionary of dictionaries. The top-level keys are the event codes, and the second-level keys are the dictionary entries. The code below recursively traverses the dictionary and counts the number of entries for each code.

p = read_verb_dictionary("CAMEO.2.0.txt") # modified to return VerbDict

def recursive_dict(d):
    for key, value in d.iteritems():
        try:
            codes.append(value['code'])
        except KeyError:
            recursive_dict(value)
            
codes = []
recursive_dict(p['phrases'])
Counter(codes)

In total, there are 73,667 verb dictionary entries. We can then plot the number of entries for each low-level CAMEO code, grouped by CAMEO root code.

CAMEO’s 03 category, for expressions of intent, is a fairly typical example of verb dictionary coverage. Several categories, including expressing intent to provide economic aid, intent to settle a dispute, and intent to cooperate militarily have hundreds of unique patterns. The mid-level categories, ``not specified below” have fairly good coverage. However, several of the lowest level codes, such as intent to release people, intent to provide rights, intent to de-escalate military tensions, and intent to change policy have no dictionary entries. It’s possible that these are simply rare events in politics, but they won’t be picked up by the dictionary-based coder.

Express Intent

THREATEN events also show large gaps in coverage. It’s likely that many threaten events will be picked up by the highest level category (Threaten, not specified below), but the more detailed categories, especially those related to domestic politics, have no entries.

Threaten

The study of protest is one of the main uses of machine coded event data, but again, many of the lowest level categories have no dictionaries and will be missed by the coder. Most of the missing low-level categories distinguish between the ``why” or purpose of the protest.

Protest

Finally, military conflict and ``assault” events are a major focus of users of event data. Six sub-events immediately stand out as having few to no dictionary entries. “Torture” and “assassinate” events are usually described as such, so a small number of dictionary entries probably covers most of these events in practice. “Use as a human shield” is likely to be a rare event in practice, so it’s understandable that it has few entries. “Kill by physical assault” is likely to be covered by a small number of patterns as well, and will overlap with “physically assault, not specified below.”

The two missing entries that seem most troubling are “sexually assault” and “carry out roadside bombing”, each of which have no entries. Moreover, these are events that have substantive interest to researchers and are likely to be to events that applied researchers would like to pull out of the event data. However, the lack of dictionary entries means that they will never be coded in the data.

Assault

This analysis reveals specific limitations of Phoenix, and potentially ICEWS and GDELT as well. While higher-level aggregations of events are likely to capture the broad patterns of events, researchers should be wary of using the lowest-level codes in the CAMEO ontology without first verifying that the verb dictionary can actually code it. More broadly, it reveals the limitations of any dictionary-based event coder. The coder can only produce events that (exactly) match an entry in the verb dictionary, and adding new entries to the dictionary is slow, difficult, and inherently low recall.

Andy Halterman

Assistant Professor, MSU Political Science

My research interests include natural language processing, text as data, and subnational armed conflict

CAMEO Dictionary Coverage

Andy Halterman

Assistant Professor, MSU Political Science

Related