Much of my work involves improving large-scale systems to extract political events from text (see code from our NSF project on the subject here). These systems are designed for full production use over many hundreds of sources both daily and for the past in many dozens of event categories, including protests, armed conflict, statements, arrests, and humanitarian aid. Sometimes, though, you just need to find a very specific type of event quickly. In my case, I needed to find reports of military offensives in Syria for a paper on geolocating events in text, applied to the causes of civilian victimization. I thought I would try my favorite NLP library, spaCy, to see how quickly I could build a custom event extractor for announcements of new military operations. I’m fortunate that the text I’m working with with relatively homogeneous, but I was still surprised at how quickly it came together.
I first imported spaCy and read in some text I’d scraped from a pro-government Syria newspaper that is completely slanted but does report battlefield information in great detail:
import spacy nlp = spacy.load("en_core_web_lg") with open("scraped.json", "r") as f: news = json.load(f) news = [i['body'] for i in news] processed_docs = list(nlp.pipe(news))
I then wrote a functions that traverse the dependency parse tree that spaCy returns, looking for sentences that contain an “event” defined by two hand-made lists of verbs and direct objects. It checks the root against the verb list, and then checks the direct object of that root against a list of direct objects. If there’s a match, it returns the root word of the sentence.
verb_list = ["launch", "begin", "initiate", "start"] dobj_list = ["attack", "offensive", "operation", "assault"] def detect_event(doc, verb_list, dobj_list): for word in doc: if word.dep_ == "ROOT" and word.lemma_ in verb_list: for subword in word.children: if subword.dep_ == "dobj" and subword.lemma_ in dobj_list: return word
A second sentence finds the subject of the root word and returns its subtree. If the dependency parse is correct, this is the “actor” of the sentence:
def actor_extractor(root): for child in root.children: if child.dep_ == "nsubj": nsubj = child.text nsubj_subtree = ''.join(w.text_with_ws for w in child.subtree).strip() return nsubj_subtree
Putting it all together:
for doc in processed_docs: root = detect_event(doc, verb_list, dobj_list) if root: actor = actor_extractor(root) if actor: print("actor: ", actor, "root: ", root) else: print("No event detected")
The results are quite good, correctly identifying events and actors…
Dara'a, Syria (6:06 P.M.) - The Free Syrian Army's Southern Front alliance launched a new offensive in the relatively calm Dara'a province aimed at recapturing the abandoned base, Abu Kaser checkpoints, Abu Madi Farms, Tawil Farms, Atisah Farms, and cutting the Dara'a-Damascus highway. event detected! actor: The Free Syrian Army's Southern Front alliance Short after Jaish Al-Fateh militants advanced in the Ramoussah Artillery Base, jihadist militants from Fateh Halab coalition launched a massive attack on the Ramoussah neighborhood. event detected! actor: jihadist militants from Fateh Halab coalition
…and not matching on sentences that don’t report new offensives, even if they mention the term:
Following this development, many units from the 1st, 4th, and 7th Divisions can now be relocated to more pressing fronts like the nearby Beit Jein pocket where a massive offensive is anticipated to begin soon. No event detected
I did find one false positive, but I’d take this accuracy any day:
The terrorist group began their offensive by launching two VBIEDs (Vehicle Borne Improvised Explosive Device) at the Syrian Arab Army's defenses, causing a massive explosion that killed a dozen soldiers in the process. event detected! actor: The terrorist group
The next step would be to classify the actors into useful groups (government, FSA, ISIS, Jabhat al-Nusra, etc.), and in this case a simple string matching step should work pretty well.
This approach is not going to give you the range of event types that existing tools, with their very laboriously developed dictionaries, can produce, but very focused event extraction systems may actually be a better fit for many researchers.
You can find a Gist of the full code here.