Making Event Data From Scratch: A Step-By-Step Guide
This tutorial covers how to create event data from a new set of text using existing Open Event Data Alliance tools. After going through it, you should be able to use the OEDA event data pipeline for your own projects with your own text. An upcoming second post will cover how to create new ontologies and implement new dictionaries to code them for researchers asking questions not answerable with standard CAMEO event and actor categories. This tutorial assumes you have some exposure event data and some familiarity with the command line. It also assumes that you are on a Linux machine using Python 2.7, but the instructions are relatively portable to Mac. If things go well, it should take about 3 hours to go from nothing to finalized event data.
What is event data?
Event data records political and social events in directed dyadic form, i.e., actor one did event type x to some other actor or target. The actors, events, and targets are represented in a defined ontology that maps text to a structured coding, making it possible to work with. For example, in the dominant CAMEO ontology, the (made up) sentence “Ukrainian pensioners demonstrated outside parliament today, demanding cost-of-living increases” would be represented in CAMEO as source actor = UKR CIV, event type = 14 (protest), target actor = UKR LEG. Virtually every existing system to produce event data from text are automated. More news articles are published every day than could be feasibly hand coded, and automated systems allow data to be regenerated as they improve. Conceptually, automated event data systems have two components: an ontology defining the events and actors of interest and their representation (“pensioner” = CIV), and a coder or classifier for mapping real input sentences to their representation in the ontology. The most prevalent ontology now today CAMEO and its various flavors, though many users of CAMEO are switching to PLOVER, a new ontology with coverage of some new event types, vastly simplified coding of other event types, and a more flexible system for extensions and modifications. Most automated event coders now are rule-based: they use rules to decide which noun phrases are actors and which verb phrases are events, and then compare these chunks of text against lists of hand-defined rules for coding events and actors. The advantage of these systems is that the technology is mature, existing English-language dictionaries are fairly comprehensive, modifications are easy, and their behavior is somewhat predictable. (Also, they were possible 20 years ago when automated event data research began.) Other systems in development are using the machine learning to create classifiers that learn rules themselves, rather than having them hand-coded (see, e.g., mjolnir). Machine learning based methods are probably the future for most applications, though dictionary-based coders have the advantage of being very easy to update and modify and to quickly apply to new ontologies without having to hand label thousands of example sentences.
Overview of tools
This tutorial covers the standard Open Event Data Alliance pipeline that has been producing the Phoenix event dataset daily for over three years. At the end of this tutorial, you should be able to go from a new, custom built news scraper to the dataset of events coded from it. The steps in the pipeline consist of:
- database (Mongo)
- scraper
- CoreNLP pipeline
- geolocation with Mordecai
- coding pipeline with Petrarch2
- output and analysis
Each section describes how to set up a component of the pipeline and includes an “under the hood” section explaining the step in more detail than is needed to just make it run.
Database
The first step in the process of going from scraped news text to event data is
setting up the Mongo database the pipeline uses to store scraped articles. A
good guide for Ubuntu 14.04 is the one from
Digital
Ocean,
and quick googling will reveal many more OS-specific installation instructions.
Once MongoDB is installed, you can make sure it is running by typing mongo
at
the console.
Database: Under the hood
Using a database gives us several advantages over storing everything as text files. Databases are much faster, especially when we get into the millions of articles. Later in the pipeline as we need to process each article to prepare it for coding, it becomes much easier to know what we’ve processed and what we haven’t. Finally, the database lets us just pull out specific article for coding (from source a in date range t1 to t2) or to call up a specific article to read to compare the text with the event extracted from it. We use Mongo partially out of path dependence and partially because it’s acceptably good and what we need: a way to store JSON objects using a schema that changes often for different use cases and as the pipeline evolves. In the Phoenix dataset production system, the Mongo database lives on a second server. Mongo has a tendency to hog all available memory and this was causing problems for the other parts of the pipeline.
Scraper
The next step in the pipeline is populating the database with scraped news
articles. For this tutorial, I’ve written a custom scraper that downloads a
small number of articles from Deutsche Welle with certain article tags.
Download the scraper from
here and
save it as dw_test.py
somewhere. Make sure you have the requirements
installed (pip install BeautifulSoup requests
), then run the scraper:
dw_scraper.py
This stores the articles in a database
called event_scrape
in a collection called dw_test
. To verify that they’ve
been correctly downloaded (the scraper should tell you how many have been) you
can go to the console, enter the command mongo
and then from in the Mongo
shell type
use event_scrape
db.dw_test.count()
to confirm the number of downloaded articles. To see what each article looks
like, from the Mongo shell, run db.dw_test.findOne()
.
Scraper: Under the hood
This demo uses a custom one-off scraper to pull articles from a single site’s archive. Many researchers will want to use the pipeline in this way, especially if they want to produce historical data from a news archive. (For example, the Times of India archive is about as easy to scrape as it gets…). Other users, especially those producing event data for forecasting or daily monitoring, will want to use a more general scraper. An hourly general scraper is also useful for researchers building up a backfile, as many sites do not make their articles available in an archive. For large-scale scrapers, users have two good options:
- the OEDA scraper can download articles from the RSS feeds of a set of news sources. This scraper is very easy to set up but is not stable and not recommended for permanent, daily deployment.
-
atlas
, John Beieler’s much more robust news scraper, which currently powers the Phoenix dataset. It functions similarly to the oldscraper
, but is extremely robust and can stay up for months. The technical requirements are slightly higher, but this is the recommended solution for daily production scrapers.
The unifying component of all of these tools is code that formats the articles in a defined way and inserts them into the database. You can see this format in the DW scraper or here. Understanding that component in the code will allow you to write other custom scrapers or to load article in text format from a disk into the database (e.g. VSS, which loads articles from a LexisNexis dump into Mongo).
Stanford CoreNLP pipeline
The next step in the process is to run all of the downloaded articles through Stanford’s CoreNLP natural language processing tools. This set of tools labels the sentence with its grammatical syntax, which our event coder, Petrarch2, uses in the next step. Here, we use the easiest and most stable code for CoreNLP processing the articles. Users with more than a million articles should consider distributed options like the Docker and RabbitMQ-based biryani, Databrix’s Spark-based CoreNLP, or UTD’s SPEC.
1. download CoreNLP
Run the following commands to download the version of CoreNLP used by the pipeline. (These steps are reproduced from the
stanford_pipeline
repo.)
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2014-06-16.zip
unzip stanford-corenlp-full-2014-06-16.zip
mv stanford-corenlp-full-2014-06-16 stanford-corenlp
cd stanford-corenlp
wget http://nlp.stanford.edu/software/stanford-srparser-2014-07-01-models.jar
2. download stanford_pipeline
Next, download the stanford_pipeline
code by running
git clone https://github.com/openeventdata/stanford_pipeline
To install the Python packages needed for the pipeline, run pip install -r requirements.txt
on the requirements file that’s in the stanford_pipeline
repo.
3. customize
The stanford_pipeline
directory has a default_config.ini
file that needs to be customized before use. First, change the CoreNLP directory location from the default ~/stanford-corenlp
to the full path to where you downloaded it. For me, this would be /home/ahalterman/stanford-corenlp
. Next, change the collection
to the name of the collection where our scraper put our stories, which is dw_test
. The database name can stay the same. Finally, the range
option controls whether the pipeline processes all the unparsed stories from the database or just the ones that were added in the last 24 hours. If you’re doing this tutorial all in one shot it shouldn’t make a difference, but if you took a day off after scraping, comment out range = today
and uncomment range = all
.
4. run!
Finally, from inside stanford_pipeline/
, run python process.py
to parse all the articles in the database. You may see up to a minute of startup errors ([111]
), but these can be disregarded as long as it gets there eventually. You should then see it begin to process stories. Once the parsing process is complete, you can one again start the mongo shell and enter db.dw_test.findOne()
to see what the parse info looks like.
Parsing : under the hood
OEDA’s event data tools make heavy use of advances in natural language processing technology and computational linguistics. Rather than comparing dictionary entires to all parts of the sentence, we can ensure we are comparing the verb dictionary to verb phrases and the actor dictionary to noun phrases. To parse the sentence and get the information on sentence structure, we use Stanford’s CoreNLP natural language processing tool.
Geolocation with Mordecai
UPDATE (August 2018) The instructions below reference an outdated version of Mordecai. The setup it references is still available here but this is no longer the most up-to-date and accurate model. Advanced users could write a REST wrapper around the new Mordecai 2.0 and incorporate it into the pipeline. The pipeline requires a running geolocation service in order to infer the location of events in text. Follow the “Simple Docker Installation” instructions at https://github.com/openeventdata/mordecai/tree/legacy-docker to install and begin running Mordecai. (The installation instructions may change in the next couple months so I won’t reproduce them here.) This step requires installing Docker, a lightweight container system. Instructions on installing Docker on your system can be found here. Note that Mordecai is quite large, so you may want to run it on a hefty computer. Downloading the models and starting the service are both slow, though the actual geocoding should be fast once it’s up and running.
Phoenix Pipeline
The final step in producing the data is to run the phoenix_pipeline
, which ties together the database, geolocation, the Petrarch2 event coder, and postprocessing. First, download the phoenix_pipeline
repo:
git clone https://github.com/openeventdata/phoenix_pipeline
To install the requirements needed for the pipeline, from inside the directory run
pip install -r requirements.txt
Next, we need to change the configuration file to tell the pipeline where to pull stories from. In the [Mongo]
section of PHOX_config.ini
, leave db
as event_scrape
but change collection
to dw_test
. If everything has gone well, you should see stuff streaming across the terminal and some helpful messages.
Under the hood: Petrarch2
Petrarch2 has two fundamental tasks: to identify which parts of an input sentence are the source actor, event/action description, and target actor; and to match the extracted text against a set of dictionaries to determine the correct event coding. The CoreNLP output is integral to the first task. Petrarch2 has a set of rules that use the constituency parse information about noun and verb phrases to identify actors and event descriptions. The dictionaries are hand created mappings of text to their representation in the CAMEO ontology. The dictionaries can be seen here, and more about them in part 2. For more information about how Petrarch2 works, see the pdf in the Petrarch2 repo.
Final steps
If you’ve made it here, good job! After the pipeline finishes running, you should see a CSV output file. You can read in the file using
phoxy
, a package for reading in and working with the Phoenix dataset, or just as you would work with any other CSV.
Changes made
I made some changes to the code referenced in this tutorial to make it a little easier to use. You can see the changes at: