text data | Andy Halterman

text data

Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks

A bag of tricks for efficient, custom event data production, using transformer-based classifiers, question-answering models, Wikipedia entity linking, and active learning.

PLOVER and POLECAT: A New Political Event Ontology and Dataset

POLECAT is a new global event dataset for social science research, coded in the PLOVER event ontology.

Synthetically generated text for supervised text analysis

Synthetically generated text can help researchers address common issues in supervised text analysis.

Bootstrapping your way to active learning

Active learning is great, but what if you don’t already have a model? You can bootstrap your way to a machine learning model with majority-vote deterministic rules. Human labeled data is often the primary bottleneck in building good machine learning models.

Event Data in 30 Lines of Python

Much of my work involves improving large-scale systems to extract political events from text (see code from our NSF project on the subject here). These systems are designed for full production use over many hundreds of sources both daily and for the past in many dozens of event categories, including protests, armed conflict, statements, arrests, and humanitarian aid.

Making Event Data From Scratch: A Step-By-Step Guide

This tutorial covers how to create event data from a new set of text using existing Open Event Data Alliance tools. After going through it, you should be able to use the OEDA event data pipeline for your own projects with your own text.

CAMEO Dictionary Coverage

I was going through the Petrarch2 dictionary code, working on live updating of dictionaries for the human coding interface, and ended up taking a look at the dictionaries’ coverage of different CAMEO event codes.