Word Order-Aware Text Processing: A Third Generation of Text-as-Data in Political Science


Text has always been a key source of data for political scientists. In the past five to ten years, new techniques have automated some components of text analysis, allowing researcher to classify, categorize, and find meaning in large collections of text. I argue that text as data in political science has moved through two distinct approaches and is on the cusp of a third. The first generation consisted of simple text matching operations, where documents are searched for keywords or phrases were checked against hand-built dictionaries. The second generation introduced machine learning methods, operating on documents represented as “bags-of-words”, where word order is discarded and documents are treated as counts of the words they use. The second generation has given us the staple techniques of text analysis in political science, including supervised document classification and topic modeling through latent Dirichlet allocation (Blei, Ng, and Jordan 2003) or the structural topic model (Roberts et al. 2013). The third generation will be characterized by word order-aware machine learning models, informed by research in natural language processing. These two techniques may perform better than the second generation on some existing tasks, but more importantly, this “linguistic turn” in political text analysis will enable wholly new methods of extracting meaning from text.

Andy Halterman
Andy Halterman
PhD Candidate

My research interests include natural language processing, text as data, and subnational armed conflict