| Andy Halterman

The 9th Annual New Directions in Text as Data Conference starts today in Seattle. This is the main conference at the intersection on text at the intersection of political science and computer science. This seems like a good time to recap (belately) what I took from text research at PolMeth this year.

Text has arrived

Approximately 1/3 of the work at PolMeth this year was on text analysis. Of the three concurrent sessions running at once, one of them was almost always a text analysis session. As someone who works on text, this was very exciting.

There were remarkably few topic model papers

Judging by the way text analysis is sometimes taught in political science (nice collection of syllabi), you’d think that text analysis in political science was synonymous with topic models. This was far from the case at PolMeth this year. Of the papers presented at PolMeth, only one (Phil Schrodt) was primarily about topic modeling, and was firmly situated in the new application-focused track at PolMeth. While topic models are clearly useful and very effective for summarizing lots of text without needing to define categories ahead of time, they’re often not the right tool.

(Spirling here)

Word embeddings have arrived

After radically remaking natural language processing in 2013,¹ word embedding approaches have fully arrived in political science. Word embeddings represent words as vectors in low (50-300) dimensional space such that words that are used in similar contexts will have similar positions in space. What was interesting to see was how differently political scientists are using word embeddings from how computer scientists have used them. In computer science, after some attempts to measure the performance of word embeddings on analogy or synonym tasks, much of the current research on embeddings in computer science has become very pragmatic (good word embeddings are word embeddings that perform well as an input feature for another task). In contast, political science researchers, perhaps informed by our field’s focus on topic models, are developing techniques for getting topics out of embeddings.

Laila (measuring jihadi influence)
Ludovic (basically doc2vec with metadata: clever)
Burt and Mitchell (multiple languages)
Sarah (??)
maaaany posters

The paradigm remains largely the same

Despite nice innovations in word embeddings and and other techniques (e.g. Spirling’s)

unsupervised
corpus summarization
bag of words

what it’s not:

supervised
syntatcically informed
IE focused

The paradigm is still bag of words and document summarization: there’s still a wide world of information extraction out there

(People aren’t moving away from topic models because the task is changing, they’re just moving on to the new hotness)

maybe because it’s solved-ish? No more methodological innovation for political scientists? That doesn’t seem right. We have beliefs about which features matter.

@mikolov2013distributed ↩︎