Do the answers you get depend on the news you read? Protests and violence in Syria


Machine-produced event data from news text is a cheap, accurate, and useful source of empirical data for researchers in political science. Many quantitative analyses rely on English language text rather than text in the local language. We investigate the relationship between protests in Syria and subsequent violence in Syria and demonstrate that the results are substantially different using data coded from English and Arabic news sources. Using a gold standard hand coded dataset (Mazur 2018), we find a significant effect of protests on subsequent violence in the locality. The result holds when using data coded from Arabic sources, but becomes insignificant when using English language text. These results suggest that researchers should include text from the local language when using automated text analysis to study subnational outcomes. We also offer guidance on which statistical classifiers perform best on detecting protest events in short text. While a neural net classifier using state-of-the-art BERT embeddings slightly outperforms other models and feature representations in Arabic, a simple random forest on a bag-of-words performs best in English.

Supplementary materials

Andy Halterman
Andy Halterman
PhD Candidate

My research interests include natural language processing, text as data, and subnational armed conflict