Posts | Andy Halterman

Using Synthetic Text Data to Train Better Classifiers

I’m excited to share my latest paper, now out in Political Analysis, which introduces a new approach to training supervised text classifiers. The core idea is simple: instead of relying solely on expensive hand-labeled data, we can use generative large language models (LLMs) to generate synthetic training examples, then fit a classifier on the synthetic text (and any real training data we have).

Andy Halterman

Jan 30, 2025

Codebook LLM Evaluation Framework: Systematic Approach to Large Language Models in Political Science Text Analysis

New framework for evaluating LLMs on political science codebooks with behavioral testing, zero-shot assessment, and instruction tuning guidance.

Andy Halterman

Jan 15, 2025 methodology, research

Campaign analysis with R's Shiny

Campaign analysis–a technique for understanding the outcomes of military operations–is an important tool for scholars of security studies. As Rachel Tecott and I argue in our International Security article, campaign analysis involves specifying a model of a particular military operation, and a set of inputs that the model uses to generate outputs.

Mar 29, 2024

Crash course in Python for R users

This is Part 1 of a two part series. Stay tuned for Part 2, which will cover numpy, pandas and scikit-learn. R is an extremely powerful language for data analysis, and probably the best language for working with tabular data, running regressions, and making visualizations.

Nov 6, 2023

Simple political actor classification with "soft" dictionaries

As political scientists, we are often interested in using text to understand the actions of political actors. Thankfully, have a growing set of tools for identifying political actors in text, including named entity recognition and dependency parses, custom event models, or hand labeling events text.

May 9, 2023

Introducing Mordecai 3: A Neural Geoparser and Event Geocoder in Python

Researchers working with text data are often faced with the problem of identifying place names in text and linking them to their geographic coordinates. In social science, we might want to measure news coverage of specific locations, track discussions of specific places in government documents, or geolocate events such protests to the locations where they occur.

Apr 5, 2023

Introducing Mordecai 3: A Neural Geoparser and Event Geocoder in Python

Tutorial: Information Extraction for Social Science Research

This workshop provides an interactive introduction to information extraction for social science–techniques for identifying specific words, phrases, or pieces of information contained within documents. It focuses on two common techniques, named entity recognition and dependency parses using the spaCy library, and shows how they can provide useful descriptive data about the civil war in Syria.

Mar 16, 2022

Tutorial: Information Extraction for Social Science Research

Bootstrapping your way to active learning

Active learning is great, but what if you don’t already have a model? You can bootstrap your way to a machine learning model with majority-vote deterministic rules. Human labeled data is often the primary bottleneck in building good machine learning models.

Aug 26, 2018

Bootstrapping your way to active learning

Event Data in 30 Lines of Python

Much of my work involves improving large-scale systems to extract political events from text (see code from our NSF project on the subject here). These systems are designed for full production use over many hundreds of sources both daily and for the past in many dozens of event categories, including protests, armed conflict, statements, arrests, and humanitarian aid.

Mar 13, 2018

Managing Machine Learning Experiments

Reproducible methods like knitr and version control using git are on their way toward being standard for academic code, even in social science disciplines such as political science. knitr, Rmarkdown, and Jupyter notebooks make it easy to verify that your findings and figures come from the most recent version of your code and that it runs without errors.

Mar 5, 2018