Synthetically generated text for supervised text analysis

April 2024

PDF Poster

Abstract

Large language models are a powerful tool for conducting text analysis in political science, but using them to annotate text has several drawbacks, including high cost, limited reproducibility, and poor explainability. Traditional supervised text classifiers are fast and reproducible, but require expensive hand annotation, which is especially difficult for rare classes. This article proposes using LLMs to generate synthetic training data for training smaller, traditional supervised text models. Synthetic data can augment limited hand annotated data or be used on its own to train a classifier with good performance and greatly reduced cost. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, a simple technique for improving the quality of synthetic text, and an illustration of its limitations. I demonstrate the usefulness of synthetic training through three applications: synthetic news articles describing police responses to communal violence in India for training an event detection system, a multilingual corpus of synthetic populist manifesto statements for training a sentence-level populism classifier, and generating synthetic tweets describing the fighting in Ukraine to improve a named entity system.

Type

Preprint

Publication

PolMeth 2022 and Text as Data 2022

political science text data

Andy Halterman

Assistant Professor, MSU Political Science

My research interests include natural language processing, text as data, and subnational armed conflict

Synthetically generated text for supervised text analysis

Abstract

Andy Halterman

Assistant Professor, MSU Political Science

Related