Synthetically generated text for supervised text analysis


UPDATED VERSION IN PROGRESS. Supervised text models are often the best tool for categorizing documents into known classes or for extracting information from within documents. However, supervised models are often difficult to employ because of the expense involved in hand-labeling documents, the difficulty of retrieving relevant documents for rare class annotation, and copyright and privacy concerns involved in sharing annotated documents. This paper proposes a partial solution to these three issues, in the form of controlled generation of synthetic text. Recent advances in text generation make it possible to create synthetic documents with desired class labels and in a form that can be broadly shared without copyright or licensing concerns. I demonstrate the usefulness of text generation techniques with three applications: using an off-the-shelf language model prompted with article headlines to generate synthetic news articles describing specified political events for training an event detection system, using a fine-tuned language model to generate synthetic tweets describing the fighting in Ukraine for named entity recognition labeling, and using a task description approach to generate a multilingual corpus of populist manifesto statements for training a sentence-level populism classifier. The article includes a discussion of the ethical concerns inherent in this work along with proposed guidelines for researchers.

PolMeth 2022 and Text as Data 2022
Andy Halterman
Andy Halterman
Assistant Professor, MSU Political Science

My research interests include natural language processing, text as data, and subnational armed conflict