Using Synthetic Text Data to Train Better Classifiers

Jan 30, 2025

I’m excited to share my latest paper, now out in Political Analysis, which introduces a new approach to training supervised text classifiers. The core idea is simple: instead of relying solely on expensive hand-labeled data, we can use generative large language models (LLMs) to generate synthetic training examples, then fit a classifier on the synthetic text (and any real training data we have). We can then use the trained classifier to label our real text.

Most of my methods papers start with a methods problem (“Wouldn’t it be cool if we could do X? Here’s a way to do it."). This one starts with a tool I found myself using over and over in applied text analysis projects over the past few years.

It grew out of the overlap of two things I’d been thinking about:

How do we make the process of collecting annotations for supervised text analysis less tedious and expensive? Annotating text is always a pain, but it’s especially hard when you’re trying to collect annotations for rare classes or events. For example, an actor threatening to cut off water, electricity, or internet is substantively very interesting, but (thankfully) occurs in so few news stories that it is difficult to get enough examples to train a classifier. As we expand the set of classes we’re interested in, the problem gets much worse.
What’s the appropriate way to use generative large language models in political science? LLMs are helpful, powerful tools, and a growing body of research (and my own anecdotal experience) is that LLMs can classify and annotate text as well as many human annotators in practice (though perhaps struggle with complex classification tasks). But LLMs come with drawbacks. They can be expensive to use (both in API fees or hardware), commercial LLMs change rapidly and are not reproducibile, and it’s difficult to integrate hand-coded data and LLMs without technically difficult fine-tuning. More broadly, LLMs are overkill for many text analysis tasks–identifying protests in news text shouldn’t require a trillion parameters, especially given that many of those parameters encode things like how to explain the EM algorithm in the style of Cardi B lyrics.

To get future blog posts in newsletter form, you can sign up here:

Why Synthetic Data?

Most of us who work with supervised text analysis face a familiar challenge: we need lots of labeled examples to train our models. The annotation problem gets worse when the events we care about are rare—if only 0.5% of sentences describe police in India making arrests during communal violence, we might need to label thousands of documents to get enough positive examples to train a good classifier.

Synthetic data has exploded as a training technique for LLMs. Researchers are using it for generating synthetic “textbooks” for LLM pretraining, or to create instruction-tuning data for fine-tuning models. But much of this work focuses on training new language models, while the potential for using synthetic data to train traditional supervised classifiers remains relatively unexplored.

A Different Way to Use LLMs

While many researchers are using LLMs directly for classification, I’ve found this approach has real drawbacks—it’s expensive, hard to reproduce, and difficult to combine with existing hand-labeled data. Instead, I propose using LLMs for what they do best: generating text.

The basic workflow is straightforward:

Generate synthetic examples using an LLM
Train a traditional supervised classifier on this synthetic data
Apply the classifier to your real documents

How to make synthetic data

To be useful as training data, synthetic text needs to contain the desired content. For instance, if we’re generating synthetic text to train a populism classifier, we need to know that when we try to generate a populist statement, it will actually be populist, and if we intend to generate a negative non-populist statement, it will indeed not contain populist rhetoric.

The paper describes three techniques for controlling the content and style of the synthetic text.

prompting: this is definitely the easiest and most familiar technique to guide the content of the text. In initial versions of the paper, before instruction-tuned models were really available, this consisted of writing the initial part of a document, and having the LLM complete it. This works excellently for news stories, which have a convenient headline–body format. By writing a headline like “Police break up protest in capital with teargas (AFP) – TEHRAN”, the LLM pretty reliably generates a news story that describes police using force against a protest. With instruction-tuned models, we can instead ask for the exact content we want. Using the populism validation from the paper, a prompt like the one below reliably generates examples of populist rhetoric.

“Populist rhetoric sees politics as a conflict with good, common, or “real” people on one side, and out-of-touch, evil, or self-serving elites on the other. Write ten statements that a populist party in {country} might make (in {language}):”

Prompting is the easier approach, especially as LLMs’ instruction following ability improves.
“Adapting” LLMs. In some situations, we can control the text the LLM produces by further training the LLM on the text we want it to emulate.¹ Briefly, we continue the pretraining step on a specific corpus of text. In the paper, I do this with a corpus of tweets about the post-2022 war in Ukraine, which was after the data cutoff for GPT-2. This technique yields a model that can generate documents that are a very close match for the training documents. This technique is more technically demanding than prompting, but is really useful if you need to produce an expanded, copyright-free synthetic corpus that looks similar to your original corpus.

Finally, we can control the style of the synthetic text by varying the sampling hyperparameters–things like temperature, top-K, repetition penalty, etc. Varying these affects whether the text is repetitive and simple (“high probability”, from the model’s perspective), or less predictable and creative.

One of the keys to producing successful synthetic data is to ensure enough diversity in the text. One simple way to increase the diversity of synthetic is to inject other information into the prompt, for example, varying the country a story is about, or which news source to emulate. (Anecdotally, from a different project, asking for a story in the style of the New Yorker yields very complex and useful training text.)

Does it Actually Work?

The three validations in the paper show that using synthetic text, sometimes in conjunction with real text, does meaningfully improve text classifiers. But ensuring the quality of the synthetic text is crucial. The paper discusses three strategies for checking the quality of the synthetic text.

1. Check actual performance on an eval set

Synthetic text doesn’t mean that you don’t need hand-annotated (real) validation text. It’s mostly useful for reducing the amount of training data you need to collect. The best test of whether synthetic text helps is to check if adding it to your training set improves performance on your (hand-labeled real text) validation set.

2. Check the Semantic Coverage

One simple but effective approach for evaluating synthetic text quality is to visualize how well your synthetic text covers the semantic space of your real documents. If you have real documents that are semantically dissimilar from your synthetic text, a classifier trained on synthetic text alone will have difficulty classifying them. Here’s an example from the validation on police violence in India:

Both real and synthetic documents are embedded using a sentence transformers model and the 2D PCA of their embeddings is plotted. The gaps in overlap here revealed that my initial synthetic examples were missing discussions of party politics and temple disputes. I could then write some extra headlines on those subjects to elicit more targeted training examples:

3. Use an Adversarial Evaluation

We can also measure synthetic text quality by seeing how well a classifier can distinguish between real and synthetic examples. The lower the performance of a real-vs-synthetic classifier, the harder synthetic text is to distinguish from real text, and therefore the higher quality it is. This technique is especially helpful to tune the generation hyperparameters that affect how tokens are sampled from the LLM. Here’s how different generation parameters affected quality in one of my validations:

Ideally, synthetic text would be indistinguishable from real text with an accuracy of 50%. But some hyperparameters are clearly better than others, and those are the ones to use.

What do you do with the synthetic text?

After establishing that the synthetic text is high quality, researchers then have a few options for how to use it.

One approach is to to augment a limited set of hand-labeled training data with synthetic data. The India police events validation in the paper shows that using some hand-labeled data alongside the synthetic data yields (unsurprisingly) a better classifier than one trained on synthetic data alone. However, in some cases, including the populism validation, a model trained on purely synthetic text can produce accurate predictions.

Finally, in some limited cases, researchers might hand-annotate synthetic text. This isn’t ideal: if you’re labeling text at all, why not label real text? But in some situations, copyright or privacy concerns might make it impossible to release any real text or show it to annotators. By annotating synthetic text, a researcher can trade some accuracy for greater transparency and reproducibility.

The Results

In my validations, this approach showed real promise:

Using synthetic data alongside just 100-500 hand-labeled examples performed better than using 1,000 hand-labeled examples alone in classifying documents about actions taken by police in India in the context of communal violence:

I was able to train a multilingual populism classifier with no hand-labeled training data. A classifier trained on purely synthetic manifestos did not perform as well as a model trained on real hand-labeled text, but the synthetic-only still performs remarkably well.

Finally, the third validation shows that an “obsolete” model like GPT-2, after adaptation, can produce tweets about the war in Ukraine that are difficult to distinguish from real tweets. It also shows the importance of identifying the best generation hyperparameters. A named entity recognition trained on hand-labeled real tweets and hand-labeled synthetic tweets shows that using synthetic tweets carries an accuracy cost, but this loss is much lower when higher quality synthetic tweets are used.

Why does it work?

As much as I like engineering, this is a political science paper, so I try to discuss the theoretical reasons why synthetic text should improve our models. One reason is that synthetic text can be used to address a rare class/class imbalance problem. Many of the classifiers that social scientists train are for rare events. By producing more examples of the positive class, we can address the mechanical class imbalance problem in our loss function, but also create more diverse positive samples. This reduces the classifier’s reliance on any specific positive example.

Second, and I think more compellingly, the way to think about synthetic text is as a form of model distillation. Very large and capable LLMs encode much of the information we want our classifiers to contain. We can transfer that knowledge from the large LLM to our smaller classifiers through synthetic text. This allows us to distill specific parts of the large model into a smaller, more efficient, and reproducible classifier. By inspecting the synthetic text, we can see if we’re actually distilling the right concepts.

Getting Started

If you want to try this approach, here are my recommendations:

You can use a relatively small model. The paper uses GPT-2 extensively (that’s all there was for the first draft!), but a ~8B model like Llama 3.1 or Mistral will perform excellently for most tasks. One advantage of these models, which didn’t make it into the paper, is that you can provide more instructions on what you want. (“Make sure the story describes police using teargas.")
The paper focuses on document classification, but I think the technique is potentially even more useful for information extraction models. For example, an LLM can generate essentially unlimited news stories with annotations on the named entities in the document.
Validate your synthetic text using the visualization and adversarial techniques described above.
Consider starting with data augmentation (combining synthetic with hand-labeled examples) before trying pure synthetic training.

Code and Data

All code and synthetic data from the paper are available on the Harvard Dataverse.

The paper, “Synthetically generated text for supervised text analysis” goes into much more detail on implementation, validation approaches, and ethical considerations. I hope these techniques will be useful for other researchers working with supervised text analysis.

The term “fine-tuning” is kind of cursed in the literature. Does it mean supervised fine-tuning, like we did with BERT models? Or instruction fine-tuning? The term “additional, domain-specific pretraining” is the most accurate term for what we’re doing but is extremely clunky. The reviewers and I eventually hashed out “adaptation” as a compromise. ↩︎

Andy Halterman

Assistant Professor, MSU Political Science

My research interests include natural language processing, text as data, and subnational armed conflict