Active learning is great, but what if you don’t already have a model? You can bootstrap your way to a machine learning model with majority-vote deterministic rules.
Human labeled data is often the primary bottleneck in building good machine learning models. One of the most important developments in machine learning recently is the development of new annotation frameworks that simplify the process of data collection and use the model in the loop to reduce the time needed to train a model. Specifically, Prodigy is a new browser-based tool for rapidly annotating data for use in machine learning models using active learning and binary decisions (see “Introducing Prodigy” and “Supervised learning is great – it’s data collection that’s broken“). Prodigy promises the ability to much more quickly create labeled training and test data than was possible before in a nice interface.
The good stuff
Two key features underlie Prodigy’s speed. The first feature is the use of “active learning,” a technique where the uncertainty of an existing model is used to select which objects a human annotator will label next. By picking points for which the model is least certain (0.5 for a binary classifier), the expected informational value and usefulness of each label is maximized. Rather than labeling a fixed number of observations and hoping the model will perform well after training, the number of observations needed in each category varies with how hard they are for the model to learn.
The second speed innovation in Prodigy is to present human labelers with binary tasks. Rather than asking for an annotator to select one label out of many, the interface presents a best guess at the label and asks the human to accept or reject it. Anecdotal experience and some research shows that this approach leads to much faster and more accurate coding than requiring human annotators to select a label by hand.
The catch is that both of these speed improvements, active learning and binary decisions, require an existing model to generate labels (albeit potentially bad labels) before training begins. For many tasks, off-the-shelf models exist and just need to be tweaked or updated. For other tasks, no model whatsoever exists. How can you use Prodigy when you’re building a classifier wholly from scratch? Prodigy has ways of doing this: seed terms for classification, rule matching for NER. But what if you have something more complicated, like more complicated information extraction or learning to rank problems?
This was the task I faced in developing training data for Mordecai, a text geoparser that resolves place names in text to their geographic coordinates. A key component of the geoparser is inferring the correct country of a place name mentioned in a sentence (e.g., “Aleppo” in the sentence “government forces surrounded Aleppo” refers to Syria, not Aleppo Township, Pennsylvania). This task also becomes extremely tedious if each country label needs to be selected from a long list of possible countries. I knew that I needed a very large amount of human-verified labeled data, so I needed a way of generating soft labels without any model at all and then displaying them in Prodigy as a binary decision.
The solution I hit upon was to write a number of deterministic rules that individually may be only modestly accurate, but when taken as a plurality vote, generate a reasonably accurate predicted country label. This approach was inspired by Snorkel, which uses aggregations of deterministic rules and a measurement model to automatically generate silver-standard labels, and by the large literature in machine learning on weak learners and ensemble methods, which both generate good predictions from a set of poor models.
In the case of inferring the correct country, I came up with several simple rules that would each greatly outperform chance, but none of which was itself very good. Those rules were to pick the country by:
- the most similar word embedding to the place name
- the closest embedding to the sentence’s embedding
- the country with the largest population for a place of that name
- the country of the place with the most alternative spellings
- the first country back from a Geonames search of that name
- picking any country explicitly referenced in the sentence
These rules capture the general importance or prominance of a place and its match with the context sentence, using word vectors.
Together, the plurality vote of these rules gave a soft label for a country that was generally, but not always correct. I could then export the text, highlighted place name, and inferred country to a JSON file in Prodigy’s format, load it in, and very quickly (~1 second per sentence) accept or reject the vote. Getting the data into this format allowed me to label around 8,000 sentences in a matter of hours. I then batch trained a model, which gave me around 0.9 precision and recall. To increase this more, I used the model itself to generate soft labels, leaving the old plurality vote method behind and launching into ML orbit.
This approach is very easily implemented and can use the domain-specific knowledge that many people have over their own problems. If you can write several simple rules that each contain some useful signal, you can combine them into a vote that you can accept or reject. By casting the problem into a binary decision task, you can quickly amass enough labeled data to switch over to a machine learning model, which can then itself generate more soft labels for human review.