The future of forecasting: the good and the bad

Feb 18, 2017

Science has a special issue this month on forecasting political behavior, which included an essay by Cederman and Weidmann in which they discuss the limitations of current conflict forecasting models, as well as the areas where they’re better than many people, including political science scholars, think they are. I agree with their overview: conflict forecasting has come a long way since the State Failure (now Political Instability) Task Force was founded in 1994. As someone more in the Ward optimism camp than Ulfelder’s more cautionary camp, I find a lot to agree with.

The forecasting literature has resoundingly rejected p-values as a tool for building forecasting models and has explored the outer limits of what can be done to forecast conflict onset with structural variables like GDP, infant mortality, and lagged dependent variables. Current research, on incorporating faster moving data, especially derived from text, is showing some promise ( Chadefaux 2014; Chiba and Gleditsch 2017 promise.

I thought that the piece missed two important points about the future of forecasting: one that bodes well and one that presents a challenge. The good, that forecasting represents perhaps the only “shared task” in political science, with all the tremendous potential that has to spur improvements, I’ll discuss in a separate post. The bad is that the extreme rarity of the conflict events that we forecast make dramatic improvements in forecasting accuracy impossible. While clearly good for the world, given what conflict forecasters work on, we simply do not have enough cases to improve our models more than on the margins. Moreover, new cases do not occur often enough for us to assess our forecasting models on new events (again, good for the world, but bad for improving accuracy).

Taking three of the most common targets of forecasting, coups, civil war onsets, fewer than 300 exist of each since 1945, the earliest date most models use. If we limit the time period to 1980, when most large news text corpora become available, or 1989, if we believe that data from the Cold War is not useful for forecasting 2018, we have only around 250-300 of all three events combined. Correctly forecasting a single additional case in a validation set can lead to more than a 1% change in accuracy, which is a recipe for overfitting. The beauty of forecasting is that overfit models are eventually discarded or modified, but with 1–5 additional data points per year, we risk chasing the noise of new cases, exacerbating the problem and giving consumers of forecasts bad information in the mean time.

A lack of onset cases is the primary limitation on forecasters’ abilities to answer several important open questions including

how can we use newer machine learning methods to improve our forecasts?
what is the best way to handle missing data? interpolate? drop? use the missingness itself as a feature?
how often should we re-fit models, both in testing and production? Every year? Once?
how much should models be customized to particular countries, regions, and time periods?
what’s the best way to merge structural data like GDP and text- or event data?

These improvements in modeling and input data are limited by our ability to measure their effect on predicting a small number of cases.

What’s to be done?

A measure that does not help is the approach that the Integrated Conflict Early Warning System and the Uppsala Conflict Data Program have taken, in measuring their outcome variables at the month or day level (coup datasets of course have always done this). This approach is very useful on the input side: data measured at the day or month becomes useful in a way it was not when outcomes were measured annually. But more finely chopping the same number of events does not give us more data points. The solution is to redirect efforts to forecasting events that happen much more often than coups, wars, and mass killings. Two obstacles stand in the way of this: more frequent outcomes are harder and more expensive to measure, and coups, wars, and mass killings are probably the highest priority for forecasting.

Data on more frequent events is difficult and expensive to obtain. Partially because current forecasting targets are rare, it is feasible to have one person tally them up at the end of the year or as they happen. And while civil wars and mass killings are conceptually slippery, they have fairly well established definitions that can be consistently coded. This is not the case for more frequent events, which may require major work to define with any kind of consistency. Updating frequently also requires much greater resources, either in the form of human coders or skilled work to develop and validated machine coded measures (more on this later). Several examples of these datasets exist, however. IARPA’s EMBERS forecasting project developed a large, machine-generated dataset of protests in Latin America to use as a forecasting target, and PITF has a large, handmade dataset of smaller atrocities.
PROBABLY MORE HERE

(another one in here? amenable to quantitative forecasting. Can’t be things like cabinet appointments/resignations. We’ll leave those to the prediction market people)

The second problem is that policymakers probably care more about what we already forecast than other kinds of events. This might just be a case of rigor vs. relevance.

That said, I see several possibilities for forecasting targets that are still relevant to researchers and policymakers and happen often enough to hold out real promise for improving forecasts and forecasting. In addition to protests and acts of mass killing, these events include