Collective intelligence is a widely open concept. Just taking a look to the program of the last seminar organized by MIT’ Center for Collective Intelligence this summer to notice how many diversified topics were touched: queuing behaviour, crwodsourcing translations, or collective forecasting.
In fact, in this last topic is where we discover the Good Judgement Project (GJP), as an experimente fundend by IARPA – Intelligence Advanced Research Projects Activity, which clearly is the recognized as the benchmark in the industry. Between the names involved in this project we know Philip Tetlock, well know for his best seller «Expert Political Judgement«, and we’ve had the pleasure to interview two of it researchers, Michael Horowitz and Eric Stone.
GJP has set up an international prediction tournament, nowadays in its 4th season, where they look for the best forecasters in order to learn from their capacity to make accurate estimates in questions of international affairs.
We, in Futura Markets, run the first prediction market platform in Spain and as passionates of this technology it’s a pleasure to talk directly with Eric and Michael.
Michael C. Horowitz is an associate professor of political science at the University of Pennsylvania and an investigator on the Good Judgment Project. You can follow him on twitter @mchorowitz
Eric Stone is a data scientist and programmer, statistician, and researcher on the Good Judgment Project. You can follow him on twitter @theericstone
You can find the Good Judgment Project at: https://www.goodjudgmentproject.com/
We always read “collective intelligence”, but when can we objectively say that a group is intelligent?
Horowitz: Determining when a group becomes intelligent is a difficult challenge. From my perspective, working on the Good Judgment Project, a group becomes intelligent in one of two ways. First, a group becomes intelligent when, as a team, a group makes decisions that are more accurate than those the individuals alone would make. Second, a group becomes at least passively intelligent when, even if unknowingly, such as with the wisdom of crowds, the aggregated judgment of a group proves more accurate than that of individuals.
Stone: This is a tricky question that changes based on context. We tend to talk about group accuracy rather than about intelligence, but intelligence plays an important role. We also consider intelligence as a scale, rather than a binary state.
From an empirical perspective, when we observe that our group’s aggregate judgments are more accurate than those of the individuals that compose it, we can reasonably conclude the group’s intelligence surpasses that of its individual members. We also note that because our aggregate judgments continually perform objectively well (GJP’s aggregate judgments were right on more than 90% of days across all the questions we forecasted last year), our group of forecasters is, relatively speaking, an intelligent one.
Do we understand today why and when groups perform better than individuals alone? Could we say that this one of your key research questions in “The Good Judgement Project”?
Horowitz: One of the things we have discovered, in fact, is that groups working together have the capacity to outperform individuals when it comes to geopolitical and economic forecasting questions. Rather than leading to groupthink in a way that distorts forecasting accuracy, specific context in which our teams operate of our teams, where people work together online, anonymously, and with accuracy as the only thing that drives “status” in the group, promote forecasting accuracy.
You also found some super-predictors, that is, people with forecasting tracks significantly above the average. How did you choose them? How were you sure that you were not picking just the ones that got lucky?
Horowitz: We have found that there are, indeed, a select group of elite forecasters that we now call superforecasters. This is a group of people with the ability to consistently outperform the crowd. We select them by taking the top forecasters from each of our forecasting seasons, and then putting them together into teams for the following forecasting season. If they were just getting lucky, you would expect them to regress to the mean and become less accurate. Instead, the opposite occurs – they often become even more accurate.
Stone: We minimized the chance of picking merely lucky participants by selecting the top performers who had also surpassed a minimum threshold of participation. Additionally, we have analyzed whether performance during the 1st half of the tournament year reliably predicts performance in the 2nd half, and it does. Given that, we would also expect performance in year 1 of the tournament to predict performance in year 2. Beyond the selection of superforecasters, this is important because it provides evidence that forecasting is a skill.
What do those super-predictors have in common? Do they share any distinguishable traits? Should all of us try to copy their strategy?
Horowitz: Our superforecasters are actually diverse in many ways. Among their common attributes, our superforecasters tend to enjoy thinking through puzzles, be highly analytical thinkers, and demonstrate a great deal of curiosity about the world. They also work very hard to maintain their status, suggesting that becoming an elite forecaster is about both nature and nurture.
In the end, the aggregate predictions from all the participants showed very good forecasting capabilities. Which is the main mechanism for this phenomenon?
Stone: There are many reasons for this. It is well-established that the mean of many estimates is superior to any individual estimate from among the group (there is an oft-told example of 30 students guessing how many jellybeans are in a large jar). Beyond that, when we combine predictions from all of our participants, we are effectively giving ourselves access to thousands of perspectives, backgrounds, interpretations, and information sources. This allows us to systematically increase the certainty of the aggregate, making it both more likely to be correct (on the right side of ‘yes’ or ‘no’), and closer to 0 or 100%. Our performance is judged by how close we are to correct, not just whether we are right or not, so this is important to our forecasting capabilities.
I like the example of 2 people making a prediction about which of 2 cars is faster. Let’s say both individuals are given the same single fact about the cars, horsepower, and asked to estimate the probability that car 1 is faster than car 2. The best aggregate estimate we can make is a simple average of their two estimates. But this changes if we give each individual a different fact about the cars; engine horsepower to individual 1 and displacement to individual 2. Now, if individual 1 says car 1 is faster with probability .8, and individual 2 says car 1 is faster with probability .7, our aggregate can be pushed higher than .75, and could be .9 or even higher. This is an over-simplification of course, but it reflects how our aggregations can reliably provide more accurate, more confident forecasts than the many individuals therein.
If we can measure when a group becomes intelligent, do you think that we could even think about having crowd-based organizations working better than well-designed hierarchies?
Horowitz: The challenge with crowd-based organizations is how you aggregate judgments and make decisions. At the end of a day, even if an organization is designed more as a crowd, someone has to make a decision about what to implement, and someone has to implement the decision (and it is often a group for both the decision and the implementation).
Your research does it confront the reputation of the traditional experts? It seems that most of the pundits cannot beat a simple extrapolation. Why are the experts still all over the media?
Horowitz: One important thing to remember is that accuracy is only one of many reasons why the mainstream media consults experts. Other reasons include their ability to explain events that have just occurred and whether they have interesting insights that draw readers or viewers, depending on the medium, regardless of accuracy.
However, our research does not suggest that experts are useless. Far from it. What it suggests is that, when you combine the wisdom of dozens or hundreds or even thousands of educated people, with the right system, you can generate insights that are more accurate than that of a single individual, no matter how expert, in many situations. But is this really that surprising?
Stone: Sure, pundits are frequently wrong when held to making predictions about future political events, and Phil Tetlock has shown that to be the case. If that is all one cares about in a pundit, then it would be surprising to see them everywhere. But, as Mike points out, ability to frame a story is a valuable asset as well. I do think our research shows that if all the pundits pooled their predictions, they would be much better. Maybe it would make for a good CNN segment.
Do you think political punditry will be affected by the new “data driven” approaches to journalism, such as the success of FiveThirtyEight? This kind of journalism can be at least more sistematically evaluated. Do we care about evaluating the accuracy of our favorite commenter?
Horowitz: While accuracy in predicting the future is not the only way that we should evaluate experts, it is something we should be tracking. If you do not keep score, there is no way to improve – or even know how you are doing. Data-driven journalism could offer the promise of more systematically tracking how our beliefs about the world match up with reality, which is a good thing.
Stone: I’m hopeful that well-presented demonstrations of data-driven journalism and event forecasting have already and will continue to shift the dialogue to one more cognizant of statistics and data. I find that FiveThirtyEight has struck a good balance between abuse of statistics and getting bogged down in minutia. It is my wish that more stories about sports, the economy, and politics will adopt a similar approach.
How does prediction markets fit within the research field of collective intelligence?
Stone: The Good Judgment Project has run several prediction markets as a way to obtain group estimates. These tools function much like a stock market, where participants buy and sell shares of events, rather than equities – with fake money of course. Conventional wisdom says that a well-defined market should be the most efficient tool for eliciting the likelihood of an event. However, over the 3 years of Team Good Judgment’s work, our opinion pooling statistical aggregation methods have proved to be more accurate, particularly early on in questions – arguably the time when we most want to know the answer.
In what sense are prediction markets different to a poll aggregation? Which is the role of people with no meaningful information? Is their opinion useful?
Horowitz: Even people who do not think they have meaningful information might have insights they do not realize are applicable, or a general understanding that helps them infer what might happen in the world. Thus, it often makes sense to aggregate their judgments as well. Prediction markets are also a form of poll aggregation, in a way. Prediction markets, through the buying and selling of beliefs about the probability that an event may occur, aggregate the beliefs of individuals into a crowd.
Stone: I view the prediction markets essentially as aggregation methods in and of themselves. There are many questions where the prediction markets out-perform the poll aggregations, and vise-versa. The best methods combine several markets and poll aggregation algorithms, so I would say they are certainly useful, even if it turns out poll-based aggregations are more accurate in general.
Which are the most effective information aggregation mechanisms in your own experience? Which are your biggest concerns regarding the usefulness of prediction markets?
Horowitz: Our most accurate forecasting methods involve teams made up of our best forecasters. We have found that these super teams are so accurate that our aggregation methods provide limited, if any, benefit over just the raw average of the forecasts of those teams.
Stone: Currently, our best aggregation methods improve the accuracy of the aggregate, though this is less so among teams working together, and among our super teams in particular. Our approach gives us a lot of flexibility in how we combine forecasts, and lets us be more conservative or more aggressive depending on what we observe in our forecasting pool.
While the markets tend to be accurate in general, we have yet to see parity with the best aggregation methods. We continue to improve upon our prediction markets, and we are running several in the 4th year of the tournament, including one populated with superforecasters, so we’ll see how much better they can get.
Your project has the support of a public entity, IARPA, a research intelligence agency. Which one would you say that is the main field of application of your findings in the real business life? Which are the next steps of your Project?
Horowitz: Nearly any business or government agency that makes decisions based on what it thinks will happen in the future could benefit from the application of the findings of the Good Judgment Project. The application of our methods could provide anyone from the US government to multinational corporations to financial services companies significant enhancements in strategic foresight.
Last question … Which are your next steps in The Good Judgement Project?
Stone: (next steps of the project) We are currently beginning a 4th year of research under IARPA, after which we will determine if subsequent years are appropriate, likely with narrower scope, and to answer more specific questions left unanswered by the larger project.
Horowitz and Stone: We hope to continue our research as we try to take key lessons learned from the Good Judgment Project and use them to continue advancing our strategic foresight through the use of innovative methodologies.
THANKS FOR YOUR TIME!