Judgmental forecasting is a fairly recent and (in my humble opinion)
under-researched & under-appreciated human endeavour and field of
research, with some low-hanging fruit (which are getting picked almost
as fast as I can write them up).
In general, judgmental forecasting methods operate best in areas with
fast feedback loops, large existing datasets (or at least good reference
classes for base rates) and continuous historical trends.
We can therefore identify the five horsemen of hard forecasting:
Long time horizons: Because most forecasters and traders
discount the
future (either due to rewards further in the future being less
certain, or because whatever investment is bound up in a bet
could be used in the mean term, or because they actually weigh
the future lower), and because long term thinking activates far
mode
from construal level
theory,
the incentives to perform well on long-term questions are weaker
than on short-term questions. Additionally, forecasters receive
much more & better feedback on short-term questions. One would
expect long-term questions to receive less accurate forecasts
because of this, and the evidence points to this being the case (Dillon
2021,
Niplav
2022).
But we're often especially interested in long-term questions: How can
we incentivize or create good forecasts on those questions?
Reward-correlated predictions: The clearest examples of this problem
are questions on extinction events: If you forecast doom, you're never
going to get rewarded for it, because the resolution happens only in
worlds where the bad outcome didn't occur. Forecasters are embedded
agents in the world
they are predicting on, and there is no Cartesian boundary. This can
happen with prediction markets as well: when making predictions on the
outcome of a decision, with the payout of the prediction market being
in a currency that is affected by the decision (for example devaluing
it respective to other currencies), the market might choose the "worse"
decision (according to the metric used for scoring it) because it prevents
the currency from being devalued as much.
Low probability events: Some events are very important, but
have a low probability (extreme stock market crashes, extinction
events, rare diseases, encounters with aliens etc.). But low
probability events are maybe even harder to forecast than long
time horizon events: they often don't have good reference classes,
while long time horizon questions do (that's why we have history and
time series data!), and forecasters very rarely encounter them. We
might just round all probabilities <1% to 0%, lest we get Pascal's
mugged, but in doing
so we close our eyes to possible dangers (and prizes) out there, the
Talebian approach
of erring on the side of caution by "rounding them up" condemns us to
eternal overcaution and conservatism, so as a first step we definitely
want our probabilities to be as accurate as possible.
Out-of-distribution situations: Whenever things with no
clear existing reference class occur, such as novel technologies
(social media, the internet in general, nuclear weapons,
international shipping logistics, and in the future potentially
genetic engineering or self-driving cars), forecasters struggle to
anticipate the consequences (or foresee those shifts). This isn't
limited to forecasters and prediction markets: if regular people,
pundits and domain experts on average do worse than top forecasters
(though as a counterpoint to forecasters>experts see Leech & Yagudin
2022),
then we wouldn't expect them to do much better specifically in very
novel & unforeseen situations (reasons why this could still happen:
experts might have detailed causal models that are outperformed by
simple heuristics in the modal case, but as we go outside of the normal
course of events, those causal & theoretical models break down much more
gracefully than simple surface heuristics).
Hard-to-specify events: Maybe we are slicing up forecasting
the wrong way: as the old adage goes, the hard part is not coming
up with the answer, it is coming up with the right question to
ask. Similarly, for forecasting, we often run into the problem of
specifying exactly what we want to know about: Too broad and
you drive away forecasters and traders who don't want to waste
their time on predicting the whims of whoever resolves the market in the
end,
too narrow and you miss what you actually care about or invite
Goodharting. An
additional layer of complexity is added when hobbyists do your
forecasting, in which case narrow questions just aren't very interesting
to do predictions on. This could be seen with the Metaculus clean meat
tournament:
many questions were just different combinatorial variations on
each other, with maybe five being interesting to predict on,
but not all fourteen, leading to many questions receiving less
than 100 predictions during the tournament. But "interestingness"
and "specifiability" appear to be tugging in opposite directions:
hobbyists are probably most interested in making broad claims that flow
from their worldview, instead of finding minutiae for very specific
questions. Finding ways to create more specific questions on events
(or avoid doing so with clever tricks while still receiving accurate
forecasts) is important and difficult. Latent variable prediction
markets
offer one approach—how easy are they to implement with acceptable UX?
We can use these categories as guideposts: How bad are these as
problems? What approaches have been proposed/tried/implemented so far? If
we can improve one of them without harming our ability to perform well
on the others, we have made progress, if we improve several in tandem,
that's even better.
How quickly does our forecasting ability decrease with increasing range of the question/forecast?
Does it decrease at all, or just oscillate wildly?
How quickly does performance degrade in different categories of questions (finance, meteorology, global economics, technological development) and by different forecasters (prediction markets, superforecasters & teams)?
Are there people who are better long-term forecasters and people who are better short-term forecasters?
Are better short-term forecasters also better long-term forecasters?
Do forecasters become better at forecasting over time?
How quickly?
Over time/over more forecasts
How much does forecaster quantity affect forecast quality on continuous questions? (i.e., extend Dillon 2021 to continuous data)
How much does forecasting time affect forecast quality? That is, what is the relation of accuracy of prediction to the time spent on refining that prediction?
Generally, scaling laws for forecasting would be interesting/cool to see.
How much do number of resolutions/forecasts matter for forecast quality/learning?
Do laypeople/pundits/domain experts perform better than forecasters/superforecasters/forecasting teams/prediction markets specifically under novel & unforeseen situations?
Are more extreme views or more conservative views more accurate?
How well does forecasting expertise in one domain transfer to another?
That is, if a forecaster starts by forecasting in some domain , and after a while switches to domain , how much better is the forecaster than if he'd started out in without any other experience?
This would be even more interesting when also having a metric for the difference between and .