author: niplav, created: 2022-04-04, modified: 2024-04-06, language: english, status: notes, importance: 6, confidence: highly likely
Beginnings of a research agenda about judgmental
forecasting.
Judgmental forecasting is a fairly recent and (in my humble opinion)
under-researched & under-appreciated human endeavour and field of
research, with some low-hanging fruit (which are getting picked almost
as fast as I can write them up).
In general, judgmental forecasting methods operate best in areas with
fast feedback loops, large existing datasets (or at least good reference
classes for base rates) and continuous historical trends.
We can therefore identify the five horsemen of hard forecasting:
- Long time horizons: Because most forecasters and traders
discount the
future (either due to rewards further in the future being less
certain, or because whatever investment is bound up in a bet
could be used in the mean term, or because they actually weigh
the future lower), and because long term thinking activates far
mode
from construal level
theory,
the incentives to perform well on long-term questions are weaker
than on short-term questions. Additionally, forecasters receive
much more & better feedback on short-term questions. One would
expect long-term questions to receive less accurate forecasts
because of this, and the evidence points to this being the case (Dillon
2021,
Niplav
2022).
But we're often especially interested in long-term questions: How can
we incentivize or create good forecasts on those questions?
- Reward-correlated predictions: The clearest examples of this problem
are questions on extinction events: If you forecast doom, you're never
going to get rewarded for it, because the resolution happens only in
worlds where the bad outcome didn't occur. Forecasters are embedded
agents in the world
they are predicting on, and there is no Cartesian boundary. This can
happen with prediction markets as well: when making predictions on the
outcome of a decision, with the payout of the prediction market being
in a currency that is affected by the decision (for example devaluing
it respective to other currencies), the market might choose the "worse"
decision (according to the metric used for scoring it) because it prevents
the currency from being devalued as much.
- Low probability events: Some events are very important, but
have a low probability (extreme stock market crashes, extinction
events, rare diseases, encounters with aliens etc.). But low
probability events are maybe even harder to forecast than long
time horizon events: they often don't have good reference classes,
while long time horizon questions do (that's why we have history and
time series data!), and forecasters very rarely encounter them. We
might just round all probabilities <1% to 0%, lest we get Pascal's
mugged, but in doing
so we close our eyes to possible dangers (and prizes) out there, the
Talebian approach
of erring on the side of caution by "rounding them up" condemns us to
eternal overcaution and conservatism, so as a first step we definitely
want our probabilities to be as accurate as possible.
- Out-of-distribution situations: Whenever things with no
clear existing reference class occur, such as novel technologies
(social media, the internet in general, nuclear weapons,
international shipping logistics, and in the future potentially
genetic engineering or self-driving cars), forecasters struggle to
anticipate the consequences (or foresee those shifts). This isn't
limited to forecasters and prediction markets: if regular people,
pundits and domain experts on average do worse than top forecasters
(though as a counterpoint to forecasters>experts see Leech & Yagudin
2022),
then we wouldn't expect them to do much better specifically in very
novel & unforeseen situations (reasons why this could still happen:
experts might have detailed causal models that are outperformed by
simple heuristics in the modal case, but as we go outside of the normal
course of events, those causal & theoretical models break down much more
gracefully than simple surface heuristics).
- Hard-to-specify events: Maybe we are slicing up forecasting
the wrong way: as the old adage goes, the hard part is not coming
up with the answer, it is coming up with the right question to
ask. Similarly, for forecasting, we often run into the problem of
specifying exactly what we want to know about: Too broad and
you drive away forecasters and traders who don't want to waste
their time on predicting the whims of whoever resolves the market in the
end,
too narrow and you miss what you actually care about or invite
Goodharting. An
additional layer of complexity is added when hobbyists do your
forecasting, in which case narrow questions just aren't very interesting
to do predictions on. This could be seen with the Metaculus clean meat
tournament:
many questions were just different combinatorial variations on
each other, with maybe five being interesting to predict on,
but not all fourteen, leading to many questions receiving less
than 100 predictions during the tournament. But "interestingness"
and "specifiability" appear to be tugging in opposite directions:
hobbyists are probably most interested in making broad claims that flow
from their worldview, instead of finding minutiae for very specific
questions. Finding ways to create more specific questions on events
(or avoid doing so with clever tricks while still receiving accurate
forecasts) is important and difficult. Latent variable prediction
markets
offer one approach—how easy are they to implement with acceptable UX?
We can use these categories as guideposts: How bad are these as
problems? What approaches have been proposed/tried/implemented so far? If
we can improve one of them without harming our ability to perform well
on the others, we have made progress, if we improve several in tandem,
that's even better.
- How good are long-term forecasts?
- How quickly does our forecasting ability decrease with increasing range of the question/forecast?
- Does it decrease at all, or just oscillate wildly?
- How quickly does performance degrade in different categories of questions (finance, meteorology, global economics, technological development) and by different forecasters (prediction markets, superforecasters & teams)?
- Are there people who are better long-term forecasters and people who are better short-term forecasters?
- How good are our forecasts on low-probability events?
- How good are our forecasts on extinction events?
- How good are our forecasts in situations where we have historical discontinuities?
- How quickly/slowly do our forecasts converge to the final answer?
- When don't they converge?
- Can we classify convergence/divergence/oscillation behaviors?
- How do prediction markets, professional forecasting teams, internet enthusiasts and large language models compare?
- What is a good formalization of the idea of a forecaster being accurate at a level of n%?
- See Precision of Sets of Forecasts
- Are better short-term forecasters also better long-term forecasters?
- Do forecasters become better at forecasting over time?
- How quickly?
- Over time/over more forecasts
- How much does forecaster quantity affect forecast quality on continuous questions? (i.e., extend Dillon 2021 to continuous data)
- How much does forecasting time affect forecast quality? That is, what is the relation of accuracy of prediction to the time spent on refining that prediction?
- Generally, scaling laws for forecasting would be interesting/cool to see.
- How much do number of resolutions/forecasts matter for forecast quality/learning?
- Do laypeople/pundits/domain experts perform better than forecasters/superforecasters/forecasting teams/prediction markets specifically under novel & unforeseen situations?
- Are more extreme views or more conservative views more accurate?
- Question originally asked in Hanson 2007
- Are there people who are better long-term forecasters and people who are better short-term forecasters?
- How well does forecasting expertise in one domain transfer to another?
- That is, if a forecaster starts by forecasting in some domain
$D$
, and after a while switches to domain $D'$
, how much better is the forecaster than if he'd started out in $D'$
without any other experience?
- This would be even more interesting when also having a metric for the difference between
$D$
and $D'$
.
- What possible forecasting scoring rules could we develop?
- Taking into account:
- Accuracy compared to others
- Importance of question
- That incentivize collaboration and positive-sum interactions instead of information-hiding
- The literature on information elicitation could be useful here
- How can we compare the skill and reliability of forecasters to one another?
- Metaculus at the moment does this by "who writes good comments". That seems inadequate.
- Taking into account:
- Number of questions each forecaster predicted on
- Calibration
- Resolution
- Importance of questions
- Two boundary methods:
- Compare using a scoring rule on any question the forecasters predicted on
- Compare using a scoring rule on the intersection of the questions the forecasters predicted on
- Two functions of scoring rules: Rewarding or comparing forecasters
- Related field: honest reporting and information elicitation
- See also: Section 27.4.2 from Algorithmic Game Theory (Nisan et al. 2007)
- How can we deal with questions with unclear resolution criteria?
- Collect Metaculus experiments on this
- How do we incentivise good predictions on long-term questions?
- Ideas:
- chained temporal forecasts
- How do we incentivise good predictions on low-probability events?
- Ideas:
- chained conditional forecasts
- Is there any conceivable way of incentivizing good predictions on extinction events?
Moved here.
- What are methods of scoring/defining how good a question was?
- How many questions resolve due to technicalities in the resolution criteria?
- Are the ratios here different across different question categories?
- How does this ratio develop as one puts more effort into specifying resolution criteria?
- This might be studied qualitatively/semi-quantitatively.
- Where are the big datasets of past judgmental forecasts?
- What is the rate of positive resolution by range?
- How good a predictor is forecasting performance of intra-individual cognitive performance?
- How difficult is it to manipulate real existing prediction platforms?
- How can we develop better forecast aggregation methods?
- Use momentum of past forecasts
- Use the generalized mean with changing
$p$
as the time to question resolution shrinks
- Should
$p$
be increasing/decreasing/following a more complicated pattern?
- Can we do something cool with the quasi-arithmetic mean?