author: niplav, created: 2022-07-15, modified: 2023-06-12, language: english, status: in progress, importance: 6, confidence: certain
A library for handling forecasting datasets is documented.
[…] there is a great gap between a tool existing and everyone using it, and good documentation is as underestimated as open datasets.
—Gwern Branwen, “2019 News”, 2019
Iqisa is a library for handling and comparing different forecasting datasets, focused on taking on the burden of dealing with differently organised datasets off the user and presenting them with a unified interface.
On the margin it prioritises correctness over speed, and simplicity over providing the user with every function they could need.
Note that there is not yet a package for iqisa, and you need to be in the directory with the datasets to load them. Sorry about that, I intend to fix it.
The minimal steps for getting started with the library are quite simple. Here's the code for loading the data from the Good Judgment Project prediction markets:
$ python3
>>> import gjp
>>> import iqisa as iqs
>>> market_fcasts=gjp.load_markets()
Similarly, one can also load the data from the Good Judgment project surveys:
>>> survey_fcasts=gjp.load_surveys()
Now market_fcasts
contains the forecasts from all prediction markets
from the Good Judgement Project as a pandas
DataFrame
(and survey_fcasts
all from the surveys):
>>> market_fcasts
question_id user_id team_id probability ... n_opts options q_status q_type
0 1040.0 6203 0 0.4000 ... 2 (a) Yes, (b) No closed 0
1 1040.0 6203 0 0.4500 ... 2 (a) Yes, (b) No closed 0
... ... ... ... ... ... ... ... ... ...
793499 1542.0 21975 9 0.0108 ... 2 (a) Yes, (b) No closed 0
793500 1542.0 13854 28 0.0049 ... 2 (a) Yes, (b) No closed 0
[793501 rows x 15 columns]
The load
functions are the central piece of the
library, as they give you, the user, the data in a format
that can be compared across datasets. The other functions are merely
suggestions and can be ignored if they don't fit your use-case (iqisa
wants to provide you with the data, and not be opinionated with what
you do with that data in the end, and how you do it).
The user could now want to just know how good the forecasters were at forecasting on all questions:
>>> import numpy as np
>>> def brier_score(probabilities, outcomes):
... return np.mean((probabilities-outcomes)**2)
>>> scores=iqs.score(market_fcasts, brier_score)
>>> scores
score
question_id
1017.0 0.147917
1038.0 0.177000
... ...
5005.0 0.140392
6413.0 0.109608
[411 rows x 1 columns]
>>> np.mean(scores)
score 0.137272
dtype: float64
Next, the user might define an aggregation function:
>>> import statistics
>>> import numpy as np
>>> def geom_odds_aggr(forecasts):
... probabilities=forecasts['probability']
... probabilities=probabilities/(1-probabilities)
... aggregated=statistics.geometric_mean(probabilities)
... aggregated=aggregated/(1+aggregated)
... return np.array([aggregated])
and pass it to the aggregate
method:
>>> aggregations=iqs.aggregate(market_fcasts, geom_odds_aggr)
>>> aggregations
question_id probability outcome answer_option
0 1017.0 0.370863 b a
0 1038.0 0.580189 a a
.. ... ... ... ...
0 5005.0 0.194700 a c
0 6413.0 0.291428 b a
[713 rows x 4 columns]
Now, after aggregating the forecasts, is the Brier score better?
>>> aggr_scores=iqs.score(aggregations, brier_score)
>>> aggr_scores
score
question_id
1017.0 0.137540
1038.0 0.176242
... ...
5005.0 0.334230
6413.0 0.058682
[411 rows x 1 columns]
>>> np.mean(aggr_scores)
score 0.083357
dtype: float64
Yes it is.
Unlike for scoring by question, there is no library-internal abstraction for scoring users, but this is easy to implement:
def brier_score_user(user_forecasts):
user_right=(user_forecasts['outcome']==user_forecasts['answer_option'])
probabilities=user_forecasts['probability']
return np.mean((probabilities-user_right)**2)
trader_scores=iqs.score(market_fcasts, brier_score, on=['user_id'])
However, we might want to exclude traders who have made fewer than, let's say, 100 trades:
filtered_trader_scores=iqs.score(market_fcasts.groupby(['user_id']).filter(lambda x: len(x)>100), brier_score, on=['user_id'])
Surprisingly, the mean score of the traders with >100 trades is not better than the score of all traders:
>>> np.mean(trader_scores)
score 0.159125
dtype: float64
>>> np.mean(filtered_trader_scores)
score 0.159525
dtype: float64
However, filtering removes outliers (both positive and negative):
>>> filtered_trader_scores.min()
score 0.02433
dtype: float64
>>> filtered_trader_scores.max()
score 0.685084
dtype: float64
>>> trader_scores.min()
score 0.0001
dtype: float64
>>> trader_scores.max()
score 0.7921
dtype: float64
Iqisa is intended to make forecasting and forecasting question data from different datasets available in the same data format, which is described here.
Some functions (gjp.load_markets(), gjp.load_surveys(), metaculus.load_private_binary()
)
return data in a common format that is intended to be comparable across
forecasting datasets. That format is a pandas
DataFrame
with the following columns:
question_id
: The unique ID of the question, type float64
.user_id
: The unique ID of the user who made the forecast, type float64
.team_id
: The ID of the team the user was in, type float64
.probability
: The probability assigned in the forecast, type float64
. Probabilities (or probabilities implied by market prices) ≥1 are changed to 1-prob_margin
(by default 0.995), and ≤0 to prob_margin
(by default 0.005).answer_option
: The answer option selected by the user, type str
.timestamp
: The time at which the forecast/trade was made, type datetime64[ns]
.outcome
: The outcome of the question, type str
.open_time
: The time at which the question was opened, i.e. at which forecasts could start. Type datetime64[ns]
close_time
: The time at which the question was closed, i.e. at which the last possible forecast could be made. Type datetime64[ns]
.resolve_time
: The time at which the resolution of the question was available. Type datetime64[ns]
.time_open
: The days for which the quesion was open, type timedelta64[ns]
.n_opts
: The number of options the question had, type int64
.options
: A string containing a description of the different possible options, type str
.q_status
: The status of the question the forecast was made on, type str
.q_type
: The type of the question, type int64
.This field is a pandas DataFrame describing the question-specific data in
the dataset. It is set either manually or by calling load_questions()
in a subclass.
Its columns are
question_id
, q_title
, q_status
, open_time
, close_time
, resolve_time
, close_date
, outcome
, time_open
, n_opts
, options
: As in the description of forecasts
above.q_title
: The title of the question, as a str
.The following functions can be used to load the forecasting data.
gjp.load_surveys(files=None, processed=True, complete=False)
and gjp.load_markets(files=None, processed=True, complete=False)
gjp.load_surveys()
loads forecasting data from GJP surveys, and
gjp.load_markets()
loads forecasting data from GJP prediction
markets. They have the same arguments.
files
: If None
, the data is loaded from the default files (depending on the value of processed
). Expects a list of strings of the filenames.
processed
is True
, files
is by default gjp.processed_survey_files
(for gjp.load_surveys()
) or gjp.processed_market_files
(for gjp.load_markets()
)processed
is False
, files
is by default gjp.survey_files
(for gjp.load_surveys()
) or gjp.market_files
(for gjp.load_markets()
)processed
: Whether to load the data from a pre-processed file (if True
) or from the original files (if False
). The main difference is in speed, loading from the pre-processed file is much faster.complete
: Whether to load all columns present in the dataset (if True
) or only columns described here (if False
). Loading all columns returns a bigger and more confusing DataFrame, loading the comparable subset always returns a subset of the columns of the "complete" DataFrame.A DataFrame in the format described here loaded from
files
, potentially with additional columns.
complete=True
Setting complete=True
loads the following additional fields for
gjp.load_surveys()
:
forecast_id
fcast_type
fcast_date
expertise
viewtime
year
q_title
q_desc
short_title
Setting complete=True
loads the following additional fields for
gjp.load_markets()
:
islong
by_agent
op_type
spent
min_qty
trade_type
with_mm
divest_only
prob_after_trade
matching_order_id
high_fuse
stock_name
low_fuse
created_at
filled_at
trade_qty
isbuy
prob_est
market_name
gjp.load_questions(files=None)
Returns a pandas DataFrame with the columns described here
loaded from files
, by default from the files listed in
gjp.questions_files
(value [./data/gjp/ifps.csv]
).
The field resolve_time
is the same as close_time
, as the GJOpen data
doesn't distinguish the two times.
Additionally, this questions data contains the columns
q_desc
: The description of the question, including resolution criteria, type str
.short_title
: The shortened title of the question, type str
.metaculus.load_private_binary(data_file)
Load private binary Metaculus forecasting data in the format the Metaculus developers give to researchers.
data_file
is the path to the file holding the private binary data.
Returns a DataFrame in this format. If the Metaculus
questions file in the iqisa repository is outdated this might only load
a subset of the forecasts in data_file
.
metaculus.load_questions(files=None)
Returns a pandas DataFrame with the columns described here
loaded from files
, by default from the files listed in
metaculus.questions_files
(value ["./data/metaculus/questions.csv"]
).
metaculus.load_public_binary(files=None, processed=True)
Note: This data is not the data for individual forecasters, but timeseries data for each question (capped at 101 interpolated datapoints per question).
Returns a pandas DataFrame with forecasting data frome the
public Metaculus API. The columns of the data are described
here, and the data is loaded from files
, by
default from the files in metaculus.public_files
(value
(["./data/metaculus/public.csv.zip"]
).
files
: Specify a different file do load the data from.processed
: If True
, load the data from a pre-processed CSV, if False
, load it from the original JSON. Currently the only difference is that loading from the original data is slower.predictionbook.load(files=None, processed=True)
Return a pandas DataFrame with forecasts from PredictionBook (columns of the data described here).
The data is loaded from files
, by default public_files
(["./data/predictionbook/public.csv.zip"]
).
files
: Specify a different file do load the data from.processed
: If True
, load the data from a pre-processed CSV, if False
, load it from the original HTML files. Currently the only difference is that loading from the original data is far far slower.predictionbook.load_questions(data_file=None, processed=True)
Returns a pandas DataFrame with the columns described here
loaded from data_file
, by default from the files listed in
predictionbook.questions_file
(value ["./data/metaculus/questions.csv.zip"]
).
Setting processed=False
makes the loading much slower, and currently
has no other effect.
aggregate(forecasts, aggregation_function, on=['question_id', 'answer_option'], *args, **kwargs)
Combine multiple forecasts on questions into a single probability by
running aggregation_function
over the forecasts
, aggregation
method provided by the user (e.g. the geometric mean of
odds).
The type signature of the function is
aggregate: DataFrame × (DataFrame × Optional(arguments) -> [0,1]) × list × Optional(arguments) -> DataFrame
To elaborate a bit further:
forecasts
): A DataFrame of the format described here, needs the following columns:
question_id
timestamp
probability
answer_option
outcome
aggregation_function
): The user-defined aggregation function, which is called for on each set of forecasts made on the same question for the same answer option.
forecasts
(all with the same question_id
)aggregate
on
: What columns of forecasts
to group by/aggregate over. By default the function groups by the question ID and the answer option, so we receive one probability for every answer on every question.*args
are the variable arguments, and**kwargs
are the variable keyword argumentsA DataFrame with columns probability
, outcome
, and whatever columns
were specified in the argument on
(by default question_id
and
answer_option
). probability
is the aggregated probability over the
answer option on the question.
Define an aggregation method:
>>> import statistics
>>> import numpy as np
>>> def geom_odds_aggr(forecasts):
... probabilities=forecasts['probability']
... probabilities=probabilities/(1-probabilities)
... aggregated=statistics.geometric_mean(probabilities)
... aggregated=aggregated/(1+aggregated)
... return np.array([aggregated])
and pass it to the aggregate
function:
>>> aggregations=iqs.aggregate(market_fcasts, geom_odds_aggr)
>>> aggregations
question_id probability outcome answer_option
0 1017.0 0.370863 b a
0 1038.0 0.580189 a a
.. ... ... ... ...
0 5005.0 0.194700 a c
0 6413.0 0.291428 b a
[713 rows x 4 columns]
score(forecasts, scoring_rule, on=['question_id'] *args, **kwargs)
Score predictions or aggregated predictions on questions, method can be given by the user.
Throws an exception if there are no forecasts loaded/aggregations computed
(i.e. the number of rows of forecasts
/aggregations
is zero).
The type signature of the function is
score: DataFrame × ([0,1]ⁿ × {0,1}ⁿ × Optional(arguments) -> float) × list × Optional(arguments) -> DataFrame
To elaborate a bit further:
forecasts
): A DataFrame of the format described here, needs the following columns:
question_id
probability
outcome
answer_option
scoring_rule
): The scoring rule for forecasts.
score
on
: What columns of forecasts
to group by/score on. By default the function groups by the question ID , so we receive one score for every question*args
are the variable arguments, and**kwargs
are the variable keyword argumentsA new DataFrame with the scores for each group (as defined by on
),
by default a DataFrame where the index contains the question_id
s,
and the rows contain the score.
We aggregate by calculating the arithmetic mean of all forecasts made on a question & option, and score with the Brier score:
def arith_aggr(forecasts):
return np.array([np.mean(forecasts['probability'])])
def brier_score(probabilities, outcomes):
return np.mean((probabilities-outcomes)**2)
Using these in the repl:
>>> import gjp
>>> import iqisa as iqs
>>> import numpy as np
>>> m=gjp.load_markets()
>>> aggregations=iqs.aggregate(m, arith_aggr)
>>> aggregations.columns
Index(['question_id', 'probability', 'outcome', 'answer_option'], dtype='object')
>>> scores=iqs.score(aggregations, brier_score)
>>> scores
question_id
1017.0 0.140625
1038.0 0.176400
... ...
5005.0 0.332759
6413.0 0.081349
[411 rows x 1 columns]
We can now calculate the average Brier score on all questions:
>>> scores.describe()
score
count 411.000000
mean 0.102582
std 0.100136
min 0.000574
25% 0.032574
50% 0.067686
75% 0.140791
max 0.661671
add_cumul_user_score(forecasts, scoring_rule, *args, **kwargs)
Return a new DataFrame that has contains a new field cumul_score
. The
field contains the past performance of the user making that forecast,
before the time of prediction.
Change forecasts
so that it has contains a new field cumul_score
. The
field contains the past performance of the user making that forecast,
before the time of prediction.
The type signature of the function is
cumul_user_score: Dataframe × ([0,1]ⁿ × {0,1}ⁿ × Optional(arguments) -> float) × Optional(arguments) -> DataFrame
forecasts
): a DataFrame with the fields:
question_id
user_id
probability
timestamp
date_suspend
scoring_rule
): the scoring rule by which the performance will be judged
cumul_user_score
A new DataFrame that is a copy of forecasts
, and an additional column
cumul_score
: The score of the user making the forecast for all
questions that have resolved before the current prediction (that is,
before timestamp
), as judged by scoring_rule
add_cumul_user_perc(forecasts, lower_better=True)
Based on cumulative past scores, add the percentile of forecaster performance the forecaster finds themselves in at the time of forecasting.
Takes a DataFrame with at least the columns
timestamp
date_suspend
user_id
cumul_score
(e.g. as added by cumul_user_score
)and a named argument lower_better
that, if True
, assumes that lower
values in cumul_score
indicate better performance, and if False
,
assumes that higher values in the same field are better.
he same DataFrame it has received as its argument, and an additional
column cumul_perc
. cumul_perc
is the percentile of forecaster
performance the forecaster finds themselves in at the time they are
making the forecast.
The function is currently very slow (several hours for a dataset of 500k forecasts on my laptop).
frontfill(forecasts)
Warning: This function makes the dataset given to it ~100 times bigger, which might lead to running of out RAM.
Return a new DataFrame with a set of forecasts so that forecasts by individual forecasters are repeated daily until they make a new forecast or the question is closed.
A DataFrame of the format described here, necessary columns
are question_id
, user_id
, answer_option
, timestamp
, time_close
.
A new DataFrame with a set of forecasts so that forecasts by individual forecasters are repeated daily until they make a new forecast or the question is closed.
$ python3
>>> import gjp
>>> import iqisa as iqs
>>> survey_files=['./data/gjp/survey_fcasts_mini.yr1.csv']
>>> s=gjp.load_surveys(survey_files)
>>> len(s)
9999
>>> s=iqs.frontfill(s)
>>> len(s)
940598
generic_aggregate(group, summ='arith', format='probs', decay=1, extremize='noextr', extrfactor=3, fill=False)
A generic method for combining multiple forecasts into a
single number, intended to be plugged as a second argument into
aggregate
.
group
: A DataFrame
containing a set of forecastssumm
: Which method to use to combine forecasts together. Options are:
arith
(default): The arithmetic meangeom
: The geometric meanmedian
: The medianformat
: Which format to convert the given probabilities to before aggregating
probs
: Keep the raw probabilitiesodds
: Convert the probabilities to oddslogodds
: The logarithm of the odds ratiosdecay
: Parameter that describes how much forecasts that were made longer before resolution time should be discounted. If it is 1
(default), no such discounting is done. Otherwise the discount factor is decay
to the power of the number of days between the timestamp for the prediction and the closing time of the forecast.
summ
is 'arith'
extremize
: Whether and how to extremize forecasts.
noextr
: Don't extremize, leave the probabilities as they aregjpextr
: Use the extremising method described in Ungar et al 2012: Given the already aggregated probability $p$
and extremization factor $a$
(function argument extrfactor
, default 3), set the new probability to $\frac{p^a}{(p^a+(1-p))^{1/a}}$
postextr
: Given the already aggregated probability $p$
and extremization factor $a$
(function argument extrfactor
, default 3), extremise the probaility to $p^a$
neyextr
: Use the extremising method developed in Neyman & Roughgarden 2022: Given $n$
forecasts, already aggregated to a probability $p$
, extremise to $n \cdot \frac{\sqrt{3 \cdot n^2-3n+1}-2}{n^2-n-1}$
fill
: Change the forecasts so that each forecast is repeated daily until a new forecast is madeA single number in a numpy array, which is the aggregated probability.
>>> def weird_aggr(group):
... return iqs.generic_aggregate(group, summ="arith", format="logodds", extremize='neyextr', decay=0.995)
>>> iqs.aggregate(market_fcasts, weird_aggr)
question_id probability outcome answer_option
0 1017.0 0.212827 b a
0 1038.0 0.435006 a a
0 1039.0 0.457726 a a
0 1040.0 0.607709 a a
0 1047.0 0.008727 b a
.. ... ... ... ...
0 5002.0 0.156100 c d
0 5005.0 0.166393 a a
0 5005.0 0.638400 a b
0 5005.0 0.188500 a c
0 6413.0 0.047023 b a
[713 rows x 4 columns]
normalise(forecasts, on=['question_id'])
Changes the field forecasts
so that the values on the field
probability
assigned to different options on the same question sum to 1.
forecasts
: A DataFrame
with predictions, should have at least the columns `['question_id', 'probability']on
: The "scope" under which the values should sum to 1, by default ['question_id']
A DataFrame with the normalised probabilities.
gjp.survey_files
A list containing the names of all files in the dataset that contain data from surveys:
gjp.market_files
A list containing the names of all files in the dataset that contain trades on prediction markets:
gjp.processed_survey_files
and gjp.processed_market_files
Preprocessed files that contain all survey data
(./data/gjp/surveys.csv.zip
) and all market data
(./data/gjp/markets.csv.zip
).
The GJOpen forecast data has some peculiarities, which are described here:
question_id
: Follows the format [0-9]{4}
.team_id
: The team "DEFAULT" is given the ID 0.answer_option
: One of 'a', 'b', 'c', 'd' or 'e' (or rarely numpy.nan
for market data).outcome
: One of 'a', 'b', 'c', 'd', or 'e' (or rarely numpy.nan
, in the case of voided questions).q_status
: One of 'closed', 'voided' or 'open'.q_type
: Integer between 0 and 6 (inclusive).
q_type
2: 2nd conditional question)