
author: niplav, created: 2022-07-15, modified: 2023-06-12, language: english, status: on hold, importance: 6, confidence: certain

A library for handling forecasting datasets is documented.


Iqisa Documentation

[…] there is a great gap between a tool existing and everyone using it, and good documentation is as underestimated as open datasets.

Gwern Branwen, “2019 News”, 2019

Iqisa is a library for handling and comparing different forecasting datasets, focused on taking on the burden of dealing with differently organised datasets off the user and presenting them with a unified interface.

On the margin it prioritises correctness over speed, and simplicity over providing the user with every function they could need.


Minimal Example

Note that there is not yet a package for iqisa, and you need to be in the directory with the datasets to load them. Sorry about that, I intend to fix it.

The minimal steps for getting started with the library are quite simple. Here's the code for loading the data from the Good Judgment Project prediction markets:

$ python3
>>> import gjp
>>> import iqisa as iqs
>>> market_fcasts=gjp.load_markets()

Similarly, one can also load the data from the Good Judgment project surveys:

>>> survey_fcasts=gjp.load_surveys()

Now market_fcasts contains the forecasts from all prediction markets from the Good Judgement Project as a pandas DataFrame (and survey_fcasts all from the surveys):

>>> market_fcasts
        question_id  user_id  team_id  probability  ... n_opts          options q_status q_type
0            1040.0     6203        0       0.4000  ...      2  (a) Yes, (b) No   closed      0
1            1040.0     6203        0       0.4500  ...      2  (a) Yes, (b) No   closed      0
...             ...      ...      ...          ...  ...    ...              ...      ...    ...
793499       1542.0    21975        9       0.0108  ...      2  (a) Yes, (b) No   closed      0
793500       1542.0    13854       28       0.0049  ...      2  (a) Yes, (b) No   closed      0

[793501 rows x 15 columns]

The load functions are the central piece of the library, as they give you, the user, the data in a format that can be compared across datasets. The other functions are merely suggestions and can be ignored if they don't fit your use-case (iqisa wants to provide you with the data, and not be opinionated with what you do with that data in the end, and how you do it).

Aggregating and Scoring

The user could now want to just know how good the forecasters were at forecasting on all questions:

>>> import numpy as np
>>> def brier_score(probabilities, outcomes):
...     return np.mean((probabilities-outcomes)**2)
>>> scores=iqs.score(market_fcasts, brier_score)
>>> scores
1017.0       0.147917
1038.0       0.177000
...               ...
5005.0       0.140392
6413.0       0.109608

[411 rows x 1 columns]
>>> np.mean(scores)
score    0.137272
dtype: float64

Next, the user might define an aggregation function:

>>> import statistics
>>> import numpy as np
>>> def geom_odds_aggr(forecasts):
...    probabilities=forecasts['probability']
...    probabilities=probabilities/(1-probabilities)
...    aggregated=statistics.geometric_mean(probabilities)
...    aggregated=aggregated/(1+aggregated)
...    return np.array([aggregated])

and pass it to the aggregate method:

>>> aggregations=iqs.aggregate(market_fcasts, geom_odds_aggr)
>>> aggregations
    question_id  probability outcome answer_option
0        1017.0     0.370863       b             a
0        1038.0     0.580189       a             a
..          ...          ...     ...           ...
0        5005.0     0.194700       a             c
0        6413.0     0.291428       b             a

[713 rows x 4 columns]

Now, after aggregating the forecasts, is the Brier score better?

>>> aggr_scores=iqs.score(aggregations, brier_score)
>>> aggr_scores
1017.0       0.137540
1038.0       0.176242
...               ...
5005.0       0.334230
6413.0       0.058682

[411 rows x 1 columns]
>>> np.mean(aggr_scores)
score    0.083357
dtype: float64

Yes it is.

Scoring Users

Unlike for scoring by question, there is no library-internal abstraction for scoring users, but this is easy to implement:

def brier_score_user(user_forecasts):
    return np.mean((probabilities-user_right)**2)

trader_scores=iqs.score(market_fcasts, brier_score, on=['user_id'])

However, we might want to exclude traders who have made fewer than, let's say, 100 trades:

filtered_trader_scores=iqs.score(market_fcasts.groupby(['user_id']).filter(lambda x: len(x)>100), brier_score, on=['user_id'])

Surprisingly, the mean score of the traders with >100 trades is not better than the score of all traders:

>>> np.mean(trader_scores)
score    0.159125
dtype: float64
>>> np.mean(filtered_trader_scores)
score    0.159525
dtype: float64

However, filtering removes outliers (both positive and negative):

>>> filtered_trader_scores.min()
score    0.02433
dtype: float64
>>> filtered_trader_scores.max()
score    0.685084
dtype: float64
>>> trader_scores.min()
score    0.0001
dtype: float64
>>> trader_scores.max()
score    0.7921
dtype: float64

Forecasts & Questions Format

Iqisa is intended to make forecasting and forecasting question data from different datasets available in the same data format, which is described here.


Some functions (gjp.load_markets(), gjp.load_surveys(), metaculus.load_private_binary()) return data in a common format that is intended to be comparable across forecasting datasets. That format is a pandas DataFrame with the following columns:


This field is a pandas DataFrame describing the question-specific data in the dataset. It is set either manually or by calling load_questions() in a subclass.

Its columns are

Loading Functions

The following functions can be used to load the forecasting data.

gjp.load_surveys(files=None, processed=True, complete=False) and gjp.load_markets(files=None, processed=True, complete=False)

gjp.load_surveys() loads forecasting data from GJP surveys, and gjp.load_markets() loads forecasting data from GJP prediction markets. They have the same arguments.



A DataFrame in the format described here loaded from files, potentially with additional columns.

Additional Fields when complete=True

Setting complete=True loads the following additional fields for gjp.load_surveys():

Setting complete=True loads the following additional fields for gjp.load_markets():


Returns a pandas DataFrame with the columns described here loaded from files, by default from the files listed in gjp.questions_files (value [./data/gjp/ifps.csv]).

The field resolve_time is the same as close_time, as the GJOpen data doesn't distinguish the two times.

Additionally, this questions data contains the columns


Load private binary Metaculus forecasting data in the format the Metaculus developers give to researchers.

data_file is the path to the file holding the private binary data.

Returns a DataFrame in this format. If the Metaculus questions file in the iqisa repository is outdated this might only load a subset of the forecasts in data_file.


Returns a pandas DataFrame with the columns described here loaded from files, by default from the files listed in metaculus.questions_files (value ["./data/metaculus/questions.csv"]).

metaculus.load_public_binary(files=None, processed=True)

Note: This data is not the data for individual forecasters, but timeseries data for each question (capped at 101 interpolated datapoints per question).

Returns a pandas DataFrame with forecasting data frome the public Metaculus API. The columns of the data are described here, and the data is loaded from files, by default from the files in metaculus.public_files (value (["./data/metaculus/public.csv.zip"]).


predictionbook.load(files=None, processed=True)

Return a pandas DataFrame with forecasts from PredictionBook (columns of the data described here).

The data is loaded from files, by default public_files (["./data/predictionbook/public.csv.zip"]).


predictionbook.load_questions(data_file=None, processed=True)

Returns a pandas DataFrame with the columns described here loaded from data_file, by default from the files listed in predictionbook.questions_file (value ["./data/metaculus/questions.csv.zip"]).

Setting processed=False makes the loading much slower, and currently has no other effect.

General Functions

aggregate(forecasts, aggregation_function, on=['question_id', 'answer_option'], *args, **kwargs)

Combine multiple forecasts on questions into a single probability by running aggregation_function over the forecasts, aggregation method provided by the user (e.g. the geometric mean of odds).


The type signature of the function is

aggregate: DataFrame × (DataFrame × Optional(arguments) -> [0,1]) × list × Optional(arguments) -> DataFrame

To elaborate a bit further:


A DataFrame with columns probability, outcome, and whatever columns were specified in the argument on (by default question_id and answer_option). probability is the aggregated probability over the answer option on the question.


Define an aggregation method:

>>> import statistics
>>> import numpy as np
>>> def geom_odds_aggr(forecasts):
...    probabilities=forecasts['probability']
...    probabilities=probabilities/(1-probabilities)
...    aggregated=statistics.geometric_mean(probabilities)
...    aggregated=aggregated/(1+aggregated)
...    return np.array([aggregated])

and pass it to the aggregate function:

>>> aggregations=iqs.aggregate(market_fcasts, geom_odds_aggr)
>>> aggregations
    question_id  probability outcome answer_option
0        1017.0     0.370863       b             a
0        1038.0     0.580189       a             a
..          ...          ...     ...           ...
0        5005.0     0.194700       a             c
0        6413.0     0.291428       b             a

[713 rows x 4 columns]

score(forecasts, scoring_rule, on=['question_id'] *args, **kwargs)

Score predictions or aggregated predictions on questions, method can be given by the user.


Throws an exception if there are no forecasts loaded/aggregations computed (i.e. the number of rows of forecasts/aggregations is zero).

The type signature of the function is

score: DataFrame × ([0,1]ⁿ × {0,1}ⁿ × Optional(arguments) -> float) × list × Optional(arguments) -> DataFrame

To elaborate a bit further:


A new DataFrame with the scores for each group (as defined by on), by default a DataFrame where the index contains the question_ids, and the rows contain the score.


We aggregate by calculating the arithmetic mean of all forecasts made on a question & option, and score with the Brier score:

def arith_aggr(forecasts):
    return np.array([np.mean(forecasts['probability'])])

def brier_score(probabilities, outcomes):
    return np.mean((probabilities-outcomes)**2)

Using these in the repl:

>>> import gjp
>>> import iqisa as iqs
>>> import numpy as np
>>> m=gjp.load_markets()
>>> aggregations=iqs.aggregate(m, arith_aggr)
>>> aggregations.columns
Index(['question_id', 'probability', 'outcome', 'answer_option'], dtype='object')
>>> scores=iqs.score(aggregations, brier_score)
>>> scores
1017.0       0.140625
1038.0       0.176400
...               ...
5005.0       0.332759
6413.0       0.081349

[411 rows x 1 columns]

We can now calculate the average Brier score on all questions:

>>> scores.describe()
count  411.000000
mean     0.102582
std      0.100136
min      0.000574
25%      0.032574
50%      0.067686
75%      0.140791
max      0.661671

add_cumul_user_score(forecasts, scoring_rule, *args, **kwargs)

Return a new DataFrame that has contains a new field cumul_score. The field contains the past performance of the user making that forecast, before the time of prediction.

Change forecasts so that it has contains a new field cumul_score. The field contains the past performance of the user making that forecast, before the time of prediction.


The type signature of the function is

cumul_user_score: Dataframe × ([0,1]ⁿ × {0,1}ⁿ × Optional(arguments) -> float) × Optional(arguments) -> DataFrame


A new DataFrame that is a copy of forecasts, and an additional column cumul_score: The score of the user making the forecast for all questions that have resolved before the current prediction (that is, before timestamp), as judged by scoring_rule

add_cumul_user_perc(forecasts, lower_better=True)

Based on cumulative past scores, add the percentile of forecaster performance the forecaster finds themselves in at the time of forecasting.


Takes a DataFrame with at least the columns

and a named argument lower_better that, if True, assumes that lower values in cumul_score indicate better performance, and if False, assumes that higher values in the same field are better.


he same DataFrame it has received as its argument, and an additional column cumul_perc. cumul_perc is the percentile of forecaster performance the forecaster finds themselves in at the time they are making the forecast.


The function is currently very slow (several hours for a dataset of 500k forecasts on my laptop).


Warning: This function makes the dataset given to it ~100 times bigger, which might lead to running of out RAM.

Return a new DataFrame with a set of forecasts so that forecasts by individual forecasters are repeated daily until they make a new forecast or the question is closed.


A DataFrame of the format described here, necessary columns are question_id, user_id, answer_option, timestamp, time_close.


A new DataFrame with a set of forecasts so that forecasts by individual forecasters are repeated daily until they make a new forecast or the question is closed.


$ python3
>>> import gjp
>>> import iqisa as iqs
>>> survey_files=['./data/gjp/survey_fcasts_mini.yr1.csv']
>>> s=gjp.load_surveys(survey_files)
>>> len(s)
>>> s=iqs.frontfill(s)
>>> len(s)

generic_aggregate(group, summ='arith', format='probs', decay=1, extremize='noextr', extrfactor=3, fill=False)

A generic method for combining multiple forecasts into a single number, intended to be plugged as a second argument into aggregate.



A single number in a numpy array, which is the aggregated probability.


>>> def weird_aggr(group):
... return iqs.generic_aggregate(group, summ="arith", format="logodds", extremize='neyextr', decay=0.995)
>>> iqs.aggregate(market_fcasts, weird_aggr)
    question_id  probability outcome answer_option
0        1017.0     0.212827       b             a
0        1038.0     0.435006       a             a
0        1039.0     0.457726       a             a
0        1040.0     0.607709       a             a
0        1047.0     0.008727       b             a
..          ...          ...     ...           ...
0        5002.0     0.156100       c             d
0        5005.0     0.166393       a             a
0        5005.0     0.638400       a             b
0        5005.0     0.188500       a             c
0        6413.0     0.047023       b             a

[713 rows x 4 columns]

normalise(forecasts, on=['question_id'])

Changes the field forecasts so that the values on the field probability assigned to different options on the same question sum to 1.



A DataFrame with the normalised probabilities.

Appendix A: Internal Lists of Files


A list containing the names of all files in the dataset that contain data from surveys:


A list containing the names of all files in the dataset that contain trades on prediction markets:

gjp.processed_survey_files and gjp.processed_market_files

Preprocessed files that contain all survey data (./data/gjp/surveys.csv.zip) and all market data (./data/gjp/markets.csv.zip).

Appendix B: Data Peculiarities

The GJOpen forecast data has some peculiarities, which are described here: