Dataloader
This section presents the dataloader object in details. The available dataloaders are the following:
DummySoccerDataLoader
: Soccer data loader with dummy data, just for testing.SoccerDataLoader
: Soccer data loader.
We aim to include in the future more dataloaders for various sports and betting markets:
- Basketball
- NFL
- Hockey
Initialization
A dataloader is initialized with the parameter param_grid
that selects the training data to extract. Indirectly this parameter
also affects the extracted fixtures data since dataloaders ensure that these two are in correspondence i.e. input and odds
matrices of training and fixtures data have the same columns.
Available parameters
The available parameters and their values are provided from the class method get_all_params
:
from sportsbet.datasets import DummySoccerDataLoader
assert DummySoccerDataLoader.get_all_params() == [
{'division': 1, 'year': 1998},
{'division': 1, 'league': 'France', 'year': 2000},
{'division': 1, 'league': 'France', 'year': 2001},
{'division': 1, 'league': 'Greece', 'year': 2017},
{'division': 1, 'league': 'Greece', 'year': 2019},
{'division': 1, 'league': 'Spain', 'year': 1997},
{'division': 2, 'league': 'England', 'year': 1997},
{'division': 2, 'league': 'Spain', 'year': 1999},
{'division': 3, 'league': 'England', 'year': 1998}
]
Similarly, for SoccerDataLoader
:
from sportsbet.datasets import SoccerDataLoader
assert SoccerDataLoader.get_all_params() == [
{'division': 1, 'league': 'Argentina', 'year': 2018},
{'division': 1, 'league': 'Argentina', 'year': 2019},
{'division': 1, 'league': 'Argentina', 'year': 2020},
...,
{'division': 5, 'league': 'England', 'year': 2022},
{'division': 5, 'league': 'England', 'year': 2023},
{'division': 5, 'league': 'England', 'year': 2024}
]
Selection of parameters
The parameter param_grid
has the same usage as the initialization parameter of scikit-learn's ParameterGrid. The default value
of param_grid
is None
and corresponds to the selection of all training data i.e. all leagues, years and divisions. If
param_grid
is provided, then it should be a list of dictionaries with the above keys and values as lists.
For example, if we use the SoccerDataLoader
and include data for the Spanish and Italian
leagues of first division and 2018-2020 years, as well as data for the French league of all divisions for 2020-2021 years, we will
use the following param_grid
and dataloader:
from sportsbet.datasets import SoccerDataLoader
param_grid = [
{'division': [1], 'league': ['Spain', 'Italy'], 'year': [2018, 2019, 2020]},
{'league': ['France'], 'year': [2020, 2021]}
]
dataloader = SoccerDataLoader(param_grid=param_grid)
Once the dataloader is initialized, the training and fixtures data can be extracted.
Training data
The training data is a tuple of the input matrix X_train
, the multi-output targets Y_train
and the odds' matrix O_train
. You
can extract the training data using the method extract_train_data
that accepts the parameters drop_na_thres
and odds_type
.
Both parameters are important, therefore we discuss briefly their usage. We will use the following dataloader as an example:
from sportsbet.datasets import SoccerDataLoader
param_grid = [
{'division': [1], 'league': ['Greece'], 'year': [2019, 2020]}
]
dataloader = SoccerDataLoader(param_grid=param_grid)
-
Parameter
drop_na_thres
adjusts the threshold of a column with missing values to be removed from the input matrixX_train
. It takes values in the range [0.0, 1.0]. This parameter is included for convenience since historical data often come with columns that have many missing values, therefore their presence does not enhance the predictive power of models.If we set
drop_na_thres=0
then all columns are kept:X_train, *_ = dataloader.extract_train_data() assert len(X_train.columns) == 39
Similarly, if we set
drop_na_thres=1.0
then only columns with non-missing values are kept:X_train, *_ = dataloader.extract_train_data(drop_na_thres=1.0) assert len(X_train.columns) == 5
-
Parameter
odds_type
selects the type of odds that will be used for the odds' matrixO_train
. It also affects the columns of the multi-output targetsY_train
since there is a match betweenY_train
andO_train
columns as explained below. You can get the available odds types from the methodget_odds_types
:assert dataloader.get_odds_types() == ['market_average', 'market_maximum']
When
odds_type
is not provided, its default value isNone
andO_train
is None:*_, O_train, = dataloader.extract_train_data(drop_na_thres=0.5) assert O_train is None
Selecting one of the above odds types, returns the corresponding data:
X_train, _, O_train, = dataloader.extract_train_data(drop_na_thres=0.5, odds_type='market_average') assert O_train.columns.tolist() == [ 'odds__market_average__home_win__full_time_goals', 'odds__market_average__draw__full_time_goals', 'odds__market_average__away_win__full_time_goals', 'odds__market_average__over_2.5__full_time_goals', 'odds__market_average__under_2.5__full_time_goals' ]
Notice that
odds_type
parameter affects only the odds matrix. The input matrixX_train
still contains information from all available odds types:assert [col for col in X_train.columns if col.startswith('odds')] == [ 'odds__market_maximum__home_win__full_time_goals', 'odds__market_maximum__draw__full_time_goals', 'odds__market_maximum__away_win__full_time_goals', 'odds__market_maximum__over_2.5__full_time_goals', 'odds__market_maximum__under_2.5__full_time_goals', 'odds__market_average__home_win__full_time_goals', 'odds__market_average__draw__full_time_goals', 'odds__market_average__away_win__full_time_goals', 'odds__market_average__over_2.5__full_time_goals', 'odds__market_average__under_2.5__full_time_goals' ]
This is because we use
O_train
for backtesting when bets are placed against a specific bookmaker, but the information from other bookmakers may still be useful for the predictive model, thus they are included inX_train
.
Fixtures data
Once the training data are extracted, it is straightforward to extract the corresponding fixtures data using the method
extract_fixtures_data
:
from sportsbet.datasets import SoccerDataLoader
param_grid = [
{'division': [1], 'league': ['Spain', 'Italy'], 'year': [2018, 2019, 2020]},
]
dataloader = SoccerDataLoader(param_grid=param_grid)
X_train, Y_train, O_train = dataloader.extract_train_data(odds_type='market_average')
X_fix, Y_fix, O_fix = dataloader.extract_fixtures_data()
The method accepts no parameters and the extracted fixtures input and odds matrices have the same columns as the latest extracted input and odds matrices for the training data:
assert X_train.columns.tolist() == X_fix.columns.tolist()
assert O_train.columns.tolist() == O_fix.columns.tolist()
Since we are extracting the fixtures data, there is no target matrix:
assert Y_fix is None
Description of data
As we have seen above, the extracted data are the following:
- Training:
(X_train, Y_train, O_train)
- Fixtures:
(X_fix, None, O_fix)
As an example we use the following data:
from sportsbet.datasets import SoccerDataLoader
param_grid = {'league': ['England'], 'year': [2021]}
dataloader = SoccerDataLoader(param_grid=param_grid)
X_train, Y_train, O_train = dataloader.extract_train_data(odds_type='market_maximum')
X_fix, Y_fix, O_fix = dataloader.extract_fixtures_data()
A detailed description of the above tuples of data is provided below.
X_train
X_train
is the first component of the training data tuple. X_train
is a pandas DataFrame that contains information known
before the start of the betting event like the date, the names of the opponents, features related to the past performance of the
opponents and any other information useful for predictive modelling:
assert X_train.columns.tolist() == [
'league',
'division',
'year',
'home_team',
'away_team',
'odds__market_maximum__home_win__full_time_goals',
'odds__market_maximum__draw__full_time_goals',
'odds__market_maximum__away_win__full_time_goals',
'odds__market_maximum__over_2.5__full_time_goals',
'odds__market_maximum__under_2.5__full_time_goals',
'odds__market_average__home_win__full_time_goals',
'odds__market_average__draw__full_time_goals',
'odds__market_average__away_win__full_time_goals',
'odds__market_average__over_2.5__full_time_goals',
'odds__market_average__under_2.5__full_time_goals',
'home__points__avg',
'home__adj_points__avg',
'home__goals_for__avg',
'home__goals_against__avg',
'home__adj_goals_for__avg',
'home__adj_goals_against__avg',
'home__points__latest_avg',
'home__adj_points__latest_avg',
'home__goals_for__latest_avg',
'home__goals_against__latest_avg',
'home__adj_goals_for__latest_avg',
'home__adj_goals_against__latest_avg',
'away__points__avg',
'away__adj_points__avg',
'away__goals_for__avg',
'away__goals_against__avg',
'away__adj_goals_for__avg',
'away__adj_goals_against__avg',
'away__points__latest_avg',
'away__adj_points__latest_avg',
'away__goals_for__latest_avg',
'away__goals_against__latest_avg',
'away__adj_goals_for__latest_avg',
'away__adj_goals_against__latest_avg'
]
It may also include odds data as shown above. The index of X_train
is a pandas DatetimeIndex and the data are always sorted by
date:
assert X_train.index.tolist() == [
Timestamp('2020-09-11 00:00:00'),
Timestamp('2020-09-12 00:00:00'),
...,
Timestamp('2021-05-23 00:00:00'),
Timestamp('2021-05-23 00:00:00')
]
Y_train
Y_train
is the second component of the training data tuple:
assert Y_train.columns.tolist() == [
'output__home_win__full_time_goals',
'output__draw__full_time_goals',
'output__away_win__full_time_goals',
'output__over_2.5__full_time_goals',
'output__under_2.5__full_time_goals'
]
Y_train
is a pandas DataFrame that contains information known after the end of the betting event like goals or points scored,
fouls committed etc. Column names follow a naming convention of the form f'output__{betting_market}__{target}'
:
betting_market
: Any supported betting market like home win, over 2.5, draw, home points etc.target
: The outcome that was used to extract the targets like'full_time_goals'
,'half_time_goals'
,'full_time_points'
etc.
The entries of Y_train
show whether an outcome of a betting event is True
or False
. In order to make
the data suitable for modelling, Y_train
does not contain any missing values i.e. rows of raw data that contain any missing
values are removed. This last step also includes X_train
and O_train
: Their corresponding rows are removed to match Y_train
.
O_train
O_train
is the last component of the training data tuple:
assert O_train.columns.tolist() == [
'odds__market_maximum__home_win__full_time_goals',
'odds__market_maximum__draw__full_time_goals',
'odds__market_maximum__away_win__full_time_goals',
'odds__market_maximum__over_2.5__full_time_goals',
'odds__market_maximum__under_2.5__full_time_goals'
]
O_train
is a pandas DataFrame that contains information related to the odds for various betting markets. Column names follow a
naming convention of the form f'odds__{bookmaker}__{betting_market}__{target}'
:
bookmaker
: Any supported bookmaker or aggregation of bookmakers return by the methodget_odds_types
.betting_market
: Similar toY_train
.target
: Similar toY_train
.
The entries of O_train
are the odd values of betting events and, depending on the data source, it may contain missing values.
Y_train
and O_train
columns match, i.e. Y_train
and O_train
have the same shape and
f'output__{betting_market}__{target}'
column of Y_train
is at the same position as the
f'odds__{bookmaker}__{betting_market}__{target}'
column of O_train
. The correspondence is clear in the examples above.
X_fix
X_fix
is the first component of the fixtures data tuple. It is a pandas DataFrame that contains information known before the
start of the betting event. The features of X_fix
are identical to the features of X_train
:
assert X_train.columns.tolist() == X_fix.columns.tolist()
X_fix
is not affected by the initialization parameter param_grid
of the dataloader i.e. it contains the latest fixtures for
every league, division or any other parameter, even if they are not included in the training data.
Y_fix
Y_fix
is always equal to None
since the output of betting events for fixtures data is not known:
assert Y_fix is None
O_fix
O_fix
is the last component of the fixtures data tuple. It is a pandas DataFrame that contains information related to the odds for various betting markets. The features of O_fix
are identical to the features of O_train
:
assert O_train.columns.tolist() == O_fix.columns.tolist()