Skip to content

Datasets

It provides the tools to extract sports betting data.

DummySoccerDataLoader(param_grid=None)

Bases: _BaseDataLoader

Dataloader for soccer dummy data.

The data are provided only for convenience, since they require no downloading, and to familiarize the user with the methods of the dataloader objects.

Read more in the user guide.

Parameters:

Name Type Description Default
param_grid ParamGrid | None

It selects the type of information that the data include. The keys of dictionaries might be parameters like 'league' or 'division' while the values are sequences of allowed values. It works in a similar way as the param_grid parameter of the scikit-learn's ParameterGrid class. The default value None corresponds to all parameters.

None

Attributes:

Name Type Description
param_grid_ ParameterGrid

The checked value of parameters grid. It includes all possible parameters if param_grid is None.

dropped_na_cols_ Index

The columns with missing values that are dropped.

drop_na_thres_(float) Index

The checked value of drop_na_thres.

odds_type_ str | None

The checked value of odds_type.

input_cols_ Index

The columns of X_train and X_fix.

output_cols_ Index

The columns of Y_train and Y_fix.

odds_cols_ Index

The columns of O_train and O_fix.

target_cols_ Index

The columns used for the extraction of output and odds columns.

train_data_ TrainData

The tuple (X, Y, O) that represents the training data as extracted from the method extract_train_data.

fixtures_data_ FixturesData

The tuple (X, Y, O) that represents the fixtures data as extracted from the method extract_fixtures_data.

Examples:

>>> from sportsbet.datasets import DummySoccerDataLoader
>>> import pandas as pd
>>> # Get all available parameters to select the training data
>>> DummySoccerDataLoader.get_all_params()
[{'division': 1, 'year': 1998}, ...
>>> # Select only the traning data for the Spanish league
>>> dataloader = DummySoccerDataLoader(param_grid={'league': ['Spain']})
>>> # Get available odds types
>>> dataloader.get_odds_types()
['interwetten', 'williamhill']
>>> # Select the odds of Interwetten bookmaker
>>> X_train, Y_train, O_train = dataloader.extract_train_data(
... odds_type='interwetten')
>>> # Training input data
>>> print(X_train)
            division league  year ... odds__williamhill__draw__full_time_goals
date
1997-05-04         1  Spain  1997 ...                                      2.5
1999-03-04         2  Spain  1999 ...                                      NaN
>>> # Training output data
>>> print(Y_train)
output__home_win__full_time_goals ... output__away_win__full_time_goals
0                               True ...                             False
1                              False ...                             False
>>> # Training odds data
>>> print(O_train)
odds__interwetten__home_win__full_time_goals ...
0                                           1.5 ...
1                                           2.5 ...
>>> # Extract the corresponding fixtures data
>>> X_fix, Y_fix, O_fix = dataloader.extract_fixtures_data()
>>> # Training and fixtures input and odds data have the same column names
>>> pd.testing.assert_index_equal(X_train.columns, X_fix.columns)
>>> pd.testing.assert_index_equal(O_train.columns, O_fix.columns)
>>> # Fixtures data have always no output
>>> Y_fix is None
True
Source code in src/sportsbet/datasets/_dummy.py
385
386
def __init__(self: Self, param_grid: ParamGrid | None = None) -> None:
    super().__init__(param_grid)

extract_fixtures_data()

Extract the fixtures data.

Read more in the user guide.

It returns fixtures data that can be used to make predictions for upcoming matches based on a betting strategy.

Before calling the extract_fixtures_data method for the first time, the extract_training_data should be called, in order to match the columns of the input, output and odds data.

The data contain information about the matches known before the start of the match, i.e. the training data X and the odds data O. The multi-output targets Y is always equal to None and are only included for consistency with the method extract_train_data.

The param_grid parameter of the initialization method has no effect on the fixtures data.

Returns:

Type Description
(X, None, O)

Each of the components represent the fixtures input data X, the multi-output targets Y equal to None and the corresponding odds O, respectively.

Source code in src/sportsbet/datasets/_dummy.py
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
def extract_fixtures_data(self: Self) -> FixturesData:
    """Extract the fixtures data.

    Read more in the [user guide][dataloader].

    It returns fixtures data that can be used to make predictions for
    upcoming matches based on a betting strategy.

    Before calling the `extract_fixtures_data` method for
    the first time, the `extract_training_data` should be called, in
    order to match the columns of the input, output and odds data.

    The data contain information about the matches known before the
    start of the match, i.e. the training data `X` and the odds
    data `O`. The multi-output targets `Y` is always equal to `None`
    and are only included for consistency with the method `extract_train_data`.

    The `param_grid` parameter of the initialization method has no effect
    on the fixtures data.

    Returns:
        (X, None, O):
            Each of the components represent the fixtures input data `X`, the
            multi-output targets `Y` equal to `None` and the
            corresponding odds `O`, respectively.
    """
    return super().extract_fixtures_data()

extract_train_data(drop_na_thres=0.0, odds_type=None)

Extract the training data.

Read more in the user guide.

It returns historical data that can be used to create a betting strategy based on heuristics or machine learning models.

The data contain information about the matches that belong in two categories. The first category includes any information known before the start of the match, i.e. the training data X and the odds data O. The second category includes the outcomes of matches i.e. the multi-output targets Y.

The method selects only the the data allowed by the param_grid parameter of the initialization method. Additionally, columns with missing values are dropped through the drop_na_thres parameter, while the types of odds returned is defined by the odds_type parameter.

Parameters:

Name Type Description Default
drop_na_thres float

The threshold that specifies the input columns to drop. It is a float in the [0.0, 1.0] range. Higher values result in dropping more values. The default value drop_na_thres=0.0 keeps all columns while the maximum value drop_na_thres=1.0 keeps only columns with non missing values.

0.0
odds_type str | None

The selected odds type. It should be one of the available odds columns prefixes returned by the method get_odds_types. If odds_type=None then no odds are returned.

None

Returns:

Type Description
(X, Y, O)

Each of the components represent the training input data X, the multi-output targets Y and the corresponding odds O, respectively.

Source code in src/sportsbet/datasets/_dummy.py
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
def extract_train_data(
    self: Self,
    drop_na_thres: float = 0.0,
    odds_type: str | None = None,
) -> TrainData:
    """Extract the training data.

    Read more in the [user guide][dataloader].

    It returns historical data that can be used to create a betting
    strategy based on heuristics or machine learning models.

    The data contain information about the matches that belong
    in two categories. The first category includes any information
    known before the start of the match, i.e. the training data `X`
    and the odds data `O`. The second category includes the outcomes of
    matches i.e. the multi-output targets `Y`.

    The method selects only the the data allowed by the `param_grid`
    parameter of the initialization method. Additionally, columns with missing
    values are dropped through the `drop_na_thres` parameter, while the
    types of odds returned is defined by the `odds_type` parameter.

    Args:
        drop_na_thres:
            The threshold that specifies the input columns to drop. It is a float in
            the `[0.0, 1.0]` range. Higher values result in dropping more values.
            The default value `drop_na_thres=0.0` keeps all columns while the
            maximum value `drop_na_thres=1.0` keeps only columns with non
            missing values.

        odds_type:
            The selected odds type. It should be one of the available odds columns
            prefixes returned by the method `get_odds_types`. If `odds_type=None`
            then no odds are returned.

    Returns:
        (X, Y, O):
            Each of the components represent the training input data `X`, the
            multi-output targets `Y` and the corresponding odds `O`, respectively.
    """
    return super().extract_train_data(drop_na_thres, odds_type)

SoccerDataLoader(param_grid=None)

Bases: _BaseDataLoader

Dataloader for soccer data.

It downloads historical and fixtures data for various leagues, years and divisions.

Read more in the user guide.

Parameters:

Name Type Description Default
param_grid ParamGrid | None

It selects the type of information that the data include. The keys of dictionaries might be parameters like 'league' or 'division' while the values are sequences of allowed values. It works in a similar way as the param_grid parameter of the scikit-learn's ParameterGrid class. The default value None corresponds to all parameters.

None

Attributes:

Name Type Description
param_grid_ ParameterGrid

The checked value of parameters grid. It includes all possible parameters if param_grid is None.

dropped_na_cols_ Index

The columns with missing values that are dropped.

drop_na_thres_(float) Index

The checked value of drop_na_thres.

odds_type_ str | None

The checked value of odds_type.

input_cols_ Index

The columns of X_train and X_fix.

output_cols_ Index

The columns of Y_train and Y_fix.

odds_cols_ Index

The columns of O_train and O_fix.

target_cols_ Index

The columns used for the extraction of output and odds columns.

train_data_ TrainData

The tuple (X, Y, O) that represents the training data as extracted from the method extract_train_data.

fixtures_data_ FixturesData

The tuple (X, Y, O) that represents the fixtures data as extracted from the method extract_fixtures_data.

Examples:

>>> from sportsbet.datasets import SoccerDataLoader
>>> import pandas as pd
>>> # Get all available parameters to select the training data
>>> SoccerDataLoader.get_all_params()
[{'division': 1, 'league': 'Argentina', ...
>>> # Select only the traning data for the French and Spanish leagues of 2020 year
>>> dataloader = SoccerDataLoader(
... param_grid={'league': ['England', 'Spain'], 'year':[2020]})
>>> # Get available odds types
>>> dataloader.get_odds_types()
['market_average', 'market_maximum']
>>> # Select the market average odds and drop colums with missing values
>>> X_train, Y_train, O_train = dataloader.extract_train_data(
... odds_type='market_average')
>>> # Odds data include the selected market average odds
>>> O_train.columns
Index(['odds__market_average__home_win__full_time_goals', ...
>>> # Extract the corresponding fixtures data
>>> X_fix, Y_fix, O_fix = dataloader.extract_fixtures_data()
>>> # Training and fixtures input and odds data have the same column names
>>> pd.testing.assert_index_equal(X_train.columns, X_fix.columns)
>>> pd.testing.assert_index_equal(O_train.columns, O_fix.columns)
>>> # Fixtures data have always no output
>>> Y_fix is None
True
Source code in src/sportsbet/datasets/_soccer/_data.py
166
167
def __init__(self: Self, param_grid: ParamGrid | None = None) -> None:
    super().__init__(param_grid)

extract_fixtures_data()

Extract the fixtures data.

Read more in the user guide.

It returns fixtures data that can be used to make predictions for upcoming matches based on a betting strategy.

Before calling the extract_fixtures_data method for the first time, the extract_training_data should be called, in order to match the columns of the input, output and odds data.

The data contain information about the matches known before the start of the match, i.e. the training data X and the odds data O. The multi-output targets Y is always equal to None and are only included for consistency with the method extract_train_data.

The param_grid parameter of the initialization method has no effect on the fixtures data.

Returns:

Type Description
(X, None, O)

Each of the components represent the fixtures input data X, the multi-output targets Y equal to None and the corresponding odds O, respectively.

Source code in src/sportsbet/datasets/_soccer/_data.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
def extract_fixtures_data(self: Self) -> FixturesData:
    """Extract the fixtures data.

    Read more in the [user guide][dataloader].

    It returns fixtures data that can be used to make predictions for
    upcoming matches based on a betting strategy.

    Before calling the `extract_fixtures_data` method for
    the first time, the `extract_training_data` should be called, in
    order to match the columns of the input, output and odds data.

    The data contain information about the matches known before the
    start of the match, i.e. the training data `X` and the odds
    data `O`. The multi-output targets `Y` is always equal to `None`
    and are only included for consistency with the method `extract_train_data`.

    The `param_grid` parameter of the initialization method has no effect
    on the fixtures data.

    Returns:
        (X, None, O):
            Each of the components represent the fixtures input data `X`, the
            multi-output targets `Y` equal to `None` and the
            corresponding odds `O`, respectively.
    """
    return super().extract_fixtures_data()

extract_train_data(drop_na_thres=0.0, odds_type=None)

Extract the training data.

Read more in the user guide.

It returns historical data that can be used to create a betting strategy based on heuristics or machine learning models.

The data contain information about the matches that belong in two categories. The first category includes any information known before the start of the match, i.e. the training data X and the odds data O. The second category includes the outcomes of matches i.e. the multi-output targets Y.

The method selects only the the data allowed by the param_grid parameter of the initialization method. Additionally, columns with missing values are dropped through the drop_na_thres parameter, while the types of odds returned is defined by the odds_type parameter.

Parameters:

Name Type Description Default
drop_na_thres float

The threshold that specifies the input columns to drop. It is a float in the [0.0, 1.0] range. Higher values result in dropping more values. The default value drop_na_thres=0.0 keeps all columns while the maximum value drop_na_thres=1.0 keeps only columns with non missing values.

0.0
odds_type str | None

The selected odds type. It should be one of the available odds columns prefixes returned by the method get_odds_types. If odds_type=None then no odds are returned.

None

Returns:

Type Description
(X, Y, O)

Each of the components represent the training input data X, the multi-output targets Y and the corresponding odds O, respectively.

Source code in src/sportsbet/datasets/_soccer/_data.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
def extract_train_data(
    self: Self,
    drop_na_thres: float = 0.0,
    odds_type: str | None = None,
) -> TrainData:
    """Extract the training data.

    Read more in the [user guide][dataloader].

    It returns historical data that can be used to create a betting
    strategy based on heuristics or machine learning models.

    The data contain information about the matches that belong
    in two categories. The first category includes any information
    known before the start of the match, i.e. the training data `X`
    and the odds data `O`. The second category includes the outcomes of
    matches i.e. the multi-output targets `Y`.

    The method selects only the the data allowed by the `param_grid`
    parameter of the initialization method. Additionally, columns with missing
    values are dropped through the `drop_na_thres` parameter, while the
    types of odds returned is defined by the `odds_type` parameter.

    Args:
        drop_na_thres:
            The threshold that specifies the input columns to drop. It is a float in
            the `[0.0, 1.0]` range. Higher values result in dropping more values.
            The default value `drop_na_thres=0.0` keeps all columns while the
            maximum value `drop_na_thres=1.0` keeps only columns with non
            missing values.

        odds_type:
            The selected odds type. It should be one of the available odds columns
            prefixes returned by the method `get_odds_types`. If `odds_type=None`
            then no odds are returned.

    Returns:
        (X, Y, O):
            Each of the components represent the training input data `X`, the
            multi-output targets `Y` and the corresponding odds `O`, respectively.
    """
    return super().extract_train_data(drop_na_thres=drop_na_thres, odds_type=odds_type)

load_dataloader(path)

Load the dataloader object.

Parameters:

Name Type Description Default
path str

The path of the dataloader pickled file.

required

Returns:

Name Type Description
dataloader _BaseDataLoader

The dataloader object.

Source code in src/sportsbet/datasets/_base.py
440
441
442
443
444
445
446
447
448
449
450
451
452
453
def load_dataloader(path: str) -> _BaseDataLoader:
    """Load the dataloader object.

    Args:
        path:
            The path of the dataloader pickled file.

    Returns:
        dataloader:
            The dataloader object.
    """
    with Path(path).open('rb') as file:
        dataloader = cloudpickle.load(file)
    return dataloader