imbalanced-learn-extra: A Python package for novel oversampling algorithms

Artificial Intelligence

Machine Learning

Imbalanced Data

Publication

Authors

Affiliation

Georgios Douzas

NOVA IMS

Fernando Bacao

NOVA IMS

Abstract

Learning from imbalanced data is a common and challenging problem in supervised learning, as standard classifiers are typically designed for balanced class distributions. Among various strategies to address this issue, oversampling algorithms, which generate artificial data to balance class distributions, offer greater flexibility than modifying classification algorithms. In this paper, we present the imbalanced-learn-extra library, describe its implementation in detail, and make it freely available to the machine learning community. The library integrates seamlessly with the Scikit-Learn ecosystem, enabling researchers and practitioners to incorporate it into their existing workflows with ease. The imbalanced-learn-extra Python library implements novel oversampling methods to tackle both between-class and within-class imbalances. Specifically, Geometric SMOTE, enhances the traditional SMOTE algorithm by expanding the data generation area beyond the line segments connecting minority class instances, allowing for greater diversity in synthetic samples and effectively addressing between-class imbalances. On the other hand leverages a clustering-based oversampling addresses within-class imbalances by partitioning the input space into clusters and applying oversampling within each cluster using appropriate resampling ratios. These methods have demonstrated superior performance compared to standard oversampling techniques across a variety of datasets.