User guide
imbalanced-learn-extra
is a Python package that extends imbalanced-learn. It implements algorithms that are not included in
imbalanced-learn due to their novelty or lower citation number. The current version includes the following:
-
A general interface for clustering-based oversampling algorithms that introduces the
ClusterOverSampler
class, whileKMeansSMOTE
,SOMO
andGeometricSOMO
classes are provided for convinience. The distribution of the generated samples to the clusters is controled by thedistributor
parameter withDensityDistributor
being an example of distribution that is based on the density of the clusters. -
The Geometric SMOTE algorithm as a geometrically enhanced drop-in replacement for SMOTE, that handles numerical as well as categorical features.
Clustering-based oversamping
Initially, we generate multi-class imbalanced data represented by the imput data X
and targets y
:
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_classes=3, weights=[0.10, 0.10, 0.80], random_state=0, n_informative=10)
>>> print(sorted(Counter(y).items()))
[(np.int64(0), 10), (np.int64(1), 10), (np.int64(2), 80)]
Below we provided some examples of the imblearn_extra.clover
functionality.
KMeans-SMOTE algorithm
KMeans-SMOTE[^2] algorithm is a combination of KMeans clusterer and SMOTE oversampler and it is implemented by the
KMeansSMOTE
class. We initialize it with the default parameters and use it to resample the data:
>>> from imblearn_extra.clover.over_sampling import KMeansSMOTE
>>> kmeans_smote = KMeansSMOTE(random_state=5)
>>> X_resampled, y_resampled = kmeans_smote.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(np.int64(0), 80), (np.int64(1), 80), (np.int64(2), 80)]
The augmented data set can be used instead of the original data set to train a classifier:
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier()
>>> clf.fit(X_resampled, y_resampled)
DecisionTreeClassifier()
Combining clusterers and oversamplers
The ClusterOverSampler
class allows to combine imbalanced-learn oversamplers with scikit-learn clusterers. This achieved
through the use of the parameters oversampler
and clusterer
. For example, we can select BorderlineSMOTE as the oversampler
and DBSCAN as the clustering algorithm:
>>> from sklearn.cluster import DBSCAN
>>> from imblearn.over_sampling import BorderlineSMOTE
>>> from imblearn_extra.clover.over_sampling import ClusterOverSampler
>>> dbscan_bsmote = ClusterOverSampler(oversampler=BorderlineSMOTE(random_state=5), clusterer=DBSCAN())
>>> X_resampled, y_resampled = dbscan_bsmote.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(np.int64(0), 80), (np.int64(1), 80), (np.int64(2), 80)]
Additionally, if the clusterer supports a neighboring structure for the clusters through a neighbors_
attribute, then it can
be used to generate inter-cluster artificial data similarly to SOMO[^1] and G-SOMO[^3] algorithms.
Adjusting the distribution of generated samples
The parameter distributor
of the ClusterOverSampler
is used to define the distribution of the generated samples to the
clusters. The DensityDistributor
class implements a density based distribution and it is the default distributor
for all
objects of the ClusterOverSampler
class:
>>> from sklearn.cluster import AgglomerativeClustering
>>> from imblearn.over_sampling import SMOTE
>>> agg_smote = ClusterOverSampler(oversampler=SMOTE(random_state=5), clusterer=AgglomerativeClustering())
>>> agg_smote.fit(X, y)
>>> agg_smote.distributor_
DensityDistributor()
The DensityDistributor
objects can be parametrized:
>>> from imblearn_extra.clover.distribution import DensityDistributor
>>> distributor = DensityDistributor(distances_exponent=0)
In order to distribute the samples a labels
parameter is required, while neighbors
is optional:
>>> from sklearn.cluster import KMeans
>>> clusterer = KMeans(n_clusters=4, random_state=1).fit(X, y)
>>> labels = clusterer.labels_
The distribution samples of the samples is provided by the fit_distribute
method and it is described in the intra_distribution
and inter_distribution
dictionaries:
>>> intra_distribution, inter_distribution = distributor.fit_distribute(X, y, labels, neighbors=None)
>>> print(distributor.filtered_clusters_)
[(np.int32(3), np.int64(1)), (np.int32(1), np.int64(0)), (np.int32(1), np.int64(1))]
>>> print(distributor.clusters_density_)
{(np.int32(3), np.int64(1)): np.float64(3.0), (np.int32(1), np.int64(0)): np.float64(7.0), (np.int32(1), np.int64(1)): np.float64(7.0)}
>>> print(intra_distribution)
{(np.int32(3), np.int64(1)): np.float64(0.7), (np.int32(1), np.int64(0)): np.float64(1.0), (np.int32(1), np.int64(1)): np.float64(0.3)}
>>> print(inter_distribution)
{}
The keys of the above dictionaries are tuples of (cluster_label, class_label)
shape, while their values are proportions of the
total generated samples for the particular class. For example (0, 1): 0.7
means that 70% of samples of class 1
will be
generated in the cluster 0
. Any other distributor can be defined by extending the BaseDistributor
class.
Geometric SMOTE algorithm
SMOTE algorithm, as well as any other over-sampling method based on the SMOTE mechanism, generates synthetic samples along line
segments that join minority class instances. Geometric SMOTE[^4] (G-SMOTE) is an enhancement of the SMOTE data generation
mechanism. G-SMOTE generates synthetic samples in a geometric region of the input space, around each selected minority instance.
The GeometricSMOTE
class can be used with multiple classes as well as binary classes classification. It uses a one-vs-rest
approach by selecting each targeted class and computing the necessary statistics against the rest of the data set which are
grouped in a single class.
Initially, we generate multi-class imbalanced data represented by the input data X
and targets y
:
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_classes=3, weights=[0.10, 0.10, 0.80], random_state=0, n_informative=10)
>>> print(sorted(Counter(y).items()))
[(0, 10), (1, 10), (2, 80)]
We can use GeometricSMOTE
to resample the data:
>>> from imblearn_extra.gsmote import GeometricSMOTE
>>> geometric_smote = GeometricSMOTE()
>>> X_resampled, y_resampled = geometric_smote.fit_resample(X, y)
>>> from collections import Counter
>>> print(sorted(Counter(y_resampled).items()))
[(np.int64(0), 80), (np.int64(1), 80), (np.int64(2), 80)]
The augmented data set can be used instead of the original data set to train a classifier:
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier()
>>> clf.fit(X_resampled, y_resampled)
DecisionTreeClassifier()
GeometricSMOTE
can be used also in a machine learning pipeline:
from imblearn.pipeline import make_pipeline
pipeline = make_pipeline(GeometricSMOTE(), DecisionTreeClassifier())
pipeline.fit(X, y)
Pipeline(steps=[('geometricsmote', GeometricSMOTE()),
('decisiontreeclassifier', DecisionTreeClassifier())])
Compatibility
The API of imblearn_extra
is fully compatible to imbalanced-learn. Particularly for clustering-based oversampling, any
oversampler from cluster-over-sampling that does not use clustering, i.e. when clusterer=None
, is equivalent to the
corresponding imbalanced-learn oversampler:
>>> import numpy as np
>>> X_res_im, y_res_im = SMOTE(random_state=5).fit_resample(X, y)
>>> X_res_cl, y_res_cl = ClusterOverSampler(SMOTE(random_state=5), clusterer=None).fit_resample(X, y)
>>> np.testing.assert_equal(X_res_im, X_res_cl)
>>> np.testing.assert_equal(y_res_im, y_res_cl)
References
[^1]: G. Douzas, F. Bacao, "Self-Organizing Map Oversampling (SOMO) for imbalanced data set learning", Expert Systems with
Applications, vol. 82, pp. 40-52, 2017.
[^2]: G. Douzas, F. Bacao, F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE", Information Sciences, vol. 465, pp. 1-20,
2018.
[^3]: G. Douzas, F. Bacao, F. Last, "G-SOMO: An oversampling approach based on self-organized maps and geometric SMOTE", Expert
Systems with Applications, vol. 183,115230, 2021.
[^4]: G. Douzas, F. Bacao, F. Last, "Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE", Information
Sciences, Volume 501, 2019.