Over sampling
This module includes classes for clustering-based oversampling.
A general class for clustering-based oversampling as well as specific clustering-based oversamplers are provided.
ClusterOverSampler(oversampler, clusterer=None, distributor=None, raise_error=True, random_state=None, n_jobs=None)
Bases: BaseOverSampler
A class that handles clustering-based oversampling.
Any combination of oversampler, clusterer and distributor can be used.
Read more in the [user_guide].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
oversampler
|
BaseOverSampler
|
Oversampler to apply to each selected cluster. |
required |
clusterer
|
ClusterMixin | None
|
Clusterer to apply to input space before oversampling.
|
None
|
distributor
|
BaseDistributor | None
|
Distributor to distribute the generated samples per cluster label.
|
None
|
raise_error
|
bool
|
Raise an error when no samples are generated.
|
True
|
random_state
|
RandomState | int | None
|
Control the randomization of the algorithm.
|
None
|
n_jobs
|
int | None
|
Number of CPU cores used.
|
None
|
Attributes:
Name | Type | Description |
---|---|---|
oversampler_ |
BaseOverSampler
|
A fitted clone of the |
clusterer_ |
ClusterMixin
|
A fitted clone of the |
distributor_ |
BaseDistributor
|
A fitted clone of the |
labels_ |
Labels
|
Cluster labels of each sample. |
neighbors_ |
Neighbors
|
An array that contains all neighboring pairs with each row being
a unique neighboring pair. It is |
random_state_ |
RandomState
|
An instance of |
sampling_strategy_ |
dict[int, int]
|
Actual sampling strategy. |
Examples:
>>> import numpy as np
>>> from collections import Counter
>>> from imblearn_extra.clover.over_sampling import ClusterOverSampler
>>> from sklearn.datasets import make_classification
>>> from sklearn.cluster import KMeans
>>> from imblearn.over_sampling import SMOTE
>>> np.set_printoptions(legacy='1.25')
>>> X, y = make_classification(random_state=0, n_classes=2, weights=[0.9, 0.1])
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({0: 90, 1: 10})
>>> cluster_oversampler = ClusterOverSampler(
... oversampler=SMOTE(random_state=5),
... clusterer=KMeans(random_state=10, n_init='auto'))
>>> X_res, y_res = cluster_oversampler.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({0: 90, 1: 90})
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 |
|
fit(X, y)
Check inputs and statistics of the sampler.
You should use fit_resample
to generate the synthetic data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
InputData
|
Data array. |
required |
y
|
Targets
|
Target array. |
required |
Returns:
Name | Type | Description |
---|---|---|
self |
Self
|
Return the instance itself. |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 |
|
fit_resample(X, y, **fit_params)
Resample the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
InputData
|
Matrix containing the data which have to be sampled. |
required |
y
|
Targets
|
Corresponding label for each sample in X. |
required |
fit_params
|
dict[str, str]
|
Parameters passed to the fit method of the clusterer. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
X_resampled |
InputData
|
The array containing the resampled data. |
y_resampled |
Targets
|
The corresponding label of resampled data. |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 |
|
GeometricSOMO(sampling_strategy='auto', random_state=None, k_neighbors=5, truncation_factor=1.0, deformation_factor=0.0, selection_strategy='combined', som_estimator=None, imbalance_ratio_threshold='auto', distances_exponent='auto', distribution_ratio=0.8, raise_error=True, n_jobs=None)
Bases: ClusterOverSampler
Geometric SOMO algorithm.
Applies the SOM algorithm to the input space before applying Geometric SMOTE. Read more in the [user_guide].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampling_strategy
|
dict[int, int] | str | float | Callable[[Targets], dict[int, int]]
|
Sampling information to resample the data set.
|
'auto'
|
random_state
|
RandomState | int | None
|
Control the randomization of the algorithm.
|
None
|
k_neighbors
|
NearestNeighbors | int
|
Defines the number of nearest neighbors to be used by SMOTE.
|
5
|
truncation_factor
|
float
|
The type of truncation. The values should be in the |
1.0
|
deformation_factor
|
float
|
The type of geometry. The values should be in the |
0.0
|
selection_strategy
|
str
|
The type of Geometric SMOTE algorithm with the following options:
|
'combined'
|
som_estimator
|
SOM | None
|
Defines the SOM clusterer applied to the input space.
|
None
|
imbalance_ratio_threshold
|
float | str
|
The threshold of a filtered cluster. It can be any non-negative number or
Any cluster that has an imbalance ratio smaller than the filtering threshold is identified as a filtered cluster and can be potentially used to generate minority class instances. Higher values increase the number of filtered clusters. |
'auto'
|
distances_exponent
|
float | str
|
The exponent of the mean distance in the density calculation. It can be
any non-negative number or
|
'auto'
|
distribution_ratio
|
float
|
The ratio of intra-cluster to inter-cluster generated samples. It is a
number in the |
0.8
|
raise_error
|
bool
|
Raise an error when no samples are generated.
|
True
|
n_jobs
|
int | None
|
Number of CPU cores used.
|
None
|
Attributes:
Name | Type | Description |
---|---|---|
oversampler_ |
GeometricSMOTE
|
A fitted |
clusterer_ |
SOM
|
A fitted |
distributor_ |
DensityDistributor
|
A fitted |
labels_ |
Labels
|
Labels of each sample. |
neighbors_ |
Neighbors
|
An array that contains all neighboring pairs with each row being a unique neighboring pair. |
random_state_ |
RandomState
|
An instance of |
sampling_strategy_ |
dict[int, int]
|
Actual sampling strategy. |
Examples:
>>> import numpy as np
>>> from imblearn_extra.clover.over_sampling import GeometricSOMO
>>> from sklearn.datasets import make_blobs
>>> blobs = [100, 800, 100]
>>> X, y = make_blobs(blobs, centers=[(-10, 0), (0,0), (10, 0)])
>>> # Add a single 0 sample in the middle blob
>>> X = np.concatenate([X, [[0, 0]]])
>>> y = np.append(y, 0)
>>> # Make this a binary classification problem
>>> y = y == 1
>>> gsomo = GeometricSOMO(random_state=42)
>>> X_res, y_res = gsomo.fit_resample(X, y)
>>> # Find the number of new samples in the middle blob
>>> right, left = X_res[:, 0] > -5, X_res[:, 0] < 5
>>> n_res_in_middle = (right & left).sum()
>>> print("Samples in the middle blob: %s" % n_res_in_middle)
Samples in the middle blob: 801
>>> unchanged = n_res_in_middle == blobs[1] + 1
>>> print("Middle blob unchanged: %s" % unchanged)
Middle blob unchanged: True
>>> more_zero_samples = (y_res == 0).sum() > (y == 0).sum()
>>> print("More 0 samples: %s" % more_zero_samples)
More 0 samples: True
Source code in src/imblearn_extra/clover/over_sampling/_gsomo.py
195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
|
KMeansSMOTE(sampling_strategy='auto', random_state=None, k_neighbors=5, kmeans_estimator=None, imbalance_ratio_threshold='auto', distances_exponent='auto', raise_error=True, n_jobs=None)
Bases: ClusterOverSampler
KMeans-SMOTE algorithm.
Applies KMeans clustering to the input space before applying SMOTE. Read more in the [user_guide].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampling_strategy
|
dict[int, int] | str
|
Sampling information to resample the data set.
|
'auto'
|
random_state
|
RandomState | int | None
|
Control the randomization of the algorithm.
|
None
|
k_neighbors
|
NearestNeighbors | int
|
Defines the number of nearest neighbors to be used by SMOTE.
|
5
|
kmeans_estimator
|
KMeans | None
|
Defines the KMeans clusterer applied to the input space.
|
None
|
imbalance_ratio_threshold
|
float | str
|
The threshold of a filtered cluster. It can be any non-negative number or
Any cluster that has an imbalance ratio smaller than the filtering threshold is identified as a filtered cluster and can be potentially used to generate minority class instances. Higher values increase the number of filtered clusters. |
'auto'
|
distances_exponent
|
float | str
|
The exponent of the mean distance in the density calculation. It can be
any non-negative number or
|
'auto'
|
raise_error
|
bool
|
Raise an error when no samples are generated.
|
True
|
n_jobs
|
int | None
|
Number of CPU cores used.
|
None
|
Attributes:
Name | Type | Description |
---|---|---|
oversampler_ |
SMOTE
|
A fitted |
clusterer_ |
KMeans | MiniBatchKMeans
|
A fitted |
distributor_ |
DensityDistributor
|
A fitted |
labels_ |
Labels
|
Cluster labels of each sample. |
neighbors_ |
None
|
It is |
random_state_ |
RandomState
|
An instance of |
sampling_strategy_ |
dict[int, int]
|
Actual sampling strategy. |
Examples:
>>> import numpy as np
>>> from imblearn_extra.clover.over_sampling import KMeansSMOTE
>>> from sklearn.datasets import make_blobs
>>> blobs = [100, 800, 100]
>>> X, y = make_blobs(blobs, centers=[(-10, 0), (0,0), (10, 0)])
>>> # Add a single 0 sample in the middle blob
>>> X = np.concatenate([X, [[0, 0]]])
>>> y = np.append(y, 0)
>>> # Make this a binary classification problem
>>> y = y == 1
>>> kmeans_smote = KMeansSMOTE(random_state=42)
>>> X_res, y_res = kmeans_smote.fit_resample(X, y)
>>> # Find the number of new samples in the middle blob
>>> n_res_in_middle = ((X_res[:, 0] > -5) & (X_res[:, 0] < 5)).sum()
>>> print("Samples in the middle blob: %s" % n_res_in_middle)
Samples in the middle blob: 801
>>> print("Middle blob unchanged: %s" % (n_res_in_middle == blobs[1] + 1))
Middle blob unchanged: True
>>> print("More 0 samples: %s" % ((y_res == 0).sum() > (y == 0).sum()))
More 0 samples: True
Source code in src/imblearn_extra/clover/over_sampling/_kmeans_smote.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
|
SOMO(sampling_strategy='auto', random_state=None, k_neighbors=5, som_estimator=None, distribution_ratio=0.8, raise_error=True, n_jobs=None)
Bases: ClusterOverSampler
SOMO algorithm.
Applies the SOM algorithm to the input space before applying SMOTE. Read more in the [user_guide].
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampling_strategy
|
dict[int, int] | str
|
Sampling information to resample the data set.
|
'auto'
|
random_state
|
RandomState | int | None
|
Control the randomization of the algorithm.
|
None
|
k_neighbors
|
NearestNeighbors | int
|
Defines the number of nearest neighbors to be used by SMOTE.
|
5
|
som_estimator
|
SOM | None
|
Defines the SOM clusterer applied to the input space.
|
None
|
distribution_ratio
|
float
|
The ratio of intra-cluster to inter-cluster generated samples. It is a
number in the |
0.8
|
raise_error
|
bool
|
Raise an error when no samples are generated.
|
True
|
n_jobs
|
int | None
|
Number of CPU cores used.
|
None
|
Attributes:
Name | Type | Description |
---|---|---|
oversampler_ |
SMOTE
|
A fitted |
clusterer_ |
SOM
|
A fitted |
distributor_ |
DensityDistributor
|
A fitted |
labels_ |
Labels
|
Cluster labels of each sample. |
neighbors_ |
Neighbors
|
An array that contains all neighboring pairs with each row being a unique neighboring pair. |
random_state_ |
RandomState
|
An instance of |
sampling_strategy_ |
dict[int, int]
|
Actual sampling strategy. |
Examples:
>>> import numpy as np
>>> from imblearn_extra.clover.over_sampling import SOMO
>>> from sklearn.datasets import make_blobs
>>> blobs = [100, 800, 100]
>>> X, y = make_blobs(blobs, centers=[(-10, 0), (0,0), (10, 0)])
>>> # Add a single 0 sample in the middle blob
>>> X = np.concatenate([X, [[0, 0]]])
>>> y = np.append(y, 0)
>>> # Make this a binary classification problem
>>> y = y == 1
>>> somo = SOMO(random_state=42)
>>> X_res, y_res = somo.fit_resample(X, y)
>>> # Find the number of new samples in the middle blob
>>> right, left = X_res[:, 0] > -5, X_res[:, 0] < 5
>>> n_res_in_middle = (right & left).sum()
>>> print("Samples in the middle blob: %s" % n_res_in_middle)
Samples in the middle blob: 801
>>> unchanged = n_res_in_middle == blobs[1] + 1
>>> print("Middle blob unchanged: %s" % unchanged)
Middle blob unchanged: True
>>> more_zero_samples = (y_res == 0).sum() > (y == 0).sum()
>>> print("More 0 samples: %s" % more_zero_samples)
More 0 samples: True
Source code in src/imblearn_extra/clover/over_sampling/_somo.py
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
|
clone_modify(oversampler, class_label, y_in_cluster)
Clone and modify attributes of oversampler for corner cases.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
oversampler
|
BaseOverSampler
|
The oversampler to modify its attributes. |
required |
class_label
|
int
|
The class label. |
required |
y_in_cluster
|
Targets
|
The data of the target in the cluster. |
required |
Returns:
Type | Description |
---|---|
BaseOverSampler
|
A cloned oversampler with modified number of nearest neighbors. |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
|
extract_inter_data(X, y, cluster_labels, inter_distribution, sampling_strategy, random_state)
Extract data between filtered clusters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
InputData
|
The input data. |
required |
y
|
Targets
|
The targets. |
required |
cluster_labels
|
Labels
|
The cluster labels. |
required |
inter_distribution
|
InterDistribution
|
The inter-clusters distributions. |
required |
sampling_strategy
|
OrderedDict[int, int]
|
The sampling strategy to follow. |
required |
random_state
|
RandomState
|
Control the randomization of the algorithm. |
required |
Returns:
Type | Description |
---|---|
list[tuple[dict[int, int], InputData, Targets]]
|
The inter-clusters data. |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
extract_intra_data(X, y, cluster_labels, intra_distribution, sampling_strategy)
Extract data for each filtered cluster.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
InputData
|
The input data. |
required |
y
|
Targets
|
The targets. |
required |
cluster_labels
|
Labels
|
The cluster labels. |
required |
intra_distribution
|
IntraDistribution
|
The intra-clusters distributions. |
required |
sampling_strategy
|
OrderedDict[int, int]
|
The sampling strategy to follow. |
required |
Returns:
Type | Description |
---|---|
list[tuple[dict[int, int], InputData, Targets]]
|
The intra-clusters data. |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
generate_in_cluster(oversampler, transformer, cluster_sampling_strategy, X_in_cluster, y_in_cluster)
Generate intra-cluster or inter-cluster new samples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
oversampler
|
BaseOverSampler
|
Oversampler to apply to each selected cluster. |
required |
transformer
|
TransformerMixin
|
Transformer to apply before oversampling. |
required |
cluster_sampling_strategy
|
dict[int, int]
|
The sampling strategy in the cluster. |
required |
X_in_cluster
|
InputData
|
The input data in the cluster. |
required |
y_in_cluster
|
Targets
|
The targets in the cluster. |
required |
Returns:
Name | Type | Description |
---|---|---|
X_new |
InputData
|
The generated. |
y_new |
Targets
|
The corresponding label of resampled data. |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
modify_nn(n_neighbors, n_samples)
Modify the nearest neighbors object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_neighbors
|
NearestNeighbors | int
|
The |
required |
n_samples
|
int
|
The number of samples. |
required |
Returns:
Type | Description |
---|---|
NearestNeighbors | int
|
The modified |
Source code in src/imblearn_extra/clover/over_sampling/_cluster.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
|