Skip to content

Gsmote

Implementation of the Geometric SMOTE algorithm.

A geometrically enhanced drop-in replacement for SMOTE. It is compatible with scikit-learn and imbalanced-learn.

GeometricSMOTE(sampling_strategy='auto', k_neighbors=5, truncation_factor=1.0, deformation_factor=0.0, selection_strategy='combined', categorical_features=None, random_state=None)

Bases: BaseOverSampler

Class to to perform over-sampling using Geometric SMOTE.

This algorithm is an implementation of Geometric SMOTE, a geometrically enhanced drop-in replacement for SMOTE. Read more in the [user_guide].

Parameters:

Name Type Description Default
categorical_features ArrayLike | None

Specified which features are categorical. Can either be:

- array of indices specifying the categorical features.

- mask array of shape (n_features, ) and `bool` dtype for which
`True` indicates the categorical features.
None
sampling_strategy dict[int, int] | str | float | Callable

Sampling information to resample the data set.

  • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. It is only available for binary classification.

  • When str, specify the class targeted by the resampling. The number of samples in the different classes will be equalized. Possible choices are:

    • 'minority': resample only the minority class.
    • 'not minority': resample all classes but the minority class.
    • 'not majority': resample all classes but the majority class.
    • 'all': resample all classes.
    • 'auto': equivalent to 'not majority'.
  • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

  • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

'auto'
random_state RandomState | int | None

Control the randomization of the algorithm.

  • If int, it is the seed used by the random number generator.
  • If np.random.RandomState instance, it is the random number generator.
  • If None, the random number generator is the RandomState instance used by np.random.
None
truncation_factor float

The type of truncation. The values should be in the [-1.0, 1.0] range.

1.0
deformation_factor float

The type of geometry. The values should be in the [0.0, 1.0] range.

0.0
selection_strategy str

The type of Geometric SMOTE algorithm with the following options: 'combined', 'majority', 'minority'.

'combined'
k_neighbors NearestNeighbors | int

If int, number of nearest neighbours to use when synthetic samples are constructed for the minority method. If object, an estimator that inherits from sklearn.neighbors.base.KNeighborsMixin class that will be used to find the k_neighbors.

5

Attributes:

Name Type Description
n_features_in_

int Number of features in the input dataset.

nns_pos_

estimator object Validated k-nearest neighbours created from the k_neighbors parameter. It is used to find the nearest neighbors of the same class of a selected observation.

nn_neg_

estimator object Validated k-nearest neighbours created from the k_neighbors parameter. It is used to find the nearest neighbor of the remaining classes (k=1) of a selected observation.

random_state_ RandomState

An instance of np.random.RandomState class.

sampling_strategy_ dict[int, int]

Actual sampling strategy.

Examples:

>>> import numpy as np
>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn_extra.gsmote import GeometricSMOTE
>>> np.set_printoptions(legacy='1.25')
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({{1: 900, 0: 100}})
>>> gsmote = GeometricSMOTE(random_state=1)
>>> X_resampled, y_resampled = gsmote.fit_resample(X, y)
>>> print('Resampled dataset shape %s' % Counter(y_resampled))
Resampled dataset shape Counter({{0: 900, 1: 900}})
Source code in src/imblearn_extra/gsmote/geometric_smote.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def __init__(
    self: Self,
    sampling_strategy: dict[int, int] | str | float | Callable = 'auto',
    k_neighbors: NearestNeighbors | int = 5,
    truncation_factor: float = 1.0,
    deformation_factor: float = 0.0,
    selection_strategy: str = 'combined',
    categorical_features: ArrayLike | None = None,
    random_state: np.random.RandomState | int | None = None,
) -> None:
    """Initialize oversampler."""
    super().__init__(sampling_strategy=sampling_strategy)
    self.k_neighbors = k_neighbors
    self.truncation_factor = truncation_factor
    self.deformation_factor = deformation_factor
    self.selection_strategy = selection_strategy
    self.categorical_features = categorical_features
    self.random_state = random_state

_check_X_y(X, y)

Check input and output data.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
def _check_X_y(  # noqa: N802
    self: Self,
    X: ArrayLike | sparse.csc_matrix | sparse.csr_matrix,
    y: ArrayLike,
) -> tuple[NDArray, NDArray, bool]:
    """Check input and output data."""
    y, binarize_y = check_target_type(y, indicate_one_vs_all=True)
    validated_data: tuple[NDArray | sparse.csc_matrix | sparse.csr_matrix, NDArray] = validate_data(
        self,
        X,
        y,
        reset=True,
        dtype=None,
        accept_sparse=['csr', 'csc'],
    )
    X, y = validated_data
    if not isinstance(X, np.ndarray):
        X = X.toarray()
    return X, y, binarize_y

_decode_categorical(X_init, X_resampled)

Reverses the encoding of the categorical features.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
def _decode_categorical(self: Self, X_init: NDArray, X_resampled: NDArray) -> NDArray:
    """Reverses the encoding of the categorical features."""

    if self.categorical_features is None:
        return X_resampled.astype(X_init.dtype)

    if math.isclose(self.median_std_, 0):
        X_resampled[: X_init.shape[0], self.continuous_features_.size :] = self.ohe_.transform(
            X_init[:, self.categorical_features_],
        ).toarray()

    indices_reordered = np.argsort(
        np.hstack((self.continuous_features_, self.categorical_features_)),
    )
    X_resampled = np.hstack(
        (
            X_resampled[:, : self.continuous_features_.size],
            self.ohe_.inverse_transform(X_resampled[:, self.continuous_features_.size :]),
        ),
    )[:, indices_reordered].astype(X_init.dtype)
    return X_resampled

_encode_categorical(X, y)

Encode categorical features.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
def _encode_categorical(self: Self, X: NDArray, y: NDArray) -> NDArray:
    """Encode categorical features."""

    if self.categorical_features is None:
        return X

    # Compute the median of the standard deviation of the minority class
    class_minority = Counter(y).most_common()[-1][0]

    # Calcuate variance
    X_continuous = check_array(X[:, self.continuous_features_], dtype=np.float64)
    X_minority_continuous = X_continuous[np.flatnonzero(y == class_minority)]
    var = X_minority_continuous.var(axis=0)
    self.median_std_ = np.median(np.sqrt(var))

    # OneHotEncoder
    X_categorical = X[:, self.categorical_features_]
    X_ohe_categorical = self.ohe_.transform(X_categorical)
    X_ohe_categorical.data = (
        np.ones_like(X_ohe_categorical.data, dtype=X_ohe_categorical.dtype) * self.median_std_ / 2
    )
    X_encoded = np.hstack([X_continuous, X_ohe_categorical.toarray()])

    return X_encoded

_make_geometric_samples(X_init, X, y, pos_class_label, n_samples)

A support function that returns artificials samples.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
def _make_geometric_samples(  # noqa: C901
    self: Self,
    X_init: NDArray,
    X: NDArray,
    y: NDArray,
    pos_class_label: str | int,
    n_samples: int,
) -> tuple[NDArray, NDArray, list[NDArray]]:
    """A support function that returns artificials samples."""

    # Return zero new samples
    if n_samples == 0:
        return (
            np.array([], dtype=X.dtype).reshape(0, X.shape[1]),
            np.array([], dtype=y.dtype),
            [],
        )

    # Select positive class samples
    X_pos = X[y == pos_class_label]

    # Force minority strategy if no negative class samples are present
    self.selection_strategy_ = 'minority' if X.shape[0] == X_pos.shape[0] else self.selection_strategy

    # Minority or combined strategy
    if self.selection_strategy_ in ('minority', 'combined'):
        self.nns_pos_.fit(X_pos)
        points_pos = self.nns_pos_.kneighbors(X_pos)[1][:, 1:]
        samples_indices = self.random_state_.randint(
            low=0,
            high=len(points_pos.flatten()),
            size=n_samples,
        )
        rows = np.floor_divide(samples_indices, points_pos.shape[1])
        cols = np.mod(samples_indices, points_pos.shape[1])

    # Majority or combined strategy
    if self.selection_strategy_ in ('majority', 'combined'):
        X_neg = X[y != pos_class_label]
        self.nn_neg_.fit(X_neg)
        points_neg = self.nn_neg_.kneighbors(X_pos)[1]
        if self.selection_strategy_ == 'majority':
            samples_indices = self.random_state_.randint(
                low=0,
                high=len(points_neg.flatten()),
                size=n_samples,
            )
            rows = np.floor_divide(samples_indices, points_neg.shape[1])
            cols = np.mod(samples_indices, points_neg.shape[1])

    # Case that the median std equals to zeros
    if self.categorical_features is not None and math.isclose(self.median_std_, 0):
        X[:, self.continuous_features_.size :] = self.ohe_.transform(
            X_init[:, self.categorical_features_],
        ).toarray()
        X_pos = X[y == pos_class_label]
        if self.selection_strategy_ in ('majority', 'combined'):
            X_neg = X[y != pos_class_label]

    # Generate new samples
    X_new = np.zeros((n_samples, X.shape[1]))
    all_neighbors = []
    for ind, (row, col) in enumerate(zip(rows, cols, strict=False)):
        # Define center point
        center = X_pos[row]

        # Minority strategy
        if self.selection_strategy_ == 'minority':
            surface_point = X_pos[points_pos[row, col]]
            neighbors = X_pos[points_pos[row]]

        # Majority strategy
        elif self.selection_strategy_ == 'majority':
            surface_point = X_neg[points_neg[row, col]]
            neighbors = X_neg[points_neg[row]]

        # Combined strategy
        else:
            surface_point_pos = X_pos[points_pos[row, col]]
            surface_point_neg = X_neg[points_neg[row, 0]]
            radius_pos = norm(center - surface_point_pos)
            radius_neg = norm(center - surface_point_neg)
            surface_point = surface_point_neg if radius_pos > radius_neg else surface_point_pos
            neighbors = np.vstack([X_pos[points_pos[row]], X_neg[points_neg[row]]])

        if self.categorical_features is not None:
            all_neighbors.append(neighbors)

        # Append new sample - no categorical features
        X_new[ind] = make_geometric_sample(
            center,
            surface_point,
            self.truncation_factor,
            self.deformation_factor,
            self.random_state_,
        )

    # Create new samples for target variable
    y_new = np.array([pos_class_label] * len(samples_indices))

    return X_new, y_new, all_neighbors

_populate_categorical_features(X_new, y_new, all_neighbors)

A support function that populates categorical features.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
def _populate_categorical_features(
    self: Self,
    X_new: NDArray,
    y_new: NDArray,
    all_neighbors: list[NDArray],
) -> tuple[NDArray, NDArray]:
    """A support function that populates categorical features."""
    categories_size = (
        [self.continuous_features_.size] + [cat.size for cat in self.ohe_.categories_]
        if self.categorical_features is not None
        else None
    )
    for ind, neighbors in enumerate(all_neighbors):
        X_new[ind] = populate_categorical_features(
            X_new[ind],
            neighbors,
            categories_size,
            self.random_state_,
        )
    return X_new, y_new

_validate_categorical_features()

Validate categorical features.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def _validate_categorical_features(self: Self) -> Self:
    """Validate categorical features."""

    if self.categorical_features is None:
        self.categorical_features_ = np.flatnonzero([])
        self.continuous_features_: NDArray = np.arange(self.n_features_in_)
        return self

    categorical_features = np.asarray(self.categorical_features)
    if categorical_features.dtype.name == 'bool':
        self.categorical_features_ = np.flatnonzero(categorical_features)
    else:
        if any(index not in np.arange(self.n_features_in_) for index in categorical_features):
            error_msg = (
                'Some of the categorical indices are out of range. Indices'
                f' should be between 0 and {self.n_features_in_ - 1}.'
            )
            raise ValueError(error_msg)
        self.categorical_features_ = np.sort(categorical_features)
    self.continuous_features_ = np.setdiff1d(
        np.arange(self.n_features_in_),
        self.categorical_features_,
    )

    if self.categorical_features_.size == self.n_features_in_:
        error_msg = (
            'GeometricSMOTE is not designed to work only with categorical '
            'features. It requires some numerical features.'
        )
        raise ValueError(error_msg)

    return self

_validate_estimators(X)

Validate nearest neighbors estimators.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def _validate_estimators(self: Self, X: NDArray) -> Self:
    """Validate nearest neighbors estimators."""

    # Check random state
    self.random_state_ = check_random_state(self.random_state)

    # Validate strategy
    if self.selection_strategy not in SELECTION_STRATEGIES:
        error_msg = (
            'Unknown selection_strategy for Geometric SMOTE algorithm. '
            f'Choices are {SELECTION_STRATEGIES}. Got {self.selection_strategy} instead.'
        )
        raise ValueError(error_msg)

    # Number of jobs
    n_jobs = None
    if isinstance(self.k_neighbors, NearestNeighbors):
        n_jobs = self.k_neighbors.n_jobs

    # Create nearest neighbors object for positive class
    if self.selection_strategy in ('minority', 'combined'):
        self.nns_pos_ = check_neighbors_object(
            'nns_positive',
            self.k_neighbors,
            additional_neighbor=1,
        )
        self.nns_pos_.set_params(n_jobs=n_jobs)

    # Create nearest neighbors object for negative class
    if self.selection_strategy in ('majority', 'combined'):
        self.nn_neg_ = check_neighbors_object('nn_negative', nn_object=1)
        self.nn_neg_.set_params(n_jobs=n_jobs)

    # Create one hot encoder object
    if self.categorical_features is not None:
        self.ohe_ = OneHotEncoder(
            sparse_output=True,
            handle_unknown='ignore',
            dtype=X.dtype if X.dtype.name != 'object' else np.float64,
        )
        self.ohe_.fit(X[:, self.categorical_features_])

    return self

make_geometric_sample(center, surface_point, truncation_factor, deformation_factor, random_state)

A support function that returns an artificial point.

Parameters:

Name Type Description Default
center NDArray

The center point.

required
surface_point NDArray

The point on the surface of the hypersphere.

required
truncation_factor float

The truncation factor of the algorithm.

required
deformation_factor float

The defirmation factor of the algorithm.

required
random_state RandomState

The random state of the process.

required

Returns:

Name Type Description
geometric_sample NDArray

The generated geometric sample.

Source code in src/imblearn_extra/gsmote/geometric_smote.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def make_geometric_sample(
    center: NDArray,
    surface_point: NDArray,
    truncation_factor: float,
    deformation_factor: float,
    random_state: np.random.RandomState,
) -> NDArray:
    """A support function that returns an artificial point.

    Args:
        center:
            The center point.

        surface_point:
            The point on the surface of the hypersphere.

        truncation_factor:
            The truncation factor of the algorithm.

        deformation_factor:
            The defirmation factor of the algorithm.

        random_state:
            The random state of the process.

    Returns:
        geometric_sample:
            The generated geometric sample.
    """

    # Zero radius case
    if np.array_equal(center, surface_point):
        return center

    # Generate a point on the surface of a unit hyper-sphere
    radius = norm(center - surface_point)
    normal_samples = random_state.normal(size=center.size)
    point_on_unit_sphere = normal_samples / norm(normal_samples)
    point: NDArray = (random_state.uniform(size=1) ** (1 / center.size)) * point_on_unit_sphere

    # Parallel unit vector
    parallel_unit_vector = (surface_point - center) / norm(surface_point - center)

    # Truncation
    close_to_opposite_boundary = truncation_factor > 0 and np.dot(point, parallel_unit_vector) < truncation_factor - 1
    close_to_boundary = truncation_factor < 0 and np.dot(point, parallel_unit_vector) > truncation_factor + 1
    if close_to_opposite_boundary or close_to_boundary:
        point -= 2 * np.dot(point, parallel_unit_vector) * parallel_unit_vector

    # Deformation
    parallel_point_position = np.dot(point, parallel_unit_vector) * parallel_unit_vector
    perpendicular_point_position = point - parallel_point_position
    point = parallel_point_position + (1 - deformation_factor) * perpendicular_point_position

    # Translation
    point = center + radius * point

    return point