Geometric SMOTE
   6 min read    ์†์ง€์šฐ

Douzas, G., & Bacao, F. (2019). Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE. Information Sciences, 501, 118-135.

In Short

SMOTE์˜ data generation ํŒŒํŠธ๋ฅผ geometrically ํ™•์žฅํ•œ oversampling ์•Œ๊ณ ๋ฆฌ์ฆ˜

1. Introduction

๋ฐ์ดํ„ฐ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋Š” ์–ธ์ œ๋‚˜ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค.

2-1. Modifications of the selection phase

๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋Š” ๊ฒŒ ๋‘ ๊ฐ€์ง€๋กœ ๋‚˜๋ˆ ์„œ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜๋‚˜๋Š” between-class ๋ถˆ๊ท ํ˜•, ๋‚˜๋จธ์ง€ ํ•˜๋‚˜๋Š” within-class ๋ถˆ๊ท ํ˜•์ด๋‹ค. ์—ฌ๊ธฐ์„œ between-class ๋ถˆ๊ท ํ˜•์€ ๊ธฐ์กด์— ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋˜ majority์™€ minority์˜ ๊ทน๋ช…ํ•œ ๋นˆ๋„์ˆ˜ ์ฐจ์ด๋ฅผ ๋œปํ•˜๋ฉฐ, within-class ๋ถˆ๊ท ํ˜•์€ ๊ฐ™์€ ํด๋ž˜์Šค ์•ˆ์—์„œ๋„ ์„ธ๋ถ€ ํด๋ž˜์Šค๋กœ ๋‚˜๋‰  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•ด ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ๋‹ค.

  1. between-class ๋ถˆ๊ท ํ˜•
    SMOTE์™€ ENN(Edited Nearest Neighbor)์„ ๊ฒฐํ•ฉํ•œ SMOTE+ENN๋ชจ๋ธ์€ between-class ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ์— ์ฃผ๋ชฉํ•œ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•๋ก  ์ค‘์—์„œ selection phase์„ ๋ณ€ํ˜•ํ•œ ๋ฐฉ๋ฒ•๋ก  ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด๋Š” SMOTE๋ฅผ ์šฐ์„  ์ง„ํ–‰ํ•œ ํ›„์—, ENN์„ ํ†ตํ•ด์„œ ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ์ƒ˜ํ”Œ๋“ค์€ ์ œ๊ฑฐํ•ด๋ฒ„๋ฆฌ๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด์™ธ์—๋„ Borderline-SMOTE, MWMOTE(Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning), ADASYN, KernelADASYN์€ ๋ชจ๋‘ majority์™€ minority์˜ borderline instance๋ฅผ ๊ธฐ์ค€์œผ๋กœ, noisyํ•œ ์ƒ˜ํ”Œ๋“ค์ด ๋งŒ๋“ค์–ด์ง€๋Š” ๊ฒƒ์„ ์˜ˆ๋ฐฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

  2. within-class ๋ถˆ๊ท ํ˜•
    within-class ๋ถˆ๊ท ํ˜•์„ ๋‹ค๋ฃจ๋Š” ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์€ ํด๋Ÿฌ์Šคํ„ฐ๋ง๊ณผ ์—ฐ๊ด€์ด ์žˆ๋‹ค. Cluster-SMOTE์€ k-means ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ ์šฉํ•œ ๋’ค์— SMOTE๋ฅผ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  DBSMOTE์€ DBSCAN์„ ํ™œ์šฉํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋ถ„๋ฅ˜ํ•œ ๋’ค, ํด๋Ÿฌ์Šคํ„ฐ ์ค‘์•™๊ฐ’๊ณผ ๊ทธ๋กœ๋ถ€ํ„ฐ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด minority ์ƒ˜ํ”Œ์„ ํ™œ์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค. A-SUWO์€ cross validation์„ ํ†ตํ•ด ํ™•์ธํ•œ ํŠน์ •ํ•œ ์‚ฌ์ด์ฆˆ๋กœ minority ํด๋ž˜์Šค์˜ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ ๋‚˜์„œ ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค. SOMO๋Š” input space์˜ 2์ฐจ์› representation(U-matrix)๋ฅผ ๋งŒ๋“ค๊ณ , SMOTE๋ฅผ ํ†ตํ•ด์„œ intra-cluster์™€ inter-cluster ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋ฐฉ์‹์„ ํ†ตํ•ด์„œ manifold structure์„ ๋ณด์กดํ•œ๋‹ค. SOMO์™€ ์œ ์‚ฌํ•˜๊ฒŒ, Kmeans์™€ SMOTE๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ(SMOTE+KMeans), ํ™•์ธ๋œ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๋ฐ€๋„๋ฅผ ํ† ๋Œ€๋กœ ํด๋ž˜์Šค ๋ถ„ํฌ๋ฅผ re-balanceํ•˜๋Š” ๋ฐฉ์‹๋„ ์žˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, oversampling๋ฐฉ์‹๊ณผ ensemble ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•œ SMOTEBoost์™€ DataBoost-IM ๋“ฑ๋„ ์žˆ๋‹ค.

2-2. Modifications of the data generation mechanism

์œ„์˜ Selection ํŒŒํŠธ์— ๋น„ํ•ด, Data generation ํŒŒํŠธ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋œ ์—ฐ๊ตฌ๊ฐ€ ๋œ ๋ถ€๋ถ„์ด๋‹ค. Safe-Level SMOTE๋Š” weight degree๋ผ๋Š” safe level์ด๋ผ๋Š” ๊ฐœ๋…์„ ์ œ์•ˆํ•˜์˜€๋‹ค. safe level์„ ํ†ตํ•ด์„œ safe level ratio๊ฐ€ ๊ณ„์‚ฐ๋˜๋Š”๋ฐ, line segment๋ฅผ truncateํ•˜๋Š” ํšจ๊ณผ๋ฅผ ์ง€๋‹Œ๋‹ค. Data Generation์—์„œ ์•„์˜ˆ SMOTE๊ฐ€ ์•„๋‹Œ ๋ฐฉ๋ฒ•๋„ ์žˆ๋Š”๋ฐ, ๋Œ€ํ‘œ์ ์œผ๋กœ๋Š” CGAN(Conditional GAN)์ด ์žˆ๋‹ค. CGAN์€ input space์˜ local information๋ณด๋‹ค๋Š” true data distribution์„ ์ง์ ‘์ ์œผ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๋ฐ์— ์ดˆ์ ์„ ๋‘” ๋ฐฉ๋ฒ•์ด๋‹ค.

3. Motivation

  1. Generation of noisy instances due to the selection of k-nearest neighbors

    figure1

  2. Generation of noisy examples due to the selection of an initial observation

    figure2

  3. Generation of nearly duplicated instances

    figure3

  4. Generation of noisy instances due to the use of observations from two different minority class clusters.

    figure4

4. Proposed Method

G-SMOTE๋Š” SMOTE์—์„œ data generation phase๋ฅผ ์ˆ˜์ •ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

  1. To define a safe area around each selected minority class instance such that the generated artificial minority instances inside the are are not noisy.
  2. To increase the variety of generated samples by expanding the minority class area.
  3. To parameterize the above characteristics based on a small number of transformations with a geometrical interpretation.

4-1. G-SMOTE Algorithm

gsmote

4-2. Functions

  1. Surface
    i) if \(\alpha_{sel} = \text{minority}\), \(\boldsymbol{x}_{surface} \in S_{min,k}\)

    figure5


    ii) if \(\alpha_{sel} = \text{majority}\), \(\boldsymbol{x}_{surface} \in S_{maj,1}\)

    figure6


    iii) if \(\alpha_{sel} = \text{combined}\), \(\boldsymbol{x}_{surface} = \arg\min_{\boldsymbol{x} \in (\boldsymbol{x}_{min}, \boldsymbol{x}_{maj})}(||\boldsymbol{x}_{center} - \boldsymbol{x}||)\) where \(\boldsymbol{x}_{min} \in S_{min,k}\) and \(\boldsymbol{x}_{maj} \in S_{maj,1}\)

    figure7


    figure8

  2. Hyperball
    $$\boldsymbol{x}_{gen} \leftarrow r^{1/p} \boldsymbol{e}_{sphere} \\ \text{where } \boldsymbol{e}_{sphere} \leftarrow \frac{\boldsymbol{v}_{normal}}{||\boldsymbol{v}_{normal}||} \\ \boldsymbol{v}_{normal} \leftarrow (v_1, ..., v_p) \sim N(0,1) \\ r \sim (0,1)$$

    figure9

  3. Vectors
    $$\boldsymbol{x}_{//} \leftarrow x_{//}\boldsymbol{e}_{//} \\ \boldsymbol{x}_{\perp} \leftarrow \boldsymbol{x}_{gen} - \boldsymbol{x}_{//} \\ \text{where } \boldsymbol{e}_{//} \leftarrow \frac{\boldsymbol{x}_{surface} - \boldsymbol{x}_{center}}{||\boldsymbol{x}_{surface} - \boldsymbol{x}_{center}||} \\ x_{//} \leftarrow \boldsymbol{x}_{gen} \cdot \boldsymbol{e}_{//}$$

  4. Truncate
    $$\boldsymbol{x}_{gen} \leftarrow \boldsymbol{x}_{gen} - 2\boldsymbol{x}_{//} \\ \text{if } |\alpha_{trunc} - x_{//}| > 1$$

    figure10


    figure11

  5. Deform
    $$\boldsymbol{x}_{gen} \leftarrow \boldsymbol{x}_{gen} - \alpha_{def}\boldsymbol{x}_{\perp}$$

    figure12


    figure13

  6. Translate
    $$\boldsymbol{x}_{gen} \leftarrow \boldsymbol{x}_{center} + R\boldsymbol{x}_{gen}$$

    figure14


    figure15

  1. Surface
    i) if \(\alpha_{sel} = \text{minority}\), \(\boldsymbol{x}_{surface} \in S_{min,k}\)
    ii) if \(\alpha_{sel} = \text{majority}\), \(\boldsymbol{x}_{surface} \in S_{maj,1}\)
    iii) if \(\alpha_{sel} = \text{combined}\), \(\boldsymbol{x}_{surface} = \arg\min_{\boldsymbol{x} \in (\boldsymbol{x}_{min}, \boldsymbol{x}_{maj})}(||\boldsymbol{x}_{center} - \boldsymbol{x}||)\) where \(\boldsymbol{x}_{min} \in S_{min,k}\) and \(\boldsymbol{x}_{maj} \in S_{maj,1}\)

  2. Hyperball
    $$\boldsymbol{x}_{gen} \leftarrow r^{1/p} \boldsymbol{e}_{sphere} \\ \text{where } \boldsymbol{e}_{sphere} \leftarrow \frac{\boldsymbol{v}_{normal}}{||\boldsymbol{v}_{normal}||} \\ \boldsymbol{v}_{normal} \leftarrow (v_1, ..., v_p) \sim N(0,1) \\ r \sim (0,1)$$

  3. Vectors
    $$\boldsymbol{x}_{//} \leftarrow x_{//}\boldsymbol{e}_{//} \\ \boldsymbol{x}_{\perp} \leftarrow \boldsymbol{x}_{gen} - \boldsymbol{x}_{//} \\ \text{where } \boldsymbol{e}_{//} \leftarrow \frac{\boldsymbol{x}_{surface} - \boldsymbol{x}_{center}}{||\boldsymbol{x}_{surface} - \boldsymbol{x}_{center}||} \\ x_{//} \leftarrow \boldsymbol{x}_{gen} \cdot \boldsymbol{e}_{//}$$

  4. Truncate
    $$\boldsymbol{x}_{gen} \leftarrow \boldsymbol{x}_{gen} - 2\boldsymbol{x}_{//} \\ \text{if } |\alpha_{trunc} - x_{//}| > 1$$

  5. Deform
    $$\boldsymbol{x}_{gen} \leftarrow \boldsymbol{x}_{gen} - \alpha_{def}\boldsymbol{x}_{\perp}$$

  6. Translate
    $$\boldsymbol{x}_{gen} \leftarrow \boldsymbol{x}_{center} + R\boldsymbol{x}_{gen}$$

4-3. Justification of the Algorithm

G-SMOTE extends the linear interpolation mechanism by introducing a geometric region where the data generation process occurs.

  1. \(S_{gen}\) is initialized with empty.
  2. \(S_{min}\) are shuffled.
  3. \(\boldsymbol{x}_{center}\) is selected.
  4. SMOTE์˜ selection ๊ณผ์ •์„ ์ผ๋ฐ˜ํ™”ํ•œ ํŒŒํŠธ์ด๋‹ค. Surface์—์„œ \(\alpha_{sel}\)์— ๋”ฐ๋ผ ์„ธ ๊ฐ€์ง€ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋‚˜์˜จ๋‹ค. ์ž์„ธํ•œ ๊ฑฐ๋Š” ์œ„๋ฅผ ์ฐธ๊ณ ํ•˜๊ธธ ๋ฐ”๋ž€๋‹ค.
  5. Vectors์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„์ด๋‹ค.
    \(\boldsymbol{x}_{//}\): projection of \(\boldsymbol{x}_{gen}\) to unit vector \(\boldsymbol{e}_{//}\)
    \(\boldsymbol{x}_{\perp}\): perpendicular to the same vector belonging also to the hyperplane dinfed by \(\boldsymbol{x}_{gen}\) and \(\boldsymbol{e}_{//}\)
  6. ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ data generation ๋ถ€๋ถ„์ด๋‹ค. Hyperball์— ๋”ฐ๋ผ \(\boldsymbol{e}_{sphere}\)์™€ \(\boldsymbol{x}_{gen}\)๋ฅผ ๋งŒ๋“ ๋‹ค.
  7. Truncate
  8. Deform
  9. Translate

5. Research Methodology

5-1. Experimental Data

์ด 69๊ฐœ datasets

  • UCI Machine Learning Repository: 13 datasets
  • KEEL repository: 13 datasets
  • Simulated data based on variations of the “MANDELION” dataset: 2 datasets
  • additional datasets with higher imbalance ratios

5-2. Evaluation Measures

i) Accuracy
ii) AUC
iii) F-score
iv) G-mean

5-3. Machine Learning Algorithms

๋น„๊ต๋Œ€์ƒ: SMOTE, Random Oversampling, NO oversampling
๋ถ„๋ฅ˜๊ธฐ: Logistic Regression, K-Nearest Neighbors, Decision Tree, Gradient Boosting Classifier

5-4. Experimental Procedure

5-fold cross validation
\(k \in {3,5}\)
\(\alpha_{trunc} = \{-1.0, -0.5, 0.0, 0.25, 0.5, 0.75, 1.0\}\)
\(\alpha_{def} = \{0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0\}\)

ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜ํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”์ง€ ๋ณด๊ธฐ ์œ„ํ•ด์„œ Friedman Test์™€ Holms Test๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ 6-2. Statistical Analysis๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด ๋œ๋‹ค.

5-5. Software Implementation

python์—์„œ ํ•ด๋‹น ํŒจํ‚ค์ง€๊ฐ€ ๊ตฌ์ถ•๋˜์–ด์žˆ๋‹ค.

6. Results and Discussion

6-1. Comparative Presentation

table2


table3


table4

6-2. Statistical Analysis

table5


table6

[Table 5] Friedman Test: oversampling ๋ฐฉ์‹์— ๋”ฐ๋ผ์„œ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜ํ•œ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ

  • ๊ฒฐ๋ก : ๋ชจ๋“  ๋ถ„๋ฅ˜๊ธฐ๋“ค์€ oversampling ๋ฐฉ์‹์— ๋”ฐ๋ผ ๋ชจ๋“  evaluation metric์—์„œ ํ‰๊ท  rank๊ฐ€ ๋‹ค๋ฅด๋‹ค.

[Table 6] Holms Test: G-SMOTE๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋ก ๋“ค๋ณด๋‹ค ์ข‹์•˜๋Š”์ง€ ํ™•์ธ

  • ๊ฒฐ๋ก : G-SMOTE๊ฐ€ ๋‹ค๋ฅธ oversampling ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๋‹ค.

6-3. G-SMOTE taxonomy

G-SMOTE์˜ geometric hyperparameter: \(\alpha_{trunc}, \alpha_{def}, \alpha_{sel}\)

  1. SMOTE
    \(\alpha_{trunc}=1.0, \alpha_{def}=1.0, \alpha_{sel}=\text{minority}\)๋ฅผ ํ•˜๋ฉด, ์ผ๋ฐ˜์ ์ธ SMOTE์™€ ๊ฐ™๋‹ค.

  2. Modified SMOTE
    \(\alpha_{def}=1.0\)์œผ๋กœ ๊ณ ์ •ํ•˜๋”๋ผ๋„, ๋‚˜๋จธ์ง€ \(\alpha_{trunc}, \alpha_{sel}\)์— ๋”ฐ๋ผ SMOTE๋ฅผ ์กฐ๊ธˆ ๋” ๋ณ€ํ˜•๋œ ํ˜•ํƒœ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. line segment ์œ„์—์„œ ์ƒˆ๋กœ์šด synthetic example์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฒƒ์€ SMOTE์™€ ๊ฐ™์ง€๋งŒ, \(\alpha_{trunc}\)์™€ \(\alpha_{sel}\)์˜ ์กฐํ•ฉ์— ๋”ฐ๋ผ truncated, expanded, rotated๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

  3. Pure G-SMOTE
    \(\alpha_{trunc}\)์™€ \(\alpha_{sel}\)์— ๋”๋ถˆ์–ด์„œ \(\alpha_{def}\)์„ ์ž์œ ๋กญ๊ฒŒ ์„ค์ •ํ•˜๊ฒŒ ๋˜๋ฉด, data generation area๊ฐ€ ์ง์„ (line segment)์—์„œ ์ดˆ-ํšŒ์ „ํƒ€์›์ฒด(hyper-spheroid)๊ฐ€ ๋œ๋‹ค.

table7

Table 7์„ ํ†ตํ•ด์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ด 26,391๋ฒˆ์˜ ์‹คํ—˜์—์„œ Pure G-SMOTE๊ฐ€ ์••๋„์ ์œผ๋กœ ๋งŽ์€ ๋นˆ๋„์ˆ˜๋กœ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค.

6-4. Analysis and Tuning of optimal hyper-parameters

  1. \(\alpha_{trunc}, \alpha_{def}, \alpha_{sel}\)์˜ ์˜๋ฏธ
    \(\alpha_{trunc}, \alpha_{def}, \alpha_{sel}\)์€ data generation process์—์„œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. ํŠนํžˆ \(\alpha_{sel}=\text{majority}\)์˜ ๊ฒฝ์šฐ, minority class area๋ฅผ ๊ณต๊ฒฉ์ ์œผ๋กœ ํ™•์žฅํ•˜๊ฒŒ ๋˜๋ฉฐ, \(\alpha_{trunc}\)์™€ \(\alpha_{def}\)์˜ ์ ˆ๋Œ“๊ฐ’์„ ๋‚ฎ์€ ์ˆซ์ž๋กœ ์„ค์ •ํ• ์ˆ˜๋ก ๋”๋”์šฑ ๊ทธ๋Ÿฌํ•œ ํšจ๊ณผ๋ฅผ ํฌ๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

  2. IR ๋˜๋Š” R๊ณผ geometric hyperparameter ๊ฐ„์˜ ๊ด€๊ณ„
    ์—ฌ๊ธฐ์„œ IR์€ Imbalance Ratio, R์€ ๋ณ€์ˆ˜ ์ˆ˜ ๋Œ€๋น„ ์ƒ˜ํ”Œ ์ˆ˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

i) High IR or Low R
majority ๋˜๋Š” combined, ๊ทธ๋ฆฌ๊ณ  ๋‚ฎ์€ ์ ˆ๋Œ“๊ฐ’์˜ truncation, deformation hyperparameter๊ฐ€ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋ถˆ๊ท ํ˜•๋„๊ฐ€ ๋†’์€ ๊ฒฝ์šฐ์—๋Š” ์ผ๋ฐ˜ SMOTE๋Š” ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ•œ ๋˜๋Š” noisy ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ธ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, R์ด ๋‚ฎ์€ ๊ฒฝ์šฐ(sparse input space)์—๋Š” ์ผ๋ฐ˜ SMOTE์˜ ๊ธฐ๋ณธ linear interpolation ๊ณผ์ •์ด ํŠน์ • ๋ฐฉํ–ฅ์˜ input space์—์„œ๋งŒ ์ƒ˜ํ”Œ๋“ค ๋งŒ๋“ค์–ด๋‚ด์–ด ๊ธฐ์กด ๋ฐ์ดํ„ฐ์™€ ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ noisyํ•œ ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ธ ๊ฒƒ์œผ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

ii) Low IR or High R
minority, ๊ทธ๋ฆฌ๊ณ  ๋†’์€ ์ ˆ๋Œ“๊ฐ’์˜ truncation, deformation hyperparameter๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋ถˆ๊ท ํ˜•๋„๊ฐ€ ๋‚ฎ๊ฑฐ๋‚˜ R์ด ํฐ ๊ฒฝ์šฐ์—๋Š” input space๊ฐ€ ์ด๋ฏธ ์ถฉ๋ถ„ํžˆ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์„œ SMOTE์˜ ๋‹จ์ ์„ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ์—ˆ๋˜ ๊ฒƒ์œผ๋กœ ํ•ด์„ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

7. Conclusions

์ •๋ฆฌํ•˜์ž๋ฉด, G-SMOTE๋Š” minority class area ๊ทผ์ฒ˜์—์„œ safe radius๋ฅผ ์ •ํ•˜๊ณ  ์•ˆ์ „ํ•œ ์ดˆ-ํšŒ์ „ํƒ€์›์ฒด ๋‚ด์—์„œ ์ถ”๊ฐ€์ ์ธ ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ ์€ ์ˆ˜์˜ hyperparameter๋ฅผ ์กฐ์ •ํ•ด์ฃผ๊ธฐ๋งŒ ํ•ด๋„ ํ€„๋ฆฌํ‹ฐ ์ข‹์€ ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ธก๋ฉด์—์„œ G-SMOTE๋Š” ์ด์ „๋ณด๋‹ค ๋ฐœ์ „ํ–ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

MY OWN OPINION

  1. Oversampling ๊ณผ์ •์— ์žˆ์–ด์„œ Selection phase์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋œ ์—ฐ๊ตฌ๊ฐ€ ๋œ Data Generation phase์—์„œ ์‹ ์„ ํ•œ ๋…ผ๋ฌธ์ธ ๊ฒƒ ๊ฐ™๋‹ค.

  2. ๊ฐ™์€ ์—ฐ๊ตฌ์‹ค์— ๊ณ„์‹  ๋ฐ•์‚ฌ๋‹˜๊ป˜์„œ๋Š” AR-SMOTE(Angle-Rotated SMOTE) ์—ฐ๊ตฌ๋„ ํ•˜์‹œ๊ณ  ํ–ˆ๋˜ ๊ฒƒ์œผ๋กœ ๋ฏธ๋ฃจ์–ด๋ณด์•„, ๊ต‰์žฅํžˆ ๊ดœ์ฐฎ์€ ์—ฐ๊ตฌ๋ฐฉํ–ฅ์ด๋ผ๊ณ  ์ƒ๊ฐ๋œ๋‹ค.

  3. Geometric SMOTE for Regression ๋…ผ๋ฌธ๋„ ์žˆ๋˜๋ฐ, ์–ผ๋ฅธ ์ฝ์–ด๋ด์•ผ๊ฒ ๋‹ค. Classification๊ณผ ๋‹ฌ๋ฆฌ Regression์—์„œ๋Š” y๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ณผ์ •๋„ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™์€๋ฐ, ์ด๋Ÿฌํ•œ ๋ถ€๋ถ„์— ํŠนํžˆ ๋” ์ฃผ๋ชฉํ•ด์„œ ๋ณด์•„์•ผ๊ฒ ๋‹ค.

  4. IR(๋ถˆ๊ท ํ˜•๋„)๋งŒ ๋ณด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ R(๋ณ€์ˆ˜์ˆ˜ ๋Œ€๋น„ ์ƒ˜ํ”Œ์ˆ˜)๋ฅผ ๊ธฐ์ค€์œผ๋กœ๋„ ์‚ฌํ›„๋ถ„์„์„ ํ•ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ๋‹ค.

  5. G-SMOTE์™€ ๋ณ„๊ฐœ๋กœ, Related Works๋ฅผ ์ฝ๋‹ค๋ณด๋‹ˆ within-class ๋ถˆ๊ท ํ˜•์— ์ฃผ๋ชฉํ•˜๊ณ  manifold structure๋ฅผ ๋ณด์กดํ•˜๊ณ ์ž ํ•˜๋Š” SOMO๋ผ๋Š” ๋…ผ๋ฌธ์„ ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค.