SMOTE for Regression
   2 min read    ์†์ง€์šฐ

Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P. (2013). SMOTE for Regression. In: Correia, L., Reis, L.P., Cascalho, J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science(), vol 8154. Springer, Berlin, Heidelberg.

In Short

SMOTE๋ฅผ ํšŒ๊ท€๋ถ„์„์— ๋งž๊ฒŒ ๋ฐ”๊พผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

์ฐธ๊ณ ๋กœ, ์œ„ ์ €์ž๋“ค์€ ํ•ด๋‹น ๋…ผ๋ฌธ์ด ๋‚˜์˜ค๊ธฐ 6๋…„ ์ „์ธ 2007๋…„์— Utility-based Regression์ด๋ผ๋Š” ๋…ผ๋ฌธ์„ ์ผ์—ˆ๊ณ , ์—ฌ๊ธฐ์„œ ๊ทธ๋“ค์ด ์ œ์•ˆํ•œ ๊ฐœ๋…์ธ relevance๋ฅผ ํ™œ์šฉํ•˜์—ฌ SMOTER๋ฅผ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค.

1. Introduction

Regression ์ƒํ™ฉ์—์„œ๋„ ๋ถˆ๊ท ํ˜•๋ฐ์ดํ„ฐ์— ํ•ด๊ฒฐ์ฑ…์ด ํ•„์š”ํ•˜๋‹ค.

2. Problem Formulation

2-1. Utility-based Regression

$$\begin{aligned} U^{P}_{\phi}(\hat{y},y) &= B_{\phi}(\hat{y},y) - C_{\phi}(\hat{y},y) \\ &= \phi(y) \cdot (1-\Gamma_B(\hat{y},y)) - \phi^P(\hat{y},y) \cdot \Gamma_C(\hat{y},y) \end{aligned}$$

\(B_{\phi}(\hat{y},y), C_{\phi}(\hat{y},y), \Gamma_B(\hat{y},y), \Gamma_C(\hat{y},y)\)๋Š” ๊ฐ๊ฐ benefit๊ณผ cost์™€ ๊ด€๋ จ๋œ ํ•จ์ˆ˜๋“ค์ด๋‹ค.

2-2. Precision and Recall for Regression

$$\text{recall} = \frac{\sum_{i:\hat{z_i}=1, z_i=1}(1+u_i)}{\sum_{i:z_i=1}(1+\phi(y_i))}$$
$$\text{precision} = \frac{\sum_{i:\hat{z_i}=1, z_i=1}(1+u_i)}{\sum_{i:\hat{z_i}=1,z_i=1} \Big(1+\phi(y_i)\Big) + \sum_{i:\hat{z_i}=1,z_i=0}\Big(2-p\big(\phi(y_i)\big)\Big)} \\ \text{where p is a weight differentiating the types of errors}$$

$$\text{F1-score} = \frac{(\beta^2+1) \cdot precision \cdot recall}{\beta^2 \cdot precision + recall}$$

3. Sampling Approaches

3-1. Under-sampling common values

๋‹จ์ˆœํ•˜๊ฒŒ rare target value๋ฅผ oversamplingํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, normal case๋„ ์ถ”๊ฐ€์ ์œผ๋กœ undersamplingํ•˜๋Š” hybrid ๋ฐฉ์‹์ด๋‹ค. ์™œ๋ƒํ•˜๋ฉด rare value๋ฅผ oversamplingํ•  ๋•Œ ๊ทธ ๋น„์œจ์„ normal๊ณผ 1:1๋กœ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ์—ฌ์ „ํžˆ ๋ถˆ๊ท ํ˜•๋„๊ฐ€ ๋‚จ์•„์žˆ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

3-2. SMOTE for regression

Point 1. how to define which are the relevant observations and the normal cases
Point 2. how to create new synthetic examples (i.e. oversampling)
Point 3. how to decide the target variable value of these new synthetic examples

algorithm1


algorithm2

  • Algorithm2์—์„œ ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋‹ค.
    new[a] <- case[a] + RANDOM(0,1) X diff๊ฐ€ ์•„๋‹ˆ๋ผ new[a] <- x[a] + RANDOM(0,1) X diff์ด๋‹ค.

4. Experimental Evaluation

dataset


algorithms


result_summary


best_score

200% oversampling and 200% undersampling์ด ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ์กฐํ•ฉ์ด์—ˆ๋‹ค.

5. Conclusions

rare extreme value๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ์— ๊ธฐ์—ฌํ•˜์˜€๋‹ค.

Critical Point (MY OWN OPINION)

  1. Regression์—์„œ SMOTE๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค๋Š” ๋ฐ์— ํฐ ์˜์˜๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.
  2. Oversamplingํ•˜๋Š” ๊ณผ์ •์—์„œ distance๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— Scaling์ด ๊ต‰์žฅํžˆ ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™๋‹ค. ๊ทธ๋ฆฌ๊ณ , categorical variable๊ณผ numerical variable๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜์—ฌ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์„ ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™๋‹ค. SMOTE-NC์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

ETC

  • R์—์„œ๋Š” uba๋ผ๋Š” ํŒจํ‚ค์ง€๋กœ relevance function์ด ๊ตฌํ˜„๋˜์–ด ์žˆ๋‹ค.
  • ๋˜ํ•œ, SMOTER์˜ R ์ฝ”๋“œ๋„ ๊ณต๊ฐœ๋˜์–ด์žˆ๋‹ค. (http://www.dcc.fc.up.
    pt/~ltorgo/EPIA2013)