Torgo, L., Ribeiro, R.P., Pfahringer, B., Branco, P. (2013). SMOTE for Regression. In: Correia, L., Reis, L.P., Cascalho, J. (eds) Progress in Artificial Intelligence. EPIA 2013. Lecture Notes in Computer Science(), vol 8154. Springer, Berlin, Heidelberg.
In Short
SMOTE๋ฅผ ํ๊ท๋ถ์์ ๋ง๊ฒ ๋ฐ๊พผ ์๊ณ ๋ฆฌ์ฆ์ด๋ค.
์ฐธ๊ณ ๋ก, ์ ์ ์๋ค์ ํด๋น ๋
ผ๋ฌธ์ด ๋์ค๊ธฐ 6๋
์ ์ธ 2007๋
์ Utility-based Regression์ด๋ผ๋ ๋
ผ๋ฌธ์ ์ผ์๊ณ , ์ฌ๊ธฐ์ ๊ทธ๋ค์ด ์ ์ํ ๊ฐ๋
์ธ relevance๋ฅผ ํ์ฉํ์ฌ SMOTER๋ฅผ ์ ์ํ๊ณ ์๋ค.
1. Introduction
Regression ์ํฉ์์๋ ๋ถ๊ท ํ๋ฐ์ดํฐ์ ํด๊ฒฐ์ฑ ์ด ํ์ํ๋ค.
2. Problem Formulation
2-1. Utility-based Regression
$$\begin{aligned} U^{P}_{\phi}(\hat{y},y) &= B_{\phi}(\hat{y},y) - C_{\phi}(\hat{y},y) \\ &= \phi(y) \cdot (1-\Gamma_B(\hat{y},y)) - \phi^P(\hat{y},y) \cdot \Gamma_C(\hat{y},y) \end{aligned}$$
\(B_{\phi}(\hat{y},y), C_{\phi}(\hat{y},y), \Gamma_B(\hat{y},y), \Gamma_C(\hat{y},y)\)๋ ๊ฐ๊ฐ benefit๊ณผ cost์ ๊ด๋ จ๋ ํจ์๋ค์ด๋ค.
2-2. Precision and Recall for Regression
$$\text{recall} = \frac{\sum_{i:\hat{z_i}=1, z_i=1}(1+u_i)}{\sum_{i:z_i=1}(1+\phi(y_i))}$$
$$\text{precision} = \frac{\sum_{i:\hat{z_i}=1, z_i=1}(1+u_i)}{\sum_{i:\hat{z_i}=1,z_i=1} \Big(1+\phi(y_i)\Big) + \sum_{i:\hat{z_i}=1,z_i=0}\Big(2-p\big(\phi(y_i)\big)\Big)} \\ \text{where p is a weight differentiating the types of errors}$$
$$\text{F1-score} = \frac{(\beta^2+1) \cdot precision \cdot recall}{\beta^2 \cdot precision + recall}$$
3. Sampling Approaches
3-1. Under-sampling common values
๋จ์ํ๊ฒ rare target value๋ฅผ oversamplingํ๋ ๊ฒ์ด ์๋๋ผ, normal case๋ ์ถ๊ฐ์ ์ผ๋ก undersamplingํ๋ hybrid ๋ฐฉ์์ด๋ค. ์๋ํ๋ฉด rare value๋ฅผ oversamplingํ ๋ ๊ทธ ๋น์จ์ normal๊ณผ 1:1๋ก ํ๋ ๊ฒ์ด ์๋๊ธฐ ๋๋ฌธ์ ์ฌ์ ํ ๋ถ๊ท ํ๋๊ฐ ๋จ์์์ ์ ์๊ธฐ ๋๋ฌธ์ด๋ค.
3-2. SMOTE for regression
Point 1. how to define which are the relevant observations and the normal cases
Point 2. how to create new synthetic examples (i.e. oversampling)
Point 3. how to decide the target variable value of these new synthetic examples
- Algorithm2์์ ์ค๋ฅ๊ฐ ์๋ค.
new[a] <- case[a] + RANDOM(0,1) X diff๊ฐ ์๋๋ผ new[a] <- x[a] + RANDOM(0,1) X diff์ด๋ค.
4. Experimental Evaluation
200% oversampling and 200% undersampling์ด ๊ฐ์ฅ ํจ์จ์ ์ธ ์กฐํฉ์ด์๋ค.
5. Conclusions
rare extreme value๋ฅผ ์์ธกํ๋๋ฐ์ ๊ธฐ์ฌํ์๋ค.
—
Critical Point (MY OWN OPINION)
- Regression์์ SMOTE๋ฅผ ์ ์ํ๋ค๋ ๋ฐ์ ํฐ ์์๊ฐ ์๋ ๊ฒ ๊ฐ๋ค.
- Oversamplingํ๋ ๊ณผ์ ์์ distance๋ฅผ ๊ณ์ฐํ๊ธฐ ๋๋ฌธ์ Scaling์ด ๊ต์ฅํ ์ค์ํ ๊ฒ ๊ฐ๋ค. ๊ทธ๋ฆฌ๊ณ , categorical variable๊ณผ numerical variable๋ฅผ ๋์์ ๊ณ ๋ คํ์ฌ ๊ฑฐ๋ฆฌ ๊ณ์ฐ์ ํ๋ ๊ฒ์ด๊ธฐ ๋๋ฌธ์ ์ค์ํ ๊ฒ ๊ฐ๋ค. SMOTE-NC์ ๊ฐ์ ๋ฐฉ๋ฒ์ผ๋ก ๊ณ์ฐํ๋ ๊ฒ ๊ฐ๋ค.
—
ETC
- R์์๋
uba๋ผ๋ ํจํค์ง๋ก relevance function์ด ๊ตฌํ๋์ด ์๋ค. - ๋ํ, SMOTER์ R ์ฝ๋๋ ๊ณต๊ฐ๋์ด์๋ค. (http://www.dcc.fc.up.
pt/~ltorgo/EPIA2013)