A Random Forests Quantile Classifier for Class Imbalanced Data
   11 min read    ์†์ง€์šฐ

Oโ€™Brien, R., & Ishwaran, H. (2019). A random forests quantile classifier for class imbalanced data. Pattern recognition, 90, 232-249.

In Short

๋ถˆ๊ท ํ˜•๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด, quantile classifier์„ ์‚ฌ์šฉํ•œ Random Forest

1. Introduction

1-1. ๋ถˆ๊ท ํ˜•๋ฐ์ดํ„ฐ์˜ ์ •์˜

์ผ๋ฐ˜์ ์œผ๋กœ ๋‘ ๊ฐœ์˜ ํด๋ž˜์Šค๊ฐ€ ์žˆ๋Š” ์ƒํ™ฉ์—์„œ, ํ•œ ํด๋ž˜์Šค์— ์†ํ•œ ์›์†Œ๊ฐ€ ๋‚˜๋จธ์ง€ ํด๋ž˜์Šค์— ์†ํ•œ ์›์†Œ์— ๋น„ํ•ด ์›”๋“ฑํ•˜๊ฒŒ ๋งŽ์€ ๊ฒฝ์šฐ๋ฅผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถˆ๊ท ํ˜•ํ•œ ์ƒํ™ฉ์ด๋ผ๊ณ  ์ •์˜ํ•œ๋‹ค. (์—ฌ๊ธฐ์„œ๋Š” $Y=1$์ด Minoritiy, $Y=0$์ด Majority๋ผ๊ณ  ์ƒ๊ฐํ•˜์ž.)

5๊ฐœ์˜ ๊ทผ์ ‘์›์†Œ๋“ค์— ๋Œ€ํ•ด์„œ Majority ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ์›์†Œ๊ฐ€ 0~1๊ฐœ์ธ ์›์†Œ๋ฅผ Safe, 2~3๊ฐœ๋Š” Borderline, 4~5๊ฐœ๋Š” Rare๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

1-2. IR (Imbalance Ratio)

$$IR = \frac{\text{# of Majority class}}{\text{# of Minority class}}$$

1-3. Marginally imbalanced

์ •์˜: $p(x) \ll \frac{1}{2} \text{ for all } x \in X \text{ where } p(x) = P(Y=1|X=x)$
ํฌ์ธํŠธ๋Š” all x์ธ ๊ฒƒ ๊ฐ™๋‹ค. ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ํŠน์ • ํด๋ž˜์Šค(์†Œ์ˆ˜ ํด๋ž˜์Šค)์ผ ํ™•๋ฅ ์ด ๊ทน๋‹จ์ ์œผ๋กœ ์ž‘์„ ๊ฒฝ์šฐ marginally imbalanced๋ผ๊ณ  ํ•œ๋‹ค.

1-4. Conditionally imbalanced

์ •์˜: $\text{there exists a set } A \subset X \text{ with nonzero probability, } P(X \in A) >0, \text{ such that } P(Y=1|X \in A) \approx 1 \text{ and } p(x) \ll \frac{1}{2} \text{ for } x \notin A$
ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ์†Œ์ˆ˜ ํด๋ž˜์Šค์ผ ํ™•๋ฅ ์ด 1์— ๊ฐ€๊น์ง€๋งŒ, ๊ทธ์™ธ ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ์†Œ์ˆ˜ ํด๋ž˜์Šค์ผ ํ™•๋ฅ ์ด 0์— ๊ฐ€๊นŒ์šด ๊ฒฝ์šฐ๋ฅผ conditionally imbalanced๋ผ๊ณ  ํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๊ฐœ์ธ์ ์œผ๋กœ 1-3์˜ marginallly imbalanced๋ณด๋‹ค๋Š” conditionally imbalanced๊ฐ€ ์กฐ๊ธˆ ๋” ํ˜„์‹ค์ ์ธ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์„ ๋” ๋ฐ˜์˜ํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐ์€ ๋“ ๋‹ค.

1-5. Notation ์ •๋ฆฌ

์•„๋ž˜๋Š” ๋ณธ ๋…ผ๋ฌธ์˜ Table 1์ด๋‹ค.

Notation

2-1. Data Level Methods

๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ๊ฑด๋“œ๋ ค์„œ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์„ Data Level Method๋ผ๊ณ  ์นญํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ด์•ผ๊ธฐํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋กœ๋Š” Balanced Random Forest(BRF)๊ฐ€ ์žˆ๋‹ค. ์ด๋Š” ๋‹ค์ˆ˜ ํด๋ž˜์Šค์— ์†ํ•˜๋Š” ๊ฒƒ๋“ค์„ ์ ๊ฒŒ ๋ฝ‘๋Š”(undersampling) ๋ฐฉ์‹์ด๋‹ค. ์ด์™ธ์— SMOTE์™€ ๊ฐ™์€ oversampling ๊ธฐ๋ฒ•๋“ค๋„ ์žˆ๊ณ , undersampling๊ณผ oversampling์ด ๊ฒฐํ•ฉ๋œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ฐฉ์‹๋„ ์žˆ๋‹ค. ์•„๋ž˜๋Š” ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ์–ธ๊ธ‰๋œ ๋ฐฉ๋ฒ•๋ก ๋“ค์ด๋‹ค.

  • One-sided Sampling: Tomek Links
  • Neighborhood Balanced Bagging
  • SMOTEBoost, RUSBoost, EUSBoost: combine boosting with sampling data at each boosting iteration

2-2. Algorithmic Level Methods

์œ„์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ์˜ ๊ท ํ˜•์„ ์ง์ ‘์ ์œผ๋กœ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹์ด ์žˆ๋Š”๊ฐ€ํ•˜๋ฉด, ์•Œ๊ณ ๋ฆฌ์ฆ˜์ ์œผ๋กœ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์„ ๋†’์ด๊ณ ์ž ํ•˜๋Š” ๋…ธ๋ ฅ๋“ค๋„ ์žˆ์—ˆ๋‹ค. ์•„๋ž˜๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋ก ๋“ค ์˜ˆ์‹œ์ด๋‹ค.

  • SHRINK
  • Helling Distance Decision Trees(HDDT)
  • Near-Bayesian Support Vector Machines(NBSVM)
  • Class Switching according to NEarest Enemy Distance

2-3. Bayes Decision Rule

$$\delta_B(x) = I\big( p(x) \geq 1/2 \big)$$
์ฐธ๊ณ ๋กœ ์—ฌ๊ธฐ์„œ $p(x) = P(Y=1 | X=x)$์ด๋‹ค. ์ด๋Š” IR์ด ์ปค์ง€๋ฉด ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค. $p(x)$๊ฐ€ 0์— ๊ฐ€๊นŒ์šฐ๋ฉด ํ•ด๋‹น classifier๋Š” Majority ํด๋ž˜์Šค๋กœ ์˜ˆ์ธกํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์ˆ˜์˜ ์›์†Œ๊ฐ€ ์†ํ•ด์žˆ๋Š” ํด๋ž˜์Šค๋กœ ์˜ˆ์ธกํ•˜๋„๋ก $p(x)$๊ฐ€ 0์— ๊ฐ€๊น๊ฒŒ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋•Œ Bayes error๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด 0์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜ค๋ฏ€๋กœ ๋งˆ์น˜ ์™„๋ฒฝํ•œ ๋ถ„๋ฅ˜๊ธฐ์ฒ˜๋Ÿผ ์ฐฉ๊ฐ๋  ์ˆ˜ ์žˆ๋‹ค.
$$r(\delta_B) = E[\min\{p(X), 1-p(X)\}] = E[p(X)] \approx 0$$

2-4. Balanced Random Forests (BRF)

random forests with undersampling the majority class

2-5. Algorithm Procedure of Random Forest Classification

algorithm_RF

3. Q*-Classifier

3-1. Quantile classifier

$$\delta_q(x) = I\big( p(x) \geq q \big), \ 0<q<1$$
quantile classifer๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์ดํ•ดํ•˜๋ฉด, ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ํฌ์ธํŠธ์ธ q*-classifier์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ•ด๋‹น ๋ฐฉ๋ฒ•๋ก ์€ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ์žฅ์ ์ด ์žˆ๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” TPR๊ณผ TNR์„ ์ตœ๋Œ€ํ™”ํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ๋‘๋ฒˆ์งธ๋Š” cost-weighted Bayes classifier๊ณผ ๊ฐ™์ด ์ž‘๋™ํ•จ์œผ๋กœ์จ weighted risk๋ฅผ ์ตœ์†Œํ™”ํ•ด์ค€๋‹ค.

$$r(\hat{\delta}, \ell_0, \ell_1) = E\Big[\ell_{0}1_{(\hat{\delta}(X)=1, Y=0)} + \ell_{1}1_{(\hat{\delta}(X)=0, Y=1)}\Big]$$

์—ฌ๊ธฐ์„œ $\ell_0$์™€ $\ell_1$์€ ๊ฐ๊ฐ Majority ์›์†Œ ๋˜๋Š” Minority ์›์†Œ๋ฅผ ์ž˜๋ชป ๋ถ„๋ฅ˜ํ•  ๋•Œ์˜ cost์ด๋ฉฐ, ๋ชจ๋‘ ์–‘์ˆ˜์ด๋‹ค.

cost-weighted risk์˜ ๊ด€์ ์—์„œ ๋ณด๋ฉด, ์ตœ์ ์˜ classifier๋Š” cost-weighted Bayes rule์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ธ๋ฐ, ์ด๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
$$\delta_{WB}(x) = 1_{\big(p(x) \geq \frac{\ell_0}{\ell_0 + \ell_1}\big)}$$

์ด๊ฒƒ์ด ์ตœ์ ์ธ ์ด์œ ๋Š” ๋ชจ๋“  ๋ถ„๋ฅ˜๊ธฐ์— ๋Œ€ํ•ด์„œ $r(\delta_{WB}, \ell_0, \ell_1) \leq r(\hat{\delta}, \ell_0, \ell_1)$๋ฅผ ๋งŒ์กฑํ•˜๋ฉฐ, ๊ทธ ๋ฆฌ์Šคํฌ๊ฐ€ ์•„๋ž˜๋ฅผ ๋งŒ์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
$$r(\delta_{WB}, \ell_0, \ell_1) = E\Big[min\Big(\ell_1p(X), \ell_0(1-p(X))\Big)\Big]$$

์œ„์— ๋Œ€ํ•œ ์ฆ๋ช…์€ ๋…ผ๋ฌธ Appendix1์— ์ •๋ฆฌ๋˜์–ด์žˆ์œผ๋ฉฐ, ์ถ”ํ›„ ์ถ”๊ฐ€ ์„œ์ˆ ํ•˜๋„๋ก ํ•˜๊ฒ ๋‹ค.

3-2. TNR+TPR optimal

TNR(True Negative Rate)์™€ TPR(True Positive Rate)์˜ ํ•ฉ์„ ์ตœ๋Œ€ํ™”์‹œ์ผœ์ฃผ๋Š” ๋ถ„๋ฅ˜๊ธฐ๋ฅผ TNR+TPR optimal์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.
$$TPR = \frac{TP}{TP+FN}, \ TNR = \frac{TN}{TN+FP}$$
์ฐธ๊ณ ๋กœ ๊ธฐ๋ณธ Bayes Rule์„ ํ™œ์šฉํ•œ ๋ถ„๋ฅ˜๊ธฐ๋Š” TNR์€ 1์— ๊ฐ€๊น์ง€๋งŒ, TPR์€ 0์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜จ๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

3-3. q*-classifier

$$\delta_D(x) = 1_{\big(\Delta_D(x) \geq 1\big)} \text{, where } \Delta_D(x) = \frac{f_{X|Y}(x|1)}{f_{X|Y}(x|0)} = \frac{p(x)(1-\pi)}{(1-p(x))\pi} \qquad (4)$$

์—ฌ๊ธฐ์„œ $\delta_{q*}(x) = I\big(p(x) \geq \pi \big) = \delta_D(x)$๋ฅผ q*-classifier๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ์†ํ•˜๋ฉฐ, Density-based approach๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ์™œ๋ƒํ•˜๋ฉด data density๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํด๋ž˜์Šค๋ฅผ ๋ถ„๋ฅ˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

cf. Density-based approach
$$\delta_D(x) = 1_{\big(f_{X|Y}(x|1) \geq f_{X|Y}(x|0)\big)}$$
์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•ด์•ผ ํ•  ์ ์€ conditional density of the response ($p(x)$)๊ฐ€ ์•„๋‹ˆ๋ผ conditional density of the features($f_{X|Y}$)๋ฅผ ํ™œ์šฉํ–ˆ๋‹ค๋Š” ์ ์ด๋‹ค. ์ด๋กœ ์ธํ•ด ์†Œ์ˆ˜ ํด๋ž˜์Šค์˜ prevalance ํšจ๊ณผ๋ฅผ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋‹ค. (๊ฐœ์ธ์ƒ๊ฐ: Bayesian์˜ ์šฉ์–ด๋กœ ํ•ด์„ํ•œ๋‹ค๋ฉด, ์„ ํ–‰์—ฐ๊ตฌ์ฒ˜๋Ÿผ uniform prior๊ฐ€ ์•„๋‹ˆ๋ผ likelihood๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜๋ฅผ ํ–ˆ๋‹ค๋Š” ๋ฐ์— ์˜์˜๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค.)

q*-classifier๋Š” TNR+TPR optimal์ด๋‹ค. (์ด์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์— ๋‚˜์™€์žˆ๋‹ค.) ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, cost-weighted Bayes rule์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, $\ell_0 = \pi$์ด๊ณ , $\ell_1=(1-\pi)$์ด๋‹ค. ๊ทธ๋ ‡๊ฒŒ ํ•˜๋ฉด marginal ๊ทธ๋ฆฌ๊ณ  conditional imbalanced ์ƒํƒœ์—์„œ weighted risk๊ฐ€ ๋ชจ๋‘ 0์— ๊ฐ€๊น๊ฒŒ ๋‚˜์˜จ๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์šฐ๋ณ€์— ์žˆ๋Š” ($\pi$)๋Š” marginally imbalancedํ•œ ์ƒํ™ฉ์—์„œ๋„, conditionally imbalancedํ•œ ์ƒํ™ฉ์—์„œ๋„ ๋ชจ๋‘ 0์— ๊ฐ€๊นŒ์›Œ์•ผ ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ์•„๋‘์ž.

\begin{align} r(\delta_{q*}, \pi, 1-\pi) &= E[\min\{(1-\pi)p(X), \pi(1-p(X))\}] \\ &\leq E[\pi(1-p(X))] \\ &\leq \pi \end{align}

Theorem2 ์ค‘์—์„œ TNR+TPR optimal์— ๋Œ€ํ•œ ์ฆ๋ช…์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์ฐธ๊ณ ๋กœ, FPR = 1-TNR, FNR = 1-TPR์ด๋ฏ€๋กœ, TNR์™€ TPR์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์€ FPR๊ณผ FNR์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

\begin{align} FPR(\hat{\delta}) &+ FNR(\hat{\delta})\\ &= P\{\hat{\delta}(X)=1|Y=0\} + P\{\hat{\delta}(X)=0|Y=1\} \\ &= \frac{P\{\hat{\delta}(X)=1, Y=0\}}{P(Y=0)} + \frac{P\{\hat{\delta}(X)=0, Y=1\}}{P(Y=1)} \\ &= E\Big[\frac{1\{\hat{\delta}(X)=1, Y=0\}}{\ell_1} + \frac{1\{\hat{\delta}(X)=0, Y=1\}}{\ell_0}\Big] \end{align}

์œ„์— $\ell_0\ell_1$์„ ๊ณฑํ•ด์ฃผ๋ฉด ์•„๋ž˜์˜ ์‹์„ ์ตœ์†Œํ™”ํ•ด์ฃผ๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.
$$E\Big[\ell_{0}1_{(\hat{\delta}(X)=1, Y=0)} + \ell_{1}1_{(\hat{\delta}(X)=0, Y=1)}\Big]$$
๊ทธ๋ฆฌ๊ณ  ์ด๋Š” 3-1.์—์„œ ๋ณด์•˜๋“ฏ์ด weighted risk์™€ ์™„์ „ํ•˜๊ฒŒ ๊ฐ™์€ ํ˜•ํƒœ์ด๋‹ค. ์ฆ‰, weighted risk๋ฅผ ์ตœ์†Œํ™”ํ•œ๋‹ค๋ฉด TNR+TPR optimal ์กฐ๊ฑด๋„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งŒ์กฑ์ด ๋  ๊ฒƒ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

3-4. Response-based sampling: Balancing the data

cf) Response-based sampling: where data values are selected with probability that depend only on the value of Y and not X.

Under balanced subsampling, the subsampled Bayes rule $\delta^{S}_{B}$ is TNR+TPR optimal.

$$\begin{equation} P(S=1 |Y) = \begin{cases} \pi_S(1), &\mbox{if } Y=1 \\ \pi_S(0), &\mbox{otherwise} \end{cases} \end{equation} \quad (5)\\ \pi^S := P(Y=1|S=1) = \frac{P(S=1|Y=1)P(Y=1)}{P(S=1)} = \frac{\pi_S(1)\pi}{P(S=1)} \quad (6.1)$$

$$1-\pi^S = P(Y=0|S=1) = \frac{\pi_S(0)(1-\pi)}{P(S=1)} \quad (6.2) $$

balanced subsample๋œ ๊ฒƒ๋“ค์ด๋ฏ€๋กœ \(\pi^S=1/2\)์ด๊ณ , ์ด๋Š” ๊ณง \(\pi^S = 1-\pi^S\)์ด๋ฏ€๋กœ ์•„๋ž˜ (7)์ด ์„ฑ๋ฆฝํ•œ๋‹ค.

$$\therefore \frac{\pi_S(1)}{\pi_S(0)} = \frac{1-\pi}{\pi} \quad (7)$$

subsampled๋œ ๋ฐ์ดํ„ฐ๋“ค๋กœ ๋ถ„๋ฅ˜๊ธฐ๋ฅผ ํ•™์Šต์‹œํ‚จ ๊ฒƒ์„ $\delta_{B}^{S}$๋ผ๊ณ  ํ•˜์ž. ์ด๋ฅผ ์ดํ›„์—๋Š” subsampled Bayes rule์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ฒ ๋‹ค. (์ค‘๊ฐ„ ์ •๋ฆฌ๋ฅผ ํ•˜์ž๋ฉด, (5)๋Š” response-based sampling์ด๊ณ , (7)๋Š” ๊ทธ์ค‘์—์„œ๋„ balanced sampling์ด๋‹ค.)

$$\delta_{B}^{S}(x) = 1 \mbox{, if } \frac{p^S(x)}{1-p^S(x) }\geq 1 \\ \mbox{where } p^S(x) = \frac{f^S_{X,Y}(x,1)}{f^S_X(x)}, \ 1-p^S(x) = \frac{f^S_{X,Y}(x,0)}{f^S_X(x)} \\ \therefore \delta_{B}^{S}(x) = 1 \mbox{, if } \frac{f^S_{X,Y}(x,1)}{f^S_{X,Y}(x,0)}\\$$

$$\begin{align} \mbox{where } f^S_{X,Y}(x,1) &= P(X=x, Y=1 |S=1) \\ &= \frac{P(X=x, Y=1, S=1)}{P(S=1)} \\ &= \frac{P(S=1|X=x, Y=1)P(X=x,Y=1)}{P(S=1)} \\ &= \frac{P(S=1|Y=1)f_{X,Y}(x,1)}{P(S=1)} \\ &= \frac{\pi_S(1)p(x)f_X(x)}{P(S=1)} \end{align}\\$$

$$\therefore \frac{p^S(x)}{1-p^S(x)} = \frac{p(x)\pi_s(1)}{(1-p(x))\pi_S(0)} \qquad (8)\\ \therefore \delta_{B}^{S}(x) = 1 \mbox{, if } \frac{p(x)}{1-p(x)} \geq \frac{\pi_S(0)}{\pi_S(1)} = \frac{\pi}{1-\pi} \quad \mbox{by (7)} \\ \therefore \delta_B^S(x) = \delta_D(x)$$

3-5. q*-classifier is invariant to response-based sampling

\(q^*\)-classifier๋Š” target balance ratio์™€ ์ƒ๊ด€์—†์ด TPN+TPR-optimality๋ฅผ ์œ ์ง€ํ•œ๋‹ค. ์ฆ๋ช…์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

$$\text{By definition,} \quad \delta^S_{q^*}(x) = \textbf{1}_{\{p^S(x)\ge\pi^S\}} \\ \text{Equivalently,} \quad \delta^{S}_{q^*}(x)=1 \quad \text{if} \quad \frac{p^S(x)(1-\pi^S)}{(1-p^S(x))\pi^S} \ge 1 \\ \text{By (8),} \quad \delta^{S}_{q^*}(x)=1 \quad \text{if} \quad \frac{p(x)\pi_S(1)(1-\pi^S)}{\big(1-p(x)\big)\pi_S(0)\pi^S} = \frac{p(x)/\pi}{\big(1-p(x)\big)/(1-\pi)} \ge 1 \qquad (9)$$

(6)๊ณผ (8)๋กœ ์ธํ•ด (9)๊ฐ€ ๋„์ถœ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  (4)์™€ (9)๊ฐ€ ๊ฐ™๋‹ค๋Š” ์ ์„ ์ฃผ๋ชฉํ•ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค. \(\delta^S_{q^*} = \delta_{q^*}\) ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์œ„์—์„œ ๋งํ•œ๋Œ€๋กœ, \(q^*\)-classifier๋Š” target balance ratio์™€ ์ƒ๊ด€์—†์ด TPN+TPR-optimality๋ฅผ ์œ ์ง€ํ•œ๋‹ค.

response-based sampling ํ˜•ํƒœ์ธ (5)์— ์˜ํ•ด $\delta^{S}_{q*} = \delta_{q*}$์ด๋ฏ€๋กœ $\delta^{S}_{q*}$๋Š” TNR+TPR optimal์ด๋‹ค.
๊ทธ๋ฆฌ๊ณ  balanced sampling (7)์— ์˜ํ•ด $\delta^S_B = \delta^{S}_{q*} = \delta_{q*}$์ด๋ฉฐ, ์„ธ ๋ฐฉ๋ฒ•๋ก ์€ ๋ชจ๋‘ TNR+TPR optimal์ด๋‹ค.

4. Application to Random Forests

  1. RFQ์˜ q*-classifier์—์„œ $q* = \pi$๋กœ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, empirical relative frequency๋กœ์จ $\hat{\pi} = \frac{N_1}{N_0 + N_1}$์„ ์‚ฌ์šฉํ•œ๋‹ค.

5. Comparison to BRF

5-1. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋””ํ…Œ์ผ ์ฐจ์ด

  1. RFQ๊ณผ BRF์˜ ์ฐจ์ด์ ์€, ๋ถ€ํŠธ์ŠคํŠธ๋žฉ ๊ณผ์ •์—์„œ ์ƒ˜ํ”Œ์‚ฌ์ด์ฆˆ๋ฅผ $N$์ด ์•„๋‹ˆ๋ผ $2N_1$๋งŒํผ์„ ์‚ฌ์šฉํ•˜๊ณ , ์ƒ˜ํ”Œ๋ง ํ™•๋ฅ ์„ $\pi_S(1) = \frac{N_0}{N_1}\pi_S(0)$๋กœ ์„ค์ •ํ•œ๋‹ค๋Š” ์ ์—์„œ ๋‹ค๋ฅด๋‹ค. (์ฐธ๊ณ ๋กœ ๊ธฐ๋ณธ RF๋„ ์ƒ˜ํ”Œ์‚ฌ์ด์ฆˆ๋Š” N์œผ๋กœ ํ•œ๋‹ค.)

  2. ๊ธฐ๋ณธ RF์™€ ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” RFQ๋Š”, \(\delta_{RF}(x) = \textbf{1}_{\{\hat{p}_{RF}(x) \geq \frac{1}{2}\}}\) ๋Œ€์‹ ์— \(\delta_{RFQ}(x) = \textbf{1}_{\{\hat{p}_{RF}(x) \geq \pi\}}\)๋ฅผ ์“ด๋‹ค๋Š” ์ฐจ์ด์ ์ด ์žˆ๋‹ค.

5-2. Why RFQ is better

์šฐ์„  ๊ธฐ๋ณธ์ ์œผ๋กœ BRF์™€ RFQ ๋ชจ๋‘ TNR+TPR property๋ฅผ ๊ฐ–๊ณ  ์žˆ๊ธฐ๋Š” ํ•˜๋‹ค. BRF์˜ ๊ฒฝ์šฐ๋Š” Theorem 3์—์„œ balancing condition (7)์— ์˜ํ•ด, RFQ์˜ ๊ฒฝ์šฐ๋Š” Theorem 2์—์„œ q*-classification์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ์‹ค์ œ ํ™•๋ฅ  ํ•จ์ˆ˜์ธ $p(x)$๊ฐ€ ์˜ˆ์ธก์‹œ ํ™œ์šฉ์ด ๋˜๋Š”๋ฐ, ์‹ค์ „์—์„œ๋Š” ์ด๋ฅผ estimateํ•˜์—ฌ ํ™œ์šฉํ•˜์—ฌ์•ผ ํ•œ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. BRF์— ๋น„ํ•ด์„œ RFQ๊ฐ€ ํ›จ์”ฌ ๋งŽ์€ ์ˆซ์ž์˜ ์ƒ˜ํ”Œ์„ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ผ๋ฐ˜์ ์œผ๋กœ BRF์— ๋น„ํ•ด RFQ๊ฐ€ $p(x)$์„ estimateํ•˜๋Š” ๋ฐ์— ์œ ๋ฆฌํ•˜๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ํŠนํžˆ IR์ด ์ปค์ง€๋ฉด ์ปค์งˆ์ˆ˜๋ก $2N_1$์€ $N$์— ๋น„ํ•ด์„œ ํ›จ์”ฌ ์ž‘์•„์ง€๊ธฐ ๋•Œ๋ฌธ์—, IR์ด ์ปค์ง€๋ฉด ์ปค์งˆ์ˆ˜๋ก BRF๋ณด๋‹ค RFQ๊ฐ€ ๋”์šฑ ์œ ๋ฆฌํ•˜๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฐจ์›์ด ์ปค์งˆ์ˆ˜๋ก estimation์ด ์–ด๋ ค์›Œ์ง€๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ๋„ RFQ๊ฐ€ ์œ ๋ฆฌํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

6. Performance

6-1. G-mean

$$\mbox{G-mean} = (TNR \times TPR)^{1/2}$$

q๊ฐ€ ๊ทผ์‚ฌ์ ์œผ๋กœ $\hat{\pi}$์— ๊ฐ€๊นŒ์›Œ์กŒ์„ ๋•Œ, RFQ์— ์˜ํ•œ G-mean์ด ์ตœ๋Œ€์น˜์— ๊ฐ€๊น๋‹ค๋Š” ๊ฒƒ์„ 143๊ฐœ์˜ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด์„œ ํ™•์ธํ–ˆ๋‹ค.(10-fold CV๋ฅผ 250๋ฒˆ์”ฉ ์‹œํ–‰ํ•˜์˜€๋‹ค.) ์ด๋Š” ๋ถ„๋ฅ˜๊ธฐ์— ์žˆ์–ด์„œ TNR+TPR optimality๊ฐ€ ์ค‘์š”ํ•œ ํŠน์ง•์ด๋ผ๋Š” ๊ฒƒ์„ ์‹œ์‚ฌํ•œ๋‹ค.

Figure2

(splitting criterion์œผ๋กœ์„œ Gini index ๋Œ€์‹  Hellinger distance๋ฅผ ์‚ฌ์šฉํ•ด๋ณด๊ธด ํ•˜์˜€์œผ๋‚˜ ํฌ๊ฒŒ ์œ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š์•˜๋‹ค.)

6-2. ex1) Simulated data

epoch: 250, trees: 5000, nodesize=1, mtry=d/3

Table2


Table3

์œ„ Table์€ complex imbalanced data in high dimensional settings์—์„œ RFQ๊ฐ€ ํšจ๊ณผ์ ์ž„์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค.

6-3. ex2) Cognitive impairment data

Alzheimers Disease CSF Data from AppliedPredictiveModeling (N=333, d=130 where $N_0=242, N_1=91$ with IR=2.66)

epoch: 250, trees: 5000, nodesize=1, mtry=d/3

Table4

BRF์˜ ๊ฒฝ์šฐ์—๋Š” high dimensional์ด ๋ ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

6-4. ex3) Customer churn data

N=3333 with $N_1=483$ and IR=5.90
epoch: 250, trees: 5000, nodesize=1, mtry=d/3

Table5

6-3์™€ ๊ฐ™์ด, BRF๋Š” high dimension์ผ ๋•Œ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•„์ง์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

6-5. Multiclass Imbalanced Data

Binary๊ฐ€ ์•„๋‹ˆ๋ผ Multiclass์˜ ๊ฒฝ์šฐ์—๋„ RFQ๊ฐ€ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.

6-5-1. ex1) Waveform simulations

$$\mbox{weighted G-mean} = \Big(TPR1^{\beta_1} + TPR2^{\beta_2} + TPR3^{\beta_3}\Big)^{1/(\beta_1+\beta_2+\beta_3)}$$

2๊ฐœ ์•„๋‹ˆ๋ผ, 3๊ฐœ์˜ ํด๋ž˜์Šค๋กœ ๋‚˜๋ˆ„์–ด์ ธ์žˆ๋Š” ๊ฒฝ์šฐ์— G-mean์„ ํ†ตํ•ด ์„ธ ๋ชจ๋ธ์„ ๋ถ„๋ฅ˜ํ•˜์˜€๋‹ค. \(\binom{3}{2} = 3\)์ด๋ฏ€๋กœ, ์ด ์„ธ ๊ฒฝ์šฐ์˜ ์ˆ˜์— ์žˆ์–ด์„œ TPR๊ณผ TNR์„ ๊ณ„์‚ฐํ•œ ํ›„ weighted G-mean์„ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ์•„๋ž˜์˜ ๋‘ ํ…Œ์ด๋ธ”์˜ ์ฐจ์ด๋Š” ๊ฐ ๊ทธ๋ฃน๋ณ„ TPR์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ๋‘๊ณ  G-mean์„ ๊ณ„์‚ฐํ–ˆ๋Š”์ง€์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค. ์ฐธ๊ณ ๋กœ unweighted G-mean์€ multiclass ์ƒํ™ฉ์—์„œ ์ ์ ˆํ•˜์ง€๋Š” ์•Š๋‹ค. ํŠนํžˆ ์‹ฌ๊ฐํ•œ ๋ถˆ๊ท ํ˜•์ด ์กด์žฌํ•  ๊ฒฝ์šฐ ๋”์šฑ ๊ทธ๋Ÿฌํ•˜๋‹ค. ์•„๋ž˜์˜ ๊ฒฝ์šฐ์—๋Š” \(\beta_1, \beta_2, \beta_3\)๋ฅผ ๊ฐ๊ฐ ๋Ÿฌํ”„ํ•˜๊ฒŒ 1/2, 1, 1๋กœ ๋„ฃ์—ˆ์ง€๋งŒ, ์ด๋Š” ์ €์ž๊ฐ€ ์˜๋„ํ•˜๋Š” ๋ฐ”๋ฅผ ๋‹ด๊ธฐ์—๋Š” ์ถฉ๋ถ„ํ•œ ์ฐจ์ด๋ฅผ ๋ณด์ด๊ธด ํ–ˆ๋‹ค. ์•„๋ž˜์˜ ํ‘œ๋ฅผ ํ†ตํ•ด์„œ ๊ตฌ์ฒด์ ์ธ ์ˆ˜์น˜๋ฅผ ํ™•์ธํ•ด๋ณด๋„๋ก ํ•˜์ž.

TableB1


TableB2

6-5-2. ex2) Cassini simulations

TableB3


TableB4

์œ„์˜ ์˜ˆ์‹œ์™€ ์‹œ์‚ฌํ•˜๋Š” ๋ฐ”๋Š” ๋™์ผํ•˜๋‹ค.

7. Variable Importance

  • Breiman-Culter importance(tree-based) : not fit
    ๋Œ€๋ถ€๋ถ„์˜ ๋…ธํŠธ๋“ค์ด 0์„ ๊ฐ–๊ณ  ์žˆ์„ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ถˆ๊ท ํ˜•๋ฐ์ดํ„ฐ์—์„œ๋Š” ํ•ด๋‹น ๊ธฐ์ค€์œผ๋กœ VIMP์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ์—๋Š” ์ ์ ˆํ•˜์ง€ ๋ชปํ•˜๋‹ค.

  • G-mean with Ishwaran-Kogalur importance(ensemble) : do fit
    blocked ensemble์˜ prediction error๋ฅผ ํ†ตํ•ด์„œ ๊ณ„์‚ฐํ•œ๋‹ค.

1
library(randomForestSRC)
## Warning: ํŒจํ‚ค์ง€ 'randomForestSRC'๋Š” R ๋ฒ„์ „ 3.6.3์—์„œ ์ž‘์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค
## 
##  randomForestSRC 2.11.0 
##  
##  Type rfsrc.news() to see new features, changes, and bug fixes. 
## 
1
2
3
4
data(breast)
breast <- na.omit(breast)
o.rfq <- imbalanced(status ~ ., breast, importance = TRUE)
print(o.rfq)
##                          Sample size: 194
##            Frequency of class labels: 148, 46
##                      Number of trees: 3000
##            Forest terminal node size: 1
##        Average no. of terminal nodes: 27.20167
## No. of variables tried at each split: 6
##               Total no. of variables: 32
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 123
##                             Analysis: RFQ
##                               Family: class
##                       Splitting rule: gini *random*
##        Number of random split points: 10
##               Normalized brier score: 73.24 
##                                  AUC: 55.29 
##                               G-mean: 0.54
##                     Imbalanced ratio: 3.22
##                           Error rate: 0.46
## 
## Confusion matrix:
## 
##           predicted
##   observed  N  R class.error
##          N 73 75      0.5068
##          R 19 27      0.4130
## 
## 	Overall error rate: 46.19%
1
plot(o.rfq, plots.one.page = FALSE)

## 
##                       Importance   Relative Imp
## SE_perimeter              0.0384         1.0000
## worst_fractaldim          0.0343         0.8932
## mean_perimeter            0.0322         0.8401
## mean_symmetry             0.0311         0.8098
## SE_texture                0.0239         0.6223
## pnodes                    0.0210         0.5480
## mean_texture              0.0203         0.5295
## worst_area                0.0188         0.4889
## worst_radius              0.0188         0.4889
## worst_concavity           0.0173         0.4521
## worst_perimeter           0.0149         0.3897
## SE_area                   0.0149         0.3897
## mean_compactness          0.0149         0.3897
## mean_area                 0.0149         0.3897
## mean_radius               0.0149         0.3897
## worst_concavepoints       0.0137         0.3568
## worst_compactness         0.0137         0.3568
## mean_fractaldim           0.0137         0.3568
## mean_concavepoints        0.0137         0.3568
## tsize                     0.0133         0.3459
## SE_smoothness             0.0101         0.2622
## worst_symmetry            0.0074         0.1935
## SE_compactness            0.0074         0.1935
## SE_radius                 0.0074         0.1935
## worst_texture             0.0065         0.1683
## SE_concavity              0.0065         0.1683
## mean_smoothness           0.0065         0.1683
## SE_concavepoints          0.0037         0.0964
## worst_smoothness         -0.0037        -0.0958
## SE_fractaldim            -0.0037        -0.0958
## mean_concavity           -0.0073        -0.1909
## SE_symmetry              -0.0181        -0.4724
1
# get.imbalanced.performance(o.rfq) # ์–ธ์ œ๋ถ€ํ„ฐ์ธ๊ฐ€ ์—†์–ด์กŒ๋‹ค.

8. Comparison to Boosting

Figure

Figure 6 and 7 are the cases of low or high dimensional task, respectively.

  • Spline Boost: boosted parametric splines using binomial loss
  • Tree Boost: boosted trees using binomial loss (nonparametric boosting)
  • Tree HBoost: boosted trees using Huber loss (nonparametric boosting)
  • RFQ: Random Forest with q-classifier
  • RFQvsel: RFQ with variable selection

9. Discussion

high complexity, high imbalancedness, high dimensionality์—์„œ RFQ๊ฐ€ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค.
BRF๊ฐ€ ์•„์ง ๊ณ„์‚ฐ์ด ๋” ๋น ๋ฅด๊ธฐ๋Š” ํ•˜์ง€๋งŒ ํฐ ์ฐจ์ด๋Š” ์•„๋‹ˆ๋‹ค. ์‹ฌ์ง€์–ด Theorem 4์— ์˜ํ•ด subsampling์„ ํ•œ๋‹ค๋ฉด computational load๋„ ์ค„์ด๋ฉด์„œ TNR+TPR optimal์€ ๋†“์น˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค.

10. Further Reference

๋ถˆ๊ท ํ˜•๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜์˜ ์„ธ ๋…ผ๋ฌธ์„ ์ถ”๊ฐ€ ์ฐธ๊ณ ํ•ด๋ณด๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

  1. Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4), 221-232.
  2. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.
  3. Das, S., Datta, S., & Chaudhuri, B. B. (2018). Handling data irregularities in classification: Foundations, trends, and future challenges. Pattern Recognition, 81, 674-693.

์ด ๋…ผ๋ฌธ์— ๋Œ€ํ•ด์„œ ์˜์–ด๋กœ ์ •๋ฆฌ๋œ ๊นƒํ—™ ํŽ˜์ด์ง€๊ฐ€ ์žˆ๋‹ค.

Critical Point (MY OWN OPINION)

  1. ์ค‘๊ฐ„์—๋„ ์–ธ๊ธ‰ํ–ˆ์ง€๋งŒ, Bayesian์˜ ์šฉ์–ด๋กœ ํ•ด์„ํ•œ๋‹ค๋ฉด, uniform prior๊ฐ€ ์•„๋‹ˆ๋ผ likelihood๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๋ฅ˜๋ฅผ ํ–ˆ๋‹ค๋Š” ๋ฐ์— ์˜์˜๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋‹ฌ๋ฆฌ ๋งํ•˜๋ฉด, ๋‹จ์ˆœํžˆ classifier์˜ threshold๋ฅผ 1/2์ด ์•„๋‹Œ \(\pi\)๋กœ ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ทธ ์ด๋ฉด์˜ ์ˆ˜ํ•™์  ์˜๋ฏธ๋ฅผ ์ž˜ ์ฆ๋ช…ํ•ด๋‚ธ ๋…ผ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•œ๋‹ค.

  2. G-mean์„ performance metrics๋กœ์จ ํ™œ์šฉํ•  ๋•Œ, ๊ฐ€์ค‘ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜๋ฉด ์กฐ๊ธˆ ๋” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ฒŒ ๋˜์ง€ ์•Š์„๊นŒ? ์˜ˆ๋ฅผ ๋“ค์–ด, TPR์— ์กฐ๊ธˆ ๋” ๊ฐ€์ค‘์น˜๋ฅผ ๋‘์–ด์„œ \(\mbox{weighted G-mean} = TNR^{0.2} \times TPR^{0.8}\)์ฒ˜๋Ÿผ?

  3. Regression ๋ฌธ์ œ์—๋Š” ์ด ์•„์ด๋””์–ด๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ?