Utility-Based Regression
   4 min read    ์†์ง€์šฐ

Torgo, L., & Ribeiro, R. (2007, September). Utility-based regression. In European conference on principles of data mining and knowledge discovery (pp. 597-604). Springer, Berlin, Heidelberg.

In Short

It is about the metrics which is useful when there is the prediction of rare extreme values of a continuous target variable. Kind of cost-sensitive learning.

1. Introduction

Regression ์ƒํ™ฉ์—์„œ benefit๊ณผ cost๋ฅผ ์ •์˜ํ•˜๊ณ  ์ด ๋‘˜์˜ ์ฐจ๋ฅผ utility๋ผ๊ณ  ์ •์˜ํ•œ๋‹ค.

2. Problem Formulation

  1. Regression์—์„œ Uniform cost assumption์€ ๋น„ํ˜„์‹ค์ ์ด๋‹ค. ํ˜„์‹ค์—์„œ๋Š” cost-sensitive learning์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋” ๋งŽ๋‹ค.
  2. ๊ทธ๋ฆฌ๊ณ  Classification์—์„œ๋Š” class์— ๋”ฐ๋ผ์„œ cost๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์ ์šฉํ•˜๋Š” ๋ฐ์— ๋ฐ˜ํ•ด, Regression์—์„œ๋Š” ๊ทธ๋Ÿฌํ•˜์ง€ ์•Š๋‹ค. (๋ญ‰์ณ์žˆ๋Š” ๋ฐ์ดํ„ฐ์—์„œ error๊ฐ€ ํฌ๊ฒŒ ๋ฐœ์ƒํ•œ ๊ฒƒ๊ณผ, ๋”ฐ๋กœ ๋–จ์–ด์ ธ์žˆ๋Š” ๋ฐ์ดํ„ฐ์—์„œ error๊ฐ€ ํฌ๊ฒŒ ๋ฐœ์ƒํ•œ ๊ฒƒ์„ ๊ตฌ๋ถ„ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ด์•ผ๊ธฐ์ด๋‹ค.)
  3. ๊ธฐ์กด์˜ regression์—์„œ non-uniform cost assumption๋ฅผ ์—ฐ๊ตฌํ•œ ๊ฒƒ์€ ๋Œ€์ฒด๋กœ under-prediction๊ณผ over-prediction์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐ์— ๊ทธ์ณค๋‹ค.
  4. cost๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, benefit์— ๋Œ€ํ•ด์„œ๋„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค.
  5. ๊ฒฐ๋ก : Relevance(Importance)๋ฅผ ์ •์˜ํ•˜์—ฌ cost๋ฅผ ์ƒˆ๋กญ๊ฒŒ ์ •์˜ํ•  ํ•„์š”์„ฑ์ด ์žˆ๋‹ค.

3. Utility-Based Regression

3-1. Relevance Functions

$$\phi(Y): \mathscr{R} \rightarrow0..1$$

Utility-based regression is independent of the shape of the \(\phi()\) function. ๊ธฐ์กด์ ์œผ๋กœ๋Š” user๊ฐ€ relevance๋ฅผ ์ •์˜ํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์ง€๋งŒ, ๊ทธ๋ ‡์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ํ˜„์‹ค์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„์ด๋‹ค. Relevance๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ Rarity(ํฌ์†Œ์„ฑ)๊ณผ ๊ด€๋ จ์ด ์žˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” target variable์˜ pdf์— ๋ฐ˜๋น„๋ก€ํ•˜๊ฒŒ๋” ์„ค์ •ํ•˜๋ฉด ๋˜์ง€๋งŒ, ๊ทธ pdf๋ฅผ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค.

3-2. Cost and Benefit Surfaces

3-2-1. Utility of Prediction

$$U = TB - TC \\ \text{where TB and TC are total benefit and total cost, respectively}$$

3-2-2. Cost of Prediction

$$TC = \sum^{n}_{i=1}c(\hat{y_i},y) \\ \text{where } c(\hat{Y},Y) = \Phi(\hat{Y},Y) \times C_{\max} \times L(\hat{Y},Y)$$

Cost of prediction depends on three components. One is the relevance of the test case value, and the others are the relevance of the predicted value and precision of the prediction. ์—ฌ๊ธฐ์„œ relevance of the test case value๋Š” \(\phi(Y)\)์ด๊ณ , relevance of the predicted value๋Š” \(\phi(\hat{Y})\)์ด๋‹ค. ์ด์™€ ๊ด€๋ จํ•˜์—ฌ ์•„๋ž˜ ์œ ์˜์‚ฌํ•ญ ์„ธ ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

  1. False Alarm: Predict a relevant value for an irrelevant test case
  2. Oppotunity Cost: Predict an irrelevant value for a relevant test case
  3. Confusing Relevant Events(the most serious mistakes): Predict a relevant but very different value for a relevant test case

(i) Bivariate Relevance Function \(\Phi(\hat{Y},Y)\)
1๊ณผ 2๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด์„œ Bivariate Relevance Function๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค. m์€ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ–๋Š” hyperparameter์ธ๋ฐ, False Alarm(1)๋ณด๋‹ค Oppotunity Cost(2)๋ฅผ ๋”์šฑ ์ค‘์š”ํ•˜๊ฒŒ ๋ณด๊ณ  ์‹ถ์€ ๊ฒฝ์šฐ 0.5 ์ด์ƒ์˜ ๊ฐ’์„ ์„ค์ •ํ•˜๋ฉด ๋œ๋‹ค.

$$\Phi(\hat{Y},Y) = (1-m) \cdot \phi(\hat{Y}) + m \cdot \phi(Y)$$

(ii) Maximum Cost \(C_{\max}\)
maximum cost that is only assigned when the relevance of the prediction is maximum (i.e. \(\Phi(\hat{Y},Y)=1\)). Usually, \(C_{\max}\) is provided as a constant by the user.

(iii) Loss Function \(L(\hat{Y},Y)\)
์•„๋ฌด metric์ด๋‚˜ ์จ๋„ ๋˜๊ธด ํ•˜์ง€๋งŒ, ํ•ด์„์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด์„œ 0๋ถ€ํ„ฐ 1์‚ฌ์ด ๊ฐ’์„ ๊ถŒ์žฅํ•œ๋‹ค. \(\Phi(\hat{Y},Y) \times C_{\max}\)์€ ๋ฐœ์ƒ๊ฐ€๋Šฅํ•œ ์ตœ์•…์˜ ์ƒํ™ฉ์—์„œ์˜ ์ตœ๋Œ€ ํŽ˜๋„ํ‹ฐ ๊ฐ’์„ ์˜๋ฏธํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. (the maximum penalty we get if \(\hat{Y}\) is the worst possible prediction.) ์ฆ‰, \(L(\hat{Y},Y)=1\)์ด๋ฉด ์ตœ๋Œ€ cost๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ๋ฐ˜๋ฉด, \(L(\hat{Y},Y)=0\)์ด๋ฉด, relevance์™€ ์ƒ๊ด€์—†์ด cost๋Š” 0์ด๋‹ค. ์•„๋ž˜๋Š” ๋…ผ๋ฌธ ์ €์ž๊ฐ€ ์ œ์•ˆํ•˜๋Š” Loss function์ด๋‹ค.

$$L(\hat{Y},Y) = \Big| \max_{i \in \hat{Y}..Y} \phi(i) - \min_{i \in \hat{Y}..Y} \phi(i)\Big|$$

3-2-3. Benefit of Prediction

$$TB = \sum^{n}_{i=1}b(\hat{y_i},y) \\ \text{where } b(\hat{Y},Y) = \phi(Y) \times B_{\max} \times \Big(1-L(\hat{Y},Y)\Big)$$

This definition of benefits associates higher rewards with higher relevance. \(\phi(Y) \times B_{\max}\)์„ maximum benefit with relevance๋ผ๊ณ  ์ƒ๊ฐํ•˜๊ณ , \(\Big(1-L(\hat{Y},Y)\Big)\)๋ฅผ ์ผ์ข…์˜ proportion์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

(i) Maximum Benefit \(B_{\max}\)
Like \(C_{\max}\), \(B_{\max}\) is a user-defined maximum reward which means it is a constant.

4. An Illustrative Application

relevance

Boxplot์˜ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ
Simpler strategy to derive a relevance function for a class of application where relevance is associated with rarity: the prediction of rare extreme values of a numeric variable

table1

Table1์„ ๋ณด๋ฉด, SVM์ด TB๋Š” ๋†’์ง€๋งŒ TC๋„ ๋†’์•„์„œ U๋Š” ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ์€ U๋ฅผ ๋†’์ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, rare extreme value๋ฅผ ์ž˜ ์˜ˆ์ธกํ–ˆ๋Š”์ง€์ด๋‹ค.

5. Conclusion

New evaluation frame work for regression tasks with non-uniform costs and benefits of the predictions.

Critical Point (MY OWN OPINION)

  1. boxplot์„ ํ™œ์šฉํ•˜๋ฉด, ๋†“์ง€๋Š” ๋ถ€๋ถ„์ด ๋งŽ์„ ๊ฒƒ ๊ฐ™๋‹ค. ํŠนํžˆ, gaussian mixture model์„ ๋”ฐ๋ฅผ ๊ฒฝ์šฐ, ์ค‘์•™๊ฐ’์ด ์ด๋“ค์„ ์ถฉ๋ถ„ํžˆ ๋Œ€๋ณ€ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค. KDE๋ฅผ ํ•˜๋ฉด ๋” ์ข‹์ง€ ์•Š์„๊นŒ?

  2. ๊ทธ๋ฆฌ๊ณ , ์ถ”ํ›„ SMOTE for Regression์—์„œ ํ™œ์šฉํ•  ๋•Œ, relevance function์„ ํ†ตํ•ด์„œ 0๊ณผ 1๋กœ classification ๋ฌธ์ œ๋กœ ๋ณ€ํ˜•ํ•  ๋•Œ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์ ์€ ์–ด๋– ํ•œ ๊ฒƒ์ด ์žˆ์„๊นŒ?

  3. hyperparameter๋กœ๋Š” \(C_{\max}, B_{\max}, m\)์™€ ๊ฐ™์€ ๊ฒƒ๋“ค์ด ์žˆ๋Š”๋ฐ, ํŠนํžˆ \(C_{\max}\)์™€ \(B_{\max}\)์— ๋Œ€ํ•œ ์‚ฌ์ „์ง€์‹์ด ์—†์„ ๊ฒฝ์šฐ ์ด ๋‘˜์„ ์–ด๋– ํ•œ ๋น„์œจ๋กœ ์„ค์ •ํ•˜๋Š”์ง€๋„ ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™๋‹ค.

ETC