Bayesian Linear Regression

Chapter 09.
Linear Regression

๋ณธ ํฌ์ŠคํŒ…์€ First Course in Bayesian Statistical Methods๋ฅผ ์ฐธ๊ณ ํ•˜์˜€๋‹ค.

1. Linear Regression Model

Linear Regression Model: a particular type of smoothly changing model for $p(y|\boldsymbol{x})$ that specifies that the conditional expectation $E[Y|\boldsymbol{x}]$ has a form that is linear in a set of parameters

\[\int yp(y|\boldsymbol{x})dy = E[Y|\boldsymbol{x}] = \beta_1x_1 + \cdots + \beta_px_p = \boldsymbol{\beta}^T\boldsymbol{x}\]

Normal Linear Regression Model: ๊ธฐ์กด Linear Regression Model์— ์ถ”๊ฐ€ํ•ด์„œ, ์•„๋ž˜์˜ ์กฐ๊ฑด์ด ์ถ”๊ฐ€๋œ ๊ฒƒ \[\epsilon_1, ..., \epsilon_n \stackrel{i.i.d}\sim N(0,\sigma^2) \\ Y_i = \boldsymbol{\beta}^T\boldsymbol{x}_i + \epsilon_i \\ \boldsymbol{y|X,\beta},\sigma^2 \sim MVN(\boldsymbol{X\beta}, \sigma^2\boldsymbol{I})\]

1-1. $\beta$ ์ถ”์ •(OLS) $$\[\begin{align} \hat{\beta} &\sim N(\beta, \sigma^2(X^TX)^{-1}) \\ \\ \arg \min_\beta \sum e_i^2 &= \arg \min_\beta(Y-X\beta)^T(Y-X\beta)\\ &= \frac{d}{d\beta}(Y^TY -\beta^TX^TY-Y^TX\beta+\beta^TX^TX\beta) \\ &= -2Y^TX+2\beta^TX^TX \\ &\stackrel{let}= 0 \\ \rightarrow X^TX\beta &= X^TY\\ \rightarrow \hat{\beta} &= (X^TX)^{-1}X^TY \\ \\ E(\hat{\beta}) &= E((X^TX)^{-1}X^TY) \\ &= (X^TX)^{-1}X^TE(Y) \\ &= (X^TX)^{-1}X^TX\beta \\ &= \beta \\ Cov(\hat{\beta}) &= Cov((X^TX)^{-1}X^TY) \\ &= (X^TX)^{-1}X^TCov(Y)\Big((X^TX)^{-1}X^T\Big)^T \\ &= \sigma^2 (X^TX)^{-1}X^TX(X^TX)^{-1} \\ &= \sigma^2 (X^TX)^{-1} \end{align}\]$$

1-2. $\sigma^2$ ์ถ”์ •(OLS) ์—ฌ๊ธฐ์„œ $\sigma^2$๋Š” ์•„๋ž˜์˜ ๊ณผ์ •์„ ํ†ตํ•ด ์ถ”์ •๋œ๋‹ค.

\[\hat{\sigma}^2_{ols} = \frac{SSR}{n-p} = \frac{\sum (y_i-X_i\hat{\beta}_{ols})^2}{n-p} \\ \text{proof) }\frac{SSR}{\sigma^2} \sim \chi^2(n-p) \\ E(SSR) = (n-p)\sigma^2 \\ \therefore \hat{\sigma}^2_{ols} = \frac{SSR}{n-p}\]

1-1)๊ณผ 1-2)๋ฅผ ์ข…ํ•ฉํ•ด์„œ ํ•ด์„ํ•ด๋ณด์ž๋ฉด, $\hat{\beta}_{ols}$์˜ ๋ถ„์‚ฐ์€ $\sigma^2 (X^TX)^{-1}$์—๋‹ค๊ฐ€ $\sigma^2$ ๋Œ€์‹ ์— $\hat{\sigma}^2_{ols}$์„ ๋Œ€์ž…ํ•œ ๊ฒƒ์˜ ๋Œ€๊ฐ์›์†Œ๋กœ ์ถ”์ •ํ•˜๋ฉด ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์„ ๋ฃจํŠธ ์ทจํ•ด์ฃผ๋ฉด, ๊ฐ ๋ฒ ํƒ€์›์†Œ๋“ค์˜ standard error๋ฅผ ์•Œ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ๊ฐ ๊ณ„์ˆ˜์˜ ์œ ์˜์„ฑ์„ ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

1-3. Conjugacy ๋ณต์Šต

1-3-1. Matching Trick \[\begin{align} p(y|\mu,\Sigma) &\propto exp(-\frac{1}{2}(y-\mu)^T\Sigma^{-1}(y-\mu)) \\ &\propto exp(-\frac{1}{2}y^T\Sigma^{-1}y + y^T\Sigma^{-1}\mu) \end{align}\]

1-3-2. Scaled Inverse Gamma distribution \[\text{X~} \chi^{-2}(\nu, \tau^2) = \Gamma^{-1}(\nu/2, \nu\tau^2/2) \\ \rightarrow f(x) = \frac{(\nu\tau^2/2)^{\nu/2}}{\Gamma(\nu/2)}\left(\frac{1}{x}\right)^{\nu/2+1}exp(-\frac{\nu\tau^2}{2x}) \] $\sigma^2 \sim \chi^2(\nu_0, \sigma^2_0)$์œผ๋กœ prior๋ฅผ ์ฃผ๊ฒŒ ๋œ๋‹ค๋ฉด, $\nu_o$๋Š” prior size์™€ ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  $\sigma^2_0$๋Š” prior variance๋กœ์„œ์˜ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค.

2. Bayesian estimation for a regression model

2-1. Semi-conjugate prior (independent prior)

semi-conjugate: prior์™€ conditional posterior๊ฐ€ ๊ฐ™์€ ๋ถ„ํฌ
- Slopes $\beta$
likelihood: $\boldsymbol{y|X,\beta},\sigma^2 \sim MVN(\boldsymbol{X\beta}, \sigma^2\boldsymbol{I})$
prior: $\beta \sim N(\beta_0, \Sigma_0)$
posterior: $\beta|y,X,\sigma^2 \sim N(\beta_n, \Sigma_0)$
where $\beta_n = \Sigma_n(\Sigma_0^{-1}\beta_0 + X^Ty/\sigma^2), \Sigma_n = (\Sigma_0^{-1} + X^TX/\sigma^2)^{-1}$
(posterior ์œ ๋„ ์ฆ๋ช…) \[\begin{align} p(\beta|y,\sigma^2) &\propto p(y|\beta,\sigma^2)p(\beta) \\ &\propto exp(-\frac{1}{2\sigma^2}(y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta) - \frac{1}{2}(\beta^T\Sigma_0^{-1}\beta - 2\beta^T\Sigma_0^{-1}\beta_0)) \\ &\propto exp\Big(-\frac{1}{2}\beta^T(X^TX/\sigma^2 + \Sigma_0^{-1})\beta + \beta^T(X^Ty/\sigma^2 + \Sigma_0^{-1}\beta_0)\Big) \\ \therefore \Sigma_n^{-1} &= X^TX/\sigma^2 + \Sigma_0^{-1} \\ \beta_n &= \Sigma_n(X^Ty/\sigma^2 + \Sigma_0^{-1}\beta_0) \end{align}\]


- Error Variance $\sigma^2$
likelihood: $\boldsymbol{y|X,\beta},\sigma^2 \sim MVN(\boldsymbol{X\beta}, \sigma^2\boldsymbol{I})$
prior: $\sigma^2 \sim \Gamma^{-1}(\frac{\nu_0}{2}, \frac{\nu_0\sigma^2_0}{2})$
posterior: $\sigma^2|y,X,\beta \sim \Gamma^{-1}(\frac{\nu_n}{2}, \frac{\nu_n\sigma^2_n}{2})$
where $\nu_n = \nu_0+n, \sigma^2_n = \frac{1}{\nu_n}(\nu_0\sigma^2_0 + SSR(\beta))$
(posterior ์œ ๋„ ์ฆ๋ช…) \[\begin{align} p(\sigma^2|y, \beta) &\propto p(\sigma^2)p(y|\beta,\sigma^2)\\ &\propto (\frac{1}{\sigma^2})^{\nu_0/2+1}exp(-\frac{\nu_0\sigma^2_0}{2\sigma^2})(\frac{1}{\sigma^2})^{n/2}exp(-\frac{SSR(\beta)}{2\sigma^2}) \\ &= (\frac{1}{\sigma^2})^{(\nu_0+n)/2+1}exp(-\frac{\nu_0\sigma^2_0+SSR(\beta)}{2\sigma^2}) \\ \therefore \nu_n &= \nu_0+n \\ \sigma^2_n &= (\nu_0\sigma^2_0+SSR(\beta))/\nu_n \end{align}\]

2-2. Full-conjugate prior (dependent prior)

\[\begin{align} p(\beta,\sigma^2|y) &= p(\beta|\sigma^2,y) \ p(\sigma^2|y) \\ &\propto p(\beta|\sigma^2,y)p(\sigma^2)p(y|\sigma^2) \\ \end{align}\]

์œ„์—์„œ ๊ฐ๊ฐ์— ๋Œ€ํ•œ ๋ถ„ํฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. \[\begin{align} \beta|y,\sigma^2 &\sim N(\frac{g}{g+1}\hat{\beta}_{mle},\frac{g}{g+1}Var(\hat{\beta}_{mle})) \\ \sigma^2 &\sim \chi^{-2}(\nu_0, \sigma^2_0) \\ p(y|\sigma^2) &= \int p(y|\beta,\sigma^2)p(\beta|\sigma^2)d\beta \end{align}\]

์—ฌ๊ธฐ์„œ integral ์•ˆ์— ์žˆ๋Š” ๊ฒƒ์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. \[\begin{align} p(y|\beta,\sigma^2)p(\beta|\sigma^2) &= \Big(\frac{1}{2\pi\sigma^2}\Big)^{n/2}exp(-\frac{1}{2}(y-X\beta)^T(y-X\beta)) \times \Big(\frac{1}{2\pi}\Big)^{p/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}}exp(-\frac{1}{2g\sigma^2}\beta^TX^TX\beta) \\ &\because \beta \sim N(\beta_0=0, \Sigma_0 = g\sigma^2(X^TX)^{-1}) \\ \end{align}\]

๊ทธ๋Ÿฌ๋ฏ€๋กœ ๋‹ค์‹œ ๋Œ์•„์™€์„œ ์ž‘์„ฑํ•ด๋ณด์ž๋ฉด \[\begin{align} p(y|\sigma^2) &= \int p(y|\beta,\sigma^2)p(\beta|\sigma^2)d\beta \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{n/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}}exp(-\frac{1}{2\sigma^2}y^Ty)\int exp(\frac{1}{\sigma^2}\beta^TX^Ty - \frac{1}{2\sigma^2}(1+\frac{1}{g})\beta^TX^TX\beta)d\beta \end{align}\]

์ด์ค‘์—์„œ ๋งจ ๋งˆ์ง€๋ง‰์— ์œ„์น˜ํ•œ $(1+\frac{1}{g})\beta^TX^TX\beta$๋ฅผ ์ž์„ธํžˆ ๋ณด์ž.
์—ฌ๊ธฐ์„œ $m, V$๋Š” ๊ฐ๊ฐ $\beta|y,\sigma^2$(posterior)์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด๋ผ๊ณ  ํ•˜์ž. ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์กฐ์ž‘ํ•ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋‚˜์˜จ๋‹ค.
\[\begin{align} (1+\frac{1}{g})\beta^TX^TX\beta &= \frac{g+1}{g} \times \Bigg((\beta-m)^TX^TX(\beta-m) + 2m^TX^TX\beta -m^TX^TXm \Bigg) \\ (1) \frac{g+1}{g}(\beta-m)^TX^TX(\beta-m) &= \sigma^2(\beta-m)^TV^{-1}(\beta-m)\\ (2) \frac{g+1}{g}2m^TX^TX\beta &= 2\beta^TX^Ty \\ (3) \frac{g+1}{g}m^TX^TXm &= \sigma^2m^TV^{-1}m \\ \therefore (1+\frac{1}{g})\beta^TX^TX\beta &= \sigma^2(\beta-m)^TV^{-1}(\beta-m) + 2\beta^TX^Ty - \sigma^2m^TV^{-1}m \end{align}\]

\[\begin{align} p(y|\sigma^2) &= \int p(y|\beta,\sigma^2)p(\beta|\sigma^2)d\beta \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{n/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}}exp(-\frac{1}{2\sigma^2}y^Ty + \frac{1}{2}m^TV^{-1}m)\int exp\Big(-\frac{1}{2}(\beta-m)^TV^{-1}(\beta-m)\Big)d\beta \\ &= \Big(\frac{1}{\sigma^2}\Big)^{n/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}} exp\big(-\frac{1}{2\sigma^2}(y^Ty - \sigma^2m^TV^{-1}m)\big) \ (2\pi)^{n/2}|\frac{g}{g+1}\sigma^2(X^TX)^{-1}|^{\frac{1}{2}} \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{n/2}(1+g)^{-p/2}exp\big(-\frac{1}{2\sigma^2}SSR(g)\big) \\ &\text{where } SSR(g) = y^Ty - \sigma^2m^TV^{-1}m = y^T\Big(I - \frac{g}{g+1}X(X^TX)^{-1}X^T\Big)y \end{align}\]

์—ฌ๊ธฐ์„œ ์œ„์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, g๊ฐ€ ํด์ˆ˜๋ก ์•ฝํ•œ prior๋ฅผ ๋œปํ•œ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— g๋ฅผ ๋ฌดํ•œ๋Œ€๋กœ ๋ณด๋‚ด๊ฒŒ ๋˜๋ฉด, $SSR(g) \rightarrow SSR(\hat{\beta}_{mle})$

\[\begin{align} p(\sigma^2|y) &\propto p(\sigma^2)p(y|\sigma^2) \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{(\nu_0+n)/2+1}exp(-\frac{\nu_0\sigma^2_0 + SSR(g)}{2\sigma^2}) \\ \therefore \sigma^2|y &\sim \chi^{-2}(\nu_0+n, \frac{\nu_0\sigma^2_0 + SSR(g)}{\nu_0+n}) \end{align}\]

Full-conjugate prior๋ฅผ ํ†ตํ•ด ๊ตฌํ•œ posterior๋ฅผ ์ •๋ฆฌํ•˜์ž๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. \[ \sigma^2|y \sim \chi^{-2}(\nu_0+n, \frac{\nu_0\sigma^2_0 + SSR(g)}{\nu_0+n}) \\ \beta|y,\sigma^2 \sim N(\frac{g}{g+1}\hat{\beta}_{mle},\frac{g}{g+1}Var(\hat{\beta}_{mle})) \\ \text{where } Var(\hat{\beta}_{mle}) = \sigma^2(X^TX)^{-1} \] Gibbs sampling์ด ์•„๋‹Œ, Monte Carlo Approximation์„ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ joint posterior distribution์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ถ€๋ถ„์—์„œ semi-conjugate prior์™€ ํฐ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

3. Model Selection

๋ชฉ์ฐจ