Chapter 09.
Linear Regression
๋ณธ ํฌ์คํ ์ First Course in Bayesian Statistical Methods๋ฅผ ์ฐธ๊ณ ํ์๋ค.
1. Linear Regression Model
Linear Regression Model: a particular type of smoothly changing model for $p(y|\boldsymbol{x})$ that specifies that the conditional expectation $E[Y|\boldsymbol{x}]$ has a form that is linear in a set of parameters
\[\int yp(y|\boldsymbol{x})dy = E[Y|\boldsymbol{x}] = \beta_1x_1 + \cdots + \beta_px_p = \boldsymbol{\beta}^T\boldsymbol{x}\]
Normal Linear Regression Model: ๊ธฐ์กด Linear Regression Model์ ์ถ๊ฐํด์, ์๋์ ์กฐ๊ฑด์ด ์ถ๊ฐ๋ ๊ฒ \[\epsilon_1, ..., \epsilon_n \stackrel{i.i.d}\sim N(0,\sigma^2) \\ Y_i = \boldsymbol{\beta}^T\boldsymbol{x}_i + \epsilon_i \\ \boldsymbol{y|X,\beta},\sigma^2 \sim MVN(\boldsymbol{X\beta}, \sigma^2\boldsymbol{I})\]
1-1. $\beta$ ์ถ์ (OLS)
$$\[\begin{align} \hat{\beta} &\sim N(\beta, \sigma^2(X^TX)^{-1}) \\
\\
\arg \min_\beta \sum e_i^2 &= \arg \min_\beta(Y-X\beta)^T(Y-X\beta)\\
&= \frac{d}{d\beta}(Y^TY -\beta^TX^TY-Y^TX\beta+\beta^TX^TX\beta) \\
&= -2Y^TX+2\beta^TX^TX \\
&\stackrel{let}= 0 \\
\rightarrow X^TX\beta &= X^TY\\
\rightarrow \hat{\beta} &= (X^TX)^{-1}X^TY \\
\\
E(\hat{\beta}) &= E((X^TX)^{-1}X^TY) \\
&= (X^TX)^{-1}X^TE(Y) \\
&= (X^TX)^{-1}X^TX\beta \\
&= \beta \\
Cov(\hat{\beta}) &= Cov((X^TX)^{-1}X^TY) \\
&= (X^TX)^{-1}X^TCov(Y)\Big((X^TX)^{-1}X^T\Big)^T \\
&= \sigma^2 (X^TX)^{-1}X^TX(X^TX)^{-1} \\
&= \sigma^2 (X^TX)^{-1} \end{align}\]$$
1-2. $\sigma^2$ ์ถ์ (OLS)
์ฌ๊ธฐ์ $\sigma^2$๋ ์๋์ ๊ณผ์ ์ ํตํด ์ถ์ ๋๋ค.
\[\hat{\sigma}^2_{ols} = \frac{SSR}{n-p} = \frac{\sum (y_i-X_i\hat{\beta}_{ols})^2}{n-p} \\ \text{proof) }\frac{SSR}{\sigma^2} \sim \chi^2(n-p) \\ E(SSR) = (n-p)\sigma^2 \\ \therefore \hat{\sigma}^2_{ols} = \frac{SSR}{n-p}\]
1-1)๊ณผ 1-2)๋ฅผ ์ข
ํฉํด์ ํด์ํด๋ณด์๋ฉด, $\hat{\beta}_{ols}$์ ๋ถ์ฐ์ $\sigma^2 (X^TX)^{-1}$์๋ค๊ฐ $\sigma^2$ ๋์ ์ $\hat{\sigma}^2_{ols}$์ ๋์
ํ ๊ฒ์ ๋๊ฐ์์๋ก ์ถ์ ํ๋ฉด ๋๋ค. ๊ทธ๋ฆฌ๊ณ ์ด๊ฒ์ ๋ฃจํธ ์ทจํด์ฃผ๋ฉด, ๊ฐ ๋ฒ ํ์์๋ค์ standard error๋ฅผ ์ ์ ์๊ฒ ๋๋ค. ์ด๋ฅผ ํตํด์ ๊ฐ ๊ณ์์ ์ ์์ฑ์ ์๊ฐํด๋ณผ ์ ์๋ค.
1-3. Conjugacy ๋ณต์ต
1-3-1. Matching Trick \[\begin{align} p(y|\mu,\Sigma) &\propto exp(-\frac{1}{2}(y-\mu)^T\Sigma^{-1}(y-\mu)) \\ &\propto exp(-\frac{1}{2}y^T\Sigma^{-1}y + y^T\Sigma^{-1}\mu) \end{align}\]
1-3-2. Scaled Inverse Gamma distribution
\[\text{X~} \chi^{-2}(\nu, \tau^2) = \Gamma^{-1}(\nu/2, \nu\tau^2/2) \\
\rightarrow f(x) = \frac{(\nu\tau^2/2)^{\nu/2}}{\Gamma(\nu/2)}\left(\frac{1}{x}\right)^{\nu/2+1}exp(-\frac{\nu\tau^2}{2x}) \]
$\sigma^2 \sim \chi^2(\nu_0, \sigma^2_0)$์ผ๋ก prior๋ฅผ ์ฃผ๊ฒ ๋๋ค๋ฉด, $\nu_o$๋ prior size์ ๊ฐ์ ์๋ฏธ๋ฅผ ๊ฐ๊ณ $\sigma^2_0$๋ prior variance๋ก์์ ์๋ฏธ๋ฅผ ๊ฐ๋๋ค.
2. Bayesian estimation for a regression model
2-1. Semi-conjugate prior (independent prior)
semi-conjugate: prior์ conditional posterior๊ฐ ๊ฐ์ ๋ถํฌ
- Slopes $\beta$
likelihood: $\boldsymbol{y|X,\beta},\sigma^2 \sim MVN(\boldsymbol{X\beta}, \sigma^2\boldsymbol{I})$
prior: $\beta \sim N(\beta_0, \Sigma_0)$
posterior: $\beta|y,X,\sigma^2 \sim N(\beta_n, \Sigma_0)$
where $\beta_n = \Sigma_n(\Sigma_0^{-1}\beta_0 + X^Ty/\sigma^2), \Sigma_n = (\Sigma_0^{-1} + X^TX/\sigma^2)^{-1}$
(posterior ์ ๋ ์ฆ๋ช
)
\[\begin{align}
p(\beta|y,\sigma^2) &\propto p(y|\beta,\sigma^2)p(\beta) \\
&\propto exp(-\frac{1}{2\sigma^2}(y^Ty - 2\beta^TX^Ty + \beta^TX^TX\beta) - \frac{1}{2}(\beta^T\Sigma_0^{-1}\beta - 2\beta^T\Sigma_0^{-1}\beta_0)) \\
&\propto exp\Big(-\frac{1}{2}\beta^T(X^TX/\sigma^2 + \Sigma_0^{-1})\beta + \beta^T(X^Ty/\sigma^2 + \Sigma_0^{-1}\beta_0)\Big) \\
\therefore \Sigma_n^{-1} &= X^TX/\sigma^2 + \Sigma_0^{-1} \\
\beta_n &= \Sigma_n(X^Ty/\sigma^2 + \Sigma_0^{-1}\beta_0)
\end{align}\]
- Error Variance $\sigma^2$
likelihood: $\boldsymbol{y|X,\beta},\sigma^2 \sim MVN(\boldsymbol{X\beta}, \sigma^2\boldsymbol{I})$
prior: $\sigma^2 \sim \Gamma^{-1}(\frac{\nu_0}{2}, \frac{\nu_0\sigma^2_0}{2})$
posterior: $\sigma^2|y,X,\beta \sim \Gamma^{-1}(\frac{\nu_n}{2}, \frac{\nu_n\sigma^2_n}{2})$
where $\nu_n = \nu_0+n, \sigma^2_n = \frac{1}{\nu_n}(\nu_0\sigma^2_0 + SSR(\beta))$
(posterior ์ ๋ ์ฆ๋ช
)
\[\begin{align}
p(\sigma^2|y, \beta) &\propto p(\sigma^2)p(y|\beta,\sigma^2)\\
&\propto (\frac{1}{\sigma^2})^{\nu_0/2+1}exp(-\frac{\nu_0\sigma^2_0}{2\sigma^2})(\frac{1}{\sigma^2})^{n/2}exp(-\frac{SSR(\beta)}{2\sigma^2}) \\
&= (\frac{1}{\sigma^2})^{(\nu_0+n)/2+1}exp(-\frac{\nu_0\sigma^2_0+SSR(\beta)}{2\sigma^2}) \\
\therefore \nu_n &= \nu_0+n \\
\sigma^2_n &= (\nu_0\sigma^2_0+SSR(\beta))/\nu_n
\end{align}\]
2-2. Full-conjugate prior (dependent prior)
\[\begin{align} p(\beta,\sigma^2|y) &= p(\beta|\sigma^2,y) \ p(\sigma^2|y) \\ &\propto p(\beta|\sigma^2,y)p(\sigma^2)p(y|\sigma^2) \\ \end{align}\]
์์์ ๊ฐ๊ฐ์ ๋ํ ๋ถํฌ๋ ์๋์ ๊ฐ๋ค. \[\begin{align} \beta|y,\sigma^2 &\sim N(\frac{g}{g+1}\hat{\beta}_{mle},\frac{g}{g+1}Var(\hat{\beta}_{mle})) \\ \sigma^2 &\sim \chi^{-2}(\nu_0, \sigma^2_0) \\ p(y|\sigma^2) &= \int p(y|\beta,\sigma^2)p(\beta|\sigma^2)d\beta \end{align}\]
์ฌ๊ธฐ์ integral ์์ ์๋ ๊ฒ์ ์์ธํ ์ดํด๋ณด๋ฉด ์๋์ ๊ฐ๋ค. \[\begin{align} p(y|\beta,\sigma^2)p(\beta|\sigma^2) &= \Big(\frac{1}{2\pi\sigma^2}\Big)^{n/2}exp(-\frac{1}{2}(y-X\beta)^T(y-X\beta)) \times \Big(\frac{1}{2\pi}\Big)^{p/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}}exp(-\frac{1}{2g\sigma^2}\beta^TX^TX\beta) \\ &\because \beta \sim N(\beta_0=0, \Sigma_0 = g\sigma^2(X^TX)^{-1}) \\ \end{align}\]
๊ทธ๋ฌ๋ฏ๋ก ๋ค์ ๋์์์ ์์ฑํด๋ณด์๋ฉด \[\begin{align} p(y|\sigma^2) &= \int p(y|\beta,\sigma^2)p(\beta|\sigma^2)d\beta \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{n/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}}exp(-\frac{1}{2\sigma^2}y^Ty)\int exp(\frac{1}{\sigma^2}\beta^TX^Ty - \frac{1}{2\sigma^2}(1+\frac{1}{g})\beta^TX^TX\beta)d\beta \end{align}\]
์ด์ค์์ ๋งจ ๋ง์ง๋ง์ ์์นํ $(1+\frac{1}{g})\beta^TX^TX\beta$๋ฅผ ์์ธํ ๋ณด์.
์ฌ๊ธฐ์ $m, V$๋ ๊ฐ๊ฐ $\beta|y,\sigma^2$(posterior)์ ํ๊ท ๊ณผ ๋ถ์ฐ์ด๋ผ๊ณ ํ์. ์ด๋ฅผ ํ์ฉํ์ฌ ํด๋น ๋ถ๋ถ์ ์กฐ์ํด๋ณด๋ฉด ์๋์ ๊ฐ์ด ๋์จ๋ค.
\[\begin{align}
(1+\frac{1}{g})\beta^TX^TX\beta &= \frac{g+1}{g} \times \Bigg((\beta-m)^TX^TX(\beta-m) + 2m^TX^TX\beta -m^TX^TXm \Bigg) \\
(1) \frac{g+1}{g}(\beta-m)^TX^TX(\beta-m) &= \sigma^2(\beta-m)^TV^{-1}(\beta-m)\\
(2) \frac{g+1}{g}2m^TX^TX\beta &= 2\beta^TX^Ty \\
(3) \frac{g+1}{g}m^TX^TXm &= \sigma^2m^TV^{-1}m \\
\therefore (1+\frac{1}{g})\beta^TX^TX\beta &= \sigma^2(\beta-m)^TV^{-1}(\beta-m) + 2\beta^TX^Ty - \sigma^2m^TV^{-1}m
\end{align}\]
\[\begin{align} p(y|\sigma^2) &= \int p(y|\beta,\sigma^2)p(\beta|\sigma^2)d\beta \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{n/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}}exp(-\frac{1}{2\sigma^2}y^Ty + \frac{1}{2}m^TV^{-1}m)\int exp\Big(-\frac{1}{2}(\beta-m)^TV^{-1}(\beta-m)\Big)d\beta \\ &= \Big(\frac{1}{\sigma^2}\Big)^{n/2}|g\sigma^2(X^TX)^{-1}|^{-\frac{1}{2}} exp\big(-\frac{1}{2\sigma^2}(y^Ty - \sigma^2m^TV^{-1}m)\big) \ (2\pi)^{n/2}|\frac{g}{g+1}\sigma^2(X^TX)^{-1}|^{\frac{1}{2}} \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{n/2}(1+g)^{-p/2}exp\big(-\frac{1}{2\sigma^2}SSR(g)\big) \\ &\text{where } SSR(g) = y^Ty - \sigma^2m^TV^{-1}m = y^T\Big(I - \frac{g}{g+1}X(X^TX)^{-1}X^T\Big)y \end{align}\]
์ฌ๊ธฐ์ ์์ ๋ง์ฐฌ๊ฐ์ง๋ก, g๊ฐ ํด์๋ก ์ฝํ prior๋ฅผ ๋ปํ๋ค. ๊ทธ๋ ๊ธฐ ๋๋ฌธ์ g๋ฅผ ๋ฌดํ๋๋ก ๋ณด๋ด๊ฒ ๋๋ฉด, $SSR(g) \rightarrow SSR(\hat{\beta}_{mle})$
\[\begin{align} p(\sigma^2|y) &\propto p(\sigma^2)p(y|\sigma^2) \\ &\propto \Big(\frac{1}{\sigma^2}\Big)^{(\nu_0+n)/2+1}exp(-\frac{\nu_0\sigma^2_0 + SSR(g)}{2\sigma^2}) \\ \therefore \sigma^2|y &\sim \chi^{-2}(\nu_0+n, \frac{\nu_0\sigma^2_0 + SSR(g)}{\nu_0+n}) \end{align}\]
Full-conjugate prior๋ฅผ ํตํด ๊ตฌํ posterior๋ฅผ ์ ๋ฆฌํ์๋ฉด ์๋์ ๊ฐ๋ค. \[ \sigma^2|y \sim \chi^{-2}(\nu_0+n, \frac{\nu_0\sigma^2_0 + SSR(g)}{\nu_0+n}) \\ \beta|y,\sigma^2 \sim N(\frac{g}{g+1}\hat{\beta}_{mle},\frac{g}{g+1}Var(\hat{\beta}_{mle})) \\ \text{where } Var(\hat{\beta}_{mle}) = \sigma^2(X^TX)^{-1} \] Gibbs sampling์ด ์๋, Monte Carlo Approximation์ ํตํด์ ํจ์จ์ ์ผ๋ก joint posterior distribution์ ๊ตฌํ ์ ์๋ค. ์ด ๋ถ๋ถ์์ semi-conjugate prior์ ํฐ ์ฐจ์ด๊ฐ ์๋ค๊ณ ํ ์ ์๋ค.