Entropy, KL-Divergence

Entropy

์ •๋ณด๋Ÿ‰ = ๋ถˆํ™•์‹ค์„ฑ \[\begin{align} H(p) &= \sum_{i=1}p_i log\frac{1}{p_i} \\ &= -\sum_{i=1}p_i log(p_i) \end{align}\]

์—ฌ๊ธฐ์„œ $\frac{1}{p_i}$๋Š” ๋ฐœ์ƒํ™•๋ฅ ์˜ ์—ญ์ˆ˜๋กœ, ๋‹ค๋ฅด๊ฒŒ ๋ณด๋ฉด ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ์˜ ์ˆ˜๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— $log\frac{1}{p_i}$๋Š” ํ•„์š”ํ•œ ์งˆ๋ฌธ์˜ ์ˆ˜๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
ํ•ฉ์ณ์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด, ์ •๋ณด๋Ÿ‰์ด๋ผ๊ณ  ํ•˜๋Š” ๊ฒƒ์€ ํ•„์š”ํ•œ ์งˆ๋ฌธ์˜ ์ˆ˜ x ํ™•๋ฅ ์˜ ์ดํ•ฉ์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Cross Entropy

p์— ๋Œ€ํ•ด, ์ „๋žต Q๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ์˜ ๋ถˆํ™•์‹ค์„ฑ ์ฆ‰, ํŠน์ • ์ „๋žต์„ ์“ธ ๋•Œ, ์˜ˆ์ƒ๋˜๋Š” ์งˆ๋ฌธ๊ฐœ์ˆ˜์— ๋Œ€ํ•œ ๊ธฐ๋Œ“๊ฐ’

๊ทธ๋ƒฅ Entropy์™€์˜ ์ฐจ์ด์ ์€ log ์•ˆ์˜ $p_i$๊ฐ€ $q_i$๋กœ ๋ฐ”๋€Œ์—ˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์˜ ์˜๋ฏธ๋ฅผ ์ž˜ ํŒŒ์•…ํ•ด์•ผ ํ•œ๋‹ค.

\[\begin{align} H(p,q) &= \sum_{i=1}p_i log\frac{1}{q_i} \\ &= -\sum_{i=1}p_i log(q_i) \end{align}\]

Cross Entropy๋Š” Log Loss ๋˜๋Š” Negative Log Likelihood๋ผ๊ณ  ๋ถˆ๋ฆฌ๊ธฐ๋„ ํ•œ๋‹ค. ์ฆ‰, Cross Entropy๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์€ log likelihood๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

KL-Divergence

์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ(Kullback-Leibler Divergence)๋Š” ๋‘ ํ™•๋ฅ ๋ถ„ํฌ์˜ ์ฐจ์ด์—์„œ ๊ณ„์‚ฐ๋œ ์—”ํŠธ๋กœํ”ผ ์ฐจ์ด๋ฅผ ๋œปํ•œ๋‹ค. ์ฐธ๊ณ ๋กœ, H(p)๋Š” ์ƒ์ˆ˜๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์— Cross Entropy๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์€ KLD๋ฅผ ์ตœ์†Œํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ task์ด๋‹ค. ์ฆ‰, KLD๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์€ log likelihood๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์•„์ง„๋‹ค.

\[\begin{align} KL(p||q) &= H(p,q) - H(p) \\ &= \sum_{i=1}p_i log\frac{p_i}{q_i} \\ &= -\sum_{i=1}p_i log\frac{q_i}{p_i} \end{align}\]

KL-Divergence๋Š” ํ•ญ์ƒ 0 ์ด์ƒ์ด๋‹ค. ์ง๊ด€์ ์œผ๋กœ๋Š” $H(p,q)$์˜ lower bound๊ฐ€ $H(p)$(์ƒ์ˆ˜๊ฐ’)์ด๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์ฆ๋ช…ํ•˜๊ณ ์ž ํ•œ๋‹ค๋ฉด, convex function์ธ -log๋ฅผ f(x)๋กœ ์ƒ๊ฐํ•˜๊ณ  Jensenโ€™s inequality๋กœ ์ฆ๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.

Jensen-Shannon Divergence

KL-Divergence๋Š” ๋Œ€์นญ์ด ์•„๋‹ˆ๋‹ค. ์ฆ‰, p์™€ q์˜ ์œ„์น˜๋ฅผ ๋ฐ”๊ฟ”์“ธ ์ˆ˜ ์—†๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๊ฑฐ๋ฆฌ ๊ฐœ๋…์œผ๋กœ ํ˜ผ๋™ํ•˜๋ฉด ์•ˆ๋œ๋‹ค.
์ง๊ด€์ ์œผ๋กœ ์ดํ•ดํ•  ๋•Œ, KL-Divergence๋Š” ๋‘ ํ™•๋ฅ ๋ถ„ํฌ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ผ๊ณ  ์„ค๋ช…ํ•˜๊ณค ํ•˜์ง€๋งŒ, ๊ทธ๊ฒƒ์ด ์˜ณ์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๊ทธ๋ž˜๋„ ๊ฑฐ๋ฆฌ ๊ฐœ๋…์œผ๋กœ ํ™œ์šฉํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, Jensen-Shannon Divergence๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋œ๋‹ค.

\[JSD(p||q) = \frac{1}{2}KL(p||M) + \frac{1}{2}KL(q||M) \\ \text{where } M = \frac{1}{2}(p+q)\]

๋ชฉ์ฐจ