Back to : deep-learning-study
Contents

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ 3๊ฐ• (9์›” 9์ผ), 4๊ฐ• (9์›” 14์ผ) ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฌธ์„œ๋Š” $\LaTeX$๋ฅผ pandoc์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์—, ๋ ˆ์ด์•„์›ƒ ๋“ฑ์ด ๊น”๋”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์  ๊ฐ€ pdf ๋ฒ„์ „์˜ ๋…ธํŠธ๋ฅผ ๊ณต๊ฐœํ•œ๋‹ค๋ฉด ๊ทธ์ชฝ์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Shallow Neural Network : Introduction

๋ฐ์ดํ„ฐ $X_1, \dots X_n \in \mathcal{X}$์ด ์žˆ๊ณ , ์ด์— ๋Œ€ํ•œ ์ •๋‹ต ๋ผ๋ฒจ $Y_1, \dots Y_n \in \mathcal{Y}$์ด ์ฃผ์–ด์ง„ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์ž. ์ด๋•Œ, ์–ด๋–ค True Unknown Function $f_\star : \mathcal{X} \to \mathcal{Y}$ ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด, $Y_i = f_\star(X_i)$ ๋ฅผ ๋งŒ์กฑํ•œ๋‹ค.

์šฐ๋ฆฌ๋Š”, $X_i, Y_i$๋กœ๋ถ€ํ„ฐ, $f_\star$๊ณผ ๊ฐ€๊นŒ์šด ์–ด๋–ค ํ•จ์ˆ˜ $f$๋ฅผ ์ฐพ์•„๋‚ด๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์‹ถ๋‹ค. $X_i$๋“ค์— ๋Œ€ํ•ด $Y_i$๋Š” ์‚ฌ๋žŒ์ด ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ Supervised Learning์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

๋ญ”๊ฐ€๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์—, ์ผ๋‹จ $f_\star$๊ณผ ๊ฐ€๊นŒ์šด $f$๊ฐ€ ๋„๋Œ€์ฒด ๋ฌด์Šจ ๋ง์ธ์ง€๋ฅผ ๋ช…ํ™•ํžˆ ํ•ด์•ผ ํ•œ๋‹ค. ๋ญ”๊ฐ€๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฌธ์ œ๋กœ ๋งŒ๋“ค๊ณ  ์‹ถ์€๋ฐ... ๊ฐ€์žฅ ์ž๋ช…ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ๊ฐํ•˜๋ฉด ์–ด๋–ค ์†์‹คํ•จ์ˆ˜ $\ell$์„ ๋„์ž…ํ•ด์„œ, ์ด๋ ‡๊ฒŒ ์“ฐ๊ณ  ์‹ถ๋‹ค. \(\underset{f \in \mathcal{F}}{\minimize}\ \sup_{x \in \mathcal{X}} \ell(f(x), f_\star(x))\) ์ด ๋ฌธ์ œ๋Š”, (1) ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜๋“ค์˜ ๊ณต๊ฐ„ ์œ„์—์„œ ๋ญ”๊ฐ€๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ ์œผ๋กœ ๋ง์ด ์•ˆ ๋˜๊ณ , (2) ์ด ์ตœ์ ํ™” ๋ฌธ์ œ์˜ ํ•ด๋Š” $f_\star$์ด๋‹ˆ๊นŒ, ์‚ฌ์‹ค ์ตœ์ ํ™” ๋ฌธ์ œ๋„ ๋”ฑํžˆ ์•„๋‹ˆ๋‹ค. ๋ชจ๋“  $x$์— ๋Œ€ํ•ด $f_\star$๋ฅผ ์•Œ๊ณ  ์žˆ์œผ๋ฉด ์ตœ์ ํ™”๋ฅผ ์ƒ๊ฐํ•  ์ด์œ ๊ฐ€ ์—†๋‹ค.

๋Œ€์‹ ์—, ํ•จ์ˆ˜๋“ค์˜ ๊ณต๊ฐ„์„ ์ œ์•ฝํ•˜์ž. ์–ด๋–ค ํŒŒ๋ผ๋ฏธํ„ฐ $\theta$๋ฅผ ์ด์šฉํ•˜์—ฌ, ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ๋ฐ”๊พธ๊ณ  ์‹ถ๋‹ค. \(\underset{\theta \in \Theta}{\minimize}\ \sup_{x \in \mathcal{X}} \ell(f_\theta(x), f_\star(x))\)

์—ฌ์ „ํžˆ, ์ผ๋‹จ ์šฐ๋ฆฌ๋Š” ๋ชจ๋“  $x$์— ๋Œ€ํ•ด $f_\star$๋ฅผ ์•Œ๊ณ  ์žˆ์ง€ ์•Š๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” $x_1, x_2, \dots$ ์— ๋Œ€ํ•œ ๋‹ต $y_1, y_2 \dots$ ๋“ค์„ ๋งž์ถฐ๋‚ผ ์ˆ˜ ์žˆ๋Š” ํ•จ์ˆ˜๋ฅผ ์ผ๋‹จ ๋งŒ๋“œ๋Š” ์ •๋„๊ฐ€ ์ตœ์„ ์ด ์•„๋‹๊นŒ? ๊ทธ๋ฆฌ๊ณ , ์ตœ์•…์˜ ๊ฒฝ์šฐ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋Œ€์‹ , ํ‰๊ท ์„ ์ตœ์ ํ™”ํ•˜๋Š”๊ฒŒ ๋ญ”๊ฐ€ โ€˜์ผ๋ฐ˜์ ์œผ๋กœโ€™ ์ข‹์€ ์†”๋ฃจ์…˜์„ ์ œ๊ณตํ•  ๊ฒƒ ๊ฐ™๋‹ค. supremum์„ ์ตœ์†Œํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋„ˆ๋ฌด ์ง€๋‚˜์นœ ๋ชฉํ‘œ์ด๋‹ค. \(\underset{\theta \in \Theta}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell(f_\theta(x_i), f_\star(x_i))\) ์šฐ๋ฆฌ๋Š” $f_\star(x_i) = y_i$ ์ž„์„ ์•Œ๊ณ  ์žˆ์œผ๋ฏ€๋กœ, ์ด์ œ ๋ญ”๊ฐ€๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

์ด์ œ, $\theta$๋ฅผ ์ด์šฉํ•˜์—ฌ ํ‘œํ˜„๋˜๋Š” $f_\theta$๋ฅผ model ๋˜๋Š” neural network๋ผ๊ณ  ๋ถ€๋ฅผ ๊ฒƒ์ด๋‹ค. ๋˜ํ•œ, ์ด ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ์ž‘์—…์„ training ์ด๋ผ๊ณ  ๋ถ€๋ฅผ ๊ฒƒ์ด๋‹ค. ์ฆ‰, ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ ํ‘œํ˜„ํ•œ ๋ชจ๋ธ $f_\theta$๋ฅผ SGD์™€ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•˜์—ฌ trainingํ•œ๋‹ค๋Š” ํ‘œํ˜„์ด ๋œ๋‹ค. ํ˜„์žฌ ๊ฑฐ์˜ ๋ชจ๋“  ๋ฐฉ๋ฒ•๋“ค์ด SGD์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ๋‹ค.

Example : Least square regression

$\mathcal{X} = \R^p, \mathcal{Y} = \R, \Theta = \R^p$์ด๊ณ , ๋ชจ๋ธ $f_\theta(x) = x^T \theta$, $L(y_1, y_2) = \frac{1}{2}(y_1 - y_2)^2$ ์ธ ๋ฌธ์ œ๋ฅผ Least square๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ์ฆ‰, ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋“ค์„ ๋น„์Šทํ•˜๊ฒŒ ๋งž์ถฐ๋‚ด๋Š” Linearํ•œ ํ•จ์ˆ˜๋ฅผ ์ฐ๋Š” ๊ฒƒ.

KL-Divergence

As a mathematical tool, ์–ด๋–ค $p, q \in \R^n$์ด probability mass vector์ผ ๋•Œ, ์ฆ‰ $p_i, q_i \geq 0$ ์ด๊ณ  $\sum p_i = \sum q_i = 1$์ผ ๋•Œ, ์šฐ๋ฆฌ๋Š” ๋‘ distribution์˜ ์ฐจ์ด๋ฅผ ์ƒ๊ฐํ•˜๊ณ  ์‹ถ๋‹ค.

Kullback-Leibler Divergence (KL-Divergence)๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค. \(\DKL{p}{q} = \sum_{i = 1}^{n} p_i \log\frac{p_i}{q_i} = -\sum_{i = 1}^{n} p_i \log q_i + \sum_{i = 1}^{n} p_i \log p_i\)

  • ์ด๋Š” ๋‹ค์‹œ, ์ •๋ณด์ด๋ก ์˜ ์šฉ์–ด๋กœ๋Š” Cross entropy $H(p, q)$ ์™€ Entropy $H(p)$์˜ ํ•ฉ์œผ๋กœ ์“ฐ์—ฌ์ง„๋‹ค.

  • ํŽธ์˜๋ฅผ ์œ„ํ•ด (์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ), $0 \log (0 / 0) = 0$ ์œผ๋กœ, $0 \log 0 = 0$ ์œผ๋กœ, $x > 0$์ด๋ฉด $x \log (x / 0) = \infty$ ์œผ๋กœ ๋‘”๋‹ค.

๋ช‡๊ฐ€์ง€ ์„ฑ์งˆ๋“ค์„ ์‚ดํŽด๋ณด๋ฉด...

  • $\DKL{p}{q}$ ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ $\DKL{q}{p}$ ์™€ ๊ฐ™์ง€ ์•Š๋‹ค. (๊ทธ๋ž˜์„œ metric์€ ์•„๋‹˜)
  • $\DKL{p}{q} \geq 0$ ์ด๊ณ , $p \neq q$ ์ด๋ฉด $\DKL{p}{q} > 0$ (๊ณผ์ œ)

  • $\DKL{p}{q} = \infty$ ์ธ ๊ฒฝ์šฐ๋„ ๊ฐ€๋Šฅ.

KL-Divergence๋ฅผ ํ™•๋ฅ ๋ก ์˜ notation์œผ๋กœ ์“ฐ๋ฉด, random variable $I$๊ฐ€ $p_i$์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ฐ€์งˆ ๋•Œ, \(\DKL{p}{q} = \expectwith{I}{\log\left(\frac{p_i}{q_i}\right)}\) ์ด๋ ‡๊ฒŒ expectation์œผ๋กœ ์“ธ ์ˆ˜๋„ ์žˆ๋‹ค.

Symmetrized version $(\DKL{p}{q} + \DKL{q}{p}) / 2$ ๊ฐ™์€ ๊ฒƒ์„ ์ƒ๊ฐํ•˜๋ฉด?
$\Rightarrow$ Jensen-Shannon Divergence๋ผ๊ณ  ๋ถ€๋ฅด๋Š”๋ฐ, ๊ทธ๋ž˜๋„ ์—ฌ์ „ํžˆ infinity๋ผ๋Š” ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„์„œ ๋ฉ”ํŠธ๋ฆญ์ด ๋˜์ง€๋Š” ์•Š๋Š”๋‹ค.