Back to : deep-learning-study

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ 3๊ฐ• (9์›” 9์ผ), 4๊ฐ• (9์›” 14์ผ) ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฌธ์„œ๋Š” $\LaTeX$๋ฅผ pandoc์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์—, ๋ ˆ์ด์•„์›ƒ ๋“ฑ์ด ๊น”๋”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์  ๊ฐ€ pdf ๋ฒ„์ „์˜ ๋…ธํŠธ๋ฅผ ๊ณต๊ฐœํ•œ๋‹ค๋ฉด ๊ทธ์ชฝ์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Binary Classification

์ž ์‹œ ์•ž์„œ์˜ ์ •์˜๋ฅผ ๋Œ์•„๋ณด์ž.

๋ฐ์ดํ„ฐ $X_1, \dots X_n \in \mathcal{X}$์ด ์žˆ๊ณ , ์ด์— ๋Œ€ํ•œ ์ •๋‹ต ๋ผ๋ฒจ $Y_1, \dots Y_n \in \mathcal{Y}$์ด ์ฃผ์–ด์ง„ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์ž. ์ด๋•Œ, ์–ด๋–ค True Unknown Function $f_\star : \mathcal{X} \to \mathcal{Y}$ ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด, $Y_i = f_\star(X_i)$ ๋ฅผ ๋งŒ์กฑํ•œ๋‹ค.

์šฐ๋ฆฌ๋Š”, $X_i, Y_i$๋กœ๋ถ€ํ„ฐ, $f_\star$๊ณผ ๊ฐ€๊นŒ์šด ์–ด๋–ค ํ•จ์ˆ˜ $f$๋ฅผ ์ฐพ์•„๋‚ด๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ์‹ถ๋‹ค. $X_i$๋“ค์— ๋Œ€ํ•ด $Y_i$๋Š” ์‚ฌ๋žŒ์ด ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ Supervised Learning์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

Supervised Learning์„ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ์ƒ๊ฐํ•  ๊ฒƒ์ด๋‹ค. \(\underset{\theta \in \Theta}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell(f_\theta(x_i), f_\star(x_i))\)

ํŠนํžˆ, ์ด๋ฒˆ์—๋Š” $\mathcal{X} = \R^p$, $\mathcal{Y} = \Set{-1, +1}$ ์ธ ๋ฌธ์ œ๋ฅผ ์ƒ๊ฐํ•˜์ž. ์ฆ‰, ๋ฐ์ดํ„ฐ๋ฅผ ๋‘ ํด๋ž˜์Šค๋กœ ๋ถ„๋ฆฌํ•ด๋‚ด๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ, ํŠน๋ณ„ํžˆ ์ด ๋ฐ์ดํ„ฐ๊ฐ€ linearly seperableํ•œ์ง€๋ฅผ ์ƒ๊ฐํ•œ๋‹ค. ์–ด๋–ค ์ดˆํ‰๋ฉด $a^T x + b$ ๊ฐ€ ์กด์žฌํ•˜์—ฌ, $y$๊ฐ’์„ $a^T x + b$์˜ ๋ถ€ํ˜ธ์— ๋”ฐ๋ผ ์ฐ์–ด๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉด linearly seperableํ•˜๋‹ค๊ณ  ์ •์˜ํ•œ๋‹ค.

Linear Classification

Binary classifcation, ํŠนํžˆ linear classifcation ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ affine model์„ ์ƒ๊ฐํ•œ๋‹ค. \(f_{a, b}(x) = \sgn(a^T x + b)\) ์—ฌ๊ธฐ์— loss function์œผ๋กœ, ํ‹€๋ฆฐ ๋ผ๋ฒจ์˜ ๊ฐœ์ˆ˜๋ฅผ ์„ธ๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ž์—ฐ์Šค๋Ÿฝ๋‹ค. ์ด๋ ‡๊ฒŒ ์ปดํŒฉํŠธํ•˜๊ฒŒ ์“ธ ์ˆ˜ ์žˆ๋‹ค. \(\ell(y_1, y_2) = \frac{1}{2}\abs{1 - y_1 y_2}\)

์ด์ œ, ๋‹ค์Œ์˜ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€๊ณ  ์‹ถ๋‹ค. \(\underset{a \in \R^p, b \in \R}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell(f_{a, b}(x_i), y_i)\) ๊ทธ๋Ÿฌ๋ฉด Linearly seperableํ•œ์ง€๋Š” ์ด ์ตœ์ ํ™” ๋ฌธ์ œ์˜ ์ตœ์ ํ•ด๊ฐ€ 0์ธ์ง€์™€ ๋™์น˜์ด๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ์ด ํ•จ์ˆ˜๋Š” ์—ฐ์†ํ•จ์ˆ˜๊ฐ€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— (์ •ํ™•ํžˆ๋Š” ๋Œ€์ถฉ ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์กฐ๊ฑด์„ ์š”๊ตฌํ•œ๋‹ค) SGD๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋Œ๋ฆด์ˆ˜๊ฐ€ ์—†๋‹ค.

Support Vector Machine

๋”ฐ๋ผ์„œ, ์ด ๋ฌธ์ œ๋ฅผ continuousํ•˜๊ฒŒ relaxationํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ด€์ ์„ ๋ฐ”๊พธ๋ฉด, ์ด ๋ผ๋ฒจ์ด 1์ผ / -1์ผ โ€˜Confidenceโ€™๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋„๋ก ๋ชจ๋ธ์„ ์ข€ ์ž˜ ํ™•์žฅํ•˜๊ณ ์ž ํ•œ๋‹ค. 0.5์ด๋ฉด โ€˜์•„๋งˆ๋„ 1์ผ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹คโ€™ ๊ฐ™์€ ๋Š๋‚Œ์œผ๋กœ.

์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” $y_i f_{a, b}(x_i) > 0$ ์„ ๋งŒ์กฑํ•ด์•ผ ํ•œ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ, ์‹ค์ œ๋กœ๋Š” ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด $f$๊ฐ’์ด 0 ๊ทผ์ฒ˜์—์„œ๋งŒ ์™”๋‹ค๊ฐ”๋‹คํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๊ณ , ์ด๋Š” numericalํ•œ ๋ฉด์—์„œ๋‚˜ neural network์˜ confidence๋ผ๋Š” ํ•ด์„์œผ๋กœ๋‚˜ ์ ์ ˆํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ ๋‹นํžˆ margin์„ ์ฃผ๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•˜๋‹ค.

์ ๋‹นํžˆ margin์„ 1๋งŒํผ ์ค˜์„œ, $y_i f_{a, b}(x_i) \geq 1$ ์„ ๋งŒ์กฑํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค. ์—ฌ๊ธฐ์„œ โ€˜์ข‹์„ ๊ฒƒ ๊ฐ™๋‹คโ€™ ๋Š” ๋ง์€ ๋ฐ˜๋Œ€๋กœ ์ € ์„ฑ์งˆ์„ ๋งŒ์กฑํ•˜์ง€ ์•Š์œผ๋ฉด ํŽ˜๋„ํ‹ฐ๋ฅผ ๋ถ€๊ณผํ•˜๊ฒ ๋‹ค๋Š” ๋ฐœ์ƒ์œผ๋กœ๋„ ํ•ด์„๋  ์ˆ˜ ์žˆ๊ณ โ€ฆ ์ด ํŽ˜๋„ํ‹ฐ ํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฌธ์ œ๋กœ ์“ฐ๋ฉด, \(\underset{a \in \R^p, b \in \R}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \max(0, 1 - y_i f_{a, b}(x_i)) = \frac{1}{N}\sum_{i = 1}^{N} \max(0, 1 - y_i (a^T x_i + b))\)

๋ฐ์ดํ„ฐ๊ฐ€ linearly seperableํ•˜๋ฉด, ์ด ์‹๋„ optimal value๊ฐ€ 0์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ Support Vector Machine ์ด๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ, ํ”ํžˆ regularizer๋ฅผ ์ถ”๊ฐ€ํ•œ ์•„๋ž˜ ์‹์œผ๋กœ ์“ด๋‹ค.\(\underset{a \in \R^p, b \in \R}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \max(0, 1 - y_i (a^T x_i + b)) + \frac{\lambda}{2}\norm{a}^2\)

์ด ์ตœ์ ํ™” ๋ฌธ์ œ (Relaxation ๋„ฃ๊ธฐ ์ „!)๊ฐ€ ์›๋ณธ ๋ฌธ์ œ์˜ relaxation์ด๋ผ๋Š” ์‚ฌ์‹ค์„ ๋ณด์ด๋Š” ๊ฒƒ์€ ์–ด๋ ต์ง€ ์•Š๋‹ค. ์›๋ž˜ ๋ฌธ์ œ์˜ ์ตœ์ ํ•ด๋ฅผ $p_1^\star$ ๋ผ ํ•˜๊ณ , SVM์˜ ์ตœ์ ํ•ด๋ฅผ $p_2^\star$ ๋ผ ํ•˜๋ฉด, $p_1^\star = 0 \iff p_2^\star = 0$ ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๊ฒฐ๊ตญ, relaxed supervised learning์€ point prediction์„ relaxation ํ•ด์„œ label value ๋Œ€์‹  ๊ทธ label์˜ probability๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ. Single prediction๋ณด๋‹ค ํ›จ์”ฌ realisticํ•œ ์„ธํŒ…์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Logistic Regression

Linear binary classification์— ๋Œ€ํ•œ ๋˜๋‹ค๋ฅธ ๋ฐฉ๋ฒ•. ์—ฌ์ „ํžˆ Decision boundary $a^T x + b$ ๋ฅผ ์•Œ๊ณ ์ž ํ•œ๋‹ค. ๋จผ์ €...

Binary classification์—์„œ, ์šฐ๋ฆฌ๊ฐ€ ํ™•์ธํ•œ ๋ฐ์ดํ„ฐ์˜ Label์„ ํ™•๋ฅ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด์„œ (๋งŒ์•ฝ ์™„์ „ํžˆ label์ด ํ•˜๋‚˜๋ผ๋ฉด, (1, 0) ๊ณผ (0, 1) ์ฒ˜๋Ÿผ) ํ‘œํ˜„ํ•œ ๊ฒƒ์„ empirical distribution $\mathcal{P}(y)$ ๋ผ๊ณ  ์ •์˜ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” supervised learning์„ Logistic Regression์ด๋ผ ํ•œ๋‹ค. \(f_{a, b}(x) = \begin{bmatrix} \frac{1}{1 + e^{a^T x + b}} \\ \frac{1}{1 + e^{-(a^Tx + b)}} \end{bmatrix}\)

์ด ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ๋‹ค. \(\underset{a \in \R^p, b \in \R}{\minimize}\ \sum_{i = 1}^{N} \DKL{\mathcal{P}(Y_i)}{f_{a, b}(X_i)}\) ์ฆ‰, ์šฐ๋ฆฌ๋Š” empirical distribution๊ณผ์˜ KL-Divergence๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ณ  ์‹ถ๋‹ค. ์ด ์‹์„ ์ •๋ฆฌํ•˜๋ฉด... \(\underset{a \in \R^p, b \in \R}{\minimize}\ \sum_{i = 1}^{N} H(\mathcal{P}(Y_i), f_{a, b}(X_i)) + \text{ Terms independent of } a, b\) ์ •ํ™•ํžˆ Cross entropy $H$๋ฅผ ์ „๊ฐœํ•˜๊ณ , ์˜ค๋ฅธ์ชฝ term๋“ค์„ ๋‹ค ๋ฒ„๋ฆฌ๋ฉด... \(\underset{a \in \R^p, b \in \R}{\minimize}\ - \frac{1}{N}\sum_{i = 1}^{N} \P(y_i = -1) \log\left(\frac{1}{1 + e^{a^Tx_i + b}}\right) + \P(y_i = 1)\log\left(\frac{1}{1 + e^{-a^Tx_i - b}}\right)\) ์ด๋Š” ๋‹ค์‹œ, $\P(y_i = 1)$ ๊ณผ $\P(y_i = -1)$ ์ด one-hot์ด๋ฏ€๋กœ, ๋‘˜์ค‘์— ์–ด๋Š์ชฝ์ด 1์ธ์ง€๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ์ •๋ฆฌํ•˜์—ฌ, \(\underset{a \in \R^p, b \in \R}{\minimize}\ - \frac{1}{N}\sum_{i = 1}^{N} \log\left(\frac{1}{1 + e^{-y_i(a^Tx_i + b)}}\right)\) ๋‹จ์กฐ๊ฐ์†Œํ•จ์ˆ˜์ธ Loss function $\ell(z) = \log(1 + e^{-z})$๋ฅผ ๋„์ž…ํ•˜์—ฌ ๋ถ€ํ˜ธ๋ฅผ ๋–ผ๊ณ  ๊น”๋”ํ•˜๊ฒŒ ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. \(\underset{a \in \R^p, b \in \R}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N}\ell(y_i(a^T x_i + b))\) ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ํ›„, $a^T x + b$ ์˜ ๋ถ€ํ˜ธ์— ๋”ฐ๋ผ predictionํ•œ๋‹ค.

SVM๊ณผ ๋น„๊ตํ•˜๋ฉด, ์ถœ๋ฐœ์ ์ด ๋‹ฌ๋ž์ง€๋งŒ ๊ฒฐ๊ตญ์€ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š”๋ฐ, $\ell(z)$ ๋ฅผ ์–ด๋–ป๊ฒŒ ์ •์˜ํ•˜๋Š๋ƒ์˜ ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค. SVM์€ $\max(0, 1-z)$์ด๊ณ , Logistic regression์€ $\log(1 + e^{-z})$ ๋ฅผ ์“ฐ๋Š” ๊ฒฝ์šฐ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ขŒํ‘œ์— ๊ทธ๋ ค๋ณด๋ฉด ๋‘ ํ•จ์ˆ˜๊ฐ€ ์‚ฌ์‹ค ๊ต‰์žฅํžˆ ๋น„์Šทํ•˜๊ฒŒ ์ƒ๊ฒผ๋‹ค.

SVM๊ณผ LR์€ ๋‘˜๋‹ค (Decision boundary๊ฐ€ hyperplane์ด๋ผ๋Š” ๊ด€์ ์—์„œ) Linear classifier์ด์ง€๋งŒ, LR์ด ์ข€๋” ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ multiclass classification์œผ๋กœ ํ™•์žฅ๋œ๋‹ค. (Softmax Regression)