Back to : deep-learning-study
Contents

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ 6๊ฐ• (9์›” 23์ผ) ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ธ€์€ SVM๊ณผ Logistic Regression ๋งํฌ ์— ์ด์–ด์ง€๋Š” ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

๋‚˜์ค‘์— ์„ค๋ช…์„ ๋ณด๊ฐ•ํ•ด์„œ ๋‹ค์‹œ ์ž‘์„ฑ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.


๋ฐ์ดํ„ฐ $X_1, \dots X_n \in \mathcal{X}$์ด ์žˆ๊ณ , ์ด์— ๋Œ€ํ•œ ์ •๋‹ต ๋ผ๋ฒจ $Y_1, \dots Y_n \in \mathcal{Y}$์ด ์ฃผ์–ด์ง„ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์ž. ์ด๋ฒˆ์—๋Š” ๊ทธ๋Ÿฐ๋ฐ, $Y_i$ ๊ฐ€ $-1$ ๊ณผ $1$ ์ค‘์—์„œ ๊ณ ๋ฅด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ $\Set{1, 2, \dots k}$ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

Softmax Regression

Logistic Regression์˜ ํ™•์žฅ๋œ ๋ฒ„์ „์œผ๋กœ, multi-class classification์„ ํ•˜๊ณ  ์‹ถ๋‹ค. ์—ฌ์ „ํžˆ empirical distribution์˜ ๊ฐœ๋…์„ ์‚ฌ์šฉํ•œ๋‹ค. $\mathcal{P}(y)$ ๋Š” ํฌ๊ธฐ $k$์˜ ๋ฒกํ„ฐ๋กœ, one-hot encoding ๋œ ๊ฒƒ์œผ๋กœ ๋ณด์ž.

Softmax ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ, $\argmax$ ๋ฅผ in some sense smooth- ํ•œ๋‹ค. Define $\mu : \R^k \to \R^k$ as \(\mu(z)_i = \frac{e^{z_i}}{\sum_{j = 1}^{k} e^{z_j}}\)

์ด ํ•จ์ˆ˜๊ฐ’์˜ ๋ชจ๋“  index $i$์— ๋Œ€ํ•œ ํ•ฉ์ด 1์ด๊ธฐ ๋•Œ๋ฌธ์—, $\mu(z)_i$ ๋ฅผ ์ผ์ข…์˜ confidence ํ™•๋ฅ ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์ œ, ๋ชจ๋ธ $\mu(f_{A, b}(x)) = \mu(Ax + b)$ ๋ฅผ ํƒํ•˜์ž. ์ด๋•Œ, $x \in \R^n$ ์— ๋Œ€ํ•ด, $A$์˜ ๊ฐ row vector ๋ฅผ $a_i^T$ ๋ผ ํ•˜๋ฉด, $f_{A, b}(x)$ ๋Š” ๋‹ค์Œ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด $(f_{A, b}(x))_i = (a_i^Tx + b_i)$ ์ธ ํฌ๊ธฐ $k$์˜ ๋ฒกํ„ฐ๊ฐ€ ๋˜๊ณ , $\mu$ ๋ฅผ ๋ถ™์ด๋ฉด ๊ฐ index์— softmax๋ฅผ ์“ด ๊ฒฐ๊ณผ๊ฐ€ ๋œ๋‹ค. ๊ฒฐ๊ตญ์€ ์–ด๋–ค ํ–‰๋ ฌ๊ณฑ์„ ํ•ด์„œ ๋ฒกํ„ฐ๋ฅผ ์–ป์€ ๋‹ค์Œ, ๊ทธ ๋ฒกํ„ฐ์—๋‹ค๊ฐ€ softmax๋ฅผ ๋ถ™์ธ ์…ˆ.

์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ๋‹ค. \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \sum_{i = 1}^{N} \DKL{\mathcal{P}(Y_i)}{\mu(f_{a, b}(X_i))}\) ์ด ์‹์„ ์ •๋ฆฌํ•˜๋ฉด, Logistic regression ๋•Œ์ฒ˜๋Ÿผ ๋‹ค์Œ ๋ฌธ์ œ์™€ ๋™์น˜์ž„์„ ์•ˆ๋‹ค. \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \sum_{i = 1}^{N} H(\mathcal{P}(Y), \mu(f_{a, b}(X)))\) ์ด์ œ, ๋‹ค์‹œ cross entropy ํ•ญ์„ ์ „๊ฐœํ•˜์—ฌ ์ •๋ฆฌํ•œ๋‹ค. \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ -\sum_{i = 1}^{N} \sum_{j = 1}^{k} \mathcal{P}(Y_i)_j \log (\mu(f_{A, b}(X_i))_j)\) ์—ฌ๊ธฐ์„œ $\mathcal{P}(Y_i)_j$ ๋Š” $j = Y_i$ ์ผ ๋•Œ 1์ด๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์ด๋ฏ€๋กœ, \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ -\sum_{i = 1}^{N} \log \mu(f_{A, b}(X_{i}))_{Y_i} = \underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ -\sum_{i = 1}^{N} \log \left( \frac{e^{a_{Y_i}^T X_i + b_{Y_i}}}{\sum_{j = 1}^{k} e^{a_j^TX_i + b_j}}\right)\) ์ด ์‹์„ ์ •๋ฆฌํ•˜์—ฌ, \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \sum_{i = 1}^{N} \left(-(a_{Y_i}^T X_i + b_{Y_i}) + \log\left(\sum_{j = 1}^{k} e^{a_j^TX_i + b_j}\right)\right)\)

Interesting fact : Softmax regression์€ ์ž˜ ๋ณด๋ฉด ๊ฒฐ๊ณผ ์‹์ด ์‚ฌ์‹ค convexํ•˜๋‹ค. ๋˜ํ•œ, $n = 2$ ์ผ ๋•Œ, ์ด ์‹์€ Logistic regression๊ณผ ๋™์น˜์ด๋‹ค.

์ด๋ฅผ ํŽธํ•˜๊ฒŒ ์“ฐ๊ธฐ ์œ„ํ•ด, Cross Entropy Loss ๋ผ๋Š” ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•œ๋‹ค. \(\ell^{\text{CE}} (f, y) = - \log\left(\frac{e^{f_y}}{\sum_{j = 1}^{k} e^{f_j}}\right)\) ์ด์ œ, ์ด ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์‰ฝ๊ฒŒ Softmax Regression์„ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค. \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell^{\text{CE}}(f_{A, b}(X_i), Y_i)\)

์ด๋Š” ์ฆ‰, Softmax Regression์„ ์ •์˜ํ•˜๋Š” ๋ฐ ์žˆ์–ด์„œโ€ฆ

  • ๋‹จ์ˆœํ•œ Linear model์„ Cross Entropy Loss๋กœ ์ตœ์ ํ™”ํ•˜๊ธฐ
  • Softmax-ed Linear model์˜ KL Divergence๋กœ ์ตœ์ ํ™”ํ•˜๊ธฐ

๊ฒฐ๊ตญ์€ ๋‘˜์ด ๊ฐ™์€ ๋ง์ด์ง€๋งŒ (CE Loss๊ฐ€ ๊ฒฐ๊ตญ softmax์ฒ˜๋ฆฌํ•œ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๊ณ ๋ คํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ด๋ฏ€๋กœ), ์ „์ž์˜ ํ‘œํ˜„์ด ์ข€๋” ์ผ๋ฐ˜ํ™”๊ฐ€ ์‰ฝ๋‹ค.

์ „์ž์˜ ํ‘œํ˜„์„ ์ด์šฉํ•˜์—ฌ SR์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ™•์žฅํ•˜๋ฉด, linear model $f_{A, b}$ ๋Œ€์‹  ์–ด๋–ค ์ž„์˜์˜ model $f_\theta(X_i)$ ์™€์˜ cross entropy loss๋ฅผ minimizeํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค. \(\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell^{\text{CE}}(f_{\theta}(X_i), Y_i)\)

์ด๋Š” cross entropy loss๊ฐ€ ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” ์–ด๋–ค arg-max ์Šค๋Ÿฌ์šด (by softmax) choice๋ฅผ ํ•ด์„œ ๊ทธ ๊ฒฐ๊ณผ๊ฐ’์˜ empirical distribution๊ณผ์˜ KL-Divergence๋ฅผ minimizeํ•˜๋Š” ๊ฐœ๋…์œผ๋กœ ์ ์šฉ๋˜๊ธฐ ๋•Œ๋ฌธ.