Back to : deep-learning-study
Contents

์ฌ์ธต ์ ๊ฒฝ๋ง์ ์ํ์  ๊ธฐ์ด 6๊ฐ (9์ 23์ผ) ์ ๊ธฐ๋ฐํฉ๋๋ค.

์ด ๊ธ์ SVM๊ณผ Logistic Regression ๋งํฌ ์ ์ด์ด์ง๋ ๋ด์ฉ์๋๋ค.

๋์ค์ ์ค๋ช์ ๋ณด๊ฐํด์ ๋ค์ ์์ฑ๋  ์์ ์๋๋ค.

๋ฐ์ดํฐ $X_1, \dots X_n \in \mathcal{X}$์ด ์๊ณ , ์ด์ ๋ํ ์ ๋ต ๋ผ๋ฒจ $Y_1, \dots Y_n \in \mathcal{Y}$์ด ์ฃผ์ด์ง ๊ฒฝ์ฐ๋ฅผ ์๊ฐํด ๋ณด์. ์ด๋ฒ์๋ ๊ทธ๋ฐ๋ฐ, $Y_i$ ๊ฐ $-1$ ๊ณผ $1$ ์ค์์ ๊ณ ๋ฅด๋ ๊ฒ์ด ์๋๋ผ $\Set{1, 2, \dots k}$ ์ค ํ๋์ด๋ค.

Softmax Regression

Logistic Regression์ ํ์ฅ๋ ๋ฒ์ ์ผ๋ก, multi-class classification์ ํ๊ณ  ์ถ๋ค. ์ฌ์ ํ empirical distribution์ ๊ฐ๋์ ์ฌ์ฉํ๋ค. $\mathcal{P}(y)$ ๋ ํฌ๊ธฐ $k$์ ๋ฒกํฐ๋ก, one-hot encoding ๋ ๊ฒ์ผ๋ก ๋ณด์.

Softmax ํจ์๋ฅผ ์ด์ฉํ์ฌ, $\argmax$ ๋ฅผ in some sense smooth- ํ๋ค. Define $\mu : \R^k \to \R^k$ as $$\mu(z)_i = \frac{e^{z_i}}{\sum_{j = 1}^{k} e^{z_j}}$$

์ด ํจ์๊ฐ์ ๋ชจ๋  index $i$์ ๋ํ ํฉ์ด 1์ด๊ธฐ ๋๋ฌธ์, $\mu(z)_i$ ๋ฅผ ์ผ์ข์ confidence ํ๋ฅ ๋ก ์๊ฐํ  ์ ์๋ค.

์ด์ , ๋ชจ๋ธ $\mu(f_{A, b}(x)) = \mu(Ax + b)$ ๋ฅผ ํํ์. ์ด๋, $x \in \R^n$ ์ ๋ํด, $A$์ ๊ฐ row vector ๋ฅผ $a_i^T$ ๋ผ ํ๋ฉด, $f_{A, b}(x)$ ๋ ๋ค์ ๊ทธ๋ฆผ๊ณผ ๊ฐ์ด $(f_{A, b}(x))_i = (a_i^Tx + b_i)$ ์ธ ํฌ๊ธฐ $k$์ ๋ฒกํฐ๊ฐ ๋๊ณ , $\mu$ ๋ฅผ ๋ถ์ด๋ฉด ๊ฐ index์ softmax๋ฅผ ์ด ๊ฒฐ๊ณผ๊ฐ ๋๋ค. ๊ฒฐ๊ตญ์ ์ด๋ค ํ๋ ฌ๊ณฑ์ ํด์ ๋ฒกํฐ๋ฅผ ์ป์ ๋ค์, ๊ทธ ๋ฒกํฐ์๋ค๊ฐ softmax๋ฅผ ๋ถ์ธ ์.

์ฐ๋ฆฌ๋ ๋ค์๊ณผ ๊ฐ์ ์ต์ ํ ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ณ ์ ํ๋ค. $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \sum_{i = 1}^{N} \DKL{\mathcal{P}(Y_i)}{\mu(f_{a, b}(X_i))}$$ ์ด ์์ ์ ๋ฆฌํ๋ฉด, Logistic regression ๋์ฒ๋ผ ๋ค์ ๋ฌธ์ ์ ๋์น์์ ์๋ค. $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \sum_{i = 1}^{N} H(\mathcal{P}(Y), \mu(f_{a, b}(X)))$$ ์ด์ , ๋ค์ cross entropy ํญ์ ์ ๊ฐํ์ฌ ์ ๋ฆฌํ๋ค. $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ -\sum_{i = 1}^{N} \sum_{j = 1}^{k} \mathcal{P}(Y_i)_j \log (\mu(f_{A, b}(X_i))_j)$$ ์ฌ๊ธฐ์ $\mathcal{P}(Y_i)_j$ ๋ $j = Y_i$ ์ผ ๋ 1์ด๊ณ  ๋๋จธ์ง๋ 0์ด๋ฏ๋ก, $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ -\sum_{i = 1}^{N} \log \mu(f_{A, b}(X_{i}))_{Y_i} = \underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ -\sum_{i = 1}^{N} \log \left( \frac{e^{a_{Y_i}^T X_i + b_{Y_i}}}{\sum_{j = 1}^{k} e^{a_j^TX_i + b_j}}\right)$$ ์ด ์์ ์ ๋ฆฌํ์ฌ, $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \sum_{i = 1}^{N} \left(-(a_{Y_i}^T X_i + b_{Y_i}) + \log\left(\sum_{j = 1}^{k} e^{a_j^TX_i + b_j}\right)\right)$$

Interesting fact : Softmax regression์ ์ ๋ณด๋ฉด ๊ฒฐ๊ณผ ์์ด ์ฌ์ค convexํ๋ค. ๋ํ, $n = 2$ ์ผ ๋, ์ด ์์ Logistic regression๊ณผ ๋์น์ด๋ค.

์ด๋ฅผ ํธํ๊ฒ ์ฐ๊ธฐ ์ํด, Cross Entropy Loss ๋ผ๋ ํจ์๋ฅผ ์ ์ํ๋ค. $$\ell^{\text{CE}} (f, y) = - \log\left(\frac{e^{f_y}}{\sum_{j = 1}^{k} e^{f_j}}\right)$$ ์ด์ , ์ด ํจ์๋ฅผ ์ด์ฉํ์ฌ ์ฝ๊ฒ Softmax Regression์ ์ ์ํ  ์ ์๋ค. $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell^{\text{CE}}(f_{A, b}(X_i), Y_i)$$

์ด๋ ์ฆ, Softmax Regression์ ์ ์ํ๋ ๋ฐ ์์ด์โฆ

• ๋จ์ํ Linear model์ Cross Entropy Loss๋ก ์ต์ ํํ๊ธฐ
• Softmax-ed Linear model์ KL Divergence๋ก ์ต์ ํํ๊ธฐ

๊ฒฐ๊ตญ์ ๋์ด ๊ฐ์ ๋ง์ด์ง๋ง (CE Loss๊ฐ ๊ฒฐ๊ตญ softmax์ฒ๋ฆฌํ ํ๋ฅ ๋ถํฌ๋ฅผ ๊ณ ๋ คํ๊ฒ ๋ค๋ ์๋ฏธ์ด๋ฏ๋ก), ์ ์์ ํํ์ด ์ข๋ ์ผ๋ฐํ๊ฐ ์ฝ๋ค.

์ ์์ ํํ์ ์ด์ฉํ์ฌ SR์ ์์ฐ์ค๋ฝ๊ฒ ํ์ฅํ๋ฉด, linear model $f_{A, b}$ ๋์  ์ด๋ค ์์์ model $f_\theta(X_i)$ ์์ cross entropy loss๋ฅผ minimizeํ๋ ๊ฒ์ฒ๋ผ ์๊ฐํด ๋ณผ ์๋ ์๋ค. $$\underset{A \in \R^{k \x n}, b \in \R^k}{\minimize}\ \frac{1}{N}\sum_{i = 1}^{N} \ell^{\text{CE}}(f_{\theta}(X_i), Y_i)$$

์ด๋ cross entropy loss๊ฐ ๊ธฐ๋ณธ์ ์ผ๋ก๋ ์ด๋ค arg-max ์ค๋ฌ์ด (by softmax) choice๋ฅผ ํด์ ๊ทธ ๊ฒฐ๊ณผ๊ฐ์ empirical distribution๊ณผ์ KL-Divergence๋ฅผ minimizeํ๋ ๊ฐ๋์ผ๋ก ์ ์ฉ๋๊ธฐ ๋๋ฌธ.