Back to : deep-learning-study

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ 6๊ฐ• (9์›” 23์ผ) ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ธ€์€ SVM๊ณผ Logistic Regression ๋งํฌ, Softmax Regression ๋งํฌ ์— ์ด์–ด์ง€๋Š” ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

๋‚˜์ค‘์— ์„ค๋ช…์„ ๋ณด๊ฐ•ํ•ด์„œ ๋‹ค์‹œ ์ž‘์„ฑ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.

Logistic regression ๊ฐ™์€ $f_\theta(x) = a^T x + b$ case๋ฅผ 1-layer (linear layer) neural network๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Softmax Regression ์„ค๋ช… ๋งˆ์ง€๋ง‰์— ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ, ์ ์ ˆํ•œ loss function $\ell$ ์„ ๋„์ž…ํ•œ ๋‹ค์Œ, $\ell(f_\theta(x), y)$ ๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•˜์ž. Logistic regression์€ ์—ฌ๊ธฐ์— $\ell$๋กœ logistic loss๋ฅผ, $f_\theta$ ์ž๋ฆฌ์— linear model์„ ๋„ฃ์€ ํŠน์ˆ˜ํ•œ ์ผ€์ด์Šค์ด๋‹ค. ์ด๋ฅผ ์ข€๋” ์—„๋ฐ€ํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๊ธฐ ์œ„ํ•ด, Linear layer๋ฅผ ์ƒ๊ฐํ•˜์ž.

Linear Layer

์ž…๋ ฅ์œผ๋กœ $X \in \R^{B \x n}$, where $B = $ batch size, $n = $ ์ž…๋ ฅ ํฌ๊ธฐ๋ฅผ ๋ฐ›์•„์„œ, ์ถœ๋ ฅ $Y \in \R^{B \x m}$ ํฌ๊ธฐ์˜ ํ…์„œ๋ฅผ ์ถœ๋ ฅํ•˜๋Š”๋ฐ, \(Y_{k, i} = \sum_{j = 1}^{n} A_{i, j} X_{k, j} + b_i\) ์ด์™€ ๊ฐ™์ด ์ž‘๋™ํ•˜๋Š” layer ๋ฅผ, batch์˜ ๊ฐ ๋ฒกํ„ฐ $x_k$ ์— ๋Œ€ํ•ด $y_k = A x_k + b$ ํ˜•ํƒœ์˜ ์„ ํ˜•์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค๋Š” ์˜๋ฏธ์—์„œ linear layer๋ผ ํ•œ๋‹ค. ์ด๋•Œ $A$ ํ–‰๋ ฌ์„ weight, $b$ ๋ฒกํ„ฐ๋ฅผ bias๋ผ ํ•œ๋‹ค.

๋”ฐ๋ผ์„œ, Logistic Regression์ด๋ž€, ํ•˜๋‚˜์˜ Linear layer๋ฅผ ์ด์šฉํ•˜๊ณ , loss function์œผ๋กœ logistic loss (KL-divergence with logistic probability) ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” Shallow neural network ๋ผ๊ณ  ๋‹ค์‹œ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

Multi Layer Perceptron

Multi-Layer (Deep Network) ๋ฅผ ์ƒ๊ฐํ•˜๋ฉด, linear function์˜ ๊นŠ์€ ๊ฒฐํ•ฉ์€ ์–ด์ฐจํ”ผ linearํ•˜๋ฏ€๋กœ ์•„๋ฌด ์˜๋ฏธ๊ฐ€ ์—†๋‹ค.

๊ทธ๋Ÿฌ๋‚˜, ์ ๋‹นํ•œ non-linear activation function $\sigma$ ๋ฅผ ๋„์ž…ํ•˜์—ฌ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ layer๋ฅผ ๊ตฌ์ถ•ํ•˜๋ฉด ์˜๋ฏธ๊ฐ€ ์žˆ๊ฒŒ ๋œ๋‹ค.

picture 1

์ฆ‰, ์ด๋ฅผ ์‹์œผ๋กœ ์“ฐ๋ฉดโ€ฆ \(\begin{align*} y_L &= W_L y_{L-1} + b_L \\ y_{L - 1} &= \sigma(W_{L-1} y_{L - 2} + b_{L - 1}) \\ \cdots & \cdots \\ y_2 &= \sigma (W_2 y_1 + b_2) \\ y_1 &= \sigma (W_1 x + b_1) \end{align*}\) where $x \in \R^{n_0}, W_l \in \R^{n_l \x n_{l-1}}, n_L = 1$. (Binary classification๋งŒ ์ž ๊น ์ƒ๊ฐํ•˜๊ธฐ๋กœ ํ•˜์ž)

  • ์ฃผ๋กœ $\sigma$ ๋กœ๋Š” ReLU $ = \max(z, 0)$, Sigmoid $\frac{1}{1 + e^{-z}}$, Hyperbolic tangent $\frac{1 - e^{-2z}}{1 + e^{-2z}}$ ๋ฅผ ์“ด๋‹ค.
  • ๊ด€๋ก€์ ์œผ๋กœ ๋งˆ์ง€๋ง‰ layer์—๋Š” $\sigma$๋ฅผ ๋„ฃ์ง€ ์•Š๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค.

์ด ๋ชจ๋ธ์„ MultiLayer Perceptron (MLP) ๋˜๋Š” Fully connected neural network ๋ผ ํ•œ๋‹ค.

Weight Initialization

SGD $\theta^{k + 1} = \theta^k - \alpha g^k$ ์—์„œ, $\theta^0$ ์€ convex optimization์—์„œ๋Š” ์–ด๋–ค ์ ์„ ๊ณจ๋ผ๋„ global solution์œผ๋กœ ์ˆ˜๋ ดํ•˜๋ฏ€๋กœ ์˜๋ฏธ๊ฐ€ ์—†์ง€๋งŒ, deep learning์—์„œ๋Š” $\theta^0$ ์„ ์ž˜ ์ฃผ๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•œ ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค.

๋‹จ์ˆœํ•˜๊ฒŒ $\theta^0 = 0$ ์„ ์“ฐ๋ฉด, vanishing gradient ์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. Pytorch์—์„œ๋Š” ๋”ฐ๋กœ ์ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ์Œ.

Gradient Computation : Back propagation

๋‹ค์‹œ ์ž ๊น logistic regression์„ ์ƒ๊ฐํ•˜๋ฉด, loss function์„ ๋‹ค ์…‹์—…ํ•œ ๋‹ค์Œ ๊ฒฐ๊ตญ ๋งˆ์ง€๋ง‰์—๋Š” stochastic gradient descent ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด์„œ ์ตœ์ ํ™”ํ•  ๊ณ„ํš์œผ๋กœ ์ง„ํ–‰ํ–ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋Š” ๋ง์€, ๊ฒฐ๊ตญ ์–ด๋–ป๊ฒŒ๋“  ๋ญ”๊ฐ€ ์ € loss function์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•  ๋ฐฉ๋ฒ•์ด ์žˆ๊ธฐ๋Š” ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์˜๋ฏธ๊ฐ€ ๋œ๋‹ค. ์ฆ‰, ๊ฐ layer์˜ weight๋“ค๊ณผ bias๋“ค์˜ ๊ฐ ์›์†Œ๋“ค $A_{i, j, k}$์— ๋Œ€ํ•ด, $\pdv{y_L}{A_{i, j, k}}$ ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

MLP์—์„œ๋Š” ์ด gradient ๊ณ„์‚ฐ์ด ์ง์ ‘ ์ˆ˜ํ–‰ํ•˜๊ธฐ์—๋Š” ๋งค์šฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ pytorch์—์„œ๋Š” autograd ํ•จ์ˆ˜๋กœ ์ œ๊ณตํ•œ๋‹ค. ๋‹ค๋งŒ ๊ธฐ๋ณธ์ ์ธ ์›๋ฆฌ๋Š” vector calculus์˜ chain rule์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค. ๋‚˜์ค‘์— ์ด๋ฅผ ๋”ฐ๋กœ ๋‹ค๋ฃฌ๋‹ค.