Back to : deep-learning-study
Contents

์ฌ์ธต ์ ๊ฒฝ๋ง์ ์ํ์  ๊ธฐ์ด 2๊ฐ (9์ 7์ผ), 3๊ฐ (9์ 9์ผ) ์ ๊ธฐ๋ฐํฉ๋๋ค.

์ด ๋ฌธ์๋ $\LaTeX$๋ฅผ pandoc์ผ๋ก ๋ณํํ์ฌ ์์ฑํ์๊ธฐ ๋๋ฌธ์, ๋ ์ด์์ ๋ฑ์ด ๊น๋ํ์ง ์์ ์ ์์ต๋๋ค. ์ธ์  ๊ฐ pdf ๋ฒ์ ์ ๋ธํธ๋ฅผ ๊ณต๊ฐํ๋ค๋ฉด ๊ทธ์ชฝ์ ์ฐธ๊ณ ํ๋ฉด ์ข์ ๊ฒ ๊ฐ์ต๋๋ค.

ML์๋ ๋ง์ Finite sum optimization์ด ์๋ค. ์ฐ๋ฆฌ๋ $F(x) = \frac{1}{N} \sum_{i = 1}^{N} f_i(x)$ ๋ฅผ ์ต์ ํํ๊ณ  ์ถ๋ค. ๋ํ์ ์ผ๋ก, Gradient Descent๋ฅผ ์ธ ์ ์๋ค. But, $N$์ด ๋งค์ฐ ํฌ๋ฉด ์ด ํจ์๋ฅผ ํ๋ฒ ๊ณ์ฐํ๋ ์๊ฐ์ด ๋งค์ฐ ์ค๋ ๊ฑธ๋ฆฐ๋ค.

์ ์์, ์ด ํจ์์ ๊ธฐ๋๊ฐ ์ผ๋ก ์ดํดํ  ์ ์๋ค. $$\underset{x \in \R^p}{\minimize}\ \E_I[f_I(x)], \ I \sim \uniform{1}{N}$$

์ด๋ฐ motivation์ ํตํด, Stochastic (Random) Gradient Descent๋ฅผ ์๊ฐํ๋ค.

์์์ ์์์  $x^0 \in \R^p$ ๋ฅผ ์ก๊ณ , ์ ์ ํ $\alpha_k > 0$ ์ ๋ํด ๋ค์์ ๋ฐ๋ณตํ๋ค. $$i(k) \sim \uniform{1}{N},\quad x^{k+1} = x^k - \alpha_k \nabla{f_{i(k)}(x^k)}$$

๋๋ต์ ์์ด๋์ด :
GD์ฒ๋ผ, Taylor expansionํ๊ณ  Stochastic์ ๊ณ ๋ คํ์ฌ Expectation์ ์์ด๋ค.

$x^k$ ๊ทผ์ฒ์์ $F(x) = \frac{1}{N} \sum_{i = 1}^{N} f_i(x)$๋ฅผ ํ์ผ๋ฌ ์ ๊ฐํ๊ณ  $x^{k+1}$ ๋์ํ๋ฉด, $$F(x^{k+1}) = F(x^k) - \alpha_k \nabla F(x^k)^T \nabla f_{i(k)}(x^k) + \order{\alpha_k^2}$$ ์ด์ , ์์ชฝ์ $\E$ ๋ฅผ ์์ด๋ค. $$\expect{F(x^{k+1})} = \expect{F(x^k)} - \alpha_k \expect{\nabla F(x^k)^T \nabla f_{i(k)}(x^k)} + \order{\alpha_k^2}$$ $\nabla F(x^k)^T$ ๋ ๊ธฐ๋๊ฐ์ ์ํฅ์ด ์๊ณ , $\nabla f_{i(k)}(x^k)$ ์ ๊ธฐ๋๊ฐ์ $\nabla F(x^k)$ ์ด๋ฏ๋ก, $$\expect{F(x^{k+1})} = \expect{F(x^k)} - \alpha_k \norm{\nabla F(x^k)}^2 + \order{\alpha_k^2}$$ ์ ๋นํ $\alpha_k$๋ฅผ ์ถฉ๋ถํ ์๊ฒ ์ก์ผ๋ฉด, ๊ธฐ๋๊ฐ์ด ๊ฐ์ํ  ์ ์์ ๊ฒ ๊ฐ๋ค.

• Stochastic Gradient Descent๋ ์๋ ด์ฑ์ ๊ดํ ์ ๋ฆฌ๋ฅผ ๊ธฐ์ ํ  ์ ์์ผ๋, ์ฆ๋ช์ด ๋งค์ฐ Tediousํ๊ณ  ML ์ธํ์์๋ ๊ทธ๋ ๊ฒ ์ค์ํ์ง ์๋ค.

• ์ง์  ๊ตฌํํด์ ํ์คํธํด ๋ณด๋ฉด, ์ด๋ฐ์ SGD๊ฐ GD๋ณด๋ค ์๋ ด์๋๊ฐ ๋น ๋ฅด์ง๋ง, Eventually GD์๊ฒ ๋ฐ๋ผ์กํ๋ค.
• ๊ทธ๋ฌ๋, ์ฐ๋ฆฌ๋ ML์ ๊ณต๋ถํ๋๋ฐ ์์ด SGD๋ฅผ Subroutine์ผ๋ก ์ธ ๊ฒ์ด๊ณ , ์งง์ Training ์๊ฐ์ ํ๊ฒฝ์์ SGD๊ฐ in practice GD๋ณด๋ค ์ ์๋ ดํ๋ค.

• Stochastic Gradient Descent์์, $i(k)$ ์์ฒด์ ์ฑ์ง์ ์ ํ ํ์ฉํ์ง ์์๋ค.
• ์ค์ ๋ก, SGD์ ์๋ ด์ฑ ์ฆ๋ช์์ ์ค์ํ ๊ฒ์, ๋ค์๊ณผ ๊ฐ์ Framework๋ฉด ์ถฉ๋ถํ๊ธฐ ๋๋ฌธ.

์์์ ์์์  $x^0 \in \R^p$ ๋ฅผ ์ก๊ณ , ์ ์ ํ $\alpha_k > 0$ ์ ๋ํด ๋ค์์ ๋ฐ๋ณตํ๋ค. $$i(k) \sim \uniform{1}{N},\quad x^{k+1} = x^k - \alpha_k g^k$$ ์ด๋, $g^k$ ๋ Stochastic gradient๋ก, $\nabla F(x^k)$ ์ Unbiased Estimator ์ด๋ฉด - ์ฆ, ๊ธฐ๋๊ฐ์ด $\nabla F(x^k)$ ์ด๋ฉด ์ถฉ๋ถํ๋ค.

Batch SGD / Cyclic SGD

• ์๋ฅผ ๋ค์ด, $g^k$๋ฅผ ๊ณ ๋ฅด๋ ๋ฐฉ๋ฒ์ผ๋ก Batch sampling with/without Replacement๋ฅผ ์๊ฐํ  ์ ์๋ค.
• ์ฆ, $N$๊ฐ ์ค ์ผ๋ถ์ธ $B$๊ฐ๋ฅผ ๋๋คํ๊ฒ ๊ณ์ ๋ฝ์์, $\frac{1}{B}\sum_{b = 1}^{B} \nabla f_{i(k, b)}(x^k)$, ์ฆ $B$๊ฐ์ batch์ ๋ํ gradient์ ํ๊ท ์ ์ฐ๋ ๊ฒ.
• ์ด๋, Batch๋ฅผ ๋ฝ์ ๋ ์ค๋ณต์ ํ์ฉํ๋์ง ์ฌ๋ถ๋ ์๊ด ์๋ค (๋ ๋ค Unbiased estimator๊ฐ ๋๊ธฐ ๋๋ฌธ). ์ค๋ณต์ ํ์ฉํ๊ณ  ์ถ์ผ๋ฉด randomํ $B$๊ฐ๋ฅผ ๋ฝ๊ณ , ํ์ฉํ๊ณ  ์ถ์ง ์์ผ๋ฉด random permutation์ ์ฒซ $B$๊ฐ๋ฅผ ์ฐ๋ฉด ๋๋ค.

ํนํ, Batch ๋ฐฉ๋ฒ์ ๊ฒฝ์ฐ, GPU๋ฅผ ์ด์ฉํ Parallel ์ฐ์ฐ์ ์ ๋ฆฌํ๋ค๋ ์ถ๊ฐ์ ์ธ ์ฅ์ ์ด ์๋ค. GPU์ ๋ณ๋ ฌ์ฒ๋ฆฌ๋ฅผ ์ต๋ํ ํ์ฉํ๊ธฐ ์ํด, GPU memory์ ๋ค์ด๊ฐ๋ ์ต๋ $B$๋ฅผ ์ด์ฉํ๋ ๊ฒ์ด ๊ฐ์ฅ ์ ๋ฆฌํ๋ค. Batch size๋ noise ์ ๋์ ๋ฐ๋ผ ์ฑ๋ฅ์ด ๋ฌ๋ผ์ง๋๋ฐ, noise๊ฐ ํด์๋ก large batch๊ฐ ์ ๋ฆฌํ๋ค.

๊ทธ๋ฐ๋ฐโฆ์ด ์๊ณ ๋ฆฌ์ฆ์์, ์๋ ๋๋คํ์ง ์์ ๊ฒ์ ์ต์ง๋ก ๋๋คํ๊ฒ ๋ง๋ค์ด์ ํ๊ณ  ์๋ ๊ฒ ์๋๊ฐ? Stochasticํ๊ฒ ๋ฝ๋ ๋์ , ๊ทธ๋ฅ ์์๋๋ก ๋๋ฆฌ๋ฉด์ ์ฐ๋ฉด ์ ๋๋?

Algorithm : Cyclic SGD
์์์ ์์์  $x^0 \in \R^p$ ๋ฅผ ์ก๊ณ , ์ ์ ํ $\alpha > 0$ ์ ๋ํด ๋ค์์ ๋ฐ๋ณตํ๋ค. $$x^{k+1} = x^k - \alpha_k \nabla{f_{(k \text{ mod } N + 1)}(x^k)}$$

์ด ๋ฐฉ๋ฒ์ stochasticํ ๋ถ๋ถ์ด ์ฌ์ค ์๋ค. ์ฅ๋จ์ ์...

• (+) ํ์คํ๊ฒ $N$๊ฐ์ ๋ฐ์ดํฐ๋ฅผ $N$๋ฒ๋ง๋ค ํ๋ฒ์ฉ ์ ํํ๊ฒ ์ฌ์ฉํ๋ค.
• (-) SGD์ ์๋ ด์ฑ์ ๋ํ ์ ๋ฆฌ๋ฅผ ์ธ ์ ์๋ค.
• (-) ์ผ๋ฐ SGD์ ๋นํด Theoretically / Empirically, some case์์๋ ์ ์๋ํ์ง ์์.
• (-) Deep Learning์ผ๋ก ๊ฐ๋ฉด, Neural network๊ฐ ์ด ์์(cyclic order)๋ฅผ ๊ธฐ์ตํ๋ ๊ฒฝํฅ ๋ฐ์.

ํนํ ๊ธฐ์ตํ๋ ๊ฒฝํฅ์ ์ํ overfitting์ด ํฐ ์ด์์ด๊ธฐ ๋๋ฌธ์, ์ด๋ฅผ ๋ฐฉ์งํด์ผ ํ๋ค. ์ ๋นํ ์์ด์ฃผ๋ฉด ์ด๋จ๊น?

Algorithm : Shuffled Cyclic SGD
์์์ ์์์  $x^0 \in \R^p$ ๋ฅผ ์ก๊ณ , ์ ์ ํ $\alpha > 0$ ์ ๋ํด ๋ค์์ ๋ฐ๋ณตํ๋ค. $$x^{k+1} = x^k - \alpha \nabla{f_{\sigma^{(k/N)}(k \text{ mod } N + 1)}(x^k)}$$

์ฆ, $N$๋ฒ์ ํ ๋ฒ์ฉ, ์ธ๋ฑ์ค๋ค์ ๋๋คํ๊ฒ permutationํด๋ฒ๋ฆฐ ๋ค์, ๊ทธ ์์๋ก ๋ค์ $N$๋ฒ์ iteration์ Cyclicํ๊ฒ ๋๋ฆฐ๋ค. ์ด๋ ๊ฒ ํ๋ฉด, ์ ํํ๊ฒ $N$๊ฐ์ ๋ฐ์ดํฐ๋ฅผ ํ๋ฒ์ฉ ์ด๋ค๋ ์ฅ์ ์ ์ฑ๊ธฐ๋ฉด์๋, neural network๊ฐ ํ์ตํ๋ ์ผ์ ๋ง์ ์ ์๋ค. ๊ธฐ์กด์๋ ๊ฐํ theory๊ฐ ๋ณ๋ก ์์์ง๋ง, recent breakthrough๋ค์ด ์ด๋ฅผ ๊ฐ์ ํ๊ณ  ์๋ค.

๊ทธ๋ฅ ์ผ๋ฐ์ ์ธ ์ธํ์์๋, Shuffled cyclic minibatch SGD without replacement ๋ฅผ ์ฐ๋ฉด ๋๊ณ , GPU๊ฐ ํ๋ฝํ๋ ์ต๋ํ ํฐ Batch size๋ฅผ ์ก์ผ๋ฉด ๋๋ค. Deep Learning์ ๋ง์ ๊ฒฝ์ฐ, ์ํ์ ์ธ ๋ถ์์ด ์ค์  performance๋ฅผ ์ ํํ๊ฒ ์์ธกํ์ง ๋ชปํ๋ ๊ฒฝํฅ์ด ์๋๋ฐ, empirically this is best.

์ผ๋ฐ์ ์ธ expectation์ผ๋ก ํํ๋ ์ต์ ํ ๋ฌธ์ , ์๋ฅผ ๋ค์ด ํ๋ฅ ๋ณ์ $\omega$์ ๋ํด ์ด๋ฐ ๋ฌธ์ ๋ค $$\underset{x \in \R^p}{\minimize}\ \E_\omega[f_\omega(x)]$$ ์ ๊ฒฝ์ฐ, ๋๊ฐ์ด SGD๋ก ํ ์ ์๋ค. GD๋ก๋ ํ  ์๋ ์์ง๋ง, ์ผ๋ฐ์ ์ผ๋ก โgradient์ ๊ธฐ๋๊ฐโ ์ ๊ตฌํ๊ธฐ๊ฐ ์ด๋ ต๊ธฐ ๋๋ฌธ์...

์ฐธ๊ณ  : Optimization / ML์์, ๋์ถฉ โํ๋ฐํดโ ๋ฅผ Epoch ๋ผ๊ณ  ๋ถ๋ฅธ๋ค. ๋์ถฉ ๋ฐ์ดํฐ๋ค์ ํ๋ฐํด ๋๋ฉด ๋๋ค. Gradient descent๋ฉด ํ๋ฒ = 1 epoch, SGD๋ฉด $N$๋ฒ, Batched SGD๋ฉด $N / B$ ๋ฒ ์ ๋.

SGD Convergence Theorem

์์ธํ๊ฒ ๋ค๋ฃจ์ด์ผ ํ  ๋ด์ฉ์ ์๋์ง๋ง, ์์ ๊ณต๋ถํ Lipschitz Gradient Lemma ๋ฑ์ ์ด์ฉํด์ ๋น์ทํ ์ฆ๋ช์ ์ธ ์ ์๋ค.

$F : \R^n \to \R$ ์ด ๋ฏธ๋ถ๊ฐ๋ฅํ๊ณ , $\nabla F$ ๊ฐ $L$-Lipschitz ์ฐ์์ด๋ฉฐ, $F$๊ฐ $-\infty$๊ฐ ์๋ ์ต์๊ฐ์ ๊ฐ์ง๋ฉฐ, $g^k$๊ฐ ๋ค์ ์กฐ๊ฑด $$\E[g^k \di x^k] = \nabla F(x^k), \quad \quad \expect{\norm{g^k - \nabla F(x^k)}^2 \di x^k} \leq \sigma^2$$ ์ ๋ง์กฑํ  ๋, ์ฆ $g^k$ ๊ฐ Gradient์ ๋ํ Unbiased estimator์ด๊ณ  ๊ทธ ๋ถ์ฐ์ด $\sigma^2$ ์ดํ์ผ ๋, ๋ค์์ด ์ฑ๋ฆฝํ๋ค. $$\frac{1}{M}\sum_{k = 0}^{M-1} \expect{\norm{\nabla F(x^k)}^2} \leq \frac{1}{\sqrt{M}}\left(2L(F(x^0) - F^*) + \sigma^2\right)$$

์ฆ, Gradient์ ํฌ๊ธฐ์ ํ๊ท ์ด $M$๋ฒ์ iteration์ ์ํด $\order{\frac{1}{\sqrt{M}}}$๋ก ๊ฐ์ํ๋ค.

Proof. ๋จผ์ , Lipschitz Gradient Lemma๋ฅผ $x = x^k, \delta = -\alpha g^k$์ ๋ํด ์ฐ๋ฉด, $$F(x^{k+1}) \leq F(x^k) -\alpha \nabla F(x^k)^T g^k + \frac{\alpha^2L}{2}\norm{g^k}^2$$ $x^k$ ๊ฐ ์ด๋ฏธ ์ฃผ์ด์ก์ ๋์ Conditional expectation์ ์ด๋ค. $$\expect{F(x^{k+1}) \di x^k} \leq F(x^k) - \alpha \norm{\nabla F(x^k)}^2 + \frac{\alpha^2 L}{2}\left(\norm{\nabla F(x^k)}^2 + \sigma^2\right)$$ ์ด์  ์ด๋ฅผ ๋ค์ Total expectation์ ์ทจํ๋ฉด, $$\expect{F(x^{k+1})} \leq \expect{F(x^k)} - \alpha\left(1 - \frac{\alpha L}{2}\right) \expect{\nabla F(x^k)} + \frac{\alpha^2 \sigma^2 L}{2}$$ ์ด๋ฅผ $k = 0 \dots M-1$์ ๋ํด ์๋ณ์ ๋ํ์ฌ $$\alpha\left(1 - \frac{\alpha L}{2}\right) \sum_{k = 1}^{M-1}\expect{\nabla F(x^k)} \leq (F(x^0) - F^*) + \expect{F(x^k) - F^*} + \frac{\alpha^2 \sigma^2 L}{2}$$ ๋ง์ง๋ง์ผ๋ก, $\alpha = \frac{1}{L \sqrt{K}}$ ๋ฅผ ์ทจํ์ฌ ์ฃผ์ด์ง ์์ ์ป๋๋ค.ย โป