Back to : deep-learning-study
Contents

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ 2๊ฐ• (9์›” 7์ผ), 3๊ฐ• (9์›” 9์ผ) ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฌธ์„œ๋Š” $\LaTeX$๋ฅผ pandoc์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์—, ๋ ˆ์ด์•„์›ƒ ๋“ฑ์ด ๊น”๋”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์  ๊ฐ€ pdf ๋ฒ„์ „์˜ ๋…ธํŠธ๋ฅผ ๊ณต๊ฐœํ•œ๋‹ค๋ฉด ๊ทธ์ชฝ์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Stochastic Gradient Descent

ML์—๋Š” ๋งŽ์€ Finite sum optimization์ด ์žˆ๋‹ค. ์šฐ๋ฆฌ๋Š” $F(x) = \frac{1}{N} \sum_{i = 1}^{N} f_i(x)$ ๋ฅผ ์ตœ์ ํ™”ํ•˜๊ณ  ์‹ถ๋‹ค. ๋Œ€ํ‘œ์ ์œผ๋กœ, Gradient Descent๋ฅผ ์“ธ ์ˆ˜ ์žˆ๋‹ค. But, $N$์ด ๋งค์šฐ ํฌ๋ฉด ์ด ํ•จ์ˆ˜๋ฅผ ํ•œ๋ฒˆ ๊ณ„์‚ฐํ•˜๋Š” ์‹œ๊ฐ„์ด ๋งค์šฐ ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค.

์œ„ ์‹์„, ์ด ํ•จ์ˆ˜์˜ ๊ธฐ๋Œ“๊ฐ’ ์œผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋‹ค. \(\underset{x \in \R^p}{\minimize}\ \E_I[f_I(x)], \ I \sim \uniform{1}{N}\)

์ด๋Ÿฐ motivation์„ ํ†ตํ•ด, Stochastic (Random) Gradient Descent๋ฅผ ์ƒ๊ฐํ•œ๋‹ค.

Algorithm (Stochastic Gradient Descent)
์ž„์˜์˜ ์‹œ์ž‘์  $x^0 \in \R^p$ ๋ฅผ ์žก๊ณ , ์ ์ ˆํ•œ $\alpha_k > 0$ ์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ๋ฐ˜๋ณตํ•œ๋‹ค. \(i(k) \sim \uniform{1}{N},\quad x^{k+1} = x^k - \alpha_k \nabla{f_{i(k)}(x^k)}\)

๋Œ€๋žต์˜ ์•„์ด๋””์–ด :
GD์ฒ˜๋Ÿผ, Taylor expansionํ•˜๊ณ  Stochastic์„ ๊ณ ๋ คํ•˜์—ฌ Expectation์„ ์”Œ์šด๋‹ค.

$x^k$ ๊ทผ์ฒ˜์—์„œ $F(x) = \frac{1}{N} \sum_{i = 1}^{N} f_i(x)$๋ฅผ ํ…Œ์ผ๋Ÿฌ ์ „๊ฐœํ•˜๊ณ  $x^{k+1}$ ๋Œ€์ž…ํ•˜๋ฉด, \(F(x^{k+1}) = F(x^k) - \alpha_k \nabla F(x^k)^T \nabla f_{i(k)}(x^k) + \order{\alpha_k^2}\) ์ด์ œ, ์–‘์ชฝ์— $\E$ ๋ฅผ ์”Œ์šด๋‹ค. \(\expect{F(x^{k+1})} = \expect{F(x^k)} - \alpha_k \expect{\nabla F(x^k)^T \nabla f_{i(k)}(x^k)} + \order{\alpha_k^2}\) $\nabla F(x^k)^T$ ๋Š” ๊ธฐ๋Œ“๊ฐ’์— ์˜ํ–ฅ์ด ์—†๊ณ , $\nabla f_{i(k)}(x^k)$ ์˜ ๊ธฐ๋Œ“๊ฐ’์€ $\nabla F(x^k)$ ์ด๋ฏ€๋กœ, \(\expect{F(x^{k+1})} = \expect{F(x^k)} - \alpha_k \norm{\nabla F(x^k)}^2 + \order{\alpha_k^2}\) ์ ๋‹นํžˆ $\alpha_k$๋ฅผ ์ถฉ๋ถ„ํžˆ ์ž‘๊ฒŒ ์žก์œผ๋ฉด, ๊ธฐ๋Œ“๊ฐ’์ด ๊ฐ์†Œํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.


  • Stochastic Gradient Descent๋„ ์ˆ˜๋ ด์„ฑ์— ๊ด€ํ•œ ์ •๋ฆฌ๋ฅผ ๊ธฐ์ˆ ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ์ฆ๋ช…์ด ๋งค์šฐ Tediousํ•˜๊ณ  ML ์„ธํŒ…์—์„œ๋Š” ๊ทธ๋ ‡๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค.

  • ์ง์ ‘ ๊ตฌํ˜„ํ•ด์„œ ํ…Œ์ŠคํŠธํ•ด ๋ณด๋ฉด, ์ดˆ๋ฐ˜์— SGD๊ฐ€ GD๋ณด๋‹ค ์ˆ˜๋ ด์†๋„๊ฐ€ ๋น ๋ฅด์ง€๋งŒ, Eventually GD์—๊ฒŒ ๋”ฐ๋ผ์žกํžŒ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜, ์šฐ๋ฆฌ๋Š” ML์„ ๊ณต๋ถ€ํ•˜๋Š”๋ฐ ์žˆ์–ด SGD๋ฅผ Subroutine์œผ๋กœ ์“ธ ๊ฒƒ์ด๊ณ , ์งง์€ Training ์‹œ๊ฐ„์˜ ํ™˜๊ฒฝ์—์„œ SGD๊ฐ€ in practice GD๋ณด๋‹ค ์ž˜ ์ˆ˜๋ ดํ•œ๋‹ค.

  • Stochastic Gradient Descent์—์„œ, $i(k)$ ์ž์ฒด์˜ ์„ฑ์งˆ์€ ์ „ํ˜€ ํ™œ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.
  • ์‹ค์ œ๋กœ, SGD์˜ ์ˆ˜๋ ด์„ฑ ์ฆ๋ช…์—์„œ ์ค‘์š”ํ•œ ๊ฒƒ์€, ๋‹ค์Œ๊ณผ ๊ฐ™์€ Framework๋ฉด ์ถฉ๋ถ„ํ•˜๊ธฐ ๋•Œ๋ฌธ.

Algorithm (Stochastic Gradient Descent)
์ž„์˜์˜ ์‹œ์ž‘์  $x^0 \in \R^p$ ๋ฅผ ์žก๊ณ , ์ ์ ˆํ•œ $\alpha_k > 0$ ์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ๋ฐ˜๋ณตํ•œ๋‹ค. \(i(k) \sim \uniform{1}{N},\quad x^{k+1} = x^k - \alpha_k g^k\) ์ด๋•Œ, $g^k$ ๋Š” Stochastic gradient๋กœ, $\nabla F(x^k)$ ์˜ Unbiased Estimator ์ด๋ฉด - ์ฆ‰, ๊ธฐ๋Œ“๊ฐ’์ด $\nabla F(x^k)$ ์ด๋ฉด ์ถฉ๋ถ„ํ•˜๋‹ค.

Batch SGD / Cyclic SGD

  • ์˜ˆ๋ฅผ ๋“ค์–ด, $g^k$๋ฅผ ๊ณ ๋ฅด๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ Batch sampling with/without Replacement๋ฅผ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ฆ‰, $N$๊ฐœ ์ค‘ ์ผ๋ถ€์ธ $B$๊ฐœ๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๊ณ„์† ๋ฝ‘์•„์„œ, $\frac{1}{B}\sum_{b = 1}^{B} \nabla f_{i(k, b)}(x^k)$, ์ฆ‰ $B$๊ฐœ์˜ batch์— ๋Œ€ํ•œ gradient์˜ ํ‰๊ท ์„ ์“ฐ๋Š” ๊ฒƒ.
  • ์ด๋•Œ, Batch๋ฅผ ๋ฝ‘์„ ๋•Œ ์ค‘๋ณต์„ ํ—ˆ์šฉํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋Š” ์ƒ๊ด€ ์—†๋‹ค (๋‘˜ ๋‹ค Unbiased estimator๊ฐ€ ๋˜๊ธฐ ๋•Œ๋ฌธ). ์ค‘๋ณต์„ ํ—ˆ์šฉํ•˜๊ณ  ์‹ถ์œผ๋ฉด randomํ•œ $B$๊ฐœ๋ฅผ ๋ฝ‘๊ณ , ํ—ˆ์šฉํ•˜๊ณ  ์‹ถ์ง€ ์•Š์œผ๋ฉด random permutation์˜ ์ฒซ $B$๊ฐœ๋ฅผ ์“ฐ๋ฉด ๋œ๋‹ค.

ํŠนํžˆ, Batch ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ, GPU๋ฅผ ์ด์šฉํ•œ Parallel ์—ฐ์‚ฐ์— ์œ ๋ฆฌํ•˜๋‹ค๋Š” ์ถ”๊ฐ€์ ์ธ ์žฅ์ ์ด ์žˆ๋‹ค. GPU์™€ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๋ฅผ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด, GPU memory์— ๋“ค์–ด๊ฐ€๋Š” ์ตœ๋Œ€ $B$๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์œ ๋ฆฌํ•˜๋‹ค. Batch size๋Š” noise ์ •๋„์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๋Š”๋ฐ, noise๊ฐ€ ํด์ˆ˜๋ก large batch๊ฐ€ ์œ ๋ฆฌํ•˜๋‹ค.

๊ทธ๋Ÿฐ๋ฐโ€ฆ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ, ์›๋ž˜ ๋žœ๋คํ•˜์ง€ ์•Š์€ ๊ฒƒ์„ ์–ต์ง€๋กœ ๋žœ๋คํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์„œ ํ’€๊ณ  ์žˆ๋Š” ๊ฒƒ ์•„๋‹Œ๊ฐ€? Stochasticํ•˜๊ฒŒ ๋ฝ‘๋Š” ๋Œ€์‹ , ๊ทธ๋ƒฅ ์ˆœ์„œ๋Œ€๋กœ ๋Œ๋ฆฌ๋ฉด์„œ ์“ฐ๋ฉด ์•ˆ ๋˜๋‚˜?

Algorithm : Cyclic SGD
์ž„์˜์˜ ์‹œ์ž‘์  $x^0 \in \R^p$ ๋ฅผ ์žก๊ณ , ์ ์ ˆํ•œ $\alpha > 0$ ์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ๋ฐ˜๋ณตํ•œ๋‹ค. \(x^{k+1} = x^k - \alpha_k \nabla{f_{(k \text{ mod } N + 1)}(x^k)}\)

์ด ๋ฐฉ๋ฒ•์€ stochasticํ•œ ๋ถ€๋ถ„์ด ์‚ฌ์‹ค ์—†๋‹ค. ์žฅ๋‹จ์ ์€...

  • (+) ํ™•์‹คํ•˜๊ฒŒ $N$๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ $N$๋ฒˆ๋งˆ๋‹ค ํ•œ๋ฒˆ์”ฉ ์ •ํ™•ํ•˜๊ฒŒ ์‚ฌ์šฉํ•œ๋‹ค.
  • (-) SGD์˜ ์ˆ˜๋ ด์„ฑ์— ๋Œ€ํ•œ ์ •๋ฆฌ๋ฅผ ์“ธ ์ˆ˜ ์—†๋‹ค.
  • (-) ์ผ๋ฐ˜ SGD์— ๋น„ํ•ด Theoretically / Empirically, some case์—์„œ๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ.
  • (-) Deep Learning์œผ๋กœ ๊ฐ€๋ฉด, Neural network๊ฐ€ ์ด ์ˆœ์„œ(cyclic order)๋ฅผ ๊ธฐ์–ตํ•˜๋Š” ๊ฒฝํ–ฅ ๋ฐœ์ƒ.

ํŠนํžˆ ๊ธฐ์–ตํ•˜๋Š” ๊ฒฝํ–ฅ์— ์˜ํ•œ overfitting์ด ํฐ ์ด์Šˆ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ ๋ฐฉ์ง€ํ•ด์•ผ ํ•œ๋‹ค. ์ ๋‹นํžˆ ์„ž์–ด์ฃผ๋ฉด ์–ด๋–จ๊นŒ?

Algorithm : Shuffled Cyclic SGD
์ž„์˜์˜ ์‹œ์ž‘์  $x^0 \in \R^p$ ๋ฅผ ์žก๊ณ , ์ ์ ˆํ•œ $\alpha > 0$ ์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ๋ฐ˜๋ณตํ•œ๋‹ค. \(x^{k+1} = x^k - \alpha \nabla{f_{\sigma^{(k/N)}(k \text{ mod } N + 1)}(x^k)}\)

์ฆ‰, $N$๋ฒˆ์— ํ•œ ๋ฒˆ์”ฉ, ์ธ๋ฑ์Šค๋“ค์„ ๋žœ๋คํ•˜๊ฒŒ permutationํ•ด๋ฒ„๋ฆฐ ๋‹ค์Œ, ๊ทธ ์ˆœ์„œ๋กœ ๋‹ค์Œ $N$๋ฒˆ์˜ iteration์„ Cyclicํ•˜๊ฒŒ ๋Œ๋ฆฐ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด, ์ •ํ™•ํ•˜๊ฒŒ $N$๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ฒˆ์”ฉ ์“ด๋‹ค๋Š” ์žฅ์ ์„ ์ฑ™๊ธฐ๋ฉด์„œ๋„, neural network๊ฐ€ ํ•™์Šตํ•˜๋Š” ์ผ์„ ๋ง‰์„ ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์กด์—๋Š” ๊ฐ•ํ•œ theory๊ฐ€ ๋ณ„๋กœ ์—†์—ˆ์ง€๋งŒ, recent breakthrough๋“ค์ด ์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ณ  ์žˆ๋‹ค.

๊ทธ๋ƒฅ ์ผ๋ฐ˜์ ์ธ ์„ธํŒ…์—์„œ๋Š”, Shuffled cyclic minibatch SGD without replacement ๋ฅผ ์“ฐ๋ฉด ๋˜๊ณ , GPU๊ฐ€ ํ—ˆ๋ฝํ•˜๋Š” ์ตœ๋Œ€ํ•œ ํฐ Batch size๋ฅผ ์žก์œผ๋ฉด ๋œ๋‹ค. Deep Learning์˜ ๋งŽ์€ ๊ฒฝ์šฐ, ์ˆ˜ํ•™์ ์ธ ๋ถ„์„์ด ์‹ค์ œ performance๋ฅผ ์ •ํ™•ํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋Š”๋ฐ, empirically this is best.

์ผ๋ฐ˜์ ์ธ expectation์œผ๋กœ ํ‘œํ˜„๋œ ์ตœ์ ํ™” ๋ฌธ์ œ, ์˜ˆ๋ฅผ ๋“ค์–ด ํ™•๋ฅ ๋ณ€์ˆ˜ $\omega$์— ๋Œ€ํ•ด ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค \(\underset{x \in \R^p}{\minimize}\ \E_\omega[f_\omega(x)]\) ์˜ ๊ฒฝ์šฐ, ๋˜‘๊ฐ™์ด SGD๋กœ ํ’€ ์ˆ˜ ์žˆ๋‹ค. GD๋กœ๋„ ํ•  ์ˆ˜๋Š” ์žˆ์ง€๋งŒ, ์ผ๋ฐ˜์ ์œผ๋กœ โ€˜gradient์˜ ๊ธฐ๋Œ“๊ฐ’โ€™ ์„ ๊ตฌํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—...

์ฐธ๊ณ  : Optimization / ML์—์„œ, ๋Œ€์ถฉ โ€˜ํ•œ๋ฐ”ํ€ดโ€™ ๋ฅผ Epoch ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ๋Œ€์ถฉ ๋ฐ์ดํ„ฐ๋“ค์„ ํ•œ๋ฐ”ํ€ด ๋Œ๋ฉด ๋œ๋‹ค. Gradient descent๋ฉด ํ•œ๋ฒˆ = 1 epoch, SGD๋ฉด $N$๋ฒˆ, Batched SGD๋ฉด $N / B$ ๋ฒˆ ์ •๋„.

SGD Convergence Theorem

์ƒ์„ธํ•˜๊ฒŒ ๋‹ค๋ฃจ์–ด์•ผ ํ•  ๋‚ด์šฉ์€ ์•„๋‹ˆ์ง€๋งŒ, ์•ž์„œ ๊ณต๋ถ€ํ•œ Lipschitz Gradient Lemma ๋“ฑ์„ ์ด์šฉํ•ด์„œ ๋น„์Šทํ•œ ์ฆ๋ช…์„ ์“ธ ์ˆ˜ ์žˆ๋‹ค.

$F : \R^n \to \R$ ์ด ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๊ณ , $\nabla F$ ๊ฐ€ $L$-Lipschitz ์—ฐ์†์ด๋ฉฐ, $F$๊ฐ€ $-\infty$๊ฐ€ ์•„๋‹Œ ์ตœ์†Œ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, $g^k$๊ฐ€ ๋‹ค์Œ ์กฐ๊ฑด \(\E[g^k \di x^k] = \nabla F(x^k), \quad \quad \expect{\norm{g^k - \nabla F(x^k)}^2 \di x^k} \leq \sigma^2\) ์„ ๋งŒ์กฑํ•  ๋•Œ, ์ฆ‰ $g^k$ ๊ฐ€ Gradient์— ๋Œ€ํ•œ Unbiased estimator์ด๊ณ  ๊ทธ ๋ถ„์‚ฐ์ด $\sigma^2$ ์ดํ•˜์ผ ๋•Œ, ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•œ๋‹ค. \(\frac{1}{M}\sum_{k = 0}^{M-1} \expect{\norm{\nabla F(x^k)}^2} \leq \frac{1}{\sqrt{M}}\left(2L(F(x^0) - F^*) + \sigma^2\right)\)

์ฆ‰, Gradient์˜ ํฌ๊ธฐ์˜ ํ‰๊ท ์ด $M$๋ฒˆ์˜ iteration์— ์˜ํ•ด $\order{\frac{1}{\sqrt{M}}}$๋กœ ๊ฐ์†Œํ•œ๋‹ค.

Proof. ๋จผ์ €, Lipschitz Gradient Lemma๋ฅผ $x = x^k, \delta = -\alpha g^k$์— ๋Œ€ํ•ด ์“ฐ๋ฉด, \(F(x^{k+1}) \leq F(x^k) -\alpha \nabla F(x^k)^T g^k + \frac{\alpha^2L}{2}\norm{g^k}^2\) $x^k$ ๊ฐ€ ์ด๋ฏธ ์ฃผ์–ด์กŒ์„ ๋•Œ์˜ Conditional expectation์„ ์“ด๋‹ค. \(\expect{F(x^{k+1}) \di x^k} \leq F(x^k) - \alpha \norm{\nabla F(x^k)}^2 + \frac{\alpha^2 L}{2}\left(\norm{\nabla F(x^k)}^2 + \sigma^2\right)\) ์ด์ œ ์ด๋ฅผ ๋‹ค์‹œ Total expectation์„ ์ทจํ•˜๋ฉด, \(\expect{F(x^{k+1})} \leq \expect{F(x^k)} - \alpha\left(1 - \frac{\alpha L}{2}\right) \expect{\nabla F(x^k)} + \frac{\alpha^2 \sigma^2 L}{2}\) ์ด๋ฅผ $k = 0 \dots M-1$์— ๋Œ€ํ•ด ์–‘๋ณ€์„ ๋”ํ•˜์—ฌ \(\alpha\left(1 - \frac{\alpha L}{2}\right) \sum_{k = 1}^{M-1}\expect{\nabla F(x^k)} \leq (F(x^0) - F^*) + \expect{F(x^k) - F^*} + \frac{\alpha^2 \sigma^2 L}{2}\) ๋งˆ์ง€๋ง‰์œผ๋กœ, $\alpha = \frac{1}{L \sqrt{K}}$ ๋ฅผ ์ทจํ•˜์—ฌ ์ฃผ์–ด์ง„ ์‹์„ ์–ป๋Š”๋‹ค.ย โ—ป