Back to : deep-learning-study

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ 1๊ฐ• (9์›” 2์ผ) ์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ฌธ์„œ๋Š” $\LaTeX$๋ฅผ pandoc์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ž‘์„ฑํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์—, ๋ ˆ์ด์•„์›ƒ ๋“ฑ์ด ๊น”๋”ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์  ๊ฐ€ pdf ๋ฒ„์ „์˜ ๋…ธํŠธ๋ฅผ ๊ณต๊ฐœํ•œ๋‹ค๋ฉด ๊ทธ์ชฝ์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Intro to Optimization

์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ์ƒ๊ฐํ•œ๋‹ค.

์ตœ์ ํ™” ๋ฌธ์ œ๋ž€, ์–ด๋–ค Decision variable $x$๋ฅผ ์กฐ์ ˆํ•˜์—ฌ Objective function $f$๋ฅผ ์ตœ์†Œํ™” / ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋งํ•œ๋‹ค. ์ด๋•Œ, equality ๋˜๋Š” inequality constraint (์ œ์•ฝ์กฐ๊ฑด) ๋“ค์ด ์ฃผ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ˜•ํƒœ์˜ ๋ฌธ์ œ๋“ค. \(\underset{x \in \R^p}{\minimize} \ f(x) \ \subto \ h(x) = 0,\ g(x) \leq 0\)

์–ด์ฐจํ”ผ ์ตœ์†Œํ™”์™€ ์ตœ๋Œ€ํ™”๋Š” ๋ถ€ํ˜ธ๋ฅผ ๋ฐ”๊พธ๋ฉด ๋™์น˜์ด๋ฏ€๋กœ, ์•ž์œผ๋กœ๋Š” ์ตœ์†Œํ™” ๋ฌธ์ œ๋งŒ ์ƒ๊ฐํ•œ๋‹ค.

๋ชจ๋“  ์ œ์•ฝ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ์ ์„ Feasible point๋ผ ํ•œ๋‹ค. Feasible point๊ฐ€ ์•„์˜ˆ ์—†๋Š” ๊ฒฝ์šฐ infeasibleํ•˜๋‹ค.

์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ Constraint ์œ ๋ฌด์— ๋”ฐ๋ผ Unconstrained / Constrained๋กœ ๋‚˜๋ˆˆ๋‹ค.

์ตœ์ ํ™” ๋ฌธ์ œ์˜ ๋‹ต์€ Optimal value $p^\star$ ์™€ solution $x^\star$๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. ์ฆ‰โ€ฆ \(p^* = \inf \Setcond{f(x)}{x \in \R^n, x \text{ feasible }}, \quad f(x^*) = p^*\) ML ์„ธํŒ…์—์„œ๋Š”, $0 \leq p^* < \infty$ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฃผ๋กœ ์ƒ๊ฐํ•œ๋‹ค.

์˜ˆ์‹œ : Curve-Fitting, ์ฆ‰ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ $x_1 \dots x_n$ ๊ณผ ๊ทธ label $y_1 \dots y_n$์— ๋Œ€ํ•ด, ์ž…๋ ฅ๊ฐ’๊ณผ label์˜ ๊ด€๊ณ„๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ. ๋Œ€ํ‘œ์ ์œผ๋กœ, Least square ๋ฌธ์ œ๋ฅผ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฃผ์–ด์ง„ ์ž…๋ ฅ๊ณผ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์žฅ ๊ฐ€๊น๊ฒŒ ๊ทผ์‚ฌํ•˜๋Š” ์ผ์ฐจํ•จ์ˆ˜ ์ฐพ๊ธฐ.

์ตœ์ ํ™” ๋ฌธ์ œ์—์„œ๋Š” ๊ทน์†Œ (Local minima) ์™€ ์ตœ์†Œ (Global minima) ๋ฅผ ์ƒ๊ฐํ•ด์•ผ ํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ, non-convex ํ•จ์ˆ˜์˜ global minima๋ฅผ ์ฐพ๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค.

Gradient Descent

๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜ $f$์— ๋Œ€ํ•ด, ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ unconstrained optimization problem์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ๋‹ค. \(\underset{x \in \R^p}{\minimize} \ f(x)\) ML ์„ธํŒ…์—์„œ๋Š” almost everywhere differentiable์ด๋ฉด ๋Œ€์ถฉ ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ๋งํ•˜๋Š” ๊ฒฝ์šฐ ๋งŽ์Œ.

Algorithm (Gradient Descent)

  • ์ž„์˜์˜ ์‹œ์ž‘์  $x^0 \in \R^p$ ๋ฅผ ์žก๊ณ , ์ ์ ˆํ•œ $\alpha_k > 0$ ์— ๋Œ€ํ•ด ๋‹ค์Œ์„ ๋ฐ˜๋ณตํ•œ๋‹ค. \(x^{k+1} = x^k - \alpha_k \nabla{f(x^k)}\)

๋Œ€๋žต์˜ ์•„์ด๋””์–ด : $x^k$ ๊ทผ์ฒ˜์—์„œ $f$๋ฅผ ํ…Œ์ผ๋Ÿฌ ์ „๊ฐœํ•˜๋ฉด, \(f(x) = f(x^k) + \nabla f (x^k)^T (x - x^k) + \order{\norm{x - x^k}^2}\) ์ฆ‰, $x = x^{k+1} = x^k - \alpha_k \nabla{f(x^k)}$ ๋Œ€์ž…ํ•˜๋ฉด, \(f(x^{k+1}) = f(x^k) - \alpha_k \norm{\nabla{f(x^k)}}^2 + \order{\alpha_k^2}\) ์ ๋‹นํžˆ $\alpha_k$๋ฅผ ์ถฉ๋ถ„ํžˆ ์ž‘๊ฒŒ ์žก์œผ๋ฉด, $f(x^{k+1})$ ์ด $f(x^k)$๋ณด๋‹ค ์ž‘๊ฒŒ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

์ผ๋ฐ˜์ ์œผ๋กœ, Gradient Descent๋Š” Global ํ•œ ์ตœ์ ํ•ด๋ฅผ ๋ณด์žฅํ•˜์ง€ ์•Š๋Š”๋‹ค. Localํ•œ ์ตœ์ ํ•ด๋ฅผ ์ฐพ์•„๊ฐ„๋‹ค๋Š” ๊ฒƒ๋„ ์ผ๋ฐ˜์ ์ธ ์กฐ๊ฑด์—์„œ๋Š” ์•ˆ ๋˜๊ณ โ€ฆ ๋Œ€์‹ ์—, ์กฐ๊ฑด์ด ์ถฉ๋ถ„ํžˆ ์ข‹์œผ๋ฉด ๊ฑฐ์˜ ๋น„์Šทํ•œ ๋ช…์ œ, $\nabla f(x^k) \to 0$ ์„ ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค.

์ •๋ฆฌ (Gradient Descent์˜ ์ˆ˜๋ ด์„ฑ)
$f : \R^n \to \R$ ์ด ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๊ณ , $\nabla f$ ๊ฐ€ $L$-Lipschitz ์—ฐ์†์ด๋ฉฐ, $f$๊ฐ€ $-\infty$๊ฐ€ ์•„๋‹Œ ์ตœ์†Œ๊ฐ’์„ ๊ฐ€์งˆ ๋•Œ, Gradient descent \(x^{k+1} = x^k - \alpha \nabla{f(x^k)}\) ์€, $\alpha \in \left(0, \frac{2}{L}\right)$์— ๋Œ€ํ•ด, $\nabla f(x^k) \to 0$ ์„ ๋ณด์žฅํ•œ๋‹ค.

Remark $L$-Lipschitz ์กฐ๊ฑด์ด ํ•„์š”ํ•œ ์ด์œ ๋Š”, $\nabla f$ ๊ฐ€ ์ ๋‹นํžˆ smooth ํ•ด์•ผ ํ…Œ์ผ๋Ÿฌ ์ „๊ฐœ์˜ ๊ทผ์‚ฌ๊ฐ€ ์ž˜ ๋งž๊ธฐ ๋•Œ๋ฌธ.

์ด ์ƒ๊ฐ์„ ๋ณด์กฐ์ •๋ฆฌ๋กœ ํ™œ์šฉํ•œ๋‹ค.

Lemma (Lipschitz Gradient Lemma)

  • $f : \R^n \to R$ ์ด ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•˜๊ณ , $\nabla f$ ๊ฐ€ $L$-Lipschitz ์—ฐ์†์ด๋ฉด, ๋‹ค์Œ์ด ์„ฑ๋ฆฝํ•œ๋‹ค. \(f(x + \delta) \leq f(x) + \nabla f(x)^T \delta + \frac{L}{2}\norm{\delta}^2\)

Lemma์˜ ์ฆ๋ช…(์‚ด์ง Roughํ•˜๊ฒŒ)

  • $g(t) = f(x + t\delta)$ ๋กœ ๋‘๋ฉด, $gโ€™(t) = \nabla f(x + t\delta)^T \delta$ ๊ฐ€ ๋œ๋‹ค. ์ด๋•Œ $gโ€™$๋Š” ์ง์ ‘ ๊ณ„์‚ฐํ•ด ๋ณด๋ฉด $L\norm{\delta}^2$-Lipschitz ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
  • ์ด์ œ, $f(x + \delta) = g(1)$ ์€, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐํ•œ๋‹ค. \(f(x + \delta) = g(1) = g(0) + \int_{0}^{1} g'(t) \dd{t} \leq f(x) + \int_{0}^{1} (g'(0) + L\norm{\delta}^2 t) \dd{t} = f(x) + \nabla f(x)^T \delta + \frac{L}{2}\norm{\delta}^2\)
  • ๋”ฐ๋ผ์„œ ์ฃผ์–ด์ง„ ๋ถ€๋“ฑ์‹์ด ์„ฑ๋ฆฝํ•œ๋‹ค.ย โ—ป

์ˆ˜ํ•™์ ์œผ๋กœ ๊น”๋”ํ•˜๊ฒŒ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•ด, ๋ณด์กฐ์ •๋ฆฌ๋ฅผ ํ•˜๋‚˜ ๋” ์“ฐ์ž.
Lemma (Summability Lemma)

  • Non-negative sequence $V_i$, $S_i$ ๊ฐ€ $V_{k+1} \leq V_k - S_k$ ๋ฅผ ๋งŒ์กฑํ•  ๋•Œ, $S_k \to 0$ ์ด๋‹ค.

Lemma์˜ ์ฆ๋ช…

  • $V_{k+1} + \sum_{i = 0}^{k} S_i \leq V_0$์—์„œ, $k \to \infty$ ๋ฅผ ์ทจํ•˜๋ฉด, $\sum_{i = 0}^{\infty} S_i \leq V_0 < \infty$ ์ด๋‹ค. ๊ธ‰์ˆ˜๊ฐ€ ์ˆ˜๋ ดํ•  ๋•Œ, ์ผ๋ฐ˜ํ•ญ์ด 0์œผ๋กœ ์ˆ˜๋ ดํ•จ์ด ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฏ€๋กœ, ์ฃผ์–ด์ง„ ๋ช…์ œ๊ฐ€ ์„ฑ๋ฆฝํ•œ๋‹ค. โ—ป

์ด์ œ ๋งˆ์ง€๋ง‰์œผ๋กœ ์•ž์„  ์ •๋ฆฌ - Gradient Descent์˜ ์ˆ˜๋ ด์„ฑ ๋น„์Šทํ•œ ์ •๋ฆฌ - ๋ฅผ ์ฆ๋ช…ํ•œ๋‹ค.

  • Lipchitz Gradient Lemma์— ์˜ํ•ด, $f(x^{k+1}) \leq f(x^k) - \alpha\left(1 - \frac{\alpha L}{2}\right)\norm{\nabla(x^k)}^2$ ์ด๋‹ค.
  • ๋˜ํ•œ, $V_k = f(x^k) - f(x^\star)$, $S_k = \alpha\left(1 - \frac{\alpha L}{2}\right)\norm{\nabla(x^k)}^2$ ๋ผ ํ•˜๋ฉด, ์ฃผ์–ด์ง„ ์ˆ˜์—ด๋“ค์ด ์Œ์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ฉฐ Summability Lemma์˜ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•จ์„ ์•ˆ๋‹ค. ๋”ฐ๋ผ์„œ, $S_k \to 0$, ์ฆ‰ $\norm{\nabla(x^k)}^2 \to 0$ ์ด๋ฏ€๋กœ $\nabla f(x^k) \to 0$ ์ด๋‹ค. โ—ป