Back to : deep-learning-study
Contents

์ฌ์ธต ์ ๊ฒฝ๋ง์ ์ํ์  ๊ธฐ์ด 1๊ฐ (9์ 2์ผ) ์ ๊ธฐ๋ฐํฉ๋๋ค.

์ด ๋ฌธ์๋ $\LaTeX$๋ฅผ pandoc์ผ๋ก ๋ณํํ์ฌ ์์ฑํ์๊ธฐ ๋๋ฌธ์, ๋ ์ด์์ ๋ฑ์ด ๊น๋ํ์ง ์์ ์ ์์ต๋๋ค. ์ธ์  ๊ฐ pdf ๋ฒ์ ์ ๋ธํธ๋ฅผ ๊ณต๊ฐํ๋ค๋ฉด ๊ทธ์ชฝ์ ์ฐธ๊ณ ํ๋ฉด ์ข์ ๊ฒ ๊ฐ์ต๋๋ค.

## Intro to Optimization

์ฐ๋ฆฌ๋ ๋ค์๊ณผ ๊ฐ์ ์ต์ ํ ๋ฌธ์ ๋ฅผ ์๊ฐํ๋ค.

์ต์ ํ ๋ฌธ์ ๋, ์ด๋ค Decision variable $x$๋ฅผ ์กฐ์ ํ์ฌ Objective function $f$๋ฅผ ์ต์ํ / ์ต๋ํ ํ๋ ๋ฌธ์ ๋ฅผ ๋งํ๋ค. ์ด๋, equality ๋๋ inequality constraint (์ ์ฝ์กฐ๊ฑด) ๋ค์ด ์ฃผ์ด์ง ์ ์๋ค. ์ฆ ๋ค์๊ณผ ๊ฐ์ ํํ์ ๋ฌธ์ ๋ค. $$\underset{x \in \R^p}{\minimize} \ f(x) \ \subto \ h(x) = 0,\ g(x) \leq 0$$

์ด์ฐจํผ ์ต์ํ์ ์ต๋ํ๋ ๋ถํธ๋ฅผ ๋ฐ๊พธ๋ฉด ๋์น์ด๋ฏ๋ก, ์์ผ๋ก๋ ์ต์ํ ๋ฌธ์ ๋ง ์๊ฐํ๋ค.

๋ชจ๋  ์ ์ฝ ์กฐ๊ฑด์ ๋ง์กฑํ๋ ์ ์ Feasible point๋ผ ํ๋ค. Feasible point๊ฐ ์์ ์๋ ๊ฒฝ์ฐ infeasibleํ๋ค.

์ต์ ํ ๋ฌธ์ ๋ฅผ Constraint ์ ๋ฌด์ ๋ฐ๋ผ Unconstrained / Constrained๋ก ๋๋๋ค.

์ต์ ํ ๋ฌธ์ ์ ๋ต์ Optimal value $p^\star$ ์ solution $x^\star$๋ฅผ ์ฐพ๋ ๊ฒ์ด๋ค. ์ฆโฆ $$p^* = \inf \Setcond{f(x)}{x \in \R^n, x \text{ feasible }}, \quad f(x^*) = p^*$$ ML ์ธํ์์๋, $0 \leq p^* < \infty$ ์ธ ๊ฒฝ์ฐ๋ฅผ ์ฃผ๋ก ์๊ฐํ๋ค.

์์ : Curve-Fitting, ์ฆ ์๋ ฅ ๋ฐ์ดํฐ $x_1 \dots x_n$ ๊ณผ ๊ทธ label $y_1 \dots y_n$์ ๋ํด, ์๋ ฅ๊ฐ๊ณผ label์ ๊ด๊ณ๋ฅผ ์ฐพ๋ ๋ฌธ์ . ๋ํ์ ์ผ๋ก, Least square ๋ฌธ์ ๋ฅผ ์๊ฐํ  ์ ์๋ค. ์ฃผ์ด์ง ์๋ ฅ๊ณผ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ์ฅ ๊ฐ๊น๊ฒ ๊ทผ์ฌํ๋ ์ผ์ฐจํจ์ ์ฐพ๊ธฐ.

์ต์ ํ ๋ฌธ์ ์์๋ ๊ทน์ (Local minima) ์ ์ต์ (Global minima) ๋ฅผ ์๊ฐํด์ผ ํ๋ค. ์ผ๋ฐ์ ์ผ๋ก, non-convex ํจ์์ global minima๋ฅผ ์ฐพ๋ ๊ฒ์ ์ด๋ ต๋ค.

๋ฏธ๋ถ๊ฐ๋ฅํ ํจ์ $f$์ ๋ํด, ๊ฐ์ฅ ๊ฐ๋จํ unconstrained optimization problem์ ํด๊ฒฐํ๊ณ ์ ํ๋ค. $$\underset{x \in \R^p}{\minimize} \ f(x)$$ ML ์ธํ์์๋ almost everywhere differentiable์ด๋ฉด ๋์ถฉ ๋ฏธ๋ถ๊ฐ๋ฅํ๋ค๊ณ  ๋งํ๋ ๊ฒฝ์ฐ ๋ง์.

• ์์์ ์์์  $x^0 \in \R^p$ ๋ฅผ ์ก๊ณ , ์ ์ ํ $\alpha_k > 0$ ์ ๋ํด ๋ค์์ ๋ฐ๋ณตํ๋ค. $$x^{k+1} = x^k - \alpha_k \nabla{f(x^k)}$$

๋๋ต์ ์์ด๋์ด : $x^k$ ๊ทผ์ฒ์์ $f$๋ฅผ ํ์ผ๋ฌ ์ ๊ฐํ๋ฉด, $$f(x) = f(x^k) + \nabla f (x^k)^T (x - x^k) + \order{\norm{x - x^k}^2}$$ ์ฆ, $x = x^{k+1} = x^k - \alpha_k \nabla{f(x^k)}$ ๋์ํ๋ฉด, $$f(x^{k+1}) = f(x^k) - \alpha_k \norm{\nabla{f(x^k)}}^2 + \order{\alpha_k^2}$$ ์ ๋นํ $\alpha_k$๋ฅผ ์ถฉ๋ถํ ์๊ฒ ์ก์ผ๋ฉด, $f(x^{k+1})$ ์ด $f(x^k)$๋ณด๋ค ์๊ฒ ํ  ์ ์์ ๊ฒ ๊ฐ๋ค.

์ผ๋ฐ์ ์ผ๋ก, Gradient Descent๋ Global ํ ์ต์ ํด๋ฅผ ๋ณด์ฅํ์ง ์๋๋ค. Localํ ์ต์ ํด๋ฅผ ์ฐพ์๊ฐ๋ค๋ ๊ฒ๋ ์ผ๋ฐ์ ์ธ ์กฐ๊ฑด์์๋ ์ ๋๊ณ โฆ ๋์ ์, ์กฐ๊ฑด์ด ์ถฉ๋ถํ ์ข์ผ๋ฉด ๊ฑฐ์ ๋น์ทํ ๋ช์ , $\nabla f(x^k) \to 0$ ์ ๋ณด์ผ ์ ์๋ค.

์ ๋ฆฌ (Gradient Descent์ ์๋ ด์ฑ)
$f : \R^n \to \R$ ์ด ๋ฏธ๋ถ๊ฐ๋ฅํ๊ณ , $\nabla f$ ๊ฐ $L$-Lipschitz ์ฐ์์ด๋ฉฐ, $f$๊ฐ $-\infty$๊ฐ ์๋ ์ต์๊ฐ์ ๊ฐ์ง ๋, Gradient descent $$x^{k+1} = x^k - \alpha \nabla{f(x^k)}$$ ์, $\alpha \in \left(0, \frac{2}{L}\right)$์ ๋ํด, $\nabla f(x^k) \to 0$ ์ ๋ณด์ฅํ๋ค.

Remark $L$-Lipschitz ์กฐ๊ฑด์ด ํ์ํ ์ด์ ๋, $\nabla f$ ๊ฐ ์ ๋นํ smooth ํด์ผ ํ์ผ๋ฌ ์ ๊ฐ์ ๊ทผ์ฌ๊ฐ ์ ๋ง๊ธฐ ๋๋ฌธ.

์ด ์๊ฐ์ ๋ณด์กฐ์ ๋ฆฌ๋ก ํ์ฉํ๋ค.

• $f : \R^n \to R$ ์ด ๋ฏธ๋ถ๊ฐ๋ฅํ๊ณ , $\nabla f$ ๊ฐ $L$-Lipschitz ์ฐ์์ด๋ฉด, ๋ค์์ด ์ฑ๋ฆฝํ๋ค. $$f(x + \delta) \leq f(x) + \nabla f(x)^T \delta + \frac{L}{2}\norm{\delta}^2$$

Lemma์ ์ฆ๋ช(์ด์ง Roughํ๊ฒ)

• $g(t) = f(x + t\delta)$ ๋ก ๋๋ฉด, $gโ(t) = \nabla f(x + t\delta)^T \delta$ ๊ฐ ๋๋ค. ์ด๋ $gโ$๋ ์ง์  ๊ณ์ฐํด ๋ณด๋ฉด $L\norm{\delta}^2$-Lipschitz ์์ ์ ์ ์๋ค.
• ์ด์ , $f(x + \delta) = g(1)$ ์, ๋ค์๊ณผ ๊ฐ์ด ๊ณ์ฐํ๋ค. $$f(x + \delta) = g(1) = g(0) + \int_{0}^{1} g'(t) \dd{t} \leq f(x) + \int_{0}^{1} (g'(0) + L\norm{\delta}^2 t) \dd{t} = f(x) + \nabla f(x)^T \delta + \frac{L}{2}\norm{\delta}^2$$
• ๋ฐ๋ผ์ ์ฃผ์ด์ง ๋ถ๋ฑ์์ด ์ฑ๋ฆฝํ๋ค.ย โป

์ํ์ ์ผ๋ก ๊น๋ํ๊ฒ ์ฆ๋ชํ๊ธฐ ์ํด, ๋ณด์กฐ์ ๋ฆฌ๋ฅผ ํ๋ ๋ ์ฐ์.
Lemma (Summability Lemma)

• Non-negative sequence $V_i$, $S_i$ ๊ฐ $V_{k+1} \leq V_k - S_k$ ๋ฅผ ๋ง์กฑํ  ๋, $S_k \to 0$ ์ด๋ค.

Lemma์ ์ฆ๋ช

• $V_{k+1} + \sum_{i = 0}^{k} S_i \leq V_0$์์, $k \to \infty$ ๋ฅผ ์ทจํ๋ฉด, $\sum_{i = 0}^{\infty} S_i \leq V_0 < \infty$ ์ด๋ค. ๊ธ์๊ฐ ์๋ ดํ  ๋, ์ผ๋ฐํญ์ด 0์ผ๋ก ์๋ ดํจ์ด ์๋ ค์ ธ ์์ผ๋ฏ๋ก, ์ฃผ์ด์ง ๋ช์ ๊ฐ ์ฑ๋ฆฝํ๋ค. โป

์ด์  ๋ง์ง๋ง์ผ๋ก ์์  ์ ๋ฆฌ - Gradient Descent์ ์๋ ด์ฑ ๋น์ทํ ์ ๋ฆฌ - ๋ฅผ ์ฆ๋ชํ๋ค.

• Lipchitz Gradient Lemma์ ์ํด, $f(x^{k+1}) \leq f(x^k) - \alpha\left(1 - \frac{\alpha L}{2}\right)\norm{\nabla(x^k)}^2$ ์ด๋ค.
• ๋ํ, $V_k = f(x^k) - f(x^\star)$, $S_k = \alpha\left(1 - \frac{\alpha L}{2}\right)\norm{\nabla(x^k)}^2$ ๋ผ ํ๋ฉด, ์ฃผ์ด์ง ์์ด๋ค์ด ์์๊ฐ ์๋๋ฉฐ Summability Lemma์ ์กฐ๊ฑด์ ๋ง์กฑํจ์ ์๋ค. ๋ฐ๋ผ์, $S_k \to 0$, ์ฆ $\norm{\nabla(x^k)}^2 \to 0$ ์ด๋ฏ๋ก $\nabla f(x^k) \to 0$ ์ด๋ค. โป