Back to : deep-learning-study
Back to : dl-theory
Contents

์ด ๋งํฌ ์˜ supervised learning setup์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

Support Vector Machines

๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ Linear Classification model ์ค‘ ํ•˜๋‚˜์ธ support vector machine (SVM) ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์–ด๋–ค ๋ฐ์ดํ„ฐ $x_1, \dots x_n$ ๊ณผ ์ด๋“ค์˜ โ€œboolean labelโ€ ์ด ์ฃผ์–ด์ ธ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ณ , ์ด๋“ค์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ์ƒ๊ฐํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฆ‰, $y_i \in \Set{1, -1}$ ์ด๊ณ , ์—ฌ๊ธฐ์„œ $f : \R^n \to \Set{1, -1}$์„ ์ถ”์ธกํ•˜๊ณ  ์‹ถ์€๋ฐ, $f(x_i) = y_i$ ๋ฅผ ์•Œ๊ณ  ์žˆ๋Š” ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค.

๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ๋กœ, ๊ณต๊ฐ„์ƒ์— ์ดˆํ‰๋ฉด์„ ํ•˜๋‚˜ ๊ทธ์–ด์„œ ๊ทธ ์œ„๋Š” 1, ์•„๋ž˜๋Š” -1๋กœ ๋‚˜๋ˆ ๋ฒ„๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, parameter $\theta = (w, b)$ ์— ๋Œ€ํ•ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์„ ์žก์Šต๋‹ˆ๋‹ค. \(g_{w, b}(x) = \sgn(w^T x + b)\) (์—ฌ๊ธฐ์„œ $\sgn$์€ ์–‘์ˆ˜/์Œ์ˆ˜์— ๊ฐ๊ฐ 1, -1์„ ๋ถ€์—ฌํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค) ์šฐ๋ฆฌ๋Š” ์ด ๋ชจ๋ธ์ด $f$๋ฅผ ์ž˜ approximateํ•˜๊ธฐ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ $w$์™€ $b$์— ๋”ฐ๋ผ decision boundary (1/-1์„ ๊ฐ€๋ฅด๋Š” ์„ ) ์ด ์ดˆํ‰๋ฉด ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚˜๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฅผ โ€œLinear Classifierโ€ ๋ผ ํ• ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ œ, ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ์„ ์ตœ์†Œํ™”ํ•˜๋Š” $w, b$๋ฅผ ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. \(\underset{w \in \R^n, b \in \R}{\minimize}\ \abs{g_{w, b}(x_i) - y_i}^2\) ์–ด์ฐจํ”ผ $\abs{g_{w, b}(x_i) - y_i}$๋Š” 0์ด๊ฑฐ๋‚˜ ($g$๊ฐ€ ๋งž์€ ๊ฒฝ์šฐ) 2์ด๋ฏ€๋กœ ($g$๊ฐ€ ํ‹€๋ฆฐ ๊ฒฝ์šฐ) ์ œ๊ณฑ์„ ํ•˜๋“  ๋ง๋“  ๋ณ„๋กœ ์ƒ๊ด€์€ ์—†์Šต๋‹ˆ๋‹ค.

Large Margin Classification

๋งŒ์•ฝ, โ€œ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผโ€ ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฆ‰, \(\underset{w \in \R^n, b \in \R}{\minimize}\ \abs{g_{w, b}(x_i) - y_i}^2\) ์ด ๋ฌธ์ œ์—์„œ, ๊ฒฐ๊ณผ๊ฐ’์ด 0์ธ $w, b$๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ทธ๋ฆผ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด,

drawing
(๊ทธ๋ฆผ ์ถœ์ฒ˜ : Wikipedia)

์ด ๊ทธ๋ฆผ์—์„œ, ์ดˆ๋ก์ƒ‰ ์„  $H_3$์€ ๋ถ„๋ฅ˜๊ฐ€ ์˜ฌ๋ฐ”๋ฅด์ง€ ์•Š์œผ๋ฏ€๋กœ ์น˜์šฐ๊ณ , $H_1$ ๊ณผ $H_2$๋ฅผ ๋น„๊ตํ•˜๋Š” ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค. $H_1$๋ณด๋‹ค $H_2$๊ฐ€ ๋ญ”๊ฐ€ ์‚ฌ์ด์˜ ๋นˆ ๊ณต๊ฐ„์„ ๊น”๋”ํ•˜๊ฒŒ ์ž˜๋ผ์ฃผ๊ณ  ์žˆ๋Š” ๋А๋‚Œ์„ ๋ฐ›๋Š”๋ฐ, ์ด๋Š” ์„ ์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋–จ์–ด์ง„ ์ •๋„ - margin ์ด ๋ณด๋‹ค ํฌ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. margin์ด ํฌ๋‹ค๋Š” ๊ฒƒ์€, ์ง€๊ธˆ ๋ณธ ๋ฐ์ดํ„ฐ์™€ โ€œ๋น„์Šทํ•œโ€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋” ๋“ค์–ด์™”์„๋•Œ - ์ฆ‰, ์ € ๊ทธ๋ฆผ์—์„œ, ํฐ์ƒ‰ ๋˜๋Š” ๊ฒ€์€์ƒ‰ ์ ๋“ค์ด ๋ชฐ๋ ค์žˆ๋Š” ๋ถ€๋ถ„ ์ฃผ์œ„์— ์ƒˆ๋กœ์šด ์ ์ด ๋˜ ์ฐํ˜”์„ ๋•Œ - ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋ถ„๋ฅ˜ํ•  ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ, ๊ฐ€๋Šฅํ•˜๋‹ค๋ฉด margin์ด ํฐ ์ดˆํ‰๋ฉด์„ ๊ณ ๋ฅด๋Š” ๊ฒƒ์ด ๋ฐ”๋žŒ์งํ•ฉ๋‹ˆ๋‹ค.

์ด๋•Œ, Margin์ด ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐ๋˜๋Š”์ง€๋ฅผ ์ƒ๊ฐํ•ด ๋ด…์‹œ๋‹ค. margin์ด๋ผ๋Š” ๊ฒƒ์€ ์ดˆํ‰๋ฉด์—์„œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ ์ ๋“ค์— ์˜ํ•ด ์ •์˜๋˜๋ฉฐ, $w$์™€ $b$๋ฅผ ์ ๋‹นํžˆ ์ƒ์ˆ˜๋ฐฐํ•˜๋ฉด, ์ด ์ดˆํ‰๋ฉด์˜ ๋ฒ•์„ ๋ฒกํ„ฐ $w$์™€ ๊ฐ™์€ ๋ฒ•์„ ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง€๋ฉด์„œ (ํ‰ํ–‰ํ•˜๋ฉด์„œ) $y_i = 1$์ธ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ์ ์„ ์ง€๋‚˜๋Š” ํ‰๋ฉด์ด ์ •ํ™•ํžˆ $w^T x + b = 1$ ์ด ๋˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, $y_i = 1$ ์ธ ๋ชจ๋“  ์ ๋“ค์— ๋Œ€ํ•ด, $w^T x + b \geq 1$ ์ด ๋จ์„ ๊ฐ•์ œํ•œ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ $y_i = -1$์ธ ์ ๋“ค์— ๋Œ€ํ•ด $w^T x + b \leq -1$ ์ด ๋˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ œ์•ฝ ์กฐ๊ฑด ์œผ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \(y_i (w^T x_i + b) \geq 1\) ์ด์ œ, ์ด ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ๋ชจ๋“  ์ดˆํ‰๋ฉด $(w, b)$ ๋“ค์€ ์˜ฌ๋ฐ”๋ฅธ classifier๋“ค์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ, ์ด์ œ margin์€ ๋‘ ํ‰๋ฉด $w^T x + b = 1$ ๊ณผ $w^T x + b = -1$ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋˜๋ฉฐ, ์ด๋Š” ๊ธฐํ•˜์ ์œผ๋กœ ${2}/{\norm{w}}$ ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, margin์ด ์ตœ๋Œ€์ธ ์˜ฌ๋ฐ”๋ฅธ classifier๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋Š” ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค์Œ์˜ ์ œ์•ฝ ์žˆ๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. \(\underset{w \in \R^n, b \in \R}{\minimize}\ \norm{w}^2 \quad \subto \ \ y_i (w^T x_i + b) \geq 1\) ์—ฌ๊ธฐ์„œ๋„ $\norm{\cdot}$์„ ์ตœ์ ํ™”ํ•˜๋“  ๊ทธ ์ œ๊ณฑ์„ ์ตœ์ ํ™”ํ•˜๋“  ์ˆ˜ํ•™์ ์œผ๋กœ๋Š” ๊ฐ™์€ ๋ฌธ์ œ์ง€๋งŒ, computationalํ•˜๊ฒŒ๋Š” ์ œ๊ณฑ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ Quadratic Program ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์•Œ๊ณ ๋ฆฌ์ฆ˜์ ์ธ ์ธก๋ฉด์—์„œ ์ œ๊ณฑ์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.

Soft Margin SVM

์œ„ ๋ฌธ์ œ๋Š” ๊ฒฐ๊ตญ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋Š” ์™„์ „ํžˆ ํ‰๋ฉด์œผ๋กœ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค, ์ฆ‰ linearly seperable ํ•˜๋‹ค๋Š” ๊ฐ€์ •์„ ์žก๊ณ  ํ•ด๊ฒฐํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹น์—ฐํžˆ ์šฐ๋ฆฌ๋Š” โ€œ๊ฑฐ์˜โ€ linearly seperableํ•œ ๋ฌธ์ œ๋กœ ์ด ๋…ผ์ฆ์ด ํ™•์žฅ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ๊ณ , ํฌ๊ฒŒ ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ• (์„œ๋กœ ๋™์น˜์ธ) ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

์ œ์•ฝ ์žˆ๋Š” ์ตœ์ ํ™”์˜ ๊ด€์ ์„ ์ด์–ด๊ฐ€์„œ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฌธ์ œ๋ฅผ ๋ณ€ํ˜•ํ•ฉ๋‹ˆ๋‹ค. \(\underset{w \in \R^n, b \in \R}{\minimize}\ \norm{w}^2 + C \sum h_i \quad \subto \ \ y_i (w^T x_i + b) \geq 1 - h_i\) ์ฆ‰, $y_i (w^T x_i + b)$๊ฐ€ 1๋ณด๋‹ค ์กฐ๊ธˆ ์ž‘์•„์ง€๋Š”๊ฑธ ํ—ˆ์šฉํ•˜๋Š” ๋Œ€์‹ , ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ margin์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์„ penalizeํ•˜๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

๋˜๋Š”, ์•„์˜ˆ ์ œ์•ฝ ์—†๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ๋ฐ”๊พธ๋Š” ๋Œ€์‹ , Hinge Loss ๋ฅผ ๋„์ž…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \(\underset{w \in \R^n, b \in \R}{\minimize}\ \norm{w}^2 + C \sum \max(0, 1 - y_i (w^T x_i + b))\)

Computationalํ•˜๊ฒŒ ๋ณด๊ธฐ๊ฐ€ ์ข€๋” ํŽธํ•œ ์ œ์•ฝ ์žˆ๋Š” ์ตœ์ ํ™” ๊ด€์ ์„ ๊ฐ€์ ธ๊ฐ€๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด์ œ, ์ด ๋ฌธ์ œ๋Š” Lagrangian Multiplier (Primal-Dual) ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ดํ›„์˜ ๊ณ„์‚ฐ์„ ์œ„ํ•ด $\norm{w^2}$ ๋Œ€์‹  $\frac{1}{2}\norm{w^2}$ ๋ฅผ ์ตœ์ ํ™”ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, \(\mathcal{L}(w, b, h, \alpha, \eta) = \frac{1}{2}\norm{w^2} + C \sum h_i + \sum \alpha_i(1 - h_i - y_i (w^T x_i + b)) - \sum \eta_i h_i\) ์ด ํ•จ์ˆ˜์— ๋Œ€ํ•œ Lagrangian dual์„ ์ƒ๊ฐํ•˜๊ธฐ ์œ„ํ•ด, ํŽธ๋ฏธ๋ถ„๋“ค์„ ๊ณ„์‚ฐํ•˜๋ฉด \(\begin{align} \pdv{\mathcal{L}}{w} &= w - \sum \alpha_i y_i x_i = 0\\ \pdv{\mathcal{L}}{b} &= - \sum \alpha_i y_i = 0 \\ \pdv{\mathcal{L}}{h_i} &= C - \alpha_i + \eta_i = 0 \end{align}\) ์ด์ œ, ์ด ์กฐ๊ฑด๋“ค์„ $\mathcal{L}$ ์— ๋Œ€์ž…ํ•ด ๋„ฃ์œผ๋ฉด ๋‹ค์Œ์„ ์–ป์Šต๋‹ˆ๋‹ค. \(\minimize \ \frac{1}{2}\alpha^T Q \alpha - \alpha^T 1 \quad \subto \ \ \alpha^T y = 0, \alpha_i \in [0, C]\) ์ด๋•Œ $Q$๋Š” $Q_{ij} = y_i y_j \inner{x_i}{x_j}$ ์ธ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒฐ๊ตญ linear constraint๋ฅผ ๊ฐ€์ง„ quadratic ์ตœ์ ํ™” ๋ฌธ์ œ์ด๋ฏ€๋กœ, quadratic programming ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์ฐธ๊ณ  : ์œ„ํ‚คํ”ผ๋””์•„ ๋Š” ๋ฐ˜๋Œ€๋กœ maximization ๋ฌธ์ œ๋กœ ์ •๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ตœ์ ํ™” ๋ฐฉํ–ฅ์ด ๋ฐ˜๋Œ€์ž…๋‹ˆ๋‹ค. ์ˆ˜ํ•™์ ์œผ๋กœ๋Š” ๋™์น˜์ด๋ฉฐ ๊ณ„์‚ฐ์ ์œผ๋กœ๋„ ์‚ฌ์‹ค ๊ฐ™์Šต๋‹ˆ๋‹ค.)

Kernel Methods

SVM์€ ๊ธฐ๋ณธ์ ์œผ๋กœ linear ํ•˜์ง€๋งŒ, kernel method๋ฅผ ์ด์šฉํ•˜๋ฉด nonlinearํ•œ decision boundary๋ฅผ ์žก๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์ž์„ธํ•œ ๊ฒƒ์€ ๋ณ„๋„ ํฌ์ŠคํŒ… (๋‘์„ธ๋ฒˆ์— ๊ฑธ์ณ์„œ) ํ•  ์˜ˆ์ •์ด์ง€๋งŒ, ๊ฐ„๋‹จํžˆ ๋…ผ์ฆํ•˜์ž๋ฉด ์ด๋ ‡์Šต๋‹ˆ๋‹ค.

drawing
(๊ทธ๋ฆผ ์ถœ์ฒ˜ : Benyamin Ghojogh et al, Reproducing Kernel Hilbert Space,
Mercerโ€™s Theorem, Eigenfunctions, Nystrรถm Method, and Use of Kernels in Machine Learning: Tutorial and Survey
)

์ด์™€ ๊ฐ™์ด, ๊ณ ์ฐจ์›์—์„œ๋Š” ์„ ํ˜•์œผ๋กœ ๋ถ„๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ €์ฐจ์›์œผ๋กœ ๋ˆŒ๋Ÿฌ๋†“๋‹ค ๋ณด๋‹ˆ ์„ ํ˜•์œผ๋กœ ๋ถ„๋ฆฌ๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•ด์ง„ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ ์ ˆํ•œ mapping $\phi$ ๋ฅผ ๋งŒ๋“ค์–ด์„œ, $\phi(x_i)$ ๋“ค์„ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋” ๊ฐ•ํ•œ classification์ด ๊ฐ€๋Šฅํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ Kernel method ๋ผ ํ•˜๋ฉฐ, ์ด๋•Œ ๋ฐ์ดํ„ฐ๋“ค์ด ์˜ฌ๋ผ๊ฐ€ ์•‰๋Š” ์ƒˆ๋กœ์šด ๊ณต๊ฐ„์„ RKHS (Reproducing Kernel Hilbert Space) ๋ผ ํ•ฉ๋‹ˆ๋‹ค.