Back to : ml-study

Multivariate Linear Regression

  • ๋‹ค์ˆ˜์˜ Feature $(x_1, x_2, x_3, \dots)$ ๋กœ๋ถ€ํ„ฐ, output variable $y$ ๋ฅผ predictํ•˜๊ณ  ์‹ถ๋‹ค.
  • Notation ์ •๋ฆฌ: Data $i$๋ฒˆ์งธ์˜ $j$๋ฒˆ์งธ feature๋ฅผ $x^{(i)}_j$ ๋กœ ์“ฐ๊ธฐ๋กœ ํ•˜์ž. ์ด์ œ, $x^{(i)}$ ๋Š” ํ•˜๋‚˜์˜ training data์ธ๋ฐ, ์ด๋Š” n-vector ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋œ๋‹ค.
  • Linear Regression์˜ ์›๋ฆฌ๋Š” ๋˜‘๊ฐ™๋‹ค. ๋‹ค๋ณ€์ˆ˜ ์„ ํ˜•ํ•จ์ˆ˜๋ฅผ Hypothesis๋กœ ์‚ผ๊ณ โ€ฆ \(h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots\)
  • 0๋ฒˆ์งธ feature $x_0 = 1$ ์„ ํ•ญ์ƒ ๊ณ ์ •ํ•˜๋ฉด, $x$์™€ $\theta$๋ฅผ $n+1$๊ฐœ์งœ๋ฆฌ vector๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ณ , ๋ฒกํ„ฐ์˜ ๋‚ด์ ์œผ๋กœ ์œ„ Hypothesis๋ฅผ ์‰ฝ๊ฒŒ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. \(h_\theta = \mathbf{\theta}^T \mathbf{x}\) (์›๋ž˜ ๋ฒกํ„ฐ๋Š” \mathbf ๋กœ ์“ฐ๊ณ  ์‹ถ์ง€๋งŒ, ์•„๋งˆ ๊ท€์ฐฎ์œผ๋‹ˆ ๋Œ€์ถฉ ์“ธ ๋“ฏโ€ฆ ๋งฅ๋ฝ์ƒ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋‹ค)

Multivariate Gradient Descent

  • ํ‘ธ๋Š” ๋ฐฉ๋ฒ•๋„ Single variable ๊ฒฝ์šฐ์™€ ๋ณ„๋กœ ๋‹ค๋ฅด์ง€ ์•Š๋‹ค. Gradient descent๋ฅผ ์ƒ๊ฐํ•˜์ž.
  • Parameter๊ฐ€ $n+1$ vector, Hypothesis๋Š” ์œ„ ๋‚ด์  ํ•จ์ˆ˜์ด๋ฏ€๋กœ, cost function๋งŒ ์ž˜ ์ •์˜ํ•˜๋ฉด ๋.
  • Cost๋„ ๋น„์Šทํ•˜๊ฒŒ.. \(J(\theta) = \frac{1}{2m} \sum_{i = 1}^{m} (\theta^T x^{(i)} - y^{(i)})^2\)
  • $n+1$๊ฐœ์˜ parameter์— ๋Œ€ํ•œ, ํŽธ๋ฏธ๋ถ„์„ ์ด์šฉํ•œ Simultaneous update๋ฅผ ํ•œ๋‹ค.

Feature Scaling

  • Feature๋“ค์ด ๋น„์Šทํ•œ Scale์— ์žˆ์œผ๋ฉด, gradient descent๊ฐ€ ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์˜ˆ๋ฅผ ๋“ค์–ด, ์ฃผํƒ ๊ฐ€๊ฒฉ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ์žˆ์–ด์„œโ€ฆ
    • size๊ฐ€ $[0, 2000]$ ๋ฒ”์œ„์— ์žˆ๊ณ 
    • number of bedrooms ๊ฐ€ $[1, 5]$ ๋ฒ”์œ„์— ์žˆ๋‹ค๋ฉด?
  • Cost function์˜ contour๊ฐ€ ๋งค์šฐ ๊ธธ๊ณ  ์–‡์€ ํƒ€์›ํ˜•์œผ๋กœ ๋‚˜ํƒ€๋‚˜๊ฒŒ ๋˜๊ณ ,
  • ์ด ํƒ€์›์˜ ์žฅ์ถ•์„ ๋”ฐ๋ผ ๋‚ด๋ ค๊ฐ€๊ธฐ์—๋Š” ๋„ˆ๋ฌด ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ์ˆ˜๋„ ์žˆ๋‹ค!
  • ๋ฐ˜๋Œ€๋กœ, ๋„ˆ๋ฌด ์งง์€ ๋‹จ์ถ•์„ ๋”ฐ๋ผ๊ฐ€๋‹ค ๋ณด๋ฉด ์ˆ˜๋ ดํ•˜์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ๋‹ค.
  • ๊ฐ Feature๋ฅผ, ๊ทธ feature๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ“๊ฐ’์œผ๋กœ ๋‚˜๋ˆ„์–ด $[0, 1]$ ์Šค์ผ€์ผ ์œ„์— ์˜ฌ๋ฆฌ๋ฉด ์ด ๋ฌธ์ œ๋ฅผ ํ”ผํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ๋” ์ผ๋ฐ˜์ ์œผ๋กœ๋Š” $[-1, 1]$์— ์˜ฌ๋ฆฐ๋‹ค. ์ตœ์†Ÿ๊ฐ’์ด $-1$, ์ตœ๋Œ“๊ฐ’์ด $1$์ด ๋˜๊ฒŒ.
  • ๊ผญ ๊ทธ ๋ฒ”์œ„์ธ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ โ€ฆ ๋Œ€์ถฉ ๋น„์Šทํ•œ ๋ฒ”์œ„ ๋‚ด๋กœ feature๋“ค์ด ๋“ค์–ด์˜ค๊ฒŒ ํ•˜๋Š” ๊ฒƒ.
  • Mean Normalization : $x_i$ ๋ฅผ $x_i - \mu_i$ ๋กœ ๋Œ€์ฒดํ•˜์—ฌ, approximately zero mean์„ ์œ ์ง€ํ•˜๊ฒŒ ํ•˜๊ธฐ.

Learning Rate

\(\theta_j := \theta_j - \alpha \pdv{}{\theta_j} J(\theta_0, \theta_1)\)

  • How to choose $\alpha$ ?
  • Practical tips.
    • Gradient descent iteration ํšŸ์ˆ˜ ($t$) ์— ๋Œ€ํ•œ $J(\theta)$ ์˜ ๊ฐ’์„ ๊ณ„์† plotting ํ•œ๋‹ค.
    • ์ด ๊ฐ’์ด ๊ฐ์†Œํ•จ์ˆ˜์ด๊ฒŒ ํ•˜๊ณ  ์‹ถ๋‹ค. ์šฐ๋ฆฌ ๋ชฉ์ ์€ $J$๋ฅผ minimizeํ•˜๋Š” ๊ฒƒ์ด๋‹ˆ๊นŒ.
    • Convergence test๋ฅผ ์ž˜ ๋งŒ๋“ค์–ด๋„ ๋˜๊ณ โ€ฆ ex) ๋ช‡๋ฒˆ ๋”๋Œ๋ ธ๋Š”๋ฐ $k$๋งŒํผ ๊ฐ’์ด ๋–จ์–ด์ง€์ง€ ์•Š์œผ๋ฉด~ โ€ฆ
  • Not working?
    • $J$ ์ฆ๊ฐ€ : ์•„๋งˆ๋„ $\alpha$๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ธฐ ๋•Œ๋ฌธ.
    • Overshooting : ์•„๋งˆ๋„ $\alpha$๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ธฐ ๋•Œ๋ฌธ.
    • Too slow : ์•„๋งˆ๋„ $\alpha$๊ฐ€ ๋„ˆ๋ฌด ์ž‘๊ธฐ ๋•Œ๋ฌธ.
  • Mathematically Proven Result : $\alpha$๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ž‘์œผ๋ฉด, $J$๋Š” gradient descent์˜ ๋งค iteration๋งˆ๋‹ค ๋ฐ˜๋“œ์‹œ ๊ฐ์†Œํ•œ๋‹ค. ($J$๊ฐ€ convex function์ผ ๋•Œโ€ฆ)

More on features / Regression

  • ์ •๋ง ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ data $x_1, x_2, \dots$์— $y$ ๊ฐ’์ด ์„ ํ˜•์ ์ธ ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š”๊ฐ€?
  • $x_1x_2$ (๊ณฑ)์— ๋Œ€ํ•œ ์„ ํ˜•ํ•จ์ˆ˜๋Š” ์•„๋‹Œ๊ฐ€? (New feature defining)
  • $x_1$์— ๋Œ€ํ•œ ์ด์ฐจํ•จ์ˆ˜, ์‚ผ์ฐจํ•จ์ˆ˜ โ€ฆ ๋Š” ์•„๋‹Œ๊ฐ€? (Polynomial Regression)
    • Hypothesis $h$๋ฅผ ๋‹คํ•ญ์‹์œผ๋กœ ์žก์œผ๋ฉด ์‰ฝ๊ฒŒ ๊ฐ€๋Šฅ. \(h_\theta(x) = \theta_0 + \theta_1 x^2 + \theta_2 x^2 + \dots\)
    • Linear regression๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, $J$๋ฅผ ์ž˜ ๊ณ„์‚ฐ, ํŽธ๋ฏธ๋ถ„ํ•ด์„œ ๋Œ๋ฆฐ๋‹ค.
  • ์ด์ฐจ ์ด์ƒ์˜ ๊ฒฝ์šฐ Feature scaling์ด ๋” ์ค‘์š”ํ•ด์ง„๋‹ค.
    • $[1, 100]$์ด ์ด์ฐจ๊ฐ€ ๋˜๋ฉด $[1, 10000]$์ด ๋˜๊ธฐ ๋•Œ๋ฌธโ€ฆ
  • ๊ผญ ๋‹คํ•ญ์‹์ผ ํ•„์š”๋„ ์—†๋‹ค. $h_\theta(x) = \theta_0 + \theta_1 x + \theta_2 \sqrt{x}$ ๊ฐ™์€ ๊ฒƒ๋„ ๊ฐ€๋Šฅ.
  • ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ Insight๊ฐ€ ์ด๋Ÿฐ ์ƒํ™ฉ์—์„œ hypothesis๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹ค.