Back to : deep-learning-study

์ด ๊ธ€์€ ICML 2015์—์„œ ๋ฐœํ‘œ๋œ ICML 2015์—์„œ ๋ฐœํ‘œ๋œ Ioffe, Szegedy์˜ ๋…ผ๋ฌธ ๊ณผ ์ œ๊ฐ€ ์ˆ˜๊ฐ•ํ–ˆ๋˜ ์‹ฌ์ธต์‹ ๊ฒฝ๋ง ๊ฐ•์˜์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

ICML 2015 ๋…ผ๋ฌธ์—์„œ๋Š” ๋”ฅ๋Ÿฌ๋‹์—์„œ training์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ, Batch normalization์ด๋ผ๋Š” ๊ธฐ๋ฒ•์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ธ€์—์„œ๋Š” ๊ทธ ๊ธฐ๋ฒ•์— ๋Œ€ํ•ด ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

Internal Covariate Shift

์–ธ์ œ๋‚˜, ์–ด๋–ค ์ƒˆ๋กœ์šด ๊ธฐ๋ฒ•์ด ์ œ์‹œ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทธ ์ด์ „ ๋ฐฉ๋ฒ• (๋˜๋Š” ๊ทธ ๊ธฐ๋ฒ•์ด ์—†์„ ๋•Œ) ์— ๋น„ํ•ด ๋ฌด์–ธ๊ฐ€๊ฐ€ ๋‚˜์•„์ ธ์•ผ ํ•˜๊ณ , ๊ทธ๋Ÿฌ๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด๋–ค ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”์ง€๋ฅผ ์•Œ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊นŠ์€ ์‹ ๊ฒฝ๋ง ๋„คํŠธ์›Œํฌ์—์„œ ๊ฐ€์žฅ ํ”ํžˆ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ ์ค‘ ํ•˜๋‚˜๋Š” Vanishing Gradient ์ž…๋‹ˆ๋‹ค.

Vanishing Gradient๋ž€, sigmoid ๊ฐ™์€ ํ•จ์ˆ˜๋ฅผ activation function์œผ๋กœ ์“ธ ๋•Œ, ์ž…๋ ฅ๊ฐ’์ด ๊ต‰์žฅํžˆ ํฌ๋ฉด ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„์ ธ์„œ training์ด ์ •์ƒ์ ์œผ๋กœ ์ง„ํ–‰๋˜์ง€ ์•Š๋Š” ์ƒํ™ฉ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ ๋•Œ๋ฌธ์— ์ค‘๊ฐ„์— ํ•œ๋ฒˆ์ด๋ผ๋„ ๊ฐ’์ด ํŠ€์–ด์„œ $x$์˜ ์ ˆ๋Œ“๊ฐ’์ด ์ง€๋‚˜์น˜๊ฒŒ ์ปค์ง€๋ฉด, ๊ทธ ์œ„์น˜์—์„œ ์•„๋ฌด๋ฆฌ gradient๋ฅผ ๊ตฌํ•ด๋„ ๊ทธ ๊ฐ’์ด ๋„ˆ๋ฌด ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ๋น ์ ธ๋‚˜์˜ค์ง€ ๋ชปํ•˜๊ณ  ๊ฐ‡ํ˜€๋ฒ„๋ฆฌ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€๊ฐ€ ์ œ์‹œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • Sigmoid๊ฐ™์€ ํ•จ์ˆ˜๋ฅผ ์“ฐ์ง€ ๋ง๊ณ , ReLU๋ฅผ ์“ฐ์ž. ReLU๋Š” $x$๊ฐ€ ์ปค์ ธ๋„ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ด ๋˜์ง€ ์•Š๊ณ , ๋Œ€์‹ ์— ์Œ์ˆ˜๊ฐ€ ๋˜๋ฉด 0์ด ๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•  Leaky ReLU๊ฐ™์€ activation function์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋งŒ์•ฝ ์ž…๋ ฅ๊ฐ’์ด ๊ณ„์† -1~1, ๋˜๋Š” 0~1 ์ •๋„ ๋ฒ”์œ„ ์‚ฌ์ด๋ฅผ ์™”๋‹ค๊ฐ”๋‹ค ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์‚ฌ์‹ค sigmoid๋ฅผ ์จ๋„ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์–ด๋–ค ์‹์œผ๋กœ๋“  ์ž…๋ ฅ์„ stabilizeํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.
    • ๊ธ€๋กœ ๋‹ค๋ฃฌ ์ ์€ ์—†์ง€๋งŒ, ์ด ๋ฌธ์ œ ๋•Œ๋ฌธ์— ๋”ฅ๋Ÿฌ๋‹์€ (Convexํ•œ ํ•จ์ˆ˜์˜ ์ตœ์ ํ™”์™€๋Š” ๋‹ค๋ฅด๊ฒŒ) initialization์„ ์ž˜ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Initialization์„ ์–ด๋–ป๊ฒŒ ํ• ์ง€์— ๋Œ€ํ•ด์„œ๋„ ๋งŽ์€ ๋…ผ๋ฌธ๋“ค์ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์˜ค๋Š˜ ๋‹ค๋ฃฐ ๋ฐฉ๋ฒ•๋„ ์ด ๊ด€์ ์—์„œ ๋ฌธ์ œ๋ฅผ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

Ioffe์™€ Szegedy๋Š” (์ดํ•˜, ์ €์ž๋“ค์€) ์ดˆ๊ธฐ์— ์ž…๋ ฅ์ด ์ข‹์€ ์ƒํƒœ (-1~1 ์ •๋„ ๊ตฌ๊ฐ„ ์‚ฌ์ด์— ์กด์žฌ) ํ•˜๋Š” ์ƒํ™ฉ์œผ๋กœ ์‹œ์ž‘ํ•˜๋”๋ผ๋„ ๋‚˜์ค‘์— ๋‰ด๋Ÿฐ๋“ค์„ ๊ฑฐ์น˜๋‹ค ๋ณด๋ฉด ์ด ๋ถ„ํฌ๊ฐ€ ๋ณ€ํ™”ํ•จ์„ ๊ด€์ฐฐํ•˜์˜€์œผ๋ฉฐ, ์ด๋ฅผ Internal Covariate Shift ๋ผ๊ณ  ๋ถˆ๋ €์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์ดˆ๊ธฐ์—๋Š” ํ‰๊ท  0, ๋ถ„์‚ฐ 1์ธ ์•„๋ฆ„๋‹ค์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋“ค๊ณ  ์‹œ์ž‘ํ•˜๋”๋ผ๋„ ์ค‘๊ฐ„์— ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด ํ‹€์–ด์ง€๋ฉฐ, ์ด๋ฅผ ๊ต์ •ํ•˜๊ณ  ์‹ถ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

Batch Normalization

ํ†ต๊ณ„ํ•™๊ณผ ๊ธฐ์กด์˜ ๋จธ์‹ ๋Ÿฌ๋‹์—์„œ๋Š” ์ด๋ฏธ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฒ”์œ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ, ์ •๊ทœํ™”๋ผ๋Š” ๊ธฐ๋ฒ•์„ ์ด์šฉํ•ด ์™”์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด ์–ด๋–ค ๋™๋ฌผ์ข… A์™€ B์˜ ๊ฐœ์ฒด์ˆ˜์— ๋”ฐ๋ฅธ ํšŒ๊ท€๋ชจํ˜•์„ ๋งŒ๋“ค๊ณ ์ž ํ•  ๋•Œ, A๋Š” 1~100 ์ •๋„ ๋ถ„ํฌํ•˜๊ณ  B๋Š” 1~10,000 ์ •๋„ ๋ถ„ํฌํ•œ๋‹ค๋ฉด B์˜ ์ฐจ์ด์— ๋น„ํ•ด A์˜ ์ฐจ์ด๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„์ ธ์„œ A์— ๋”ฐ๋ฅธ ๋ณ€ํ™”๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋ฐ˜์˜๋˜์ง€ ๋ชปํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด, ๋ณดํ†ต์€ ํ‰๊ท ์ด 0์ด๊ณ  ๋ถ„์‚ฐ์ด 1์ด ๋˜๋„๋ก ์ „์ฒด ๊ฐ’์„ ๋งž์ถฐ์ค€๋‹ค๊ฑฐ๋‚˜ ํ•˜๋Š” ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ, ๋ฐ์ดํ„ฐ $x_1, \dots x_n$์— ๋Œ€ํ•ด ์ •๊ทœํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‰ฝ๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. \(y_i = \frac{x_i - \expect{x}}{\sqrt{\var[x]}}\) ์ด ๋ฐฉ๋ฒ•์€ ์‰ฌ์šด ์ •๊ทœํ™”์ง€๋งŒ, ์•ฝ๊ฐ„ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ •๊ทœํ™” ๋•Œ๋ฌธ์— ํ˜น์‹œ ๋„คํŠธ์›Œํฌ๊ฐ€ ํ‘œํ˜„๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜์˜ ์ง‘ํ•ฉ (์ด๋ฅผ Representation power๋ผ ํ•ฉ๋‹ˆ๋‹ค) ์ด ์ค„์–ด๋“ค์ง€ ์•Š๋Š๋ƒ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด sigmoid๊ฐ€ -1, 1 ๊ตฌ๊ฐ„์—์„œ ๊ฑฐ์˜ linearํ•˜๊ณ  ์–‘์ชฝ๋ถ€๋ถ„์—์„œ nonlinearํ•œ๋ฐ ๋ชจ๋“  ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๊ฐ’์„ ์–ด๊ฑฐ์ง€๋กœ ์ด๋ ‡๊ฒŒ ์ค‘๊ฐ„์— ๋ฐ€์–ด๋„ฃ๋Š”๊ฒŒ ์˜ฌ๋ฐ”๋ฅด๋ƒ? ๋Š” ๋ง์— ์„ ๋œป ๋‹ตํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์œผ๋กœ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ normalization์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. \(y_i = \gamma_i \frac{x_i - \expect{x}}{\sqrt{\var[x]}} + \beta_i\) ์ด๋•Œ, $\gamma_i$ ์™€ $\beta_i$๋Š” trainable parameter์ž…๋‹ˆ๋‹ค.

์ด์ œ ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•˜๊ณ  ๋‚˜๋ฉด, ๋˜๋‹ค๋ฅธ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Stochastic optimization์„ ํ•˜๋Š” ์šฐ๋ฆฌ์˜ ํŠน์„ฑ์ƒ, batch ํ•œ๋ฒˆ์„ ์ด์šฉํ•ด์„œ gradient update๋ฅผ ํ•˜๊ณ ๋‚˜๋ฉด ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ „์ฒด ๋ฐ์ดํ„ฐ๊ฐ€ 1๋งŒ๊ฐœ๊ณ  batch๊ฐ€ 100๊ฐœ์”ฉ์ด๋ผ๊ณ  ํ•˜๋ฉด, 100๊ฐœ๋ฅผ ์ด์šฉํ•ด์„œ ๋ ˆ์ด์–ด์˜ ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•œ๋ฒˆ ๋ฐ”๊พธ๊ณ  ๋‚˜๋ฉด 1๋งŒ๊ฐœ์˜ ์ž…๋ ฅ์„ ๋„ฃ๊ณ  ๋Œ๋ ค์„œ ์ƒˆ๋กœ์šด ํ‰๊ท ์„ ๊ตฌํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿด๊ฑฐ๋ฉด ์• ์ดˆ์— batch๋ฅผ ์žก์•„์„œ stochasticํ•˜๊ฒŒ ๋ญ”๊ฐ€๋ฅผ ํ•˜๋Š” ์˜๋ฏธ๊ฐ€ ์—†์–ด์ ธ ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ ์ €์ž๋“ค์€ ์‹ค์ œ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ตฌํ•ด์„œ ์“ฐ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, batch๋ณ„๋กœ ํ‰๊ท ๊ณผ ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ตฌํ•ด์„œ ๊ทธ ๊ฐ’๋“ค๋งŒ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰โ€ฆ Tensor $X$๋ฅผ $B \times N$ ๋กœ ๋ณผ ๋•Œ, \(\begin{align*} \mu[:] &= \frac{1}{B} \sum_{b = 1}^{B} X[b, :] \\ \sigma^2[:] &= \frac{1}{B} \sum_{b = 1}^{B} (X[b, :] - \mu[:])^2 + \epsilon \\ \text{BN}_{\beta, \gamma}(X)[b, :] &= \gamma[:] \odot \frac{X[b, :] - \mu[:]}{\sigma[:]} + \beta[:] \end{align*}\) ์ด๋ ‡๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. $\epsilon$์€ floating point error๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์ง‘์–ด๋„ฃ์€ ์ ๋‹นํžˆ ์ž‘์€ ์ˆ˜์ด๋ฏ€๋กœ ์ˆ˜ํ•™์ ์œผ๋กœ๋Š” ๊ณ ๋ คํ•˜์ง€ ์•Š์•„๋„ ๋ฉ๋‹ˆ๋‹ค.

Convolution ์—ฐ์‚ฐ์—์„œ๋„ ๊ฑฐ์˜ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, Convolution์˜ ๊ฒฝ์šฐ ์–ด๋–ค spatial locality๋ฅผ ์œ ์ง€ํ•œ๋‹ค๋Š” ์„ฑ์งˆ์„ ์œ ์ง€ํ•˜๊ณ  ์‹ถ๊ธฐ ๋•Œ๋ฌธ์— (์ด ์ •๋ณด๊ฐ€ ์–ด๋””์„œ ์™”๋Š”์ง€๋ฅผ ์–ด๋Š์ •๋„๋Š” ๋ณด์กดํ•˜๋ฉด์„œ ๊ฐ€๊ณ  ์‹ถ๊ธฐ ๋•Œ๋ฌธ์—) H, W ๋ฐฉํ–ฅ์œผ๋กœ๋Š” ์ •๊ทœํ™”ํ•˜์ง€ ์•Š๊ณ , C๋ฐฉํ–ฅ๋งŒ ์ด์šฉํ•ด์„œ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, tensor $X$๋ฅผ $B \times C \times H \times W$ ๋กœ ๋ณผ ๋•Œ, \(\begin{align*} \mu[:] &= \frac{1}{BHW} \sum_{b = 1}^{B} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X[b, :, i, j] \\ \sigma^2[:] &= \frac{1}{BHW} \sum_{b = 1}^{B} \sum_{i = 1}^{H} \sum_{j = 1}^{W} (X[b, :, i, j] - \mu[:])^2 + \epsilon^2 \\ \text{BN}_{\beta, \gamma}(X)[b, :, i, j] &= \gamma[:] \odot \frac{X[b, :, i, j] - \mu[:]}{\sigma[:]} + \beta[:] \end{align*}\)

Experimental Result

์ €์ž๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌํ•ญ๋“ค์„ ์‹คํ—˜์ ์œผ๋กœ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Dropout๊ณผ Batch norm์€ ๋‘˜ ๋‹ค ์“ธ ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. BN ์ž์ฒด๊ฐ€ ์–ด๋Š์ •๋„์˜ regularizaitonํšจ๊ณผ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • BN์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, learning rate๋ฅผ ์ข€๋” ํฌ๊ฒŒ ์žก์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ์›๋ฆฌ๋กœ, momentum์„ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ lr decay๋ฅผ ์ค„์ผ ์ˆ˜๋„ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ReLU๊ฐ€ ์•„๋‹Œ tanh, sigmoid ๋“ฑ์˜ ํ•จ์ˆ˜๋„ activation์œผ๋กœ ์“ธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • Local Response Normalization ์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์„ AlexNet ํฌ์ŠคํŒ…์—์„œ ์–ธ๊ธ‰ํ–ˆ์ง€๋งŒ ๋”์ด์ƒ ์“ฐ์ด์ง€ ์•Š๋Š”๋‹ค๊ณ  ๋งํ–ˆ์—ˆ๋Š”๋ฐ, ๊ทธ ์ด์œ ๊ฐ€ ์—ฌ๊ธฐ์— ์žˆ์Šต๋‹ˆ๋‹ค. BN์„ ์‚ฌ์šฉํ•˜๋ฉด Local Response Normalization์€ ๊ตณ์ด ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ, VGGNet ํฌ์ŠคํŒ…์—์„œ๋„ โ€œVGGNet์€ ๊นŠ์–ด์„œ training์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— 11๋ ˆ์ด์–ด ํ›ˆ๋ จํ•˜๊ณ  ๊ฑฐ๊ธฐ์— 2๊ฐœ ์–น๊ณ โ€ฆํ•˜๋Š” ์‹์œผ๋กœ trainingํ•œ๋‹คโ€ ๋Š” ๋ง์„ ์–ธ๊ธ‰ํ•œ ์ ์ด ์žˆ๋Š”๋ฐ, BN์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด๊ฒƒ๋„ ๊ตณ์ด ๊ทธ๋ ‡๊ฒŒ ํ•˜์ง€ ์•Š์•„๋„ ๊ทธ๋ƒฅ ๋ฐ”๋กœ 16๋ ˆ์ด์–ด๋ฅผ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ›„์†์—ฐ๊ตฌ : Real effect of BN?

MIT์˜ ์—ฐ๊ตฌํŒ€ (Santurkar et al) ์€ 2018๋…„ NeurlPS์— ๋ฐœํ‘œ๋œ ์—ฐ๊ตฌ์—์„œ, BN์˜ ์ €์ž๋“ค์ด ์ฃผ์žฅํ•œ Internal Covariate Shift (์ดํ•˜ ICS)์— ๋Œ€ํ•œ ๋ถ€๋ถ„์„ ๋ฐ˜๋ฐ•ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณด๋‹ค ์ •ํ™•ํžˆ๋Š”, ์ด ๋…ผ๋ฌธ์˜ ์š”์ ์„ ์ •๋ฆฌํ•˜์ž๋ฉดโ€ฆ

  1. ICS๋ฅผ ๊ต์ •ํ•œ๋‹ค๊ณ  ์‹ค์ œ๋กœ performance๊ฐ€ ๋†’์•„์ง„๋‹ค๋Š” ๊ทผ๊ฑฐ๋Š” ๋ณ„๋กœ ์—†๋‹ค. BN์˜ ์ €์ž๋“ค์ด (BN์ด ICS๋ฅผ ๊ต์ •ํ•˜๋ฉฐ) (๊ทธ๋ž˜์„œ performance๊ฐ€ ๋†’์•„์ง„๋‹ค) ๋ผ๊ณ  ์ฃผ์žฅํ–ˆ์ง€๋งŒ, concrete evidence๊ฐ€ ์žˆ๋Š”๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.
  2. In fact, BN์ด ICS๋ฅผ ์ •๋ง ๊ต์ •ํ•˜๋Š” ๊ฒƒ๋„ ์‚ฌ์‹ค ์•„๋‹ˆ๋‹ค. ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” BN์ด ICS ๊ต์ •๊ณผ ๋ณ„๋กœ ์ƒ๊ด€์ด ์—†๋Š”๊ฒƒ ๊ฐ™๋‹ค.
  3. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  BN์ด performance๋ฅผ ๊ฐœ์„ ํ•˜๋Š”๊ฑด ์‚ฌ์‹ค์ด๋‹ค.
  4. ์‚ฌ์‹ค BN์˜ ์ง„์งœ ํšจ๊ณผ๋Š”, loss function์˜ surface๋ฅผ smoothํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค. ๋˜ํ•œ, ๋‹ค๋ฅธ regularization ๋ฐฉ๋ฒ•๋“ค๋„ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ถ„์„ํ•ด๋ณด๋ฉด ๊ทธ๋Ÿฐ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค.

์ด์ •๋„๋กœ ์š”์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, Loss function์ด ์ข€๋” ์ข‹์€ ํ˜•ํƒœ๋กœ ์žกํžˆ๊ธฐ ๋•Œ๋ฌธ์— BN์„ ์จ์•ผ ํ•˜๊ธด ํ•˜์ง€๋งŒ, ๊ทธ๊ฒŒ ICS ๋•Œ๋ฌธ์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋‹ค์–‘ํ•œ ์‹คํ—˜๊ฒฐ๊ณผ์™€ ํ•จ๊ป˜ ์ด๋ก ์ ์œผ๋กœ๋„ ๋ถ„์„ํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์ด๋ก ์  ๊ธฐ๋ฐ˜์„ ๋จผ์ € ๊ฐ–์ถ˜์ƒํƒœ๋กœ ์ •ํ™•ํžˆ ํ•„์š”ํ•œ ๊ฒƒ์ด ๋ฌด์—‡์ธ์ง€๋ฅผ ์žก์•„๋‚ด์„œ ์‹คํ—˜์„ ์„ค๊ณ„ํ•˜๋ฉด ๋ณด๋‹ค ์›ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์‰ฝ๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ๊ณผ, ์šฐ๋ฆฌ๊ฐ€ ์ž˜ ์•Œ๋ ค์ง„ ๊ธฐ๋ฒ•๋“ค์— ๋Œ€ํ•ด์„œ๋„ ์ˆ˜ํ•™์ /ํ†ต๊ณ„ํ•™์ ์œผ๋กœ ๋ช…ํ™•ํ•˜๊ฒŒ ์ดํ•ดํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ์˜ˆ์‹œ๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ๋Ÿผ์—๋„, ์ดํ›„์˜ ๋งŽ์€ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ๋“ค - ํŠนํžˆ CNN๋“ค์— ๋Œ€ํ•ด์„œ, Batch Normalization์€ ๊ฑฐ์˜ ํ•„์ˆ˜์ ์ธ ๊ฒƒ์œผ๋กœ ๋ฐ›์•„๋“ค์—ฌ์ง€๊ณ  ์žˆ์„ ๋งŒํผ ์„ฑ๋Šฅ ๊ฐœ์„  ํšจ๊ณผ๊ฐ€ ๋šœ๋ ทํ•˜๊ธฐ์— ์› ์ €์ž๋“ค์˜ ์—ฐ๊ตฌ๊ฐ€ ๋น›์ด ๋ฐ”๋ž˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. 2014๋…„ ์ดํ›„์˜ Deep CNN ๋ชจ๋ธ๋“ค์€ BN์˜ ํšจ๊ณผ ๋•๋ถ„์— ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค๊ณ  ํ•ด๋„ ๊ณผ์–ธ์ด ์•„๋‹ˆ๋‹ˆ๊นŒ์š”. ์•ž์œผ๋กœ๋Š” ์ด๋Ÿฐ ๋ฐฉ๋ฒ•๋“ค์„ ์ ์šฉํ•œ, ๋” ๊นŠ์€ network๋“ค์— ๋Œ€ํ•ด ์ข€๋” ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.