Back to : deep-learning-study
Contents

AlexNet์€ 2012๋…„ Imagenet challenge๋ฅผ ํฐ ๊ฒฉ์ฐจ๋กœ ์šฐ์Šนํ•˜๋ฉด์„œ, image classification task๋ฅผ ํ†ตํ•ด Deep neural network & GPU-computing์˜ ์‹œ๋Œ€๋ฅผ ์—ฐ ๋ชจ๋ธ์ด๋ผ๋Š” ํ‰๊ฐ€๋ฅผ ๋ฐ›๋Š” ๊ทธ๋Ÿฐ ์•„ํ‚คํ…์ณ์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ๊ทธ๋Ÿฐ AlexNet์˜ ์›๋ณธ ๋…ผ๋ฌธ์„ ๋”ฐ๋ผ๊ฐ€๋ฉด์„œ, ๋ฉ”์ธ ์•„์ด๋””์–ด๋“ค์— ๋Œ€ํ•ด ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

Architecture

picture 1

๊ธฐ๋ณธ์ ์œผ๋กœ AlexNet์˜ ๊ตฌ์กฐ๋Š” LeNet๊ณผ ๋งŽ์ด ๋‹ค๋ฅด์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ณด๋‹ค ํฐ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ข€๋” ๊นŠ์–ด์ง„ ๊ตฌ์กฐ๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ์š”. Convolution layer 5๊ฐœ์™€ Fully connected layer 2๊ฐœ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. LeNet์—์„œ ๋ณธ๊ฒƒ์ฒ˜๋Ÿผ Convolution layer๋“ค์ด ์ฃผ๋ณ€์„ ๋ณด๋ฉด์„œ feature๋ฅผ ์ถ”์ถœํ•˜๊ณ , ๊ทธ ์ถ”์ถœํ•œ ํŠน์ง•๋“ค์„ ๋งˆ์ง€๋ง‰ linear layer์—์„œ ์–ด๋–ป๊ฒŒ ํ•ฉ์น ์ง€๋ฅผ ๊ณ ๋ฏผํ•ด์„œ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋‚ธ๋‹ค๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๋‹ค.

2-GPU Training

AlexNet์€ ๋ช‡๊ฐ€์ง€ ํŠน์ดํ•œ ์ ์ด ์žˆ๋Š”๋ฐ, ๋ˆˆ์— ๋ณด์ด๋Š” ๊ฐ€์žฅ ํฐ ํŠน์ง• ์ค‘ ํ•˜๋‚˜๋Š” ์œ„ ๊ทธ๋ฆผ์—์„œ ๋ณด๋“ฏ ๋„คํŠธ์›Œํฌ ์ „์ฒด๋ฅผ ๋‘๊ฐœ๋กœ ๋‚˜๋ˆ ์„œ ๊ตฌํ˜„ํ•ด๋†จ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋‹น์‹œ (2012) GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 3GB์ •๋„๋กœ ํ˜„์žฌ์— ๋น„ํ•ด ๋ถ€์กฑํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— GPU 2๊ฐœ์— ๋„คํŠธ์›Œํฌ๋ฅผ ๋ฐ˜์”ฉ ๋‚˜๋ˆ ์„œ ๋Œ๋ฆฌ๋ฉด์„œ, ํ•„์š”ํ•œ ๋•Œ๋งŒ ์„œ๋กœ๊ฐ„์— communicateํ•˜๋„๋ก ํ•œ ๊ฒƒ์ธ๋ฐ์š”. ์ง€๊ธˆ์— ์™€์„œ๋Š” GPU์˜ ์„ฑ๋Šฅ์ด ๋น„์•ฝ์ ์œผ๋กœ ํ–ฅ์ƒ๋จ์— ๋”ฐ๋ผ ๊ตณ์ด ์ด๋ ‡๊ฒŒ ๋‘๊ฐœ๋กœ ๋‚˜๋ˆ„์ง€ ์•Š์•„๋„ ์ถฉ๋ถ„ํžˆ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Usage of ReLU

LeNet์—์„œ๋„ ๊ทธ๋ ‡๊ณ , ์ด์ „๊นŒ์ง€์˜ ๋งŽ์€ Neural network๋“ค์€ activation function์œผ๋กœ $\tanh$ ๋‚˜ sigmoid (์–ด์ฐจํ”ผ ๊ฑฐ์˜ ๋น„์Šทํ•˜๋ฏ€๋กœ sigmoid๋กœ ํ†ต์นญํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค) ๊ฐ™์€ smoothํ•œ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ AlexNet์—์„œ๋Š” ReLU๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ training์ด ๊ฐ€๋Šฅํ•จ์„ ์ฃผ์žฅํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ์‹คํ—˜์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ๋Š” ๋‹ค์–‘ํ•œ ์ด์•ผ๊ธฐ์™€ ์ƒ๊ฐํ•ด๋ณผ ์ด์Šˆ๋“ค์ด ์žˆ๋Š”๋ฐ, ReLU๋Š” ๊ณ„์‚ฐ ์ž์ฒด๊ฐ€ ๋น ๋ฅด๊ฒŒ ๊ฐ€๋Šฅํ•œ๋ฐ๋‹ค ์–‘์ชฝ์—์„œ vanishing gradient ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” sigmoid์— ๋น„ํ•ด ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์ด ๋œํ•˜๋‹ค๋Š” ์žฅ์ ์„ ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

AlexNet ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋ฅผ non-saturating์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ณ  ์žˆ์œผ๋ฉฐ, ์ดํ›„ ๋งŽ์€ ๋…ผ๋ฌธ์—์„œ๋„ ReLU๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ์ด์ ์„ ์–ป๊ณ ์ž ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Local Response Normalization

Sigmoid์˜ ๊ฒฝ์šฐ ๊ฐ ๋‰ด๋Ÿฐ์˜ ์ž…๋ ฅ์ด 0 ์ฃผ์œ„๋กœ ๋ชจ์•„์ ธ์•ผ learning์˜ ํšจ์œจ์ด ๋ฐœํœ˜๋˜๊ธฐ ๋•Œ๋ฌธ์— (๋์ชฝ์œผ๋กœ ๊ฐˆ์ˆ˜๋ก ๋ฏธ๋ถ„๊ณ„์ˆ˜๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์„œ ์•„๋ฌด ์ผ๋„ ์ผ์–ด๋‚˜์ง€ ์•Š์Œ), ๋งŽ์€ ๋„คํŠธ์›Œํฌ๋“ค์ด input normalization์„ ํ†ตํ•ด ์ด๋ฅผ ๋งž์ถฐ์ฃผ๋ ค๊ณ  ํ–ˆ์Šต๋‹ˆ๋‹ค. ReLU๋Š” 0 ์ดํ•˜๋งŒ ์•„๋‹ˆ๋ผ๋ฉด ์ž…๋ ฅ๊ฐ’์— ๋”ฐ๋ผ ๋ฏธ๋ถ„๊ณ„์ˆ˜๊ฐ€ ์ค„์–ด๋“ค๊ฑฐ๋‚˜ ํ•˜์ง€๋Š” ์•Š์œผ๋ฏ€๋กœ ์ด๊ฒŒ ๊ผญ ํ•„์š”ํ•˜์ง€๋Š” ์•Š์ง€๋งŒ, AlexNet ๋…ผ๋ฌธ์—์„œ๋Š” Local normalization์ด๋ผ๋Š” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ๋•Œ ํšจ์œจ์ด ์ข‹์•˜๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์ด ๋ฐฉ๋ฒ•์€ ์ดํ›„ Batch normalization ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ์ œ์‹œ๋˜๊ณ  ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์˜ ํšจ์œจ์ด ๋”์šฑ ์šฐ์ˆ˜ํ•จ์ด ๋ฐํ˜€์ง์— ๋”ฐ๋ผ ์ดํ›„์˜ ์—ฐ๊ตฌ์—์„œ ๋”์ด์ƒ ๊ณ„์Šน๋˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์ž์„ธํžˆ ๋‹ค๋ฃจ์ง€๋Š” ์•Š๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•˜๊ฒŒ๋งŒ ๋งํ•˜์ž๋ฉด, convolution layer ํ•œ๋ฒˆ์ด ํ•„ํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ์“ฐ๋Š” ์ƒํ™ฉ์—์„œ ์ ์šฉํ•˜๋Š” normalization์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋ณด์ž๋ฉด ์ด 96๊ฐœ์˜ ํ•„ํ„ฐ (์ปค๋„) ์„ ์“ฐ๋Š” ์ƒํ™ฉ์—์„œ ๋ญ 17๋ฒˆ ํ•„ํ„ฐ์˜ ๊ฒฐ๊ณผ๊ฐ’์ด ์žˆ์„ํ…๋ฐ, ์ด ๊ฐ’์„ 13๋ฒˆ, 14๋ฒˆ, โ€ฆ, 21๋ฒˆ๊นŒ์ง€์˜ ํ•„ํ„ฐ์˜ ๊ฒฐ๊ณผ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ normalizeํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค (์ขŒ์šฐ๋กœ 4๊ฐœ์”ฉ ์“ฐ๋Š”๊ฑด ๊ทธ๋ƒฅ ์ž„์˜๋กœ ์ •ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค) ์ด ๋ฐฉ๋ฒ•์€ ์‹ค์ œ ๋‡Œ์—์„œ์˜ ์‹ ๊ฒฝ์ƒ๋ฆฌํ•™์— ์žˆ์–ด์„œ ์ธก๋ฉด ์–ต์ œ (lateral inhibition) ๋กœ๋ถ€ํ„ฐ motivation์„ ์–ป์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Overlapping Pooling

AlexNet์—์„œ๋Š” ์ผ๋ถ€ pooling์„ ์„œ๋กœ ์‚ด์ง ๊ฒน์น˜๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋Š”๋ฐ, ์ด ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด overfitting์„ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

Training

Regularization Techniques

AlexNet์€ ์ด์ „ LeNet์— ๋น„ํ•ด ํ›จ์”ฌ ๋” ํฐ ๋ชจ๋ธ๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ฐœ์ˆ˜๊ฐ€ ํ›จ์”ฌ ๋” ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— overfitting์˜ ์šฐ๋ ค๊ฐ€ ํฝ๋‹ˆ๋‹ค. ์ด์— ๋Œ€์‘ํ•˜๊ธฐ ์œ„ํ•ด, LeNet๊ณผ ๋น„๊ตํ•ด ๋ณด๋ฉด ํ›จ์”ฌ Regularization์— ๊ณต์„ ๋“ค์ด๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • 2012๋…„ ๋‹น์‹œ์—๋Š” ์ตœ์‹  ํ…Œํฌ๋‹‰์ด์—ˆ๋˜ (๊ทธ๋Ÿฌ๋‚˜, ์ดํ›„์—๋Š” Batch normalization๋“ฑ์˜ ํ™œ์šฉ์œผ๋กœ ์ธํ•ด ๊ทธ ํšจ์šฉ์ด ๋งŽ์ด ์ค„์–ด๋“ ) dropout์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ fully-connected layer ์ค‘ ์•ž ๋‘ ์นธ์— $p = 0.5$์ž…๋‹ˆ๋‹ค.
  • SGD์— Weight decay 0.0005๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋…ผ๋ฌธ์—์„œ๋Š” ๊ทธ ์ด์œ ๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ฐํžˆ๊ณ  ์žˆ์ง€๋Š” ์•Š์œผ๋‚˜, ๋‹จ์ˆœํžˆ regularization์ผ ๋ฟ ์•„๋‹ˆ๋ผ ์‹ค์ œ๋กœ training์— ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ฃผ์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Optimization

  • SGD with Momentum 0.9, weight decay 0.0005.
  • Learning rate๋Š” 0.01๋กœ ์‹œ์ž‘ํ•ด์„œ, loss๊ฐ€ ์ค„์–ด๋“ค์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์•„ ๋ณด์ผ๋•Œ๋งˆ๋‹ค 1/10์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
  • โ€œAdjusted Manuallyโ€โ€ฆ

Code

AlexNet์œผ๋กœ CIFAR10 ํ’€์–ด๋ณด๊ธฐ ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค