Back to : deep-learning-study
Contents

์ด ๊ธ€์˜ ์ƒ๋‹น ๋ถ€๋ถ„์€, ์„œ๋ฒ ์ด ๋…ผ๋ฌธ์ธ Minaee, S., Boykov, Y., Porikli, F., Plaza, A., Kehtarnavaz, N., & Terzopoulos, D. (2020). Image Segmentation Using Deep Learning: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1โ€“22. https://doi.org/10.1109/TPAMI.2021.3059968 ์„ ์ •๋ฆฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

๋ฌธ์ œ ์†Œ๊ฐœ

Semantic Segmentation์ด๋ž€, Computer Vision ๋ถ„์•ผ์˜ ๋Œ€ํ‘œ์ ์ธ task์ค‘ ํ•˜๋‚˜๋กœ, ๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜์ž๋ฉด

์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ทธ ์ด๋ฏธ์ง€๋ฅผ ํ”ฝ์…€๋‹จ์œ„๋กœ ์–ด๋–ค ๋Œ€์ƒ์ธ์ง€ ๋ฅผ ๋ถ„๋ฅ˜ํ•ด๋‚ด๋Š” ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

picture 1
์ถœ์ฒ˜ : Stanford cs231n slides

์ด ์‚ฌ์ง„์€ ๋Œ€ํ‘œ์ ์ธ ๋„ค ๊ฐ€์ง€์˜ task๋ฅผ ๋น„๊ตํ•œ ๊ฒƒ์ธ๋ฐ, ๊ต‰์žฅํžˆ ์ง๊ด€์ ์œผ๋กœ ๋ฌด์Šจ ์˜๋ฏธ์ธ์ง€ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ™œ์šฉ

Semantic segmentation์€ ๋”ฑ ๋А๋‚Œ์—๋„ ๋งค์šฐ ์œ ์šฉํ•  ๊ฒƒ ๊ฐ™์€๋ฐ, ๋Œ€ํ‘œ์ ์ธ ํ™œ์šฉ์ฒ˜ ๋ช‡๊ฐœ๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉดโ€ฆ

  • ์ž์œจ ์ฃผํ–‰ : ์ž์œจ์ฃผํ–‰์—์„œ ์ง€๊ธˆ ๋ˆˆ์•ž์— ๋ณด์ด๋Š” ๊ฒƒ์ด ๋„๋กœ์ธ์ง€, ํ™๋ฐ”๋‹ฅ์ธ์ง€, ๋ฌผ์›…๋ฉ์ด์ธ์ง€๋ฅผ ํŒ๋‹จํ•˜๋Š” ์ž‘์—…์€ ๊ต‰์žฅํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
  • ์˜๋ฃŒ ์ด๋ฏธ์ง€ : CT ์‚ฌ์ง„์—์„œ, ๊ฐ ๊ณ ํ˜•์žฅ๊ธฐ๋ฅผ ๋ถ„๋ฅ˜ํ•œ๋‹ค๊ฑฐ๋‚˜, ์ •์ƒ์กฐ์ง๊ณผ ๋น„์ •์ƒ์กฐ์ง์„ ๊ตฌ๋ถ„ํ•˜๋Š” ์ž‘์—…๋„ ๊ฒฐ๊ตญ์€ segmentation์— ๊ธฐ๋ฐ˜ํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ์š”

์—ฌ๊ธฐ์„œ๋Š” ์ •๋ง ๋Œ€๋žต์ ์ธ ์•„์ด๋””์–ด๋ฅผ ํ•œ์ค„๋กœ ์ •๋ฆฌํ•˜๊ณ , ๊ฐœ๋ณ„ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ์— ๋Œ€ํ•œ ํฌ์ŠคํŒ…์„ ํ†ตํ•ด ์ „์ฒด ๋‚ด์šฉ์„ ๋ถ™์—ฌ๋‚˜๊ฐ€๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

์ด ์ด๋ฏธ์ง€๊ฐ€ ๊ฐœ์ธ์ง€, ๊ณ ์–‘์ด์ธ์ง€๋ฅผ ํŒ๋‹จํ•˜๋Š” Classification์˜ ๊ฒฝ์šฐ, ํ†ต์ƒ์ ์œผ๋กœ convolutional neural network (CNN) ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์ง„ ์ „์ฒด์˜ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•œ $3 \times W \times H$ ํ…์„œ๋ฅผ ๊ฐ€์ง€๊ณ  ์‹œ์ž‘ํ•ด์„œ, Convolution layer๋ฅผ ๊ฑฐ์น˜๋ฉด์„œ ์ •๋ณด๋“ค์„ ์ถ”์ถœํ•˜๊ณ , ๋งˆ์ง€๋ง‰์— fully connected layer๋ฅผ ๋ถ™์—ฌ์„œ ์‹ค์ œ๋กœ ํด๋ž˜์Šค๋ฅผ ๊ตฌ๋ถ„ํ•ด๋‚ด๋Š” ์‹์œผ๋กœ ์ง„ํ–‰ํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ๋ฐฉ๋ฒ•์„ semantic segmentation๊ฐ™์€ ๋ฌธ์ œ์—์„œ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šด ์ด์œ ๋Š”, ๋งˆ์ง€๋ง‰ fully connected layer๊ฐ€ ๊ธฐํ•˜์ ์ธ ์œ„์น˜์ •๋ณด๋ฅผ ๋‹ค ๋‚ ๋ ค๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ฆ‰, โ€˜๊ณ ์–‘์ด๊ฐ€โ€™ ์žˆ๋‹ค๋Š” ์ •๋ณด๋Š” ์–ด๋–ป๊ฒŒ ๋ถ„๋ฅ˜ํ•ด๋ณผ ์ˆ˜ ์žˆ์„์ง€๋ผ๋„, ๊ณ ์–‘์ด๊ฐ€ โ€˜์–ด๋””์—โ€™ ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด์„œ๋Š” ์ „ํ˜€ ์•Œ ์ˆ˜๊ฐ€ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— fully connected layer๋ฅผ ์“ธ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ์•„์ด๋””์–ด๋กœ, ๋งˆ์ง€๋ง‰ ์ •๋ณด๋ฅผ Fully connected๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹  $1 \times 1$ Convolution์„ ์“ฐ๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Convolution์€ ์œ„์น˜์ •๋ณด๋ฅผ ์–ด๋А์ •๋„ ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ์ฐฉ์•ˆํ•œ ์•„์ด๋””์–ด์ธ๋ฐ, ์ด๋Š” 2016๋…„์— Fully Convolutional Network (FCN) ์ด๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ๋ฐœํ‘œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— ์ถ”๊ฐ€๋กœ Conditional Random Field, Markov Random Field ๋“ฑ์˜ ๋ชจ๋ธ๋“ค์„ ์ ์šฉํ•˜๋ฉด ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

picture 1

CNN์—์„œ๋Š” Convolution๊ณผ ํ•จ๊ป˜ Pooling์„ ๋ฐ˜๋ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐˆ์ˆ˜๋ก feature map์˜ ํฌ๊ธฐ๊ฐ€ ์ค„์–ด๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์šฐ๋ฆฌ๋Š” ๊ฐ ํ”ฝ์…€๋‹จ์œ„๋กœ ์–ด๋–ค ํด๋ž˜์Šค์ธ์ง€๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค์‹œ feature map์˜ ํฌ๊ธฐ๋ฅผ ์›๋ณธ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋งŒํผ ํ‚ค์›Œ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋“ค์ด ์žˆ๋Š”๋ฐ, FCN์—์„œ๋Š” skip connection / upsampling ์ด๋ผ๊ณ  ํ•ด์„œ, ๋„คํŠธ์›Œํฌ๋ฅผ ํƒ€๊ณ  ํ๋ฅด๋Š” ์ค‘๊ฐ„์˜ ์ •๋ณด๋ฅผ ๋ฝ‘์•„๋‹ค๊ฐ€ ์ตœ์ข… ์ •๋ณด์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

picture 2

CNN์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฐœ๋… ์ค‘, feature map์˜ ํฌ๊ธฐ๋ฅผ ๊ฑฐ๊พธ๋กœ ๋Š˜๋ฆฌ๊ฒŒ ๋˜๋Š” deconvolution์ด๋ผ๋Š” ์—ฐ์‚ฐ์ด ์žˆ์Šต๋‹ˆ๋‹ค (convolution์˜ ๊ณ„์‚ฐ์ ์ธ inverse์ด๊ธด ํ•œ๋ฐ, ์ •ํ™•ํ•œ inverse๋Š” ์•„๋‹™๋‹ˆ๋‹ค. transposed convolution์ด ์ข€๋” ์ •ํ™•ํ•œ ๋ง์ธ๋ฐ, ์ด ๋ถ€๋ถ„๋„ ๋‚˜์ค‘์— ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค). CNN์„ ํƒ€๊ณ  feature map์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ค„์–ด๋“ค์—ˆ๋‹ค๋Š” ๊ฒƒ์ด ๋ฌธ์ œ๊ฐ€ ๋œ๋‹ค๋ฉด, ๋‹ค์‹œ deconvolution์„ ๊ทธ๋งŒํผ ๊ฑฐ๊พธ๋กœ ๋Œ๋ ค์„œ feature map์„ ๋Œ์ด์ผœ์ฃผ๋ฉด ๋˜์ง€ ์•Š์„๊นŒ์š”? ์ด๋ฅผ Encoder-Decoder ํ˜•ํƒœ์˜ ๊ตฌ์กฐ๋ผ๊ณ  ๋ถ€๋ฅด๋ฉฐ, U-Net ์„ ํ•„๋‘๋กœ (์ด ์ด๋ฆ„์€, ๋ง๊ทธ๋Œ€๋กœ convolution๊ณผ deconvolution์„ U์žํ˜•์œผ๋กœ ์Œ“์•„์„œ ๋ถ™์—ฌ์ง„ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค) ์—ฌ๋Ÿฌ ๋ชจ๋ธ๋“ค์ด ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

picture 3

Convolution์—์„œ pooling์„ ํ†ตํ•ด ์ •๋ณด๋ฅผ ์žƒ๋‹ค๊ฐ€ ๋‹ค์‹œ ์ด๊ฑธ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๊ฒƒ ๋Œ€์‹ , ์ฒ˜์Œ๋ถ€ํ„ฐ feature map์˜ ํฌ๊ธฐ๋ฅผ ์ค„์ด์ง€ ์•Š์œผ๋ฉด ์–ด๋–จ๊นŒ์š”? ๊ทธ๋ ‡๋‹ค๊ณ  pooling์„ ์•„์˜ˆ ํ•˜์ง€ ์•Š์„ ์ˆ˜๋Š” ์—†๋Š”๋ฐ, ํ•„ํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋ฉด ์—ฐ์‚ฐ์ด ๋„ˆ๋ฌด ๋งŽ์•„์ง€๋Š” ๋ฐ๋‹ค๊ฐ€ learning capacity๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ํ˜„์ƒ๋“ค์ด ๋ฐœ์ƒํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ์–ด๋А์ •๋„ ํ•ด๊ฒฐํ•˜๋Š” Dilated convolution์€ Convolution์„ ํ•  ๋•Œ๋ถ€ํ„ฐ ์ ๋‹นํžˆ ํ•„ํ„ฐ์— ์ œ๋กœ ํŒจ๋”ฉ์„ ๋ถ™์—ฌ์คŒ์œผ๋กœ์จ, ๊ณต๊ฐ„์ ์ธ ์ •๋ณด๋ฅผ ์žƒ์ง€ ์•Š๊ณ  feature map์˜ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•ด ์ค๋‹ˆ๋‹ค. ์ด ์—ฐ์‚ฐ์„ ์ด์šฉํ•˜์—ฌ (์›๋ž˜์˜ upsampling๊ณผ ํ•จ๊ป˜ ์“ฐ๊ธด ํ•ฉ๋‹ˆ๋‹ค) Dilated Convolutional Model ๋“ค์ด ๊ฐœ๋ฐœ๋˜์—ˆ๋Š”๋ฐ, ๋Œ€ํ‘œ์ ์œผ๋กœ Google์˜ DeepLab ๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

picture 4

์šฐ์„ ์€ FCN, U-Net, DeepLab์„ ํ•„๋‘๋กœ ์ •๋ฆฌ๋ฅผ ์‹œ์ž‘ํ•ด ๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.