FCN: Fully Convolutional Networks for Semantic Segmentation 정리

논문 리뷰

FCN: Fully Convolutional Networks for Semantic Segmentation 정리

cottonlove 2022. 10. 14. 14:32

이전에 학교 팀플 수업에서 segmentation model을 사용해 pixel별 u,v 값을 예측하도록 해 다중 조명에 대해 Auto White Balance 연구를 한 적이 있었다. 그때, semantic segmentation model들 논문(유넷,,, SWIN+UperNet)을 읽으면 관련 선행 연구로 FCN 모델이 자주 언급되었는데 드디어 제대로 FCN 논문을 읽게 되었다!

초록은 별 내용이 없어보이는데 한글 리뷰를 참고하지 않고 바로 논문을 읽으니 어려워서 유튜브랑 다른 사람들이 올린 리뷰를 참고해서 한번 정리해보았다!!!

큰 내용은 많이 없는데 정확하게 이해하는 것은 좀 어려웠던 것 같다! 논문 중간중간 등장하는 background 지식?으로 컴퓨터 비전 수업에서 배운 내용들이 나와서 좀 반가웠다..!

암튼... 리뷰 시작!!! 전에 미리 알아두면 논문 읽을 때 좋은 개념들을 간단히 소개하고 시작하겠다.

Background

✅ End-to-End 방식
딥러닝공부를 하면 정말 많이 접하는 말이다. end-to-end learning, end-to-end fashion,,,etc
암튼 뜻은 입력에서 출력까지 파이프라인 네트워크(전체 네트워크를 이루는 부분 네트워크) 없이 신경망으로 한번에 처리하는 learning 방식이다. 즉, feature extractor, representation learning, classifier가 따로따로 학습되는 게 아니라 모델의 모든 매개변수가 하나의 loss function에 대해 동시에 훈련이 가능한 경로가 있는 것을 의미한다.

✅ Receptive field
각 필터와 convolution 연산 후 output값은 하나의 뉴런 값이다. 이때 하나의 뉴런이 input image에서 담당하는 범위를 receptive field라고 한다.

위 그림에서 layer2에 있는 초록색 뉴런의 receptive field는 layer1에서 초록색 범위(3*3)이고 layer3의 노란색 뉴런의 receptive field 는 layer1의 5*5 부분이다.

즉, 층이 깊어질 수록 더 넓은 receptive field를 가지게 된다.

✅ mean IU = mean IOU = mean Intersection over Union

성능평가지수로 주로 object detection model 평가할 때 사용.
각 클래스에 대한 IU score = true positive / (true positive + false positive + false negative)

그러므로 Mean IU is simply the average over all classes.

✅ DAG

DAG: Directed Acyclic Graph = 비순환 그래프. 즉, 순환하는 싸이클이 존재하지 않고 일방향성만 가진다.

✅ dense prediction task

CV(computer vision)에서 이미지 분류와 달리 semantic segmentaion, instance segmentation, etc 처럼 pixel-wise prediction을 dense prediction task라고 한다.

진짜 리뷰 시작!!!

일단 FCN 모델은 한문장으로 정리하자면!

semantic segmentation task를 위해 image classification에서 우수한 성능을 보인 기존 CNN기반 모델들을 변형시킨 모델이다.

In this paper, classification nets such as AlexNet, VGG16, GoogLeNet are re-architected and
fine-tuned to direct, dense prediction of semantic segmentation.

이렇게 보면 굉장히 간단해보이는데

image classification model을 FCN으로 변형하는 방식은 크게 이 세 과정으로 볼 수 있다.

1) Convolutionalization

2) Deconvolution (Upsampling)

3) Skip architecture

이 세 과정의 방법과 왜 필요한지를 이해하는 게 이 논문의 핵심이라고 할 수 있겠다.

1) Convolutionalization

이 컨볼루션화 과정을 이해하기 위해선 먼저 기존 이미지 분류 모델을 살펴봐야한다.

이 논문에서 사용한 AlexNet, GoogLeNet, VGG 모두 이미지를 분류하기 위한 모델로 출력층이 Fully Connected layer (FC layer)로 구성되어 있다.

그래서 간단하게 모델 구조를 도식화하면 아래 그림과 같다.

이렇게 네트워크 입력층에서 중간 부분까지는 ConvNet을 이용해 이미지의 feature를 추출하고 추출한 feature를 출력층인 FC layer에서 이용해 classification을 한다.

그런데 image classification task와 달리 semantic segmentation을 할 때는 FC layer 갖는 한계가 있다.

FC layers take fixed-size inputs and produce non-spatial outputs.

1) 이미지의 위치 정보가 사라진다.

filter가 입력 이미지위를 sliding하면서 convolution 연산을 하는 Conv layer와 달리 FC layer는 2차원 matrix인 입력 이미지를 flatten해서 1차원 벡터로 만들어 입력으로 받으므로 이미지의 위치 정보가 사라지는 문제가 있다. (receptive field 개념이 사라짐)
참고로 FC layer는 input의 모든 영역을 receptive field로 보는 필터를 가진 Conv layer라고 볼 수 있다.

doing so, can take input of any size and output spatial classification maps.

2) 입력 이미지 크기가 고정된다.

가중치 개수가 고정되어 있기 때문에 FC layer 앞 레이어의 output인 feature map 크기가 고정되고, 연쇄적으로 그 이전 layer의 feature map 크기, input image 크기도 고정되는 문제가 있다.

Semantic segmentation은 input image의 pixel-wise prediction (dense prediction), 즉 각 픽셀에 대해 class를 분류해 instance 및 배경을 구분하는 것으로 위치 정보가 매우매우 중요하다.

따라서, 입력 이미지의 위치 정보를 보존하기 위해 이 논문에서는 모든 FC layer를 Conv layer로 대체하였다!!!

이것이 바로 Convolutionalization이다. 이렇게 함으로써 앞서 말한 2가지 한계가 보완되었다.

doing so, can take input of any size and output spatial classification maps.

Backbone으로 썼을 때 가장 성능이 좋은 VGG16의 예를 들면,
출력층 부분의 마지막 FC layers를 모두 Conv layers로 변경하였다.

좀 더 자세하게, 논문에서 말한 내용을 그대로 보자면 다음과 같다.

- 마지막 classifier layer를 없애고
- 모든 FC layer를 convolution layer로 바꾼다.
- coarse feature map을 output하는 layer위치에 1x1 convolution with channel dimension 21(#class label) 추가하고 // 여기까지가 Convolutionalization 과정
- 이 coarse output을 원래 image size에 맞게 upsampling하는 레이어를 추가한다. (이후에 다룰 부분)

처음 이부분을 읽고 나서 든 생각은 '아 그래서 어떻게 FC layer를 Conv layer로 바꾼는 건데?'였다. 그래서 바로 구글링을 했다.

역시 구글은 없는게 없다.

https://cs231n.github.io/convolutional-networks/#convert

CS231n Convolutional Neural Networks for Visual Recognition

Table of Contents: Convolutional Neural Networks (CNNs / ConvNets) Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receive

cs231n.github.io

위 사이트에 아주 친절하게 FC layer를 Conv layer로 바꾸는 법을 알려준다. 그치만 영어이므로 아주 친철하게 한국어로 내가 이해한 부분을 정리하자면 다음과 같다.

"Any FC layer can be converted to a Conv layer." 왜냐면 Conv layer와 FC layer의 차이점은 conv layer의 뉴런(perceptron)이 input의 local region에만 연결되어있다는 점(다른 말로 FC layer의 뉴런의 receptive field는 input image라는 거)과 conv volume의 많은 뉴런이 parameters를 공유한다는 것이기 때문이다. FC와 Conv layer의 모든 neurons이 dot product를 하기 때문에 functional form은 동일해서 바꿀 수 있다!!!

암튼 이렇게 FC layer를 Conv layer로 바꿀 수 있다는 걸 알았으니 어떻게 바꾸는 지도 알아보자.

결론부터 말하면 conv layer의 filter size를 input volume size와 동일하게 해야 원래 FC layer의 결과와 동일하게 얻을 수 있다,

좀 더 쉬운 이해를 위해 그림을 통해 설명해보겠다.

처음 FC layer의 channel이 4096이고 이 FC layer의 input인 feature map이 [7x7x512]라고 하자. 이때 ouput은 4096개의 값을 갖는 벡터일 것이다. 이 FC layer를 대신할 Conv layer를 찾자면 결과가 값아야하므로 filter size는 input volume size와 같은 7, padding = 0, stride = 1, output의 channel이 4096이므로 filter의 개수가 4096이어야한다. (filter의 개수는 output의 channel과 filter의 channel는 input의 channel과 같아야! 그리고 input에서 output volume size구하는 공식은 구글링하면 바로 나온다!)

이제 어떻게 FC layer를 Conv layer로 대체할 지 알았으니 AlexNet을 예시로 Convolutionalization과정을 설명하겠다.

마찬가지로 아래 그림을 통해 설명하겠다.

Input image[224x224x3]가 conv&pool을 여러번 거쳐 [7x7x512]의 activation map이 구해지고 이것이 첫번째 FC layer의 input이 된다. FC1, FC2, 마지막 FC3를 거쳐 1000개의 elements를 갖는 vector가 된다.

이 세개의 FC layer를 Conv layer로 변환할 때 마찬가지로 input volume과 filter size가 같아야하고 output의 channel과 filter의 개수가 같아야함을 생각하면 위 그림처럼 바꿀 수 있다. 그리고 마지막 FC layer를 바꿀 때는 분류하고자하는 class의 개수가 filter의 개수가 되게 바꾸면 된다. (이 부분이 잘 이해가 안되면 convolution layer에 대한 설명을 찾아보는 걸 추천합니다.)

이렇게 Convolutionalization을 통해 기존 분류 모델을 semantic segmentation 모델로 변형시켜
출력된 feature map은 input image의 위치 정보를 포함하고 있지만 pixel별 예측을 하기에는 coarse(거친, 알맹이가 큰)하다는 문제가 있다.

While our reinterpretation of classification nets as fully convolutional yields output maps
for inputs of any size, the output dimensions are typically reduced by subsampling.

따라서 coarse feature map을 input image 크기에 맞는 dense map으로 변환해주기 위해 논문에서는 Deconvolution (Upsampling) 방식을 사용하였다.

2) Deconvolution(Upsampling)

coarse map -> dense map으로 변환하는 몇가지 방법들:

- Interpolation
- Deconvolution
- Unpooling
- Shift and stitch

이 논문에서는 interpolation과 deconvolution 방법을 사용하였다.

먼저 처음부터 feature map을 구할 때, pooling을 사용하지 않거나 pooling 시 stride를 줄여 feature map 크기가 작아지는 것을 막을 수 있다.

그러나 pooling을 쓰는 이유는 (컴비 수업에서 배웠다)

To gain robustness to the exact spatial location of features (= invariance to shift) -> 넓어진 receptive field 통해 이미지 컨텍스트 이해
Feature map의 크기를 줄여 학습 파라미터 수 감소 -> 학습시간 감소

따라서 pooling을 안쓰는 것이 아니라 coarse feature map을 dense map으로 어떻게 바꿀 지 생각해야한다.

또, shift and stitch 방식 같은 trick을 사용하는 실험을 해보았지만 skip architecture(이후에 다룰 내용)과 결합했을 때 upsampling방식이 더 나아 모델에서 이 방식을 안 썼다고 한다.

그러면 논문에서 사용한 Interpolation과 Deconvolution 방식에 대해 순차적으로 설명하겠다.

1) Bilinear Interpolation

1차원의 linear interpolation을 이미지이므로 2차원으로 확장한 방식이다.

공식은 아래 그림과 같고 이 공식으로 feature map의 빈 영역을 추정해 upsampling할 수 있다.

2) Deconvolution (Backwards Strided Convolution)

앞의 bilinear interpolation처럼 고정된 것이 아니라 학습 가능한 방식이다.

이름 그대로 convolution 연산을 반대로 할 경우 자연스럽게 upsampling이 된다고 한다. 이때 사용하는 filter의 가중치 값이 학습 파라미터로 학습이 되는 방식이다.

뭔소리냐라고 할 수 있으니 논문에서 말한 내용을 인용하면 다음과 같다.

Upsampling with factor f is convolution with a fractional input stride of 1/f. So long as f is integral, a natural way to upsample is therefore backwards convolution(=deconvolution) with an output stride of f. Such operation is trivial to implement, since it simply reverses the forward and backward passes of convolution.

“Thus upsampling is performed in-network for end-to-end learning by backpropagation from piexel-wise loss.”

“The deconvolution filter need not be fixed, but can be learned.”

그래서 이 논문에서는 bilinear interpolation과 backwards convolution 두가지 방식을 사용해서 coarse feature map으로부터 dense prediction map을 구했다.

제일 초기 FCN은 이렇게 VGG16 model을 convolutionalization한 구조에 bilinear interpolation 작업을 통해 dense map을 구했다.

근데 이렇게 interpolation만으로 구한 dense map은 coarse feature map이 input image에 비해 너무 작기 때문에 많은 정보가 손실되어 정교하지 못하다.

따라서 보다 정교한 segmentation을 위해 추가적인 작업이 필요하고 필요에 맞춰 등장한 것이 바로 Skip architecture이다.

3) Skip architecture

본 논문에서 정교한 segmentation을 위해 Deep&coarse layer의 semantic 정보와 shallow&fine layer의 appearance 정보를 결합한 skip architecture를 고안하였다. (유넷에도 skip connection 나오는뎅,,,)

We build a novel skip architecture that combines coarse, semantic and local, appearance information to refine prediction.
A new fully convolutional net that combines layers of the feature hierarchy and refines the spatial precision of the output

이러한 직관은 visualizing and understanding convolutional networks 연구에서 볼 수 있음

이렇게 얕은 층에선 local feature(직선, 곡선, 색상 정도의 낮은 수준 특징)를 깊은 층에서는 global feature(좀더 복잡하고 포괄적인 객체 정보)를 감지한다.

이 논문에서는 이런 직관을 기반으로 앞에서 구한 dense map에 얕은(shallow, lower) 층의 정보를 결합하는 방식으로 segmentation의 품질을 개선하였다.

위 그림을 통해 skip architecture를 설명하자면 input image가 32x32일 때(channel은 잠깐 생략.) 각 pooling layer를 통과하고 나면 16x16, 8x8, 4x4, 2x2, 1x1 feature map이 생성된다. 이때 FCN-32s는 pool5 layer후 1x1 feature map을 input image 크기에 맞게 32배 upsampling한 것이고 FCN-16s는 pool4 layer 후 2x2 feature map과 pool5 layer후 1x1 feature map을 2배 upsampling한 걸 element-wise 더해주고 16배 upsampling해준 것이다. 이런식으로 FCN-8s를 구했고 그 이후는 딱히 개선이 없어서 멈췄다고 한다.

다른 그림으로 FCN-16s와 FCN-8s를 구하는 과정을 보면 아래와 같다.

각 pooling layer를 통과 후 prediction을 위해 추가된 1x1 Conv layer의 filter는 0으로, upsampling하는 즉 trainable backwards convolution의 filter는 bilinear interpolation으로 초기화한 수 학습을 진행했다고 한다.

이러한 skip architecture를 통해 segmentation 개선되었고 아래 결과에서 확인할 수 있다.