[week 5] Object Detection: 1-stage-detector

November 2, 2022 9 분 소요

Semantic Segmentation

의미론적 분할이라는 것은 이미지를 넣었을 때, 모든 픽셀 이미지의 카테고리를 분류해 주는 것을 말한다.

download1

하지만 붙어있는 객체라면 하나의 객체로 판별을 하게 된다.

idea 1: sliding Window

download2

이미지가 들어온다면 그 이미지를 매우 작게 쪼갠다. 그 다음 작은 이미지가 어떤 카테고리에 들어갈지 일종의 분류 문제로 취급하게 된다.

이렇게 분류를 한다면 분류는 되겠지만 그렇게 좋은 아이디어는 아니다.

엄청난 계산 비용(픽셀들을 순방향, 역방향으로 계산해서 카테고리 분류를 하기 때문)
같은 객체지만 붙어있는 경우 처리할 수 없다.

그래서 아무도 사용은 하지 않음. 하지만 Semantic Segmentation이 무엇인지는 알 수 있게 한다.

idea 2: Fully Convolutional

download1

이미지가 들어가게 된다면 깊은 Conv 레이더들을 거치면서 한 번에 픽셀들을 분류를 할 수 있다. 그래서 각 층별로 특정 카테고리에 대한 점수를 통해서 카테고리를 분류할 수 있다. 따라서, 마지막 output의 깊이는 카테고리의 개수와 동일해야 된다.

sliding Window 보다는 나은 점

이미지를 작게 쪼개서 하나하나 독립적으로 계산을 하는게 아니라 한 번에 네트워크를 통해서 픽셀 별 카테고리를 분류하는 것이므로 sliding Window보다는 계산이 간단하다.

문제점

엄청난 계산 비용이 든다.

따라서 위 사진의 있는 네트워크는 사용하지 않지만, 아래 사진에 있는 네트워크는 사용하는 경우가 있다.

download1

네트워크 안에서 downsampling, upsampling을 통해서 사용하는 구조도 있다. downsampling의 방법에서는 CNN에 대해서 배울 때 많이 배웠다. Pooling, strided convolution(stride를 늘려서 크기를 줄이는 conv)를 통해서 downsampling이 가능하다.

UpSampling

downsampling하는 방법에 대해서는 배웠지만 upsampling 하는 방법에 대해서는 배우지 않았으므로 upsampling 하는 과정에 대해서 배운다.

Unpooling

풀링의 반대 과정, 맥스 풀링과 평균 풀링이 있었던 것 처럼 unpooling도 크게 2가지 방식이 있다.

여기서 Bed of Nails 방식에서 좀더 나아가 Max Unpooling 방식이 있다.

이 네트워크의 경우는 절반은 donwsampling을 하고 대칭적으로 upsampling을 하기 때문에 pair를 맞춰서 해당 맥스 풀링에 기여했던 레이어의 위치에 non-zero 값을 넣는 것이다.
Transpose Convolution

왜 Transpose Conv일까?

원래 합성곱 연산 ⏬

download1

Transpose Convolution 연산 ⏬

download2

여기에서 필터를 transpose시켜서 transpose 된 C X output을 곱해서 input이었던 크기로 복구를 시킬 수 있다. 이 과정에서 필터와 output의 곱이 overlap된다면 그 값은 더해준다.

Transpose Convolution
= Deconvolution
:: 논문에서는 볼 수 있지만 그닥 좋지 않은 의미이다. 신호 처리 관점에서는 deconvolution이라는 것 자체가 사실은 convolution의 역연산의 의미이이긴 하다.
= Upconvolution
= Fractionally strided convolution
= Backward strided convolution

Classification + Localization

이미지 안의 객체를 분류해야되고 이와 함계 위치도 판별해야된다.

download1

첫 번째 FC: Classification -> output: set of Class Score -> output과 Correct Label의 Loss -> softmax Loss
두 번째 FC: Localization -> output: four numbers(bounding box’s x, y, w, h) -> output과 Correct output의 Loss -> L2(L1, smooth L1) Loss

softmax Loss + L2 Loss = Loss를 사용한다.

download2

적용 예시 ⏫

사람의 joints를 기준으로 14개로 나누어서 14개의 x, y 예측 지점에 대해서 14개의 regression Loss(ex. L2 Euclidean loss, L1 loss 등)을 적용하고 back propagation을 통해서 다시 훈련을 한다.

자세 예측을 통해서 각각 14개의 x, y가 나왔는데 이를 통해서 classification을 여러 개 하더라도 이에 맞는 bounding box를 여러 개 output할 수 있어서 다양한 문제에 적용될 수 있다. (Object Detection)

Object Detection

Classification + Localization와의 차이점은 각 이미지에 대해서 object가 다를 수 있기 때문에 예측 해야되는 개수가 다르다. 이미지마다 객체의 개수가 다르고 이를 예측하기 어려우므로

Object Detection으로 regression을 이용하기에는 어렵다.

Sliding Window

download1

Semantic Segmentation와 비슷하게 모든 crop들을 분석하여 카테고리 별로 분류하는 형식이다.

문제점

이미지의 객체가 모든 위치에 나타날 수 있고, 크기도 다 다르고, w, h도 다 다르기 때문에 이렇게 무차별적인 sliding window를 하려면 너무 많은 게산이 소모된다.

Selective Search

blobby regions를 찾은 다음 give some set of candidate proposal regions(object가 있을 수 있는)

selective search의 경우는 약 2000 개의 proposals regions를 제공한다. 따라서 어디에 있을 모르는 object를 분류하는 것 대신에 Region Proposal Networks를 사용해서 object가 있을 만한 영역을 통해서 convolution network를 통해서 classification을 한다.

이 아이디어는 R-CNN을 통해서 왔다.

R-CNN

download2

Regions of Interest(ROI)라고 부른다. 이 ROI에 대해서 다시 selective search를 통해서 2000개의 ROI를 받는다.
Convolution Network에 넣기 전에 크기를 조정해서 width와 height를 동일하게 한다. (warp)
Convolution Network -> feature 추출
분류하기 위해서 SVM을 사용(binary classification) + bounding box의 4가지 값도 예측(regression)

문제점

Selective Search는 CPU 기반으로 진행 -> 많은 시간 소요
CNN과 SVM, Regressor 모듈이 분리되어 있다.
- CNN은 고정되므로 SVM, Bounding Box Regression 결과로 CNN을 업데이트 할 수 없다.
- end-to-end 방식으로 학습할 수는 없다.
모든 RoI는 개별적으로 CNN이 들어가므로 시간이 모르걸린다.

Fast R-CNN

R-CNN과 유사하지만 ROI를 개별적으로 처리하는 것이 아니라 전체 이미지에 해당하는 것을 처리한다.

download1

전체 이미지 -> ConvNet :: to give this high resolution convolutional feature map correspoding entire image
Regions Proposal method :: projecting thos region proposals -> taking crops
RoI Pooling layer :: 하나의 RoI 영역에 대하여 최대한 비슷한 크기의 영역으로 나누어서 거기에 대해서 Max Pooling을 적용하는 방식이다.
FC -> output: classification softmax, bbox regression
back propagation learn jointly

download2

다른 사진 Layer

문제점

CPU 상에서 Region Proposal을 사용하기 때문에 여전히 bottleneck 발생

Faster R-CNN

Fast R-CNN의 문제점을 네트 워크 자체가 own region proposals를 예측하도록 하여 해결하였다.

구성: input image -> CNN을 통해서 Feature Map 추출 -> Region Proposal Network -> Fast R-CNN

download1

전체 이미지를 CNN을 통해서 나온 feature map 에서 RPN을 통해서 개별적으로 RoI를 결정한다. 그 다음부터는 Fast R-CNN과 동일.

Region Proposal Network needs to do two things -> object인지 아닌지 여부 파악(binary classification), regress bbox

Final Network needs to do two things -> 클래스 점수에 대한 분류, regress bbox to again correct any errors

장점:

Region Proposal 작업을 GPU 장치에서 수행(RPN) -> 전체 아키텍처를 end-to-end로 학습 가능하다.

❓end-to-end 방식?

2-Stage 방식 정리

download1

R-CNN

특징: Selective Search를 통해서 나온 2000개의 Crop을 각각에 대해서 피쳐 맵을 뽑기 뒤해서 CNN를 거쳐 bbox와 category를 예측한다.

문제점: 약 2000개 정도의 Crop이 각각 CNN을 통과해야되다보니 시간이 너무 오래걸린다.
Fast R-CNN

특징: Selective Search를 통해서 나온 2000개의 Crop에 대해서 피쳐 맵을 뽑기 위해서 CNN을 한 번만 통과하여 사용한다.

문제점: R-CNN과 마찬가지로 Region Proposal을 CPU를 통해서 진행한다. -> 속도가 느리다.
Faster R-CNN

특징: Region Propoal Network(RPN)을 GPU 상에서 진행한다. 이미지를 보고 어느 곳에 object가 있을지를 예측한다.

download2

download1

bbox와 실제 bbox의 차이를 통해서 모델 감지에 대한 정답을 확인하는데, 여기서 bbox가 정답이라고 판단되는 기준을 IoU라고 한다.

IoU

두 바운딩 박스가 겹치는 비율을 의미한다. 성능 평가를 위해서도 사용되지만 NMS에서도 많이 사용된다.

NMS(Non Maximum Supperession)

객체 검출에서는 하나의 인스턴스에 하나의 bounding box가 적용되어야 할 때, 가장 많이 겹치는 것을 제외하고 나머지는 없애는 방식이다.

Detection without Proposals: YOLO / SSD

download1

예측하고자 하는 이미지를 7x7 그리드 셀로 나누고 각 cell 마다 하나의 객체를 예측
boundary boxes를 통해서 객체의 위치와 크기 파악
- B개의 boundary boxes를 예측하고 각 box는 하나의 box confidence score를 가지고 있다. box confidence score는 box가 객체를 포함하고 있을 가능성(objectness)과 boundary box가 얼마나 정확한지를 반영한다.
- C개의 conditional class probabilities를 예측한다. Conditional class probabilities는 탐지된 객체가 어느 특정 클래스에 속하는지에 대한 확률이다.
confidence score순으로 예측을 정렬
제일 높은 score에서 시작해서, 이전의 예측과 클래스가 같고 IOU > 0.5인 것이 있으면 현재의 예측 무시
모든 예측을 확인할 때까지 4번 과정 반복

Instance Segmentation

object detection + detect whole segmentation mask

download1

whole input image -> CNN -> learned RPN -> warp -> two branches

faster R-CNN -> category, bbox
object 여부 상관없이 픽셀에 대해서 분류 -> sementation

download2

Faster R-CNN 코드 실습

데이터셋 불러오기

!git clone https://github.com/Pseudo-Lab/Tutorial-Book-Utils
!python Tutorial-Book-Utils/PL_data_loader.py --data FaceMaskDetection
!unzip -q Face\ Mask\ Detection.zip

train set / test set 분리

import os
import random
import numpy as np
import shutil

# 불러온 total data 개수 : 853
print(len(os.listdir('annotations')))
print(len(os.listdir('images')))

# test_images, test_annotations 폴더 생성
!mkdir test_images
!mkdir test_annotations

# 랜덤으로 170개를 가져와서 idx에 넣어놓음.
random.seed(1234)
idx = random.sample(range(853), 170)

# 170개의 이미지를 test_images 폴더로 이동
for img in np.array(sorted(os.listdir('images')))[idx]:
    shutil.move('images/'+img, 'test_images/'+img)

# 170개의 이미지를 test_annotations 폴더로 이동
for annot in np.array(sorted(os.listdir('annotations')))[idx]:
    shutil.move('annotations/'+annot, 'test_annotations/'+annot)

print(len(os.listdir('annotations')))
print(len(os.listdir('images')))
print(len(os.listdir('test_annotations')))
print(len(os.listdir('test_images')))

>>>
853
853
683
683
170
170

클래스 정의

torchvision은 이미지 처리를 하기 위해 사용되며 데이터셋에 관한 패키지와 모델에 관한 패키지가 내장되어 있습니다.

import os
import numpy as np
import matplotlib.patches as patches
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
from PIL import Image
import torchvision
from torchvision import transforms, datasets, models
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import time

# bounding box를 위한 함수
def generate_box(obj):
  xmin = float(obj.find('xmin').text)
  ymin = float(obj.find('ymin').text)
  xmax = float(obj.find('xmax').text)
  ymax = float(obj.find('ymax').text)

  return [xmin, ymin, xmax, ymax]

adjust_label = 1

# category를 위한 함수
def generate_label(obj):
  if obj.find('name').text == 'with_mask':
    return 1 + adjust_label
  elif obj.find('name').text == 'mask_weared_incorrect':
    return 2 + adjust_label
  return 0 + adjust_label

# target 생성
def generate_target(file):
  with open(file) as f:
    data = f.read()
    soup = BeautifulSoup(data, "html.parser")
    objects = soup.find_all("object")

    num_objs = len(objects)

    boxes = []
    labels = []
    for i in objects:
      boxes.append(generate_box(i))
      labels.append(generate_label(i))

    boxes = torch.as_tensor(boxes, dtype=torch.float32)
    labels = torch.as_tensor(labels, dtype=torch.int64)

    target = {}
    target['boxes'] = boxes
    target['labels'] = labels

    return target

# target 이미지에 bbox 그리기
  def plot_image_from_output(img, annotation):

    img = img.cpu().permute(1,2,0)

    fig,ax = plt.subplots(1)
    ax.imshow(img)

    for idx in range(len(annotation["boxes"])):
        xmin, ymin, xmax, ymax = annotation["boxes"][idx]

        if annotation['labels'][idx] == 1 :
            rect = patches.Rectangle((xmin,ymin),(xmax-xmin),(ymax-ymin),linewidth=1,edgecolor='r',facecolor='none')

        elif annotation['labels'][idx] == 2 :

            rect = patches.Rectangle((xmin,ymin),(xmax-xmin),(ymax-ymin),linewidth=1,edgecolor='g',facecolor='none')

        else :

            rect = patches.Rectangle((xmin,ymin),(xmax-xmin),(ymax-ymin),linewidth=1,edgecolor='orange',facecolor='none')

        ax.add_patch(rect)

    plt.show()

데이터셋 클래스와 데이터 로더를 정의해준다. torch.utils.data.DataLoader 함수를 통해 배치 사이즈는 4로 지정.

class MaskDataset(object):
    def __init__(self, transforms, path):
        '''
        path: path to train folder or test folder
        '''
        # transform module과 img path 경로를 정의
        self.transforms = transforms
        self.path = path
        self.imgs = list(sorted(os.listdir(self.path)))

    # loads and returns a sample from the dataset at the given index idx.
    def __getitem__(self, idx): #special method
        # load images ad masks
        file_image = self.imgs[idx]
        file_label = self.imgs[idx][:-3] + 'xml'
        img_path = os.path.join(self.path, file_image)

        if 'test' in self.path:
            label_path = os.path.join("test_annotations/", file_label)
        else:
            label_path = os.path.join("annotations/", file_label)

        img = Image.open(img_path).convert("RGB")
        #Generate Label
        target = generate_target(label_path)

        if self.transforms is not None:
            img = self.transforms(img)

        return img, target

    def __len__(self):
        return len(self.imgs)

data_transform = transforms.Compose([  # transforms.Compose : list 내의 작업을 연달아 할 수 있게 호출하는 클래스
        transforms.ToTensor() # ToTensor : numpy 이미지에서 torch 이미지로 변경
    ])

def collate_fn(batch):
    return tuple(zip(*batch))

dataset = MaskDataset(data_transform, 'images/')
test_dataset = MaskDataset(data_transform, 'test_images/')

data_loader = torch.utils.data.DataLoader(dataset, batch_size=4, collate_fn=collate_fn)
test_data_loader = torch.utils.data.DataLoader(test_dataset, batch_size=2, collate_fn=collate_fn)

모델 불러오기

torchvision.models.detection.fasterrcnn_resnet50_fpn : torchvision에서 제공하는 Faster R-CNN API
pretrained=True/False : pretrained 설정 여부
num_classes : 클래스 개수 설정

import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_model_instance_segmentation(num_classes):

    # 모델 불러오기
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    # get number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    return model

전이 학습

기존 클래스

adjust_label = 1 : 마스크를 쓰지 않음
adjust_label = 2 : 마스크를 씀
adjust_label = 3 : 제대로 마스크를 쓰지 않음

+ background 클래스가 포함하여 num_classes는 4로 설정

model = get_model_instance_segmentation(4)

# GPU 사용(torch.cuda.is_available() = True)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

모델 학습

# 반복 횟수
num_epochs = 10
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                                momentum=0.9, weight_decay=0.0005)

print('----------------------train start--------------------------')
for epoch in range(num_epochs):
    start = time.time()
    model.train()
    i = 0
    epoch_loss = 0
    # train set(data_loader)에서 이미지를 하나씩 불러와서 로스 계산을 통해 최적화를 수행
    for imgs, annotations in data_loader:
        i += 1
        # img 리스트
        imgs = list(img.to(device) for img in imgs)
        # k: boxes, v: box pos, labels(category)
        annotations = [{k: v.to(device) for k, v in t.items()} for t in annotations]
        loss_dict = model(imgs, annotations)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()
        epoch_loss += losses
    print(f'epoch : {epoch+1}, Loss : {epoch_loss}, time : {time.time() - start}')

>>>
----------------------train start--------------------------
epoch : 1, Loss : 74.38842010498047, time : 258.89981985092163
epoch : 2, Loss : 49.511959075927734, time : 256.5860526561737
epoch : 3, Loss : 42.1615104675293, time : 256.39848041534424
epoch : 4, Loss : 36.24540328979492, time : 256.0167520046234
epoch : 5, Loss : 32.7506217956543, time : 256.156777381897
epoch : 6, Loss : 31.926315307617188, time : 256.31392216682434
epoch : 7, Loss : 30.4024600982666, time : 256.148469209671
epoch : 8, Loss : 30.105791091918945, time : 256.33737349510193
epoch : 9, Loss : 25.10750961303711, time : 256.2675304412842
epoch : 10, Loss : 23.202547073364258, time : 256.12270879745483

모델 예측

예측결과에는 바운딩 박스의 좌표(boxes)와 클래스(labels), 점수(scores)가 포함된다. 점수(scores)에는 해당 클래스의 신뢰도 값이 저장되는데 threshold(옳은 cate를 고른 것인지 아닌지에 대한 기준)로 0.5 이상인 것만 추출하도록 함수make_prediction를 정의하겠다. 그리고 test_data_loader의 첫번째 배치에 대해서만 결과를 출력해보았다.

def make_prediction(model, img, threshold):
    model.eval()
    preds = model(img)
    for id in range(len(preds)) :
        idx_list = []

        for idx, score in enumerate(preds[id]['scores']) :
            if score > threshold :
                idx_list.append(idx)

        preds[id]['boxes'] = preds[id]['boxes'][idx_list]
        preds[id]['labels'] = preds[id]['labels'][idx_list]
        preds[id]['scores'] = preds[id]['scores'][idx_list]

    return preds

with torch.no_grad():
    # 테스트셋 배치사이즈= 2
    for imgs, annotations in test_data_loader:
        imgs = list(img.to(device) for img in imgs)

        pred = make_prediction(model, imgs, 0.5)
        print(pred)
        break

Twitter Facebook LinkedIn

Sohn SooKyoung

[week 5] Object Detection: 1-stage-detector

Semantic Segmentation

idea 1: sliding Window

idea 2: Fully Convolutional

UpSampling

Classification + Localization

Object Detection

Sliding Window

Selective Search

R-CNN

Fast R-CNN

Faster R-CNN

2-Stage 방식 정리

Detection without Proposals: YOLO / SSD

Instance Segmentation

Faster R-CNN 코드 실습

데이터셋 불러오기

train set / test set 분리

클래스 정의

모델 불러오기

전이 학습

모델 학습

모델 예측

공유하기

댓글남기기

참고

Mujoco Tutorial 02/28

[졸업 프로젝트] 시각장애인을 위한 LaTeX 수식 음성 변환 기능 구현

[통계분석실습] 빅데이터 기반 프로야구 인기도 지표 분석 및 구단별 인기 기여 정도 파악

[Chap 5] EM 알고리즘