MMDetection

In this article, we investigate state of the art object detection algorithms and their implementations in MMDetection. We write an up-to-date guide for setting up and running MMDetection on a structured COCO-2017 dataset in Google Colab based on our experience. We analyse and classify the key component structure (detection head/neck/backbone) of MMdetection models. After which, we reproduce the object detection task for popular models on main-stream benchmarks including the COCO-2017 dataset and construct a datailed interactive table. In the last part of our work, we use signature object detection algorithms on real-live photos and videos taken at UCLA campus and provide error analysis. Finally, we perform basic adversarial attacks on our models using representave photos and evaluate their effects.

Video 1. Spotlight Overview.

1. Introduction
2. MMdetection Setup and Data Preparation
- 2.1 Environment Setup
- 2.2 Dataset Preparation
3. Model Structure and Classification
4. Model Evaluation
- 4.1 Config Name Style
- 4.2 Evaluate Verifocal-Net with ResNet-50 Backbone on COCO-2017
5. Interactive Framework/Backbone Results Table
6. Photo/Video Demo and Error Analysis
- 6.1 Building a MMDetection image/video detector
- 6.2 Image/Video detection experiments and analysis
7. Adversarial Attacks and Error Analysis
8. Demo Links
Reference

1. Introduction

Stemming from image classification and object localization, Object Detection is one of the core up-stream descriminative tasks in computer vision. Object Detection forms the basis of many other down-stream tasks such as semantic segmentation, instance segmentation, and image captioning. Therefore making it one of the fundamental building blocks towards modern applications including Face Detection and Autonomous Driving. Due to its high level of generality and wide applications, object detection has become one of the most well-studied topics in computer vision over the past 20 years. To assist researchers in evaluating and applying state of the art object detection algorithms, The MMDetection library was created as a gereral toobox library that contains a rich set of object detection and instance segmentation methods as well as related components and modules. Derived from its flexible-modular implementation in PyTorch, it is considered a great improvement over the previous caffe-based Facebook Detectron Library both in its model zoo size and overall performance.

Object Detection	Adversarial Attack

2. MMdetection Setup and Data Preparation

2.1 Environment Setup

Environment building and version control is usually the most time-consuming step if not done well. So here we write an up-to-date colab setup guide and our experience dealing with problems in the MMDetection offcial setup docs.

In your google drive, create a new .ipynb file for environment setup, here I name it “SETUP.ipynb”. In SETUP.ipynb do the following environment setup

First mount your google drive

 import os
 from google.colab import drive
 drive.mount('/content/drive')

Here I create a new directory named “MMDet1” and cd into it
```
 ! mkdir MMDet1
 %cd /content/drive/My Drive/MMDet1
```

Now we can use pip to install the mmdetection package and other dependencies

I made some adjustments to the official colab tutorial at:

https://github.com/open-mmlab/mmdetection/blob/master/demo/MMDet_Tutorial.ipynb

 # install dependencies: (use cu101 because colab has CUDA 10.1)
 !pip install -U torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
    
 # install mmcv-full thus we could use CUDA operators
 !pip install mmcv-full
    
 # Install mmdetection
 !rm -rf mmdetection
 !git clone https://github.com/open-mmlab/mmdetection.git
 %cd mmdetection
    
 !pip install -e .
    
 # install Pillow 7.0.0 back in order to avoid bug in colab
 !pip install Pillow==7.0.

The above installation command does not specify version of MMCV, this could lead to confusing error messages later on. it would also result in an unusually long download time. So here I change the original implementation to specify version details:

 # install dependencies: (use cu101 because colab has CUDA 10.1)
 !pip install -U torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
    
 # install mmcv-full thus we could use CUDA operators
 # !pip install mmcv-full 
    
 !pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
    
 # Install mmdetection
 !rm -rf mmdetection
 !git clone https://github.com/open-mmlab/mmdetection.git
 %cd mmdetection
    
 !pip install -e .
    
 # install Pillow 7.0.0 back in order to avoid bug in colab
 !pip install Pillow==7.0.0

Executing the above block takes about 5 minutes for me.

Next we import the installed packages. (remember to set your runtime to GPU) Print package version to verify that we installed correctly:

 # Check Pytorch installation
 import torch, torchvision
 print(torch.__version__, torch.cuda.is_available())
    
 # Check MMDetection installation
 import mmdet
 print(mmdet.__version__)
    
 # Check mmcv installation
 from mmcv.ops import get_compiling_cuda_version, get_compiler_version
 print(get_compiling_cuda_version())
 print(get_compiler_version())

if everything is correct, this block should output:

8.1+cu111 True
21.0
1
 GCC 7.3

2.2 Dataset Preparation

Next we prepare the Pascal_VOC and COCO-2017 datasets in our mmdetection folder, we need to setup the datasets in given structure so that mmdetection knows how to read them.

I wrote a colab tutorial on this part at: Google Colaboratory

First setup drive environment by installing mmdetection and going into the folder we created last time:

 %load_ext autoreload
 %autoreload 2

 import os
 from google.colab import drive
 drive.mount('/content/drive', force_remount=True)

 # install dependencies: (use cu101 because colab has CUDA 10.1)
 #!pip install -U torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
 !pip install -U torch==1.8.1+cu111 torchvision==0.9.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html
    
 # install mmcv-full thus we could use CUDA operators
 # !pip install mmcv-full 
    
 # !!!!!!!!!!remember to install with specific version number!!!!!!!!!!!
 !pip install mmcv-full -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
    
 # Install mmdetection
 !rm -rf mmdetection
 !git clone https://github.com/open-mmlab/mmdetection.git
 %cd mmdetection
    
 !pip install -e .
    
 # install Pillow 7.0.0 back in order to avoid bug in colab
 !pip install Pillow==7.0.0

Dataset download: COCO-2017

 # here we download the dataset inside our mmdetection folder
    
 !python3 tools/misc/download_dataset.py --dataset-name coco2017 
    
 # remember that the command only downloads the zip files containing the coco dataset
 # to actually use the dataset, we need to unzip first

Unzip the coco dataset. This may take approximately 2 hours, do not close browser during any part of this process!!!!

 !unzip "data/coco/annotations_trainval2017.zip" -d "data/coco/"
    
 !unzip "data/coco/test2017.zip" -d "data/coco/"
    
 !unzip "data/coco/train2017.zip" -d "data/coco/"
    
 !unzip "data/coco/val2017.zip" -d "data/coco/"

Dataset download: Pascal_VOC 2007

 # here we download the dataset inside our mmdetection folder
    
 !python3 tools/misc/download_dataset.py --dataset-name voc2007
    
 # remember that the command only downloads the zip files containing the coco dataset
 # to actually use the dataset, we need to unzip first

Unzip the PASCAL-VOC dataset. This is quicker than the coco dataset due to smaller size and tar over zip, again, remebmder not to close browser during any part of this process!!!!

 # voc is faster to load because of its smaller size and the fact that it uses tar instead of zip
    
 !tar -xvf "data/voc/VOCdevkit_08-Jun-2007.tar" -C "data/voc/"
    
 !tar -xvf "data/voc/VOCtest_06-Nov-2007.tar" -C "data/voc/"
    
 !tar -xvf "data/voc/VOCtrainval_06-Nov-2007.tar" -C "data/voc/"

Now just wait a few hours for google drive to load all the images and you are good to go! You can also run the detection now using the images stored in Cache(wierd drive memory).

3. Model Structure and Classification

3.1 Model Structure

There is a popular paradigm for deep learning-based object detectors: the backbone network (typically designed for classification and pre-trained on ImageNet) extracts basic features from the input image, and then the neck (e.g., feature pyramid network) enhances the multi-scale features from the backbone, after which the detection head predicts the object bounding boxes with position and classification information. Based on detection heads, the cutting edge methods for generic object detection can be briefly categorized into two major branches. The first branch contains one-stage detectors such as YOLO, SSD, RetinaNet, NAS-FPN, and EfficientDet. The other branch contains two-stage methods such as Faster R-CNN, FPN, Mask RCNN, Cascade R-CNN, and Libra R-CNN. Recently, academic attention has been geared toward anchor-free detectors due partly to the emergence of FPN and focal Loss, where more elegant end-to-end detectors have been proposed. On the one hand, FSAF, FCOS, ATSS and GFL improve RetinaNet with center-based anchor-free methods. On the other hand, CornerNet and CenterNet detect object bounding boxes with a keypoint-based method.

reference: https://arxiv.org/pdf/2107.00420.pdf

3.2 Model Components in MMdetection

Based on the information above, we can better understand how MMDetection constructs its models. In MMDetection, models are constructed using the following three components as the main building blocks:

Backbone: First MMdetection uses a FCN backbone network to extract feature maps from images. This part has the largest influence on prediction results.
Neck: Then the neck enhances the multi-scale features from the backbone. This is the part between backbones and heads, usually models just use FPN as default method.
Detection Head: After the input image processed by the backbone and neck parts, MMdetection passes the results to detection heads to perform specific tasks such as bounding box prediction and mask prediction.

In addition, the following two components are customizable within each model:

ROI Extractor: this part extracts features from feature maps
Loss: the component in head for calculating losses

3.3 Examples for Each Component

Main Components:

Backbone: ResNet, VGG, ResNeXt, SWIN, PVT
Neck: FPN, CARAFE, ASPP
Detection Head: bbox prediction, mask prediction

Sub-Components:

ROI Extractor: RoI Pooling, RoI Align
Loss: L1Loss, L2Loss, GHMLoss, FocalLoss

3.4 Model Classification:

3.4.1 Detection Frameworks:

We categorize MMdetection frameworks into two categories based on their detection heads, there are: 1-stage detection model and 2-stage detection models.

Popular frameworks that we hope to investigate are listed below:

1-stage detection models:
- YOLO-X
- YOLOv2
- YOLOv3
- SSD
- DSSD
- RetinaNet
- NAS-FPN
- EfficientDet
2-stage detection models:
- R-CNN
- Fast-R-CNN
- Faster-R-CNN
- FPN
- Mask-R-CNN
- Cascade-R-CNN
- Libra R-CNN
- SPP-Net

3.4.2 Backbones:

MMdetection Backbones can also be categorized into two categories based on whether they predominately use Convolutional Neurel Networks or the Encoder-Decoder Transformer structure to perform feature extraction.

Popular frameworks that we hope to investigate are listed below:

CNN-Based Backbones:
- VGG (ICLR’2015)
- ResNet (CVPR’2016)
- ResNeXt (CVPR’2017)
- MobileNetV2 (CVPR’2018)
- HRNet (CVPR’2019)
- Generalized Attention (ICCV’2019)
- GCNet (ICCVW’2019)
- Res2Net (TPAMI’2020)
- RegNet (CVPR’2020)
- ResNeSt (ArXiv’2020)
Transformer-Based Backbones:

3.4.3 Necks:

At last, here is a list of Necks commonly used in MMDetection. Due to its limited influence over detection results, we do not focus on the effects of different necks on detection results. We only use PAFPN in our investigations of Model Backbone and Detection Framework.

Popular necks in MMDetection are listed below:

4. Model Evaluation

4.1 Config Name Style

When using or contributing to the MMDetection library, the most important file that the user will call/customize is the model config file. It is very important that we have a good understanding of the config file naming convention before we perform actual tasks with MMDetection, so that we can configurate the model to be just what we have in mind. First, lets take a look at the universal config file name:

**{model}_[model** **setting]_{backbone}_{neck}_[norm** **setting]_[misc]_[gpu** **x** **batch_per_gpu]_{schedule}_{dataset}**

Now, given that we understood the Model Structure in MMDetection writen above, we can better understand the various config naming components[8] listed below:

{xxx} is required field and [yyy] is optional.

{model}: model type like faster_rcnn, mask_rcnn, etc.
[model setting]: specific setting for some model, like without_semantic for htc, moment for reppoints, etc.
{backbone}: backbone type like r50 (ResNet-50), x101 (ResNeXt-101).
{neck}: neck type like fpn, pafpn, nasfpn, c4.
[norm_setting]: bn (Batch Normalization) is used unless specified, other norm layer type could be gn (Group Normalization), syncbn (Synchronized Batch Normalization). gn-head/gn-neck indicates GN is applied in head/neck only, while gn-all means GN is applied in the entire model, e.g. backbone, neck, head.
[misc]: miscellaneous setting/plugins of model, e.g. dconv, gcb, attention, albu, mstrain.
[gpu x batch_per_gpu]: GPUs and samples per GPU, 8x2 is used by default.
{schedule}: training schedule, options are 1x, 2x, 20e, etc. 1x and 2x means 12 epochs and 24 epochs respectively. 20e is adopted in cascade models, which denotes 20 epochs. For 1x/2x, initial learning rate decays by a factor of 10 at the 8/16th and 11/22th epochs. For 20e, initial learning rate decays by a factor of 10 at the 16th and 19th epochs.
{dataset}: dataset like coco, cityscapes, voc_0712, wider_face.

4.2 Evaluate Verifocal-Net with ResNet-50 Backbone on COCO-2017

Now that we undstand the model config files, we can use them to customize and evaluate the models and backbones of our choice. Here is an example of evaluating the Verifocal-Net framework with a ResNet-50 Backbone (FPN neck) on the COCO-2017 dataset. We provide model setting file routes and complete execution results below.(Execution takes approx 20 min on 1 Nvidia T-100 GPU):

4.2.1 Model settings:

!python3 tools/test.py \
    configs/vfnet/vfnet_r50_fpn_1x_coco.py \
    checkpoints/vfnet_r50_fpn_1x_coco_20201027-38db6f58.pth  \
    --eval bbox proposal \
    --eval-options "classwise=True"

4.2.2 Execution Results:

 <p> /content/mmdetection/mmdet/utils/setup_env.py:33: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting OMP_NUM_THREADS environment variable for each process '
/content/mmdetection/mmdet/utils/setup_env.py:43: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
  f'Setting MKL_NUM_THREADS environment variable for each process '
loading annotations into memory...
Done (t=2.28s)
creating index...
index created!
load checkpoint from local path: checkpoints/vfnet_r50_fpn_1x_coco_20201027-38db6f58.pth
[>>] 5000/5000, 4.0 task/s, elapsed: 1239s, ETA:     0s
Evaluating bbox...
Loading and preparing results...
DONE (t=1.29s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=46.34s).
Accumulating evaluation results...
DONE (t=13.19s).


| category                | IoU            | Area        | category       | AP    | 
|-------------------------|----------------|-------------|----------------|-------| 
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area=   all | maxDets=100 ]  | 0.408 |
 Average Precision  (AP) @|[ IoU=0.50      | area=   all | maxDets=1000 ] | 0.587 |
 Average Precision  (AP) @|[ IoU=0.75      | area=   all | maxDets=1000 ] | 0.441 |
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area= small | maxDets=1000 ] | 0.244 |
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] | 0.450 |
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area= large | maxDets=1000 ] | 0.526|
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=   all | maxDets=100 ]  | 0.598 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=   all | maxDets=300 ]  | 0.598 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] | 0.619 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area= small | maxDets=1000 ] | 0.401 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] | 0.650 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area= large | maxDets=1000 ] | 0.757 |




| category      | AP    | category     | AP    | category       | AP    | 
|---------------|-------|--------------|-------|----------------|-------| 
| person        | 0.557 | bicycle      | 0.305 | car            | 0.445 | 
| motorcycle    | 0.407 | airplane     | 0.654 | bus            | 0.644 | 
| train         | 0.599 | truck        | 0.375 | boat           | 0.274 | 
| traffic light | 0.275 | fire hydrant | 0.682 | stop sign      | 0.632 | 
| parking meter | 0.461 | bench        | 0.219 | bird           | 0.352 | 
| cat           | 0.654 | dog          | 0.620 | horse          | 0.583 | 
| sheep         | 0.498 | cow          | 0.585 | elephant       | 0.637 | 
| bear          | 0.704 | zebra        | 0.679 | giraffe        | 0.676 | 
| backpack      | 0.169 | umbrella     | 0.403 | handbag        | 0.123 | 
| tie           | 0.325 | suitcase     | 0.389 | frisbee        | 0.655 | 
| skis          | 0.239 | snowboard    | 0.274 | sports ball    | 0.462 | 
| kite          | 0.428 | baseball bat | 0.273 | baseball glove | 0.349 | 
| skateboard    | 0.509 | surfboard    | 0.335 | tennis racket  | 0.485 | 
| bottle        | 0.394 | wine glass   | 0.386 | cup            | 0.434 | 
| fork          | 0.310 | knife        | 0.162 | spoon          | 0.136 | 
| bowl          | 0.413 | banana       | 0.239 | apple          | 0.209 | 
| sandwich      | 0.358 | orange       | 0.335 | broccoli       | 0.237 | 
| carrot        | 0.225 | hot dog      | 0.307 | pizza          | 0.495 | 
| donut         | 0.461 | cake         | 0.354 | chair          | 0.270 | 
| couch         | 0.405 | potted plant | 0.288 | bed            | 0.326 | 
| dining table  | 0.178 | toilet       | 0.585 | tv             | 0.558 | 
| laptop        | 0.585 | mouse        | 0.613 | remote         | 0.284 | 
| keyboard      | 0.534 | cell phone   | 0.344 | microwave      | 0.576 | 
| oven          | 0.316 | toaster      | 0.372 | sink           | 0.360 | 
| refrigerator  | 0.533 | book         | 0.148 | clock          | 0.511 | 
| vase          | 0.386 | scissors     | 0.261 | teddy bear     | 0.469 | 
| hair drier    | 0.093 | toothbrush   | 0.234 | None           | None  | 


Evaluating proposal...
Loading and preparing results...
DONE (t=0.56s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=75.46s).
Accumulating evaluation results...
DONE (t=12.70s).
| category                | IoU            | Area        | category       | AP    | 
|-------------------------|----------------|-------------|----------------|-------| 
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area=   all | maxDets=100 ]  | 0.453 |
 Average Precision  (AP) @|[ IoU=0.50      | area=   all | maxDets=1000 ] | 0.678 |
 Average Precision  (AP) @|[ IoU=0.75      | area=   all | maxDets=1000 ] | 0.486 |
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area= small | maxDets=1000 ] | 0.296 |
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] | 0.522 |
 Average Precision  (AP) @|[ IoU=0.50:0.95 | area= large | maxDets=1000 ] | 0.633 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=   all | maxDets=100 ]  | 0.619 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=   all | maxDets=300 ]  | 0.619 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=   all | maxDets=1000 ] | 0.619 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area= small | maxDets=1000 ] | 0.448 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area=medium | maxDets=1000 ] | 0.691 |
 Average Recall     (AR) @|[ IoU=0.50:0.95 | area= large | maxDets=1000 ] | 0.820 |

Note that the single most important statistic we use to for ranking the algorithm is the “Average Precision (AP) @[IoU=0.50:0.95 where area=all and maxDets=100]”. Here, the mean average precision of Verifocal-Net with ResNet-50 Backbone on COCO-2017 is 40.8%. Some of the other statistics that are of importance are “Average Recall (AR) @[ IoU=0.50:0.95 area=all maxDets=100 ] => 0.598 “, “Average Precision (AP) @[ IoU=0.50:0.95 area=small maxDets=1000 ] => 0.244”, “Average Precision (AP) @[ IoU=0.50:0.95 area=medium maxDets=1000 ] => 0.450”, and “Average Precision (AP) @[ IoU=0.50:0.95 area= large maxDets=1000 ] => 0.526”

5. Interactive Framework/Backbone Results Table

With the above methods of construction and evaluation, we build the following Interactive Framework/Backbone Results Table which compares the performances and memory usage of different detection frameworks and feature extraction backbones on the COCO-2017 dataset. Due to limitations in Markdown iframe embedding, here we can only display the non-interactive version of our results table. To view a color-seperated and interactive version of this table where you can sort and group_by different labels, please refer to our Interactive Table built with the Notion api.

5.1 Non-Interactive Version for Markdown View

This table contains our reproduced results of popular MMDetection frameworks and backbones on the COCO-2017 dataset:

Model	# Stage	Backbone	Base	Neck	Schedule	Memory	Status	box mAP	box mAR	Small Objects	Medium Objects	Large Objects	MyCode/Results	config	model_url	Scale
Fast R-CNN	2-Stage	ResNet-50	CNN	FPN	1x	N/A	Depreciated	N/A	N/A	N/A	N/A	N/A	N/A	configs/fast_rcnn/fast_rcnn_r50_fpn_1x_coco.py	N/A
Fast R-CNN	2-Stage	ResNet-101	CNN	FPN	1x	N/A	Depreciated	N/A	N/A	N/A	N/A	N/A	N/A	configs/fast_rcnn/fast_rcnn_r101_fpn_1x_coco.py	N/A
Faster R-CNN	2-Stage	ResNet-50	CNN	FPN	1x	4.0GB	Executable	0.374	0.517	0.212	0.41	0.481	https://colab.research.google.com/drive/14YV0cGt2tOMpZeNS6V-mN1eNGaXWoNSf?usp=sharing	configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
Faster R-CNN	2-Stage	ResNet-101	CNN	FPN	1x	6.0GB	Executable	0.394	0.533	0.224	0.437	0.511	https://colab.research.google.com/drive/14YV0cGt2tOMpZeNS6V-mN1eNGaXWoNSf?usp=sharing	configs/faster_rcnn/faster_rcnn_r101_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r101_fpn_1x_coco/faster_rcnn_r101_fpn_1x_coco_20200130-f513f705.pth

VarifocalNet	2-Stage	ResNet-50	CNN	FPN	1x	N/A	Executable	0.408	0.598	0.244	0.45	0.526	https://colab.research.google.com/drive/1dj94bQ_8u3qzoChjg3KrOcJMshtzNmc_?usp=sharing	configs/vfnet/vfnet_r50_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/vfnet/vfnet_r50_fpn_1x_coco/vfnet_r50_fpn_1x_coco_20201027-38db6f58.pth
VarifocalNet	2-Stage	ResNeXt-101	CNN	FPN	2x	N/A	Has Bugs	0.496	0.656	0.314	0.548	0.64	https://colab.research.google.com/drive/1dj94bQ_8u3qzoChjg3KrOcJMshtzNmc_?usp=sharing	configs/vfnet/vfnet_x101_64x4d_fpn_mdconv_c3-c5_mstrain_2x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/vfnet/vfnet_x101_64x4d_fpn_mdconv_c3-c5_mstrain_2x_coco/vfnet_x101_64x4d_fpn_mdconv_c3-c5_mstrain_2x_coco_20201027pth-b5f6da5e.pth

Mask R-CNN	2-Stage	Swin-T	Transformer	FPN	1x	7.6GB	Executable	0.427	0.559	0.265	0.459	0.566	https://colab.research.google.com/drive/1aggq_JyvM4JpVp_EvsPkikYBwki_gxUe?usp=sharing	configs/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/swin/mask_rcnn_swin-t-p4-w7_fpn_1x_coco/mask_rcnn_swin-t-p4-w7_fpn_1x_coco_20210902_120937-9d6b7cfa.pth
Mask R-CNN	2-Stage	Swin-T	Transformer	FPN	3x	7.8GB	Executable	0.46	0.595	0.314	0.493	0.593	https://colab.research.google.com/drive/1aggq_JyvM4JpVp_EvsPkikYBwki_gxUe?usp=sharing	configs/swin/mask_rcnn_swin-t-p4-w7_fpn_fp16_ms-crop-3x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/swin/mask_rcnn_swin-t-p4-w7_fpn_fp16_ms-crop-3x_coco/mask_rcnn_swin-t-p4-w7_fpn_fp16_ms-crop-3x_coco_20210908_165006-90a4008c.pth
Mask R-CNN	2-Stage	Swin-S	Transformer	FPN	3x	11.9GB	Executable	0.482	0.604	0.321	0.518	0.627	https://colab.research.google.com/drive/1aggq_JyvM4JpVp_EvsPkikYBwki_gxUe?usp=sharing	configs/swin/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/swin/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco_20210903_104808-b92c91f1.pth

YOLOv3	1-Stage	MobileNetV2	CNN	None	300e	5.3GB	Executable	0.239	0.374	0.106	0.251	0.349	https://colab.research.google.com/drive/178sU_2FkqzRww05knIigsLRrA_ed7P82?usp=sharing	configs/yolo/yolov3_mobilenetv2_mstrain-416_300e_coco.py	https://download.openmmlab.com/mmdetection/v2.0/yolo/yolov3_mobilenetv2_mstrain-416_300e_coco/yolov3_mobilenetv2_mstrain-416_300e_coco_20210718_010823-f68a07b3.pth	416
YOLOv3	1-Stage	DarkNet-53	CNN	None	273e	2.7GB	Executable	0.279	0.395	0.105	0.301	0.438	https://colab.research.google.com/drive/178sU_2FkqzRww05knIigsLRrA_ed7P82?usp=sharing	configs/yolo/yolov3_d53_mstrain-608_273e_coco.py	https://download.openmmlab.com/mmdetection/v2.0/yolo/yolov3_d53_mstrain-608_273e_coco/yolov3_d53_mstrain-608_273e_coco_20210518_115020-a2c3acb8.pth	608
YOLOX	1-Stage	YOLOX-x	CNN	None	300e	28.1GB	Executable	0.506	0.636	0.32	0.556	0.667	https://colab.research.google.com/drive/178sU_2FkqzRww05knIigsLRrA_ed7P82?usp=sharing	configs/yolox/yolox_x_8x8_300e_coco.py	https://download.openmmlab.com/mmdetection/v2.0/yolox/yolox_x_8x8_300e_coco/yolox_x_8x8_300e_coco_20211126_140254-1ef88d67.pth	640

RetinaNet	1-Stage	ResNet-50	CNN	FPN	1x	3.8GB	Executable	0.365	0.54	0.204	0.403	0.481	https://colab.research.google.com/drive/1XmgSOg9U7UhK5rc3bYfQVBVtvBupDIk5?usp=sharing	configs/retinanet/retinanet_r50_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/retinanet/retinanet_r50_fpn_1x_coco/retinanet_r50_fpn_1x_coco_20200130-c2398f9e.pth
RetinaNet	1-Stage	ResNet-101	CNN	FPN	1x	5.7GB	Executable	0.385	0.554	0.217	0.428	0.505	https://colab.research.google.com/drive/1XmgSOg9U7UhK5rc3bYfQVBVtvBupDIk5?usp=sharing	configs/retinanet/retinanet_r101_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/retinanet/retinanet_r101_fpn_1x_coco/retinanet_r101_fpn_1x_coco_20200130-7a93545f.pth
RetinaNet	1-Stage	ResNeXt-101	CNN	FPN	1x	10.0GB	Executable	0.41	0.569	0.239	0.452	0.54	https://colab.research.google.com/drive/1XmgSOg9U7UhK5rc3bYfQVBVtvBupDIk5?usp=sharing	configs/retinanet/retinanet_x101_64x4d_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/retinanet/retinanet_x101_64x4d_fpn_1x_coco/retinanet_x101_64x4d_fpn_1x_coco_20200130-366f5af1.pth

RetinaNet	1-Stage	PVT-Small	Transformer	FPN	12e	14.5GB	Executable	0.404	0.563	0.248	0.432	0.548	https://colab.research.google.com/drive/1XmgSOg9U7UhK5rc3bYfQVBVtvBupDIk5?usp=sharing	configs/pvt/retinanet_pvt-s_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/pvt/retinanet_pvt-s_fpn_1x_coco/retinanet_pvt-s_fpn_1x_coco_20210906_142921-b6c94a5b.pth
RetinaNet	1-Stage	PVT-V2-b0	Transformer	FPN	12e	7.4GB	Executable	0.371	0.544	0.234	0.404	0.492	https://colab.research.google.com/drive/1Novfkkgq5qQQM3-NRmqEpyWEcxc3b7_p?usp=sharing	configs/pvt/retinanet_pvtv2-b0_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/pvt/retinanet_pvtv2-b0_fpn_1x_coco/retinanet_pvtv2-b0_fpn_1x_coco_20210831_103157-13e9aabe.pth
RetinaNet	1-Stage	PVT-V2-b4	Transformer	FPN	12e	17.0GB	Executable	0.463	0.607	0.29	0.501	0.627	https://colab.research.google.com/drive/1Novfkkgq5qQQM3-NRmqEpyWEcxc3b7_p?usp=sharing	configs/pvt/retinanet_pvtv2-b4_fpn_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/pvt/retinanet_pvtv2-b4_fpn_1x_coco/retinanet_pvtv2-b4_fpn_1x_coco_20210901_170151-83795c86.pth

Faster R-CNN	2-Stage	R-50 (RSB)	CNN	FPN	1x	3.9GB	Executable	0.408	0.542	0.254	0.449	0.532	https://colab.research.google.com/drive/1W9MUoQI212ctwwrAn6HeIVKyG6pj03Il?usp=sharing	configs/resnet_strikes_back/faster_rcnn_r50_fpn_rsb-pretrain_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnet_strikes_back/faster_rcnn_r50_fpn_rsb-pretrain_1x_coco/faster_rcnn_r50_fpn_rsb-pretrain_1x_coco_20220113_162229-32ae82a9.pth
RetinaNet	1-Stage	R-50 (RSB)	CNN	FPN	1x	3.8GB	Executable	0.39	0.554	0.234	0.427	0.522	https://colab.research.google.com/drive/1W9MUoQI212ctwwrAn6HeIVKyG6pj03Il?usp=sharing	configs/resnet_strikes_back/retinanet_r50_fpn_rsb-pretrain_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnet_strikes_back/retinanet_r50_fpn_rsb-pretrain_1x_coco/retinanet_r50_fpn_rsb-pretrain_1x_coco_20220113_175432-bd24aae9.pth
Mask R-CNN	2-Stage	R-50 (RSB)	CNN	FPN	1x	4.5GB	Executable	0.412	0.549	0.248	0.453	0.539	https://colab.research.google.com/drive/1W9MUoQI212ctwwrAn6HeIVKyG6pj03Il?usp=sharing	configs/resnet_strikes_back/mask_rcnn_r50_fpn_rsb-pretrain_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnet_strikes_back/mask_rcnn_r50_fpn_rsb-pretrain_1x_coco/mask_rcnn_r50_fpn_rsb-pretrain_1x_coco_20220113_174054-06ce8ba0.pth
Cascade Mask R-CNN	2-Stage	R-50 (RSB)	CNN	FPN	1x	6.2GB	Executable	0.448	0.575	0.267	0.483	0.598	https://colab.research.google.com/drive/1W9MUoQI212ctwwrAn6HeIVKyG6pj03Il?usp=sharing	configs/resnet_strikes_back/cascade_mask_rcnn_r50_fpn_rsb-pretrain_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnet_strikes_back/cascade_mask_rcnn_r50_fpn_rsb-pretrain_1x_coco/cascade_mask_rcnn_r50_fpn_rsb-pretrain_1x_coco_20220113_193636-8b9ad50f.pth

Faster R-CNN	2-Stage	ResNeSt-50	CNN	FPN	1x	4.8GB	Executable	0.42	0.557	0.267	0.459	0.535	https://colab.research.google.com/drive/1hUmImLvvKOe4e3odyqs2Q52bgbksDQ5p?usp=sharing	configs/resnest/faster_rcnn_s50_fpn_syncbn-backbone+head_mstrain-range_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnest/faster_rcnn_s50_fpn_syncbn-backbone%2Bhead_mstrain-range_1x_coco/faster_rcnn_s50_fpn_syncbn-backbone%2Bhead_mstrain-range_1x_coco_20200926_125502-20289c16.pth
Faster R-CNN	2-Stage	ResNeSt-101	CNN	FPN	1x	7.1GB	Executable	0.445	0.573	0.287	0.486	0.573	https://colab.research.google.com/drive/1hUmImLvvKOe4e3odyqs2Q52bgbksDQ5p?usp=sharing	configs/resnest/faster_rcnn_s101_fpn_syncbn-backbone+head_mstrain-range_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnest/faster_rcnn_s101_fpn_syncbn-backbone%2Bhead_mstrain-range_1x_coco/faster_rcnn_s101_fpn_syncbn-backbone%2Bhead_mstrain-range_1x_coco_20201006_021058-421517f1.pth
Mask R-CNN	2-Stage	ResNeSt-101	CNN	FPN	1x	7.8GB	Executable	0.452	0.582	0.29	0.489	0.584	https://colab.research.google.com/drive/1hUmImLvvKOe4e3odyqs2Q52bgbksDQ5p?usp=sharing	configs/resnest/mask_rcnn_s101_fpn_syncbn-backbone+head_mstrain_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnest/mask_rcnn_s101_fpn_syncbn-backbone%2Bhead_mstrain_1x_coco/mask_rcnn_s101_fpn_syncbn-backbone%2Bhead_mstrain_1x_coco_20201005_215831-af60cdf9.pth
Cascade Mask R-CNN	2-Stage	ResNeSt-101	CNN	FPN	1x	10.5GB	Executable	0.477	0.603	0.301	0.518	0.614	https://colab.research.google.com/drive/1hUmImLvvKOe4e3odyqs2Q52bgbksDQ5p?usp=sharing	configs/resnest/cascade_mask_rcnn_s101_fpn_syncbn-backbone+head_mstrain_1x_coco.py	https://download.openmmlab.com/mmdetection/v2.0/resnest/cascade_mask_rcnn_s101_fpn_syncbn-backbone%2Bhead_mstrain_1x_coco/cascade_mask_rcnn_s101_fpn_syncbn-backbone%2Bhead_mstrain_1x_coco_20201005_113243-42607475.pth

*Our code for each row can be found in the links in the “MyCode/Results” section.

*For performance breakdown on specific object categories, please also refer to the links in the “MyCode/Results” column.

*In the “config” and “model_url” sections, we have collected the respective paths for each of the models we tested, so it would be easier for readers to reproduce our results.

*FPN for Feature Pyramid Network

*PVT for Pyramid Vision Transformers

*RSB for ResNet Strikes Back

5.2 Results Analysis

Among all the small to medium sized prediction models (memory usage under 20GB), the highest performing model uses the SWIN-Small backbone and the Mask-RCNN detection head. This is consistent with the current COCO-2017 leaderboard on https://paperswithcode.com/sota/object-detection-on-coco where 9 out of the top 10 models utilize some form of the SWIN-Large backbone.

Leaderboard

Fig 0. COCO-2017 object detection task leaderboard (updated 03/05/2022).

We observe that our current detection models obtain much lower mAP on detecting Small Objects compared to Large and Medium Objects. While most modern detection models can reach 50-60 percent accuracy on Large Objects, their accuracy on Small Objects stagnates at around 20-30 percent.

Object Groups by Size	Small Objects	Medium Objects	Large Objects
Average mAP over all Models	0.25021	0.45029	0.54533

Given example of the above-mentioned SWIN-S model(0.482-mAP), we can see that while it can predict large objects with 0.627% accuracy, it only achieves 0.321% accuracy on small objects. If we look more closely at the performance breakdown by object category, we can see that detection performance varies hugely between large objects like “airplane (0.728%)”, “bus (0.716%)” and small objects like “hair drier (0.110%)” and “traffic light (0.311%)”.


category	AP	category	AP	category	AP
person	0.583	bicycle	0.380	car	0.497
motorcycle	0.501	airplane	0.728	bus	0.716
train	0.708	truck	0.423	boat	0.333
traffic light	0.311	fire hydrant	0.734	stop sign	0.679
parking meter	0.532	bench	0.322	bird	0.409
cat	0.726	dog	0.692	horse	0.646
sheep	0.574	cow	0.623	elephant	0.677
bear	0.764	zebra	0.691	giraffe	0.697
backpack	0.232	umbrella	0.456	handbag	0.241
tie	0.382	suitcase	0.478	frisbee	0.710
skis	0.313	snowboard	0.453	sports ball	0.478
kite	0.464	baseball bat	0.394	baseball glove	0.431
skateboard	0.608	surfboard	0.458	tennis racket	0.569
bottle	0.444	wine glass	0.410	cup	0.490
fork	0.459	knife	0.307	spoon	0.302
bowl	0.466	banana	0.315	apple	0.260
sandwich	0.445	orange	0.365	broccoli	0.272
carrot	0.255	hot dog	0.436	pizza	0.558
donut	0.539	cake	0.453	chair	0.364
couch	0.509	potted plant	0.364	bed	0.495
dining table	0.307	toilet	0.668	tv	0.630
laptop	0.678	mouse	0.652	remote	0.432
keyboard	0.543	cell phone	0.415	microwave	0.629
oven	0.392	toaster	0.402	sink	0.418
refrigerator	0.645	book	0.194	clock	0.540
vase	0.408	scissors	0.453	teddy bear	0.528
hair drier	0.110	toothbrush	0.384	None	None

Similar results can be observed for a state of the art CNN-Based ResNeSt-101 model with comparable overall performance(0.477-mAP): “0.614-Large Objects”, “0.301-Small Objects”, “airplane (0.725%)“, “bus (0.710%) ”, “hair drier (0.065%) ”, “traffic light (0.314%)“. Thus we have reason to believe the both CNN and Transformer architectures suffer from small object detection.


category	AP	category	AP	category	AP
person	0.602	bicycle	0.364	car	0.510
motorcycle	0.491	airplane	0.725	bus	0.710
train	0.701	truck	0.403	boat	0.330
traffic light	0.314	fire hydrant	0.727	stop sign	0.700
parking meter	0.502	bench	0.297	bird	0.409
cat	0.754	dog	0.694	horse	0.625
sheep	0.588	cow	0.615	elephant	0.697
bear	0.763	zebra	0.725	giraffe	0.725
backpack	0.200	umbrella	0.478	handbag	0.203
tie	0.419	suitcase	0.458	frisbee	0.731
skis	0.304	snowboard	0.486	sports ball	0.508
kite	0.477	baseball bat	0.390	baseball glove	0.457
skateboard	0.620	surfboard	0.464	tennis racket	0.577
bottle	0.456	wine glass	0.421	cup	0.486
fork	0.436	knife	0.288	spoon	0.249
bowl	0.466	banana	0.292	apple	0.253
sandwich	0.404	orange	0.313	broccoli	0.267
carrot	0.270	hot dog	0.423	pizza	0.570
donut	0.550	cake	0.423	chair	0.347
couch	0.476	potted plant	0.321	bed	0.493
dining table	0.324	toilet	0.649	tv	0.616
laptop	0.662	mouse	0.647	remote	0.439
keyboard	0.566	cell phone	0.419	microwave	0.606
oven	0.396	toaster	0.340	sink	0.412
refrigerator	0.662	book	0.203	clock	0.557
vase	0.414	scissors	0.384	teddy bear	0.543
hair drier	0.065	toothbrush	0.349	None	None

5.3 Current SOTA – SWIN Transformers

Based on our experiments, The SWIN Transformers model currently outperforms CNN models in the object detection task. From on our understanding of the SWIN Transformers paper[1], here we provide a short analysis of SWIN Transformers and why they can outperform CNNs in the Object Detection Task.

Different from the traditional CNN-based architecture, the SWIN Backbone incorporated the Encoder-Decoder structure of Natural Language Transformers into computer vision tasks. Using shifted-windows architecture, SWIN Transformers are able to compute self-attention with great efficiency while also allowing for cross-window connections. Thus, its computational complexity growth can be retained at linear instead of squared with respect to the image size when we do a global attention. The shifted windows can ensure that when we do local self-attention computation, we still consider different parts of the image interactively, similar to the effects of traditional convolutional layers.

6. Photo/Video Demo and Error Analysis

6.1 Building a MMDetection image/video detector

Here’s our code for building a Mask-RCNN + ResNet-50 image detector with MMDetection:

!mkdir checkpoints
!wget -c https://download.openmmlab.com/mmdetection/v2.0/mask_rcnn/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco_bbox_mAP-0.408__segm_mAP-0.37_20200504_163245-42aa3d00.pth \
      -O checkpoints/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco_bbox_mAP-0.408__segm_mAP-0.37_20200504_163245-42aa3d00.pth

from mmdet.apis import inference_detector, init_detector, show_result_pyplot

# Choose to use a config and initialize the detector
config = 'configs/mask_rcnn/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco.py'
# Setup a checkpoint file to load
checkpoint = 'checkpoints/mask_rcnn_r50_caffe_fpn_mstrain-poly_3x_coco_bbox_mAP-0.408__segm_mAP-0.37_20200504_163245-42aa3d00.pth'
# initialize the detector
model = init_detector(config, checkpoint, device='cuda:0')

import matplotlib.pyplot as plt
import mmcv
import cv2
import numpy as np
from google.colab.patches import cv2_imshow
# Use the detector to do inference
img = mmcv.imread('demo/demo.jpg')
plt.figure(figsize=(15, 10))
plt.imshow(mmcv.bgr2rgb(img))

6.2 Image/Video detection experiments and analysis

Experiment 1: NBA Game image

Experiment 1

Fig 1. Experiment 1: NBA Game image.

For this NBA image, which is very crowded with person(s). Many person(s) are detected with high score, but also many person(s) are detected with very low score, which shows a low degree of confidence. Many person(s) are not detected.

Experiment 2: Ackerman Union image

Experiment 2

Fig 2. Experiment 2: Ackerman Union image.

This result is very good, almost correctly identified every instance, such as “person” and “TV”.

Experiment 3

Fig 3. Experiment 3: Social Event Image.

Very good results, “cup”, “table”, “fork”, “knife”, “cell phone”, “wine glass”, “bottle”, “couch” are perfectly identified with greater than 0.5 score.

Experiment 4: UCLA Health image

Experiment 4

Fig 4. Experiment 4: UCLA Health image.

“traffic light”, car(s), person(s) are correctly identified, pretty good. Other objects such as “tree”s and “street light”s are not identified.

Experiment 5: Concert image

Experiment 5

Fig 5. Experiment 5: Concert image.

closer person(s) are identified, but person(s) far away blended with colorful lights are not identified. A human observer can identify those are person(s) due to the semantic background of a concert, but this algorithm cannot do that, it is still mainly depending on pixel-wise image patterns. Also, something on the ceiling is seen as bicycles, which are mistakes, again proving the algorithm is based on shapes and patterns, not meaning and syntax.

Experiment 6: Kitchen Video

We also tested on our own video examples. The algorithm deals with data in .mp4 format. The video is transformed into gif for presentation purpose.

Experiment 6

Fig 6. Experiment 6: Kitchen table GIF.

Experiment 7: Street View Video

Experiment 7

Fig 7. Experiment 7: result_streets.gif.

Experiment 8: Bruinwalk Video

Experiment 8

Fig 8. Experiment 8: bruinwalk1.gif.

Videos Error Analysis

On Bruinwalk, the algorithm has very high accuracy, it is able to recognize persons, trucks, umbrellas, fire hydrants. Interestingly, it mistakenly sees the bruin bear statue as an elephant. Also, it mistakenly sees a banner on the ground as a parking meter. It mistakenly sees an advertising image on the banner as a book, probably because it looks like a book’s cover. Another interesting discovery is that the algorithm recognizes a tree’s mirror image in a building’s window as a [potted plant], it even can detect mirror images! it at an instant mistakenly sees a gap in the tree’s leaves as a car. it mistakenly sees a four-footed banner as a chair. it mistakenly sees a four-footed banner as a chair. it mistakenly sees a handrail of a stair as an airplane. In the Rochester-Midvale street view video, it correctly identifies many potted plants and cars. However, it frequently identifies some gaps in the tree leaves as elephants or horses. In the cooking video, it correclt identifies many bowls, a sink, and carrots, but it mistakenly sees some bowls as clocks or vases, probably due to the shape of these bowls are indeed artistic in a sense. It sees an orange-colored plate as an orange mistakenly. It sees a plate as a frisbee mistakenly. It sees romain hearts as broccoli mistakenly. It sees a pot as a bird mistakenly, very strange. It mistakenly sees a bottled soy sauce and a bottled olive oil as wines. Indeed it’s a hard topic to teach algorithms distinguishing which bottle is soyce sauce, which bottle is wine, oil, vinegar, it does not have taste! It mistakenly sees a pot with a glass lid as a wine glass.

In conclusion, the mistakes made are all reasonable because of the similarity in shape between the object represented by the true label and the object represented by the wrong label, especially given the 2D image or video. The reaction rate of the algorithm is much faster than humans. Given the variety and the large number of objects in the video and such a high speed to change the perspective, a human watcher of the video cannot detect all details, but the algorithm is able to detect almost every small tiny object on the very edge of the image that does not draw a human’s attention.

7. Adversarial Attacks and Error Analysis

7.1 Adversiarial Attacks on the Mask_RCNN + ResNet-50 Image Detector

# Draw a Rectangle in
image = np.zeros((512,512,3), np.uint8)
cv2.rectangle(img, (310,220), (360,270), (0,0,0), -1)
cv2_imshow(img)
cv2.waitKey(0)
cv2.destroyAllWindows()

plt.show()

result = inference_detector(model, img)

# Let's plot the result
show_result_pyplot(model, img, result, score_thr=0.3)

Experiment 1:

cv2.rectangle(img, (310,220), (360,270), (0,0,0), -1)

Experiment 1

Fig 9. Black patches Experiment 1.

Experiment 2:

cv2.rectangle(img, (100,100), (300,250), (127,50,127), -1)

Experiment 2

Fig 10. Black patches Experiment 2.

Experiment 3:

cv2.rectangle(img, (300,220), (380,270), (0,0,0), -1)

Experiment 3

Fig 11. Black patches Experiment 3.

Experiment 4:

cv2.rectangle(img, (280,200), (400,300), (0,0,0), -1)

Experiment 4

Fig 12. Black patches Experiment 4.

Experiment 5:

cv2.rectangle(img, (260,190), (410,320), (0,0,0), -1)

Experiment 5

Fig 13. Black patches Experiment 5.

Experiment 6:

experiment using our living room’s picture, the little black patch at the top left corner does not catch attention of the algorithm.

Experiment 6

Fig 14. Black patches Experiment 6.

Experiment 7:

cv2.rectangle(img, (700,800), (1000,1000), (0,0,0), -1)

Experiment 7

Fig 15. Black patches Experiment 7.

network is fooled to recognize our black patch as a “TV” with score = 0.63

Experiment 8: bruinwalk image

Experiment 8

Fig 16. Black patches Experiment 8: bruinwalk image.

bear, person, and bags are identified correctly.

Experiment 9: bruinwalk images with a black patch

Experiment 9

Fig 17. Black patches Experiment 9: bruinwalk image.

bear, person, and bags are identified correctly. The black patch makes the algorithm thinks the bear is an elephant. The algorithm is influenced by the black patch.

7.2 Mask_RCNN + ResNet-50 Experiments Analysis

Static images

The detection algorithm works well on the demo image provided by the MMdetection itself with high accuracy and low error rates. If we set the score threshold to be 0.5, i.e. only display the objects for which the score given to objects category larger than 0.5, we can almost get every object with correct bounding boxes and class labels.

For the customized image, which is not provided by MMDetection but by us, i.e. the living room image, it makes more mistakes. For example, it misrecognize an [orange] to be an [apple]. Also, the [vinyl CDs] are misrecognized as [books]. Probably it is due to the side view of the CDs are indeed very similar to books, and the orange from such a far away point of view looks indeed indistinguishable from an apple, and they are both very likely to placed in plates on the living room table. The mistakes are understandable since a human may even make such misrecognitions given this specific image. The black patches on our customized image, when small, even does not draw attention from the algorithm–it does not give a bounding box, which means it does not assume it to be an object. When the black patch becomes large, the algorithm misrecognizes it as a TV, which is very interesting. This recognition makes very much sense, since a black squared-shape object in side a living room is indeed very likely to be a TV. Although the algorithm is fooled, I think this TV misrecognition, on the contrary, proves the wisdom of the model in a sense.

Images with black patch attacks

For black patches interference experiments, if the black patch is small compared to the size of the object, the detection algorithm is little affected. It still gives correct bounding box and class labels with as high scores as 1.00. However, when the black patch grows in size to about half the size of the object, and blurs certain significant edges and shapes of the object, like in experiment 4 and 5, the algorithm will be fooled.

7.3 Comparison with Mask_RCNN + SWIN Transformers

Since in section 5.2, we observed that among all the small to medium sized prediction models (memory usage under 20GB), the highest performing model uses the SWIN-S backbone combined with the Mask-RCNN detection head, thus is the current SOTA. We replicate our previous experiments but this time equip the detector with the SWIN-S backbone, hoping to potentially see some observable improvements:

!mkdir checkpoints
!wget -c https://download.openmmlab.com/mmdetection/v2.0/swin/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco_20210903_104808-b92c91f1.pth \
      -O checkpoints/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco_20210903_104808-b92c91f1.pth

from mmdet.apis import inference_detector, init_detector, show_result_pyplot

# Choose to use a config and initialize the detector
config = 'configs/swin/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py'
# Setup a checkpoint file to load
checkpoint = 'checkpoints/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco_20210903_104808-b92c91f1.pth'
# initialize the detector
model = init_detector(config, checkpoint, device='cuda:0')

import matplotlib.pyplot as plt
import mmcv
import cv2
import numpy as np
from google.colab.patches import cv2_imshow
# Use the detector to do inference
img = mmcv.imread('demo/demo.jpg')
plt.figure(figsize=(15, 10))
plt.imshow(mmcv.bgr2rgb(img))
# print(img.shape) #(427, 640, 3)

# Draw a Rectangle in
image = np.zeros((512,512,3), np.uint8)
# cv2.rectangle(image, (100,100), (300,250), (127,50,127), -1)
cv2.rectangle(img, (310,220), (360,270), (0,0,0), -1)
cv2_imshow(img)
cv2.waitKey(0)
cv2.destroyAllWindows()

plt.show()

result = inference_detector(model, img)

# Let's plot the result
show_result_pyplot(model, img, result, score_thr=0.3)

To perform object detection on videos with SWIN-S, we specify the following in Colab:

!python demo/video_demo.py demo/cooking.mp4 \
    configs/swin/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco.py \
    checkpoints/mask_rcnn_swin-s-p4-w7_fpn_fp16_ms-crop-3x_coco_20210903_104808-b92c91f1.pth \
    --out swin_result_cooking.mp4

Experiment 10: Ackerman Union Image SWIN-S

Experiment 10

Fig 18. Experiment 10: ackerman_swin.png.

Experiment 11

Fig 19. Experiment 11: birthday_swin.png.

Experiment 12: Concert Image SWIN-S

Experiment 12

Fig 20. Experiment 12: concert_swin.png.

Experiment 13: Restaurant Table Image SWIN-S

Experiment 13

Fig 21. Experiment 13: eating_swin.png.

Experiment 14: NBA Game Image SWIN-S

Experiment 14

Fig 22. Experiment 14: nba_swin.png.

Experiment 15: UCLA Health Image SWIN-S

Experiment 15

Fig 23. Experiment 15: uclahealth_swin.png.

The static images results are almost the same as the outputs of Mask_RCNN + ResNet50 detector. Next, we look at videos result detected by our Mask R-CNN framework with a SWIN-S backbone:

Experiment 16: Streets Video SWIN-S

Video 2. Experiment 16: Streets Video SWIN-S.

Experiment 17: Cooking Video SWIN-S

Video 3. Experiment 17: Cooking Video SWIN-S.

Experiment 18: Bruinwalk Video_1 SWIN-S

Video 4. Experiment 18: Bruinwalk Video_1 SWIN-S.

Experiment 19: Bruinwalk Video_2 SWIN-S

Video 5. Experiment 19: Bruinwalk Video_2 SWIN-S.

Experiment 20: Bruinwalk Video_3 SWIN-S

Video 6. Experiment 20: Bruinwalk Video_3 SWIN-S.

For the streets view video, the Mask-RCNN + SWIN-S detector finds an additional stop sign that is not detected by the Mask-RCNN + ResNet-50, but it also wrongly labeled a non-existent umbrella at the end of the video. For other video results, we find no significant difference from Mask-RCNN + ResNet-50.

Although the Swin Transformer is recognized as the state-of-the-art model in computer vision for many tasks, the results of our experiment show little observable difference from the outputs of ResNet-50 – the former state-of-the-art that came out a few years before. This may be an evidence to suggest that in terms of mAP increase, the Object Detection task is indeed stagnating.

Furthermore, in terms of robustness against adversiarial attacks, both models suffer dramaticly from simple black-patch attacks on the real-world images and videos we took in common day-to-day settings. This result demonstrates the vulnerability of current Object Detection models against potential disturbances and human sabotage, highlighting the need to increase adversiarial robustness of such algorithms in real-world applications.

8. Demo Links

For links to our code and Colab demo, please refer to here and here.

For link to the Colored Interactive Framework/Backbone Results Table where you can sort and group_by different labels, please refer to here.

For our recorded spotlight video, please refer to here

Reference

[1] Liu, Ze, et al. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021.

[2] Wang, Wenhai, et al. “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions.” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021.

[3] He et al. “Deep Residual Learning for Image Recognition.” IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016.

[4] Wang et al. “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions.” Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021.

[5] Wang et al. “Pvtv2: Improved baselines with pyramid vision transformer.” arXiv preprint arXiv:2106.13797. 2021.

[6] Zhang, Hang, et al. “ResNeSt: Split-Attention Networks.” arXiv preprint arXiv:2004.08955. 2021.

[7] Gao, Shang-Hua, et al. “Res2Net: A New Multi-Scale Backbone Architecture.” Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). 2021.

[8] Chen, Kai, et al. “MMDetection: Open MMLab Detection Toolbox and Benchmark. “ arXiv preprint arXiv:1906.07155. 2019.