Explore wider usage of CLIP, a large scale self-supervised models. We believe CLIP have more usage than what’s shown in the original paper, as it has great feature extraction ability.

Motivition

Recently, there is a trend in industry to train large-scale self-supervised models. Such models utilize huge amount of unlabeled data to learn the intrinsic features, and in this way to get more general and robust prediction result. CLIP is such a model gain a lot of attention in recent years. It consisting of an image encoder and a text encoder. In the forward stage, it calculates the loss based on the difference between the image/text feature vector pair with an image/text-description pair as input, and optimizes the two encoders’ parameters simultaneously. Typical usage of the CLIP model includes using the image encoder as pretrained model to finetune, and formulizing the prediction task as text queries, together with the image feeding into CLIP to get a score vector. However, in CILP paper, OpenAI mainly focus on its zero-shot learning performance and the cost for getting a CILP model still remain expensive. In recent years, new methods was proposed and shown to be powerful on downstream tasks. We believe CLIP and other recent model have more usage than what’s shown in the original paper, as it has great feature extraction ability but when and how to use remain to be explored.

First, we’ll compare different models trained with different amount data for each label in dataset that is different from ImageNet. Also, we will get feature of each model and use simple ML algorithm to train it. We want to see if which one results in higher accuracy with fewer labeled data.

Second, we will compare generality of CLIP compared with other models. In this case, we chose 2 data set that have distribution drift between training and testing data. We perform experiment both on fine tuning and feature with simple ML only and see which one is less sensitive to distribution drift.

Third, we’ll also further explore which feature is better for clustering and dimension reduction. In real application, dimension reduction and clustering help us better understand data in data preprocessing stage. We’ll examinate whether the cluster reflects the actual data label for selected fine grained dataset.

Related work

Tested model:

The experiment is run on the following model, there are 3 type of model

Type 1: language supervised model

  • CLIP model( VIT-32 ) [1]
    clip Fig 1. CILP model

We have introduced CLIP model in the motivation section.

Type 2: model trained on open scource dataset (ImageNet)

  • Vision Transformer: VIT-32 (base and large version), VIT-16 (base and large version) [2]
    ViT Fig 2. VIT model

Vision Transformer is the first attemption to use transformer architecture on CV tasks and achieves SoTA performance. It divides the image horizontally and vertically into several patchies (for example, dividing a 224*224 image into #(14*14) pacthes of size 16*16) and then linear transforms every patch and adds them with respective positional embeddings. (Here a special token embedding position 0 and having special meaning ‘cls’ is used to capture the overall relation of patches) The resulting 196 vectors are quantitively small enough to be used as input tokens to feed the standard transformer encoder architecture. After several encoder layers, the output at position 0, which captures the overall relation, is connected with header network to generate the result.
Experiments show that with the help of masive number of supervised training data, ViT can beat state-of-the-art convolution-based models, and its performance scales well with the number of parameters and the amount of training data.
For ViT-16 and ViT-32, the number difference indicates the number of dividents on each dimension. E.g. for ViT-16, images are divided into #16*16 pacthes.

  • New CNN: convnext_small, convnext_base [3]
    ConvNextBlock Fig 3. ConvNextBlock

Since the invention of vision transformer, researchers find many ways to replace the core attention mechanism with simplier modules while still attaining comparable performance, and thus it’s doubtable whether the success of ViT contributes to the attention mechanism or other delicate designs of the transformer architecture. In ConvNext, researchers modify ResNet architecture to adopt several modern designs in ViT. It uses a different stage compute ratio and different stem cell structure, and use grouped convolution to reduce the computation amount. ConvNext also follows ViT to use a inverted bottleneck layer, and larger kernel size.
Besides these macro changes, ConvNext also has some micro changes like replacing ReLU activation with GeLU activation and reducing the activation layer number. Overall, ConvNext uses pure convolutional architecture to reach the top performance once again.

  • Traditional CNN: efficientnet_b4,efficientnet_b6 [4]
    EfficientNet Fig 4. EfficientNet

EfficientNet is a class of convolutional networks found by NAS (neural architecture search). Researchers use MBConv as basic block and search the same exploration space as MnasNet to get most basic EfficientNet-B0. Then they do grid search on the space consisting of three scaling factors: network depth (number of layers), width (number of kernels per layer) and resolution (size of image/feature map). Finally for different accuracy/efficiency requirement, they multiply these factors with a single compound value to get the network configuration.

Type 3: Model with self-supervised pretraining

  • Beit [5]
    BEIT Fig 5. BEIT

Beit is a self-supervised pretrained ViT model. It has two parts.
The first part is a dVAE (Discrete Variational Autoencoder). The dVAE is trained to compress the original image into a series of token (integer value) with each token representing a patch in the original image. Then tokens are mapped to a dictionary storing their latent vectors, and further used to restore the original image. After dVAE training, the dVAE part generating tokens produce good representations of image patches.
The second part is a ViT. In every training step, the original image is masked for several patches and then fed into the vision transformer to generate a series of value for the patches. At the same time, the original image without any mask is fed into dVAE to get the token repsentation per patch. The dVAE generated tokens are used as labels to train the vision transformer to learn the masked patch tokens. In this process, it learns the representation of images.

  • MAE [6]
    ViTMAE Fig 6. ViTMAE

MAE (masked autoencoders) is another self-supervised learning approach for vision transformers. It masks patches of the original image and use the simply linear transformed representation of unmasked patches as well as their positional embeddings as input tokens to feed into the vision transformer, and feed the output into another network to rebuild the original image. It differs from the most primitive self-supervised learning method proposed in ViT paper in two ways. First, the authors observe that image data has much more redundant information than the word sequence, and thus a low proportion of patch masking is not sufficient to learn. For example, an image masked one patch for every two patches can be easily restored by linear interpolation with little loss of semantic information. Consequently, in MAE most patches are masked and only unmasked patches participant into the computation. Second, images contain many details and a MLP network (original head network of ViT) is not sufficient to regenerate it well, and thus MAE use another transformer instead.

  • DINO [7]
    DINO Fig 7. DINO

DINO uses a process called self-distillation to learn. It consists of two transformer networks of the same architecture, a student network and a teacher network. The teacher network is a momentum teacher, which means its parameters are exponential-weighted moving average of parameters of the student network. Every training step, the original image passes two different data augmentation transformation, and then the two outputs are respectively fed into student and teacher network (In teacher network it additionally goes through centering and sharpening steps). Then the two results pass softmax layer to generate two distribution vectors. The training uses cross-entropy-loss with teacher network’s output as the label to optimize student network’s parameters, so as to encourage student network to have same output as the teacher network, in other words to make the output invariant to the image deformation, which indicates the transformer learnt the high level representation of images.
Without the centering and sharpening steps in teacher network’s output, the learning process can collapse to either a flat distribution or a spike distribution, neither making the transformer learn useful parameters. To deal with this issue, the centering step minus every output with its mean value so that prevents any one feature from dominating and forming a spike in the output. Similarly, the sharpening step prevents flat output by exaggerating small differences between high and low values.

  • Data2vec [8]
    data2vec Fig 8. data2vec

Data2vec also uses a student ViT network to imitate the output of the teacher network (still exponential-weighted moving average of student network in history). Contrast with DINO, it doesn’t apply different augmentation transformation to original image, instead it feeds in masked image into student network and unmasked image into teacher network, encouraging the student network to have the same output as the teacher network. Data2vec combines the idea of self-distillation and mask model.

Summary of tested model

  model year Key technology Fine tune on imagenet Good for representation learning pretrain dataset
0 CLIP 2021 Contrastive learning, vision-language supervision   Y 400 million (image,text) pairs from openai
1 Data2vec 2022 teacher-student Mask 84.2(B)   ImageNet 1k
2 DINO 2021 teacher student 82.8 (B) Y ImageNet
3 Beit 2021 dVAE mask 83.4 (B)   training set of ImageNet 1k (1.2 M)
4 MAE 2021 Mask 83.6 (B)   ImageNet-1K
5 vit_b_16 & vit_b_32 2021 VIT, pretrain 84.15 (16 B JFT300M) 80.73 (32 B JFT300M)   JFT300M
6 convnext 2022 CNN modification 85.1 (B)   ImageNet-22K.
7 EfficientNet 2020 NAS 80.1 (B6) 84.0 (B6)    

Dataset for general classification

There are two type of dataset we will test:

Dataset similar to CLIP pretraining data

  • Flowers102: 102 category dataset, consisting of 102 flower categories. The flowers chosen to be flower commonly occuring in the United Kingdom. Some examples are shown below.
  • Dataset_Flowers102 Fig 9. Dataset_Flowers102
  • FGVCAircraft: The dataset contains 10,200 images of aircraft, with 100 images for each of 102 different aircraft model variants, most of which are airplanes.

Dataset_FGVCAircraft Fig 10. Dataset_FGVCAircraft

  • DTD: DTD we use consists of 1880 images and organized according to a list of 47 terms (categories) inspired from human perception. Image sizes range between 300x300 and 640x640, and the images contain at least 90% of the surface representing the category attribute.

Dataset_DTD Fig 11. Dataset_DTD

Dataset different from CLIP pretraining data

  • PatternNet(torch geo): Dataset from torchgeo package [9] The PatternNet dataset is a dataset for remote sensing scene classification and image retrieval. It have 38 scene classes, 800 images per class.

PatternNet_Dataset Fig 12. PatternNet_Dataset

  • UCMerced(torch geo): Dataset from torchgeo package [9]. The UC Merced dataset is a land use classification dataset of 2.1k 256x256 1ft resolution RGB images of urban locations around the U.S. extracted from the USGS National Map Urban Area Imagery collection with 21 land use classes (100 images per class).

UCMerced_Dataset Fig 13. UCMerced_Dataset

  • ISIC [10]: Data from SIIM-ISIC Melanoma Classification 2020. This competition aim to predicting a binary target for each image, 0 denotes benign, and 1 indicates malignant.

ISIC Fig 14. ISIC

Dataset for data drift

  • iwildcamdataset (wilds): Dataset from wilds [11] for animal classification. The input x is a photo from a camera trap, the label y is one of 182 animal species, and the domain d specifies the identity of the camera trap. The training, validation and in-distribution data contain the images from different camera, but out-of-distribution test data are not capture by the camera used in in-distribution camera.

iwildcam_dataset Fig 15. iwildcam_dataset

  • fmowdataset (wilds): Dataset from wilds [11] for satellite image classification. As satellite data constantly changes due to human activity and environmental processes, these models must be robust to distribution shifts over time. The input x is an RGB satellite image, the label y is one of 62 building or land use. The in-distribution data comprises data from before 2013, while the out-of-distribution test set comprises data from 2016 and after.

fmow_dataset Fig 16. fmow_dataset

Experiment:

Classification experiment on general dataset

The experiment is to test the performance of different models on the general dataset mentioned before. In this part, two types of experiment are conducted:

Classification with feature and simple ML algorithm:

  • This experiment aim to test the respresentation performance of each model. For each dataset, extract feature from last layer of the network and perform KNN and logistic regression.

Fine tuning on each dataset:

  • This aim to test each model performance after fine-tuning. For each dataset, fine tune each network on each dataset.

The evaluation metric for this experiment is accuracy. In this case, we want to test the performance of each model on each general dataset with respect to different amount of training and validation data. The amount of training and validation data increased in order of 10 from 10 samples per class to 1000 samples per class. But for general dataset, not all dataset have enough data. Thus, we detailed designed the training, validation and testing split for dataset in following table. The number listed in (10)train:val:test, (100)train:val:test and (1000)train:val:test means the number of data used in training, validation and testing dataset. In simple ML algorithm, we only use (10)train:val:test and (100)train:val:test for training and testing.

  dataset total data Number of labels (10)train:val:test (100)train:val:test (1000)train:val:test
0 ISIC 1150 2 14:4:1132 148:50:952 688:230:232
1 PatternNet_Dataset 30400 38 266:76:30058 2812:950:26638 18202:6080:6118
2 UCMerced_Dataset 1260 21 147:42:1071 567:189:504  
3 Dataset_FGVCAircraft 3334 100 700:200:2434 1900:600:834  
4 Dataset_DTD 1880 47 329:94:1457 1081:376:423  
5 Dataset_Flowers102 1020 102 510:204:306    

From the table, Dataset_Flowers102 only have one type of split way. UCMerced_Dataset, Dataset_FGVCAircraft and Dataset_DTD have two type of split way. This is because these dataset have not enough data for each label (e.g. Dataset_Flowers102 on have 10 data for each label, Dataset_DTD have 40 samples per label). Thus, in this experiment, we just repeat the the last experiment for Dataset_Flowers102, UCMerced_Dataset, Dataset_FGVCAircraft and Dataset_DTD for those experiment that do not have enough labels. In result section, final results would be averaged.

Classification experiment on Data Drift dataset

Same as before, the experiment is to test the performance of different models on the data drift dataset. In this case, the model is trained on the in-distribution data and tested on both in-distribution and out-of-distribution data.

Classification with feature and simple ML algorithm:

  • This experiment aim to compare the model representation performance in terms of data drift. For each dataset, extract feature from last layer of the network and perform KNN and logistic regression.

Fine tuning on each dataset:

  • This aim to test the model sensitivity in terms of fine tuning for data drift. For each dataset, fine tune each network on each dataset.

The evaluation metric for this experiment is still accuracy on in-distribution test set and out-of-distribution test set. Same as previous experiment, we split training data in order of 10 that start from 10 samples and more detailed split is listed below. In simple ML algorithm, we only use (10)train:val and (100)train:val for training and testing.

  dataset total training data total val data total id test data total ood test data Number of labels (10)train:val (100)train:val (1000)train:val
0 FMoWDataset 76863 19915 11327 22108 62 434:124 4517:1527 40150:12721
1 IWildCamDataset 129809 14961 8154 42791 87 1090:149 6694:1394 33628:7079

Clustering experiment

In clustering experiment, we firstly extract feature from each model. The feature extraction method is same as the one used in logistic regression and KNN experiment. To have better clustering performance, we firstly do dimension reduction with UMAP with euclidean metric. Then, we use k-means clustering to cluster the data. In this case, we perform clustering on all dataset in general dataset and test data in data drift dataset. The evaluation metric for this experiment is adjusted rand score, which can be considered as accurcy measure in clustering task.

Training and preprocessing

The training pipeline is built based on pytorch and pytorch-lightning. Each training samples are cropped and normalized to the 244x244 with mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225].

The GPU used for fine tuning is NVIDIA GeForce RTX 3070 laptop GPU. The batch size for training is 16 and optimizer is Adam. The learning rate is 1e-5. The number of epochs depend on early stopping. The early stopping is used to stop training when the validation loss does not decrease for 30 epoch.

Results

Result for classification experiment on general dataset

The result for training representation on KNN and logistic regression is shown in the following tables. The row is ordered by accuracy of logistic regression at 100.

Dataset_FGVCAircraft

model knn_10 knn_100 lr_10 lr_100
DINO 0.202958 0.28777 0.38373 0.516787
vit_l_16 0.122432 0.220624 0.312654 0.461631
vit_b_16 0.133936 0.207434 0.314708 0.447242
convnext_tiny 0.11175 0.148681 0.280608 0.43765
vit_l_32 0.131471 0.20024 0.309778 0.434053
CLIP 0.235826 0.31295 0.345933 0.420863
vit_b_32 0.124897 0.178657 0.275267 0.377698
convnext_base 0.0784717 0.106715 0.222268 0.364508
ViTMAE 0.0246508 0.0383693 0.0969597 0.14988
Beit 0.0209532 0.0311751 0.0439606 0.0611511
data2vec 0.0147905 0.0335731 0.0299918 0.0371703
efficientnet_b6 0.0110929 0.00959233 0.0168447 0.029976
efficientnet_b2 0.00862777 0.0179856 0.0115037 0.0203837

ISIC

model knn_10 knn_100 lr_10 lr_100
DINO 0.650528 0.538732 0.715669 0.71919
vit_b_32 0.598592 0.591549 0.623239 0.676056
ViTMAE 0.720951 0.723592 0.701585 0.670775
vit_l_16 0.570423 0.590669 0.632042 0.65757
convnext_base 0.534331 0.509683 0.665493 0.654049
vit_b_16 0.605634 0.551937 0.636444 0.638204
convnext_tiny 0.571303 0.512324 0.62412 0.627641
vit_l_32 0.553697 0.535211 0.6875 0.615317
data2vec 0.691021 0.701585 0.65493 0.612676
Beit 0.683099 0.623239 0.735035 0.575704
efficientnet_b2 0.49912 0.502641 0.518486 0.536092
CLIP 0.582746 0.637324 0.662852 0.52993
efficientnet_b6 0.5 0.5 0.506162 0.525528

Dataset_DTD

model knn_10 knn_100 lr_10 lr_100
DINO 0.523533 0.661939 0.602192 0.742317
CLIP 0.493875 0.63357 0.600903 0.737589
convnext_base 0.355899 0.520095 0.573179 0.718676
vit_b_32 0.411348 0.550827 0.555126 0.690307
vit_b_16 0.387492 0.550827 0.548034 0.685579
convnext_tiny 0.350097 0.529551 0.586718 0.683215
vit_l_16 0.323017 0.522459 0.544165 0.680851
vit_l_32 0.310123 0.484634 0.548034 0.652482
ViTMAE 0.117988 0.219858 0.406834 0.63357
Beit 0.0870406 0.160757 0.137975 0.267139
data2vec 0.0444874 0.0661939 0.0754352 0.113475
efficientnet_b6 0.030303 0.0236407 0.0257898 0.0330969
efficientnet_b2 0.0199871 0.035461 0.0238556 0.0283688

Dataset_Flowers102

model knn_10 lr_10
CLIP 0.803922 0.905229
DINO 0.826797 0.937908
vit_b_16 0.568627 0.856209
vit_l_16 0.513072 0.869281
convnext_base 0.447712 0.830065
vit_l_32 0.486928 0.810458
convnext_tiny 0.526144 0.830065
vit_b_32 0.53268 0.810458
ViTMAE 0.179739 0.542484
Beit 0.140523 0.320261
data2vec 0.0784314 0.183007
efficientnet_b6 0.0130719 0.0130719
efficientnet_b2 0.0130719 0.0359477

UCMerced_Dataset

model knn_10 knn_100 lr_10 lr_100
DINO 0.781475 0.904573 0.885791 0.970179
vit_l_32 0.646583 0.831014 0.83723 0.94831
vit_b_32 0.67536 0.852883 0.843525 0.944334
vit_l_16 0.58723 0.791252 0.851619 0.942346
CLIP 0.78777 0.89662 0.878597 0.934394
convnext_base 0.651978 0.813121 0.821043 0.934394
convnext_tiny 0.613309 0.769384 0.818345 0.932406
vit_b_16 0.714928 0.83499 0.807554 0.932406
ViTMAE 0.348022 0.594433 0.607014 0.888668
Beit 0.383094 0.624254 0.531475 0.787276
data2vec 0.198741 0.310139 0.283273 0.429423
efficientnet_b6 0.0530576 0.0516899 0.0809353 0.149105
efficientnet_b2 0.0620504 0.0755467 0.0557554 0.0815109

PatternNet_Dataset

model knn_10 knn_100 lr_10 lr_100
CLIP 0.864434 0.942952 0.944148 0.979638
DINO 0.863704 0.947846 0.955066 0.97844
vit_l_16 0.729499 0.866705 0.923207 0.972143
convnext_tiny 0.787542 0.889839 0.914612 0.968174
vit_b_16 0.800451 0.888265 0.910596 0.964409
vit_l_32 0.794046 0.881216 0.901736 0.964272
vit_b_32 0.786049 0.878957 0.887997 0.9615
convnext_base 0.709587 0.854591 0.897355 0.961466
ViTMAE 0.474629 0.67164 0.817044 0.925533
Beit 0.478777 0.649841 0.683238 0.829985
data2vec 0.246175 0.394374 0.393987 0.477978
efficientnet_b6 0.0297348 0.0370966 0.050078 0.0851442
efficientnet_b2 0.029934 0.0318264 0.0427106 0.0660142

From the results above, in most cases in terms of ML algorithms performance, logistic regression with 100 samples > logistic regression with 10 samples > knn with 100 samples > knn with 10 samples, but if each samples in dataset is not too similar like DTD, UCMerced and PatternNet, sometimes logistic regression with 10 samples and knn with 100 samples can have same or better performance for best model. In industry, knn-like information retrieval is much cheaper and fast than logistic regression, so if our samples are not too similar and we have enough samples, KNN maybe a better option.

As for model representation, DINO is superisingly the best model in most cases and then CLIP or VIT trained on imagenet. Thus, from this point of view, DINO is more promising, because it use less data for pretraining but the result is still good. But in most cases, the model with masked self-supervised pretraining is no good and it only have relatively better result in ISIC dataset. The samples in ISIC dataset are more similar compared to other datasets, so these models may have advantage for this kind of dataset, but in this project due to limited time we leave this question for future exploration.

The result for fine tuning and comparation with simple ML + feature training is shown below table, the models are ordered by the best performance among correspond dataset.

PatternNet_Dataset

model 10 100 1000 knn_10 knn_100 lr_10 lr_100 performance order Best performance
convnext_base 0.969792 0.991441 0.996404 0.709587 0.854591 0.897355 0.961466 1000>100>10>lr_100>lr_10>knn_100>knn_10 0.996404
DINO 0.871781 0.986673 0.996241 0.863704 0.947846 0.955066 0.97844 1000>100>lr_100>lr_10>knn_100>10>knn_10 0.996241
convnext_tiny 0.957815 0.989414 0.996077 0.787542 0.889839 0.914612 0.968174 1000>100>lr_100>10>lr_10>knn_100>knn_10 0.996077
efficientnet_b2 0.904551 0.984083 0.996077 0.029934 0.0318264 0.0427106 0.0660142 1000>100>10>lr_100>lr_10>knn_100>knn_10 0.996077
vit_b_16 0.929669 0.991028 0.995914 0.800451 0.888265 0.910596 0.964409 1000>100>lr_100>10>lr_10>knn_100>knn_10 0.995914
efficientnet_b6 0.893473 0.982919 0.995096 0.0297348 0.0370966 0.050078 0.0851442 1000>100>10>lr_100>lr_10>knn_100>knn_10 0.995096
Beit 0.765354 0.972857 0.994933 0.478777 0.649841 0.683238 0.829985 1000>100>lr_100>10>lr_10>knn_100>knn_10 0.994933
CLIP 0.925378 0.991028 0.994606 0.864434 0.942952 0.944148 0.979638 1000>100>lr_100>lr_10>knn_100>10>knn_10 0.994606
ViTMAE 0.839876 0.979352 0.990847 0.474629 0.67164 0.817044 0.925533 1000>100>lr_100>10>lr_10>knn_100>knn_10 0.990847
vit_b_32 0.930834 0.985246 0.990193 0.786049 0.878957 0.887997 0.9615 1000>100>lr_100>10>lr_10>knn_100>knn_10 0.990193
data2vec 0.500133 0.814168 0.943609 0.246175 0.394374 0.393987 0.477978 1000>100>10>lr_100>knn_100>lr_10>knn_10 0.943609

Dataset_Flowers102

model 10 knn_10 lr_10 performance order Best performance
convnext_base 0.938998 0.428105 0.834967 10>lr_10>knn_10 0.938998
DINO 0.738562 0.794118 0.929739 lr_10>knn_10>10 0.929739
CLIP 0.831264 0.816993 0.913399 lr_10>10>knn_10 0.913399
convnext_tiny 0.897603 0.517974 0.828431 10>lr_10>knn_10 0.897603
vit_b_16 0.839446 0.576797 0.854575 lr_10>10>knn_10 0.854575
vit_b_32 0.771242 0.517974 0.803922 lr_10>10>knn_10 0.803922
efficientnet_b2 0.78976 0.0179739 0.0294118 10>lr_10>knn_10 0.78976
efficientnet_b6 0.775599 0.0130719 0.0196078 10>lr_10>knn_10 0.775599
ViTMAE 0.545685 0.166667 0.550654 lr_10>10>knn_10 0.550654
Beit 0.274743 0.138889 0.313725 lr_10>10>knn_10 0.313725
data2vec 0.262474 0.0882353 0.161765 10>lr_10>knn_10 0.262474

Dataset_DTD

model 10 100 knn_10 knn_100 lr_10 lr_100 performance order Best performance
DINO 0.350721 0.591017 0.523533 0.661939 0.602192 0.742317 lr_100>knn_100>lr_10>100>knn_10>10 0.742317
CLIP 0.545642 0.678487 0.493875 0.63357 0.600903 0.737589 lr_100>100>knn_100>lr_10>10>knn_10 0.737589
convnext_base 0.624571 0.686761 0.355899 0.520095 0.573179 0.718676 lr_100>100>10>lr_10>knn_100>knn_10 0.718676
convnext_tiny 0.592999 0.717494 0.350097 0.529551 0.586718 0.683215 100>lr_100>10>lr_10>knn_100>knn_10 0.717494
vit_b_32 0.486616 0.605201 0.411348 0.550827 0.555126 0.690307 lr_100>100>lr_10>knn_100>10>knn_10 0.690307
vit_b_16 0.51407 0.641844 0.387492 0.550827 0.548034 0.685579 lr_100>100>knn_100>lr_10>10>knn_10 0.685579
ViTMAE 0.385038 0.531915 0.117988 0.219858 0.406834 0.63357 lr_100>100>lr_10>10>knn_100>knn_10 0.63357
efficientnet_b2 0.40906 0.573286 0.0199871 0.035461 0.0238556 0.0283688 100>10>knn_100>lr_100>lr_10>knn_10 0.573286
efficientnet_b6 0.396706 0.542553 0.030303 0.0236407 0.0257898 0.0330969 100>10>lr_100>knn_10>lr_10>knn_100 0.542553
Beit 0.185999 0.364066 0.0870406 0.160757 0.137975 0.267139 100>lr_100>10>knn_100>lr_10>knn_10 0.364066
data2vec 0.103638 0.176123 0.0444874 0.0661939 0.0754352 0.113475 100>lr_100>10>lr_10>knn_100>knn_10 0.176123

UCMerced_Dataset

model 10 100 knn_10 knn_100 lr_10 lr_100 performance order Best performance
DINO 0.776844 0.943452 0.781475 0.904573 0.885791 0.970179 lr_100>100>knn_100>lr_10>knn_10>10 0.970179
convnext_base 0.907563 0.96627 0.651978 0.813121 0.821043 0.934394 100>lr_100>10>lr_10>knn_100>knn_10 0.96627
CLIP 0.852474 0.963264 0.78777 0.89662 0.878597 0.934394 100>lr_100>knn_100>lr_10>10>knn_10 0.963264
vit_b_16 0.867414 0.962276 0.714928 0.83499 0.807554 0.932406 100>lr_100>10>knn_100>lr_10>knn_10 0.962276
convnext_tiny 0.845938 0.957341 0.613309 0.769384 0.818345 0.932406 100>lr_100>10>lr_10>knn_100>knn_10 0.957341
vit_b_32 0.8338 0.948357 0.67536 0.852883 0.843525 0.944334 100>lr_100>knn_100>lr_10>10>knn_10 0.948357
efficientnet_b2 0.789916 0.927579 0.0620504 0.0755467 0.0557554 0.0815109 100>10>lr_100>knn_100>knn_10>lr_10 0.927579
efficientnet_b6 0.779645 0.924603 0.0530576 0.0516899 0.0809353 0.149105 100>10>lr_100>lr_10>knn_10>knn_100 0.924603
ViTMAE 0.647993 0.896743 0.348022 0.594433 0.607014 0.888668 100>lr_100>10>lr_10>knn_100>knn_10 0.896743
Beit 0.53408 0.835148 0.383094 0.624254 0.531475 0.787276 100>lr_100>knn_100>10>lr_10>knn_10 0.835148
data2vec 0.409897 0.59085 0.198741 0.310139 0.283273 0.429423 100>lr_100>10>knn_100>lr_10>knn_10 0.59085

Dataset_FGVCAircraft

model 10 100 knn_10 knn_100 lr_10 lr_100 performance order Best performance
convnext_base 0.493837 0.673261 0.0784717 0.106715 0.222268 0.364508 100>10>lr_100>lr_10>knn_100>knn_10 0.673261
convnext_tiny 0.396878 0.617506 0.11175 0.148681 0.280608 0.43765 100>lr_100>10>lr_10>knn_100>knn_10 0.617506
CLIP 0.318817 0.539887 0.235826 0.31295 0.345933 0.420863 100>lr_100>lr_10>10>knn_100>knn_10 0.539887
DINO 0.162284 0.390288 0.202958 0.28777 0.38373 0.516787 lr_100>100>lr_10>knn_100>knn_10>10 0.516787
vit_b_16 0.288825 0.491295 0.133936 0.207434 0.314708 0.447242 100>lr_100>lr_10>10>knn_100>knn_10 0.491295
efficientnet_b2 0.268694 0.450839 0.00862777 0.0179856 0.0115037 0.0203837 100>10>lr_100>knn_100>lr_10>knn_10 0.450839
efficientnet_b6 0.277732 0.440647 0.0110929 0.00959233 0.0168447 0.029976 100>10>lr_100>lr_10>knn_10>knn_100 0.440647
vit_b_32 0.224322 0.385492 0.124897 0.178657 0.275267 0.377698 100>lr_100>lr_10>10>knn_100>knn_10 0.385492
ViTMAE 0.0772391 0.216557 0.0246508 0.0383693 0.0969597 0.14988 100>lr_100>lr_10>10>knn_100>knn_10 0.216557
Beit 0.0419063 0.0983796 0.0209532 0.0311751 0.0439606 0.0611511 100>lr_100>lr_10>10>knn_100>knn_10 0.0983796
data2vec 0.0353328 0.0401938 0.0147905 0.0335731 0.0299918 0.0371703 100>lr_100>10>knn_100>lr_10>knn_10 0.0401938

ISIC

model 10 100 1000 knn_10 knn_100 lr_10 lr_100 performance order Best performance
convnext_tiny 0.693463 0.77521 0.875 0.571303 0.512324 0.62412 0.627641 1000>100>10>lr_100>lr_10>knn_10>knn_100 0.875
CLIP 0.770318 0.788866 0.849138 0.582746 0.637324 0.662852 0.52993 1000>100>10>lr_10>knn_100>knn_10>lr_100 0.849138
DINO 0.758834 0.777311 0.844828 0.650528 0.538732 0.715669 0.71919 1000>100>10>lr_100>lr_10>knn_10>knn_100 0.844828
efficientnet_b2 0.499117 0.735294 0.840517 0.49912 0.502641 0.518486 0.536092 1000>100>lr_100>lr_10>knn_100>knn_10>10 0.840517
ViTMAE 0.704947 0.75 0.831897 0.720951 0.723592 0.701585 0.670775 1000>100>knn_100>knn_10>10>lr_10>lr_100 0.831897
vit_b_16 0.583922 0.771008 0.814655 0.605634 0.551937 0.636444 0.638204 1000>100>lr_100>lr_10>knn_10>10>knn_100 0.814655
Beit 0.69523 0.736345 0.806035 0.683099 0.623239 0.735035 0.575704 1000>100>lr_10>10>knn_10>knn_100>lr_100 0.806035
vit_b_32 0.65371 0.778361 0.771552 0.598592 0.591549 0.623239 0.676056 100>1000>lr_100>10>lr_10>knn_10>knn_100 0.778361
convnext_base 0.708481 0.765756 0.758621 0.534331 0.509683 0.665493 0.654049 100>1000>10>lr_10>lr_100>knn_10>knn_100 0.765756
efficientnet_b6 0.539753 0.734244 0.711207 0.5 0.5 0.506162 0.525528 100>1000>10>lr_100>lr_10>knn_100>knn_10 0.734244
data2vec 0.673145 0.733193 0.698276 0.691021 0.701585 0.65493 0.612676 100>knn_100>1000>knn_10>10>lr_10>lr_100 0.733193

From the result, it is not surprised to see that the more data we use, the higher the accuracy and in most cases for most model and if we have enough data like 500 samples or more data, the best option in terms of performance is to fine tune it.

Also, when the number of samples increase, pretraining might not be so effective. In this case, convnext models perform best in PatternNet, ISIC and FGVCAircraft, especially for fine tuning result and if we only compare the result of fine tuning, convnext base is mostly better on all dataset. This could because CNN is easier to converage for small dataset. But in this project, all transformer model are traditional transformer. Recently, there are some modified version like Swin-transformer and it is possible that Swin-transformer may perform better.

Apart from the result of convnext and fine tuning result with more than 100 samples, it is very interesting to see that DINO and CLIP have similar or even better performance with logistic regression rather than fine tuning, especially for those dataset that have samples that are quite different from each other (e.g. DTD and UCMerced). If we have relatively small dataset (e.g less than 100 samples per class) and the samples are quite easy, just use representation from DINO or CLIP with traditional ML algorithms may be good choice.

Result for classification experiment on Data Drift dataset

The result of classification experiment with KNN and logistic regression on Data Drift dataset is shown below. The different between in-distribution result and out-of-distribution result is shown in last 4 column.

IWildCamDataset

model knn_10_id knn_10_ood lr_10_id lr_10_ood knn_100_id knn_100_ood lr_100_id lr_100_ood knn_acc_diff_10 knn_acc_diff_100 lr_acc_diff_10 lr_acc_diff_100
DINO 0.132205 0.186979 0.143488 0.177724 0.244665 0.186651 0.214619 0.279171 -0.0547735 0.0580138 -0.0342364 -0.0645523
convnext_base 0.14128 0.22458 0.112583 0.188685 0.221609 0.23776 0.200883 0.22916 -0.0832996 -0.0161513 -0.0761018 -0.0282773
convnext_tiny 0.140545 0.219369 0.116017 0.158631 0.243439 0.225304 0.195242 0.203594 -0.078824 0.0181344 -0.0426148 -0.00835261
vit_b_16 0.114668 0.0812087 0.110375 0.0957444 0.191072 0.151901 0.182487 0.16639 0.033459 0.0391708 0.0146308 0.016097
vit_b_32 0.0954133 0.0899722 0.0804513 0.0831483 0.142261 0.133626 0.151705 0.142273 0.0054411 0.00863524 -0.00269701 0.00943178
CLIP 0.0927152 0.105326 0.0864606 0.0596153 0.190213 0.175224 0.166299 0.137482 -0.0126107 0.0149896 0.0268453 0.0288166
Beit 0.0334805 0.0180645 0.0309051 0.020308 0.0630365 0.0236966 0.0558008 0.0342362 0.015416 0.03934 0.0105971 0.0215647
ViTMAE 0.0348295 0.0256129 0.0294334 0.0241406 0.0469708 0.0193265 0.0479519 0.0318992 0.00921667 0.0276443 0.00529282 0.0160527
data2vec 0.0383861 0.0132037 0.0279617 0.0185319 0.0470935 0.0133907 0.0448859 0.0231591 0.0251824 0.0337028 0.0094298 0.0217269

FMoWDataset

model knn_10_id knn_10_ood lr_10_id lr_10_ood knn_100_id knn_100_ood lr_100_id lr_100_ood knn_acc_diff_10 knn_acc_diff_100 lr_acc_diff_10 lr_acc_diff_100
CLIP 0.194403 0.188981 0.12404 0.132034 0.314646 0.305862 0.254083 0.242401 0.00542139 0.00878429 -0.00799375 0.0116822
DINO 0.172773 0.16926 0.105412 0.0972951 0.310674 0.29763 0.217357 0.195223 0.00351302 0.0130438 0.00811675 0.0221333
convnext_tiny 0.152732 0.139361 0.0839587 0.0714221 0.261676 0.24905 0.16624 0.141442 0.0133711 0.0126255 0.0125366 0.0247979
vit_b_16 0.129337 0.124525 0.0798093 0.0657228 0.22645 0.219649 0.157853 0.133617 0.00481192 0.00680108 0.0140865 0.0242361
vit_b_32 0.110885 0.111905 0.0703628 0.0673512 0.206498 0.200878 0.14823 0.124661 -0.0010197 0.00562024 0.00301166 0.0235691
convnext_base 0.133575 0.130089 0.0626821 0.0531482 0.249846 0.239913 0.130926 0.111227 0.00348599 0.00993235 0.00953391 0.0196994
Beit 0.0664783 0.058757 0.0300168 0.0294916 0.12766 0.111543 0.0583561 0.0508413 0.00772132 0.0161162 0.000525187 0.00751482
ViTMAE 0.0379624 0.0377691 0.0227774 0.0241994 0.0684206 0.0621494 0.0320473 0.0328388 0.000193257 0.00627114 -0.00142195 -0.000791471
data2vec 0.0404344 0.0425638 0.0188929 0.0215759 0.0548248 0.0519721 0.0342544 0.0306224 -0.00212942 0.00285262 -0.00268299 0.00363204

The overall performance is not so high for compared those general datasets, which means the dataset here is more difficult.

From the two table, even though DINO still perform good in both dataset, CLIP performance are not as good as the one in general dataset experiment for IWildCamDataset and it is even worse than convnext and VIT model that are trained on ImageNet. Because the key different between IWildCamDataset and other dataset is the class label and samples are similar to ImageNet, those network are specifically trained on ImageNet with different kind of augmentation technique may get some advantage in representation. One evidence for this claim is that the performance different between in-distributiuon and out-of-distribution for DINO and convnext is not so signification some even have better out-of-distribution score. This is not the case for FMoWDataset. Thus, if the model is trained with enough augmentation technique, it would generialize better.

For FMoWDataset dataset, CLIP achieve the best performance in terms of both accuracy and sensitivity to distribution drift and the DINO representation is the second.This means CLIP and DINO representation maybe not only mostly better than in terms of performance but also sensitivity to distribution drift, but because the overall model performance is not so high, using KNN and logistic regression for this task may not be a good choice. Finally, those masked self-supervised model still perform not good that may means self-supervised model may not able to directly output good representation.

For fine tuning the result is shown as below, because in this case, the result table is too large, it is splited into 3.

FMoWDataset

model 10 id 10 ood 100 id 100 ood 1000 id 1000 ood knn_10_id knn_10_ood lr_10_id lr_10_ood knn_100_id knn_100_ood lr_100_id lr_100_ood
convnext_base 0.189459 0.166998 0.397016 0.360955 0.578706 0.519676 0.133575 0.130089 0.0626821 0.0531482 0.249846 0.239913 0.130926 0.111227
convnext_tiny 0.158471 0.147503 0.373179 0.339515 0.556899 0.507463 0.152732 0.139361 0.0839587 0.0714221 0.261676 0.24905 0.16624 0.141442
vit_b_16 0.116801 0.103809 0.329478 0.301656 0.52962 0.464809 0.129337 0.124525 0.0798093 0.0657228 0.22645 0.219649 0.157853 0.133617
DINO 0.09367 0.0889723 0.337424 0.307355 0.514876 0.461779 0.172773 0.16926 0.105412 0.0972951 0.310674 0.29763 0.217357 0.195223
CLIP 0.0983491 0.089877 0.388099 0.365524 0.482211 0.438348 0.194403 0.188981 0.12404 0.132034 0.314646 0.305862 0.254083 0.242401
Beit 0.0605633 0.0559978 0.186281 0.1499 0.48892 0.424236 0.0664783 0.058757 0.0300168 0.0294916 0.12766 0.111543 0.0583561 0.0508413
vit_b_32 0.133751 0.11471 0.286395 0.257328 0.449016 0.402162 0.110885 0.111905 0.0703628 0.0673512 0.206498 0.200878 0.14823 0.124661
ViTMAE 0.104529 0.0837253 0.249051 0.204948 0.449722 0.377013 0.0379624 0.0377691 0.0227774 0.0241994 0.0684206 0.0621494 0.0320473 0.0328388
data2vec 0.0410524 0.0382667 0.0863424 0.0692509 0.147435 0.119142 0.0404344 0.0425638 0.0188929 0.0215759 0.0548248 0.0519721 0.0342544 0.0306224
model Best OOD performance performance order
convnext_base 0.519676 1000_ood>100_ood>knn_100_ood>10_ood>knn_10_ood>lr_100_ood>lr_10_ood
convnext_tiny 0.507463 1000_ood>100_ood>knn_100_ood>10_ood>lr_100_ood>knn_10_ood>lr_10_ood
vit_b_16 0.464809 1000_ood>100_ood>knn_100_ood>lr_100_ood>knn_10_ood>10_ood>lr_10_ood
DINO 0.461779 1000_ood>100_ood>knn_100_ood>lr_100_ood>knn_10_ood>lr_10_ood>10_ood
CLIP 0.438348 1000_ood>100_ood>knn_100_ood>lr_100_ood>knn_10_ood>lr_10_ood>10_ood
Beit 0.424236 1000_ood>100_ood>knn_100_ood>knn_10_ood>10_ood>lr_100_ood>lr_10_ood
vit_b_32 0.402162 1000_ood>100_ood>knn_100_ood>lr_100_ood>10_ood>knn_10_ood>lr_10_ood
ViTMAE 0.377013 1000_ood>100_ood>10_ood>knn_100_ood>knn_10_ood>lr_100_ood>lr_10_ood
data2vec 0.119142 1000_ood>100_ood>knn_100_ood>knn_10_ood>10_ood>lr_100_ood>lr_10_ood
model 10 diff 100 diff 1000 diff knn_acc_diff_10 knn_acc_diff_100 lr_acc_diff_10 lr_acc_diff_100
convnext_base 0.0224604 0.0360607 0.0590296 0.00348599 0.00993235 0.00953391 0.0196994
convnext_tiny 0.0109677 0.033664 0.0494361 0.0133711 0.0126255 0.0125366 0.0247979
vit_b_16 0.012992 0.0278227 0.0648104 0.00481192 0.00680108 0.0140865 0.0242361
DINO 0.00469768 0.0300691 0.0530974 0.00351302 0.0130438 0.00811675 0.0221333
CLIP 0.00847211 0.0225754 0.0438625 0.00542139 0.00878429 -0.00799375 0.0116822
Beit 0.00456543 0.0363801 0.0646847 0.00772132 0.0161162 0.000525187 0.00751482
vit_b_32 0.0190416 0.0290677 0.0468535 -0.0010197 0.00562024 0.00301166 0.0235691
ViTMAE 0.0208037 0.0441025 0.0727091 0.000193257 0.00627114 -0.00142195 -0.000791471
data2vec 0.00278566 0.0170914 0.0282929 -0.00212942 0.00285262 -0.00268299 0.00363204

IWildCamDataset

model 10 id 10 ood 100 id 100 ood 1000 id 1000 ood knn_10_id knn_10_ood lr_10_id lr_10_ood knn_100_id knn_100_ood lr_100_id lr_100_ood
convnext_base 0.221977 0.242831 0.57015 0.611157 0.660412 0.693861 0.14128 0.22458 0.112583 0.188685 0.221609 0.23776 0.200883 0.22916
convnext_tiny 0.262448 0.290435 0.52833 0.537122 0.653912 0.666425 0.140545 0.219369 0.116017 0.158631 0.243439 0.225304 0.195242 0.203594
vit_b_16 0.160657 0.117244 0.514349 0.530462 0.654403 0.622514 0.114668 0.0812087 0.110375 0.0957444 0.191072 0.151901 0.182487 0.16639
DINO 0.165072 0.170877 0.418813 0.366713 0.626073 0.549812 0.132205 0.186979 0.143488 0.177724 0.244665 0.186651 0.214619 0.279171
vit_b_32 0.161761 0.0841065 0.424822 0.443084 0.586829 0.528499 0.0954133 0.0899722 0.0804513 0.0831483 0.142261 0.133626 0.151705 0.142273
CLIP 0.174148 0.12068 0.425803 0.484658 0.572602 0.518824 0.0927152 0.105326 0.0864606 0.0596153 0.190213 0.175224 0.166299 0.137482
Beit 0.108168 0.0603865 0.281212 0.270384 0.536914 0.456498 0.0334805 0.0180645 0.0309051 0.020308 0.0630365 0.0236966 0.0558008 0.0342362
ViTMAE 0.150356 0.138487 0.33456 0.291229 0.498038 0.409736 0.0348295 0.0256129 0.0294334 0.0241406 0.0469708 0.0193265 0.0479519 0.0318992
data2vec 0.0430464 0.0372041 0.157346 0.112711 0.366569 0.223622 0.0383861 0.0132037 0.0279617 0.0185319 0.0470935 0.0133907 0.0448859 0.0231591
model Best OOD performance performance order
convnext_base 0.693861 1000_ood>100_ood>10_ood>knn_100_ood>lr_100_ood>knn_10_ood>lr_10_ood
convnext_tiny 0.666425 1000_ood>100_ood>10_ood>knn_100_ood>knn_10_ood>lr_100_ood>lr_10_ood
vit_b_16 0.622514 1000_ood>100_ood>lr_100_ood>knn_100_ood>10_ood>lr_10_ood>knn_10_ood
DINO 0.549812 1000_ood>100_ood>lr_100_ood>knn_10_ood>knn_100_ood>lr_10_ood>10_ood
vit_b_32 0.528499 1000_ood>100_ood>lr_100_ood>knn_100_ood>knn_10_ood>10_ood>lr_10_ood
CLIP 0.518824 1000_ood>100_ood>knn_100_ood>lr_100_ood>10_ood>knn_10_ood>lr_10_ood
Beit 0.456498 1000_ood>100_ood>10_ood>lr_100_ood>knn_100_ood>lr_10_ood>knn_10_ood
ViTMAE 0.409736 1000_ood>100_ood>10_ood>lr_100_ood>knn_10_ood>lr_10_ood>knn_100_ood
data2vec 0.223622 1000_ood>100_ood>10_ood>lr_100_ood>lr_10_ood>knn_100_ood>knn_10_ood
model 10 diff 100 diff 1000 diff knn_acc_diff_10 knn_acc_diff_100 lr_acc_diff_10 lr_acc_diff_100
convnext_base -0.0208545 -0.0410069 -0.0334488 -0.0832996 -0.0161513 -0.0761018 -0.0282773
convnext_tiny -0.027987 -0.00879264 -0.012513 -0.078824 0.0181344 -0.0426148 -0.00835261
vit_b_16 0.0434131 -0.0161132 0.0318887 0.033459 0.0391708 0.0146308 0.016097
DINO -0.0058047 0.0521002 0.0762612 -0.0547735 0.0580138 -0.0342364 -0.0645523
vit_b_32 0.0776546 -0.0182616 0.0583295 0.0054411 0.00863524 -0.00269701 0.00943178
CLIP 0.0534681 -0.0588547 0.0537784 -0.0126107 0.0149896 0.0268453 0.0288166
Beit 0.0477812 0.0108277 0.0804166 0.015416 0.03934 0.0105971 0.0215647
ViTMAE 0.0118686 0.0433303 0.0883021 0.00921667 0.0276443 0.00529282 0.0160527
data2vec 0.00584227 0.0446355 0.142947 0.0251824 0.0337028 0.0094298 0.0217269

From the result, similar to fine tuning in previous experiments, the result might not too sensitive to pretraining method, even if the test data is out of training distribution.

From peformance table of each dataset,fine tuning with 100 with 100 samples is always better using simple ML algorithm with 100 samples, but if we use CLIP or DINO features with very little examples for each classes (e.g. 10 samples), fine tuning may not be a good choice.

From the table that compare the different between in-distribution and out-of-distribution, there is not a significant difference between in-distribution and out-of-distribution in FMoWDataset. Even though the performance drop for CLIP is slightly lower compared to convnext, its performance is not so good. For IWildCamDataset, those pretrain on imageNet with different augmentation method, clearly have less performance drop and some of even have better out-of-distribution drift than in-distribution drift.

In our experiments, we haven’t apply any augmentation method during training. However, from the results of distribution drift, there is a clear sign that data augmentation play a big role to handle distribution drift. In future experiment, the effect of data augmentation during training might be a interesting topic to explore.

Also, in banchmark of imageNet, ConvNext is thes best model and this result consistent most of the dataset in this project. It might be possible that the best performance in fine tuning is caused by architecture of ConvNext and it would be interesting to see what the result would be if we use both CLIP pretraining and ConvNext architecture.

Result for clustering experiment

The result of clustering experiment is shown below. For each column, each model is arranged in the order of model performance and the number in right of model name is adjusted rand score.

  Dataset_DTD Dataset_FGVCAircraft Dataset_Flowers102 ISIC FMoWDataset ID FMoWDataset OOD IWildCamDataset ID IWildCamDataset OOD PatternNet_Dataset UCMerced_Dataset
0 CLIP 0.364 CLIP 0.147 CLIP 0.654 data2vec 0.177 CLIP 0.086 CLIP 0.102 DINO 0.213 vit_b_32 0.182 vit_b_32 0.846 CLIP 0.684
1 DINO 0.343 DINO 0.106 DINO 0.623 Beit 0.164 DINO 0.084 DINO 0.079 vit_b_32 0.173 DINO 0.165 convnext_base 0.841 vit_b_16 0.649
2 convnext_tiny 0.303 vit_b_16 0.072 convnext_tiny 0.380 ViTMAE 0.120 vit_b_32 0.055 convnext_tiny 0.057 CLIP 0.166 vit_b_16 0.148 convnext_tiny 0.834 DINO 0.645
3 convnext_base 0.291 vit_b_32 0.062 vit_b_16 0.363 CLIP 0.087 vit_b_16 0.049 vit_b_16 0.057 vit_b_16 0.161 convnext_tiny 0.127 vit_b_16 0.824 vit_b_32 0.616
4 vit_b_16 0.284 convnext_tiny 0.046 vit_b_32 0.344 vit_b_32 0.029 convnext_tiny 0.047 vit_b_32 0.052 convnext_tiny 0.155 CLIP 0.126 DINO 0.796 convnext_tiny 0.538
5 vit_b_32 0.275 convnext_base 0.031 convnext_base 0.257 vit_b_16 0.020 convnext_base 0.038 convnext_base 0.048 convnext_base 0.135 convnext_base 0.097 CLIP 0.744 convnext_base 0.508
6 ViTMAE 0.030 ViTMAE 0.006 Beit 0.058 DINO 0.020 Beit 0.009 Beit 0.009 Beit 0.117 Beit 0.089 ViTMAE 0.394 Beit 0.172
7 Beit 0.022 Beit 0.006 ViTMAE 0.053 convnext_base 0.002 ViTMAE 0.004 ViTMAE 0.005 data2vec 0.072 data2vec 0.058 Beit 0.365 ViTMAE 0.148
8 data2vec 0.013 data2vec 0.005 data2vec 0.047 convnext_tiny -0.00 data2vec 0.004 data2vec 0.004 ViTMAE 0.058 ViTMAE 0.040 data2vec 0.085 data2vec 0.064

From the table, CLIP model achieved the best performance and then DINO. Also, except ISIC dataset, those mask based self-supervised model (VITMAE, Beit, Data2vec) perform poorly in most cases.

The model that trained in supervised way in imageNet (VIT, convnext) perform good in IWildCamDataset dataset. This might because the class labels and samples are too similar to imageNet. Also, the DINO model that self-supervised pretrained on imageNet perform even better than CILP. So one interesting thing would be what would happend if DINO is trained on CLIP’s dataset, will DINO representation better than CILP in terms of clustering?

For ISIC dataset, mask based self-supervised model like Data2vec which perform not good in the other dataset perform very good this time. Because in ISIC dataset, the difference between different classes are very small compared to other datasets, it might be possible that mased self-supervised model tend to maintain those tiny information. But ISIC is the only that kind of dataset we tested, it might possible that this is just an coincidence.

Discussion on Possible issue and explanation

In this project, we see DINO representation show a surprising result. It looks like DINO use less data in self-supervised way to get representation and surpass CILP that use a lot of data, but the reason why this can happen still remains unclear. According to paper of DINO, its learned representation is able to refect semantic of the image.

dino Fig 16. dino representation

However, due to DINO performance and CILP perforance are similar in many dataset, directly visiualize attention map and assess it manually may not be a good way to make fair comparsion. Due to limited time and compute we have, we haven’t think of any ways to explain why DINO have such a good performance even when pretraining is less and what would happen if we use CLIP method with some modification on only ImageNet (same as DINO), for example, we can apply different kind of augmenatation and in language supervision part we can combine both labels and augmentation into one sentance as language supervision sentence. Maybe CLIP can still achieve similar parformance or even better performance, but this kind of experiment need a lot of time and compute that we don’t have.

Also for MAE, data2vec and BEIT, the implementation of them in this project follows huggingface version. In huggingface VIT implementation, the sequence length of output feature is same as input. To aggregate all feature of each token into one, we directly apply mean pooling on those feature vectors. According to original paper of MAT and BEIT, this should not largely affect final result. However, considering DINO and VIT and CILP model only use specific token like [cls] token for getting feature. Directly pooling may not be a good option. Due to limited time and compute we do not have time to test how aggragation of feature vectors affect final result.

Future work and Conclusion

From result of using KNN and logistic regression + model representation, even though CLIP have very good result in many tasks, DINO is also a kind of self-supervised pretraining method need to be noticed. In most cases, masked self-supervised pretraining method perform poorly, this could because this kind of method cannot directly generate high level representation or our directly pooling over last layer is not a good way to get its representation. Also, if we have enough data and consider inference speed and cost, KNN might be a good choice. In terms of data drift, if our model is trained on a dataset that is similar to target dataset with a lot of augmentation, its representation may not so sensitive to distribution drift. But if it is a new dataset, representation from CLIP and DINO may still give good performance. However, one problem for the data drift experiement is the dataset we tested is only two, so the interpretation might not enough.

For fine tuning experiment, we show that in tested dataset with limited training samples, pretraining method may not have huge impact on final performance for those samples that are not too similar to each other. For some dataset, if our data is less than 100 samples, simple ml algorithms with DINO or CLIP feature may achieve better performance than fine tuning. For data drift experiment, those model pretrained with lots of augmentation method have better performance and if the dataset label is similar to downstream tasks, the model even insensitive to data drift more. However, if the dataset is not similar to downstream tasks, there is not a clear sign that show which model is better. Additionally, in our fine tuning, we haven’t add any data augmentation, the effect of this might be a interesting future work.

Model representation performance in clustering experiment is quite similar to KNN and logistic regression experiment task. CLIP and DINO perform good in most cases, and the model pretrained on imageNet (VIT, convnext) perform good on IWildCamDataset dataset. One particular thing for clustering is that those masked self-supervised perform good on ISIC dataset and it have relative better performance in KNN and logistic regression. This might shows the represenation of masked self-supervised pretraining method may have advantage in certain tasks which could be a question for future experiment.

Reference

[1] Radford, Alec, et al. “Learning Transferable Visual Models From Natural Language Supervision” arXiv. 2021.

[2] Alexey, Lucas, et al. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

[3] Zhuang, Hanzi Mao1 et al. “A ConvNet for the 2020s”

[4] Mingxing, Quoc “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”

[5] Hangbo, Li, Furu “BEiT: BERT Pre-Training of Image Transformers”

[6] Kaiming, Xinlei et al “Masked Autoencoders Are Scalable Vision Learners”

[7] Mathilde, Hugo et al “Emerging Properties in Self-Supervised Vision Transformers”

[8] Alexei, Wei-Ning et al “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language”

[9] Adam J., Caleb et al “TorchGeo: deep learning with geospatial data” https://arxiv.org/abs/2111.08872

[10] SIIM-ISIC Melanoma Classification https://www.kaggle.com/competitions/siim-isic-melanoma-classification/overview/siim-ai-conference

[11] WILDS: A Benchmark of in-the-Wild Distribution Shifts. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. ICML 2021.

Appendix