Module 6: Weak Supervision and Self Supervision: Representation Learning

Weak supervision and self-supervision algorithms have seen tremendous successes recently. We briefly discuss key papers for both these topics to give an overview of recent progress in the field.

Introduction
Weak Supervision
Self Supervision
References

Introduction

We discuss techniques to learn high quality features for computer vision models without large supervised datasets. The motivation is to reduce huge amount of manual efforts that goes in creating large supervised dataset. This manual approach is also not scalable. Existing works on reducing supervision can be divided into two broad categories: weak supervision and self-supervision. Weak supervision uses lower quality labels which are easier to obtain, generally from non-experts [1]. On the other hand, self-supervison algorithms completely discard the label information and create supervisiory signal from input itself, often leveraging the underlying structure in the data [2]. Recent works using weak supervision or self-supervision have shown strong performance on downstream tasks. We next discuss some important works for both the categories in detail.

Weak Supervision

The success of deep learning is attributed to large, high quality datasets such as ImageNet, and improvement in computing resources which helps in training much deeper models. There is a belief that improving the model architecture (depth, novel components etc) and training on larger datasets will help to learn better representations (evaluated by performance on downstream tasks). Many works following the success of AlexNet focused on desigining novel architectures whereas few works explored the other direction, i.e. how to increase the dataset size for representation learning. Since it is not feasible to employ human labor to create a massive, carefully annotated dataset, the idea is to use potentially noisy labels, for example, hashtags, image captions etc. which can be easily collected from the internet. Such labels provide weak signals compared to high quality labels provided by domain experts. Therefore, it is imperative to train with large weakly supervised dataset, which has its own challenges. We discuss below four works providing insights into different aspects of weakly supervised pre-training. The first work [3] shows that such pre-training indeed improves performance on downstream tasks. The second work [4] uses an even larger dataset with different type of weak supervision and provides some analysis. The third [5] and fourth [6] works are sort of followup works of the first and second respectively. We only do a qualitative discussion of results from these works because of varying experimental setup.

Revisiting the unreasonable effectiveness of data in deep learning era

The authors use JFT-300M dataset to pre-train a ResNet101 model and evaluate it on multiple downstream tasks such as object detection, semantic segmentation etc. JFT-300M dataset contains 300 million images derived from data which powers image search. There are 375 million labels from 18291 different categories, indicating that an image can belong to multiple categories. The label categories form a heirarchical relationship. For example, an image containing apple also belongs to fruit category. Unfortunately, we don’t know any other details about creation of this dataset except that the process was automated and used complex mixture of raw web signals, connections between web pages, and user feedback. It is mentioned that approximately 20% of the labels are noisy. Noisy labels can manifest in form of label confusion or random images labeled as one of the 18291 classes. Another challenge with collecting data in this manner is the skewness in resulting data distribution. JFT-300M has a heavy long-tailed distrubtion. Around 5K categories have less than 100 images per category whereas there are more than 2M images for flowers. Both the factors can lead to unstable pre-training, bias model towards frequent class and perform poorly on classes in the tail of data distribution. The experimental results show that pre-training improves performance on ImageNet classification, COCO object detection, Pascal VOC semantic segmentation and CMU pose estimation. This indicates that weakly supervised pre-training works despite significant label noise and long tailed data distribution. Interested readers can go through the published work for more details.

Exploring the limits of weakly supervised pretraining

This work scales the pre-training dataset to upto 3.5 billion images from Instagram. Also, different from previous work, it uses hashtags for the collected images as labels to learn the features. The authors mention that their dataset construction process is simpler and more transparent compared to JFT-300M dataset discussed above. The hashtags associated with images are very noisy in the sense that it may not explicitly describe the visual content of corresponding image. Similarly, relevant hashtags may have been left out. WordNet is utilized to merge similar hashtags which also reduces some noise. To characterize the effect of such label noise, they artificially inject label noise by replacing a fraction of the hashtags with random hashtags and find that weakly supervised pre-training is resilient to significant amount of label noise. We discussed earlier that large datasets have a long tailed distribution. So how does this impact performance? This work compares three types of sampling from the dataset: random sampling from original distribution, random sampling from re-normalized distribution obtained after taking square root of hashtags in head of original distribution, and randomly sample hashtag followed by randomly sampling image for that hashtag. Experiments show that square root sampling performs the best with much better accuracy than sampling at random (first one). Additional insights and some results (in contrast with previous work) are as follows: 1. The authors minimize cross entropy loss where target vector has k non-zero entries each set to 1/k corresponding to the k ≥ 1 hashtags for the image. Per-hashtag sigmoid output and binary logistic loss, used in previous work, gives significantly worse results for this dataset. 2. Selection of label space to match that of downstream task is as important as increasing the size of pre-training dataset 3. The pre-training may help classification tasks but may actually hurt localization task. Finally, in agreement with the previous work, the authors conclude that model architectures need to be improved otherwise they underfit on such large dataset.

Scaling vision transformers

Both the above works find that model capacity is bottleneck for weakly supervised pre-training, and not the dataset size. So one can ask the following questions: 1. how crucial are the modeling choices in this process 2. what is the model behaviour when model capacity and dataset size are increased? This work investigates these questions by training a vision transformer (ViT) model with 2B parameters on JFT-3B dataset. JFT-3B is an extension of previously discussed JFT-300M dataset and contains 3 billion images. There are 30K labels obtained via semi-automatic pipeline indicating manual labour involved in the process. Interestingly, this work ignores the heirarchical nature of the labels which was crucial for pre-training with JFT-300M dataset [4]. The use of different model architecture is also notable. Ideally, we want the model to learn inductive bias from the dataset itself rather than manually baking it ourselves into the architecture as done in CNN. In this regard, ViT may be more preferable than CNNs for large scale datasets as the former has comparable or better performance with much less inductive bias. Further, the model architecture should be easy to scale over large datasets. It is not entirely clear how to scale CNN models. There have been recent works in this regard (EfficientNet, BiT, ResNet-RS) whereas scaling is relatively well studied for transformers. ViT results presented in this paper are state of the art and outperforms all other model architectures. Also, simply switching from JFT-300M to JFT-3B for pre-training without scaling the model brings consisitent significant improvement for both small and large model capacity. The authors discuss various small improvements which are important for training models at scale. But we leave them out due to their less relevance for this survey.

Revisiting weakly supervised pretraining of visual perception models

This paper revisits pre-training using Instagram images and hashtags. Similar to previous work, one of the focus of this work is to increase model capacity. Here also one of the best performing model is ViT. The dataset size is similar but resampling is modified from square root sampling to resampling according to inverse square root of hashtag frequency. Additionally, images containing at least one infrequent hashtag is upsampled by 100x. The authors provide a system-level comparison between supervised pre-training, self-supervised pre-training and weakly supervised pre-training. Interestingly, they consider JFT-300M and JFT-3B dataset as supervised dataset and not weakly supervised dataset since it was mentioned in the previous work using JFT-3B that dataset construction is semi-automatic. The results show that weakly supervised ViT performs competitively or even surpasses other models on some tasks in terms of classification accuracy. The comparitive results are only presented for different classificaiton tasks. We also find that pre-training is done on smaller resolution (224x224) but finetuning is done on higher resolution (e.g. 224x518). This may be done to reduce pre-training time. However, this impacts model performance where pre-training resolution is used (as in next two comparisons). In those cases, the best performing model is based on RegNetY and not ViT. The comparison with self-supervised models show that weak supervision outperforms self-supervision by significant margin. However, it is difficult to draw a conclusion since the experimental settings vary. For example, self-supervised models considered in this work are trained on smaller datasets and use different model architecture. The final comparison is done in zero shot settings with weakly supervised models using text modality for pre-training such as CLIP [7] and ALIGN [8]. Image-text pre-training has become extremely popular recently but we do not cover these types of weakly supervised models in this survey. CLIP and ALIGN models outperform RegNetY models but, as mentioned earlier, it is difficult to find the reason due to difference in experimental settings. Nonetheless, weak supervision, including CLIP type of pre-training, shows very strong results. In all the four works, it is shown that performance does not saturate and may increase further with proper scaling of dataset and model size.

Self Supervision

Self-supervised learning refers to the paradigm of learning feature representations in an unsupervised manner, by defining an annotation free pretext task that derives the task’s supervision from the input itself. The success of the paradigm has been demonstrated in the field of natural language processing by well-known works such as BERT [9] and GPT [10], with the emergence of the Transformer model. On the other hand, in the computer vision field, early works in the self-supervised learning regime mainly utilized convolutional based architectures such as residual networks. Recent research in self-supervised representation learning in computer vision follow several directions. One major direction is the design of pretext tasks, with the rationale that solving such self-supervised tasks will force the model to learn semantic image features that can be useful for other vision tasks. For example, Zhang et al. [11] use the image colorization task, colorizing gray scale images, to train ConvNets to learn features. Doersch et al. [12] predict the relative position of image patches, and [13] propose to learn features by predicting image rotations. Another line of research derives from the emergence of contrastive learning, which models image similarity and dissimilarity between two or more different views, generated with data augmentation. While contrastive learning is a general framework that can be combined with different pretext tasks, works in this direction typically utilize the instance discrimination task [14]. This survey includes MOCO [15] as a representative work in this area. In contrast with contrastive pre-training, a recent line of work only leverages the similarity between multiple views of the same data and has demonstrated even better linear probing performance on ImageNet. This direction also heavily depends on data augmentation to create different views. We present DINO [16] as a representative paper. The last direction in self-supervised pre-training included in this survey is reconstruction based autoencoders. Most recent works such as BEiT [17] and MAE [18] are inspired by the success of autoencoders in NLP and seek to directly apply the same or similar approach in computer vision with vision transformers [19].

UNSUPERVISED REPRESENTATION LEARNING BY PREDICTING IMAGE ROTATIONS

In this paper the authors propose a new pretext task: predicting the number of degrees an image has been rotated with. This work proposes to learn image representations by training ConvNets to recognize the geometric transformation that is applied to an image that it gets as input. In detail, it first defines a set of geometric transformations and a set of training images. Then each geometric transformation is applied to every image and is assigned as the label to the transformed image. Under this setup, the pretext task becomes a classification problem with the number of classes equal to the number of transformations. In particular, the authors chose to define the geometric transformations as the image rotations by 0, 90, 180, and 270 degrees. As in a standard classification problem, a regular negative log likelihood loss function for maximizing the likelihood the model predicts the correct class is used. The authors claim that the intuition behind this task is that to successfully predict the rotation of an image the model must necessarily learn to localize the main objects in the image, recognize their orientation based on the object type, and then relate the object orientation with the dominant orientation that each type of object tends to be depicted within the available images. This problem is well-posed because most people tend to take images where the main objects in the image are in the upright direction, and using sparse rotation categories reduces the ambiguity that may exist when the objects are not fully upright. The method was evaluated on CIFAR10, ImageNet, PASCAL VOC [20], and Places [21]. The authors used ConvNet as the backbone on CIFAR10, and AlexNet for other datasets. With selected architectures, the proposed method consistently outperforms previous self-supervised methods on all datasets except Places. We recommend readers to refer to the original paper for detailed evaluation setting. However, the performance gap between the proposed method and supervised method still remains large. For example, on ImageNet, the top-1 accuracy of fitting a linear classifier to self-supervise learned features is only 38.7 percent, much lower training with labels, which has 50.5 percent.

Momentum Contrast for Unsupervised Visual Representation Learning

Contrastive learning can be thought of as building a dynamic dictionary where the keys are encoded samples from data. Then the learning trains encoders to perform dictionary look-up for a newly sampled “query” such that the encoded “query” should be similar to its matching key and dissimilar to others. The definition of a matching pair would depend on the specific pretext task. For example, if the pretext task is instance discrimination, then a matching pair can be two different views of the same images generated by random augmentation. In general, the encoders are trained with a contrastive loss that is in the form of . Momentum Contrast (MOCO) [15] builds on the general framework of contrastive learning and proposes to use a momentum encoder network and a queue to make the training more stable. In details, a fixed size queue is used to store samples encoded by a key encoder and the earliest samples are removed when the capacity is reached. This design decouples the size of the dictionary from the size of a minibatch. However, since the queue stores encodings instead of samples, early encodings become inconsistent with later ones if the encoder is continuously updated. To mitigate this inconsistency, MOCO only updates the query encoder with the gradient of each iteration, and updates the key encoder as a moving average of the query encoder with a momentum term close to 1. Then the encoded samples in the queue become more consistent. In contrast, many earlier contrastive learning methods are end-to-end, which uses the samples in the current mini batch as the dictionary, so the keys are consistently encoded. In this setting, the dictionary size is coupled with the mini batch size, and it does not utilize the momentum key encoder. MOCO is a general learning framework that does not require a specific pretext task. For the particular implementation is the paper, the authors use the instance discrimination task. In each iteration, a new mini batch is sampled from the training set, and two views of the batch are generated with random augmentation. The two views are treated as query and keys, and are encoded by the query and key encoder respectively. The newly encoded keys are then added to the queue. Interesting, in the follow up work MOCOv3 [22], where the authors investigate the effect of momentum contrast using vision transformers, they removed the queue from the design, where they claim that using a large mini batch size is sufficient for effective training.

Emerging Properties in Self-Supervised Vision Transformers

While contrastive contrastive pretraining has demonstrated success in representation learning, recent work showed features can be learned without discriminating between different images, but by matching the features to representations obtained from a momentum encoder. In this paper, DINO, the authors pursue along this direction, and interpret it as a form of knowledge distillation, where a student network is trained by learning to match the output of a teacher network. In particular, DINO formulates its teacher network as a momentum encoder that is initially a copy of the student network and slowly updates itself with the weights of the student, without gradient propagation. The student network is learned by trying to match the output of the teacher. In particular, each encoder outputs a distribution over K dimension given an input image, and the student is trained with a cross entropy loss in the form of . Where. Interestingly, with this design, the main difference between DINO and MOCOv3 is only the loss function, while the architecture and general update rule being the same. In MOCOv3 both the momentum encoder and the gradient updated encoder have a random view of an image. The gradient update encoder tried to maximize the similarity between its output and the momentum encoder’s output given different views for the same image, and minimize the similarity between the outputs given views of different images. In DINO, the momentum encoder holds global views and the gradient updated encoder holds local views. The gradient update encoder only tries to maximize the similarity between its output and the momentum encoder’s output given different views for the same image. DINO outperforms previous self-supervised method on linear probing top-1 classification accuracy on ImageNet with both Resnet50 and ViT.

Masked Autoencoders Are Scalable Vision Learners

Reconstruction based autoencoders have been studied in computer vision prior to the prevalence of BERT and GPT in NLP. Pioneer works in this area include Denoising autoencoders (2008) that corrupt an input signal and learn to reconstruct the original, uncorrupted signal, and stacked denoising autoencoders (2010) that present masking as a noise type and apply masking to images. However, these frameworks implemented with convolutional neural networks did not show as impressive performance as masked autoencoders have shown in NLP. The authors of this paper (MAE) hypothesized this may be caused by different information density between images, which are natural signal that has redundant information in local area, and languages, for which each word may contain stand-alone information. Another reason the authors hypothesize is the difference in architectures between two fields: convnets and transformers. In this work, the authors propose a simple framework to address these issues: masked autoencoders built upon vision transformers by masking a large portion of the input images. In detail, MAE consists of an encoder and a decoder. During training, each input image is split into patches that are randomly masked. The unmasked patches are added with positional encodings and sent to the encoder. Encoded patches are then appended with a learnable mask token for each patch that has been masked. After added with positional encodings, all tokens are fed into the decoder, which directly predict pixel values for the original image. A mean squared error reconstruction loss is applied to predicted pixel values and the actual pixel values for masked patches, and backpropogates to the decoder and encoder. The authors have shown that this framework has better performance when a large number of patches are masked. The optimal masking ratio for ImageNet classification pretrained on ImageNet is around 75%. The high masking ratio also enable an asymmetric design between the encoder and decoder. Since only a small number of patches will be processed by the encoder, it can be 2 times larger than the decoder and still reduces computation cost compared to encoding all the patches.

References

[1] http://ai.stanford.edu/blog/weak-supervision

[2] https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence

[3] Sun, Chen et al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.” 2017 IEEE International Conference on Computer Vision (ICCV) (2017)

[4] Mahajan, Dhruv Kumar et al. “Exploring the Limits of Weakly Supervised Pretraining.” ArXiv abs/1805.00932 (2018)

[5] Zhai, Xiaohua et al. “Scaling Vision Transformers.” ArXiv abs/2106.04560 (2021)

[6] Singh, Mannat et al. “Revisiting Weakly Supervised Pre-Training of Visual Perception Models.” ArXiv abs/2201.08371 (2022)

[7] Radford, Alec et al. “Learning Transferable Visual Models From Natural Language Supervision.” ICML (2021)

[8] Jia, Chao et al. “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision.” ICML (2021)

[9] Devlin, Chang et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL (2019)

[10] Radford, Narasimhan et al. “Improving Language Understanding by Generative Pre-Training.” (2018)

[11] Zhang, Isola, Efros. “Colorful image colorization.” In European Conference on Computer Vision. pp. 649–666. Springer, 2016a.

[12] Doersch, Gupta, Efros. “Unsupervised visual representation learning by context prediction.” In Proceedings of the IEEE International Conference on Computer Vision (2015)

[13] Gidaris, Singh, Komodakis. “UNSUPERVISED REPRESENTATION LEARNING BY PREDICTING IMAGE ROTATIONS.” ICLR (2018)

[14] Wu, Xiong, et al. “Unsupervised feature learning via non-parametric instance discrimination”. CVPR (2018)

[15] He, Fan, et al. “Momentum contrast for unsupervised visual representation learning”. CVPR (2020)

[16] Caron, Touvron et al. “Emerging Properties in Self-Supervised Vision Transformers.” ICCV (2021)

[17] Bao, Dong, Wei. “BEiT: BERT pre-training of image transformers.” arXiv:2106.08254 (2021)

[18] He, Chen, et al. “Masked Autoencoders Are Scalable Vision Learners.” CVPR (2022)

[19] Dosovitskiy, Beyer, et al. “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE.” ICLR (2021)

[20] Everingham, Gool, et al. “The pascal visual object classes (voc) challenge.” IJCV (2010)

[21] Zhou, Lapedriza, et al. “Learning deep features for scene recognition using places database.” NIPS (2014)

[22] Chen, Xie, He. “An Empirical Study of Training Self-Supervised Vision Transformers.” ICCV (2021)