Skip to content

Self-Supervised Learning

Some of the following contents are adapted from the paper entitled A Cookbook of Self-Supervised Learning1.

Self-Supervised Learning (SSL), dubbed “the dark matter of intelligence”, is a promising path to advance machine learning. As opposed to supervised learning, which is limited by the availability of labeled data, self-supervised approaches can learn from vast unlabeled data.

SSL underpins deep learning’s success in natural language processing leading to advances from automated machine translation to large language models trained on web-scale corpora of unlabeled text. SSL methods for computer vision have been able to match or in some cases surpass models trained on labeled data, even on highly competitive benchmarks like ImageNet. SSL has also been successfully applied across other modalities such as video, audio, and time series.

Self-supervised learning defines a pretext task based on unlabeled inputs to produce descriptive and intelligible representations. No labelled data is required for SSL, although the models used are actually supervised.

In natural language, a common SSL objective is to mask a word in the text and predict the surrounding words. This objective encourages the model to capture relationships among words in the text without the need for any labels. The same SSL model representations can be used across a range of downstream tasks such as translating text across languages, summarizing, or even generating text, along with many others.

In computer vision, analogous pretext tasks exist with models learning to predict masked patches of an image or representation. Other SSL objectives encourage two views of the same image, formed by say adding color or cropping, to be mapped to similar representations.

In self-supervised learning, one trains a model to solve a so-called pretext task on a dataset without the need for human annotation by creating positive (and sometimes, negative pairs) from unlabeled data. The pretext tasks must be carefully designed to encourage the model to capture meaningful features and similarities in the data.

The main objective, however, is to transfer this model to a target (downstream) task. The downstream task is the knowledge transfer process of the pretext model to a specific task. For this, one of the most effective transfer strategies is fine-tuning. After training the pretext task, the learned embeddings can easily be used for another task such as image classification or text summarization.

Vanilla SSL approach. Source: https://openaccess.thecvf.com/content_cvpr_2018/papers/Noroozi_Boosting_Self-Supervised_Learning_CVPR_2018_paper.pdf

With the power to train on vast unlabeled data comes many benefits. While traditional supervised learning methods are trained on a specific task often known a priori based on the available labeled data, SSL learns generic representations useful across many tasks. SSL can be especially useful in domains such as medicine where labels are costly or the specific task can not be known a priori. There’s also evidence SSL models can learn representations that are more robust to adversarial examples, label corruption, and input perturbations—and are more fair—compared to their supervised counterparts. Consequently, SSL is a field garnering growing interest.

For this course, we are following the categorization in 1, in which methods are divided into three families:

  • Deep Metric Learning
  • Self-Distillation
  • Canonical Correlation Analysis

A single model of each family is selected: SimCLR2, BYOL3, and VicReg4.

To understand these recent methods, you should first know the basics of data augmentation and the origins of SSL.

Data Augmentation

SSL methods often begin with data augmentation, which involves applying various transformations or perturbations to unlabeled data to create diverse instances (also called augmented views).

The goal of data augmentation is to increase the variability of the data and expose the model to different perspectives of the same instance. Common data augmentation techniques include cropping, flipping, rotation, random crop, and color transformations. By generating diverse instances, contrastive learning ensures that the model learns to capture relevant information regardless of variations in the input data.

You can also chain the different techniques together:

Data augmentation examples. Source: https://encord.com/blog/guide-to-contrastive-learning/

Data augmentation examples. Source: https://encord.com/blog/guide-to-contrastive-learning/

💡 If you want to learn more about data augmentation, please read this link.

Origins of SSL

Contemporary SSL methods build upon the knowledge from early experiments, which form the foundation for many of the modern methods.

Learning spatial context methods such as RotNet5 train a model to understand the relative positions and orientations of objects within a scene. In particular, RotNet applies a random rotation and then asks the model to predict the rotation.

The RotNet model is trained with random augmentations of images at 0, 90, 180 and 270 degrees. A standard classification network is trained to predict the rotation. Then, the learned features can be used to initialize the weights of a network for another downstream task.

Despite the simplicity of this self-supervised formulation, as can be seen in the experimental section of the paper, the features learned achieve dramatic improvements on the unsupervised feature learning benchmarks.

The Deep Metric Learning Family

The Deep Metric Learning (DML) family of methods is based on the principle of encouraging similarity between semantically transformed versions of an input. It includes methods such as SimCLR2, NNCLR, MeanSHIFT or SCL, among others. DML originated with the idea of contrastive loss, which transforms this principle into a learning objective.

Contrastive learning leverages the assumption that similar instances should be closer together in a learned embedding space, while dissimilar instances should be farther apart. By framing learning as a discrimination task, contrastive learning allows models to capture relevant features and similarities in the data.

Contrastive Learning

Contrastive Learning can be used in a Supervised or a Self-Supervised manner.

In the Few-Shot Learning block, we already used Contrastive Learning methods in a supervised way, using labelled data for training models explicitly to differentiate between similar and dissimilar instances according to their class. Therefore, the loss in this case tries to put together embeddings from the same class. The objective is to learn a representation space where instances from the same class are clustered closer together, while instances from other classes are pushed apart.

Self-supervised contrastive learning takes a different approach by learning representations from unlabeled data without relying on explicit labels. Since no labels are available, to identify similar inputs, methods often form variants of a single input using known semantic preserving transformations via data augmentation. The augmented samples are called views. The variants of the inputs are called positive pairs or examples; the samples we wish to make dissimilar are called negatives. So here, the loss tries to put together embeddings from the same view since we don't have information of the labels.

SimCLR

SimCLR2, a Simple framework for Contrastive Learning of visual Representations, is a technique introduced by Prof. Hinton’s Group.

SIMCLR

SimCLR learns representations by maximizing agreement between differently augmented views of the same data example via a contrastive loss in the latent space, as shown below.

SIMCLR

SimCLR results were impressive, showing that unsupervised learning benefits more from bigger models than its supervised counterpart.

Homework

  • Read this post to understand the basics of SimCLR. Estimated time: 🕑 30 min.

The Self-Distillation Family

Self-distillation methods such as BYOL6, SimSIAM, DINO, along with their variants rely on a simple mechanism: feeding two different views to two encoders, and mapping one to the other by means of a predictor. To prevent the encoders from collapsing by predicting a constant for any input, various techniques are employed. A common approach to prevent collapse is to update one of the two encoder weights with a running average of the other encoder’s weights.

BYOL

BYOL (bootstrap your own latent)6 introduced self-distillation as a means to avoid collapse.

BYOL

BYOL uses two networks along with a predictor to map the outputs of one network to the other. The network predicting the output is called the online or student network while the network producing the target is called the target or teacher network. Each network receives a different view of the same image formed by image transformations including random resizing, cropping, color jittering, and brightness alterations.

The online (student) network is updated throughout training using gradient descent. The target (teacher) network is updated with an exponential moving average (EMA) updates of the weights of the online network. The slow updates induced by exponential moving average creates an asymmetry that is crucial to BYOL’s success.

BYOL achieves higher performance than state-of-the-art contrastive methods without using negative pairs at all. Instead, it uses two networks that learn from each other to iteratively bootstrap the representations, by forcing one network to use an augmented view of an image to predict the output of the other network for a different augmented view of the same image.

BYOL almost matches the best supervised baseline on top-1 accuracy on ImageNet and beats out the self-supervised baselines.

Homework

  • Read this post to understand the basics of BYOL. Estimated time: 🕑 45 min.

The Canonical Correlation Analysis Family

The SSL canonical correlation analysis family originates with the Canonical Correlation Framework7. The high-level goal of CCA is to infer the relationship between two variables by analyzing their cross-covariance matrices. It includes methods such as VicReg4, BarlowTwins, SWAV or W-MSE, among others.

VicReg

VicReg2

VICReg4 has the same basic architecture as its predecessors; augmented positive pairs are fed into Siamese encoders that produce representations, which are then passed into Siamese projectors that return projections.

However, unlike previous models, VicReg requires none of the following: negative examples, momentum encoders, asymmetric mechanisms in the architecture, stop-gradients, predictors, or even normalization of the projector outputs. Instead, the heavy lifting is done by VICReg’s objective function, which contains three main terms: a variance term, an invariance term, and a covariance term.

VICReg balances three objectives based on co-variance matrices of representations from two views: variance, invariance, co- variance. Regularizing the variance along each dimension of the representation prevents collapse, the invariance ensures two views are encoded similarly, and the co-variance encourages different dimensions of the representation to capture different features.

Homework

  • Read this post to understand the basics of VicReg. Estimated time: 🕑 30 min.

Using this code you can train SimCLR, BYOL or VicReg on ImageNet, STL-10, and CIFAR-10 and compare their results.


  1. Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, and Micah Goldblum. A cookbook of self-supervised learning. 2023. arXiv:2304.12210

  2. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. CoRR, 2020. URL: https://arxiv.org/abs/2002.05709, arXiv:2002.05709

  3. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, 21271–21284. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf

  4. Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: variance-invariance-covariance regularization for self-supervised learning. 2022. arXiv:2105.04906

  5. Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. CoRR, 2018. URL: http://arxiv.org/abs/1803.07728, arXiv:1803.07728

  6. Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, 21271–21284. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf

  7. Harold Hotelling. Relations Between Two Sets of Variates, pages 162–190. Springer New York, New York, NY, 1992. URL: https://doi.org/10.1007/978-1-4612-4380-9_14, doi:10.1007/978-1-4612-4380-9_14