Skip to main content

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stéphane Deny

Abstract

% background Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called , owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.Code and pre-trained models (in PyTorch) are available at https://github.com/facebookresearch/barlowtwins \baselineskip %Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is the joint embedding approach, which consists in training a system to produce output representation vectors such that distorted versions of a training sample produce similar vectors, while different samples produce different vectors. A number of different methods have been recently proposed to ensure that different samples produce different vectors, including contrastive methods which use an explicit repulsive term between the output vectors of samples in a batch, or non-contrastive methods that run distorted samples through slightly different networks. We propose an objective function that measures the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and tries to make it as close to the identity matrix as possible.This causes the vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of the representation vectors. The method is called , owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. does not require large batches nor asymmetry between the network twins such as a predictor, gradient stopping, or a moving average on the weight updates. It allows the use of very high-dimensional output vectors. outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, as well as for a number of transfer tasks of classification and object detection. %Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL consists in learning representations which are invariant to a series of augmentations, but a recurring issue with this approach is to avoid trivial representations which are constant regardless of the input image. In this paper, inspired by the redundancy-reduction principle in Neuroscience, we propose , an algorithm for SSL which relies on a new objective function enforcing the cross-correlation matrix computed from twin representations to be close to the identity. This objective avoids learning trivial representations by reducing redundancy between output units. We show that is simpler to implement than other methods as it does not require a large number of negative samples, nor does it require any asymmetry between the twin networks such as momentum encoders, stop-gradient or asymmetric predictor heads. performs on par with current state-of-the-art methods on ImageNet linear classification, as well as on a number of transfer tasks of classification and object detection. In particular, our method outperforms previous models on ImageNet semi-supervised learning classification in the low-data regime.

Introduction

Self-supervised learning aims to learn useful representations of the input data without relying on human annotations. Recent advances in self-supervised learning for visual data (Caron et al., 2020; Chen et al., 2020a; Grill et al., 2020; He et al., 2019; Misra & van der Maaten, 2019) show that it is possible to learn self-supervised representations that are competitive with supervised representations. A common underlying theme that unites these methods is that they all aim to learn representations that are invariant under different distortions (also referred to as 'data augmentations'). This

Proceedings of the 38 th International Conference on Machine Learning , PMLR 139, 2021. Copyright 2021 by the author(s).

1 Code and pre-trained models (in PyTorch) are available at https://github.com/facebookresearch/barlowtwins

is typically achieved by maximizing similarity of representations obtained from different distorted versions of a sample using a variant of Siamese networks (Hadsell et al., 2006). As there are trivial solutions to this problem, like a constant representation, these methods rely on different mechanisms to learn useful representations.

Contrastive methods like SIMCLR (Chen et al., 2020a) define 'positive' and 'negative' sample pairs which are treated differently in the loss function. Additionally, they can also use asymmetric learning updates wherein momentum encoders (He et al., 2019) are updated separately from the main network. Clustering methods use one distorted sample to compute 'targets' for the loss, and another distorted version of the sample to predict these targets, followed by an alternate optimization scheme like k-means in DEEPCLUSTER (Caron et al., 2018) or non-differentiable operators in SWAV (Caron et al., 2020) and SELA (Asano et al., 2020). In another recent line of work, BYOL (Grill et al., 2020) and SIMSIAM (Chen & He, 2020), both the network architecture and parameter updates are modified to introduce asymmetry. The network architecture is modified to be asymmetric using a special 'predictor' network and the parameter updates are asymmetric such that the model parameters are only updated using one distorted version of the input, while the representations from another distorted version are used as a fixed target. (Chen & He, 2020) conclude that the asymmetry of the learning update, 'stop-gradient', is critical to preventing trivial solutions.

In this paper, we propose a new method, BARLOW TWINS, which applies redundancy-reduction -aprinciple first proposed in neuroscience - to self-supervised learning. In his influential article Possible Principles Underlying the Transformation of Sensory Messages (Barlow, 1961), neuroscientist H. Barlow hypothesized that the goal of sensory processing is to recode highly redundant sensory inputs into a factorial code (a code with statistically independent components). This principle has been fruitful in explaining the organization of the visual system, from the retina to cortical areas (see (Barlow, 2001) for a review and (Lindsey et al., 2020; Ocko et al., 2018; Schwartz & Simoncelli, 2001) for recent efforts), and has led to a number of algorithms for supervised and unsupervised learning (Ball´ e et al., 2017; Deco & Parra, 1997; F¨ oldi´ ak, 1990; Linsker, 1988; Redlich, 1993a;b; Schmidhuber et al., 1996). Based on this principle, we propose an objective function which tries to make the cross-correlation matrix computed from twin embeddings as close to the identity matrix as possible. BARLOW TWINS is conceptually simple, easy to implement and learns useful representations as opposed to trivial solutions. Compared to other methods, it does not require large batches (Chen et al., 2020a), nor does it require any asymmetric mechanisms like prediction networks (Grill et al., 2020), momentum encoders (He et al., 2019), non-differentiable operators (Caron et al.,

  1. or stop-gradients (Chen & He, 2020). Intriguingly, BARLOW TWINS strongly benefits from the use of very high-dimensional embeddings. BARLOW TWINS outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime (55% top-1 accuracy for 1% labels), and is on par with current state of the art for ImageNet classification with a linear classifier head, as well as for a number of transfer tasks of classification and object detection.
Negative examples to avoid Collapse
Asymmetric methods to avoid collapse
Clustering methods to avoid collapse
Whitening Methods to avoid Collapse
Redundancy Reduction is a Fundamental Organization Principle of Neuroscience

Method

Description of AlgoName{

Like other methods for SSL (Caron et al., 2020; Chen et al., 2020a; Grill et al., 2020; He et al., 2019; Misra & van der Maaten, 2019), BARLOW TWINS operates on a joint embedding of distorted images (Fig. 1). More specifically, it produces two distorted views for all images of a batch X sampled from a dataset. The distorted views are obtained via a distribution of data augmentations T . The two batches of distorted views Y A and Y B are then fed to a function f θ , typically a deep network with trainable parameters θ , producing batches of embeddings Z A and Z B respectively. To simplify notations, Z A and Z B are assumed to be meancentered along the batch dimension, such that each unit has mean output 0 over the batch.

BARLOW TWINS distinguishes itself from other methods by its innovative loss function L BT :

glyph[negationslash]

where λ is a positive constant trading off the importance of the first and second terms of the loss, and where C is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension:

where b indexes batch samples and i, j index the vector dimension of the networks' outputs. C is a square matrix with size the dimensionality of the network's output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

Intuitively, the invariance term of the objective, by trying to equate the diagonal elements of the cross-correlation matrix to 1, makes the embedding invariant to the distortions applied. The redundancy reduction term , by trying to equate the off-diagonal elements of the cross-correlation

matrix to 0, decorrelates the different vector components of the embedding. This decorrelation reduces the redundancy between output units, so that the output units contain non-redundant information about the sample.

More formally, BARLOW TWINS's objective function can be understood through the lens of information theory, and specifically as an instanciation of the Information Bottleneck (IB) objective (Tishby & Zaslavsky, 2015; Tishby et al., 2000). Applied to self-supervised learning, the IB objective consists in finding a representation that conserves as much information about the sample as possible while being the least possible informative about the specific distortions applied to that sample. The mathematical connection between BARLOW TWINS's objective function and the IB principle is explored in Appendix A.

BARLOW TWINS' objective function has similarities with existing objective functions for SSL. For example, the redundancy reduction term plays a role similar to the contrastive term in the INFONCE objective (Oord et al., 2018), as discussed in detail in Section 5. However, important conceptual differences in these objective functions result in practical advantages of our method compared to INFONCE-based methods, namely that (1) our method does not require a large number of negative samples and can thus operate on small batches (2) our method benefits from very high-dimensional embeddings. Alternatively, the redundancy reduction term can be viewed as a soft-whitening constraint on the embeddings, connecting our method to a recently proposed method performing a hard-whitening operation on the embeddings (Ermolov et al., 2020), as discussed in Section 5. However, our method performs better than current hard-whitening methods.

The pseudocode for BARLOW TWINS is shown as Algorithm 1.

Implementation Details

Image augmentations Each input image is transformed twice to produce the two distorted views shown in Figure 1. The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224 × 224 , horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability. This probability is different for the two distorted views in the last two transformations (blurring and solarization). We use the same augmentation parameters as BYOL (Grill et al., 2020).

Architecture The encoder consists of a ResNet-50 network (He et al., 2016) (without the final classification layer,

Image augmentations

Image augmentations Each input image is transformed twice to produce the two distorted views shown in Figure 1. The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224 × 224 , horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability. This probability is different for the two distorted views in the last two transformations (blurring and solarization). We use the same augmentation parameters as BYOL (Grill et al., 2020).

Architecture The encoder consists of a ResNet-50 network (He et al., 2016) (without the final classification layer,

Architecture
Optimization

Results

We follow standard practice (Goyal et al., 2019) and evaluate our representations by transfer learning to different datasets and tasks in computer vision. Our network is pretrained using self-supervised learning on the training set of the ImageNet ILSVRC-2012 dataset (Deng et al., 2009) (without labels). We evaluate our model on a variety of tasks such as image classification and object detection, and using fixed representations from the network or finetuning it. We provide the hyperparameters for all the transfer learning experiments in the Appendix.

Linear and Semi-Supervised Evaluations on ImageNet

Linear evaluation on ImageNet We train a linear classifier on ImageNet on top of fixed representations of a ResNet50 pretrained with our method. The top-1 and top-5 accuracies obtained on the ImageNet validation set are reported in Table 1. Our method obtains a top-1 accuracy of 73 . 2% which is comparable to the state-of-the-art methods.

Table 1. Top-1 and top-5 accuracies (in %) under linear evaluation on ImageNet . All models use a ResNet-50 encoder. Top-3 best self-supervised methods are underlined.

Semi-supervised training on ImageNet We fine-tune a ResNet-50 pretrained with our method on a subset of ImageNet. We use subsets of size 1% and 10% using the same split as SIMCLR. The semi-supervised results obtained on the ImageNet validation set are reported in Table 2. Our method is either on par (when using 10% of the data) or slightly better (when using 1% of the data) than competing methods.

Linear evaluation on ImageNet

The linear classifier is trained for 100 epochs with a learning rate of 0 . 3 and a cosine learning rate schedule. We minimize the cross-entropy loss with the SGD optimizer with momentum and weight decay of 10 -6 . We use a batch size of 256 . At training time we augment an input image by taking a random crop, resizing it to 224 × 224 , and optionally flipping the image horizontally. At test time we resize the image to 256 × 256 and center-crop it to a size of 224 × 224 .

Semi-supervised training on ImageNet

We train for 20 epochs with a learning rate of 0 . 002 for the ResNet-50 and 0 . 5 for the final classification layer. The learning rate is multiplied by a factor of 0 . 2 after the 12th and 16th epoch. We minimize the cross-entropy loss with the SGD optimizer with momentum and do not use weight decay. We use a batch size of 256 . The image augmentations are the same as in the linear evaluation setting.

Transfer to other datasets and tasks

Image classification with fixed features We follow the setup from (Misra & van der Maaten, 2019) and train a linear classifier on fixed image representations, i.e ., the

Table 2. Semi-supervised learning on ImageNet using 1% and 10% training examples. Results for the supervised method are from (Zhai et al., 2019). Best results are in bold .

Table 3. Transfer learning: image classification. We benchmark learned representations on the image classification task by training linear classifiers on fixed features. We report top-1 accuracy on Places-205 and iNat18 datasets, and classification mAP on VOC07. Top-3 best self-supervised methods are underlined.

parameters of the ConvNet remain unchanged. We use a diverse set of datasets for this evaluation - Places-205 (Zhou et al., 2014) for scene classification, VOC07 (Everingham et al., 2010) for multi-label image classification, and iNaturalist2018 (Van Horn et al., 2018) for fine-grained image classification. We report our results in Table 3. BARLOW TWINS performs competitively against prior work, and outperforms SimCLR and MoCo-v2 on most datasets.

Object Detection and Instance Segmentation We evaluate our representations for the localization based tasks of object detection and instance segmentation. We use the VOC07+12 (Everingham et al., 2010) and COCO (Lin et al., 2014) datasets following the setup in (He et al., 2019) which finetunes the ConvNet parameters. Our results in Table 4 indicate that BARLOW TWINS performs comparably or better than state-of-the-art representation learning methods for these localization tasks.

Table 4. Transfer learning: object detection and instance segmentation. We benchmark learned representations on the object detection task on VOC07+12 using Faster R-CNN (Ren et al., 2015) and on the detection and instance segmentation task on COCO using Mask R-CNN (He et al., 2017). All methods use the C4 backbone variant (Wu et al., 2019) and models on COCO are finetuned using the 1 × schedule. Best results are in bold .

Ablations

For all ablation studies, BARLOW TWINS was trained for 300 epochs instead of 1000 epochs in the previous section. A linear evaluation on ImageNet of this baseline model yielded a 71 . 4% top-1 accuracy and a 90 . 2% top-5 accuracy. For all the ablations presented we report the top-1 and top-5 accuracy of training linear classifiers on the 2048 dimensional res5 features using the ImageNet train set.

Loss Function Ablations We alter our loss function (eqn. 1) in several ways to test the necessity of each term of the loss function, and to experiment with different practices popular in other loss functions for SSL, such as INFONCE. Table 5 recapitulates the different loss functions tested along with their results on a linear evaluation benchmark of Imagenet. First we find that removing the invariance term (on-diagonal term) or the redundancy reduction term (off-diagonal term) of our loss function leads to worse/collapsed solutions, as expected. We then study the effect of different normalization strategies. We first try to normalize the embeddings along the feature dimension so that they lie on the unit sphere, as it is common practice for losses measuring a cosine similarity (Chen et al., 2020a; Grill et al., 2020; Wang & Isola, 2020). Specifically, we first normalize the embeddings along the batch dimension (with mean subtraction), then normalize the embeddings along the feature dimension (without mean subtraction), and finally we measure the (unnormalized) covariance matrix instead of the (normalized) cross-correlation matrix in eqn. 2. The performance is slightly reduced. Second, we try to remove batch-normalization operations in the two hidden layers of the projector network MLP. The performance is barely affected. Third, in addition to removing the batch-normalization in the hidden layers, we replace the cross-correlation matrix in eqn. 2 by the crosscovariance matrix (which means the features are no longer normalized along the batch dimension). The performance glyph[negationslash]

is substantially reduced. We finally try a cross-entropy loss with temperature, for which the on-diagonal term and off-diagonal term is controlled by a temperature hyperparameter τ and coefficient λ : L = -log ∑ i exp( C ii /τ ) + λ log ∑ i ∑ j = i exp(max( C ij , 0) /τ ) . The performance is reduced.

Table 5. Loss function explorations . We ablate the invariance and redundancy terms in our proposed loss and observe that both terms are necessary for good performance. We also experiment with different normalization schemes and a cross-entropy loss and observe reduced performance.

Robustness to Batch Size The INFONCE loss that draws negative examples from the minibatch suffer performance drops when the batch size is reduced (e.g. SIMCLR (Chen et al., 2020a)). We thus sought to test the robustness of BARLOW TWINS to small batch sizes. In order to adapt our model to different batch sizes, we performed a grid search on LARS learning rates for each batch size. We find that, unlike SIMCLR, our model is robust to small batch sizes (Fig. 2), with a performance almost unaffected for a batch as small as 256. In comparison the accuracy for SimCLR drops about 4 p.p. for batch size 256. This robustness to small batch size, also found in non-contrastive methods such as BYOL, further demonstrates that our method is not only conceptually (see Discussion) but also empirically different than the INFONCE objective.

Effect of Removing Augmentations We find that our model is not robust to removing some types of data augmentations, like SIMCLR but unlike BYOL (Fig. 3). While this can be seen as a disadvantage of our method compared to BYOL, it can also be argued that the representations learned by our method are better controlled by the specific set of distortions used, as opposed to BYOL for which the invariances learned seem generic and intriguingly independent of the specific distortions used.

Projector Network Depth & Width For other SSL methods, such as BYOL and SIMCLR, the projector network drastically reduces the dimensionality of the ResNet output.

Figure 2. Effect of batch size. To compare the effect of the batch size across methods, for each method we report the difference between the top-1 accuracy at a given batch size and the best obtained accuracy among all batch size tested. BYOL: best accuracy is 72.5% for a batch size of 4096 (data from (Grill et al., 2020) fig. 3A). SIMCLR: best accuracy is 67.1% for a batch size of 4096 (data from (Chen et al., 2020a) fig. 9, model trained for 300 epochs). BARLOW TWINS: best accuracy is 71.7% for a batch size of 1024.

Figure 3. Effect of progressively removing data augmentations. Data for BYOL and SIMCLR (repro) is from (Grill et al., 2020) fig 3b.

In stark contrast, we find that BARLOW TWINS performs better when the dimensionality of the projector network output is very large. Other methods rapidly saturate when the dimensionality of the output increases, but our method keeps improving with all output dimensionality tested (Fig. 4). This result is quite surprising because the output of the ResNet is kept fixed to 2048, which acts as a dimensionality bottleneck in our model and sets the limit of the intrinsic dimensionality of the representation. In addition, similarly to other methods, we find that our model performs better when the projector network has more layers, with a saturation of the performance for 3 layers.

Figure 4. Effect of the dimensionality of the last layer of the projector network on performance. The parameter λ is kept fix for all dimensionalities tested. Data for SIMCLR is from (Chen et al., 2020a) fig 8; Data for BYOL is from (Grill et al., 2020) Table 14b.

Breaking Symmetry Many SSL methods (e.g. BYOL, SIMSIAM, SWAV) rely on different symmetry-breaking mechanisms to avoid trivial solutions. Our loss function avoids these trivial solutions by construction, even in the case of symmetric networks. It is however interesting to ask whether breaking symmetry can further improve the performance of our network. Following SIMSIAM and BYOL, we experiment with adding a predictor network composed of 2 fully connected layers of size 8192 to one of the network (with batch normalization followed by a ReLU nonlinearity in the hidden layer) and/or a stop-gradient mechanism on the other network. We find that these asymmetries slightly decrease the performance of our network (see Table 6).

Table 6. Effect of asymmetric settings

BYOLwith a larger projector/predictor/embedding For a fair comparison with BYOL, we also evaluated BYOL with a wider and/or deeper projector head (3-layer MLP), a wider and/or deeper predictor head, and a larger dimensionality of the embedding. BYOL did not improve under these conditions (see Table 7).

Sensitivity to λ . We also explored the sensitivity of BARLOW TWINS to the hyperparameter λ , which trades off the desiderata of invariance and informativeness of the embeddings. We find that BARLOW TWINS is not very sensitive to this hyperparameter (Fig. 5).

Figure 5. Sensitivity of BARLOW TWINS to the hyperparameter λ

Loss Function Ablations

For all ablation studies, BARLOW TWINS was trained for 300 epochs instead of 1000 epochs in the previous section. A linear evaluation on ImageNet of this baseline model yielded a 71 . 4% top-1 accuracy and a 90 . 2% top-5 accuracy. For all the ablations presented we report the top-1 and top-5 accuracy of training linear classifiers on the 2048 dimensional res5 features using the ImageNet train set.

Loss Function Ablations We alter our loss function (eqn. 1) in several ways to test the necessity of each term of the loss function, and to experiment with different practices popular in other loss functions for SSL, such as INFONCE. Table 5 recapitulates the different loss functions tested along with their results on a linear evaluation benchmark of Imagenet. First we find that removing the invariance term (on-diagonal term) or the redundancy reduction term (off-diagonal term) of our loss function leads to worse/collapsed solutions, as expected. We then study the effect of different normalization strategies. We first try to normalize the embeddings along the feature dimension so that they lie on the unit sphere, as it is common practice for losses measuring a cosine similarity (Chen et al., 2020a; Grill et al., 2020; Wang & Isola, 2020). Specifically, we first normalize the embeddings along the batch dimension (with mean subtraction), then normalize the embeddings along the feature dimension (without mean subtraction), and finally we measure the (unnormalized) covariance matrix instead of the (normalized) cross-correlation matrix in eqn. 2. The performance is slightly reduced. Second, we try to remove batch-normalization operations in the two hidden layers of the projector network MLP. The performance is barely affected. Third, in addition to removing the batch-normalization in the hidden layers, we replace the cross-correlation matrix in eqn. 2 by the crosscovariance matrix (which means the features are no longer normalized along the batch dimension). The performance glyph[negationslash]

is substantially reduced. We finally try a cross-entropy loss with temperature, for which the on-diagonal term and off-diagonal term is controlled by a temperature hyperparameter τ and coefficient λ : L = -log ∑ i exp( C ii /τ ) + λ log ∑ i ∑ j = i exp(max( C ij , 0) /τ ) . The performance is reduced.

Table 5. Loss function explorations . We ablate the invariance and redundancy terms in our proposed loss and observe that both terms are necessary for good performance. We also experiment with different normalization schemes and a cross-entropy loss and observe reduced performance.

Robustness to Batch Size The INFONCE loss that draws negative examples from the minibatch suffer performance drops when the batch size is reduced (e.g. SIMCLR (Chen et al., 2020a)). We thus sought to test the robustness of BARLOW TWINS to small batch sizes. In order to adapt our model to different batch sizes, we performed a grid search on LARS learning rates for each batch size. We find that, unlike SIMCLR, our model is robust to small batch sizes (Fig. 2), with a performance almost unaffected for a batch as small as 256. In comparison the accuracy for SimCLR drops about 4 p.p. for batch size 256. This robustness to small batch size, also found in non-contrastive methods such as BYOL, further demonstrates that our method is not only conceptually (see Discussion) but also empirically different than the INFONCE objective.

Effect of Removing Augmentations We find that our model is not robust to removing some types of data augmentations, like SIMCLR but unlike BYOL (Fig. 3). While this can be seen as a disadvantage of our method compared to BYOL, it can also be argued that the representations learned by our method are better controlled by the specific set of distortions used, as opposed to BYOL for which the invariances learned seem generic and intriguingly independent of the specific distortions used.

Projector Network Depth & Width For other SSL methods, such as BYOL and SIMCLR, the projector network drastically reduces the dimensionality of the ResNet output.

Figure 2. Effect of batch size. To compare the effect of the batch size across methods, for each method we report the difference between the top-1 accuracy at a given batch size and the best obtained accuracy among all batch size tested. BYOL: best accuracy is 72.5% for a batch size of 4096 (data from (Grill et al., 2020) fig. 3A). SIMCLR: best accuracy is 67.1% for a batch size of 4096 (data from (Chen et al., 2020a) fig. 9, model trained for 300 epochs). BARLOW TWINS: best accuracy is 71.7% for a batch size of 1024.

Figure 3. Effect of progressively removing data augmentations. Data for BYOL and SIMCLR (repro) is from (Grill et al., 2020) fig 3b.

In stark contrast, we find that BARLOW TWINS performs better when the dimensionality of the projector network output is very large. Other methods rapidly saturate when the dimensionality of the output increases, but our method keeps improving with all output dimensionality tested (Fig. 4). This result is quite surprising because the output of the ResNet is kept fixed to 2048, which acts as a dimensionality bottleneck in our model and sets the limit of the intrinsic dimensionality of the representation. In addition, similarly to other methods, we find that our model performs better when the projector network has more layers, with a saturation of the performance for 3 layers.

Figure 4. Effect of the dimensionality of the last layer of the projector network on performance. The parameter λ is kept fix for all dimensionalities tested. Data for SIMCLR is from (Chen et al., 2020a) fig 8; Data for BYOL is from (Grill et al., 2020) Table 14b.

Breaking Symmetry Many SSL methods (e.g. BYOL, SIMSIAM, SWAV) rely on different symmetry-breaking mechanisms to avoid trivial solutions. Our loss function avoids these trivial solutions by construction, even in the case of symmetric networks. It is however interesting to ask whether breaking symmetry can further improve the performance of our network. Following SIMSIAM and BYOL, we experiment with adding a predictor network composed of 2 fully connected layers of size 8192 to one of the network (with batch normalization followed by a ReLU nonlinearity in the hidden layer) and/or a stop-gradient mechanism on the other network. We find that these asymmetries slightly decrease the performance of our network (see Table 6).

Table 6. Effect of asymmetric settings

BYOLwith a larger projector/predictor/embedding For a fair comparison with BYOL, we also evaluated BYOL with a wider and/or deeper projector head (3-layer MLP), a wider and/or deeper predictor head, and a larger dimensionality of the embedding. BYOL did not improve under these conditions (see Table 7).

Sensitivity to λ . We also explored the sensitivity of BARLOW TWINS to the hyperparameter λ , which trades off the desiderata of invariance and informativeness of the embeddings. We find that BARLOW TWINS is not very sensitive to this hyperparameter (Fig. 5).

Figure 5. Sensitivity of BARLOW TWINS to the hyperparameter λ

Robustness to Batch Size
Effect of Removing Augmentations

We use the detectron2 library (Wu et al., 2019) for training the detection models and closely follow the evaluation settings from (He et al., 2019). The backbone ResNet50 network for Faster R-CNN (Ren et al., 2015) and Mask R-CNN (He et al., 2017) is initialized using our BARLOW TWINS pretrained model.

VOC07+12 We use the VOC07+12 trainval set of 16 K images for training a Faster R-CNN (Ren et al., 2015) C-4 backbone for 24 K iterations using a batch size of 16 across 8 GPUs using SyncBatchNorm. The initial learning rate for the model is 0 . 1 which is reduced by a factor of 10 after 18 K and 22 K iterations. We use linear warmup (Goyal et al., 2017) with a slope of 0 . 333 for 1000 iterations.

COCO We train Mask R-CNN (He et al., 2017) C-4 backbone on the COCO 2017 train split and report results on the val split. We use a learning rate of 0 . 03 and keep the other parameters the same as in the 1 × schedule in detectron2.

MethodTop-1Top-5
Supervised76.5
MOCO60.6
PIRL63.6-
SIMCLR69.389.0
MOCO V271.190.1
SIMSIAM71.3-
SWAV (w/o multi-crop)71.8-
BYOL74.391.6
SWAV75.3-
BARLOW TWINS (ours)73.291.0
MethodTop- 1Top- 1Top- 5Top- 5
1%10%1%10%
Supervised25.456.448.480.4
PIRL--57.283.8
SIMCLR48.365.675.587.8
BYOL53.268.878.489.0
SWAV53.970.278.589.9
BARLOW TWINS (ours)55.069.779.289.3
MethodPlaces-205VOC07iNat18
Supervised53.287.546.7
SimCLR52.585.537.2
MoCo-v251.886.438.6
SwAV (w/o multi-crop)52.886.439.5
SwAV56.788.948.6
BYOL5486.647.6
BARLOW TWINS (ours)54.186.246.5
MethodVOC07+12 detVOC07+12 detVOC07+12 detCOCO detCOCO detCOCO detCOCO instance segCOCO instance segCOCO instance seg
AP allAP 50AP 75AP bbAP bb 50AP bb 75AP mkAP mk 50AP mk 75
Sup.53.581.358.838.258.241.233.354.735.2
MoCo-v257.482.564.039.358.942.534.455.836.5
SwAV56.182.662.738.458.641.333.855.235.9
SimSiam5782.463.739.259.342.134.456.036.7
BT (ours)56.882.663.439.259.042.534.356.036.5
Loss functionTop- 1Top- 5
Baseline71.490.2
Only invariance term (on-diag term)57.380.5
Only red. red. term (off-diag term)0.10.5
Normalization along feature dim.69.888.8
No BN in MLP71.289.7
No BN in MLP + no Normalization53.476.7
Cross-entropy with temp.63.385.7
casestop-gradientpredictorTop- 1Top- 5
Baseline--71.490.2
(a)glyph[check]-70 . 589 . 0
(b)-glyph[check]70 . 289 . 0
(c)glyph[check]glyph[check]61 . 383 . 5
ProjectorPredictorAcc1Description
4096-2564096-25674.1%baseline
4096-4096-2564096-25674.0%3 layer proj, 2 layer pred, 256-d repr.
4096-4096-2564096-4096-25673.2%3 layer proj, 3 layer pred, 256-d repr.
4096-4096-5124096-51273.7%3 layer proj, 2 layer pred, 512-d repr.
4096-4096-5124096-4096-51273.2%3 layer proj, 3 layer pred, 512-d repr.
8192-8192-81928192-819272.3%same proj as BT, 2 layer pred, 8192-d repr.
Projector Network Depth & Width
Importance of textsc{batchnorm
Breaking Symmetry
Wide-ResNet

Discussion

BARLOW TWINS learns self-supervised representations through a joint embedding of distorted images, with an objective function that maximizes similarity between the embedding vectors while reducing redundancy between their components. Our method does not require large batches of samples, nor does it require any particular asymmetry in the twin network structure. We discuss next the similarities and differences between our method and prior art, both from a conceptual and an empirical standpoint. For ease of comparison, all objective functions are recast with a common set of notations. The discussion ends with future directions.

Comparison with Prior Art

infoNCE The INFONCE loss, where NCE stands for Noise-Contrastive Estimation (Gutmann & Hyv¨ arinen, 2010), is a popular type of contrastive loss function used for self-supervised learning (e.g. (Chen et al., 2020a; He et al., 2019; H´ enaff et al., 2019; Oord et al., 2018)). It can be instantiated as:

glyph[negationslash]

where z A and z B are the twin network outputs, b indexes the sample in a batch, i indexes the vector component of the output, and τ is a positive constant called temperature in analogy to statistical physics.

For ready comparison, we rewrite BARLOW TWINS loss function with the same notations:

Both BARLOW TWINS' and INFONCE's objective functions have two terms, the first aiming at making the embeddings invariant to the distortions fed to the twin networks, the second aiming at maximizing the variability of the embedding learned. Another common point between the two losses is that they both rely on batch statistics to measure this variability. However, the INFONCE objective maximizes the variability of the embeddings by maximizing the pairwise distance between all pairs of samples, whereas our method does so by decorrelating the components of the embeddings vectors.

The contrastive term in INFONCE can be interpreted as a non-parametric estimation of the entropy of the distribution of embeddings (Wang & Isola, 2020). An issue that arises with non-parametric entropy estimators is that they are prone to the curse of dimensionality: they can only be estimated reliably in a low-dimensional setting, and they typically require a large number of samples.

In contrast, our loss can be interpreted as a proxy entropy estimator of the distribution of embeddings under a Gaussian parametrization (see Appendix A). Thanks to this simplified parametrization, the variability of the embedding can be estimated from much fewer samples, and on very largedimensional embeddings. Indeed, in the ablation studies that we perform, we find that (1) our method is robust to small batches unlike the popular INFONCE-based method

SIMCLR, and (2) our method benefits from using very large dimensional embeddings, unlike INFONCE-based methods which do not see a benefit in increasing the dimensionality of the output.

Our loss presents several other interesting differences with infoNCE:

· In INFONCE, the embeddings are typically normalized along the feature dimension to compute a cosine similarity between embedded samples. We normalize the embeddings along the batch dimension instead. · In our method, there is a parameter λ that trades off how much emphasis is put on the invariance term vs. the redundancy reduction term. This parameter can be interpreted as the trade-off parameter in the Information Bottleneck framework (see Appendix A). This parameter is not present in INFONCE. · INFONCE also has a hyperparameter, the temperature, which can be interpreted as the width of the kernel in a non-parametric kernel density estimation of entropy, and practically weighs the relative importance of the hardest negative samples present in the batch (Chen et al., 2020a).

Anumber of alternative methods to ours have been proposed to alleviate the reliance on large batches of the INFONCE loss. For example, MoCo (Chen et al., 2020b; He et al., 2019) builds a dynamic dictionary of negative samples with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo typically needs to store > 60 , 000 sample embeddings. In contrast, our method does not require such a large dictionary, since it works well with a relatively small batch size (e.g. 256).

Asymmetric Twins BOOTSTRAP-YOUR-OWN-LATENT (aka BYOL) (Grill et al., 2020) and SIMSIAM (Chen & He, 2020) are two recent methods which use a simple cosine similarity between twin embeddings as an objective function, without any contrastive term:

Surprisingly, these methods successfully avoid trivial solutions by introducing some asymmetry in the architecture and learning procedure of the twin networks. For example, BYOL uses a predictor network which breaks the symmetry between the two networks, and also enforces an exponential moving average on the target network weights to slow down the progression of the weights on the target network. Combined together, these two mechanisms surprisingly avoid trivial solutions. The reasons behind this success are the subject of recent theoretical and empirical studies (Chen & He, 2020; Fetterman & Albrecht, 2020; Richemond et al., 2020; Tian et al., 2020). In particular, the ablation study (Chen & He, 2020) shows that the moving average is not necessary, but that stop-gradient on one of the branch and the presence of the predictor network are two crucial elements to avoid collapse. Other works show that batch normalization (Fetterman & Albrecht, 2020; Tian et al., 2020) or alternatively group normalization (Richemond et al., 2020) could play an important role in avoiding collapse.

Like our method, these asymmetric methods do not require large batches, since in their case there is no interaction between batch samples in the objective function.

It should be noted however that these asymmetric methods cannot be described as the optimization of an overall learning objective. Instead, there exists trivial solutions to the learning objective that these methods avoid via particular implementation choices and/or the result of non-trivial learning dynamics. In contrast, our method avoids trivial solutions by construction, making our method conceptually simpler and more principled than these alternatives (until their principle is discovered, see (Tian et al., 2021) for an early attempt).

Whitening In a concurrent work, (Ermolov et al., 2020) propose W-MSE. Acting on the embeddings from identical twin networks, this method performs a differentiable whitening operation (via Cholesky decomposition) of each batch of embeddings before computing a simple cosine similarity between the whitened embeddings of the twin networks. In contrast, the redundancy reduction term in our loss encourages the whitening of the batch embeddings as a soft constraint. The current W-MSE model achieves 66.3% top-1 accuracy on the Imagenet linear evaluation benchmark. It is an interesting direction for future studies to determine whether improved versions of this hard-whitening strategy could also lead to state-of-the-art results on these large-scale computer vision benchmarks.

Clustering These methods, such as DEEPCLUSTER (Caron et al., 2018), SWAV (Caron et al., 2020), SELA (Asano et al., 2020), perform contrastive-like comparisons without the requirement to compute all pairwise distances. Specifically, these methods simultaneously cluster the data while enforcing consistency between cluster assignments produced for different distortions of the same image, instead of comparing features directly as in contrastive learning. Clustering methods are also prone to collapse, e.g ., empty clusters in k-means and avoiding them relies on careful implementation details. Online clustering methods like SWAV

can be trained with large and small batches but require storing features when the number of clusters is much larger than the batch size. Clustering methods can also be combined with contrastive learning (Li et al., 2021) to prevent collapse.

Noise As Targets This method (Bojanowski & Joulin, 2017) learns to map samples to fixed random targets on the unit sphere, which can be interpreted as a form of whitening. This objective uses a single network, and hence does not leverage the distortions induced by twin networks. Predefining random targets might limit the flexibility of the representation that can be learned.

IMAX In the early days of SSL, (Becker & Hinton, 1992; Zemel & Hinton, 1990) proposed a loss function between twin networks given by:

where | | denotes the determinant of a matrix, C ( Z A -Z B ) is the covariance matrix of the difference of the outputs of the twin networks and C ( Z A + Z B ) the covariance of the sum of these outputs. It can be shown that this objective maximizes the information between the twin network representations under the assumptions that the two representations are noisy versions of the same underlying Gaussian signal, and that the noise is independant, additive and Gaussian. This objective is similar to ours in the sense that there is one term that encourages the two representations to be similar and another term that encourages the units to be decorrelated. However, unlike IMAX, our objective is not directly an information quantity, and we have an extra trade-off parameter λ that trades off the two terms of our loss. The IMAX objective was used in early work so it is not clear whether it can scale to large computer vision tasks. Our attempts to make it work on ImageNet were not successful.

infoNCE

Asano, Y. M., Rupprecht, C., and Vedaldi, A. Self-labelling via simultaneous clustering and representation learning. International Conference on Learning Representations (ICLR) , 2020.

Barlow, H. Redundancy reduction revisited. Network (Bristol, England) , 12(3):241-253, August 2001. ISSN 0954898X.

Barlow, H. B. Possible Principles Underlying the Transformations of Sensory Messages . in Sensory Communication, The MIT Press, 1961.

Becker, S. and Hinton, G. E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature , 355(6356):161-163, January 1992.

Cai, T. T., Liang, T., and Zhou, H. H. Law of Log Determinant of Sample Covariance Matrix and Optimal Estimation of Differential Entropy for High-Dimensional Gaussian Distributions. Journal of Multivariate Analysis , 137, 2015. doi: 10.1016/j.jmva.2015.02.003.

Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep clustering for unsupervised learning of visual features. In

Asymmetric Twins

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called BARLOW TWINS, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. BARLOW TWINS does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. BARLOW TWINS outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection. 1

Figure 1. BARLOW TWINS's objective function measures the crosscorrelation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. BARLOW TWINS is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, naturally avoiding trivial constant (i.e. collapsed) embeddings, and being robust to the training batch size.

Whitening
Clustering
Noise As Targets
IMAX

Future Directions

We observe a steady improvement of the performance of our method as we increase the dimensionality of the embeddings (i.e. of the last layer of the projector network). This intriguing result is in stark contrast with other popular methods for SSL, such as SIMCLR (Chen et al., 2020a) and BYOL (Grill et al., 2020), for which increasing the dimensionality of the embeddings rapidly saturates performance. It is a promising avenue to continue this exploration for even higher dimensional embeddings ( > 16 , 000 ), but this would require the development of new methods or alternative hardware to accommodate the memory requirements of operating on such large embeddings.

Our method is just one possible instanciation of the Information Bottleneck principle applied to SSL. We believe that further refinements of the proposed loss function and algorithm could lead to more efficient solutions and even better performances. For example, the redundancy reduction term is currently computed from the off-diagonal terms of the crosscorrelation matrix between the twin network embeddings, but alternatively it could be computed from the off-diagonal terms of the auto-correlation matrix of a single network's embedding. Our preliminary analyses seem to indicate that this alternative leads to similar performances (not shown). A modified loss could also be applied to the (unnormalized) cross-covariance matrix instead of the (normalized) crosscorrelation matrix (see Ablations for preliminary analyses).

Uniformity on the Hypersphere
MoCo and MoCo2

Acknowledgements

We thank Pascal Vincent, Yubei Chen and Samuel Ocko for helpful insights on the mathematical connection to the infoNCE loss, Robert Geirhos and Adrien Bardes for extra analyses not included in the manuscript and Xinlei Chen, Mathilde Caron, Armand Joulin, Reuben Feinman and Ulisse Ferrari for useful comments on the manuscript.

Implementation Details

Image augmentations Each input image is transformed twice to produce the two distorted views shown in Figure 1. The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224 × 224 , horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability. This probability is different for the two distorted views in the last two transformations (blurring and solarization). We use the same augmentation parameters as BYOL (Grill et al., 2020).

Architecture The encoder consists of a ResNet-50 network (He et al., 2016) (without the final classification layer,

Connection between AlgoName{

Figure 6. The information bottleneck principle applied to selfsupervised learning (SSL) posits that the objective of SSL is to learn a representation Z θ which is informative about the image sample, but invariant (i.e. uninformative) to the specific distortions that are applied to this sample. BARLOW TWINS can be viewed as a specific instanciation of the information bottleneck objective.

We explore in this appendix the connection between BARLOW TWINS' loss function and the Information Bottleneck (IB) principle (Tishby & Zaslavsky, 2015; Tishby et al., 2000).

As a reminder, BARLOW TWINS' loss function is given by:

glyph[negationslash]

where λ is a positive constant trading off the importance of the first and second terms of the loss, and where C is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension :

where b indexes batch samples and i, j index the vector dimension of the networks' outputs. C is a square matrix with size the dimensionality of the network's output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

Applied to self-supervised learning, the IB principle posits that a desirable representation should be as informative as possible about the sample represented while being as invariant (i.e. non-informative) as possible to distortions of that sample (here the data augmentations used) (Fig. 6). This trade-off is captured by the following loss function:

where I ( ., . ) denotes mutual information and β is a positive scalar trading off the desideratas of preserving information and being invariant to distortions.

Using a classical identity for mutual information, we can rewrite equation 5 as:

where H ( . ) denotes entropy. The conditional entropy H ( Z θ | Y ) -the entropy of the representation conditioned on a specific distorted sample- cancels to 0 because the function f θ is deterministic, and so the representation Z θ conditioned on the input sample Y is perfectly known and has zero entropy. Since the overall scaling factor of the loss function is not important, we can rearrange equation 6 as:

Measuring the entropy of a high-dimensional signal generally requires vast amounts of data, much larger than the size of a single batch. In order to circumvent this difficulty, we make the simplifying assumption that the representation Z is distributed as a Gaussian. The entropy of a Gaussian distribution is simply given by the logarithm of the determinant of its covariance function (up to a constant corresponding to the assumed discretization level that we ignore here) (Cai et al., 2015). The loss function becomes:

This equation is still not exactly the one we optimize for in practice (see eqn. 3 and 4). Indeed, our loss function is only connected to the IB loss given by eqn. 8 through the following simplifications and approximations:

Hyperparameters

Evaluations on ImageNet

Linear evaluation on ImageNet

The linear classifier is trained for 100 epochs with a learning rate of 0 . 3 and a cosine learning rate schedule. We minimize the cross-entropy loss with the SGD optimizer with momentum and weight decay of 10 -6 . We use a batch size of 256 . At training time we augment an input image by taking a random crop, resizing it to 224 × 224 , and optionally flipping the image horizontally. At test time we resize the image to 256 × 256 and center-crop it to a size of 224 × 224 .

Semi-supervised training on ImageNet

We train for 20 epochs with a learning rate of 0 . 002 for the ResNet-50 and 0 . 5 for the final classification layer. The learning rate is multiplied by a factor of 0 . 2 after the 12th and 16th epoch. We minimize the cross-entropy loss with the SGD optimizer with momentum and do not use weight decay. We use a batch size of 256 . The image augmentations are the same as in the linear evaluation setting.

Transfer Learning

Linear evaluation

We follow the exact settings from PIRL (Misra & van der Maaten, 2019) for evaluating linear classifiers on the Places205, VOC07 and iNaturalist2018 datasets. For Places-205 and iNaturalist2018 we train a linear classifier with SGD (14 epochs on Places-205, 84 epochs on iNaturalist2018) with a learning rate of 0 . 01 reduced by a factor of 10 at two equally spaced intervals, a weight decay of 5 × 10 -4 and SGD momentum of 0 . 9 . We train SVM classifiers on the VOC07 dataset where the C values are computed using cross-validation.

Object Detection and Instance Segmentation

We use the detectron2 library (Wu et al., 2019) for training the detection models and closely follow the evaluation settings from (He et al., 2019). The backbone ResNet50 network for Faster R-CNN (Ren et al., 2015) and Mask R-CNN (He et al., 2017) is initialized using our BARLOW TWINS pretrained model.

VOC07+12 We use the VOC07+12 trainval set of 16 K images for training a Faster R-CNN (Ren et al., 2015) C-4 backbone for 24 K iterations using a batch size of 16 across 8 GPUs using SyncBatchNorm. The initial learning rate for the model is 0 . 1 which is reduced by a factor of 10 after 18 K and 22 K iterations. We use linear warmup (Goyal et al., 2017) with a slope of 0 . 333 for 1000 iterations.

COCO We train Mask R-CNN (He et al., 2017) C-4 backbone on the COCO 2017 train split and report results on the val split. We use a learning rate of 0 . 03 and keep the other parameters the same as in the 1 × schedule in detectron2.

MethodTop-1Top-5
Supervised76.5
MOCO60.6
PIRL63.6-
SIMCLR69.389.0
MOCO V271.190.1
SIMSIAM71.3-
SWAV (w/o multi-crop)71.8-
BYOL74.391.6
SWAV75.3-
BARLOW TWINS (ours)73.291.0
MethodTop- 1Top- 1Top- 5Top- 5
1%10%1%10%
Supervised25.456.448.480.4
PIRL--57.283.8
SIMCLR48.365.675.587.8
BYOL53.268.878.489.0
SWAV53.970.278.589.9
BARLOW TWINS (ours)55.069.779.289.3
MethodPlaces-205VOC07iNat18
Supervised53.287.546.7
SimCLR52.585.537.2
MoCo-v251.886.438.6
SwAV (w/o multi-crop)52.886.439.5
SwAV56.788.948.6
BYOL5486.647.6
BARLOW TWINS (ours)54.186.246.5
MethodVOC07+12 detVOC07+12 detVOC07+12 detCOCO detCOCO detCOCO detCOCO instance segCOCO instance segCOCO instance seg
AP allAP 50AP 75AP bbAP bb 50AP bb 75AP mkAP mk 50AP mk 75
Sup.53.581.358.838.258.241.233.354.735.2
MoCo-v257.482.564.039.358.942.534.455.836.5
SwAV56.182.662.738.458.641.333.855.235.9
SimSiam5782.463.739.259.342.134.456.036.7
BT (ours)56.882.663.439.259.042.534.356.036.5
Loss functionTop- 1Top- 5
Baseline71.490.2
Only invariance term (on-diag term)57.380.5
Only red. red. term (off-diag term)0.10.5
Normalization along feature dim.69.888.8
No BN in MLP71.289.7
No BN in MLP + no Normalization53.476.7
Cross-entropy with temp.63.385.7
casestop-gradientpredictorTop- 1Top- 5
Baseline--71.490.2
(a)glyph[check]-70 . 589 . 0
(b)-glyph[check]70 . 289 . 0
(c)glyph[check]glyph[check]61 . 383 . 5
ProjectorPredictorAcc1Description
4096-2564096-25674.1%baseline
4096-4096-2564096-25674.0%3 layer proj, 2 layer pred, 256-d repr.
4096-4096-2564096-4096-25673.2%3 layer proj, 3 layer pred, 256-d repr.
4096-4096-5124096-51273.7%3 layer proj, 2 layer pred, 512-d repr.
4096-4096-5124096-4096-51273.2%3 layer proj, 3 layer pred, 512-d repr.
8192-8192-81928192-819272.3%same proj as BT, 2 layer pred, 8192-d repr.

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow’s redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.111Code and pre-trained models (in PyTorch) are available at https://github.com/facebookresearch/barlowtwins

Self-supervised learning aims to learn useful representations of the input data without relying on human annotations. Recent advances in self-supervised learning for visual data (Caron et al., 2020; Grill et al., 2020; Chen et al., 2020a; He et al., 2019; Misra & van der Maaten, 2019) show that it is possible to learn self-supervised representations that are competitive with supervised representations. A common underlying theme that unites these methods is that they all aim to learn representations that are invariant under different distortions (also referred to as ‘data augmentations’). This is typically achieved by maximizing similarity of representations obtained from different distorted versions of a sample using a variant of Siamese networks (Hadsell et al., 2006). As there are trivial solutions to this problem, like a constant representation, these methods rely on different mechanisms to learn useful representations.

Contrastive methods like SimCLR (Chen et al., 2020a) define ‘positive’ and ‘negative’ sample pairs which are treated differently in the loss function. Additionally, they can also use asymmetric learning updates wherein momentum encoders (He et al., 2019) are updated separately from the main network. Clustering methods use one distorted sample to compute ‘targets’ for the loss, and another distorted version of the sample to predict these targets, followed by an alternate optimization scheme like k-means in DeepCluster (Caron et al., 2018) or non-differentiable operators in SwAV (Caron et al., 2020) and SeLa (Asano et al., 2020). In another recent line of work, BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2020), both the network architecture and parameter updates are modified to introduce asymmetry. The network architecture is modified to be asymmetric using a special ‘predictor’ network and the parameter updates are asymmetric such that the model parameters are only updated using one distorted version of the input, while the representations from another distorted version are used as a fixed target. (Chen & He, 2020) conclude that the asymmetry of the learning update, ‘stop-gradient’, is critical to preventing trivial solutions.

In this paper, we propose a new method, Barlow Twins, which applies redundancy-reduction — a principle first proposed in neuroscience — to self-supervised learning. In his influential article Possible Principles Underlying the Transformation of Sensory Messages (Barlow, 1961), neuroscientist H. Barlow hypothesized that the goal of sensory processing is to recode highly redundant sensory inputs into a factorial code (a code with statistically independent components). This principle has been fruitful in explaining the organization of the visual system, from the retina to cortical areas (see (Barlow, 2001) for a review and (Ocko et al., 2018; Lindsey et al., 2020; Schwartz & Simoncelli, 2001) for recent efforts), and has led to a number of algorithms for supervised and unsupervised learning (Redlich, 1993a, b; Deco & Parra, 1997; Földiák, 1990; Linsker, 1988; Schmidhuber et al., 1996; Ballé et al., 2017). Based on this principle, we propose an objective function which tries to make the cross-correlation matrix computed from twin embeddings as close to the identity matrix as possible. Barlow Twins is conceptually simple, easy to implement and learns useful representations as opposed to trivial solutions. Compared to other methods, it does not require large batches (Chen et al., 2020a), nor does it require any asymmetric mechanisms like prediction networks (Grill et al., 2020), momentum encoders (He et al., 2019), non-differentiable operators (Caron et al., 2020) or stop-gradients (Chen & He, 2020). Intriguingly, Barlow Twins strongly benefits from the use of very high-dimensional embeddings. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime (55% top-1 accuracy for 1% labels), and is on par with current state of the art for ImageNet classification with a linear classifier head, as well as for a number of transfer tasks of classification and object detection.

Like other methods for SSL (Caron et al., 2020; Grill et al., 2020; Chen et al., 2020a; He et al., 2019; Misra & van der Maaten, 2019), Barlow Twins operates on a joint embedding of distorted images (Fig. 1). More specifically, it produces two distorted views for all images of a batch X𝑋X sampled from a dataset. The distorted views are obtained via a distribution of data augmentations 𝒯𝒯\mathcal{T}. The two batches of distorted views YAsuperscript𝑌𝐴Y^{A} and YBsuperscript𝑌𝐵Y^{B} are then fed to a function fθsubscript𝑓𝜃f_{\theta}, typically a deep network with trainable parameters θ𝜃\theta, producing batches of embeddings ZAsuperscript𝑍𝐴Z^{A} and ZBsuperscript𝑍𝐵Z^{B} respectively. To simplify notations, ZAsuperscript𝑍𝐴Z^{A} and ZBsuperscript𝑍𝐵Z^{B} are assumed to be mean-centered along the batch dimension, such that each unit has mean output 0 over the batch.

Barlow Twins distinguishes itself from other methods by its innovative loss function ℒℬ​𝒯subscriptℒℬ𝒯\mathcal{L_{BT}}:

where λ𝜆\lambda is a positive constant trading off the importance of the first and second terms of the loss, and where 𝒞𝒞\mathcal{C} is the cross-correlation matrix computed between the outputs of the two identical networks along the batch dimension:

where b𝑏b indexes batch samples and i,j𝑖𝑗i,j index the vector dimension of the networks’ outputs. 𝒞𝒞\mathcal{C} is a square matrix with size the dimensionality of the network’s output, and with values comprised between -1 (i.e. perfect anti-correlation) and 1 (i.e. perfect correlation).

Intuitively, the invariance term of the objective, by trying to equate the diagonal elements of the cross-correlation matrix to 1, makes the embedding invariant to the distortions applied. The redundancy reduction term, by trying to equate the off-diagonal elements of the cross-correlation matrix to 0, decorrelates the different vector components of the embedding. This decorrelation reduces the redundancy between output units, so that the output units contain non-redundant information about the sample.

More formally, Barlow Twins’s objective function can be understood through the lens of information theory, and specifically as an instanciation of the Information Bottleneck (IB) objective (Tishby & Zaslavsky, 2015; Tishby et al., 2000). Applied to self-supervised learning, the IB objective consists in finding a representation that conserves as much information about the sample as possible while being the least possible informative about the specific distortions applied to that sample. The mathematical connection between Barlow Twins’s objective function and the IB principle is explored in Appendix A.

Barlow Twins’ objective function has similarities with existing objective functions for SSL. For example, the redundancy reduction term plays a role similar to the contrastive term in the infoNCE objective (Oord et al., 2018), as discussed in detail in Section 5. However, important conceptual differences in these objective functions result in practical advantages of our method compared to infoNCE-based methods, namely that (1) our method does not require a large number of negative samples and can thus operate on small batches (2) our method benefits from very high-dimensional embeddings. Alternatively, the redundancy reduction term can be viewed as a soft-whitening constraint on the embeddings, connecting our method to a recently proposed method performing a hard-whitening operation on the embeddings (Ermolov et al., 2020), as discussed in Section 5. However, our method performs better than current hard-whitening methods.

The pseudocode for Barlow Twins is shown as Algorithm 1.

Each input image is transformed twice to produce the two distorted views shown in Figure 1. The image augmentation pipeline consists of the following transformations: random cropping, resizing to 224×224224224224\times 224, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. The first two transformations (cropping and resizing) are always applied, while the last five are applied randomly, with some probability. This probability is different for the two distorted views in the last two transformations (blurring and solarization). We use the same augmentation parameters as BYOL (Grill et al., 2020).

The encoder consists of a ResNet-50 network (He et al., 2016) (without the final classification layer, 2048 output units) followed by a projector network. The projector network has three linear layers, each with 8192 output units. The first two layers of the projector are followed by a batch normalization layer and rectified linear units. We call the output of the encoder the ’representations’ and the output of the projector the ’embeddings’. The representations are used for downstream tasks and the embeddings are fed to the loss function of Barlow Twins.

We follow the optimization protocol described in BYOL (Grill et al., 2020). We use the LARS optimizer (You et al., 2017) and train for 1000 epochs with a batch size of 2048. We however emphasize that our model works well with batches as small as 256 (see Ablations). We use a learning rate of 0.2 for the weights and 0.0048 for the biases and batch normalization parameters. We multiply the learning rate by the batch size and divide it by 256. We use a learning rate warm-up period of 10 epochs, after which we reduce the learning rate by a factor of 1000 using a cosine decay schedule (Loshchilov & Hutter, 2016). We ran a search for the trade-off parameter λ𝜆\lambda of the loss function and found the best results for λ=5⋅10−3𝜆⋅5superscript103\lambda=5\cdot 10^{-3}. We use a weight decay parameter of 1.5⋅10−6⋅1.5superscript1061.5\cdot 10^{-6}. The biases and batch normalization parameters are excluded from LARS adaptation and weight decay. Training is distributed across 32 V100 GPUs and takes approximately 124 hours. For comparison, our reimplementation of BYOL trained with a batch size of 4096 takes 113 hours on the same hardware.

We follow standard practice (Goyal et al., 2019) and evaluate our representations by transfer learning to different datasets and tasks in computer vision. Our network is pretrained using self-supervised learning on the training set of the ImageNet ILSVRC-2012 dataset (Deng et al., 2009) (without labels). We evaluate our model on a variety of tasks such as image classification and object detection, and using fixed representations from the network or finetuning it. We provide the hyperparameters for all the transfer learning experiments in the Appendix.

We train a linear classifier on ImageNet on top of fixed representations of a ResNet-50 pretrained with our method. The top-1 and top-5 accuracies obtained on the ImageNet validation set are reported in Table 1. Our method obtains a top-1 accuracy of 73.2%percent73.273.2% which is comparable to the state-of-the-art methods.

We fine-tune a ResNet-50 pretrained with our method on a subset of ImageNet. We use subsets of size 1%percent11% and 10%percent1010% using the same split as SimCLR. The semi-supervised results obtained on the ImageNet validation set are reported in Table 2. Our method is either on par (when using 10%percent1010% of the data) or slightly better (when using 1%percent11% of the data) than competing methods.

Image classification with fixed features We follow the setup from (Misra & van der Maaten, 2019) and train a linear classifier on fixed image representations, i.e., the parameters of the ConvNet remain unchanged. We use a diverse set of datasets for this evaluation - Places-205 (Zhou et al., 2014) for scene classification, VOC07 (Everingham et al., 2010) for multi-label image classification, and iNaturalist2018 (Van Horn et al., 2018) for fine-grained image classification. We report our results in Table 3. Barlow Twins performs competitively against prior work, and outperforms SimCLR and MoCo-v2 on most datasets.

Object Detection and Instance Segmentation We evaluate our representations for the localization based tasks of object detection and instance segmentation. We use the VOC07+12 (Everingham et al., 2010) and COCO (Lin et al., 2014) datasets following the setup in (He et al., 2019) which finetunes the ConvNet parameters. Our results in Table 4 indicate that Barlow Twins performs comparably or better than state-of-the-art representation learning methods for these localization tasks.

For all ablation studies, Barlow Twins was trained for 300 epochs instead of 1000 epochs in the previous section. A linear evaluation on ImageNet of this baseline model yielded a 71.4%percent71.471.4% top-1 accuracy and a 90.2%percent90.290.2% top-5 accuracy. For all the ablations presented we report the top-1 and top-5 accuracy of training linear classifiers on the 204820482048 dimensional res5 features using the ImageNet train set.

We alter our loss function (eqn. 1) in several ways to test the necessity of each term of the loss function, and to experiment with different practices popular in other loss functions for SSL, such as infoNCE. Table 5 recapitulates the different loss functions tested along with their results on a linear evaluation benchmark of Imagenet. First we find that removing the invariance term (on-diagonal term) or the redundancy reduction term (off-diagonal term) of our loss function leads to worse/collapsed solutions, as expected. We then study the effect of different normalization strategies. We first try to normalize the embeddings along the feature dimension so that they lie on the unit sphere, as it is common practice for losses measuring a cosine similarity (Chen et al., 2020a; Grill et al., 2020; Wang & Isola, 2020). Specifically, we first normalize the embeddings along the batch dimension (with mean subtraction), then normalize the embeddings along the feature dimension (without mean subtraction), and finally we measure the (unnormalized) covariance matrix instead of the (normalized) cross-correlation matrix in eqn. 2. The performance is slightly reduced. Second, we try to remove batch-normalization operations in the two hidden layers of the projector network MLP. The performance is barely affected. Third, in addition to removing the batch-normalization in the hidden layers, we replace the cross-correlation matrix in eqn. 2 by the cross-covariance matrix (which means the features are no longer normalized along the batch dimension). The performance is substantially reduced. We finally try a cross-entropy loss with temperature, for which the on-diagonal term and off-diagonal term is controlled by a temperature hyperparameter τ𝜏\tau and coefficient λ𝜆\lambda: ℒ=−log​∑iexp⁡(𝒞i​i/τ)+λ​log​∑i∑j≠iexp⁡(max⁡(𝒞i​j,0)/τ)ℒsubscript𝑖subscript𝒞𝑖𝑖𝜏𝜆subscript𝑖subscript𝑗𝑖subscript𝒞𝑖𝑗0𝜏\mathcal{L}=-\log\sum_{i}\exp(\mathcal{C}{ii}/\tau)+\lambda\log\sum{i}\sum_{j\neq i}\exp(\max(\mathcal{C}_{ij},0)/\tau). The performance is reduced.

The infoNCE loss that draws negative examples from the minibatch suffer performance drops when the batch size is reduced (e.g. SimCLR (Chen et al., 2020a)). We thus sought to test the robustness of Barlow Twins to small batch sizes. In order to adapt our model to different batch sizes, we performed a grid search on LARS learning rates for each batch size. We find that, unlike SimCLR, our model is robust to small batch sizes (Fig. 2), with a performance almost unaffected for a batch as small as 256. In comparison the accuracy for SimCLR drops about 444 p.p. for batch size 256. This robustness to small batch size, also found in non-contrastive methods such as BYOL, further demonstrates that our method is not only conceptually (see Discussion) but also empirically different than the infoNCE objective.

We find that our model is not robust to removing some types of data augmentations, like SimCLR but unlike BYOL (Fig. 3). While this can be seen as a disadvantage of our method compared to BYOL, it can also be argued that the representations learned by our method are better controlled by the specific set of distortions used, as opposed to BYOL for which the invariances learned seem generic and intriguingly independent of the specific distortions used.

For other SSL methods, such as BYOL and SimCLR, the projector network drastically reduces the dimensionality of the ResNet output. In stark contrast, we find that Barlow Twins performs better when the dimensionality of the projector network output is very large. Other methods rapidly saturate when the dimensionality of the output increases, but our method keeps improving with all output dimensionality tested (Fig. 4). This result is quite surprising because the output of the ResNet is kept fixed to 2048, which acts as a dimensionality bottleneck in our model and sets the limit of the intrinsic dimensionality of the representation. In addition, similarly to other methods, we find that our model performs better when the projector network has more layers, with a saturation of the performance for 3 layers.

Many SSL methods (e.g. BYOL, SimSiam, SwAV) rely on different symmetry-breaking mechanisms to avoid trivial solutions. Our loss function avoids these trivial solutions by construction, even in the case of symmetric networks. It is however interesting to ask whether breaking symmetry can further improve the performance of our network. Following SimSiam and BYOL, we experiment with adding a predictor network composed of 2 fully connected layers of size 8192 to one of the network (with batch normalization followed by a ReLU nonlinearity in the hidden layer) and/or a stop-gradient mechanism on the other network. We find that these asymmetries slightly decrease the performance of our network (see Table 6).

BYOL with a larger projector/predictor/embedding For a fair comparison with BYOL, we also evaluated BYOL with a wider and/or deeper projector head (3-layer MLP), a wider and/or deeper predictor head, and a larger dimensionality of the embedding. BYOL did not improve under these conditions (see Table 7).

Sensitivity to λ𝜆\lambda. We also explored the sensitivity of Barlow Twins to the hyperparameter λ𝜆\lambda, which trades off the desiderata of invariance and informativeness of the embeddings. We find that Barlow Twins is not very sensitive to this hyperparameter (Fig. 5).

Barlow Twins learns self-supervised representations through a joint embedding of distorted images, with an objective function that maximizes similarity between the embedding vectors while reducing redundancy between their components. Our method does not require large batches of samples, nor does it require any particular asymmetry in the twin network structure. We discuss next the similarities and differences between our method and prior art, both from a conceptual and an empirical standpoint. For ease of comparison, all objective functions are recast with a common set of notations. The discussion ends with future directions.

The InfoNCE loss, where NCE stands for Noise-Contrastive Estimation (Gutmann & Hyvärinen, 2010), is a popular type of contrastive loss function used for self-supervised learning (e.g. (Oord et al., 2018; Chen et al., 2020a; He et al., 2019; Hénaff et al., 2019)). It can be instantiated as:

where zAsuperscript𝑧𝐴z^{A} and zBsuperscript𝑧𝐵z^{B} are the twin network outputs, b𝑏b indexes the sample in a batch, i𝑖i indexes the vector component of the output, and τ𝜏\tau is a positive constant called temperature in analogy to statistical physics.

For ready comparison, we rewrite Barlow Twins loss function with the same notations:

Both Barlow Twins’ and InfoNCE’s objective functions have two terms, the first aiming at making the embeddings invariant to the distortions fed to the twin networks, the second aiming at maximizing the variability of the embedding learned. Another common point between the two losses is that they both rely on batch statistics to measure this variability. However, the InfoNCE objective maximizes the variability of the embeddings by maximizing the pairwise distance between all pairs of samples, whereas our method does so by decorrelating the components of the embeddings vectors.

The contrastive term in InfoNCE can be interpreted as a non-parametric estimation of the entropy of the distribution of embeddings (Wang & Isola, 2020). An issue that arises with non-parametric entropy estimators is that they are prone to the curse of dimensionality: they can only be estimated reliably in a low-dimensional setting, and they typically require a large number of samples.

In contrast, our loss can be interpreted as a proxy entropy estimator of the distribution of embeddings under a Gaussian parametrization (see Appendix A). Thanks to this simplified parametrization, the variability of the embedding can be estimated from much fewer samples, and on very large-dimensional embeddings. Indeed, in the ablation studies that we perform, we find that (1) our method is robust to small batches unlike the popular InfoNCE-based method SimCLR, and (2) our method benefits from using very large dimensional embeddings, unlike InfoNCE-based methods which do not see a benefit in increasing the dimensionality of the output.

Our loss presents several other interesting differences with infoNCE:

In infoNCE, the embeddings are typically normalized along the feature dimension to compute a cosine similarity between embedded samples. We normalize the embeddings along the batch dimension instead.

In our method, there is a parameter λ𝜆\lambda that trades off how much emphasis is put on the invariance term vs. the redundancy reduction term. This parameter can be interpreted as the trade-off parameter in the Information Bottleneck framework (see Appendix A). This parameter is not present in infoNCE.

infoNCE also has a hyperparameter, the temperature, which can be interpreted as the width of the kernel in a non-parametric kernel density estimation of entropy, and practically weighs the relative importance of the hardest negative samples present in the batch (Chen et al., 2020a).

A number of alternative methods to ours have been proposed to alleviate the reliance on large batches of the infoNCE loss. For example, MoCo (He et al., 2019; Chen et al., 2020b) builds a dynamic dictionary of negative samples with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo typically needs to store >60,000absent60000>60{,}000 sample embeddings. In contrast, our method does not require such a large dictionary, since it works well with a relatively small batch size (e.g. 256).

Bootstrap-Your-Own-Latent (aka BYOL) (Grill et al., 2020) and SimSiam (Chen & He, 2020) are two recent methods which use a simple cosine similarity between twin embeddings as an objective function, without any contrastive term:

Surprisingly, these methods successfully avoid trivial solutions by introducing some asymmetry in the architecture and learning procedure of the twin networks. For example, BYOL uses a predictor network which breaks the symmetry between the two networks, and also enforces an exponential moving average on the target network weights to slow down the progression of the weights on the target network. Combined together, these two mechanisms surprisingly avoid trivial solutions. The reasons behind this success are the subject of recent theoretical and empirical studies (Tian et al., 2020; Chen & He, 2020; Fetterman & Albrecht, 2020; Richemond et al., 2020). In particular, the ablation study (Chen & He, 2020) shows that the moving average is not necessary, but that stop-gradient on one of the branch and the presence of the predictor network are two crucial elements to avoid collapse. Other works show that batch normalization (Tian et al., 2020; Fetterman & Albrecht, 2020) or alternatively group normalization (Richemond et al., 2020) could play an important role in avoiding collapse.

Like our method, these asymmetric methods do not require large batches, since in their case there is no interaction between batch samples in the objective function.

It should be noted however that these asymmetric methods cannot be described as the optimization of an overall learning objective. Instead, there exists trivial solutions to the learning objective that these methods avoid via particular implementation choices and/or the result of non-trivial learning dynamics. In contrast, our method avoids trivial solutions by construction, making our method conceptually simpler and more principled than these alternatives (until their principle is discovered, see (Tian et al., 2021) for an early attempt).

In a concurrent work, (Ermolov et al., 2020) propose W-MSE. Acting on the embeddings from identical twin networks, this method performs a differentiable whitening operation (via Cholesky decomposition) of each batch of embeddings before computing a simple cosine similarity between the whitened embeddings of the twin networks. In contrast, the redundancy reduction term in our loss encourages the whitening of the batch embeddings as a soft constraint. The current W-MSE model achieves 66.3% top-1 accuracy on the Imagenet linear evaluation benchmark. It is an interesting direction for future studies to determine whether improved versions of this hard-whitening strategy could also lead to state-of-the-art results on these large-scale computer vision benchmarks.

These methods, such as DeepCluster (Caron et al., 2018), SwAV (Caron et al., 2020), SeLa (Asano et al., 2020), perform contrastive-like comparisons without the requirement to compute all pairwise distances. Specifically, these methods simultaneously cluster the data while enforcing consistency between cluster assignments produced for different distortions of the same image, instead of comparing features directly as in contrastive learning. Clustering methods are also prone to collapse, e.g., empty clusters in k-means and avoiding them relies on careful implementation details. Online clustering methods like SwAV can be trained with large and small batches but require storing features when the number of clusters is much larger than the batch size. Clustering methods can also be combined with contrastive learning (Li et al., 2021) to prevent collapse.

This method (Bojanowski & Joulin, 2017) learns to map samples to fixed random targets on the unit sphere, which can be interpreted as a form of whitening. This objective uses a single network, and hence does not leverage the distortions induced by twin networks. Predefining random targets might limit the flexibility of the representation that can be learned.

In the early days of SSL, (Becker & Hinton, 1992; Zemel & Hinton, 1990) proposed a loss function between twin networks given by:

where |||~{}| denotes the determinant of a matrix, 𝒞(ZA−ZB)subscript𝒞superscript𝑍𝐴superscript𝑍𝐵\mathcal{C}{(Z^{A}-Z^{B})} is the covariance matrix of the difference of the outputs of the twin networks and 𝒞(ZA+ZB)subscript𝒞superscript𝑍𝐴superscript𝑍𝐵\mathcal{C}{(Z^{A}+Z^{B})} the covariance of the sum of these outputs. It can be shown that this objective maximizes the information between the twin network representations under the assumptions that the two representations are noisy versions of the same underlying Gaussian signal, and that the noise is independant, additive and Gaussian. This objective is similar to ours in the sense that there is one term that encourages the two representations to be similar and another term that encourages the units to be decorrelated. However, unlike IMAX, our objective is not directly an information quantity, and we have an extra trade-off parameter λ𝜆\lambda that trades off the two terms of our loss. The IMAX objective was used in early work so it is not clear whether it can scale to large computer vision tasks. Our attempts to make it work on ImageNet were not successful.

We observe a steady improvement of the performance of our method as we increase the dimensionality of the embeddings (i.e. of the last layer of the projector network). This intriguing result is in stark contrast with other popular methods for SSL, such as SimCLR (Chen et al., 2020a) and BYOL (Grill et al., 2020), for which increasing the dimensionality of the embeddings rapidly saturates performance. It is a promising avenue to continue this exploration for even higher dimensional embeddings (>16,000absent16000>16{,}000), but this would require the development of new methods or alternative hardware to accommodate the memory requirements of operating on such large embeddings.

Our method is just one possible instanciation of the Information Bottleneck principle applied to SSL. We believe that further refinements of the proposed loss function and algorithm could lead to more efficient solutions and even better performances. For example, the redundancy reduction term is currently computed from the off-diagonal terms of the cross-correlation matrix between the twin network embeddings, but alternatively it could be computed from the off-diagonal terms of the auto-correlation matrix of a single network’s embedding. Our preliminary analyses seem to indicate that this alternative leads to similar performances (not shown). A modified loss could also be applied to the (unnormalized) cross-covariance matrix instead of the (normalized) cross-correlation matrix (see Ablations for preliminary analyses).

We thank Pascal Vincent, Yubei Chen and Samuel Ocko for helpful insights on the mathematical connection to the infoNCE loss, Robert Geirhos and Adrien Bardes for extra analyses not included in the manuscript and Xinlei Chen, Mathilde Caron, Armand Joulin, Reuben Feinman and Ulisse Ferrari for useful comments on the manuscript.

We explore in this appendix the connection between Barlow Twins’ loss function and the Information Bottleneck (IB) principle (Tishby & Zaslavsky, 2015; Tishby et al., 2000).

Applied to self-supervised learning, the IB principle posits that a desirable representation should be as informative as possible about the sample represented while being as invariant (i.e. non-informative) as possible to distortions of that sample (here the data augmentations used) (Fig. 6). This trade-off is captured by the following loss function:

where I(.,.)I(.,.) denotes mutual information and β𝛽\beta is a positive scalar trading off the desideratas of preserving information and being invariant to distortions.

Using a classical identity for mutual information, we can rewrite equation 5 as:

where H(.)H(.) denotes entropy. The conditional entropy H​(Zθ|Y)𝐻conditionalsubscript𝑍𝜃𝑌H(Z_{\theta}|Y) —the entropy of the representation conditioned on a specific distorted sample— cancels to 0 because the function fθsubscript𝑓𝜃f_{\theta} is deterministic, and so the representation Zθsubscript𝑍𝜃Z_{\theta} conditioned on the input sample Y𝑌Y is perfectly known and has zero entropy. Since the overall scaling factor of the loss function is not important, we can rearrange equation 6 as:

Measuring the entropy of a high-dimensional signal generally requires vast amounts of data, much larger than the size of a single batch. In order to circumvent this difficulty, we make the simplifying assumption that the representation Z𝑍Z is distributed as a Gaussian. The entropy of a Gaussian distribution is simply given by the logarithm of the determinant of its covariance function (up to a constant corresponding to the assumed discretization level that we ignore here) (Cai et al., 2015). The loss function becomes:

This equation is still not exactly the one we optimize for in practice (see eqn. 3 and 4). Indeed, our loss function is only connected to the IB loss given by eqn. 8 through the following simplifications and approximations:

In the case where β<=1𝛽1\beta<=1, it is easy to see from eqn. 8 that the best solution to the IB trade-off is to set the representation to a constant that does not depend on the input. This trade-off thus does not lead to interesting representations and can be ignored. When β>1𝛽1\beta>1, we note that the second term of eqn. 8 is preceded by a negative constant. We can thus simply replace 1−ββ1𝛽𝛽\frac{1-\beta}{\beta} by a new positive constant λ𝜆\lambda, preceded by a negative sign.

In practice, we find that directly optimizing the determinant of the covariance matrices does not lead to SoTA solutions. Instead, we replace the second term of the loss in eqn. 8 (maximizing the information about samples), by the proxy which consist in simply minimizing the Frobenius norm of the cross-correlation matrix. If the representations are assumed to be re-scaled to 1 along the batch dimension before entering the loss (an assumption we are free to make since the cross-correlation matrix is invariant to this re-scaling), this minimization only affects the off-diagonal terms of the covariance matrix (the diagonal terms being fixed to 1 by the re-scaling) and encourages them to be as close to 0 as possible. It is clear that this surrogate objective, which consists in decorrelating all output units, has the same global optimum than the original information maximization objective.

For consistency with eqn. 8, the second term in Barlow Twins’ loss should be computed from the auto-correlation matrix of one of the twin networks, instead of the cross-correlation matrix between twin networks. In practice, we do not see a strong difference in performance between these alternatives.

Similarly, it can easily be shown that the first term of eqn. 8 (minimizing the information the representation contains about the distortions) has the same global optimum than the first term of eqn. 3, which maximizes the alignment between representations of pairs of distorted samples.

The linear classifier is trained for 100 epochs with a learning rate of 0.30.30.3 and a cosine learning rate schedule. We minimize the cross-entropy loss with the SGD optimizer with momentum and weight decay of 10−6superscript10610^{-6}. We use a batch size of 256256256. At training time we augment an input image by taking a random crop, resizing it to 224×224224224224\times 224, and optionally flipping the image horizontally. At test time we resize the image to 256×256256256256\times 256 and center-crop it to a size of 224×224224224224\times 224.

We train for 202020 epochs with a learning rate of 0.0020.0020.002 for the ResNet-50 and 0.50.50.5 for the final classification layer. The learning rate is multiplied by a factor of 0.20.20.2 after the 12th and 16th epoch. We minimize the cross-entropy loss with the SGD optimizer with momentum and do not use weight decay. We use a batch size of 256256256. The image augmentations are the same as in the linear evaluation setting.

We follow the exact settings from PIRL (Misra & van der Maaten, 2019) for evaluating linear classifiers on the Places-205, VOC07 and iNaturalist2018 datasets. For Places-205 and iNaturalist2018 we train a linear classifier with SGD (14 epochs on Places-205, 84 epochs on iNaturalist2018) with a learning rate of 0.010.010.01 reduced by a factor of 101010 at two equally spaced intervals, a weight decay of 5×10−45superscript1045\times 10^{-4} and SGD momentum of 0.90.90.9. We train SVM classifiers on the VOC07 dataset where the C𝐶C values are computed using cross-validation.

Table: S3.T1: Top-1 and top-5 accuracies (in %) under linear evaluation on ImageNet. All models use a ResNet-50 encoder. Top-3 best self-supervised methods are underlined.

MethodTop-1Top-5
Supervised76.5
MoCo60.6
PIRL63.6-
SimCLR69.389.0
MoCo v271.190.1
SimSiam71.3-
SwAV (w/o multi-crop)71.8-
BYOL74.391.6
SwAV75.3-
Barlow Twins (ours)73.291.0

Table: S3.T2: Semi-supervised learning on ImageNet using 1% and 10% training examples. Results for the supervised method are from (Zhai et al., 2019). Best results are in bold.

MethodTop-111Top-555
1%10%1%10%
Supervised25.456.448.480.4
PIRL--57.283.8
SimCLR48.365.675.587.8
BYOL53.268.878.489.0
SwAV53.970.278.589.9
Barlow Twins (ours)55.069.779.289.3

Table: S3.T3: Transfer learning: image classification. We benchmark learned representations on the image classification task by training linear classifiers on fixed features. We report top-1 accuracy on Places-205 and iNat18 datasets, and classification mAP on VOC07. Top-3 best self-supervised methods are underlined.

MethodPlaces-205VOC07iNat18
Supervised53.287.546.7
SimCLR52.585.537.2
MoCo-v251.886.438.6
SwAV (w/o multi-crop)52.886.439.5
SwAV56.788.948.6
BYOL54.086.647.6
Barlow Twins (ours)54.186.246.5

Table: S3.T4: Transfer learning: object detection and instance segmentation. We benchmark learned representations on the object detection task on VOC07+12 using Faster R-CNN (Ren et al., 2015) and on the detection and instance segmentation task on COCO using Mask R-CNN (He et al., 2017). All methods use the C4 backbone variant (Wu et al., 2019) and models on COCO are finetuned using the 1×\times schedule. Best results are in bold.

MethodVOC07+12 detCOCO detCOCO instance seg
APallAP50AP75APbbAP50bbsubscriptsuperscriptabsentbb50{}^{\mathrm{bb}}_{50}AP75bbsubscriptsuperscriptabsentbb75{}^{\mathrm{bb}}_{75}APmkAP50mksubscriptsuperscriptabsentmk50{}^{\mathrm{mk}}_{50}AP75mksubscriptsuperscriptabsentmk75{}^{\mathrm{mk}}_{75}
Sup.53.581.358.838.258.241.233.354.735.2
MoCo-v257.482.564.039.358.942.534.455.836.5
SwAV56.182.662.738.458.641.333.855.235.9
SimSiam5782.463.739.259.342.134.456.036.7
BT (ours)56.882.663.439.259.042.534.356.036.5

Table: S4.T5: Loss function explorations. We ablate the invariance and redundancy terms in our proposed loss and observe that both terms are necessary for good performance. We also experiment with different normalization schemes and a cross-entropy loss and observe reduced performance.

Loss functionTop-111Top-555
Baseline71.490.2
Only invariance term (on-diag term)57.380.5
Only red. red. term (off-diag term)0.10.5
Normalization along feature dim.69.888.8
No BN in MLP71.289.7
No BN in MLP + no Normalization53.476.7
Cross-entropy with temp.63.385.7

Table: S4.T6: Effect of asymmetric settings

Baseline--71.490.2
(a)-70.570.570.589.089.089.0
(b)-70.270.270.289.089.089.0
(c)61.361.361.383.583.583.5

Table: S4.T7: Wider and/or deeper projector and predictor heads and larger dimensionality of the embedding did not improve the performance of BYOL.

ProjectorPredictorAcc1Description
4096-2564096-25674.1%baseline
4096-4096-2564096-25674.0%3 layer proj, 2 layer pred, 256-d repr.
4096-4096-2564096-4096-25673.2%3 layer proj, 3 layer pred, 256-d repr.
4096-4096-5124096-51273.7%3 layer proj, 2 layer pred, 512-d repr.
4096-4096-5124096-4096-51273.2%3 layer proj, 3 layer pred, 512-d repr.
8192-8192-81928192-819272.3%same proj as BT, 2 layer pred, 8192-d repr.

Refer to caption Barlow Twins’s objective function measures the cross-correlation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. Barlow Twins is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, naturally avoiding trivial constant (i.e. collapsed) embeddings, and being robust to the training batch size.

Refer to caption Effect of batch size. To compare the effect of the batch size across methods, for each method we report the difference between the top-1 accuracy at a given batch size and the best obtained accuracy among all batch size tested. BYOL: best accuracy is 72.5% for a batch size of 4096 (data from (Grill et al., 2020) fig. 3A). SimCLR: best accuracy is 67.1% for a batch size of 4096 (data from (Chen et al., 2020a) fig. 9, model trained for 300 epochs). Barlow Twins: best accuracy is 71.7% for a batch size of 1024.

Refer to caption Effect of progressively removing data augmentations. Data for BYOL and SimCLR (repro) is from (Grill et al., 2020) fig 3b.

Refer to caption Effect of the dimensionality of the last layer of the projector network on performance. The parameter λ𝜆\lambda is kept fix for all dimensionalities tested. Data for SimCLR is from (Chen et al., 2020a) fig 8; Data for BYOL is from (Grill et al., 2020) Table 14b.

Refer to caption Sensitivity of Barlow Twins to the hyperparameter λ𝜆\lambda

Refer to caption The information bottleneck principle applied to self-supervised learning (SSL) posits that the objective of SSL is to learn a representation Zθsubscript𝑍𝜃Z_{\theta} which is informative about the image sample, but invariant (i.e. uninformative) to the specific distortions that are applied to this sample. Barlow Twins can be viewed as a specific instanciation of the information bottleneck objective.

$$ \mathcal{L_{BT}}\triangleq\underbrace{\sum_{i}(1-\mathcal{C}{ii})^{2}}{\text{invariance term}}+{}{}\lambda\underbrace{\sum_{i}\sum_{j\neq i}{\mathcal{C}{ij}}^{2}}{\text{redundancy reduction term}} $$ \tag{S2.E1}

$$ \mathcal{C}{ij}\triangleq\frac{\sum{b}z^{A}{b,i}z^{B}{b,j}}{\sqrt{\sum_{b}{(z^{A}{b,i})}^{2}}\sqrt{\sum{b}{(z^{B}_{b,j})}^{2}}} $$ \tag{S2.E2}

$$ \mathcal{IB_{\theta}}\triangleq I(Z_{\theta},Y)-\beta I(Z_{\theta},X) $$ \tag{A1.E5}

$$ \displaystyle\mathcal{L}{infoNCE}\triangleq-\underbrace{\sum{b}\frac{\langle z^{A}{b},z^{B}{b}\rangle_{i}}{\tau\left\lVert z^{A}{b}\right\rVert{2}\left\lVert z^{B}{b}\right\rVert{2}}}_{\text{similarity term}} $$

$$ \displaystyle\mathcal{L}{IMAX}\triangleq\log|\mathcal{C}{(Z^{A}-Z^{B})}|-\log|\mathcal{C}_{(Z^{A}+Z^{B})}| $$

JUNK STUFF: Electronic Submission

Submitting Papers

Submitting Final Camera-Ready Copy

Format of the Paper

Dimensions

BARLOW TWINS learns self-supervised representations through a joint embedding of distorted images, with an objective function that maximizes similarity between the embedding vectors while reducing redundancy between their components. Our method does not require large batches of samples, nor does it require any particular asymmetry in the twin network structure. We discuss next the similarities and differences between our method and prior art, both from a conceptual and an empirical standpoint. For ease of comparison, all objective functions are recast with a common set of notations. The discussion ends with future directions.

Title

Author Information for Submission

Self-Citations

For all ablation studies, BARLOW TWINS was trained for 300 epochs instead of 1000 epochs in the previous section. A linear evaluation on ImageNet of this baseline model yielded a 71 . 4% top-1 accuracy and a 90 . 2% top-5 accuracy. For all the ablations presented we report the top-1 and top-5 accuracy of training linear classifiers on the 2048 dimensional res5 features using the ImageNet train set.

Loss Function Ablations We alter our loss function (eqn. 1) in several ways to test the necessity of each term of the loss function, and to experiment with different practices popular in other loss functions for SSL, such as INFONCE. Table 5 recapitulates the different loss functions tested along with their results on a linear evaluation benchmark of Imagenet. First we find that removing the invariance term (on-diagonal term) or the redundancy reduction term (off-diagonal term) of our loss function leads to worse/collapsed solutions, as expected. We then study the effect of different normalization strategies. We first try to normalize the embeddings along the feature dimension so that they lie on the unit sphere, as it is common practice for losses measuring a cosine similarity (Chen et al., 2020a; Grill et al., 2020; Wang & Isola, 2020). Specifically, we first normalize the embeddings along the batch dimension (with mean subtraction), then normalize the embeddings along the feature dimension (without mean subtraction), and finally we measure the (unnormalized) covariance matrix instead of the (normalized) cross-correlation matrix in eqn. 2. The performance is slightly reduced. Second, we try to remove batch-normalization operations in the two hidden layers of the projector network MLP. The performance is barely affected. Third, in addition to removing the batch-normalization in the hidden layers, we replace the cross-correlation matrix in eqn. 2 by the crosscovariance matrix (which means the features are no longer normalized along the batch dimension). The performance glyph[negationslash]

is substantially reduced. We finally try a cross-entropy loss with temperature, for which the on-diagonal term and off-diagonal term is controlled by a temperature hyperparameter τ and coefficient λ : L = -log ∑ i exp( C ii /τ ) + λ log ∑ i ∑ j = i exp(max( C ij , 0) /τ ) . The performance is reduced.

Table 5. Loss function explorations . We ablate the invariance and redundancy terms in our proposed loss and observe that both terms are necessary for good performance. We also experiment with different normalization schemes and a cross-entropy loss and observe reduced performance.

Robustness to Batch Size The INFONCE loss that draws negative examples from the minibatch suffer performance drops when the batch size is reduced (e.g. SIMCLR (Chen et al., 2020a)). We thus sought to test the robustness of BARLOW TWINS to small batch sizes. In order to adapt our model to different batch sizes, we performed a grid search on LARS learning rates for each batch size. We find that, unlike SIMCLR, our model is robust to small batch sizes (Fig. 2), with a performance almost unaffected for a batch as small as 256. In comparison the accuracy for SimCLR drops about 4 p.p. for batch size 256. This robustness to small batch size, also found in non-contrastive methods such as BYOL, further demonstrates that our method is not only conceptually (see Discussion) but also empirically different than the INFONCE objective.

Effect of Removing Augmentations We find that our model is not robust to removing some types of data augmentations, like SIMCLR but unlike BYOL (Fig. 3). While this can be seen as a disadvantage of our method compared to BYOL, it can also be argued that the representations learned by our method are better controlled by the specific set of distortions used, as opposed to BYOL for which the invariances learned seem generic and intriguingly independent of the specific distortions used.

Projector Network Depth & Width For other SSL methods, such as BYOL and SIMCLR, the projector network drastically reduces the dimensionality of the ResNet output.

Figure 2. Effect of batch size. To compare the effect of the batch size across methods, for each method we report the difference between the top-1 accuracy at a given batch size and the best obtained accuracy among all batch size tested. BYOL: best accuracy is 72.5% for a batch size of 4096 (data from (Grill et al., 2020) fig. 3A). SIMCLR: best accuracy is 67.1% for a batch size of 4096 (data from (Chen et al., 2020a) fig. 9, model trained for 300 epochs). BARLOW TWINS: best accuracy is 71.7% for a batch size of 1024.

Figure 3. Effect of progressively removing data augmentations. Data for BYOL and SIMCLR (repro) is from (Grill et al., 2020) fig 3b.

In stark contrast, we find that BARLOW TWINS performs better when the dimensionality of the projector network output is very large. Other methods rapidly saturate when the dimensionality of the output increases, but our method keeps improving with all output dimensionality tested (Fig. 4). This result is quite surprising because the output of the ResNet is kept fixed to 2048, which acts as a dimensionality bottleneck in our model and sets the limit of the intrinsic dimensionality of the representation. In addition, similarly to other methods, we find that our model performs better when the projector network has more layers, with a saturation of the performance for 3 layers.

Figure 4. Effect of the dimensionality of the last layer of the projector network on performance. The parameter λ is kept fix for all dimensionalities tested. Data for SIMCLR is from (Chen et al., 2020a) fig 8; Data for BYOL is from (Grill et al., 2020) Table 14b.

Breaking Symmetry Many SSL methods (e.g. BYOL, SIMSIAM, SWAV) rely on different symmetry-breaking mechanisms to avoid trivial solutions. Our loss function avoids these trivial solutions by construction, even in the case of symmetric networks. It is however interesting to ask whether breaking symmetry can further improve the performance of our network. Following SIMSIAM and BYOL, we experiment with adding a predictor network composed of 2 fully connected layers of size 8192 to one of the network (with batch normalization followed by a ReLU nonlinearity in the hidden layer) and/or a stop-gradient mechanism on the other network. We find that these asymmetries slightly decrease the performance of our network (see Table 6).

Table 6. Effect of asymmetric settings

BYOLwith a larger projector/predictor/embedding For a fair comparison with BYOL, we also evaluated BYOL with a wider and/or deeper projector head (3-layer MLP), a wider and/or deeper predictor head, and a larger dimensionality of the embedding. BYOL did not improve under these conditions (see Table 7).

Sensitivity to λ . We also explored the sensitivity of BARLOW TWINS to the hyperparameter λ , which trades off the desiderata of invariance and informativeness of the embeddings. We find that BARLOW TWINS is not very sensitive to this hyperparameter (Fig. 5).

Figure 5. Sensitivity of BARLOW TWINS to the hyperparameter λ

Camera-Ready Author Information

Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called BARLOW TWINS, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. BARLOW TWINS does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. BARLOW TWINS outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection. 1

Figure 1. BARLOW TWINS's objective function measures the crosscorrelation matrix between the embeddings of two identical networks fed with distorted versions of a batch of samples, and tries to make this matrix close to the identity. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. BARLOW TWINS is competitive with state-of-the-art methods for self-supervised learning while being conceptually simpler, naturally avoiding trivial constant (i.e. collapsed) embeddings, and being robust to the training batch size.

Partitioning the Text

Sections and Subsections

We use the detectron2 library (Wu et al., 2019) for training the detection models and closely follow the evaluation settings from (He et al., 2019). The backbone ResNet50 network for Faster R-CNN (Ren et al., 2015) and Mask R-CNN (He et al., 2017) is initialized using our BARLOW TWINS pretrained model.

VOC07+12 We use the VOC07+12 trainval set of 16 K images for training a Faster R-CNN (Ren et al., 2015) C-4 backbone for 24 K iterations using a batch size of 16 across 8 GPUs using SyncBatchNorm. The initial learning rate for the model is 0 . 1 which is reduced by a factor of 10 after 18 K and 22 K iterations. We use linear warmup (Goyal et al., 2017) with a slope of 0 . 333 for 1000 iterations.

COCO We train Mask R-CNN (He et al., 2017) C-4 backbone on the COCO 2017 train split and report results on the val split. We use a learning rate of 0 . 03 and keep the other parameters the same as in the 1 × schedule in detectron2.

MethodTop-1Top-5
Supervised76.5
MOCO60.6
PIRL63.6-
SIMCLR69.389.0
MOCO V271.190.1
SIMSIAM71.3-
SWAV (w/o multi-crop)71.8-
BYOL74.391.6
SWAV75.3-
BARLOW TWINS (ours)73.291.0
MethodTop- 1Top- 1Top- 5Top- 5
1%10%1%10%
Supervised25.456.448.480.4
PIRL--57.283.8
SIMCLR48.365.675.587.8
BYOL53.268.878.489.0
SWAV53.970.278.589.9
BARLOW TWINS (ours)55.069.779.289.3
MethodPlaces-205VOC07iNat18
Supervised53.287.546.7
SimCLR52.585.537.2
MoCo-v251.886.438.6
SwAV (w/o multi-crop)52.886.439.5
SwAV56.788.948.6
BYOL5486.647.6
BARLOW TWINS (ours)54.186.246.5
MethodVOC07+12 detVOC07+12 detVOC07+12 detCOCO detCOCO detCOCO detCOCO instance segCOCO instance segCOCO instance seg
AP allAP 50AP 75AP bbAP bb 50AP bb 75AP mkAP mk 50AP mk 75
Sup.53.581.358.838.258.241.233.354.735.2
MoCo-v257.482.564.039.358.942.534.455.836.5
SwAV56.182.662.738.458.641.333.855.235.9
SimSiam5782.463.739.259.342.134.456.036.7
BT (ours)56.882.663.439.259.042.534.356.036.5
Loss functionTop- 1Top- 5
Baseline71.490.2
Only invariance term (on-diag term)57.380.5
Only red. red. term (off-diag term)0.10.5
Normalization along feature dim.69.888.8
No BN in MLP71.289.7
No BN in MLP + no Normalization53.476.7
Cross-entropy with temp.63.385.7
casestop-gradientpredictorTop- 1Top- 5
Baseline--71.490.2
(a)glyph[check]-70 . 589 . 0
(b)-glyph[check]70 . 289 . 0
(c)glyph[check]glyph[check]61 . 383 . 5
ProjectorPredictorAcc1Description
4096-2564096-25674.1%baseline
4096-4096-2564096-25674.0%3 layer proj, 2 layer pred, 256-d repr.
4096-4096-2564096-4096-25673.2%3 layer proj, 3 layer pred, 256-d repr.
4096-4096-5124096-51273.7%3 layer proj, 2 layer pred, 512-d repr.
4096-4096-5124096-4096-51273.2%3 layer proj, 3 layer pred, 512-d repr.
8192-8192-81928192-819272.3%same proj as BT, 2 layer pred, 8192-d repr.

Paragraphs and Footnotes

Figures

Algorithms

Tables

Citations and References

Asano, Y. M., Rupprecht, C., and Vedaldi, A. Self-labelling via simultaneous clustering and representation learning. International Conference on Learning Representations (ICLR) , 2020.

Barlow, H. Redundancy reduction revisited. Network (Bristol, England) , 12(3):241-253, August 2001. ISSN 0954898X.

Barlow, H. B. Possible Principles Underlying the Transformations of Sensory Messages . in Sensory Communication, The MIT Press, 1961.

Becker, S. and Hinton, G. E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature , 355(6356):161-163, January 1992.

Cai, T. T., Liang, T., and Zhou, H. H. Law of Log Determinant of Sample Covariance Matrix and Optimal Estimation of Differential Entropy for High-Dimensional Gaussian Distributions. Journal of Multivariate Analysis , 137, 2015. doi: 10.1016/j.jmva.2015.02.003.

Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deep clustering for unsupervised learning of visual features. In

Software and Data

Acknowledgements

We thank Pascal Vincent, Yubei Chen and Samuel Ocko for helpful insights on the mathematical connection to the infoNCE loss, Robert Geirhos and Adrien Bardes for extra analyses not included in the manuscript and Xinlei Chen, Mathilde Caron, Armand Joulin, Reuben Feinman and Ulisse Ferrari for useful comments on the manuscript.

Do emph{not

$$ \mathcal{L_{BT}} \triangleq \underbrace{\sum_i (1-\mathcal{C}{ii})^2}\text{invariance term} + ~~\lambda \underbrace{\sum_{i}\sum_{j \neq i} {\mathcal{C}{ij}}^2}\text{redundancy reduction term} \label{eq:lossBarlow} $$ \tag{eq:lossBarlow}

$$ \mathcal{C}{ij} \triangleq \frac{ \sum_b z^A{b,i} z^B_{b,j}} {\sqrt{\sum_b {(z^A_{b,i})}^2} \sqrt{\sum_b {(z^B_{b,j})}^2}} \label{eq:crosscorr} $$ \tag{eq:crosscorr}

$$ \mathcal{IB_{ \theta}} \triangleq I(Z_{\theta}, Y) - \beta I(Z_{\theta}, X) \label{eq:lossIB} $$ \tag{eq:lossIB}

$$ \mathcal{L}{infoNCE} \triangleq - \underbrace{\sum_b \frac{\langle z^A{b}, z^B_{b} \rangle_i}{\tau\norm{z^A_{b}}2\norm{z^B{b}}2}}\text{similarity term}\

  • \underbrace{\sum_{b} \log\left( \sum_{b' \neq b} \exp \left(\frac{\langle z^A_{b}, z^B_{b'} \rangle_i}{\tau\norm{z^A_{b}}2\norm{z^B{b'}}2}\right)\right)}\text{contrastive term} $$

$$ \mathcal{L_{BT}} = \underbrace{\sum_i \left(1-\frac{\langle z^A_{\boldsymbol{\cdot},i}, z^B_{\boldsymbol{\cdot},i} \rangle_b}{\norm{z^A_{\boldsymbol{\cdot},i} }2 \norm{z^B{\boldsymbol{\cdot},i}}2}\right) ^2}\text{invariance term} \

  • \lambda \underbrace{\sum_{i} \sum_{j \neq i} \left(\frac{\langle z^A_{\boldsymbol{\cdot},i}, z^B_{\boldsymbol{\cdot},j} \rangle_b}{\norm{z^A_{\boldsymbol{\cdot},i} }2\norm{z^B{\boldsymbol{\cdot},j}}2}\right)^2}\text{redundancy reduction term} $$

$$ \mathcal{L}{IMAX} \triangleq \log |\mathcal{C}{(Z^A-Z^B)}|-\log |\mathcal{C}_{(Z^A+Z^B)}| $$

Algorithm: algorithm
[tb]
\caption{PyTorch-style pseudocode for Barlow Twins.}
\label{alg:barlow_twins}

\definecolor{codeblue}{rgb}{0.25,0.5,0.5}
\lstset{
basicstyle=\fontsize{7.2pt}{7.2pt}\ttfamily\bfseries,
commentstyle=\fontsize{7.2pt}{7.2pt}\color{codeblue},
keywordstyle=\fontsize{7.2pt}{7.2pt},
}
\begin{lstlisting}[language=python]
# f: encoder network
# lambda: weight on the off-diagonal terms
# N: batch size
# D: dimensionality of the embeddings
#
# mm: matrix-matrix multiplication
# off_diagonal: off-diagonal elements of a matrix
# eye: identity matrix

for x in loader: # load a batch with N samples
# two randomly augmented versions of x
y_a, y_b = augment(x)

# compute embeddings
z_a = f(y_a) # NxD
z_b = f(y_b) # NxD

# normalize repr. along the batch dimension
z_a_norm = (z_a - z_a.mean(0)) / z_a.std(0) # NxD
z_b_norm = (z_b - z_b.mean(0)) / z_b.std(0) # NxD

# cross-correlation matrix
c = mm(z_a_norm.T, z_b_norm) / N # DxD

# loss
c_diff = (c - eye(D)).pow(2) # DxD
# multiply off-diagonal elems of c_diff by lambda
off_diagonal(c_diff).mul_(lambda)
loss = c_diff.sum()

# optimization step
loss.backward()
optimizer.step()
\end{lstlisting}
Algorithm: algorithm
[tb]
\caption{Bubble Sort}
\label{alg:example}
\begin{algorithmic}
\STATE {\bfseries Input:} data $x_i$, size $m$
\REPEAT
\STATE Initialize $noChange = true$.
\FOR{$i=1$ {\bfseries to} $m-1$}
\IF{$x_i > x_{i+1}$}
\STATE Swap $x_i$ and $x_{i+1}$
\STATE $noChange = false$
\ENDIF
\ENDFOR
\UNTIL{$noChange$ is $true$}
\end{algorithmic}
MethodTop-1Top-5
Supervised76.5
MOCO60.6
PIRL63.6-
SIMCLR69.389.0
MOCO V271.190.1
SIMSIAM71.3-
SWAV (w/o multi-crop)71.8-
BYOL74.391.6
SWAV75.3-
BARLOW TWINS (ours)73.291.0
MethodTop- 1Top- 1Top- 5Top- 5
1%10%1%10%
Supervised25.456.448.480.4
PIRL--57.283.8
SIMCLR48.365.675.587.8
BYOL53.268.878.489.0
SWAV53.970.278.589.9
BARLOW TWINS (ours)55.069.779.289.3
MethodPlaces-205VOC07iNat18
Supervised53.287.546.7
SimCLR52.585.537.2
MoCo-v251.886.438.6
SwAV (w/o multi-crop)52.886.439.5
SwAV56.788.948.6
BYOL5486.647.6
BARLOW TWINS (ours)54.186.246.5
MethodVOC07+12 detVOC07+12 detVOC07+12 detCOCO detCOCO detCOCO detCOCO instance segCOCO instance segCOCO instance seg
AP allAP 50AP 75AP bbAP bb 50AP bb 75AP mkAP mk 50AP mk 75
Sup.53.581.358.838.258.241.233.354.735.2
MoCo-v257.482.564.039.358.942.534.455.836.5
SwAV56.182.662.738.458.641.333.855.235.9
SimSiam5782.463.739.259.342.134.456.036.7
BT (ours)56.882.663.439.259.042.534.356.036.5
Loss functionTop- 1Top- 5
Baseline71.490.2
Only invariance term (on-diag term)57.380.5
Only red. red. term (off-diag term)0.10.5
Normalization along feature dim.69.888.8
No BN in MLP71.289.7
No BN in MLP + no Normalization53.476.7
Cross-entropy with temp.63.385.7
casestop-gradientpredictorTop- 1Top- 5
Baseline--71.490.2
(a)glyph[check]-70 . 589 . 0
(b)-glyph[check]70 . 289 . 0
(c)glyph[check]glyph[check]61 . 383 . 5
ProjectorPredictorAcc1Description
4096-2564096-25674.1%baseline
4096-4096-2564096-25674.0%3 layer proj, 2 layer pred, 256-d repr.
4096-4096-2564096-4096-25673.2%3 layer proj, 3 layer pred, 256-d repr.
4096-4096-5124096-51273.7%3 layer proj, 2 layer pred, 512-d repr.
4096-4096-5124096-4096-51273.2%3 layer proj, 3 layer pred, 512-d repr.
8192-8192-81928192-819272.3%same proj as BT, 2 layer pred, 8192-d repr.

References

[ICCV="Proceedings of the International Conference on Computer Vision (ICCV)"} @string{CVPR="Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)"} @string{NIPS="Advances in Neural Information Processing Systems (NeurIPS)"} @string{WACV="Winter Conference on Applications of Computer Vision (WACV)"} @string{ECCV="Proceedings of the European Conference on Computer Vision (ECCV)"} @string{ICML="Proceedings of the International Conference on Machine Learning (ICML)"} @string{ICLR="International Conference on Learning Representations (ICLR)"}

@article{redlich_supervised_1993] Redlich, A. Norman. (1993). Supervised {Factorial. Neural Computation. doi:10.1162/neco.1993.5.5.750.

[redlich_redundancy_1993] Redlich, A. N.. (1993). Redundancy {Reduction. Neural Computation. doi:10.1162/neco.1993.5.2.289.

[deco_non-linear_1997] Deco, G., Parra, L.. (1997). Non-linear {Feature. Neural Networks. doi:10.1016/S0893-6080(96)00110-4.

[schmidhuber_semilinear_1996] Schmidhuber, J., Eldracher, M., Foltin, B.. (1996). Semilinear {Predictability. Neural Computation. doi:10.1162/neco.1996.8.4.773.

[foldiak_forming_1990] Földiák, P.. (1990). Forming sparse representations by local anti-{Hebbian. Biological Cybernetics. doi:10.1007/BF02331346.

[linsker_self-organization_1988] Linsker, R.. (1988). Self-organization in a perceptual network. Computer. doi:10.1109/2.36.

[schwartz_natural_2001] Schwartz, Odelia, Simoncelli, Eero P.. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience. doi:10.1038/90526.

[balle_end--end_2017] Ballé, Johannes, Laparra, Valero, Simoncelli, Eero P.. (2017). End-to-end {Optimized.

[li2020contrastive] Li, Yunfan, Hu, Peng, Liu, Zitao, Peng, Dezhong, Zhou, Joey Tianyi, Peng, Xi. (2021). Contrastive Clustering. AAAI.

[tian_understanding_2021] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised {Learning. arXiv:2102.06810 [cs].

[gutmann_noise-contrastive_2010] Gutmann, Michael, Hyvärinen, Aapo. (2010). Noise-contrastive estimation: {A. Proceedings of the {Thirteenth.

[barlow_possible_nodate] Barlow, H. B.. (1961). Possible {Principles.

[barlow_redundancy_2001] Barlow, H.. (2001). Redundancy reduction revisited. Network (Bristol, England).

[ocko_emergence_2018] . . ().

[lindsey_unified_2020] Lindsey, Jack, Ocko, Samuel A., Ganguli, Surya, Deny, Stephane. (2020). A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep {CNNs. ICLR.

[tishby_information_2000] Tishby, Naftali, Pereira, Fernando C., Bialek, William. (2000). The information bottleneck method. arXiv preprint arXiv:physics/0004057.

[tishby_deep_2015] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep {Learning. arXiv preprint arXiv:1503.02406.

[ermolov_whitening_2020] Ermolov, Aleksandr, Siarohin, Aliaksandr, Sangineto, Enver, Sebe, Nicu. (2020). Whitening for {Self. arXiv preprint arXiv:2007.06346.

[chen_big_2020] Chen, Ting, Kornblith, Simon, Swersky, Kevin, Norouzi, Mohammad, Hinton, Geoffrey. (2020). Big {Self.

[wang_understanding_2020] Wang, Tongzhou, Isola, Phillip. (2020). Understanding {Contrastive. arXiv preprint arXiv:2005.10242.

[becker_self-organizing_1992] Becker, Suzanna, Hinton, Geoffrey E.. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature.

[zemel_discovering_1990] Zemel, Richard, Hinton, Geoffrey E. (1990). Discovering Viewpoint-Invariant Relationships That Characterize Objects. NeurIPS.

[tian_understanding_2020] Tian, Yuandong, Yu, Lantao, Chen, Xinlei, Ganguli, Surya. (2020). Understanding {Self. arXiv preprint arXiv:2010.00578.

[fetterman_understanding_2020] Fetterman, Abe, Albrecht, Josh. (2020). Understanding self-supervised and contrastive learning with “{Bootstrap. Untitled AI.

[richemond_byol_2020] Richemond, Pierre H., Grill, Jean-Bastien, Altché, Florent, Tallec, Corentin, Strub, Florian, Brock, Andrew, Smith, Samuel, De, Soham, Pascanu, Razvan, Piot, Bilal, Valko, Michal. (2020). {BYOL. arXiv preprint arXiv:2010.10241.

[he2019rethinking] He, Kaiming, Girshick, Ross, Doll{'a. (2019). Rethinking imagenet pre-training.

[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context.

[donahue2019large] Donahue, Jeff, Simonyan, Karen. (2019). Large scale adversarial representation learning.

[fan2008liblinear] Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang-Rui, Lin, Chih-Jen. (2008). LIBLINEAR: A library for large linear classification. Journal of machine learning research.

[carion2020end] Carion, Nicolas, Massa, Francisco, Synnaeve, Gabriel, Usunier, Nicolas, Kirillov, Alexander, Zagoruyko, Sergey. (2020). End-to-End Object Detection with Transformers. arXiv preprint arXiv:2005.12872.

[cubuk2019randaugment] Cubuk, Ekin D, Zoph, Barret, Shlens, Jonathon, Le, Quoc V. (2019). RandAugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719.

[xie2020unsupervised] Xie, Qizhe, Dai, Zihang Dai, Hovy, Eduard, Luong, Minh-Thang, Le, Quoc V.. (2020). Unsupervised Data Augmentation for Consistency Training. arXiv preprint arXiv:1904.12848.

[sohn2020fixmatch] Sohn, Kihyuk, Berthelot, David, Li, Chun-Liang, Zhang, Zizhao, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Zhang, Han, Raffel, Colin. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685.

[li2020prototypical] Li, Junnan, Zhou, Pan, Xiong, Caiming, Socher, Richard, Hoi, Steven CH. (2020). Prototypical Contrastive Learning of Unsupervised Representations. arXiv preprint arXiv:2005.04966.

[gidaris2020learning] Gidaris, Spyros, Bursuc, Andrei, Komodakis, Nikos, Pérez, Patrick, Cord, Matthieu. (2020). Learning Representations by Predicting Bags of Visual Words. arXiv preprint arXiv:2002.12247.

[van2018inaturalist] Van Horn, Grant, Mac Aodha, Oisin, Song, Yang, Cui, Yin, Sun, Chen, Shepard, Alex, Adam, Hartwig, Perona, Pietro, Belongie, Serge. (2018). The inaturalist species classification and detection dataset.

[everingham2010pascal] Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision.

[zhou2014learning] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning deep features for scene recognition using places database.

[micikevicius2017mixed] Micikevicius, Paulius, Narang, Sharan, Alben, Jonah, Diamos, Gregory, Elsen, Erich, Garcia, David, Ginsburg, Boris, Houston, Michael, Kuchaiev, Oleksii, Venkatesh, Ganesh, others. (2017). Mixed precision training. arXiv preprint arXiv:1710.03740.

[loshchilov2016sgdr] Loshchilov, Ilya, Hutter, Frank. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.

[you2017large] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.

[genevay2017learning] Genevay, Aude, Peyr{'e. (2017). Learning generative models with sinkhorn divergences. arXiv preprint arXiv:1706.00292.

[cai_law_2015] Cai, T. Tony, Liang, Tengyuan, Zhou, Harrison H.. (2015). Law of {Log. Journal of Multivariate Analysis. doi:10.1016/j.jmva.2015.02.003.

[cuturi2013sinkhorn] Cuturi, Marco. (2013). Sinkhorn distances: Lightspeed computation of optimal transport.

[gutmann2010noise] Gutmann, Michael, Hyv{. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.

[chen2020improved] Chen, Xinlei, Fan, Haoqi, Girshick, Ross, He, Kaiming. (2020). Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297.

[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views.

[tan2019efficientnet] Tan, Mingxing, Le, Quoc V. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946.

[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2019). Learning deep representations by mutual information estimation and maximization.

[noroozi2018boosting] Noroozi, Mehdi, Vinjimoor, Ananth, Favaro, Paolo, Pirsiavash, Hamed. (2018). Boosting self-supervised learning via knowledge transfer.

[wang2017transitive] Wang, Xiaolong, He, Kaiming, Gupta, Abhinav. (2017). Transitive invariance for self-supervised visual representation learning.

[mahendran2018cross] Mahendran, Aravindh, Thewlis, James, Vedaldi, Andrea. (2018). Cross Pixel Optical Flow Similarity for Self-Supervised Learning. arXiv preprint arXiv:1807.05636.

[misra2016shuffle] Misra, Ishan, Zitnick, C Lawrence, Hebert, Martial. (2016). Shuffle and learn: unsupervised learning using temporal order verification.

[jenni2018self] Jenni, Simon, Favaro, Paolo. (2018). Self-supervised feature learning by learning to spot artifacts.

[pathak2016context] Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell, Trevor, Efros, Alexei A. (2016). Context encoders: Feature learning by inpainting.

[agrawal2015learning] Agrawal, Pulkit, Carreira, Joao, Malik, Jitendra. (2015). Learning to see by moving.

[wang2015unsupervised] Wang, Xiaolong, Gupta, Abhinav. (2015). Unsupervised learning of visual representations using videos.

[pathak2017learning] Pathak, Deepak, Girshick, Ross, Doll{'a. (2017). Learning features by watching objects move.

[zhang2017split] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction.

[tian2019contrastive] Tian, Yonglong, Krishnan, Dilip, Isola, Phillip. (2019). Contrastive multiview coding. arXiv preprint arXiv:1906.05849.

[larsson2016learning] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2016). Learning representations for automatic colorization.

[zhang2016colorful] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization.

[kim2018learning] Kim, Dahun, Cho, Donghyeon, Yoo, Donggeun, Kweon, In So. (2018). Learning image representations by completing damaged jigsaw puzzles.

[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles.

[larsson2017colorization] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2017). Colorization as a proxy task for visual understanding.

[bach2008diffrac] Bach, Francis R, Harchaoui, Za{. (2008). Diffrac: a discriminative and flexible framework for clustering.

[hadsell2006dimensionality] Hadsell, Raia, Chopra, Sumit, LeCun, Yann. (2006). Dimensionality reduction by learning an invariant mapping.

[jing2019self] Jing, Longlong, Tian, Yingli. (2019). Self-supervised visual feature learning with deep neural networks: A survey. arXiv preprint arXiv:1902.06162.

[xie2016unsupervised] Xie, Junyuan, Girshick, Ross, Farhadi, Ali. (2016). Unsupervised deep embedding for clustering analysis.

[bautista2016cliquecnn] Bautista, Miguel A, Sanakoyeu, Artsiom, Tikhoncheva, Ekaterina, Ommer, Bjorn. (2016). Cliquecnn: Deep unsupervised exemplar learning.

[courtiol2018classification] Pierre Courtiol, Eric W. Tramel, Marc Sanselme, Gilles Wainrib. (2018). Classification and Disease Localization in Histopathology Using Only Global Labels: A Weakly-Supervised Approach.

[huang2019unsupervised] Huang, Jiabo, Dong, Qi, Gong, Shaogang. (2019). Unsupervised deep learning by neighbourhood discovery.

[caron2019unsupervised] Caron, Mathilde, Bojanowski, Piotr, Mairal, Julien, Joulin, Armand. (2019). Unsupervised pre-training of image features on non-curated data.

[yan2020cluster] Yan, Xueting, Misra, Ishan, Ishan, Gupta, Abhniav, Ghadiyaram, Deepti, Mahajan, Dhruv. (2020). {ClusterFit.

[dosovitskiy2016discriminative] Dosovitskiy, Alexey, Fischer, Philipp, Springenberg, Jost Tobias, Riedmiller, Martin, Brox, Thomas. (2016). Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence.

[goyal2017accurate] Goyal, Priya, Doll{'a. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database.

[kolesnikov2019revisiting] Kolesnikov, Alexander, Zhai, Xiaohua, Beyer, Lucas. (2019). Revisiting self-supervised visual representation learning.

[zhuang2019local] Zhuang, Chengxu, Zhai, Alex Lin, Yamins, Daniel. (2019). Local aggregation for unsupervised learning of visual embeddings.

[misra2019self] Misra, Ishan, van der Maaten, Laurens. (2019). Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991.

[mahajan2018exploring] Mahajan, Dhruv, Girshick, Ross, Ramanathan, Vignesh, He, Kaiming, Paluri, Manohar, Li, Yixuan, Bharambe, Ashwin, van der Maaten, Laurens. (2018). Exploring the limits of weakly supervised pretraining.

[gidaris2018unsupervised] Gidaris, Spyros, Singh, Praveer, Komodakis, Nikos. (2018). Unsupervised representation learning by predicting image rotations.

[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS.

[caron2020swav] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin. (2020). Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. NeurIPS.

[chen2020exploring] Chen, Xinlei, He, Kaiming. (2020). Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:2011.10566.

[he2019momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2019). Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722.

[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella X, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance discrimination.

[asano2019self] Asano, Yuki Markus, Rupprecht, Christian, Vedaldi, Andrea. (2020). Self-labelling via simultaneous clustering and representation learning.

[yang2016joint] Yang, Jianwei, Parikh, Devi, Batra, Dhruv. (2016). Joint unsupervised learning of deep representations and image clusters.

[doersch2015unsupervised] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction.

[bojanowski2017unsupervised] Bojanowski, Piotr, Joulin, Armand. (2017). Unsupervised learning by predicting noise.

[caron2018deep] Caron, Mathilde, Bojanowski, Piotr, Joulin, Armand, Douze, Matthijs. (2018). Deep clustering for unsupervised learning of visual features.

[ji2019invariant] Ji, Xu, Henriques, Jo{~a. (2019). Invariant information clustering for unsupervised image classification and segmentation.

[touvron2019fixing] Touvron, Hugo, Vedaldi, Andrea, Douze, Matthijs, J{'e. (2019). Fixing the train-test resolution discrepancy.

[hoffer2019mix] Hoffer, Elad, Weinstein, Berry, Hubara, Itay, Ben-Nun, Tal, Hoefler, Torsten, Soudry, Daniel. (2019). Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency. arXiv preprint arXiv:1908.08986.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition.

[henaff2019data] H{'e. (2019). Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272.

[xie2017aggregated] Xie, Saining, Girshick, Ross, Doll{'a. (2017). Aggregated residual transformations for deep neural networks.

[ren2015faster] Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks.

[he2017mask] He, Kaiming, Gkioxari, Georgia, Doll{'a. (2017). Mask r-cnn.

[wu2019detectron2] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, Ross Girshick. (2019). Detectron2.

[liu2020labels] Liu, Chenxi, Doll{'a. (2020). Are Labels Necessary for Neural Architecture Search?. arXiv preprint arXiv:2003.12056.

[caron2020pruning] Caron, Mathilde, Morcos, Ari, Bojanowski, Piotr, Mairal, Julien, Joulin, Armand. (2020). Pruning Convolutional Neural Networks with Self-Supervision. arXiv preprint arXiv:2001.03554.

[olshausen1996] B. A. Olshausen, D. J. Field. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature.

[donahue2017adversarial] J. Donahue, P. Krahenb{. (2016). Adversarial feature learning.

[mescheder2017unifying] L. Mescheder, S. Nowozin, A. Geiger. (2017). Adversarial variational {B.

[vincent2008extracting] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol. (2008). Extracting and composing robust features with denoising autoencoders. ICML.

[ranzato2007unsupervised] Marc’Aurelio Ranzato, Fu-Jie Huang, Y-Lan Boureau, Yann LeCun. (2007). Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition. CVPR.

[salakhutdinov2009deep] R. Salakhutdinov, G. Hinton. (2009). Deep {B. AI-STATS.

[masci2011stacked] J. Masci, U. Meier, D. Cires, J. Schmidhuber. (2011). Stacked convolutional auto-encoders for hierarchical feature extraction. ICANN.

[zhai2019s4l] Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, Beyer, Lucas. (2019). S4l: Self-supervised semi-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[bib1] Asano et al. (2020) Asano, Y. M., Rupprecht, C., and Vedaldi, A. Self-labelling via simultaneous clustering and representation learning. International Conference on Learning Representations (ICLR), 2020.

[bib4] Barlow, H. B. Possible Principles Underlying the Transformations of Sensory Messages. in Sensory Communication, The MIT Press, 1961.

[bib6] Bojanowski, P. and Joulin, A. Unsupervised learning by predicting noise. In Proceedings of the International Conference on Machine Learning (ICML), 2017.

[bib7] Cai et al. (2015) Cai, T. T., Liang, T., and Zhou, H. H. Law of Log Determinant of Sample Covariance Matrix and Optimal Estimation of Differential Entropy for High-Dimensional Gaussian Distributions. Journal of Multivariate Analysis, 137, 2015. doi: 10.1016/j.jmva.2015.02.003.

[bib13] Deco, G. and Parra, L. Non-linear Feature Extraction by Redundancy Reduction in an Unsupervised Stochastic Neural Network. Neural Networks, 10(4):683–691, June 1997. ISSN 0893-6080. doi: 10.1016/S0893-6080(96)00110-4.

[bib14] Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[bib15] Ermolov et al. (2020) Ermolov, A., Siarohin, A., Sangineto, E., and Sebe, N. Whitening for Self-Supervised Representation Learning. arXiv preprint arXiv:2007.06346, 2020.

[bib21] Grill et al. (2020) Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo, Z. D., Azar, M. G., et al. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.

[bib31] Linsker, R. Self-organization in a perceptual network. Computer, 21(3):105–117, March 1988. ISSN 1558-0814. doi: 10.1109/2.36. Conference Name: Computer.

[bib36] Redlich, A. N. Redundancy Reduction as a Strategy for Unsupervised Learning. Neural Computation, 5(2):289–304, March 1993a. ISSN 0899-7667. doi: 10.1162/neco.1993.5.2.289. Conference Name: Neural Computation.

[bib38] Ren et al. (2015) Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), 2015.

[bib41] Schwartz, O. and Simoncelli, E. P. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. ISSN 1546-1726. doi: 10.1038/90526. Number: 8 Publisher: Nature Publishing Group.

[bib43] Tian et al. (2021) Tian, Y., Chen, X., and Ganguli, S. Understanding self-supervised Learning Dynamics without Contrastive Pairs. arXiv:2102.06810 [cs], February 2021. arXiv: 2102.06810.

[bib48] Wu et al. (2019) Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Girshick, R. Detectron2. https://github.com/facebookresearch/detectron2, 2019.