Skip to main content

Decoupled Contrastive Learning

Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, Yann LeCun

Abstract

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented ``views'' of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves $68.2%$ ImageNet-1K top-1 accuracy using batch size $256$ within $200$ epochs pre-training, outperforming its SimCLR baseline by $6.4%$. Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve $72.3%$ ImageNet-1K top-1 accuracy with $512$ batch size in $400$ epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies. Contrastive learning, self-supervised learning

Decouple Negative and Positive Samples in Contrastive Learning

Chun-Hsiao Yeh 1 , 2 , Cheng-Yao Hong 1 , Yen-Chi Hsu 1 , 3 , Tyng-Luh Liu ⋆ 1 , Yubei Chen 4 , and Yann LeCun 4 , 5

1 IIS, Academia Sinica, Taiwan 2 UC Berkeley

3 National Taiwan University

4 Meta AI Research

{sensible,yenchi,liutyng}@iis.sinica.edu.tw , daniel_yeh@berkeley.edu , {yubeic,yann}@fb.com

Abstract. Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented 'views' of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves 68 . 2% ImageNet-1K top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its SimCLR baseline by 6 . 4% . Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72 . 3% ImageNet-1K top-1 accuracy with 512 batch size in 400 epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies.

Keywords: Contrastive learning, self-supervised learning

⋆ Corresponding author. E-mail: liutyng@iis.sinica.edu.tw

Fig. 1. An overview of the batch size issue is that general contrastive approaches need large batch sizes to perform better: (a) shows the NPC multiplier q B in different batch sizes. As the batch size gradually increases, the q B will approach to 1 with a small coefficient of variation ( C v = σ/µ ); and (b) illustrates the distribution of q B with various batch sizes and indicates that the mode value of q B will shift towards 1 when the batch size increases. Note that the σ and µ are the standard deviation and mean of q B , respectively. The coefficient of variation, C v , measures the dispersion of a frequency distribution.

Fig. 1. An overview of the batch size issue is that general contrastive approaches need large batch sizes to perform better: (a) shows the NPC multiplier q B in different batch sizes. As the batch size gradually increases, the q B will approach to 1 with a small coefficient of variation ( C v = σ/µ ); and (b) illustrates the distribution of q B with various batch sizes and indicates that the mode value of q B will shift towards 1 when the batch size increases. Note that the σ and µ are the standard deviation and mean of q B , respectively. The coefficient of variation, C v , measures the dispersion of a frequency distribution.

Keywords:

Introduction

As a fundamental task in machine learning, representation learning aims to extract useful information from the raw data for the downstream tasks. It has been regarded as a long-acting goal over the past decades. Recent progress on representation learning has achieved a significant milestone over self-supervised learning (SSL), facilitating feature learning with its competence in exploiting massive raw data without any annotated supervision. In the early stage of SSL, representation learning has focused on exploiting pretext tasks, which are addressed by generating pseudo-labels to the unlabeled data through different transformations, such as solving jigsaw puzzles [24], colorization [41] and rotation prediction [12]. Though these approaches succeed in computer vision, there is a large gap between these methods and supervised learning. Recently, there has been a significant advancement in using contrastive learning [36,25,30,15,7] for self-supervised pre-training, which significantly closes the gap between the SSL method and supervised learning. Contrastive SSL methods, e.g., SimCLR [7], in general, try to pull different views of the same instance close and push different instances far apart in the representation space.

Despite the evident progress of the state-of-the-art contrastive SSL methods, there have been facing several challenges into future development in this direction, including 1) The SOTA models, e.g ., [15] may require specific structures such as the momentum encoder and large memory queues, which may complicate the underlying representation learning. 2) The contrastive SSL models, e.g ., [7] often depend on large batch size and huge epoch numbers to achieve competitive

performance, posing a computational challenge for academia to explore this direction. 3) They tend to be sensitive to hyperparameters and optimizers, introducing additional difficulty reproducing the results on various benchmarks.

Through the analysis of the widely adopted InfoNCE loss in contrastive learning, we identified a negative-positive-coupling (NPC) multiplier q B in the gradient as shown in Proposition 1. The NPC multiplier modulates the gradient of each sample, and it reduces the learning efficiency due to easy SSL classification tasks: 1) when a positive sample is very close to the anchor; 2) when negative samples are far away from the anchor; and 3) when there is only a small number of negative samples (i.e., a small batch size). A less-informative (nearby) positive view would reduce the gradient from a batch of informative negative samples or vice versa. Such a coupling exacerbates when smaller batch sizes are used.

Meanwhile, we also investigate the relationship between q B and batch size through the baseline, SimCLR. As can be seen in Figure 1, the distribution of q B has a strong positive correlation with the batch size. Figure 1(a) shows that when batch size gradually increases, q B not only approaches 1 but also reduces the coefficient of variation C v . The distribution with larger C v has low statistical dispersion and vice versa. Figure 1(b) indicates that the mode value of q B will also shift from 0 to 1 when the batch size becomes larger. Hence, it is reasonable to fix the value of q B , alleviating the influence of batch size.

By removing the coupling term from the Info-NCE loss, we reach a new formulation, the decoupled contrastive learning (DCL). The new objective function significantly improves the training efficiency with less sensitivity to sub-optimal hyper-parameters requires neither large batches, momentum encoding, or large epochs to achieve competitive performance on various benchmarks. The main contributions of the proposed DCL can be characterized as follows:

  1. We provide both theoretical analysis and empirical evidence to show the NPC effect in the InfoNCE-based contrastive learning;
  2. We introduce DCL objective, which casts off the NPC coupling phenomenon, significantly improves the training efficiency, and it is less sensitive to suboptimal hyper-parameters;
  3. Extensive experiments are provided to show the effectiveness of the proposed method that DCL achieves competitive performance without large batch sizes, large training epochs, momentum encoding, or additional tricks such as stop-gradient and multi-cropping, etc. This leads to a plug-and-play improvement to the widely adopted InfoNCE-based contrastive learning;
  4. We show that DCL can be easily combined with the SOTA contrastive methods, e.g. NNCLR [10], to achieve further improvements.

Contrastive Learning. Contrastive learning (CL) constructs positive and negative sample pairs to extract information from the data itself. In CL, each anchor image in a batch has only one positive sample to construct a positive sample

Figure

̸

Fig. 2. Contrastive learning and negative-positive coupling (NPC). (a) In SimCLR, each sample x i has two augmented views { x (1) i , x (2) i } . They are encoded by the same encoder f and further projected to { z (1) i , z (2) i } by a normalized MLP. (b) According to Equation 4. For the view x (1) i , the cross-entropy loss L (1) i leads to a positive force z (2) i , which comes from the other view x (2) i of x and a negative force, which is a weighted average of all the negative samples, i.e. { z ( l ) j | l ∈ { 1 , 2 } , j = i } . However, the gradient -∇ z (2) i L (1) i is proportional to the NPC multiplier. (c) We show two cases when the NPC term affects learning efficiency. The positive sample is close to the anchor and less informative on the top. However, the gradient from the negative samples is also reduced. On the bottom, when the negative samples are far away and less informative, the learning rate from the positive sample is mistakenly reduced. In general, the NPC multiplier from the InfoNCE loss makes the SSL task simpler to solve, leading to reduced learning efficiency.

pair [14,7,15]. CPC [25] predicts the future output of sequential data by using current output as prior knowledge, which can improve the feature representing the ability of the model. Instance discrimination [36] proposes a non-parametric cross-entropy loss to optimize the model at the instance level. Inv. spread [37] makes use of data augmentation invariant and the spread-out property of instance to learn features. MoCo [15] proposes a dictionary to maintain a negative sample set, thus increasing the number of negative sample pairs. Different from the aforementioned self-supervised CL approaches, [20] proposes a supervised CL that considers all the same categories as positive pairs to increase the utility of images.

Collapsing Issue on the Number of Negatives. In CL, the objective is to maximize the mutual information between the positive pairs. However, to avoid the ' collapsing output ', vast quantities of negative samples are needed so that the learning objectives obtain the maximum similarity and have the minimum similarity with negative samples. For instance, in SimCLR [7], training requires many negative samples, leading to a large batch size (i.e., 4096). Furthermore, to optimize such a huge batch, a specially designed optimizer LARS [38] is used. Similarly, MoCo [15] needs a vast queue (i.e., 65536) to achieve competitive performance. BYOL [13] does not collapse output without using any negative samples by considering all the images are positive and to maximize the similarity of 'projection' and 'prediction ' features. On the other hand, SimSiam [9] leverages

the Siamese network to introduce inductive biases for modeling invariance. With the small batch size (i.e., 256), SimSiam is a rival to BYOL (i.e., 4096). Unlike both approaches that achieved their success through empirical studies, this paper tackles from a theoretical perspective, proving that an intertwined multiplier q B of positive and negative is the main issue to contrastive learning.

Batch Size Sensitivity on InfoNCE. Several works of literature focus on batch size sensitivity concerning the InfoNCE objective function. [32] proposes an objective based on relative predictive coding that maintains the balance between training stability and batch size sensitivity. [17] follows the [3] and extends the idea between the local and global features. [26] proposes a Wasserstein distance to prevent the encoder from learning any other differences between unpaired samples. [19] and [29] learn better representation by sampling hard negatives, particularly for small batches. Other recent works [42,11] aim to mitigate the issue of small batch size in InfoNCE loss. Although the basic principle of recent works and DCL is derived from InfoNCE objective function, we provide a novel perspective to support the decoupling between positive and negative terms in InfoNCE loss is essential. Simply removing the term from the denominator pre-training to positive pairs can drastically improve the performance and keep the objective function invariant to batch size sensitivity.

Decouple Negative and Positive Samples in Contrastive Learning

̸

We choose to start from SimCLR because of its conceptual simplicity. Given a batch of N samples (e.g. images), { x 1 , . . . , x N } , let x (1) i , x (2) i be two augmented views of the sample x i and B be the set of all of the augmented views in the batch, i.e. B = { x ( k ) i | k ∈ { 1 , 2 } , i ∈ [ [1 , N ] ] } . As shown by Figure 2(a), each of the views x ( k ) i is sent into the same encoder network f and the output h ( k ) i = f ( x ( k ) i ) is then projected by a normalized MLP projector that z ( k ) i = g ( h ( k ) i ) / ∥ g ( h ( k ) i ) ∥ . For each augmented view x ( k ) i , SimCLR solves a classification problem by using the rest of the views in B as targets, and assigns the only positive label to x ( u ) i , where u = k . So SimCLR creates a cross-entropy loss function L ( k ) i for each view x ( k ) i , and the overall loss function is L = ∑ k ∈{ 1 , 2 } ,i ∈ [ [1 ,N ] ] L ( k ) i .

$$

$$

where

$$

$$

means the summation of negative terms for the view k of the sample i .

Proposition 1 : There exists a negative-positive coupling (NPC) multiplier q (1) B,i in the gradient of L (1) i :

$$

$$

$$

$$

and U i, 1 = ∑ l ∈{ 1 , 2 } ,j ∈ [ [1 ,N ] ] ,j = i exp( ⟨ z (1) i , z ( l ) j ⟩ /τ ) . Due to the symmetry, a similar NPC multiplier q ( k ) B,i exists in the gradient of L ( k ) i , k ∈ { 1 , 2 } , i ∈ [ [1 , N ] ] .

As we can see, all of the partial gradients in Equation 3 are modified by the common NPC multiplier q ( k ) B,i in Equation 4. Equation 4 makes intuitive sense: when the SSL classification task is easy, the gradient would be reduced by the NPC term. However, the positive samples and negative samples are strongly coupled. When the negative samples are far away and less informative (easy negatives), the gradient from an informative, positive sample would be reduced by the NPC multiplier q (1) B,i . On the other hand, when the positive sample is close (easy positive) and less informative, the gradient from a batch of informative negative samples would also be reduced by the NPC multiplier. When the batch size is smaller, the SSL classification problem can be significantly simpler to solve. As a result, the learning efficiency can be significantly reduced with a small batch size setting.

Figure 1(b) shows the NPC multiplier q B distribution shift w.r.t. different batch sizes for a pre-trained SimCLR baseline model. While all of the shown distributions have prominent fluctuation, the smaller batch size makes q B cluster towards 0 , while the larger batch size pushes the distribution towards δ (1) . Figure 1(a) shows the averaged NPC multiplier ⟨ q B ⟩ changes w.r.t. the batch size and the relative fluctuation. The small batch sizes introduce significant NPC fluctuation. Based on this observation, we propose to remove the NPC multipliers from the gradients, which corresponds to the case q B,N →∞ . This leads to the decoupled contrastive learning formulation. [34] also proposes an alignment & uniformity loss which does not have the NPC. However, a similar analysis introduces negative-negative coupling from different positive samples. In other words, [34] considers all the negative samples in the batch together, which may cause the gradient to be dominated by a specific negative pair. In Appendix 5, we provide a thorough discussion and demonstrate the advantage of DCL loss against [34].

Proposition 2 the DCL Loss: Removing the positive pair from the denominator of Equation 1 leads to a decoupled contrastive learning loss. If we remove the NPC multiplier q ( k ) B,i from Equation 3, we reach a decoupled contrastive learning loss L DC = ∑ k ∈{ 1 , 2 } ,i ∈ [ [1 ,N ] ] L ( k ) DC,i , where L ( k ) DC,i is:

$$

$$

$$

$$

The proofs of Proposition 1 and 2 are given in Appendix. Further, we can generalize the loss function L DC to L DCW by introducing a weighting function for the positive pairs i.e. L DCW = ∑ k ∈{ 1 , 2 } ,i ∈ [ [1 ,N ] ] L ( i,k ) DCW,i .

$$

$$

where we can intuitively choose w to be a negative von Mises-Fisher weighting function that w ( z (1) i , z (2) i ) = 2 -exp( ⟨ z (1) i , z (2) i ⟩ /σ ) E i [ exp( ⟨ z (1) i , z (2) i ⟩ /σ ) ] and E [ w ] = 1 . L DC is a special case of L DCW and we can see that lim σ →∞ L DCW = L DC . The intuition behind w ( z (1) i , z (2) i ) is that there is more learning signal when a positive pair of samples are far from each other, and E [ w ( z (1) i , z (2) i ) ⟨ z (1) i , z (2) i ⟩ ] ≈ E [ ⟨ z (1) i , z (2) i ⟩ ] . Other similar weight functions also provide similar results. In general, we find such a weighting function, which gives a larger weight to the hard positives tend to increase the representation quality.

Experiments

This section empirically evaluates the proposed decoupled contrastive learning (DCL) and compares it to general contrastive learning methods. We summarize the experiments and analysis as the following: (1) the proposed work significantly outperforms the general InfoNCE-based contrastive learning on both large-scale and small-scale vision benchmarks; (2) we show that the enhanced version of DCL, DCLW, could further improve the representation quality; and (3) we further analyze DCL with ablation studies on ImageNet-1K, hyperparameters, and few learning epochs, which shows fast convergence of the proposed DCL. Note that all the experiments are conducted with 8 Nvidia V100 GPUs on a single machine.

Implementation Details

ImageNet. For a fair comparison on ImageNet data, we implement the proposed decoupled structure, DCL, by following SimCLR [7] with ResNet-50 [16] as the encoder backbone and use cosine annealing schedule with SGD optimizer. We set the temperature τ to 0.1 and the latent vector dimension to 128. Following

Fig. 3. Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ is 0 . 1 for SimCLR and 0 . 07 for MoCo in the comparison.

Fig. 3. Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ is 0 . 1 for SimCLR and 0 . 07 for MoCo in the comparison.

the OpenSelfSup benchmark [40], we evaluate the pre-trained models by training a linear classifier with frozen learned embedding on ImageNet data. We further consider evaluating DCL on ImageNet-100, a selected subset of 100 classes of ImageNet-1K. Note that all models on ImageNet are trained for 200 epochs.

CIFAR and STL10. For CIFAR10, CIFAR100, and STL10, ResNet-18 [16] is used as the encoder architecture. Following the small-scale benchmark [35], we set the temperature τ to 0.07. All models are trained for 200 epochs with SGD optimizer, a base lr = 0 . 03 ∗ batchsize/ 256 , and evaluated by k nearest neighbor (kNN) classifier. Note that on STL10, we include both the train and unlabeled set for model pre-training. We further use ResNet-50 as a stronger backbone by following the implementation [28], using the same backbone and hyperparameters.

Experiments and Analysis

DCL on ImageNet. This section illustrates the effect of DCL against InfoNCEbased approaches under different batch sizes and queues. The initial setup is to have 1024 batch size (SimCLR) and 65536 queues (MoCo [15]) and gradually reduce the batch size (SimCLR) and queue (MoCo) to show the corresponding top-1 accuracy by linear evaluation. Figure 3 indicates that without DCL, the top-1 accuracy drastically drops when batch size (SimCLR) or queue (MoCo) becomes very small. While with DCL, the performance keeps steadier than baselines (SimCLR: -4 . 1% vs. -8 . 3% , MoCo: -0 . 4% vs. -5 . 9% ).

Specifically, Figure 3 further shows that in SimCLR, the performance with DCL improves from 61 . 8% to 65 . 9% under 256 batch size; MoCo with DCL

Table 1. Comparisons with/without DCL under different batch sizes from 32 to 512. Results show the effectiveness of DCL on five widely used benchmarks. The performance of DCL keeps steadier than the SimCLR baseline while the batch size is varied.

improves from 54 . 7% to 60 . 8% under 256 queues. The comparison fully demonstrates the necessity of DCL, especially when the number of negatives is small. Although batch size increases to 1024, DCL ( 66 . 1% ) still improves over the SimCLR baseline ( 65 . 1% ).

We further observe the same phenomenon on ImageNet-100 data. Table 1 shows that, with DCL, the top-1 linear performance only drops 2 . 3% compared to the InfoNCE baseline (SimCLR) of 7 . 1% when the batch size is varied.

In summary, it is worth noting that, while the batch size is small, the strength of q B,i , which is used to push the negative samples away from the positive sample, is also relatively weak. This phenomenon tends to reduce the efficiency of learning representation. While taking advantage of DCL alleviates the performance gap between small and large batch sizes. Hence, through the analysis, we find out DCL can simply tackle the batch size issue in contrastive learning. With this considerable advantage given by DCL, general SSL approaches can be implemented with fewer computational resources or lower standard platforms. Compared to InfoNCE, DCL is more applicable across all large-scale SSL applications.

DCL on CIFAR and STL10. For STL10, CIFAR10, and CIFAR100, we implement DCL with ResNet-18 as encoder backbone. In Table 1, it is observed that DCL also demonstrates its strong effectiveness on small-scale benchmarks. In the evaluation (kNN / Linear) summary, DCL outperforms its baseline by 4 . 8% / 5 . 3% (CIFAR10) and 1 . 7% / 4 . 4% (CIFAR100) under a small batch size 32.

Table 2. Comparisons between SimCLR baseline, DCL, and DCLW. The linear and kNN top-1 ( % ) results indicate that DCL improves baseline performance, and DCLW further provides an extra boost. Note that results are under batch size 256 and epoch 200. All models are both trained and evaluated with the same experimental settings. The backbones are ResNet-18 and ResNet-50 for CIFAR and ImageNet, respectively.

Table 3. Improve the DCL model performance on ImageNet-1K with tuned hyperparameters: temperature and learning rate, and stronger image augmentation. Note that models are trained with 256 batch size and 200 epochs.

The accuracy (kNN / Linear) of the SimCLR baseline on STL10 is also improved significantly by 7 . 9% / 9 . 0% .

Decoupled Objective with Re-Weighting DCLW. We only replace L DC with L DCW with no possible advantage from additional tricks. Both DCL and the baselines apply the same training instruction of the OpenSelfSup benchmark for fairness. Note that we empirically choose σ = 0 . 5 in the experiments. Results in Table 2 indicates that, DCLW achieves extra 5 . 1% (ImageNet-1K), 3 . 5% (ImageNet-100) gains compared to the baseline. For CIFAR data, an extra 3 . 4% (CIFAR10) 3 . 2% is gained from the addition of DCLW. It is worth noting that, trained with 200 epochs, DCLW reaches 66 . 9% with batch size 256, surpassing the SimCLR baseline: 66 . 2% with batch size 8192.

Ablations

We perform extensive ablations on the hyperparameters of DCL on both ImageNet data and other small-scale data, i.e., CIFAR and STL10. By seeking better configurations empirically, we see that DCL gives consistent gains over the standard InfoNCE baselines (SimCLR and MoCo-v2). In other ablations, we see that DCL achieves more gains over both SimCLR and MoCo-v2, i.e., InfoNCEbased baselines, also when training for 100 epochs only.

DCL Ablations on ImageNet. In Table 3, we have slightly improved the DCL model performance on ImageNet-1K: 1) tuned hyperparameters, temperature τ and learning rate ; 2) asymmetric image augmentation (e.g., BYOL). To obtain a stronger baseline, we conduct an empirical hyperparameter search with batch size 256 and 200 epochs. This improves DCL from 65.9% to 67.8% top-1 accuracy

Table 4. The comparisons with/without DCL under various batch sizes from 32 to 512 on ResNet-50.

Table 5. Linear top-1 accuracy ( % ) comparison with MoCo-V2 on ImageNet-1K and ImageNet-100.

on ImageNet-1K. We further adopt the asymmetric augmentation policy from BYOL and improve DCL from 67.8% to 68.2% top-1 accuracy on ImageNet-1K.

DCL Ablations on CIFAR. Further experiments are conducted based on the ResNet-50 backbone and large learning epochs (i.e., 500 epochs). The DCL model with kNN eval, batch size 32, and 500 epochs of training could reach 86.1% compared to 82.2%. For the following experiments in Table 4, we show DCL ResNet-50 performance on CIFAR10 and CIFAR100. In these comparisons, we vary the batch size to show the effectiveness of DCL.

MoCo-v2 with DCL. We are aware that it is more convincing to compare the proposed DCL against a more compelling version, MoCo-v2. Comparisons on both ImageNet-1K and ImageNet-100 in Table 5 indicate that DCL becomes significantly more effective than MoCo-v2 when the queue size gets smaller.

Few Learning Epochs. DCL can alleviate the shortcoming of the traditional contrastive learning framework, which needs a large batch size long learning epochs to achieve higher performance. The previous state-of-the-art, SimCLR, heavily relies on large quantities of learning epochs to obtain high top-1 accuracy. (e.g., 69 . 3% with up to 1000 epochs). DCL aims to achieve higher learning efficiency with few learning epochs. We demonstrate the effectiveness of DCL in InfoNCE-based frameworks SimCLR and MoCo-v2 [8]. We choose the batch size of 256 (queue of 65536) as the baseline and train the model with only 100 epochs. We make sure other parameter settings are the same for a fair comparison. Table 6 shows the result on ImageNet-1K using linear evaluation. With DCL, SimCLR can achieve 64 . 6% top-1 accuracy with only 100 epochs compared to SimCLR baseline: 57 . 5% ; MoCo-v2 with DCL reaches 64 . 4% compared to MoCo-v2 baseline: 63 . 6% with 100 epochs pre-training.

Table 6. ImageNet-1K top-1 accuracy (%) on SimCLR and MoCo-v2 with/without DCL under few training epochs. We further list results under 200 epochs for clear comparison. With DCL, the performance of SimCLR trained under 100 epochs nearly reaches its performance under 200 epochs. The MoCo-v2 with DCL also reaches higher accuracy than the baseline under 100 epochs.

Fig. 4. Comparisons between DCL and InfoNCE-based baseline (SimCLR) on (a) CIFAR10 and (b) STL10 data. DCL speeds up the model convergence during the SSL pre-training and provides better performance than the baseline on CIFAR and STL10 data. (c) t-SNE visualization of CIFAR10 with 32 batch size. DCL shows a stronger separation force between the features than SimCLR.

Fig. 4. Comparisons between DCL and InfoNCE-based baseline (SimCLR) on (a) CIFAR10 and (b) STL10 data. DCL speeds up the model convergence during the SSL pre-training and provides better performance than the baseline on CIFAR and STL10 data. (c) t-SNE visualization of CIFAR10 with 32 batch size. DCL shows a stronger separation force between the features than SimCLR.

We further demonstrate that, with DCL, learning representation becomes faster during the early stage of training compared to the InfoNCE-based learning scheme. The reason is that DCL successfully solves the decoupled issue between positive and negative pairs. Figure 4 on (a) CIFAR10 and (b) STL10 shows that DCL improves the speed of convergence and reaches higher performance than the baseline on CIFAR and STL10 data. The t-SNE visualization in Figure 4 (c) also supports the proposed theoretical derivation that removing the batch-size dependent impact (i.e., NPC multiplier) should improve representation learning abilities over the InfoNCE-based learning scheme.

Table 7. Linear top-1 accuracy ( % ) comparison of SSL approaches on ImageNet1K. Given lower computational budget, DCL model are better than recent SOTA approaches. Its effectiveness does not rely on large batch size and epochs (SimCLR [7], NNCLR [10]), momentum encoding (BYOL [13], MoCo-v2 [8]), or other tricks such as stop-gradient (SimSiam [9]) and multi-cropping (SwAV [5]).

Discussion

Comparison with other SOTA SSL Approaches. The primary goal of this work is to provide an efficient and effective improvement to the widely used InfoNCE-based contrastive learning, where we decouple the positive and negative terms to achieve better representation quality. DCL is less sensitive to suboptimal hyperparameters and achieves competitive results with minimal requirements. Its effectiveness does not rely on large batch size and learning epochs, momentum encoding, negative sample queues, or additional tactics (e.g., stop-gradient and multi-cropping). Overall, DCL provides a more robust baseline for the contrastive-based SSL approaches. Though this work aims not to provide a SOTA SSL approach, DCL can be combined with the SOTA contrastive learning methods, such as NNCLR [10], to achieve better performance without large batch size and learning epochs. In Table 7, we provide extensive comparisons to SOTA SSL approaches on ImageNet-1K to validate the effectiveness of DCL. In Table 8, we further show that DCL achieves competitive results compared to VICReg [2], Barlow Twins [39], SimSiam [9], SwAV [4], and DINO [6] on ImageNet-100 and CIFAR10.

Generalization of DCL to Different Domains. DCL can be easily adapted to different domains (e.g., speech and language models) to achieve competitive performance. We demonstrate that DCL can be combined with SOTA SSL speech models, e.g., wav2vec 2.0 [1] which uses transformer backbone and requires enormous computation resources. We evaluate wav2vec 2.0 on its downstream tasks and perform better by applying the DCL method. Detailed results and discussion can be found in Appendix. To the best of our knowledge, DCL can be potentially combined with a transformer-based language model, CLIP [27], which uses a very large batch size of 32768. With DCL, CLIP shall maintain its complexity and achieve huge learning efficiency when the batch size becomes smaller. Note that it has been implemented by [33].

DCL Convergence for Large Batch Sizes. The performance of DCL appears to have less gain compared to InfoNCE-based baseline when the batch size is large. According to Figure 1 and the theoretical analysis, the reason is that the NPC multiplier q B → 0 when the batch size is large (e.g., 1024). As shown in the analysis, InfoNCE loss converges to the DCL loss as the batch size approaches

Table 8. kNN & linear top-1 accuracy ( % ) comparison of SSL approaches on CIFAR10 and ImageNet-100.

Table 9. Results of DCL and SimCLR with large batch size and learning epochs.

infinity. With 400 training epochs, the ImageNet-1K top-1 accuracy slightly increases from 69.5% to 69.9% when the batch size increases from 256 to 1024. Please refer to Table 9.

Conclusion

This paper identifies the negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, making the SSL task significantly easier to solve with smaller batch size. By removing the NPC effect, we reach a new objective function, decoupled contrastive learning (DCL). The proposed DCL loss function requires minimal modification to the SimCLR baseline and provides efficient, reliable, and nontrivial performance improvement on various benchmarks. Given the conceptual simplicity of DCL and that it requires neither momentum encoding, large batch size, or long epochs to reach competitive performance. Notably, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72 . 3% ImageNet-1K top-1 accuracy with 512 batch size in 400 epochs. We wish that DCL can serve as a strong baseline for the contrastive-based SSL methods. Further, an important lesson from the DCL loss is that a more efficient SSL task shall maintain its complexity when the batch size becomes smaller.

Acknowledgements . This work was supported in part by the MOST grants 110-2634-F-007-027, 110-2221-E-001-017 and 111-2221-E-001-015 of Taiwan. We are grateful to National Center for High-performance Computing and Meta AI Research for providing computational resources and facilities.

Experiments

Appendix

Proof of proposition 1

$$

$$

̸

Proposition 1. There exists a negative-positive coupling (NPC) multiplier q (1) B,i in the gradient of L (1) i :

$$

$$

$$

$$

Proof.

$$

$$

2 Yeh et al.

$$

$$

where we can easily see that ∑ q ∈{ 1 , 2 } ,j ∈ [ [1 ,N ] ] ,j = i exp( ⟨ z (1) i , z ( q ) j ⟩ /τ ) U i, 1 = 1 .

Proof of proposition 2

Proposition 2. Removing the positive pair from the denominator of Equation 3 leads to a decoupled contrastive learning loss. If we remove the NPC multiplier q ( k ) B,i from Equation 3, we reach a decoupled contrastive learning loss L DC = ∑ k ∈{ 1 , 2 } ,i ∈ [ [1 ,N ] ] L ( k ) DC,i , where L ( k ) DC,i is:

$$

$$

̸

$$

$$

$$

$$

Proof. By removing the positive term in the denominator of Equation 1, we can repeat the procedure in the proof of Proposition 1 and see that the coupling term disappears.

Linear classification on ImageNet-1K

Top-1 accuracies of linear evaluation in Table 10 shows that, we compare with the state-of-the-art SSL approaches on ImageNet-1K. For fairness, we list each approach's batch size and learning epoch, shown in the original paper. During pre-training, DCL is based on a ResNet-50 backbone, with two views with the size 224 × 224. DCL relies on its simplicity to reach competitive performance without relatively huge batch sizes and epochs or other pre-training schemes, i.e., momentum encoder, clustering, and prediction head. We report 400-epoch versions of DCL combined with NNCLR [10]. It achieves 71 . 1% under the batch size of 256 and 400-epoch pre-training, which is better than NNCLR [10] in their optimal case, 68 . 7% with a batch size of 256 and 1000-epoch. Note that SwAV [4], BYOL [13], SimCLR, and PIRL [22] need a huge batch size of 4096, and SwAV further applies multi-cropping extra views to reach optimal performance. The results of SwAV are taken from SimSiam that multi-cropping is not included.

Implementation Details

ImageNet. For a fair comparison on ImageNet data, we implement the proposed decoupled structure, DCL, by following SimCLR [7] with ResNet-50 [16] as the encoder backbone and use cosine annealing schedule with SGD optimizer. We set the temperature τ to 0.1 and the latent vector dimension to 128. Following

Fig. 3. Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ is 0 . 1 for SimCLR and 0 . 07 for MoCo in the comparison.

Fig. 3. Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ is 0 . 1 for SimCLR and 0 . 07 for MoCo in the comparison.

the OpenSelfSup benchmark [40], we evaluate the pre-trained models by training a linear classifier with frozen learned embedding on ImageNet data. We further consider evaluating DCL on ImageNet-100, a selected subset of 100 classes of ImageNet-1K. Note that all models on ImageNet are trained for 200 epochs.

CIFAR and STL10. For CIFAR10, CIFAR100, and STL10, ResNet-18 [16] is used as the encoder architecture. Following the small-scale benchmark [35], we set the temperature τ to 0.07. All models are trained for 200 epochs with SGD optimizer, a base lr = 0 . 03 ∗ batchsize/ 256 , and evaluated by k nearest neighbor (kNN) classifier. Note that on STL10, we include both the train and unlabeled set for model pre-training. We further use ResNet-50 as a stronger backbone by following the implementation [28], using the same backbone and hyperparameters.

Default DCL augmentations.

ImageNet. For a fair comparison on ImageNet data, we implement the proposed decoupled structure, DCL, by following SimCLR [7] with ResNet-50 [16] as the encoder backbone and use cosine annealing schedule with SGD optimizer. We set the temperature τ to 0.1 and the latent vector dimension to 128. Following

Fig. 3. Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ is 0 . 1 for SimCLR and 0 . 07 for MoCo in the comparison.

Fig. 3. Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ is 0 . 1 for SimCLR and 0 . 07 for MoCo in the comparison.

the OpenSelfSup benchmark [40], we evaluate the pre-trained models by training a linear classifier with frozen learned embedding on ImageNet data. We further consider evaluating DCL on ImageNet-100, a selected subset of 100 classes of ImageNet-1K. Note that all models on ImageNet are trained for 200 epochs.

CIFAR and STL10. For CIFAR10, CIFAR100, and STL10, ResNet-18 [16] is used as the encoder architecture. Following the small-scale benchmark [35], we set the temperature τ to 0.07. All models are trained for 200 epochs with SGD optimizer, a base lr = 0 . 03 ∗ batchsize/ 256 , and evaluated by k nearest neighbor (kNN) classifier. Note that on STL10, we include both the train and unlabeled set for model pre-training. We further use ResNet-50 as a stronger backbone by following the implementation [28], using the same backbone and hyperparameters.

Strong DCL augmentations.

As a fundamental task in machine learning, representation learning aims to extract useful information from the raw data for the downstream tasks. It has been regarded as a long-acting goal over the past decades. Recent progress on representation learning has achieved a significant milestone over self-supervised learning (SSL), facilitating feature learning with its competence in exploiting massive raw data without any annotated supervision. In the early stage of SSL, representation learning has focused on exploiting pretext tasks, which are addressed by generating pseudo-labels to the unlabeled data through different transformations, such as solving jigsaw puzzles [24], colorization [41] and rotation prediction [12]. Though these approaches succeed in computer vision, there is a large gap between these methods and supervised learning. Recently, there has been a significant advancement in using contrastive learning [36,25,30,15,7] for self-supervised pre-training, which significantly closes the gap between the SSL method and supervised learning. Contrastive SSL methods, e.g., SimCLR [7], in general, try to pull different views of the same instance close and push different instances far apart in the representation space.

Despite the evident progress of the state-of-the-art contrastive SSL methods, there have been facing several challenges into future development in this direction, including 1) The SOTA models, e.g ., [15] may require specific structures such as the momentum encoder and large memory queues, which may complicate the underlying representation learning. 2) The contrastive SSL models, e.g ., [7] often depend on large batch size and huge epoch numbers to achieve competitive

performance, posing a computational challenge for academia to explore this direction. 3) They tend to be sensitive to hyperparameters and optimizers, introducing additional difficulty reproducing the results on various benchmarks.

Through the analysis of the widely adopted InfoNCE loss in contrastive learning, we identified a negative-positive-coupling (NPC) multiplier q B in the gradient as shown in Proposition 1. The NPC multiplier modulates the gradient of each sample, and it reduces the learning efficiency due to easy SSL classification tasks: 1) when a positive sample is very close to the anchor; 2) when negative samples are far away from the anchor; and 3) when there is only a small number of negative samples (i.e., a small batch size). A less-informative (nearby) positive view would reduce the gradient from a batch of informative negative samples or vice versa. Such a coupling exacerbates when smaller batch sizes are used.

Meanwhile, we also investigate the relationship between q B and batch size through the baseline, SimCLR. As can be seen in Figure 1, the distribution of q B has a strong positive correlation with the batch size. Figure 1(a) shows that when batch size gradually increases, q B not only approaches 1 but also reduces the coefficient of variation C v . The distribution with larger C v has low statistical dispersion and vice versa. Figure 1(b) indicates that the mode value of q B will also shift from 0 to 1 when the batch size becomes larger. Hence, it is reasonable to fix the value of q B , alleviating the influence of batch size.

By removing the coupling term from the Info-NCE loss, we reach a new formulation, the decoupled contrastive learning (DCL). The new objective function significantly improves the training efficiency with less sensitivity to sub-optimal hyper-parameters requires neither large batches, momentum encoding, or large epochs to achieve competitive performance on various benchmarks. The main contributions of the proposed DCL can be characterized as follows:

  1. We provide both theoretical analysis and empirical evidence to show the NPC effect in the InfoNCE-based contrastive learning;
  2. We introduce DCL objective, which casts off the NPC coupling phenomenon, significantly improves the training efficiency, and it is less sensitive to suboptimal hyper-parameters;
  3. Extensive experiments are provided to show the effectiveness of the proposed method that DCL achieves competitive performance without large batch sizes, large training epochs, momentum encoding, or additional tricks such as stop-gradient and multi-cropping, etc. This leads to a plug-and-play improvement to the widely adopted InfoNCE-based contrastive learning;
  4. We show that DCL can be easily combined with the SOTA contrastive methods, e.g. NNCLR [10], to achieve further improvements.
Linear evaluation.

As a fundamental task in machine learning, representation learning aims to extract useful information from the raw data for the downstream tasks. It has been regarded as a long-acting goal over the past decades. Recent progress on representation learning has achieved a significant milestone over self-supervised learning (SSL), facilitating feature learning with its competence in exploiting massive raw data without any annotated supervision. In the early stage of SSL, representation learning has focused on exploiting pretext tasks, which are addressed by generating pseudo-labels to the unlabeled data through different transformations, such as solving jigsaw puzzles [24], colorization [41] and rotation prediction [12]. Though these approaches succeed in computer vision, there is a large gap between these methods and supervised learning. Recently, there has been a significant advancement in using contrastive learning [36,25,30,15,7] for self-supervised pre-training, which significantly closes the gap between the SSL method and supervised learning. Contrastive SSL methods, e.g., SimCLR [7], in general, try to pull different views of the same instance close and push different instances far apart in the representation space.

Despite the evident progress of the state-of-the-art contrastive SSL methods, there have been facing several challenges into future development in this direction, including 1) The SOTA models, e.g ., [15] may require specific structures such as the momentum encoder and large memory queues, which may complicate the underlying representation learning. 2) The contrastive SSL models, e.g ., [7] often depend on large batch size and huge epoch numbers to achieve competitive

performance, posing a computational challenge for academia to explore this direction. 3) They tend to be sensitive to hyperparameters and optimizers, introducing additional difficulty reproducing the results on various benchmarks.

Through the analysis of the widely adopted InfoNCE loss in contrastive learning, we identified a negative-positive-coupling (NPC) multiplier q B in the gradient as shown in Proposition 1. The NPC multiplier modulates the gradient of each sample, and it reduces the learning efficiency due to easy SSL classification tasks: 1) when a positive sample is very close to the anchor; 2) when negative samples are far away from the anchor; and 3) when there is only a small number of negative samples (i.e., a small batch size). A less-informative (nearby) positive view would reduce the gradient from a batch of informative negative samples or vice versa. Such a coupling exacerbates when smaller batch sizes are used.

Meanwhile, we also investigate the relationship between q B and batch size through the baseline, SimCLR. As can be seen in Figure 1, the distribution of q B has a strong positive correlation with the batch size. Figure 1(a) shows that when batch size gradually increases, q B not only approaches 1 but also reduces the coefficient of variation C v . The distribution with larger C v has low statistical dispersion and vice versa. Figure 1(b) indicates that the mode value of q B will also shift from 0 to 1 when the batch size becomes larger. Hence, it is reasonable to fix the value of q B , alleviating the influence of batch size.

By removing the coupling term from the Info-NCE loss, we reach a new formulation, the decoupled contrastive learning (DCL). The new objective function significantly improves the training efficiency with less sensitivity to sub-optimal hyper-parameters requires neither large batches, momentum encoding, or large epochs to achieve competitive performance on various benchmarks. The main contributions of the proposed DCL can be characterized as follows:

  1. We provide both theoretical analysis and empirical evidence to show the NPC effect in the InfoNCE-based contrastive learning;
  2. We introduce DCL objective, which casts off the NPC coupling phenomenon, significantly improves the training efficiency, and it is less sensitive to suboptimal hyper-parameters;
  3. Extensive experiments are provided to show the effectiveness of the proposed method that DCL achieves competitive performance without large batch sizes, large training epochs, momentum encoding, or additional tricks such as stop-gradient and multi-cropping, etc. This leads to a plug-and-play improvement to the widely adopted InfoNCE-based contrastive learning;
  4. We show that DCL can be easily combined with the SOTA contrastive methods, e.g. NNCLR [10], to achieve further improvements.

Relation to alignment and uniformity

In this section, we provide a thorough discussion of the connection and difference between DCL and Hypersphere [34], which does not have negative-positive coupling either. However, there is a critical difference between DCL and Hypersphere, and the difference is that the order of the expectation and exponential is swapped. Let us assume the latent embedding vectors z are normalized for analytical conve-

nience. When z i , z j are normalized, exp( ⟨ z ( k ) i , z ( l ) i ⟩ /τ ) and exp( -|| z ( k ) i -z ( l ) i || 2 /τ ) are the same, except for a trivial scale difference. Thus we can write L DCL and L align -uni in a similar fashion:

$$

$$

$$

$$

̸

where

$$

$$

With the right weight factor, L align can be made exactly the same as L DCL,pos . So let's focus on L DCL,neg and L uniform :

$$

$$

$$

$$

Similar to the earlier analysis in the manuscript, the latter L uniform introduces a negative-negative coupling between the negative samples of different positive samples. If two negative samples of z i are close to each other, the gradient for z i would also be attenuated. This behaves similarly to the negative-positive coupling. That being said, while Hypersphere does not have a negative-positive coupling, it has a similarly problematic negative-negative coupling.

A case can simply demonstrate the negative-negative coupling in [34]. Let's assume the model has the batch size of 3, and temperature τ is 1. Both L DCL,neg and L uniform can be formulated as follows:

$$

$$

$$

$$

Table 11. STL10 comparisons Hypersphere and DCL under the same experiment setting.

If the value of exp( ⟨ z ( k ) 1 , z ( l ) 3 ⟩ ) is much larger (e.g., hard negatives) than other terms, there would be a huge difference between L DCL,neg and L uniform . Since L uniform first sums up all the negative pair samples in the batch together, it may cause the loss to be dominated by a specific negative pair sample. Thus, in the DCL loss, the negative samples from different positives are not coupled in contrast to the uniformity loss in [34].

Next, we provide a comprehensive empirical comparison. The empirical experiments match the analytical prediction: DCL outperforms Hypersphere with a more considerable margin under a smaller batch size.

The comparisons of DCL to Hypersphere are evaluated on STL10, ImageNet100, ImageNet-1K under various settings. For STL10 data, we implement DCL based on the official code of Hypersphere. The encoder and the hyperparameters are the same as Hypersphere, which has not been optimized for DCL in any way. We have found that Hypersphere did a pretty thorough hyperparameter search. We believe the default hyperparameters are relatively optimized for Hypersphere.

In Table 11, DCL reaches 84.4% (fc7+Linear) compared to 83.2% (fc7+Linear) reported in Hypersphere on STL10. In Table 12 and Table 13, DCL achieves better performance than Hypersphere under the same setting (MoCo & MoCov2) on ImageNet-100 data. DCL further shows strong results compared against Hypersphere on ImageNet-1K in Table 14. We also provide the STL10 comparisons of DCL and Hypersphere under different batch sizes in Table 15. The experiment shows the advantage of DCL becomes larger with smaller batch size. Please note that we did not tune the parameters for DCL at all. This should be a more than fair comparison.

In every single one of the experiments, DCL outperforms Hypersphere. Although the difference between the DCL and Hypersphere is slight, it makes DCL more easier to alleviate the domination from a specific negative pair in a batch. We hope these results show the unique value of DCL compared to Hypersphere.

Table 16. Results of DCL on wav2vec 2.0 be evaluated on two downstream tasks.

† In the downstream training process, the pre-trained representation first mean-pool and forward a fully a connected layer with cross-entropy loss on the VoxCeleb1 [23].

DCL on speech models

The SOTA SSL speech models, e.g., wav2vec 2.0 [1] still uses contrastive loss in the objective function. In Table 16, we show the effectiveness of DCL with wav2vec 2.0 [1]. We replace the InfoNCE loss with the DCL loss and train a wav2vec 2.0 base model (i.e., 7-Conv + 24-Transformer) from scratch. 6 After the pre-training of model, we evaluate the representation on two downstream tasks, speaker identification and intent classification. Table 16 shows the representation improvement of DCL.

Supervised classifier: DCL vs Cross-Entropy

The idea of DCL, removing positive from the denominator, can also be applied for learning objective function in the supervised classifier. By following [18], we implement the proposed DCL idea on cross entropy loss by removing the positive logits from the denominator of the softmax function. In Table 17, it is observed that our supervised DCL achieves slightly lower performance while comparing to the cross-entropy on CIFAR data. One possible reason for undermined performance of DCL in supervised learning might be the different feature interaction between supervised and unsupervised classifiers, which are referred to as parametric and non-parametric classifiers in [36].

Under the parametric formulation in [36], the logits equal to w T z , where w is a weight vector for each class and z is the output embedding of the neural network. While in contrastive learning (i.e., non-parametric classifier), the logits equal to z (1) z (2) , where z (1) and z (2) are two augmented views of the same sample. In the embedding space of the early training stage, w is relatively far away from z compared to the relation between z (1) and z (2) . Consider the effect of NPC multiplier q b into parametric and non-parametric classifier, q b → 1 in parametric classifier might diminish the effectiveness of DCL idea as the coupling effect is already tiny.

Ablations of DCLW

Based on weighting function for the positive pairs in the Section 3 of the manuscript, we provide an another weighting function of DCLW:

$$

$$

$$

$$

where w ( z (1) i , z (2) i ) = δ · exp( -σ · ⟨ z (1) i , z (2) i ⟩ ) . Basically, the goal is similar to DCLW that we provide larger weight to hard positives (e.g., a positive pair of samples are far away from each other).

The results indicate that δ = 3 and σ = 0 . 5 can achieve 85 . 4% kNN top-1 accuracy on CIFAR10, and outperform the InfoNCE baseline (SimCLR) by 4% .

6 The experiment is downscaled to 8 V100 GPUs rather than 64.

Table 18. The ablation study of various temperatures τ on CIFAR10.

Fig. 5. (a) The coefficient of variation ( C v = σ/µ ) of gradient and (b) the mean gradient norm with its std of baseline (InfoNCE) and proposed method (DCL) under different batch sizes.

Fig. 5. (a) The coefficient of variation ( C v = σ/µ ) of gradient and (b) the mean gradient norm with its std of baseline (InfoNCE) and proposed method (DCL) under different batch sizes.

Additional Discussion

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented “views” of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves 68.2%percent68.268.2% ImageNet-1K top-1 accuracy using batch size 256256256 within 200200200 epochs pre-training, outperforming its SimCLR baseline by 6.4%percent6.46.4%. Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72.3%percent72.372.3% ImageNet-1K top-1 accuracy with 512512512 batch size in 400400400 epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies.

As a fundamental task in machine learning, representation learning aims to extract useful information from the raw data for the downstream tasks. It has been regarded as a long-acting goal over the past decades. Recent progress on representation learning has achieved a significant milestone over self-supervised learning (SSL), facilitating feature learning with its competence in exploiting massive raw data without any annotated supervision. In the early stage of SSL, representation learning has focused on exploiting pretext tasks, which are addressed by generating pseudo-labels to the unlabeled data through different transformations, such as solving jigsaw puzzles [24], colorization [41] and rotation prediction [12]. Though these approaches succeed in computer vision, there is a large gap between these methods and supervised learning. Recently, there has been a significant advancement in using contrastive learning [36, 25, 30, 15, 7] for self-supervised pre-training, which significantly closes the gap between the SSL method and supervised learning. Contrastive SSL methods, e.g., SimCLR [7], in general, try to pull different views of the same instance close and push different instances far apart in the representation space.

Despite the evident progress of the state-of-the-art contrastive SSL methods, there have been facing several challenges into future development in this direction, including 1) The SOTA models, e.g., [15] may require specific structures such as the momentum encoder and large memory queues, which may complicate the underlying representation learning. 2) The contrastive SSL models, e.g., [7] often depend on large batch size and huge epoch numbers to achieve competitive performance, posing a computational challenge for academia to explore this direction. 3) They tend to be sensitive to hyperparameters and optimizers, introducing additional difficulty reproducing the results on various benchmarks.

Through the analysis of the widely adopted InfoNCE loss in contrastive learning, we identified a negative-positive-coupling (NPC) multiplier qBsubscript𝑞𝐵q_{B} in the gradient as shown in Proposition 1. The NPC multiplier modulates the gradient of each sample, and it reduces the learning efficiency due to easy SSL classification tasks: 1) when a positive sample is very close to the anchor; 2) when negative samples are far away from the anchor; and 3) when there is only a small number of negative samples (i.e., a small batch size). A less-informative (nearby) positive view would reduce the gradient from a batch of informative negative samples or vice versa. Such a coupling exacerbates when smaller batch sizes are used.

Meanwhile, we also investigate the relationship between qBsubscript𝑞𝐵q_{B} and batch size through the baseline, SimCLR. As can be seen in Figure 1, the distribution of qBsubscript𝑞𝐵q_{B} has a strong positive correlation with the batch size. Figure 1(a) shows that when batch size gradually increases, qBsubscript𝑞𝐵q_{B} not only approaches 111 but also reduces the coefficient of variation Cvsubscript𝐶𝑣C_{v}. The distribution with larger Cvsubscript𝐶𝑣C_{v} has low statistical dispersion and vice versa. Figure 1(b) indicates that the mode value of qBsubscript𝑞𝐵q_{B} will also shift from 00 to 111 when the batch size becomes larger. Hence, it is reasonable to fix the value of qBsubscript𝑞𝐵q_{B}, alleviating the influence of batch size.

By removing the coupling term from the Info-NCE loss, we reach a new formulation, the decoupled contrastive learning (DCL). The new objective function significantly improves the training efficiency with less sensitivity to sub-optimal hyper-parameters requires neither large batches, momentum encoding, or large epochs to achieve competitive performance on various benchmarks. The main contributions of the proposed DCL can be characterized as follows:

We provide both theoretical analysis and empirical evidence to show the NPC effect in the InfoNCE-based contrastive learning;

We introduce DCL objective, which casts off the NPC coupling phenomenon, significantly improves the training efficiency, and it is less sensitive to sub-optimal hyper-parameters;

Extensive experiments are provided to show the effectiveness of the proposed method that DCL achieves competitive performance without large batch sizes, large training epochs, momentum encoding, or additional tricks such as stop-gradient and multi-cropping, etc. This leads to a plug-and-play improvement to the widely adopted InfoNCE-based contrastive learning;

We show that DCL can be easily combined with the SOTA contrastive methods, e.g. NNCLR [10], to achieve further improvements.

Contrastive Learning. Contrastive learning (CL) constructs positive and negative sample pairs to extract information from the data itself. In CL, each anchor image in a batch has only one positive sample to construct a positive sample pair [14, 7, 15]. CPC [25] predicts the future output of sequential data by using current output as prior knowledge, which can improve the feature representing the ability of the model. Instance discrimination [36] proposes a non-parametric cross-entropy loss to optimize the model at the instance level. Inv. spread [37] makes use of data augmentation invariant and the spread-out property of instance to learn features. MoCo [15] proposes a dictionary to maintain a negative sample set, thus increasing the number of negative sample pairs. Different from the aforementioned self-supervised CL approaches, [20] proposes a supervised CL that considers all the same categories as positive pairs to increase the utility of images.

Collapsing Issue on the Number of Negatives. In CL, the objective is to maximize the mutual information between the positive pairs. However, to avoid the “collapsing output”, vast quantities of negative samples are needed so that the learning objectives obtain the maximum similarity and have the minimum similarity with negative samples. For instance, in SimCLR [7], training requires many negative samples, leading to a large batch size (i.e., 4096). Furthermore, to optimize such a huge batch, a specially designed optimizer LARS [38] is used. Similarly, MoCo [15] needs a vast queue (i.e., 65536) to achieve competitive performance. BYOL [13] does not collapse output without using any negative samples by considering all the images are positive and to maximize the similarity of “projection” and “prediction ” features. On the other hand, SimSiam [9] leverages the Siamese network to introduce inductive biases for modeling invariance. With the small batch size (i.e., 256), SimSiam is a rival to BYOL (i.e., 4096). Unlike both approaches that achieved their success through empirical studies, this paper tackles from a theoretical perspective, proving that an intertwined multiplier qBsubscript𝑞𝐵q_{B} of positive and negative is the main issue to contrastive learning.

Batch Size Sensitivity on InfoNCE. Several works of literature focus on batch size sensitivity concerning the InfoNCE objective function. [32] proposes an objective based on relative predictive coding that maintains the balance between training stability and batch size sensitivity. [17] follows the [3] and extends the idea between the local and global features. [26] proposes a Wasserstein distance to prevent the encoder from learning any other differences between unpaired samples. [19] and [29] learn better representation by sampling hard negatives, particularly for small batches. Other recent works [42, 11] aim to mitigate the issue of small batch size in InfoNCE loss. Although the basic principle of recent works and DCL is derived from InfoNCE objective function, we provide a novel perspective to support the decoupling between positive and negative terms in InfoNCE loss is essential. Simply removing the term from the denominator pre-training to positive pairs can drastically improve the performance and keep the objective function invariant to batch size sensitivity.

We choose to start from SimCLR because of its conceptual simplicity. Given a batch of N𝑁N samples (e.g. images), {𝐱1,…,𝐱N}subscript𝐱1…subscript𝐱𝑁{\mathbf{x}{1},\dots,\mathbf{x}{N}}, let 𝐱i(1),𝐱i(2)subscriptsuperscript𝐱1𝑖subscriptsuperscript𝐱2𝑖\mathbf{x}^{(1)}{i},\mathbf{x}^{(2)}{i} be two augmented views of the sample xisubscript𝑥𝑖x_{i} and B𝐵B be the set of all of the augmented views in the batch, i.e. B={𝐱i(k)|k∈{1,2},i∈[[1,N]]}𝐵conditional-setsubscriptsuperscript𝐱𝑘𝑖formulae-sequence𝑘12𝑖delimited-[]1𝑁B={\mathbf{x}^{(k)}{i}|k\in{1,2},i\in[![1,N]!]}. As shown by Figure 2(a), each of the views 𝐱i(k)superscriptsubscript𝐱𝑖𝑘\mathbf{x}{i}^{(k)} is sent into the same encoder network f𝑓f and the output 𝐡i(k)=f​(𝐱i(k))superscriptsubscript𝐡𝑖𝑘𝑓superscriptsubscript𝐱𝑖𝑘\mathbf{h}{i}^{(k)}=f(\mathbf{x}{i}^{(k)}) is then projected by a normalized MLP projector that 𝐳i(k)=g​(𝐡i(k))/‖g​(𝐡i(k))‖superscriptsubscript𝐳𝑖𝑘𝑔superscriptsubscript𝐡𝑖𝑘norm𝑔superscriptsubscript𝐡𝑖𝑘\mathbf{z}{i}^{(k)}=g(\mathbf{h}{i}^{(k)})/|g(\mathbf{h}{i}^{(k)})|. For each augmented view 𝐱i(k)superscriptsubscript𝐱𝑖𝑘\mathbf{x}{i}^{(k)}, SimCLR solves a classification problem by using the rest of the views in B𝐵B as targets, and assigns the only positive label to 𝐱i(u)superscriptsubscript𝐱𝑖𝑢\mathbf{x}{i}^{(u)}, where u≠k𝑢𝑘u\neq k. So SimCLR creates a cross-entropy loss function Li(k)superscriptsubscript𝐿𝑖𝑘L{i}^{(k)} for each view 𝐱i(k)subscriptsuperscript𝐱𝑘𝑖\mathbf{x}^{(k)}{i}, and the overall loss function is L=∑k∈{1,2},i∈[[1,N]]Li(k)𝐿subscriptformulae-sequence𝑘12𝑖delimited-[]1𝑁superscriptsubscript𝐿𝑖𝑘L=\sum{k\in{1,2},i\in[![1,N]!]}{L_{i}^{(k)}}.

means the summation of negative terms for the view k𝑘k of the sample i𝑖i.

: There exists a negative-positive coupling (NPC) multiplier qB,i(1)superscriptsubscript𝑞𝐵𝑖1q_{B,i}^{(1)} in the gradient of Li(1)superscriptsubscript𝐿𝑖1L_{i}^{(1)}:

and Ui,1=∑l∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(l)⟩/τ)subscript𝑈𝑖1subscriptformulae-sequence𝑙12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑙𝜏U_{i,1}={\sum_{l\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(l)}\rangle/\tau)}}. Due to the symmetry, a similar NPC multiplier qB,i(k)superscriptsubscript𝑞𝐵𝑖𝑘q_{B,i}^{(k)} exists in the gradient of Li(k),k∈{1,2},i∈[[1,N]]formulae-sequencesuperscriptsubscript𝐿𝑖𝑘𝑘12𝑖delimited-[]1𝑁L_{i}^{(k)},k\in{1,2},i\in[![1,N]!].

As we can see, all of the partial gradients in Equation 7 are modified by the common NPC multiplier qB,i(k)superscriptsubscript𝑞𝐵𝑖𝑘q_{B,i}^{(k)} in Equation 8. Equation 8 makes intuitive sense: when the SSL classification task is easy, the gradient would be reduced by the NPC term. However, the positive samples and negative samples are strongly coupled. When the negative samples are far away and less informative (easy negatives), the gradient from an informative, positive sample would be reduced by the NPC multiplier qB,i(1)superscriptsubscript𝑞𝐵𝑖1q_{B,i}^{(1)}. On the other hand, when the positive sample is close (easy positive) and less informative, the gradient from a batch of informative negative samples would also be reduced by the NPC multiplier. When the batch size is smaller, the SSL classification problem can be significantly simpler to solve. As a result, the learning efficiency can be significantly reduced with a small batch size setting.

Figure 1(b) shows the NPC multiplier qBsubscript𝑞𝐵q_{B} distribution shift w.r.t. different batch sizes for a pre-trained SimCLR baseline model. While all of the shown distributions have prominent fluctuation, the smaller batch size makes qBsubscript𝑞𝐵q_{B} cluster towards 00, while the larger batch size pushes the distribution towards δ​(1)𝛿1\delta(1). Figure 1(a) shows the averaged NPC multiplier ⟨qB⟩delimited-⟨⟩subscript𝑞𝐵\langle q_{B}\rangle changes w.r.t. the batch size and the relative fluctuation. The small batch sizes introduce significant NPC fluctuation. Based on this observation, we propose to remove the NPC multipliers from the gradients, which corresponds to the case qB,N→∞subscript𝑞→𝐵𝑁q_{B,N\to\infty}. This leads to the decoupled contrastive learning formulation. [34] also proposes an alignment & uniformity loss which does not have the NPC. However, a similar analysis introduces negative-negative coupling from different positive samples. In other words, [34] considers all the negative samples in the batch together, which may cause the gradient to be dominated by a specific negative pair. In Appendix 5, we provide a thorough discussion and demonstrate the advantage of DCL loss against [34].

the DCL Loss: Removing the positive pair from the denominator of Equation 1 leads to a decoupled contrastive learning loss. If we remove the NPC multiplier qB,i(k)superscriptsubscript𝑞𝐵𝑖𝑘q_{B,i}^{(k)} from Equation 7, we reach a decoupled contrastive learning loss LD​C=∑k∈{1,2},i∈[[1,N]]LD​C,i(k)subscript𝐿𝐷𝐶subscriptformulae-sequence𝑘12𝑖delimited-[]1𝑁superscriptsubscript𝐿𝐷𝐶𝑖𝑘L_{DC}=\sum_{k\in{1,2},i\in[![1,N]!]}{L_{DC,i}^{(k)}}, where LD​C,i(k)superscriptsubscript𝐿𝐷𝐶𝑖𝑘L_{DC,i}^{(k)} is:

The proofs of Proposition 1 and 2 are given in Appendix. Further, we can generalize the loss function LD​Csubscript𝐿𝐷𝐶L_{DC} to LD​C​Wsubscript𝐿𝐷𝐶𝑊L_{DCW} by introducing a weighting function for the positive pairs i.e. LD​C​W=∑k∈{1,2},i∈[[1,N]]LD​C​W,i(i,k)subscript𝐿𝐷𝐶𝑊subscriptformulae-sequence𝑘12𝑖delimited-[]1𝑁superscriptsubscript𝐿𝐷𝐶𝑊𝑖𝑖𝑘L_{DCW}=\sum_{k\in{1,2},i\in[![1,N]!]}{L_{DCW,i}^{(i,k)}}.

where we can intuitively choose w𝑤w to be a negative von Mises-Fisher weighting function that w​(𝐳i(1),𝐳i(2))=2−exp⁡(⟨𝐳i(1),𝐳i(2)⟩/σ)Ei​[exp⁡(⟨𝐳i(1),𝐳i(2)⟩/σ)]𝑤superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖22superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜎subscriptE𝑖delimited-[]superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜎w(\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)})=2-\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\sigma)}{{\rm E},{i}\left[\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\sigma)\right]} and E​[w]=1Edelimited-[]𝑤1{\rm E},\left[w\right]=1. LD​Csubscript𝐿𝐷𝐶L{DC} is a special case of LD​C​Wsubscript𝐿𝐷𝐶𝑊L_{DCW} and we can see that limσ→∞LD​C​W=LD​Csubscript→𝜎subscript𝐿𝐷𝐶𝑊subscript𝐿𝐷𝐶\lim_{\sigma\to\infty}{L_{DCW}}=L_{DC}. The intuition behind w​(𝐳i(1),𝐳i(2))𝑤superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2w(\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}) is that there is more learning signal when a positive pair of samples are far from each other, and E​[w​(𝐳i(1),𝐳i(2))​⟨𝐳i(1),𝐳i(2)⟩]≈E​[⟨𝐳i(1),𝐳i(2)⟩]𝐸delimited-[]𝑤superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝐸delimited-[]superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2E\left[w(\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)})\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle\right]\approx E\left[\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle\right]. Other similar weight functions also provide similar results. In general, we find such a weighting function, which gives a larger weight to the hard positives tend to increase the representation quality.

This section empirically evaluates the proposed decoupled contrastive learning (DCL) and compares it to general contrastive learning methods. We summarize the experiments and analysis as the following: (1) the proposed work significantly outperforms the general InfoNCE-based contrastive learning on both large-scale and small-scale vision benchmarks; (2) we show that the enhanced version of DCL, DCLW, could further improve the representation quality; and (3) we further analyze DCL with ablation studies on ImageNet-1K, hyperparameters, and few learning epochs, which shows fast convergence of the proposed DCL. Note that all the experiments are conducted with 8 Nvidia V100 GPUs on a single machine.

ImageNet. For a fair comparison on ImageNet data, we implement the proposed decoupled structure, DCL, by following SimCLR [7] with ResNet-50 [16] as the encoder backbone and use cosine annealing schedule with SGD optimizer. We set the temperature τ𝜏\tau to 0.1 and the latent vector dimension to 128. Following the OpenSelfSup benchmark [40], we evaluate the pre-trained models by training a linear classifier with frozen learned embedding on ImageNet data. We further consider evaluating DCL on ImageNet-100, a selected subset of 100 classes of ImageNet-1K. Note that all models on ImageNet are trained for 200 epochs.

CIFAR and STL10. For CIFAR10, CIFAR100, and STL10, ResNet-18 [16] is used as the encoder architecture. Following the small-scale benchmark [35], we set the temperature τ𝜏\tau to 0.07. All models are trained for 200 epochs with SGD optimizer, a base l​r=0.03∗b​a​t​c​h​s​i​z​e/256𝑙𝑟0.03𝑏𝑎𝑡𝑐ℎ𝑠𝑖𝑧𝑒256lr=0.03*batchsize/256, and evaluated by k nearest neighbor (kNN) classifier. Note that on STL10, we include both the t​r​a​i​n𝑡𝑟𝑎𝑖𝑛train and u​n​l​a​b​e​l​e​d𝑢𝑛𝑙𝑎𝑏𝑒𝑙𝑒𝑑unlabeled set for model pre-training. We further use ResNet-50 as a stronger backbone by following the implementation [28], using the same backbone and hyperparameters.

DCL on ImageNet. This section illustrates the effect of DCL against InfoNCE-based approaches under different batch sizes and queues. The initial setup is to have 1024 batch size (SimCLR) and 65536 queues (MoCo [15]) and gradually reduce the batch size (SimCLR) and queue (MoCo) to show the corresponding top-1 accuracy by linear evaluation. Figure 3 indicates that without DCL, the top-1 accuracy drastically drops when batch size (SimCLR) or queue (MoCo) becomes very small. While with DCL, the performance keeps steadier than baselines (SimCLR: −4.1%percent4.1-4.1% vs. −8.3%percent8.3-8.3%, MoCo: −0.4%percent0.4-0.4% vs. −5.9%percent5.9-5.9%).

Specifically, Figure 3 further shows that in SimCLR, the performance with DCL improves from 61.8%percent61.861.8% to 65.9%percent65.965.9% under 256 batch size; MoCo with DCL improves from 54.7%percent54.754.7% to 60.8%percent60.860.8% under 256 queues. The comparison fully demonstrates the necessity of DCL, especially when the number of negatives is small. Although batch size increases to 1024, DCL (66.1%percent66.166.1%) still improves over the SimCLR baseline (65.1%percent65.165.1%).

We further observe the same phenomenon on ImageNet-100 data. Table 1 shows that, with DCL, the top-1 linear performance only drops 2.3%percent2.32.3% compared to the InfoNCE baseline (SimCLR) of 7.1%percent7.17.1% when the batch size is varied.

In summary, it is worth noting that, while the batch size is small, the strength of qB,isubscript𝑞𝐵𝑖q_{B,i}, which is used to push the negative samples away from the positive sample, is also relatively weak. This phenomenon tends to reduce the efficiency of learning representation. While taking advantage of DCL alleviates the performance gap between small and large batch sizes. Hence, through the analysis, we find out DCL can simply tackle the batch size issue in contrastive learning. With this considerable advantage given by DCL, general SSL approaches can be implemented with fewer computational resources or lower standard platforms. Compared to InfoNCE, DCL is more applicable across all large-scale SSL applications.

DCL on CIFAR and STL10. For STL10, CIFAR10, and CIFAR100, we implement DCL with ResNet-18 as encoder backbone. In Table 1, it is observed that DCL also demonstrates its strong effectiveness on small-scale benchmarks. In the evaluation (kNN / Linear) summary, DCL outperforms its baseline by 4.8%percent4.84.8% / 5.3%percent5.35.3% (CIFAR10) and 1.7%percent1.71.7% / 4.4%percent4.44.4% (CIFAR100) under a small batch size 32. The accuracy (kNN / Linear) of the SimCLR baseline on STL10 is also improved significantly by 7.9%percent7.97.9% / 9.0%percent9.09.0%.

Decoupled Objective with Re-Weighting DCLW. We only replace LD​Csubscript𝐿𝐷𝐶L_{DC} with LD​C​Wsubscript𝐿𝐷𝐶𝑊L_{DCW} with no possible advantage from additional tricks. Both DCL and the baselines apply the same training instruction of the OpenSelfSup benchmark for fairness. Note that we empirically choose σ=0.5𝜎0.5\sigma=0.5 in the experiments. Results in Table 2 indicates that, DCLW achieves extra 5.1%percent5.15.1% (ImageNet-1K), 3.5%percent3.53.5% (ImageNet-100) gains compared to the baseline. For CIFAR data, an extra 3.4%percent3.43.4% (CIFAR10) 3.2%percent3.23.2% is gained from the addition of DCLW. It is worth noting that, trained with 200 epochs, DCLW reaches 66.9%percent66.966.9% with batch size 256, surpassing the SimCLR baseline: 66.2%percent66.266.2% with batch size 8192.

We perform extensive ablations on the hyperparameters of DCL on both ImageNet data and other small-scale data, i.e., CIFAR and STL10. By seeking better configurations empirically, we see that DCL gives consistent gains over the standard InfoNCE baselines (SimCLR and MoCo-v2). In other ablations, we see that DCL achieves more gains over both SimCLR and MoCo-v2, i.e., InfoNCE-based baselines, also when training for 100 epochs only.

DCL Ablations on ImageNet. In Table 3, we have slightly improved the DCL model performance on ImageNet-1K: 1) tuned hyperparameters, temperature τ𝜏\tau and learning rate ; 2) asymmetric image augmentation (e.g., BYOL). To obtain a stronger baseline, we conduct an empirical hyperparameter search with batch size 256 and 200 epochs. This improves DCL from 65.9% to 67.8% top-1 accuracy on ImageNet-1K. We further adopt the asymmetric augmentation policy from BYOL and improve DCL from 67.8% to 68.2% top-1 accuracy on ImageNet-1K.

DCL Ablations on CIFAR. Further experiments are conducted based on the ResNet-50 backbone and large learning epochs (i.e., 500 epochs). The DCL model with kNN eval, batch size 32, and 500 epochs of training could reach 86.1% compared to 82.2%. For the following experiments in Table 4, we show DCL ResNet-50 performance on CIFAR10 and CIFAR100. In these comparisons, we vary the batch size to show the effectiveness of DCL.

MoCo-v2 with DCL. We are aware that it is more convincing to compare the proposed DCL against a more compelling version, MoCo-v2. Comparisons on both ImageNet-1K and ImageNet-100 in Table 5 indicate that DCL becomes significantly more effective than MoCo-v2 when the queue size gets smaller.

Few Learning Epochs. DCL can alleviate the shortcoming of the traditional contrastive learning framework, which needs a large batch size long learning epochs to achieve higher performance. The previous state-of-the-art, SimCLR, heavily relies on large quantities of learning epochs to obtain high top-1 accuracy. (e.g., 69.3%percent69.369.3% with up to 1000 epochs). DCL aims to achieve higher learning efficiency with few learning epochs. We demonstrate the effectiveness of DCL in InfoNCE-based frameworks SimCLR and MoCo-v2 [8]. We choose the batch size of 256 (queue of 65536) as the baseline and train the model with only 100 epochs. We make sure other parameter settings are the same for a fair comparison. Table 6 shows the result on ImageNet-1K using linear evaluation. With DCL, SimCLR can achieve 64.6%percent64.664.6% top-1 accuracy with only 100 epochs compared to SimCLR baseline: 57.5%percent57.557.5%; MoCo-v2 with DCL reaches 64.4%percent64.464.4% compared to MoCo-v2 baseline: 63.6%percent63.663.6% with 100 epochs pre-training.

We further demonstrate that, with DCL, learning representation becomes faster during the early stage of training compared to the InfoNCE-based learning scheme. The reason is that DCL successfully solves the decoupled issue between positive and negative pairs. Figure 4 on (a) CIFAR10 and (b) STL10 shows that DCL improves the speed of convergence and reaches higher performance than the baseline on CIFAR and STL10 data. The t-SNE visualization in Figure 4 (c) also supports the proposed theoretical derivation that removing the batch-size dependent impact (i.e., NPC multiplier) should improve representation learning abilities over the InfoNCE-based learning scheme.

Comparison with other SOTA SSL Approaches. The primary goal of this work is to provide an efficient and effective improvement to the widely used InfoNCE-based contrastive learning, where we decouple the positive and negative terms to achieve better representation quality. DCL is less sensitive to suboptimal hyperparameters and achieves competitive results with minimal requirements. Its effectiveness does not rely on large batch size and learning epochs, momentum encoding, negative sample queues, or additional tactics (e.g., stop-gradient and multi-cropping). Overall, DCL provides a more robust baseline for the contrastive-based SSL approaches. Though this work aims not to provide a SOTA SSL approach, DCL can be combined with the SOTA contrastive learning methods, such as NNCLR [10], to achieve better performance without large batch size and learning epochs. In Table 7, we provide extensive comparisons to SOTA SSL approaches on ImageNet-1K to validate the effectiveness of DCL. In Table 8, we further show that DCL achieves competitive results compared to VICReg [2], Barlow Twins [39], SimSiam [9], SwAV [4], and DINO [6] on ImageNet-100 and CIFAR10.

Generalization of DCL to Different Domains. DCL can be easily adapted to different domains (e.g., speech and language models) to achieve competitive performance. We demonstrate that DCL can be combined with SOTA SSL speech models, e.g., wav2vec 2.0 [1] which uses transformer backbone and requires enormous computation resources. We evaluate wav2vec 2.0 on its downstream tasks and perform better by applying the DCL method. Detailed results and discussion can be found in Appendix. To the best of our knowledge, DCL can be potentially combined with a transformer-based language model, CLIP [27], which uses a very large batch size of 32768. With DCL, CLIP shall maintain its complexity and achieve huge learning efficiency when the batch size becomes smaller. Note that it has been implemented by [33].

DCL Convergence for Large Batch Sizes. The performance of DCL appears to have less gain compared to InfoNCE-based baseline when the batch size is large. According to Figure 1 and the theoretical analysis, the reason is that the NPC multiplier qB→0→subscript𝑞𝐵0q_{B}\to 0 when the batch size is large (e.g., 1024). As shown in the analysis, InfoNCE loss converges to the DCL loss as the batch size approaches infinity. With 400 training epochs, the ImageNet-1K top-1 accuracy slightly increases from 69.5% to 69.9% when the batch size increases from 256 to 1024. Please refer to Table 9.

This paper identifies the negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, making the SSL task significantly easier to solve with smaller batch size. By removing the NPC effect, we reach a new objective function, decoupled contrastive learning (DCL). The proposed DCL loss function requires minimal modification to the SimCLR baseline and provides efficient, reliable, and nontrivial performance improvement on various benchmarks. Given the conceptual simplicity of DCL and that it requires neither momentum encoding, large batch size, or long epochs to reach competitive performance. Notably, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72.3%percent72.372.3% ImageNet-1K top-1 accuracy with 512512512 batch size in 400400400 epochs. We wish that DCL can serve as a strong baseline for the contrastive-based SSL methods. Further, an important lesson from the DCL loss is that a more efficient SSL task shall maintain its complexity when the batch size becomes smaller.

Acknowledgements. This work was supported in part by the MOST grants 110-2634-F-007-027, 110-2221-E-001-017 and 111-2221-E-001-015 of Taiwan. We are grateful to National Center for High-performance Computing and Meta AI Research for providing computational resources and facilities.

Let Yi,1=exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)+∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏Y_{i,1}=\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)+\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}, Ui,1=∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)subscript𝑈𝑖1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏U_{i,1}=\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}. So qB,i(1)=Ui,1Yi,1superscriptsubscript𝑞𝐵𝑖1subscript𝑈𝑖1subscript𝑌𝑖1q_{B,i}^{(1)}=\frac{U_{i,1}}{Y_{i,1}}.

where we can easily see that ∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)Ui,1=1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏subscript𝑈𝑖11\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}{U_{i,1}}}=1.

By removing the positive term in the denominator of Equation 1, we can repeat the procedure in the proof of Proposition 1 and see that the coupling term disappears.

Top-1 accuracies of linear evaluation in Table 10 shows that, we compare with the state-of-the-art SSL approaches on ImageNet-1K. For fairness, we list each approach’s batch size and learning epoch, shown in the original paper. During pre-training, DCL is based on a ResNet-50 backbone, with two views with the size 224 ×\times 224. DCL relies on its simplicity to reach competitive performance without relatively huge batch sizes and epochs or other pre-training schemes, i.e., momentum encoder, clustering, and prediction head. We report 400-epoch versions of DCL combined with NNCLR [10]. It achieves 71.1%percent71.171.1% under the batch size of 256 and 400-epoch pre-training, which is better than NNCLR [10] in their optimal case, 68.7%percent68.768.7% with a batch size of 256 and 1000-epoch. Note that SwAV [4], BYOL [13], SimCLR, and PIRL [22] need a huge batch size of 4096, and SwAV further applies multi-cropping extra views to reach optimal performance. The results of SwAV are taken from SimSiam that multi-cropping is not included.

We follow the settings of SimCLR to set up the data augmentations. We use R​a​n​d​o​m​R​e​s​i​z​e​d​C​r​o​p𝑅𝑎𝑛𝑑𝑜𝑚𝑅𝑒𝑠𝑖𝑧𝑒𝑑𝐶𝑟𝑜𝑝RandomResizedCrop with scale in [0.08, 1.0] and follow by R​a​n​d​o​m​H​o​r​i​z​o​n​t​a​l​F​l​i​p𝑅𝑎𝑛𝑑𝑜𝑚𝐻𝑜𝑟𝑖𝑧𝑜𝑛𝑡𝑎𝑙𝐹𝑙𝑖𝑝RandomHorizontalFlip. Then, C​o​l​o​r​J​i​t​t​e​r​i​n​g𝐶𝑜𝑙𝑜𝑟𝐽𝑖𝑡𝑡𝑒𝑟𝑖𝑛𝑔ColorJittering with strength in [0.8, 0.8, 0.8, 0.2] with probability of 0.8, and R​a​n​d​o​m​G​r​a​y​s​c​a​l​e𝑅𝑎𝑛𝑑𝑜𝑚𝐺𝑟𝑎𝑦𝑠𝑐𝑎𝑙𝑒RandomGrayscale with probability of 0.2. G​a​u​s​s​i​a​n​B​l​u​r𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛𝐵𝑙𝑢𝑟GaussianBlur includes Gaussian kernel with standard deviation in [0.1, 2.0].

We follow the asymmetric image augmentation of BYOL to replace default DCL augmentation in ablations. Table 3 demonstrates that the ImageNet-1K top-1 performance is increased from 67.8% to 68.2% by applying asymmetric augmentations.

Following the OpenSelfSup benchmark [40], we first train the linear classifier with batch size 256 for 100 epochs. We use the SGD optimizer with momentum = 0.9, and weight decay = 0. The base l​r𝑙𝑟lr is set to 30.0 and decay by 0.1 at epoch [60, 80]. We further demonstrate the linear evaluation protocol of SimSiam [9], which raises the batch size to 4096 for 90 epochs. The optimizer is switched to LARS optimizer with base l​r=1.2𝑙𝑟1.2lr=1.2 and cosine decay schedule. The momentum and weight decay have remained unchanged. We found the second one slightly improves the performance.

In this section, we provide a thorough discussion of the connection and difference between DCL and Hypersphere [34], which does not have negative-positive coupling either. However, there is a critical difference between DCL and Hypersphere, and the difference is that the order of the expectation and exponential is swapped. Let us assume the latent embedding vectors z𝑧z are normalized for analytical convenience. When zi,zjsubscript𝑧𝑖subscript𝑧𝑗z_{i},z_{j} are normalized, exp⁡(⟨𝐳i(k),𝐳i(l)⟩/τ)superscriptsubscript𝐳𝑖𝑘superscriptsubscript𝐳𝑖𝑙𝜏\exp(\langle\mathbf{z}{i}^{(k)},\mathbf{z}{i}^{(l)}\rangle/\tau) and exp⁡(−‖𝐳i(k)−𝐳i(l)‖2/τ)superscriptnormsuperscriptsubscript𝐳𝑖𝑘superscriptsubscript𝐳𝑖𝑙2𝜏\exp(-||\mathbf{z}{i}^{(k)}-\mathbf{z}{i}^{(l)}||^{2}/\tau) are the same, except for a trivial scale difference. Thus we can write LD​C​Lsubscript𝐿𝐷𝐶𝐿L_{DCL} and La​l​i​g​n−u​n​isubscript𝐿𝑎𝑙𝑖𝑔𝑛𝑢𝑛𝑖L_{align-uni} in a similar fashion:

With the right weight factor, La​l​i​g​nsubscript𝐿𝑎𝑙𝑖𝑔𝑛L_{align} can be made exactly the same as LD​C​L,p​o​ssubscript𝐿𝐷𝐶𝐿𝑝𝑜𝑠L_{DCL,pos}. So let’s focus on LD​C​L,n​e​gsubscript𝐿𝐷𝐶𝐿𝑛𝑒𝑔L_{DCL,neg} and Lu​n​i​f​o​r​msubscript𝐿𝑢𝑛𝑖𝑓𝑜𝑟𝑚L_{uniform}:

Similar to the earlier analysis in the manuscript, the latter Lu​n​i​f​o​r​msubscript𝐿𝑢𝑛𝑖𝑓𝑜𝑟𝑚L_{uniform} introduces a negative-negative coupling between the negative samples of different positive samples. If two negative samples of zisubscript𝑧𝑖z_{i} are close to each other, the gradient for zisubscript𝑧𝑖z_{i} would also be attenuated. This behaves similarly to the negative-positive coupling. That being said, while Hypersphere does not have a negative-positive coupling, it has a similarly problematic negative-negative coupling.

A case can simply demonstrate the negative-negative coupling in [34]. Let’s assume the model has the batch size of 3, and temperature τ𝜏\tau is 1. Both LD​C​L,n​e​gsubscript𝐿𝐷𝐶𝐿𝑛𝑒𝑔L_{DCL,neg} and Lu​n​i​f​o​r​msubscript𝐿𝑢𝑛𝑖𝑓𝑜𝑟𝑚L_{uniform} can be formulated as follows:

If the value of exp⁡(⟨𝐳1(k),𝐳3(l)⟩)superscriptsubscript𝐳1𝑘superscriptsubscript𝐳3𝑙\exp(\langle\mathbf{z}{1}^{(k)},\mathbf{z}{3}^{(l)}\rangle) is much larger (e.g., hard negatives) than other terms, there would be a huge difference between LD​C​L,n​e​gsubscript𝐿𝐷𝐶𝐿𝑛𝑒𝑔L_{DCL,neg} and Lu​n​i​f​o​r​msubscript𝐿𝑢𝑛𝑖𝑓𝑜𝑟𝑚L_{uniform}. Since Lu​n​i​f​o​r​msubscript𝐿𝑢𝑛𝑖𝑓𝑜𝑟𝑚L_{uniform} first sums up all the negative pair samples in the batch together, it may cause the loss to be dominated by a specific negative pair sample. Thus, in the DCL loss, the negative samples from different positives are not coupled in contrast to the uniformity loss in [34].

Next, we provide a comprehensive empirical comparison. The empirical experiments match the analytical prediction: DCL outperforms Hypersphere with a more considerable margin under a smaller batch size.

The comparisons of DCL to Hypersphere are evaluated on STL10, ImageNet-100, ImageNet-1K under various settings. For STL10 data, we implement DCL based on the official code of Hypersphere. The encoder and the hyperparameters are the same as Hypersphere, which has not been optimized for DCL in any way. We have found that Hypersphere did a pretty thorough hyperparameter search. We believe the default hyperparameters are relatively optimized for Hypersphere.

In Table 11, DCL reaches 84.4% (fc7+Linear) compared to 83.2% (fc7+Linear) reported in Hypersphere on STL10. In Table 12 and Table 13, DCL achieves better performance than Hypersphere under the same setting (MoCo & MoCo-v2) on ImageNet-100 data. DCL further shows strong results compared against Hypersphere on ImageNet-1K in Table 14. We also provide the STL10 comparisons of DCL and Hypersphere under different batch sizes in Table 15. The experiment shows the advantage of DCL becomes larger with smaller batch size. Please note that we did not tune the parameters for DCL at all. This should be a more than fair comparison.

In every single one of the experiments, DCL outperforms Hypersphere. Although the difference between the DCL and Hypersphere is slight, it makes DCL more easier to alleviate the domination from a specific negative pair in a batch. We hope these results show the unique value of DCL compared to Hypersphere.

In the downstream training process, the pre-trained representation first mean-pool and forward a fully a connected layer with cross-entropy loss on the VoxCeleb1 [23].

The SOTA SSL speech models, e.g., wav2vec 2.0 [1] still uses contrastive loss in the objective function. In Table 16, we show the effectiveness of DCL with wav2vec 2.0 [1]. We replace the InfoNCE loss with the DCL loss and train a wav2vec 2.0 base model (i.e., 7-Conv + 24-Transformer) from scratch.111The experiment is downscaled to 8 V100 GPUs rather than 64. After the pre-training of model, we evaluate the representation on two downstream tasks, speaker identification and intent classification. Table 16 shows the representation improvement of DCL.

The idea of DCL, removing positive from the denominator, can also be applied for learning objective function in the supervised classifier. By following [18], we implement the proposed DCL idea on cross entropy loss by removing the positive logits from the denominator of the softmax function. In Table 17, it is observed that our supervised DCL achieves slightly lower performance while comparing to the cross-entropy on CIFAR data. One possible reason for undermined performance of DCL in supervised learning might be the different feature interaction between supervised and unsupervised classifiers, which are referred to as parametric and non-parametric classifiers in [36].

Under the parametric formulation in [36], the logits equal to wT​zsuperscript𝑤𝑇𝑧w^{T}z, where w𝑤w is a weight vector for each class and z𝑧z is the output embedding of the neural network. While in contrastive learning (i.e., non-parametric classifier), the logits equal to z(1)​z(2)superscript𝑧1superscript𝑧2z^{(1)}z^{(2)}, where z(1)superscript𝑧1z^{(1)} and z(2)superscript𝑧2z^{(2)} are two augmented views of the same sample. In the embedding space of the early training stage, w𝑤w is relatively far away from z𝑧z compared to the relation between z(1)superscript𝑧1z^{(1)} and z(2)superscript𝑧2z^{(2)}. Consider the effect of NPC multiplier qbsubscript𝑞𝑏q_{b} into parametric and non-parametric classifier, qb→1→subscript𝑞𝑏1q_{b}\rightarrow 1 in parametric classifier might diminish the effectiveness of DCL idea as the coupling effect is already tiny.

Based on weighting function for the positive pairs in the Section 3 of the manuscript, we provide an another weighting function of DCLW:

where w​(𝐳i(1),𝐳i(2))=δ⋅exp⁡(−σ⋅⟨𝐳i(1),𝐳i(2)⟩)𝑤superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2⋅𝛿⋅𝜎superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2w(\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)})=\delta\cdot\exp(-\sigma\cdot\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle). Basically, the goal is similar to DCLW that we provide larger weight to hard positives (e.g., a positive pair of samples are far away from each other).

The results indicate that δ=3𝛿3\delta=3 and σ=0.5𝜎0.5\sigma=0.5 can achieve 85.4%percent85.485.4% kNN top-1 accuracy on CIFAR10, and outperform the InfoNCE baseline (SimCLR) by 4%percent44%.

Analysis of Temperature τ . In Table 18, we further provide extensive analysis on temperature τ in the objective function to support that the DCL method is not sensitive to hyperparameters compared against the InfoNCE-based baselines. In the following, show the temperature τ search on both DCL and SimCLR on CIFAR10 data. Specifically, we pre-train the network with temperature τ in { 0 . 07 , 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 , 0 . 8 , 0 . 9 , 1 . 0 } and report results with kNN eval, batch size 512, and 500 epochs. As shown in Table 18, compared to SimCLR, DCL is less sensitive to hyperparameters, e.g., temperature τ .

Analysis of Gradient. For further analysis of the phenomenon of DCL, we visualize the mean norm with its std of the last convolutional layers from the last two residual blocks of ResNet-18 trained on CIFAR100 under different batch sizes. The results in Figure 5 show that DCL constantly achieves larger gradients than baseline (InfoNCE) loss, especially under small batch sizes.

Table: S4.T1: Comparisons with/without DCL under different batch sizes from 32 to 512. Results show the effectiveness of DCL on five widely used benchmarks. The performance of DCL keeps steadier than the SimCLR baseline while the batch size is varied.

DatasetImageNet-1K (kNN / Linear)
Baseline (ResNet-50)40.2/56.842.9/58.945.1/60.646.3/61.849.4/64.0
w/ DCL (ResNet-50)43.7/61.546.3/63.448.5/64.349.8/65.950.1/65.8
DatasetImageNet-100 (kNN / Linear)
Baseline (ResNet-50)67.8/74.271.9/77.673.2/79.374.6/80.775.4/81.3
w/ DCL (ResNet-50)74.9/80.876.3/82.076.5/81.976.9/83.176.8/82.8
DatasetCIFAR10 (kNN / Linear)
Baseline (ResNet-18)78.9/79.880.4/81.381.1/82.881.4/83.081.3/83.3
w/ DCL (ResNet-18)83.7/85.184.4/85.984.4/85.784.2/85.383.5/84.7
DatasetCIFAR100 (kNN / Linear)
Baseline (ResNet-18)49.4/51.350.3/53.851.8/55.352.0/56.352.4/56.8
w/ DCL (ResNet-18)51.1/55.454.3/58.354.6/58.954.9/58.555.0/58.4
DatasetSTL10 (kNN / Linear)
Baseline (ResNet-18)74.1/76.277.6/77.879.3/80.080.7/81.381.3/81.5
w/ DCL (ResNet-18)82.0/85.282.8/86.381.8/86.181.2/85.781.0/85.6

Table: S4.T2: Comparisons between SimCLR baseline, DCL, and DCLW. The linear and kNN top-1 (%percent%) results indicate that DCL improves baseline performance, and DCLW further provides an extra boost. Note that results are under batch size 256 and epoch 200. All models are both trained and evaluated with the same experimental settings. The backbones are ResNet-18 and ResNet-50 for CIFAR and ImageNet, respectively.

DatasetCIFAR10 (kNN)CIFAR100 (kNN)ImageNet-100 (linear)ImageNet-1K (linear)
SimCLR81.452.080.761.8
DCL84.2 (+2.8)54.9 (+2.9)83.1 (+2.4)65.9 (+4.1)
DCLW84.8 (+3.4)55.2 (+3.2)84.2 (+3.5)66.9 (+5.1)

Table: S4.T3: Improve the DCL model performance on ImageNet-1K with tuned hyperparameters: temperature and learning rate, and stronger image augmentation. Note that models are trained with 256 batch size and 200 epochs.

ImageNet-1K (256 Batch size; 200 epoch)Linear Top-1 Accuracy (%)
DCL65.9
+ optimal (τ,lr𝜏subscript𝑙𝑟\tau,l_{r}) = (0.2, 0.07)67.8 (+1.9)
+ asymmetric augmentation [13]68.2 (+0.4)

Table: S4.T4: The comparisons with/without DCL under various batch sizes from 32 to 512 on ResNet-50.

Architecture@epochResNet-50@500 epoch
DatasetCIFAR10 (kNN)CIFAR100 (kNN)
Batch Size32641282565123264128256512
SimCLR82.285.988.588.989.149.855.359.960.661.1
SimCLR w/ DCL86.188.389.990.190.354.358.461.662.062.2

Table: S4.T5: Linear top-1 accuracy (%percent%) comparison with MoCo-V2 on ImageNet-1K and ImageNet-100.

DatasetImageNet-100 (Linear)ImageNet-1K (Linear)
MoCo-v2 Baseline (ResNet-50)73.776.478.778.779.863.967.167.5
MoCo-v2 w/DCL (ResNet-50)76.278.379.679.680.565.867.667.7

Table: S4.T6: ImageNet-1K top-1 accuracy (%) on SimCLR and MoCo-v2 with/without DCL under few training epochs. We further list results under 200 epochs for clear comparison. With DCL, the performance of SimCLR trained under 100 epochs nearly reaches its performance under 200 epochs. The MoCo-v2 with DCL also reaches higher accuracy than the baseline under 100 epochs.

SimCLRSimCLR w/ DCLMoCo-v2MoCo-v2 w/ DCL
100 Epoch57.564.663.664.4
200 Epoch61.865.967.567.7

Table: S5.T7: Linear top-1 accuracy (%percent%) comparison of SSL approaches on ImageNet-1K. Given lower computational budget, DCL model are better than recent SOTA approaches. Its effectiveness does not rely on large batch size and epochs (SimCLR [7], NNCLR [10]), momentum encoding (BYOL [13], MoCo-v2 [8]), or other tricks such as stop-gradient (SimSiam [9]) and multi-cropping (SwAV [5]).

ResNet-50 w/SimCLRBYOLSwAVMoCo-v2SimSiamBarlow TwinsNNCLRNNCLR +DCL
Epoch4004003001000400
Batch Size4096256256256 / 512256 / 512
ImageNet-1K (Linear)69.873.270.771.070.870.768.7 / 71.771.1 / 72.3

Table: S5.T9: Results of DCL and SimCLR with large batch size and learning epochs.

ImageNet-1K (ResNet-50)Batch SizeEpochTop-1 Accuracy (%)
SimCLR25620061.8
SimCLR25640064.8
SimCLR102440067.3
SimCLR w/ DCL25620067.8 (+6.0)
SimCLR w/ DCL25640069.5 (+4.7)
SimCLR w/ DCL102440069.9 (+2.6)

Table: Pt0.A1.T12: ImageNet-100 comparisons of Hypersphere and DCL under the same setting (MoCo).

ImageNet-100EpochMemory Queue SizeLinear Top-1 Accuracy (%)
Hypersphere2401638475.6
DCL2401638476.8 (+1.2)

Table: Pt0.A1.T16: Results of DCL on wav2vec 2.0 be evaluated on two downstream tasks.

Downstream task (Accuracy)Speaker Identification† (%)Intent Classification‡ (%)
wav2vec 2.0 Base Baseline74.992.3
wav2vec 2.0 Base w/ (DCL)75.292.5

Table: Pt0.A1.T18: The ablation study of various temperatures τ𝜏\tau on CIFAR10.

Temperature τ𝜏\tau0.070.10.20.30.40.50.60.70.80.91.0Std
SimCLR83.687.589.589.288.789.188.587.686.885.985.31.44
SimCLR w/ DCL88.389.490.889.989.690.389.689.088.588.087.70.98

Refer to caption An overview of the batch size issue is that general contrastive approaches need large batch sizes to perform better: (a) shows the NPC multiplier qBsubscript𝑞𝐵q_{B} in different batch sizes. As the batch size gradually increases, the qBsubscript𝑞𝐵q_{B} will approach to 111 with a small coefficient of variation (Cv=σ/μsubscript𝐶𝑣𝜎𝜇C_{v}=\sigma/\mu); and (b) illustrates the distribution of qBsubscript𝑞𝐵q_{B} with various batch sizes and indicates that the mode value of qBsubscript𝑞𝐵q_{B} will shift towards 111 when the batch size increases. Note that the σ𝜎\sigma and μ𝜇\mu are the standard deviation and mean of qBsubscript𝑞𝐵q_{B}, respectively. The coefficient of variation, Cvsubscript𝐶𝑣C_{v}, measures the dispersion of a frequency distribution.

Refer to caption Contrastive learning and negative-positive coupling (NPC). (a) In SimCLR, each sample 𝐱isubscript𝐱𝑖\mathbf{x}{i} has two augmented views {𝐱i(1),𝐱i(2)}superscriptsubscript𝐱𝑖1superscriptsubscript𝐱𝑖2{\mathbf{x}{i}^{(1)},\mathbf{x}{i}^{(2)}}. They are encoded by the same encoder f𝑓f and further projected to {𝐳i(1),𝐳i(2)}superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2{\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}} by a normalized MLP. (b) According to Equation 8. For the view 𝐱i(1)superscriptsubscript𝐱𝑖1\mathbf{x}{i}^{(1)}, the cross-entropy loss Li(1)superscriptsubscript𝐿𝑖1L_{i}^{(1)} leads to a positive force 𝐳i(2)superscriptsubscript𝐳𝑖2\mathbf{z}{i}^{(2)}, which comes from the other view 𝐱i(2)superscriptsubscript𝐱𝑖2\mathbf{x}{i}^{(2)} of 𝐱𝐱\mathbf{x} and a negative force, which is a weighted average of all the negative samples, i.e. {𝐳j(l)|l∈{1,2},j≠i}conditional-setsuperscriptsubscript𝐳𝑗𝑙formulae-sequence𝑙12𝑗𝑖{\mathbf{z}{j}^{(l)}|l\in{1,2},j\neq i}. However, the gradient −∇𝐳i(2)Li(1)subscript∇superscriptsubscript𝐳𝑖2superscriptsubscript𝐿𝑖1-\nabla{\mathbf{z}{i}^{(2)}}L{i}^{(1)} is proportional to the NPC multiplier. (c) We show two cases when the NPC term affects learning efficiency. The positive sample is close to the anchor and less informative on the top. However, the gradient from the negative samples is also reduced. On the bottom, when the negative samples are far away and less informative, the learning rate from the positive sample is mistakenly reduced. In general, the NPC multiplier from the InfoNCE loss makes the SSL task simpler to solve, leading to reduced learning efficiency.

Refer to caption Comparisons on ImageNet-1K with/without DCL under different numbers of (a): batch sizes for SimCLR and (b): queues for MoCo. Without DCL, the top-1 accuracy significantly drops when batch size (SimCLR) or queues (MoCo) becomes very small. Note that the temperature τ𝜏\tau is 0.10.10.1 for SimCLR and 0.070.070.07 for MoCo in the comparison.

Refer to caption (a) CIFAR10

Refer to caption (c) t-SNE visualization

Refer to caption (a) The coefficient of variation (Cv=σ/μsubscript𝐶𝑣𝜎𝜇C_{v}=\sigma/\mu) of gradient and (b) the mean gradient norm with its std of baseline (InfoNCE) and proposed method (DCL) under different batch sizes.

$$ \centering L_i^{(k)} = -\log{\frac{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}_i^{(2)} \rangle/\tau)}{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}i^{(2)} \rangle/\tau) + \sum{l\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle \mathbf{z}_i^{(k)},\mathbf{z}_j^{(l)} \rangle/\tau)}}} \label{eq:simclr_loss_re} $$ \tag{eq:simclr_loss_re}

$$ \left{\begin{array}{l} -\nabla_{\mathbf{z}{i}^{(1)}}L{i}^{(1)} = \ \frac{q_{B,i}^{(1)}}{\tau} \left( \mathbf{z}i^{(2)} - \sum{l\in{1,2}, j\in[![1,N]!],j\neq i}{\frac{\exp{\langle \mathbf{z}i^{(1)},\mathbf{z}j^{(l)} \rangle/\tau}}{U{i,1}}}\cdot \mathbf{z}j^{(l)}\right) \ -\nabla{\mathbf{z}{i}^{(2)}}L_{i}^{(1)} = \frac{q_{B,i}^{(1)}}{\tau}\cdot \mathbf{z}i^{(1)}\ -\nabla{\mathbf{z}{j}^{(l)}}L{i}^{(1)} = - \frac{q_{B,i}^{(1)}}{\tau}\frac{\exp{\langle \mathbf{z}_i^{(1)},\mathbf{z}j^{(l)} \rangle/\tau}}{U{i,1}}\cdot \mathbf{z}_i^{(1)} \end{array} \right. \label{eq:gradient_Li} $$ \tag{eq:gradient_Li}

$$ \displaystyle L_{DC,i}^{(k)} $$

$$ L_{DCL,neg} = \sum_{i}\log(\sum_{j\neq i} \exp(\langle \mathbf{z}_i^{(k)},\mathbf{z}_j^{(l)} \rangle/\tau)) $$

Prop. Proposition 1 : There exists a negative-positive coupling (NPC) multiplier qB,i(1)superscriptsubscript𝑞𝐵𝑖1q_{B,i}^{(1)} in the gradient of Li(1)superscriptsubscript𝐿𝑖1L_{i}^{(1)}: {−∇𝐳i(1)Li(1)=qB,i(1)τ​(𝐳i(2)−∑l∈{1,2},j∈[[1,N]],j≠iexp⁡⟨𝐳i(1),𝐳j(l)⟩/τUi,1⋅𝐳j(l))−∇𝐳i(2)Li(1)=qB,i(1)τ⋅𝐳i(1)−∇𝐳j(l)Li(1)=−qB,i(1)τ​exp⁡⟨𝐳i(1),𝐳j(l)⟩/τUi,1⋅𝐳i(1)casessubscript∇superscriptsubscript𝐳𝑖1superscriptsubscript𝐿𝑖1absentsuperscriptsubscript𝑞𝐵𝑖1𝜏superscriptsubscript𝐳𝑖2subscriptformulae-sequence𝑙12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖⋅superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑙𝜏subscript𝑈𝑖1superscriptsubscript𝐳𝑗𝑙subscript∇superscriptsubscript𝐳𝑖2superscriptsubscript𝐿𝑖1⋅superscriptsubscript𝑞𝐵𝑖1𝜏superscriptsubscript𝐳𝑖1subscript∇superscriptsubscript𝐳𝑗𝑙superscriptsubscript𝐿𝑖1⋅superscriptsubscript𝑞𝐵𝑖1𝜏superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑙𝜏subscript𝑈𝑖1superscriptsubscript𝐳𝑖1\displaystyle\left{\begin{array}[]{l}-\nabla_{\mathbf{z}{i}^{(1)}}L{i}^{(1)}=\ \frac{q_{B,i}^{(1)}}{\tau}\left(\mathbf{z}{i}^{(2)}-\sum{l\in{1,2},j\in[![1,N]!],j\neq i}{\frac{\exp{\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(l)}\rangle/\tau}}{U_{i,1}}}\cdot\mathbf{z}{j}^{(l)}\right)\ -\nabla{\mathbf{z}{i}^{(2)}}L{i}^{(1)}=\frac{q_{B,i}^{(1)}}{\tau}\cdot\mathbf{z}{i}^{(1)}\ -\nabla{\mathbf{z}{j}^{(l)}}L{i}^{(1)}=-\frac{q_{B,i}^{(1)}}{\tau}\frac{\exp{\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(l)}\rangle/\tau}}{U_{i,1}}\cdot\mathbf{z}{i}^{(1)}\end{array}\right. (7) where the NPC multiplier qB,i(1)superscriptsubscript𝑞𝐵𝑖1q{B,i}^{(1)} is: qB,i(1)=1−exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)+Ui,1superscriptsubscript𝑞𝐵𝑖11superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏subscript𝑈𝑖1\displaystyle q_{B,i}^{(1)}=1-\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)+U_{i,1}} (8) and Ui,1=∑l∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(l)⟩/τ)subscript𝑈𝑖1subscriptformulae-sequence𝑙12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑙𝜏U_{i,1}={\sum_{l\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(l)}\rangle/\tau)}}. Due to the symmetry, a similar NPC multiplier qB,i(k)superscriptsubscript𝑞𝐵𝑖𝑘q_{B,i}^{(k)} exists in the gradient of Li(k),k∈{1,2},i∈[[1,N]]formulae-sequencesuperscriptsubscript𝐿𝑖𝑘𝑘12𝑖delimited-[]1𝑁L_{i}^{(k)},k\in{1,2},i\in[![1,N]!].

Prop. Proposition 2 the DCL Loss: Removing the positive pair from the denominator of Equation 1 leads to a decoupled contrastive learning loss. If we remove the NPC multiplier qB,i(k)superscriptsubscript𝑞𝐵𝑖𝑘q_{B,i}^{(k)} from Equation 7, we reach a decoupled contrastive learning loss LD​C=∑k∈{1,2},i∈[[1,N]]LD​C,i(k)subscript𝐿𝐷𝐶subscriptformulae-sequence𝑘12𝑖delimited-[]1𝑁superscriptsubscript𝐿𝐷𝐶𝑖𝑘L_{DC}=\sum_{k\in{1,2},i\in[![1,N]!]}{L_{DC,i}^{(k)}}, where LD​C,i(k)superscriptsubscript𝐿𝐷𝐶𝑖𝑘L_{DC,i}^{(k)} is: LD​C,i(k)superscriptsubscript𝐿𝐷𝐶𝑖𝑘\displaystyle L_{DC,i}^{(k)} =−log⁡exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)+Ui,kabsentsuperscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏cancelsuperscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏subscript𝑈𝑖𝑘\displaystyle=-\log{\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)}{\bcancel{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)}+U_{i,k}}} (9) =−⟨𝐳i(1),𝐳i(2)⟩/τ+log⁡Ui,kabsentsuperscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏subscript𝑈𝑖𝑘\displaystyle=-\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau+\log{U_{i,k}} (10)

Proof. Proof Let Yi,1=exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)+∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏Y_{i,1}=\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)+\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}, Ui,1=∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)subscript𝑈𝑖1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏U_{i,1}=\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}. So qB,i(1)=Ui,1Yi,1superscriptsubscript𝑞𝐵𝑖1subscript𝑈𝑖1subscript𝑌𝑖1q_{B,i}^{(1)}=\frac{U_{i,1}}{Y_{i,1}}. −∇𝐳i(1)Li(1)subscript∇superscriptsubscript𝐳𝑖1superscriptsubscript𝐿𝑖1\displaystyle-\nabla_{\mathbf{z}{i}^{(1)}}L{i}^{(1)} =𝐳i(2)τ−1Yi,1⋅exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)⋅𝐳i(2)τ−1Yi,1⋅∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)​𝐳j(q)τabsentsuperscriptsubscript𝐳𝑖2𝜏⋅1subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏superscriptsubscript𝐳𝑖2𝜏⋅1subscript𝑌𝑖1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏superscriptsubscript𝐳𝑗𝑞𝜏\displaystyle=\frac{\mathbf{z}{i}^{(2)}}{\tau}-\frac{1}{Y{i,1}}\cdot\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)\cdot\frac{\mathbf{z}{i}^{(2)}}{\tau}-\frac{1}{Y{i,1}}\cdot\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}\frac{\mathbf{z}{j}^{(q)}}{\tau} =(1−1Yi,1⋅exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ))​𝐳i(2)τ−1Yi,1⋅∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)​𝐳j(q)τabsent1⋅1subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏superscriptsubscript𝐳𝑖2𝜏⋅1subscript𝑌𝑖1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏superscriptsubscript𝐳𝑗𝑞𝜏\displaystyle=(1-\frac{1}{Y{i,1}}\cdot\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau))\frac{\mathbf{z}{i}^{(2)}}{\tau}-\frac{1}{Y{i,1}}\cdot\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}\frac{\mathbf{z}{j}^{(q)}}{\tau} =Ui,1Yi,1​𝐳i(2)τ−Ui,1Yi,1⋅1Ui,1⋅∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)​𝐳j(q)τabsentsubscript𝑈𝑖1subscript𝑌𝑖1superscriptsubscript𝐳𝑖2𝜏⋅subscript𝑈𝑖1subscript𝑌𝑖11subscript𝑈𝑖1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏superscriptsubscript𝐳𝑗𝑞𝜏\displaystyle=\frac{U{i,1}}{Y_{i,1}}\frac{\mathbf{z}{i}^{(2)}}{\tau}-\frac{U{i,1}}{Y_{i,1}}\cdot\frac{1}{U_{i,1}}\cdot\sum_{q\in{1,2},j\in[![1,N]!],j\neq i}{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}\frac{\mathbf{z}{j}^{(q)}}{\tau} =1τ​Ui,1Yi,1​[𝐳i(2)−∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)Ui,1⋅𝐳j(q)]absent1𝜏subscript𝑈𝑖1subscript𝑌𝑖1delimited-[]superscriptsubscript𝐳𝑖2subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖⋅superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏subscript𝑈𝑖1superscriptsubscript𝐳𝑗𝑞\displaystyle=\frac{1}{\tau}\frac{U{i,1}}{Y_{i,1}}\left[\mathbf{z}{i}^{(2)}-\sum{q\in{1,2},j\in[![1,N]!],j\neq i}{\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}{U_{i,1}}\cdot\mathbf{z}{j}^{(q)}}\right] =qB,i(1)τ​[𝐳i(2)−∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)Ui,1⋅𝐳j(q)]absentsuperscriptsubscript𝑞𝐵𝑖1𝜏delimited-[]superscriptsubscript𝐳𝑖2subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖⋅superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏subscript𝑈𝑖1superscriptsubscript𝐳𝑗𝑞\displaystyle=\frac{q{B,i}^{(1)}}{\tau}\left[\mathbf{z}{i}^{(2)}-\sum{q\in{1,2},j\in[![1,N]!],j\neq i}{\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}{U_{i,1}}\cdot\mathbf{z}{j}^{(q)}}\right] −∇𝐳i(2)Li(1)subscript∇superscriptsubscript𝐳𝑖2superscriptsubscript𝐿𝑖1\displaystyle-\nabla{\mathbf{z}{i}^{(2)}}L{i}^{(1)} =1τ​𝐳i(1)−1Yi,1​exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ)⋅𝐳i(1)τabsent1𝜏superscriptsubscript𝐳𝑖1⋅1subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏superscriptsubscript𝐳𝑖1𝜏\displaystyle=\frac{1}{\tau}\mathbf{z}{i}^{(1)}-\frac{1}{Y{i,1}}\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)\cdot\frac{\mathbf{z}{i}^{(1)}}{\tau} =1τ​(1−1Yi,1​exp⁡(⟨𝐳i(1),𝐳i(2)⟩/τ))⋅𝐳i(1)absent⋅1𝜏11subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑖2𝜏superscriptsubscript𝐳𝑖1\displaystyle=\frac{1}{\tau}\left(1-\frac{1}{Y{i,1}}\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{i}^{(2)}\rangle/\tau)\right)\cdot\mathbf{z}{i}^{(1)} =1τ​Ui,1Yi,1⋅𝐳i(1)absent⋅1𝜏subscript𝑈𝑖1subscript𝑌𝑖1superscriptsubscript𝐳𝑖1\displaystyle=\frac{1}{\tau}\frac{U{i,1}}{Y_{i,1}}\cdot\mathbf{z}{i}^{(1)} =qB,i(1)τ⋅𝐳i(1)absent⋅superscriptsubscript𝑞𝐵𝑖1𝜏superscriptsubscript𝐳𝑖1\displaystyle=\frac{q{B,i}^{(1)}}{\tau}\cdot\mathbf{z}{i}^{(1)} −∇𝐳j(l)Li(1)subscript∇superscriptsubscript𝐳𝑗𝑙superscriptsubscript𝐿𝑖1\displaystyle-\nabla{\mathbf{z}{j}^{(l)}}L{i}^{(1)} =1Yi,1​exp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)⋅𝐳i(1)τabsent⋅1subscript𝑌𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏superscriptsubscript𝐳𝑖1𝜏\displaystyle=\frac{1}{Y_{i,1}}\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)\cdot\frac{\mathbf{z}{i}^{(1)}}{\tau} =Ui,1Yi,1⋅1Ui,1​exp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)⋅𝐳i(1)τabsent⋅⋅subscript𝑈𝑖1subscript𝑌𝑖11subscript𝑈𝑖1superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏superscriptsubscript𝐳𝑖1𝜏\displaystyle=\frac{U{i,1}}{Y_{i,1}}\cdot\frac{1}{U_{i,1}}\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)\cdot\frac{\mathbf{z}{i}^{(1)}}{\tau} =qB,i(1)τ⋅exp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)Ui,1​𝐳i(1)absent⋅superscriptsubscript𝑞𝐵𝑖1𝜏superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏subscript𝑈𝑖1superscriptsubscript𝐳𝑖1\displaystyle=\frac{q{B,i}^{(1)}}{\tau}\cdot\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}{U_{i,1}}\mathbf{z}{i}^{(1)} where we can easily see that ∑q∈{1,2},j∈[[1,N]],j≠iexp⁡(⟨𝐳i(1),𝐳j(q)⟩/τ)Ui,1=1subscriptformulae-sequence𝑞12formulae-sequence𝑗delimited-[]1𝑁𝑗𝑖superscriptsubscript𝐳𝑖1superscriptsubscript𝐳𝑗𝑞𝜏subscript𝑈𝑖11\sum{q\in{1,2},j\in[![1,N]!],j\neq i}{\frac{\exp(\langle\mathbf{z}{i}^{(1)},\mathbf{z}{j}^{(q)}\rangle/\tau)}{U_{i,1}}}=1.

Proof. By removing the positive term in the denominator of Equation1, we can repeat the procedure in the proof of Proposition1 and see that the coupling term disappears.

Batch Size3264128256512
DatasetImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)
Baseline (ResNet-50) w/ DCL (ResNet-50)40.2/56.8 43.7/61.5 46.3/63.442.9/58.945.1/60.6 48.5/64.346.3/61.8 49.8/65.949.4/64.0 50.1/65.8
DatasetImageNet-100(kNN/ Linear)
Baseline (ResNet-50) w/ DCL (ResNet-50)67.8/74.2 74.9/80.871.9/77.6 76.3/82.073.2/79.3 76.5/81.974.6/80.7 76.9/83.175.4/81.3 76.8/82.8
DatasetCIFAR10(kNN /Linear)
Baseline (ResNet-18)78.9/79.880.4/81.381.1/82.881.4/83.081.3/83.3
w/ DCL (ResNet-18) Dataset83.7/85.184.4/85.984.4/85.7 (kNN /84.2/85.3 Linear)83.5/84.7
Baseline (ResNet-18) w/ DCL (ResNet-18)49.4/51.3CIFAR100 50.3/53.851.8/55.352.0/56.352.4/56.8
Dataset51.1/55.4 54.3/58.3 (kNNSTL1054.6/58.9 /54.9/58.5 Linear)55.0/58.4
Baseline (ResNet-18) w/ DCL (ResNet-18)82.0/85.2 82.8/86.377.6/77.881.8/86.180.7/81.381.3/81.5
74.1/76.279.3/80.081.2/85.781.0/85.6
Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)
SimCLR81.452.080.761.8
DCL84.2 ( +2.8)54.9 ( +2.9)83.1 ( +2.4)65.9 ( +4.1)
DCLW84.8 (+3.4)55.2 (+3.2)84.2 (+3.5)66.9 (+5.1)
ImageNet-1K (256 Batch size; 200 epoch) Linear Top-1 Accuracy (%)ImageNet-1K (256 Batch size; 200 epoch) Linear Top-1 Accuracy (%)
DCL65.9
+ optimal ( τ, l r ) = (0.2, 0.07)67.8 ( +1.9 )
+ asymmetric augmentation [13]68.2 ( +0.4 )
Architecture@epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epoch
DatasetCIFAR10 (kNN)CIFAR10 (kNN)CIFAR10 (kNN)CIFAR10 (kNN)CIFAR10 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)
Batch Size32641282565123264128256512
SimCLR82.285.988.588.989.149.855.359.960.661.1
SimCLR w/ DCL86.188.389.990.190.354.358.461.662.062.2
Queue Size326412825681926425665536
DatasetImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-1KImageNet-1K(Linear)
MoCo-v2 Baseline (ResNet-50)73.776.478.778.779.863.967.167.5
MoCo-v2 w/DCL (ResNet-50)76.278.379.679.680.565.867.667.7
SimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCL
100 Epoch57.564.663.664.4
200 Epoch61.865.967.567.7
ResNet-50 w/SimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRNNCLR +DCLSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLR
Epoch40040010001000400
Batch Size4096256256/ 512256 / 512
ImageNet-1K (Linear)69.873.270.77168.7/ 71.771.1 / 72.3
ResNet-18 @ 256 Batch SizeDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCL
CIFAR10, 1000 Epoch (kNN)89.589.290.592.192.191.892.3
ImageNet-100, 400 Epoch (Linear)74.97474.579.280.279.880.6
ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)
SimCLR25620061.8
SimCLR25640064.8
SimCLR102440067.3
SimCLR w/ DCL25620067.8 ( +6.0 )
SimCLR w/ DCL25640069.5 ( +4.7 )
SimCLR w/ DCL102440069.9 ( +2.6 )
MethodParam. (M)Batch SizeEpochs Top-1Linear ( % )
NPID [36]2425620056.5
MoCo [15]2425620060.6
CMC [30]4725628064.1
MoCo-v2 [8]2825620067.5
SwAV [5]28409620069.1
SimSiam [9]2825620070
InfoMin [31]2825620070.1
BYOL [13]28409620070.6
SiMo [42]2825620068
Hypersphere [34]2825620067.7
SimCLR [7]2825620061.8
SimCLR+DCL2825620067.8
SimCLR+DCL(w/ BYOL aug.)2825620068.2
PIRL [22]2425680063.6
BYOL [13]28409640073.2
SwAV [5]28409640070.7
MoCo-v2 [8]2825640071
SimSiam [9]2825640070.8
Barlow Twins [39]2825630070.7
SimCLR [7]284096100069.3
SimCLR+DCL2825640069.5
NNCLR [10]28256100068.7
NNLCR+DCL2825640071.1
NNCLR [10]28512100071.7
NNCLR+DCL2851240072.3
STL10fc7+Linearfc7+5-NNOutput + LinearOutput + 5-NN
Hypersphere83.276.280.179.2
DCL84.4 (+1.2)77.3 (+1.1)81.5 (+1.4)80.5 (+1.3)
ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)
Hypersphere2401638475.6
DCL2401638476.8 (+1.2)
ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)
Hypersphere2001638477.7
DCL200819280.5 (+2.7)
ImageNet-1KEpochBatch SizeLinear Top-1 Accuracy (%)
MoCo-v2 Baseline200256 (Memory queue = 65536)67.5
Hypersphere200256 (Memory queue = 65536)67.7 (+0.2)
DCL20025668.2 (+0.7)
Batch Size3264128256768
Hypersphere78.981.081.982.683.2
DCL81.0 (+2.1)82.9 (+1.9)83.7 (+1.8)84.2 (+1.6)84.4 (+1.2)
Downstream task (Accuracy) Speaker Identification † (%) Intent Classification ‡ (%)Downstream task (Accuracy) Speaker Identification † (%) Intent Classification ‡ (%)Downstream task (Accuracy) Speaker Identification † (%) Intent Classification ‡ (%)
wav2vec 2.0 Base Baseline74.992.3
wav2vec 2.0 Base w/ (DCL)75.292.5
Architecture@epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epoch
Batch Size3212825632128256
DatasetCIFAR10(top-1)CIFAR100(top-1)
Cross entropy91.592.391.061.962.761.8
DCL89.291.491.260.261.861.4
Temperature τ0.07 0.10.20.30.40.50.60.70.80.91.0Std
SimCLR83.6 87.589.589.288.789.188.587.686.885.985.31.44
SimCLR w/ DCL88.3 89.490.889.989.690.389.68988.58887.70.98

$$ \label{eq:NPC} q_{B,i}^{(1)} = 1 - \frac{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}_i^{(2)} \rangle/\tau)}{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}i^{(2)}\rangle/\tau) + U{i,1}} $$ \tag{eq:NPC}

$$ -\nabla_{\mathbf{z}{i}^{(1)}}L{i}^{(1)} &= \frac{\mathbf{z}i^{(2)}}{\tau} - \frac{1}{Y{i,1}}\cdot\exp(\langle \mathbf{z}i^{(1)},\mathbf{z}i^{(2)}\rangle/\tau)\cdot \frac{\mathbf{z}i^{(2)}}{\tau} - \frac{1}{Y{i,1}}\cdot \sum{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle \mathbf{z}i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)}\frac{\mathbf{z}j^{(q)}}{\tau}\ &=(1-\frac{1}{Y{i,1}}\cdot\exp(\langle \mathbf{z}i^{(1)},\mathbf{z}i^{(2)}\rangle/\tau))\frac{\mathbf{z}{i}^{(2)}}{\tau} - \frac{1}{Y{i,1}}\cdot \sum{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle \mathbf{z}i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)}\frac{\mathbf{z}j^{(q)}}{\tau}\ &=\frac{U{i,1}}{Y{i,1}}\frac{\mathbf{z}{i}^{(2)}}{\tau} - \frac{U{i,1}}{Y{i,1}}\cdot\frac{1}{U_{i,1}}\cdot \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)}\frac{\mathbf{z}j^{(q)}}{\tau}\ &= \frac{1}{\tau}\frac{U{i,1}}{Y{i,1}}\left[\mathbf{z}i^{(2)} - \sum{q\in{1,2}, j\in[![1,N]!],j\neq i}{\frac{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)}{U{i,1}} \cdot \mathbf{z}j^{(q)}}\right]\ &= \frac{q{B,i}^{(1)}}{\tau}\left[\mathbf{z}i^{(2)} - \sum{q\in{1,2}, j\in[![1,N]!],j\neq i}{\frac{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)}{U{i,1}} \cdot \mathbf{z}_j^{(q)}}\right] $$

$$ -\nabla_{\mathbf{z}{i}^{(2)}}L{i}^{(1)} &= \frac{1}{\tau}\mathbf{z}{i}^{(1)} - \frac{1}{Y{i,1}}\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}_i^{(2)}\rangle/\tau)\cdot\frac{\mathbf{z}i^{(1)}}{\tau}\ & = \frac{1}{\tau}\left(1 - \frac{1}{Y{i,1}}\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}i^{(2)}\rangle/\tau)\right)\cdot\mathbf{z}i^{(1)}\ & = \frac{1}{\tau}\frac{U{i,1}}{Y{i,1}}\cdot\mathbf{z}i^{(1)}\ &= \frac{q{B,i}^{(1)}}{\tau}\cdot \mathbf{z}_i^{(1)} $$

$$ L_{DCL} &= L_{DCL,pos} + L_{DCL,neg} $$

$$ L_{DCL,neg} = \log( \exp(\langle \mathbf{z}_1^{(k)},\mathbf{z}_2^{(l)} \rangle) + \exp(\langle \mathbf{z}_1^{(k)},\mathbf{z}_3^{(l)} \rangle)) + \ \log( \exp(\langle \mathbf{z}_2^{(k)},\mathbf{z}_1^{(l)} \rangle) + \exp(\langle \mathbf{z}_2^{(k)},\mathbf{z}_3^{(l)} \rangle)) + \ \log( \exp(\langle \mathbf{z}_3^{(k)},\mathbf{z}_1^{(l)} \rangle) + \exp(\langle \mathbf{z}_3^{(k)},\mathbf{z}_2^{(l)} \rangle)) $$

$$ L_{uniform} = \log( \exp(\langle \mathbf{z}_1^{(k)},\mathbf{z}_2^{(l)} \rangle) + \exp(\langle \mathbf{z}_1^{(k)},\mathbf{z}_3^{(l)} \rangle) + \exp(\langle \mathbf{z}_2^{(k)},\mathbf{z}_1^{(l)} \rangle) + \ \exp(\langle \mathbf{z}_2^{(k)},\mathbf{z}_3^{(l)} \rangle) + \exp(\langle \mathbf{z}_3^{(k)},\mathbf{z}_1^{(l)} \rangle) + \exp(\langle \mathbf{z}_3^{(k)},\mathbf{z}_2^{(l)} \rangle)) $$

$$ L_{DCLW} = \sum_{k\in{1,2}, i\in[![1,N]!]}{L_{DCLW,i}^{(i,k)}} $$

Proof. $ $\newline Let $Y_{i,1} = \exp(\langle z_i^{(1)},z_i^{(2)}\rangle/\tau) + \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)}$, $U_{i,1}=\sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)}$. So $q_{B,i}^{(1)} = U_{i,1}{Y_{i,1}}$. \tiny align* -\nabla_{z_{i}^{(1)}}L_{i}^{(1)} &= z_i^{(2)}{\tau} - 1{Y_{i,1}}\cdot\exp(\langle z_i^{(1)},z_i^{(2)}\rangle/\tau)\cdot z_i^{(2)}{\tau} - 1{Y_{i,1}}\cdot \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)}z_j^{(q)}{\tau}\ &=(1-1{Y_{i,1}}\cdot\exp(\langle z_i^{(1)},z_i^{(2)}\rangle/\tau))z_{i^{(2)}}{\tau} - 1{Y_{i,1}}\cdot \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)}z_j^{(q)}{\tau}\ &=U_{i,1}{Y_{i,1}}z_{i^{(2)}}{\tau} - U_{i,1}{Y_{i,1}}\cdot1{U_{i,1}}\cdot \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)}z_j^{(q)}{\tau}\ &= 1{\tau}U_{i,1}{Y_{i,1}}\left[z_i^{(2)} - \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1),z_j^{(q)} \rangle/\tau)}{U_{i,1}} \cdot z_j^{(q)}}\right]\ &= q_{B,i^{(1)}}{\tau}\left[z_i^{(2)} - \sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1),z_j^{(q)} \rangle/\tau)}{U_{i,1}} \cdot z_j^{(q)}}\right] align* \normalsize \small align* -\nabla_{z_{i}^{(2)}}L_{i}^{(1)} &= 1{\tau}z_{i}^{(1)} - 1{Y_{i,1}}\exp(\langle z_i^{(1)},z_i^{(2)}\rangle/\tau)\cdotz_i^{(1)}{\tau}\ & = 1{\tau}\left(1 - 1{Y_{i,1}}\exp(\langle z_i^{(1)},z_i^{(2)}\rangle/\tau)\right)\cdotz_i^{(1)}\ & = 1{\tau}U_{i,1}{Y_{i,1}}\cdotz_i^{(1)}\ &= q_{B,i^{(1)}}{\tau}\cdot z_i^{(1)} align* \normalsize \small align* -\nabla_{z_{j}^{(l)}}L_{i}^{(1)} &= 1{Y_{i,1}}\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)\cdot z_i^{(1)}{\tau}\ &= U_{i,1}{Y_{i,1}}\cdot1{U_{i,1}}\exp(\langle z_i^{(1)},z_j^{(q)} \rangle/\tau)\cdot z_i^{(1)}{\tau}\ &= q_{B,i^{(1)}}{\tau}\cdot \exp(\langle z_i^{(1),z_j^{(q)} \rangle/\tau)}{U_{i,1}} z_i^{(1)} align* \normalsize where we can easily see that $\sum_{q\in{1,2}, j\in[![1,N]!],j\neq i}{\exp(\langle z_i^{(1),z_j^{(q)} \rangle/\tau)}{U_{i,1}}} = 1$.

Batch Size3264128256512
DatasetImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)ImageNet-1K (kNN / Linear)
Baseline (ResNet-50) w/ DCL (ResNet-50)40.2/56.8 43.7/61.5 46.3/63.442.9/58.945.1/60.6 48.5/64.346.3/61.8 49.8/65.949.4/64.0 50.1/65.8
DatasetImageNet-100(kNN/ Linear)
Baseline (ResNet-50) w/ DCL (ResNet-50)67.8/74.2 74.9/80.871.9/77.6 76.3/82.073.2/79.3 76.5/81.974.6/80.7 76.9/83.175.4/81.3 76.8/82.8
DatasetCIFAR10(kNN /Linear)
Baseline (ResNet-18)78.9/79.880.4/81.381.1/82.881.4/83.081.3/83.3
w/ DCL (ResNet-18) Dataset83.7/85.184.4/85.984.4/85.7 (kNN /84.2/85.3 Linear)83.5/84.7
Baseline (ResNet-18) w/ DCL (ResNet-18)49.4/51.3CIFAR100 50.3/53.851.8/55.352.0/56.352.4/56.8
Dataset51.1/55.4 54.3/58.3 (kNNSTL1054.6/58.9 /54.9/58.5 Linear)55.0/58.4
Baseline (ResNet-18) w/ DCL (ResNet-18)82.0/85.2 82.8/86.377.6/77.881.8/86.180.7/81.381.3/81.5
74.1/76.279.3/80.081.2/85.781.0/85.6
Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)Dataset CIFAR10 (kNN) CIFAR100 (kNN) ImageNet-100 (linear) ImageNet-1K (linear)
SimCLR81.452.080.761.8
DCL84.2 ( +2.8)54.9 ( +2.9)83.1 ( +2.4)65.9 ( +4.1)
DCLW84.8 (+3.4)55.2 (+3.2)84.2 (+3.5)66.9 (+5.1)
ImageNet-1K (256 Batch size; 200 epoch) Linear Top-1 Accuracy (%)ImageNet-1K (256 Batch size; 200 epoch) Linear Top-1 Accuracy (%)
DCL65.9
+ optimal ( τ, l r ) = (0.2, 0.07)67.8 ( +1.9 )
+ asymmetric augmentation [13]68.2 ( +0.4 )
Architecture@epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epochResNet-50@500 epoch
DatasetCIFAR10 (kNN)CIFAR10 (kNN)CIFAR10 (kNN)CIFAR10 (kNN)CIFAR10 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)CIFAR100 (kNN)
Batch Size32641282565123264128256512
SimCLR82.285.988.588.989.149.855.359.960.661.1
SimCLR w/ DCL86.188.389.990.190.354.358.461.662.062.2
Queue Size326412825681926425665536
DatasetImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-100 (Linear)ImageNet-1KImageNet-1K(Linear)
MoCo-v2 Baseline (ResNet-50)73.776.478.778.779.863.967.167.5
MoCo-v2 w/DCL (ResNet-50)76.278.379.679.680.565.867.667.7
SimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCLSimCLR SimCLR w/ DCL MoCo-v2 MoCo-v2 w/ DCL
100 Epoch57.564.663.664.4
200 Epoch61.865.967.567.7
ResNet-50 w/SimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRNNCLR +DCLSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLRSimCLR BYOL SwAV MoCo-v2 SimSiam Barlow Twins NNCLR
Epoch40040010001000400
Batch Size4096256256/ 512256 / 512
ImageNet-1K (Linear)69.873.270.77168.7/ 71.771.1 / 72.3
ResNet-18 @ 256 Batch SizeDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCLDINO SwAV SimSiam VICReg Barlow Twins NNCLR NNCLR+DCL
CIFAR10, 1000 Epoch (kNN)89.589.290.592.192.191.892.3
ImageNet-100, 400 Epoch (Linear)74.97474.579.280.279.880.6
ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)ImageNet-1K (ResNet-50) Batch Size Epoch Top-1 Accuracy (%)
SimCLR25620061.8
SimCLR25640064.8
SimCLR102440067.3
SimCLR w/ DCL25620067.8 ( +6.0 )
SimCLR w/ DCL25640069.5 ( +4.7 )
SimCLR w/ DCL102440069.9 ( +2.6 )
MethodParam. (M)Batch SizeEpochs Top-1Linear ( % )
NPID [36]2425620056.5
MoCo [15]2425620060.6
CMC [30]4725628064.1
MoCo-v2 [8]2825620067.5
SwAV [5]28409620069.1
SimSiam [9]2825620070
InfoMin [31]2825620070.1
BYOL [13]28409620070.6
SiMo [42]2825620068
Hypersphere [34]2825620067.7
SimCLR [7]2825620061.8
SimCLR+DCL2825620067.8
SimCLR+DCL(w/ BYOL aug.)2825620068.2
PIRL [22]2425680063.6
BYOL [13]28409640073.2
SwAV [5]28409640070.7
MoCo-v2 [8]2825640071
SimSiam [9]2825640070.8
Barlow Twins [39]2825630070.7
SimCLR [7]284096100069.3
SimCLR+DCL2825640069.5
NNCLR [10]28256100068.7
NNLCR+DCL2825640071.1
NNCLR [10]28512100071.7
NNCLR+DCL2851240072.3
STL10fc7+Linearfc7+5-NNOutput + LinearOutput + 5-NN
Hypersphere83.276.280.179.2
DCL84.4 (+1.2)77.3 (+1.1)81.5 (+1.4)80.5 (+1.3)
ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)
Hypersphere2401638475.6
DCL2401638476.8 (+1.2)
ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)ImageNet-100 Epoch Memory Queue Size Linear Top-1 Accuracy (%)
Hypersphere2001638477.7
DCL200819280.5 (+2.7)
ImageNet-1KEpochBatch SizeLinear Top-1 Accuracy (%)
MoCo-v2 Baseline200256 (Memory queue = 65536)67.5
Hypersphere200256 (Memory queue = 65536)67.7 (+0.2)
DCL20025668.2 (+0.7)
Batch Size3264128256768
Hypersphere78.981.081.982.683.2
DCL81.0 (+2.1)82.9 (+1.9)83.7 (+1.8)84.2 (+1.6)84.4 (+1.2)
Downstream task (Accuracy) Speaker Identification † (%) Intent Classification ‡ (%)Downstream task (Accuracy) Speaker Identification † (%) Intent Classification ‡ (%)Downstream task (Accuracy) Speaker Identification † (%) Intent Classification ‡ (%)
wav2vec 2.0 Base Baseline74.992.3
wav2vec 2.0 Base w/ (DCL)75.292.5
Architecture@epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epochResNet-20@200 epoch
Batch Size3212825632128256
DatasetCIFAR10(top-1)CIFAR100(top-1)
Cross entropy91.592.391.061.962.761.8
DCL89.291.491.260.261.861.4
Temperature τ0.07 0.10.20.30.40.50.60.70.80.91.0Std
SimCLR83.6 87.589.589.288.789.188.587.686.885.985.31.44
SimCLR w/ DCL88.3 89.490.889.989.690.389.68988.58887.70.98

Figure

$$ -\nabla_{\mathbf{z}{j}^{(l)}}L{i}^{(1)} &= \frac{1}{Y_{i,1}}\exp(\langle \mathbf{z}i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)\cdot \frac{\mathbf{z}i^{(1)}}{\tau}\ &= \frac{U{i,1}}{Y{i,1}}\cdot\frac{1}{U{i,1}}\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}_j^{(q)} \rangle/\tau)\cdot \frac{\mathbf{z}i^{(1)}}{\tau}\ &= \frac{q{B,i}^{(1)}}{\tau}\cdot \frac{\exp(\langle \mathbf{z}_i^{(1)},\mathbf{z}j^{(q)} \rangle/\tau)}{U{i,1}} \mathbf{z}_i^{(1)} $$

References

[Hessel2017RainbowCI] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, David Silver. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the Conference on Artificial Intelligence (AAAI).

[salakhutdinov2009deep] Salakhutdinov, Ruslan, Hinton, Geoffrey. (2009). Deep boltzmann machines. Artificial Intelligence and Statistics (AISTATS).

[goodfellow2014generative] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, Bengio, Yoshua. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS).

[radford2015unsupervised] Alec Radford, Luke Metz, Soumith Chintala. (2016). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. International Conference on Learning Representations (ICLR).

[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella X, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[ye2019unsupervised] Ye, Mang, Zhang, Xu, Yuen, Pong C, Chang, Shih-Fu. (2019). Unsupervised embedding learning via invariant and spreading instance feature. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[oord2018representation] A{. (2018). Representation Learning with Contrastive Predictive Coding. CoRR.

[tian2019contrastive] Yonglong Tian, Dilip Krishnan, Phillip Isola. (2020). Contrastive Multiview Coding. European Conference on Computer Vision (ECCV).

[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[chen2020simple] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey E. Hinton. (2020). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the International Conference on Machine Learning (ICML).

[sermanet2018time] Sermanet, Pierre, Lynch, Corey, Chebotar, Yevgen, Hsu, Jasmine, Jang, Eric, Schaal, Stefan, Levine, Sergey, Brain, Google. (2018). Time-contrastive networks: Self-supervised learning from video. IEEE International Conference on Robotics and Automation (ICRA).

[xie2020pointcontrast] Saining Xie, Jiatao Gu, Demi Guo, Charles R. Qi, Leonidas J. Guibas, Or Litany. (2020). PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding. European Conference on Computer Vision (ECCV).

[hassani2020contrastive] Kaveh Hassani, Amir Hosein Khas Ahmadi. (2020). Contrastive Multi-View Representation Learning on Graphs. International Conference on Machine Learning (ICML).

[noroozi2016unsupervised] Mehdi Noroozi, Paolo Favaro. (2016). Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles. European Conference on Computer Vision (ECCV).

[zhang2016colorful] Richard Zhang, Phillip Isola, Alexei A. Efros. (2016). Colorful Image Colorization. European Conference on Computer Vision (ECCV).

[gidaris2018unsupervised] Spyros Gidaris, Praveer Singh, Nikos Komodakis. (2018). Unsupervised Representation Learning by Predicting Image Rotations. International Conference on Learning Representations (ICLR).

[devries2017improved] DeVries, Terrance, Taylor, Graham W. (2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.

[zhong2020random] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, Yi Yang. (2020). Random Erasing Data Augmentation. the Conference on Artificial Intelligence (AAAI).

[zhang2017mixup] Hongyi Zhang, Moustapha Ciss{'{e. (2018). mixup: Beyond Empirical Risk Minimization. International Conference on Learning Representations (ICLR).

[yun2019cutmix] Yun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, Yoo, Youngjoon. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[selvaraju2017grad] Selvaraju, Ramprasaath R, Cogswell, Michael, Das, Abhishek, Vedantam, Ramakrishna, Parikh, Devi, Batra, Dhruv. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[shen2020rethinking] Shen, Zhiqiang, Liu, Zechun, Liu, Zhuang, Savvides, Marios, Darrell, Trevor. (2020). Rethinking Image Mixture for Unsupervised Visual Representation Learning. arXiv preprint arXiv:2003.05438.

[xie2020self] Xie, Qizhe, Luong, Minh-Thang, Hovy, Eduard, Le, Quoc V. (2020). Self-training with noisy student improves imagenet classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[he2019revisiting] Junxian He, Jiatao Gu, Jiajun Shen, Marc'Aurelio Ranzato. (2020). Revisiting Self-Training for Neural Sequence Generation. International Conference on Learning Representations (ICLR).

[kahn2020self] Kahn, Jacob, Lee, Ann, Hannun, Awni. (2020). Self-training for end-to-end speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[tian2020makes] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola. (2020). What Makes for Good Views for Contrastive Learning?. Advances in Neural Information Processing Systems (NeurIPS).

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. IEEE conference on Computer Vision and Pattern Recognition (CVPR).

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[shorten2019survey] Shorten, Connor, Khoshgoftaar, Taghi M. (2019). A survey on image data augmentation for deep learning. Journal of Big Data.

[wang2020unsupervised] Xudong Wang, Ziwei Liu, Stella X. Yu. (2021). Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[chen2020improved] Xinlei Chen, Haoqi Fan, Ross B. Girshick, Kaiming He. (2020). Improved Baselines with Momentum Contrastive Learning. CoRR.

[openselfsup] Zhan, Xiaohang, Xie, Jiahao, Liu, Ziwei, Lin, Dahua, Change Loy, Chen. (2020). {OpenSelfSup. GitHub repository.

[coates2011analysis] Coates, Adam, Ng, Andrew, Lee, Honglak. (2011). An analysis of single-layer networks in unsupervised feature learning. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS).

[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.

[caron2020unsupervised] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin. (2020). Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. Advances in Neural Information Processing Systems (NeurIPS).

[dosovitskiy2015discriminative] Dosovitskiy, Alexey, Fischer, Philipp, Springenberg, Jost Tobias, Riedmiller, Martin, Brox, Thomas. (2015). Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI).

[XieLHL20] Qizhe Xie, Minh{-. (2020). Self-Training With Noisy Student Improves ImageNet Classification. {IEEE/CVF.

[zoph2020rethinking] Barret Zoph, Golnaz Ghiasi, Tsung{-. (2020). Rethinking Pre-training and Self-training. Advances in Neural Information Processing Systems (NeurIPS).

[you2017large] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.

[GrillSATRBDPGAP20] Jean{-. (2020). Bootstrap Your Own Latent - {A. Advances in Neural Information Processing Systems (NeurIPS).

[abs-2011-10566] Xinlei Chen, Kaiming He. (2021). Exploring Simple Siamese Representation Learning. {IEEE.

[HadsellCL06] Raia Hadsell, Sumit Chopra, Yann LeCun. (2006). Dimensionality Reduction by Learning an Invariant Mapping. {IEEE.

[abs-2004-11362] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan. (2020). Supervised Contrastive Learning. Advances in Neural Information Processing Systems (NeurIPS).

[WuXYL18] Zhirong Wu, Yuanjun Xiong, Stella X. Yu, Dahua Lin. (2018). Unsupervised Feature Learning via Non-Parametric Instance Discrimination. {IEEE.

[doersch2015unsupervised] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[caron2018deep] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze. (2018). Deep Clustering for Unsupervised Learning of Visual Features. European Conference on Computer Vision (ECCV).

[zhuang2019local] Zhuang, Chengxu, Zhai, Alex Lin, Yamins, Daniel. (2019). Local aggregation for unsupervised learning of visual embeddings. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[tsai2021self] Yao{-. (2021). Self-supervised Representation Learning with Relative Predictive Coding. International Conference on Learning Representations (ICLR).

[hjelm2018learning] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie{-. (2019). Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR).

[ozair2019wasserstein] Sherjil Ozair, Corey Lynch, Yoshua Bengio, A{. (2019). Wasserstein Dependency Measure for Representation Learning. Advances in Neural Information Processing Systems (NeurIPS).

[belghazi2018mutual] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, Aaron C. Courville. (2018). Mutual Information Neural Estimation. Proceedings of the International Conference on Machine Learning (ICML).

[chen2009ranking] Wei Chen, Tie{-. (2009). Ranking Measures and Loss Functions in Learning to Rank. Advances in Neural Information Processing Systems (NeurIPS).

[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning (ICML).

[pyTorchimplementation] Hao Ren. (2020). A PyTorch implementation of SimCLR. GitHub repository.

[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[susmelj2020lightly] Igor Susmelj, Matthias Heller, Philipp Wirth, Jeremy Prescott, Malte Ebner, et al.. (2020). Lightly. GitHub. Note: https://github.com/lightly-ai/lightly.

[wav2vec] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, Michael Auli. (2020). wav2vec 2.0: {A. Advances in Neural Information Processing Systems (NeurIPS).

[NagraniCXZ20] Arsha Nagrani, Joon Son Chung, Weidi Xie, Andrew Zisserman. (2020). Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang..

[LugoschRITB19] Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, Yoshua Bengio. (2019). Speech Model Pre-Training for End-to-End Spoken Language Understanding. the Annual Conference of the International Speech Communication Association (InterSpeech).

[bardes2021vicreg] Adrien Bardes, Jean Ponce, Yann LeCun. (2021). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. CoRR.

[DBLP:conf/nips/KalantidisSPWL20] Yannis Kalantidis, Mert B{. (2020). Hard Negative Mixing for Contrastive Learning. Advances in Neural Information Processing Systems (NeurIPS).

[DBLP:conf/iclr/RobinsonCSJ21] Joshua David Robinson, Ching{-. (2021). Contrastive Learning with Hard Negative Samples. International Conference on Learning Representations (ICLR).

[chen2021simpler] Chen, Junya, Gan, Zhe, Li, Xuan, Guo, Qing, Chen, Liqun, Gao, Shuyang, Chung, Tagyoung, Xu, Yi, Zeng, Belinda, Lu, Wenlian, others. (2021). Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE. arXiv preprint arXiv:2107.01152.

[zhu2020eqco] Zhu, Benjin, Huang, Junqiang, Li, Zeming, Zhang, Xiangyu, Sun, Jian. (2020). EqCo: Equivalent Rules for Self-supervised Contrastive Learning. arXiv preprint arXiv:2010.01929.

[ermolov2021whitening] Ermolov, Aleksandr, Siarohin, Aliaksandr, Sangineto, Enver, Sebe, Nicu. (2021). Whitening for self-supervised representation learning. International Conference on Machine Learning (ICML).

[Idelbayev18a] Yerlan Idelbayev. Proper {ResNet.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.

[abs-2104-14294] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv{'{e. (2021). Emerging Properties in Self-Supervised Vision Transformers. CoRR.

[RadfordKHRGASAM21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, {ICML.

[xclip] Phil Wang. (2021). {x-clip. GitHub repository.

[bib1] Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

[bib2] Bardes, A., Ponce, J., LeCun, Y.: Vicreg: Variance-invariance-covariance regularization for self-supervised learning. CoRR abs/2105.04906 (2021)

[bib3] Belghazi, M.I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Hjelm, R.D., Courville, A.C.: Mutual information neural estimation. In: Proceedings of the International Conference on Machine Learning (ICML) (2018)

[bib4] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: European Conference on Computer Vision (ECCV) (2018)

[bib5] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

[bib6] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. CoRR abs/2104.14294 (2021)

[bib7] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: Proceedings of the International Conference on Machine Learning (ICML) (2020)

[bib8] Chen, X., Fan, H., Girshick, R.B., He, K.: Improved baselines with momentum contrastive learning. CoRR abs/2003.04297 (2020)

[bib9] Chen, X., He, K.: Exploring simple siamese representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

[bib10] Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., Zisserman, A.: With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9588–9597 (2021)

[bib11] Ermolov, A., Siarohin, A., Sangineto, E., Sebe, N.: Whitening for self-supervised representation learning. In: International Conference on Machine Learning (ICML) (2021)

[bib12] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR) (2018)

[bib13] Grill, J., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.Á., Guo, Z., Azar, M.G., Piot, B., Kavukcuoglu, K., Munos, R., Valko, M.: Bootstrap your own latent - A new approach to self-supervised learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

[bib14] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2006)

[bib15] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

[bib16] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

[bib17] Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. In: International Conference on Learning Representations (ICLR) (2019)

[bib18] Idelbayev, Y.: Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10, accessed: 20xx-xx-xx

[bib19] Kalantidis, Y., Sariyildiz, M.B., Pion, N., Weinzaepfel, P., Larlus, D.: Hard negative mixing for contrastive learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

[bib20] Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.: Supervised contrastive learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

[bib21] Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y.: Speech model pre-training for end-to-end spoken language understanding. In: the Annual Conference of the International Speech Communication Association (InterSpeech) (2019)

[bib22] Misra, I., Maaten, L.v.d.: Self-supervised learning of pretext-invariant representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

[bib23] Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: Voxceleb: Large-scale speaker verification in the wild. Comput. Speech Lang. 60 (2020)

[bib24] Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: European Conference on Computer Vision (ECCV) (2016)

[bib25] van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018)

[bib26] Ozair, S., Lynch, C., Bengio, Y., van den Oord, A., Levine, S., Sermanet, P.: Wasserstein dependency measure for representation learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

[bib27] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. Proceedings of Machine Learning Research, vol. 139, pp. 8748–8763. PMLR (2021)

[bib28] Ren, H.: A pytorch implementation of simclr. https://github.com/leftthomas/SimCLR (2020)

[bib29] Robinson, J.D., Chuang, C., Sra, S., Jegelka, S.: Contrastive learning with hard negative samples. In: International Conference on Learning Representations (ICLR) (2021)

[bib30] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: European Conference on Computer Vision (ECCV) (2020)

[bib31] Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., Isola, P.: What makes for good views for contrastive learning? In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

[bib32] Tsai, Y.H., Ma, M.Q., Yang, M., Zhao, H., Morency, L., Salakhutdinov, R.: Self-supervised representation learning with relative predictive coding. In: International Conference on Learning Representations (ICLR) (2021)

[bib33] Wang, P.: x-clip. https://github.com/lucidrains/x-clip (2021)

[bib34] Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: International Conference on Machine Learning (ICML) (2020)

[bib35] Wang, X., Liu, Z., Yu, S.X.: Unsupervised feature learning by cross-level instance-group discrimination. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

[bib36] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

[bib37] Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

[bib38] You, Y., Gitman, I., Ginsburg, B.: Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888 (2017)

[bib39] Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning. pp. 12310–12320. PMLR (2021)

[bib40] Zhan, X., Xie, J., Liu, Z., Lin, D., Change Loy, C.: OpenSelfSup: Open mmlab self-supervised learning toolbox and benchmark. https://github.com/open-mmlab/openselfsup (2020)

[bib41] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision (ECCV) (2016)

[bib42] Zhu, B., Huang, J., Li, Z., Zhang, X., Sun, J.: Eqco: Equivalent rules for self-supervised contrastive learning. arXiv preprint arXiv:2010.01929 (2020)