Understanding Dimensional Collapse in Contrastive Self-supervised Learning
Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian, Facebook AI Research
Abstract
Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.
Dimensional Collapse caused by Implicit Regularization
Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian Facebook AI Research
Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR , which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.
Introduction
Self-supervised learning aims to learn useful representations of the input data without relying on human annotations. Recent advances in self-supervised visual representation learning based on joint embedding methods (Misra & Maaten, 2020b; He et al., 2020; Chen et al., 2020a; Chen & He, 2020; Grill et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Chen et al., 2020b; Dwibedi et al., 2021; Li et al., 2021; Misra & Maaten, 2020a; HaoChen et al., 2021; Assran et al., 2021; Caron et al., 2021) show that self-supervised representations have competitive performances compared with supervised ones. These methods generally aim to learn representations invariant to data augmentations by maximizing the agreement between embedding vectors from different distortions of the same images.
As there are trivial solutions where the model maps all input to the same constant vector, known as the collapsing problem, various methods have been proposed to solve this problem that rely on different mechanisms. Contrastive methods like Chen et al. (2020a) and He et al. (2016) define 'positive' and 'negative' sample pairs which are treated differently in the loss function. Non-contrastive methbods like Grill et al. (2020) and Chen & He (2020) use stop-gradient, and an extra predictor to prevent collapse without negative pairs; Caron et al. (2018; 2020) use an additional clustering step; and Zbontar et al. (2021) minimize the redundant information between two branches.
These self-supervised learning methods are successful in preventing complete collapse whereby all representation vectors shrink into a single point. However, it has been observed empirically in noncontrastive learning methods (Hua et al., 2021; Tian et al., 2021) that while embedding vectors do not completely collapse; they collapse along certain dimensions. This is known as dimensional collapse (Hua et al., 2021), whereby the embedding vectors only span a lower-dimensional subspace.
In contrastive methods that explicitly use positive and negative pairs in the loss function, it seems intuitive to speculate that the repulsive effect of negative examples should prevent this kind of dimensional collapse and make full use of all dimensions. However, contrary to intuition, contrastive learning methods still suffer from dimensional collapse (See Fig. 7). In this work, we theoretically study the dynamics behind this phenomenon. We show there are two different mechanisms that
cause collapsing: (1) along the feature direction where the variance caused by the data augmentation is larger than the variance caused by the data distribution, the weight collapses. Moreover, (2) even if the covariance of data augmentation has a smaller magnitude than the data variance along all dimensions, the weight will still collapse due to the interplay of weight matrices at different layers known as implicit regularization. This kind of collapsing happens only in networks where the network has more than one layer.
Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR , which directly optimizes the encoder (i.e., representation space) without relying on an explicit trainable projector. DirectCLR outperforms SimCLR with a linear trainable projector on ImageNet.
We summarize our contributions as follows:
· We empirically show that contrastive self-supervised learning suffers from dimensional collapse whereby all the embedding vectors fall into a lower-dimensional subspace instead of the entire available embedding space. · We showed that there are two mechanisms causing the dimensional collapse in contrastive learning: (1) strong augmentation along feature dimensions (2) implicit regularization driving models toward low-rank solutions. · Wepropose DirectCLR , a novel contrastive learning method that directly optimizes the representation space without relying on an explicit trainable projector. DirectCLR outperforms SimCLR with a linear trainable projector.
Related Works
Self-supervised Learning Methods Joint embedding methods are a promising approach in selfsupervised learning, whose principle is to match the embedding vectors of augmented views of a training instance. Contrastive methods (Chen et al., 2020a; He et al., 2016) directly compare training samples by effectively viewing each sample as its own class, typically based on the InfoNCE contrastive loss (van den Oord et al., 2018) which encourages representations from positive pairs of examples to be close in the embedding space while representations from negative pairs are pushed away from each other. In practice, contrastive methods are known to require a large number of negative samples. Non-contrastive methods do not directly rely on explicit negative samples. These include clustering-based methods (Caron et al., 2018; 2020), redundancy reduction methods (Zbontar et al., 2021; Bardes et al., 2021) and methods using special architecture design (Grill et al., 2020; Chen & He, 2020).
Theoretical Understanding of Self-supervised Learning Although self-supervised learning models have shown success in learning useful representations and have outperformed their supervised counterpart in several downstream transfer learning benchmarks (Chen et al., 2020a), the underlying dynamics of these methods remains somewhat mysterious and poorly understood. Several theoretical works have attempted to understand it. Arora et al. (2019b); Lee et al. (2020); Tosh et al. (2021) theoretically proved that the learned representations via contrastive learning are useful for downstream tasks. Tian et al. (2021) explained why non-contrastive learning methods like BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2020) work: the dynamics of the alignment of eigenspaces between the predictor and its input correlation matrix play a key role in preventing complete collapse.
Implicit Regularization It has been theoretically explained that gradient descent will drive adjacent matrices aligned in a linear neural network setting (Ji & Telgarsky, 2019). Under the aligned matrix assumption, Gunasekar et al. (2018) prove that gradient descent can derive minimal nuclear norm solution. Arora et al. (2019a) extend this concept to the deep linear network case by theoretically and empirically demonstrating that a deep linear network can derive low-rank solutions. In general, over-parametrized neural networks tend to find flatter local minima (Saxe et al., 2019; Neyshabur et al., 2019; Soudry et al., 2018; Barrett & Dherin, 2021).
Dimensional Collapse
Self-supervised learning methods learn useful representation by minimizing the distances between embedding vectors from augmented images (Figure 1a). On its own, this would result in a collapsed

Figure 1: Illustration of the collapsing problem. For complete collapse, the embedding vectors collapse to same point. For dimensional collapse, the embedding vectors only span a lower dimensional space.
solution where the produced representation becomes constant (Figure 1b). Contrastive methods prevent complete collapse via the negative term that pushes embedding vectors of different input images away from each other. In this section, we show that while they prevent complete collapse, contrastive methods still experience a dimensional collapse in which the embedding vectors occupy a lower-dimensional subspace than their dimension (Figure 1c).
Wetrain a SimCLR model (Chen et al. (2020a)) with a two-layer MLP projector. We followed the standard recipe and trained the model on ImageNet for 100 epoch. We evaluate the dimensionality by collecting the embedding vectors on the validation set. Each embedding vector has a size of d = 128 . We compute the covariance matrix C ∈ R d × d of the embedding layer (here ¯ z := ∑ N i =1 z i /N and N is the total number of samples):
$$
$$
Figure 2 shows singular value decomposition on this matrix ( C = USV T , S = diag ( σ k ) ). in sorted order and logarithmic scale ( { log( σ k ) } ). We observe that a number of singular values collapse to zero, thus representing collapsed dimensions.

⊕}˜(}˜(√]{˜√⌈∐√(√∐⌈√˜√
⋃]{˜√⌈∐√(∧∐⌈√˜(∫∐{⌋(∮{̂˜glyph[arrowtp]
Figure 2: Singular value spectrum of the embedding space. The embedding vectors are computed from a pretrained SimCLR model on the validation set of ImageNet. Each embedding vector has a dimension of 128. The spectrum contains the singular values of the covariance matrix of these embedding vectors in sorted order and logarithmic scale. A number of singular values drop to zero, indicating collapsed dimensions.
Dimensional Collapse caused by Strong Augmentation
Linear Model
In this section, we explain one scenario for contrastive learning to have collapsed embedding dimensions, where the augmentation surpasses the input information. We focus on a simple linear network setting. We denote the input vector as x and the augmentation is an additive noise. The network is a single linear layer with weight matrix is W . Hence, the embedding vector is z = W x . We focus on a typical contrastive loss, InfoNCE (van den Oord et al., 2018):
glyph[negationslash]
$$
$$
where z i and z ′ i are a pair of embedding vectors from the two branches, z j indicates the negative samples within the minibatch. When all z i and z ′ i are normalized to be unit vector, the negative distance -| z i -z ′ i | 2 / 2 can be replaced by inner products z T i z ′ i . The model is trained with a basic stochastic gradient descent without momentum or weight decay.
Gradient Flow Dynamics
We study the dynamics via gradient flow, i.e., gradient descent with an infinitesimally small learning rate.
Lemma 1. The weight matrix in a linear contrastive self-supervised learning model evolves by:
$$
$$
$$
$$
This can be easily proven based on the chain rule. See proof in Appendix B.1. For InfoNCE loss defined in Eqn 2, the gradient of the embedding vector for each branch can be written as glyph[negationslash]
$$
$$
glyph[negationslash]
where { α ij } are the softmax of similarity of between z i and { z j } , defined by α ij = exp( -| z i -z j | 2 / 2) /Z i , α ii = exp( -| z i -z ′ i | 2 / 2) /Z i , and Z i = ∑ j = i exp( -| z i -z j | 2 / 2)+exp( -| z i -z ′ i | 2 / 2) . Hence, ∑ j α ij = 1 . Since z i = W x i , we have
$$
$$
where
$$
$$
Lemma 2. X is a difference of two PSD matrices:
$$
$$
Here ˆ Σ 0 = ∑ i,j α ij ( x i -x j )( x i -x j ) T is a weighted data distribution covariance matrix and ˆ Σ 1 = ∑ i (1 -α ii )( x ′ i -x i )( x ′ i -x i ) T is a weighted augmentation distribution covariance matrix.
See proof in Appendix B.2. Therefore, the amplitude of augmentation determines whether X is a positive definite matrix. Similar to Theorem 3-4 in Tian et al. (2020), Lemma 2 also models the time derivative of weight W as a product of W and a symmetric and/or PSD matrices. However, Lemma 2 is much more general: it applies to InfoNCE with multiple negative contrastive terms, remains true when α ij varies with sample pair ( i, j ) , and holds with finite batch size N . In contrast, Theorem 4 in Tian et al. (2020) only works for one negative term in InfoNCE, holds only in the population sense (i.e., N → + ∞ ), and the formulation has residual terms, if α ij are not constants.
Next, we look into the dynamics of weight matrix W given property of X .
Theorem 1. With fixed matrix X (defined in Eqn 6) and strong augmentation such that X has negative eigenvalues, the weight matrix W has vanishing singular values.
See proof in Appendix B.3.
Corollary 1 (Dimensional Collapse Caused by Strong Augmentation) . With strong augmentation, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding (Eqn. 1), C = ∑ i ( z i -¯ z )( z i -¯ z ) T /N = ∑ i W ( x i -¯ x )( x i -¯ x ) T W T /N . Since W has vanishing singular values, C is also low-rank, indicating collapsed dimensions.
Numerical simulation verifies our theory. We choice input data as isotropic Gaussian with covariance matrix ∑ i,j ( x i -x j )( x i -x j ) T /N = I . We set the augmentation as additive Gaussian with covariance matrix equal to ∑ i ( x ′ i -x i )( x ′ i -x i ) T /N = block diagonal ( 0 , k ∗ I ) , where the block has the size of 8x8. We plot the weight matrix singular value spectrum in Figure 3 with various augmentation amplitude k . This proves that under linear network setting, strong augmentation leads to dimensional collapse in embedding space.
Our theory in this section is limited to linear network settings. For more complex nonlinear networks, the collapsing condition will still depend on 'strong augmentation' but interpreted differently. A strong augmentation will be determined by more complicated properties of the augmentation (higher-order statistics of augmentation, manifold property of augmentation vs. data distribution) conditioned on the capacity of the networks.
.
.
.
Dimensional Collapse caused by Strong Augmentation
Dimensional Collapse caused by Implicit Regularization
Two-layer linear model
With strong augmentation, a linear model under InfoNCE loss will have dimensional collapse. However, such scenarios rely on the condition that the network has a limited capacity which may not hold for real cases. On the other hand, when there is no

⋃]{˜√⌈∐√(√∐⌈√˜√
Figure 3: Weight matrix singular value spectrum with different augmentation amplitude k . The setting is a single layer linear toy model with each weight matrix of the size of 16x16, where the block has the size of 8x8. Strong augmentation results in vanishing singular values in weight matrices.
strong augmentation ( ˆ Σ 1 ≺ ˆ Σ 0 ) and thus X matrix remains PSD, a single linear model won't have dimensional collapsing. However, interestingly, for deep networks, dimensional collapsing still happens in practice. In the following, we will show that it stems from a different nature: implicit regularization, where over-parametrized linear networks tend to find low-rank solutions.
To understand this counter-intuitive phenomena, we start with the simplest over-parametrized setting by choosing the network as a two-layer linear MLP without bias. The weight matrices of these two layers are denoted by W 1 ∈ R d × d and W 2 ∈ R d × d . Similar to the setting in Sec 4, the input vector is denoted as x and the augmentation is an additive noise. The embedding vector from each branch is z = W 2 W 1 x , hence z ∈ R n . We do not normalize z . See Figure 4. We use InfoNCE loss defined in Eqn 2. The model is trained with a basic stochastic gradient descent without momentum or weight decay.
Gradient Flow Dynamics
We study the dynamics via gradient flow, i.e., gradient descent with an infinitesimally small learning rate.
Lemma 1. The weight matrix in a linear contrastive self-supervised learning model evolves by:
$$
$$
$$
$$
This can be easily proven based on the chain rule. See proof in Appendix B.1. For InfoNCE loss defined in Eqn 2, the gradient of the embedding vector for each branch can be written as glyph[negationslash]
$$
$$
glyph[negationslash]
where { α ij } are the softmax of similarity of between z i and { z j } , defined by α ij = exp( -| z i -z j | 2 / 2) /Z i , α ii = exp( -| z i -z ′ i | 2 / 2) /Z i , and Z i = ∑ j = i exp( -| z i -z j | 2 / 2)+exp( -| z i -z ′ i | 2 / 2) . Hence, ∑ j α ij = 1 . Since z i = W x i , we have
$$
$$
where
$$
$$
Lemma 2. X is a difference of two PSD matrices:
$$
$$
Here ˆ Σ 0 = ∑ i,j α ij ( x i -x j )( x i -x j ) T is a weighted data distribution covariance matrix and ˆ Σ 1 = ∑ i (1 -α ii )( x ′ i -x i )( x ′ i -x i ) T is a weighted augmentation distribution covariance matrix.
See proof in Appendix B.2. Therefore, the amplitude of augmentation determines whether X is a positive definite matrix. Similar to Theorem 3-4 in Tian et al. (2020), Lemma 2 also models the time derivative of weight W as a product of W and a symmetric and/or PSD matrices. However, Lemma 2 is much more general: it applies to InfoNCE with multiple negative contrastive terms, remains true when α ij varies with sample pair ( i, j ) , and holds with finite batch size N . In contrast, Theorem 4 in Tian et al. (2020) only works for one negative term in InfoNCE, holds only in the population sense (i.e., N → + ∞ ), and the formulation has residual terms, if α ij are not constants.
Next, we look into the dynamics of weight matrix W given property of X .
Theorem 1. With fixed matrix X (defined in Eqn 6) and strong augmentation such that X has negative eigenvalues, the weight matrix W has vanishing singular values.
See proof in Appendix B.3.
Corollary 1 (Dimensional Collapse Caused by Strong Augmentation) . With strong augmentation, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding (Eqn. 1), C = ∑ i ( z i -¯ z )( z i -¯ z ) T /N = ∑ i W ( x i -¯ x )( x i -¯ x ) T W T /N . Since W has vanishing singular values, C is also low-rank, indicating collapsed dimensions.
Numerical simulation verifies our theory. We choice input data as isotropic Gaussian with covariance matrix ∑ i,j ( x i -x j )( x i -x j ) T /N = I . We set the augmentation as additive Gaussian with covariance matrix equal to ∑ i ( x ′ i -x i )( x ′ i -x i ) T /N = block diagonal ( 0 , k ∗ I ) , where the block has the size of 8x8. We plot the weight matrix singular value spectrum in Figure 3 with various augmentation amplitude k . This proves that under linear network setting, strong augmentation leads to dimensional collapse in embedding space.
Our theory in this section is limited to linear network settings. For more complex nonlinear networks, the collapsing condition will still depend on 'strong augmentation' but interpreted differently. A strong augmentation will be determined by more complicated properties of the augmentation (higher-order statistics of augmentation, manifold property of augmentation vs. data distribution) conditioned on the capacity of the networks.
.
Weight Alignment
Since we have two matrices W 1 and W 2 , the first question is how they interact with each other. We apply singular value decomposition on both matrices W 1 and W 2 , i.e., W 1 = U 1 S 1 V T 1 , W 2 = U 2 S 2 V T 2 and S 1 = diag ([ σ k 1 ]) , S 2 = diag ([ σ k 2 ]) . The alignment is now governed by the interaction

Figure 4: Two-layer Linear Model
between the adjacent orthonormal matrices V 2 := [ v k 2 ] and U 1 = [ u k 1 ] . This can be characterized by the alignment matrix A = V T 2 U 1 , whose ( k, k ′ ) -entry represents the alignment between the k -th right singular vector v k 2 of W 2 and the k ′ -th left singular vector u k ′ 1 of W 1 . The following shows that indeed W 1 and W 2 aligns.
glyph[negationslash]
Theorem 2 (Weight matrices align) . If for all t , W 2 ( t ) W 1 ( t ) = 0 , X ( t ) is positive-definite and W 1 (+ ∞ ) , W 2 (+ ∞ ) have distinctive singular values, then the alignment matrix A = V T 2 U 1 → I .
See proof in Appendix B.5. Here, we also empirically demonstrate that under InfoNCE loss, the absolute value of the alignment matrix A converges to an identity matrix. See Figure 5.
The alignment effect has been studied in other scenarios (Ji & Telgarsky, 2019; Radhakrishnan et al., 2020). In the real case, when some of our assumptions are not satisfied, e.g., there are degenerate singular values in weight matrices, we will not observe a perfect alignment. This can be easily understood by the fact that the singular decomposition is no longer unique given degenerate singular values. In our toy experiment, we specifically initialize the weight matrices to have non-degenerate singular values. In real scenario, when weight matrices are randomly initialized, we will only observe the alignment matrix to converge to a block-diagonal matrix, with each block representing a group of degenerate singular values.
Given the fact that singular vectors corresponding to the same singular value align, we can now study the dynamics of the singular values of each weight matrix W 1 and W 2 .
$$
$$
$$
$$
See proof in Appendix B.6. According to Eqn. 10, ( σ k 1 ) 2 = ( σ k 2 ) 2 + C . We solve the singular value dynamics analytically: ˙ σ k 1 = σ k 1 (( σ k 1 ) 2 +
⌈}˜(√]{˜√⌈∐√(√∐⌈√˜√

Figure 5: Visualization of the alignment matrix A = V T 2 U 1 after training. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The alignment matrix converges to an identity matrix.
C )( v k 1 T X v k 1 ) . This shows that a pair of singular values (singular values with same ranking from the other matrix) have gradients proportional to themselves. Notice that X is a positive definite matrix, the term v k 1 T X v k 1 is always non-negative. This explains why we observe that the smallest group of singular values grow significantly slower. See demonstrative experiment results in Figure 6a and 6b.
〉⌉̂˜̂̂]{˜(⋃√∐̂˜(⋃√˜̂√√√⌉

Figure 6: Evolution of the singular values of the weight matrices and the embedding space covariance matrix. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The lowest few singular values of each weight matrix remain significantly smaller.
Corollary 2 (Dimensional Collapse Caused by Implicit Regularization) . With small augmentation and over-parametrized linear networks, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding vectors, C = ∑ ( z -¯ z )( z -¯ z ) T /N = ∑ W 2 W 1 ( x -¯ x )( x -¯ x ) T W T 1 W T 2 /N . As
W 2 W 1 evolves to be low-rank, C is low-rank, indicating collapsed dimensions. See Figure 6c for experimental verification.
Our theory can also be extended to multilayer networks and nonlinear setting. Please see Appendix C
Weight Alignment
Since we have two matrices W 1 and W 2 , the first question is how they interact with each other. We apply singular value decomposition on both matrices W 1 and W 2 , i.e., W 1 = U 1 S 1 V T 1 , W 2 = U 2 S 2 V T 2 and S 1 = diag ([ σ k 1 ]) , S 2 = diag ([ σ k 2 ]) . The alignment is now governed by the interaction

Figure 4: Two-layer Linear Model
between the adjacent orthonormal matrices V 2 := [ v k 2 ] and U 1 = [ u k 1 ] . This can be characterized by the alignment matrix A = V T 2 U 1 , whose ( k, k ′ ) -entry represents the alignment between the k -th right singular vector v k 2 of W 2 and the k ′ -th left singular vector u k ′ 1 of W 1 . The following shows that indeed W 1 and W 2 aligns.
glyph[negationslash]
Theorem 2 (Weight matrices align) . If for all t , W 2 ( t ) W 1 ( t ) = 0 , X ( t ) is positive-definite and W 1 (+ ∞ ) , W 2 (+ ∞ ) have distinctive singular values, then the alignment matrix A = V T 2 U 1 → I .
See proof in Appendix B.5. Here, we also empirically demonstrate that under InfoNCE loss, the absolute value of the alignment matrix A converges to an identity matrix. See Figure 5.
The alignment effect has been studied in other scenarios (Ji & Telgarsky, 2019; Radhakrishnan et al., 2020). In the real case, when some of our assumptions are not satisfied, e.g., there are degenerate singular values in weight matrices, we will not observe a perfect alignment. This can be easily understood by the fact that the singular decomposition is no longer unique given degenerate singular values. In our toy experiment, we specifically initialize the weight matrices to have non-degenerate singular values. In real scenario, when weight matrices are randomly initialized, we will only observe the alignment matrix to converge to a block-diagonal matrix, with each block representing a group of degenerate singular values.
Given the fact that singular vectors corresponding to the same singular value align, we can now study the dynamics of the singular values of each weight matrix W 1 and W 2 .
$$
$$
$$
$$
See proof in Appendix B.6. According to Eqn. 10, ( σ k 1 ) 2 = ( σ k 2 ) 2 + C . We solve the singular value dynamics analytically: ˙ σ k 1 = σ k 1 (( σ k 1 ) 2 +
⌈}˜(√]{˜√⌈∐√(√∐⌈√˜√

Figure 5: Visualization of the alignment matrix A = V T 2 U 1 after training. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The alignment matrix converges to an identity matrix.
C )( v k 1 T X v k 1 ) . This shows that a pair of singular values (singular values with same ranking from the other matrix) have gradients proportional to themselves. Notice that X is a positive definite matrix, the term v k 1 T X v k 1 is always non-negative. This explains why we observe that the smallest group of singular values grow significantly slower. See demonstrative experiment results in Figure 6a and 6b.
〉⌉̂˜̂̂]{˜(⋃√∐̂˜(⋃√˜̂√√√⌉

Figure 6: Evolution of the singular values of the weight matrices and the embedding space covariance matrix. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The lowest few singular values of each weight matrix remain significantly smaller.
Corollary 2 (Dimensional Collapse Caused by Implicit Regularization) . With small augmentation and over-parametrized linear networks, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding vectors, C = ∑ ( z -¯ z )( z -¯ z ) T /N = ∑ W 2 W 1 ( x -¯ x )( x -¯ x ) T W T 1 W T 2 /N . As
W 2 W 1 evolves to be low-rank, C is low-rank, indicating collapsed dimensions. See Figure 6c for experimental verification.
Our theory can also be extended to multilayer networks and nonlinear setting. Please see Appendix C
.
Dimensional Collapse caused by Implicit Regularization
DirectCLR
Motivation
We now leverage our theoretical finding to design novel algorithms. Here we are targeting the projector component in contrastive learning.
Empirically, adding a projector substantially improves the quality of the learned representation and downstream performance (Chen et al., 2020a). Checking the spectrum of the representation layer also reveals a difference with/without a projector. To see this, we train two SimCLR models with and without a projector. The representation space spectrum are shown in Figure 7b. The dimensional collapse in representation space happens when the model is trained without a projector. Thus, the projector prevents the collapse in the representation space.
0

Figure 7: (a) Definition of representation and the embedding space; (b) Singular value spectrums of the representation space of pretrained contrastive learning models (pretrained with or without a projector). The representation vectors are the output from the ResNet50 encoder and directly used for downstream tasks. Each representation vector has a dimension of 2048. Without a projector, SimCLR suffers from dimensional collapse in the representation space.
The projector in contrastive learning is essential to prevent dimensional collapse in the representation space. We claim the following propositions regarding a linear projector in contrastive learning models.
Proposition 1. A linear projector weight matrix only needs to be diagonal .
Based on our theory on implicit regularization dynamics, we expect to see adjacent layers W 1 (= U 1 S 1 V T 1 ) and W 2 (= U 2 S 2 V T 2 ) to be aligned such that the overall dynamics is only governed by their singular values S 1 and S 2 . And the orthogonal matrices V T 2 and U 1 are redundant as they will evolve to V T 2 U 1 = I , given S 1 and S 2 .
Now, let's consider the linear projector SimCLR model and only focus on the channel dimension. W 1 is the last layer in the encoder, and W 2 is the projector weight matrix. Our propositions claim that for this projector matrix W 2 , the orthogonal component V 2 can be omitted. Because the previous layer W 1 is fully trainable, its orthogonal component ( U 1 ) will always evolve to satisfy V T 2 U 1 = I . Therefore, the final behavior of the projector is only determined by the singular values ( S 2 ) of the projector weight matrix. This motivates Proposition 1: the orthogonal component of the weight matrix doesn't matter. So we can set the projector matrix as a diagonal matrix.
Also, according to our theory, the weight matrix will always converge to the low-rank. The singular value diagonal matrix naturally becomes low-rank, so why not just set it low-rank directly? This is the motivation of Proposition 2.
These propositions are verified via ablation studies in Sec 6.3. Given these two propositions, we propose DirectCLR , which is effectively using a low-rank diagonal projector.
.
.
Main Idea
We propose to remove the projector in contrastive learning by directly sending a sub-vector of the representation vector to the loss function. We call our method DirectCLR . In contrast to all recent state-of-the-art self-supervised learning methods, our method directly optimizes the representation space. See Figure 8, DirectCLR picks a subvector of the representation z = r [0 : d 0 ] , where d 0 is a hyperparameter. Then, it applies a standard InfoNCE loss on this normalized subvector ˆ z = z / | z | , L = ∑ i log exp(ˆ z i · ˆ z ′ i ) ∑ exp(ˆ z i · ˆ z j ) .
$$
$$
We train DirectCLR with a standard recipe of SimCLR for 100 epochs on ImageNet. The backbone encoder is a ResNet50. More implementation details

Figure 8: DirectCLR : no explicit trainable projector, simply apply InfoNCE loss on the a fixed sub-vector of the representations
can be found in the Appendix D. DirectCLR demonstrates better performance compared to SimCLR with a trainable linear projector on ImageNet. The linear probe accuracies for each model are listed in Table 1.
Table 1: Linear probe accuracy on ImageNet. Each model is trained on ImageNet for 100 epochs with standard training recipe. The backbone encoder is a ResNet50. DirectCLR outperforms SimCLR with 1-layer linear projector.
We visualize the learnt representation space spectrum in Figure 9. DirectCLR prevents dimensional collapse in the representation space similar to the functionality of a trainable projector in SimCLR.
⋃]⌉⊕∫(∖⌈∐glyph[arrowbt]˜√({}{⌈]{˜∐√(√√}⌊˜̂√}√
⊕}˜(}˜(√]{˜√⌈∐√(√∐⌈√˜√
⋃]{˜√⌈∐√(∧∐⌈√˜(∫∐{⌋(∮{̂˜glyph[arrowtp]

Figure 9: Representation space spectrum of DirectCLR compared to SimCLR (a) with a 2-layer nonlinear projector (b) with a 1-layer linear projector (c) without projector. The spectrums are computed based on the output from the backbone, using ImgaeNet validation set. Similar to SimCLRwith projectors, DirectCLR is able to prevent dimensional collapse in the representation space.

Figure 10: Why is the whole representation vector r meaningful in DirectCLR while only part of it receives gradient? It takes advantage of the residual connection in the backbone. Thus, the gradient passing through the representation vector is low-rank where only the first d 0 channel dimensions are non-zero. When the gradient enters the ResNet backbone and passes through the last nonlinear conv block, it becomes full rank. Therefore, this hidden layer h receives gradients on all channels. During forward pass, h is directly fed to the representation vectors via the residual connection. Therefore, the entire representation vector r is meaningful.
One may suspect that the contrastive loss in DirectCLR does not apply a gradient on the rest part of the representation vector r [ d 0 :] , then why these dimensions would contain useful information?
Here, we show that the entire representation vector r contains useful information. See Figure 10. First, the gradient backpropagating through the representation vector is low-rank, where only the first d 0 channel dimensions are non-zero. When the gradient enters the ResNet backbone and passes through the last nonlinear conv block, it becomes full rank. Therefore, this hidden layer h receives gradients on all channels. Note that h and r have a same channel dimension of 2048. Next, we consider the forward pass. This hidden layer h is directly fed to the representation vectors via the residual connection. As a result, the rest part of the representation vector r [ d 0 :] is not trivial. In addition, we run an ablation study in Sec F to test the linear probe accuracy based only on the 'directly' optimized vector. This verifies that the whole representation vector is meaningful.
Disclaimer: DirectCLR is able to replace the linear projector and verify the two propositions on understanding the dynamics of a linear projector. But our theory is not able to fully explain why a nonlinear projector is able to prevent dimensional collapse. DirectCLR also still relies on the mechanism of a nonlinear projector to prevent dimensional collapse, which is effectively performed by the last block of the backbone, as explained above.
Ablation Study
Table 2: Ablation study: top-1 accuracies on ImageNet by SimCLR model with different projector settings.
To further verify our hypothesis, we have perform ablation studies.
Proposition 1 matches the fact that: (a) an orthogonal constrained projector performs the same as the non-projector setting; (b) fixed low-rank projector performs the same as a fixed diagonal projector; (c) trainable linear projector performs the same as a trainable diagonal projector.
Proposition 2 matches the observation that a low-rank projector has the highest accuracy.
Please see more detailed ablation study discuss and additional ablation experiments in Appendix F.
Conclusions
In this work, we showed that contrastive self-supervised learning suffers from dimensional collapse, where the embedding vectors only span a lower-dimensional subspace. We provided the theoretical understanding of this phenomenon and showed that there are two mechanisms causing dimensional collapse: strong augmentation and implicit regularization. Inspired by our theory, we proposed a novel contrastive self-supervised learning method DirectCLR that directly optimizes the representation space without relying on a trainable projector. DirectCLR outperforms SimCLR with a linear projector on ImageNet.
Acknowledgement
We thank Yubei Chen, Jiachen Zhu, Adrien Bardes, Nicolas Ballas, Randall Balestriero, Quentin Garrido for useful discussions. We thank Wieland Brendel for the insightfull discussion on understanding the role of the projector.
Reproducibility Statement
We provide detailed proof for all the lemmas and theorems in the Appendices. Code (in PyTorch) is available at https://github.com/facebookresearch/directclr
References
Useful Lemmas
We adapt two useful lemmas from Arora et al. (2019a).
Lemma 4. Given a matrix W and the dynamics that W evolves by ˙ W , the singular values of this matrix evolve by:
$$
$$
where u k and v k are singular value σ k 's corresponding left and right singular vectors. i.e. the k -th column of matrices U and V respectively.
$$
$$
Multiplying U T from the left and multiplying V from the right, considering U and V are orthogonal matrices, we have
$$
$$
Since S = diag ( σ k ) is a diagonal matrix, we have
$$
$$
Again, considering u k and v k have unit-norm, we have u k T ˙ u k = 0 and ˙ v k T v k = 0 . Therefore, we derive
$$
$$
$$
$$
$$
$$
where glyph[circledot] represents Hadamard element-wise multiplication. H is a skew-symmetric matrix glyph[negationslash]
$$
$$
Proof. Same as proof for Lemma 1, we start from the following equation
$$
$$
Considering the fact that U T ˙ U and ˙ V T V are skew-symmetric matrices, whose diagonal terms are all zero, we Hadamard-multiply ¯ I to both sides of the equation. Here, ¯ I has all diagonal values equal zeros and all off-diagonal values equal to one, we have
$$
$$
$$
$$
$$
$$
$$
$$
Therefore, we have where
$$
$$
Similar proof applies to Eqn 14.
Lemma 6 (Alignment matrix dynamics) . The alignment matrix A , defined by A = V T 2 U 1 , evolves by:
$$
$$
$$
$$
$$
$$
glyph[negationslash]
$$
$$
$$
$$
where
$$
$$
$$
$$
$$
$$
$$
$$
Plugging in Eqn 8, we have
$$
$$
.
Delayed Proofs
.
Delayed Proofs
Gradient Flow Dynamics
We study the dynamics via gradient flow, i.e., gradient descent with an infinitesimally small learning rate.
Lemma 1. The weight matrix in a linear contrastive self-supervised learning model evolves by:
$$
$$
$$
$$
This can be easily proven based on the chain rule. See proof in Appendix B.1. For InfoNCE loss defined in Eqn 2, the gradient of the embedding vector for each branch can be written as glyph[negationslash]
$$
$$
glyph[negationslash]
where { α ij } are the softmax of similarity of between z i and { z j } , defined by α ij = exp( -| z i -z j | 2 / 2) /Z i , α ii = exp( -| z i -z ′ i | 2 / 2) /Z i , and Z i = ∑ j = i exp( -| z i -z j | 2 / 2)+exp( -| z i -z ′ i | 2 / 2) . Hence, ∑ j α ij = 1 . Since z i = W x i , we have
$$
$$
where
$$
$$
Lemma 2. X is a difference of two PSD matrices:
$$
$$
Here ˆ Σ 0 = ∑ i,j α ij ( x i -x j )( x i -x j ) T is a weighted data distribution covariance matrix and ˆ Σ 1 = ∑ i (1 -α ii )( x ′ i -x i )( x ′ i -x i ) T is a weighted augmentation distribution covariance matrix.
See proof in Appendix B.2. Therefore, the amplitude of augmentation determines whether X is a positive definite matrix. Similar to Theorem 3-4 in Tian et al. (2020), Lemma 2 also models the time derivative of weight W as a product of W and a symmetric and/or PSD matrices. However, Lemma 2 is much more general: it applies to InfoNCE with multiple negative contrastive terms, remains true when α ij varies with sample pair ( i, j ) , and holds with finite batch size N . In contrast, Theorem 4 in Tian et al. (2020) only works for one negative term in InfoNCE, holds only in the population sense (i.e., N → + ∞ ), and the formulation has residual terms, if α ij are not constants.
Next, we look into the dynamics of weight matrix W given property of X .
Theorem 1. With fixed matrix X (defined in Eqn 6) and strong augmentation such that X has negative eigenvalues, the weight matrix W has vanishing singular values.
See proof in Appendix B.3.
Corollary 1 (Dimensional Collapse Caused by Strong Augmentation) . With strong augmentation, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding (Eqn. 1), C = ∑ i ( z i -¯ z )( z i -¯ z ) T /N = ∑ i W ( x i -¯ x )( x i -¯ x ) T W T /N . Since W has vanishing singular values, C is also low-rank, indicating collapsed dimensions.
Numerical simulation verifies our theory. We choice input data as isotropic Gaussian with covariance matrix ∑ i,j ( x i -x j )( x i -x j ) T /N = I . We set the augmentation as additive Gaussian with covariance matrix equal to ∑ i ( x ′ i -x i )( x ′ i -x i ) T /N = block diagonal ( 0 , k ∗ I ) , where the block has the size of 8x8. We plot the weight matrix singular value spectrum in Figure 3 with various augmentation amplitude k . This proves that under linear network setting, strong augmentation leads to dimensional collapse in embedding space.
Our theory in this section is limited to linear network settings. For more complex nonlinear networks, the collapsing condition will still depend on 'strong augmentation' but interpreted differently. A strong augmentation will be determined by more complicated properties of the augmentation (higher-order statistics of augmentation, manifold property of augmentation vs. data distribution) conditioned on the capacity of the networks.
Delayed Proofs
Gradient Flow Dynamics
We study the dynamics via gradient flow, i.e., gradient descent with an infinitesimally small learning rate.
Lemma 1. The weight matrix in a linear contrastive self-supervised learning model evolves by:
$$
$$
$$
$$
This can be easily proven based on the chain rule. See proof in Appendix B.1. For InfoNCE loss defined in Eqn 2, the gradient of the embedding vector for each branch can be written as glyph[negationslash]
$$
$$
glyph[negationslash]
where { α ij } are the softmax of similarity of between z i and { z j } , defined by α ij = exp( -| z i -z j | 2 / 2) /Z i , α ii = exp( -| z i -z ′ i | 2 / 2) /Z i , and Z i = ∑ j = i exp( -| z i -z j | 2 / 2)+exp( -| z i -z ′ i | 2 / 2) . Hence, ∑ j α ij = 1 . Since z i = W x i , we have
$$
$$
where
$$
$$
Lemma 2. X is a difference of two PSD matrices:
$$
$$
Here ˆ Σ 0 = ∑ i,j α ij ( x i -x j )( x i -x j ) T is a weighted data distribution covariance matrix and ˆ Σ 1 = ∑ i (1 -α ii )( x ′ i -x i )( x ′ i -x i ) T is a weighted augmentation distribution covariance matrix.
See proof in Appendix B.2. Therefore, the amplitude of augmentation determines whether X is a positive definite matrix. Similar to Theorem 3-4 in Tian et al. (2020), Lemma 2 also models the time derivative of weight W as a product of W and a symmetric and/or PSD matrices. However, Lemma 2 is much more general: it applies to InfoNCE with multiple negative contrastive terms, remains true when α ij varies with sample pair ( i, j ) , and holds with finite batch size N . In contrast, Theorem 4 in Tian et al. (2020) only works for one negative term in InfoNCE, holds only in the population sense (i.e., N → + ∞ ), and the formulation has residual terms, if α ij are not constants.
Next, we look into the dynamics of weight matrix W given property of X .
Theorem 1. With fixed matrix X (defined in Eqn 6) and strong augmentation such that X has negative eigenvalues, the weight matrix W has vanishing singular values.
See proof in Appendix B.3.
Corollary 1 (Dimensional Collapse Caused by Strong Augmentation) . With strong augmentation, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding (Eqn. 1), C = ∑ i ( z i -¯ z )( z i -¯ z ) T /N = ∑ i W ( x i -¯ x )( x i -¯ x ) T W T /N . Since W has vanishing singular values, C is also low-rank, indicating collapsed dimensions.
Numerical simulation verifies our theory. We choice input data as isotropic Gaussian with covariance matrix ∑ i,j ( x i -x j )( x i -x j ) T /N = I . We set the augmentation as additive Gaussian with covariance matrix equal to ∑ i ( x ′ i -x i )( x ′ i -x i ) T /N = block diagonal ( 0 , k ∗ I ) , where the block has the size of 8x8. We plot the weight matrix singular value spectrum in Figure 3 with various augmentation amplitude k . This proves that under linear network setting, strong augmentation leads to dimensional collapse in embedding space.
Our theory in this section is limited to linear network settings. For more complex nonlinear networks, the collapsing condition will still depend on 'strong augmentation' but interpreted differently. A strong augmentation will be determined by more complicated properties of the augmentation (higher-order statistics of augmentation, manifold property of augmentation vs. data distribution) conditioned on the capacity of the networks.
Delayed Proofs
Delayed Proofs
Proof of Lemma 4
The gradient on matrix W is
$$
$$
We denote the gradient on z i and z ′ i as g z i and g z ′ i , respectively. Since ∂ z i ∂W = x i and ∂ z ′ i ∂W = x ′ i , we get
$$
$$
Proof of Lemma 4
Proof. X is defined in Eqn 6.
glyph[negationslash]
$$
$$
Given the fact that ∑ j = i α ij = 1 -α ii , we have ∑ i ∑ j = i α ij x ′ i x T i = ∑ i (1 -α ii ) x ′ i x T i . Also, since ∑ i ∑ j = i iterates all pairs of i, j , we can replace the index between i and j , we have ∑ i ∑ j = i α ij x j x T i = ∑ i ∑ j = i α ji x i x T j .
Therefore
$$
$$
Delayed Proofs
Proof of Lemma 4
Proof. According to Lemma 1, we have glyph[negationslash]
$$
$$
For a fixed X , we solve this equation analyically,
$$
$$
Apply eigen-decomposition on X , X = U Λ U T . Then we have exp( Xt ) = U exp(Λ t ) U T . Therefore,
$$
$$
Because X has negative eigenvalues, i.e., Λ has negative terms, we have for t →∞ , exp(Λ t ) is rank deficient. Therefore, we know that W ( ∞ ) is also rank deficient, the weight matrix W has vanishing singular values.
Delayed Proofs
Proof of Lemma 4
Proof. The gradient on matrix W 2 is
$$
$$
We denote the gradient on z i and z ′ i as g z i and g z ′ i , respectively. Since ∂ z i ∂W 2 = W 1 x i and ∂ z ′ i ∂W 2 = W 1 x ′ i , we get
$$
$$
Similar proof applies to W 1 .
Delayed Proofs
Proof of Lemma 4
Here, we prove that under the assumption that singular values are non-degenerate, the alignment matrix A = V T 2 U 1 converges to identity matrix.
Proof. According to Lemma 3, we have
$$
$$
$$
$$
$$
$$
$$
$$
Next, we show that the Frobenius norm of each weight matrix grow to infinitely.
$$
$$
$$
$$
glyph[negationslash]
Because X is a positive definite matrix and for all t , W 2 ( t ) W 1 ( t ) = 0 , we know B := W 2 W 1 XW T 1 W T 2 is positive semi-definite and B = 0 . Therefore, tr ( B ) = ∑ k λ k ( B ) > 0 since not all eigenvalues of B are zero.
$$
$$
Plug in the singular value decomposition of W 1 and W 2 , we have U 1 S 2 1 U T 1 = V 2 S 2 2 V T 2 . Assuming W 1 and W 2 have non-degenerate singular values, due to the uniqueness of eigen-decomposition, we have
$$
$$
$$
$$
$$
$$
Remark . Note that when the non-degenerate singular value assumption does not hold, the corresponding singular vectors are not unique and we will not observe the corresponding dimensions becoming aligned.
Delayed Proofs
Proof of Lemma 4
Proof. According to Theorem 2, for σ k 1 and σ k 2 with same index, the corresponding singular vector pairs v k 2 and u k 1 will get aligned, i.e., v k ′ 2 T u k 1 → δ i,j . Therefore, Eqn 21 and Eqn 22 can be simplified to
$$
$$
Insert Eqn 9 and considering the alignment, we derive
$$
$$
therefore, or
Delayed Proofs
Effect of More Layers and Nonlinearity
In our toy model, we focused on a two-layer linear MLP setting. Here, we empirically show that our theory extends to multilayer and nonlinear cases, as shown in Figure 11a.
Stronger over-parametrization leads to a stronger collapsing effect, which has been shown theoretically (Arora et al., 2019a; Barrett & Dherin, 2021) and empirically (Jing et al., 2020). This can be explained by the fact that more adjacent matrices getting aligned, and the collapsing in the product matrix gets amplified. Note that for a single-layer case, L = 1 , there is no dimensional collapse in the embedding space, which is consistent with our analysis.

Figure 11: Embedding space singular value spectrum with different layers on (a) linear and (b) nonlinear networks. All models use weight matrices with a size of 16x16. Adding more layers in the network leads to more collapsed dimensions. Adding nonlinearity leads to a similar collapsing effect.
We empirically show that the collapsing effect also applies to the nonlinear scenario. We insert ReLU between linear layers and observe a similar singular value collapse compared to the linear case. See Figure 11b.
Implementation Detail
Augmentations
Each input image is transformed twice to produce the two distorted views for contrastive loss. The image augmentation pipeline includes random cropping, resizing to 224x224, random horizontal flipping, color jittering, grayscale, Gaussian blurring, and solarization.
Network
Throughout the ImageNet experiments in this paper, we use a ResNet-50 (He et al., 2016) as an encoder. This network has an output of dimension 2048, which is called a representation vector.
Optimization
We use a LARS optimizer and train all models for 100 epochs. The batch size is 4096, which fits into 32 GPUs during training. The learning rate is 4.8 as in SimCLR (Chen et al., 2020a), which goes through a 10 epoch of warming up and then a cosine decay schedule.
Hyperparameter tuning on $d_0$
Here, we list the ImageNet accuracy with various d 0 value in Figure 12. It's easy to see that when d 0 → 0 , there's too little gradient information coming from the loss, the performance drops. When d 0 → 2048 , the model converges to standard SimCLR without a projector, which we know suffers from dimensional collapse in representation space.

Figure 12: Hyperparameter tuning on d 0 based on ImageNet linear probe Top-1 accuracy.
Ablation Study Detail
Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.
Self-supervised learning aims to learn useful representations of the input data without relying on human annotations. Recent advances in self-supervised visual representation learning based on joint embedding methods (Misra & Maaten, 2020b; He et al., 2020; Chen et al., 2020a; Chen & He, 2020; Grill et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Chen et al., 2020b; Dwibedi et al., 2021; Li et al., 2021; Misra & Maaten, 2020a; HaoChen et al., 2021; Assran et al., 2021; Caron et al., 2021) show that self-supervised representations have competitive performances compared with supervised ones. These methods generally aim to learn representations invariant to data augmentations by maximizing the agreement between embedding vectors from different distortions of the same images.
As there are trivial solutions where the model maps all input to the same constant vector, known as the collapsing problem, various methods have been proposed to solve this problem that rely on different mechanisms. Contrastive methods like Chen et al. (2020a) and He et al. (2016) define ‘positive’ and ‘negative’ sample pairs which are treated differently in the loss function. Non-contrastive methbods like Grill et al. (2020) and Chen & He (2020) use stop-gradient, and an extra predictor to prevent collapse without negative pairs; Caron et al. (2018; 2020) use an additional clustering step; and Zbontar et al. (2021) minimize the redundant information between two branches.
These self-supervised learning methods are successful in preventing complete collapse whereby all representation vectors shrink into a single point. However, it has been observed empirically in non-contrastive learning methods (Hua et al., 2021; Tian et al., 2021) that while embedding vectors do not completely collapse; they collapse along certain dimensions. This is known as dimensional collapse (Hua et al., 2021), whereby the embedding vectors only span a lower-dimensional subspace.
In contrastive methods that explicitly use positive and negative pairs in the loss function, it seems intuitive to speculate that the repulsive effect of negative examples should prevent this kind of dimensional collapse and make full use of all dimensions. However, contrary to intuition, contrastive learning methods still suffer from dimensional collapse (See Fig. 7). In this work, we theoretically study the dynamics behind this phenomenon. We show there are two different mechanisms that cause collapsing: (1) along the feature direction where the variance caused by the data augmentation is larger than the variance caused by the data distribution, the weight collapses. Moreover, (2) even if the covariance of data augmentation has a smaller magnitude than the data variance along all dimensions, the weight will still collapse due to the interplay of weight matrices at different layers known as implicit regularization. This kind of collapsing happens only in networks where the network has more than one layer.
Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the encoder (i.e., representation space) without relying on an explicit trainable projector. DirectCLR outperforms SimCLR with a linear trainable projector on ImageNet.
We summarize our contributions as follows:
We empirically show that contrastive self-supervised learning suffers from dimensional collapse whereby all the embedding vectors fall into a lower-dimensional subspace instead of the entire available embedding space.
We showed that there are two mechanisms causing the dimensional collapse in contrastive learning: (1) strong augmentation along feature dimensions (2) implicit regularization driving models toward low-rank solutions.
We propose DirectCLR, a novel contrastive learning method that directly optimizes the representation space without relying on an explicit trainable projector. DirectCLR outperforms SimCLR with a linear trainable projector.
Self-supervised Learning Methods Joint embedding methods are a promising approach in self-supervised learning, whose principle is to match the embedding vectors of augmented views of a training instance. Contrastive methods (Chen et al., 2020a; He et al., 2016) directly compare training samples by effectively viewing each sample as its own class, typically based on the InfoNCE contrastive loss (van den Oord et al., 2018) which encourages representations from positive pairs of examples to be close in the embedding space while representations from negative pairs are pushed away from each other. In practice, contrastive methods are known to require a large number of negative samples. Non-contrastive methods do not directly rely on explicit negative samples. These include clustering-based methods (Caron et al., 2018; 2020), redundancy reduction methods (Zbontar et al., 2021; Bardes et al., 2021) and methods using special architecture design (Grill et al., 2020; Chen & He, 2020).
Theoretical Understanding of Self-supervised Learning Although self-supervised learning models have shown success in learning useful representations and have outperformed their supervised counterpart in several downstream transfer learning benchmarks (Chen et al., 2020a), the underlying dynamics of these methods remains somewhat mysterious and poorly understood. Several theoretical works have attempted to understand it. Arora et al. (2019b); Lee et al. (2020); Tosh et al. (2021) theoretically proved that the learned representations via contrastive learning are useful for downstream tasks. Tian et al. (2021) explained why non-contrastive learning methods like BYOL (Grill et al., 2020) and SimSiam (Chen & He, 2020) work: the dynamics of the alignment of eigenspaces between the predictor and its input correlation matrix play a key role in preventing complete collapse.
Implicit Regularization It has been theoretically explained that gradient descent will drive adjacent matrices aligned in a linear neural network setting (Ji & Telgarsky, 2019). Under the aligned matrix assumption, Gunasekar et al. (2018) prove that gradient descent can derive minimal nuclear norm solution. Arora et al. (2019a) extend this concept to the deep linear network case by theoretically and empirically demonstrating that a deep linear network can derive low-rank solutions. In general, over-parametrized neural networks tend to find flatter local minima (Saxe et al., 2019; Neyshabur et al., 2019; Soudry et al., 2018; Barrett & Dherin, 2021).
Self-supervised learning methods learn useful representation by minimizing the distances between embedding vectors from augmented images (Figure 1(a)). On its own, this would result in a collapsed solution where the produced representation becomes constant (Figure 1(b)). Contrastive methods prevent complete collapse via the negative term that pushes embedding vectors of different input images away from each other. In this section, we show that while they prevent complete collapse, contrastive methods still experience a dimensional collapse in which the embedding vectors occupy a lower-dimensional subspace than their dimension (Figure 1(c)).
We train a SimCLR model (Chen et al. (2020a)) with a two-layer MLP projector. We followed the standard recipe and trained the model on ImageNet for 100 epoch. We evaluate the dimensionality by collecting the embedding vectors on the validation set. Each embedding vector has a size of d=128𝑑128d=128. We compute the covariance matrix C∈ℝd×d𝐶superscriptℝ𝑑𝑑C\in\mathbb{R}^{d\times d} of the embedding layer (here z¯:=∑i=1Nzi/Nassign¯zsuperscriptsubscript𝑖1𝑁subscriptz𝑖𝑁\bar{\textbf{z}}:=\sum_{i=1}^{N}\textbf{z}_{i}/N and N𝑁N is the total number of samples):
Figure 2 shows singular value decomposition on this matrix (C=USVT𝐶𝑈𝑆superscript𝑉𝑇C=USV^{T}, S=diag(σk)𝑆𝑑𝑖𝑎𝑔superscript𝜎𝑘S=diag(\sigma^{k})). in sorted order and logarithmic scale ({log(σk)}superscript𝜎𝑘{\log(\sigma^{k})}). We observe that a number of singular values collapse to zero, thus representing collapsed dimensions.
In this section, we explain one scenario for contrastive learning to have collapsed embedding dimensions, where the augmentation surpasses the input information. We focus on a simple linear network setting. We denote the input vector as x and the augmentation is an additive noise. The network is a single linear layer with weight matrix is W𝑊W. Hence, the embedding vector is z=Wxz𝑊x\textbf{z}=W\textbf{x}. We focus on a typical contrastive loss, InfoNCE (van den Oord et al., 2018):
where zisubscriptz𝑖\textbf{z}{i} and zi′superscriptsubscriptz𝑖′\textbf{z}{i}^{\prime} are a pair of embedding vectors from the two branches, zjsubscriptz𝑗\textbf{z}{j} indicates the negative samples within the minibatch. When all zisubscriptz𝑖\textbf{z}{i} and zi′subscriptsuperscriptz′𝑖\textbf{z}^{\prime}{i} are normalized to be unit vector, the negative distance −|zi−zi′|2/2superscriptsubscriptz𝑖subscriptsuperscriptz′𝑖22-|\textbf{z}{i}-\textbf{z}^{\prime}{i}|^{2}/2 can be replaced by inner products ziTzi′superscriptsubscriptz𝑖𝑇superscriptsubscriptz𝑖′\textbf{z}{i}^{T}\textbf{z}_{i}^{\prime}. The model is trained with a basic stochastic gradient descent without momentum or weight decay.
We study the dynamics via gradient flow, i.e., gradient descent with an infinitesimally small learning rate.
The weight matrix in a linear contrastive self-supervised learning model evolves by:
where G=∑i(g𝐳ixiT+g𝐳i′xi′T)𝐺subscript𝑖subscriptgsubscript𝐳𝑖superscriptsubscriptx𝑖𝑇subscriptgsuperscriptsubscript𝐳𝑖′superscriptsubscriptx𝑖′𝑇G=\sum_{i}(\textbf{g}{{\bm{z}}{i}}\textbf{x}{i}^{T}+\textbf{g}{{\bm{z}}{i}^{\prime}}\textbf{x}{i}^{\prime T}), and gzisubscriptgsubscriptz𝑖\textbf{g}{\textbf{z}{i}} is the gradient on the embedding vector zisubscriptz𝑖\textbf{z}{i} (similarly gzi′subscriptgsuperscriptsubscriptz𝑖′\textbf{g}{\textbf{z}_{i}^{\prime}}).
This can be easily proven based on the chain rule. See proof in Appendix B.1. For InfoNCE loss defined in Eqn 2, the gradient of the embedding vector for each branch can be written as
where {αij}subscript𝛼𝑖𝑗{\alpha_{ij}} are the softmax of similarity of between 𝒛isubscript𝒛𝑖{\bm{z}}{i} and {𝒛j}subscript𝒛𝑗{{\bm{z}}{j}}, defined by αij=exp(−|zi−zj|2/2)/Zisubscript𝛼𝑖𝑗superscriptsubscriptz𝑖subscriptz𝑗22subscript𝑍𝑖\alpha_{ij}=\exp(-|\textbf{z}{i}-\textbf{z}{j}|^{2}/2)/Z_{i}, αii=exp(−|zi−zi′|2/2)/Zisubscript𝛼𝑖𝑖superscriptsubscriptz𝑖superscriptsubscriptz𝑖′22subscript𝑍𝑖\alpha_{ii}=\exp(-|\textbf{z}{i}-\textbf{z}{i}^{\prime}|^{2}/2)/Z_{i}, and Zi=∑j≠iexp(−|zi−zj|2/2)+exp(−|zi−zi′|2/2)subscript𝑍𝑖subscript𝑗𝑖superscriptsubscriptz𝑖subscriptz𝑗22superscriptsubscriptz𝑖superscriptsubscriptz𝑖′22Z_{i}=\sum_{j\neq i}\exp(-|\textbf{z}{i}-\textbf{z}{j}|^{2}/2)+\exp(-|\textbf{z}{i}-\textbf{z}{i}^{\prime}|^{2}/2). Hence, ∑jαij=1subscript𝑗subscript𝛼𝑖𝑗1\sum_{j}\alpha_{ij}=1. Since 𝒛i=Wxisubscript𝒛𝑖𝑊subscriptx𝑖{\bm{z}}{i}=W\textbf{x}{i}, we have
X𝑋X is a difference of two PSD matrices:
Here Σ^0=∑i,jαij(xi−xj)(xi−xj)Tsubscript^Σ0subscript𝑖𝑗subscript𝛼𝑖𝑗subscriptx𝑖subscriptx𝑗superscriptsubscriptx𝑖subscriptx𝑗𝑇\hat{\Sigma}{0}=\sum{i,j}\alpha_{ij}(\textbf{x}{i}-\textbf{x}{j})(\textbf{x}{i}-\textbf{x}{j})^{T} is a weighted data distribution covariance matrix and Σ^1=∑i(1−αii)(xi′−xi)(xi′−xi)Tsubscript^Σ1subscript𝑖1subscript𝛼𝑖𝑖superscriptsubscriptx𝑖′subscriptx𝑖superscriptsuperscriptsubscriptx𝑖′subscriptx𝑖𝑇\hat{\Sigma}{1}=\sum{i}(1-\alpha_{ii})(\textbf{x}{i}^{\prime}-\textbf{x}{i})(\textbf{x}{i}^{\prime}-\textbf{x}{i})^{T} is a weighted augmentation distribution covariance matrix.
See proof in Appendix B.2. Therefore, the amplitude of augmentation determines whether X𝑋X is a positive definite matrix. Similar to Theorem 3-4 in Tian et al. (2020), Lemma 2 also models the time derivative of weight W𝑊W as a product of W𝑊W and a symmetric and/or PSD matrices. However, Lemma 2 is much more general: it applies to InfoNCE with multiple negative contrastive terms, remains true when αijsubscript𝛼𝑖𝑗\alpha_{ij} varies with sample pair (i,j)𝑖𝑗(i,j), and holds with finite batch size N𝑁N. In contrast, Theorem 4 in Tian et al. (2020) only works for one negative term in InfoNCE, holds only in the population sense (i.e., N→+∞→𝑁N\rightarrow+\infty), and the formulation has residual terms, if αijsubscript𝛼𝑖𝑗\alpha_{ij} are not constants.
Next, we look into the dynamics of weight matrix W𝑊W given property of X𝑋X.
With fixed matrix X𝑋X (defined in Eqn 6) and strong augmentation such that X𝑋X has negative eigenvalues, the weight matrix W𝑊W has vanishing singular values.
See proof in Appendix B.3.
With strong augmentation, the embedding space covariance matrix becomes low-rank.
The embedding space is identified by the singular value spectrum of the covariance matrix on the embedding (Eqn. 1), C=∑i(zi−z¯)(zi−z¯)T/N=∑iW(xi−x¯)(xi−x¯)TWT/N𝐶subscript𝑖subscriptz𝑖¯zsuperscriptsubscriptz𝑖¯z𝑇𝑁subscript𝑖𝑊subscriptx𝑖¯xsuperscriptsubscriptx𝑖¯x𝑇superscript𝑊𝑇𝑁C=\sum_{i}(\textbf{z}{i}-\bar{\textbf{z}})(\textbf{z}{i}-\bar{\textbf{z}})^{T}/N=\sum_{i}W(\textbf{x}{i}-\bar{\textbf{x}})(\textbf{x}{i}-\bar{\textbf{x}})^{T}W^{T}/N. Since W𝑊W has vanishing singular values, C𝐶C is also low-rank, indicating collapsed dimensions.
Numerical simulation verifies our theory. We choice input data as isotropic Gaussian with covariance matrix ∑i,j(xi−xj)(xi−xj)T/N=Isubscript𝑖𝑗subscriptx𝑖subscriptx𝑗superscriptsubscriptx𝑖subscriptx𝑗𝑇𝑁𝐼\sum_{i,j}(\textbf{x}{i}-\textbf{x}{j})(\textbf{x}{i}-\textbf{x}{j})^{T}/N=I. We set the augmentation as additive Gaussian with covariance matrix equal to ∑i(xi′−xi)(xi′−xi)T/N=block_diagonal(0,k∗I)subscript𝑖superscriptsubscriptx𝑖′subscriptx𝑖superscriptsuperscriptsubscriptx𝑖′subscriptx𝑖𝑇𝑁𝑏𝑙𝑜𝑐𝑘_𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙0𝑘𝐼\sum_{i}(\textbf{x}{i}^{\prime}-\textbf{x}{i})(\textbf{x}{i}^{\prime}-\textbf{x}{i})^{T}/N=block_diagonal(\textbf{0},k*I), where the block has the size of 8x8. We plot the weight matrix singular value spectrum in Figure 3 with various augmentation amplitude k𝑘k. This proves that under linear network setting, strong augmentation leads to dimensional collapse in embedding space.
Our theory in this section is limited to linear network settings. For more complex nonlinear networks, the collapsing condition will still depend on “strong augmentation” but interpreted differently. A strong augmentation will be determined by more complicated properties of the augmentation (higher-order statistics of augmentation, manifold property of augmentation vs. data distribution) conditioned on the capacity of the networks.
With strong augmentation, a linear model under InfoNCE loss will have dimensional collapse. However, such scenarios rely on the condition that the network has a limited capacity which may not hold for real cases. On the other hand, when there is no strong augmentation (Σ^1≺Σ^0precedessubscript^Σ1subscript^Σ0\hat{\Sigma}{1}\prec\hat{\Sigma}{0}) and thus X𝑋X matrix remains PSD, a single linear model won’t have dimensional collapsing. However, interestingly, for deep networks, dimensional collapsing still happens in practice. In the following, we will show that it stems from a different nature: implicit regularization, where over-parametrized linear networks tend to find low-rank solutions.
To understand this counter-intuitive phenomena, we start with the simplest over-parametrized setting by choosing the network as a two-layer linear MLP without bias. The weight matrices of these two layers are denoted by W1∈ℝd×dsubscript𝑊1superscriptℝ𝑑𝑑W_{1}\in\mathbb{R}^{d\times d} and W2∈ℝd×dsubscript𝑊2superscriptℝ𝑑𝑑W_{2}\in\mathbb{R}^{d\times d}. Similar to the setting in Sec 4, the input vector is denoted as x and the augmentation is an additive noise. The embedding vector from each branch is z=W2W1xzsubscript𝑊2subscript𝑊1x\textbf{z}=W_{2}W_{1}\textbf{x}, hence z∈ℝnzsuperscriptℝ𝑛\textbf{z}\in\mathbb{R}^{n}. We do not normalize z. See Figure 4. We use InfoNCE loss defined in Eqn 2. The model is trained with a basic stochastic gradient descent without momentum or weight decay.
Similar to Lemma 1, we derive the gradient flow on the two weight matrices W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2}.
The weight matrices of the two layer linear contrastive self-supervised learning model evolves by (G=∑i(g𝐳ixiT+g𝐳i′xi′T)𝐺subscript𝑖subscriptgsubscript𝐳𝑖superscriptsubscriptx𝑖𝑇subscriptgsuperscriptsubscript𝐳𝑖′superscriptsubscriptx𝑖′𝑇G=\sum_{i}(\textbf{g}{\mathbf{z}{i}}\textbf{x}{i}^{T}+\textbf{g}{\mathbf{z}{i}^{\prime}}\textbf{x}{i}^{\prime T}) is defined in Lemma 1):
where X𝑋X is defined in Eqn 6. According to Lemma 2, we know that with small augmentation, X=Σ^0−Σ^1≻0𝑋subscript^Σ0subscript^Σ1succeeds0X=\hat{\Sigma}{0}-\hat{\Sigma}{1}\succ 0 is a positive-definite matrix.
Since we have two matrices W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2}, the first question is how they interact with each other. We apply singular value decomposition on both matrices W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2}, i.e., W1=U1S1V1Tsubscript𝑊1subscript𝑈1subscript𝑆1superscriptsubscript𝑉1𝑇W_{1}=U_{1}S_{1}V_{1}^{T}, W2=U2S2V2Tsubscript𝑊2subscript𝑈2subscript𝑆2superscriptsubscript𝑉2𝑇W_{2}=U_{2}S_{2}V_{2}^{T} and S1=diag([σ1k])subscript𝑆1𝑑𝑖𝑎𝑔delimited-[]superscriptsubscript𝜎1𝑘S_{1}=diag([\sigma_{1}^{k}]), S2=diag([σ2k])subscript𝑆2𝑑𝑖𝑎𝑔delimited-[]superscriptsubscript𝜎2𝑘S_{2}=diag([\sigma_{2}^{k}]). The alignment is now governed by the interaction between the adjacent orthonormal matrices V2:=[𝐯2k]assignsubscript𝑉2delimited-[]superscriptsubscript𝐯2𝑘V_{2}:=[\mathbf{v}{2}^{k}] and U1=[𝐮1k]subscript𝑈1delimited-[]superscriptsubscript𝐮1𝑘U{1}=[\mathbf{u}{1}^{k}]. This can be characterized by the alignment matrix A=V2TU1𝐴superscriptsubscript𝑉2𝑇subscript𝑈1A=V{2}^{T}U_{1}, whose (k,k′)𝑘superscript𝑘′(k,k^{\prime})-entry represents the alignment between the k𝑘k-th right singular vector 𝐯2ksuperscriptsubscript𝐯2𝑘\mathbf{v}{2}^{k} of W2subscript𝑊2W{2} and the k′superscript𝑘′k^{\prime}-th left singular vector 𝐮1k′superscriptsubscript𝐮1superscript𝑘′\mathbf{u}{1}^{k^{\prime}} of W1subscript𝑊1W{1}. The following shows that indeed W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2} aligns.
If for all t𝑡t, W2(t)W1(t)≠0subscript𝑊2𝑡subscript𝑊1𝑡0W_{2}(t)W_{1}(t)\neq 0, X(t)𝑋𝑡X(t) is positive-definite and W1(+∞)subscript𝑊1W_{1}(+\infty), W2(+∞)subscript𝑊2W_{2}(+\infty) have distinctive singular values, then the alignment matrix A=V2TU1→I𝐴superscriptsubscript𝑉2𝑇subscript𝑈1→𝐼A=V_{2}^{T}U_{1}\rightarrow I.
See proof in Appendix B.5. Here, we also empirically demonstrate that under InfoNCE loss, the absolute value of the alignment matrix A𝐴A converges to an identity matrix. See Figure 5.
The alignment effect has been studied in other scenarios (Ji & Telgarsky, 2019; Radhakrishnan et al., 2020). In the real case, when some of our assumptions are not satisfied, e.g., there are degenerate singular values in weight matrices, we will not observe a perfect alignment. This can be easily understood by the fact that the singular decomposition is no longer unique given degenerate singular values. In our toy experiment, we specifically initialize the weight matrices to have non-degenerate singular values. In real scenario, when weight matrices are randomly initialized, we will only observe the alignment matrix to converge to a block-diagonal matrix, with each block representing a group of degenerate singular values.
Given the fact that singular vectors corresponding to the same singular value align, we can now study the dynamics of the singular values of each weight matrix W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2}.
If W2subscript𝑊2W_{2} and W1subscript𝑊1W_{1} are aligned (i.e., V2=U1Tsubscript𝑉2superscriptsubscript𝑈1𝑇V_{2}=U_{1}^{T}), then the singular values of the weight matrices W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2} under InfoNCE loss evolve by:
See proof in Appendix B.6. According to Eqn. 10, (σ1k)2=(σ2k)2+Csuperscriptsuperscriptsubscript𝜎1𝑘2superscriptsuperscriptsubscript𝜎2𝑘2𝐶(\sigma_{1}^{k})^{2}=(\sigma_{2}^{k})^{2}+C. We solve the singular value dynamics analytically: σ1k˙=σ1k((σ1k)2+C)(v1kTXv1k)˙superscriptsubscript𝜎1𝑘superscriptsubscript𝜎1𝑘superscriptsuperscriptsubscript𝜎1𝑘2𝐶superscriptsuperscriptsubscriptv1𝑘𝑇𝑋superscriptsubscriptv1𝑘\dot{\sigma_{1}^{k}}=\sigma_{1}^{k}((\sigma_{1}^{k})^{2}+C)({\textbf{v}{1}^{k}}^{T}X\textbf{v}{1}^{k}). This shows that a pair of singular values (singular values with same ranking from the other matrix) have gradients proportional to themselves. Notice that X𝑋X is a positive definite matrix, the term v1kTXv1ksuperscriptsuperscriptsubscriptv1𝑘𝑇𝑋superscriptsubscriptv1𝑘{\textbf{v}{1}^{k}}^{T}X\textbf{v}{1}^{k} is always non-negative. This explains why we observe that the smallest group of singular values grow significantly slower. See demonstrative experiment results in Figure 6(a) and 6(b).
Our theory can also be extended to multilayer networks and nonlinear setting. Please see Appendix C
We now leverage our theoretical finding to design novel algorithms. Here we are targeting the projector component in contrastive learning.
Empirically, adding a projector substantially improves the quality of the learned representation and downstream performance (Chen et al., 2020a). Checking the spectrum of the representation layer also reveals a difference with/without a projector. To see this, we train two SimCLR models with and without a projector. The representation space spectrum are shown in Figure 7(b). The dimensional collapse in representation space happens when the model is trained without a projector. Thus, the projector prevents the collapse in the representation space.
The projector in contrastive learning is essential to prevent dimensional collapse in the representation space. We claim the following propositions regarding a linear projector in contrastive learning models.
A linear projector weight matrix only needs to be diagonal.
Based on our theory on implicit regularization dynamics, we expect to see adjacent layers W1(=U1S1V1T)annotatedsubscript𝑊1absentsubscript𝑈1subscript𝑆1superscriptsubscript𝑉1𝑇W_{1}(=U_{1}S_{1}V_{1}^{T}) and W2(=U2S2V2T)annotatedsubscript𝑊2absentsubscript𝑈2subscript𝑆2superscriptsubscript𝑉2𝑇W_{2}(=U_{2}S_{2}V_{2}^{T}) to be aligned such that the overall dynamics is only governed by their singular values S1subscript𝑆1S_{1} and S2subscript𝑆2S_{2}. And the orthogonal matrices V2Tsuperscriptsubscript𝑉2𝑇V_{2}^{T} and U1subscript𝑈1U_{1} are redundant as they will evolve to V2TU1=Isuperscriptsubscript𝑉2𝑇subscript𝑈1𝐼V_{2}^{T}U_{1}=I, given S1subscript𝑆1S_{1} and S2subscript𝑆2S_{2}.
Now, let’s consider the linear projector SimCLR model and only focus on the channel dimension. W1subscript𝑊1W_{1} is the last layer in the encoder, and W2subscript𝑊2W_{2} is the projector weight matrix. Our propositions claim that for this projector matrix W2subscript𝑊2W_{2}, the orthogonal component V2subscript𝑉2V_{2} can be omitted. Because the previous layer W1subscript𝑊1W_{1} is fully trainable, its orthogonal component (U1subscript𝑈1U_{1}) will always evolve to satisfy V2TU1=Isuperscriptsubscript𝑉2𝑇subscript𝑈1𝐼V_{2}^{T}U_{1}=I. Therefore, the final behavior of the projector is only determined by the singular values (S2subscript𝑆2S_{2} ) of the projector weight matrix. This motivates Proposition 1: the orthogonal component of the weight matrix doesn’t matter. So we can set the projector matrix as a diagonal matrix.
Also, according to our theory, the weight matrix will always converge to the low-rank. The singular value diagonal matrix naturally becomes low-rank, so why not just set it low-rank directly? This is the motivation of Proposition 2.
These propositions are verified via ablation studies in Sec 6.3. Given these two propositions, we propose DirectCLR, which is effectively using a low-rank diagonal projector.
We propose to remove the projector in contrastive learning by directly sending a sub-vector of the representation vector to the loss function. We call our method DirectCLR. In contrast to all recent state-of-the-art self-supervised learning methods, our method directly optimizes the representation space. See Figure 8, DirectCLR picks a subvector of the representation z=r[0:d0]\textbf{z}=\textbf{r}[0:d_{0}], where d0subscript𝑑0d_{0} is a hyperparameter. Then, it applies a standard InfoNCE loss on this normalized subvector z^=z/|z|^zzz\hat{\textbf{z}}=\textbf{z}/|\textbf{z}|, L=∑ilogexp(z^i⋅z^i′)∑jexp(z^i⋅z^j)𝐿subscript𝑖⋅subscript^z𝑖superscriptsubscript^z𝑖′subscript𝑗⋅subscript^z𝑖subscript^z𝑗L=\sum_{i}\log\frac{\exp(\hat{\textbf{z}}{i}\cdot\hat{\textbf{z}}{i}^{\prime})}{\sum_{j}\exp(\hat{\textbf{z}}{i}\cdot\hat{\textbf{z}}{j})}.
We train DirectCLR with a standard recipe of SimCLR for 100 epochs on ImageNet. The backbone encoder is a ResNet50. More implementation details can be found in the Appendix D. DirectCLR demonstrates better performance compared to SimCLR with a trainable linear projector on ImageNet. The linear probe accuracies for each model are listed in Table 1.
We visualize the learnt representation space spectrum in Figure 10. DirectCLR prevents dimensional collapse in the representation space similar to the functionality of a trainable projector in SimCLR.
One may suspect that the contrastive loss in DirectCLR does not apply a gradient on the rest part of the representation vector r[d0:]\textbf{r}[d_{0}:], then why these dimensions would contain useful information?
Here, we show that the entire representation vector r contains useful information. See Figure 10. First, the gradient backpropagating through the representation vector is low-rank, where only the first d0subscript𝑑0d_{0} channel dimensions are non-zero. When the gradient enters the ResNet backbone and passes through the last nonlinear conv block, it becomes full rank. Therefore, this hidden layer h receives gradients on all channels. Note that h and r have a same channel dimension of 2048. Next, we consider the forward pass. This hidden layer h is directly fed to the representation vectors via the residual connection. As a result, the rest part of the representation vector r[d0:]\textbf{r}[d_{0}:] is not trivial. In addition, we run an ablation study in Sec F to test the linear probe accuracy based only on the “directly” optimized vector. This verifies that the whole representation vector is meaningful.
Disclaimer: DirectCLR is able to replace the linear projector and verify the two propositions on understanding the dynamics of a linear projector. But our theory is not able to fully explain why a nonlinear projector is able to prevent dimensional collapse. DirectCLR also still relies on the mechanism of a nonlinear projector to prevent dimensional collapse, which is effectively performed by the last block of the backbone, as explained above.
To further verify our hypothesis, we have perform ablation studies.
Proposition 1 matches the fact that: (a) an orthogonal constrained projector performs the same as the non-projector setting; (b) fixed low-rank projector performs the same as a fixed diagonal projector; (c) trainable linear projector performs the same as a trainable diagonal projector.
Proposition 2 matches the observation that a low-rank projector has the highest accuracy.
Please see more detailed ablation study discuss and additional ablation experiments in Appendix F.
In this work, we showed that contrastive self-supervised learning suffers from dimensional collapse, where the embedding vectors only span a lower-dimensional subspace. We provided the theoretical understanding of this phenomenon and showed that there are two mechanisms causing dimensional collapse: strong augmentation and implicit regularization. Inspired by our theory, we proposed a novel contrastive self-supervised learning method DirectCLR that directly optimizes the representation space without relying on a trainable projector. DirectCLR outperforms SimCLR with a linear projector on ImageNet.
We thank Yubei Chen, Jiachen Zhu, Adrien Bardes, Nicolas Ballas, Randall Balestriero, Quentin Garrido for useful discussions. We thank Wieland Brendel for the insightfull discussion on understanding the role of the projector.
We provide detailed proof for all the lemmas and theorems in the Appendices. Code (in PyTorch) is available at https://github.com/facebookresearch/directclr
We adapt two useful lemmas from Arora et al. (2019a).
Given a matrix W𝑊W and the dynamics that W𝑊W evolves by W˙˙𝑊\dot{W}, the singular values of this matrix evolve by:
where uksuperscriptu𝑘\textbf{u}^{k} and vksuperscriptv𝑘\textbf{v}^{k} are singular value σksuperscript𝜎𝑘\sigma^{k}’s corresponding left and right singular vectors. i.e. the k𝑘k-th column of matrices U𝑈U and V𝑉V respectively.
Multiplying UTsuperscript𝑈𝑇U^{T} from the left and multiplying V𝑉V from the right, considering U𝑈U and V𝑉V are orthogonal matrices, we have
Since S=diag(σk)𝑆𝑑𝑖𝑎𝑔superscript𝜎𝑘S=diag(\sigma^{k}) is a diagonal matrix, we have
Again, considering uksuperscriptu𝑘\textbf{u}^{k} and vksuperscriptv𝑘\textbf{v}^{k} have unit-norm, we have ukTuk˙=0superscriptsuperscriptu𝑘𝑇˙superscriptu𝑘0{\textbf{u}^{k}}^{T}\dot{\textbf{u}^{k}}=0 and vk˙Tvk=0superscript˙superscriptv𝑘𝑇superscriptv𝑘0\dot{\textbf{v}^{k}}^{T}\textbf{v}^{k}=0. Therefore, we derive
where ⊙direct-product\odot represents Hadamard element-wise multiplication. H𝐻H is a skew-symmetric matrix
Same as proof for Lemma 1, we start from the following equation
Considering the fact that UTU˙superscript𝑈𝑇˙𝑈U^{T}\dot{U} and V˙TVsuperscript˙𝑉𝑇𝑉\dot{V}^{T}V are skew-symmetric matrices, whose diagonal terms are all zero, we Hadamard-multiply I¯¯𝐼\bar{I} to both sides of the equation. Here, I¯¯𝐼\bar{I} has all diagonal values equal zeros and all off-diagonal values equal to one, we have
Taking transpose, we have
The alignment matrix A𝐴A, defined by A=V2TU1𝐴superscriptsubscript𝑉2𝑇subscript𝑈1A=V_{2}^{T}U_{1}, evolves by:
and F𝐹F is defined by
According to Lemma 4,
The gradient on matrix W𝑊W is
We denote the gradient on zisubscriptz𝑖\textbf{z}{i} and zi′superscriptsubscriptz𝑖′\textbf{z}{i}^{\prime} as gzisubscriptgsubscriptz𝑖\textbf{g}{\textbf{z}{i}} and gzi′subscriptgsuperscriptsubscriptz𝑖′\textbf{g}{\textbf{z}{i}^{\prime}}, respectively. Since ∂zi∂W=xisubscriptz𝑖𝑊subscriptx𝑖\frac{\partial\textbf{z}{i}}{\partial W}=\textbf{x}{i} and ∂zi′∂W=xi′superscriptsubscriptz𝑖′𝑊superscriptsubscriptx𝑖′\frac{\partial\textbf{z}{i}^{\prime}}{\partial W}=\textbf{x}{i}^{\prime}, we get
Given the fact that ∑j≠iαij=1−αiisubscript𝑗𝑖subscript𝛼𝑖𝑗1subscript𝛼𝑖𝑖\sum_{j\neq i}\alpha_{ij}=1-\alpha_{ii}, we have ∑i∑j≠iαijxi′xiT=∑i(1−αii)xi′xiTsubscript𝑖subscript𝑗𝑖subscript𝛼𝑖𝑗superscriptsubscriptx𝑖′superscriptsubscriptx𝑖𝑇subscript𝑖1subscript𝛼𝑖𝑖superscriptsubscriptx𝑖′superscriptsubscriptx𝑖𝑇\sum_{i}\sum_{j\neq i}\alpha_{ij}\textbf{x}{i}^{\prime}\textbf{x}{i}^{T}=\sum_{i}(1-\alpha_{ii})\textbf{x}{i}^{\prime}\textbf{x}{i}^{T}. Also, since ∑i∑j≠isubscript𝑖subscript𝑗𝑖\sum_{i}\sum_{j\neq i} iterates all pairs of i,j𝑖𝑗i,j, we can replace the index between i𝑖i and j𝑗j, we have ∑i∑j≠iαijxjxiT=∑i∑j≠iαjixixjTsubscript𝑖subscript𝑗𝑖subscript𝛼𝑖𝑗subscriptx𝑗superscriptsubscriptx𝑖𝑇subscript𝑖subscript𝑗𝑖subscript𝛼𝑗𝑖subscriptx𝑖superscriptsubscriptx𝑗𝑇\sum_{i}\sum_{j\neq i}\alpha_{ij}\textbf{x}{j}\textbf{x}{i}^{T}=\sum_{i}\sum_{j\neq i}\alpha_{ji}\textbf{x}{i}\textbf{x}{j}^{T}.
Apply eigen-decomposition on X𝑋X, X=UΛUT𝑋𝑈Λsuperscript𝑈𝑇X=U\Lambda U^{T}. Then we have exp(Xt)=Uexp(Λt)UT𝑋𝑡𝑈Λ𝑡superscript𝑈𝑇\exp(Xt)=U\exp(\Lambda t)U^{T}. Therefore,
Because X𝑋X has negative eigenvalues, i.e., ΛΛ\Lambda has negative terms, we have for t→∞→𝑡t\rightarrow\infty, exp(Λt)Λ𝑡\exp(\Lambda t) is rank deficient. Therefore, we know that W(∞)𝑊W(\infty) is also rank deficient, the weight matrix W𝑊W has vanishing singular values.
Similar proof applies to W1subscript𝑊1W_{1}.
therefore,
According to Eqn 9, G=−W2W1X𝐺subscript𝑊2subscript𝑊1𝑋G=-W_{2}W_{1}X, we have
Because X𝑋X is a positive definite matrix and for all t𝑡t, W2(t)W1(t)≠0subscript𝑊2𝑡subscript𝑊1𝑡0W_{2}(t)W_{1}(t)\neq 0, we know B:=W2W1XW1TW2Tassign𝐵subscript𝑊2subscript𝑊1𝑋superscriptsubscript𝑊1𝑇superscriptsubscript𝑊2𝑇B:=W_{2}W_{1}XW_{1}^{T}W_{2}^{T} is positive semi-definite and B≠0𝐵0B\neq 0. Therefore, tr(B)=∑kλk(B)>0𝑡𝑟𝐵subscript𝑘subscript𝜆𝑘𝐵0tr(B)=\sum_{k}\lambda_{k}(B)>0 since not all eigenvalues of B𝐵B are zero.
Therefore, we know ‖W1‖F2→+∞→superscriptsubscriptnormsubscript𝑊1𝐹2||W_{1}||{F}^{2}\rightarrow+\infty (similarly ‖W2‖F2→+∞→superscriptsubscriptnormsubscript𝑊2𝐹2||W{2}||_{F}^{2}\rightarrow+\infty). In the limit t−>+∞limit-from𝑡t->+\infty, we have
Plug in the singular value decomposition of W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2}, we have U1S12U1T=V2S22V2Tsubscript𝑈1superscriptsubscript𝑆12superscriptsubscript𝑈1𝑇subscript𝑉2superscriptsubscript𝑆22superscriptsubscript𝑉2𝑇U_{1}S_{1}^{2}U_{1}^{T}=V_{2}S_{2}^{2}V_{2}^{T}. Assuming W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2} have non-degenerate singular values, due to the uniqueness of eigen-decomposition, we have
Remark. Note that when the non-degenerate singular value assumption does not hold, the corresponding singular vectors are not unique and we will not observe the corresponding dimensions becoming aligned.
According to Theorem 2, for σ1ksuperscriptsubscript𝜎1𝑘\sigma_{1}^{k} and σ2ksuperscriptsubscript𝜎2𝑘\sigma_{2}^{k} with same index, the corresponding singular vector pairs v2ksuperscriptsubscriptv2𝑘\textbf{v}{2}^{k} and u1ksuperscriptsubscriptu1𝑘\textbf{u}{1}^{k} will get aligned, i.e., v2k′Tu1k→δi,j→superscriptsuperscriptsubscriptv2superscript𝑘′𝑇superscriptsubscriptu1𝑘subscript𝛿𝑖𝑗{\textbf{v}{2}^{k^{\prime}}}^{T}\textbf{u}{1}^{k}\rightarrow\delta_{i,j}. Therefore, Eqn 21 and Eqn 22 can be simplified to
Insert Eqn 9 and considering the alignment, we derive
In our toy model, we focused on a two-layer linear MLP setting. Here, we empirically show that our theory extends to multilayer and nonlinear cases, as shown in Figure 11(a).
Stronger over-parametrization leads to a stronger collapsing effect, which has been shown theoretically (Arora et al., 2019a; Barrett & Dherin, 2021) and empirically (Jing et al., 2020). This can be explained by the fact that more adjacent matrices getting aligned, and the collapsing in the product matrix gets amplified. Note that for a single-layer case, L=1𝐿1L=1, there is no dimensional collapse in the embedding space, which is consistent with our analysis.
We empirically show that the collapsing effect also applies to the nonlinear scenario. We insert ReLU between linear layers and observe a similar singular value collapse compared to the linear case. See Figure 11(b).
Each input image is transformed twice to produce the two distorted views for contrastive loss. The image augmentation pipeline includes random cropping, resizing to 224x224, random horizontal flipping, color jittering, grayscale, Gaussian blurring, and solarization.
Throughout the ImageNet experiments in this paper, we use a ResNet-50 (He et al., 2016) as an encoder. This network has an output of dimension 2048, which is called a representation vector.
We use a LARS optimizer and train all models for 100 epochs. The batch size is 4096, which fits into 32 GPUs during training. The learning rate is 4.8 as in SimCLR (Chen et al., 2020a), which goes through a 10 epoch of warming up and then a cosine decay schedule.
Here, we list the ImageNet accuracy with various d0subscript𝑑0d_{0} value in Figure 12. It’s easy to see that when d0→0→subscript𝑑00d_{0}\rightarrow 0, there’s too little gradient information coming from the loss, the performance drops. When d0→2048→subscript𝑑02048d_{0}\rightarrow 2048, the model converges to standard SimCLR without a projector, which we know suffers from dimensional collapse in representation space.
Fixed low-rank projector vs Fixed low-rank diagonal projector : DirectCLR is equivalent to SimCLR with a fixed low-rank diagoanl projector. It performs the same as a SimCLR with fixed low-rank projector, which achieves 62 . 3% linear probe accuracy. Specifically, the singular values of this low-rank matrix are set to have d 0 numbers of 1 and 0 for the rest, then left- and right- multiply a fixed orthogonal matrix. Therefore, their only difference is that this fixed projector has an extra fixed orthogonal matrix in between.
Trainable projector vs trainable diagonal projector : Wetrained a SimCLR model with a trainable projector that is constrained be diagonal. The model achieves 60 . 2% linear probe accuracy on ImageNet, which is close to a SimCLR with a 1-layer linear projector.
Orthogonal projector vs no projector : We train a single layer projector SimCLR model with orthogonal constraint using ExpM parametrization (Casado & Mart´ ınez-Rubio, 2019). Therefore, the projector weight matrix has all singular values fixed to be 1. This model reaches 52 . 2% accuracy on ImageNet which is close to a SimCLR without projector.
These ablation studies verify the propostion 1 that the SimCLR projector only needs to be diagonal. Also, according to Table 2, we find that low-rank projector setting consistently improves the performance, which verifies proposition 2.
Linear probe on subvector instead of the entire vector: For DirectCLR, we perform a linear probe only on the sub-vector z and get 47.9%percent47.947.9% accuracy on ImageNet. This shows that the rest of r still contains useful information even though it does not see gradient directly coming from the loss function.
Random dropout instead of fixed subvector : Since DirectCLR drops out a number of dimensions for the loss function, it would be natural to ask whether random dropping out can reach the same performance. We train a SimCLR model without a projector and randomly feed d 0 number of features to InfoNCE loss every iteration. This model reaches only 43 . 0% accuracy on ImageNet. This demonstrates the importance of applying a fixed subvector, which allows the alignment effect to happen.
Table: S6.T1: Linear probe accuracy on ImageNet. Each model is trained on ImageNet for 100 epochs with standard training recipe. The backbone encoder is a ResNet50. DirectCLR outperforms SimCLR with 1-layer linear projector.
| Loss function | Projector | Accuracy |
|---|---|---|
| SimCLR | 2-layer nonlinear projector | 66.5 |
| SimCLR | 1-layer linear projector | 61.1 |
| SimCLR | no projector | 51.5 |
| DirectCLR | no projector | 62.7 |
Table: S6.T2: Ablation study: top-1 accuracies on ImageNet by SimCLR model with different projector settings.
| Projector | diagonal | low-rank | Top-1 Accuracy |
|---|---|---|---|
| no projector | 51.5 | ||
| orthogonal projector | 52.2 | ||
| trainable projector | 61.1 | ||
| trainable diagonal projector | ✓ | 60.2 | |
| fixed low-rank projector | ✓ | 62.3 | |
| fixed low-rank diagonal projector | ✓ | ✓ | 62.7 |
(a) embedding space
(b) complete collapse
Singular value spectrum of the embedding space. The embedding vectors are computed from a pretrained SimCLR model on the validation set of ImageNet. Each embedding vector has a dimension of 128. The spectrum contains the singular values of the covariance matrix of these embedding vectors in sorted order and logarithmic scale. A number of singular values drop to zero, indicating collapsed dimensions.
Weight matrix singular value spectrum with different augmentation amplitude k𝑘k. The setting is a single layer linear toy model with each weight matrix of the size of 16x16, where the block has the size of 8x8. Strong augmentation results in vanishing singular values in weight matrices.
Two-layer Linear Model
Visualization of the alignment matrix A=V2TU1𝐴superscriptsubscript𝑉2𝑇subscript𝑈1A=V_{2}^{T}U_{1} after training. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The alignment matrix converges to an identity matrix.
(a) W1subscript𝑊1W_{1}
DirectCLR: no explicit trainable projector, simply apply InfoNCE loss on the a fixed sub-vector of the representations
Representation space spectrum of DirectCLR compared to SimCLR (a) with a 2-layer nonlinear projector (b) with a 1-layer linear projector (c) without projector. The spectrums are computed based on the output from the backbone, using ImgaeNet validation set. Similar to SimCLR with projectors, DirectCLR is able to prevent dimensional collapse in the representation space.
Why is the whole representation vector r meaningful in DirectCLR while only part of it receives gradient? It takes advantage of the residual connection in the backbone. Thus, the gradient passing through the representation vector is low-rank where only the first d0subscript𝑑0d_{0} channel dimensions are non-zero. When the gradient enters the ResNet backbone and passes through the last nonlinear conv block, it becomes full rank. Therefore, this hidden layer h receives gradients on all channels. During forward pass, h is directly fed to the representation vectors via the residual connection. Therefore, the entire representation vector r is meaningful.
Hyperparameter tuning on d0subscript𝑑0d_{0} based on ImageNet linear probe Top-1 accuracy.
$$ C = \frac{1}{N}\sum_{i=1}^N (\textbf{z}_i-\bar{\textbf{z}})(\textbf{z}_i-\bar{\textbf{z}})^T \label{eq:cov-embedding-layer} $$ \tag{eq:cov-embedding-layer}
$$ \label{eqn:infoNCE} L = -\sum_{i=1}^N\log\frac{\exp(-|\textbf{z}_i -\textbf{z}i'|^2/2)}{\sum{j\neq i}\exp(-|\textbf{z}_i -\textbf{z}_j|^2/2) + \exp(-|\textbf{z}_i -\textbf{z}_i'|^2/2)} $$ \tag{eqn:infoNCE}
$$ \textbf{g}{\textbf{z}i} = \sum{j\neq i} \alpha{ij}(\textbf{z}j-\textbf{z}i') + \sum{j\neq i}\alpha{ji}(\textbf{z}j-\textbf{z}i), \quad\quad\quad \textbf{g}{\textbf{z}i'} = \sum{j\neq i}\alpha{ij}(\textbf{z}_i' - \textbf{z}_i) $$
$$ \label{eqn:G2X} G= - WX $$ \tag{eqn:G2X}
$$ X = \hat{\Sigma}_0 - \hat{\Sigma}_1 $$
$$ \dot{W} = \dot{U}SV^T + U\dot{S}V^T + US\dot{V}^T $$
$$ U^T\dot{W}V = U^T\dot{U}S+ \dot{S}+ S\dot{V}^TV $$
$$ \dot{\sigma^k} = {\textbf{u}^k}^T\dot{W}\textbf{v}^k - {\textbf{u}^k}^T\dot{\textbf{u}^k}\sigma^k - \sigma^k \dot{\textbf{v}^k}^T\textbf{v}^k $$
$$ \dot{\sigma^k} = {\textbf{u}^k}^T\dot{W}\textbf{v}^k $$
$$ \label{eqn:H} H^{k,k'} = \begin{cases} 1/({\sigma^k}^2 - {\sigma^{k'}}^2) & \text{if $k\neq k'$}\ 0 & \text{if $k=k'$} \end{cases} $$ \tag{eqn:H}
$$ U^T\dot{U}S^2 - S^2U^T\dot{U} = \bar{I} \odot (U^T\dot{W}VS + SV^T\dot{W}U) $$
$$ \dot{A} = - A (H_1 \odot (A^TF + F^TA)) + (H_2 \odot (AF^T + FA^T)) A $$
$$ F = S_2 U_2^T G V_1 S_1 $$
$$ \frac{dL}{dW} = \sum_i(\frac{\partial L}{\partial \textbf{z}_i}\frac{\partial \textbf{z}_i}{\partial W} + \frac{\partial L}{\partial \textbf{z}_i'}\frac{\partial \textbf{z}_i'}{\partial W}) $$
$$ W(t) = W(0)\exp(Xt) $$
$$ \dot{W} = -(\frac{dL}{dW})^T = -\sum_i (\textbf{g}_{\textbf{z}_i}\textbf{x}i^T + \textbf{g}{\textbf{z}_i'}{\textbf{x}_i'}^T) $$
$$ \frac{d}{dt}(W_1W_1^T - W_2^TW_2) = 0 $$
$$ U_1 = V_2 $$
$$ \displaystyle\dot{W}=-G $$
$$ \displaystyle\dot{U_{1}}=U_{1}(H_{1}\odot(U_{1}^{T}\dot{W_{1}}V_{1}S_{1}+S_{1}V_{1}^{T}\dot{W}{1}^{T}U{1})) $$
$$ \displaystyle\dot{\sigma_{1}^{k}}=-\sum_{k^{\prime}}({\textbf{v}{2}^{k^{\prime}}}^{T}\textbf{u}{1}^{k})\sigma^{k^{\prime}}{2}({\textbf{u}{2}^{k^{\prime}}}^{T}G\textbf{v}_{1}^{k}) $$
Lemma. Lemma 1. The weight matrix in a linear contrastive self-supervised learning model evolves by: W˙=−G˙𝑊𝐺\displaystyle\dot{W}=-G (3) where G=∑i(g𝐳ixiT+g𝐳i′xi′T)𝐺subscript𝑖subscriptgsubscript𝐳𝑖superscriptsubscriptx𝑖𝑇subscriptgsuperscriptsubscript𝐳𝑖′superscriptsubscriptx𝑖′𝑇G=\sum_{i}(\textbf{g}{{\bm{z}}{i}}\textbf{x}{i}^{T}+\textbf{g}{{\bm{z}}{i}^{\prime}}\textbf{x}{i}^{\prime T}), and gzisubscriptgsubscriptz𝑖\textbf{g}{\textbf{z}{i}} is the gradient on the embedding vector zisubscriptz𝑖\textbf{z}{i} (similarly gzi′subscriptgsuperscriptsubscriptz𝑖′\textbf{g}{\textbf{z}_{i}^{\prime}}).
Lemma. Lemma 2. X𝑋X is a difference of two PSD matrices: X=Σ^0−Σ^1𝑋subscript^Σ0subscript^Σ1X=\hat{\Sigma}{0}-\hat{\Sigma}{1} (7) Here Σ^0=∑i,jαij(xi−xj)(xi−xj)Tsubscript^Σ0subscript𝑖𝑗subscript𝛼𝑖𝑗subscriptx𝑖subscriptx𝑗superscriptsubscriptx𝑖subscriptx𝑗𝑇\hat{\Sigma}{0}=\sum{i,j}\alpha_{ij}(\textbf{x}{i}-\textbf{x}{j})(\textbf{x}{i}-\textbf{x}{j})^{T} is a weighted data distribution covariance matrix and Σ^1=∑i(1−αii)(xi′−xi)(xi′−xi)Tsubscript^Σ1subscript𝑖1subscript𝛼𝑖𝑖superscriptsubscriptx𝑖′subscriptx𝑖superscriptsuperscriptsubscriptx𝑖′subscriptx𝑖𝑇\hat{\Sigma}{1}=\sum{i}(1-\alpha_{ii})(\textbf{x}{i}^{\prime}-\textbf{x}{i})(\textbf{x}{i}^{\prime}-\textbf{x}{i})^{T} is a weighted augmentation distribution covariance matrix.
Theorem. With fixed matrix $X$ (defined in Eqn~eqn:X) and strong augmentation such that $X$ has negative eigenvalues, the weight matrix $W$ has vanishing singular values.
Corollary. [Dimensional Collapse Caused by Strong Augmentation] With strong augmentation, the embedding space covariance matrix becomes low-rank.
Theorem. [Weight matrices align] If for all $t$, $W_2(t)W_1(t)\neq 0$, $X(t)$ is positive-definite and $W_1(+\infty)$, $W_2(+\infty)$ have distinctive singular values, then the alignment matrix $A = V_2^T U_1\rightarrow I$.
Theorem. Theorem 3. If W2subscript𝑊2W_{2} and W1subscript𝑊1W_{1} are aligned (i.e., V2=U1Tsubscript𝑉2superscriptsubscript𝑈1𝑇V_{2}=U_{1}^{T}), then the singular values of the weight matrices W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2} under InfoNCE loss evolve by: σ˙1k=σ1k(σ2k)2(v1kTXv1k)superscriptsubscript˙𝜎1𝑘superscriptsubscript𝜎1𝑘superscriptsuperscriptsubscript𝜎2𝑘2superscriptsuperscriptsubscriptv1𝑘𝑇𝑋superscriptsubscriptv1𝑘\displaystyle\dot{\sigma}{1}^{k}=\sigma{1}^{k}(\sigma_{2}^{k})^{2}({\textbf{v}{1}^{k}}^{T}X\textbf{v}{1}^{k}) (10) σ˙2k=σ2k(σ1k)2(v1kTXv1k)superscriptsubscript˙𝜎2𝑘superscriptsubscript𝜎2𝑘superscriptsuperscriptsubscript𝜎1𝑘2superscriptsuperscriptsubscriptv1𝑘𝑇𝑋superscriptsubscriptv1𝑘\displaystyle\dot{\sigma}{2}^{k}=\sigma{2}^{k}(\sigma_{1}^{k})^{2}({\textbf{v}{1}^{k}}^{T}X\textbf{v}{1}^{k}) (11)
Proposition. A linear projector weight matrix only needs to be diagonal.
Lemma. Lemma 4. Given a matrix W𝑊W and the dynamics that W𝑊W evolves by W˙˙𝑊\dot{W}, the singular values of this matrix evolve by: σk˙=ukTW˙vk˙superscript𝜎𝑘superscriptsuperscriptu𝑘𝑇˙𝑊superscriptv𝑘\displaystyle\dot{\sigma^{k}}={\textbf{u}^{k}}^{T}\dot{W}\textbf{v}^{k} (12) where uksuperscriptu𝑘\textbf{u}^{k} and vksuperscriptv𝑘\textbf{v}^{k} are singular value σksuperscript𝜎𝑘\sigma^{k}’s corresponding left and right singular vectors. i.e. the k𝑘k-th column of matrices U𝑈U and V𝑉V respectively.
Lemma. Lemma 5. Given a matrix W𝑊W and the dynamics that W𝑊W evolves by W˙˙𝑊\dot{W}, the singular vectors of this matrix evolve by: U˙=U(H⊙(UTW˙VS+SVTW˙TU))˙𝑈𝑈direct-product𝐻superscript𝑈𝑇˙𝑊𝑉𝑆𝑆superscript𝑉𝑇superscript˙𝑊𝑇𝑈\displaystyle\dot{U}=U(H\odot(U^{T}\dot{W}VS+SV^{T}\dot{W}^{T}U)) (13) V˙=V(H⊙(VTW˙TUS+SUTW˙V))˙𝑉𝑉direct-product𝐻superscript𝑉𝑇superscript˙𝑊𝑇𝑈𝑆𝑆superscript𝑈𝑇˙𝑊𝑉\displaystyle\dot{V}=V(H\odot(V^{T}\dot{W}^{T}US+SU^{T}\dot{W}V)) (14) where ⊙direct-product\odot represents Hadamard element-wise multiplication. H𝐻H is a skew-symmetric matrix Hk,k′={1/(σk2−σk′2)if k≠k′0if k=k′superscript𝐻𝑘superscript𝑘′cases1superscriptsuperscript𝜎𝑘2superscriptsuperscript𝜎superscript𝑘′2if k≠k′0if k=k′H^{k,k^{\prime}}=\begin{cases}1/({\sigma^{k}}^{2}-{\sigma^{k^{\prime}}}^{2})&\text{if $k\neq k^{\prime}$}\ 0&\text{if $k=k^{\prime}$}\end{cases} (15)
Lemma. Lemma 6 (Alignment matrix dynamics). The alignment matrix A𝐴A, defined by A=V2TU1𝐴superscriptsubscript𝑉2𝑇subscript𝑈1A=V_{2}^{T}U_{1}, evolves by: A˙=−A(H1⊙(ATF+FTA))+(H2⊙(AFT+FAT))A˙𝐴𝐴direct-productsubscript𝐻1superscript𝐴𝑇𝐹superscript𝐹𝑇𝐴direct-productsubscript𝐻2𝐴superscript𝐹𝑇𝐹superscript𝐴𝑇𝐴\dot{A}=-A(H_{1}\odot(A^{T}F+F^{T}A))+(H_{2}\odot(AF^{T}+FA^{T}))A (18) where ⊙direct-product\odot represents Hadamard (element-wise) multiplication. Hlsubscript𝐻𝑙H_{l} is a skew-symmetric matrix, whose (k,k′)𝑘superscript𝑘′(k,k^{\prime})-entry is given by Hlk,k′={1/(σlk2−σlk′2)if k≠k′0if k=k′superscriptsubscript𝐻𝑙𝑘superscript𝑘′cases1superscriptsuperscriptsubscript𝜎𝑙𝑘2superscriptsuperscriptsubscript𝜎𝑙superscript𝑘′2if k≠k′0if k=k′H_{l}^{k,k^{\prime}}=\begin{cases}1/({\sigma_{l}^{k}}^{2}-{\sigma_{l}^{k^{\prime}}}^{2})&\text{if $k\neq k^{\prime}$}\ 0&\text{if $k=k^{\prime}$}\end{cases} (19) and F𝐹F is defined by F=S2U2TGV1S1𝐹subscript𝑆2superscriptsubscript𝑈2𝑇𝐺subscript𝑉1subscript𝑆1F=S_{2}U_{2}^{T}GV_{1}S_{1} (20)
Lemma. Lemma 7 (Singular value dynamics). The singular values of the weight matrices W1subscript𝑊1W_{1} and W2subscript𝑊2W_{2} evolve by: σ1k˙=−∑k′(v2k′Tu1k)σ2k′(u2k′TGv1k)˙superscriptsubscript𝜎1𝑘subscriptsuperscript𝑘′superscriptsuperscriptsubscriptv2superscript𝑘′𝑇superscriptsubscriptu1𝑘subscriptsuperscript𝜎superscript𝑘′2superscriptsuperscriptsubscriptu2superscript𝑘′𝑇𝐺superscriptsubscriptv1𝑘\displaystyle\dot{\sigma_{1}^{k}}=-\sum_{k^{\prime}}({\textbf{v}{2}^{k^{\prime}}}^{T}\textbf{u}{1}^{k})\sigma^{k^{\prime}}{2}({\textbf{u}{2}^{k^{\prime}}}^{T}G\textbf{v}{1}^{k}) (21) σ2k˙=−∑k′(u1k′Tv2k)σ1k′(u2kTGv1k′)˙superscriptsubscript𝜎2𝑘subscriptsuperscript𝑘′superscriptsuperscriptsubscriptu1superscript𝑘′𝑇superscriptsubscriptv2𝑘subscriptsuperscript𝜎superscript𝑘′1superscriptsuperscriptsubscriptu2𝑘𝑇𝐺superscriptsubscriptv1superscript𝑘′\displaystyle\dot{\sigma{2}^{k}}=-\sum_{k^{\prime}}({\textbf{u}{1}^{k^{\prime}}}^{T}\textbf{v}{2}^{k})\sigma^{k^{\prime}}{1}({\textbf{u}{2}^{k}}^{T}G\textbf{v}_{1}^{k^{\prime}}) (22)
Linear probe on subvector instead of the entire vector : For DirectCLR , we perform a linear probe only on the sub-vector z and get 47 . 9% accuracy on ImageNet. This shows that the rest of r still contains useful information even though it does not see gradient directly coming from the loss function.
| Loss function | Projector | Accuracy |
|---|---|---|
| SimCLR | 2-layer nonlinear projector | 66.5 |
| SimCLR | 1-layer linear projector | 61.1 |
| SimCLR | no projector | 51.5 |
| DirectCLR | no projector | 62.7 |
| Projector | diagonal | low-rank | Top-1 Accuracy |
|---|---|---|---|
| no projector orthogonal | 51.5 | ||
| projector | 52.2 | ||
| trainable projector | 61.1 | ||
| trainable diagonal projector | glyph[check] | 60.2 | |
| fixed low-rank projector | glyph[check] | 62.3 | |
| fixed low-rank diagonal projector | glyph[check] | glyph[check] | 62.7 |
$$ \label{eqn:G2XX} G = -W_2 W_1 X $$ \tag{eqn:G2XX}
$$ \label{eqn:proof-2-2} \bar{I} \odot V^T\dot{W}U = - SU^T\dot{U} - \dot{V}^TVS $$ \tag{eqn:proof-2-2}
$$ V_2^T U_1 = I $$
Theorem. If $W_2$ and $W_1$ are aligned (i.e., $V_2 = U_1^T$), then the singular values of the weight matrices $W_1$ and $W_2$ under InfoNCE loss evolve by: eqnarray \sigma_{1}^k =\sigma_{1}^k (\sigma_{2}^k)^2 ({v_{1}^k}^TXv_{1}^k) \ \sigma_{2}^k =\sigma_{2}^k(\sigma_{1}^k)^2 ({v_{1}^k}^TXv_{1}^k) eqnarray
Lemma. lemma{1} The weight matrix in a linear contrastive self-supervised learning model evolves by: eqnarray W = -G eqnarray where $G = \sum_i (g_{\vz_i}x_i^T + g_{\vz_i'}x_i'^T)$, and $g_{z_i}$ is the gradient on the embedding vector $z_i$ (similarly $g_{z_i'}$).
Lemma. $X$ is a difference of two PSD matrices: equation X = \Sigma_0 - \Sigma_1 equation Here $\Sigma_0= \sum_{i,j}\alpha_{ij}(x_i -x_j)(x_i -x_j)^T$ is a weighted data distribution covariance matrix and $\Sigma_1 = \sum_{i}(1-\alpha_{ii})(x_i' -x_i)(x_i' -x_i)^T$ is a weighted augmentation distribution covariance matrix.
Lemma. The weight matrices of the two layer linear contrastive self-supervised learning model evolves by ($G = \sum_i (g_{\vz_i}x_i^T + g_{\vz_i'}x_i'^T)$ is defined in Lemma~lemma:W-gradient): eqnarray W_1 = -W_2^TG, \quad\quad W_2 = -GW_1^T eqnarray
Lemma. Given a matrix $W$ and the dynamics that $W$ evolves by $W$, the singular values of this matrix evolve by: eqnarray \sigma^k = {u^k}^TWv^k eqnarray where $u^k$ and $v^k$ are singular value $\sigma^k$'s corresponding left and right singular vectors. i.e. the $k$-th column of matrices $U$ and $V$ respectively.
Lemma. Given a matrix $W$ and the dynamics that $W$ evolves by $W$, the singular vectors of this matrix evolve by: eqnarray U = U (H \odot (U^TWVS + SV^TW^TU)) \ V = V (H \odot (V^TW^TUS + SU^TWV)) eqnarray where $\odot$ represents Hadamard element-wise multiplication. $H$ is a skew-symmetric matrix equation H^{k,k'} = cases 1/({\sigma^k}^2 - {\sigma^{k'}}^2) & if $k\neq k'$\ 0 & if $k=k'$ cases equation
Lemma. [Alignment matrix dynamics] The alignment matrix $A$, defined by $A=V_2^TU_1$, evolves by: equation A = - A (H_1 \odot (A^TF + F^TA)) + (H_2 \odot (AF^T + FA^T)) A equation where $\odot$ represents Hadamard (element-wise) multiplication. $H_l$ is a skew-symmetric matrix, whose $(k, k')$-entry is given by equation H_l^{k,k'} = cases 1/({\sigma_l^k}^2 - {\sigma_l^{k'}}^2) & if $k\neq k'$\ 0 & if $k=k'$ cases equation and $F$ is defined by equation F = S_2 U_2^T G V_1 S_1 equation
Proof. Given a matrix $W$ and its singular value decomposition $W = USV^T$. We have the dynamics of the matrix equation* W = USV^T + USV^T + USV^T equation* Multiplying $U^T$ from the left and multiplying $V$ from the right, considering $U$ and $V$ are orthogonal matrices, we have equation* U^TWV = U^TUS+ S+ SV^TV equation* Since $S=diag(\sigma^k)$ is a diagonal matrix, we have equation* \sigma^k = {u^k}^TWv^k - {u^k}^Tu^k\sigma^k - \sigma^k v^k^Tv^k equation* Again, considering $u^k$ and $v^k$ have unit-norm, we have ${u^k}^Tu^k=0$ and $v^k^Tv^k=0$. Therefore, we derive equation* \sigma^k = {u^k}^TWv^k equation*
Proof. Same as proof for Lemmalemma:W-gradient, we start from the following equation equation* U^TWV = U^TUS+ S+ SV^TV equation* Considering the fact that $U^TU$ and $V^TV$ are skew-symmetric matrices, whose diagonal terms are all zero, we Hadamard-multiply $I$ to both sides of the equation. Here, $I$ has all diagonal values equal zeros and all off-diagonal values equal to one, we have equation I \odot U^TWV = U^TUS+ SV^TV equation Taking transpose, we have equation I \odot V^TWU = - SU^TU - V^TVS equation Right-multiplying $S$ to Eqn eqn:proof-2-1 and left-multiplying $S$ to Eqn eqn:proof-2-2, then adding them up, we have equation* U^TUS^2 - S^2U^TU = I \odot (U^TWVS + SV^TWU) equation* Therefore, we have equation* U = U (H \odot (U^TWVS + SV^TW^TU)) equation* where equation* H^{k,k'} = cases 1/({\sigma^k}^2 - {\sigma^{k'}}^2) & if $k\neq k'$\ 0 & if $k=k'$ cases equation* Similar proof applies to Eqneqn:dotV.
Proof. According to Lemma. lemma:arora-2, we have eqnarray* U_1 = U_1 (H_1 \odot (U_1^TW_1V_1S_1 + S_1V_1^TW_1^TU_1)) \ V_2 = V_2 (H_2 \odot (V_2^TW_2^TU_2S_2 + S_2U_2^TW_2V_2)) eqnarray* Plugging the above two equations and Eqn~eqn:gW12, the dynamics of the alignment matrix $A=V_2^TU_1$ can be written as eqnarray* A &=& V_2^TU_1 + V_2^TU_1\ &=& V_2^TU_1 (H_1 \odot (U_1^TW_1V_1S_1 + S_1V_1^TW_1^TU_1)) + (H_2 \odot (V_2^TW_2^TU_2S_2 + S_2U_2^TW_2V_2))^T V_2^TU_1 \ &=& - A(H_1\odot(U_1^TW_2^TGV_1S_1 + S_1V_1^TG^TW_2U_1)) + (H_2\odot(S_2U_2^TGW_1^TV_2 + V_2^TW_1G^TU_2S_2)) A \ &=& -A(H_1\odot(U_1^TV_2S_2U_2^TGV_1S_1 + S_1V_1^TG^TU_2S_2V_2^TU_1)) \ && + (H_2\odot(S_2U_2^T GV_1S_1U_1^T V_2 + V_2^T U_1S_1V_1^TG^TU_2S_2)) A \ &=& -A(H_1\odot(A^TS_2U_2^TGV_1S_1 + S_1V_1^TG^TU_2S_2A) \ && + (H_2\odot(S_2U_2^T GV_1S_1A^T + A S_1V_1^TG^TU_2S_2)) A \ &=& - A (H_1 \odot (A^TF + F^TA)) + (H_2 \odot (AF^T + FA^T)) A eqnarray* where equation* F = S_2 U_2^T G V_1 S_1 equation*
Proof. According to Lemmalemma:arora-1, equation* \sigma_1^r = {u_1^r}^TW_1v_1^r equation* Plugging in Eqneqn:gW12, we have eqnarray* \sigma_1^k &=& - {u_1^k}^TW_2^TGv_1^k \ &=& - {u_1^k}^TV_2S_2U_2^TGv_1^k \ &=& -\sum_{k'} ({v_2^{k'}}^Tu_1^k)\sigma^{k'}_2 ({u_2^{k'}}^TGv_1^k) eqnarray* Similar proof applies to Eqn~eqn:sigma2.
Proof. $X$ is defined in Eqn~eqn:X. eqnarray* X &=& \sum_i(\sum_{j\neq i} \alpha_{ij}(x_i'-x_j) + \sum_{j\neq i}\alpha_{ji}(x_i-x_j))x_i^T - \sum_i(1-\alpha_{ii})(x_i'-x_i){x_i'}^T \ &=& \sum_i\sum_{j\neq i} \alpha_{ij}x_i'x_i^T - \sum_i\sum_{j\neq i}\alpha_{ij}x_jx_i^T + \sum_i\sum_{j\neq i}\alpha_{ji}(x_i-x_j)(x_i-x_j)^T \&&+ \sum_i\sum_{j\neq i}\alpha_{ji}(x_i-x_j)x_j^T - \sum_i(1-\alpha_{ii})(x_i'-x_i)({x_i'}-x_i)^T - \sum_i(1-\alpha_{ii})(x_i'-x_i){x_i}^T eqnarray* Given the fact that $\sum_{j\neq i}\alpha_{ij} = 1 - \alpha_{ii}$, we have $\sum_i\sum_{j\neq i}\alpha_{ij} x_i'x_i^T= \sum_i(1-\alpha_{ii})x_i'x_i^T$. Also, since $\sum_i\sum_{j\neq i}$ iterates all pairs of $i, j$, we can replace the index between $i$ and $j$, we have $\sum_i\sum_{j\neq i}\alpha_{ij}x_jx_i^T = \sum_i\sum_{j\neq i}\alpha_{ji}x_ix_j^T$. Therefore equation* X = \sum_i\sum_{j\neq i}\alpha_{ji}(x_i-x_j)(x_i-x_j)^T - \sum_i(1-\alpha_{ii})(x_i'-x_i)({x_i'}-x_i)^T equation*
Proof. % According to Lemma~lemma:W-gradient, we have % equation % % d{dt}(W^TW) = XW^TW + W^TWX % equation % Let $K := W^TW$, and the dynamics becomes % equation* % d{dt}vec(K) = KroneckerSum(X, I) * vec(K) % equation* % The eigenvalues of $KroneckerSum(X, I)$ are $\kappa_{ij} = \lambda_i + \omega_j$ where $\lambda_i$ are X's eigenvalues and $\omega_j=1$. % Therefore, we know that $KroneckerSum(X, I)$ has negative eigenvalues if $X$ has eigenvalues smaller than -1. % Under the condition that $KroneckerSum(X, I)$ has negative eigenvalues, we know $K$ has vanishing singular values. % Therefore, $W$ has vanishing singular values. %
Proof. According to Lemmalemma:W-gradient, we have equation d{dt}W = WX equation For a fixed $X$, we solve this equation analyically, equation* W(t) = W(0)\exp(Xt) equation* Apply eigen-decomposition on $X$, $X = U\Lambda U^T$. Then we have $\exp(Xt) = U\exp(\Lambda t)U^T$. Therefore, equation* W(t) = W(0) U\exp(\Lambda t)U^T equation* Because $X$ has negative eigenvalues, i.e., $\Lambda$ has negative terms, we have for $t\rightarrow\infty$, $\exp(\Lambda t)$ is rank deficient. Therefore, we know that $W(\infty)$ is also rank deficient, the weight matrix $W$ has vanishing singular values. % According to Lemmalemma:W-gradient, we have % equation % % d{dt}(W^TW) = XW^TW + W^TWX % equation % Apply SVD on $W$, we have $W = USV^T$ where $S = diag(\sigma^k)$. $W^TW = VS^2V^T = \sum_k v^k {\sigma^k}^2 {v^k}^T$, where $v^k$ is the $k$-th column of matrix $V$. % Assume we have fixed $X$, Eqneqn:dWW can be solved analytically, % equation* % W(t)^TW(t) = \exp(Xt) C \exp(Xt) % equation* % Here, we assume that $W$ has orthogonal initialization. Therefore, $W^T(0)W(0) = C = I$, we have $W(t)^TW(t) = \exp(2Xt)$. % Hence $W(t)^TW(t)$ and $\exp(2Xt)$ have same eigenvectors. Also, $X$ and $\exp(2Xt)$ have same eigenvectors. Therefore, for $W(t)^TW(t)$'s every eigenvector $v^k$, it's also $X$'s eigenvector. % Let $w^k = \sigma^k v^k$, Eqneqn:dWW can be written as % equation % % \sum_kd{dt}(w^k{w^k}^T) = \sum_k (Xw^k{w^k}^T + w^k{w^k}^TX) % equation % Given the fact that $w^k$ are orthogonal to each other and $w^k\cdotw^k = (\sigma^k)^2$, we left-multiply ${w^k}^T$ and right-multiply $w^k$ to Eqn~eqn:dww, and we derive % equation* % (\sigma^k)^2d{dt}(\sigma^k)^2 = 2{w^k}^TXw^k % equation* % or % equation* % d{dt}(\sigma^k)^2 = 2{v^k}^TXv^k % equation* % Because $X$ is not positive definite and $v^k$ are $X$'s eigenvectors, there exists $k$ such that % equation* % d{dt}(\sigma^k)^2 < 0 % equation* % Therefore, $W$ have vanishing singular values.
Proof. The gradient on matrix $W_2$ is equation dL{dW_2} = \sum_i(\partial L{\partial z_i}\partial z_i{\partial W_2} + \partial L{\partial z_i'}\partial z_i'{\partial W_2}) equation We denote the gradient on $z_i$ and $z_i'$ as $g_{z_i}$ and $g_{z_i'}$, respectively. Since $\partial z_i{\partial W_2} = W_1x_i$ and $\partial z_i'{\partial W_2} = W_1x_i'$, we get equation W_2 = -(dL{dW_2})^T = -\sum_i (g_{z_i}x_i^T + g_{z_i'}{x_i'}^T)W_1^T equation Similar proof applies to $W_1$.
Proof. % $X$ matrix is defined by $X= 2\sum_i (x_i' x_i^T) - \sum_i \sum_j (\alpha_{ij} x_j x_i^T + \alpha_{ji} x_j x_i^T) + \sum_i(\sum_j \alpha_{ji} - 1)x_ix_i^T$. % We make two assumptions regarding the augmentation. % First, we assume that the augmentation is an isotropic additive noise. Therefore we have $E[x_i'x_i^T] = E[xx^T]$ and $E[x_jx_i^T] = E[xx^T]$, where $x$ is the original input data without augmentation. % Second, we assume that the augmentation has a significant small value, we can also replace the last term into $xx^T$. % Now, we omit the $\sim$ and simply denote the original data as $x$. Therefore, the expectation of $X$ can be written as % eqnarray* % E(X) &=& E((3-\sum_j \alpha_{ji})x_i x_i^T - \sum_j (\alpha_{ji} +\alpha_{ij}) x_j x_i^T) \ % &=& E (x_i x_i^T - \sum_j \alpha_{ij} x_j x_i^T) + E( x_i x_i^T - \sum_j \alpha_{ji} x_j x_i^T) + E( \sum_j\alpha_{ji}-1)x_i x_i^T) % eqnarray* % =========== % The $X$ matrix is defined by % $$X= \sum_i (x_i' x_i^T + x_i {x_i'}^T) - \sum_i \sum_j (\alpha_{ij} x_i' x_j^T + \alpha_{ij}' x_i {x_j'}^T)$$ % where $\alpha_{ij}$ is the softmax of similarity between $x_i$ and ${x_j}$. Hence, $0 < \alpha_{ij} < 1$. % Here, we prove the expectation of this matrix is positive definite. % Because we assume the augmentation is an isotropic additive noise, we have the contribution from the augmentation averaged to zero. Therefore, we have $E[xx^T] = E[x{x'}^T]$. i.e., % equation* % E[X] = E (x_i x_i^T - \sum_j \alpha_{ij} x_j x_i^T) % equation* % Here, we drop a constant coefficient proportional to training batch size. % For the first term $E_1 = E (x_i x_i^T - \sum_j \alpha_{ij} x_j x_i^T)$, we decompose each $x_j$ into $x_j^{\parallel}$ and $x_j^{\bot}$, where $x_j^{\parallel}$ is parallel to $x_i$ and $x_j^{\bot}$ is perpendicular to $x_i$. Therefore, we have % eqnarray* % E_1 &=& % E(x_i x_i^T - \sum_j\alpha_{ij} x_j^{\parallel} x_i^T - \sum_j \alpha_{ij} x_j^{\bot}x_i^T) \ % &=& % E(x_i x_i^T - \sum_j\alpha_{ij} |x_j^{\parallel|}{|x_i|}x_i x_i^T - \sum_j \alpha_{ij}x_j^{\bot}x_i^T) \ % &=& % E (1 - \sum_j\alpha_{ij} |x_j^{\parallel|}{|x_i|})x_i x_i^T - E\sum_j \alpha_{ij} x_j^{\bot}x_i^T % \ % &=& % E (1 - \sum_j\alpha_{ij} x_j\cdotx_i{|x_i|^2})x_i x_i^T - E\sum_j \alpha_{ij}x_j^{\bot}x_i^T % eqnarray* % Because $x_j$ is isotropic, we know $x_j^{\bot}$ is symmetric given $x_i$, i.e., for every minibatch of ${x_{j}}$, there exist another minibatch ${x_j}$, such that $x_{j}^{\parallel} = x_j^{\parallel}$ and $x_{j}^{\bot} = - x_j^{\bot}$. Therefore, $\alpha_{ij} = \alpha_{ij}$ and $E[\sum_j \alpha_{ij}x_j^{\bot}x_i^T] = 0$, we have % equation* % E_1 = E (1 - \sum_j\alpha_{ij} x_j\cdot x_i {|x_i|^2})x_i x_i^T % equation* % For simplicity, we assume there's only one negative sample per minibatch, $E_1 = E \alpha_{ij}(1 - x_j\cdot x_i {|x_i|^2})x_i x_i^T$. % For $x_i$, $x_j$ pairs with $|x_i| = |x_j|$, the coefficient $\sum_i \alpha_{ij}(1 - x_j\cdot x_i {|x_i|^2}) > 0$. % For $x_i$, $x_j$ pairs with $|x_i| \neq |x_j|$, we leverage the fact that for any pair of $x_i$ and $x_j$, there exists another pair $x_i$ and $x_j$ with same directions but swapped amplitudes: $x_i\parallelx_i$, $x_j\parallelx_j$ and $|x_i| = |x_j|$, $|x_j| = |x_i|$. % Hence, the corresponding $\alpha_{ij} = \alpha_{ij}$. % Combining their contribution to $E_1$, we have $E_1 = E \alpha_{ij}((1 - x_j\cdot x_i {|x_i|^2})x_i x_i^T + (1 - \bar{x_j\cdot x_i }{|x_i|^2})x_i x_i^T )= E \alpha_{ij}(1 -2x_j\cdot x_i {|x_i|^2} + |x_j|^2{|x_i|^2})x_i x_i^T$ % Therefore, the coefficient $(1 -2x_j\cdot x_i {|x_i|^2} + |x_j|^2{|x_i|^2}) \geq (1 - |x_j|{|x_i|})^2 \geq 0$. % Since $\alpha_{ij} > 0$ and not all coefficient equal to zero, given the fact that $x_i x_i^T$ is a positive definite matrix, we have $E_1$ is positive definite. % Similar proof applies to the second term $E( x_i x_i^T - \sum_j \alpha_{ji} x_j x_i^T)$. % For the last term $E( \sum_j\alpha_{ji}-1)x_i x_i^T) = E_{x_i}(\sum_j\alpha_{ji}-1)Ex_i x_i^T$, because $\alpha_{ji}$ averages over all possible minibatch ${j}$ in the dataset, we have $E_{x_i}( \sum_j\alpha_{ji}-1) = 0$. % Adding up the above three terms, we have $E[X]$ is a positive definite matrix. %
Proof. According to Lemma lemma:W2-gradient, we have eqnarray* d{dt}(W_1W_1^T) = - W_1 G^T W_2 - W_2^T G W_1^T \ d{dt}(W_2^TW_2) = - W_2^T G W_1^T - W_1 G^T W_2 \ eqnarray* therefore, equation* d{dt}(W_1W_1^T - W_2^TW_2) = 0 equation* or equation* W_1W_1^T - W_2^TW_2 = C equation* Next, we show that the Frobenius norm of each weight matrix grow to infinitely. equation* d{dt}||W_1||_F^2 = d{dt}tr(W_1 W_1^T) = -tr(W_2^TGW_1^T) - tr(W_1 G_1^T W_2) equation* According to Eqn~eqn:G2XX, $G = - W_2 W_1 X$, we have eqnarray* -tr(W_2^T G W_1^T) &=& tr(W_2^T W_2 W_1 X W_1^T ) \ &=& tr(W_2W_1X W_1^T W_2^T) eqnarray* % eqnarray % d{dt}||W_1||_F^2 = 2tr(W_2W_1XW_1^TW_2^T) > 0 % eqnarray Because $X$ is a positive definite matrix and for all $t$, $W_2(t)W_1(t) \neq 0$, we know $B := W_2 W_1 X W_1^T W_2^T$ is positive semi-definite and $B \neq 0$. Therefore, $tr(B) = \sum_k \lambda_k(B) > 0$ since not all eigenvalues of $B$ are zero. Therefore, we know $||W_1||_F^2 \rightarrow +\infty$ (similarly $||W_2||_F^2\rightarrow +\infty$). In the limit $t->+\infty$, we have equation* W_1 W_1^T = W_2^T W_2 equation* Plug in the singular value decomposition of $W_1$ and $W_2$, we have $U_1 S_1^2U_1^T = V_2 S_2^2 V_2^T$. Assuming $W_1$ and $W_2$ have non-degenerate singular values, due to the uniqueness of eigen-decomposition, we have equation* U_1 = V_2 equation* therefore, equation* V_2^T U_1 = I equation*
Proof. According to Theorem theorem:W-align, for $\sigma_1^k$ and $\sigma_2^k$ with same index, the corresponding singular vector pairs $v_2^k$ and $u_1^k$ will get aligned, i.e., ${v_2^{k'}}^Tu_1^k \rightarrow \delta_{i,j}$. Therefore, Eqneqn:sigma1 and Eqneqn:sigma2 can be simplified to eqnarray* \sigma_1^k \rightarrow -\sigma_2^k ({u_2^k}^TGv_1^k) \ \sigma_2^k \rightarrow -\sigma_1^k({u_2^k}^TGv_1^k) eqnarray* Insert Eqn~eqn:G2XX and considering the alignment, we derive eqnarray* \sigma_1^k \rightarrow \sigma_1^k (\sigma_2^k)^2 ({v_1^k}^TXv_1^k) \ \sigma_2^k \rightarrow \sigma_2^k (\sigma_1^k)^2 ({v_1^k}^TXv_1^k) eqnarray*
| Loss function | Projector | Accuracy |
|---|---|---|
| SimCLR | 2-layer nonlinear projector | 66.5 |
| SimCLR | 1-layer linear projector | 61.1 |
| SimCLR | no projector | 51.5 |
| DirectCLR | no projector | 62.7 |
| Projector | diagonal | low-rank | Top-1 Accuracy |
|---|---|---|---|
| no projector orthogonal | 51.5 | ||
| projector | 52.2 | ||
| trainable projector | 61.1 | ||
| trainable diagonal projector | glyph[check] | 60.2 | |
| fixed low-rank projector | glyph[check] | 62.3 | |
| fixed low-rank diagonal projector | glyph[check] | glyph[check] | 62.7 |



References
[he2016resnet] Adrien Bardes, J. Ponce, Y. LeCun. (2021). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. ArXiv.
[Caron2018DeepCF] Mathilde Caron, Piotr Bojanowski, Armand Joulin, M. Douze. (2018). Deep Clustering for Unsupervised Learning of Visual Features. ECCV.
[Hua2021OnFD] Tianyu Hua, Wenxiao Wang, Zihui Xue, Yue Wang, Sucheng Ren, Hang Zhao. (2021). On Feature Decorrelation in Self-Supervised Learning. ArXiv.
[Ermolov2020WhiteningFS] Aleksandr Ermolov, Aliaksandr Siarohin, E. Sangineto, N. Sebe. (2020). Whitening for Self-Supervised Representation Learning. ArXiv.
[Oord2018RepresentationLW] A{. (2018). Representation Learning with Contrastive Predictive Coding. ArXiv.
[Tian2021UnderstandingSL] Yuandong Tian, Xinlei Chen, S. Ganguli. (2021). Understanding self-supervised Learning Dynamics without Contrastive Pairs. ArXiv.
[Chuang2020DebiasedCL] Ching-Yao Chuang, J. Robinson, Yen-Chen Lin, A. Torralba, S. Jegelka. (2020). Debiased Contrastive Learning. ArXiv.
[Arora2019ATA] Sanjeev Arora, H. Khandeparkar, M. Khodak, Orestis Plevrakis, Nikunj Saunshi. (2019). A Theoretical Analysis of Contrastive Unsupervised Representation Learning. ICML.
[Lee2020PredictingWY] J. Lee, Qi Lei, Nikunj Saunshi, Jiacheng Zhuo. (2020). Predicting What You Already Know Helps: Provable Self-Supervised Learning. ArXiv.
[Tosh2021ContrastiveLM] Christopher Tosh, A. Krishnamurthy, Daniel J. Hsu. (2021). Contrastive learning, multi-view redundancy, and linear models. ArXiv.
[Barrett2021ImplicitGR] D. Barrett, B. Dherin. (2021). Implicit Gradient Regularization. ArXiv.
[Jing2020ImplicitRA] L. Jing, J. Zbontar, Y. LeCun. (2020). Implicit Rank-Minimizing Autoencoder. ArXiv.
[Saxe2019AMT] Andrew M. Saxe, James L. McClelland, S. Ganguli. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences.
[Soudry2018TheIB] Daniel Soudry, E. Hoffer, Suriya Gunasekar, Nathan Srebro. (2018). The Implicit Bias of Gradient Descent on Separable Data. ArXiv.
[Gidel2019ImplicitRO] Gauthier Gidel, F. Bach, S. Lacoste-Julien. (2019). Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks. NeurIPS.
[Arora2019ImplicitRI] Sanjeev Arora, Nadav Cohen, W. Hu, Yuping Luo. (2019). Implicit Regularization in Deep Matrix Factorization. NeurIPS.
[Gunasekar2018ImplicitRI] Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, Nathan Srebro. (2018). Implicit Regularization in Matrix Factorization. 2018 Information Theory and Applications Workshop (ITA).
[Gunasekar2018ImplicitBO] Suriya Gunasekar, Jason D. Lee, Daniel Soudry, Nathan Srebro. (2018). Implicit Bias of Gradient Descent on Linear Convolutional Networks. NeurIPS.
[Ji2019GradientDA] Ziwei Ji, Matus Telgarsky. (2019). Gradient descent aligns the layers of deep linear networks. ArXiv.
[Ji2018RiskAP] Ziwei Ji, Matus Telgarsky. (2018). Risk and parameter convergence of logistic regression. ArXiv.
[tian2020understanding] Tian, Yuandong, Yu, Lantao, Chen, Xinlei, Ganguli, Surya. (2020). Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578.
[Dwibedi2021WithAL] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, Andrew Zisserman. (2021). With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations. ArXiv.
[He2016DeepRL] Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[Chen2020ImprovedBW] Xinlei Chen, Haoqi Fan, Ross B. Girshick, Kaiming He. (2020). Improved Baselines with Momentum Contrastive Learning. ArXiv.
[He2020MomentumCF] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross B. Girshick. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[Li2021PrototypicalCL] Junnan Li, Pan Zhou, Caiming Xiong, R. Socher, S. Hoi. (2021). Prototypical Contrastive Learning of Unsupervised Representations. ArXiv.
[Misra2020SelfSupervisedLO] Ishan Misra, L. V. D. Maaten. (2020). Self-Supervised Learning of Pretext-Invariant Representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[HaoChen2021ProvableGF] Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, Tengyu Ma. (2021). Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. ArXiv.
[Neyshabur2019TowardsUT] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Y. LeCun, Nathan Srebro. (2019). Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. ArXiv.
[AllenZhu2019LearningAG] Zeyuan Allen-Zhu, Yuanzhi Li, Yingyu Liang. (2019). Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. ArXiv.
[Assran2021SemiSupervisedLO] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, Michael G. Rabbat. (2021). Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples. ArXiv.
[Caron2021EmergingPI] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv'e J'egou, J. Mairal, Piotr Bojanowski, Armand Joulin. (2021). Emerging Properties in Self-Supervised Vision Transformers. ArXiv.
[Frankle2019TheLT] Jonathan Frankle, Michael Carbin. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv: Learning.
[Radhakrishnan2020OnAI] Adityanarayanan Radhakrishnan, Eshaan Nichani, D. Bernstein, Caroline Uhler. (2020). On Alignment in Deep Linear Neural Networks. arXiv: Learning.
[Casado2019CheapOC] Mario Lezcano Casado, David Mart{'i. (2019). Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group. ArXiv.
[bib1] Arora et al. (2019a) Sanjeev Arora, Nadav Cohen, W. Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. In NeurIPS, 2019a.
[bib2] Arora et al. (2019b) Sanjeev Arora, H. Khandeparkar, M. Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019b.
[bib3] Assran et al. (2021) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Armand Joulin, Nicolas Ballas, and Michael G. Rabbat. Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. ArXiv, abs/2104.13963, 2021.
[bib4] Bardes et al. (2021) Adrien Bardes, J. Ponce, and Y. LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. ArXiv, abs/2105.04906, 2021.
[bib5] D. Barrett and B. Dherin. Implicit gradient regularization. ArXiv, abs/2009.11162, 2021.
[bib6] Caron et al. (2018) Mathilde Caron, Piotr Bojanowski, Armand Joulin, and M. Douze. Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
[bib7] Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS, 2020.
[bib8] Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Herv’e J’egou, J. Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. ArXiv, abs/2104.14294, 2021.
[bib9] Mario Lezcano Casado and David Martínez-Rubio. Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. ArXiv, abs/1901.08428, 2019.
[bib10] Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. 2020a.
[bib11] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, 2020.
[bib12] Chen et al. (2020b) Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. ArXiv, abs/2003.04297, 2020b.
[bib13] Dwibedi et al. (2021) Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. ArXiv, abs/2104.14548, 2021.
[bib14] Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS, 2020.
[bib15] Gunasekar et al. (2018) Suriya Gunasekar, Blake E. Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. 2018 Information Theory and Applications Workshop (ITA), pp. 1–10, 2018.
[bib16] HaoChen et al. (2021) Jeff Z. HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. ArXiv, abs/2106.04156, 2021.
[bib17] He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
[bib18] He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9726–9735, 2020.
[bib19] Hua et al. (2021) Tianyu Hua, Wenxiao Wang, Zihui Xue, Yue Wang, Sucheng Ren, and Hang Zhao. On feature decorrelation in self-supervised learning. ArXiv, abs/2105.00470, 2021.
[bib20] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. ArXiv, abs/1810.02032, 2019.
[bib21] Jing et al. (2020) L. Jing, J. Zbontar, and Y. LeCun. Implicit rank-minimizing autoencoder. ArXiv, abs/2010.00679, 2020.
[bib22] Lee et al. (2020) J. Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. ArXiv, abs/2008.01064, 2020.
[bib23] Li et al. (2021) Junnan Li, Pan Zhou, Caiming Xiong, R. Socher, and S. Hoi. Prototypical contrastive learning of unsupervised representations. ArXiv, abs/2005.04966, 2021.
[bib24] Ishan Misra and L. V. D. Maaten. Self-supervised learning of pretext-invariant representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6706–6716, 2020a.
[bib25] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020b.
[bib26] Neyshabur et al. (2019) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Y. LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. ArXiv, abs/1805.12076, 2019.
[bib27] Radhakrishnan et al. (2020) Adityanarayanan Radhakrishnan, Eshaan Nichani, D. Bernstein, and Caroline Uhler. On alignment in deep linear neural networks. arXiv: Learning, 2020.
[bib28] Saxe et al. (2019) Andrew M. Saxe, James L. McClelland, and S. Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116:11537 – 11546, 2019.
[bib29] Soudry et al. (2018) Daniel Soudry, E. Hoffer, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. ArXiv, abs/1710.10345, 2018.
[bib30] Tian et al. (2020) Yuandong Tian, Lantao Yu, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578, 2020.
[bib31] Tian et al. (2021) Yuandong Tian, Xinlei Chen, and S. Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. ArXiv, abs/2102.06810, 2021.
[bib32] Tosh et al. (2021) Christopher Tosh, A. Krishnamurthy, and Daniel J. Hsu. Contrastive learning, multi-view redundancy, and linear models. ArXiv, abs/2008.10150, 2021.
[bib33] van den Oord et al. (2018) Aäron van den Oord, Y. Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
[bib34] Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arxiv:2103.03230, 2021.