An Information-Theoretic Perspective on Variance-Invariance-Covariance Regularization

% \name Randall Balestriero \addr Meta AI, FAIR \email, % \name Kenji Kawaguchi \addr National University of Singapore \email, % \name Tim G. J. Rudner \addr New York University \email, % \name Yann LeCun \addr New York University & Meta AI, FAIR \email %

Abstract

In this paper, we provide an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg) for self-supervised learning. To do so, we first demonstrate how information-theoretic quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Next, we relate the VICReg objective to mutual information maximization and use it to highlight the underlying assumptions of the objective. Based on this relationship, we derive a generalization bound for VICReg, providing generalization guarantees for downstream supervised learning tasks and present new self-supervised learning methods, derived from a mutual information maximization objective, that outperform existing methods in terms of performance. This work provides a new information-theoretic perspective on self-supervised learning and Variance-Invariance-Covariance Regularization in particular and guides the way for improved transfer learning via information-theoretic self-supervised learning objectives.

Introduction

Self-supervised learning (SSL) is a promising approach to extracting meaningful representations by optimizing a surrogate objective between inputs and self-generated signals. For example, VarianceInvariance-Covariance Regularization (VICReg) [7], a widely-used SSL algorithm employing a de-correlation mechanism, circumvents learning trivial solutions by applying variance and covariance regularization.

Once the surrogate objective is optimized, the pre-trained model can be used as a feature extractor for a variety of downstream supervised tasks such as image classification, object detection, instance segmentation, or pose estimation [15, 16, 49, 68]. Despite the promising results demonstrated by SSL methods, the theoretical underpinnings explaining their efficacy continue to be the subject of investigation [5, 43].

Information theory has proved a useful tool for improving our understanding of deep neural networks (DNNs), having a significant impact on both applications in representation learning [3] and theoretical explorations [61, 73]. However, applications of information-theoretic principles to SSL have made unrealistic assumptions, making many existing information-theoretic approaches to SSL of limited use. One such assumption is to assume that the DNN to be optimized is stochastic-an assumption that is violated for the vast majority DNNs used in practice. For a comprehensive review on this topic, refer to the work by Shwartz-Ziv and LeCun [63].

∗ Correspondence to: ravid.shwartz.ziv@nyu.edu .

In this paper, we examine Variance-Invariance-Covariance Regularization (VICReg), an SSL method developed for deterministic DNNs, from an information-theoretic perspective. We propose an approach that addresses the challenge of mutual information estimation in deterministic networks by transitioning the randomness from the networks to the input data-a more plausible assumption. This shift allows us to apply an information-theoretic analysis to deterministic networks. To establish a connection between the VICReg objective and information maximization, we identify and empirically validate the necessary assumptions. Building on this analysis, we describe differences between different SSL algorithms from an information-theoretic perspective and propose a new family of plug-in methods for SSL. This new family of methods leverages existing information estimators and achieves state-of-the-art predictive performance across several benchmarking tasks. Finally, we derive a generalization bound that links information optimization and the VICReg objective to downstream task performance, underscoring the advantages of VICReg.

Our key contributions are summarized as follows:

We introduce a novel approach for studying deterministic deep neural networks from an information-theoretic perspective by shifting the stochasticity from the networks to the inputs using the Data Distribution Hypothesis (Section 3)
We establish a connection between the VICReg objective and information-theoretic optimization, using this relationship to elucidate the underlying assumptions of the objective and compare it to other SSL methods (Section 4).
We propose a family of information-theoretic SSL methods, grounded in our analysis, that achieve state-of-the-art performance (Section 5).
We derive a generalization bound that directly links VICReg to downstream task generalization, further emphasizing its practical advantages over other SSL methods (Section 6).

Background & Preliminaries

We first introduce the necessary technical background for our analysis.

Continuous Piecewise Affine (CPA) Mappings.

A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition Ω of a domain R D , a spline of order k is a mapping defined by a polynomial of order k on each region ω ∈ Ω with continuity constraints on the entire domain for the derivatives of order 0 ,. . . , k -1 . As we will focus on affine splines ( k = 1 ), we only define this case for clarity. A K -dimensional affine spline f produces its output via f ( z ) = ∑ ω ∈ Ω ( A ω z + b ω ) ✶ { z ∈ ω } , with input z ∈ R D and A ω ∈ R K × D , b ω ∈ R K , ∀ ω ∈ Ω the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f ∈ C 0 ( R D ) .

Deep Neural Networks as CPA Mappings. A deep neural network (DNN) is a (non-linear) operator f Θ with parameters Θ that map a input x ∈ R D to a prediction y ∈ R K . The precise definitions of DNN operators can be found in Goodfellow et al. [27]. To avoid cluttering notation, we will omit Θ unless needed for clarity. For our analysis, we only assume that the non-linearities in the DNN are CPA mappings-as is the case with (leaky-) ReLU, absolute value, and max-pooling operators. The entire input-output mapping then becomes a CPA spline with an implicit partition Ω , the function of the weights and architecture of the network [52, 6]. For smooth nonlinearities, our results hold using a first-order Taylor approximation argument.

Continuous Piecewise Affine (CPA) Mappings.

Deep Neural Networks as CPA Mappings.

Recently, information-theoretic methods have played an essential role in advancing deep learning by developing and applying information-theoretic estimators and learning principles to DNN training [3, 9, 34, 57, 65, 66, 70, 73]. However, information-theoretic objectives for deterministic DNNs often face a common issue: the mutual information between the input and the DNN representation is infinite. This leads to ill-posed optimization problems.

Several strategies have been suggested to address this challenge. One involves using stochastic DNNs with variational bounds, where the output of the deterministic network is used as the parameters of the conditional distribution [44, 62]. Another approach, as suggested by Dubois et al. [22], assumes that the randomness of data augmentation among the two views is the primary source of stochasticity in the network. However, these methods assume that randomness comes from the DNN, contrary to common practice. Other research has presumed a random input but has made no assumptions about the network's representation distribution properties. Instead, it relies on general lower bounds to analyze the objective [72, 76].

Self-Supervised Learning.

SSL is a set of techniques that learn representation functions from unlabeled data, which can then be adapted to various downstream tasks. While supervised learning relies on labeled data, SSL formulates a proxy objective using self-generated signals. The challenge in SSL is to learn useful representations without labels. It aims to avoid trivial solutions where the model maps all inputs to a constant output [11, 36]. To address this, SSL utilizes several strategies. Contrastive methods like SimCLR and its InfoNCE criterion learn representations by distinguishing positive and negative examples [16, 55]. In contrast, non-contrastive methods apply regularization techniques to prevent the collapse [14, 17, 28].

Variance-Invariance-Covariance Regularization (VICReg). A widely used SSL method for training joint embedding architectures [7]. Its loss objective is composed of three terms: the invariance loss, the variance loss, and the covariance loss:

· Invariance loss: The invariance loss is given by the mean-squared Euclidean distance between pairs of embedding and ensures consistency between the representation of the original and augmented inputs. · Regularization: The regularization term consists of two loss terms: the variance loss -a hinge loss to maintain the standard deviation (over a batch) of each variable of the embedding-and the covariance loss , which penalizes off-diagonal coefficients of the covariance matrix of the embeddings to foster decorrelation among features.

VICReg generates two batches of embeddings, Z = [ f ( x 1 ) , . . . , f ( x B )] and Z ′ = [ f ( x ′ 1 ) , . . . , f ( x ′ B )] , each of size ( B × K ) . Here, x i and x ′ i are two distinct random augmentations of a sample I i . The covariance matrix C ∈ R K × K is obtained from [ Z , Z ′ ] . The VICReg loss can thus be expressed as follows:

Variance-Invariance-Covariance Regularization (VICReg).

Next, we connect the VICReg to our information-theoretic-based objective. The 'invariance term' in Equation (9), which pushes augmentations from the same image closer together, is the same term used in the VICReg objective. However, the computation of the regularization term poses a significant challenge. Entropy estimation is a well-established problem within information theory, with Gaussian mixture densities often used for representation. Yet, the differential entropy of Gaussian mixtures lacks a closed-form solution [56].

A straightforward method for approximating entropy involves capturing the distribution's first two moments, which provides an upper bound on the entropy. However, minimizing an upper bound doesn't necessarily optimize the original objective. Despite reported success from minimizing an upper bound [47, 54], this approach may induce instability during the training process.

Let Σ Z denote the covariance matrix of Z . We utilize the first two moments to approximate the entropy we aim to maximize. Because the invariance term appears in the same form as the original VICReg objective, we will look only at the regularizer. Consequently, we get the approximation

Theorem 2 . Assuming that the eigenvalues of Σ( x i ) and Σ( x ′ i ) , along with the differences between the Gaussian means µ ( x i ) and µ ( x ′ i ) , are bounded, the solution to the maximization problem

involves setting Σ Z to a diagonal matrix.

According to Theorem 2, we can maximize Equation (10) by diagonalizing the covariance matrix and increasing its diagonal elements. This goal can be achieved by minimizing the off-diagonal elements of Σ Z -the covariance criterion of VICReg-and by maximizing the sum of the log of its diagonal elements. While this approach is straightforward and efficient, it does have a drawback: the diagonal values could tend towards zero, potentially causing instability during logarithm computations. A solution to this issue is to use an upper bound and directly compute the sum of the diagonal elements, resulting in the variance term of VICReg. This establishes the link between our information-theoretic objective and the three key components of VICReg.

Deep Neural Networks and Information Theory

Self-Supervised Learning in DNNs: An Information-Theoretic Perspective

To analyze information within deterministic networks, we first need to establish an informationtheoretic perspective on SSL (Section 3.1). Subsequently, we utilize the Data Distribution Hypothesis (Section 3.2) to demonstrate its applicability to deterministic SSL networks.

Self-Supervised Learning from an Information-Theoretic Viewpoint

Our discussion begins with the MultiView InfoMax principle , which aims to maximize the mutual information I ( Z ; X ′ ) between a view X ′ and the second representation Z . As demonstrated in Federici et al. [24], we can optimize this information by employing the following lower bound:

Here, H ( Z ) represents the entropy of Z . In supervised learning, the labels Y remain fixed, making the entropy term H ( Y ) a constant. Consequently, the optimization is solely focused on the log-loss, E x ′ [log q ( z | x ′ )] , which could be either cross-entropy or square loss.

However, for joint embedding networks, a degenerate solution can emerge, where all outputs 'collapse' into an undesired value [16]. Upon examining Equation (2), we observe that the entropies are not fixed and can be optimized. As a result, minimizing the log loss alone can lead the representations to collapse into a trivial solution and must be regularized.

Understanding the Data Distribution Hypothesis

Previously, we mentioned that a naive analysis might suggest that the information in deterministic DNNs is infinite. To address this point, we investigate whether assuming a dataset is a mixture of Gaussians with non-overlapping support can provide a manageable distribution over the neural network outputs. This assumption is less restrictive compared to assuming that the neural network itself is stochastic, as it concerns the generative process of the data, not the model and training process. For a detailed discussion about the limitations of assuming stochastic networks and a comparison between stochastic networks vs stochastic input, see Appendix N. In Section 4.2, we verify that this assumption holds for real-world datasets.

The so-called manifold hypothesis allows us to treat any point as a Gaussian random variable with a low-rank covariance matrix aligned with the data's manifold tangent space [25], which enables us to examine the conditioning of a latent representation with respect to the mean of the observation, i.e., X | x ∗ ∼ N ( x ; x ∗ , Σ x ∗ ) . Here, the eigenvectors of Σ x ∗ align with the tangent space of the data manifold at x ∗ , which varies with the position of x ∗ in space. In this setting, a dataset is considered a collection of distinct points x ∗ n , n = 1 , ..., N , and the full data distribution is expressed as a sum of Gaussian densities with low-rank covariance, defined as:

Here, T is a uniform Categorical random variable. For simplicity, we assume that the effective support of N ( x ∗ i , Σ x ∗ i ) do not overlap (for empirical validation of this assumption see Section 4.2). The effective support is defined as { x ∈ R D : p ( x ) > ϵ } . We can then approximate the density function as follows:

where N ( x ; ., . ) is the Gaussian density at x and n ( x ) = arg min n ( x -x ∗ n ) T Σ x ∗ n ( x -x ∗ n ) .

Data Distribution Under the Deep Neural Network Transformation

Let us consider an affine spline operator f , as illustrated in Section 2.1, which maps a space of dimension D to a space of dimension K , where K ≥ D . The image or the span of this mapping is expressed as follows:

In this equation, Aff ( ω ; A ω , b ω ) = { A ω x + b ω : x ∈ ω } denotes the affine transformation of region ω by the per-region parameters A ω , b ω . Ω denotes the partition of the input space where x resides. To practically compute the per-region affine mapping, we set A ω to the Jacobian matrix of the network at the corresponding input x , and b to be defined as f ( x ) -A ω x . Therefore, the DNN mapping composed of affine transformations on each input space partition region ω ∈ Ω based on the coordinate change induced by A ω and the shift induced by b ω .

When the input space is associated with a density distribution, this density is transformed by the mapping f . Calculating the density of f ( X ) is generally intractable. However, under the disjoint support assumption from Section 3.2, we can arbitrarily increase the density's representation power by raising the number of prototypes N . As a result, each Gaussian's support is contained within the region ω where its means lie, leading to the following theorem:

Theorem 1 . Given the setting of Equation (4), the unconditional DNN output density, Z , can be approximated as a mixture of the affinely transformed distributions x | x ∗ n ( x ) :

where ω ( x ∗ n ) = ω ∈ Ω ⇐⇒ x ∗ n ∈ ω is the partition region in which the prototype x ∗ n lives in. Proof. See Appendix B.

In other words, Theorem 1 implies that when the input noise is small, we can simplify the conditional output density to a single Gaussian: ( Z ′ | X ′ = x n ) ∼ N ( µ ( x n ) , Σ( x n )) , where µ ( x n ) = A ω ( x n ) x n + b ω ( x n ) and Σ( x n ) = A T ω ( x n ) Σ x n A ω ( x n ) .

Figure 1: Left: The network output for SSL training is more Gaussian for small input noise . The p -value of the normality test for different SSL models trained on ImageNet for different input noise levels. The dashed line represents the point at which the null hypothesis (Gaussian distribution) can be rejected with 99% confidence. Right: The Gaussians around each point are not overlapping. The plots show the l 2 distances between raw images for different datasets. As can be seen, the distances are largest for more complex real-world datasets.

Information Optimization and the VICReg Optimization Objective

Building on our earlier discussion, we used the Data Distribution Hypothesis to model the conditional output in deterministic networks as a Gaussian mixture. This allowed us to frame the SSL training objective as maximizing the mutual information, I ( Z ; X ′ ) and I ( Z ′ ; X ) .

However, in general, this mutual information is intractable. Therefore, we will use our derivation for the network's representation to obtain a tractable variational approximation using the expected loss, which we can optimize.

The computation of expected loss requires us to marginalize the stochasticity in the output. We can employ maximum likelihood estimation with a Gaussian observation model. For computing the expected loss over x ′ samples, we must marginalize the stochasticity in Z ′ . This procedure implies that the conditional decoder adheres to a Gaussian distribution: ( Z | X ′ = x n ) ∼ N ( µ ( x n ) , I +Σ( x n )) .

However, calculating the expected log loss over samples of Z is challenging. We thus focus on a lower bound - the expected log loss over Z ′ samples. Utilizing Jensen's inequality, we derive the following lower bound:

Taking the expectation over Z , we get

The full derivations are presented in Appendix A. To optimize this objective in practice, we approximate p ( x, x ′ ) using the empirical data distribution

Variance-Invariance-Covariance Regularization: An Information-Theoretic Perspective

involves setting Σ Z to a diagonal matrix.

Empirical Validation of Assumptions About Data Distributions

Validating our theory, we tested if the conditional output density P ( Z | X ) becomes a Gaussian as input noise lessens. We used a ResNet-50 model trained with SimCLR or VICReg objectives on CIFAR-10, CIFAR-100, and ImageNet datasets. We sampled 512 Gaussian samples for each image from the test dataset, examining whether each sample remained Gaussian in the DNN's penultimate layer. We applied the D'Agostino and Pearson's test to ascertain the validity of this assumption [19].

Figure 1 (left) displays the p -value as a function of the normalized std. For low noise levels, we reject that the network's conditional output density is non-Gaussian with an 85% probability when using VICReg. However, the network output deviates from Gaussian as the input noise increases.

Next, we verified our assumption of non-overlapping effective support in the model's data distribution. We calculate the distribution of pairwise l 2 distances between images across seven datasets: MNIST [42], CIFAR10, CIFAR100 [41], Flowers102 [53], Food101 [12], and FGVAircaft [46]. Figure 1 (right) reveals that the pairwise distances are far from zero, even for raw pixels. This implies that we can use a small Gaussian around each point without overlap, validating our assumption as realistic.

Self-Supervised Learning Models through Information Maximization

The practical application of Equation (8) involves several key 'design choices'. We begin by comparing how existing SSL models have implemented it, investigating the estimators used, and discussing the implications of their assumptions. Subsequently, we introduce new methods for SSL that incorporate sophisticated estimators from the field of information theory, which outperform current approaches.

VICReg vs. SimCLR

In order to evaluate their underlying assumptions and strategies for information maximization, we compare VICReg to contrastive SSL methods such as SimCLR along with non-contrastive methods like BYOL and SimSiam.

Contrastive Learning with SimCLR. In their work, Lee et al. [44] drew a connection between the SimCLR objective and the variational bound on information regarding representations by employing the von Mises-Fisher distribution. By applying our analysis for information in deterministic networks with their work, we compare the main differences between SimCLR and VICReg:

(i) Conditional distribution: SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation: SimCLR approximated it based on the finite sum of the input samples. In contrast, VICReg estimates the entropy of Z solely based on the second moment. Developing SSL methods that integrate these two distinctions form an intriguing direction for future research.

Empirical comparison. We trained ResNet18 on CIFAR-10 for VICReg, SimCLR, and BYOL and compared their entropies directly using the pairwise distances entropy estimator. (For more details, see Appendix K.) This estimator was not directly optimized by any method and was an independent validation. The results (Figure 2), showed that entropy increased for all methods during training, with SimCLR having the lowest and VICReg the highest entropy.

Contrastive Learning with SimCLR.

Empirical comparison.

Family of alternative Entropy Estimators

Next, we suggest integrating the invariance term of current SSL methods with plug-in methods that optimize entropy.

Entropy estimators. The VICReg objective seeks to approximate the log determinant of the empirical covariance matrix through its diagonal terms. As discussed in Section 4.1, this approach has its drawbacks. An alternative is to employ different entropy estimators. The LogDet Entropy Estimator [75] is one such option, offering a tighter upper bound. This estimator employs the differential entropy of α order with scaled noise and has been previously shown to be a tight estimator for high-dimensional features, proving robust to random noise. However, since this estimator provides an upper bound on entropy, maximizing this bound doesn't guarantee optimization of the original objective. To counteract this, we also introduce a lower bound estimator based on the pairwise distances of individual mixture components [39]. In this family, a function determining pairwise distances between component densities is designated for each member. These estimators are computationally efficient and typically straightforward to optimize. For additional entropy estimators, see Appendix F. Beyond VICReg, these methods can serve as plug-in estimators for numerous SSL algorithms. Apart from VICReg, we also conducted experiments integrating these estimators with the BYOL algorithm.

Figure 2: VICReg has higher Entropy during training. The entropy along the training for different SSL methods. Experiments were conducted with ResNet-18 on CIFAR-10. Error bars represent one standard error over 5 trials.

Table 1: The proposed entropy estimators outperform previous methods. CIFAR-10, CIFAR-100, and Tiny-ImageNet top-1 accuracy under linear evaluation using ResNet-18, ConvNetX and VIT as backbones. Error bars correspond to one standard error over three trials.

Setup. Experiments were conducted on three image datasets: CIFAR-10, CIFAR-100 [40], and Tiny-ImageNet [20]. For CIFAR-10, ResNet-18 [31] was used. In contrast, both ConvNeXt [45] and Vision Transformer [21] were used for CIFAR-100 and Tiny-ImageNet. For comparison, we examined the following SSL methods: VICReg, SimCLR, BYOL, SwAV [14], Barlow Twins [74], and MoCo [33]. The quality of representation was assessed through linear evaluation. A detailed description of different methods can be found in Appendix H.

Results. As evidenced by Table 1, the proposed entropy estimators surpass the original SSL methods. Using a more precise entropy estimator enhances the performance of both VICReg and BYOL, compared to their initial implementations. Notably, the pairwise distance estimator, being a lower bound, achieves superior results, resonating with the theoretical preference for maximizing a true entropy's lower bound. Our findings suggest that the astute choice of entropy estimators, guided by our framework, paves the way for enhanced performance.

Entropy estimators.

Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds [35]. Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is very computationally expensive and typically requires a large number of samples, especially in high dimensions [13]. Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases [35]. Another option is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) [2], and Bayesian methods [50]. These methods, however, often require complex optimizations. The LogDet estimator presented in [75] used the differential entropy α order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, [35] presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky and Tracey [39] introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both a lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are exact when the component distributions are grouped into well-separated clusters,

Setup.

Results.

A Generalization Bound for Downstream Tasks

In earlier sections, we linked information theory principles with the VICReg objective. Now, we aim to extend this link to downstream generalization via a generalization bound. This connection further aligns VICReg's generalization with information maximization and implicit regularization.

Notation. Consider input points x , outputs y ∈ R r , labeled training data S = (( x i , y i )) n i =1 of size n and unlabeled training data ¯ S = (( x + i , x ++ i )) i m =1 of size m , where x + i and x ++ i share the same (unknown) label. With the unlabeled training data, we define the invariance loss

where f θ is the trained representation on the unlabeled data ¯ S . We define a labeled loss ℓ x,y ( w ) = ∥ Wf θ ( x ) -y ∥ where w = vec[ W ] ∈ R dr is the vectorization of the matrix W ∈ R r × d . Let w S = vec[ W S ] be the minimum norm solution as W S = minimize W ′ ∥ W ′ ∥ F such that

We also define the representation matrices

We define the label matrix Y S = [ y 1 , . . . , y n ] ⊤ ∈ R n × r and the unknown label matrix Y ¯ S = [ y + 1 , . . . , y + m ] ⊤ ∈ R m × r , where y + i is the unknown label of x + i . Let F be a hypothesis space of f θ . For a given hypothesis space F , we define the normalized Rademacher complexity

where ξ 1 , . . . , ξ m are independent uniform random variables in {-1 , 1 } . It is normalized such that ˜ R m ( F ) = O (1) as m →∞ .

Notation.

In our paper, we proposed novel methods for SSL premised on information maximization. Although our methods demonstrated superior performance on some datasets, computational constraints precluded us from testing them on larger datasets. Furthermore, our study hinges on certain assumptions that, despite rigorous validation efforts, may not hold universally. While we strive for meticulous testing and validation, it's crucial to note that some assumptions might not be applicable in all scenarios or conditions. These limitations should be taken into account when interpreting our study's results.

A *{-1pt

Comparison of Generalization Bounds

The SimCLR generalization bound [59] requires the number of labeled classes to go infinity to close the generalization gap, whereas the VICReg bound in Theorem 3 does not require the number of label classes to approach infinity for the generalization gap to go to zero. This reflects that, unlike SimCLR, VICReg does not use negative pairs and thus does not use a loss function based on the implicit expectation that the labels of a negative pair are different. Another difference is that our VICReg bound improves as n increases, while the previous bound of SimCLR [59] does not depend on n . This is because Saunshi et al. [59] assumes partial access to the true distribution per class for setting, which removes the importance of labeled data size n and is not assumed in our study.

Consequently, the generalization bound in Theorem 3 provides a new insight for VICReg regarding the ratio of the effects of m v.s. n through G √ ln(1 /δ ) /m + √ ln(1 /δ ) /n . Finally, Theorem 3 also illuminates the advantages of VICReg over standard supervised training. That is, with standard training, the generalization bound via the Rademacher complexity requires the complexities of hypothesis spaces, ˜ R n ( W ) / √ n and ˜ R n ( F ) / √ n , with respect to the size of labeled data n , instead of the size of unlabeled data m . Thus, Theorem 3 shows that using SSL, we can replace the complexities of hypothesis spaces in terms of n with those in terms of m . Since the number of unlabeled data points is typically much larger than the number of labeled data points, this illuminates the benefit of SSL. Our bound is different from the recent information bottleneck bound [38] in that both our proof and bound do not rely on information bottleneck.

Understanding Theorem 2 via Mutual Information Maximization

Theorem 3, together with the result of the previous section, shows that, for generalization in the downstream task, it is helpful to maximize the mutual information I ( Z ; X ′ ) in SSL via minimizing the invariance loss I ¯ S ( f θ ) while controlling the covariance Z ¯ S Z ¯ S ⊤ . The term 2 ˜ R m ( F ) / √ m captures the importance of controlling the complexity of the representations f θ . To understand this term further, let us consider a discretization of the parameter space of F to have finite |F| < ∞ . Then, by Massart's Finite Class Lemma, we have that ˜ R m ( F ) ≤ C √ ln |F| for some constant C > 0 . Moreover, Shwartz-Ziv [61] shows that we can approximate ln |F| by 2 I ( Z ; X ) . Thus, in Theorem 3, the term I ¯ S ( f θ ) + 2 √ m ∥ P Z ¯ S Y ¯ S ∥ F + 1 √ n ∥ P Z S Y S ∥ F corresponds to I ( Z ; X ′ ) which we want to maximize while compressing the term of 2 ˜ R m ( F ) / √ m which corresponds to I ( Z ; X ) [23, 64, 67].

Although we can explicitly add regularization on the information to control 2 ˜ R m ( F ) / √ m , it is possible that I ( Z ; X | X ′ ) and 2 ˜ R m ( F ) / √ m are implicitly regularized via implicit bias through design choises [29, 69, 30]. Thus, Theorem 3 connects the information-theoretic understanding of VICReg with the probabilistic guarantee on downstream generalization.

Limitations

Conclusions

We analyzed the Variance-Invariance-Covariance Regularization for self-supervised learning through an information-theoretic lens. By transferring the stochasticity required for an information-theoretic analysis to the input distribution, we showed how the VICReg objective can be derived from information-theoretic principles, used this perspective to highlight assumptions implicit in the VICReg objective, derived a VICReg generalization bound for downstream tasks, and related it to information maximization.

Building on these findings, we introduced a new VICReg-inspired SSL objective. Our probabilistic guarantee suggests that VICReg can be further improved for the settings of partial label information by aligning the covariance matrix with the partially observable label matrix, which opens up several avenues for future work, including the design of improved estimators for information-theoretic quantities and investigations into the suitability of different SSL methods for specific data characteristics.

This appendix is organized as follows:

texorpdfstring{Lower bounds on $ EE_{x^ prime

Data Distribution after Deep Network Transformation

Theorem 4 . Given the setting of Equation (4), the unconditional DNN output density denoted as Z approximates (given the truncation of the Gaussian on its effective support that is included within a single region ω of the DN's input space partition) a mixture of the affinely transformed distributions x | x ∗ n ( x ) e.g. for the Gaussian case

where ω ( x ∗ n ) = ω ∈ Ω ⇐⇒ x ∗ n ∈ ω is the partition region in which the prototype x ∗ n lives in.

Proof. We know that If ∫ ω p ( x | x ∗ n ( x ) ) d x ≈ 1 then f is linear within the effective support of p . Therefore, any sample from p will almost surely lie within a single region ω ∈ Ω , and therefore the entire mapping can be considered linear with respect to p . Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

Additional Empirical Validation

To validate empirically Theorem 2, we checke empirically if the optimal solution for

is a diagonal matrix. We trained VICReg on ResNet18 on CIFAR-10 and did random perturbations (with different scales) for Σ Z . Then, for each perturbation, we calculated the average distance of this perturbed matrix from a diagonal matrix and the actual value of the term

. In Figure 3, we plot the difference from the optimal value of this term as a function of the distance from the diagonal matrix. As we can see, we get an optimal solution where we are close to the diagonal matrix. This observation gives us an empirical validation of Theorem 2.

Figure 3: The optimal solution for the optimization problem is a diagonal matrix. The average distance from a diagonal matrix for different perturbation scales. Experiments were conducted on CIFAR-10 with the ResNet-18 network.

Figure 4: Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur. Left - In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right - when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy.

EM and GMM

Let us examine a toy dataset on the pattern of two intertwining moons to illustrate the collapse phenomenon under GMM (Figure 1, right). We begin by training a classical GMM with maximum likelihood, where the means are initialized based on random samples, and the covariance is used as the identity matrix. A red dot represents the Gaussian's mean after training, while a blue dot represents the data points. In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high (Figure 4, left). However, when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy (Figure 4, right).

SimCLR

Entropy Estimators

Empirical validation of our assumption

Known Lemmas

We use the following well-known theorems as lemmas in our proofs. We put these below for completeness. These are classical results and not our results.

Lemma G.1. (Hoeffding's inequality) Let X 1 , ..., X n be independent random variables such that a ≤ X i ≤ b almost surely. Consider the average of these random variables, S n = 1 n ( X 1 + · · · + X n ) . Then, for all t > 0 ,

Proof. By using Hoeffding's inequality, we have that for all t > 0 ,

and and

Setting δ = exp ( -2 nt 2 ( b -a ) 2 ) and solving for t > 0 ,

It has been shown that generalization bounds can be obtained via Rademacher complexity [8, 51, 60]. The following is a trivial modification of [51, Theorem 3.1] for a one-sided bound on the nonnegative general loss functions:

Lemma G.2. Let G be a set of functions with the codomain [0 , M ] . Then, for any δ > 0 , with probability at least 1 -δ over an i.i.d. draw of m samples S = ( q i ) i m =1 , the following holds for all ψ ∈ G :

where R m ( G ) := E S,ξ [sup ψ ∈G 1 m ∑ m i =1 ξ i ψ ( q i )] and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1 , 1 } .

Proof. Let S = ( q i ) i m =1 and S ′ = ( q ′ i ) i m =1 . Define

To apply McDiarmid's inequality to φ ( S ) , we compute an upper bound on | φ ( S ) -φ ( S ′ ) | where S and S ′ be two test datasets differing by exactly one point of an arbitrary index i 0 ; i.e., S i = S ′ i for all i = i 0 and S i 0 = S ′ i 0 . Then,

Similarly, φ ( S ) -φ ( S ′ ) ≤ M m . Thus, by McDiarmid's inequality, for any δ > 0 , with probability at least 1 -δ ,

where the first line follows the definitions of each term, the second line uses Jensen's inequality and the convexity of the supremum, and the third line follows that for each ξ i ∈ {-1 , +1 } , the distribution of each term ξ i ( ℓ ( f ( x ′ i ) , y ′ i ) -ℓ ( f ( x i ) , y i )) is the distribution of ( ℓ ( f ( x ′ i ) , y ′ i ) -ℓ ( f ( x i ) , y i )) since S and S ′ are drawn iid with the same distribution. The fourth line uses the subadditivity of supremum.

Moreover,

Implentation Details for Maximizing Entropy Estimators

In this section, we will provide more details on the implantation of the experiments conducted in Section 5.2.

Setup Our experiments are conducted on CIFAR-10 [41]. We use ResNet-18 [32] as our backbone.

Training Procedure : The experimental process is organized into two sequential stages: unsupervised pretraining followed by linear evaluation. Initially, the unsupervised pretraining phase is executed, during which the encoder network is trained. Upon its completion, we transition to the linear evaluation phase, which serves as an assessment tool for the quality of the representation produced by the pretrained encoder.

Once the pretraining phase is concluded, we adhere to the fine-tuning procedures used in established baseline methods, as described by [14].

During the linear evaluation stage, we start by performing supervised training of the linear classifier. This is achieved by using the representations derived from the encoder network while keeping the network's coefficients frozen, and applying the same training dataset. Subsequently, we measure the test accuracy of the trained linear classifier using a separate validation dataset. This approach allows us to evaluate the performance of our model in a robust and systematic manner.

The training process for each model unfolds over 800 epochs, employing a batch size of 512. We utilize the Stochastic Gradient Descent (SGD) optimizer, characterized by a momentum of 0.9 and a weight decay of 1 e -4 . The learning rate is initiated at 0.5 and is adjusted according to a cosine decay schedule complemented by a linear warmup phase.

During the data augmentation process, two enhanced versions of every input image are generated. This involves cropping each image randomly and resizing it back to the original resolution. The images are then subject to random horizontal flipping, color jittering, grayscale conversion, Gaussian blurring, and polarization for further augmentation.

For the linear evaluation phase, the linear classifier is trained for 100 epochs with a batch size of 256. The SGD optimizer is again employed, this time with a momentum of 0.9 and no weight decay. The learning rate is managed using a cosine decay schedule, starting at 0.2 and reaching a minimum of 2 e -4 .

Expectation Maximization and Collapsing

Assumption 1 . The eigenvalues of Σ( x j ) are in some range a ≤ λ (Σ( x j )) ≤ b

Assumption 2 . The differences between the means of the Gaussians are bounded

Proof. The term µ ( X j ) µ ( X j ) T is an outer product of the mean vector µ ( X j ) , which is a symmetric matrix. The eigenvalues of a symmetric matrix are equal to the squares of the singular values of the original matrix. Since the singular values of a vector are equal to its absolute values, the maximum eigenvalue of µ ( X j ) µ ( X j ) T is equal to the square of the maximum absolute value of µ ( X j ) . By the second assumption, this is at most M .

Lemma J.2. The maximum eigenvalue of -µ Z µ T Z is non-positive and its absolute value is at most M .

Proof. The term -µ Z µ T Z is a negative outer product of the overall mean vector µ Z , which is a symmetric matrix. Its eigenvalues are non-positive and equal to the negative squares of the singular values of µ Z . Since the singular values of a vector are equal to its absolute values, the absolute value of the maximum eigenvalue of -µ Z µ T Z is equal to the square of the maximum absolute value of µ Z , which is also bounded by M by the second assumption.

Proof. Given a Gaussian mixture model where each component Z | x j has mean µ ( X j ) and covariance matrix Σ( x j ) , the mixture can be written as:

where p j are the mixing coefficients. The covariance matrix of the mixture, Σ Z , is then given by:

By Lemmas 1.1, 1.2, and assumptions 1 and 2, the maximum eigenvalues of (Σ( x j ) , µ ( X j ) µ ( X j ) T and µ Z µ T Z . are at most b , M , and M , respectively. Therefore, by Weyl's inequality for the sum of two symmetric matrices, the maximum eigenvalue of Σ Z is at most b + M .

It means that we can bound the sum of the eigenvalues of Σ Z with

Lemma J.4. Let Σ Z be a positive semidefinite matrix of size N × N . Consider the optimization problem given by:

where λ i (Σ Z ) denotes the i -th eigenvalue of Σ Z and c is a constant. The solution to this problem is a diagonal matrix with equal diagonal elements.

such that:

Proof. The determinant of a matrix is the product of its eigenvalues, so the objective function log det(Σ Z ) can be rewritten as ∑ N i =1 log( λ i (Σ Z )) . Our problem is then to maximize this sum under the constraints that the sum of the eigenvalues does not exceed c and that Σ Z is positive semi-definite.

Applying Jensen's inequality to the concave function log( x ) with weights 1 /N , we find that 1 N ∑ N i =1 log( λ i (Σ Z )) ≤ log( 1 N ∑ N i =1 λ i (Σ Z )) . Equality holds if and only if all λ i (Σ Z ) are equal.

Setting λ i (Σ Z ) = x for all i , we see that the constraint ∑ N i =1 λ i (Σ Z ) ≤ c becomes Nx ≤ c , leading to the optimal eigenvalue x = c/N under the constraint.

Since Σ Z is positive semi-definite, it can be diagonalized via an orthogonal transformation without changing the sum of its eigenvalues or its determinant. Therefore, the solution to the problem is a diagonal matrix with all diagonal entries equal to c/N .

This completes the proof.

Proof. The objective function can be decomposed as follows:

In this optimization problem, we are optimizing over Σ Z . The term ∑ i log | Σ( X i ) | is constant with respect to Σ Z , therefore we can focus on maximizing K log | Σ Z | .

As the determinant of a matrix is the product of its eigenvalues, log | Σ Z | is the sum of the logs of the eigenvalues of Σ Z . Thus, maximizing log | Σ Z | corresponds to maximizing the sum of the logarithms of the eigenvalues of Σ Z .

According to Lemma 1.4, when we have a constraint on the sum of the eigenvalues, the solution to the problem of maximizing the sum of the logarithms of the eigenvalues of a positive semidefinite matrix Σ Z is a diagonal matrix with equal diagonal elements.

From Lemma 1.3, we know that the sum of the eigenvalues of Σ Z is bounded by ( b + M ) × K . Therefore, when we maximize K log | Σ Z | under these constraints, the solution will be a diagonal matrix with equal diagonal elements. This completes the proof of the theorem.

A Generalization Bound for Downstream Tasks

We also define the representation matrices

where ξ 1 , . . . , ξ m are independent uniform random variables in {-1 , 1 } . It is normalized such that ˜ R m ( F ) = O (1) as m →∞ .

Additional Notation and details

We start to introduce additional notation and details. We use the notation of x ∈ X for an input and y ∈ Y ⊆ R r for an output. Define p ( y ) = P ( Y = y ) to be the probability of getting label y and ˆ p ( y ) = 1 n ∑ n i =1 ✶ { y i = y } to be the empirical estimate of p ( y ) . Let ζ be an upper bound on the norm of the label as ∥ y ∥ 2 ≤ ζ for all y ∈ Y . Define the minimum norm solution W ¯ S of the unlabeled data as W ¯ S = minimize W ′ ∥ W ′ ∥ F s.t. W ′ ∈ arg min W 1 m ∑ m i =1 ∥ Wf θ ( x + i ) -g ∗ ( x + i ) ∥ 2 . Let κ S be a data-dependent upper bound on the per-sample Euclidian norm loss with the trained model as ∥ W S f θ ( x ) -y ∥ ≤ κ S for all ( x, y ) ∈ X × Y . Similarly, let κ ¯ S be a data-dependent upper bound on the per-sample Euclidian norm loss as ∥ W ¯ S f θ ( x ) -y ∥ ≤ κ ¯ S for all ( x, y ) ∈ X × Y . Define the difference between W S and W ¯ S by c = ∥ W S -W ¯ S ∥ 2 . Let W be a hypothesis space of W such that W ¯ S ∈ W . We denote by ˜ R m ( W◦F ) = 1 √ m E ¯ S,ξ [sup W ∈W ,f ∈F ∑ m i =1 ξ i ∥ g ∗ ( x + i ) -Wf ( x + i ) ∥ ] the normalized Rademacher complexity of the set { x + ↦→ ∥ g ∗ ( x + ) -Wf ( x + ) ∥ : W ∈ W , f ∈ F} . we denote by κ a upper bound on the per-sample Euclidian norm loss as ∥ Wf ( x ) -y ∥ ≤ κ for all ( x, y, W, f ) ∈ X × Y × W × F .

We adopt the following data-generating process model that was used in a previous paper on analyzing contrastive learning [59, 10]. For the labeled data, first, y is drawn from the distribution ρ on Y , and then x is drawn from the conditional distribution D y conditioned on the label y . That is, we have the join distribution D ( x, y ) = D y ( x ) ρ ( y ) with (( x i , y i )) n i =1 ∼ D n . For the unlabeled data,

first, each of the unknown labels y + and y -is drawn from the distritbuion ρ , and then each of the positive examples x + and x ++ is drawn from the conditional distribution D y + while the negative example x -is drawn from the D y -. Unlike the analysis of contrastive learning, we do not require negative samples. Let τ ¯ S be a data-dependent upper bound on the invariance loss with the trained representation as ∥ f θ (¯ x ) -f θ ( x ) ∥ ≤ τ ¯ S for all (¯ x, x ) ∼ D 2 y and y ∈ Y . Let τ be a data-independent upper bound on the invariance loss with the trained representation as ∥ f (¯ x ) -f ( x ) ∥ ≤ τ for all (¯ x, x ) ∼ D 2 y , y ∈ Y , and f ∈ F . For simplicity, we assume that there exists a function g ∗ such that y = g ∗ ( x ) ∈ R r for all ( x, y ) ∈ X × Y . Discarding this assumption adds the average of label noises to the final result, which goes to zero as the sample sizes n and m increase, assuming that the mean of the label noise is zero.

In this paper, we provide an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg) for self-supervised learning. To do so, we first demonstrate how information-theoretic quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Next, we relate the VICReg objective to mutual information maximization and use it to highlight the underlying assumptions of the objective. Based on this relationship, we derive a generalization bound for VICReg, providing generalization guarantees for downstream supervised learning tasks and present new self-supervised learning methods, derived from a mutual information maximization objective, that outperform existing methods in terms of performance. This work provides a new information-theoretic perspective on self-supervised learning and Variance-Invariance-Covariance Regularization in particular and guides the way for improved transfer learning via information-theoretic self-supervised learning objectives.

Self-Supervised Learning (SSL) methods learn representations by optimizing a surrogate objective between inputs and self-defined signals. For example, in SimCLR (Chen et al., 2020), a contrastive loss is used to make representations for different versions of the same image similar while making representations for different images different. These pre-trained representations are then used as a feature extractor for downstream supervised tasks such as image classification, object detection, and transfer learning (Caron et al., 2021; Chen et al., 2020; Misra and Maaten, 2020; Shwartz-Ziv et al., 2022). Despite their success in practice, only a few works (Arora et al., 2019; Lee et al., 2021a) have sought to provide theoretical insights about the effectiveness of SSL.

Recently, information-theoretic methods have played a key role in several advances in deep learning—from practical applications in representation learning (Alemi et al., 2016) to theoretical investigations (Xu and Raginsky, 2017; Steinke and Zakynthinou, 2020; Shwartz-Ziv, 2022). Some works have attempted to use information theory for SSL, such as the InfoMax principle (Linsker, 1988) in SSL (Bachman et al., 2019). However, these works often present objective functions without rigorous justification, make implicit assumptions (Kahana and Hoshen, 2022; Wang et al., 2022; Lee et al., 2021b), and explicitly assume that the deep neural network mappings are stochastic—which is rarely the case for modern neural networks. See Shwartz-Ziv and LeCun (2023) for a detailed review.

This paper presents an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg). We show that the VICReg objective is closely related to approximate mutual information maximization, derive a generalization bound for VICReg, and relate the generalization bound to information maximization. We show that under a series of assumptions about the data, which we validate empirically, our results apply to deterministic deep neural network training and do not require further stochasticity assumptions about the network. To summarize, our key contributions are as follows:

We shift the stochasticity assumption to the deep neural network inputs to study deterministic deep neural networks from an information-theoretic perspective.

We relate the VICReg objective to information-theoretic quantities and use this relationship to highlight the underlying assumptions of the objective.

We study the relationship between the optimization of information-theoretic quantities and predictive performance in downstream tasks by introducing a generalization bound that connects VICReg, information theory, and downstream generalization.

We present information-theoretic SSL methods based on our analysis and empirically validate their performance.

We first introduce the technical background for our analysis.

A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition ΩΩ\Omega of a domain ℝDsuperscriptℝ𝐷\mathbb{R}^{D}, a spline of order k𝑘k is a mapping defined by a polynomial of order k𝑘k on each region ω∈Ω𝜔Ω\omega\in\Omega with continuity constraints on the entire domain for the derivatives of order 00,…,k−1𝑘1k-1. As we will focus on affine splines (k=1𝑘1k=1), we define this case only for concreteness. A K𝐾K-dimensional affine spline f𝑓f produces its output via

with input 𝒛∈ℝD𝒛superscriptℝ𝐷\bm{z}\in\mathbb{R}^{D} and 𝑨ω∈ℝK×D,𝒃ω∈ℝK,∀ω∈Ωformulae-sequencesubscript𝑨𝜔superscriptℝ𝐾𝐷formulae-sequencesubscript𝒃𝜔superscriptℝ𝐾for-all𝜔Ω\bm{A}{\omega}\in\mathbb{R}^{K\times D},\bm{b}{\omega}\in\mathbb{R}^{K},\forall\omega\in\Omega the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f∈𝒞0(ℝD)𝑓superscript𝒞0superscriptℝ𝐷f\in\mathcal{C}^{0}(\mathbb{R}^{D}). Spline operators and especially affine spline operators have been widely used in function approximation theory (Cheney and Light, 2009), optimal control (Egerstedt and Martin, 2009), statistics (Fantuzzi et al., 2002), and related fields.

A deep neural network (DNN) is a (non-linear) operator fΘsubscript𝑓Θf_{\Theta} with parameters ΘΘ\Theta that map a input 𝒙∈ℝD𝒙superscriptℝ𝐷\bm{x}\in\mathbb{R}^{D} to a prediction 𝒚∈ℝK𝒚superscriptℝ𝐾\bm{y}\in\mathbb{R}^{K}. The precise definitions of DNN operators can be found in Goodfellow et al. (2016). To avoid cluttering notation, we will omit ΘΘ\Theta unless needed for clarity. The only assumption we require for our analysis is that the non-linearities present in the DNN are CPA mappings—as is the case with (leaky-) ReLU, absolute value, and max-pooling operators. The entire input–output mapping then becomes a CPA spline with an implicit partition ΩΩ\Omega, the function of the weights and architecture of the network (Montufar et al., 2014; Balestriero and Baraniuk, 2018). For smooth nonlinearities, our results hold using a first-order Taylor approximation argument.

Joint embedding methods learn DNN parameters ΘΘ\Theta without supervision and input reconstruction. The difficulty of self-supervised learning (SSL) is generating a good representation for downstream tasks whose labels are unavailable during self-supervised training while avoiding trivial solutions where the model maps all inputs to constant output. Many methods have been proposed to solve this problem (see Balestriero and LeCun (2022) for a summary and connections between methods). Contrastive methods, such as SimCLR (Chen et al., 2020) and its InfoNCE criterion (Oord et al., 2018), learn representations by contrasting positive and negative examples. In contrast, non-contrastive methods employ different regularization methods to prevent collapsing of the representation and do not explicitly rely on negative samples. Some methods use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021; Grill et al., 2020) while Caron et al. (2020) use an additional clustering step. Of particular interest to us is the Variance-Invariance-Covariance Regularization method(VICReg; Bardes et al. (2021)) that considers two embedding batches 𝒁=[f(𝒙1),…,f(𝒙N)]𝒁𝑓subscript𝒙1…𝑓subscript𝒙𝑁\bm{Z}=\left[f(\bm{x}{1}),\dots,f(\bm{x}{N})\right] and 𝒁′=[f(𝒙1′),…,f(𝒙N′)]superscript𝒁′𝑓subscriptsuperscript𝒙′1…𝑓subscriptsuperscript𝒙′𝑁\bm{Z}^{\prime}=\left[f(\bm{x}^{\prime}{1}),\dots,f(\bm{x}^{\prime}{N})\right] each of size (N×K)𝑁𝐾(N\times K). Denoting by 𝑪𝑪\bm{C} the (K×K)𝐾𝐾(K\times K) covariance matrix obtained from [𝒁,𝒁′]𝒁superscript𝒁′[\bm{Z},\bm{Z}^{\prime}], the VICReg triplet loss is given by

Recently, information-theoretic methods have played an essential role in advancing deep learning (Alemi et al., 2016; Xu and Raginsky, 2017; Steinke and Zakynthinou, 2020; Shwartz-Ziv and Tishby, 2017b) by developing and applying information-theoretic estimators and learning principles to DNN training (Hjelm et al., 2018; Belghazi et al., 2018; Piran et al., 2020; Shwartz-Ziv et al., 2018). However, information-theoretic objectives for deterministic DNNs often exhibit a common pitfall: They assume that DNN mappings are stochastic- an assumption that is usually violated. As a result, the mutual information between the input and the DNN representation in such objectives would be infinite, resulting in ill-posed optimization problems. To avoid this problem, stochastic DNNs with variational bounds could be used, where the output of the deterministic network is used as the parameters of the conditional distribution (Lee et al., 2021b; Shwartz-Ziv and Alemi, 2020). Dubois et al. (2021) assumed that the randomness of data augmentation among the two views is the source of stochasticity in the network. Other work assumed a random input, but without making any assumptions about the properties of the distribution of the network’s output, to analyze the objective and relied on general lower bounds (Wang and Isola, 2020; Zimmermann et al., 2021). For supervised learning, Goldfeld et al. (2018) introduced an auxiliary (noisy) DNN by injecting additive noise into the model and demonstrated that the resulting model is a good proxy for the original (deterministic) DNN in terms of both performance and representation. Finally, Achille and Soatto (2018) found that minimizing a stochastic network with a regularizer is equivalent to minimizing the cross-entropy over deterministic DNNs with multiplicative noise. All of these methods assume that the source of randomness comes from the DNN, contradicting common practice.

This section provides an information-theoretic perspective on SSL in deterministic deep neural networks. We begin by introducing assumptions about the information-theoretic challenges in SSL (Section 3.1) and about the data distribution (Section 3.2). More specifically, we assume throughout that any training sample 𝒙𝒙\bm{x} can be seen as coming from a single Gaussian distribution, 𝒙∼𝒩(μ𝒙,Σ𝒙)similar-to𝒙𝒩subscript𝜇𝒙subscriptΣ𝒙\bm{x}\sim\mathcal{N}(\mu_{\bm{x}},\Sigma_{\bm{x}}). From this, we show that the output of any DNN f(𝒙)𝑓𝒙f(\bm{x}) corresponds to a mixture of truncated Gaussian distributions (in Section 3.3). This will enable information measures to be applied to deterministic DNNs. Using these assumptions, we then show that an approximation to the VICReg objective can be recovered from information-theoretic principles.

To better understand the difference between key SSL methods and suggest new ones, we first formulate the general SSL goal from an information-theoretical perspective. This formulation allows us to analyze and compare different SSL methods based on their ability to maximize the mutual information between the representations. Furthermore, it opens up the possibility for new SSL methods that may improve upon existing ones by finding new ways to maximize this information. We start with the MultiView InfoMax principle, which aims to maximize the mutual information between the representations of two different views, X𝑋X and X′superscript𝑋′X^{\prime}, and their corresponding representations, Z𝑍Z and Z′superscript𝑍′Z^{\prime}. As shown in Federici et al. (2020), to maximize their information, we maximize I(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) and I(Z′;X)𝐼superscript𝑍′𝑋I(Z^{\prime};X) using the lower bound

where H(Z)𝐻𝑍H(Z) is the entropy of Z𝑍Z. In supervised learning, where we need to maximize I(Z;Y)𝐼𝑍𝑌I(Z;Y), the labels (Y𝑌Y) are fixed, the entropy term H(Y)𝐻𝑌H(Y) is constant, and you only need to optimize the log-loss 𝔼x′[log⁡q(z|x)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧𝑥\mathbb{E}_{x^{\prime}}[\log q(z|x)] (cross-entropy or square loss). However, it is well known that for Siamese networks, a degenerate solution exists in which all outputs “collapse” into an undesired value (Chen et al., 2020). By looking at Equation 6, we can see that the entropies are not constant and can be optimized throughout the learning process. Therefore, only minimizing the log loss will cause the representations to collapse to the trivial solution of making them constant (where the entropy goes to zero). To regularize these entropies, that is, to prevent collapse, different methods utilize different approaches to implicitly regularizing information. To better understand these methods, we will introduce results about the data distribution (in Section 3.2) and about the pushforward measure of the data under the neural network transformation (in Section 3.3).

First, we examine the way the output random variables of the network are represented and assume a distribution over the data. Under the manifold hypothesis, any point can be seen as a Gaussian random variable with a low-rank covariance matrix in the direction of the manifold tangent space of the data (Fefferman et al., 2016). Therefore, throughout this study, we will consider the conditioning of a latent representation with respect to the mean of the observation, i.e., X|𝒙∗∼𝒩(𝒙∗,Σ𝒙∗)similar-toconditional𝑋superscript𝒙𝒩superscript𝒙subscriptΣsuperscript𝒙X|\bm{x}^{}\sim\mathcal{N}(\bm{x}^{},\Sigma_{\bm{x}^{}}), where the eigenvectors of Σ𝒙∗subscriptΣsuperscript𝒙\Sigma_{\bm{x}^{}} are in the same linear subspace as the tangent space of the data manifold at 𝒙∗superscript𝒙\bm{x}^{} which varies with the position of 𝒙∗superscript𝒙\bm{x}^{} in space. Hence a dataset is considered to be a collection of {𝒙n∗,n=1,…,N}formulae-sequencesubscriptsuperscript𝒙𝑛𝑛1…𝑁{\bm{x}^{*}_{n},n=1,\dots,N} and the full data distribution to be a sum of low-rank covariance Gaussian densities, as in

with T𝑇T the uniform Categorical random variable. For simplicity, we consider that the effective support of 𝒩(𝒙i∗,Σ𝒙i∗)𝒩subscriptsuperscript𝒙𝑖subscriptΣsubscriptsuperscript𝒙𝑖\mathcal{N}(\bm{x}^{}{i},\Sigma{\bm{x}^{}{i}}) and 𝒩(𝒙j∗,Σ𝒙j∗)𝒩subscriptsuperscript𝒙𝑗subscriptΣsubscriptsuperscript𝒙𝑗\mathcal{N}(\bm{x}^{*}{j},\Sigma_{\bm{x}^{*}_{j}}) do not overlap, where the effective support is defined as {x∈ℝD:p(x)>ϵ}conditional-set𝑥superscriptℝ𝐷𝑝𝑥italic-ϵ{x\in\mathbb{R}^{D}:p(x)>{\epsilon}}. Therefore, we have that.

where 𝒩(𝒙;.,.)\mathcal{N}\left(\bm{x};.,.\right) is the Gaussian density at 𝒙𝒙\bm{x} and with n(𝒙)=argminn(𝒙−𝒙n∗)TΣ𝒙n∗(𝒙−𝒙n∗)n(\bm{x})=\operatorname*{arg,min}{n}(\bm{x}-\bm{x}^{*}{n})^{T}\Sigma_{\bm{x}^{}_{n}}(\bm{x}-\bm{x}^{}_{n}). This assumption, that a dataset is a mixture of Gaussians with non-overlapping support, will simplify our derivations below and could be extended to the general case if needed.

Consider an affine spline operator f𝑓f (Equation 1) that goes from a space of dimension D𝐷D to a space of dimension K𝐾K with K≥D𝐾𝐷K\geq D. The span, which we denote as an image, of this mapping is given by

with Aff(ω;𝑨ω,𝒃ω)={𝑨ω𝒙+𝒃ω:𝒙∈ω}Aff𝜔subscript𝑨𝜔subscript𝒃𝜔conditional-setsubscript𝑨𝜔𝒙subscript𝒃𝜔𝒙𝜔\text{Aff}(\omega;\bm{A}{\omega},\bm{b}{\omega})={\bm{A}{\omega}\bm{x}+\bm{b}{\omega}:\bm{x}\in\omega} the affine transformation of region ω𝜔\omega by the per-region parameters 𝑨ω,𝒃ωsubscript𝑨𝜔subscript𝒃𝜔\bm{A}{\omega},\bm{b}{\omega}, and with ΩΩ\Omega the partition of the input space in which 𝒙𝒙\bm{x} lives in. The practical computation of the per-region affine mapping can be obtained by setting 𝑨ωsubscript𝑨𝜔\bm{A}{\omega} to the Jacobian matrix of the network at the corresponding input x𝑥x, and b𝑏b to be defined as f(x)−𝑨ωx𝑓𝑥subscript𝑨𝜔𝑥f(x)-\bm{A}{\omega}x. Therefore, the DNN mapping consists of affine transformations on each input space partition region ω∈Ω𝜔Ω\omega\in\Omega based on the coordinate change induced by 𝑨ωsubscript𝑨𝜔\bm{A}{\omega} and the shift induced by 𝒃ωsubscript𝒃𝜔\bm{b}{\omega}.

When the input space is equipped with a density distribution, this density is transformed by the mapping f𝑓f. In general, the density of f(X)𝑓𝑋f(X) is intractable. However, given the disjoint support assumption from Section 3.2, we can arbitrarily increase the representation power of the density by increasing the number of prototypes N𝑁N. By doing so, the support of each Gaussian is included within the region ω𝜔\omega in which its means lie, leading to the following result:

Given the setting of Equation 8, the unconditional DNN output density, Z𝑍Z, is approximately a mixture of the affinely transformed distributions 𝐱|𝐱n(𝐱)∗conditional𝐱subscriptsuperscript𝐱𝑛𝐱\bm{x}|\bm{x}^{*}_{n(\bm{x})}:

where ω(𝐱n∗)=ω∈Ω⇔𝐱n∗∈ωiff𝜔subscriptsuperscript𝐱𝑛𝜔Ωsubscriptsuperscript𝐱𝑛𝜔\omega(\bm{x}^{}_{n})=\omega\in\Omega\iff\bm{x}^{}{n}\in\omega is the partition region in which the prototype 𝐱n∗subscriptsuperscript𝐱𝑛\bm{x}^{*}{n} lives in.

Proof See Appendix B.

Next, we will show how SSL algorithms for deterministic networks can be derived from information-theoretic principles. According to Section 3.1, we want to maximize I(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) and I(Z′;X)𝐼superscript𝑍′𝑋I(Z^{\prime};X). Although this mutual information is intractable in general, we can obtain a tractable variational approximation using the expected loss. First, when the input noise is small, namely that the effective support of the Gaussian centered at x𝑥x is contained within the region w𝑤w of the DNN’s input space partition, we can reduce the conditional output density to a single Gaussian: (Z′|X′=xn)∼𝒩(μ(xn),Σ(xn)),similar-toconditionalsuperscript𝑍′superscript𝑋′subscript𝑥𝑛𝒩𝜇subscript𝑥𝑛Σsubscript𝑥𝑛(Z^{\prime}|X^{\prime}=x_{n})\sim\mathcal{N}\left(\mu(x_{n}),\Sigma(x_{n})\right), where μ(xn)=𝑨ω(𝒙n)𝒙n+𝒃ω(𝒙n)𝜇subscript𝑥𝑛subscript𝑨𝜔subscript𝒙𝑛subscript𝒙𝑛subscript𝒃𝜔subscript𝒙𝑛\mu(x_{n})=\bm{A}{\omega(\bm{x}{n})}\bm{x}{n}+\bm{b}{\omega(\bm{x}{n})} and Σ(xn)=𝑨ω(𝒙n)TΣ𝒙n𝑨ω(𝒙n)Σsubscript𝑥𝑛subscriptsuperscript𝑨𝑇𝜔subscript𝒙𝑛subscriptΣsubscript𝒙𝑛subscript𝑨𝜔subscript𝒙𝑛\Sigma(x{n})=\bm{A}^{T}{\omega(\bm{x}{n})}\Sigma_{\bm{x}{n}}\bm{A}{\omega(\bm{x}{n})}. Second, to compute the expected loss, we need to marginalize out the stochasticity in the output of the network. In general, training with squared loss is equivalent to maximum likelihood estimation in a Gaussian observation model, p(z|z′)∼𝒩(z′,Σr)similar-to𝑝conditional𝑧superscript𝑧′𝒩superscript𝑧′subscriptΣ𝑟p(z|z^{\prime})\sim\mathcal{N}(z^{\prime},\Sigma{r}), where Σr=IsubscriptΣ𝑟𝐼\Sigma_{r}=I. To compute the expected loss over samples of x′superscript𝑥′x^{\prime}, we need to marginalize out the stochasticity in Z′superscript𝑍′Z^{\prime}: which means that the conditional decoder is a Gaussian: (Z|X′=xn)∼𝒩(μ(xn),Σr+Σ(xn))similar-toconditional𝑍superscript𝑋′subscript𝑥𝑛𝒩𝜇subscript𝑥𝑛subscriptΣ𝑟Σsubscript𝑥𝑛\left(Z|X^{\prime}=x_{n}\right)\sim\mathcal{N}(\mu(x_{n}),\Sigma_{r}+\Sigma(x_{n})). However, the expected log loss over samples of Z𝑍Z is hard to compute. We instead focus on a lower bound; the expected log loss over samples of Z′superscript𝑍′Z^{\prime}. For simplicity, let Σr=IsubscriptΣ𝑟𝐼\Sigma_{r}=I. By Jensen’s inequality, we then obtain the following lower bound on 𝔼x′[log⁡q(z|x′)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑥′\mathbb{E}_{x^{\prime}}\left[\log q(z|x^{\prime})\right]:

Now, taking the expectation over Z𝑍Z, we get

Full derivations of Equations (10) and (11) are given in Appendix A. Combining all of the above then yields

To optimize this objective in practice, we can approximate p(x,x′)𝑝𝑥superscript𝑥′p(x,x^{\prime}) using the empirical data distribution:

Next, we will discuss estimating the intractable entropy H(Z)𝐻𝑍H(Z) changes the objective.

In the previous section, we derived an objective function based on information-theoretical principles. The “invariance term” in LABEL:eq:obj is similar to the invariance loss of VICReg. However, computing the regularization term—and H(Z)𝐻𝑍H(Z) in particular—is challenging. Estimating the entropy of random variables is a classic problem in information theory, with the Gaussian mixture density being a popular representation. However, there is no closed-form solution to the differential entropy of Gaussian mixtures. Approximations, including loose upper and lower bounds (Huber et al., 2008) and Monte Carlo sampling, exist in the literature. Unfortunately, Monte Carlo sampling is computationally expensive and requires many samples in high dimensions (Brewer, 2017).

One of the simplest and straightforward approaches to approximating the entropy is to capture the first two moments of the distribution, which provides an upper bound on the entropy. However, minimizing an upper bound means that there is no guarantee that the original objective is being optimized. In practice, there have been cases where successful results have been achieved by minimizing an upper bound (Martinez et al., 2021; Nowozin et al., 2016). However, this may cause instability in the training process. For a detailed discussion and results on different entropy estimators, see Section 5. Letting ΣZsubscriptΣ𝑍\Sigma_{Z} be the covariance matrix of Z𝑍Z, we will use the first two moments to approximate the entropy we wish to maximize. This way, we obtain the approximation

A standard fact in linear algebra is that the determinant of a matrix is the product of its eigenvalues. Therefore, maximizing the sum of the log eigenvalues implies maximizing the log determinant of Z𝑍Z. Many works have considered this problem (Giles, 2008; Ionescu et al., 2015; Dang et al., 2018). One approach is to find the solutions using the eigendecomposition, which leads to numerical instability (Dang et al., 2018). An alternative approach is to diagonalize the covariance matrix and increase its diagonal elements. Because the eigenvalues of a diagonal matrix are the diagonal entries, increasing the sum of the log-diagonal terms is equivalent to increasing the sum of the log eigenvalues. One way to do this is to push the off-diagonal terms of ΣZsubscriptΣ𝑍\Sigma_{Z} to be zero and maximize the sum of its log diagonal. This can be done using the covariance term of VICReg. Even though this approach is simple and efficient, the values on the diagonal may become close to zero, which may cause instability when we calculate the logarithm. Therefore, we use an upper bound and calculate the sum of the diagonal elements directly, which is the variance term of VICReg. In conclusion, we see the connection between the information-theoretic objective and the three terms of CIVReg. An exciting research direction is to maximize the eigenvalues of Z𝑍Z using more sophisticated methods, such as using a differential expression for eigendecomposition.

Based on the theory outlined in Section 3.3, the conditional output density p𝒛|x=isubscript𝑝conditional𝒛𝑥𝑖p_{\bm{z}|x=i} can be reduced to a single Gaussian with decreasing input noise. To validate this, we used a ResNet-18 model trained with either SimCLR or VICReg objectives on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). We sampled 512 Gaussian samples for each image from the test dataset, and analyzed whether each sample remained Gaussian in the penultimate layer of the DNN. We then used the D’Agostino and Pearson’s test (D’Agostino, 1971) to determine the validity of this assumption. Figure 1 (left) shows the p𝑝p-value as a function of the normalized standard deviation. For small noise, we can reject the hypothesis that the conditional output density of the network is not Gaussian with a probability of 85%percent8585% for VICReg. However, as the input noise increases, the network’s output becomes less Gaussian and even for the small noise regime, there is a 15%percent1515% chance of a Type I error.

Next, to confirm our assumption that the model of the data distribution has non-overlapping effective support, we calculated the distribution of pairwise l2subscript𝑙2l_{2} distances between images for seven datasets: MNIST, CIFAR10, CIFAR100, Flowers102, Food101, FGVAircaft. Figure 1 (right) shows that even for raw pixels, the pairwise distances are far from zero, which means that we can use a small Gaussian around each point without overlapping. Therefore, the effective support of these datasets is non-overlapping, and our assumption is realistic.

Implementing Equation 12 in practice requires various “design choices.” As demonstrated in Section 5, VICReg uses an approximation of the entropy that is based on certain assumptions. Next, we examine different ways to implement the information-based objective by comparing VICReg to other methods such as contrastive learning SSL methods such as SimCLR and non-contrastive methods like BYOL and SimSiam. We will analyze their assumptions and the differences in their approaches to implementing the information maximization objective. Based on our analysis, we will suggest new objective functions that incorporate more recent information and entropy estimators from the information theory literature. This will allow us to further improve the performance of SSL and to better understand their underlying learning mechanisms.

Lee et al. (2021b) connect the SimCLR objective (Chen et al., 2020) to the variational bound on the information between representations by using the von Mises-Fisher distribution as the conditional variational family. By applying our analysis for information in deterministic networks with their work, we can compare the differences between SimCLR and VICReg, and identify two main differences: (i) Conditional distribution: SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation: The entropy term in SimCLR is approximate and based on the finite sum of the input samples. In contrast, VICReg estimates the entropy of Z𝑍Z solely based on the second moment. Creating self-supervised methods that combine these two differences would be an interesting future research direction.

As we saw in previous sections, the different methods use different objective functions to optimize the entropy of their representation. Next, we compare the SSL methods and check directly their entropy. To do so, we trained ResNet-18 architecture (He et al., 2016) on CIFAR-10 (Krizhevsky et al., 2009) for VICReg, SimCLR and BYOL. we used the pairwise distances entropy estimator based on the distances of the individual mixture component (Kolchinsky and Tracey, 2017). Even though this quantity is just an estimator for the entropy, it is shown as a tight estimator and is directly optimized by neither one of the methods. Therefore, we can treat it as a outsource validation of the entropy for the different methods. For more details on this and other entropy estimators see Section 5.2. In Figure 2, we see that, as expected from our analysis before, all the entropy decreased during the training for all the methods. Additionally, we see that SimCLR has the lowest entropy during the training, while VICReg has the highest one.

In Section 5.1, we discussed existing methods that use an approximation to the entropy. Next, we suggest combining the invariance term of these methods with plug-in methods for optimizing the entropy.

The VICReg objective aims to approximate the log determinate of the empirical covariance matrix by using diagonal terms. However, as discussed in Section 4.1, this estimator can be problematic. Instead, we can plug in different entropy estimators. One such option is the LogDet Entropy Estimator (Zhouyin and Liu, 2021), which provides a tighter upper bound. This estimator uses the differential entropy α𝛼\alpha order entropy with scaled noise and was previously demonstrated as a tight estimator for high-dimensional features and robust to random noise. However, since the estimator is an upper bound on the entropy, we are not guaranteed to optimize the original objective when maximizing this upper bound. To address this problem, we also use a lower bound estimator, which is based on the pairwise distances of the individual mixture component (Kolchinsky and Tracey, 2017). For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient, as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. They are continuous and smooth (and therefore useful for optimization problems) and converge to the exact solution when the component distributions are grouped into well-separated clusters. These proposed methods are compared with VICReg, SimCLR (Chen et al., 2020), and Barlow Twin (Zbontar et al., 2021).

Our experiments are conducted on CIFAR-10 (Krizhevsky et al., 2009). A ResNet-18 architecture (He et al., 2016) is used as the backbone. We use linear evaluation for the quality of the representation. For full details, see Appendix H.

It can be seen from LABEL:tab:results that the proposed estimators outperform both the original VICReg and SimCLR as well as Barlow Twin. By estimating the entropy with a more accurate estimator, we can improve the results of VICReg, and the pairwise distance estimator, which is a lower bound, achieves the best results. This aligns with the theory that we want to maximize a lower bound on true entropy. The results of our study suggest that a smart selection of entropy estimators, inspired by our framework, leads to better results.

In the previous sections, we showed the connection between information-theoretic principles and the VICReg objective. Next, we will connect this objective and the information-theoretic principles to the downstream generalization of VICReg by deriving a downstream generalization bound. Together with the results in the previous sections, this relates generalization in VICReg to information maximization and implicit regularization.

Consider input points x𝑥x, outputs y∈ℝr𝑦superscriptℝ𝑟y\in\mathbb{R}^{r}, labeled training data S=((xi,yi))i=1n𝑆superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛S=((x_{i},y_{i})){i=1}^{n} of size n𝑛n and unlabeled training data S¯=((xi+,xi++))i=1m¯𝑆superscriptsubscriptsubscriptsuperscript𝑥𝑖subscriptsuperscript𝑥absent𝑖𝑖1𝑚{\bar{S}}=((x^{+}{i},x^{++}{i})){i=1}^{m} of size m𝑚m, where xi+subscriptsuperscript𝑥𝑖x^{+}{i} and xi++subscriptsuperscript𝑥absent𝑖x^{++}{i} share the same (unknown) label. With the unlabeled training data, we define the invariance loss

where fθsubscript𝑓𝜃f_{\theta} is the trained representation on the unlabeled data S¯¯𝑆{\bar{S}}. We define a labeled loss ℓx,y(w)=‖Wfθ(x)−y‖subscriptℓ𝑥𝑦𝑤norm𝑊subscript𝑓𝜃𝑥𝑦\ell_{x,y}(w)=|Wf_{\theta}(x)-y| where w=vec⁡[W]∈ℝdr𝑤vec𝑊superscriptℝ𝑑𝑟w=\operatorname{vec}[W]\in\mathbb{R}^{dr} is the vectorization of the matrix W∈ℝr×d𝑊superscriptℝ𝑟𝑑W\in\mathbb{R}^{r\times d}. Let wS=vec⁡[WS]subscript𝑤𝑆vecsubscript𝑊𝑆w_{S}=\operatorname{vec}[W_{S}] be the minimum norm solution as WS=minimizeW′‖W′‖Fsubscript𝑊𝑆subscriptminimizesuperscript𝑊′subscriptnormsuperscript𝑊′𝐹W_{S}=\mathop{\mathrm{minimize}}{W^{\prime}}|W^{\prime}|{F} such that

and the projection matrices

We define the label matrix YS=[y1,…,yn]⊤∈ℝn×rsubscript𝑌𝑆superscriptsubscript𝑦1…subscript𝑦𝑛topsuperscriptℝ𝑛𝑟Y_{S}=[y_{1},\dots,y_{n}]^{\top}\in\mathbb{R}^{n\times r} and the unknown label matrix YS¯=[y1+,…,ym+]⊤∈ℝm×rsubscript𝑌¯𝑆superscriptsubscriptsuperscript𝑦1…subscriptsuperscript𝑦𝑚topsuperscriptℝ𝑚𝑟Y_{{\bar{S}}}=[y^{+}{1},\dots,y^{+}{m}]^{\top}\in\mathbb{R}^{m\times r}, where yi+subscriptsuperscript𝑦𝑖y^{+}{i} is the unknown label of xi+subscriptsuperscript𝑥𝑖x^{+}{i}. Let ℱℱ\mathcal{F} be a hypothesis space of fθsubscript𝑓𝜃f_{\theta}. For a given hypothesis space ℱℱ\mathcal{F}, we define the normalized Rademacher complexity

where ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi_{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}. It is normalized such that ℛ~~m(ℱ)=O(1)subscript~~ℛ𝑚ℱ𝑂1\tilde{\mathcal{R}}_{m}(\mathcal{F})=O(1) as m→∞→𝑚m\rightarrow\infty for typical choices of hypothesis spaces ℱℱ\mathcal{F}, including deep neural networks (Bartlett et al., 2017; Kawaguchi et al., 2018).

Theorem 2 shows that VICReg improves generalization on supervised downstream tasks. More cpecifically, minimizing the unlabeled invariance loss while controlling the covariance ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top} and the complexity of representations ℛ~~m(ℱ)subscript~~ℛ𝑚ℱ\tilde{\mathcal{R}}_{m}(\mathcal{F}) minimizes the expected labeled loss:

(Informal version). For any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta,

where 𝒬m,n=O(Gln⁡(1/δ)/m+ln⁡(1/δ)/n)→0subscript𝒬𝑚𝑛𝑂𝐺1𝛿𝑚1𝛿𝑛→0\mathcal{Q}{m,n}=O(G\sqrt{\ln(1/\delta)/m}+\sqrt{\ln(1/\delta)/n})\rightarrow 0 as m,n→∞→𝑚𝑛m,n\rightarrow\infty. In 𝒬m,nsubscript𝒬𝑚𝑛\mathcal{Q}{m,n}, the value of G𝐺G for the term decaying at the rate 1/m1𝑚1/\sqrt{m} depends on the hypothesis space of fθsubscript𝑓𝜃f_{\theta} and w𝑤w whereas the term decaying at the rate 1/n1𝑛1/\sqrt{n} is independent of any hypothesis space.

Proof The complete version of Theorem 2 and its proof are presented in Appendix I.

The term ‖𝐏ZS¯YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F} in Theorem 2 contains the unobservable label matrix YS¯subscript𝑌¯𝑆Y{\bar{S}}. However, we can minimize this term by using ‖𝐏ZS¯YS¯‖F≤‖𝐏ZS¯‖F‖YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹subscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹subscriptnormsubscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}\leq|\mathbf{P}{Z_{\bar{S}}}|{F}|Y{\bar{S}}|{F} and by minimizing ‖𝐏ZS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹|\mathbf{P}{Z_{\bar{S}}}|{F}. The factor ‖𝐏ZS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹|\mathbf{P}{Z_{\bar{S}}}|{F} is minimized when the rank of the covariance ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z{\bar{S}}}{Z_{\bar{S}}}^{\top} is maximized. Since a strictly diagonally dominant matrix is non-singular, this can be enforced by maximizing the diagonal entries while minimizing the off-diagonal entries, as is done in VICReg. For example, if d≥n𝑑𝑛d\geq n, then ‖𝐏ZS¯‖F=0subscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹0|\mathbf{P}{Z{\bar{S}}}|{F}=0 when the covariance ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z{\bar{S}}}{Z_{\bar{S}}}^{\top} is of full rank.

The term ‖𝐏ZSYS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z{S}}Y_{S}|{F} contains only observable variables, and we can directly measure the value of this term using training data. In addition, the term ‖𝐏ZSYS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|{F} is also minimized when the rank of the covariance ZSZS⊤subscript𝑍𝑆superscriptsubscript𝑍𝑆top{Z{S}}{Z_{S}}^{\top} is maximized. Since the covariances ZSZS⊤subscript𝑍𝑆superscriptsubscript𝑍𝑆top{Z_{S}}{Z_{S}}^{\top} and ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top} concentrate to each other via concentration inequalities with the error in the order of O((ln⁡(1/δ))/n+ℛ~~m(ℱ)(ln⁡(1/δ))/m)𝑂1𝛿𝑛subscript~~ℛ𝑚ℱ1𝛿𝑚O(\sqrt{(\ln(1/\delta))/n}+\tilde{\mathcal{R}}{m}(\mathcal{F})\sqrt{(\ln(1/\delta))/m}), we can also minimize the upper bound on ‖𝐏ZSYS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|{F} by maximizing the diagonal entries of ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z{\bar{S}}}{Z_{\bar{S}}}^{\top} while minimizing its off-diagonal entries, as is done in VICReg.

Thus, VICReg can be understood as a method to minimize the generalization bound in Theorem 2 by minimizing the invariance loss while controlling the covariance ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top} to minimize the label-agnostic upper bounds on ‖𝐏ZS¯YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F} and ‖𝐏ZSYS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|{F}. If we know partial information about the label YS¯subscript𝑌¯𝑆Y{\bar{S}} of the unlabeled data, we can use it to minimize ‖𝐏ZS¯YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F} and ‖𝐏ZSYS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|_{F} directly. This direction can be used to improve VICReg in future work for the partially observable setting.

The SimCLR generalization bound (Saunshi et al., 2019) requires the number of label classes to go infinity to close the generalization gap, whereas the VICReg bound in Theorem 2 does not require the number of label classes to approach infinity for the generalization gap to go to zero. This reflects the fact that, unlike SimCLR, VICReg does not use negative pairs and thus does not use a loss function that is based on the implicit expectation that the labels of a negative pair (y+,y−)superscript𝑦superscript𝑦(y^{+},y^{-}) are different. Another difference is that our VICReg bound improves as n𝑛n increases, while the previous bound of SimCLR (Saunshi et al., 2019) does not depend on n𝑛n. This is because Saunshi et al. (2019) assume partial access to the true distribution p(x∣y)𝑝conditional𝑥𝑦p(x\mid y) per class for setting W𝑊W, which removes the importance of labeled data size n𝑛n and is not assumed in our study.

Consequently, the generalization bound in Theorem 2 provides a new insight for VICReg regarding the ratio of the effects of m𝑚m v.s. n𝑛n through Gln⁡(1/δ)/m+ln⁡(1/δ)/n𝐺1𝛿𝑚1𝛿𝑛G\sqrt{\ln(1/\delta)/m}+\sqrt{\ln(1/\delta)/n}. Finally, Theorem 2 also illuminates the advantages of VICReg over standard supervised training. That is, with standard training, the generalization bound via the Rademacher complexity requires the complexities of hypothesis spaces, ℛ~~n(𝒲)/nsubscript~~ℛ𝑛𝒲𝑛\tilde{\mathcal{R}}{n}(\mathcal{W})/\sqrt{n} and ℛ~~n(ℱ)/nsubscript~~ℛ𝑛ℱ𝑛\tilde{\mathcal{R}}{n}(\mathcal{F})/\sqrt{n}, with respect to the size of labeled data n𝑛n, instead of the size of unlabeled data m𝑚m.

Thus, Theorem 2 shows that using self-supervised learning, we can replace all the complexities of hypothesis spaces in terms of n𝑛n with those in terms of m𝑚m. Since the number of unlabeled data points is typically much larger than the number of labeled data points, this illuminates the benefit of self-supervised learning.

Theorem 2 together with the result of the previous section shows that, for generalization in the downstream task, it is helpful to maximize the mutual information I(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) in SSL via minimizing the invariance loss IS¯(fθ)subscript𝐼¯𝑆subscript𝑓𝜃I_{{\bar{S}}}(f_{\theta}) while controlling the covariance ZS¯ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top}. The term 2ℛ~~m(ℱ)/m2subscript~~ℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m} captures the importance of controlling the complexity of the representations fθsubscript𝑓𝜃f{\theta}. To understand this term further in terms of mutual information, let us consider a discretization of the parameter space of ℱℱ\mathcal{F} to have finite |ℱ|<∞ℱ|\mathcal{F}|<\infty (indeed, a computer always implements some discretization of continuous variables). Then, by Massart’s Finite Class Lemma, we have that ℛ~~m(ℱ)≤Cln⁡|ℱ|subscript~~ℛ𝑚ℱ𝐶ℱ\tilde{\mathcal{R}}{m}(\mathcal{F})\leq C\sqrt{\ln|\mathcal{F}|} for some constant C>0𝐶0C>0. Moreover, Shwartz-Ziv (2022) shows that we can approximate ln⁡|ℱ|ℱ\ln|\mathcal{F}| by 2I(Z;X)superscript2𝐼𝑍𝑋2^{I(Z;X)}. Thus, in Theorem 2, the term IS¯(fθ)+2m‖𝐏ZS¯YS¯‖F+1n‖𝐏ZSYS‖Fsubscript𝐼¯𝑆subscript𝑓𝜃2𝑚subscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹1𝑛subscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹I{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F} corresponds to I(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) while the term of 2ℛ~~m(ℱ)/m2subscript~~ℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m} corresponds to I(Z;X)𝐼𝑍𝑋I(Z;X). Recall that the information can be decomposed as

where we want to maximize the predictive information I(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}), while minimizing I(Z;X)𝐼𝑍𝑋I(Z;X) (Federici et al., 2019; Shwartz-Ziv and Tishby, 2017a). Thus, to improve generalization, we also need to control 2ℛ~~m(ℱ)/m2subscript~~ℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m} to restrict the superfluous information I(Z;X|X′)𝐼𝑍conditional𝑋superscript𝑋′I(Z;X|X^{\prime}), in addition to minimizing IS¯(fθ)+2m‖𝐏ZS¯YS¯‖F+1n‖𝐏ZSYS‖Fsubscript𝐼¯𝑆subscript𝑓𝜃2𝑚subscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹1𝑛subscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹I{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F} that corresponded to maximize the predictive information I(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}). Although we can explicitly add regularization on I(Z;X|X′)𝐼𝑍conditional𝑋superscript𝑋′I(Z;X|X^{\prime}) to control 2ℛ~~m(ℱ)/m2subscript~~ℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m}, it is possible that I(Z;X|X′)𝐼𝑍conditional𝑋superscript𝑋′I(Z;X|X^{\prime}) and 2ℛ~~m(ℱ)/m2subscript~~ℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}_{m}(\mathcal{F})/\sqrt{m} are implicitly regularized via implicit bias through e design choises (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018). Thus, Theorem 2 connects the information-theoretic understanding of VICReg with the probabilistic guarantee on downstream generalization.

In this study, we examined Variance-Invariance-Covariance Regularization for self-supervised learning from an information-theoretic perspective. By transferring the required stochasticity required for an information-theoretic analysis to the input distribution, we showed how the VICReg objective can be derived from information-theoretic principles, used this perspective to highlight assumptions implicit in the VICReg objective, derived a VICReg generalization bound for downstream tasks, and related it to information maximization.

Finally, we built on the insights from our analysis to propose a new VICReg-style SSL objective. Our probabilistic guarantee suggests that VICReg can be further improved for the settings of partial label information by aligning the covariance matrix with the partially observable label matrix, which opens up several avenues for future work, including the design of improved estimators for information-theoretic quantities and investigations into the suitability of different SSL methods for specific data characteristics.

Tim G. J. Rudner is funded by a Qualcomm Innovation Fellowship.

In this section of the supplementary material, we present the full derivation of the lower bound on 𝔼x′[log⁡q(z|x′)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑥′\mathbb{E}{x^{\prime}}\left[\log q(z|x^{\prime})\right]. Because Z′|X′conditionalsuperscript𝑍′superscript𝑋′Z^{\prime}|X^{\prime} is a Gaussian, we can write it as Z′=μ(x′)+L(x′)ϵsuperscript𝑍′𝜇superscript𝑥′𝐿superscript𝑥′italic-ϵZ^{\prime}=\mu(x^{\prime})+L(x^{\prime})\epsilon where ϵ∼𝒩(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1) and L(x′)TL(x′)=Σ(x′)𝐿superscriptsuperscript𝑥′𝑇𝐿superscript𝑥′Σsuperscript𝑥′L(x^{\prime})^{T}L(x^{\prime})=\Sigma(x^{\prime}). Now, setting Σr=IsubscriptΣ𝑟𝐼\Sigma{r}=I, will give us:

where 𝔼x′[log⁡q(z|x′)]=𝔼x′[log⁡𝔼z′|x′[q(z|z′)]]≥𝔼z′[log⁡q(z|z′)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑥′subscript𝔼superscript𝑥′delimited-[]subscript𝔼conditionalsuperscript𝑧′superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑧′subscript𝔼superscript𝑧′delimited-[]𝑞conditional𝑧superscript𝑧′\mathbb{E}{x^{\prime}}\left[\log q(z|x^{\prime})\right]=\mathbb{E}{x^{\prime}}\left[\log\mathbb{E}{z^{\prime}|x^{\prime}}\left[q(z|z^{\prime})\right]\right]\geq\mathbb{E}{z^{\prime}}\left[\log q(z|z^{\prime})\right] by Jensen’s inequality, 𝔼ϵ[ϵ]=0subscript𝔼italic-ϵdelimited-[]italic-ϵ0\mathbb{E}{\epsilon}[\epsilon]=0 and 𝔼ϵ[ϵ(L(x′)TL(x′)ϵ]=TrlogΣ(x′)\mathbb{E}{\epsilon}\left[\epsilon\left(L(x^{\prime})^{T}L(x^{\prime}\right)\epsilon\right]=Tr\log\Sigma(x^{\prime}) by the Hutchinson’s estimator.

Proof We know that If ∫ωp(𝒙|𝒙n(𝒙)∗)𝑑𝒙≈1subscript𝜔𝑝conditional𝒙subscriptsuperscript𝒙𝑛𝒙differential-d𝒙1\int_{\omega}p(\bm{x}|\bm{x}^{*}_{n(\bm{x})})d\bm{x}\approx 1 then f𝑓f is linear within the effective support of p𝑝p. Therefore, any sample from p𝑝p will almost surely lie within a single region ω∈Ω𝜔Ω\omega\in\Omega and therefore the entire mapping can be considered linear with respect to p𝑝p. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

The bound in the complete version of Theorem 4 is better than the one in the informal version of Theorem 2, because of the factor c𝑐c. The factor c𝑐c measures the difference between the minimum norm solution WSsubscript𝑊𝑆W_{S} of the labeled training data and the minimum norm solution WS¯subscript𝑊¯𝑆W_{\bar{S}} of the unlabeled training data. Thus, the factor c𝑐c also decreases towards zero as n𝑛n and m𝑚m increase. Moreover, if the labeled and unlabeled training data are similar, the value of c𝑐c is small, decreasing the generalization bound further, which makes sense. Thus, we can view the factor c𝑐c as a measure on the distance between the labeled training data and the unlabeled training data.

We obtain the informal version from the complete version of Theorem 2 by the following reasoning to simplify the notation in the main text. We have that cIS¯(fθ)+c2ℛ~~m(ℱ)m=IS¯(fθ)+2ℛ~~m(ℱ)m+Q𝑐subscript𝐼¯𝑆subscript𝑓𝜃𝑐2subscript~~ℛ𝑚ℱ𝑚subscript𝐼¯𝑆subscript𝑓𝜃2subscript~~ℛ𝑚ℱ𝑚𝑄cI_{{\bar{S}}}(f_{\theta})+c\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}=I{{\bar{S}}}(f_{\theta})+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}+Q, where Q=(c−1)(IS¯(fθ)+2ℛ~~m(ℱ)m)≤ς→0𝑄𝑐1subscript𝐼¯𝑆subscript𝑓𝜃2subscript~~ℛ𝑚ℱ𝑚𝜍→0Q=(c-1)(I{{\bar{S}}}(f_{\theta})+\frac{2\tilde{\mathcal{R}}_{m}(\mathcal{F})}{\sqrt{m}})\leq\varsigma\rightarrow 0 as as m,n→∞→𝑚𝑛m,n\rightarrow\infty, since c→0→𝑐0c\rightarrow 0 as m,n→∞→𝑚𝑛m,n\rightarrow\infty. However, this reasoning is used only to simplify the notation in the main text. The bound in the complete version of Theorem 2 is more accurate and indeed tighter than the one in the informal version.

Proof [Proof of Theorem 2] Let W=WS𝑊subscript𝑊𝑆W=W_{S} where WSsubscript𝑊𝑆W_{S} is the the minimum norm solution as WS=minimizeW′‖W′‖Fsubscript𝑊𝑆subscriptminimizesuperscript𝑊′subscriptnormsuperscript𝑊′𝐹W_{S}=\mathop{\mathrm{minimize}}{W^{\prime}}|W^{\prime}|{F} s.t. W′∈argminW⁡1n∑i=1n‖Wfθ(xi)−yi‖2superscript𝑊′subscriptargmin𝑊1𝑛superscriptsubscript𝑖1𝑛superscriptnorm𝑊subscript𝑓𝜃subscript𝑥𝑖subscript𝑦𝑖2W^{\prime}\in\operatorname*{arg,min}{W}\frac{1}{n}\sum{i=1}^{n}|Wf_{\theta}(x_{i})-y_{i}|^{2}. Let W∗=WS¯superscript𝑊subscript𝑊¯𝑆W^{}=W_{\bar{S}} where WS¯subscript𝑊¯𝑆W_{\bar{S}} is the minimum norm solution as W∗=WS¯=minimizeW′‖W′‖Fsuperscript𝑊subscript𝑊¯𝑆subscriptminimizesuperscript𝑊′subscriptnormsuperscript𝑊′𝐹W^{}=W_{{\bar{S}}}=\mathop{\mathrm{minimize}}{W^{\prime}}|W^{\prime}|{F} s.t. W′∈argminW⁡1m∑i=1m‖Wfθ(xi+)−g∗(xi+)‖2superscript𝑊′subscriptargmin𝑊1𝑚superscriptsubscript𝑖1𝑚superscriptnorm𝑊subscript𝑓𝜃subscriptsuperscript𝑥𝑖superscript𝑔subscriptsuperscript𝑥𝑖2W^{\prime}\in\operatorname*{arg,min}{W}\frac{1}{m}\sum{i=1}^{m}|Wf_{\theta}(x^{+}{i})-g^{*}(x^{+}{i})|^{2}. Since y=g∗(x)𝑦superscript𝑔𝑥y=g^{*}(x),

where φ(x)=g∗(x)−W∗fθ(x)𝜑𝑥superscript𝑔𝑥superscript𝑊subscript𝑓𝜃𝑥\varphi(x)=g^{}(x)-W^{}f_{\theta}(x). Define LS(w)=1n∑i=1n‖Wfθ(xi)−yi‖subscript𝐿𝑆𝑤1𝑛superscriptsubscript𝑖1𝑛norm𝑊subscript𝑓𝜃subscript𝑥𝑖subscript𝑦𝑖L_{S}(w)=\frac{1}{n}\sum_{i=1}^{n}|Wf_{\theta}(x_{i})-y_{i}|. Using these,

where W~=W−W∗~𝑊𝑊superscript𝑊{\tilde{W}}=W-W^{*}. We now consider new fresh samples x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} for i=1,…,n𝑖1…𝑛i=1,\dots,n to rewrite the above further as:

This implies that

Furthermore, since y=W∗fθ(x)+φ(x)𝑦superscript𝑊subscript𝑓𝜃𝑥𝜑𝑥y=W^{}f_{\theta}(x)+\varphi(x), by writing y¯i=W∗fθ(x¯i)+φ(x¯i)subscript¯𝑦𝑖superscript𝑊subscript𝑓𝜃subscript¯𝑥𝑖𝜑subscript¯𝑥𝑖{\bar{y}}_{i}=W^{}f_{\theta}({\bar{x}}{i})+\varphi({\bar{x}}{i}) (where y¯i=yisubscript¯𝑦𝑖subscript𝑦𝑖{\bar{y}}{i}=y{i} since x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} for i=1,…,n𝑖1…𝑛i=1,\dots,n),

Combining these, we have that

To bound the left-hand side of equation C.15, we now analyze the following random variable:

where y¯i=yisubscript¯𝑦𝑖subscript𝑦𝑖{\bar{y}}{i}=y{i} since x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} for i=1,…,n𝑖1…𝑛i=1,\dots,n. Importantly, this means that as WSsubscript𝑊𝑆W_{S} depends on yisubscript𝑦𝑖y_{i}, WSsubscript𝑊𝑆W_{S} depends on y¯isubscript¯𝑦𝑖{\bar{y}}{i}. Thus, the collection of random variables ‖WSfθ(x¯1)−y¯1‖,…,‖WSfθ(nn)−y¯n‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…normsubscript𝑊𝑆subscript𝑓𝜃subscript𝑛𝑛subscript¯𝑦𝑛|W{S}f_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|W_{S}f_{\theta}(n_{n})-{\bar{y}}{n}| is not independent. Accordingly, we cannot apply standard concentration inequality to bound equation C.16. A standard approach in learning theory is to first bound equation C.16 by 𝔼x,y‖WSfθ(x)−y‖−1n∑i=1n‖WSfθ(x¯i)−y¯i‖≤supW∈𝒲𝔼x,y‖Wfθ(x)−y‖−1n∑i=1n‖Wfθ(x¯i)−y¯i‖subscript𝔼𝑥𝑦normsubscript𝑊𝑆subscript𝑓𝜃𝑥𝑦1𝑛superscriptsubscript𝑖1𝑛normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥𝑖subscript¯𝑦𝑖subscriptsupremum𝑊𝒲subscript𝔼𝑥𝑦norm𝑊subscript𝑓𝜃𝑥𝑦1𝑛superscriptsubscript𝑖1𝑛norm𝑊subscript𝑓𝜃subscript¯𝑥𝑖subscript¯𝑦𝑖\mathbb{E}{x,y}|W_{S}f_{\theta}(x)-y|-\frac{1}{n}\sum_{i=1}^{n}|W_{S}f_{\theta}({\bar{x}}{i})-{\bar{y}}{i}|\leq\sup_{W\in\mathcal{W}}\mathbb{E}{x,y}|Wf{\theta}(x)-y|-\frac{1}{n}\sum_{i=1}^{n}|Wf_{\theta}({\bar{x}}{i})-{\bar{y}}{i}| for some hypothesis space 𝒲𝒲\mathcal{W} (that is independent of S𝑆S) and realize that the right-hand side now contains the collection of independent random variables ‖Wfθ(x¯1)−y¯1‖,…,‖Wfθ(nn)−y¯n‖norm𝑊subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…norm𝑊subscript𝑓𝜃subscript𝑛𝑛subscript¯𝑦𝑛|Wf_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|Wf_{\theta}(n_{n})-{\bar{y}}{n}| , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space 𝒲𝒲\mathcal{W}. However, the complexity of the hypothesis space 𝒲𝒲\mathcal{W} can be very large, resulting into a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space 𝒲𝒲\mathcal{W} by using a very different approach with conditional expectations to take care the dependent random variables ‖WSfθ(x¯1)−y¯1‖,…,‖WSfθ(nn)−y¯n‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…normsubscript𝑊𝑆subscript𝑓𝜃subscript𝑛𝑛subscript¯𝑦𝑛|W{S}f_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|W_{S}f_{\theta}(n_{n})-{\bar{y}}_{n}|. Intuitively, we utilize the fact that for these dependent random variables, there are a structure of conditional independence, conditioned on each y∈𝒴𝑦𝒴y\in\mathcal{Y}.

We first write the expected loss as the sum of the conditional expected loss:

where Xysubscript𝑋𝑦X_{y} is the random variable for the conditional with Y=y𝑌𝑦Y=y. Using this, we decompose equation C.16 into two terms:

where 𝒴~={y∈𝒴:|ℐy|≠0}~𝒴conditional-set𝑦𝒴subscriptℐ𝑦0{\tilde{\mathcal{Y}}}={y\in\mathcal{Y}:|\mathcal{I}_{y}|\neq 0}. Substituting these into equation equation C.17 yields

Importantly, while ‖WSfθ(x¯1)−y¯1‖,…,‖WSfθ(x¯n)−y¯n‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥𝑛subscript¯𝑦𝑛|W_{S}f_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|W_{S}f_{\theta}({\bar{x}}{n})-{\bar{y}}{n}| on the right-hand side of equation C.18 are dependent random variables, ‖WSfθ(x¯1)−y‖,…,‖WSfθ(x¯n)−y‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1𝑦…normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥𝑛𝑦|W_{S}f_{\theta}({\bar{x}}{1})-y|,\dots,|W{S}f_{\theta}({\bar{x}}{n})-y| are independent random variables since WSsubscript𝑊𝑆W{S} and x¯isubscript¯𝑥𝑖{\bar{x}}_{i} are independent and y𝑦y is fixed here. Thus, by using Hoeffding’s inequality (Lemma 1), and taking union bounds over y∈𝒴𝑦𝒴y\in{\tilde{\mathcal{Y}}}, we have that with probability at least 1−δ1𝛿1-\delta, the following holds for all y∈𝒴𝑦𝒴y\in{\tilde{\mathcal{Y}}}:

We will now analyze the term 1n∑i=1n‖φ(xi)‖+1n∑i=1n‖φ(x¯i)‖1𝑛superscriptsubscript𝑖1𝑛norm𝜑subscript𝑥𝑖1𝑛superscriptsubscript𝑖1𝑛norm𝜑subscript¯𝑥𝑖\frac{1}{n}\sum_{i=1}^{n}|\varphi(x_{i})|+\frac{1}{n}\sum_{i=1}^{n}|\varphi({\bar{x}}{i})| on the right-hand side of equation C.21. Since W∗=WS¯superscript𝑊subscript𝑊¯𝑆W^{*}=W{\bar{S}},

Moreover, by using (Mohri et al., 2012, Theorem 3.1) with the loss function x+↦‖g∗(x+)−Wf(x+)‖maps-tosuperscript𝑥normsuperscript𝑔superscript𝑥𝑊𝑓superscript𝑥x^{+}\mapsto|g^{*}(x^{+})-Wf(x^{+})| (i.e., Lemma 2), we have that for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta,

where ℛ~~m(𝒲∘ℱ)=1m𝔼S¯,ξ[supW∈𝒲,f∈ℱ∑i=1mξi‖g∗(xi+)−Wf(xi+)‖]subscript~~ℛ𝑚𝒲ℱ1𝑚subscript𝔼¯𝑆𝜉delimited-[]subscriptsupremumformulae-sequence𝑊𝒲𝑓ℱsuperscriptsubscript𝑖1𝑚subscript𝜉𝑖normsuperscript𝑔subscriptsuperscript𝑥𝑖𝑊𝑓subscriptsuperscript𝑥𝑖\tilde{\mathcal{R}}{m}(\mathcal{W}\circ\mathcal{F})=\frac{1}{\sqrt{m}}\mathbb{E}{{\bar{S}},\xi}[\sup_{W\in\mathcal{W},f\in\mathcal{F}}\sum_{i=1}^{m}\xi_{i}|g^{}(x^{+}{i})-Wf(x^{+}{i})|] is the normalized Rademacher complexity of the set {x+↦‖g∗(x+)−Wf(x+)‖:W∈𝒲,f∈ℱ}:maps-tosuperscript𝑥normsuperscript𝑔superscript𝑥𝑊𝑓superscript𝑥formulae-sequence𝑊𝒲𝑓ℱ{x^{+}\mapsto|g^{}(x^{+})-Wf(x^{+})|:W\in\mathcal{W},f\in\mathcal{F}} (it is normalized such that ℛ~~m(ℱ)=O(1)subscript~~ℛ𝑚ℱ𝑂1\tilde{\mathcal{R}}{m}(\mathcal{F})=O(1) as m→∞→𝑚m\rightarrow\infty for typical choices of ℱℱ\mathcal{F}), and ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}. Takinng union bounds, we have that for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta,

Here, since Wfθ(xi+)∈ℝr𝑊subscript𝑓𝜃subscriptsuperscript𝑥𝑖superscriptℝ𝑟Wf_{\theta}(x^{+}_{i})\in\mathbb{R}^{r}, we have that

where Ir∈ℝr×rsubscript𝐼𝑟superscriptℝ𝑟𝑟I_{r}\in\mathbb{R}^{r\times r} is the identity matrix, and [fθ(xi+)⊤⊗Ir]∈ℝr×drdelimited-[]tensor-productsubscript𝑓𝜃superscriptsubscriptsuperscript𝑥𝑖topsubscript𝐼𝑟superscriptℝ𝑟𝑑𝑟[f_{\theta}(x^{+}{i})^{\top}\otimes I{r}]\in\mathbb{R}^{r\times dr} is the Kronecker product of the two matrices, and vec⁡[W]∈ℝdrvec𝑊superscriptℝ𝑑𝑟\operatorname{vec}[W]\in\mathbb{R}^{dr} is the vectorization of the matrix W∈ℝr×d𝑊superscriptℝ𝑟𝑑W\in\mathbb{R}^{r\times d}. Thus, by defining Ai=[fθ(xi+)⊤⊗Ir]∈ℝr×drsubscript𝐴𝑖delimited-[]tensor-productsubscript𝑓𝜃superscriptsubscriptsuperscript𝑥𝑖topsubscript𝐼𝑟superscriptℝ𝑟𝑑𝑟A_{i}=[f_{\theta}(x^{+}{i})^{\top}\otimes I{r}]\in\mathbb{R}^{r\times dr} and using the notation of w=vec⁡[W]𝑤vec𝑊w=\operatorname{vec}[W] and its inverse W=vec−1⁡[w]𝑊superscriptvec1𝑤W=\operatorname{vec}^{-1}[w] (i.e., the inverse of the vectorization from ℝr×dsuperscriptℝ𝑟𝑑\mathbb{R}^{r\times d} to ℝdrsuperscriptℝ𝑑𝑟\mathbb{R}^{dr} with a fixed ordering), we can rewrite equation C.24 by

with gi=g∗(xi+)∈ℝrsubscript𝑔𝑖superscript𝑔subscriptsuperscript𝑥𝑖superscriptℝ𝑟g_{i}=g^{*}(x^{+}{i})\in\mathbb{R}^{r}. Since the function w↦∑i=1m‖gi−Aiw‖2maps-to𝑤subscriptsuperscript𝑚𝑖1superscriptnormsubscript𝑔𝑖subscript𝐴𝑖𝑤2w\mapsto\sum^{m}{i=1}|g_{i}-A_{i}w|^{2} is convex, a necessary and sufficient condition of the minimizer of this function is obtained by

In other words,

where (A⊤A)†superscriptsuperscript𝐴top𝐴†(A^{\top}A)^{\dagger} is the Moore–Penrose inverse of the matrix A⊤Asuperscript𝐴top𝐴A^{\top}A and Null⁡(A)Null𝐴\operatorname{Null}(A) is the null space of the matrix A𝐴A. Thus, the minimum norm solution is obtained by

where the inequality follows from the Jensen’s inequality and the concavity of the square root function. Thus, we have that

where W~=WS−W∗~𝑊subscript𝑊𝑆superscript𝑊{\tilde{W}}=W_{S}-W^{*} and 𝐏A=I−A(A⊤A)†A⊤subscript𝐏𝐴𝐼𝐴superscriptsuperscript𝐴top𝐴†superscript𝐴top\mathbf{P}_{A}=I-A(A^{\top}A)^{\dagger}A^{\top}.

where ‖W~‖2subscriptnorm~𝑊2|{\tilde{W}}|{2} is the spectral norm of W~~𝑊{\tilde{W}}. Since x¯isubscript¯𝑥𝑖{\bar{x}}{i} shares the same label with xisubscript𝑥𝑖x_{i} as x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} (and xi∼𝒟yisimilar-tosubscript𝑥𝑖subscript𝒟subscript𝑦𝑖x_{i}\sim\mathcal{D}{y{i}}), and because fθsubscript𝑓𝜃f_{\theta} is trained with the unlabeled data S¯¯𝑆{\bar{S}}, using Hoeffding’s inequality (Lemma 1) implies that with probability at least 1−δ1𝛿1-\delta,

Define ZS¯=[f(x1+),…,f(xm+)]∈ℝd×msubscript𝑍¯𝑆𝑓subscriptsuperscript𝑥1…𝑓subscriptsuperscript𝑥𝑚superscriptℝ𝑑𝑚{Z_{\bar{S}}}=[f(x^{+}{1}),\dots,f(x^{+}{m})]\in\mathbb{R}^{d\times m}. Then, we have A=[ZS¯⊤⊗Ir]𝐴delimited-[]tensor-productsuperscriptsubscript𝑍¯𝑆topsubscript𝐼𝑟A=[{Z_{\bar{S}}}^{\top}\otimes I_{r}]. Thus,

where 𝐏ZS¯=Im−ZS¯⊤(ZS¯ZS¯⊤)†ZS¯∈ℝm×msubscript𝐏subscript𝑍¯𝑆subscript𝐼𝑚superscriptsubscript𝑍¯𝑆topsuperscriptsubscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top†subscript𝑍¯𝑆superscriptℝ𝑚𝑚\mathbf{P}{Z{\bar{S}}}=I_{m}-{Z_{\bar{S}}}^{\top}({Z_{\bar{S}}}{Z_{\bar{S}}}^{\top})^{\dagger}{Z_{\bar{S}}}\in\mathbb{R}^{m\times m}. By defining YS¯=[g∗(x1+),…,g∗(xm+)]⊤∈ℝm×rsubscript𝑌¯𝑆superscriptsuperscript𝑔subscriptsuperscript𝑥1…superscript𝑔subscriptsuperscript𝑥𝑚topsuperscriptℝ𝑚𝑟Y_{\bar{S}}=[g^{}(x^{+}_{1}),\dots,g^{}(x^{+}{m})]^{\top}\in\mathbb{R}^{m\times r}, since g=vec⁡[YS¯⊤]𝑔vecsuperscriptsubscript𝑌¯𝑆topg=\operatorname{vec}[Y{\bar{S}}^{\top}],

On the other hand, recall that WSsubscript𝑊𝑆W_{S} is the minimum norm solution as

where ZS=[f(x1),…,f(xn)]∈ℝd×nsubscript𝑍𝑆𝑓subscript𝑥1…𝑓subscript𝑥𝑛superscriptℝ𝑑𝑛{Z_{S}}=[f(x_{1}),\dots,f(x_{n})]\in\mathbb{R}^{d\times n} and YS=[y1,…,yn]⊤∈ℝn×rsubscript𝑌𝑆superscriptsubscript𝑦1…subscript𝑦𝑛topsuperscriptℝ𝑛𝑟Y_{S}=[y_{1},\dots,y_{n}]^{\top}\in\mathbb{R}^{n\times r}. Then,

We use the following well-known theorems as lemmas in our proof. We put these below for completeness. These are classical results and not our results.

(Hoeffding’s inequality) Let X1,…,Xnsubscript𝑋1…subscript𝑋𝑛X_{1},...,X_{n} be independent random variables such that a≤Xi≤b𝑎subscript𝑋𝑖𝑏{\displaystyle a\leq X_{i}\leq b} almost surely. Consider the average of these random variables, Sn=1n(X1+⋯+Xn).subscript𝑆𝑛1𝑛subscript𝑋1⋯subscript𝑋𝑛{\displaystyle S_{n}=\frac{1}{n}(X_{1}+\cdots+X_{n}).} Then, for all t>0𝑡0t>0,

Proof By using Hoeffding’s inequality, we have that for all t>0𝑡0t>0,

Setting δ=exp⁡(−2nt2(b−a)2)𝛿2𝑛superscript𝑡2superscript𝑏𝑎2\delta=\exp\left(-{\frac{2nt^{2}}{(b-a)^{2}}}\right) and solving for t>0𝑡0t>0,

It has been shown that generalization bounds can be obtained via Rademacher complexity (Bartlett and Mendelson, 2002; Mohri et al., 2012; Shalev-Shwartz and Ben-David, 2014). The following is a trivial modification of (Mohri et al., 2012, Theorem 3.1) for a one-sided bound on the nonnegative general loss functions:

Let 𝒢𝒢\mathcal{G} be a set of functions with the codomain [0,M]0𝑀[0,M]. Then, for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta over an i.i.d. draw of m𝑚m samples S=(qi)i=1m𝑆superscriptsubscriptsubscript𝑞𝑖𝑖1𝑚S=(q_{i})_{i=1}^{m}, the following holds for all ψ∈𝒢𝜓𝒢\psi\in\mathcal{G}:

where ℛm(𝒢):=𝔼S,ξ[supψ∈𝒢1m∑i=1mξiψ(qi)]assignsubscriptℛ𝑚𝒢subscript𝔼𝑆𝜉delimited-[]subscriptsupremum𝜓𝒢1𝑚superscriptsubscript𝑖1𝑚subscript𝜉𝑖𝜓subscript𝑞𝑖\mathcal{R}{m}(\mathcal{G}):=\mathbb{E}{S,\xi}[\sup_{\psi\in\mathcal{G}}\frac{1}{m}\sum_{i=1}^{m}\xi_{i}\psi(q_{i})] and ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi_{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}.

Proof Let S=(qi)i=1m𝑆superscriptsubscriptsubscript𝑞𝑖𝑖1𝑚S=(q_{i}){i=1}^{m} and S′=(qi′)i=1msuperscript𝑆′superscriptsubscriptsuperscriptsubscript𝑞𝑖′𝑖1𝑚S^{\prime}=(q{i}^{\prime})_{i=1}^{m}. Define

To apply McDiarmid’s inequality to φ(S)𝜑𝑆\varphi(S), we compute an upper bound on |φ(S)−φ(S′)|𝜑𝑆𝜑superscript𝑆′|\varphi(S)-\varphi(S^{\prime})| where S𝑆S and S′superscript𝑆′S^{\prime} be two test datasets differing by exactly one point of an arbitrary index i0subscript𝑖0i_{0}; i.e., Si=Si′subscript𝑆𝑖subscriptsuperscript𝑆′𝑖S_{i}=S^{\prime}{i} for all i≠i0𝑖subscript𝑖0i\neq i{0} and Si0≠Si0′subscript𝑆subscript𝑖0subscriptsuperscript𝑆′subscript𝑖0S_{i_{0}}\neq S^{\prime}{i{0}}. Then,

where the first line follows the definitions of each term, the second line uses Jensen’s inequality and the convexity of the supremum, and the third line follows that for each ξi∈{−1,+1}subscript𝜉𝑖11\xi_{i}\in{-1,+1}, the distribution of each term ξi(ℓ(f(xi′),yi′)−ℓ(f(xi),yi))subscript𝜉𝑖ℓ𝑓subscriptsuperscript𝑥′𝑖subscriptsuperscript𝑦′𝑖ℓ𝑓subscript𝑥𝑖subscript𝑦𝑖\xi_{i}(\ell(f(x^{\prime}{i}),y^{\prime}{i})-\ell(f(x_{i}),y_{i})) is the distribution of (ℓ(f(xi′),yi′)−ℓ(f(xi),yi))ℓ𝑓subscriptsuperscript𝑥′𝑖subscriptsuperscript𝑦′𝑖ℓ𝑓subscript𝑥𝑖subscript𝑦𝑖(\ell(f(x^{\prime}{i}),y^{\prime}{i})-\ell(f(x_{i}),y_{i})) since S𝑆S and S′superscript𝑆′S^{\prime} are drawn iid with the same distribution. The fourth line uses the subadditivity of supremum.

In contrastive learning, different augmented views of the same image are attracted (positive pairs), while different augmented views are repelled (negative pairs). MoCo (He et al., 2020) and SimCLR (Chen et al., 2020) are recent examples of self-supervised visual representation learning that reduce the gap between self-supervised and fully-supervised learning. SimCLR applies randomized augmentations to an image to create two different views, x𝑥x and y𝑦y, and encodes both of them with a shared encoder, producing representations rxsubscript𝑟𝑥r_{x} and rysubscript𝑟𝑦r_{y}. Both rxsubscript𝑟𝑥r_{x} and rysubscript𝑟𝑦r_{y} are l2𝑙2l2-normalized. The SimCLR version of the InfoNCE objective is:

where η𝜂\eta is a temperature term and K𝐾K is the number of views in a minibatch.

Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds (Huber et al., 2008). Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is a very computationally expensive and typically requires a large number of samples, especially in high dimensions (Brewer, 2017). Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases Huber et al. (2008). Another options is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) (Ahmed and Gokhale, 1989), and bayesian methods Misra et al. (2005). These methods, however, often require complex optimizations. The LogDet estimator presented in Zhouyin and Liu (2021) used the differential entropy α𝛼\alpha order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, Huber et al. (2008) presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky and Tracey (2017) introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities defined for each member. These estimators are computationally efficient, as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are are exact when the component distributions are grouped into well-separated clusters,

Let us examine a toy dataset on the pattern of two intertwining moons to illustrate the collapse phenomenon under GMM (Figure 1 - right). We begin by training a classical GMM with maximum likelihood, where the means are initialized based on random samples, and the covariance is used as the identity matrix. A red dot represents the Gaussian’s mean after training, while a blue dot represents the data points. In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high (Figure 4 - left, in the Appendix). However, when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy (Figure 4 - right, in the Appendix).

We adopt the following data-generating process model that is used in the previous paper on analyzing contrastive learning (Saunshi et al., 2019; Ben-Ari and Shwartz-Ziv, 2018). For the labeled data, first, y𝑦y is drawn from the distritbuion ρ𝜌\rho on 𝒴𝒴\mathcal{Y}, and then x𝑥x is drawn from the conditional distribution 𝒟ysubscript𝒟𝑦\mathcal{D}{y} conditioned on the label y𝑦y. That is, we have the join distribution 𝒟(x,y)=𝒟y(x)ρ(y)𝒟𝑥𝑦subscript𝒟𝑦𝑥𝜌𝑦\mathcal{D}(x,y)=\mathcal{D}{y}(x)\rho(y) with ((xi,yi))i=1n∼𝒟nsimilar-tosuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛superscript𝒟𝑛((x_{i},y_{i})){i=1}^{n}\sim\mathcal{D}^{n}. For the unlabeled data, first, each of the unknown labels y+superscript𝑦y^{+} and y−superscript𝑦y^{-} is drawn from the distritbuion ρ𝜌\rho, and then each of the positive examples x+superscript𝑥x^{+} and x++superscript𝑥absentx^{++} is drawn from the conditional distribution 𝒟y+subscript𝒟superscript𝑦\mathcal{D}{y^{+}} while the negative example x−superscript𝑥x^{-} is drawn from the 𝒟y−subscript𝒟superscript𝑦\mathcal{D}{y^{-}}. Unlike the analysis of contrastive learning, we do not require the negative samples. Let τS¯subscript𝜏¯𝑆\tau{{\bar{S}}} be a data-dependent upper bound on the invariance loss with the trained representation as ‖fθ(x¯)−fθ(x)‖≤τS¯normsubscript𝑓𝜃¯𝑥subscript𝑓𝜃𝑥subscript𝜏¯𝑆|f_{\theta}({\bar{x}})-f_{\theta}(x)|\leq\tau_{{\bar{S}}} for all (x¯,x)∼𝒟y2similar-to¯𝑥𝑥superscriptsubscript𝒟𝑦2({\bar{x}},x)\sim\mathcal{D}{y}^{2} and y∈𝒴𝑦𝒴y\in\mathcal{Y}. Let τ𝜏\tau be a data-independent upper bound on the invariance loss with the trained representation as‖f(x¯)−f(x)‖≤τnorm𝑓¯𝑥𝑓𝑥𝜏|f({\bar{x}})-f(x)|\leq\tau for all (x¯,x)∼𝒟y2similar-to¯𝑥𝑥superscriptsubscript𝒟𝑦2({\bar{x}},x)\sim\mathcal{D}{y}^{2}, y∈𝒴𝑦𝒴y\in\mathcal{Y}, and f∈ℱ𝑓ℱf\in\mathcal{F}. For the simplicity, we assume that there exists a function g∗superscript𝑔g^{} such that y=g∗(x)∈ℝr𝑦superscript𝑔𝑥superscriptℝ𝑟y=g^{}(x)\in\mathbb{R}^{r} for all (x,y)∈𝒳×𝒴𝑥𝑦𝒳𝒴(x,y)\in\mathcal{X}\times\mathcal{Y}. Discarding this assumption adds the average of label noises to the final result, which goes to zero as the sample sizes n𝑛n and m𝑚m increase, assuming that the mean of the label noise is zero.

Refer to caption Left: The network output SSL training is more Gaussian for small input noise. The p𝑝p-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The dashed line represents the point at which the null hypothesis (Gaussian distribution) can be rejected with 99%percent9999% confidence. Right: The Gaussians around each point are not overlapping The plots show the l2𝑙2l2 distances between raw images for different datasets. As can be seen, the distances are largest for more complex real-world datasets.

Refer to caption The entropy for the SSL models VICReg decreased during the training. The entropy (measured by the LogDet Entropy estimator) as a function of the number of steps during training for VICReg and SimCLR and BYOL. Additionally, SimCLR entropy estimation is tighter compared to the others.

Refer to caption Evolution of the entropy for each of the learning rate configurations showing that the impact of picking the incorrect learning rate for the data and/or centroids lead to a collapse of the samples.

Refer to caption Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur. Left - In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right - when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy.

$$ Z\sim\sum_{n=1}^{N}\mathcal{N}\left(\bm{A}{\omega(\bm{x}^{*}{n})}\bm{x}^{}{n}+\bm{b}{\omega(\bm{x}^{}{n})},\bm{A}^{T}{\omega(\bm{x}^{}{n})}\Sigma{\bm{x}^{}{n}}\bm{A}{\omega(\bm{x}^{*}_{n})}\right)^{T=n}, $$ \tag{A2.Ex5}

$$ \frac{1}{n}\sum_{i=1}^{n}|{\tilde{W}}f_{\theta}({\bar{x}}{i})|\leq L{S}(w)+\frac{1}{n}\sum_{i=1}^{n}|{\tilde{W}}(f_{\theta}({\bar{x}}{i})-f{\theta}(x_{i}))|+\frac{1}{n}\sum_{i=1}^{n}|\varphi(x_{i})|. $$ \tag{A3.Ex18}

$$ \mathcal{I}{y}={i\in[n]:y{i}=y}. $$ \tag{A3.Ex27}

$$ \mathbb{E}{X{y}}[|W_{S}f_{\theta}(X_{y})-y|]-\frac{1}{|\mathcal{I}{y}|}\sum{i\in\mathcal{I}{y}}|W{S}f_{\theta}({\bar{x}}{i})-y|\leq\kappa{S}\sqrt{\frac{\ln(|{\tilde{\mathcal{Y}}}|/\delta)}{2|\mathcal{I}_{y}|}}. $$ \tag{A3.Ex33}

$$ {\hat{p}}(y)=\frac{|\mathcal{I}_{y}|}{n}. $$ \tag{A3.Ex38}

$$ \frac{1}{n}\sum_{i=1}^{n}|\varphi(x_{i})|\leq\frac{1}{m}\sum_{i=1}^{m}|g^{*}(x^{+}{i})-W{{\bar{S}}}f_{\theta}(x^{+}{i})|+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{W}\circ\mathcal{F})}{\sqrt{m}}+\kappa\sqrt{\frac{\ln(2/\delta)}{2m}}+\kappa_{{\bar{S}}}\sqrt{\frac{\ln(2/\delta)}{2n}} $$ \tag{A3.Ex50}

$$ W_{\bar{S}}=\operatorname{vec}^{-1}[w_{\bar{S}}]\quad\text{where }\quad w_{\bar{S}}=\mathop{\mathrm{minimize}}{w^{\prime}}|w^{\prime}|{F}\text{ s.t. }w^{\prime}\in\operatorname*{arg,min}{w}\sum{i=1}^{m}|g_{i}-A_{i}w|^{2}, $$ \tag{A3.Ex54}

$$ 0=\nabla_{w}\sum^{m}{i=1}|g{i}-A_{i}w|^{2}=2\sum^{m}{i=1}A{i}^{\top}(g_{i}-A_{i}w)\in\mathbb{R}^{dr} $$ \tag{A3.Ex55}

$$ \sum^{m}{i=1}A{i}^{\top}A_{i}w=\sum^{m}{i=1}A{i}^{\top}g_{i}. $$ \tag{A3.Ex56}

$$ \operatorname{vec}[W_{\bar{S}}]=w_{\bar{S}}=(A^{\top}A)^{\dagger}A^{\top}g. $$ \tag{A3.Ex59}

$$ \mathbf{P}{A}=I-[{Z{\bar{S}}}^{\top}\otimes I_{r}][{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top}\otimes I_{r}]^{\dagger}[{Z_{\bar{S}}}\otimes I_{r}]=I-[{Z_{\bar{S}}}^{\top}({Z_{\bar{S}}}{Z_{\bar{S}}}^{\top})^{\dagger}{Z_{\bar{S}}}\otimes I_{r}]=[\mathbf{P}{Z{\bar{S}}}\otimes I_{r}] $$ \tag{A3.Ex77}

$$ \mathbb{P}{S}\left(\mathrm{E}\left[S{n}\right]-S_{n}\geq(b-a)\sqrt{\frac{\ln(1/\delta)}{2n}}\right)\leq\delta, $$ \tag{A4.Ex88}

$$ \displaystyle f(\bm{z})=\sum\nolimits_{\omega\in\Omega}(\bm{A}{\omega}\bm{z}+\bm{b}{\omega})\mathbbm{1}_{{\bm{z}\in\omega}}, $$

$$ \displaystyle\SwapAboveDisplaySkip\mathcal{L}=\frac{1}{K}\sum_{k=1}^{K}(\alpha\text{Var}(Z_{k})+\beta\text{Cov}(Z_{k},Z_{k^{\prime}}))+\gamma\text{Inv}(Z_{k},Z_{k^{\prime}}), $$

$$ \displaystyle\text{Cov}(Z_{k},Z_{k^{\prime}}) $$

$$ \displaystyle X\sim\sum_{n=1}^{N}\mathcal{N}(\bm{x}^{}{n},\Sigma{\bm{x}^{}{n}})^{1{{T=n}}},T\sim{\rm Cat}(N), $$

$$ \displaystyle\begin{split}\mathbb{E}{x^{\prime}}\left[\log q(z|x^{\prime})\right]&\geq\mathbb{E}{z^{\prime}|x^{\prime}}\left[\log q(z|z^{\prime})\right]=\frac{1}{2}(d\log 2\pi-\left(z-\mu(x^{\prime})\right)^{2}-\text{Tr}\log\Sigma(x^{\prime})).\end{split} $$

$$ \displaystyle L\approx\sum_{n=1}^{N}{\log\frac{|\Sigma_{Z}|}{|\Sigma(x_{i})|\cdot|\Sigma(x_{i}^{\prime})|}}-{\frac{1}{2}(\mu(x)-\mu(x^{\prime}))^{2}}. $$

$$ \displaystyle{Z_{S}}=[f(x_{1}),\dots,f(x_{n})]\in\mathbb{R}^{d\times n}\qquad\text{and}\qquad{Z_{\bar{S}}}=[f(x^{+}{1}),\dots,f(x^{+}{m})]\in\mathbb{R}^{d\times m}, $$

$$ \displaystyle\mathbf{P}{Z{S}}=I-{Z_{S}}^{\top}({Z_{S}}{Z_{S}}^{\top})^{\dagger}{Z_{S}}\qquad\text{and}\qquad\mathbf{P}{Z{\bar{S}}}=I-{Z_{\bar{S}}}^{\top}({Z_{\bar{S}}}{Z_{\bar{S}}}^{\top})^{\dagger}{Z_{\bar{S}}}. $$

$$ \displaystyle\tilde{\mathcal{R}}{m}(\mathcal{F})=\frac{1}{\sqrt{m}}\mathbb{E}{{\bar{S}},\xi}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{m}\xi_{i}|f(x^{+}{i})-f(x^{++}{i})|\right], $$

$$ \displaystyle\begin{split}&\mathbb{E}{x,y}[\ell{x,y}(w_{S})]\leq I_{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F}+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}+\mathcal{Q}_{m,n},\end{split} $$

$$ \displaystyle\begin{split}=&\frac{d}{2}\log 2\pi-\frac{1}{2}\mathbb{E}{\epsilon}\left[\left(\mu(x)-\mu(x^{\prime})\right)^{2}\right]+\mathbb{E}{\epsilon}\left[\left(\mu(x)-\mu(x^{\prime})\right)L(x)\epsilon\right]\ &-\frac{1}{2}\mathbb{E}_{\epsilon}\left[\epsilon^{T}L(x)^{T}L(x)\epsilon\right]-\frac{1}{2}Tr\log\Sigma(x^{\prime})\end{split} $$

$$ \displaystyle\quad+\kappa_{S}\sqrt{\frac{2\ln(6|\mathcal{Y}|/\delta)}{2n}}\sum_{y\in\mathcal{Y}}\left(\sqrt{{\hat{p}}(y)}+\sqrt{p(y)}\right) $$

$$ \displaystyle\mathbb{E}{X,Y}[|W{S}f_{\theta}(X)-Y|] $$

$$ \displaystyle\leq\kappa_{S}\left(\sum_{y\in{\tilde{\mathcal{Y}}}}\sqrt{{\hat{p}}(y)}\right)\sqrt{\frac{\ln(2|{\tilde{\mathcal{Y}}}|/\delta)}{2n}}+\kappa_{S}\left(\sum_{y\in\mathcal{Y}}\sqrt{p(y)}\right)\sqrt{\frac{2\ln(2|\mathcal{Y}|/\delta)}{2n}} $$

$$ \displaystyle A^{\top}Aw=A^{\top}g\quad\text{ where }A=\begin{bmatrix}A_{1}\ A_{2}\ \vdots\ A_{m}\ \end{bmatrix}\in\mathbb{R}^{mr\times dr}\text{ and }g=\begin{bmatrix}g_{1}\ g_{2}\ \vdots\ g_{m}\ \end{bmatrix}\in\mathbb{R}^{mr} $$

$$ \displaystyle\mathbb{E}{q}[\psi(q)]\leq\frac{1}{m}\sum{i=1}^{m}\psi(q_{i})+2\mathcal{R}_{m}(\mathcal{G})+M\sqrt{\frac{\ln(1/\delta)}{2m}}, $$

$$ \displaystyle\varphi(S^{\prime})-\varphi(S)\leq\sup_{\psi\in\mathcal{G}}\frac{\psi(q_{i_{0}})-\psi(q^{\prime}{i{0}})}{m}\leq\frac{M}{m}. $$

$$ \displaystyle\mathbb{E}{x,y}\left[-\log\left(\frac{e^{\frac{1}{\eta}r{y}^{T}r_{x}}}{\sum_{k=1}^{K}{e^{\frac{1}{\eta}r_{y_{k}}^{T}r_{x}}}}\right)\right], $$

Thm. Theorem 1 Given the setting of Equation 8, the unconditional DNN output density, Z𝑍Z, is approximately a mixture of the affinely transformed distributions 𝐱|𝐱n(𝐱)∗conditional𝐱subscriptsuperscript𝐱𝑛𝐱\bm{x}|\bm{x}^{}{n(\bm{x})}: Z∼∑n=1N𝒩(𝑨ω(𝒙n∗)𝒙n∗+𝒃ω(𝒙n∗),𝑨ω(𝒙n∗)TΣ𝒙n∗𝑨ω(𝒙n∗))1{T=n},similar-to𝑍superscriptsubscript𝑛1𝑁𝒩superscriptsubscript𝑨𝜔subscriptsuperscript𝒙𝑛subscriptsuperscript𝒙𝑛subscript𝒃𝜔subscriptsuperscript𝒙𝑛subscriptsuperscript𝑨𝑇𝜔subscriptsuperscript𝒙𝑛subscriptΣsubscriptsuperscript𝒙𝑛subscript𝑨𝜔subscriptsuperscript𝒙𝑛subscript1𝑇𝑛\displaystyle Z\sim\sum{n=1}^{N}\mathcal{N}\left(\bm{A}_{\omega(\bm{x}^{}{n})}\bm{x}^{*}{n}+\bm{b}{\omega(\bm{x}^{*}{n})},\bm{A}^{T}{\omega(\bm{x}^{*}{n})}\Sigma_{\bm{x}^{}{n}}\bm{A}{\omega(\bm{x}^{}{n})}\right)^{1{{T=n}}}, where ω(𝐱n∗)=ω∈Ω⇔𝐱n∗∈ωiff𝜔subscriptsuperscript𝐱𝑛𝜔Ωsubscriptsuperscript𝐱𝑛𝜔\omega(\bm{x}^{}_{n})=\omega\in\Omega\iff\bm{x}^{}{n}\in\omega is the partition region in which the prototype 𝐱n∗subscriptsuperscript𝐱𝑛\bm{x}^{*}{n} lives in.

Thm. Theorem 2 (Informal version). For any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta, 𝔼x,y[ℓx,y(wS)]≤IS¯(fθ)+2m‖𝐏ZS¯YS¯‖F+1n‖𝐏ZSYS‖F+2ℛm(ℱ)m+𝒬m,n,subscript𝔼𝑥𝑦delimited-[]subscriptℓ𝑥𝑦subscript𝑤𝑆subscript𝐼¯𝑆subscript𝑓𝜃2𝑚subscriptdelimited-∥∥subscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹1𝑛subscriptdelimited-∥∥subscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹2subscriptℛ𝑚ℱ𝑚subscript𝒬𝑚𝑛\displaystyle\begin{split}&\mathbb{E}{x,y}[\ell{x,y}(w_{S})]\leq I_{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F}+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}+\mathcal{Q}{m,n},\end{split} (18) where 𝒬m,n=O(Gln⁡(1/δ)/m+ln⁡(1/δ)/n)→0subscript𝒬𝑚𝑛𝑂𝐺1𝛿𝑚1𝛿𝑛→0\mathcal{Q}{m,n}=O(G\sqrt{\ln(1/\delta)/m}+\sqrt{\ln(1/\delta)/n})\rightarrow 0 as m,n→∞→𝑚𝑛m,n\rightarrow\infty. In 𝒬m,nsubscript𝒬𝑚𝑛\mathcal{Q}{m,n}, the value of G𝐺G for the term decaying at the rate 1/m1𝑚1/\sqrt{m} depends on the hypothesis space of fθsubscript𝑓𝜃f{\theta} and w𝑤w whereas the term decaying at the rate 1/n1𝑛1/\sqrt{n} is independent of any hypothesis space.

Lemma. Lemma 1 (Hoeffding’s inequality) Let X1,…,Xnsubscript𝑋1…subscript𝑋𝑛X_{1},...,X_{n} be independent random variables such that a≤Xi≤b𝑎subscript𝑋𝑖𝑏{\displaystyle a\leq X_{i}\leq b} almost surely. Consider the average of these random variables, Sn=1n(X1+⋯+Xn).subscript𝑆𝑛1𝑛subscript𝑋1⋯subscript𝑋𝑛{\displaystyle S_{n}=\frac{1}{n}(X_{1}+\cdots+X_{n}).} Then, for all t>0𝑡0t>0, ℙS(E[Sn]−Sn≥(b−a)ln⁡(1/δ)2n)≤δ,subscriptℙ𝑆Edelimited-[]subscript𝑆𝑛subscript𝑆𝑛𝑏𝑎1𝛿2𝑛𝛿\mathbb{P}{S}\left(\mathrm{E}\left[S{n}\right]-S_{n}\geq(b-a)\sqrt{\frac{\ln(1/\delta)}{2n}}\right)\leq\delta, and ℙS(Sn−E[Sn]≥(b−a)ln⁡(1/δ)2n)≤δ.subscriptℙ𝑆subscript𝑆𝑛Edelimited-[]subscript𝑆𝑛𝑏𝑎1𝛿2𝑛𝛿\mathbb{P}{S}\left(S{n}-\mathrm{E}\left[S_{n}\right]\geq(b-a)\sqrt{\frac{\ln(1/\delta)}{2n}}\right)\leq\delta.

Lemma. Lemma 2 Let 𝒢𝒢\mathcal{G} be a set of functions with the codomain [0,M]0𝑀[0,M]. Then, for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta over an i.i.d. draw of m𝑚m samples S=(qi)i=1m𝑆superscriptsubscriptsubscript𝑞𝑖𝑖1𝑚S=(q_{i}){i=1}^{m}, the following holds for all ψ∈𝒢𝜓𝒢\psi\in\mathcal{G}: 𝔼q[ψ(q)]≤1m∑i=1mψ(qi)+2ℛm(𝒢)+Mln⁡(1/δ)2m,subscript𝔼𝑞delimited-[]𝜓𝑞1𝑚superscriptsubscript𝑖1𝑚𝜓subscript𝑞𝑖2subscriptℛ𝑚𝒢𝑀1𝛿2𝑚\displaystyle\mathbb{E}{q}[\psi(q)]\leq\frac{1}{m}\sum_{i=1}^{m}\psi(q_{i})+2\mathcal{R}{m}(\mathcal{G})+M\sqrt{\frac{\ln(1/\delta)}{2m}}, (D.35) where ℛm(𝒢):=𝔼S,ξ[supψ∈𝒢1m∑i=1mξiψ(qi)]assignsubscriptℛ𝑚𝒢subscript𝔼𝑆𝜉delimited-[]subscriptsupremum𝜓𝒢1𝑚superscriptsubscript𝑖1𝑚subscript𝜉𝑖𝜓subscript𝑞𝑖\mathcal{R}{m}(\mathcal{G}):=\mathbb{E}{S,\xi}[\sup{\psi\in\mathcal{G}}\frac{1}{m}\sum_{i=1}^{m}\xi_{i}\psi(q_{i})] and ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi_{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}.

LARGE Supplementary Material

Proof of Theorem ref{thm:1

Proof of Theorem 3. Let W = W S where W S is the the minimum norm solution as W S = minimize W ′ ∥ W ′ ∥ F s.t. W ′ ∈ arg min W 1 n ∑ n i =1 ∥ Wf θ ( x i ) -y i ∥ 2 . Let W ∗ = W ¯ S where W ¯ S is the minimum norm solution as W ∗ = W ¯ S = minimize W ′ ∥ W ′ ∥ F s.t. W ′ ∈ arg min W 1 m ∑ m i =1 ∥ Wf θ ( x + i ) -g ∗ ( x + i ) ∥ 2 . Since y = g ∗ ( x ) ,

where φ ( x ) = g ∗ ( x ) -W ∗ f θ ( x ) . Define L S ( w ) = 1 n ∑ n i =1 ∥ Wf θ ( x i ) -y i ∥ . Using these,

where ˜ W = W -W ∗ . We now consider new fresh samples ¯ x i ∼ D y i for i = 1 , . . . , n to rewrite the above further as:

This implies that

Combining these, we have that

To bound the left-hand side of equation 34, we now analyze the following random variable:

where ¯ y i = y i since ¯ x i ∼ D y i for i = 1 , . . . , n . Importantly, this means that as W S depends on y i , W S depends on ¯ y i . Thus, the collection of random variables ∥ W S f θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ W S f θ ( n n ) -¯ y n ∥ is not independent. Accordingly, we cannot apply standard concentration inequality to bound equation 35. A standard approach in learning theory is to first bound equation 35 by E x,y ∥ W S f θ ( x ) -y ∥ -1 n ∑ n i =1 ∥ W S f θ (¯ x i ) -¯ y i ∥ ≤ sup W ∈W E x,y ∥ Wf θ ( x ) -y ∥ -1 n ∑ n i =1 ∥ Wf θ (¯ x i ) -¯ y i ∥ for some hypothesis space W (that is independent of S ) and realize that the right-hand side now contains the collection of independent random variables ∥ Wf θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ Wf θ ( n n ) -¯ y n ∥ , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space W . However, the complexity of the hypothesis space W can be very large, resulting in a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space W by using a very different approach with conditional expectations to take care the dependent random variables ∥ W S f θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ W S f θ ( n n ) -¯ y n ∥ . Intuitively, we utilize the fact that for these dependent random variables, there is a structure of conditional independence, conditioned on each y ∈ Y .

We first write the expected loss as the sum of the conditional expected loss:

where X y is the random variable for the conditional with Y = y . Using this, we decompose equation 35 into two terms:

where

where ˜ Y = { y ∈ Y : |I y | = 0 } . Substituting these into equation equation 36 yields

Importantly, while ∥ W S f θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ W S f θ (¯ x n ) -¯ y n ∥ on the right-hand side of equation 37 are dependent random variables, ∥ W S f θ (¯ x 1 ) -y ∥ , . . . , ∥ W S f θ (¯ x n ) -y ∥ are independent random variables since W S and ¯ x i are independent and y is fixed here. Thus, by using Hoeffding's inequality (Lemma G.1), and taking union bounds over y ∈ ˜ Y , we have that with probability at least 1 -δ , the following holds for all y ∈ ˜ Y :

Combining equation 34 and equation 39 implies that with probability at least 1 -δ ,

We will now analyze the term 1 n ∑ n i =1 ∥ φ ( x i ) ∥ + 1 n ∑ n i =1 ∥ φ (¯ x i ) ∥ on the right-hand side of equation 40. Since W ∗ = W ¯ S ,

Moreover, by using [51, Theorem 3.1] with the loss function x + ↦→ ∥ g ∗ ( x + ) -Wf ( x + ) ∥ (i.e., Lemma G.2), we have that for any δ > 0 , with probability at least 1 -δ ,

where ˜ R m ( W ◦ F ) = 1 √ m E ¯ S,ξ [sup W ∈W ,f ∈F ∑ m i =1 ξ i ∥ g ∗ ( x + i ) -Wf ( x + i ) ∥ ] is the normalized Rademacher complexity of the set { x + ↦→∥ g ∗ ( x + ) -Wf ( x + ) ∥ : W ∈ W , f ∈ F} (it is normalized such that ˜ R m ( F ) = O (1) as m → ∞ for typical choices of F ), and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1 , 1 } . Takinng union bounds, we have that for any δ > 0 , with probability at least 1 -δ ,

where I r ∈ R r × r is the identity matrix, and [ f θ ( x + i ) ⊤ ⊗ I r ] ∈ R r × dr is the Kronecker product of the two matrices, and vec[ W ] ∈ R dr is the vectorization of the matrix W ∈ R r × d . Thus, by defining A i = [ f θ ( x + i ) ⊤ ⊗ I r ] ∈ R r × dr and using the notation of w = vec[ W ] and its inverse W = vec -1 [ w ] (i.e., the inverse of the vectorization from R r × d to R dr with a fixed ordering), we can rewrite equation 43 by

with g i = g ∗ ( x + i ) ∈ R r . Since the function w ↦→ ∑ m i =1 ∥ g i -A i w ∥ 2 is convex, a necessary and sufficient condition of the minimizer of this function is obtained by

In other words,

Thus,

where ( A ⊤ A ) † is the Moore-Penrose inverse of the matrix A ⊤ A and Null( A ) is the null space of the matrix A . Thus, the minimum norm solution is obtained by

where the inequality follows from the Jensen's inequality and the concavity of the square root function. Thus, we have that

where ˜ W = W S -W ∗ and P A = I -A ( A ⊤ A ) † A ⊤ .

where ∥ ˜ W ∥ 2 is the spectral norm of ˜ W . Since ¯ x i shares the same label with x i as ¯ x i ∼ D y i (and x i ∼ D y i ), and because f θ is trained with the unlabeled data ¯ S , using Hoeffding's inequality (Lemma G.1) implies that with probability at least 1 -δ ,

Define Z ¯ S = [ f ( x + 1 ) , . . . , f ( x + m )] ∈ R d × m . Then, we have A = [ Z ¯ S ⊤ ⊗ I r ] . Thus, P A = I -[ Z ¯ S ⊤ ⊗ I r ][ Z ¯ S Z ¯ S ⊤ ⊗ I r ] † [ Z ¯ S ⊗ I r ] = I -[ Z ¯ S ⊤ ( Z ¯ S Z ¯ S ⊤ ) † Z ¯ S ⊗ I r ] = [ P Z ¯ S ⊗ I r ] where P Z ¯ S = I m -Z ¯ S ⊤ ( Z ¯ S Z ¯ S ⊤ ) † Z ¯ S ∈ R m × m . By defining Y ¯ S = [ g ∗ ( x + 1 ) , . . . , g ∗ ( x + m )] ⊤ ∈ R m × r , since g = vec[ Y ⊤ ¯ S ] ,

On the other hand, recall that W S is the minimum norm solution as

Now,

Information Optimization and the VICReg Objective