Skip to main content

An Information-Theoretic Perspective on Variance-Invariance-Covariance Regularization

% \name Randall Balestriero \addr Meta AI, FAIR \email, % \name Kenji Kawaguchi \addr National University of Singapore \email, % \name Tim G. J. Rudner \addr New York University \email, % \name Yann LeCun \addr New York University & Meta AI, FAIR \email %

Abstract

In this paper, we provide an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg) for self-supervised learning. To do so, we first demonstrate how information-theoretic quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Next, we relate the VICReg objective to mutual information maximization and use it to highlight the underlying assumptions of the objective. Based on this relationship, we derive a generalization bound for VICReg, providing generalization guarantees for downstream supervised learning tasks and present new self-supervised learning methods, derived from a mutual information maximization objective, that outperform existing methods in terms of performance. This work provides a new information-theoretic perspective on self-supervised learning and Variance-Invariance-Covariance Regularization in particular and guides the way for improved transfer learning via information-theoretic self-supervised learning objectives.

Introduction

Self-supervised learning (SSL) is a promising approach to extracting meaningful representations by optimizing a surrogate objective between inputs and self-generated signals. For example, VarianceInvariance-Covariance Regularization (VICReg) [7], a widely-used SSL algorithm employing a de-correlation mechanism, circumvents learning trivial solutions by applying variance and covariance regularization.

Once the surrogate objective is optimized, the pre-trained model can be used as a feature extractor for a variety of downstream supervised tasks such as image classification, object detection, instance segmentation, or pose estimation [15, 16, 49, 68]. Despite the promising results demonstrated by SSL methods, the theoretical underpinnings explaining their efficacy continue to be the subject of investigation [5, 43].

Information theory has proved a useful tool for improving our understanding of deep neural networks (DNNs), having a significant impact on both applications in representation learning [3] and theoretical explorations [61, 73]. However, applications of information-theoretic principles to SSL have made unrealistic assumptions, making many existing information-theoretic approaches to SSL of limited use. One such assumption is to assume that the DNN to be optimized is stochastic-an assumption that is violated for the vast majority DNNs used in practice. For a comprehensive review on this topic, refer to the work by Shwartz-Ziv and LeCun [63].

∗ Correspondence to: ravid.shwartz.ziv@nyu.edu .

In this paper, we examine Variance-Invariance-Covariance Regularization (VICReg), an SSL method developed for deterministic DNNs, from an information-theoretic perspective. We propose an approach that addresses the challenge of mutual information estimation in deterministic networks by transitioning the randomness from the networks to the input data-a more plausible assumption. This shift allows us to apply an information-theoretic analysis to deterministic networks. To establish a connection between the VICReg objective and information maximization, we identify and empirically validate the necessary assumptions. Building on this analysis, we describe differences between different SSL algorithms from an information-theoretic perspective and propose a new family of plug-in methods for SSL. This new family of methods leverages existing information estimators and achieves state-of-the-art predictive performance across several benchmarking tasks. Finally, we derive a generalization bound that links information optimization and the VICReg objective to downstream task performance, underscoring the advantages of VICReg.

Our key contributions are summarized as follows:

  1. We introduce a novel approach for studying deterministic deep neural networks from an information-theoretic perspective by shifting the stochasticity from the networks to the inputs using the Data Distribution Hypothesis (Section 3)
  2. We establish a connection between the VICReg objective and information-theoretic optimization, using this relationship to elucidate the underlying assumptions of the objective and compare it to other SSL methods (Section 4).
  3. We propose a family of information-theoretic SSL methods, grounded in our analysis, that achieve state-of-the-art performance (Section 5).
  4. We derive a generalization bound that directly links VICReg to downstream task generalization, further emphasizing its practical advantages over other SSL methods (Section 6).

Background & Preliminaries

We first introduce the necessary technical background for our analysis.

Continuous Piecewise Affine (CPA) Mappings.

A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition Ω of a domain R D , a spline of order k is a mapping defined by a polynomial of order k on each region ω ∈ Ω with continuity constraints on the entire domain for the derivatives of order 0 ,. . . , k -1 . As we will focus on affine splines ( k = 1 ), we only define this case for clarity. A K -dimensional affine spline f produces its output via f ( z ) = ∑ ω ∈ Ω ( A ω z + b ω ) ✶ { z ∈ ω } , with input z ∈ R D and A ω ∈ R K × D , b ω ∈ R K , ∀ ω ∈ Ω the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f ∈ C 0 ( R D ) .

Deep Neural Networks as CPA Mappings. A deep neural network (DNN) is a (non-linear) operator f Θ with parameters Θ that map a input x ∈ R D to a prediction y ∈ R K . The precise definitions of DNN operators can be found in Goodfellow et al. [27]. To avoid cluttering notation, we will omit Θ unless needed for clarity. For our analysis, we only assume that the non-linearities in the DNN are CPA mappings-as is the case with (leaky-) ReLU, absolute value, and max-pooling operators. The entire input-output mapping then becomes a CPA spline with an implicit partition Ω , the function of the weights and architecture of the network [52, 6]. For smooth nonlinearities, our results hold using a first-order Taylor approximation argument.

Continuous Piecewise Affine (CPA) Mappings.

A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition Ω of a domain R D , a spline of order k is a mapping defined by a polynomial of order k on each region ω ∈ Ω with continuity constraints on the entire domain for the derivatives of order 0 ,. . . , k -1 . As we will focus on affine splines ( k = 1 ), we only define this case for clarity. A K -dimensional affine spline f produces its output via f ( z ) = ∑ ω ∈ Ω ( A ω z + b ω ) ✶ { z ∈ ω } , with input z ∈ R D and A ω ∈ R K × D , b ω ∈ R K , ∀ ω ∈ Ω the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f ∈ C 0 ( R D ) .

Deep Neural Networks as CPA Mappings. A deep neural network (DNN) is a (non-linear) operator f Θ with parameters Θ that map a input x ∈ R D to a prediction y ∈ R K . The precise definitions of DNN operators can be found in Goodfellow et al. [27]. To avoid cluttering notation, we will omit Θ unless needed for clarity. For our analysis, we only assume that the non-linearities in the DNN are CPA mappings-as is the case with (leaky-) ReLU, absolute value, and max-pooling operators. The entire input-output mapping then becomes a CPA spline with an implicit partition Ω , the function of the weights and architecture of the network [52, 6]. For smooth nonlinearities, our results hold using a first-order Taylor approximation argument.

Deep Neural Networks as CPA Mappings.

Recently, information-theoretic methods have played an essential role in advancing deep learning by developing and applying information-theoretic estimators and learning principles to DNN training [3, 9, 34, 57, 65, 66, 70, 73]. However, information-theoretic objectives for deterministic DNNs often face a common issue: the mutual information between the input and the DNN representation is infinite. This leads to ill-posed optimization problems.

Several strategies have been suggested to address this challenge. One involves using stochastic DNNs with variational bounds, where the output of the deterministic network is used as the parameters of the conditional distribution [44, 62]. Another approach, as suggested by Dubois et al. [22], assumes that the randomness of data augmentation among the two views is the primary source of stochasticity in the network. However, these methods assume that randomness comes from the DNN, contrary to common practice. Other research has presumed a random input but has made no assumptions about the network's representation distribution properties. Instead, it relies on general lower bounds to analyze the objective [72, 76].

Self-Supervised Learning.

SSL is a set of techniques that learn representation functions from unlabeled data, which can then be adapted to various downstream tasks. While supervised learning relies on labeled data, SSL formulates a proxy objective using self-generated signals. The challenge in SSL is to learn useful representations without labels. It aims to avoid trivial solutions where the model maps all inputs to a constant output [11, 36]. To address this, SSL utilizes several strategies. Contrastive methods like SimCLR and its InfoNCE criterion learn representations by distinguishing positive and negative examples [16, 55]. In contrast, non-contrastive methods apply regularization techniques to prevent the collapse [14, 17, 28].

Variance-Invariance-Covariance Regularization (VICReg). A widely used SSL method for training joint embedding architectures [7]. Its loss objective is composed of three terms: the invariance loss, the variance loss, and the covariance loss:

· Invariance loss: The invariance loss is given by the mean-squared Euclidean distance between pairs of embedding and ensures consistency between the representation of the original and augmented inputs. · Regularization: The regularization term consists of two loss terms: the variance loss -a hinge loss to maintain the standard deviation (over a batch) of each variable of the embedding-and the covariance loss , which penalizes off-diagonal coefficients of the covariance matrix of the embeddings to foster decorrelation among features.

VICReg generates two batches of embeddings, Z = [ f ( x 1 ) , . . . , f ( x B )] and Z ′ = [ f ( x ′ 1 ) , . . . , f ( x ′ B )] , each of size ( B × K ) . Here, x i and x ′ i are two distinct random augmentations of a sample I i . The covariance matrix C ∈ R K × K is obtained from [ Z , Z ′ ] . The VICReg loss can thus be expressed as follows:

̸

$$

$$

Variance-Invariance-Covariance Regularization (VICReg).

Next, we connect the VICReg to our information-theoretic-based objective. The 'invariance term' in Equation (9), which pushes augmentations from the same image closer together, is the same term used in the VICReg objective. However, the computation of the regularization term poses a significant challenge. Entropy estimation is a well-established problem within information theory, with Gaussian mixture densities often used for representation. Yet, the differential entropy of Gaussian mixtures lacks a closed-form solution [56].

A straightforward method for approximating entropy involves capturing the distribution's first two moments, which provides an upper bound on the entropy. However, minimizing an upper bound doesn't necessarily optimize the original objective. Despite reported success from minimizing an upper bound [47, 54], this approach may induce instability during the training process.

Let Σ Z denote the covariance matrix of Z . We utilize the first two moments to approximate the entropy we aim to maximize. Because the invariance term appears in the same form as the original VICReg objective, we will look only at the regularizer. Consequently, we get the approximation

$$

$$

Theorem 2 . Assuming that the eigenvalues of Σ( x i ) and Σ( x ′ i ) , along with the differences between the Gaussian means µ ( x i ) and µ ( x ′ i ) , are bounded, the solution to the maximization problem

$$

$$

involves setting Σ Z to a diagonal matrix.

$$

$$

According to Theorem 2, we can maximize Equation (10) by diagonalizing the covariance matrix and increasing its diagonal elements. This goal can be achieved by minimizing the off-diagonal elements of Σ Z -the covariance criterion of VICReg-and by maximizing the sum of the log of its diagonal elements. While this approach is straightforward and efficient, it does have a drawback: the diagonal values could tend towards zero, potentially causing instability during logarithm computations. A solution to this issue is to use an upper bound and directly compute the sum of the diagonal elements, resulting in the variance term of VICReg. This establishes the link between our information-theoretic objective and the three key components of VICReg.

Deep Neural Networks and Information Theory

Recently, information-theoretic methods have played an essential role in advancing deep learning by developing and applying information-theoretic estimators and learning principles to DNN training [3, 9, 34, 57, 65, 66, 70, 73]. However, information-theoretic objectives for deterministic DNNs often face a common issue: the mutual information between the input and the DNN representation is infinite. This leads to ill-posed optimization problems.

Several strategies have been suggested to address this challenge. One involves using stochastic DNNs with variational bounds, where the output of the deterministic network is used as the parameters of the conditional distribution [44, 62]. Another approach, as suggested by Dubois et al. [22], assumes that the randomness of data augmentation among the two views is the primary source of stochasticity in the network. However, these methods assume that randomness comes from the DNN, contrary to common practice. Other research has presumed a random input but has made no assumptions about the network's representation distribution properties. Instead, it relies on general lower bounds to analyze the objective [72, 76].

Self-Supervised Learning in DNNs: An Information-Theoretic Perspective

To analyze information within deterministic networks, we first need to establish an informationtheoretic perspective on SSL (Section 3.1). Subsequently, we utilize the Data Distribution Hypothesis (Section 3.2) to demonstrate its applicability to deterministic SSL networks.

Self-Supervised Learning from an Information-Theoretic Viewpoint

Our discussion begins with the MultiView InfoMax principle , which aims to maximize the mutual information I ( Z ; X ′ ) between a view X ′ and the second representation Z . As demonstrated in Federici et al. [24], we can optimize this information by employing the following lower bound:

$$

$$

Here, H ( Z ) represents the entropy of Z . In supervised learning, the labels Y remain fixed, making the entropy term H ( Y ) a constant. Consequently, the optimization is solely focused on the log-loss, E x ′ [log q ( z | x ′ )] , which could be either cross-entropy or square loss.

However, for joint embedding networks, a degenerate solution can emerge, where all outputs 'collapse' into an undesired value [16]. Upon examining Equation (2), we observe that the entropies are not fixed and can be optimized. As a result, minimizing the log loss alone can lead the representations to collapse into a trivial solution and must be regularized.

Understanding the Data Distribution Hypothesis

Previously, we mentioned that a naive analysis might suggest that the information in deterministic DNNs is infinite. To address this point, we investigate whether assuming a dataset is a mixture of Gaussians with non-overlapping support can provide a manageable distribution over the neural network outputs. This assumption is less restrictive compared to assuming that the neural network itself is stochastic, as it concerns the generative process of the data, not the model and training process. For a detailed discussion about the limitations of assuming stochastic networks and a comparison between stochastic networks vs stochastic input, see Appendix N. In Section 4.2, we verify that this assumption holds for real-world datasets.

The so-called manifold hypothesis allows us to treat any point as a Gaussian random variable with a low-rank covariance matrix aligned with the data's manifold tangent space [25], which enables us to examine the conditioning of a latent representation with respect to the mean of the observation, i.e., X | x ∗ ∼ N ( x ; x ∗ , Σ x ∗ ) . Here, the eigenvectors of Σ x ∗ align with the tangent space of the data manifold at x ∗ , which varies with the position of x ∗ in space. In this setting, a dataset is considered a collection of distinct points x ∗ n , n = 1 , ..., N , and the full data distribution is expressed as a sum of Gaussian densities with low-rank covariance, defined as:

$$

$$

Here, T is a uniform Categorical random variable. For simplicity, we assume that the effective support of N ( x ∗ i , Σ x ∗ i ) do not overlap (for empirical validation of this assumption see Section 4.2). The effective support is defined as { x ∈ R D : p ( x ) > ϵ } . We can then approximate the density function as follows:

$$

$$

where N ( x ; ., . ) is the Gaussian density at x and n ( x ) = arg min n ( x -x ∗ n ) T Σ x ∗ n ( x -x ∗ n ) .

Data Distribution Under the Deep Neural Network Transformation

Let us consider an affine spline operator f , as illustrated in Section 2.1, which maps a space of dimension D to a space of dimension K , where K ≥ D . The image or the span of this mapping is expressed as follows:

$$

$$

In this equation, Aff ( ω ; A ω , b ω ) = { A ω x + b ω : x ∈ ω } denotes the affine transformation of region ω by the per-region parameters A ω , b ω . Ω denotes the partition of the input space where x resides. To practically compute the per-region affine mapping, we set A ω to the Jacobian matrix of the network at the corresponding input x , and b to be defined as f ( x ) -A ω x . Therefore, the DNN mapping composed of affine transformations on each input space partition region ω ∈ Ω based on the coordinate change induced by A ω and the shift induced by b ω .

When the input space is associated with a density distribution, this density is transformed by the mapping f . Calculating the density of f ( X ) is generally intractable. However, under the disjoint support assumption from Section 3.2, we can arbitrarily increase the density's representation power by raising the number of prototypes N . As a result, each Gaussian's support is contained within the region ω where its means lie, leading to the following theorem:

Theorem 1 . Given the setting of Equation (4), the unconditional DNN output density, Z , can be approximated as a mixture of the affinely transformed distributions x | x ∗ n ( x ) :

$$

$$

where ω ( x ∗ n ) = ω ∈ Ω ⇐⇒ x ∗ n ∈ ω is the partition region in which the prototype x ∗ n lives in. Proof. See Appendix B.

In other words, Theorem 1 implies that when the input noise is small, we can simplify the conditional output density to a single Gaussian: ( Z ′ | X ′ = x n ) ∼ N ( µ ( x n ) , Σ( x n )) , where µ ( x n ) = A ω ( x n ) x n + b ω ( x n ) and Σ( x n ) = A T ω ( x n ) Σ x n A ω ( x n ) .

Figure 1: Left: The network output for SSL training is more Gaussian for small input noise . The p -value of the normality test for different SSL models trained on ImageNet for different input noise levels. The dashed line represents the point at which the null hypothesis (Gaussian distribution) can be rejected with 99% confidence. Right: The Gaussians around each point are not overlapping. The plots show the l 2 distances between raw images for different datasets. As can be seen, the distances are largest for more complex real-world datasets.

Figure 1: Left: The network output for SSL training is more Gaussian for small input noise . The p -value of the normality test for different SSL models trained on ImageNet for different input noise levels. The dashed line represents the point at which the null hypothesis (Gaussian distribution) can be rejected with 99% confidence. Right: The Gaussians around each point are not overlapping. The plots show the l 2 distances between raw images for different datasets. As can be seen, the distances are largest for more complex real-world datasets.

Information Optimization and the VICReg Optimization Objective

Building on our earlier discussion, we used the Data Distribution Hypothesis to model the conditional output in deterministic networks as a Gaussian mixture. This allowed us to frame the SSL training objective as maximizing the mutual information, I ( Z ; X ′ ) and I ( Z ′ ; X ) .

However, in general, this mutual information is intractable. Therefore, we will use our derivation for the network's representation to obtain a tractable variational approximation using the expected loss, which we can optimize.

The computation of expected loss requires us to marginalize the stochasticity in the output. We can employ maximum likelihood estimation with a Gaussian observation model. For computing the expected loss over x ′ samples, we must marginalize the stochasticity in Z ′ . This procedure implies that the conditional decoder adheres to a Gaussian distribution: ( Z | X ′ = x n ) ∼ N ( µ ( x n ) , I +Σ( x n )) .

However, calculating the expected log loss over samples of Z is challenging. We thus focus on a lower bound - the expected log loss over Z ′ samples. Utilizing Jensen's inequality, we derive the following lower bound:

$$

$$

Taking the expectation over Z , we get

$$

$$

$$

$$

The full derivations are presented in Appendix A. To optimize this objective in practice, we approximate p ( x, x ′ ) using the empirical data distribution

$$

$$

Variance-Invariance-Covariance Regularization: An Information-Theoretic Perspective

Next, we connect the VICReg to our information-theoretic-based objective. The 'invariance term' in Equation (9), which pushes augmentations from the same image closer together, is the same term used in the VICReg objective. However, the computation of the regularization term poses a significant challenge. Entropy estimation is a well-established problem within information theory, with Gaussian mixture densities often used for representation. Yet, the differential entropy of Gaussian mixtures lacks a closed-form solution [56].

A straightforward method for approximating entropy involves capturing the distribution's first two moments, which provides an upper bound on the entropy. However, minimizing an upper bound doesn't necessarily optimize the original objective. Despite reported success from minimizing an upper bound [47, 54], this approach may induce instability during the training process.

Let Σ Z denote the covariance matrix of Z . We utilize the first two moments to approximate the entropy we aim to maximize. Because the invariance term appears in the same form as the original VICReg objective, we will look only at the regularizer. Consequently, we get the approximation

$$

$$

Theorem 2 . Assuming that the eigenvalues of Σ( x i ) and Σ( x ′ i ) , along with the differences between the Gaussian means µ ( x i ) and µ ( x ′ i ) , are bounded, the solution to the maximization problem

$$

$$

involves setting Σ Z to a diagonal matrix.

$$

$$

According to Theorem 2, we can maximize Equation (10) by diagonalizing the covariance matrix and increasing its diagonal elements. This goal can be achieved by minimizing the off-diagonal elements of Σ Z -the covariance criterion of VICReg-and by maximizing the sum of the log of its diagonal elements. While this approach is straightforward and efficient, it does have a drawback: the diagonal values could tend towards zero, potentially causing instability during logarithm computations. A solution to this issue is to use an upper bound and directly compute the sum of the diagonal elements, resulting in the variance term of VICReg. This establishes the link between our information-theoretic objective and the three key components of VICReg.

Empirical Validation of Assumptions About Data Distributions

Validating our theory, we tested if the conditional output density P ( Z | X ) becomes a Gaussian as input noise lessens. We used a ResNet-50 model trained with SimCLR or VICReg objectives on CIFAR-10, CIFAR-100, and ImageNet datasets. We sampled 512 Gaussian samples for each image from the test dataset, examining whether each sample remained Gaussian in the DNN's penultimate layer. We applied the D'Agostino and Pearson's test to ascertain the validity of this assumption [19].

Figure 1 (left) displays the p -value as a function of the normalized std. For low noise levels, we reject that the network's conditional output density is non-Gaussian with an 85% probability when using VICReg. However, the network output deviates from Gaussian as the input noise increases.

Next, we verified our assumption of non-overlapping effective support in the model's data distribution. We calculate the distribution of pairwise l 2 distances between images across seven datasets: MNIST [42], CIFAR10, CIFAR100 [41], Flowers102 [53], Food101 [12], and FGVAircaft [46]. Figure 1 (right) reveals that the pairwise distances are far from zero, even for raw pixels. This implies that we can use a small Gaussian around each point without overlap, validating our assumption as realistic.

Self-Supervised Learning Models through Information Maximization

The practical application of Equation (8) involves several key 'design choices'. We begin by comparing how existing SSL models have implemented it, investigating the estimators used, and discussing the implications of their assumptions. Subsequently, we introduce new methods for SSL that incorporate sophisticated estimators from the field of information theory, which outperform current approaches.

VICReg vs. SimCLR

In order to evaluate their underlying assumptions and strategies for information maximization, we compare VICReg to contrastive SSL methods such as SimCLR along with non-contrastive methods like BYOL and SimSiam.

Contrastive Learning with SimCLR. In their work, Lee et al. [44] drew a connection between the SimCLR objective and the variational bound on information regarding representations by employing the von Mises-Fisher distribution. By applying our analysis for information in deterministic networks with their work, we compare the main differences between SimCLR and VICReg:

(i) Conditional distribution: SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation: SimCLR approximated it based on the finite sum of the input samples. In contrast, VICReg estimates the entropy of Z solely based on the second moment. Developing SSL methods that integrate these two distinctions form an intriguing direction for future research.

Empirical comparison. We trained ResNet18 on CIFAR-10 for VICReg, SimCLR, and BYOL and compared their entropies directly using the pairwise distances entropy estimator. (For more details, see Appendix K.) This estimator was not directly optimized by any method and was an independent validation. The results (Figure 2), showed that entropy increased for all methods during training, with SimCLR having the lowest and VICReg the highest entropy.

Contrastive Learning with SimCLR.

In order to evaluate their underlying assumptions and strategies for information maximization, we compare VICReg to contrastive SSL methods such as SimCLR along with non-contrastive methods like BYOL and SimSiam.

Contrastive Learning with SimCLR. In their work, Lee et al. [44] drew a connection between the SimCLR objective and the variational bound on information regarding representations by employing the von Mises-Fisher distribution. By applying our analysis for information in deterministic networks with their work, we compare the main differences between SimCLR and VICReg:

(i) Conditional distribution: SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation: SimCLR approximated it based on the finite sum of the input samples. In contrast, VICReg estimates the entropy of Z solely based on the second moment. Developing SSL methods that integrate these two distinctions form an intriguing direction for future research.

Empirical comparison. We trained ResNet18 on CIFAR-10 for VICReg, SimCLR, and BYOL and compared their entropies directly using the pairwise distances entropy estimator. (For more details, see Appendix K.) This estimator was not directly optimized by any method and was an independent validation. The results (Figure 2), showed that entropy increased for all methods during training, with SimCLR having the lowest and VICReg the highest entropy.

Empirical comparison.

Family of alternative Entropy Estimators

Next, we suggest integrating the invariance term of current SSL methods with plug-in methods that optimize entropy.

Entropy estimators. The VICReg objective seeks to approximate the log determinant of the empirical covariance matrix through its diagonal terms. As discussed in Section 4.1, this approach has its drawbacks. An alternative is to employ different entropy estimators. The LogDet Entropy Estimator [75] is one such option, offering a tighter upper bound. This estimator employs the differential entropy of α order with scaled noise and has been previously shown to be a tight estimator for high-dimensional features, proving robust to random noise. However, since this estimator provides an upper bound on entropy, maximizing this bound doesn't guarantee optimization of the original objective. To counteract this, we also introduce a lower bound estimator based on the pairwise distances of individual mixture components [39]. In this family, a function determining pairwise distances between component densities is designated for each member. These estimators are computationally efficient and typically straightforward to optimize. For additional entropy estimators, see Appendix F. Beyond VICReg, these methods can serve as plug-in estimators for numerous SSL algorithms. Apart from VICReg, we also conducted experiments integrating these estimators with the BYOL algorithm.

Figure 2: VICReg has higher Entropy during training. The entropy along the training for different SSL methods. Experiments were conducted with ResNet-18 on CIFAR-10. Error bars represent one standard error over 5 trials.

Figure 2: VICReg has higher Entropy during training. The entropy along the training for different SSL methods. Experiments were conducted with ResNet-18 on CIFAR-10. Error bars represent one standard error over 5 trials.

Table 1: The proposed entropy estimators outperform previous methods. CIFAR-10, CIFAR-100, and Tiny-ImageNet top-1 accuracy under linear evaluation using ResNet-18, ConvNetX and VIT as backbones. Error bars correspond to one standard error over three trials.

Setup. Experiments were conducted on three image datasets: CIFAR-10, CIFAR-100 [40], and Tiny-ImageNet [20]. For CIFAR-10, ResNet-18 [31] was used. In contrast, both ConvNeXt [45] and Vision Transformer [21] were used for CIFAR-100 and Tiny-ImageNet. For comparison, we examined the following SSL methods: VICReg, SimCLR, BYOL, SwAV [14], Barlow Twins [74], and MoCo [33]. The quality of representation was assessed through linear evaluation. A detailed description of different methods can be found in Appendix H.

Results. As evidenced by Table 1, the proposed entropy estimators surpass the original SSL methods. Using a more precise entropy estimator enhances the performance of both VICReg and BYOL, compared to their initial implementations. Notably, the pairwise distance estimator, being a lower bound, achieves superior results, resonating with the theoretical preference for maximizing a true entropy's lower bound. Our findings suggest that the astute choice of entropy estimators, guided by our framework, paves the way for enhanced performance.

Entropy estimators.

Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds [35]. Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is very computationally expensive and typically requires a large number of samples, especially in high dimensions [13]. Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases [35]. Another option is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) [2], and Bayesian methods [50]. These methods, however, often require complex optimizations. The LogDet estimator presented in [75] used the differential entropy α order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, [35] presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky and Tracey [39] introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both a lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are exact when the component distributions are grouped into well-separated clusters,

Setup.
Results.

A Generalization Bound for Downstream Tasks

In earlier sections, we linked information theory principles with the VICReg objective. Now, we aim to extend this link to downstream generalization via a generalization bound. This connection further aligns VICReg's generalization with information maximization and implicit regularization.

Notation. Consider input points x , outputs y ∈ R r , labeled training data S = (( x i , y i )) n i =1 of size n and unlabeled training data ¯ S = (( x + i , x ++ i )) i m =1 of size m , where x + i and x ++ i share the same (unknown) label. With the unlabeled training data, we define the invariance loss

$$

$$

where f θ is the trained representation on the unlabeled data ¯ S . We define a labeled loss ℓ x,y ( w ) = ∥ Wf θ ( x ) -y ∥ where w = vec[ W ] ∈ R dr is the vectorization of the matrix W ∈ R r × d . Let w S = vec[ W S ] be the minimum norm solution as W S = minimize W ′ ∥ W ′ ∥ F such that

$$

$$

We also define the representation matrices

$$

$$

$$

$$

We define the label matrix Y S = [ y 1 , . . . , y n ] ⊤ ∈ R n × r and the unknown label matrix Y ¯ S = [ y + 1 , . . . , y + m ] ⊤ ∈ R m × r , where y + i is the unknown label of x + i . Let F be a hypothesis space of f θ . For a given hypothesis space F , we define the normalized Rademacher complexity

$$

$$

where ξ 1 , . . . , ξ m are independent uniform random variables in {-1 , 1 } . It is normalized such that ˜ R m ( F ) = O (1) as m →∞ .

Notation.

In our paper, we proposed novel methods for SSL premised on information maximization. Although our methods demonstrated superior performance on some datasets, computational constraints precluded us from testing them on larger datasets. Furthermore, our study hinges on certain assumptions that, despite rigorous validation efforts, may not hold universally. While we strive for meticulous testing and validation, it's crucial to note that some assumptions might not be applicable in all scenarios or conditions. These limitations should be taken into account when interpreting our study's results.

A *{-1pt

Comparison of Generalization Bounds

The SimCLR generalization bound [59] requires the number of labeled classes to go infinity to close the generalization gap, whereas the VICReg bound in Theorem 3 does not require the number of label classes to approach infinity for the generalization gap to go to zero. This reflects that, unlike SimCLR, VICReg does not use negative pairs and thus does not use a loss function based on the implicit expectation that the labels of a negative pair are different. Another difference is that our VICReg bound improves as n increases, while the previous bound of SimCLR [59] does not depend on n . This is because Saunshi et al. [59] assumes partial access to the true distribution per class for setting, which removes the importance of labeled data size n and is not assumed in our study.

Consequently, the generalization bound in Theorem 3 provides a new insight for VICReg regarding the ratio of the effects of m v.s. n through G √ ln(1 /δ ) /m + √ ln(1 /δ ) /n . Finally, Theorem 3 also illuminates the advantages of VICReg over standard supervised training. That is, with standard training, the generalization bound via the Rademacher complexity requires the complexities of hypothesis spaces, ˜ R n ( W ) / √ n and ˜ R n ( F ) / √ n , with respect to the size of labeled data n , instead of the size of unlabeled data m . Thus, Theorem 3 shows that using SSL, we can replace the complexities of hypothesis spaces in terms of n with those in terms of m . Since the number of unlabeled data points is typically much larger than the number of labeled data points, this illuminates the benefit of SSL. Our bound is different from the recent information bottleneck bound [38] in that both our proof and bound do not rely on information bottleneck.

Understanding Theorem 2 via Mutual Information Maximization

Theorem 3, together with the result of the previous section, shows that, for generalization in the downstream task, it is helpful to maximize the mutual information I ( Z ; X ′ ) in SSL via minimizing the invariance loss I ¯ S ( f θ ) while controlling the covariance Z ¯ S Z ¯ S ⊤ . The term 2 ˜ R m ( F ) / √ m captures the importance of controlling the complexity of the representations f θ . To understand this term further, let us consider a discretization of the parameter space of F to have finite |F| < ∞ . Then, by Massart's Finite Class Lemma, we have that ˜ R m ( F ) ≤ C √ ln |F| for some constant C > 0 . Moreover, Shwartz-Ziv [61] shows that we can approximate ln |F| by 2 I ( Z ; X ) . Thus, in Theorem 3, the term I ¯ S ( f θ ) + 2 √ m ∥ P Z ¯ S Y ¯ S ∥ F + 1 √ n ∥ P Z S Y S ∥ F corresponds to I ( Z ; X ′ ) which we want to maximize while compressing the term of 2 ˜ R m ( F ) / √ m which corresponds to I ( Z ; X ) [23, 64, 67].

Although we can explicitly add regularization on the information to control 2 ˜ R m ( F ) / √ m , it is possible that I ( Z ; X | X ′ ) and 2 ˜ R m ( F ) / √ m are implicitly regularized via implicit bias through design choises [29, 69, 30]. Thus, Theorem 3 connects the information-theoretic understanding of VICReg with the probabilistic guarantee on downstream generalization.

Limitations

In our paper, we proposed novel methods for SSL premised on information maximization. Although our methods demonstrated superior performance on some datasets, computational constraints precluded us from testing them on larger datasets. Furthermore, our study hinges on certain assumptions that, despite rigorous validation efforts, may not hold universally. While we strive for meticulous testing and validation, it's crucial to note that some assumptions might not be applicable in all scenarios or conditions. These limitations should be taken into account when interpreting our study's results.

Conclusions

We analyzed the Variance-Invariance-Covariance Regularization for self-supervised learning through an information-theoretic lens. By transferring the stochasticity required for an information-theoretic analysis to the input distribution, we showed how the VICReg objective can be derived from information-theoretic principles, used this perspective to highlight assumptions implicit in the VICReg objective, derived a VICReg generalization bound for downstream tasks, and related it to information maximization.

Building on these findings, we introduced a new VICReg-inspired SSL objective. Our probabilistic guarantee suggests that VICReg can be further improved for the settings of partial label information by aligning the covariance matrix with the partially observable label matrix, which opens up several avenues for future work, including the design of improved estimators for information-theoretic quantities and investigations into the suitability of different SSL methods for specific data characteristics.

Table of Contents

This appendix is organized as follows:

texorpdfstring{Lower bounds on $ EE_{x^ prime

Data Distribution after Deep Network Transformation

Theorem 4 . Given the setting of Equation (4), the unconditional DNN output density denoted as Z approximates (given the truncation of the Gaussian on its effective support that is included within a single region ω of the DN's input space partition) a mixture of the affinely transformed distributions x | x ∗ n ( x ) e.g. for the Gaussian case

$$

$$

where ω ( x ∗ n ) = ω ∈ Ω ⇐⇒ x ∗ n ∈ ω is the partition region in which the prototype x ∗ n lives in.

Proof. We know that If ∫ ω p ( x | x ∗ n ( x ) ) d x ≈ 1 then f is linear within the effective support of p . Therefore, any sample from p will almost surely lie within a single region ω ∈ Ω , and therefore the entire mapping can be considered linear with respect to p . Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

Additional Empirical Validation

To validate empirically Theorem 2, we checke empirically if the optimal solution for

$$

$$

is a diagonal matrix. We trained VICReg on ResNet18 on CIFAR-10 and did random perturbations (with different scales) for Σ Z . Then, for each perturbation, we calculated the average distance of this perturbed matrix from a diagonal matrix and the actual value of the term

$$

$$

. In Figure 3, we plot the difference from the optimal value of this term as a function of the distance from the diagonal matrix. As we can see, we get an optimal solution where we are close to the diagonal matrix. This observation gives us an empirical validation of Theorem 2.

$$

$$

Figure 3: The optimal solution for the optimization problem is a diagonal matrix. The average distance from a diagonal matrix for different perturbation scales. Experiments were conducted on CIFAR-10 with the ResNet-18 network.

Figure 3: The optimal solution for the optimization problem is a diagonal matrix. The average distance from a diagonal matrix for different perturbation scales. Experiments were conducted on CIFAR-10 with the ResNet-18 network.

Figure 4: Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur. Left - In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right - when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy.

Figure 4: Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur. Left - In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right - when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy.

EM and GMM

Let us examine a toy dataset on the pattern of two intertwining moons to illustrate the collapse phenomenon under GMM (Figure 1, right). We begin by training a classical GMM with maximum likelihood, where the means are initialized based on random samples, and the covariance is used as the identity matrix. A red dot represents the Gaussian's mean after training, while a blue dot represents the data points. In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high (Figure 4, left). However, when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy (Figure 4, right).

To prevent collapse, we follow the K-means algorithm in enforcing sparse posteriors, i.e. using small initial standard deviations and learning only the mean. This forces a one-to-one mapping which leads all points to be closest to the mean without collapsing, resulting in high entropy (Figure 4 middle, in the Appendix). Another option to prevent collapse is to use different learning rates for input and parameters. Using this setting, the collapsing of the parameters does not maximize the likelihood. Figure 1 (right) shows the results of GMM with different learning rates for learned inputs and parameters. When the parameter learning rate is sufficiently high in comparison to the input learning rate, the entropy decreases much more slowly and no collapse occurs.

SimCLR

In order to evaluate their underlying assumptions and strategies for information maximization, we compare VICReg to contrastive SSL methods such as SimCLR along with non-contrastive methods like BYOL and SimSiam.

Contrastive Learning with SimCLR. In their work, Lee et al. [44] drew a connection between the SimCLR objective and the variational bound on information regarding representations by employing the von Mises-Fisher distribution. By applying our analysis for information in deterministic networks with their work, we compare the main differences between SimCLR and VICReg:

(i) Conditional distribution: SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation: SimCLR approximated it based on the finite sum of the input samples. In contrast, VICReg estimates the entropy of Z solely based on the second moment. Developing SSL methods that integrate these two distinctions form an intriguing direction for future research.

Empirical comparison. We trained ResNet18 on CIFAR-10 for VICReg, SimCLR, and BYOL and compared their entropies directly using the pairwise distances entropy estimator. (For more details, see Appendix K.) This estimator was not directly optimized by any method and was an independent validation. The results (Figure 2), showed that entropy increased for all methods during training, with SimCLR having the lowest and VICReg the highest entropy.

Entropy Estimators

Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds [35]. Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is very computationally expensive and typically requires a large number of samples, especially in high dimensions [13]. Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases [35]. Another option is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) [2], and Bayesian methods [50]. These methods, however, often require complex optimizations. The LogDet estimator presented in [75] used the differential entropy α order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, [35] presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky and Tracey [39] introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both a lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are exact when the component distributions are grouped into well-separated clusters,

Empirical validation of our assumption

Validating our theory, we tested if the conditional output density P ( Z | X ) becomes a Gaussian as input noise lessens. We used a ResNet-50 model trained with SimCLR or VICReg objectives on CIFAR-10, CIFAR-100, and ImageNet datasets. We sampled 512 Gaussian samples for each image from the test dataset, examining whether each sample remained Gaussian in the DNN's penultimate layer. We applied the D'Agostino and Pearson's test to ascertain the validity of this assumption [19].

Figure 1 (left) displays the p -value as a function of the normalized std. For low noise levels, we reject that the network's conditional output density is non-Gaussian with an 85% probability when using VICReg. However, the network output deviates from Gaussian as the input noise increases.

Next, we verified our assumption of non-overlapping effective support in the model's data distribution. We calculate the distribution of pairwise l 2 distances between images across seven datasets: MNIST [42], CIFAR10, CIFAR100 [41], Flowers102 [53], Food101 [12], and FGVAircaft [46]. Figure 1 (right) reveals that the pairwise distances are far from zero, even for raw pixels. This implies that we can use a small Gaussian around each point without overlap, validating our assumption as realistic.

Known Lemmas

We use the following well-known theorems as lemmas in our proofs. We put these below for completeness. These are classical results and not our results.

Lemma G.1. (Hoeffding's inequality) Let X 1 , ..., X n be independent random variables such that a ≤ X i ≤ b almost surely. Consider the average of these random variables, S n = 1 n ( X 1 + · · · + X n ) . Then, for all t > 0 ,

$$

$$

$$

$$

Proof. By using Hoeffding's inequality, we have that for all t > 0 ,

$$

$$

$$

$$

and and

Setting δ = exp ( -2 nt 2 ( b -a ) 2 ) and solving for t > 0 ,

$$

$$

$$

$$

It has been shown that generalization bounds can be obtained via Rademacher complexity [8, 51, 60]. The following is a trivial modification of [51, Theorem 3.1] for a one-sided bound on the nonnegative general loss functions:

Lemma G.2. Let G be a set of functions with the codomain [0 , M ] . Then, for any δ > 0 , with probability at least 1 -δ over an i.i.d. draw of m samples S = ( q i ) i m =1 , the following holds for all ψ ∈ G :

$$

$$

where R m ( G ) := E S,ξ [sup ψ ∈G 1 m ∑ m i =1 ξ i ψ ( q i )] and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1 , 1 } .

Proof. Let S = ( q i ) i m =1 and S ′ = ( q ′ i ) i m =1 . Define

$$

$$

To apply McDiarmid's inequality to φ ( S ) , we compute an upper bound on | φ ( S ) -φ ( S ′ ) | where S and S ′ be two test datasets differing by exactly one point of an arbitrary index i 0 ; i.e., S i = S ′ i for all i = i 0 and S i 0 = S ′ i 0 . Then,

$$

$$

Similarly, φ ( S ) -φ ( S ′ ) ≤ M m . Thus, by McDiarmid's inequality, for any δ > 0 , with probability at least 1 -δ ,

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

where the first line follows the definitions of each term, the second line uses Jensen's inequality and the convexity of the supremum, and the third line follows that for each ξ i ∈ {-1 , +1 } , the distribution of each term ξ i ( ℓ ( f ( x ′ i ) , y ′ i ) -ℓ ( f ( x i ) , y i )) is the distribution of ( ℓ ( f ( x ′ i ) , y ′ i ) -ℓ ( f ( x i ) , y i )) since S and S ′ are drawn iid with the same distribution. The fourth line uses the subadditivity of supremum.

̸

Moreover,

Implentation Details for Maximizing Entropy Estimators

In this section, we will provide more details on the implantation of the experiments conducted in Section 5.2.

Setup Our experiments are conducted on CIFAR-10 [41]. We use ResNet-18 [32] as our backbone.

Training Procedure : The experimental process is organized into two sequential stages: unsupervised pretraining followed by linear evaluation. Initially, the unsupervised pretraining phase is executed, during which the encoder network is trained. Upon its completion, we transition to the linear evaluation phase, which serves as an assessment tool for the quality of the representation produced by the pretrained encoder.

Once the pretraining phase is concluded, we adhere to the fine-tuning procedures used in established baseline methods, as described by [14].

During the linear evaluation stage, we start by performing supervised training of the linear classifier. This is achieved by using the representations derived from the encoder network while keeping the network's coefficients frozen, and applying the same training dataset. Subsequently, we measure the test accuracy of the trained linear classifier using a separate validation dataset. This approach allows us to evaluate the performance of our model in a robust and systematic manner.

The training process for each model unfolds over 800 epochs, employing a batch size of 512. We utilize the Stochastic Gradient Descent (SGD) optimizer, characterized by a momentum of 0.9 and a weight decay of 1 e -4 . The learning rate is initiated at 0.5 and is adjusted according to a cosine decay schedule complemented by a linear warmup phase.

During the data augmentation process, two enhanced versions of every input image are generated. This involves cropping each image randomly and resizing it back to the original resolution. The images are then subject to random horizontal flipping, color jittering, grayscale conversion, Gaussian blurring, and polarization for further augmentation.

For the linear evaluation phase, the linear classifier is trained for 100 epochs with a batch size of 256. The SGD optimizer is again employed, this time with a momentum of 0.9 and no weight decay. The learning rate is managed using a cosine decay schedule, starting at 0.2 and reaching a minimum of 2 e -4 .

Expectation Maximization and Collapsing

Assumption 1 . The eigenvalues of Σ( x j ) are in some range a ≤ λ (Σ( x j )) ≤ b

.

Assumption 2 . The differences between the means of the Gaussians are bounded

$$

$$

$$

$$

Proof. The term µ ( X j ) µ ( X j ) T is an outer product of the mean vector µ ( X j ) , which is a symmetric matrix. The eigenvalues of a symmetric matrix are equal to the squares of the singular values of the original matrix. Since the singular values of a vector are equal to its absolute values, the maximum eigenvalue of µ ( X j ) µ ( X j ) T is equal to the square of the maximum absolute value of µ ( X j ) . By the second assumption, this is at most M .

Lemma J.2. The maximum eigenvalue of -µ Z µ T Z is non-positive and its absolute value is at most M .

Proof. The term -µ Z µ T Z is a negative outer product of the overall mean vector µ Z , which is a symmetric matrix. Its eigenvalues are non-positive and equal to the negative squares of the singular values of µ Z . Since the singular values of a vector are equal to its absolute values, the absolute value of the maximum eigenvalue of -µ Z µ T Z is equal to the square of the maximum absolute value of µ Z , which is also bounded by M by the second assumption.

$$

$$

Proof. Given a Gaussian mixture model where each component Z | x j has mean µ ( X j ) and covariance matrix Σ( x j ) , the mixture can be written as:

$$

$$

where p j are the mixing coefficients. The covariance matrix of the mixture, Σ Z , is then given by:

$$

$$

By Lemmas 1.1, 1.2, and assumptions 1 and 2, the maximum eigenvalues of (Σ( x j ) , µ ( X j ) µ ( X j ) T and µ Z µ T Z . are at most b , M , and M , respectively. Therefore, by Weyl's inequality for the sum of two symmetric matrices, the maximum eigenvalue of Σ Z is at most b + M .

$$

$$

It means that we can bound the sum of the eigenvalues of Σ Z with

$$

$$

Lemma J.4. Let Σ Z be a positive semidefinite matrix of size N × N . Consider the optimization problem given by:

$$

$$

$$

$$

where λ i (Σ Z ) denotes the i -th eigenvalue of Σ Z and c is a constant. The solution to this problem is a diagonal matrix with equal diagonal elements.

such that:

Proof. The determinant of a matrix is the product of its eigenvalues, so the objective function log det(Σ Z ) can be rewritten as ∑ N i =1 log( λ i (Σ Z )) . Our problem is then to maximize this sum under the constraints that the sum of the eigenvalues does not exceed c and that Σ Z is positive semi-definite.

Applying Jensen's inequality to the concave function log( x ) with weights 1 /N , we find that 1 N ∑ N i =1 log( λ i (Σ Z )) ≤ log( 1 N ∑ N i =1 λ i (Σ Z )) . Equality holds if and only if all λ i (Σ Z ) are equal.

Setting λ i (Σ Z ) = x for all i , we see that the constraint ∑ N i =1 λ i (Σ Z ) ≤ c becomes Nx ≤ c , leading to the optimal eigenvalue x = c/N under the constraint.

Since Σ Z is positive semi-definite, it can be diagonalized via an orthogonal transformation without changing the sum of its eigenvalues or its determinant. Therefore, the solution to the problem is a diagonal matrix with all diagonal entries equal to c/N .

This completes the proof.

$$

$$

Proof. The objective function can be decomposed as follows:

$$

$$

In this optimization problem, we are optimizing over Σ Z . The term ∑ i log | Σ( X i ) | is constant with respect to Σ Z , therefore we can focus on maximizing K log | Σ Z | .

As the determinant of a matrix is the product of its eigenvalues, log | Σ Z | is the sum of the logs of the eigenvalues of Σ Z . Thus, maximizing log | Σ Z | corresponds to maximizing the sum of the logarithms of the eigenvalues of Σ Z .

According to Lemma 1.4, when we have a constraint on the sum of the eigenvalues, the solution to the problem of maximizing the sum of the logarithms of the eigenvalues of a positive semidefinite matrix Σ Z is a diagonal matrix with equal diagonal elements.

From Lemma 1.3, we know that the sum of the eigenvalues of Σ Z is bounded by ( b + M ) × K . Therefore, when we maximize K log | Σ Z | under these constraints, the solution will be a diagonal matrix with equal diagonal elements. This completes the proof of the theorem.

A Generalization Bound for Downstream Tasks

In earlier sections, we linked information theory principles with the VICReg objective. Now, we aim to extend this link to downstream generalization via a generalization bound. This connection further aligns VICReg's generalization with information maximization and implicit regularization.

Notation. Consider input points x , outputs y ∈ R r , labeled training data S = (( x i , y i )) n i =1 of size n and unlabeled training data ¯ S = (( x + i , x ++ i )) i m =1 of size m , where x + i and x ++ i share the same (unknown) label. With the unlabeled training data, we define the invariance loss

$$

$$

where f θ is the trained representation on the unlabeled data ¯ S . We define a labeled loss ℓ x,y ( w ) = ∥ Wf θ ( x ) -y ∥ where w = vec[ W ] ∈ R dr is the vectorization of the matrix W ∈ R r × d . Let w S = vec[ W S ] be the minimum norm solution as W S = minimize W ′ ∥ W ′ ∥ F such that

$$

$$

We also define the representation matrices

$$

$$

$$

$$

We define the label matrix Y S = [ y 1 , . . . , y n ] ⊤ ∈ R n × r and the unknown label matrix Y ¯ S = [ y + 1 , . . . , y + m ] ⊤ ∈ R m × r , where y + i is the unknown label of x + i . Let F be a hypothesis space of f θ . For a given hypothesis space F , we define the normalized Rademacher complexity

$$

$$

where ξ 1 , . . . , ξ m are independent uniform random variables in {-1 , 1 } . It is normalized such that ˜ R m ( F ) = O (1) as m →∞ .

Additional Notation and details

We start to introduce additional notation and details. We use the notation of x ∈ X for an input and y ∈ Y ⊆ R r for an output. Define p ( y ) = P ( Y = y ) to be the probability of getting label y and ˆ p ( y ) = 1 n ∑ n i =1 ✶ { y i = y } to be the empirical estimate of p ( y ) . Let ζ be an upper bound on the norm of the label as ∥ y ∥ 2 ≤ ζ for all y ∈ Y . Define the minimum norm solution W ¯ S of the unlabeled data as W ¯ S = minimize W ′ ∥ W ′ ∥ F s.t. W ′ ∈ arg min W 1 m ∑ m i =1 ∥ Wf θ ( x + i ) -g ∗ ( x + i ) ∥ 2 . Let κ S be a data-dependent upper bound on the per-sample Euclidian norm loss with the trained model as ∥ W S f θ ( x ) -y ∥ ≤ κ S for all ( x, y ) ∈ X × Y . Similarly, let κ ¯ S be a data-dependent upper bound on the per-sample Euclidian norm loss as ∥ W ¯ S f θ ( x ) -y ∥ ≤ κ ¯ S for all ( x, y ) ∈ X × Y . Define the difference between W S and W ¯ S by c = ∥ W S -W ¯ S ∥ 2 . Let W be a hypothesis space of W such that W ¯ S ∈ W . We denote by ˜ R m ( W◦F ) = 1 √ m E ¯ S,ξ [sup W ∈W ,f ∈F ∑ m i =1 ξ i ∥ g ∗ ( x + i ) -Wf ( x + i ) ∥ ] the normalized Rademacher complexity of the set { x + ↦→ ∥ g ∗ ( x + ) -Wf ( x + ) ∥ : W ∈ W , f ∈ F} . we denote by κ a upper bound on the per-sample Euclidian norm loss as ∥ Wf ( x ) -y ∥ ≤ κ for all ( x, y, W, f ) ∈ X × Y × W × F .

We adopt the following data-generating process model that was used in a previous paper on analyzing contrastive learning [59, 10]. For the labeled data, first, y is drawn from the distribution ρ on Y , and then x is drawn from the conditional distribution D y conditioned on the label y . That is, we have the join distribution D ( x, y ) = D y ( x ) ρ ( y ) with (( x i , y i )) n i =1 ∼ D n . For the unlabeled data,

first, each of the unknown labels y + and y -is drawn from the distritbuion ρ , and then each of the positive examples x + and x ++ is drawn from the conditional distribution D y + while the negative example x -is drawn from the D y -. Unlike the analysis of contrastive learning, we do not require negative samples. Let τ ¯ S be a data-dependent upper bound on the invariance loss with the trained representation as ∥ f θ (¯ x ) -f θ ( x ) ∥ ≤ τ ¯ S for all (¯ x, x ) ∼ D 2 y and y ∈ Y . Let τ be a data-independent upper bound on the invariance loss with the trained representation as ∥ f (¯ x ) -f ( x ) ∥ ≤ τ for all (¯ x, x ) ∼ D 2 y , y ∈ Y , and f ∈ F . For simplicity, we assume that there exists a function g ∗ such that y = g ∗ ( x ) ∈ R r for all ( x, y ) ∈ X × Y . Discarding this assumption adds the average of label noises to the final result, which goes to zero as the sample sizes n and m increase, assuming that the mean of the label noise is zero.

In this paper, we provide an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg) for self-supervised learning. To do so, we first demonstrate how information-theoretic quantities can be obtained for deterministic networks as an alternative to the commonly used unrealistic stochastic networks assumption. Next, we relate the VICReg objective to mutual information maximization and use it to highlight the underlying assumptions of the objective. Based on this relationship, we derive a generalization bound for VICReg, providing generalization guarantees for downstream supervised learning tasks and present new self-supervised learning methods, derived from a mutual information maximization objective, that outperform existing methods in terms of performance. This work provides a new information-theoretic perspective on self-supervised learning and Variance-Invariance-Covariance Regularization in particular and guides the way for improved transfer learning via information-theoretic self-supervised learning objectives.

Self-Supervised Learning (SSL) methods learn representations by optimizing a surrogate objective between inputs and self-defined signals. For example, in SimCLR (Chen et al., 2020), a contrastive loss is used to make representations for different versions of the same image similar while making representations for different images different. These pre-trained representations are then used as a feature extractor for downstream supervised tasks such as image classification, object detection, and transfer learning (Caron et al., 2021; Chen et al., 2020; Misra and Maaten, 2020; Shwartz-Ziv et al., 2022). Despite their success in practice, only a few works (Arora et al., 2019; Lee et al., 2021a) have sought to provide theoretical insights about the effectiveness of SSL.

Recently, information-theoretic methods have played a key role in several advances in deep learning—from practical applications in representation learning (Alemi et al., 2016) to theoretical investigations (Xu and Raginsky, 2017; Steinke and Zakynthinou, 2020; Shwartz-Ziv, 2022). Some works have attempted to use information theory for SSL, such as the InfoMax principle (Linsker, 1988) in SSL (Bachman et al., 2019). However, these works often present objective functions without rigorous justification, make implicit assumptions (Kahana and Hoshen, 2022; Wang et al., 2022; Lee et al., 2021b), and explicitly assume that the deep neural network mappings are stochastic—which is rarely the case for modern neural networks. See Shwartz-Ziv and LeCun (2023) for a detailed review.

This paper presents an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg). We show that the VICReg objective is closely related to approximate mutual information maximization, derive a generalization bound for VICReg, and relate the generalization bound to information maximization. We show that under a series of assumptions about the data, which we validate empirically, our results apply to deterministic deep neural network training and do not require further stochasticity assumptions about the network. To summarize, our key contributions are as follows:

We shift the stochasticity assumption to the deep neural network inputs to study deterministic deep neural networks from an information-theoretic perspective.

We relate the VICReg objective to information-theoretic quantities and use this relationship to highlight the underlying assumptions of the objective.

We study the relationship between the optimization of information-theoretic quantities and predictive performance in downstream tasks by introducing a generalization bound that connects VICReg, information theory, and downstream generalization.

We present information-theoretic SSL methods based on our analysis and empirically validate their performance.

We first introduce the technical background for our analysis.

A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition ΩΩ\Omega of a domain ℝDsuperscriptℝ𝐷\mathbb{R}^{D}, a spline of order k𝑘k is a mapping defined by a polynomial of order k𝑘k on each region ω∈Ω𝜔Ω\omega\in\Omega with continuity constraints on the entire domain for the derivatives of order 00,…,k−1𝑘1k-1. As we will focus on affine splines (k=1𝑘1k=1), we define this case only for concreteness. A K𝐾K-dimensional affine spline f𝑓f produces its output via

with input 𝒛∈ℝD𝒛superscriptℝ𝐷\bm{z}\in\mathbb{R}^{D} and 𝑨ω∈ℝK×D,𝒃ω∈ℝK,∀ω∈Ωformulae-sequencesubscript𝑨𝜔superscriptℝ𝐾𝐷formulae-sequencesubscript𝒃𝜔superscriptℝ𝐾for-all𝜔Ω\bm{A}{\omega}\in\mathbb{R}^{K\times D},\bm{b}{\omega}\in\mathbb{R}^{K},\forall\omega\in\Omega the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f∈𝒞0​(ℝD)𝑓superscript𝒞0superscriptℝ𝐷f\in\mathcal{C}^{0}(\mathbb{R}^{D}). Spline operators and especially affine spline operators have been widely used in function approximation theory (Cheney and Light, 2009), optimal control (Egerstedt and Martin, 2009), statistics (Fantuzzi et al., 2002), and related fields.

A deep neural network (DNN) is a (non-linear) operator fΘsubscript𝑓Θf_{\Theta} with parameters ΘΘ\Theta that map a input 𝒙∈ℝD𝒙superscriptℝ𝐷\bm{x}\in\mathbb{R}^{D} to a prediction 𝒚∈ℝK𝒚superscriptℝ𝐾\bm{y}\in\mathbb{R}^{K}. The precise definitions of DNN operators can be found in Goodfellow et al. (2016). To avoid cluttering notation, we will omit ΘΘ\Theta unless needed for clarity. The only assumption we require for our analysis is that the non-linearities present in the DNN are CPA mappings—as is the case with (leaky-) ReLU, absolute value, and max-pooling operators. The entire input–output mapping then becomes a CPA spline with an implicit partition ΩΩ\Omega, the function of the weights and architecture of the network (Montufar et al., 2014; Balestriero and Baraniuk, 2018). For smooth nonlinearities, our results hold using a first-order Taylor approximation argument.

Joint embedding methods learn DNN parameters ΘΘ\Theta without supervision and input reconstruction. The difficulty of self-supervised learning (SSL) is generating a good representation for downstream tasks whose labels are unavailable during self-supervised training while avoiding trivial solutions where the model maps all inputs to constant output. Many methods have been proposed to solve this problem (see Balestriero and LeCun (2022) for a summary and connections between methods). Contrastive methods, such as SimCLR (Chen et al., 2020) and its InfoNCE criterion (Oord et al., 2018), learn representations by contrasting positive and negative examples. In contrast, non-contrastive methods employ different regularization methods to prevent collapsing of the representation and do not explicitly rely on negative samples. Some methods use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021; Grill et al., 2020) while Caron et al. (2020) use an additional clustering step. Of particular interest to us is the Variance-Invariance-Covariance Regularization method(VICReg; Bardes et al. (2021)) that considers two embedding batches 𝒁=[f​(𝒙1),…,f​(𝒙N)]𝒁𝑓subscript𝒙1…𝑓subscript𝒙𝑁\bm{Z}=\left[f(\bm{x}{1}),\dots,f(\bm{x}{N})\right] and 𝒁′=[f​(𝒙1′),…,f​(𝒙N′)]superscript𝒁′𝑓subscriptsuperscript𝒙′1…𝑓subscriptsuperscript𝒙′𝑁\bm{Z}^{\prime}=\left[f(\bm{x}^{\prime}{1}),\dots,f(\bm{x}^{\prime}{N})\right] each of size (N×K)𝑁𝐾(N\times K). Denoting by 𝑪𝑪\bm{C} the (K×K)𝐾𝐾(K\times K) covariance matrix obtained from [𝒁,𝒁′]𝒁superscript𝒁′[\bm{Z},\bm{Z}^{\prime}], the VICReg triplet loss is given by

Recently, information-theoretic methods have played an essential role in advancing deep learning (Alemi et al., 2016; Xu and Raginsky, 2017; Steinke and Zakynthinou, 2020; Shwartz-Ziv and Tishby, 2017b) by developing and applying information-theoretic estimators and learning principles to DNN training (Hjelm et al., 2018; Belghazi et al., 2018; Piran et al., 2020; Shwartz-Ziv et al., 2018). However, information-theoretic objectives for deterministic DNNs often exhibit a common pitfall: They assume that DNN mappings are stochastic- an assumption that is usually violated. As a result, the mutual information between the input and the DNN representation in such objectives would be infinite, resulting in ill-posed optimization problems. To avoid this problem, stochastic DNNs with variational bounds could be used, where the output of the deterministic network is used as the parameters of the conditional distribution (Lee et al., 2021b; Shwartz-Ziv and Alemi, 2020). Dubois et al. (2021) assumed that the randomness of data augmentation among the two views is the source of stochasticity in the network. Other work assumed a random input, but without making any assumptions about the properties of the distribution of the network’s output, to analyze the objective and relied on general lower bounds (Wang and Isola, 2020; Zimmermann et al., 2021). For supervised learning, Goldfeld et al. (2018) introduced an auxiliary (noisy) DNN by injecting additive noise into the model and demonstrated that the resulting model is a good proxy for the original (deterministic) DNN in terms of both performance and representation. Finally, Achille and Soatto (2018) found that minimizing a stochastic network with a regularizer is equivalent to minimizing the cross-entropy over deterministic DNNs with multiplicative noise. All of these methods assume that the source of randomness comes from the DNN, contradicting common practice.

This section provides an information-theoretic perspective on SSL in deterministic deep neural networks. We begin by introducing assumptions about the information-theoretic challenges in SSL (Section 3.1) and about the data distribution (Section 3.2). More specifically, we assume throughout that any training sample 𝒙𝒙\bm{x} can be seen as coming from a single Gaussian distribution, 𝒙∼𝒩​(μ𝒙,Σ𝒙)similar-to𝒙𝒩subscript𝜇𝒙subscriptΣ𝒙\bm{x}\sim\mathcal{N}(\mu_{\bm{x}},\Sigma_{\bm{x}}). From this, we show that the output of any DNN f​(𝒙)𝑓𝒙f(\bm{x}) corresponds to a mixture of truncated Gaussian distributions (in Section 3.3). This will enable information measures to be applied to deterministic DNNs. Using these assumptions, we then show that an approximation to the VICReg objective can be recovered from information-theoretic principles.

To better understand the difference between key SSL methods and suggest new ones, we first formulate the general SSL goal from an information-theoretical perspective. This formulation allows us to analyze and compare different SSL methods based on their ability to maximize the mutual information between the representations. Furthermore, it opens up the possibility for new SSL methods that may improve upon existing ones by finding new ways to maximize this information. We start with the MultiView InfoMax principle, which aims to maximize the mutual information between the representations of two different views, X𝑋X and X′superscript𝑋′X^{\prime}, and their corresponding representations, Z𝑍Z and Z′superscript𝑍′Z^{\prime}. As shown in Federici et al. (2020), to maximize their information, we maximize I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) and I​(Z′;X)𝐼superscript𝑍′𝑋I(Z^{\prime};X) using the lower bound

where H​(Z)𝐻𝑍H(Z) is the entropy of Z𝑍Z. In supervised learning, where we need to maximize I​(Z;Y)𝐼𝑍𝑌I(Z;Y), the labels (Y𝑌Y) are fixed, the entropy term H​(Y)𝐻𝑌H(Y) is constant, and you only need to optimize the log-loss 𝔼x′​[log⁡q​(z|x)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧𝑥\mathbb{E}_{x^{\prime}}[\log q(z|x)] (cross-entropy or square loss). However, it is well known that for Siamese networks, a degenerate solution exists in which all outputs “collapse” into an undesired value (Chen et al., 2020). By looking at Equation 6, we can see that the entropies are not constant and can be optimized throughout the learning process. Therefore, only minimizing the log loss will cause the representations to collapse to the trivial solution of making them constant (where the entropy goes to zero). To regularize these entropies, that is, to prevent collapse, different methods utilize different approaches to implicitly regularizing information. To better understand these methods, we will introduce results about the data distribution (in Section 3.2) and about the pushforward measure of the data under the neural network transformation (in Section 3.3).

First, we examine the way the output random variables of the network are represented and assume a distribution over the data. Under the manifold hypothesis, any point can be seen as a Gaussian random variable with a low-rank covariance matrix in the direction of the manifold tangent space of the data (Fefferman et al., 2016). Therefore, throughout this study, we will consider the conditioning of a latent representation with respect to the mean of the observation, i.e., X|𝒙∗∼𝒩​(𝒙∗,Σ𝒙∗)similar-toconditional𝑋superscript𝒙𝒩superscript𝒙subscriptΣsuperscript𝒙X|\bm{x}^{}\sim\mathcal{N}(\bm{x}^{},\Sigma_{\bm{x}^{}}), where the eigenvectors of Σ𝒙∗subscriptΣsuperscript𝒙\Sigma_{\bm{x}^{}} are in the same linear subspace as the tangent space of the data manifold at 𝒙∗superscript𝒙\bm{x}^{} which varies with the position of 𝒙∗superscript𝒙\bm{x}^{} in space. Hence a dataset is considered to be a collection of {𝒙n∗,n=1,…,N}formulae-sequencesubscriptsuperscript𝒙𝑛𝑛1…𝑁{\bm{x}^{*}_{n},n=1,\dots,N} and the full data distribution to be a sum of low-rank covariance Gaussian densities, as in

with T𝑇T the uniform Categorical random variable. For simplicity, we consider that the effective support of 𝒩​(𝒙i∗,Σ𝒙i∗)𝒩subscriptsuperscript𝒙𝑖subscriptΣsubscriptsuperscript𝒙𝑖\mathcal{N}(\bm{x}^{}{i},\Sigma{\bm{x}^{}{i}}) and 𝒩​(𝒙j∗,Σ𝒙j∗)𝒩subscriptsuperscript𝒙𝑗subscriptΣsubscriptsuperscript𝒙𝑗\mathcal{N}(\bm{x}^{*}{j},\Sigma_{\bm{x}^{*}_{j}}) do not overlap, where the effective support is defined as {x∈ℝD:p​(x)>ϵ}conditional-set𝑥superscriptℝ𝐷𝑝𝑥italic-ϵ{x\in\mathbb{R}^{D}:p(x)>{\epsilon}}. Therefore, we have that.

where 𝒩(𝒙;.,.)\mathcal{N}\left(\bm{x};.,.\right) is the Gaussian density at 𝒙𝒙\bm{x} and with n(𝒙)=arg​minn(𝒙−𝒙n∗)TΣ𝒙n∗(𝒙−𝒙n∗)n(\bm{x})=\operatorname*{arg,min}{n}(\bm{x}-\bm{x}^{*}{n})^{T}\Sigma_{\bm{x}^{}_{n}}(\bm{x}-\bm{x}^{}_{n}). This assumption, that a dataset is a mixture of Gaussians with non-overlapping support, will simplify our derivations below and could be extended to the general case if needed.

Consider an affine spline operator f𝑓f (Equation 1) that goes from a space of dimension D𝐷D to a space of dimension K𝐾K with K≥D𝐾𝐷K\geq D. The span, which we denote as an image, of this mapping is given by

with Aff​(ω;𝑨ω,𝒃ω)={𝑨ω​𝒙+𝒃ω:𝒙∈ω}Aff𝜔subscript𝑨𝜔subscript𝒃𝜔conditional-setsubscript𝑨𝜔𝒙subscript𝒃𝜔𝒙𝜔\text{Aff}(\omega;\bm{A}{\omega},\bm{b}{\omega})={\bm{A}{\omega}\bm{x}+\bm{b}{\omega}:\bm{x}\in\omega} the affine transformation of region ω𝜔\omega by the per-region parameters 𝑨ω,𝒃ωsubscript𝑨𝜔subscript𝒃𝜔\bm{A}{\omega},\bm{b}{\omega}, and with ΩΩ\Omega the partition of the input space in which 𝒙𝒙\bm{x} lives in. The practical computation of the per-region affine mapping can be obtained by setting 𝑨ωsubscript𝑨𝜔\bm{A}{\omega} to the Jacobian matrix of the network at the corresponding input x𝑥x, and b𝑏b to be defined as f​(x)−𝑨ω​x𝑓𝑥subscript𝑨𝜔𝑥f(x)-\bm{A}{\omega}x. Therefore, the DNN mapping consists of affine transformations on each input space partition region ω∈Ω𝜔Ω\omega\in\Omega based on the coordinate change induced by 𝑨ωsubscript𝑨𝜔\bm{A}{\omega} and the shift induced by 𝒃ωsubscript𝒃𝜔\bm{b}{\omega}.

When the input space is equipped with a density distribution, this density is transformed by the mapping f𝑓f. In general, the density of f​(X)𝑓𝑋f(X) is intractable. However, given the disjoint support assumption from Section 3.2, we can arbitrarily increase the representation power of the density by increasing the number of prototypes N𝑁N. By doing so, the support of each Gaussian is included within the region ω𝜔\omega in which its means lie, leading to the following result:

Given the setting of Equation 8, the unconditional DNN output density, Z𝑍Z, is approximately a mixture of the affinely transformed distributions 𝐱|𝐱n​(𝐱)∗conditional𝐱subscriptsuperscript𝐱𝑛𝐱\bm{x}|\bm{x}^{*}_{n(\bm{x})}:

where ω​(𝐱n∗)=ω∈Ω⇔𝐱n∗∈ωiff𝜔subscriptsuperscript𝐱𝑛𝜔Ωsubscriptsuperscript𝐱𝑛𝜔\omega(\bm{x}^{}_{n})=\omega\in\Omega\iff\bm{x}^{}{n}\in\omega is the partition region in which the prototype 𝐱n∗subscriptsuperscript𝐱𝑛\bm{x}^{*}{n} lives in.

Proof See Appendix B.

Next, we will show how SSL algorithms for deterministic networks can be derived from information-theoretic principles. According to Section 3.1, we want to maximize I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) and I​(Z′;X)𝐼superscript𝑍′𝑋I(Z^{\prime};X). Although this mutual information is intractable in general, we can obtain a tractable variational approximation using the expected loss. First, when the input noise is small, namely that the effective support of the Gaussian centered at x𝑥x is contained within the region w𝑤w of the DNN’s input space partition, we can reduce the conditional output density to a single Gaussian: (Z′|X′=xn)∼𝒩​(μ​(xn),Σ​(xn)),similar-toconditionalsuperscript𝑍′superscript𝑋′subscript𝑥𝑛𝒩𝜇subscript𝑥𝑛Σsubscript𝑥𝑛(Z^{\prime}|X^{\prime}=x_{n})\sim\mathcal{N}\left(\mu(x_{n}),\Sigma(x_{n})\right), where μ​(xn)=𝑨ω​(𝒙n)​𝒙n+𝒃ω​(𝒙n)𝜇subscript𝑥𝑛subscript𝑨𝜔subscript𝒙𝑛subscript𝒙𝑛subscript𝒃𝜔subscript𝒙𝑛\mu(x_{n})=\bm{A}{\omega(\bm{x}{n})}\bm{x}{n}+\bm{b}{\omega(\bm{x}{n})} and Σ​(xn)=𝑨ω​(𝒙n)T​Σ𝒙n​𝑨ω​(𝒙n)Σsubscript𝑥𝑛subscriptsuperscript𝑨𝑇𝜔subscript𝒙𝑛subscriptΣsubscript𝒙𝑛subscript𝑨𝜔subscript𝒙𝑛\Sigma(x{n})=\bm{A}^{T}{\omega(\bm{x}{n})}\Sigma_{\bm{x}{n}}\bm{A}{\omega(\bm{x}{n})}. Second, to compute the expected loss, we need to marginalize out the stochasticity in the output of the network. In general, training with squared loss is equivalent to maximum likelihood estimation in a Gaussian observation model, p​(z|z′)∼𝒩​(z′,Σr)similar-to𝑝conditional𝑧superscript𝑧′𝒩superscript𝑧′subscriptΣ𝑟p(z|z^{\prime})\sim\mathcal{N}(z^{\prime},\Sigma{r}), where Σr=IsubscriptΣ𝑟𝐼\Sigma_{r}=I. To compute the expected loss over samples of x′superscript𝑥′x^{\prime}, we need to marginalize out the stochasticity in Z′superscript𝑍′Z^{\prime}: which means that the conditional decoder is a Gaussian: (Z|X′=xn)∼𝒩​(μ​(xn),Σr+Σ​(xn))similar-toconditional𝑍superscript𝑋′subscript𝑥𝑛𝒩𝜇subscript𝑥𝑛subscriptΣ𝑟Σsubscript𝑥𝑛\left(Z|X^{\prime}=x_{n}\right)\sim\mathcal{N}(\mu(x_{n}),\Sigma_{r}+\Sigma(x_{n})). However, the expected log loss over samples of Z𝑍Z is hard to compute. We instead focus on a lower bound; the expected log loss over samples of Z′superscript𝑍′Z^{\prime}. For simplicity, let Σr=IsubscriptΣ𝑟𝐼\Sigma_{r}=I. By Jensen’s inequality, we then obtain the following lower bound on 𝔼x′​[log⁡q​(z|x′)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑥′\mathbb{E}_{x^{\prime}}\left[\log q(z|x^{\prime})\right]:

Now, taking the expectation over Z𝑍Z, we get

Full derivations of Equations (10) and (11) are given in Appendix A. Combining all of the above then yields

To optimize this objective in practice, we can approximate p​(x,x′)𝑝𝑥superscript𝑥′p(x,x^{\prime}) using the empirical data distribution:

Next, we will discuss estimating the intractable entropy H​(Z)𝐻𝑍H(Z) changes the objective.

In the previous section, we derived an objective function based on information-theoretical principles. The “invariance term” in LABEL:eq:obj is similar to the invariance loss of VICReg. However, computing the regularization term—and H​(Z)𝐻𝑍H(Z) in particular—is challenging. Estimating the entropy of random variables is a classic problem in information theory, with the Gaussian mixture density being a popular representation. However, there is no closed-form solution to the differential entropy of Gaussian mixtures. Approximations, including loose upper and lower bounds (Huber et al., 2008) and Monte Carlo sampling, exist in the literature. Unfortunately, Monte Carlo sampling is computationally expensive and requires many samples in high dimensions (Brewer, 2017).

One of the simplest and straightforward approaches to approximating the entropy is to capture the first two moments of the distribution, which provides an upper bound on the entropy. However, minimizing an upper bound means that there is no guarantee that the original objective is being optimized. In practice, there have been cases where successful results have been achieved by minimizing an upper bound (Martinez et al., 2021; Nowozin et al., 2016). However, this may cause instability in the training process. For a detailed discussion and results on different entropy estimators, see Section 5. Letting ΣZsubscriptΣ𝑍\Sigma_{Z} be the covariance matrix of Z𝑍Z, we will use the first two moments to approximate the entropy we wish to maximize. This way, we obtain the approximation

A standard fact in linear algebra is that the determinant of a matrix is the product of its eigenvalues. Therefore, maximizing the sum of the log eigenvalues implies maximizing the log determinant of Z𝑍Z. Many works have considered this problem (Giles, 2008; Ionescu et al., 2015; Dang et al., 2018). One approach is to find the solutions using the eigendecomposition, which leads to numerical instability (Dang et al., 2018). An alternative approach is to diagonalize the covariance matrix and increase its diagonal elements. Because the eigenvalues of a diagonal matrix are the diagonal entries, increasing the sum of the log-diagonal terms is equivalent to increasing the sum of the log eigenvalues. One way to do this is to push the off-diagonal terms of ΣZsubscriptΣ𝑍\Sigma_{Z} to be zero and maximize the sum of its log diagonal. This can be done using the covariance term of VICReg. Even though this approach is simple and efficient, the values on the diagonal may become close to zero, which may cause instability when we calculate the logarithm. Therefore, we use an upper bound and calculate the sum of the diagonal elements directly, which is the variance term of VICReg. In conclusion, we see the connection between the information-theoretic objective and the three terms of CIVReg. An exciting research direction is to maximize the eigenvalues of Z𝑍Z using more sophisticated methods, such as using a differential expression for eigendecomposition.

Based on the theory outlined in Section 3.3, the conditional output density p𝒛|x=isubscript𝑝conditional𝒛𝑥𝑖p_{\bm{z}|x=i} can be reduced to a single Gaussian with decreasing input noise. To validate this, we used a ResNet-18 model trained with either SimCLR or VICReg objectives on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). We sampled 512 Gaussian samples for each image from the test dataset, and analyzed whether each sample remained Gaussian in the penultimate layer of the DNN. We then used the D’Agostino and Pearson’s test (D’Agostino, 1971) to determine the validity of this assumption. Figure 1 (left) shows the p𝑝p-value as a function of the normalized standard deviation. For small noise, we can reject the hypothesis that the conditional output density of the network is not Gaussian with a probability of 85%percent8585% for VICReg. However, as the input noise increases, the network’s output becomes less Gaussian and even for the small noise regime, there is a 15%percent1515% chance of a Type I error.

Next, to confirm our assumption that the model of the data distribution has non-overlapping effective support, we calculated the distribution of pairwise l2subscript𝑙2l_{2} distances between images for seven datasets: MNIST, CIFAR10, CIFAR100, Flowers102, Food101, FGVAircaft. Figure 1 (right) shows that even for raw pixels, the pairwise distances are far from zero, which means that we can use a small Gaussian around each point without overlapping. Therefore, the effective support of these datasets is non-overlapping, and our assumption is realistic.

Implementing Equation 12 in practice requires various “design choices.” As demonstrated in Section 5, VICReg uses an approximation of the entropy that is based on certain assumptions. Next, we examine different ways to implement the information-based objective by comparing VICReg to other methods such as contrastive learning SSL methods such as SimCLR and non-contrastive methods like BYOL and SimSiam. We will analyze their assumptions and the differences in their approaches to implementing the information maximization objective. Based on our analysis, we will suggest new objective functions that incorporate more recent information and entropy estimators from the information theory literature. This will allow us to further improve the performance of SSL and to better understand their underlying learning mechanisms.

Lee et al. (2021b) connect the SimCLR objective (Chen et al., 2020) to the variational bound on the information between representations by using the von Mises-Fisher distribution as the conditional variational family. By applying our analysis for information in deterministic networks with their work, we can compare the differences between SimCLR and VICReg, and identify two main differences: (i) Conditional distribution: SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation: The entropy term in SimCLR is approximate and based on the finite sum of the input samples. In contrast, VICReg estimates the entropy of Z𝑍Z solely based on the second moment. Creating self-supervised methods that combine these two differences would be an interesting future research direction.

As we saw in previous sections, the different methods use different objective functions to optimize the entropy of their representation. Next, we compare the SSL methods and check directly their entropy. To do so, we trained ResNet-18 architecture (He et al., 2016) on CIFAR-10 (Krizhevsky et al., 2009) for VICReg, SimCLR and BYOL. we used the pairwise distances entropy estimator based on the distances of the individual mixture component (Kolchinsky and Tracey, 2017). Even though this quantity is just an estimator for the entropy, it is shown as a tight estimator and is directly optimized by neither one of the methods. Therefore, we can treat it as a outsource validation of the entropy for the different methods. For more details on this and other entropy estimators see Section 5.2. In Figure 2, we see that, as expected from our analysis before, all the entropy decreased during the training for all the methods. Additionally, we see that SimCLR has the lowest entropy during the training, while VICReg has the highest one.

In Section 5.1, we discussed existing methods that use an approximation to the entropy. Next, we suggest combining the invariance term of these methods with plug-in methods for optimizing the entropy.

The VICReg objective aims to approximate the log determinate of the empirical covariance matrix by using diagonal terms. However, as discussed in Section 4.1, this estimator can be problematic. Instead, we can plug in different entropy estimators. One such option is the LogDet Entropy Estimator (Zhouyin and Liu, 2021), which provides a tighter upper bound. This estimator uses the differential entropy α𝛼\alpha order entropy with scaled noise and was previously demonstrated as a tight estimator for high-dimensional features and robust to random noise. However, since the estimator is an upper bound on the entropy, we are not guaranteed to optimize the original objective when maximizing this upper bound. To address this problem, we also use a lower bound estimator, which is based on the pairwise distances of the individual mixture component (Kolchinsky and Tracey, 2017). For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient, as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. They are continuous and smooth (and therefore useful for optimization problems) and converge to the exact solution when the component distributions are grouped into well-separated clusters. These proposed methods are compared with VICReg, SimCLR (Chen et al., 2020), and Barlow Twin (Zbontar et al., 2021).

Our experiments are conducted on CIFAR-10 (Krizhevsky et al., 2009). A ResNet-18 architecture (He et al., 2016) is used as the backbone. We use linear evaluation for the quality of the representation. For full details, see Appendix H.

It can be seen from LABEL:tab:results that the proposed estimators outperform both the original VICReg and SimCLR as well as Barlow Twin. By estimating the entropy with a more accurate estimator, we can improve the results of VICReg, and the pairwise distance estimator, which is a lower bound, achieves the best results. This aligns with the theory that we want to maximize a lower bound on true entropy. The results of our study suggest that a smart selection of entropy estimators, inspired by our framework, leads to better results.

In the previous sections, we showed the connection between information-theoretic principles and the VICReg objective. Next, we will connect this objective and the information-theoretic principles to the downstream generalization of VICReg by deriving a downstream generalization bound. Together with the results in the previous sections, this relates generalization in VICReg to information maximization and implicit regularization.

Consider input points x𝑥x, outputs y∈ℝr𝑦superscriptℝ𝑟y\in\mathbb{R}^{r}, labeled training data S=((xi,yi))i=1n𝑆superscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛S=((x_{i},y_{i})){i=1}^{n} of size n𝑛n and unlabeled training data S¯=((xi+,xi++))i=1m¯𝑆superscriptsubscriptsubscriptsuperscript𝑥𝑖subscriptsuperscript𝑥absent𝑖𝑖1𝑚{\bar{S}}=((x^{+}{i},x^{++}{i})){i=1}^{m} of size m𝑚m, where xi+subscriptsuperscript𝑥𝑖x^{+}{i} and xi++subscriptsuperscript𝑥absent𝑖x^{++}{i} share the same (unknown) label. With the unlabeled training data, we define the invariance loss

where fθsubscript𝑓𝜃f_{\theta} is the trained representation on the unlabeled data S¯¯𝑆{\bar{S}}. We define a labeled loss ℓx,y​(w)=‖W​fθ​(x)−y‖subscriptℓ𝑥𝑦𝑤norm𝑊subscript𝑓𝜃𝑥𝑦\ell_{x,y}(w)=|Wf_{\theta}(x)-y| where w=vec⁡[W]∈ℝd​r𝑤vec𝑊superscriptℝ𝑑𝑟w=\operatorname{vec}[W]\in\mathbb{R}^{dr} is the vectorization of the matrix W∈ℝr×d𝑊superscriptℝ𝑟𝑑W\in\mathbb{R}^{r\times d}. Let wS=vec⁡[WS]subscript𝑤𝑆vecsubscript𝑊𝑆w_{S}=\operatorname{vec}[W_{S}] be the minimum norm solution as WS=minimizeW′‖W′‖Fsubscript𝑊𝑆subscriptminimizesuperscript𝑊′subscriptnormsuperscript𝑊′𝐹W_{S}=\mathop{\mathrm{minimize}}{W^{\prime}}|W^{\prime}|{F} such that

and the projection matrices

We define the label matrix YS=[y1,…,yn]⊤∈ℝn×rsubscript𝑌𝑆superscriptsubscript𝑦1…subscript𝑦𝑛topsuperscriptℝ𝑛𝑟Y_{S}=[y_{1},\dots,y_{n}]^{\top}\in\mathbb{R}^{n\times r} and the unknown label matrix YS¯=[y1+,…,ym+]⊤∈ℝm×rsubscript𝑌¯𝑆superscriptsubscriptsuperscript𝑦1…subscriptsuperscript𝑦𝑚topsuperscriptℝ𝑚𝑟Y_{{\bar{S}}}=[y^{+}{1},\dots,y^{+}{m}]^{\top}\in\mathbb{R}^{m\times r}, where yi+subscriptsuperscript𝑦𝑖y^{+}{i} is the unknown label of xi+subscriptsuperscript𝑥𝑖x^{+}{i}. Let ℱℱ\mathcal{F} be a hypothesis space of fθsubscript𝑓𝜃f_{\theta}. For a given hypothesis space ℱℱ\mathcal{F}, we define the normalized Rademacher complexity

where ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi_{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}. It is normalized such that ℛm​(ℱ)=O​(1)subscriptℛ𝑚ℱ𝑂1\tilde{\mathcal{R}}_{m}(\mathcal{F})=O(1) as m→∞→𝑚m\rightarrow\infty for typical choices of hypothesis spaces ℱℱ\mathcal{F}, including deep neural networks (Bartlett et al., 2017; Kawaguchi et al., 2018).

Theorem 2 shows that VICReg improves generalization on supervised downstream tasks. More cpecifically, minimizing the unlabeled invariance loss while controlling the covariance ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top} and the complexity of representations ℛm​(ℱ)subscriptℛ𝑚ℱ\tilde{\mathcal{R}}_{m}(\mathcal{F}) minimizes the expected labeled loss:

(Informal version). For any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta,

where 𝒬m,n=O​(G​ln⁡(1/δ)/m+ln⁡(1/δ)/n)→0subscript𝒬𝑚𝑛𝑂𝐺1𝛿𝑚1𝛿𝑛→0\mathcal{Q}{m,n}=O(G\sqrt{\ln(1/\delta)/m}+\sqrt{\ln(1/\delta)/n})\rightarrow 0 as m,n→∞→𝑚𝑛m,n\rightarrow\infty. In 𝒬m,nsubscript𝒬𝑚𝑛\mathcal{Q}{m,n}, the value of G𝐺G for the term decaying at the rate 1/m1𝑚1/\sqrt{m} depends on the hypothesis space of fθsubscript𝑓𝜃f_{\theta} and w𝑤w whereas the term decaying at the rate 1/n1𝑛1/\sqrt{n} is independent of any hypothesis space.

Proof The complete version of Theorem 2 and its proof are presented in Appendix I.

The term ‖𝐏ZS¯​YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F} in Theorem 2 contains the unobservable label matrix YS¯subscript𝑌¯𝑆Y{\bar{S}}. However, we can minimize this term by using ‖𝐏ZS¯​YS¯‖F≤‖𝐏ZS¯‖F​‖YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹subscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹subscriptnormsubscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}\leq|\mathbf{P}{Z_{\bar{S}}}|{F}|Y{\bar{S}}|{F} and by minimizing ‖𝐏ZS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹|\mathbf{P}{Z_{\bar{S}}}|{F}. The factor ‖𝐏ZS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹|\mathbf{P}{Z_{\bar{S}}}|{F} is minimized when the rank of the covariance ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z{\bar{S}}}{Z_{\bar{S}}}^{\top} is maximized. Since a strictly diagonally dominant matrix is non-singular, this can be enforced by maximizing the diagonal entries while minimizing the off-diagonal entries, as is done in VICReg. For example, if d≥n𝑑𝑛d\geq n, then ‖𝐏ZS¯‖F=0subscriptnormsubscript𝐏subscript𝑍¯𝑆𝐹0|\mathbf{P}{Z{\bar{S}}}|{F}=0 when the covariance ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z{\bar{S}}}{Z_{\bar{S}}}^{\top} is of full rank.

The term ‖𝐏ZS​YS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z{S}}Y_{S}|{F} contains only observable variables, and we can directly measure the value of this term using training data. In addition, the term ‖𝐏ZS​YS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|{F} is also minimized when the rank of the covariance ZS​ZS⊤subscript𝑍𝑆superscriptsubscript𝑍𝑆top{Z{S}}{Z_{S}}^{\top} is maximized. Since the covariances ZS​ZS⊤subscript𝑍𝑆superscriptsubscript𝑍𝑆top{Z_{S}}{Z_{S}}^{\top} and ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top} concentrate to each other via concentration inequalities with the error in the order of O​((ln⁡(1/δ))/n+ℛm​(ℱ)​(ln⁡(1/δ))/m)𝑂1𝛿𝑛subscriptℛ𝑚ℱ1𝛿𝑚O(\sqrt{(\ln(1/\delta))/n}+\tilde{\mathcal{R}}{m}(\mathcal{F})\sqrt{(\ln(1/\delta))/m}), we can also minimize the upper bound on ‖𝐏ZS​YS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|{F} by maximizing the diagonal entries of ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z{\bar{S}}}{Z_{\bar{S}}}^{\top} while minimizing its off-diagonal entries, as is done in VICReg.

Thus, VICReg can be understood as a method to minimize the generalization bound in Theorem 2 by minimizing the invariance loss while controlling the covariance ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top} to minimize the label-agnostic upper bounds on ‖𝐏ZS¯​YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F} and ‖𝐏ZS​YS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|{F}. If we know partial information about the label YS¯subscript𝑌¯𝑆Y{\bar{S}} of the unlabeled data, we can use it to minimize ‖𝐏ZS¯​YS¯‖Fsubscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F} and ‖𝐏ZS​YS‖Fsubscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹|\mathbf{P}{Z_{S}}Y_{S}|_{F} directly. This direction can be used to improve VICReg in future work for the partially observable setting.

The SimCLR generalization bound (Saunshi et al., 2019) requires the number of label classes to go infinity to close the generalization gap, whereas the VICReg bound in Theorem 2 does not require the number of label classes to approach infinity for the generalization gap to go to zero. This reflects the fact that, unlike SimCLR, VICReg does not use negative pairs and thus does not use a loss function that is based on the implicit expectation that the labels of a negative pair (y+,y−)superscript𝑦superscript𝑦(y^{+},y^{-}) are different. Another difference is that our VICReg bound improves as n𝑛n increases, while the previous bound of SimCLR (Saunshi et al., 2019) does not depend on n𝑛n. This is because Saunshi et al. (2019) assume partial access to the true distribution p​(x∣y)𝑝conditional𝑥𝑦p(x\mid y) per class for setting W𝑊W, which removes the importance of labeled data size n𝑛n and is not assumed in our study.

Consequently, the generalization bound in Theorem 2 provides a new insight for VICReg regarding the ratio of the effects of m𝑚m v.s. n𝑛n through G​ln⁡(1/δ)/m+ln⁡(1/δ)/n𝐺1𝛿𝑚1𝛿𝑛G\sqrt{\ln(1/\delta)/m}+\sqrt{\ln(1/\delta)/n}. Finally, Theorem 2 also illuminates the advantages of VICReg over standard supervised training. That is, with standard training, the generalization bound via the Rademacher complexity requires the complexities of hypothesis spaces, ℛn​(𝒲)/nsubscriptℛ𝑛𝒲𝑛\tilde{\mathcal{R}}{n}(\mathcal{W})/\sqrt{n} and ℛn​(ℱ)/nsubscriptℛ𝑛ℱ𝑛\tilde{\mathcal{R}}{n}(\mathcal{F})/\sqrt{n}, with respect to the size of labeled data n𝑛n, instead of the size of unlabeled data m𝑚m.

Thus, Theorem 2 shows that using self-supervised learning, we can replace all the complexities of hypothesis spaces in terms of n𝑛n with those in terms of m𝑚m. Since the number of unlabeled data points is typically much larger than the number of labeled data points, this illuminates the benefit of self-supervised learning.

Theorem 2 together with the result of the previous section shows that, for generalization in the downstream task, it is helpful to maximize the mutual information I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) in SSL via minimizing the invariance loss IS¯​(fθ)subscript𝐼¯𝑆subscript𝑓𝜃I_{{\bar{S}}}(f_{\theta}) while controlling the covariance ZS¯​ZS¯⊤subscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top}. The term 2​ℛm​(ℱ)/m2subscriptℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m} captures the importance of controlling the complexity of the representations fθsubscript𝑓𝜃f{\theta}. To understand this term further in terms of mutual information, let us consider a discretization of the parameter space of ℱℱ\mathcal{F} to have finite |ℱ|<∞ℱ|\mathcal{F}|<\infty (indeed, a computer always implements some discretization of continuous variables). Then, by Massart’s Finite Class Lemma, we have that ℛm​(ℱ)≤C​ln⁡|ℱ|subscriptℛ𝑚ℱ𝐶ℱ\tilde{\mathcal{R}}{m}(\mathcal{F})\leq C\sqrt{\ln|\mathcal{F}|} for some constant C>0𝐶0C>0. Moreover, Shwartz-Ziv (2022) shows that we can approximate ln⁡|ℱ|ℱ\ln|\mathcal{F}| by 2I​(Z;X)superscript2𝐼𝑍𝑋2^{I(Z;X)}. Thus, in Theorem 2, the term IS¯​(fθ)+2m​‖𝐏ZS¯​YS¯‖F+1n​‖𝐏ZS​YS‖Fsubscript𝐼¯𝑆subscript𝑓𝜃2𝑚subscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹1𝑛subscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹I{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F} corresponds to I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) while the term of 2​ℛm​(ℱ)/m2subscriptℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m} corresponds to I​(Z;X)𝐼𝑍𝑋I(Z;X). Recall that the information can be decomposed as

where we want to maximize the predictive information I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}), while minimizing I​(Z;X)𝐼𝑍𝑋I(Z;X) (Federici et al., 2019; Shwartz-Ziv and Tishby, 2017a). Thus, to improve generalization, we also need to control 2​ℛm​(ℱ)/m2subscriptℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m} to restrict the superfluous information I​(Z;X|X′)𝐼𝑍conditional𝑋superscript𝑋′I(Z;X|X^{\prime}), in addition to minimizing IS¯​(fθ)+2m​‖𝐏ZS¯​YS¯‖F+1n​‖𝐏ZS​YS‖Fsubscript𝐼¯𝑆subscript𝑓𝜃2𝑚subscriptnormsubscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹1𝑛subscriptnormsubscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹I{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F} that corresponded to maximize the predictive information I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}). Although we can explicitly add regularization on I​(Z;X|X′)𝐼𝑍conditional𝑋superscript𝑋′I(Z;X|X^{\prime}) to control 2​ℛm​(ℱ)/m2subscriptℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}{m}(\mathcal{F})/\sqrt{m}, it is possible that I​(Z;X|X′)𝐼𝑍conditional𝑋superscript𝑋′I(Z;X|X^{\prime}) and 2​ℛm​(ℱ)/m2subscriptℛ𝑚ℱ𝑚2\tilde{\mathcal{R}}_{m}(\mathcal{F})/\sqrt{m} are implicitly regularized via implicit bias through e design choises (Gunasekar et al., 2017; Soudry et al., 2018; Gunasekar et al., 2018). Thus, Theorem 2 connects the information-theoretic understanding of VICReg with the probabilistic guarantee on downstream generalization.

In this study, we examined Variance-Invariance-Covariance Regularization for self-supervised learning from an information-theoretic perspective. By transferring the required stochasticity required for an information-theoretic analysis to the input distribution, we showed how the VICReg objective can be derived from information-theoretic principles, used this perspective to highlight assumptions implicit in the VICReg objective, derived a VICReg generalization bound for downstream tasks, and related it to information maximization.

Finally, we built on the insights from our analysis to propose a new VICReg-style SSL objective. Our probabilistic guarantee suggests that VICReg can be further improved for the settings of partial label information by aligning the covariance matrix with the partially observable label matrix, which opens up several avenues for future work, including the design of improved estimators for information-theoretic quantities and investigations into the suitability of different SSL methods for specific data characteristics.

Tim G. J. Rudner is funded by a Qualcomm Innovation Fellowship.

In this section of the supplementary material, we present the full derivation of the lower bound on 𝔼x′​[log⁡q​(z|x′)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑥′\mathbb{E}{x^{\prime}}\left[\log q(z|x^{\prime})\right]. Because Z′|X′conditionalsuperscript𝑍′superscript𝑋′Z^{\prime}|X^{\prime} is a Gaussian, we can write it as Z′=μ​(x′)+L​(x′)​ϵsuperscript𝑍′𝜇superscript𝑥′𝐿superscript𝑥′italic-ϵZ^{\prime}=\mu(x^{\prime})+L(x^{\prime})\epsilon where ϵ∼𝒩​(0,1)similar-toitalic-ϵ𝒩01\epsilon\sim\mathcal{N}(0,1) and L​(x′)T​L​(x′)=Σ​(x′)𝐿superscriptsuperscript𝑥′𝑇𝐿superscript𝑥′Σsuperscript𝑥′L(x^{\prime})^{T}L(x^{\prime})=\Sigma(x^{\prime}). Now, setting Σr=IsubscriptΣ𝑟𝐼\Sigma{r}=I, will give us:

where 𝔼x′​[log⁡q​(z|x′)]=𝔼x′​[log⁡𝔼z′|x′​[q​(z|z′)]]≥𝔼z′​[log⁡q​(z|z′)]subscript𝔼superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑥′subscript𝔼superscript𝑥′delimited-[]subscript𝔼conditionalsuperscript𝑧′superscript𝑥′delimited-[]𝑞conditional𝑧superscript𝑧′subscript𝔼superscript𝑧′delimited-[]𝑞conditional𝑧superscript𝑧′\mathbb{E}{x^{\prime}}\left[\log q(z|x^{\prime})\right]=\mathbb{E}{x^{\prime}}\left[\log\mathbb{E}{z^{\prime}|x^{\prime}}\left[q(z|z^{\prime})\right]\right]\geq\mathbb{E}{z^{\prime}}\left[\log q(z|z^{\prime})\right] by Jensen’s inequality, 𝔼ϵ​[ϵ]=0subscript𝔼italic-ϵdelimited-[]italic-ϵ0\mathbb{E}{\epsilon}[\epsilon]=0 and 𝔼ϵ[ϵ(L(x′)TL(x′)ϵ]=TrlogΣ(x′)\mathbb{E}{\epsilon}\left[\epsilon\left(L(x^{\prime})^{T}L(x^{\prime}\right)\epsilon\right]=Tr\log\Sigma(x^{\prime}) by the Hutchinson’s estimator.

Proof We know that If ∫ωp​(𝒙|𝒙n​(𝒙)∗)​𝑑𝒙≈1subscript𝜔𝑝conditional𝒙subscriptsuperscript𝒙𝑛𝒙differential-d𝒙1\int_{\omega}p(\bm{x}|\bm{x}^{*}_{n(\bm{x})})d\bm{x}\approx 1 then f𝑓f is linear within the effective support of p𝑝p. Therefore, any sample from p𝑝p will almost surely lie within a single region ω∈Ω𝜔Ω\omega\in\Omega and therefore the entire mapping can be considered linear with respect to p𝑝p. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

The bound in the complete version of Theorem 4 is better than the one in the informal version of Theorem 2, because of the factor c𝑐c. The factor c𝑐c measures the difference between the minimum norm solution WSsubscript𝑊𝑆W_{S} of the labeled training data and the minimum norm solution WS¯subscript𝑊¯𝑆W_{\bar{S}} of the unlabeled training data. Thus, the factor c𝑐c also decreases towards zero as n𝑛n and m𝑚m increase. Moreover, if the labeled and unlabeled training data are similar, the value of c𝑐c is small, decreasing the generalization bound further, which makes sense. Thus, we can view the factor c𝑐c as a measure on the distance between the labeled training data and the unlabeled training data.

We obtain the informal version from the complete version of Theorem 2 by the following reasoning to simplify the notation in the main text. We have that c​IS¯​(fθ)+c​2​ℛm​(ℱ)m=IS¯​(fθ)+2​ℛm​(ℱ)m+Q𝑐subscript𝐼¯𝑆subscript𝑓𝜃𝑐2subscriptℛ𝑚ℱ𝑚subscript𝐼¯𝑆subscript𝑓𝜃2subscriptℛ𝑚ℱ𝑚𝑄cI_{{\bar{S}}}(f_{\theta})+c\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}=I{{\bar{S}}}(f_{\theta})+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}+Q, where Q=(c−1)​(IS¯​(fθ)+2​ℛm​(ℱ)m)≤ς→0𝑄𝑐1subscript𝐼¯𝑆subscript𝑓𝜃2subscriptℛ𝑚ℱ𝑚𝜍→0Q=(c-1)(I{{\bar{S}}}(f_{\theta})+\frac{2\tilde{\mathcal{R}}_{m}(\mathcal{F})}{\sqrt{m}})\leq\varsigma\rightarrow 0 as as m,n→∞→𝑚𝑛m,n\rightarrow\infty, since c→0→𝑐0c\rightarrow 0 as m,n→∞→𝑚𝑛m,n\rightarrow\infty. However, this reasoning is used only to simplify the notation in the main text. The bound in the complete version of Theorem 2 is more accurate and indeed tighter than the one in the informal version.

Proof [Proof of Theorem 2] Let W=WS𝑊subscript𝑊𝑆W=W_{S} where WSsubscript𝑊𝑆W_{S} is the the minimum norm solution as WS=minimizeW′‖W′‖Fsubscript𝑊𝑆subscriptminimizesuperscript𝑊′subscriptnormsuperscript𝑊′𝐹W_{S}=\mathop{\mathrm{minimize}}{W^{\prime}}|W^{\prime}|{F} s.t. W′∈arg​minW⁡1n​∑i=1n‖W​fθ​(xi)−yi‖2superscript𝑊′subscriptargmin𝑊1𝑛superscriptsubscript𝑖1𝑛superscriptnorm𝑊subscript𝑓𝜃subscript𝑥𝑖subscript𝑦𝑖2W^{\prime}\in\operatorname*{arg,min}{W}\frac{1}{n}\sum{i=1}^{n}|Wf_{\theta}(x_{i})-y_{i}|^{2}. Let W∗=WS¯superscript𝑊subscript𝑊¯𝑆W^{}=W_{\bar{S}} where WS¯subscript𝑊¯𝑆W_{\bar{S}} is the minimum norm solution as W∗=WS¯=minimizeW′‖W′‖Fsuperscript𝑊subscript𝑊¯𝑆subscriptminimizesuperscript𝑊′subscriptnormsuperscript𝑊′𝐹W^{}=W_{{\bar{S}}}=\mathop{\mathrm{minimize}}{W^{\prime}}|W^{\prime}|{F} s.t. W′∈arg​minW⁡1m​∑i=1m‖W​fθ​(xi+)−g∗​(xi+)‖2superscript𝑊′subscriptargmin𝑊1𝑚superscriptsubscript𝑖1𝑚superscriptnorm𝑊subscript𝑓𝜃subscriptsuperscript𝑥𝑖superscript𝑔subscriptsuperscript𝑥𝑖2W^{\prime}\in\operatorname*{arg,min}{W}\frac{1}{m}\sum{i=1}^{m}|Wf_{\theta}(x^{+}{i})-g^{*}(x^{+}{i})|^{2}. Since y=g∗​(x)𝑦superscript𝑔𝑥y=g^{*}(x),

where φ​(x)=g∗​(x)−W∗​fθ​(x)𝜑𝑥superscript𝑔𝑥superscript𝑊subscript𝑓𝜃𝑥\varphi(x)=g^{}(x)-W^{}f_{\theta}(x). Define LS​(w)=1n​∑i=1n‖W​fθ​(xi)−yi‖subscript𝐿𝑆𝑤1𝑛superscriptsubscript𝑖1𝑛norm𝑊subscript𝑓𝜃subscript𝑥𝑖subscript𝑦𝑖L_{S}(w)=\frac{1}{n}\sum_{i=1}^{n}|Wf_{\theta}(x_{i})-y_{i}|. Using these,

where W~=W−W∗~𝑊𝑊superscript𝑊{\tilde{W}}=W-W^{*}. We now consider new fresh samples x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} for i=1,…,n𝑖1…𝑛i=1,\dots,n to rewrite the above further as:

This implies that

Furthermore, since y=W∗​fθ​(x)+φ​(x)𝑦superscript𝑊subscript𝑓𝜃𝑥𝜑𝑥y=W^{}f_{\theta}(x)+\varphi(x), by writing y¯i=W∗​fθ​(x¯i)+φ​(x¯i)subscript¯𝑦𝑖superscript𝑊subscript𝑓𝜃subscript¯𝑥𝑖𝜑subscript¯𝑥𝑖{\bar{y}}_{i}=W^{}f_{\theta}({\bar{x}}{i})+\varphi({\bar{x}}{i}) (where y¯i=yisubscript¯𝑦𝑖subscript𝑦𝑖{\bar{y}}{i}=y{i} since x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} for i=1,…,n𝑖1…𝑛i=1,\dots,n),

Combining these, we have that

To bound the left-hand side of equation C.15, we now analyze the following random variable:

where y¯i=yisubscript¯𝑦𝑖subscript𝑦𝑖{\bar{y}}{i}=y{i} since x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} for i=1,…,n𝑖1…𝑛i=1,\dots,n. Importantly, this means that as WSsubscript𝑊𝑆W_{S} depends on yisubscript𝑦𝑖y_{i}, WSsubscript𝑊𝑆W_{S} depends on y¯isubscript¯𝑦𝑖{\bar{y}}{i}. Thus, the collection of random variables ‖WS​fθ​(x¯1)−y¯1‖,…,‖WS​fθ​(nn)−y¯n‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…normsubscript𝑊𝑆subscript𝑓𝜃subscript𝑛𝑛subscript¯𝑦𝑛|W{S}f_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|W_{S}f_{\theta}(n_{n})-{\bar{y}}{n}| is not independent. Accordingly, we cannot apply standard concentration inequality to bound equation C.16. A standard approach in learning theory is to first bound equation C.16 by 𝔼x,y​‖WS​fθ​(x)−y‖−1n​∑i=1n‖WS​fθ​(x¯i)−y¯i‖≤supW∈𝒲𝔼x,y​‖W​fθ​(x)−y‖−1n​∑i=1n‖W​fθ​(x¯i)−y¯i‖subscript𝔼𝑥𝑦normsubscript𝑊𝑆subscript𝑓𝜃𝑥𝑦1𝑛superscriptsubscript𝑖1𝑛normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥𝑖subscript¯𝑦𝑖subscriptsupremum𝑊𝒲subscript𝔼𝑥𝑦norm𝑊subscript𝑓𝜃𝑥𝑦1𝑛superscriptsubscript𝑖1𝑛norm𝑊subscript𝑓𝜃subscript¯𝑥𝑖subscript¯𝑦𝑖\mathbb{E}{x,y}|W_{S}f_{\theta}(x)-y|-\frac{1}{n}\sum_{i=1}^{n}|W_{S}f_{\theta}({\bar{x}}{i})-{\bar{y}}{i}|\leq\sup_{W\in\mathcal{W}}\mathbb{E}{x,y}|Wf{\theta}(x)-y|-\frac{1}{n}\sum_{i=1}^{n}|Wf_{\theta}({\bar{x}}{i})-{\bar{y}}{i}| for some hypothesis space 𝒲𝒲\mathcal{W} (that is independent of S𝑆S) and realize that the right-hand side now contains the collection of independent random variables ‖W​fθ​(x¯1)−y¯1‖,…,‖W​fθ​(nn)−y¯n‖norm𝑊subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…norm𝑊subscript𝑓𝜃subscript𝑛𝑛subscript¯𝑦𝑛|Wf_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|Wf_{\theta}(n_{n})-{\bar{y}}{n}| , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space 𝒲𝒲\mathcal{W}. However, the complexity of the hypothesis space 𝒲𝒲\mathcal{W} can be very large, resulting into a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space 𝒲𝒲\mathcal{W} by using a very different approach with conditional expectations to take care the dependent random variables ‖WS​fθ​(x¯1)−y¯1‖,…,‖WS​fθ​(nn)−y¯n‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…normsubscript𝑊𝑆subscript𝑓𝜃subscript𝑛𝑛subscript¯𝑦𝑛|W{S}f_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|W_{S}f_{\theta}(n_{n})-{\bar{y}}_{n}|. Intuitively, we utilize the fact that for these dependent random variables, there are a structure of conditional independence, conditioned on each y∈𝒴𝑦𝒴y\in\mathcal{Y}.

We first write the expected loss as the sum of the conditional expected loss:

where Xysubscript𝑋𝑦X_{y} is the random variable for the conditional with Y=y𝑌𝑦Y=y. Using this, we decompose equation C.16 into two terms:

where 𝒴~={y∈𝒴:|ℐy|≠0}~𝒴conditional-set𝑦𝒴subscriptℐ𝑦0{\tilde{\mathcal{Y}}}={y\in\mathcal{Y}:|\mathcal{I}_{y}|\neq 0}. Substituting these into equation equation C.17 yields

Importantly, while ‖WS​fθ​(x¯1)−y¯1‖,…,‖WS​fθ​(x¯n)−y¯n‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1subscript¯𝑦1…normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥𝑛subscript¯𝑦𝑛|W_{S}f_{\theta}({\bar{x}}{1})-{\bar{y}}{1}|,\dots,|W_{S}f_{\theta}({\bar{x}}{n})-{\bar{y}}{n}| on the right-hand side of equation C.18 are dependent random variables, ‖WS​fθ​(x¯1)−y‖,…,‖WS​fθ​(x¯n)−y‖normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥1𝑦…normsubscript𝑊𝑆subscript𝑓𝜃subscript¯𝑥𝑛𝑦|W_{S}f_{\theta}({\bar{x}}{1})-y|,\dots,|W{S}f_{\theta}({\bar{x}}{n})-y| are independent random variables since WSsubscript𝑊𝑆W{S} and x¯isubscript¯𝑥𝑖{\bar{x}}_{i} are independent and y𝑦y is fixed here. Thus, by using Hoeffding’s inequality (Lemma 1), and taking union bounds over y∈𝒴𝑦𝒴y\in{\tilde{\mathcal{Y}}}, we have that with probability at least 1−δ1𝛿1-\delta, the following holds for all y∈𝒴𝑦𝒴y\in{\tilde{\mathcal{Y}}}:

We will now analyze the term 1n​∑i=1n‖φ​(xi)‖+1n​∑i=1n‖φ​(x¯i)‖1𝑛superscriptsubscript𝑖1𝑛norm𝜑subscript𝑥𝑖1𝑛superscriptsubscript𝑖1𝑛norm𝜑subscript¯𝑥𝑖\frac{1}{n}\sum_{i=1}^{n}|\varphi(x_{i})|+\frac{1}{n}\sum_{i=1}^{n}|\varphi({\bar{x}}{i})| on the right-hand side of equation C.21. Since W∗=WS¯superscript𝑊subscript𝑊¯𝑆W^{*}=W{\bar{S}},

Moreover, by using (Mohri et al., 2012, Theorem 3.1) with the loss function x+↦‖g∗​(x+)−W​f​(x+)‖maps-tosuperscript𝑥normsuperscript𝑔superscript𝑥𝑊𝑓superscript𝑥x^{+}\mapsto|g^{*}(x^{+})-Wf(x^{+})| (i.e., Lemma 2), we have that for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta,

where ℛm​(𝒲∘ℱ)=1m​𝔼S¯,ξ​[supW∈𝒲,f∈ℱ∑i=1mξi​‖g∗​(xi+)−W​f​(xi+)‖]subscriptℛ𝑚𝒲ℱ1𝑚subscript𝔼¯𝑆𝜉delimited-[]subscriptsupremumformulae-sequence𝑊𝒲𝑓ℱsuperscriptsubscript𝑖1𝑚subscript𝜉𝑖normsuperscript𝑔subscriptsuperscript𝑥𝑖𝑊𝑓subscriptsuperscript𝑥𝑖\tilde{\mathcal{R}}{m}(\mathcal{W}\circ\mathcal{F})=\frac{1}{\sqrt{m}}\mathbb{E}{{\bar{S}},\xi}[\sup_{W\in\mathcal{W},f\in\mathcal{F}}\sum_{i=1}^{m}\xi_{i}|g^{}(x^{+}{i})-Wf(x^{+}{i})|] is the normalized Rademacher complexity of the set {x+↦‖g∗​(x+)−W​f​(x+)‖:W∈𝒲,f∈ℱ}:maps-tosuperscript𝑥normsuperscript𝑔superscript𝑥𝑊𝑓superscript𝑥formulae-sequence𝑊𝒲𝑓ℱ{x^{+}\mapsto|g^{}(x^{+})-Wf(x^{+})|:W\in\mathcal{W},f\in\mathcal{F}} (it is normalized such that ℛm​(ℱ)=O​(1)subscriptℛ𝑚ℱ𝑂1\tilde{\mathcal{R}}{m}(\mathcal{F})=O(1) as m→∞→𝑚m\rightarrow\infty for typical choices of ℱℱ\mathcal{F}), and ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}. Takinng union bounds, we have that for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta,

Here, since W​fθ​(xi+)∈ℝr𝑊subscript𝑓𝜃subscriptsuperscript𝑥𝑖superscriptℝ𝑟Wf_{\theta}(x^{+}_{i})\in\mathbb{R}^{r}, we have that

where Ir∈ℝr×rsubscript𝐼𝑟superscriptℝ𝑟𝑟I_{r}\in\mathbb{R}^{r\times r} is the identity matrix, and [fθ​(xi+)⊤⊗Ir]∈ℝr×d​rdelimited-[]tensor-productsubscript𝑓𝜃superscriptsubscriptsuperscript𝑥𝑖topsubscript𝐼𝑟superscriptℝ𝑟𝑑𝑟[f_{\theta}(x^{+}{i})^{\top}\otimes I{r}]\in\mathbb{R}^{r\times dr} is the Kronecker product of the two matrices, and vec⁡[W]∈ℝd​rvec𝑊superscriptℝ𝑑𝑟\operatorname{vec}[W]\in\mathbb{R}^{dr} is the vectorization of the matrix W∈ℝr×d𝑊superscriptℝ𝑟𝑑W\in\mathbb{R}^{r\times d}. Thus, by defining Ai=[fθ​(xi+)⊤⊗Ir]∈ℝr×d​rsubscript𝐴𝑖delimited-[]tensor-productsubscript𝑓𝜃superscriptsubscriptsuperscript𝑥𝑖topsubscript𝐼𝑟superscriptℝ𝑟𝑑𝑟A_{i}=[f_{\theta}(x^{+}{i})^{\top}\otimes I{r}]\in\mathbb{R}^{r\times dr} and using the notation of w=vec⁡[W]𝑤vec𝑊w=\operatorname{vec}[W] and its inverse W=vec−1⁡[w]𝑊superscriptvec1𝑤W=\operatorname{vec}^{-1}[w] (i.e., the inverse of the vectorization from ℝr×dsuperscriptℝ𝑟𝑑\mathbb{R}^{r\times d} to ℝd​rsuperscriptℝ𝑑𝑟\mathbb{R}^{dr} with a fixed ordering), we can rewrite equation C.24 by

with gi=g∗​(xi+)∈ℝrsubscript𝑔𝑖superscript𝑔subscriptsuperscript𝑥𝑖superscriptℝ𝑟g_{i}=g^{*}(x^{+}{i})\in\mathbb{R}^{r}. Since the function w↦∑i=1m‖gi−Ai​w‖2maps-to𝑤subscriptsuperscript𝑚𝑖1superscriptnormsubscript𝑔𝑖subscript𝐴𝑖𝑤2w\mapsto\sum^{m}{i=1}|g_{i}-A_{i}w|^{2} is convex, a necessary and sufficient condition of the minimizer of this function is obtained by

In other words,

where (A⊤​A)†superscriptsuperscript𝐴top𝐴†(A^{\top}A)^{\dagger} is the Moore–Penrose inverse of the matrix A⊤​Asuperscript𝐴top𝐴A^{\top}A and Null⁡(A)Null𝐴\operatorname{Null}(A) is the null space of the matrix A𝐴A. Thus, the minimum norm solution is obtained by

where the inequality follows from the Jensen’s inequality and the concavity of the square root function. Thus, we have that

where W~=WS−W∗~𝑊subscript𝑊𝑆superscript𝑊{\tilde{W}}=W_{S}-W^{*} and 𝐏A=I−A​(A⊤​A)†​A⊤subscript𝐏𝐴𝐼𝐴superscriptsuperscript𝐴top𝐴†superscript𝐴top\mathbf{P}_{A}=I-A(A^{\top}A)^{\dagger}A^{\top}.

where ‖W~‖2subscriptnorm~𝑊2|{\tilde{W}}|{2} is the spectral norm of W~~𝑊{\tilde{W}}. Since x¯isubscript¯𝑥𝑖{\bar{x}}{i} shares the same label with xisubscript𝑥𝑖x_{i} as x¯i∼𝒟yisimilar-tosubscript¯𝑥𝑖subscript𝒟subscript𝑦𝑖{\bar{x}}{i}\sim\mathcal{D}{y_{i}} (and xi∼𝒟yisimilar-tosubscript𝑥𝑖subscript𝒟subscript𝑦𝑖x_{i}\sim\mathcal{D}{y{i}}), and because fθsubscript𝑓𝜃f_{\theta} is trained with the unlabeled data S¯¯𝑆{\bar{S}}, using Hoeffding’s inequality (Lemma 1) implies that with probability at least 1−δ1𝛿1-\delta,

Define ZS¯=[f​(x1+),…,f​(xm+)]∈ℝd×msubscript𝑍¯𝑆𝑓subscriptsuperscript𝑥1…𝑓subscriptsuperscript𝑥𝑚superscriptℝ𝑑𝑚{Z_{\bar{S}}}=[f(x^{+}{1}),\dots,f(x^{+}{m})]\in\mathbb{R}^{d\times m}. Then, we have A=[ZS¯⊤⊗Ir]𝐴delimited-[]tensor-productsuperscriptsubscript𝑍¯𝑆topsubscript𝐼𝑟A=[{Z_{\bar{S}}}^{\top}\otimes I_{r}]. Thus,

where 𝐏ZS¯=Im−ZS¯⊤​(ZS¯​ZS¯⊤)†​ZS¯∈ℝm×msubscript𝐏subscript𝑍¯𝑆subscript𝐼𝑚superscriptsubscript𝑍¯𝑆topsuperscriptsubscript𝑍¯𝑆superscriptsubscript𝑍¯𝑆top†subscript𝑍¯𝑆superscriptℝ𝑚𝑚\mathbf{P}{Z{\bar{S}}}=I_{m}-{Z_{\bar{S}}}^{\top}({Z_{\bar{S}}}{Z_{\bar{S}}}^{\top})^{\dagger}{Z_{\bar{S}}}\in\mathbb{R}^{m\times m}. By defining YS¯=[g∗​(x1+),…,g∗​(xm+)]⊤∈ℝm×rsubscript𝑌¯𝑆superscriptsuperscript𝑔subscriptsuperscript𝑥1…superscript𝑔subscriptsuperscript𝑥𝑚topsuperscriptℝ𝑚𝑟Y_{\bar{S}}=[g^{}(x^{+}_{1}),\dots,g^{}(x^{+}{m})]^{\top}\in\mathbb{R}^{m\times r}, since g=vec⁡[YS¯⊤]𝑔vecsuperscriptsubscript𝑌¯𝑆topg=\operatorname{vec}[Y{\bar{S}}^{\top}],

On the other hand, recall that WSsubscript𝑊𝑆W_{S} is the minimum norm solution as

where ZS=[f​(x1),…,f​(xn)]∈ℝd×nsubscript𝑍𝑆𝑓subscript𝑥1…𝑓subscript𝑥𝑛superscriptℝ𝑑𝑛{Z_{S}}=[f(x_{1}),\dots,f(x_{n})]\in\mathbb{R}^{d\times n} and YS=[y1,…,yn]⊤∈ℝn×rsubscript𝑌𝑆superscriptsubscript𝑦1…subscript𝑦𝑛topsuperscriptℝ𝑛𝑟Y_{S}=[y_{1},\dots,y_{n}]^{\top}\in\mathbb{R}^{n\times r}. Then,

We use the following well-known theorems as lemmas in our proof. We put these below for completeness. These are classical results and not our results.

(Hoeffding’s inequality) Let X1,…,Xnsubscript𝑋1…subscript𝑋𝑛X_{1},...,X_{n} be independent random variables such that a≤Xi≤b𝑎subscript𝑋𝑖𝑏{\displaystyle a\leq X_{i}\leq b} almost surely. Consider the average of these random variables, Sn=1n​(X1+⋯+Xn).subscript𝑆𝑛1𝑛subscript𝑋1⋯subscript𝑋𝑛{\displaystyle S_{n}=\frac{1}{n}(X_{1}+\cdots+X_{n}).} Then, for all t>0𝑡0t>0,

Proof By using Hoeffding’s inequality, we have that for all t>0𝑡0t>0,

Setting δ=exp⁡(−2​n​t2(b−a)2)𝛿2𝑛superscript𝑡2superscript𝑏𝑎2\delta=\exp\left(-{\frac{2nt^{2}}{(b-a)^{2}}}\right) and solving for t>0𝑡0t>0,

It has been shown that generalization bounds can be obtained via Rademacher complexity (Bartlett and Mendelson, 2002; Mohri et al., 2012; Shalev-Shwartz and Ben-David, 2014). The following is a trivial modification of (Mohri et al., 2012, Theorem 3.1) for a one-sided bound on the nonnegative general loss functions:

Let 𝒢𝒢\mathcal{G} be a set of functions with the codomain [0,M]0𝑀[0,M]. Then, for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta over an i.i.d. draw of m𝑚m samples S=(qi)i=1m𝑆superscriptsubscriptsubscript𝑞𝑖𝑖1𝑚S=(q_{i})_{i=1}^{m}, the following holds for all ψ∈𝒢𝜓𝒢\psi\in\mathcal{G}:

where ℛm​(𝒢):=𝔼S,ξ​[supψ∈𝒢1m​∑i=1mξi​ψ​(qi)]assignsubscriptℛ𝑚𝒢subscript𝔼𝑆𝜉delimited-[]subscriptsupremum𝜓𝒢1𝑚superscriptsubscript𝑖1𝑚subscript𝜉𝑖𝜓subscript𝑞𝑖\mathcal{R}{m}(\mathcal{G}):=\mathbb{E}{S,\xi}[\sup_{\psi\in\mathcal{G}}\frac{1}{m}\sum_{i=1}^{m}\xi_{i}\psi(q_{i})] and ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi_{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}.

Proof Let S=(qi)i=1m𝑆superscriptsubscriptsubscript𝑞𝑖𝑖1𝑚S=(q_{i}){i=1}^{m} and S′=(qi′)i=1msuperscript𝑆′superscriptsubscriptsuperscriptsubscript𝑞𝑖′𝑖1𝑚S^{\prime}=(q{i}^{\prime})_{i=1}^{m}. Define

To apply McDiarmid’s inequality to φ​(S)𝜑𝑆\varphi(S), we compute an upper bound on |φ​(S)−φ​(S′)|𝜑𝑆𝜑superscript𝑆′|\varphi(S)-\varphi(S^{\prime})| where S𝑆S and S′superscript𝑆′S^{\prime} be two test datasets differing by exactly one point of an arbitrary index i0subscript𝑖0i_{0}; i.e., Si=Si′subscript𝑆𝑖subscriptsuperscript𝑆′𝑖S_{i}=S^{\prime}{i} for all i≠i0𝑖subscript𝑖0i\neq i{0} and Si0≠Si0′subscript𝑆subscript𝑖0subscriptsuperscript𝑆′subscript𝑖0S_{i_{0}}\neq S^{\prime}{i{0}}. Then,

where the first line follows the definitions of each term, the second line uses Jensen’s inequality and the convexity of the supremum, and the third line follows that for each ξi∈{−1,+1}subscript𝜉𝑖11\xi_{i}\in{-1,+1}, the distribution of each term ξi​(ℓ​(f​(xi′),yi′)−ℓ​(f​(xi),yi))subscript𝜉𝑖ℓ𝑓subscriptsuperscript𝑥′𝑖subscriptsuperscript𝑦′𝑖ℓ𝑓subscript𝑥𝑖subscript𝑦𝑖\xi_{i}(\ell(f(x^{\prime}{i}),y^{\prime}{i})-\ell(f(x_{i}),y_{i})) is the distribution of (ℓ​(f​(xi′),yi′)−ℓ​(f​(xi),yi))ℓ𝑓subscriptsuperscript𝑥′𝑖subscriptsuperscript𝑦′𝑖ℓ𝑓subscript𝑥𝑖subscript𝑦𝑖(\ell(f(x^{\prime}{i}),y^{\prime}{i})-\ell(f(x_{i}),y_{i})) since S𝑆S and S′superscript𝑆′S^{\prime} are drawn iid with the same distribution. The fourth line uses the subadditivity of supremum.

In contrastive learning, different augmented views of the same image are attracted (positive pairs), while different augmented views are repelled (negative pairs). MoCo (He et al., 2020) and SimCLR (Chen et al., 2020) are recent examples of self-supervised visual representation learning that reduce the gap between self-supervised and fully-supervised learning. SimCLR applies randomized augmentations to an image to create two different views, x𝑥x and y𝑦y, and encodes both of them with a shared encoder, producing representations rxsubscript𝑟𝑥r_{x} and rysubscript𝑟𝑦r_{y}. Both rxsubscript𝑟𝑥r_{x} and rysubscript𝑟𝑦r_{y} are l​2𝑙2l2-normalized. The SimCLR version of the InfoNCE objective is:

where η𝜂\eta is a temperature term and K𝐾K is the number of views in a minibatch.

Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds (Huber et al., 2008). Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is a very computationally expensive and typically requires a large number of samples, especially in high dimensions (Brewer, 2017). Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases Huber et al. (2008). Another options is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) (Ahmed and Gokhale, 1989), and bayesian methods Misra et al. (2005). These methods, however, often require complex optimizations. The LogDet estimator presented in Zhouyin and Liu (2021) used the differential entropy α𝛼\alpha order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, Huber et al. (2008) presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky and Tracey (2017) introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities defined for each member. These estimators are computationally efficient, as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are are exact when the component distributions are grouped into well-separated clusters,

Let us examine a toy dataset on the pattern of two intertwining moons to illustrate the collapse phenomenon under GMM (Figure 1 - right). We begin by training a classical GMM with maximum likelihood, where the means are initialized based on random samples, and the covariance is used as the identity matrix. A red dot represents the Gaussian’s mean after training, while a blue dot represents the data points. In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high (Figure 4 - left, in the Appendix). However, when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy (Figure 4 - right, in the Appendix).

To prevent collapse, we follow the K-means algorithm in enforcing sparse posteriors, i.e. using small initial standard deviations and learning only the mean. This forces a one-to-one mapping which leads all points to be closest to the mean without collapsing, resulting in high entropy (Figure 4 - middle, in the Appendix). Another option to prevent collapse is to use different learning rates for input and parameters. Using this setting, the collapsing of the parameters does not maximize the likelihood. Figure 1 (right) shows the results of GMM with different learning rates for learned inputs and parameters. When the parameter learning rate is sufficiently high in comparison to the input learning rate, the entropy decreases much more slowly and no collapse occurs.

We adopt the following data-generating process model that is used in the previous paper on analyzing contrastive learning (Saunshi et al., 2019; Ben-Ari and Shwartz-Ziv, 2018). For the labeled data, first, y𝑦y is drawn from the distritbuion ρ𝜌\rho on 𝒴𝒴\mathcal{Y}, and then x𝑥x is drawn from the conditional distribution 𝒟ysubscript𝒟𝑦\mathcal{D}{y} conditioned on the label y𝑦y. That is, we have the join distribution 𝒟​(x,y)=𝒟y​(x)​ρ​(y)𝒟𝑥𝑦subscript𝒟𝑦𝑥𝜌𝑦\mathcal{D}(x,y)=\mathcal{D}{y}(x)\rho(y) with ((xi,yi))i=1n∼𝒟nsimilar-tosuperscriptsubscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖1𝑛superscript𝒟𝑛((x_{i},y_{i})){i=1}^{n}\sim\mathcal{D}^{n}. For the unlabeled data, first, each of the unknown labels y+superscript𝑦y^{+} and y−superscript𝑦y^{-} is drawn from the distritbuion ρ𝜌\rho, and then each of the positive examples x+superscript𝑥x^{+} and x++superscript𝑥absentx^{++} is drawn from the conditional distribution 𝒟y+subscript𝒟superscript𝑦\mathcal{D}{y^{+}} while the negative example x−superscript𝑥x^{-} is drawn from the 𝒟y−subscript𝒟superscript𝑦\mathcal{D}{y^{-}}. Unlike the analysis of contrastive learning, we do not require the negative samples. Let τS¯subscript𝜏¯𝑆\tau{{\bar{S}}} be a data-dependent upper bound on the invariance loss with the trained representation as ‖fθ​(x¯)−fθ​(x)‖≤τS¯normsubscript𝑓𝜃¯𝑥subscript𝑓𝜃𝑥subscript𝜏¯𝑆|f_{\theta}({\bar{x}})-f_{\theta}(x)|\leq\tau_{{\bar{S}}} for all (x¯,x)∼𝒟y2similar-to¯𝑥𝑥superscriptsubscript𝒟𝑦2({\bar{x}},x)\sim\mathcal{D}{y}^{2} and y∈𝒴𝑦𝒴y\in\mathcal{Y}. Let τ𝜏\tau be a data-independent upper bound on the invariance loss with the trained representation as‖f​(x¯)−f​(x)‖≤τnorm𝑓¯𝑥𝑓𝑥𝜏|f({\bar{x}})-f(x)|\leq\tau for all (x¯,x)∼𝒟y2similar-to¯𝑥𝑥superscriptsubscript𝒟𝑦2({\bar{x}},x)\sim\mathcal{D}{y}^{2}, y∈𝒴𝑦𝒴y\in\mathcal{Y}, and f∈ℱ𝑓ℱf\in\mathcal{F}. For the simplicity, we assume that there exists a function g∗superscript𝑔g^{} such that y=g∗​(x)∈ℝr𝑦superscript𝑔𝑥superscriptℝ𝑟y=g^{}(x)\in\mathbb{R}^{r} for all (x,y)∈𝒳×𝒴𝑥𝑦𝒳𝒴(x,y)\in\mathcal{X}\times\mathcal{Y}. Discarding this assumption adds the average of label noises to the final result, which goes to zero as the sample sizes n𝑛n and m𝑚m increase, assuming that the mean of the label noise is zero.

Refer to caption Left: The network output SSL training is more Gaussian for small input noise. The p𝑝p-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The dashed line represents the point at which the null hypothesis (Gaussian distribution) can be rejected with 99%percent9999% confidence. Right: The Gaussians around each point are not overlapping The plots show the l​2𝑙2l2 distances between raw images for different datasets. As can be seen, the distances are largest for more complex real-world datasets.

Refer to caption The entropy for the SSL models VICReg decreased during the training. The entropy (measured by the LogDet Entropy estimator) as a function of the number of steps during training for VICReg and SimCLR and BYOL. Additionally, SimCLR entropy estimation is tighter compared to the others.

Refer to caption Evolution of the entropy for each of the learning rate configurations showing that the impact of picking the incorrect learning rate for the data and/or centroids lead to a collapse of the samples.

Refer to caption Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur. Left - In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right - when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy.

$$ Z\sim\sum_{n=1}^{N}\mathcal{N}\left(\bm{A}{\omega(\bm{x}^{*}{n})}\bm{x}^{}{n}+\bm{b}{\omega(\bm{x}^{}{n})},\bm{A}^{T}{\omega(\bm{x}^{}{n})}\Sigma{\bm{x}^{}{n}}\bm{A}{\omega(\bm{x}^{*}_{n})}\right)^{T=n}, $$ \tag{A2.Ex5}

$$ \frac{1}{n}\sum_{i=1}^{n}|{\tilde{W}}f_{\theta}({\bar{x}}{i})|\leq L{S}(w)+\frac{1}{n}\sum_{i=1}^{n}|{\tilde{W}}(f_{\theta}({\bar{x}}{i})-f{\theta}(x_{i}))|+\frac{1}{n}\sum_{i=1}^{n}|\varphi(x_{i})|. $$ \tag{A3.Ex18}

$$ \mathcal{I}{y}={i\in[n]:y{i}=y}. $$ \tag{A3.Ex27}

$$ \mathbb{E}{X{y}}[|W_{S}f_{\theta}(X_{y})-y|]-\frac{1}{|\mathcal{I}{y}|}\sum{i\in\mathcal{I}{y}}|W{S}f_{\theta}({\bar{x}}{i})-y|\leq\kappa{S}\sqrt{\frac{\ln(|{\tilde{\mathcal{Y}}}|/\delta)}{2|\mathcal{I}_{y}|}}. $$ \tag{A3.Ex33}

$$ {\hat{p}}(y)=\frac{|\mathcal{I}_{y}|}{n}. $$ \tag{A3.Ex38}

$$ \frac{1}{n}\sum_{i=1}^{n}|\varphi(x_{i})|\leq\frac{1}{m}\sum_{i=1}^{m}|g^{*}(x^{+}{i})-W{{\bar{S}}}f_{\theta}(x^{+}{i})|+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{W}\circ\mathcal{F})}{\sqrt{m}}+\kappa\sqrt{\frac{\ln(2/\delta)}{2m}}+\kappa_{{\bar{S}}}\sqrt{\frac{\ln(2/\delta)}{2n}} $$ \tag{A3.Ex50}

$$ W_{\bar{S}}=\operatorname{vec}^{-1}[w_{\bar{S}}]\quad\text{where }\quad w_{\bar{S}}=\mathop{\mathrm{minimize}}{w^{\prime}}|w^{\prime}|{F}\text{ s.t. }w^{\prime}\in\operatorname*{arg,min}{w}\sum{i=1}^{m}|g_{i}-A_{i}w|^{2}, $$ \tag{A3.Ex54}

$$ 0=\nabla_{w}\sum^{m}{i=1}|g{i}-A_{i}w|^{2}=2\sum^{m}{i=1}A{i}^{\top}(g_{i}-A_{i}w)\in\mathbb{R}^{dr} $$ \tag{A3.Ex55}

$$ \sum^{m}{i=1}A{i}^{\top}A_{i}w=\sum^{m}{i=1}A{i}^{\top}g_{i}. $$ \tag{A3.Ex56}

$$ \operatorname{vec}[W_{\bar{S}}]=w_{\bar{S}}=(A^{\top}A)^{\dagger}A^{\top}g. $$ \tag{A3.Ex59}

$$ \mathbf{P}{A}=I-[{Z{\bar{S}}}^{\top}\otimes I_{r}][{Z_{\bar{S}}}{Z_{\bar{S}}}^{\top}\otimes I_{r}]^{\dagger}[{Z_{\bar{S}}}\otimes I_{r}]=I-[{Z_{\bar{S}}}^{\top}({Z_{\bar{S}}}{Z_{\bar{S}}}^{\top})^{\dagger}{Z_{\bar{S}}}\otimes I_{r}]=[\mathbf{P}{Z{\bar{S}}}\otimes I_{r}] $$ \tag{A3.Ex77}

$$ \mathbb{P}{S}\left(\mathrm{E}\left[S{n}\right]-S_{n}\geq(b-a)\sqrt{\frac{\ln(1/\delta)}{2n}}\right)\leq\delta, $$ \tag{A4.Ex88}

$$ \displaystyle f(\bm{z})=\sum\nolimits_{\omega\in\Omega}(\bm{A}{\omega}\bm{z}+\bm{b}{\omega})\mathbbm{1}_{{\bm{z}\in\omega}}, $$

$$ \displaystyle\SwapAboveDisplaySkip\mathcal{L}=\frac{1}{K}\sum_{k=1}^{K}(\alpha\text{Var}(Z_{k})+\beta\text{Cov}(Z_{k},Z_{k^{\prime}}))+\gamma\text{Inv}(Z_{k},Z_{k^{\prime}}), $$

$$ \displaystyle\text{Cov}(Z_{k},Z_{k^{\prime}}) $$

$$ \displaystyle X\sim\sum_{n=1}^{N}\mathcal{N}(\bm{x}^{}{n},\Sigma{\bm{x}^{}{n}})^{1{{T=n}}},T\sim{\rm Cat}(N), $$

$$ \displaystyle\begin{split}\mathbb{E}{x^{\prime}}\left[\log q(z|x^{\prime})\right]&\geq\mathbb{E}{z^{\prime}|x^{\prime}}\left[\log q(z|z^{\prime})\right]=\frac{1}{2}(d\log 2\pi-\left(z-\mu(x^{\prime})\right)^{2}-\text{Tr}\log\Sigma(x^{\prime})).\end{split} $$

$$ \displaystyle L\approx\sum_{n=1}^{N}{\log\frac{|\Sigma_{Z}|}{|\Sigma(x_{i})|\cdot|\Sigma(x_{i}^{\prime})|}}-{\frac{1}{2}(\mu(x)-\mu(x^{\prime}))^{2}}. $$

$$ \displaystyle{Z_{S}}=[f(x_{1}),\dots,f(x_{n})]\in\mathbb{R}^{d\times n}\qquad\text{and}\qquad{Z_{\bar{S}}}=[f(x^{+}{1}),\dots,f(x^{+}{m})]\in\mathbb{R}^{d\times m}, $$

$$ \displaystyle\mathbf{P}{Z{S}}=I-{Z_{S}}^{\top}({Z_{S}}{Z_{S}}^{\top})^{\dagger}{Z_{S}}\qquad\text{and}\qquad\mathbf{P}{Z{\bar{S}}}=I-{Z_{\bar{S}}}^{\top}({Z_{\bar{S}}}{Z_{\bar{S}}}^{\top})^{\dagger}{Z_{\bar{S}}}. $$

$$ \displaystyle\tilde{\mathcal{R}}{m}(\mathcal{F})=\frac{1}{\sqrt{m}}\mathbb{E}{{\bar{S}},\xi}\left[\sup_{f\in\mathcal{F}}\sum_{i=1}^{m}\xi_{i}|f(x^{+}{i})-f(x^{++}{i})|\right], $$

$$ \displaystyle\begin{split}&\mathbb{E}{x,y}[\ell{x,y}(w_{S})]\leq I_{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F}+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}+\mathcal{Q}_{m,n},\end{split} $$

$$ \displaystyle\begin{split}=&\frac{d}{2}\log 2\pi-\frac{1}{2}\mathbb{E}{\epsilon}\left[\left(\mu(x)-\mu(x^{\prime})\right)^{2}\right]+\mathbb{E}{\epsilon}\left[\left(\mu(x)-\mu(x^{\prime})\right)L(x)\epsilon\right]\ &-\frac{1}{2}\mathbb{E}_{\epsilon}\left[\epsilon^{T}L(x)^{T}L(x)\epsilon\right]-\frac{1}{2}Tr\log\Sigma(x^{\prime})\end{split} $$

$$ \displaystyle\quad+\kappa_{S}\sqrt{\frac{2\ln(6|\mathcal{Y}|/\delta)}{2n}}\sum_{y\in\mathcal{Y}}\left(\sqrt{{\hat{p}}(y)}+\sqrt{p(y)}\right) $$

$$ \displaystyle\mathbb{E}{X,Y}[|W{S}f_{\theta}(X)-Y|] $$

$$ \displaystyle\leq\kappa_{S}\left(\sum_{y\in{\tilde{\mathcal{Y}}}}\sqrt{{\hat{p}}(y)}\right)\sqrt{\frac{\ln(2|{\tilde{\mathcal{Y}}}|/\delta)}{2n}}+\kappa_{S}\left(\sum_{y\in\mathcal{Y}}\sqrt{p(y)}\right)\sqrt{\frac{2\ln(2|\mathcal{Y}|/\delta)}{2n}} $$

$$ \displaystyle A^{\top}Aw=A^{\top}g\quad\text{ where }A=\begin{bmatrix}A_{1}\ A_{2}\ \vdots\ A_{m}\ \end{bmatrix}\in\mathbb{R}^{mr\times dr}\text{ and }g=\begin{bmatrix}g_{1}\ g_{2}\ \vdots\ g_{m}\ \end{bmatrix}\in\mathbb{R}^{mr} $$

$$ \displaystyle\mathbb{E}{q}[\psi(q)]\leq\frac{1}{m}\sum{i=1}^{m}\psi(q_{i})+2\mathcal{R}_{m}(\mathcal{G})+M\sqrt{\frac{\ln(1/\delta)}{2m}}, $$

$$ \displaystyle\varphi(S^{\prime})-\varphi(S)\leq\sup_{\psi\in\mathcal{G}}\frac{\psi(q_{i_{0}})-\psi(q^{\prime}{i{0}})}{m}\leq\frac{M}{m}. $$

$$ \displaystyle\mathbb{E}{x,y}\left[-\log\left(\frac{e^{\frac{1}{\eta}r{y}^{T}r_{x}}}{\sum_{k=1}^{K}{e^{\frac{1}{\eta}r_{y_{k}}^{T}r_{x}}}}\right)\right], $$

Thm. Theorem 1 Given the setting of Equation 8, the unconditional DNN output density, Z𝑍Z, is approximately a mixture of the affinely transformed distributions 𝐱|𝐱n​(𝐱)∗conditional𝐱subscriptsuperscript𝐱𝑛𝐱\bm{x}|\bm{x}^{}{n(\bm{x})}: Z∼∑n=1N𝒩​(𝑨ω​(𝒙n∗)​𝒙n∗+𝒃ω​(𝒙n∗),𝑨ω​(𝒙n∗)T​Σ𝒙n∗​𝑨ω​(𝒙n∗))1{T=n},similar-to𝑍superscriptsubscript𝑛1𝑁𝒩superscriptsubscript𝑨𝜔subscriptsuperscript𝒙𝑛subscriptsuperscript𝒙𝑛subscript𝒃𝜔subscriptsuperscript𝒙𝑛subscriptsuperscript𝑨𝑇𝜔subscriptsuperscript𝒙𝑛subscriptΣsubscriptsuperscript𝒙𝑛subscript𝑨𝜔subscriptsuperscript𝒙𝑛subscript1𝑇𝑛\displaystyle Z\sim\sum{n=1}^{N}\mathcal{N}\left(\bm{A}_{\omega(\bm{x}^{}{n})}\bm{x}^{*}{n}+\bm{b}{\omega(\bm{x}^{*}{n})},\bm{A}^{T}{\omega(\bm{x}^{*}{n})}\Sigma_{\bm{x}^{}{n}}\bm{A}{\omega(\bm{x}^{}{n})}\right)^{1{{T=n}}}, where ω​(𝐱n∗)=ω∈Ω⇔𝐱n∗∈ωiff𝜔subscriptsuperscript𝐱𝑛𝜔Ωsubscriptsuperscript𝐱𝑛𝜔\omega(\bm{x}^{}_{n})=\omega\in\Omega\iff\bm{x}^{}{n}\in\omega is the partition region in which the prototype 𝐱n∗subscriptsuperscript𝐱𝑛\bm{x}^{*}{n} lives in.

Thm. Theorem 2 (Informal version). For any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta, 𝔼x,y​[ℓx,y​(wS)]≤IS¯​(fθ)+2m​‖𝐏ZS¯​YS¯‖F+1n​‖𝐏ZS​YS‖F+2​ℛm​(ℱ)m+𝒬m,n,subscript𝔼𝑥𝑦delimited-[]subscriptℓ𝑥𝑦subscript𝑤𝑆subscript𝐼¯𝑆subscript𝑓𝜃2𝑚subscriptdelimited-∥∥subscript𝐏subscript𝑍¯𝑆subscript𝑌¯𝑆𝐹1𝑛subscriptdelimited-∥∥subscript𝐏subscript𝑍𝑆subscript𝑌𝑆𝐹2subscriptℛ𝑚ℱ𝑚subscript𝒬𝑚𝑛\displaystyle\begin{split}&\mathbb{E}{x,y}[\ell{x,y}(w_{S})]\leq I_{{\bar{S}}}(f_{\theta})+\frac{2}{\sqrt{m}}|\mathbf{P}{Z{\bar{S}}}Y_{\bar{S}}|{F}+\frac{1}{\sqrt{n}}|\mathbf{P}{Z_{S}}Y_{S}|{F}+\frac{2\tilde{\mathcal{R}}{m}(\mathcal{F})}{\sqrt{m}}+\mathcal{Q}{m,n},\end{split} (18) where 𝒬m,n=O​(G​ln⁡(1/δ)/m+ln⁡(1/δ)/n)→0subscript𝒬𝑚𝑛𝑂𝐺1𝛿𝑚1𝛿𝑛→0\mathcal{Q}{m,n}=O(G\sqrt{\ln(1/\delta)/m}+\sqrt{\ln(1/\delta)/n})\rightarrow 0 as m,n→∞→𝑚𝑛m,n\rightarrow\infty. In 𝒬m,nsubscript𝒬𝑚𝑛\mathcal{Q}{m,n}, the value of G𝐺G for the term decaying at the rate 1/m1𝑚1/\sqrt{m} depends on the hypothesis space of fθsubscript𝑓𝜃f{\theta} and w𝑤w whereas the term decaying at the rate 1/n1𝑛1/\sqrt{n} is independent of any hypothesis space.

Lemma. Lemma 1 (Hoeffding’s inequality) Let X1,…,Xnsubscript𝑋1…subscript𝑋𝑛X_{1},...,X_{n} be independent random variables such that a≤Xi≤b𝑎subscript𝑋𝑖𝑏{\displaystyle a\leq X_{i}\leq b} almost surely. Consider the average of these random variables, Sn=1n​(X1+⋯+Xn).subscript𝑆𝑛1𝑛subscript𝑋1⋯subscript𝑋𝑛{\displaystyle S_{n}=\frac{1}{n}(X_{1}+\cdots+X_{n}).} Then, for all t>0𝑡0t>0, ℙS​(E​[Sn]−Sn≥(b−a)​ln⁡(1/δ)2​n)≤δ,subscriptℙ𝑆Edelimited-[]subscript𝑆𝑛subscript𝑆𝑛𝑏𝑎1𝛿2𝑛𝛿\mathbb{P}{S}\left(\mathrm{E}\left[S{n}\right]-S_{n}\geq(b-a)\sqrt{\frac{\ln(1/\delta)}{2n}}\right)\leq\delta, and ℙS​(Sn−E​[Sn]≥(b−a)​ln⁡(1/δ)2​n)≤δ.subscriptℙ𝑆subscript𝑆𝑛Edelimited-[]subscript𝑆𝑛𝑏𝑎1𝛿2𝑛𝛿\mathbb{P}{S}\left(S{n}-\mathrm{E}\left[S_{n}\right]\geq(b-a)\sqrt{\frac{\ln(1/\delta)}{2n}}\right)\leq\delta.

Lemma. Lemma 2 Let 𝒢𝒢\mathcal{G} be a set of functions with the codomain [0,M]0𝑀[0,M]. Then, for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta over an i.i.d. draw of m𝑚m samples S=(qi)i=1m𝑆superscriptsubscriptsubscript𝑞𝑖𝑖1𝑚S=(q_{i}){i=1}^{m}, the following holds for all ψ∈𝒢𝜓𝒢\psi\in\mathcal{G}: 𝔼q​[ψ​(q)]≤1m​∑i=1mψ​(qi)+2​ℛm​(𝒢)+M​ln⁡(1/δ)2​m,subscript𝔼𝑞delimited-[]𝜓𝑞1𝑚superscriptsubscript𝑖1𝑚𝜓subscript𝑞𝑖2subscriptℛ𝑚𝒢𝑀1𝛿2𝑚\displaystyle\mathbb{E}{q}[\psi(q)]\leq\frac{1}{m}\sum_{i=1}^{m}\psi(q_{i})+2\mathcal{R}{m}(\mathcal{G})+M\sqrt{\frac{\ln(1/\delta)}{2m}}, (D.35) where ℛm​(𝒢):=𝔼S,ξ​[supψ∈𝒢1m​∑i=1mξi​ψ​(qi)]assignsubscriptℛ𝑚𝒢subscript𝔼𝑆𝜉delimited-[]subscriptsupremum𝜓𝒢1𝑚superscriptsubscript𝑖1𝑚subscript𝜉𝑖𝜓subscript𝑞𝑖\mathcal{R}{m}(\mathcal{G}):=\mathbb{E}{S,\xi}[\sup{\psi\in\mathcal{G}}\frac{1}{m}\sum_{i=1}^{m}\xi_{i}\psi(q_{i})] and ξ1,…,ξmsubscript𝜉1…subscript𝜉𝑚\xi_{1},\dots,\xi_{m} are independent uniform random variables taking values in {−1,1}11{-1,1}.

LARGE Supplementary Material

Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds [35]. Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is very computationally expensive and typically requires a large number of samples, especially in high dimensions [13]. Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases [35]. Another option is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) [2], and Bayesian methods [50]. These methods, however, often require complex optimizations. The LogDet estimator presented in [75] used the differential entropy α order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, [35] presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. Kolchinsky and Tracey [39] introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both a lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are exact when the component distributions are grouped into well-separated clusters,

Proof of Theorem ref{thm:1

Proof of Theorem 3. Let W = W S where W S is the the minimum norm solution as W S = minimize W ′ ∥ W ′ ∥ F s.t. W ′ ∈ arg min W 1 n ∑ n i =1 ∥ Wf θ ( x i ) -y i ∥ 2 . Let W ∗ = W ¯ S where W ¯ S is the minimum norm solution as W ∗ = W ¯ S = minimize W ′ ∥ W ′ ∥ F s.t. W ′ ∈ arg min W 1 m ∑ m i =1 ∥ Wf θ ( x + i ) -g ∗ ( x + i ) ∥ 2 . Since y = g ∗ ( x ) ,

$$

$$

where φ ( x ) = g ∗ ( x ) -W ∗ f θ ( x ) . Define L S ( w ) = 1 n ∑ n i =1 ∥ Wf θ ( x i ) -y i ∥ . Using these,

$$

$$

where ˜ W = W -W ∗ . We now consider new fresh samples ¯ x i ∼ D y i for i = 1 , . . . , n to rewrite the above further as:

$$

$$

This implies that

$$

$$

$$

$$

Combining these, we have that

$$

$$

To bound the left-hand side of equation 34, we now analyze the following random variable:

$$

$$

where ¯ y i = y i since ¯ x i ∼ D y i for i = 1 , . . . , n . Importantly, this means that as W S depends on y i , W S depends on ¯ y i . Thus, the collection of random variables ∥ W S f θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ W S f θ ( n n ) -¯ y n ∥ is not independent. Accordingly, we cannot apply standard concentration inequality to bound equation 35. A standard approach in learning theory is to first bound equation 35 by E x,y ∥ W S f θ ( x ) -y ∥ -1 n ∑ n i =1 ∥ W S f θ (¯ x i ) -¯ y i ∥ ≤ sup W ∈W E x,y ∥ Wf θ ( x ) -y ∥ -1 n ∑ n i =1 ∥ Wf θ (¯ x i ) -¯ y i ∥ for some hypothesis space W (that is independent of S ) and realize that the right-hand side now contains the collection of independent random variables ∥ Wf θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ Wf θ ( n n ) -¯ y n ∥ , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space W . However, the complexity of the hypothesis space W can be very large, resulting in a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space W by using a very different approach with conditional expectations to take care the dependent random variables ∥ W S f θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ W S f θ ( n n ) -¯ y n ∥ . Intuitively, we utilize the fact that for these dependent random variables, there is a structure of conditional independence, conditioned on each y ∈ Y .

We first write the expected loss as the sum of the conditional expected loss:

$$

$$

where X y is the random variable for the conditional with Y = y . Using this, we decompose equation 35 into two terms:

$$

$$

where

$$

$$

$$

$$

as

$$

$$

̸

where ˜ Y = { y ∈ Y : |I y | = 0 } . Substituting these into equation equation 36 yields

$$

$$

Importantly, while ∥ W S f θ (¯ x 1 ) -¯ y 1 ∥ , . . . , ∥ W S f θ (¯ x n ) -¯ y n ∥ on the right-hand side of equation 37 are dependent random variables, ∥ W S f θ (¯ x 1 ) -y ∥ , . . . , ∥ W S f θ (¯ x n ) -y ∥ are independent random variables since W S and ¯ x i are independent and y is fixed here. Thus, by using Hoeffding's inequality (Lemma G.1), and taking union bounds over y ∈ ˜ Y , we have that with probability at least 1 -δ , the following holds for all y ∈ ˜ Y :

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

Combining equation 34 and equation 39 implies that with probability at least 1 -δ ,

$$

$$

We will now analyze the term 1 n ∑ n i =1 ∥ φ ( x i ) ∥ + 1 n ∑ n i =1 ∥ φ (¯ x i ) ∥ on the right-hand side of equation 40. Since W ∗ = W ¯ S ,

$$

$$

$$

$$

Moreover, by using [51, Theorem 3.1] with the loss function x + ↦→ ∥ g ∗ ( x + ) -Wf ( x + ) ∥ (i.e., Lemma G.2), we have that for any δ > 0 , with probability at least 1 -δ ,

$$

$$

where ˜ R m ( W ◦ F ) = 1 √ m E ¯ S,ξ [sup W ∈W ,f ∈F ∑ m i =1 ξ i ∥ g ∗ ( x + i ) -Wf ( x + i ) ∥ ] is the normalized Rademacher complexity of the set { x + ↦→∥ g ∗ ( x + ) -Wf ( x + ) ∥ : W ∈ W , f ∈ F} (it is normalized such that ˜ R m ( F ) = O (1) as m → ∞ for typical choices of F ), and ξ 1 , . . . , ξ m are independent uniform random variables taking values in {-1 , 1 } . Takinng union bounds, we have that for any δ > 0 , with probability at least 1 -δ ,

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

where I r ∈ R r × r is the identity matrix, and [ f θ ( x + i ) ⊤ ⊗ I r ] ∈ R r × dr is the Kronecker product of the two matrices, and vec[ W ] ∈ R dr is the vectorization of the matrix W ∈ R r × d . Thus, by defining A i = [ f θ ( x + i ) ⊤ ⊗ I r ] ∈ R r × dr and using the notation of w = vec[ W ] and its inverse W = vec -1 [ w ] (i.e., the inverse of the vectorization from R r × d to R dr with a fixed ordering), we can rewrite equation 43 by

$$

$$

with g i = g ∗ ( x + i ) ∈ R r . Since the function w ↦→ ∑ m i =1 ∥ g i -A i w ∥ 2 is convex, a necessary and sufficient condition of the minimizer of this function is obtained by

$$

$$

$$

$$

In other words,

$$

$$

Thus,

$$

$$

where ( A ⊤ A ) † is the Moore-Penrose inverse of the matrix A ⊤ A and Null( A ) is the null space of the matrix A . Thus, the minimum norm solution is obtained by

$$

$$

$$

$$

where the inequality follows from the Jensen's inequality and the concavity of the square root function. Thus, we have that

$$

$$

$$

$$

where ˜ W = W S -W ∗ and P A = I -A ( A ⊤ A ) † A ⊤ .

$$

$$

where ∥ ˜ W ∥ 2 is the spectral norm of ˜ W . Since ¯ x i shares the same label with x i as ¯ x i ∼ D y i (and x i ∼ D y i ), and because f θ is trained with the unlabeled data ¯ S , using Hoeffding's inequality (Lemma G.1) implies that with probability at least 1 -δ ,

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

Define Z ¯ S = [ f ( x + 1 ) , . . . , f ( x + m )] ∈ R d × m . Then, we have A = [ Z ¯ S ⊤ ⊗ I r ] . Thus, P A = I -[ Z ¯ S ⊤ ⊗ I r ][ Z ¯ S Z ¯ S ⊤ ⊗ I r ] † [ Z ¯ S ⊗ I r ] = I -[ Z ¯ S ⊤ ( Z ¯ S Z ¯ S ⊤ ) † Z ¯ S ⊗ I r ] = [ P Z ¯ S ⊗ I r ] where P Z ¯ S = I m -Z ¯ S ⊤ ( Z ¯ S Z ¯ S ⊤ ) † Z ¯ S ∈ R m × m . By defining Y ¯ S = [ g ∗ ( x + 1 ) , . . . , g ∗ ( x + m )] ⊤ ∈ R m × r , since g = vec[ Y ⊤ ¯ S ] ,

$$

$$

On the other hand, recall that W S is the minimum norm solution as

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

Now,

Information Optimization and the VICReg Objective

Assumption 1 . The eigenvalues of Σ( x j ) are in some range a ≤ λ (Σ( x j )) ≤ b

.

Assumption 2 . The differences between the means of the Gaussians are bounded

$$

$$

$$

$$

Proof. The term µ ( X j ) µ ( X j ) T is an outer product of the mean vector µ ( X j ) , which is a symmetric matrix. The eigenvalues of a symmetric matrix are equal to the squares of the singular values of the original matrix. Since the singular values of a vector are equal to its absolute values, the maximum eigenvalue of µ ( X j ) µ ( X j ) T is equal to the square of the maximum absolute value of µ ( X j ) . By the second assumption, this is at most M .

Lemma J.2. The maximum eigenvalue of -µ Z µ T Z is non-positive and its absolute value is at most M .

Proof. The term -µ Z µ T Z is a negative outer product of the overall mean vector µ Z , which is a symmetric matrix. Its eigenvalues are non-positive and equal to the negative squares of the singular values of µ Z . Since the singular values of a vector are equal to its absolute values, the absolute value of the maximum eigenvalue of -µ Z µ T Z is equal to the square of the maximum absolute value of µ Z , which is also bounded by M by the second assumption.

$$

$$

Proof. Given a Gaussian mixture model where each component Z | x j has mean µ ( X j ) and covariance matrix Σ( x j ) , the mixture can be written as:

$$

$$

where p j are the mixing coefficients. The covariance matrix of the mixture, Σ Z , is then given by:

$$

$$

By Lemmas 1.1, 1.2, and assumptions 1 and 2, the maximum eigenvalues of (Σ( x j ) , µ ( X j ) µ ( X j ) T and µ Z µ T Z . are at most b , M , and M , respectively. Therefore, by Weyl's inequality for the sum of two symmetric matrices, the maximum eigenvalue of Σ Z is at most b + M .

$$

$$

It means that we can bound the sum of the eigenvalues of Σ Z with

$$

$$

Lemma J.4. Let Σ Z be a positive semidefinite matrix of size N × N . Consider the optimization problem given by:

$$

$$

$$

$$

where λ i (Σ Z ) denotes the i -th eigenvalue of Σ Z and c is a constant. The solution to this problem is a diagonal matrix with equal diagonal elements.

such that:

Proof. The determinant of a matrix is the product of its eigenvalues, so the objective function log det(Σ Z ) can be rewritten as ∑ N i =1 log( λ i (Σ Z )) . Our problem is then to maximize this sum under the constraints that the sum of the eigenvalues does not exceed c and that Σ Z is positive semi-definite.

Applying Jensen's inequality to the concave function log( x ) with weights 1 /N , we find that 1 N ∑ N i =1 log( λ i (Σ Z )) ≤ log( 1 N ∑ N i =1 λ i (Σ Z )) . Equality holds if and only if all λ i (Σ Z ) are equal.

Setting λ i (Σ Z ) = x for all i , we see that the constraint ∑ N i =1 λ i (Σ Z ) ≤ c becomes Nx ≤ c , leading to the optimal eigenvalue x = c/N under the constraint.

Since Σ Z is positive semi-definite, it can be diagonalized via an orthogonal transformation without changing the sum of its eigenvalues or its determinant. Therefore, the solution to the problem is a diagonal matrix with all diagonal entries equal to c/N .

This completes the proof.

$$

$$

Proof. The objective function can be decomposed as follows:

$$

$$

In this optimization problem, we are optimizing over Σ Z . The term ∑ i log | Σ( X i ) | is constant with respect to Σ Z , therefore we can focus on maximizing K log | Σ Z | .

As the determinant of a matrix is the product of its eigenvalues, log | Σ Z | is the sum of the logs of the eigenvalues of Σ Z . Thus, maximizing log | Σ Z | corresponds to maximizing the sum of the logarithms of the eigenvalues of Σ Z .

According to Lemma 1.4, when we have a constraint on the sum of the eigenvalues, the solution to the problem of maximizing the sum of the logarithms of the eigenvalues of a positive semidefinite matrix Σ Z is a diagonal matrix with equal diagonal elements.

From Lemma 1.3, we know that the sum of the eigenvalues of Σ Z is bounded by ( b + M ) × K . Therefore, when we maximize K log | Σ Z | under these constraints, the solution will be a diagonal matrix with equal diagonal elements. This completes the proof of the theorem.

Entropy Comparison - Experimental Details

We use ResNet-18 [32] as our backbone. Each model is trained with 512 batch size for 800 epochs. We use the SGD optimizer with a momentum of 0.9 and a weight decay of 1 e -4 . The initial learning rate is 0 . 5 . This learning rate follows the cosine decay with a linear warmup schedule. For augmentation, two augmented versions of each input image are generated. During this process, each image is cropped with random size, and resized to the original resolution, followed by random applications of horizontal mirroring, color jittering, grayscale conversion, Gaussian blurring and solarization. For the entropy estimation, we use the same method as in [39], which uses a lower bound of the entropy using the distances of the representations under the assumption of a mixture of the Gaussians around the representations with constant variance.

$$

$$

Figure 5: Our generalization bound predicts more accurately the generalization gap in the loss. (left) Our SSL VICReg generalization bound outperforms state-of-the-art supervised generalization bounds. (right) Strong correlation between the generalization gap and our generalization bound for VICReg. Pearson correlation - 0.9633. Conducted on CIFAR-10.

Figure 5: Our generalization bound predicts more accurately the generalization gap in the loss. (left) Our SSL VICReg generalization bound outperforms state-of-the-art supervised generalization bounds. (right) Strong correlation between the generalization gap and our generalization bound for VICReg. Pearson correlation - 0.9633. Conducted on CIFAR-10.

where p i and p j are the distributions of the representation of the i-th and j-th examples, and D ( p i || p j ) represents the divergence between these distributions. Also, c i indicates the weight of component i ( c i ≥ 0 , ∑ i c i = 1 ), and C is a discrete random variable where P ( C = i ) = c i .

Reproducibility Statement

All of the methods in our study are based on existing methods and their open-source implementations. We provide a detailed implementation setup for both the pre-training and downstream experiments.

Expriemtns on the generalization bound

$$ $\label{eq:CPA} f(\bz) = \sum\nolimits_{\omega \in \Omega}(\bA_{\omega}\bz+\bb_{\omega})\Indic_{{\bz \in \omega}} $, $$ \tag{eq:CPA}

$$ \mathcal{L} = \underbrace{\color{black}{\frac{1}{K}\sum_{k=1}^K\hspace{-0.1cm}\left(\hspace{-0.1cm}\alpha\max \left(0, \gamma- \sqrt{\bC_{k,k} +\epsilon}\right)\hspace{-0.1cm}+\hspace{-0.1cm}\beta \sum_{k'\neq k}\hspace{-0.1cm}\left(\bC_{k,k'}\right)^2\hspace{-0.1cm}\right)}}_{\textcolor{black}{\text{Regularization}}}+ \underbrace{\color{black}{\eta| \bZ-\bZ'|F^2/N}}{\textcolor{black}{\text{Invariance}}}. $$

$$ \label{eq:lower_bound3} I(Z,X^\prime) = H(Z) - H(Z|X^\prime) \geq H(Z) + \EE_{x^\prime}[\log q(z|x^\prime)] . $$ \tag{eq:lower_bound3}

$$ X \sim \sum_{n=1}^N\mathcal{N}(\bx^n,\Sigma{\bx^n})^{\mathbb{I}{{T=n}}} \quad \textrm{with} \quad T \sim {\rm Cat}(N) . $$

$$ \SwapAboveDisplaySkip p(\bx) \approx \mathcal{N}\left(\bx;\bx^{n(\bx)},\Sigma{\bx^_{n(\bx)}}\right)/N,\label{eq:x_density} $$ \tag{eq:x_density}

$$ Z \sim \sum_{n=1}^{N}\mathcal{N}\hspace{-0.1cm}\left(\hspace{-0.1cm}\bA_{\omega(\bx^_{n})}\bx^{n}+\bb{\omega(\bx^{n})},\bA^T{\omega(\bx^{n})}\Sigma{\bx^{n}}\bA{\omega(\bx^{n})}\hspace{-0.1cm}\right)^{1{{T=n}}}\hspace*{-8pt}, $$

$$ \begin{split} \EE_{x^\prime}\left[\log q(z|x^\prime)\right] & \geq
\EE_{z^\prime|x^\prime}\left[\log q(z|z^\prime)\right] = \frac{1}{2}( d \log 2\pi - \left(z-\mu(x^\prime)\right)^2 - \text{Tr}\log\Sigma(x^\prime) ) . \label{eq:logzz} \end{split} $$ \tag{eq:logzz}

$$ \begin{split} \label{eq:obj} & L(x_1\dots x_N, x^\prime_1\dots x^\prime_N) \approx \frac1N \sum_{i=1}^N \underbrace{ H(Z)- \log \left( |\Sigma(x_i)| \cdot |\Sigma(x_i^\prime)|\right)}{\text{Regularizer}} -\underbrace{\frac12\left(\mu(x_i) -\mu(x_i^\prime)\right)^2}{\text{Invariance}} . \end{split} $$ \tag{eq:obj}

$$ \max_{\Sigma_{Z}} \left{ \sum_{i=1}^N {\log \frac{|\Sigma_{Z}(x_1\dots x_N)|}{ |\Sigma(x_i)|\cdot |\Sigma(x_i^\prime)|}} \right} $$

$$ \SwapAboveDisplaySkip I_{\bS}(f_\theta) = \frac{1}{m}\sum_{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})| , $$

$$ \tZ = [f(x_1),\dots, f(x_{n})] \in \RR^{d\times n} \quad \text{and} \quad \hZ = [f(\xp_1),\dots, f(\xp_{m})]\in \RR^{d\times m} , $$

$$ \Pb_\tZ = I -\tZ \T (\tZ \tZ\T)^\dagger \tZ \quad \text{and} \quad \Pb_\hZ = I -\hZ\T (\hZ\hZ\T)^\dagger \hZ . $$

$$ \begin{split} & \EE_{x,y}[\ell_{x,y}(w_{S})] \le I_{\bS}(f_\theta) +\frac{2}{\sqrt{m}}|\Pb_\hZ Y_\bS|{F}+\frac{1}{\sqrt{n}} |\Pb\tZ Y_{S}|F + \frac{2\tilde \Rcal{m}(\Fcal)}{\sqrt{m}}+\Qcal_{m,n} , \end{split} $$

$$ \label{eq:logzz_e} \begin{split} & \EE_{x^\prime}\left[\log q(z|x^\prime)\right]
\ \geq &
\EE_{z^\prime|x^\prime}\left[\log q(z|z^\prime)\right] \end{split} \=& \EE_{z^\prime|x^\prime}\left[\frac{d}{2}\log 2\pi - \frac12 \left(z-z^\prime\right)^T\left(I)\right)^{-1}\left(z-z^\prime\right)\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{z^\prime \mid x^\prime, }\left[\left(z-z^\prime\right)^2\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{ \epsilon}\left[\left(z-\mu(x^\prime) - L(x^\prime)\epsilon\right)^2\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{ \epsilon}\left[\left(z-\mu(x^\prime)\right)^2 - 2\left(z - \mu(x^\prime)*L(x^\prime) \epsilon\right) +\left(\left(L(x^\prime)\epsilon\right)^T\left(L(x^\prime) \epsilon\right)\right)\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{\epsilon}\left[\left(z-\mu(x^\prime)\right)^2\right] + \left(z-\mu(x^\prime)L(x^\prime)\right)\EE_{\epsilon}\left[\epsilon\right]-\frac{1}{2}\EE_{\epsilon}\left[\epsilon^T L(x^\prime)^T L(x^\prime) \epsilon\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \left(z-\mu(x^\prime)\right)^2 -\frac{1}{2}Tr\log\Sigma(x^\prime) $$ \tag{eq:logzz_e}

$$ \EE_{z|x}\left[\EE_{z^\prime|x^\prime}\left[\log q(z|z^\prime)\right]\right] = & \EE_{z|x}\left[\frac{d}{2}\log 2\pi - \frac12 \left(z-\mu(x^\prime)\right)^2 -\frac{1}{2}Tr \log \Sigma(x^\prime)\right] \ =& \frac{d}{2}\log 2\pi - \frac12 \EE_{z|x}\left[ \left(z-\mu(x^\prime)\right)^2\right] -\frac{1}{2}Tr \log \Sigma(x^\prime) \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{\epsilon}\left[ \left(\mu(x) +L(x)\epsilon-\mu(x^\prime)\right)^2\right] -\frac{1}{2}Tr \log \Sigma(x^\prime) \ \begin{split} = & \frac{d}{2}\log 2\pi - \frac12 \EE_{\epsilon}\left[ \left(\mu(x) -\mu(x^\prime)\right)^2\right] +\EE_{\epsilon}\left[\left(\mu(x) - \mu(x^\prime)\right)L(x)\epsilon\right] \&-\frac 12\EE_{\epsilon}\left[\epsilon^TL(x)^TL(x)\epsilon\right] -\frac{1}{2}Tr\log \Sigma(x^\prime) \end{split} \ = & \frac{d}{2}\log 2\pi - \frac12 \left(\mu(x) -\mu(x^\prime)\right)^2 -\frac 12Tr\log\Sigma(x) -\frac{1}{2}Tr\log\Sigma(x^\prime) \ =& \frac{d}{2}\log 2\pi - \frac12 \left(\mu(x) -\mu(x^\prime)\right)^2 -\frac 12 \log \left( |\Sigma(x)| \cdot |\Sigma(x^\prime)|\right) $$

$$ \mathbb{E}{x,y}\left[-\log \left(\frac{e^{\frac1\eta r_y^Tr_x }}{\sum{k=1}^K{e^{\frac 1\eta r_{y_k}^Tr_x}}}\right)\right] , $$

$$ & 1/\delta=\exp \left({\frac {2nt^{2}}{(b-a)^{2}}}\right) \ & \Longrightarrow \ln(1/\delta)= {\frac {2nt^{2}}{(b-a)^{2}}} \ & \Longrightarrow \frac{(b-a)^{2}\ln(1/\delta)}{2n}= t^2 \ & \Longrightarrow t =(b-a) \sqrt{\frac{\ln(1/\delta)}{2n} } $$

$$ \varphi(S)= \sup_{\psi \in \Gcal} \EE_{x,y}[\psi(q)]-\frac{1}{m}\sum_{i=1}^{m}\psi(q_{i}). $$

$$ \SwapAboveDisplaySkip \varphi(S) \le \EE_{S}[\varphi(S)] + M \sqrt{\frac{\ln(1/\delta)}{2m}}. $$

$$ &\EE_{S}[\varphi(S)]

\EE_{S}\left[\sup_{\psi \in \Gcal} \EE_{S'}\left[\frac{1}{m}\sum_{i=1}^{m}\psi_{}(q_i')\right]-\frac{1}{m}\sum_{i=1}^{m}\psi(q_i)\right]
\ & \le\EE_{S,S'}\left[\sup_{\psi \in \Gcal} \frac{1}{m}\sum_{i=1}^m (\psi(q'i)-\psi(q_i))\right] \ & \le \EE{\xi, S, S'}\left[\sup_{\psi \in \Gcal} \frac{1}{m}\sum_{i=1}^m \xi_i(\psi_{}(q'{i})-\psi(q_i))\right] \ & \le2\EE{\xi, S}\left[\sup_{\psi \in \Gcal} \frac{1}{m}\sum_{i=1}^m \xi_i\psi(q_i)\right] =2\Rcal_{m}(\Gcal), $$

$$ y=g^{}(x) \pm W^ f_\theta(x) =W^* f_\theta(x) +(g^{}(x)-W^ f_\theta(x))=W^* f_\theta(x) +\varphi(x) $$

$$ L_{S}(w) &= \frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i| \ & =\frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-W^* f_\theta(x_{i}) -\varphi(x_{i})| \ & \ge\frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-W^* f_\theta(x_{i})| -\frac{1}{n} \sum_{i=1}^n|\varphi(x_{i})| \ & =\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(x_i)| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| $$

$$ L_{S}(w) &\ge \frac{1}{n} \sum_{i=1}^n |\tW f_\theta(x_i)\pm\tW f_\theta(\bbx_i)| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \ & =\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)-(\tW f_\theta(\bbx_i)-\tW f_\theta(x_i))| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \ & \ge\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| -\frac{1}{n} \sum_{i=1}^{n}|\tW f_\theta(\bbx_i)-\tW f_\theta(x_i)| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \ & =\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| -\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| $$

$$ \frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| &=\frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-W^* f_\theta(\bbx_i)| \ & =\frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i+\varphi(\bbx_i)| \ & \ge\frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i|-\frac{1}{n} \sum_{i=1}^n|\varphi(\bbx_i) | $$

$$ \label{eq:5} \EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|, $$ \tag{eq:5}

$$ \label{eq:6} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & = \left(\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\frac{|\Ical_{y}|}{n}- \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|\right) \ \nonumber & \quad+\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right), $$ \tag{eq:6}

$$ &\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\frac{|\Ical_{y}|}{n}- \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ & =\frac{1}{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-\frac{1}{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right), $$

$$ &\frac{1}{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-\frac{1}{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right) \ & \le\frac{\kappa_{S}}{n}\sum_{y \in \tYcal} |\Ical_{y}| \sqrt{\frac{\ln(|\tYcal|/\delta)}{2|\Ical_{y}|}} \ &=\kappa_{S} \left(\sum_{y \in \tYcal} \sqrt{\frac{|\Ical_{y}|}{n}}\right) \sqrt{\frac{\ln(|\tYcal|/\delta)}{2n}}. $$

$$ \label{eq:8} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & \le \kappa_{S} \left(\sum_{y \in \tYcal} \sqrt{\hp(y)}\right) \sqrt{\frac{\ln(|\tYcal|/\delta)}{2n}}

  • \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right) $$ \tag{eq:8}

$$ & \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right) \ &\le \left(\sum_{y\in \Ycal} \sqrt{p(y)}\EE_{X_{y}}[|W {S}f\theta(X_{y})-y | \right) \sqrt{\frac{2\ln(|\Ycal|/\delta)}{2n}} \ & \le\kappa_{S} \left(\sum_{y\in \Ycal} \sqrt{p(y)} \right) \sqrt{\frac{2\ln(|\Ycal|/\delta)}{2n}} $$

$$ \label{eq:10} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber &\le \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|+\kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ \nonumber & \le L_{S}(w_{S})+\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \quad + \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|+\frac{1}{n} \sum_{i=1}^n|\varphi(\bbx_i) |+\kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right). $$ \tag{eq:10}

$$ &\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|=\frac{1}{n} \sum_{i=1}^n |g^{*}(x_{i})-W_{\bS}f_\theta(x_{i})|. $$

$$ A\T A w=A\T g \quad \text{ where } A=\begin{bmatrix}A_{1} \ A_{2} \ \vdots \ A_{m} \ \end{bmatrix} \in \RR^{mr \times dr} \text{ and } g=\begin{bmatrix}g_{1} \ g_{2} \ \vdots \ g_{m} \ \end{bmatrix} \in \RR^{mr} $$

$$ \frac{1}{m}\sum_{i=1}^{m} |g^{*}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})| &= \frac{1}{m}\sum_{i=1}^{m} \sqrt{\sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & \le \sqrt{\frac{1}{m}\sum{i=1}^{m} \sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & = \frac{1}{\sqrt{m}} \sqrt{\sum{i=1}^{m} \sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & =\frac{1}{\sqrt{m}} |g-Aw\bS|{2} \ & = \frac{1}{\sqrt{m}} |g-A(A\T A)^\dagger A\T g|{2} =\frac{1}{\sqrt{m}}|(I-A(A\T A)^\dagger A\T )g|_{2} $$

$$ \label{eq:3} &\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| +\frac{1}{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \ \nonumber & \le\frac{2}{\sqrt{m}}|(I-A(A\T A)^\dagger A\T )g|{2}+\frac{4\Rcal{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(4/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n} } $$ \tag{eq:3}

$$ \label{eq:4} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber & \le L_{S}(w_{S})+\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))|+\frac{2}{\sqrt{m}}|\Pb_{A}g|{2} \ \nonumber & \quad + \frac{4\Rcal{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(8/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(8/\delta)}{2n} } \ \nonumber &\quad +\kappa_{S} \sqrt{\frac{2\ln(4|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right). $$ \tag{eq:4}

$$ \label{eq:12} \frac{1}{n} \sum_{i=1}^{n}|f_\theta(\bbx_i)-f_\theta(x_i)| \le \EE_{y \sim \rho}\EE_{\bbx,x \sim \Dcal_y^2}[|f_\theta(\bbx)-f_\theta(x)|]+\tau_{\bS} \sqrt{\frac{\ln(1/\delta)}{2n}}. $$ \tag{eq:12}

$$ \label{eq:14} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber &\le L_{S}(w_{S}) + |\tW|{2} \left( \frac{1}{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+\frac{2\tilde \Rcal_{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(4/\delta)}{2m}}+\tau_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n}} \right) \ \nonumber & \quad +\frac{2}{\sqrt{m}}|\Pb_{A}g|{2}+ \frac{4\Rcal{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(16/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(16/\delta)}{2n} } \ \nonumber & \quad +\kappa_{S} \sqrt{\frac{2\ln(8|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ \nonumber & =L_{S}(w_{S}) +|\tW|{2} \left(\frac{1}{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|\right)+\frac{2}{\sqrt{m}}|\Pb_{A}g|{2}+Q{m,n} $$ \tag{eq:14}

$$ Q_{m,n} &= |\tW|{2} \left(\frac{2\tilde \Rcal{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(3/\delta)}{2m}}+\tau_{\bS} \sqrt{\frac{\ln(3/\delta)}{2n}}\right) \ & \quad +\kappa_{S} \sqrt{\frac{2\ln(6|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ & \quad + \frac{4\Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(4/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n} }. $$

$$ \label{eq:15} | \Pb_{A}g|{2} =|[\Pb\hZ \otimes I_r]\vect[Y_\bS\T]|{2} =|\vect[Y\bS\T\Pb_\hZ ]|{2} =|\Pb\hZ Y_\bS|_{F} $$ \tag{eq:15}

$$ L_{S}(w_{S})=\frac{1}{n} \sum_{i=1}^n |W_{S} f_\theta(x_i)-y_i| &= \frac{1}{n}\sum_{i=1}^{n} \sqrt{\sum_{k=1}^r ((W_{S} f_\theta(x_i)-y_i)k)^2} \ & \le \sqrt{\frac{1}{n}\sum{i=1}^{n}\sum_{k=1}^r ((W_{S} f_\theta(x_i)-y_i)k)^2} \ & = \frac{1}{\sqrt{n}} |W{S} \tZ -Y\T|_F \ & =\frac{1}{\sqrt{n}} |Y\T (\tZ \T (\tZ \tZ\T )^\dagger \tZ -I)|_F \ & =\frac{1}{\sqrt{n}} |(I-\tZ \T (\tZ \tZ\T )^\dagger \tZ )Y|_F $$

$$ \label{eq:17} L_{S}(w_{S})=\frac{1}{\sqrt{n}} |\Pb_\tZ Y|_F $$ \tag{eq:17}

$$ & \text{maximize } \log\det(\Sigma_Z) \ \text{ such that:} \ & \sum_{i=1}^{N} \lambda_i(\Sigma_Z) \leq c \ & \Sigma_Z \succeq 0 $$

$$ & \text{maximize } \sum_i \log \frac{\left| \Sigma_Z \right|}{\left| \Sigma({X_i}) \right|} $$

$$ \sum_i \log \frac{\left| \Sigma_Z \right|}{\left| \Sigma({X_i}) \right|} &= \sum_i \left( \log \left| \Sigma_Z \right| - \log \left| \Sigma({X_i}) \right| \right) \ &= K \log \left| \Sigma_Z \right| - \sum_i \log \left| \Sigma_({X_i}) \right|, $$

$$ H(Z) := H(Z|C) - \sum_{i} c_i \ln \sum_{j} c_j e^{-D(p_i || p_j)}. $$

$$ -1pt] where $f_\theta$ is the trained representation on the unlabeled data $\bS$.
We define a labeled loss $\ell_{x,y}(w)=|W f_\theta(x)-y|$ where $w=\vect[W]\in \RR^{dr}$ is the vectorization of the matrix $W \in \RR^{r \times d}$. Let $w_S=\vect[W_{S}]$ be the minimum norm solution as $W_S =\mini_{W'} |W'|{F}$ such that \vspace*{4pt}\begin{align*} \SwapAboveDisplaySkip W'\in \argmin{W} \frac{1}{n} \sum_{i=1}^n |W_{} f_\theta(x_i)-y_i|^2. \end{align*}\[-4pt] We also define the representation matrices \vspace*{4pt}\begin{align*} \tZ = [f(x_1),\dots, f(x_{n})] \in \RR^{d\times n} \quad \text{and} \quad \hZ = [f(\xp_1),\dots, f(\xp_{m})]\in \RR^{d\times m} , \end{align*}\[-4pt] and the projection matrices \vspace*{4pt}\begin{align*} \Pb_\tZ = I -\tZ \T (\tZ \tZ\T)^\dagger \tZ \quad \text{and} \quad \Pb_\hZ = I -\hZ\T (\hZ\hZ\T)^\dagger \hZ . \end{align*}\[-4pt] We define the label matrix $Y_{S}=[y_{1},\dots, y_{n}]\T \in \RR^{n\times r}$ and the unknown label matrix $Y_{\bS}=[\yp_1,\dots, \yp_m]\T \in \RR^{m\times r}$, where $\yp_i$ is the unknown label of $\xp_i$. Let $\Fcal$ be a hypothesis space of $f_{\theta}$. For a given hypothesis space $\Fcal$, we define the normalized Rademacher complexity \begin{align*} \tilde \Rcal_{m}(\Fcal) = \frac{1}{\sqrt{m}}\EE_{\bS,\xi} \left[\sup_{f\in\Fcal} \sum_{i=1}^m \xi_i |f(\xp_{i})-f(\xpp_{i})|\right] , \end{align*} where $\xi_1,\dots,\xi_m$ are independent uniform random variables in ${-1,1}$. It is normalized such that $\tilde \Rcal_{m}(\Fcal)=O(1)$ as $m\rightarrow \infty$.

\vspace*{-1pt} \subsection{A \hspace*{-1pt}Generalization \hspace*{-1pt}Bound \hspace*{-1pt}for \hspace*{-1pt}Variance-Invariance-Covariance \hspace*{-1pt}Regularization}

Now we will show that the VICReg objective improves generalization on supervised downstream tasks. More specifically, minimizing the unlabeled invariance loss while controlling the covariance $\hZ \hZ \T$ and the complexity of representations $\tilde \Rcal_{m}(\Fcal)$ minimizes the expected \textit{labeled loss}: \vspace{5pt} \begin{thm} \label{thm:1}{ (Informal version). For any $\delta>0$, with probability at least $1-\delta$, } \vspace*{-1pt}\begin{align} \begin{split} & \EE_{x,y}[\ell_{x,y}(w_{S})] \le I_{\bS}(f_\theta) +\frac{2}{\sqrt{m}}|\Pb_\hZ Y_\bS|{F}+\frac{1}{\sqrt{n}} |\Pb\tZ Y_{S}|F + \frac{2\tilde \Rcal{m}(\Fcal)}{\sqrt{m}}+\Qcal_{m,n} , \end{split} \end{align}\[-1pt] where $\Qcal_{m,n} =O(G\sqrt{\ln (1/\delta) / m }+ \sqrt{ \ln (1/\delta) / n})\rightarrow 0$ as \mbox{$m,n\rightarrow \infty$}. In $\Qcal_{m,n}$, the value of $G$ for the term decaying at the rate $1/\sqrt{m}$ depends on the hypothesis space of $f_\theta$ and $w$ whereas the term decaying at the rate $1/\sqrt{n}$ is independent of any hypothesis space. \end{thm} \vspace{-5pt} \begin{proof} The complete version of Theorem \ref{thm:1} and its proof are presented in \Cref{app:1}. \end{proof} The term $|\Pb_\hZ Y_\bS|{F}$ in Theorem \ref{thm:1} contains the unobservable label matrix $Y\bS$. However, we can minimize this term by using $|\Pb_\hZ Y_\bS|{F} \le|\Pb\hZ|F |Y\bS|{F} $ and by minimizing $|\Pb\hZ|F$. The factor $|\Pb\hZ|_F$ is minimized when the rank of the covariance $\hZ \hZ \T$ is maximized. This can be enforced by maximizing the diagonal entries while minimizing the off-diagonal entries, as is done in VICReg.

The term $|\Pb_\tZ Y_{S}|{F}$ contains only observable variables, and we can directly measure the value of this term using training data. In addition, the term $|\Pb\tZ Y_{S}|{F}$ is also minimized when the rank of the covariance $\tZ \tZ \T$ is maximized. Since the covariances $\tZ \tZ \T$ and $\hZ \hZ \T$ concentrate to each other via concentration inequalities with the error in the order of $O(\sqrt{(\ln (1/\delta))/n}+\tilde\Rcal{m}(\Fcal) \sqrt{(\ln (1/\delta))/m})$, we can also minimize the upper bound on $|\Pb_\tZ Y_{S}|_{F}$ by maximizing the diagonal entries of $\hZ \hZ \T$ while minimizing its off-diagonal entries, as is done in VICReg.

Thus, VICReg can be understood as a method to minimize the generalization bound in Theorem \ref{thm:1} by minimizing the invariance loss while controlling the covariance to minimize the \textit{label-agnostic} upper bounds on $|\Pb_\hZ Y_\bS|{F}$ and $|\Pb\tZ Y_{S}|{F}$. If we know \textit{partial} information about the label $Y\bS$ of the unlabeled data, we can use it to minimize $|\Pb_\hZ Y_\bS|{F}$ and $|\Pb\tZ Y_{S}|_F$ directly.

\subsection{Comparison of Generalization Bounds}

The SimCLR generalization bound~\citep{saunshi2019theoretical} requires the number of labeled classes to go infinity to close the generalization gap, whereas the VICReg bound in Theorem \ref{thm:1} does \textit{not} require the number of label classes to approach infinity for the generalization gap to go to zero. This reflects that, unlike SimCLR, VICReg does not use negative pairs and thus does not use a loss function based on the implicit expectation that the labels of a negative pair are different. Another difference is that our VICReg bound improves as $n$ increases, while the previous bound of SimCLR~\citep{saunshi2019theoretical} does not depend on $n$. This is because \citet{saunshi2019theoretical} assumes partial access to the true distribution per class for setting, which removes the importance of labeled data size $n$ and is not assumed in our study.

Consequently, the generalization bound in Theorem~\ref{thm:1} provides a new insight for VICReg regarding the ratio of the effects of $m$ v.s. $n$ through $G\sqrt{\ln (1/\delta)/m}+ \sqrt{\ln (1/\delta)/n}$. Finally, Theorem \ref{thm:1} also illuminates the advantages of VICReg over standard supervised training. That is, with standard training, the generalization bound via the Rademacher complexity requires the complexities of hypothesis spaces, $\tilde \Rcal_{n}(\Wcal)/\sqrt{n}$ and $\tilde \Rcal_{n}(\Fcal)/\sqrt{n}$, with respect to the size of labeled data $n$, instead of the size of unlabeled data $m$. Thus, Theorem \ref{thm:1} shows that using SSL, we can replace the complexities of hypothesis spaces in terms of $n$ with those in terms of $m$. Since the number of unlabeled data points is typically much larger than the number of labeled data points, this illuminates the benefit of SSL. Our bound is different from the recent information bottleneck bound \citep{icml2023kzxinfodl} in that both our proof and bound do not rely on information bottleneck.

\subsection{Understanding Theorem 2 via Mutual Information Maximization}

Theorem \ref{thm:1}, together with the result of the previous section, shows that, for generalization in the downstream task, it is helpful to maximize the mutual information $I(Z; X')$ in SSL via minimizing the invariance loss $I_{\bS}(f_\theta)$ while controlling the covariance $\hZ \hZ \T$. The term $2\tilde \Rcal_{m}(\Fcal) / \sqrt{m}$ captures the importance of controlling the complexity of the representations $f_\theta$. To understand this term further, let us consider a discretization of the parameter space of $\Fcal$ to have finite $|\Fcal| < \infty$. Then, by Massart's Finite Class Lemma, we have that $\tilde \Rcal_{m}(\Fcal) \le C\sqrt{\ln |\Fcal| } $ for some constant $C>0$. Moreover, \citet{shwartz2022information} shows that we can approximate $\ln |\Fcal|$ by $2^{I(Z; X)}$. Thus, in Theorem \ref{thm:1}, the term $I_{\bS}(f_\theta) +\frac{2}{\sqrt{m}}|\Pb_\hZ Y_\bS|{F}+\frac{1}{\sqrt{n}} |\Pb\tZ Y_{S}|F$ corresponds to $I(Z;X')$ which we want to maximize while compressing the term of $2\tilde \Rcal{m}(\Fcal) / \sqrt{m}$ which corresponds to $I(Z; X)$ ~\citep{federici2019learning, shwartz2017compression,shwartz2022we}.

Although we can explicitly add regularization on the information to control $2\tilde \Rcal_{m}(\Fcal) / \sqrt{m}$, it is possible that $I(Z;X|X')$ and $2\tilde \Rcal_{m}(\Fcal) / \sqrt{m}$ are implicitly regularized via implicit bias through design choises~\citep{gunasekar2017implicit,soudry2018implicit,gunasekar2018implicit}. Thus, Theorem \ref{thm:1} connects the information-theoretic understanding of VICReg with the probabilistic guarantee on downstream generalization.

\section{Limitations} \label{app:limitations} In our paper, we proposed novel methods for SSL premised on information maximization. Although our methods demonstrated superior performance on some datasets, computational constraints precluded us from testing them on larger datasets. Furthermore, our study hinges on certain assumptions that, despite rigorous validation efforts, may not hold universally. While we strive for meticulous testing and validation, it's crucial to note that some assumptions might not be applicable in all scenarios or conditions. These limitations should be taken into account when interpreting our study's results.

\section{Conclusions} \label{sec:conclusion}

We analyzed the Variance-Invariance-Covariance Regularization for self-supervised learning through an information-theoretic lens. By transferring the stochasticity required for an information-theoretic analysis to the input distribution, we showed how the VICReg objective can be derived from information-theoretic principles, used this perspective to highlight assumptions implicit in the VICReg objective, derived a VICReg generalization bound for downstream tasks, and related it to information maximization.

Building on these findings, we introduced a new VICReg-inspired SSL objective. Our probabilistic guarantee suggests that VICReg can be further improved for the settings of partial label information by aligning the covariance matrix with the partially observable label matrix, which opens up several avenues for future work, including the design of improved estimators for information-theoretic quantities and investigations into the suitability of different SSL methods for specific data characteristics.

\clearpage

\bibliographystyle{plainnat} \bibliography{references}

\newpage \pagebreak

\appendix \renewcommand{\thesection}{\Alph{section}}

\begin{appendices}

\vbox{% \hsize\textwidth \linewidth\hsize \vskip 0.1in \hrule height 4pt%\p@ \vskip 0.25in \vskip -\parskip% \centering {\LARGE\bf Appendix \par} \vskip 0.29in \vskip -\parskip \hrule height 1pt \vskip 0.09in% }

\vspace*{5pt}

\section*{Table of Contents}

This appendix is organized as follows:\vspace*{-3pt} \begin{itemize}[leftmargin=15pt] \setlength\itemsep{2pt}

\item In \Cref{app:33}, we provide a detailed derivation of the lower bound of \Cref{eq:izz_bound} .

\item In \Cref{app:2}, we provide full proof of our Theorem \ref{cor:mixture} on the network's representation distribution.

\item In \Cref{app:empirical_validation}, we provide additional empirical validations. Specifically, we empirically check if the optimal solution to the information maximization problem in \Cref{sec:vicreg} is the diagonal matrix.

\item In \Cref{app:em}, we show the collapse phenomenon under Gaussian Mixture Model (GMM) using Expectation Maximization (EM) and demonstrate how it is related to SSL and how we can prevent it.

\item \Cref{app:simclr} provides additional details on the SimCLR method.

\item \Cref{app:estimators} provides a detailed review of entropy estimators, their implications, assumptions, and limitations.

\item In \Cref{app:lemmas}, we provide proofs for known lemmas that we are using throughout our paper.

\item In \Cref{app:ver}, we provide detailed information on the hyperparameters, datasets, and architectures used in our experiments in \Cref{sec:estimator}.

\item In \Cref{app:1}, we provide full proof of our generalization bound for downstream tasks from \Cref{sec:gen}.

\item In \Cref{app:info_vicgreg}, we provide full proof of the theorems for \Cref{sec:vicreg} on the connection between information optimization and the VICReg objective.

\item \Cref{app:training_detalies} provides experimental details on experiments conducted in \Cref{sec:othermethods} for entropy comparison between different SSL methods.

\item In \Cref{app:repr}, we provide detailed information on the reproducibility of our study.

\item In \Cref{app:broader_impact}, we discuss the broader impact of our work. This section explores the implications, significance, and potential applications of our findings beyond the scope of the immediate study.

\end{itemize}

\clearpage

\section{\texorpdfstring{Lower bounds on $\EE_{x^\prime}\left[\log q(z|x^\prime)\right]$}{Lower bounds on E[x']}}

\label{app:33}

In this section of the supplementary material, we present the full derivation of the lower bound on $ \EE_{x^\prime}\left[\log q(z|x^\prime)\right]$. Because $Z^\prime|X^\prime$ is a Gaussian, we can write it as $Z^\prime = \mu(x^\prime) + L(x^\prime)\epsilon$ where $\epsilon \sim \mathcal{N}(0, 1) $ and $L(x^\prime)^TL(x^\prime) = \Sigma(x^\prime)$. Now, setting $\Sigma_r = I$, will give us: \begin{align} \label{eq:logzz_e} \begin{split} & \EE_{x^\prime}\left[\log q(z|x^\prime)\right]
\ \geq &
\EE_{z^\prime|x^\prime}\left[\log q(z|z^\prime)\right] \end{split} \=& \EE_{z^\prime|x^\prime}\left[\frac{d}{2}\log 2\pi - \frac12 \left(z-z^\prime\right)^T\left(I)\right)^{-1}\left(z-z^\prime\right)\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{z^\prime \mid x^\prime, }\left[\left(z-z^\prime\right)^2\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{ \epsilon}\left[\left(z-\mu(x^\prime) - L(x^\prime)\epsilon\right)^2\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{ \epsilon}\left[\left(z-\mu(x^\prime)\right)^2 - 2\left(z - \mu(x^\prime)*L(x^\prime) \epsilon\right) +\left(\left(L(x^\prime)\epsilon\right)^T\left(L(x^\prime) \epsilon\right)\right)\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{\epsilon}\left[\left(z-\mu(x^\prime)\right)^2\right] + \left(z-\mu(x^\prime)L(x^\prime)\right)\EE_{\epsilon}\left[\epsilon\right]-\frac{1}{2}\EE_{\epsilon}\left[\epsilon^T L(x^\prime)^T L(x^\prime) \epsilon\right] \ = & \frac{d}{2}\log 2\pi - \frac12 \left(z-\mu(x^\prime)\right)^2 -\frac{1}{2}Tr\log\Sigma(x^\prime) \end{align} where $\EE_{x^\prime}\left[\log q(z|x^\prime)\right] = \EE_{x^\prime}\left[\log \EE_{z^\prime|x^\prime} \left[q(z|z^\prime)\right]\right] \geq \EE_{z^\prime}\left[\log q(z|z^\prime)\right] $ by Jensen's inequality, $\EE_{\epsilon}[\epsilon]=0$ and $\EE_{\epsilon}\left[\epsilon\left(L(x^\prime)^T L(x^\prime \right)\epsilon\right] = Tr\log\Sigma(x^\prime)$ by the Hutchinson's estimator. \begin{align} \EE_{z|x}\left[\EE_{z^\prime|x^\prime}\left[\log q(z|z^\prime)\right]\right] = & \EE_{z|x}\left[\frac{d}{2}\log 2\pi - \frac12 \left(z-\mu(x^\prime)\right)^2 -\frac{1}{2}Tr \log \Sigma(x^\prime)\right] \ =& \frac{d}{2}\log 2\pi - \frac12 \EE_{z|x}\left[ \left(z-\mu(x^\prime)\right)^2\right] -\frac{1}{2}Tr \log \Sigma(x^\prime) \ = & \frac{d}{2}\log 2\pi - \frac12 \EE_{\epsilon}\left[ \left(\mu(x) +L(x)\epsilon-\mu(x^\prime)\right)^2\right] -\frac{1}{2}Tr \log \Sigma(x^\prime) \ \begin{split} = & \frac{d}{2}\log 2\pi - \frac12 \EE_{\epsilon}\left[ \left(\mu(x) -\mu(x^\prime)\right)^2\right] +\EE_{\epsilon}\left[\left(\mu(x) - \mu(x^\prime)\right)L(x)\epsilon\right] \&-\frac 12\EE_{\epsilon}\left[\epsilon^TL(x)^TL(x)\epsilon\right] -\frac{1}{2}Tr\log \Sigma(x^\prime) \end{split} \ = & \frac{d}{2}\log 2\pi - \frac12 \left(\mu(x) -\mu(x^\prime)\right)^2 -\frac 12Tr\log\Sigma(x) -\frac{1}{2}Tr\log\Sigma(x^\prime) \ =& \frac{d}{2}\log 2\pi - \frac12 \left(\mu(x) -\mu(x^\prime)\right)^2 -\frac 12 \log \left( |\Sigma(x)| \cdot |\Sigma(x^\prime)|\right) \end{align}

\clearpage

\section{Data Distribution after Deep Network Transformation} \label{app:2}

\begin{thm} Given the setting of \Cref{eq:x_density}, the unconditional DNN output density denoted as $Z$ approximates (given the truncation of the Gaussian on its effective support that is included within a single region $\omega$ of the DN's input space partition) a mixture of the affinely transformed distributions $\bx|\bx^{n(\bx)}$ e.g. for the Gaussian case $$Z\hspace{-0.1cm} \sim\hspace{-0.1cm} \sum{n=1}^{N}\mathcal{N}\hspace{-0.1cm}\left(\hspace{-0.1cm}\bA_{\omega(\bx^{n})}\bx^*{n}+\bb_{\omega(\bx^{n})},\bA^T{\omega(\bx^{n})}\Sigma{\bx^{n}}\bA{\omega(\bx^{n})}\hspace{-0.1cm}\right)^{T=n},$$ where $\omega(\bx^*{n})=\omega \in \Omega \iff \bx^_{n} \in \omega$ is the partition region in which the prototype $\bx^{n}$ lives in. \end{thm} \begin{proof} We know that If $\int{\omega}p(\bx|\bx^*_{n(\bx)})d\bx \approx 1$ then $f$ is linear within the effective support of $p$. Therefore, any sample from $p$ will almost surely lie within a single region $\omega \in \Omega$, and therefore the entire mapping can be considered linear with respect to $p$. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping. \end{proof}

\section{Additional Empirical Validation} \label{app:empirical_validation} To validate empirically Theorem \ref{th:maxvicre}, we checke empirically if the optimal solution for $$\sum_i \log \frac {\lvert\Sigma_Z\rvert} {\lvert\Sigma_{Z|X_i}\rvert \lvert\Sigma_{Z^\prime|X^\prime_i}\rvert }$$ is a diagonal matrix. We trained VICReg on ResNet18 on CIFAR-10 and did random perturbations (with different scales) for $\Sigma_Z$. Then, for each perturbation, we calculated the average distance of this perturbed matrix from a diagonal matrix and the actual value of the term $$\sum_i \log \frac {\lvert\Sigma_Z\rvert \lvert\Sigma_{Z^\prime|X^\prime i}\rvert} {\lvert\Sigma{Z|X_i}\rvert}$$. In \Cref{fig:my_label}, we plot the difference from the optimal value of this term as a function of the distance from the diagonal matrix. As we can see, we get an optimal solution where we are close to the diagonal matrix. This observation gives us an empirical validation of Theorem \ref{th:maxvicre}. \begin{figure}[h!] \centering \includegraphics[width=0.6\linewidth]{figures/Vbdpg3Pat1il.png} \caption{\textbf{The optimal solution for the optimization problem is a diagonal matrix.} The average distance from a diagonal matrix for different perturbation scales. Experiments were conducted on CIFAR-10 with the ResNet-18 network.} \label{fig:my_label} \end{figure} \section{EM and GMM} \label{app:em}

Let us examine a toy dataset on the pattern of two intertwining moons to illustrate the collapse phenomenon under GMM (\Cref{fig:gaussian}, right). We begin by training a classical GMM with maximum likelihood, where the means are initialized based on random samples, and the covariance is used as the identity matrix. A red dot represents the Gaussian's mean after training, while a blue dot represents the data points. In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high (\Cref{fig:oneone}, left). However, when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy (\Cref{fig:oneone}, right).

To prevent collapse, we follow the K-means algorithm in enforcing sparse posteriors, i.e. using small initial standard deviations and learning only the mean. This forces a one-to-one mapping which leads all points to be closest to the mean without collapsing, resulting in high entropy (\Cref{fig:oneone} - middle, in the Appendix). Another option to prevent collapse is to use different learning rates for input and parameters. Using this setting, the collapsing of the parameters does not maximize the likelihood. \Cref{fig:gaussian} (right) shows the results of GMM with different learning rates for learned inputs and parameters. When the parameter learning rate is sufficiently high in comparison to the input learning rate, the entropy decreases much more slowly and no collapse occurs. \iffalse \label{tab:em} \begin{figure}[t] \centering \includegraphics[width=\linewidth]{figures/GMM_entropies.png} \caption{Evolution of the entropy for each of the learning rate configurations showing that the impact of picking the incorrect learning rate for the data and/or centroids leads to a collapse of the samples.} \label{fig:entropies} \end{figure} \fi

\begin{figure}[t] \centering \includegraphics[width=0.9\linewidth]{figures/GMM_one_one.pdf} \caption{\textbf{Evolution of GMM training when enforcing a one-to-one mapping between the data and centroids akin to K-means i.e. using a small and fixed covariance matrix. We see that collapse does not occur.} Left - In the presence of fixed input samples, we observe that there is no collapsing and that the entropy of the centers is high. Right - when we make the input samples trainable and optimize their location, all the points collapse into a single point, resulting in a sharp decrease in entropy.} \label{fig:oneone} \end{figure}

\section{SimCLR} \label{app:simclr} In contrastive learning, different augmented views of the same image are attracted (positive pairs), while different augmented views are repelled (negative pairs). MoCo \citep{he2020momentum} and SimCLR \citep{chen2020simple} are recent examples of self-supervised visual representation learning that reduce the gap between self-supervised and fully-supervised learning. SimCLR applies randomized augmentations to an image to create two different views, $x$ and $y$, and encodes both of them with a shared encoder, producing representations $r_x$ and $r_y$. Both $r_x$ and $r_y$ are $l2$-normalized. The SimCLR version of the InfoNCE objective is: \begin{align*} \mathbb{E}{x,y}\left[-\log \left(\frac{e^{\frac1\eta r_y^Tr_x }}{\sum{k=1}^K{e^{\frac 1\eta r_{y_k}^Tr_x}}}\right)\right] , \end{align*} where $\eta$ is a temperature term and $K$ is the number of views in a minibatch.

\section{Entropy Estimators } \label{app:estimators} Entropy estimation is one of the classical problems in information theory, where Gaussian mixture density is one of the most popular representations. With a sufficient number of components, they can approximate any smooth function with arbitrary accuracy. For Gaussian mixtures, there is, however, no closed-form solution to differential entropy. There exist several approximations in the literature, including loose upper and lower bounds \citep{entropyapprox2008}. Monte Carlo (MC) sampling is one way to approximate Gaussian mixture entropy. With sufficient MC samples, an unbiased estimate of entropy with an arbitrarily accurate can be obtained. Unfortunately, MC sampling is very computationally expensive and typically requires a large number of samples, especially in high dimensions \citep{brewer2017computing}. Using the first two moments of the empirical distribution, VIGCreg used one of the most straightforward approaches for approximating the entropy. Despite this, previous studies have found that this method is a poor approximation of the entropy in many cases \cite{entropyapprox2008}. Another option is to use the LogDet function. Several estimators have been proposed to implement it, including uniformly minimum variance unbiased (UMVU) \citep{30996}, and Bayesian methods \cite{MISRA2005324}. These methods, however, often require complex optimizations. The LogDet estimator presented in \cite{zhouyin2021understanding} used the differential entropy $\alpha$ order entropy using scaled noise. They demonstrated that it can be applied to high-dimensional features and is robust to random noise. Based on Taylor-series expansions, \cite{entropyapprox2008} presented a lower bound for the entropy of Gaussian mixture random vectors. They use Taylor-series expansions of the logarithm of each Gaussian mixture component to get an analytical evaluation of the entropy measure. In addition, they present a technique for splitting Gaussian densities to avoid components with high variance, which would require computationally expensive calculations. \citet{kolchinsky2017estimating} introduce a novel family of estimators for the mixture entropy. For this family, a pairwise-distance function between component densities is defined for each member. These estimators are computationally efficient as long as the pairwise-distance function and the entropy of each component distribution are easy to compute. Moreover, the estimator is continuous and smooth and is therefore useful for optimization problems. In addition, they presented both a lower bound (using Chernoff distance) and an upper bound (using the KL divergence) on the entropy, which are exact when the component distributions are grouped into well-separated clusters, \label{app:methods}

\section{Known Lemmas} \label{app:lemmas} We use the following well-known theorems as lemmas in our proofs. We put these below for completeness. These are classical results and \textit{not} our results. \begin{lemma} \label{lemma:trivial:2} \emph{(Hoeffding's inequality)} Let $X_1, ..., X_n$ be independent random variables such that ${\displaystyle a_{}\leq X_{i}\leq b_{}}$ almost surely. Consider the average of these random variables, ${\displaystyle S_{n}=\frac{1}{n}(X_{1}+\cdots +X_{n}).}$ Then, for all $t > 0$, $$ \PP_S \left( \mathrm {E} \left[S_{n}\right]-S_{n} \ge (b-a) \sqrt{\frac{\ln(1/\delta)}{2n} }\right) \leq \delta, $$ and $$ \PP_S \left( S_{n} -\mathrm {E} \left[S_{n}\right]\ge (b-a) \sqrt{\frac{\ln(1/\delta)}{2n} }\right) \leq \delta. $$ \end{lemma} \begin{proof} By using Hoeffding's inequality, we have that for all $t>0$, $$ \PP_S \left( \mathrm {E} \left[S_{n}\right]-S_{n} \ge t\right)\leq \exp \left(-{\frac {2nt^{2}}{(b-a)^{2}}}\right), $$ and $$ \PP_S \left(S_{n} - \mathrm {E} \left[S_{n}\right]\ge t\right)\leq \exp \left(-{\frac {2nt^{2}}{(b-a)^{2}}}\right), $$ Setting $\delta=\exp \left(-{\frac {2nt^{2}}{(b-a)^{2}}}\right)$ and solving for $t>0$, \begin{align*} & 1/\delta=\exp \left({\frac {2nt^{2}}{(b-a)^{2}}}\right) \ & \Longrightarrow \ln(1/\delta)= {\frac {2nt^{2}}{(b-a)^{2}}} \ & \Longrightarrow \frac{(b-a)^{2}\ln(1/\delta)}{2n}= t^2 \ & \Longrightarrow t =(b-a) \sqrt{\frac{\ln(1/\delta)}{2n} } \end{align*} \vspace*{-5pt} \end{proof} \vspace*{-5pt} It has been shown that generalization bounds can be obtained via Rademacher complexity \citep{bartlett2002rademacher,mohri2012foundations,shalev2014understanding}. The following is a trivial modification of \citep[Theorem 3.1]{mohri2012foundations} for a one-sided bound on the nonnegative general loss functions: \begin{lemma} \label{lemma:trivial:1} Let $\mathcal{G}$ be a set of functions with the codomain $[0, M]$. Then, for any $\delta>0$, with probability at least $1-\delta$ over an i.i.d. draw of $m$ samples $S=(q_{i}){i=1}^m$, the following holds for all $ \psi \in \Gcal$: \begin{align} \SwapAboveDisplaySkip \EE{q}[\psi(q)] \le \frac{1}{m}\sum_{i=1}^{m} \psi(q_{i})+2\Rcal_{m}(\Gcal)+M \sqrt{\frac{\ln(1/\delta)}{2m}}, \end{align} where $\Rcal_{m}(\Gcal):=\EE_{S,\xi}[\sup_{\psi \in \Gcal}\frac{1}{m} \sum_{i=1}^m \xi_i \psi(q_{i})]$ and $\xi_1,\dots,\xi_m$ are independent uniform random variables taking values in ${-1,1}$. \end{lemma} \begin{proof} Let $S=(q_{i}){i=1}^m$ and $S'=(q{i}'){i=1}^m$. Define \begin{align} \varphi(S)= \sup{\psi \in \Gcal} \EE_{x,y}[\psi(q)]-\frac{1}{m}\sum_{i=1}^{m}\psi(q_{i}). \end{align} To apply McDiarmid's inequality to $\varphi(S)$, we compute an upper bound on $|\varphi(S)-\varphi(S')|$ where $S$ and $S'$ be two test datasets differing by exactly one point of an arbitrary index $i_{0}$; i.e., $S_i= S'i$ for all $i\neq i{0}$ and $S_{i_{0}} \neq S'{i{0}}$. Then, \begin{align} \varphi(S')-\varphi(S) \le\sup_{\psi \in \Gcal}\frac{\psi(q_{i_0})-\psi(q'{i_0})}{m} \le \frac{M}{m}. \end{align} Similarly, $\varphi(S)-\varphi(S')\le \frac{M}{m} $. Thus, by McDiarmid's inequality, for any $\delta>0$, with probability at least $1-\delta$, \begin{align} \SwapAboveDisplaySkip \varphi(S) \le \EE{S}[\varphi(S)] + M \sqrt{\frac{\ln(1/\delta)}{2m}}. \end{align} Moreover, \begin{align} &\EE_{S}[\varphi(S)]

\EE_{S}\left[\sup_{\psi \in \Gcal} \EE_{S'}\left[\frac{1}{m}\sum_{i=1}^{m}\psi_{}(q_i')\right]-\frac{1}{m}\sum_{i=1}^{m}\psi(q_i)\right]
\ & \le\EE_{S,S'}\left[\sup_{\psi \in \Gcal} \frac{1}{m}\sum_{i=1}^m (\psi(q'i)-\psi(q_i))\right] \ & \le \EE{\xi, S, S'}\left[\sup_{\psi \in \Gcal} \frac{1}{m}\sum_{i=1}^m \xi_i(\psi_{}(q'{i})-\psi(q_i))\right] \ & \le2\EE{\xi, S}\left[\sup_{\psi \in \Gcal} \frac{1}{m}\sum_{i=1}^m \xi_i\psi(q_i)\right] =2\Rcal_{m}(\Gcal), \end{align} where the first line follows the definitions of each term, the second line uses Jensen's inequality and the convexity of the supremum, and the third line follows that for each $\xi_i \in {-1,+1}$, the distribution of each term $\xi_i (\ell(f_{}(x'_i),y'i)-\ell(f(x_i),y_i))$ is the distribution of $(\ell(f{}(x'_i),y'_i)-\ell(f(x_i),y_i))$ since $S$ and $S'$ are drawn iid with the same distribution. The fourth line uses the subadditivity of supremum.
\end{proof}

\clearpage

\section{Implentation Details for Maximizing Entropy Estimators } \label{app:ver} In this section, we will provide more details on the implantation of the experiments conducted in \Cref{sec:estimator}.

\textbf{Setup} Our experiments are conducted on CIFAR-10 \cite{krizhevsky2009learning}. We use ResNet-18 \citep{he2016deep} as our backbone.

\textbf{Training Procedure}: The experimental process is organized into two sequential stages: unsupervised pretraining followed by linear evaluation. Initially, the unsupervised pretraining phase is executed, during which the encoder network is trained. Upon its completion, we transition to the linear evaluation phase, which serves as an assessment tool for the quality of the representation produced by the pretrained encoder.

Once the pretraining phase is concluded, we adhere to the fine-tuning procedures used in established baseline methods, as described by \cite{caron2020unsupervised}.

During the linear evaluation stage, we start by performing supervised training of the linear classifier. This is achieved by using the representations derived from the encoder network while keeping the network's coefficients frozen, and applying the same training dataset. Subsequently, we measure the test accuracy of the trained linear classifier using a separate validation dataset. This approach allows us to evaluate the performance of our model in a robust and systematic manner.

The training process for each model unfolds over 800 epochs, employing a batch size of 512. We utilize the Stochastic Gradient Descent (SGD) optimizer, characterized by a momentum of 0.9 and a weight decay of $1e^-4$. The learning rate is initiated at 0.5 and is adjusted according to a cosine decay schedule complemented by a linear warmup phase.

During the data augmentation process, two enhanced versions of every input image are generated. This involves cropping each image randomly and resizing it back to the original resolution. The images are then subject to random horizontal flipping, color jittering, grayscale conversion, Gaussian blurring, and polarization for further augmentation.

For the linear evaluation phase, the linear classifier is trained for 100 epochs with a batch size of 256. The SGD optimizer is again employed, this time with a momentum of 0.9 and no weight decay. The learning rate is managed using a cosine decay schedule, starting at 0.2 and reaching a minimum of $2e-4$.

\section{A Generalization Bound for Downstream Tasks} \label{app:1} In this Appendix, we present the complete version of Theorem \ref{thm:1} along with its proof and additional discussions.

\subsection{Additional Notation and details}

We start to introduce additional notation and details. We use the notation of $x \in \Xcal$ for an input and $y \in \Ycal \subseteq \RR^r$ for an output. Define $p(y)=\Pr(Y=y)$ to be the probability of getting label $y$ and $\hp(y)=\frac{1}{n}\sum_{i=1}^n \one{y_i=y}$ to be the empirical estimate of $p(y)$. Let $\zeta$ be an upper bound on the norm of the label as $|y|{2} \le \zeta$ for all $y \in \Ycal$. Define the minimum norm solution $W\bS$ of the unlabeled data as $W_\bS=\mini_{W'} |W'|{F}$ s.t. $W'\in \argmin_W \frac{1}{m} \sum{i=1}^{m} |W_{} f_\theta(\xp_{i})-g^(\xp_i)|^2$. Let $\kappa_{S}$ be a data-dependent upper bound on the per-sample Euclidian norm loss with the trained model as $|W_{S}f_\theta(x)-y| \le \kappa_{S}$ for all $(x,y) \in \Xcal \times \Ycal$. Similarly, let $\kappa_{\bS}$ be a data-dependent upper bound on the per-sample Euclidian norm loss as $|W_{\bS}f_\theta(x)-y| \le \kappa_{\bS}$ for all $(x,y) \in \Xcal \times \Ycal$. Define the difference between $W_S$ and $W_\bS$ by $c=|W_S-W_\bS|{2}$. Let $\Wcal$ be a hypothesis space of $W$ such that $W\bS \in \Wcal$. We denote by $\tilde \Rcal_{m}(\Wcal \circ \Fcal)=\frac{1}{\sqrt{m}}\EE_{\bS,\xi}[\sup_{W\in \Wcal, f\in\Fcal} \sum_{i=1}^m \xi_i|g^{}(\xp_{i})-W_{}f(\xp_{i})|]$ the normalized Rademacher complexity of the set ${\xp_{} \mapsto|g^{*}(\xp_{})-W_{}f(\xp_{})|:W \in \Wcal, f \in \Fcal}$. we denote by $\kappa_{}$ a upper bound on the per-sample Euclidian norm loss as $|Wf(x)-y| \le \kappa_{}$ for all $(x,y,W,f) \in \Xcal \times \Ycal \times \Wcal\times \Fcal$.

We adopt the following data-generating process model that was used in a previous paper on analyzing contrastive learning \citep{saunshi2019theoretical, ben2018attentioned}. For the labeled data, first, $y$ is drawn from the distribution $\rho$ on $\Ycal$, and then $x$ is drawn from the conditional distribution $\Dcal_{y}$ conditioned on the label $y$. That is, we have the join distribution $\Dcal(x, y)=\Dcal {y}(x)\rho(y)$ with $((x_i, y_i)){i=1}^n \sim\Dcal^{n}$. For the unlabeled data, first, each of the \textit{unknown} labels $y^{+}$ and $y^-$ is drawn from the distritbuion $\rho$, and then each of the positive examples $\xp$ and $\xpp$ is drawn from the conditional distribution $\Dcal_{y^{+}}$ while the negative example $\xn$ is drawn from the $\Dcal_{y^-}$. Unlike the analysis of contrastive learning, we do not require negative samples. Let $\tau_{\bS}$ be a data-dependent upper bound on the invariance loss with the trained representation as $|f_\theta(\bbx)-f_\theta(x)| \le \tau_{\bS}$ for all $(\bbx,x) \sim \Dcal_{y}^2$ and $y \in \Ycal$. Let $\tau$ be a data-independent upper bound on the invariance loss with the trained representation as$ |f(\bbx)-f(x)| \le \tau$ for all $(\bbx,x) \sim \Dcal_{y}^2$, $y \in \Ycal$, and $f \in \Fcal$. For simplicity, we assume that there exists a function $g^$ such that $y=g^{}(x)\in \RR^r$ for all $(x,y) \in \Xcal \times \Ycal$. Discarding this assumption adds the average of label noises to the final result, which goes to zero as the sample sizes $n$ and $m$ increase, assuming that the mean of the label noise is zero.

\renewcommand{\thesection}{\arabic{section}} \setcounter{section}{0}

\subsection{Proof of Theorem \ref{thm:1}} \label{app:1:1} \begin{proof}[Proof of Theorem \ref{thm:1}] Let $W=W {S}$ where $W_S$ is the the minimum norm solution as $W_S =\mini{W'} |W'|{F}$ s.t. $W'\in \argmin{W} \frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i|^2$. Let $W^=W_\bS$ where $W_\bS$ is the minimum norm solution as $W^{}=W_{\bS}=\mini_{W'} |W'|{F}$ s.t. $W'\in \argmin{W} \frac{1}{m} \sum_{i=1}^{m} |W f_\theta(\xp_{i})-g^(\xp_i)|^2$. Since $y=g^{}(x)$, \begin{align*} y=g^{}(x) \pm W^ f_\theta(x) =W^* f_\theta(x) +(g^{}(x)-W^ f_\theta(x))=W^* f_\theta(x) +\varphi(x) \end{align*} where $\varphi(x)=g^{}(x)-W^ f_\theta(x)$. Define $L_{S}(w)= \frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i|$. Using these, \begin{align*} L_{S}(w) &= \frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i| \ & =\frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-W^* f_\theta(x_{i}) -\varphi(x_{i})| \ & \ge\frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-W^* f_\theta(x_{i})| -\frac{1}{n} \sum_{i=1}^n|\varphi(x_{i})| \ & =\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(x_i)| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \end{align*} where $\tW =W-W^$. We now consider new fresh samples $\bbx_{i} \sim\Dcal {y{i}}$ for $i=1,\dots, n$ to rewrite the above further as: \begin{align} L_{S}(w) &\ge \frac{1}{n} \sum_{i=1}^n |\tW f_\theta(x_i)\pm\tW f_\theta(\bbx_i)| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \ & =\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)-(\tW f_\theta(\bbx_i)-\tW f_\theta(x_i))| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \ & \ge\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| -\frac{1}{n} \sum_{i=1}^{n}|\tW f_\theta(\bbx_i)-\tW f_\theta(x_i)| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \ & =\frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| -\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| - \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \end{align*}
This implies that $$ \frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| \le L_{S}(w)+\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| + \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|. $$ Furthermore, since $y=W^* f_\theta(x) +\varphi(x)$, by writing $\bby_{i}=W^* f_\theta(\bbx_i) +\varphi(\bbx_i)$ (where $\bby_i = y_i$ since $\bbx_{i} \sim\Dcal {y{i}}$ for $i=1,\dots, n$), \begin{align*} \frac{1}{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| &=\frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-W^* f_\theta(\bbx_i)| \ & =\frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i+\varphi(\bbx_i)| \ & \ge\frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i|-\frac{1}{n} \sum_{i=1}^n|\varphi(\bbx_i) |
\end{align*} Combining these, we have that \begin{align} \label{eq:1} \frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i| &\le L_{S}(w)+\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \quad + \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|+\frac{1}{n} \sum_{i=1}^n|\varphi(\bbx_i) |. \end{align} To bound the left-hand side of \eqref{eq:1}, we now analyze the following random variable: \begin{align} \label{eq:5} \EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|, \end{align} where $\bby_i = y_i$ since $\bbx_{i} \sim\Dcal {y{i}}$ for $i=1,\dots, n$. Importantly, this means that as $W_{S}$ depends on $y_i$, $W_{S}$ depends on $\bby_i$. Thus, the collection of random variables $|W_{S}f_\theta(\bbx_1)-\bby_1|,\dots,|W_{S}f_\theta(n_n)-\bby_n|$ is \textit{not} independent. Accordingly, we cannot apply standard concentration inequality to bound \eqref{eq:5}. A standard approach in learning theory is to first bound \eqref{eq:5} by $\EE_{x,y}|W {S}f\theta(x)-y| - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \le\sup_{W \in \Wcal}\EE_{x,y}|Wf_\theta(x)-y| - \frac{1}{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i|$ for some hypothesis space $\Wcal$ (that is independent of $S$) and realize that the right-hand side now contains the collection of independent random variables $|W_{}f_\theta(\bbx_1)-\bby_1|,\dots,|W_{}f_\theta(n_n)-\bby_n|$ , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space $\Wcal$. However, the complexity of the hypothesis space $\Wcal$ can be very large, resulting in a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space $\Wcal$ by using a very different approach with conditional expectations to take care the dependent random variables $|W_{S}f_\theta(\bbx_1)-\bby_1|,\dots,|W_{S}f_\theta(n_n)-\bby_n|$. Intuitively, we utilize the fact that for these dependent random variables, there is a structure of conditional independence, conditioned on each $y \in \Ycal$.

We first write the expected loss as the sum of the conditional expected loss: \begin{align*} \EE_{X,Y}[|W {S}f\theta(X)-Y|]&=\sum_{y\in \Ycal} \EE_{X,Y}[|W {S}f\theta(X)-Y| \mid Y = y]\Pr(Y = y) \ & =\sum_{y\in \Ycal}\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\Pr(Y = y), \end{align*} where $X_{y}$ is the random variable for the conditional with $Y=y$. Using this, we decompose \eqref{eq:5} into two terms: \begin{align} \label{eq:6} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & = \left(\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\frac{|\Ical_{y}|}{n}- \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|\right) \ \nonumber & \quad+\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right), \end{align} where $$ \Ical_{y}={i\in[n]: y_{i}=y}. $$ The first term in the right-hand side of \eqref{eq:6} is further simplified by using $$ \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|=\frac{1}{n}\sum_{y \in \Ycal} \sum_{i \in \Ical_{y}} |W_{S}f_\theta(\bbx_i)-y|, $$ as \begin{align*} &\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\frac{|\Ical_{y}|}{n}- \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ & =\frac{1}{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-\frac{1}{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right), \end{align*} where $\tYcal={y \in \Ycal : |\Ical_{y}| \neq 0}$. Substituting these into equation \eqref{eq:6} yields \begin{align} \label{eq:7} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & = \frac{1}{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-\frac{1}{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right) \ \nonumber & \quad + \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right) \end{align}
Importantly, while $|W_{S}f_\theta(\bbx_1)-\bby_1|,\dots, |W_{S}f_\theta(\bbx_n)-\bby_n|$ on the right-hand side of \eqref{eq:7} are dependent random variables, $|W_{S}f_\theta(\bbx_1)-y|,\dots,|W_{S}f_\theta(\bbx_n)-y|$ are independent random variables since $W_S$ and $\bbx_i$ are independent and $y$ is fixed here. Thus, by using Hoeffding's inequality (Lemma \ref{lemma:trivial:2}), and taking union bounds over $y \in \tYcal$, we have that with probability at least $1-\delta$, the following holds for all $y \in \tYcal$: $$ \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-\frac{1}{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \le \kappa_{S} \sqrt{\frac{\ln(|\tYcal|/\delta)}{2|\Ical_{y}|}}. $$ This implies that with probability at least $1-\delta$, \begin{align*} &\frac{1}{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-\frac{1}{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right) \ & \le\frac{\kappa_{S}}{n}\sum_{y \in \tYcal} |\Ical_{y}| \sqrt{\frac{\ln(|\tYcal|/\delta)}{2|\Ical_{y}|}} \ &=\kappa_{S} \left(\sum_{y \in \tYcal} \sqrt{\frac{|\Ical_{y}|}{n}}\right) \sqrt{\frac{\ln(|\tYcal|/\delta)}{2n}}. \end{align*} Substituting this bound into \eqref{eq:7}, we have that with probability at least $1-\delta$, \begin{align} \label{eq:8} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & \le \kappa_{S} \left(\sum_{y \in \tYcal} \sqrt{\hp(y)}\right) \sqrt{\frac{\ln(|\tYcal|/\delta)}{2n}}

  • \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right) \end{align} where $$ \hp(y)= \frac{|\Ical_{y}|}{n}. $$ Moreover, for the second term on the right-hand side of \eqref{eq:8}, by using Lemma 1 of \citep{kawaguchi2022robust}, we have that with probability at least $1-\delta$, \begin{align*} & \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- \frac{|\Ical_{y}|}{n}\right) \ &\le \left(\sum_{y\in \Ycal} \sqrt{p(y)}\EE_{X_{y}}[|W {S}f\theta(X_{y})-y | \right) \sqrt{\frac{2\ln(|\Ycal|/\delta)}{2n}} \ & \le\kappa_{S} \left(\sum_{y\in \Ycal} \sqrt{p(y)} \right) \sqrt{\frac{2\ln(|\Ycal|/\delta)}{2n}} \end{align*} where $p(y)=\Pr(Y = y)$. Substituting this bound into \eqref{eq:8} with the union bound, we have that with probability at least $1-\delta$, \begin{align} \label{eq:9} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & \le \kappa_{S} \left(\sum_{y \in \tYcal} \sqrt{\hp(y)}\right) \sqrt{\frac{\ln(2|\tYcal|/\delta)}{2n}} +\kappa_{S} \left(\sum_{y\in \Ycal} \sqrt{p(y)} \right) \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \ \nonumber & \le \left(\sum_{y\in \Ycal} \sqrt{\hp(y)}\right)\kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}}
  • \left(\sum_{y\in \Ycal} \sqrt{p(y)} \right) \kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \ \nonumber & \le\kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \end{align} Combining \eqref{eq:1} and \eqref{eq:9} implies that with probability at least $1-\delta$, \begin{align} \label{eq:10} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber &\le \frac{1}{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|+\kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ \nonumber & \le L_{S}(w_{S})+\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \quad + \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|+\frac{1}{n} \sum_{i=1}^n|\varphi(\bbx_i) |+\kappa_{S} \sqrt{\frac{2\ln(2|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right). \end{align}

We will now analyze the term $\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|+\frac{1}{n} \sum_{i=1}^n|\varphi(\bbx_i) |$ on the right-hand side of \eqref{eq:10}. Since $W^=W_\bS$, \begin{align} &\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|=\frac{1}{n} \sum_{i=1}^n |g^{}(x_{i})-W_{\bS}f_\theta(x_{i})|. \end{align} By using Hoeffding's inequality (Lemma \ref{lemma:trivial:2}), we have that for any $\delta>0$, with probability at least $1-\delta$, \begin{align*} &\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})|\le \frac{1}{n} \sum_{i=1}^n |g^{}(x_{i})-W_{\bS}f_\theta(x_{i})| \le \EE_{\xp}[|g^{}(\xp_{})-W_{\bS}f_\theta(\xp_{})| ]+ \kappa_{\bS} \sqrt{\frac{\ln(1/\delta)}{2n} }. \end{align*} Moreover, by using \citep[Theorem 3.1]{mohri2012foundations} with the loss function $\xp \mapsto |g^{}(\xp_{})-Wf(\xp_{})|$ (i.e., Lemma \ref{lemma:trivial:1}), we have that for any $\delta>0$, with probability at least $1-\delta$, \begin{align} \EE_{\xp}[|g^{}(\xp_{})-W_{\bS}f_\theta(\xp_{})| ]\le\frac{1}{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+\frac{2\tilde \Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+\kappa \sqrt{\frac{\ln(1/\delta)}{2m}} \end{align} where $\tilde \Rcal_{m}(\Wcal \circ \Fcal)=\frac{1}{\sqrt{m}}\EE_{\bS,\xi}[\sup_{W\in \Wcal, f\in\Fcal} \sum_{i=1}^m \xi_i|g^{}(\xp_{i})-W_{}f(\xp_{i})|]$ is the normalized Rademacher complexity of the set ${\xp_{} \mapsto|g^{}(\xp_{})-W_{}f(\xp_{})|:W \in \Wcal, f \in \Fcal}$ (it is normalized such that $\tilde \Rcal_{m}(\Fcal)=O(1)$ as $m\rightarrow \infty$ for typical choices of $\Fcal$), and $\xi_1,\dots,\xi_m$ are independent uniform random variables taking values in ${-1,1}$. Takinng union bounds, we have that for any $\delta>0$, with probability at least $1-\delta$, $$ \frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| \le\frac{1}{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+\frac{2\tilde \Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+\kappa \sqrt{\frac{\ln(2/\delta)}{2m}} + \kappa_{\bS} \sqrt{\frac{\ln(2/\delta)}{2n} } $$ Similarly, for any $\delta>0$, with probability at least $1-\delta$, $$ \frac{1}{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \le\frac{1}{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+\frac{2\tilde \Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+\kappa \sqrt{\frac{\ln(2/\delta)}{2m}} + \kappa_{\bS} \sqrt{\frac{\ln(2/\delta)}{2n}}. $$ Thus, by taking union bounds, we have that for any $\delta>0$, with probability at least $1-\delta$, \begin{align} \label{eq:18} &\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| +\frac{1}{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \ \nonumber & \le\frac{2}{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+\frac{4\Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(4/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n} }
\end{align} To analyze the first term on the right-hand side of \eqref{eq:18}, recall that \begin{align} \label{eq:2} W_{\bS} = \mini_{W'} |W'|{F} \text{ s.t. } W'\in \argmin{W} \frac{1}{m} \sum_{i=1}^{m} |W f_\theta(\xp_{i})-g^(\xp_i)|^2 . \end{align} Here, since $W f_\theta(\xp_{i})\in\RR^r$, we have that $$ W f_\theta(\xp_{i}) = \vect[W f_\theta(\xp_{i})]=[f_\theta(\xp_{i})\T \otimes I_r]\vect[W]\in\RR^r, $$ where $I_r \in \RR^{r \times r} $ is the identity matrix, and $[f_\theta(\xp_{i})\T \otimes I_r]\in \RR^{r \times dr}$ is the Kronecker product of the two matrices, and $\vect[W] \in \RR^{dr}$ is the vectorization of the matrix $W \in \RR^{r \times d}$. Thus, by defining $A_i=[f_\theta(\xp_{i})\T \otimes I_r] \in \RR^{r \times dr}$ and using the notation of $w=\vect[W]$ and its inverse $W=\vect^{-1}[w]$ (i.e., the inverse of the vectorization from $\RR^{r \times d}$ to $\RR^{dr}$ with a fixed ordering), we can rewrite \eqref{eq:2} by $$ W_\bS=\vect^{-1}[w_\bS] \quad \text{where } \quad w_\bS= \mini_{w'} |w'|{F} \text{ s.t. } w'\in \argmin{w} \sum_{i=1}^{m} |g_{i}-A_iw|^{2}, $$ with $g_i = g^{}(\xp_{i}) \in \RR^r$. Since the function $w \mapsto \sum^{m}{i=1} |g{i}-A_iw|^{2}$ is convex, a necessary and sufficient condition of the minimizer of this function is obtained by $$ 0 = \nabla_w \sum^{m}{i=1} |g{i}-A_iw|^{2}=2 \sum^{m}{i=1} A_i\T (g{i}-A_iw)\in \RR^{dr } $$
This implies that $$ \sum^{m}{i=1}A_i\T A_iw= \sum^{m}{i=1}A_i\T g_{i}. $$ In other words, \begin{align*} A\T A w=A\T g \quad \text{ where } A=\begin{bmatrix}A_{1} \ A_{2} \ \vdots \ A_{m} \ \end{bmatrix} \in \RR^{mr \times dr} \text{ and } g=\begin{bmatrix}g_{1} \ g_{2} \ \vdots \ g_{m} \ \end{bmatrix} \in \RR^{mr} \end{align*} Thus, $$ w'\in \argmin_{w} \sum_{i=1}^{m} |g_{i}-A_iw|^{2}= {(A\T A)^\dagger A\T g+v: v \in \Null(A)} $$ where $(A\T A)^\dagger$ is the Moore--Penrose inverse of the matrix $A\T A$ and $\Null(A)$ is the null space of the matrix $A$. Thus, the minimum norm solution is obtained by $$ \vect[W_\bS]=w_\bS=(A\T A)^\dagger A\T g. $$ Thus, by using this $W_\bS$, we have that \begin{align*} \frac{1}{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})| &= \frac{1}{m}\sum_{i=1}^{m} \sqrt{\sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & \le \sqrt{\frac{1}{m}\sum{i=1}^{m} \sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & = \frac{1}{\sqrt{m}} \sqrt{\sum{i=1}^{m} \sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & =\frac{1}{\sqrt{m}} |g-Aw\bS|{2} \ & = \frac{1}{\sqrt{m}} |g-A(A\T A)^\dagger A\T g|{2} =\frac{1}{\sqrt{m}}|(I-A(A\T A)^\dagger A\T )g|_{2} \end{align} where the inequality follows from the Jensen's inequality and the concavity of the square root function. Thus, we have that \begin{align} \label{eq:3} &\frac{1}{n} \sum_{i=1}^n |\varphi(x_{i})| +\frac{1}{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \ \nonumber & \le\frac{2}{\sqrt{m}}|(I-A(A\T A)^\dagger A\T )g|{2}+\frac{4\Rcal{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(4/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n} }
\end{align} By combining \eqref{eq:10} and \eqref{eq:3} with union bound, we have that \begin{align} \label{eq:4} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber & \le L_{S}(w_{S})+\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))|+\frac{2}{\sqrt{m}}|\Pb_{A}g|{2} \ \nonumber & \quad + \frac{4\Rcal{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(8/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(8/\delta)}{2n} } \ \nonumber &\quad +\kappa_{S} \sqrt{\frac{2\ln(4|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right). \end{align} where $\tW=W_S-W^*$ and $ \Pb_{A}=I-A(A\T A)^\dagger A\T$.

We will now analyze the second term on the right-hand side of \eqref{eq:4}: \begin{align} \label{eq:11} \frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \le |\tW|{2}\ \left(\frac{1}{n} \sum{i=1}^{n}|f_\theta(\bbx_i)-f_\theta(x_i)| \right), \end{align} where $|\tW|{2}$ is the spectral norm of $\tW$. Since $\bbx_i$ shares the same label with $x_i$ as $\bbx_i \sim\Dcal{y_i}$ (and $x_i \sim\Dcal_{y_i}$), and because $f_\theta$ is trained with the unlabeled data $\bS$, using Hoeffding's inequality (Lemma \ref{lemma:trivial:2}) implies that with probability at least $1-\delta$, \begin{align} \label{eq:12} \frac{1}{n} \sum_{i=1}^{n}|f_\theta(\bbx_i)-f_\theta(x_i)| \le \EE_{y \sim \rho}\EE_{\bbx,x \sim \Dcal_y^2}[|f_\theta(\bbx)-f_\theta(x)|]+\tau_{\bS} \sqrt{\frac{\ln(1/\delta)}{2n}}. \end{align} Moreover, by using \citep[Theorem 3.1]{mohri2012foundations} with the loss function $(x,\bbx) \mapsto |f_\theta(\bbx)-f_\theta(x)|$ (i.e., Lemma \ref{lemma:trivial:1}), we have that with probability at least $1-\delta$, \begin{align} \EE_{y \sim \rho}\EE_{\bbx,x \sim \Dcal_y^2}[|f_\theta(\bbx)-f_\theta(x)|] \le\frac{1}{m}\sum_{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+\frac{2\tilde \Rcal_{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(1/\delta)}{2m}} \end{align} where $\tilde \Rcal_{m}(\Fcal)=\frac{1}{\sqrt{m}}\EE_{\bS,\xi}[\sup_{f\in\Fcal} \sum_{i=1}^m \xi_i |f(\xp_{i})-f(\xpp_{i})|]$ is the normalized Rademacher complexity of the set ${(\xp_{},\xpp_{}) \mapsto|f(\xp_{})-f(\xpp_{})|: f \in \Fcal}$ (it is normalized such that $\tilde \Rcal_{m}(\Fcal)=O(1)$ as $m\rightarrow \infty$ for typical choices of $\Fcal$), and $\xi_1,\dots,\xi_m$ are independent uniform random variables taking values in ${-1,1}$. Thus, taking union bound, we have that for any $\delta>0$, with probability at least $1-\delta$, \begin{align} \label{eq:13} &\frac{1}{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \le|\tW|{2}\left(\frac{1}{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+\frac{2\tilde \Rcal_{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(2/\delta)}{2m}}++\tau_{\bS} \sqrt{\frac{\ln(2/\delta)}{2n}}\right). \end{align}

By combining \eqref{eq:4} and \eqref{eq:13} using the union bound, we have that with probability at least $1-\delta$, \begin{align} \label{eq:14} &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber &\le L_{S}(w_{S}) + |\tW|{2} \left( \frac{1}{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+\frac{2\tilde \Rcal_{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(4/\delta)}{2m}}+\tau_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n}} \right) \ \nonumber & \quad +\frac{2}{\sqrt{m}}|\Pb_{A}g|{2}+ \frac{4\Rcal{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(16/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(16/\delta)}{2n} } \ \nonumber & \quad +\kappa_{S} \sqrt{\frac{2\ln(8|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ \nonumber & =L_{S}(w_{S}) +|\tW|{2} \left(\frac{1}{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|\right)+\frac{2}{\sqrt{m}}|\Pb_{A}g|{2}+Q{m,n} \end{align} where \begin{align*} Q_{m,n} &= |\tW|{2} \left(\frac{2\tilde \Rcal{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(3/\delta)}{2m}}+\tau_{\bS} \sqrt{\frac{\ln(3/\delta)}{2n}}\right) \ & \quad +\kappa_{S} \sqrt{\frac{2\ln(6|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ & \quad + \frac{4\Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(4/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n} }. \end{align*} Define $\hZ =[f(\xp_1),\dots, f(\xp_{m})] \in \RR^{d\times m}$. Then, we have $A=[\hZ \T \otimes I_r]$. Thus, $$ \Pb_{A}=I-[\hZ \T \otimes I_r][\hZ \hZ \T \otimes I_r]^\dagger[\hZ \otimes I_r]=I-[\hZ \T (\hZ \hZ \T)^\dagger \hZ \otimes I_r]=[\Pb_\hZ \otimes I_r] $$ where $\Pb_\hZ = I_{m}-\hZ \T (\hZ \hZ \T)^\dagger \hZ \in \RR^{m \times m}$. By defining $Y_\bS=[g^(\xp_1),\dots, g^(\xp_{m})]\T \in \RR^{m\times r}$, since $g=\vect[Y_\bS\T]$, \begin{align} \label{eq:15} | \Pb_{A}g|{2} =|[\Pb\hZ \otimes I_r]\vect[Y_\bS\T]|{2} =|\vect[Y\bS\T\Pb_\hZ ]|{2} =|\Pb\hZ Y_\bS|{F}
\end{align} On the other hand, recall that $W
{S}$ is the minimum norm solution as $$ W_S =\mini_{W'} |W'|{F} \text{ s.t. } W' \in \argmin{W} \frac{1}{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i|^2. $$ By solving this, we have
$$ W_S =Y\T \tZ \T (\tZ \tZ\T )^\dagger ,
$$ where $\tZ =[f(x_1),\dots, f(x_{n})] \in \RR^{d\times n}$ and $Y_{S}=[y_{1},\dots, y_{n}]\T \in \RR^{n\times r}$. Then, \begin{align*} L_{S}(w_{S})=\frac{1}{n} \sum_{i=1}^n |W_{S} f_\theta(x_i)-y_i| &= \frac{1}{n}\sum_{i=1}^{n} \sqrt{\sum_{k=1}^r ((W_{S} f_\theta(x_i)-y_i)k)^2} \ & \le \sqrt{\frac{1}{n}\sum{i=1}^{n}\sum_{k=1}^r ((W_{S} f_\theta(x_i)-y_i)k)^2} \ & = \frac{1}{\sqrt{n}} |W{S} \tZ -Y\T|F \ & =\frac{1}{\sqrt{n}} |Y\T (\tZ \T (\tZ \tZ\T )^\dagger \tZ -I)|F \ & =\frac{1}{\sqrt{n}} |(I-\tZ \T (\tZ \tZ\T )^\dagger \tZ )Y|F \end{align*} Thus, \begin{align} \label{eq:17} L{S}(w{S})=\frac{1}{\sqrt{n}} |\Pb\tZ Y|F \end{align} where $\Pb\tZ = I-\tZ \T (\tZ \tZ\T )^\dagger \tZ $.

By combining \eqref{eq:14}--\eqref{eq:17} and using $1\le \sqrt{2}$, we have that with probability at least $1-\delta$, \begin{align} \EE_{X,Y}[|W {S}f\theta(X)-Y|] \le c I_{\bS}(f_\theta)+\frac{2}{\sqrt{m}}|\Pb_\hZ Y_\bS|{F}+\frac{1}{\sqrt{n}} |\Pb\tZ Y_{S}|F +Q{m,n}, \end{align} where \begin{align*} Q_{m,n} &= c \left(\frac{2\tilde \Rcal_{m}(\Fcal)}{\sqrt{m}}+\tau \sqrt{\frac{\ln(3/\delta)}{2m}}+\tau_{\bS} \sqrt{\frac{\ln(3/\delta)}{2n}}\right) \ & \quad +\kappa_{S} \sqrt{\frac{2\ln(6|\Ycal|/\delta)}{2n}} \sum_{y\in \Ycal} \left(\sqrt{\hp(y)}+\sqrt{p(y)}\right) \ & \quad + \frac{4\Rcal_{m}(\Wcal \circ \Fcal)}{\sqrt{m}}+2\kappa \sqrt{\frac{\ln(4/\delta)}{2m}} + 2\kappa_{\bS} \sqrt{\frac{\ln(4/\delta)}{2n} }. \end{align*} \end{proof}

\renewcommand{\thesection}{\Alph{section}}

\setcounter{section}{9} % Since you want it to start from J, which is the 10th letter. Now,

\section{Information Optimization and the VICReg Objective} \label{app:info_vicgreg}

\begin{assumption} The eigenvalues of $\Sigma({x_j})$ are in some range $a\leq \lambda(\Sigma(x_j))\leq b$. \end{assumption}

\begin{assumption} The differences between the means of the Gaussians are bounded $$M=\max_{i,j} \left| \mu(X_i) - \mu(X_j) \right|^2$$ \end{assumption}

\begin{lemma} The maximum eigenvalue of each (\mu({X_j}) \mu({X_j})^T) is at most (M). \end{lemma}

\begin{proof} The term (\mu({X_j}) \mu({X_j})^T) is an outer product of the mean vector (\mu({X_j})), which is a symmetric matrix. The eigenvalues of a symmetric matrix are equal to the squares of the singular values of the original matrix. Since the singular values of a vector are equal to its absolute values, the maximum eigenvalue of (\mu({X_j}) \mu({X_j})^T) is equal to the square of the maximum absolute value of (\mu({X_j})). By the second assumption, this is at most (M). \end{proof}

\begin{lemma} The maximum eigenvalue of (-\mu_Z \mu_Z^T) is non-positive and its absolute value is at most (M). \end{lemma}

\begin{proof} The term (-\mu_Z \mu_Z^T) is a negative outer product of the overall mean vector (\mu_Z), which is a symmetric matrix. Its eigenvalues are non-positive and equal to the negative squares of the singular values of (\mu_Z). Since the singular values of a vector are equal to its absolute values, the absolute value of the maximum eigenvalue of (-\mu_Z \mu_Z^T) is equal to the square of the maximum absolute value of (\mu_Z), which is also bounded by (M) by the second assumption. \end{proof}

\begin{lemma} The sum of the eigenvalues of $\Sigma_Z$ is bounded $$\sum_i \lambda_i(Z) \leq (b + M)\times K$$ \end{lemma} \begin{proof} Given a Gaussian mixture model where each component $Z|x_j$ has mean $\mu({X_j})$ and covariance matrix $\Sigma({x_j})$, the mixture can be written as:

[ Z = \sum_j p_j Z|x_j $$ \tag{thm:1}

$$ \Sigma_Z = \sum_j p_j \left( \Sigma({x_j}) + \mu({X_j}) \mu({X_j})^T \right) - \mu_Z \mu_Z^T $$

$$ \sum_i\lambda_i(\Sigma_Z) \leq (b+M)\times K $$

Theorem. Given a Gaussian mixture model where each component (Z|X_i) has covariance matrix (\Sigma({X_i})), under the assumptions above, the solution to the optimization problem align* & maximize \sum_i \log \left| \Sigma_Z \right|{\left| \Sigma({X_i}) \right|} align* is a diagonal matrix (\Sigma_Z) with equal diagonal elements.

Lemma. (Hoeffding's inequality) Let $X_1, ..., X_n$ be independent random variables such that ${\displaystyle a_{}\leq X_{i}\leq b_{}}$ almost surely. Consider the average of these random variables, ${\displaystyle S_{n}=1{n}(X_{1}+\cdots +X_{n}).}$ Then, for all $t > 0$, $$ \PP_S \left( \mathrm {E} \left[S_{n}\right]-S_{n} \ge (b-a) \frac{\ln(1/\delta){2n} }\right) \leq \delta, $$ and $$ \PP_S \left( S_{n} -\mathrm {E} \left[S_{n}\right]\ge (b-a) \frac{\ln(1/\delta){2n} }\right) \leq \delta. $$

Lemma. Let $G$ be a set of functions with the codomain $[0, M]$. Then, for any $\delta>0$, with probability at least $1-\delta$ over an i.i.d. draw of $m$ samples $S=(q_{i}){i=1}^m$, the following holds for all $ \psi \in \Gcal$: align \SwapAboveDisplaySkip \EE{q}[\psi(q)] \le 1{m}\sum_{i=1}^{m} \psi(q_{i})+2\Rcal_{m}(\Gcal)+M \frac{\ln(1/\delta){2m}}, align where $\Rcal_{m}(\Gcal):=\EE_{S,\xi}[\sup_{\psi \in \Gcal}1{m} \sum_{i=1}^m \xi_i \psi(q_{i})]$ and $\xi_1,\dots,\xi_m$ are independent uniform random variables taking values in ${-1,1}$.

Lemma. The maximum eigenvalue of each (\mu({X_j}) \mu({X_j})^T) is at most (M).

Lemma. The sum of the eigenvalues of $\Sigma_Z$ is bounded $$\sum_i \lambda_i(Z) \leq (b + M)\times K$$

Lemma. Let $\Sigma_Z$ be a positive semidefinite matrix of size $N \times N$. Consider the optimization problem given by: align* & maximize \log\det(\Sigma_Z) \ such that: \ & \sum_{i=1}^{N} \lambda_i(\Sigma_Z) \leq c \ & \Sigma_Z \succeq 0 align* where $\lambda_i(\Sigma_Z)$ denotes the $i$-th eigenvalue of $\Sigma_Z$ and $c$ is a constant. The solution to this problem is a diagonal matrix with equal diagonal elements.

Proof. *{-5pt} See app:2.

Proof. The complete version of Theorem thm:1 and its proof are presented in app:1.

Proof. We know that If $\int_{\omega}p(\bx|\bx^*_{n(\bx)})d\bx \approx 1$ then $f$ is linear within the effective support of $p$. Therefore, any sample from $p$ will almost surely lie within a single region $\omega \in \Omega$, and therefore the entire mapping can be considered linear with respect to $p$. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

Proof. By using Hoeffding's inequality, we have that for all $t>0$, $$ \PP_S \left( \mathrm {E} \left[S_{n}\right]-S_{n} \ge t\right)\leq \exp \left(-{\frac {2nt^{2}}{(b-a)^{2}}}\right), $$ and $$ \PP_S \left(S_{n} - \mathrm {E} \left[S_{n}\right]\ge t\right)\leq \exp \left(-{\frac {2nt^{2}}{(b-a)^{2}}}\right), $$ Setting $\delta=\exp \left(-{\frac {2nt^{2}}{(b-a)^{2}}}\right)$ and solving for $t>0$, align* & 1/\delta=\exp \left({\frac {2nt^{2}}{(b-a)^{2}}}\right) \ & \Longrightarrow \ln(1/\delta)= {\frac {2nt^{2}}{(b-a)^{2}}} \ & \Longrightarrow (b-a)^{2\ln(1/\delta)}{2n}= t^2 \ & \Longrightarrow t =(b-a) \frac{\ln(1/\delta){2n} } align* *{-5pt}

Proof. Let $S=(q_{i}){i=1}^m$ and $S'=(q{i}'){i=1}^m$. Define align \varphi(S)= \sup{\psi \in \Gcal} \EE_{x,y}[\psi(q)]-1{m}\sum_{i=1}^{m}\psi(q_{i}). align To apply McDiarmid's inequality to $\varphi(S)$, we compute an upper bound on $|\varphi(S)-\varphi(S')|$ where $S$ and $S'$ be two test datasets differing by exactly one point of an arbitrary index $i_{0}$; i.e., $S_i= S'i$ for all $i\neq i{0}$ and $S_{i_{0}} \neq S'{i{0}}$. Then, align \varphi(S')-\varphi(S) \le\sup_{\psi \in \Gcal}\psi(q_{i_0)-\psi(q'{i_0})}{m} \le M{m}. align Similarly, $\varphi(S)-\varphi(S')\le M{m} $. Thus, by McDiarmid's inequality, for any $\delta>0$, with probability at least $1-\delta$, align \SwapAboveDisplaySkip \varphi(S) \le \EE{S}[\varphi(S)] + M \frac{\ln(1/\delta){2m}}. align Moreover, align &\EE_{S}[\varphi(S)] = \EE_{S}\left[\sup_{\psi \in \Gcal} \EE_{S'}\left[1{m}\sum_{i=1}^{m}\psi_{}(q_i')\right]-1{m}\sum_{i=1}^{m}\psi(q_i)\right] \ & \le\EE_{S,S'}\left[\sup_{\psi \in \Gcal} 1{m}\sum_{i=1}^m (\psi(q'i)-\psi(q_i))\right] \ & \le \EE{\xi, S, S'}\left[\sup_{\psi \in \Gcal} 1{m}\sum_{i=1}^m \xi_i(\psi_{}(q'{i})-\psi(q_i))\right] \ & \le2\EE{\xi, S}\left[\sup_{\psi \in \Gcal} 1{m}\sum_{i=1}^m \xi_i\psi(q_i)\right] =2\Rcal_{m}(\Gcal), align where the first line follows the definitions of each term, the second line uses Jensen's inequality and the convexity of the supremum, and the third line follows that for each $\xi_i \in {-1,+1}$, the distribution of each term $\xi_i (\ell(f_{}(x'_i),y'i)-\ell(f(x_i),y_i))$ is the distribution of $(\ell(f{}(x'_i),y'_i)-\ell(f(x_i),y_i))$ since $S$ and $S'$ are drawn iid with the same distribution. The fourth line uses the subadditivity of supremum.

Proof. [Proof of Theorem thm:1] Let $W=W {S}$ where $W_S$ is the the minimum norm solution as $W_S =\mini{W'} |W'|{F}$ s.t. $W'\in \argmin{W} 1{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i|^2$. Let $W^=W_\bS$ where $W_\bS$ is the minimum norm solution as $W^{}=W_{\bS}=\mini_{W'} |W'|{F}$ s.t. $W'\in \argmin{W} 1{m} \sum_{i=1}^{m} |W f_\theta(\xp_{i})-g^(\xp_i)|^2$. Since $y=g^{}(x)$, align* y=g^{}(x) \pm W^ f_\theta(x) =W^* f_\theta(x) +(g^{}(x)-W^ f_\theta(x))=W^* f_\theta(x) +\varphi(x) align* where $\varphi(x)=g^{}(x)-W^ f_\theta(x)$. Define $L_{S}(w)= 1{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i|$. Using these, align* L_{S}(w) &= 1{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i| \ & =1{n} \sum_{i=1}^n |W f_\theta(x_i)-W^* f_\theta(x_{i}) -\varphi(x_{i})| \ & \ge1{n} \sum_{i=1}^n |W f_\theta(x_i)-W^* f_\theta(x_{i})| -1{n} \sum_{i=1}^n|\varphi(x_{i})| \ & =1{n} \sum_{i=1}^n |\tW f_\theta(x_i)| - 1{n} \sum_{i=1}^n |\varphi(x_{i})| align* where $\tW =W-W^$. We now consider new fresh samples $\bbx_{i} \sim\Dcal {y{i}}$ for $i=1,\dots, n$ to rewrite the above further as: align L_{S}(w) &\ge 1{n} \sum_{i=1}^n |\tW f_\theta(x_i)\pm\tW f_\theta(\bbx_i)| - 1{n} \sum_{i=1}^n |\varphi(x_{i})| \ & =1{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)-(\tW f_\theta(\bbx_i)-\tW f_\theta(x_i))| - 1{n} \sum_{i=1}^n |\varphi(x_{i})| \ & \ge1{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| -1{n} \sum_{i=1}^{n}|\tW f_\theta(\bbx_i)-\tW f_\theta(x_i)| - 1{n} \sum_{i=1}^n |\varphi(x_{i})| \ & =1{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| -1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| - 1{n} \sum_{i=1}^n |\varphi(x_{i})| align* This implies that $$ 1{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| \le L_{S}(w)+1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| + 1{n} \sum_{i=1}^n |\varphi(x_{i})|. $$ Furthermore, since $y=W^* f_\theta(x) +\varphi(x)$, by writing $\bby_{i}=W^* f_\theta(\bbx_i) +\varphi(\bbx_i)$ (where $\bby_i = y_i$ since $\bbx_{i} \sim\Dcal {y{i}}$ for $i=1,\dots, n$), align* 1{n} \sum_{i=1}^n |\tW f_\theta(\bbx_i)| &=1{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-W^* f_\theta(\bbx_i)| \ & =1{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i+\varphi(\bbx_i)| \ & \ge1{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i|-1{n} \sum_{i=1}^n|\varphi(\bbx_i) | align* Combining these, we have that align 1{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i| &\le L_{S}(w)+1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \quad + 1{n} \sum_{i=1}^n |\varphi(x_{i})|+1{n} \sum_{i=1}^n|\varphi(\bbx_i) |. align To bound the left-hand side of eq:1, we now analyze the following random variable: align \EE_{X,Y}[|W {S}f\theta(X)-Y|] - 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|, align where $\bby_i = y_i$ since $\bbx_{i} \sim\Dcal {y{i}}$ for $i=1,\dots, n$. Importantly, this means that as $W_{S}$ depends on $y_i$, $W_{S}$ depends on $\bby_i$. Thus, the collection of random variables $|W_{S}f_\theta(\bbx_1)-\bby_1|,\dots,|W_{S}f_\theta(n_n)-\bby_n|$ is not independent. Accordingly, we cannot apply standard concentration inequality to bound eq:5. A standard approach in learning theory is to first bound eq:5 by $\EE_{x,y}|W {S}f\theta(x)-y| - 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \le\sup_{W \in \Wcal}\EE_{x,y}|Wf_\theta(x)-y| - 1{n} \sum_{i=1}^n |Wf_\theta(\bbx_i)-\bby_i|$ for some hypothesis space $\Wcal$ (that is independent of $S$) and realize that the right-hand side now contains the collection of independent random variables $|W_{}f_\theta(\bbx_1)-\bby_1|,\dots,|W_{}f_\theta(n_n)-\bby_n|$ , for which we can utilize standard concentration inequalities. This reasoning leads to the Rademacher complexity of the hypothesis space $\Wcal$. However, the complexity of the hypothesis space $\Wcal$ can be very large, resulting in a loose bound. In this proof, we show that we can avoid the dependency on hypothesis space $\Wcal$ by using a very different approach with conditional expectations to take care the dependent random variables $|W_{S}f_\theta(\bbx_1)-\bby_1|,\dots,|W_{S}f_\theta(n_n)-\bby_n|$. Intuitively, we utilize the fact that for these dependent random variables, there is a structure of conditional independence, conditioned on each $y \in \Ycal$. We first write the expected loss as the sum of the conditional expected loss: align* \EE_{X,Y}[|W {S}f\theta(X)-Y|]&=\sum_{y\in \Ycal} \EE_{X,Y}[|W {S}f\theta(X)-Y| \mid Y = y]\Pr(Y = y) \ & =\sum_{y\in \Ycal}\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\Pr(Y = y), align* where $X_{y}$ is the random variable for the conditional with $Y=y$. Using this, we decompose eq:5 into two terms: align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & = \left(\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]|\Ical_{y|}{n}- 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|\right) \ \nonumber & \quad+\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- |\Ical_{y|}{n}\right), align where $$ \Ical_{y}={i\in[n]: y_{i}=y}. $$ The first term in the right-hand side of eq:6 is further simplified by using $$ 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|=1{n}\sum_{y \in \Ycal} \sum_{i \in \Ical_{y}} |W_{S}f_\theta(\bbx_i)-y|, $$ as align* &\sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]|\Ical_{y|}{n}- 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ & =1{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-1{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right), align* where $\tYcal={y \in \Ycal : |\Ical_{y}| \neq 0}$. Substituting these into equation eq:6 yields align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & = 1{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-1{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right) \ \nonumber & \quad + \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- |\Ical_{y|}{n}\right) align Importantly, while $|W_{S}f_\theta(\bbx_1)-\bby_1|,\dots, |W_{S}f_\theta(\bbx_n)-\bby_n|$ on the right-hand side of eq:7 are dependent random variables, $|W_{S}f_\theta(\bbx_1)-y|,\dots,|W_{S}f_\theta(\bbx_n)-y|$ are independent random variables since $W_S$ and $\bbx_i$ are independent and $y$ is fixed here. Thus, by using Hoeffding's inequality (Lemma lemma:trivial:2), and taking union bounds over $y \in \tYcal$, we have that with probability at least $1-\delta$, the following holds for all $y \in \tYcal$: $$ \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-1{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \le \kappa_{S} \frac{\ln(|\tYcal|/\delta){2|\Ical_{y}|}}. $$ This implies that with probability at least $1-\delta$, align* &1{n}\sum_{y \in \tYcal} |\Ical_{y}|\left(\EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]-1{|\Ical_{y}|}\sum_{i \in \Ical_{y}}|W_{S}f_\theta(\bbx_i)-y| \right) \ & \le\kappa_{S}{n}\sum_{y \in \tYcal} |\Ical_{y}| \frac{\ln(|\tYcal|/\delta){2|\Ical_{y}|}} \ &=\kappa_{S} \left(\sum_{y \in \tYcal} \frac{|\Ical_{y|}{n}}\right) \frac{\ln(|\tYcal|/\delta){2n}}. align* Substituting this bound into eq:7, we have that with probability at least $1-\delta$, align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & \le \kappa_{S} \left(\sum_{y \in \tYcal} \hp(y)\right) \frac{\ln(|\tYcal|/\delta){2n}} + \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- |\Ical_{y|}{n}\right) align where $$ \hp(y)= |\Ical_{y|}{n}. $$ Moreover, for the second term on the right-hand side of eq:8, by using Lemma 1 of kawaguchi2022robust, we have that with probability at least $1-\delta$, align* & \sum_{y\in \Ycal} \EE_{X_{y}}[|W {S}f\theta(X_{y})-y |]\left(\Pr(Y = y)- |\Ical_{y|}{n}\right) \ &\le \left(\sum_{y\in \Ycal} p(y)\EE_{X_{y}}[|W {S}f\theta(X_{y})-y | \right) \frac{2\ln(|\Ycal|/\delta){2n}} \ & \le\kappa_{S} \left(\sum_{y\in \Ycal} p(y) \right) \frac{2\ln(|\Ycal|/\delta){2n}} align* where $p(y)=\Pr(Y = y)$. Substituting this bound into eq:8 with the union bound, we have that with probability at least $1-\delta$, align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] - 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i| \ \nonumber & \le \kappa_{S} \left(\sum_{y \in \tYcal} \hp(y)\right) \frac{\ln(2|\tYcal|/\delta){2n}} +\kappa_{S} \left(\sum_{y\in \Ycal} p(y) \right) \frac{2\ln(2|\Ycal|/\delta){2n}} \ \nonumber & \le \left(\sum_{y\in \Ycal} \hp(y)\right)\kappa_{S} \frac{2\ln(2|\Ycal|/\delta){2n}} + \left(\sum_{y\in \Ycal} p(y) \right) \kappa_{S} \frac{2\ln(2|\Ycal|/\delta){2n}} \ \nonumber & \le\kappa_{S} \frac{2\ln(2|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right) align Combining eq:1 and eq:9 implies that with probability at least $1-\delta$, align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber &\le 1{n} \sum_{i=1}^n |W_{S}f_\theta(\bbx_i)-\bby_i|+\kappa_{S} \frac{2\ln(2|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right) \ \nonumber & \le L_{S}(w_{S})+1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \quad + 1{n} \sum_{i=1}^n |\varphi(x_{i})|+1{n} \sum_{i=1}^n|\varphi(\bbx_i) |+\kappa_{S} \frac{2\ln(2|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right). align We will now analyze the term $1{n} \sum_{i=1}^n |\varphi(x_{i})|+1{n} \sum_{i=1}^n|\varphi(\bbx_i) |$ on the right-hand side of eq:10. Since $W^=W_\bS$, align &1{n} \sum_{i=1}^n |\varphi(x_{i})|=1{n} \sum_{i=1}^n |g^{}(x_{i})-W_{\bS}f_\theta(x_{i})|. align By using Hoeffding's inequality (Lemma lemma:trivial:2), we have that for any $\delta>0$, with probability at least $1-\delta$, align* &1{n} \sum_{i=1}^n |\varphi(x_{i})|\le 1{n} \sum_{i=1}^n |g^{}(x_{i})-W_{\bS}f_\theta(x_{i})| \le \EE_{\xp}[|g^{}(\xp_{})-W_{\bS}f_\theta(\xp_{})| ]+ \kappa_{\bS} \frac{\ln(1/\delta){2n} }. align* Moreover, by using \citep[Theorem 3.1]{mohri2012foundations} with the loss function $\xp \mapsto |g^{}(\xp_{})-Wf(\xp_{})|$ (i.e., Lemma lemma:trivial:1), we have that for any $\delta>0$, with probability at least $1-\delta$, align \EE_{\xp}[|g^{}(\xp_{})-W_{\bS}f_\theta(\xp_{})| ]\le1{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+2\tilde \Rcal_{m(\Wcal \circ \Fcal)}{m}+\kappa \frac{\ln(1/\delta){2m}} align where $\tilde \Rcal_{m}(\Wcal \circ \Fcal)=1{m}\EE_{\bS,\xi}[\sup_{W\in \Wcal, f\in\Fcal} \sum_{i=1}^m \xi_i|g^{}(\xp_{i})-W_{}f(\xp_{i})|]$ is the normalized Rademacher complexity of the set ${\xp_{} \mapsto|g^{}(\xp_{})-W_{}f(\xp_{})|:W \in \Wcal, f \in \Fcal}$ (it is normalized such that $\tilde \Rcal_{m}(\Fcal)=O(1)$ as $m\rightarrow \infty$ for typical choices of $\Fcal$), and $\xi_1,\dots,\xi_m$ are independent uniform random variables taking values in ${-1,1}$. Takinng union bounds, we have that for any $\delta>0$, with probability at least $1-\delta$, $$ 1{n} \sum_{i=1}^n |\varphi(x_{i})| \le1{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+2\tilde \Rcal_{m(\Wcal \circ \Fcal)}{m}+\kappa \frac{\ln(2/\delta){2m}} + \kappa_{\bS} \frac{\ln(2/\delta){2n} } $$ Similarly, for any $\delta>0$, with probability at least $1-\delta$, $$ 1{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \le1{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+2\tilde \Rcal_{m(\Wcal \circ \Fcal)}{m}+\kappa \frac{\ln(2/\delta){2m}} + \kappa_{\bS} \frac{\ln(2/\delta){2n}}. $$ Thus, by taking union bounds, we have that for any $\delta>0$, with probability at least $1-\delta$, align &1{n} \sum_{i=1}^n |\varphi(x_{i})| +1{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \ \nonumber & \le2{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})|+4\Rcal_{m(\Wcal \circ \Fcal)}{m}+2\kappa \frac{\ln(4/\delta){2m}} + 2\kappa_{\bS} \frac{\ln(4/\delta){2n} } align To analyze the first term on the right-hand side of eq:18, recall that align W_{\bS} = \mini_{W'} |W'|{F} s.t. W'\in \argmin{W} 1{m} \sum_{i=1}^{m} |W f_\theta(\xp_{i})-g^(\xp_i)|^2 . align Here, since $W f_\theta(\xp_{i})\in\RR^r$, we have that $$ W f_\theta(\xp_{i}) = \vect[W f_\theta(\xp_{i})]=[f_\theta(\xp_{i})\T \otimes I_r]\vect[W]\in\RR^r, $$ where $I_r \in \RR^{r \times r} $ is the identity matrix, and $[f_\theta(\xp_{i})\T \otimes I_r]\in \RR^{r \times dr}$ is the Kronecker product of the two matrices, and $\vect[W] \in \RR^{dr}$ is the vectorization of the matrix $W \in \RR^{r \times d}$. Thus, by defining $A_i=[f_\theta(\xp_{i})\T \otimes I_r] \in \RR^{r \times dr}$ and using the notation of $w=\vect[W]$ and its inverse $W=\vect^{-1}[w]$ (i.e., the inverse of the vectorization from $\RR^{r \times d}$ to $\RR^{dr}$ with a fixed ordering), we can rewrite eq:2 by $$ W_\bS=\vect^{-1}[w_\bS] \quad where \quad w_\bS= \mini_{w'} |w'|{F} s.t. w'\in \argmin{w} \sum_{i=1}^{m} |g_{i}-A_iw|^{2}, $$ with $g_i = g^{}(\xp_{i}) \in \RR^r$. Since the function $w \mapsto \sum^{m}{i=1} |g{i}-A_iw|^{2}$ is convex, a necessary and sufficient condition of the minimizer of this function is obtained by $$ 0 = \nabla_w \sum^{m}{i=1} |g{i}-A_iw|^{2}=2 \sum^{m}{i=1} A_i\T (g{i}-A_iw)\in \RR^{dr } $$ This implies that $$ \sum^{m}{i=1}A_i\T A_iw= \sum^{m}{i=1}A_i\T g_{i}. $$ In other words, align* A\T A w=A\T g \quad where A=bmatrixA_{1} \ A_{2} \ \vdots \ A_{m} \ bmatrix \in \RR^{mr \times dr} and g=bmatrixg_{1} \ g_{2} \ \vdots \ g_{m} \ bmatrix \in \RR^{mr} align* Thus, $$ w'\in \argmin_{w} \sum_{i=1}^{m} |g_{i}-A_iw|^{2}= {(A\T A)^\dagger A\T g+v: v \in \Null(A)} $$ where $(A\T A)^\dagger$ is the Moore--Penrose inverse of the matrix $A\T A$ and $\Null(A)$ is the null space of the matrix $A$. Thus, the minimum norm solution is obtained by $$ \vect[W_\bS]=w_\bS=(A\T A)^\dagger A\T g. $$ Thus, by using this $W_\bS$, we have that align* 1{m}\sum_{i=1}^{m} |g^{}(\xp_{i})-W_{\bS}f_\theta(\xp_{i})| &= 1{m}\sum_{i=1}^{m} \sum_{k=1^r ((g_{i}-A_iw_\bS)k)^2} \ & \le \frac{1{m}\sum{i=1}^{m} \sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & = 1{m} \sum{i=1^{m} \sum_{k=1}^r ((g_{i}-A_iw_\bS)k)^2} \ & =1{m} |g-Aw\bS|{2} \ & = 1{m} |g-A(A\T A)^\dagger A\T g|{2} =1{m}|(I-A(A\T A)^\dagger A\T )g|_{2} align where the inequality follows from the Jensen's inequality and the concavity of the square root function. Thus, we have that align &1{n} \sum_{i=1}^n |\varphi(x_{i})| +1{n} \sum_{i=1}^n |\varphi(\bbx_{i})| \ \nonumber & \le2{m}|(I-A(A\T A)^\dagger A\T )g|{2}+4\Rcal{m(\Wcal \circ \Fcal)}{m}+2\kappa \frac{\ln(4/\delta){2m}} + 2\kappa_{\bS} \frac{\ln(4/\delta){2n} } align By combining eq:10 and eq:3 with union bound, we have that align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber & \le L_{S}(w_{S})+1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))|+2{m}|\Pb_{A}g|{2} \ \nonumber & \quad + 4\Rcal{m(\Wcal \circ \Fcal)}{m}+2\kappa \frac{\ln(8/\delta){2m}} + 2\kappa_{\bS} \frac{\ln(8/\delta){2n} } \ \nonumber &\quad +\kappa_{S} \frac{2\ln(4|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right). align where $\tW=W_S-W^$ and $ \Pb_{A}=I-A(A\T A)^\dagger A\T$. We will now analyze the second term on the right-hand side of eq:4: align 1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \le |\tW|{2}\ \left(1{n} \sum{i=1}^{n}|f_\theta(\bbx_i)-f_\theta(x_i)| \right), align where $|\tW|{2}$ is the spectral norm of $\tW$. Since $\bbx_i$ shares the same label with $x_i$ as $\bbx_i \sim\Dcal{y_i}$ (and $x_i \sim\Dcal_{y_i}$), and because $f_\theta$ is trained with the unlabeled data $\bS$, using Hoeffding's inequality (Lemma lemma:trivial:2) implies that with probability at least $1-\delta$, align 1{n} \sum_{i=1}^{n}|f_\theta(\bbx_i)-f_\theta(x_i)| \le \EE_{y \sim \rho}\EE_{\bbx,x \sim \Dcal_y^2}[|f_\theta(\bbx)-f_\theta(x)|]+\tau_{\bS} \frac{\ln(1/\delta){2n}}. align Moreover, by using \citep[Theorem 3.1]{mohri2012foundations} with the loss function $(x,\bbx) \mapsto |f_\theta(\bbx)-f_\theta(x)|$ (i.e., Lemma lemma:trivial:1), we have that with probability at least $1-\delta$, align \EE_{y \sim \rho}\EE_{\bbx,x \sim \Dcal_y^2}[|f_\theta(\bbx)-f_\theta(x)|] \le1{m}\sum_{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+2\tilde \Rcal_{m(\Fcal)}{m}+\tau \frac{\ln(1/\delta){2m}} align where $\tilde \Rcal_{m}(\Fcal)=1{m}\EE_{\bS,\xi}[\sup_{f\in\Fcal} \sum_{i=1}^m \xi_i |f(\xp_{i})-f(\xpp_{i})|]$ is the normalized Rademacher complexity of the set ${(\xp_{},\xpp_{}) \mapsto|f(\xp_{})-f(\xpp_{})|: f \in \Fcal}$ (it is normalized such that $\tilde \Rcal_{m}(\Fcal)=O(1)$ as $m\rightarrow \infty$ for typical choices of $\Fcal$), and $\xi_1,\dots,\xi_m$ are independent uniform random variables taking values in ${-1,1}$. Thus, taking union bound, we have that for any $\delta>0$, with probability at least $1-\delta$, align &1{n} \sum_{i=1}^{n}|\tW (f_\theta(\bbx_i)-f_\theta(x_i))| \ \nonumber & \le|\tW|{2}\left(1{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+2\tilde \Rcal_{m(\Fcal)}{m}+\tau \frac{\ln(2/\delta){2m}}++\tau_{\bS} \frac{\ln(2/\delta){2n}}\right). align By combining eq:4 and eq:13 using the union bound, we have that with probability at least $1-\delta$, align &\EE_{X,Y}[|W {S}f\theta(X)-Y|] \ \nonumber &\le L_{S}(w_{S}) + |\tW|{2} \left( 1{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|+2\tilde \Rcal_{m(\Fcal)}{m}+\tau \frac{\ln(4/\delta){2m}}+\tau_{\bS} \frac{\ln(4/\delta){2n}} \right) \ \nonumber & \quad +2{m}|\Pb_{A}g|{2}+ 4\Rcal{m(\Wcal \circ \Fcal)}{m}+2\kappa \frac{\ln(16/\delta){2m}} + 2\kappa_{\bS} \frac{\ln(16/\delta){2n} } \ \nonumber & \quad +\kappa_{S} \frac{2\ln(8|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right) \ \nonumber & =L_{S}(w_{S}) +|\tW|{2} \left(1{m}\sum{i=1}^{m} |f_\theta(\xp_{i})-f_\theta(\xpp_{i})|\right)+2{m}|\Pb_{A}g|{2}+Q{m,n} align where align Q_{m,n} &= |\tW|{2} \left(2\tilde \Rcal{m(\Fcal)}{m}+\tau \frac{\ln(3/\delta){2m}}+\tau_{\bS} \frac{\ln(3/\delta){2n}}\right) \ & \quad +\kappa_{S} \frac{2\ln(6|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right) \ & \quad + 4\Rcal_{m(\Wcal \circ \Fcal)}{m}+2\kappa \frac{\ln(4/\delta){2m}} + 2\kappa_{\bS} \frac{\ln(4/\delta){2n} }. align* Define $\hZ =[f(\xp_1),\dots, f(\xp_{m})] \in \RR^{d\times m}$. Then, we have $A=[\hZ \T \otimes I_r]$. Thus, $$ \Pb_{A}=I-[\hZ \T \otimes I_r][\hZ \hZ \T \otimes I_r]^\dagger[\hZ \otimes I_r]=I-[\hZ \T (\hZ \hZ \T)^\dagger \hZ \otimes I_r]=[\Pb_\hZ \otimes I_r] $$ where $\Pb_\hZ = I_{m}-\hZ \T (\hZ \hZ \T)^\dagger \hZ \in \RR^{m \times m}$. By defining $Y_\bS=[g^(\xp_1),\dots, g^(\xp_{m})]\T \in \RR^{m\times r}$, since $g=\vect[Y_\bS\T]$, align | \Pb_{A}g|{2} =|[\Pb\hZ \otimes I_r]\vect[Y_\bS\T]|{2} =|\vect[Y\bS\T\Pb_\hZ ]|{2} =|\Pb\hZ Y_\bS|{F} align On the other hand, recall that $W{S}$ is the minimum norm solution as $$ W_S =\mini_{W'} |W'|{F} s.t. W' \in \argmin{W} 1{n} \sum_{i=1}^n |W f_\theta(x_i)-y_i|^2. $$ By solving this, we have $$ W_S =Y\T \tZ \T (\tZ \tZ\T )^\dagger , $$ where $\tZ =[f(x_1),\dots, f(x_{n})] \in \RR^{d\times n}$ and $Y_{S}=[y_{1},\dots, y_{n}]\T \in \RR^{n\times r}$. Then, align* L_{S}(w_{S})=1{n} \sum_{i=1}^n |W_{S} f_\theta(x_i)-y_i| &= 1{n}\sum_{i=1}^{n} \sum_{k=1^r ((W_{S} f_\theta(x_i)-y_i)k)^2} \ & \le \frac{1{n}\sum{i=1}^{n}\sum_{k=1}^r ((W_{S} f_\theta(x_i)-y_i)k)^2} \ & = 1{n} |W{S} \tZ -Y\T|F \ & =1{n} |Y\T (\tZ \T (\tZ \tZ\T )^\dagger \tZ -I)|F \ & =1{n} |(I-\tZ \T (\tZ \tZ\T )^\dagger \tZ )Y|F align* Thus, align L{S}(w{S})=1{n} |\Pb\tZ Y|F align where $\Pb\tZ = I-\tZ \T (\tZ \tZ\T )^\dagger \tZ $. By combining eq:14--eq:17 and using $1\le 2$, we have that with probability at least $1-\delta$, align \EE_{X,Y}[|W {S}f\theta(X)-Y|] \le c I_{\bS}(f_\theta)+2{m}|\Pb_\hZ Y_\bS|{F}+1{n} |\Pb\tZ Y_{S}|F +Q{m,n}, align where align* Q_{m,n} &= c \left(2\tilde \Rcal_{m(\Fcal)}{m}+\tau \frac{\ln(3/\delta){2m}}+\tau_{\bS} \frac{\ln(3/\delta){2n}}\right) \ & \quad +\kappa_{S} \frac{2\ln(6|\Ycal|/\delta){2n}} \sum_{y\in \Ycal} \left(\hp(y)+p(y)\right) \ & \quad + 4\Rcal_{m(\Wcal \circ \Fcal)}{m}+2\kappa \frac{\ln(4/\delta){2m}} + 2\kappa_{\bS} \frac{\ln(4/\delta){2n} }. align*

Proof. The term (\mu({X_j}) \mu({X_j})^T) is an outer product of the mean vector (\mu({X_j})), which is a symmetric matrix. The eigenvalues of a symmetric matrix are equal to the squares of the singular values of the original matrix. Since the singular values of a vector are equal to its absolute values, the maximum eigenvalue of (\mu({X_j}) \mu({X_j})^T) is equal to the square of the maximum absolute value of (\mu({X_j})). By the second assumption, this is at most (M).

Proof. The term (-\mu_Z \mu_Z^T) is a negative outer product of the overall mean vector (\mu_Z), which is a symmetric matrix. Its eigenvalues are non-positive and equal to the negative squares of the singular values of (\mu_Z). Since the singular values of a vector are equal to its absolute values, the absolute value of the maximum eigenvalue of (-\mu_Z \mu_Z^T) is equal to the square of the maximum absolute value of (\mu_Z), which is also bounded by (M) by the second assumption.

Proof. Given a Gaussian mixture model where each component $Z|x_j$ has mean $\mu({X_j})$ and covariance matrix $\Sigma({x_j})$, the mixture can be written as: [ Z = \sum_j p_j Z|x_j ] where $p_j$ are the mixing coefficients. The covariance matrix of the mixture, $\Sigma_Z$, is then given by: [ \Sigma_Z = \sum_j p_j \left( \Sigma({x_j}) + \mu({X_j}) \mu({X_j})^T \right) - \mu_Z \mu_Z^T ] where $\mu_Z$ is the mean of the mixture distribution. By Lemmas 1.1, 1.2, and assumptions 1 and 2, the maximum eigenvalues of $(\Sigma({x_j})$, $\mu({X_j}) \mu({X_j})^T$ and $\mu_Z \mu_Z^T$. are at most (b), (M), and (M), respectively. Therefore, by Weyl's inequality for the sum of two symmetric matrices, the maximum eigenvalue of (\Sigma_Z) is at most (b + M). $$\lambda_{max}(\Sigma_Z) \leq 1{K} \sum_{i=1}^K (max(\lambda({\Sigma({X_i}}))) + M) \leq b + M$$ It means that we can bound the sum of the eigenvalues of $\Sigma_Z$ with [\sum_i\lambda_i(\Sigma_Z) \leq (b+M)\times K]

Proof. The determinant of a matrix is the product of its eigenvalues, so the objective function $\log\det(\Sigma_Z)$ can be rewritten as $\sum_{i=1}^N \log(\lambda_i(\Sigma_Z))$. Our problem is then to maximize this sum under the constraints that the sum of the eigenvalues does not exceed $c$ and that $\Sigma_Z$ is positive semi-definite. Applying Jensen's inequality to the concave function $\log(x)$ with weights $1/N$, we find that $1{N}\sum_{i=1}^N \log(\lambda_i(\Sigma_Z)) \leq \log(1{N}\sum_{i=1}^N \lambda_i(\Sigma_Z))$. Equality holds if and only if all $\lambda_i(\Sigma_Z)$ are equal. Setting $\lambda_i(\Sigma_Z) = x$ for all $i$, we see that the constraint $\sum_{i=1}^{N} \lambda_i(\Sigma_Z) \leq c$ becomes $Nx \leq c$, leading to the optimal eigenvalue $x = c/N$ under the constraint. Since $\Sigma_Z$ is positive semi-definite, it can be diagonalized via an orthogonal transformation without changing the sum of its eigenvalues or its determinant. Therefore, the solution to the problem is a diagonal matrix with all diagonal entries equal to $c/N$. This completes the proof.

Proof. The objective function can be decomposed as follows: align* \sum_i \log \left| \Sigma_Z \right|{\left| \Sigma({X_i}) \right|} &= \sum_i \left( \log \left| \Sigma_Z \right| - \log \left| \Sigma({X_i}) \right| \right) \ &= K \log \left| \Sigma_Z \right| - \sum_i \log \left| \Sigma_({X_i}) \right|, align* where (K) is the number of components in the Gaussian mixture model. In this optimization problem, we are optimizing over (\Sigma_Z). The term (\sum_i \log \left| \Sigma({X_i})\right|) is constant with respect to (\Sigma_Z), therefore we can focus on maximizing (K \log \left| \Sigma_Z \right|). As the determinant of a matrix is the product of its eigenvalues, (\log \left| \Sigma_Z \right|) is the sum of the logs of the eigenvalues of (\Sigma_Z). Thus, maximizing (\log \left| \Sigma_Z \right|) corresponds to maximizing the sum of the logarithms of the eigenvalues of (\Sigma_Z). According to Lemma 1.4, when we have a constraint on the sum of the eigenvalues, the solution to the problem of maximizing the sum of the logarithms of the eigenvalues of a positive semidefinite matrix (\Sigma_Z) is a diagonal matrix with equal diagonal elements. From Lemma 1.3, we know that the sum of the eigenvalues of (\Sigma_Z) is bounded by ((b + M) \times K). Therefore, when we maximize (K \log \left| \Sigma_Z \right|) under these constraints, the solution will be a diagonal matrix with equal diagonal elements. This completes the proof of the theorem.

In this section we have more empirical results on the connection between our generalization to the generalization gap.

MethodCIFAR-10Tiny-ImageNetTiny-ImageNetCIFAR-100CIFAR-100
ResNet-18ConvNetXVITConvNetXVIT
SimCLR89 . 72 ± 0 . 0550 . 86 ± 0 . 1351 . 16 ± 0 . 1367 . 21 ± 0 . 2467 . 31 ± 0 . 18
Barlow Twins88 . 81 ± 0 . 1051 . 34 ± 0 . 1051 . 40 ± 0 . 1668 . 54 ± 0 . 1568 . 02 ± 0 . 12
SwAV89 . 12 ± 0 . 1350 . 76 ± 0 . 1451 . 54 ± 0 . 2068 . 93 ± 0 . 1467 . 89 ± 0 . 21
MoCo89 . 46 ± 0 . 0852 . 36 ± 0 . 2153 . 06 ± 0 . 2170 . 32 ± 0 . 1569 . 89 ± 0 . 14
VICReg89 . 32 ± 0 . 0951 . 02 ± 0 . 2652 . 12 ± 0 . 2570 . 09 ± 0 . 2070 . 12 ± 0 . 17
BYOL89 . 21 ± 0 . 1152 . 24 ± 0 . 1753 . 44 ± 0 . 2070 . 01 ± 0 . 2769 . 59 ± 0 . 22
VICReg + PairDist ( ours )90 . 37 ± 0 . 0952 . 61 ± 0 . 1553 . 70 ± 0 . 1371 . 10 ± 0 . 1670 . 50 ± 0 . 19
VICReg + LogDet ( ours )90 . 27 ± 0 . 0852 . 91 ± 0 . 1754 . 89 ± 0 . 2071 . 23 ± 0 . 1870 . 61 ± 0 . 17
BYOL + PairDist ( ours )90 . 19 ± 0 . 1453 . 47 ± 0 . 2254 . 33 ± 0 . 2171 . 39 ± 0 . 2571 . 09 ± 0 . 24
BYOL + LogDet ( ours )90 . 11 ± 0 . 1653 . 19 ± 0 . 2554 . 67 ± 0 . 2771 . 20 ± 0 . 2170 . 79 ± 0 . 26
β (Noise Level)Tiny-ImageNetTiny-ImageNetCIFAR100CIFAR100
Noisy NetworkNoisy Input (ours)Noisy NetworkNoisy Input (ours)
β = 0 (no noise)53.153.170.170.1
β = 0 . 0551.753.069.770.0
β = 0 . 150.252.868.869.6
β = 0 . 248.152.367.168.9
Noise Level/MethodDeterministic NetworkNoisy NetworkNoisy Input (our method)
β = 0 . 050.970.820.93
β = 0 . 10.970.690.85
β = 0 . 20.970.540.77
β = 0 . 30.970.320.69

References

[dsprites17] Loic Matthey, Irina Higgins, Demis Hassabis, Alexander Lerchner. (2017). dSprites: Disentanglement testing Sprites dataset.

[icml2023kzxinfodl] Kenji Kawaguchi, Zhun Deng, Xu Ji, Jiaoyang Huang. (2023). How Does Information Bottleneck Help Deep Learning?. International Conference on Machine Learning (ICML).

[rudin2006real] Rudin, Walter. (2006). Real and Complex Analysis.

[belghazi2018mine] Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeswar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, R Devon. (2018). Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062.

[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[https://doi.org/10.48550/arxiv.2205.11508] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. doi:10.48550/ARXIV.2205.11508.

[shwartz-ziv2023what] Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Yann LeCun. (2023). What Do We Maximize in Self-Supervised Learning And Why Does Generalization Emerge?.

[IM2003] David Barber, Felix V. Agakov. (2003). The IM Algorithm: A Variational Approach to Information Maximization. NIPS.

[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.

[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.

[shwartz2022we] Shwartz-Ziv, Ravid, Balestriero, Randall, LeCun, Yann. (2022). What Do We Maximize in Self-Supervised Learning?. arXiv preprint arXiv:2207.10081.

[zhouyin2021understanding] Zhouyin, Zhanghao, Liu, Ding. (2021). Understanding neural networks with logarithm determinant entropy estimator. arXiv preprint arXiv:2105.03705.

[MISRA2005324] Neeraj Misra, Harshinder Singh, Eugene Demchuk. (2005). Estimation of the entropy of a multivariate normal distribution. Journal of Multivariate Analysis. doi:https://doi.org/10.1016/j.jmva.2003.10.003.

[30996] Brewer, Brendon J. (2017). Computing entropies with nested sampling. Entropy. doi:10.1109/18.30996.

[entropyapprox2008] Huber, Marco, Bailey, Tim, Durrant-Whyte, Hugh, Hanebeck, Uwe. (2008). On Entropy Approximation for Gaussian Mixture Random Vectors. IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. doi:10.1109/MFI.2008.4648062.

[koltchinskii2000random] Koltchinskii, Vladimir, Gin{'e. (2000). Random matrix approximation of spectra of integral operators. Bernoulli.

[ben2018attentioned] Ben-Ari, Itamar, Shwartz-Ziv, Ravid. (2018). Attentioned convolutional lstm inpaintingnetwork for anomaly detection in videos. arXiv preprint arXiv:1811.10228.

[wang2022rethinking] Wang, Haoqing, Guo, Xun, Deng, Zhi-Hong, Lu, Yan. (2022). Rethinking minimal sufficient representation in contrastive learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7410696] Giles, Mike B.. (2008). Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. Advances in Automatic Differentiation. doi:10.1109/ICCV.2015.339.

[dang2018eigendecomposition] Dang, Zheng, Yi, Kwang Moo, Hu, Yinlin, Wang, Fei, Fua, Pascal, Salzmann, Mathieu. (2018). Eigendecomposition-free training of deep networks with zero eigenvalue-based losses. Proceedings of the European Conference on Computer Vision (ECCV).

[shi2009data] Shi, Tao, Belkin, Mikhail, Yu, Bin. (2009). Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics.

[boundsgmmentropy] Kolchinsky, Artemy, Tracey, Brendan D. (2017). Estimating mixture entropy with pairwise distances. Entropy. doi:10.1109/TIT.2016.2553147.

[balestriero2020mad] Balestriero, Randall, Baraniuk, Richard. (2020). Mad max: Affine spline insights into deep learning. Proceedings of the IEEE.

[heusel2017gans] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. Proc. NeurIPS.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems.

[che2020your] Che, Tong, Zhang, Ruixiang, Sohl-Dickstein, Jascha, Larochelle, Hugo, Paull, Liam, Cao, Yuan, Bengio, Yoshua. (2020). Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. arXiv preprint arXiv:2003.06060.

[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[piran2020dual] Piran, Zoe, Shwartz-Ziv, Ravid, Tishby, Naftali. (2020). The dual information bottleneck. arXiv preprint arXiv:2006.04641.

[8437679] Russo, Daniel, Zou, James. (2019). How much does your data exploration overfit? Controlling bias via information usage. IEEE Transactions on Information Theory. doi:10.1109/ISIT.2018.8437679.

[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems.

[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[jing2022understanding] Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. International Conference on Learning Representations.

[tanaka2019discriminator] Tanaka, Akinori. (2019). Discriminator optimal transport. arXiv preprint arXiv:1910.06832.

[metz2016unrolled] Metz, Luke, Poole, Ben, Pfau, David, Sohl-Dickstein, Jascha. (2016). Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163.

[peyre2009manifold] Peyr{'e. (2009). Manifold models for signals and images. Computer vision and image understanding.

[faceapi] Microsoft Cognitive Services. {Face API.

[wood1996estimation] Wood, GR, Zhang, BP. (1996). Estimation of the Lipschitz constant of a function. J. Global Optim..

[cheney2009course] Cheney, Elliott Ward, Light, William Allan. (2009). A course in approximation theory.

[baggenstoss2017uniform] Baggenstoss, Paul M. (2017). Uniform manifold sampling (UMS): Sampling the maximum entropy pdf. IEEE Trans. Signal Processing.

[gulrajani2017improved] Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, Courville, Aaron. (2017). Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.

[scaman2018lipschitz] Scaman, Kevin, Virmaux, Aladin. (2018). Lipschitz regularity of deep neural networks: analysis and efficient estimation. arXiv preprint arXiv:1805.10965.

[shwartzziv2023] Shwartz-Ziv, Ravid, LeCun, Yann. (2023). To Compress or Not to Compress--Self-Supervised Learning and Information Theory: A Review. arXiv preprint arXiv:2304.09355.

[thirumuruganathan2020approximate] Thirumuruganathan, Saravanan, Hasan, Shohedul, Koudas, Nick, Das, Gautam. (2020). Approximate query processing for data exploration using deep generative models. Proc. ICDE.

[karras2017progressive] Karras, Tero, Aila, Timo, Laine, Samuli, Lehtinen, Jaakko. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.

[vahdat2020nvae] Vahdat, Arash, Kautz, Jan. (2020). Nvae: A deep hierarchical variational autoencoder. Proc. NeurIPS.

[tan2020fairgen] Tan, Shuhan, Shen, Yujun, Zhou, Bolei. (2020). Improving the Fairness of Deep Generative Models without Retraining. arXiv preprint arXiv:2012.04842.

[hwang2020fairfacegan] Hwang, Sunhee, Park, Sungho, Kim, Dohyung, Do, Mirae, Byun, Hyeran. (2020). FairfaceGAN: Fairness-aware facial image-to-image translation. Proc. BMVC.

[karras2020analyzing] Karras, Tero, Laine, Samuli, Aittala, Miika, Hellsten, Janne, Lehtinen, Jaakko, Aila, Timo. (2020). Analyzing and improving the image quality of stylegan. Proc. CVPR.

[brock2018large] Brock, Andrew, Donahue, Jeff, Simonyan, Karen. (2019). Large scale GAN training for high fidelity natural image synthesis. Proc. ICLR.

[thanh2019improving] Thanh-Tung, Hoang, Tran, Truyen, Venkatesh, Svetha. (2019). Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984.

[sandfort2019data] Sandfort, Veit, Yan, Ke, Pickhardt, Perry J, Summers, Ronald M. (2019). Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific reports.

[zhao2018bias] Zhao, Shengjia, Ren, Hongyu, Yuan, Arianna, Song, Jiaming, Goodman, Noah, Ermon, Stefano. (2018). Bias and generalization in deep generative models: An empirical study. arXiv preprint arXiv:1811.03259.

[wu2019generalization] Wu, Bingzhe, Zhao, Shiwan, Chen, ChaoChao, Xu, Haoyang, Wang, Li, Zhang, Xiaolu, Sun, Guangyu, Zhou, Jun. (2019). Generalization in generative adversarial networks: A novel perspective from privacy protection. arXiv preprint arXiv:1908.07882.

[fantuzzi2002identification] Fantuzzi, Cesare, Simani, Silvio, Beghelli, Sergio, Rovatti, Riccardo. (2002). Identification of piecewise affine models in noisy environment. International Journal of Control.

[egerstedt2009control] Egerstedt, Magnus, Martin, Clyde. (2009). Control theoretic splines: optimal control, statistics, and path planning.

[levina2004maximum] Levina, Elizaveta, Bickel, Peter. (2004). Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems.

[dempster1977maximum] Dempster, Arthur P, Laird, Nan M, Rubin, Donald B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological).

[xu2018spherical] Xu, Jiacheng, Durrett, Greg. (2018). Spherical latent spaces for stable variational autoencoders. arXiv preprint arXiv:1808.10805.

[chen2018isolating] Chen, Ricky TQ, Li, Xuechen, Grosse, Roger B, Duvenaud, David K. (2018). Isolating sources of disentanglement in variational autoencoders. Proc. NeurIPS.

[miyato2018spectral] Miyato, Takeru, Kataoka, Toshiki, Koyama, Masanori, Yoshida, Yuichi. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.

[mao2017least] Mao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, Paul Smolley, Stephen. (2017). Least squares generative adversarial networks. Proc. ICCV.

[spivak2018calculus] Spivak, Michael. (2018). Calculus on manifolds: a modern approach to classical theorems of advanced calculus.

[ansuini2019intrinsic] Ansuini, Alessio, Laio, Alessandro, Macke, Jakob H, Zoccolan, Davide. (2019). Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems.

[facco2017estimating] Facco, Elena, d’Errico, Maria, Rodriguez, Alex, Laio, Alessandro. (2017). Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports.

[balestriero2020analytical] Balestriero, Randall, Paris, S{'e. (2020). Analytical Probability Distributions and Exact Expectation-Maximization for Deep Generative Networks. Proc. NeurIPS.

[hara2016analysis] Hara, Kazuyuki, Saitoh, Daisuke, Shouno, Hayaru. (2016). Analysis of dropout learning regarded as ensemble learning. International Conference on Artificial Neural Networks.

[ketchen1996application] Ketchen, David J, Shook, Christopher L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal.

[thorndike1953belongs] Thorndike, Robert L. (1953). Who belongs in the family?. Psychometrika.

[baldi2013understanding] Baldi, Pierre, Sadowski, Peter J. (2013). Understanding dropout. Advances in neural information processing systems.

[bachman2014learning] Bachman, Philip, Alsharif, Ouais, Precup, Doina. (2014). Learning with pseudo-ensembles. Advances in neural information processing systems.

[bojanowski2017optimizing] Bojanowski, Piotr, Joulin, Armand, Lopez-Paz, David, Szlam, Arthur. (2017). Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776.

[warde2013empirical] Warde-Farley, David, Goodfellow, Ian J, Courville, Aaron, Bengio, Yoshua. (2013). An empirical analysis of dropout in piecewise linear networks. arXiv preprint arXiv:1312.6197.

[glorot2011deep] Glorot, Xavier, Bordes, Antoine, Bengio, Yoshua. (2011). Deep sparse rectifier neural networks. Proceedings of the fourteenth international conference on artificial intelligence and statistics.

[maas2013rectifier] Maas, Andrew L, Hannun, Awni Y, Ng, Andrew Y. (2013). Rectifier nonlinearities improve neural network acoustic models. Proc. icml.

[bruna2013invariant] Bruna, Joan, Mallat, St{'e. (2013). Invariant scattering convolution networks. IEEE Trans. PAMI.

[zhang2018tropical] Zhang, Liwen, Naitzat, Gregory, Lim, Lek-Heng. (2018). Tropical geometry of deep neural networks. arXiv preprint arXiv:1805.07091.

[he2015delving] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision.

[tran2017disentangled] Tran, Luan, Yin, Xi, Liu, Xiaoming. (2017). Disentangled representation learning gan for pose-invariant face recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[lautensack_zuyev_2008] Banerjee, Sudipto, Roy, Anindya. (2014). Linear Algebra and Matrix Analysis for Statistics. Advances in Applied Probability. doi:10.1239/aap/1222868179.

[yim2015rotating] Yim, Junho, Jung, Heechul, Yoo, ByungIn, Choi, Changkyu, Park, Dusik, Kim, Junmo. (2015). Rotating your face using multi-task deep neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[wang2005generalization] Wang, Shuning, Sun, Xusheng. (2005). Generalization of hinging hyperplanes. IEEE Trans. Information Theory.

[bjorck2018understanding] Bjorck, Nils, Gomes, Carla P, Selman, Bart, Weinberger, Kilian Q. (2018). Understanding batch normalization. Advances in Neural Information Processing Systems.

[zhao2017towards] Zhao, Shengjia, Song, Jiaming, Ermon, Stefano. (2017). Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658.

[huang2018introvae] Huang, Huaibo, He, Ran, Sun, Zhenan, Tan, Tieniu, others. (2018). Introvae: Introspective variational autoencoders for photographic image synthesis. Advances in Neural Information Processing systems.

[cover2012elements] Cover, Thomas M, Thomas, Joy A. (2012). Elements of Information Theory.

[higgins2017beta] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. (2017). Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.. Proc. ICLR.

[radford2015unsupervised] Radford, Alec, Metz, Luke, Chintala, Soumith. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

[tomczak2017vae] Tomczak, Jakub M, Welling, Max. (2017). VAE with a VampPrior. arXiv preprint arXiv:1705.07120.

[berg2018sylvester] Berg, Rianne van den, Hasenclever, Leonard, Tomczak, Jakub M, Welling, Max. (2018). Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649.

[tomczak2016improving] Tomczak, Jakub M, Welling, Max. (2016). Improving variational auto-encoders using householder flow. arXiv preprint arXiv:1611.09630.

[davidson2018hyperspherical] Davidson, Tim R, Falorsi, Luca, De Cao, Nicola, Kipf, Thomas, Tomczak, Jakub M. (2018). Hyperspherical variational auto-encoders. arXiv preprint arXiv:1804.00891.

[li2018learning] Li, Yuanzhi, Liang, Yingyu. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems.

[bryant1995principal] Bryant, Fred B, Yarnold, Paul R. (1995). Principal-components analysis and exploratory and confirmatory factor analysis..

[harman1960modern] Harman, Harry H. (1960). Modern factor analysis..

[kim2018disentangling] Kim, Hyunjik, Mnih, Andriy. (2018). Disentangling by factorising. arXiv preprint arXiv:1802.05983.

[isola2017image] Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, Efros, Alexei A. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[donoho1994ideal] Donoho, David L, Johnstone, Iain M, others. (1994). Ideal denoising in an orthonormal basis chosen from a library of bases. Comptes rendus de l'Acad{'e.

[breiman1977variable] Breiman, Leo, Meisel, William, Purcell, Edward. (1977). Variable kernel estimates of multivariate densities. Technometrics.

[ben2018gaussian] Ben-Yosef, Matan, Weinshall, Daphna. (2018). Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. arXiv preprint arXiv:1808.10356.

[yang2019mean] Yang, Greg, Pennington, Jeffrey, Rao, Vinay, Sohl-Dickstein, Jascha, Schoenholz, Samuel S. (2019). A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129.

[wan2013regularization] Wan, Li, Zeiler, Matthew, Zhang, Sixin, Le Cun, Yann, Fergus, Rob. (2013). Regularization of neural networks using dropconnect. International Conference on Machine Learning.

[ulyanov2016instance] Ulyanov, Dmitry, Vedaldi, Andrea, Lempitsky, Victor. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.

[liao2016importance] Liao, Zhibin, Carneiro, Gustavo. (2016). On the importance of normalisation layers in deep learning with piecewise linear activation units. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[cun1998efficient] Cun, YL, Bottou, L, Orr, G, Muller, K. (1998). Efficient backprop, neural networks: Tricks of the trade. Lecture notes in computer sciences.

[jin2019auto] Jin, Haifeng, Song, Qingquan, Hu, Xia. (2019). Auto-keras: An efficient neural architecture search system. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[bender2018understanding] Bender, Gabriel, Kindermans, Pieter-Jan, Zoph, Barret, Vasudevan, Vijay, Le, Quoc. (2018). Understanding and simplifying one-shot architecture search. International Conference on Machine Learning.

[zagoruyko2016wide] Zagoruyko, Sergey, Komodakis, Nikos. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

[liu2017learning] Liu, Zhuang, Li, Jianguo, Shen, Zhiqiang, Huang, Gao, Yan, Shoumeng, Zhang, Changshui. (2017). Learning efficient convolutional networks through network slimming. Proceedings of the IEEE International Conference on Computer Vision.

[ye2018rethinking] Ye, Jianbo, Lu, Xin, Lin, Zhe, Wang, James Z. (2018). Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124.

[zhang2018shufflenet] Zhang, Xiangyu, Zhou, Xinyu, Lin, Mengxiao, Sun, Jian. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[huang2017densely] Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[dabov2009bm3d] Dabov, Kostadin, Foi, Alessandro, Katkovnik, Vladimir, Egiazarian, Karen. (2009). BM3D image denoising with shape-adaptive principal component analysis.

[du2007hyperspectral] Du, Qian, Fowler, James E. (2007). Hyperspectral image compression using JPEG2000 and principal component analysis. IEEE Geoscience and Remote sensing letters.

[pearson1901liii] Pearson, Karl. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science.

[ioffe2017batch] Ioffe, Sergey. (2017). Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. Advances in Neural Information Processing systems.

[nam2018batch] Nam, Hyeonseob, Kim, Hyo-Eun. (2018). Batch-instance normalization for adaptively style-invariant neural networks. Advances in Neural Information Processing Systems.

[box1978statistics] Box, George EP, Hunter, William Gordon, Hunter, J Stuart, others. (1978). Statistics for experimenters.

[wu2018group] Wu, Yuxin, He, Kaiming. (2018). Group normalization. Proceedings of the European Conference on Computer Vision (ECCV).

[balestriero2019geometry] Balestriero, Randall, Cosentino, Romain, Aazhang, Behnaam, Baraniuk, Richard. (2019). The Geometry of Deep Networks: Power Diagram Subdivision. Proc. NeurIPS.

[kohler2019exponential] Kohler, Jonas, Daneshmand, Hadi, Lucchi, Aurelien, Hofmann, Thomas, Zhou, Ming, Neymeyr, Klaus. (2019). Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization. The 22nd International Conference on Artificial Intelligence and Statistics.

[salimans2016weight] Salimans, Tim, Kingma, Durk P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems.

[luo2017learning] Luo, Ping. (2017). Learning deep architectures via generalized whitened neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70.

[huang2018decorrelated] Huang, Lei, Yang, Dawei, Lang, Bo, Deng, Jia. (2018). Decorrelated batch normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[feldman2006coresets] Feldman, Dan, Fiat, Amos, Sharir, Micha. (2006). Coresets forweighted facilities and their applications. 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[megiddo1982complexity] Megiddo, Nimrod, Tamir, Arie. (1982). On the complexity of locating linear facilities in the plane. Operations research letters.

[feldman2013turning] Feldman, Dan, Schmidt, Melanie, Sohler, Christian. (2013). Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms.

[huang2018condensenet] Huang, Gao, Liu, Shichen, Van der Maaten, Laurens, Weinberger, Kilian Q. (2018). Condensenet: An efficient densenet using learned group convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[tan2019efficientnet] Tan, Mingxing, Le, Quoc V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946.

[szegedy2016rethinking] Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, Wojna, Zbigniew. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[he2016identity] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Identity mappings in deep residual networks. European conference on computer vision.

[clevert2015fast] Clevert, Djork-Arn{'e. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.

[klambauer2017self] Klambauer, G{. (2017). Self-normalizing neural networks. Advances in Neural Information Processing systems.

[lei2016layer] Lei Ba, Jimmy, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

[santurkar2018does] Santurkar, Shibani, Tsipras, Dimitris, Ilyas, Andrew, Madry, Aleksander. (2018). How does batch normalization help optimization?. Advances in Neural Information Processing Systems.

[resnet-he] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2015). Deep Residual Learning for Image Recognition. CoRR.

[katagiri2002performance] Katagiri, Takahiro. (2002). Performance evaluation of parallel Gram-Schmidt re-orthogonalization methods. International Conference on High Performance Computing for Computational Science.

[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[agarwal2004k] Agarwal, Pankaj K, Mustafa, Nabil H. (2004). K-means projective clustering. Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.

[agarwal2000covering] Agarwal, Pankaj K, Procopiuc, Cecilia M. (2000). Covering points by strips in the plane. Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms.

[han2011data] Han, Jiawei, Pei, Jian, Kamber, Micheline. (2011). Data mining: concepts and techniques.

[kaufman1987clustering] Kaufman, Leonard, Rousseeuw, Peter J. (1987). Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm. Y. Dodge, Ed.

[steinhaus1956division] Steinhaus, Hugo. (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci.

[kanungo2002efficient] Kanungo, Tapas, Mount, David M, Netanyahu, Nathan S, Piatko, Christine D, Silverman, Ruth, Wu, Angela Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. PAMI.

[tan2006cluster] Tan, Pang-Ning, Steinbach, Michael, Kumar, Vipin, others. (2006). Cluster analysis: basic concepts and algorithms. Introduction to data mining.

[knuth1992two] Knuth, Donald E. (1992). Two notes on notation. The American Mathematical Monthly.

[bell1934exponential] Bell, Eric Temple. (1934). Exponential polynomials. Annals of Mathematics.

[halmos1960naive] Halmos Paul, R. (1960). Naive set theory.

[georgescu2003mean] Georgescu, Bogdan, Shimshoni, Ilan, Meer, Peter. (2003). Mean Shift Based Clustering in High Dimensions: A Texture Classification Example.. ICCV.

[muja2009fast] Muja, Marius, Lowe, David G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration.. VISAPP (1).

[balestriero2018semi] Balestriero, Randall, Glotin, Herv{'e. (2018). Semi-Supervised Learning Enabled by Multiscale Deep Neural Network Inversion. arXiv preprint arXiv:1802.10172.

[arya1998optimal] Arya, Sunil, Mount, David M, Netanyahu, Nathan S, Silverman, Ruth, Wu, Angela Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM).

[hanin2019complexity] Hanin, Boris, Rolnick, David. (2019). Complexity of Linear Regions in Deep Networks. arXiv preprint arXiv:1901.09021.

[konda2014zero] Konda, Kishore, Memisevic, Roland, Krueger, David. (2014). Zero-bias autoencoders and the benefits of co-adapting features. arXiv preprint arXiv:1402.3337.

[wang2018a] Zichao Wang, Randall Balestriero, Richard Baraniuk. (2019). A {MAX. International Conference on Learning Representations.

[balestriero2018from] Randall Balestriero, Richard Baraniuk. (2019). From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference. International Conference on Learning Representations.

[chan2015pcanet] Chan, Tsung-Han, Jia, Kui, Gao, Shenghua, Lu, Jiwen, Zeng, Zinan, Ma, Yi. (2015). PCANet: A simple deep learning baseline for image classification?. IEEE Trans. Image Processing.

[lin2015far] Lin, Zhouhan, Memisevic, Roland, Konda, Kishore. (2015). How far can we go without convolution: Improving fully-connected networks. arXiv preprint arXiv:1511.02580.

[johnson1960advanced] Johnson, Roger A. (1960). Advanced Euclidean Geometry: An Elementary Treatise on the Geometry of the Triangle and the Circle: Under the Editorship of John Wesley Young.

[sommerville1958elements] Sommerville, Duncan Mclaren Young. (1958). The elements of non-Euclidean geometry.

[banerjee2005clustering] Banerjee, Arindam, Dhillon, Inderjit S, Ghosh, Joydeep, Sra, Suvrit. (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research.

[imai1985voronoi] Imai, Hiroshi, Iri, Masao, Murota, Kazuo. (1985). Voronoi diagram in the Laguerre geometry and its applications. SIAM Journal on Computing.

[candes2015phase] Candes, Emmanuel J, Li, Xiaodong, Soltanolkotabi, Mahdi. (2015). Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Trans. Information Theory.

[komodakis2007approximate] Komodakis, Nikos, Tziritas, Georgios. (2007). Approximate labeling via graph cuts based on linear programming. IEEE Trans. PAMI.

[boykov2001fast] Boykov, Yuri, Veksler, Olga, Zabih, Ramin. (2001). Fast approximate energy minimization via graph cuts. IEEE Trans. PAMI.

[zaslavskiy2009path] Zaslavskiy, Mikhail, Bach, Francis, Vert, Jean-Philippe. (2009). A path following algorithm for the graph matching problem. IEEE Trans. PAMI.

[he2016joint] He, Lifang, Lu, Chun-Ta, Ma, Jiaqi, Cao, Jianping, Shen, Linlin, Yu, Philip S. (2016). Joint community and structural hole spanner detection via harmonic modularity. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[chan2011convex] Chan, Emprise YK, Yeung, Dit-Yan. (2011). A convex formulation of modularity maximization for community detection. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, Spain.

[joulin2010discriminative] Joulin, Armand, Bach, Francis, Ponce, Jean. (2010). Discriminative clustering for image co-segmentation. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.

[yuan2017exact] Yuan, Ganzhao, Ghanem, Bernard. (2017). An Exact Penalty Method for Binary Optimization Based on MPEC Formulation.. AAAI.

[simonyan2013deep] Simonyan, Karen, Vedaldi, Andrea, Zisserman, Andrew. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.

[zintgraf2016new] Zintgraf, Luisa M, Cohen, Taco S, Welling, Max. (2016). A new method to visualize deep neural networks. arXiv preprint arXiv:1603.02518.

[yosinski2015understanding] Yosinski, Jason, Clune, Jeff, Nguyen, Anh, Fuchs, Thomas, Lipson, Hod. (2015). Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579.

[zeiler2014visualizing] Zeiler, Matthew D, Fergus, Rob. (2014). Visualizing and understanding convolutional networks. European conference on computer vision.

[erhan2009visualizing] Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, Vincent, Pascal. (2009). Visualizing higher-layer features of a deep network. University of Montreal.

[comp] Srivastava, R. K., Masci, J., Gomez, F., Schmidhuber, J.. (2014). Understanding locally competitive networks. arXiv preprint arXiv:1410.1165.

[trottier2017parametric] Trottier, L., Gigu, P., Chaib-draa, B.. (2017). Parametric exponential linear unit for deep convolutional neural networks. 16th IEEE Int. Conf. Mach. Learn. Appl..

[eldar2003optimal] Eldar, Yonina C, Chan, Albert M. (2003). An optimal whitening approach to linear multiuser detection. IEEE Trans. Information Theory.

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.

[nasrabadi2007pattern] Nasrabadi, Nasser M. (2007). Pattern recognition and machine learning. Journal of electronic imaging.

[allen1977unified] Allen, Jont B, Rabiner, Lawrence R. (1977). A unified approach to short-time Fourier analysis and synthesis. Proceedings of the IEEE.

[daniel1976reorthogonalization] Daniel, J.~W., Gragg, W.~B., Kaufman, L., Stewart, G.~W.. (1976). Reorthogonalization and stable algorithms for updating the {G. Math. Comput..

[weisstein2002crc] Weisstein, E.~W.. (2002). CRC Concise Encyclopedia of Mathematics.

[van2016wavenet] Van Den Oord, Aaron, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, Kavukcuoglu, Koray. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[pal1992multilayer] Pal, Sankar K, Mitra, Sushmita. (1992). Multilayer perceptron, fuzzy sets, and classification. IEEE Trans. Neural Networks.

[lecun1995convolutional] LeCun, Yann, Bengio, Yoshua, others. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks.

[boureau2010theoretical] Boureau, Y., Ponce, J., LeCun, Y.. (2010). A theoretical analysis of feature pooling in visual recognition. Proc. Int. Conf. Mach. Learn..

[xu2015empirical] Xu, Bing, Wang, Naiyan, Chen, Tianqi, Li, Mu. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.

[silver2016mastering] Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, others. (2016). Mastering the game of Go with deep neural networks and tree search. nature.

[rabiner1975theory] Rabiner, Lawrence R, Gold, Bernard. (1975). Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p..

[neal1998view] Neal, Radford M, Hinton, Geoffrey E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models.

[hastie2001elements] Hastie, Trevor, Tibshirani, Robert, Friedman, Jerome. (2001). The Elements of Statistical LearninE.

[elfwing2018sigmoid] Elfwing, S., Uchibe, E., Doya, K.. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw..

[baraniuk1999optimal] Baraniuk, Richard G. (1999). Optimal tree approximation with wavelets. Wavelet Applications in Signal and Image Processing VII.

[goodfellow2016deep] Goodfellow, I., Bengio, Y., Courville, A.. (2016). Deep Learning.

[anden2015joint] And{'e. (2015). Joint time-frequency scattering for audio classification. Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on.

[boutilier2002active] Boutilier, Craig, Zemel, Richard S, Marlin, Benjamin. (2002). Active collaborative filtering. Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence.

[ghahramani1995factorial] Ghahramani, Zoubin. (1995). Factorial learning and the EM algorithm. Advances in Neural Information Processing systems.

[ross2003multiple] Ross, David A, Zemel, Richard S. (2003). Multiple cause vector quantization. Advances in Neural Information Processing Systems.

[montufar2014number] Montufar, Guido F, Pascanu, Razvan, Cho, Kyunghyun, Bengio, Yoshua. (2014). On the number of linear regions of deep neural networks. Proc. NeurIPS.

[gulcehre2016mollifying] Gulcehre, Caglar, Moczulski, Marcin, Visin, Francesco, Bengio, Yoshua. (2016). Mollifying networks. arXiv preprint arXiv:1608.04980.

[xu2013block] Xu, Yangyang, Yin, Wotao. (2013). A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences.

[amos2016input] Amos, Brandon, Xu, Lei, Kolter, J Zico. (2016). Input convex neural networks. arXiv preprint arXiv:1609.07152.

[cohen2001tree] Cohen, Albert, Dahmen, Wolfgang, Daubechies, Ingrid, DeVore, Ronald. (2001). Tree approximation and optimal encoding. Applied and Computational Harmonic Analysis.

[nam2014local] Nam, Woonhyun, Doll{'a. (2014). Local decorrelation for improved pedestrian detection. Advances in Neural Information Processing Systems.

[vgg] Simonyan, K., Zisserman, A.. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR.

[shannon1959mathematical] Shannon-winner, CL. (1959). A mathematical theory of communication Bell. system. Tech. J.

[nakhmani2013new] Nakhmani, Arie, Tannenbaum, Allen. (2013). A new distance measure based on generalized image normalized cross-correlation for robust video tracking and image recognition. Pattern recognition letters.

[eldar2001orthogonal] Eldar, Yonina C, Oppenheim, Alan V. (2001). Orthogonal matched filter detection. Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on.

[aschwanden1992experimental] Aschwanden, P, Guggenbuhl, W. (1992). Experimental results from a comparative study on correlation-type registration algorithms. Robust computer vision.

[eldar2002orthogonal] Eldar, Yonina C, Oppenheim, Alan V. (2002). Orthogonal multiuser detection. Signal Processing.

[eldar2004orthogonal] Eldar, Yonina C, Oppenheim, Alan V, Egnor, Dianne. (2004). Orthogonal and projected orthogonal matched filter detection. Signal Processing.

[bishop1995neural] Bishop, C. M.. (1995). Neural networks for pattern recognition.

[krishnavisualgenome] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.~A., Bernstein, M., Fei-Fei, L.. (2016). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.

[szegedy2013intriguing] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.

[roberts1993convex] Roberts, Arthur Wayne. (1993). Convex functions. Handbook of Convex Geometry, Part B.

[goodfellow2014explaining] Goodfellow, Ian J, Shlens, Jonathon, Szegedy, Christian. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[bennett1985structural] Bennett, JA, Botkin, ME. (1985). Structural shape optimization with geometric description and adaptive mesh refinement. AIAA journal.

[kurakin2016adversarial] Kurakin, Alexey, Goodfellow, Ian, Bengio, Samy. (2016). Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533.

[papernot2017practical] Papernot, Nicolas, McDaniel, Patrick, Goodfellow, Ian, Jha, Somesh, Celik, Z Berkay, Swami, Ananthram. (2017). Practical black-box attacks against machine learning. Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.

[medard2000effect] Feder, M., Lapidoth, A.. (1998). Universal decoding for channels with memory. IEEE Trans. Info. Theory. doi:10.1109/18.841172.

[tuncel2009capacity] Tuncel, E.. (2009). Capacity/storage tradeoff in high-dimensional identification systems. IEEE Trans. Info. Theory.

[dasarathy2011reliability] Dasarathy, G., Draper, S. C.. (2011). On reliability of content identification from databases based on noisy queries. Proc. IEEE Intl. Symp. Info. Theory (ISIT'11).

[giles1987learning] Giles, C. L., Maxwell, T.. (1987). Learning, invariance, and generalization in high-order neural networks. Appl. Opt..

[cohen2016group] Cohen, T. S., Welling, M.. (2016). Group Equivariant Convolutional Networks. arXiv preprint arXiv:1602.07576.

[Karpathy-viz-rnn:2015wu] Karpathy, A., Johnson, J., Fei-Fei, L.. (2015). {Visualizing and Understanding Recurrent Networks. arXiv.org.

[hyvarinen2004independent] Hyv{. (2004). Independent component analysis.

[dltutorial] LeCun, Yann, Ranzato, Marc' Aurelio. (2013). Deep Learning Tutorial.

[cappe2007onlineEM] {Capp{'e. {Online EM Algorithm for Latent Data Models. ArXiv e-prints.

[hegde2012convex] Hegde, Chinmay, Sankaranarayanan, Aswin, Yin, Wotao, Baraniuk, Richard. (2012). A convex approach for learning near-isometric linear embeddings. preparation, August.

[bengio2013deep] Bengio, Yoshua. (2013). Deep learning of representations: Looking forward. Statistical language and speech processing.

[mallat2012group] Mallat, S.. (2012). Group invariant scattering. Comm. Pure Appl. Math..

[nasrabadi1988image] Nasrabadi, N.~M., King, R.~A.. (1988). Image coding using vector quantization: A review. IEEE Trans. Commun..

[rister2017piecewise] Rister, Blaine, Rubin, Daniel L. (2017). Piecewise convexity of artificial neural networks. Neural Networks.

[specht1990probabilistic] Specht, Donald F. (1990). Probabilistic neural networks. Neural networks.

[variani2015gaussian] Variani, Ehsan, McDermott, Erik, Heigold, Georg. (2015). A Gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.

[tang2012deep] Tang, Yichuan, Salakhutdinov, Ruslan, Hinton, Geoffrey. (2012). Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635.

[chen2013deep] Chen, Bo, Polatkan, Gungor, Sapiro, Guillermo, Blei, David, Dunson, David, Carin, Lawrence. (2013). Deep learning with hierarchical convolutional factor analysis. IEEE Trans. PAMI.

[jordan1998learning] Jordan, M.I.. (1998). Learning in Graphical Models.

[wei2000fast] Wei, Li-Yi, Levoy, Marc. (2000). Fast texture synthesis using tree-structured vector quantization. Proceedings of the 27th annual conference on Computer graphics and interactive techniques.

[gersho2012vector] Gersho, A., Gray, R.~M.. (2012). Vector Quantization and Signal Compression.

[weinberger2009distance] Weinberger, Kilian Q, Saul, Lawrence K. (2009). Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research.

[salakhutdinov2007learning] Salakhutdinov, Ruslan, Hinton, Geoffrey E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. International Conference on Artificial Intelligence and Statistics.

[mehta2014exact] Mehta, Pankaj, Schwab, David J. (2014). An exact mapping between the Variational Renormalization Group and Deep Learning. arXiv preprint arXiv:1410.3831.

[PoggioOnInvariance] Anselmi, F., Rosasco, L., Poggio, T.. (2015). On Invariance and Selectivity in Representation Learning. arXiv preprint arXiv:1503.05938.

[arora2013provable] Arora, S., Bhaskara, A., Ge, R., Ma, T.. (2013). Provable bounds for learning some deep representations. arXiv preprint arXiv:1310.6343.

[schroff2015facenet] Schroff, Florian, Kalenichenko, Dmitry, Philbin, James. (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering. arXiv preprint arXiv:1503.03832.

[salakhutdinov2010one] Salakhutdinov, Ruslan, Tenenbaum, Josh, Torralba, Antonio. (2010). One-shot learning with a hierarchical nonparametric bayesian model.

[breiman2001random] Breiman, Leo. (2001). Random forests. Machine learning.

[altland2010condensed] Altland, A., Simons, B.D.. (2010). Condensed Matter Field Theory.

[criminisi2013decision] Criminisi, A., Shotton, J.. (2013). Decision Forests for Computer Vision and Medical Image Analysis.

[bengio2013representation] Bengio, Y., Courville, A., Vincent, P.. (2013). Representation learning: {A. IEEE Trans. Pattern Anal. Mach. Intell..

[goodfellow2013maxout] Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, Bengio, Yoshua. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.

[anselmi2013unsupervised] Anselmi, Fabio, Leibo, Joel Z, Rosasco, Lorenzo, Mutch, Jim, Tacchetti, Andrea, Poggio, Tomaso. (2013). Unsupervised learning of invariant representations in hierarchical architectures. arXiv preprint arXiv:1311.4158.

[yamins2014performance] Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., DiCarlo, J. J.. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci..

[szegedy2014going] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.. (2014). Going deeper with convolutions. arXiv preprint arXiv:1409.4842.

[dahl2013improving] Dahl, George E, Sainath, Tara N, Hinton, Geoffrey E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.

[kemp2007learning] Kemp, Charles, Perfors, Amy, Tenenbaum, Joshua B. (2007). Learning overhypotheses with hierarchical Bayesian models. Developmental science.

[tenenbaum2011grow] Tenenbaum, Joshua B, Kemp, Charles, Griffiths, Thomas L, Goodman, Noah D. (2011). How to grow a mind: Statistics, structure, and abstraction. science.

[mnih2015human] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, others. (2015). Human-level control through deep reinforcement learning. Nature.

[ghahramani1996algorithm] Ghahramani, Z., Hinton, G.~E., others. (1996). The {EM.

[van2014factoring] Van Den Oord, A., Schrauwen, B.. (2014). Factoring Variations in Natural Images with Deep {Gaussian. Proc. Adv. Neural Inf. Process. Syst. (NIPS'14).

[soatto2016visual] Soatto, S., Chiuso, A.. (2016). Visual Representations: Defining Properties and Deep Approximations. Proc. Int. Conf. Learn. Rep. (ICLR'16).

[pmlr-v49-cohen16] Nadav Cohen, Or Sharir, Amnon Shashua. (2016). On the Expressive Power of Deep Learning: A Tensor Analysis. 29th Annual Conference on Learning Theory.

[lu2017depth] Lu, Haihao, Kawaguchi, Kenji. (2017). Depth Creates No Bad Local Minima. arXiv preprint arXiv:1702.08580.

[soudry2017exponentially] Soudry, Daniel, Hoffer, Elad. (2017). Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777.

[zhang2016understanding] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, Vinyals, Oriol. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

[shwartz2018representation] Shwartz-Ziv, Ravid, Painsky, Amichai, Tishby, Naftali. (2018). Representation compression and generalization in deep neural networks.

[shwartz2020information] Shwartz-Ziv, Ravid, Alemi, Alexander A. (2020). Information in infinite ensembles of infinitely-wide neural networks. Symposium on Advances in Approximate Bayesian Inference.

[shwartz2022information] Shwartz-Ziv, Ravid. (2022). Information Flow in Deep Neural Networks. arXiv preprint arXiv:2202.06749.

[rasmus2015semi] Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.. (2015). Semi-Supervised Learning with Ladder Networks. Proc. Adv. Neural Inf. Process. Syst (NIPS'15).

[zhao2015swwae] Zhao, J., Mathieu, M., Goroshin, R., LeCun, Y.. (2016). Stacked What-Where Autoencoders. arXiv preprint arXiv:1506.02351.

[roweis2001learning] Roweis, Sam, Ghahramani, Zoubin. (2001). Learning nonlinear dynamical systems using the expectation--maximization algorithm. Kalman filtering and neural networks.

[ioffe2015batch] Ioffe, S., Szegedy, C.. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167.

[jordan2002discriminative] Ng, A., Jordan, M.. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing systems.

[rister2016piecewise] Rister, Blaine. (2016). Piecewise convexity of artificial neural networks. arXiv preprint arXiv:1607.04917.

[murphy2012machine] Murphy, Kevin P. (2012). Machine learning: a probabilistic perspective.

[luo2018cosine] Luo, Chunjie, Zhan, Jianfeng, Xue, Xiaohe, Wang, Lei, Ren, Rui, Yang, Qiang. (2018). Cosine normalization: Using cosine similarity instead of dot product in neural networks. International Conference on Artificial Neural Networks.

[harman2010decompositional] Harman, Radoslav, Lacko, Vladim{'\i. (2010). On decompositional algorithms for uniform sampling from n-spheres and n-balls. Journal of Multivariate Analysis.

[voelker2017efficiently] Voelker, Aaron R, Gosmann, Jan, Stewart, Terrence C. (2017). Efficiently sampling vectors and coordinates from the n-sphere and n-ball.

[anton2013elementary] Anton, Howard, Rorres, Chris. (2013). Elementary Linear Algebra, Binder Ready Version: Applications Version.

[nielsen2016guaranteed] Nielsen, Frank, Sun, Ke. (2016). Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities. arXiv preprint arXiv:1606.05850.

[balestriero2018spline] Balestriero, Randall, Baraniuk, Richard. (2018). A Spline Theory of Deep Networks. Proc. ICML.

[boyd2004convex] Boyd, Stephen, Vandenberghe, Lieven. (2004). Convex optimization.

[bishop2007generative] Bishop, Christopher M, Lasserre, Julia, others. (2007). Generative or discriminative? getting the best of both worlds. Bayesian Statistics.

[sohl2010unsupervised] Sohl-Dickstein, Jascha, Wang, Jimmy C, Olshausen, Bruno A. (2010). An unsupervised algorithm for learning lie group transformations. arXiv preprint arXiv:1001.1027.

[michalski2014modeling] Michalski, Vincent, Memisevic, Roland, Konda, Kishore. (2014). Modeling sequential data using higher-order relational features and predictive training. arXiv preprint arXiv:1402.2333.

[miao2007learning] Miao, Xu, Rao, Rajesh PN. (2007). Learning the lie groups of visual invariance. Neural computation.

[pearl1988probabilistic] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauffman Pub.

[aurenhammer1988geometric] Aurenhammer, Franz, Imai, Hiroshi. (1988). Geometric relations among Voronoi diagrams. Geometriae Dedicata.

[aurenhammer1991voronoi] Aurenhammer, Franz. (1991). Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR).

[preparata2012computational] Preparata, Franco P, Shamos, Michael I. (2012). Computational geometry: an introduction.

[rudin1976principles] Rudin, Walter, others. (1976). Principles of mathematical analysis.

[pach2011combinatorial] Pach, J{'a. (2011). Combinatorial geometry.

[quinlan1986induction] Quinlan, J. Ross. (1986). Induction of decision trees. Machine learning.

[kumar2009fast] Kumar, NSLP, Satoor, Sanjiv, Buck, Ian. (2009). Fast parallel expectation maximization for Gaussian mixture models on GPUs using CUDA. High Performance Computing and Communications, 2009. HPCC'09. 11th IEEE International Conference on.

[jordan2001graphical] Jordan, Michael Irwin, Sejnowski, Terrence Joseph. (2001). Graphical models: Foundations of neural computation.

[hintonMITVideo] Geoffrey Hinton. What's wrong with convolutional nets?.

[dong2017deep] Dong, Xiao, Wu, Jiasong, Zhou, Ling. (2017). How deep learning works--The geometry of deep learning. arXiv preprint arXiv:1710.10784.

[raghu2017expressive] Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, Dickstein, Jascha Sohl. (2017). On the expressive power of deep neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70.

[tropical] Liwen Zhang, Gregory Naitzat, Lek{-. (2018). Tropical Geometry of Deep Neural Networks. CoRR.

[hintonVideo] Geoffrey Hinton. (2014). What's wrong with convolutional nets?.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[goodfellow2012large] Goodfellow, Ian, Courville, Aaron, Bengio, Yoshua. (2012). Large-scale feature learning with spike-and-slab sparse coding. arXiv preprint arXiv:1206.6407.

[hannun2014deepspeech] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., others. (2014). DeepSpeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.

[schmidhuber2015deep] Schmidhuber, J.. (2015). Deep learning in neural networks: {A. Neural Net..

[tikhonov2013numerical] Tikhonov, Andreui Nikolaevich, Goncharsky, AV, Stepanov, VV, Yagola, Anatoly G. (2013). Numerical methods for the solution of ill-posed problems.

[wolfdeepface] Wolf, Lior. DeepFace: Closing the Gap to Human-Level Performance in Face Verification.

[griffiths2004hierarchical] Griffiths, DMBTL, Tenenbaum, MIJJB. (2004). Hierarchical topic models and the nested Chinese restaurant process. Advances in Neural Information Processing systems.

[lucke2012closed] L{. (2012). Closed-form EM for sparse coding and its application to source separation. Latent Variable Analysis and Signal Separation.

[Saxe-Ganguli-dyn-lin-nn:2013tq] Saxe, A.~M., McClelland, J.~L., Ganguli, S.. (2013). {Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv.org.

[Saxe-Ganguli-hier-cat-dnn:2013vq] McClelland, J. L., Ganguli, S.. (2013). {Learning hierarchical category structure in deep neural networks. Proc. Annu. Cog. Sci. Soc..

[kschischang2001factor] F. R. Kschischang, B. J. Frey, H. A. Loeliger. (2001). Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory.

[wilamowski2001algorithm] Wilamowski, Bogdan M, Iplikci, Serdar, Kaynak, Okyay, Efe, M {. (2001). An algorithm for fast convergence in training neural networks. Proceedings of the international joint conference on neural networks.

[karklin2005hierarchical] Karklin, Yan, Lewicki, Michael S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural computation.

[pham2015study] Pham, Ngoc-Quan, Le, Hai-Son, Nguyen, Duc-Dung, Ngo, Truong-Giang. (2015). A Study of Feature Combination in Gesture Recognition with Kinect. Knowledge and Systems Engineering.

[hartley2003multiple] Hartley, Richard, Zisserman, Andrew. (2003). Multiple view geometry in computer vision.

[bishop2006pattern] Bishop, C.~M.. (2006). Pattern Recognition and Machine Learning.

[corduneanu2001variational] Corduneanu, Adrian, Bishop, Christopher M. (2001). Variational Bayesian model selection for mixture distributions. Artificial intelligence and Statistics.

[amari1993backpropagation] Amari, Shun-ichi. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing.

[wainwright2008graphical] Wainwright, M. J., Jordan, M. I.. (2008). Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn..

[Schmid:1994:PTN:991886.991915] Schmid, H.. (1994). Part-of-speech Tagging with Neural Networks. Proc. Conf. Comput. Linguistics. doi:10.3115/991886.991915.

[salakhutdinov2013learning] Jin, Chi, Ge, Rong, Netrapalli, Praneeth, Kakade, Sham M, Jordan, Michael I. (2017). How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887. doi:10.1109/TIT.1972.1054753.

[russakovsky2012attribute] Russakovsky, O., Fei-Fei, L.. (2012). Attribute learning in large-scale datasets. Trends and Topics in Computer Vision.

[russakovsky2015imagenet] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., others. (2015). Imagenet large scale visual recognition challenge. Int. J. Comput. Vision.

[ramachandran2017searching] Ramachandran, P., Zoph, B., Le, Q.. (2017). Searching for activation functions. ArXiv e-prints.

[Yuste:2004jm] Yuste, Rafael, Urban, Rochelle. (2004). {Dendritic spines and linear networks. Journal of Physiology-Paris.

[Patel:un] Patel, Ankit B. {Modeling and Inferring Cleavage Patterns in Proliferating Epithelia.

[Anonymous:TT8NVr95] . {bioessays08-nagpal.pdf. ().

[Anonymous:U7RotLPL] . {nature06-ssr.pdf. ().

[Anonymous:cPTrEePs] . {DESYNC: Self-Organizing Desynchronization and TDMA on Wireless Sensor Networks. (2007).

[Patel:2007wn] Patel, Ankit B, Degesys, Julius, Nagpal, Radhika. (2007). {Desynchronization:The Theory of Self-Organizing Algorithms for Round-Robin Scheduling.

[Charles:2013tp] Charles, Adam, Rozell, Christopher. (2013). {Short Term Memory Capacity in Networks via the Restricted Isometry Property.

[Anonymous:2012wr] . {Dynamic Filtering of Sparse Signals using Reweighted. (2012).

[Anonymous:2013uy] . {Visual Nonclassical Receptive Field E↵ects Emerge from Sparse Coding in a Dynamical System. (2013).

[Packer:2013gt] Packer, Adam M, Roska, Botond, H{. (2013). {Targeting neurons and photons for optogenetics. Nature Publishing Group.

[krizhevsky_learning_2009] Alex Krizhevsky. (2009). Learning Multiple Layers of Features from Tiny Images.

[Dyer:2013ua] Dyer, Eva. (2013). {Greedy Feature Selection for Subspace Clustering. Journal of Machine Learning Research.

[Yoon:2013hv] Yoon, KiJung, Buice, Michael A, Barry, Caswell, Hayman, Robin, Burgess, Neil, Fiete, Ila R. (2013). {Specific evidence of low-dimensional continuous attractor dynamics in grid cells. Nature Publishing Group.

[Ramirez:2013bl] Ramirez, S, Liu, X, Lin, P A, Suh, J, Pignatelli, M, Redondo, R L, Ryan, T J, Tonegawa, S. (2013). {Creating a False Memory in the Hippocampus. Science.

[Izhikevich:2003ul] Izhikevich, Eugene M. (2003). {Which Model to Use for Cortical Spiking Neurons?. IEEE Trans. Neural Networks.

[Maglione:2013ia] Maglione, Marta, Sigrist, Stephan J. (2013). {Seeing the forest tree by tree: super-resolution light microscopy meets the neurosciences. Nature Publishing Group.

[Sutherland:1998wn] Sutherland, Ivan. (1998). {Technology and Courage.

[Rozell:2008wr] Rozell, Christopher, Johnson, Don, Baraniuk, Rich, Olshausen, Bruno. (2008). {Sparse Coding via Thresholding and Local Competition in Neural Circuits. Neural Computation.

[Gordon:2012td] Gordon, Geoff, Tibshirani, Ryan. (2012). {Generalized gradient descent.

[OLSHAUSEN:2004fw] OLSHAUSEN, B, FIELD, D. (2004). {Sparse coding of sensory inputs. Current Opinion in Neurobiology.

[Anonymous:JVLVJtUI] . {Cog_Neurosci2011_98. (2011).

[Anselmi:2007ke] Anselmi, F., Mutch, J., Poggio, T.. (2007). {Magic Materials. Proc. Natl. Acad. Sci..

[Cadieu:2013wa] Cadieu, Charles, Yamins, Dan, DiCarlo, James. (2013). {The Neural Representation Benchmark and its Evaluation on Brain and Machine. ArXiV.

[Anonymous:mLLJA3aZ] . {High Frequency Stimulation of the Subthalamic Nucleus Eliminates Pathological Thalamic Rhythmicity in a Computational Model. (2004).

[DiCarlo:2012em] DiCarlo, James J, Zoccolan, Davide, Rust, Nicole C. (2012). {Perspective. Neuron.

[Humphries:2012ju] Humphries, Mark D, Gurney, Kevin. (2012). {Network effects of subthalamic deep brain stimulation drive a unique mixture of responses in basal ganglia output. European Journal of Neuroscience.

[Johnson:2005ha] Johnson, Jeffrey S, Olshausen, Bruno A. (2005). {The recognition of partially visible natural objects in the presence and absence of their occluders. Vision Research.

[Buckner:2013fu] Buckner, Randy L, Krienen, Fenna M, Yeo, B T Thomas. (2013). {Opportunities and limitations of intrinsic functional connectivity MRI. Nature Publishing Group.

[Keck:2012cb] Keck, C., Savin, C., L{. (2012). {Feedforward Inhibition and Synaptic Scaling -- Two Sides of the Same Coin?. PLoS Computational Biology.

[Anonymous:21M5ylQ8] . {Unsupervised Learning of Translation Invariant Occlusive Components. (2012).

[Rozell:2013tv] Rozell, Christopher. (2013). {Stable Manifold Embeddings with Structured Random Matrices.

[Carandini:2013dv] Carandini, Matteo, Churchland, Anne K. (2013). {Probing perceptual decisions in rodents. Nature Publishing Group.

[Anonymous:oVbxcaph] . {Specular Surface Reconstruction from Sparse Reflection Correspondences. (2010).

[Sandoe:2013il] Sandoe, Jackson, Eggan, Kevin. (2013). {Opportunities and challenges of pluripotent stem cell neurodegenerative disease models. Nature Publishing Group.

[Anonymous:2013cg] . {Focus on neurotechniques. Nature Publishing Group (2013).

[Patel:2013tv] Patel, A.~B., Kukreja, R.~S.. (2013). {Final Contract for 1515 Hyde Park {#.

[Otero:2013hh] Otero, Ives, Delbracio, Mauricio. (2013). {The Anatomy of the SIFT Method.

[Anonymous:wJ0z1pAS] . {Learning Feature Representations with K-means. (2012).

[Raphael:2012ug] Raphael, Robert. (2012). {IGERT: Neuroengineering: From Cells to Systems.

[Berens:2012fi] Berens, P, Ecker, A S, Cotton, R J, Ma, W J, Bethge, M, Tolias, A S. (2012). {A Fast and Simple Population Code for Orientation in Primate V1. Journal of Neuroscience.

[Ma:2006bh] Ma, Wei Ji, Beck, Jeffrey M, Latham, Peter E, Pouget, Alexandre. (2006). {Bayesian inference with probabilistic population codes. Nature Neuroscience.

[Ma:2013uk] Ma, Wei Ji. (2013). {Population Vector COding.

[Anonymous:S7HycmMg] . {Parallelized Stochastic Gradient Descent. (2010).

[Anonymous:FVKVV-yP] . {On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines. (2001).

[Anonymous:OYKu-7Li] . {Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. (2007).

[Anonymous:9wEHQ3-F] . {Random Feature Maps for Dot Product Kernels. (2013).

[Rahimi:2007vq] Rahimi, Ali, Recht, Ben. (2007). {Random Features for Large-Scale Kernel Machines.

[Anonymous:2011de] . {Perceptual and neural consequences of rapid motion adaptation. (2011).

[Boyd:2011bw] Boyd, Stephen. (2011). {Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends{\textregistered.

[Dubois:2011dy] Dubois, Julien, VanRullen, Rufin. (2011). {Visual Trails: Do the Doors of Perception Open Periodically?. PLoS Biology.

[Boyd:2011tq] Boyd, Stephen. (2011). {Alternating Direction Method of Multipliers.

[Sokoliuk:2013hu] Sokoliuk, R, VanRullen, R. (2013). {The Flickering Wheel Illusion: When Rhythms Make a Static Wheel Flicker. Journal of Neuroscience.

[Adibi:2013hq] Adibi, M, Clifford, C W G, Arabzadeh, E. (2013). {Informational Basis of Sensory Adaptation: Entropy and Single-Spike Efficiency in Rat Barrel Cortex. Journal of Neuroscience.

[Saxe:2013up] Saxe, Andrew, McClelland, James, Ganguli, Surya. (2013). {A Mathematical Theory of Semantic Development.

[Hinton:2010un] Hinton, Geoff. (2010). {A Practical Guide to Training Restricted Boltzmann Machines.

[Anonymous:QF6Em5B4] . {Complete Discrete 2-D Gabor Transforms by Neural Networks for Image Analysis and Compression. (2004).

[Anonymous:J448A51u] . {Tutorial on Gabor Filters. (2008).

[Anonymous:puO477jp] . {Learning hierarchical category structure in deep neural networks. (2013).

[Zoran-Weiss:2013pr] Zoran, D., Weiss, Y.. (2012). Natural Images, Gaussian Mixtures and Dead Leaves. Proc. Adv. Neural Inf. Process. Syst. (NIPS'12).

[Helmstaedter:2014iv] Helmstaedter, Moritz, Briggman, Kevin L, Turaga, Srinivas C, Jain, Viren, Seung, H Sebastian, Denk, Winfried. (2014). {Connectomic reconstruction of the innerplexiform layer in the mouse retina. Nature.

[Anonymous:sBTrRq3Q] . {Controllable single photon stimulation of retinal rod cells. (2013).

[Weiss:2002id] Weiss, Yair, Simoncelli, Eero P, Adelson, Edward H. (2002). {Motion illusions as optimal percepts. Nature Neuroscience.

[Yamins:2013tp] Yamins, Dan, Hong, Ha, DiCarlo, James. (2013). {Key Features of Higher Visual Cortex Emerge in Behaviorally Optimized Neural Networks.

[wjma:2013ts] {wjma. (2013). {Relating back to behavior.

[Anonymous:E_1bFc4h] . {Kanizsa triangle. (2013).

[wjma:2013wj] {wjma. (2013). {Lecture 11 -- Probability and inference with neurons.

[wjma:2013tp] {wjma. (2013). {Complications.

[Krizhevsky:2012wl] Krizhevsky, A., Sutskever, I., Hinton, G.. (2012). {ImageNet Classification with Deep Convolutional Neural Networks. Proc. Adv. Neural Inf. Process. Syst (NIPS'12).

[wiskott2006does] Wiskott, Laurenz. (2006). How does our visual system achieve shift and size invariance. JL van Hemmen and TJ Sejnowski, editors.

[lecun1998gradient] LeCun, Yann, Bottou, L{'e. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE.

[Carandini:2011fm] Carandini, Matteo, Heeger, David J. (2011). {Normalization as a canonical neural computation. Nature Reviews Neuroscience.

[Adibi:2013dd] Adibi, M, McDonald, J S, Clifford, C W G, Arabzadeh, E. (2013). {Adaptation Improves Neural Coding Efficiency Despite Increasing Correlations in Variability. Journal of Neuroscience.

[Anonymous:ly3rlGJy] . {Sparse Filtering. (2011).

[Cafaro:2011im] Cafaro, Jon, Rieke, Fred. (2011). {Noise correlations improve response fidelity and stimulus encoding. Nature.

[Ibbotson:2011jh] Ibbotson, Michael, Krekelberg, Bart. (2011). {Visual perception and saccadic eye movements. Current Opinion in Neurobiology.

[Kandel:2013cf] Kandel, Eric R, Markram, Henry, Matthews, Paul M, Yuste, Rafael, Koch, Christof. (2013). {Neuroscience thinks big (and collaboratively). Nature Reviews Neuroscience.

[Lacy:2013km] Lacy, Joyce W, Stark, Craig E L. (2013). {The neuroscience of memory: implications for the courtroom. Nature Reviews Neuroscience.

[BurgosArtizzu:2012ul] Burgos-Artizzu, Xavier. (2012). {Social behavior recognition in continuous video. Computer Vision and Pattern Recognition.

[Averbeck:2006ew] Averbeck, Bruno B, Latham, Peter E, Pouget, Alexandre. (2006). {Neural correlations, population coding and computation. Nature Reviews Neuroscience.

[Le:2011ts] Le, Quoc, Ng, Andrew. (2011). {Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Computer Vision and Pattern Recognition.

[Adams:2006ti] Adams, Ryan, MacKay, David. (2006). {Bayesian Online Changepoint Detection.

[Salvator:2004wq] Salvator, Dave. (2004). {ExtremeTech 3D Pipeline Tutorial. PCMag.

[Deneve:2007by] Deneve, S, Duhamel, J R, Pouget, A. (2007). {Optimal Sensorimotor Integration in Recurrent Cortical Networks: A Neural Implementation of Kalman Filters. Journal of Neuroscience.

[Jordan:1999ti] Jordan, Michael, Ghahramani, Zoubin, Jaakkola, Tommi, Saul, Lawrence. (1999). {An Introduction to Variational Methods for Graphical Models. Machine Learning.

[Anonymous:OEEDCGDt] . {343263a0. (2002).

[Poggio:2013ju] Poggio, Tomaso, Ullman, Shimon. (2013). {Vision: are models of object recognition catching up with the brain?. Annals of the New York Academy of Sciences.

[Pinto:2009gu] Pinto, Nicolas, Doukhan, David, DiCarlo, James J, Cox, David D. (2009). {A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation. PLoS Computational Biology.

[Dayan:2012kb] Dayan, Peter. (2012). {Twenty-Five Lessonsfrom Computational Neuromodulation. Neuron.

[Anonymous:MaG0r2vx] . {Beyond Simple Features: A Large-Scale Feature Search Approach to Unconstrained Face Recognition. (2011).

[Pinto:2008bo] Pinto, Nicolas, Cox, David D, DiCarlo, James J. (2008). {Why is Real-World Visual Object Recognition Hard?. PLoS Computational Biology.

[Zhu:2004ur] Zhu, Mengchen, Durand, Fredo, Rozell, Christopher. (2004). {MIT 6.837 - Ray Tracing.

[Pouget:2013gi] Pouget, Alexandre, Beck, Jeffrey M, Ma, Wei Ji, Latham, Peter E. (2013). {Probabilistic brains: knowns and unknowns. Nature Publishing Group.

[Thibodeau:2011je] Thibodeau, Paul, Boroditsky, Lera. (2011). {Metaphors We Think With: The Role of Metaphor in Reasoning. PLoS One.

[LaCamera:2008do] La Camera, Giancarlo, Richmond, Barry J. (2008). {Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules. PLoS Computational Biology.

[Fetsch:2013ks] Fetsch, Christopher R, DeAngelis, Gregory C, Angelaki, Dora E. (2013). {Bridging the gap between theoriesof sensory cue integration and thephysiology of multisensory neurons.

[Hosoya:2005fu] Hosoya, Toshihiko, Baccus, Stephen A, Meister, Markus. (2005). {Dynamic predictive coding by the retina. Nature.

[Anonymous:uoHvrsjc] . {Perceptual filling in of artificially induced scotomas in human vision. (2001).

[Pitkow:2012dh] Pitkow, Xaq, Meister, Markus. (2012). {Decorrelation and efficient coding by retinal ganglion cells. Nature Publishing Group.

[Meister:2013tw] Meister, Markus. (2013). {Neural computation in sensory systems.

[n:2009ws] 000n, 376 377 000M 000a 000r 000t 000i. (2009). {Understanding the Rotating Snakes illusion.

[wjma:2013we] {wjma. (2013). {1/21/2013Bayesian modeling.

[Laurens:2013fy] Laurens, Jean, Meng, Hui, Angelaki, Dora E. (2013). {Computation of linear acceleration through an internal model in the macaque cerebellum. Nature Publishing Group.

[Watson:2003td] Watson, Andrew. (2003). {Real-world illumination and the perception of surface reflectance properties.

[Brainard:2011dr] Brainard, D H, Maloney, L T. (2011). {Surface color perception and equivalent illumination models. Journal of Vision.

[Fleming:2013jy] Fleming, R W, Wiebel, C, Gegenfurtner, K. (2013). {Perceptual qualities and material classes. Journal of Vision.

[vanderKooij:2011fa] van der Kooij, Katinka. (2011). {Perception of 3D slant out of the box.

[Ecker:2011bx] Ecker, A S, Berens, P, Tolias, A S, Bethge, M. (2011). {The Effect of Noise Correlations in Populations of Diversely Tuned Neurons. Journal of Neuroscience.

[wjma:2013wea] {wjma. (2013). {1/21/2013Bayesian modeling.

[Anonymous:W77SX1oQ] . {Homography Estimation. (2009).

[Anonymous:jScaT-4D] . {At Least at the Level of Inferior Temporal Cortex, the Stereo Correspondence Problem Is Solved. (2003).

[Murphy:2013eq] Murphy, A P, Ban, H, Welchman, A E. (2013). {Integration of texture and disparity cues to surface slant in dorsal visual cortex. Journal of Neurophysiology.

[Tsutsui:2002kr] Tsutsui, K I. (2002). {Neural Correlates for Perception of 3D Surface Orientation from Texture Gradient. Science.

[Anonymous:Le2AY_hs] . {A Bayesian Treatment of the Stereo Correspondence Problem Using Half-Occluded Regions. (2004).

[Savarese:2008us] Savarese, Silvio. (2008). {EECS 442 -- Computer visionStereo systems.

[Savarese:2008usa] Savarese, Silvio. (2008). {EECS 442 -- Computer visionStereo systems.

[Savarese:2008uq] Savarese, Silvio. (2008). {EECS 442 -- Computer visionEpipolar Geometry.

[Savarese:2008vc] Savarese, Silvio. (2008). {EECS 442 -- Computer visionSingle view metrology.

[Savarese:2008vw] Savarese, Silvio. (2008). {EECS 442 -- Computer visionCameras.

[Customer:2008vf] Customer, Preferred. (2008). {Course overview.

[Anonymous:bR8HbOTu] . {EECS 442 -- Computer Vision. (2008).

[Savarese:2009va] Savarese, Silvio. (2009). {EECS 442 -- Computer visionVolumetric stereo.

[Savarese:2008ur] Savarese, Silvio. (2008). {EECS 442 -- Computer visionShape from reflections.

[Savarese:2008up] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Multiple view geometryAffine structure from Motion.

[Savarese:2008tu] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Multiple view geometry.

[Savarese:2008tx] Savarese, Silvio. (2008). {EECS 442 -- Computer visionFitting methods.

[Savarese:2008wc] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Radiometry.

[Savarese:2008wca] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Radiometry.

[Li:2008ub] Li, Fei-Fei. (2008). {Natural Scene Classification inNatural Scene Classification in.

[FeiFie:2008tu] {Fei-Fie. (2008). {EECS 442 -- Computer vision.

[Anonymous:Dx4Xe0J_] . {3. The Junction Tree Algorithms. (2003).

[Anonymous:YzDnGnl9] . {Distances and affinities between measures. (2000).

[manfred:2006up] {manfred. (2006). {taipei4.

[Koolen:2012wk] Koolen, Wouter, Warmuth, Manfred. (2012). {Putting Bayes to sleep.

[FeiFie:2008tua] {Fei-Fie. (2008). {EECS 442 -- Computer vision.

[Savarese:2008wn] Savarese, Silvio. (2008). {Segmentation {&.

[Anonymous:Sn-7BTe2] . {20 years of learning about vision: Questions answered, questions unanswered, and questions not yet asked. (2012).

[Anonymous:Vwe7RZoh] . {Shape perception reduces activity in human primary visual cortex. (2002).

[Anonymous:QbT06TIM] . {Principles of Image Representation in Visual Cortex. (2005).

[Savarese:2008wu] Savarese, Silvio. (2008). {Recognition.

[Savarese:2008wua] Savarese, Silvio. (2008). {Recognition.

[Savarese:2008wub] Savarese, Silvio. (2008). {Recognition.

[Savarese:2009tu] Savarese, Silvio. (2009). {EECS 442 -- Computer visionOptical flow and tracking.

[Savarese:2008ut] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Face Recognition.

[Anonymous:uKw8bset] . {Computer Vision: Algorithms and Applications. (2010).

[Anonymous:2012hj] . {Relative luminance and binocular disparity preferencesare correlated in macaque primary visual cortex,matching natural scene statistics. (2012).

[Sanada:2012hq] Sanada, T M, Nguyenkim, J D, DeAngelis, G C. (2012). {Representation of 3-D surface orientation by velocity and disparity gradient cues in area MT. Journal of Neurophysiology.

[Srivastava:2009ch] Srivastava, S, Orban, G A, De Maziere, P A, Janssen, P. (2009). {A Distinct Representation of Three-Dimensional Shape in Macaque Anterior Intraparietal Area: Fast, Metric, and Coarse. Journal of Neuroscience.

[Anonymous:9uZVlpuI] . {Stereopsis Activates V3A and Caudal Intraparietal Areas in Macaques and Humans. (2003).

[Nieder:2003kv] Nieder, Andreas. (2003). {Stereoscopic Vision: Solving the Correspondence Problem. Current Biology.

[Orban:2006fp] Orban, Guy A, Janssen, Peter, Vogels, Rufin. (2006). {Extracting 3D structure from disparity. Trends in Neurosciences.

[Kruger:gc] Kruger, Norbert, Janssen, Peter, Kalkan, Sinan, Lappe, Markus, Leonardis, Ales, Piater, Justus, Rodriguez-Sanchez, Antonio J, Wiskott, Laurenz. {Deep Hierarchies in the Primate Visual Cortex: What Can We Learn for Computer Vision?. IEEE Trans. PAMI.

[Anonymous:XBCH2ycA] . {10 Neuronal interactions and their role in solving the stereo correspondence problem. (2010).

[Tanabe:2011dx] Tanabe, S, Haefner, R M, Cumming, B G. (2011). {Suppressive Mechanisms in Monkey V1 Help to Solve the Stereo Correspondence Problem. Journal of Neuroscience.

[Howe:2005jb] Howe, P D L. (2005). {V1 Partially Solves the Stereo Aperture Problem. Cerebral Cortex.

[Read:2007gn] Read, Jenny C A, Cumming, Bruce G. (2007). {Sensors for impossible stimuli may solve the stereo correspondence problem. Nature Neuroscience.

[Jeyabalaratnam:2013fz] Jeyabalaratnam, Jeyadarshan, Bharmauria, Vishal, Bachatene, Lyes, Cattan, Sarah, Angers, Annie, Molotchnikoff, St{'e. (2013). {Adaptation Shifts Preferred Orientation of Tuning Curve in the Mouse Visual Cortex. PLoS One.

[Anonymous:L7ZAZoJb] . {gcp_stereo_cvpr11. (2013).

[Anonymous:Parcc-uC] . {Introduction -- a Tour of Multiple View Geometry. (2004).

[Anonymous:iIqqh1eh] . {MULTIPLE VIEW GEOMETRY. (2004).

[Searcy:1996vt] Searcy, J H, Bartlett, J C. (1996). {Inversion and processing of component and spatial-relational information in faces.. Journal of experimental psychology. Human perception and performance.

[Graves:2013wt] Graves, Alex, Mohamed, Abdel-rahman, Hinton, Geoffrey. (2013). {Speech recognition with deep recurrent neural networks.

[Hinton:em] Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, Kingsbury, Brian. {Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine.

[Zeiler:2013ux] Zeiler, Matthew D, Fergus, Rob. (2013). {Visualizing and Understanding Convolutional Neural Networks. arXiv preprint arXiv:1311.2901.

[Anonymous:GT9cUL3p] . {exact feature probabilities in images with occlusion. (2010).

[Anonymous:xLsGm21g] . {Compressive neural representation of sparse, high-dimensional probabilities. (2013).

[Anonymous:6l_wwPr_] . {Modeling image patches with a directed hierarchy of Markov random fields. (2008).

[Anonymous:a7uhOohM] . {RecurrentSamplingHelmholtz_Dayan. (1999).

[IEEE:2013wx] {IEEE. (2013). {A Pencil Balancing Robotusing a Pair of AER Dynamic Vision Sensors.

[kolchinsky2019nonlinear] Kolchinsky, Artemy, Tracey, Brendan D, Wolpert, David H. (2019). Nonlinear information bottleneck. Entropy.

[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey. (2009). Learning multiple layers of features from tiny images.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.

[Zaidi:2012ff] Zaidi, Qasim, Ennis, Robert, Cao, Dingcai, Lee, Barry. (2012). {Neural Locus of Color Afterimages. Current Biology.

[Anonymous:rrYjDhO3] . {Integration. (2002).

[Anonymous:2012pp] . {Depth and Deblurring from a Spectrally-varying Depth-of-Field. (2012).

[Anonymous:2013bu] . {scatterometer. (2013).

[Levin:2013bd] Levin, Anat, Glasner, Daniel, Xiong, Ying, Durand, Fredo, Freeman, William, Matusik, Wojciech, Zickler, Todd. (2013). {Fabricating BRDFs at high spatial resolution using wave optics. ACM Trans. Graphics.

[Anonymous:8qQPTOkW] . {arXiv:1206.1428v1 [cs.GR] 7 Jun 2012. (2012).

[Anonymous:2013gf] . {Synthesizing cognition in neuromorphic electronic systems. (2013).

[Jones:2012fy] Jones, P W, Gabbiani, F. (2012). {Impact of neural noise on a sensory-motor pathway signaling impending collision. Journal of Neurophysiology.

[Benosman:2012dh] Benosman, Ryad, Ieng, Sio-Hoi, Clercq, Charles, Bartolozzi, Chiara, Srinivasan, Mandyam. (2012). {Neural Networks. Neural Networks.

[Roska:2006fj] Roska, B. (2006). {Parallel Processing in Retinal Ganglion Cells: How Integration of Space-Time Patterns of Excitation and Inhibition Form the Spiking Output. Journal of Neurophysiology.

[Lichtsteiner:bm] Lichtsteiner, Patrick, Posch, Christoph, Delbruck, Tobi. {A 128$\times$ 128 120 dB 15 $\mu$s Latency Asynchronous Temporal Contrast Vision Sensor. IEEE Journal of Solid-State Circuits.

[Bialek:1990ce] Bialek, W, Owen, W G. (1990). {Temporalfiltering. Biophysical Journal.

[Anonymous:GXtE_twh] . {Local Illumination. (2004).

[Anonymous:SMGtXmKz] . {The Graphics Pipeline: Projective Transformations. (2004).

[jovan:2004vg] {jovan. (2004). {Conventional Animation.

[jovan:2004uj] {jovan. (2004). {Computer Animation II.

[jovan:2004wd] {jovan. (2004). {Computer Animation III.

[Anonymous:9iTr4Vho] . {projective. (1998).

[Abbott:2000wh] Abbott, Larry. (2000). {Theoretical Neuroscience Computational and Mathematical Modeling of Neural Systems - Peter Dayan, L. F. Abbott.

[Anonymous:5F9KVaoE] . {Kogo{&. (2013).

[Anonymous:WiFH6Vnp] . {Coding of Border Ownership in Monkey Visual Cortex. (2000).

[mdf:2011wq] {mdf. (2011). {THECOLOR CURIOSITY SHOP.

[Anonymous:leK42DDc] . {COLOR IS NOT A METRIC SPACE. (2013).

[Anonymous:Jy1FKFoA] . {Deriving Appearance Scales. (2012).

[mdf:2011wh] {mdf. (2011). {Brightness, Lightness, and Specifying Color in High-Dynamic-Range Scenes and Images.

[Anonymous:ieDds7qq] . {Number of discernible object colors is a conundrum. (2013).

[felzenszwalb2006efficient] Felzenszwalb, Pedro F, Huttenlocher, Daniel P. (2006). Efficient belief propagation for early vision. International journal of computer vision.

[Hartley2004] Hartley, R.~I., Zisserman, A.. (2004). Multiple View Geometry in Computer Vision.

[Anonymous:GxRPIp0i] . {2101911. (2010).

[tomg-admm] Taylor, Gavin, Burmeister, Ryan, Xu, Zheng, Singh, Bharat, Patel, Ankit, Goldstein, Tom. (2016). Training Neural Networks Without Gradients: A Scalable ADMM Approach. arXiv preprint arXiv:1605.02026.

[Anonymous:XCFYGa7M] . {Statistical Estimation, Optimization and Computation-Risk Tradeoffsin Data Analysis. (2013).

[vapnik1998statistical] Vapnik, Vladimir Naumovich, Vapnik, Vlamimir. (1998). Statistical learning theory.

[rifai2011contractive] Rifai, Salah, Vincent, Pascal, Muller, Xavier, Glorot, Xavier, Bengio, Yoshua. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. Proceedings of the 28th international conference on machine learning (ICML-11).

[rifai2011manifold] Rifai, Salah, Dauphin, Yann N, Vincent, Pascal, Bengio, Yoshua, Muller, Xavier. (2011). The manifold tangent classifier. Advances in Neural Information Processing Systems.

[lee2013pseudo] Lee, Dong-Hyun. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on Challenges in Representation Learning, ICML.

[makhzani2015winner] Makhzani, Alireza, Frey, Brendan J. (2015). Winner-Take-All Autoencoders. Advances in Neural Information Processing Systems.

[kingma2014semi] Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, Welling, Max. (2014). Semi-supervised learning with deep generative models. Advances in Neural Information Processing Systems.

[wakin2005multiscale] Wakin, M. B., Donoho, D. L., Choi, H., Baraniuk, R. G.. (2005). The multiscale structure of non-differentiable image manifolds. Proc. Int. Soc. Optical Eng..

[goodfellow2014generative] Goodfellow, I. J, Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.. (2014). Generative adversarial nets. Proc. NIPS.

[papyan2017convolutional] Papyan, Vardan, Romano, Yaniv, Elad, Michael. (2017). Convolutional Neural Networks Analyzed via Convolutional Sparse Coding. Journal of Machine Learning Research.

[srivastava2015training] Srivastava, Rupesh K, Greff, Klaus, Schmidhuber, J{. (2015). Training very deep networks. Advances in Neural Information Processing systems.

[chen2016infogan] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems.

[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[poole2016exponential] Poole, Ben, Lahiri, Subhaneil, Raghu, Maithreyi, Sohl-Dickstein, Jascha, Ganguli, Surya. (2016). Exponential expressivity in deep neural networks through transient chaos. Advances In Neural Information Processing Systems.

[chen2011multiscale] Chen, G., Maggioni, M.. (2011). Multiscale geometric dictionaries for point-cloud data. Proc. Sampling Theory and Applications (SampTA).

[donoho2005image] Donoho, D. L., Grimes, C.. (2005). Image manifolds which are isometric to Euclidean space. J. Math. Imaging Vision.

[ziv2013long] Wiatowski, Thomas, B{. (2015). A mathematical theory of deep convolutional neural networks for feature extraction. arXiv preprint arXiv:1512.06293.

[rubin2010theory] Xiong, H. Y., Alipanahi, B., Lee, L. J., Bretschneider, H., Merico, D., Yuen, R. K. C., Hua, Y., Gueroussov, S., Najafabadi, H. S., Hughes, T. R., Morris, Q., Barash, Y., Krainer, A. R., Jojic, N., Scherer, S. W., Blencowe, B. J., Frey, B. J.. (2015). The human splicing code reveals new insights into the genetic determinants of disease. Science. doi:10.1126/science.1254806.

[serre2007feedforward] M. Pilanci, M. J. Wainwright. (2015). Randomized sketches of convex programs with sharp guarantees. IEEE Trans. Info. Theory.

[PilWai16a] M. Pilanci, M. J. Wainwright. Iterative {H. J. Mach. Learn. Res..

[WaiJor08] M. J. Wainwright, M. I. Jordan. (2008). Graphical models, exponential families and variational inference. Found. Tren. Mach. Learn..

[HasTibWai15] T. Hastie, R. Tibshirani, M. J. Wainwright. (2015). Statistical {L.

[LohWai15] P. Loh, M. J. Wainwright. Regularized {M. J. Mach. Learn. Res..

[Wai14a] M. J. Wainwright. Structured regularizers: Statistical and computational issues. Annu. Rev. Stat. Appl..

[PilWaiElg15] M. Pilanci, M. J. Wainwright, L. {E. Sparse learning via {B. Math. Program.. doi:10.1007/s10107-015-0894-1.

[SchWaiYu15] G. Schiebinger, M. J. Wainwright, B. Yu. (2015). The geometry of kernelized spectral clustering. Ann. Stat..

[alpha-go] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., others. (2016). Mastering the game of Go with deep neural networks and tree search. Nature.

[shashua-cvpr-keynote] A. Shashua. (2016). Autonomous Driving, Computer Vision and Machine Learning.

[godeepnips] Patel, A., Nguyen, T., Baraniuk, R.. (2016). A Probabilistic Framework for Deep Learning. Proc. Adv. Neural Inf. Process. Syst. (NIPS'16).

[lensfree16] V. Boominathan, J. K. Adams, M. S. Asif, B. W. Avants, J. T. Robinson, R. G. Baraniuk, A. C. Sankaranarayanan, A. Veeraraghavan. (2016). Lensless Imaging: A computational renaissance. IEEE Signal Process. Mag.. doi:10.1109/MSP.2016.2581921.

[lensfree17] Szeliski, R.. (2006). Locally adapted hierarchical basis preconditioning. IEEE Trans. Comput. Imag.. doi:10.1109/TCI.2016.2593662.

[huang1999statistics] Huang, J., Mumford, D.. (1999). Statistics of natural images and models. Proc. IEEE Conf. Comp. Vision Pat. Recog. (CVPR'99).

[lee2003nonlinear] Lee, A.~B., Pedersen, K.~S., Mumford, David. (2003). The nonlinear statistics of high-contrast patches in natural images. Intl. J. Comp. Vision.

[li2009towards] Li, L.J., Socher, R., Fei-Fei, L.. (2009). Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. Proc. IEEE Conf. Comp. Vision Pattern Recog. (CVPR'09).

[li2010object] Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. Proc. Adv. Neural Info. Process. Syst. (NIPS'10).

[yao2012codebook] Yao, B., Bradski, G., Fei-Fei, L.. (2012). A codebook-free and annotation-free approach for fine-grained image categorization. Proc. IEEE Conf. Com. Vision and Pattern Recog. (CVPR'12).

[carin1] Chen, M., Silva, J., Paisley, J., Wang, C., Dunson, D., Carin, L.. (2010). Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Process..

[gregor2013deep] Gregor, Karol, Danihelka, Ivo, Mnih, Andriy, Blundell, Charles, Wierstra, Daan. (2013). Deep autoregressive networks. arXiv preprint arXiv:1310.8499.

[patel2016probabilistic] Patel, Ankit B, Nguyen, Tan, Baraniuk, Richard G. (2016). A Probabilistic Framework for Deep Learning. NIPS.

[salimans2016improved] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, Chen, Xi. (2016). Improved techniques for training gans. arXiv preprint arXiv:1606.03498.

[springenberg2015unsupervised] Springenberg, Jost Tobias. (2015). Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks. arXiv preprint arXiv:1511.06390.

[miyato2015distributional] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, Nakae, Ken, Ishii, Shin. (2015). Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677.

[maaloe2016auxiliary] Maal{\o. (2016). Auxiliary Deep Generative Models. arXiv preprint arXiv:1602.05473.

[springenberg2014striving] Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, Riedmiller, Martin. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

[wei2017early] Wei, Yuting, Yang, Fanny, Wainwright, Martin J. (2017). Early stopping for kernel boosting algorithms: A general analysis with localized complexities. arXiv preprint arXiv:1707.01543.

[achille2017emergence] Achille, Alessandro, Soatto, Stefano. (2017). Emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350.

[Wai17book] M. J. Wainwright. (2017). High-dimensional statistics: A non-asymptotic view.

[nishikawa1998accurate] Nishikawa, Hiroaki. (1998). Accurate Piecewise Linear Continuous Approximations to One-Dimensional Curves: Error Estimates and Algorithms.

[Yedidia01] J. S. Yedidia, W. T. Freeman, Y. Weiss. (2001). Generalized belief propagation. NIPS 13.

[SonJaa07a] D. Sontag, T. Jaakkola. (2007). New outer bounds on the marginal polytope. Neural Information Processing Systems.

[MelGloWei09] T. Meltzer, A. Globerson, Y. Weiss. (2009). Convergent message-passing algorithms: {A. Uncertainty in Artificial Intelligence.

[KolTik59] A. N. Kolmogorov, B. Tikhomirov. (1959). $\epsilon$-entropy and $\epsilon$-capacity of sets in functional spaces. Uspekhi Mat. Nauk..

[YanBar99] Y. Yang, A. Barron. (1999). Information-theoretic determination of minimax rates of convergence. annstat.

[Yu] B. Yu. (1996). Assouad, {F. Research Papers in Probability and Statistics: Festschrift in Honor of Lucien Le Cam.

[zhang2016convexified] Zhang, Yuchen, Liang, Percy, Wainwright, Martin J. (2016). Convexified convolutional neural networks. arXiv preprint arXiv:1609.01000.

[tishby2015deep] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep learning and the information bottleneck principle. Information Theory Workshop (ITW), 2015 IEEE.

[hinton1997modeling] Hinton, Geoffrey E, Dayan, Peter, Revow, Michael. (1997). Modeling the manifolds of images of handwritten digits. IEEE transactions on Neural Networks.

[simard1993efficient] Simard, Patrice, LeCun, Yann, Denker, John S. (1993). Efficient pattern recognition using a new transformation distance. Advances in Neural Information Processing systems.

[belkin2003laplacian] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation.

[zhou2009hierarchical] Zhou, Xi, Cui, Na, Li, Zhen, Liang, Feng, Huang, Thomas S. (2009). Hierarchical gaussianization for image classification. Computer Vision, 2009 IEEE 12th International Conference on.

[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[gatys2015neural] Gatys, Leon A, Ecker, Alexander S, Bethge, Matthias. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.

[tian2017deeptest] Tian, Yuchi, Pei, Kexin, Jana, Suman, Ray, Baishakhi. (2017). DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. arXiv preprint arXiv:1708.08559.

[edX] . Discrete Time Signals and Systems. ().

[goodman2016european] Goodman, Bryce, Flaxman, Seth. (2016). European Union regulations on algorithmic decision-making and a. arXiv preprint arXiv:1606.08813.

[rust2010selectivity] Rust, Nicole C, DiCarlo, James J. (2010). Selectivity and tolerance both increase as visual information propagates from cortical area V4 to IT. Journal of Neuroscience.

[coifman1992entropy] Coifman, Ronald R, Wickerhauser, M Victor. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on information theory.

[tropp2004greed] Tropp, Joel A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory.

[hopfield1985neural] Hopfield, John J, Tank, David W. (1985). “Neural” computation of decisions in optimization problems. Biological cybernetics.

[hannah2013multivariate] Hannah, L.~A., Dunson, D.~B.. (2013). Multivariate convex regression with adaptive partitioning. J. Mach. Learn. Res..

[breiman1993hinging] Breiman, Leo. (1993). Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory.

[magnani2009convex] Magnani, Alessandro, Boyd, Stephen P. (2009). Convex piecewise-linear fitting. Optim. Eng..

[cybenko1989approximation] Cybenko, George. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS).

[meyer1993algorithms] Meyer, Yves. (1993). Algorithms and applications. SIAM, philadelphia.

[hornik1989multilayer] Hornik, Kurt, Stinchcombe, Maxwell, White, Halbert. (1989). Multilayer feedforward networks are universal approximators. Neural networks.

[raj2016local] Raj, Anant, Kumar, Abhishek, Mroueh, Youssef, Fletcher, P Thomas, others. (2016). Local Group Invariant Representations via Orbit Embeddings. arXiv preprint arXiv:1612.01988.

[marcos2016rotation] Marcos, Diego, Volpi, Michele, Komodakis, Nikos, Tuia, Devis. (2016). Rotation equivariant vector field networks. arXiv preprint arXiv:1612.09346.

[cooijmans2016recurrent] Cooijmans, Tim, Ballas, Nicolas, Laurent, C{'e. (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.

[glorot2010understanding] Glorot, X., Bengio, Y.. (2010). Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Int. Conf. AI Statist..

[anden2014deep] And{'e. (2014). Deep scattering spectrum. IEEE Transactions on Signal Processing.

[sifre2013rotation] Sifre, Laurent, Mallat, St{'e. (2013). Rotation, scaling and deformation invariant scattering for texture discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[li2005perceptron] Li, Ling. (2005). Perceptron learning with random coordinate descent.

[nesterov2012efficiency] Nesterov, Yu. (2012). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization.

[garnett2007image] Garnett, John B, Le, Triet M, Meyer, Yves, Vese, Luminita A. (2007). Image decompositions using bounded variation and generalized homogeneous Besov spaces. Applied and Computational Harmonic Analysis.

[choi2004multiple] Choi, Hyeokho, Baraniuk, Richard G. (2004). Multiple wavelet basis image denoising using Besov ball projections. IEEE Signal Processing Letters.

[hecht1988theory] Hecht-Nielsen, Robert, others. (1988). Theory of the backpropagation neural network.. Neural Networks.

[balle2014learning] Ball{'e. (2014). Learning sparse filter bank transforms with convolutional ICA. Image Processing (ICIP), 2014 IEEE International Conference on.

[mallat1999wavelet] Mallat, St{'e. (1999). A wavelet tour of signal processing.

[bastien2012theano] Bastien, Fr{'e. (2012). Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590.

[puthawala2020globally] Puthawala, Michael, Kothari, Konik, Lassas, Matti, Dokmani{'c. (2020). Globally Injective ReLU Networks. arXiv preprint arXiv:2006.08464.

[lucas2018using] Lucas, Alice, Iliadis, Michael, Molina, Rafael, Katsaggelos, Aggelos K. (2018). Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Processing Magazine.

[rudin1964principles] Rudin, Walter, others. (1964). Principles of mathematical analysis.

[schumaker2007spline] Schumaker, Larry. (2007). Spline functions: basic theory.

[choromanska2015loss] Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, G{'e. (2015). The Loss Surfaces of Multilayer Networks.. AISTATS.

[donoho1995noising] Donoho, David L. (1995). De-noising by soft-thresholding. IEEE transactions on information theory.

[zhang2014entropy] Zhang, Lin. (2014). Entropy, stochastic matrices, and quantum operations. Linear and Multilinear Algebra.

[guggenheimer1977applicable] Guggenheimer, Heinrich Walter. (1977). Applicable geometry: global and local convexity.

[lloyd1982least] Lloyd, Stuart. (1982). Least squares quantization in PCM. IEEE transactions on information theory.

[kuurkova1992kolmogorov] K{\uu. (1992). Kolmogorov's theorem and multilayer neural networks. Neural networks.

[jayaraman2009digital] Jayaraman, S, Esakkirajan, S, Veerakumar, T. (2009). Digital Image Processing TMH Publication. Year of Publication.

[srivastava2014understanding] Srivastava, R.~K., Masci, J., Gomez, F., Schmidhuber, J.. (2014). Understanding locally competitive networks. arXiv preprint arXiv:1410.1165.

[henaff2014local] H{'e. (2014). The local low-dimensionality of natural images. arXiv preprint arXiv:1412.6626.

[mathieu2016disentangling] Mathieu, Michael F, Zhao, Junbo Jake, Zhao, Junbo, Ramesh, Aditya, Sprechmann, Pablo, LeCun, Yann. (2016). Disentangling factors of variation in deep representation using adversarial training. Advances in Neural Information Processing Systems.

[larsson2016fractalnet] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2016). Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648.

[lee2015generalizing] Lee, Chen-Yu, Gallagher, Patrick W, Tu, Zhuowen. (2015). Generalizing pooling functions in convolutional neural networks: Mixed. Gated, and Tree, arXiv e-print sarXiv.

[ding2005equivalence] Ding, Chris, He, Xiaofeng, Simon, Horst D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. Proceedings of the 2005 SIAM International Conference on Data Mining.

[tieleman2012lecture] Tieleman, Tijmen, Hinton, Geoffrey. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning.

[mairal2009online] Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro, Guillermo. (2009). Online dictionary learning for sparse coding. Proceedings of the 26th annual international conference on machine learning.

[jiang2011learning] Jiang, Zhuolin, Lin, Zhe, Davis, Larry S. (2011). Learning a discriminative dictionary for sparse coding via label consistent K-SVD. Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.

[balestriero2017multiscale] Balestriero, Randall. (2017). Multiscale Residual Mixture of PCA: Dynamic Dictionaries for Optimal Basis Learning. arXiv preprint arXiv:1707.05840.

[lecun1995learning] LeCun, Yann, Jackel, LD, Bottou, L{'e. (1995). Learning algorithms for classification: A comparison on handwritten digit recognition. Neural networks: the statistical mechanics perspective.

[lecun2015lenet] LeCun, Yann, others. (2015). LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet.

[rumelhart1988learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J, others. (1988). Learning representations by back-propagating errors. Cognitive modeling.

[bengio2013advances] Bengio, Yoshua, Boulanger-Lewandowski, Nicolas, Pascanu, Razvan. (2013). Advances in optimizing recurrent networks. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.

[zeiler2012adadelta] Zeiler, Matthew D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[reed2014learning] Reed, Scott, Sohn, Kihyuk, Zhang, Yuting, Lee, Honglak. (2014). Learning to disentangle factors of variation with manifold interaction. Proceedings of the 31st International Conference on Machine Learning (ICML-14).

[rennie2014deep] Rennie, Steven J, Goel, Vaibhava, Thomas, Samuel. (2014). Deep order statistic networks. Spoken Language Technology Workshop (SLT), 2014 IEEE.

[lee2015deeply] Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick W, Zhang, Zhengyou, Tu, Zhuowen. (2015). Deeply-Supervised Nets.. AISTATS.

[li2019understanding] Li, Xiang, Chen, Shuo, Hu, Xiaolin, Yang, Jian. (2019). Understanding the disharmony between dropout and batch normalization by variance shift. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[bakir2004learning] Bak{\i. (2004). Learning to find pre-images. Advances in Neural Information Processing systems.

[comon1994independent] Comon, Pierre. (1994). Independent {C. Signal Processing.

[hyvarinen2016unsupervised] Hyvarinen, Aapo, Morioka, Hiroshi. (2016). Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in Neural Information Processing Systems.

[schmidhuber1992learning] Schmidhuber, J{. (1992). Learning factorial codes by predictability minimization. Neural Computation.

[rosenblatt1956remarks] Rosenblatt, Murray. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics.

[sajjadi2018assessing] Sajjadi, Mehdi SM, Bachem, Olivier, Lucic, Mario, Bousquet, Olivier, Gelly, Sylvain. (2018). Assessing generative models via precision and recall. arXiv preprint arXiv:1806.00035.

[munkres2014topology] Munkres, James. (2014). Topology.

[karras2019style] Karras, Tero, Laine, Samuli, Aila, Timo. (2019). A style-based generator architecture for generative adversarial networks. Proc. CVPR.

[gong2019autogan] Gong, Xinyu, Chang, Shiyu, Jiang, Yifan, Wang, Zhangyang. (2019). Autogan: Neural architecture search for generative adversarial networks. Proceedings of the IEEE International Conference on Computer Vision.

[stewart1973error] Stewart, Gilbert W. (1973). Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM review.

[locatello2018challenging] Locatello, Francesco, Bauer, Stefan, Lucic, Mario, R{. (2018). Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359.

[tompson2015efficient] Tompson, Jonathan, Goroshin, Ross, Jain, Arjun, LeCun, Yann, Bregler, Christoph. (2015). Efficient object localization using convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[lin2013network] Lin, Min, Chen, Qiang, Yan, Shuicheng. (2013). Network in network. arXiv preprint arXiv:1312.4400.

[blot2016max] Blot, Michael, Cord, Matthieu, Thome, Nicolas. (2016). Max-min convolutional neural networks for image classification. Image Processing (ICIP), 2016 IEEE International Conference on.

[shang2016understanding] Shang, Wenling, Sohn, Kihyuk, Almeida, Diogo, Lee, Honglak. (2016). Understanding and improving convolutional neural networks via concatenated rectified linear units. Proceedings of the International Conference on Machine Learning (ICML).

[targ2016resnet] Targ, Sasha, Almeida, Diogo, Lyman, Kevin. (2016). Resnet in Resnet: generalizing residual architectures. arXiv preprint arXiv:1603.08029.

[szegedy2016inception] Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, Alemi, Alex. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261.

[graham2014fractional] Graham, Benjamin. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071.

[masnadi2009design] Masnadi-Shirazi, Hamed, Vasconcelos, Nuno. (2009). On the design of loss functions for classification: theory, robustness to outliers, and savageboost. Advances in Neural Information Processing systems.

[zeiler2013stochastic] Zeiler, Matthew D, Fergus, Rob. (2013). Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557.

[malinowski2013learnable] Malinowski, Mateusz, Fritz, Mario. (2013). Learnable pooling regions for image classification. arXiv preprint arXiv:1301.3516.

[chung2014empirical] Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, Bengio, Yoshua. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

[cho2014learning] Cho, Kyunghyun, Van Merri{. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[unser2018representer] Unser, Michael. (2018). A representer theorem for deep neural networks. arXiv preprint arXiv:1802.09210.

[jones1992simple] Jones, Lee K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. The annals of Statistics.

[szegedy2017inception] Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, Alemi, Alexander A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning.. AAAI.

[barron1993universal] Barron, Andrew R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory.

[rosasco2004loss] Rosasco, Lorenzo, De Vito, Ernesto, Caponnetto, Andrea, Piana, Michele, Verri, Alessandro. (2004). Are loss functions all the same?. Neural Computation.

[mallat2008wavelet] Mallat, Stephane. (2008). A wavelet tour of signal processing: the sparse way.

[berger1994removing] Berger, Jonathan, Coifman, Ronald R, Goldberg, Maxim J. (1994). Removing noise from music using local trigonometric bases and wavelet packets. Journal of the Audio Engineering Society.

[tikk2003survey] Tikk, Domonkos, K{'o. (2003). A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research.

[tikhomirov1991representation] Tikhomirov, VM. (1991). On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of a Smaller Number of Variables. Selected Works of AN Kolmogorov.

[duchi2011adaptive] Duchi, John, Hazan, Elad, Singer, Yoram. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.

[matsuoka1992noise] Matsuoka, Kiyotoshi. (1992). Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics.

[bishop2008training] Bishop, Chris M. (2008). Training with noise is equivalent to Tikhonov regularization. Training.

[wager2013dropout] Wager, Stefan, Wang, Sida, Liang, Percy S. (2013). Dropout training as adaptive regularization. Advances in Neural Information Processing systems.

[bajcsy1989multiresolution] Bajcsy, Ruzena, Kova{\v{c. (1989). Multiresolution elastic matching. Computer vision, graphics, and image processing.

[zhang1997face] Zhang, Jun, Yan, Yong, Lades, Martin. (1997). Face recognition: eigenface, elastic matching, and neural nets. Proceedings of the IEEE.

[dieleman2015rotation] Dieleman, Sander, Willett, Kyle W, Dambre, Joni. (2015). Rotation-invariant convolutional neural networks for galaxy morphology prediction. Monthly notices of the royal astronomical society.

[bastani2016measuring] Bastani, Osbert, Ioannou, Yani, Lampropoulos, Leonidas, Vytiniotis, Dimitrios, Nori, Aditya, Criminisi, Antonio. (2016). Measuring neural net robustness with constraints. Advances In Neural Information Processing Systems.

[blumer1987occam] Blumer, Anselm, Ehrenfeucht, Andrzej, Haussler, David, Warmuth, Manfred K. (1987). Occam's razor. Information processing letters.

[gal2016dropout] Gal, Yarin, Ghahramani, Zoubin. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning.

[li2016whiteout] Li, Yinan, Xu, Ruoyi, Liu, Fang. (2016). Whiteout: Gaussian Adaptive Regularization Noise in Deep Neural Networks. arXiv preprint arXiv:1612.01490.

[schrijver1998theory] Schrijver, Alexander. (1998). Theory of linear and integer programming.

[de1978practical] De Boor, Carl, De Boor, Carl, Math{'e. (1978). A practical guide to splines.

[green1993nonparametric] Green, Peter J, Silverman, Bernard W. (1993). Nonparametric regression and generalized linear models: a roughness penalty approach.

[balestriero2018hard] Balestriero, Randall, Baraniuk, Richard G. (2018). From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference. arXiv preprint arXiv:1810.09274.

[gu2013smoothing] Gu, Chong. (2013). Smoothing spline ANOVA models.

[wang2011smoothing] Wang, Yuedong. (2011). Smoothing splines: methods and applications.

[yin2008noisy] Yin, Junsong, Hu, Dewen, Zhou, Zongtan. (2008). Noisy manifold learning using neighborhood smoothing embedding. Pattern Recognition Letters.

[park2004local] Park, JinHyeong, Zhang, Zhenyue, Zha, Hongyuan, Kasturi, Rangachar. (2004). Local smoothing for manifold learning. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[tjeng2017evaluating] Tjeng, Vincent, Xiao, Kai, Tedrake, Russ. (2017). Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356.

[nalisnick2015scale] Nalisnick, Eric, Anandkumar, Anima, Smyth, Padhraic. (2015). A scale mixture perspective of multiplicative noise in neural networks. arXiv preprint arXiv:1506.03208.

[devries2017dataset] DeVries, Terrance, Taylor, Graham W. (2017). Dataset Augmentation in Feature Space. arXiv preprint arXiv:1702.05538.

[bengio2011deep] Bengio, Yoshua, Bergeron, Arnaud, Boulanger--Lewandowski, Nicolas, Breuel, Thomas, Chherawala, Youssouf, Cisse, Moustapha, Erhan, Dumitru, Eustache, Jeremy, Glorot, Xavier, Muller, Xavier, others. (2011). Deep learners benefit more from out-of-distribution examples. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

[vapnik1992principles] Vapnik, Vladimir. (1992). Principles of risk minimization for learning theory. Advances in Neural Information Processing systems.

[guyon1992structural] Guyon, Isabelle, Vapnik, Vladimir, Boser, Bernhard, Bottou, Leon, Solla, Sara A. (1992). Structural risk minimization for character recognition. Advances in Neural Information Processing systems.

[moody1994architecture] Moody, John, Utans, Joachim. (1994). Architecture selection strategies for neural networks: Application to corporate bond rating prediction. Neural networks in the capital markets.

[wolpert1994bayesian] Wolpert, David H. (1994). Bayesian backpropagation over io functions rather than weights. Advances in Neural Information Processing systems.

[williams1995bayesian] Williams, Peter M. (1995). Bayesian regularization and pruning using a Laplace prior. Neural computation.

[hochreiter1995simplifying] Hochreiter, Sepp, Schmidhuber, J{. (1995). Simplifying neural nets by discovering flat minima. Advances in Neural Information Processing systems.

[schmidhuber1994discovering] Schmidhuber, J{. (1994). Discovering problem solutions with low Kolmogorov complexity and high generalization capability. Machine Learning: Proceedings of the Twelfth International Conference.

[plaut1986experiments] Plaut, David C, others. (1986). Experiments on Learning by Back Propagation..

[hinton1987learning] Hinton, Geoffrey E. (1987). Learning translation invariant recognition in a massively parallel networks. International Conference on Parallel Architectures and Languages Europe.

[mackay1996bayesian] MacKay, David JC. (1996). Bayesian methods for backpropagation networks. Models of neural networks III.

[hinton1986learning] Hinton, Geoffrey E. (1986). Learning distributed representations of concepts. Proceedings of the eighth annual conference of the cognitive science society.

[weigend1990predicting] Weigend, Andreas S, Huberman, Bernardo A, Rumelhart, David E. (1990). Predicting the future: A connectionist approach. International journal of neural systems.

[morgan1990generalization] Morgan, Nelson, Bourlard, Herv{'e. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. Advances in Neural Information Processing systems.

[yann1987modeles] Yann, LE. (1987). Mod{`e.

[lecun1989generalization] LeCun, Yann, others. (1989). Generalization and network design strategies. Connectionism in perspective.

[lang1990time] Lang, Kevin J, Waibel, Alex H, Hinton, Geoffrey E. (1990). A time-delay neural network architecture for isolated word recognition. Neural networks.

[rumelhart1986parallel] Rumelhart, David E, Mcclelland, James L. (1986). Parallel distributed processing: Explorations in the microstructure of cognition: Foundations (Parallel distributed processing).

[nowlan1992simplifying] Nowlan, Steven J, Hinton, Geoffrey E. (1992). Simplifying neural networks by soft weight-sharing. Neural computation.

[hinton93keeping] Hinton, GE, van Camp, Drew. Keeping neural networks simple by minimising the description length of weights. 1993. Proceedings of COLT-93.

[memisevic2014zero] Memisevic, Roland, Krueger, David. (2014). Zero-bias autoencoders and the benefits of co-adapting features. stat.

[murray1993synaptic] Murray, Alan F, Edwards, Peter J. (1993). Synaptic weight noise during MLP learning enhances fault-tolerance, generalization and learning trajectory. Advances in Neural Information Processing systems.

[valiant1984theory] Valiant, Leslie G. (1984). A theory of the learnable. Communications of the ACM.

[zeiler2010deconvolutional] Zeiler, Matthew D, Krishnan, Dilip, Taylor, Graham W, Fergus, Rob. (2010). Deconvolutional networks. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.

[mallat2016understanding] Mallat, St{'e. (2016). Understanding deep convolutional networks. Phil. Trans. R. Soc. A.

[jaderberg2015spatial] Jaderberg, Max, Simonyan, Karen, Zisserman, Andrew, others. (2015). Spatial transformer networks. Advances in Neural Information Processing Systems.

[biernacki2000assessing] Biernacki, C., Celeux, G., Govaert, G.. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell..

[graves2013speech] Graves, Alex, Mohamed, Abdel-rahman, Hinton, Geoffrey. (2013). Speech recognition with deep recurrent neural networks. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on.

[burr1981elastic] Burr, David J. (1981). Elastic matching of line drawings.. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[uchida2005survey] Uchida, Seiichi, Sakoe, Hiroaki. (2005). A survey of elastic matching techniques for handwritten character recognition. IEICE transactions on information and systems.

[korman2013fast] Korman, Simon, Reichman, Daniel, Tsur, Gilad, Avidan, Shai. (2013). Fast-match: Fast affine template matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[kim2007grayscale] Kim, Hae Yong, De Ara{'u. (2007). Grayscale template-matching invariant to rotation, scale, translation, brightness and contrast. Pacific-Rim Symposium on Image and Video Technology.

[murthy1994system] Murthy, Sreerama K., Kasif, Simon, Salzberg, Steven. (1994). A system for induction of oblique decision trees. Journal of artificial intelligence research.

[rao1999learning] Rao, Rajesh PN, Ruderman, Daniel L. (1999). Learning Lie groups for invariant visual perception. Advances in Neural Information Processing systems.

[hubel1962receptive] Hubel, David H, Wiesel, Torsten N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology.

[feng2015learning] Feng, Jiashi, Darrell, Trevor. (2015). Learning the structure of deep convolutional networks. Proceedings of the IEEE International Conference on Computer Vision.

[spitzer1985complex] Spitzer, HEDVA, Hochstein, SHAUL. (1985). A complex-cell receptive-field model. Journal of Neurophysiology.

[grimes2005bilinear] Grimes, David B, Rao, Rajesh PN. (2005). Bilinear sparse coding for invariant vision. Neural computation.

[foldiak1991learning] F{. (1991). Learning invariance from transformation sequences. Neural Computation.

[kaudererquantifying] Kauderer-Abrams, Eric. Quantifying Translation-Invariance in Convolutional Neural Networks.

[xu2014scale] Xu, Yichong, Xiao, Tianjun, Zhang, Jiaxing, Yang, Kuiyuan, Zhang, Zheng. (2014). Scale-Invariant Convolutional Neural Networks. arXiv preprint arXiv:1411.6369.

[marcos2016learning] Marcos, Diego, Volpi, Michele, Tuia, Devis. (2016). Learning rotation invariant convolutional filters for texture classification. arXiv preprint arXiv:1604.06720.

[2016arXiv160407143B] {Biau. {Neural Random Forests. ArXiv e-prints.

[verma2009spatial] Verma, Nakul, Kpotufe, Samory, Dasgupta, Sanjoy. (2009). Which spatial partition trees are adaptive to intrinsic dimension?. Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence.

[sproull1991refinements] Sproull, Robert F. (1991). Refinements to nearest-neighbor searching ink-dimensional trees. Algorithmica.

[schneidman2002analyzing] Schneidman, Elad, Slonim, Noam, Tishby, Naftali, deRuyter van Steveninck, R, Bialek, William. (2002). Analyzing neural codes using the information bottleneck method. Advances in Neural Information Processing systems.

[barlow2001exploitation] Barlow, Horace. (2001). The exploitation of regularities in the environment by the brain. Behavioral and Brain Sciences.

[chunjie2017cosine] Chunjie, Luo, Qiang, Yang, others. (2017). Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks. arXiv preprint arXiv:1702.05870.

[powell1981approximation] Powell, Michael James David. (1981). Approximation theory and methods.

[grimes2003probabilistic] Grimes, David B, Shon, Aaron P, Rao, Rajesh PN. (2003). Probabilistic bilinear models for appearance-based vision. null.

[grimes2003bilinear] Grimes, David B, Rao, Rajesh PN. (2003). A bilinear model for sparse coding. Advances in Neural Information Processing systems.

[tenenbaum1997separating] Tenenbaum, Joshua B, Freeman, William T. (1997). Separating style and content. Advances in Neural Information Processing systems.

[agostinelli2014learning] Agostinelli, Forest, Hoffman, Matthew, Sadowski, Peter, Baldi, Pierre. (2014). Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830.

[friedman1991multivariate] Friedman, Jerome H. (1991). Multivariate adaptive regression splines. The annals of statistics.

[barlow1981ferrier] Barlow, Horace B. (1981). The ferrier lecture, 1980: Critical limiting factors in the design of the eye and visual cortex. Proceedings of the Royal Society of London B: Biological Sciences.

[strouse2016deterministic] Strouse, DJ, Schwab, David J. (2016). The deterministic information bottleneck. arXiv preprint arXiv:1604.00268.

[fukushima1980neocognitron] Fukushima, Kunihiko. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics.

[raghu2016expressive] Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, Sohl-Dickstein, Jascha. (2016). On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336.

[keskar2016large] Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, Tang, Ping Tak Peter. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.

[hoffer2015deep] Hoffer, Elad, Ailon, Nir. (2015). Deep metric learning using triplet network. International Workshop on Similarity-Based Pattern Recognition.

[taigman2014deepface] Taigman, Yaniv, Yang, Ming, Ranzato, Marc'Aurelio, Wolf, Lior. (2014). Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[krishnan1999extracting] Krishnan, R, Sivakumar, G, Bhattacharya, P. (1999). Extracting decision trees from trained neural networks. Pattern Recognition.

[craven1996extracting] Craven, Mark W. (1996). Extracting comprehensible models from trained neural networks.

[craven1994using] Craven, Mark, Shavlik, Jude W. (1994). Using sampling and queries to extract rules from trained neural networks.. ICML.

[kamruzzaman2010rule] Kamruzzaman, SM, Hasan, Ahmed Ryadh. (2010). Rule Extraction using Artificial Neural Networks. arXiv preprint arXiv:1009.4984.

[towell1993extracting] Towell, Geoffrey G, Shavlik, Jude W. (1993). Extracting refined rules from knowledge-based neural networks. Machine learning.

[quinlan1994comparing] Quinlan, John Ross. (1994). Comparing connectionist and symbolic learning methods. Computational Learning Theory and Natural Learning Systems: Constraints and Prospects.

[fu1994rule] Fu, LiMin. (1994). Rule generation from neural networks. IEEE Transactions on Systems, Man, and Cybernetics.

[bengio2007greedy] Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, Larochelle, Hugo. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems.

[lecun1998mnist] LeCun, Yann. (1998). The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

[netzer2011reading] Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, Ng, Andrew Y. (2011). Reading digits in natural images with unsupervised feature learning. NIPS workshop on deep learning and unsupervised feature learning.

[weston2012deep] Weston, Jason, Ratle, Fr{'e. (2012). Deep learning via semi-supervised embedding. Neural Networks: Tricks of the Trade.

[abadi2016tensorflow] Abadi, Mart{'\i. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

[agarap2018deep] Agarap, A. F.. (2018). Deep Learning using Rectified Linear Units ({R. arXiv preprint arXiv:1803.08375.

[graves2005framewise] Graves, Alex, Schmidhuber, J{. (2005). Framewise phoneme classification with bidirectional LSTM networks. Neural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on.

[boyd1992defeating] Boyd, John P. (1992). Defeating the Runge phenomenon for equispaced polynomial interpolation via Tikhonov regularization. Applied Mathematics Letters.

[boyd2009divergence] Boyd, John P, Xu, Fei. (2009). Divergence (Runge phenomenon) for least-squares polynomial approximation on an equispaced grid and Mock--Chebyshev subset interpolation. Applied Mathematics and Computation.

[pena2000multivariate] Pe{~n. (2000). On the multivariate Horner scheme. SIAM journal on numerical analysis.

[de2015exploration] de Br{'e. (2015). An exploration of softmax alternatives belonging to the spherical loss family. arXiv preprint arXiv:1511.05042.

[veit2016residual] Veit, Andreas, Wilber, Michael J, Belongie, Serge. (2016). Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems.

[de1983approximation] de Boor, Carl, DeVore, Ron. (1983). Approximation by smooth multivariate splines. Transactions of the American Mathematical Society.

[nowozin2016f] Nowozin, Sebastian, Cseke, Botond, Tomioka, Ryota. (2016). f-gan: Training generative neural samplers using variational divergence minimization. Advances in Neural Information Processing systems.

[dziugaite2015training] Dziugaite, Gintare Karolina, Roy, Daniel M, Ghahramani, Zoubin. (2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906.

[arjovsky2017wasserstein] Arjovsky, Martin, Chintala, Soumith, Bottou, L{'e. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875.

[gan2017triangle] Gan, Zhe, Chen, Liqun, Wang, Weiyao, Pu, Yuchen, Zhang, Yizhe, Liu, Hao, Li, Chunyuan, Carin, Lawrence. (2017). Triangle generative adversarial networks. Advances in Neural Information Processing Systems.

[angles2018generative] Angles, Tom{'a. (2018). Generative networks as inverse problems with scattering transforms. arXiv preprint arXiv:1805.06621.

[zhao2016energy] Zhao, Junbo, Mathieu, Michael, LeCun, Yann. (2016). Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126.

[roth2017stabilizing] Roth, Kevin, Lucchi, Aurelien, Nowozin, Sebastian, Hofmann, Thomas. (2017). Stabilizing training of generative adversarial networks through regularization. Advances in Neural Information Processing systems.

[li2017towards] Li, Jerry, Madry, Aleksander, Peebles, John, Schmidt, Ludwig. (2017). Towards understanding the dynamics of generative adversarial networks. arXiv preprint arXiv:1706.09884.

[liu2017approximation] Liu, Shuang, Bousquet, Olivier, Chaudhuri, Kamalika. (2017). Approximation and convergence properties of generative adversarial learning. Proc. NeurIPS.

[zhang2017discrimination] Zhang, Pengchuan, Liu, Qiang, Zhou, Dengyong, Xu, Tao, He, Xiaodong. (2017). On the discrimination-generalization tradeoff in GANs. arXiv preprint arXiv:1711.02771.

[arjovsky1701towards] Arjovsky, Martin, Bottou, Léon. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862.

[rifai2011higher] Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., Glorot, X.. (2011). Higher order contractive auto-encoder. Joint European Conference on Machine Learning and Knowledge Discovery in Databases.

[miao1992principal] Miao, Jianming, Ben-Israel, Adi. (1992). On principal angles between subspaces in Rn. Linear Algebra Appl.

[deng2020low] Deng, Tingquan, Ye, Dongsheng, Ma, Rong, Fujita, Hamido, Xiong, Lvnan. (2020). Low-rank local tangent space embedding for subspace clustering. Information Sciences.

[ma2010local] Ma, Li, Crawford, Melba M, Tian, Jinwen. (2010). Local manifold learning-based $ k $-nearest-neighbor for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing.

[vincent2008extracting] Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P. A.. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.

[teng2019invertible] Teng, Y., Choromanska, A.. (2019). Invertible Autoencoder for Domain Adaptation. Computation.

[chongxuan2017triple] Chongxuan, LI, Xu, Taufik, Zhu, Jun, Zhang, Bo. (2017). Triple generative adversarial nets. Advances in Neural Information Processing systems.

[khayatkhoei2018disconnected] Khayatkhoei, Mahyar, Singh, Maneesh K, Elgammal, Ahmed. (2018). Disconnected manifold learning for generative adversarial networks. Advances in Neural Information Processing Systems.

[tanielian2020learning] Tanielian, Ugo, Issenhuth, Thibaut, Dohmatob, Elvis, Mary, Jeremie. (2020). Learning disconnected manifolds: a no GANs land. arXiv preprint arXiv:2006.04596.

[durugkar2016generative] Durugkar, Ishan, Gemp, Ian, Mahadevan, Sridhar. (2017). Generative multi-adversarial networks. Proc. ICLR.

[ghosh2018multi] Ghosh, Arnab, Kulharia, Viveka, Namboodiri, Vinay P, Torr, Philip HS, Dokania, Puneet K. (2018). Multi-agent diverse generative adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[yang2019diversitysensitive] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, Honglak Lee. (2019). Diversity-Sensitive Conditional Generative Adversarial Networks. arXiv preprint arXiv:1901.09024.

[kodali2017convergence] Kodali, Naveen, Abernethy, Jacob, Hays, James, Kira, Zsolt. (2017). On convergence and stability of gans. arXiv preprint arXiv:1705.07215.

[fabius2014variational] Fabius, Otto, van Amersfoort, Joost R. (2014). Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581.

[van2017neural] van den Oord, Aaron, Vinyals, Oriol, others. (2017). Neural discrete representation learning. Proc. NeurIPS.

[roy2018theory] Roy, Aurko, Vaswani, Ashish, Neelakantan, Arvind, Parmar, Niki. (2018). Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063.

[rezende2015variational] Rezende, Danilo Jimenez, Mohamed, Shakir. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.

[dinh2019rad] Dinh, Laurent, Sohl-Dickstein, Jascha, Pascanu, Razvan, Larochelle, Hugo. (2019). A RAD approach to deep mixture models. arXiv preprint arXiv:1903.07714.

[grathwohl2018ffjord] Grathwohl, Will, Chen, Ricky TQ, Betterncourt, Jesse, Sutskever, Ilya, Duvenaud, David. (2018). Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367.

[dinh2014nice] Dinh, Laurent, Krueger, David, Bengio, Yoshua. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.

[kingma2018glow] Kingma, Diederik P, Dhariwal, Prafulla. (2018). Glow: Generative flow with invertible 1x1 convolutions. Proc. NeurIPS.

[meyer2000matrix] Meyer, Carl D. (2000). Matrix analysis and applied linear algebra.

[dinh2016density] Dinh, Laurent, Sohl-Dickstein, Jascha, Bengio, Samy. (2017). Density estimation using real NVP. Proc. ICLR.

[andrsterr2019perturbation] Helena Andrés-Terré, Pietro Lió. (2019). Perturbation theory approach to study the latent space degeneracy of Variational Autoencoders. arXiv preprint arXiv:1907.05267.

[srivastava2017veegan] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, Charles Sutton. (2017). VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning.

[dieng2019prescribed] Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei, Michalis K. Titsias. (2019). Prescribed Generative Adversarial Networks.

[biau2018some] Biau, G{'e. (2018). Some theoretical properties of GANs. arXiv preprint arXiv:1803.07819.

[boyd2010six] Boyd, John P. (2010). Six strategies for defeating the Runge Phenomenon in Gaussian radial basis functions on a finite interval. Computers & Mathematics with Applications.

[gorski2007biconvex] Gorski, Jochen, Pfeuffer, Frank, Klamroth, Kathrin. (2007). Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research.

[gu2006manifold] Gu, Xianfeng, He, Ying, Qin, Hong. (2006). Manifold splines. Graphical Models.

[xu2015block] Xu, Yangyang, Yin, Wotao. (2015). Block stochastic gradient iteration for convex and nonconvex optimization. SIAM Journal on Optimization.

[bezhaev1988splines] Bezhaev, A Yu. (1988). Splines on manifolds. Russian Journal of Numerical Analysis and Mathematical Modelling.

[gu2014towards] Gu, Shixiang, Rigazio, Luca. (2014). Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068.

[lyu2015unified] Lyu, Chunchuan, Huang, Kaizhu, Liang, Hai-Ning. (2015). A unified gradient regularization family for adversarial examples. Data Mining (ICDM), 2015 IEEE International Conference on.

[shaham2015understanding] Shaham, Uri, Yamada, Yutaro, Negahban, Sahand. (2015). Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization. arXiv preprint arXiv:1511.05432.

[fawzi2015analysis] Fawzi, Alhussein, Fawzi, Omar, Frossard, Pascal. (2015). Analysis of classifiers' robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590.

[carlini2016defensive] Carlini, Nicholas, Wagner, David. (2016). Defensive distillation is not robust to adversarial examples. arXiv preprint.

[papernot2016distillation] Papernot, Nicolas, McDaniel, Patrick, Wu, Xi, Jha, Somesh, Swami, Ananthram. (2016). Distillation as a defense to adversarial perturbations against deep neural networks. Security and Privacy (SP), 2016 IEEE Symposium on.

[tang2013deep] Tang, Yichuan. (2013). Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239.

[shen2017disciplined] Shen, Xinyue, Diamond, Steven, Udell, Madeleine, Gu, Yuantao, Boyd, Stephen. (2017). Disciplined multi-convex programming. Control And Decision Conference (CCDC), 2017 29th Chinese.

[atteia1989spline] Atteia, M, Benbourhim, MN. (1989). Spline elastic manifolds. Mathematical methods in computer aided geometric design.

[savel1995splines] Savel'ev, Il'ya Vasil'evich. (1995). Splines and manifolds. Russian Mathematical Surveys.

[hofer2004energy] Hofer, Michael, Pottmann, Helmut. (2004). Energy-minimizing splines in manifolds. ACM Transactions on Graphics (TOG).

[chui1988multivariate] Chui, Charles K. (1988). Multivariate splines.

[bergstra2010theano] Bergstra, James, Breuleux, Olivier, Bastien, Fr{'e. (2010). Theano: A CPU and GPU math compiler in Python. Proc. 9th Python in Science Conf.

[afriat1957orthogonal] Afriat, Sidney N. (1957). Orthogonal and oblique projectors and the characteristics of pairs of vector spaces. Mathematical Proceedings of the Cambridge Philosophical Society.

[bjorck1973numerical] Bjorck, Ake, Golub, Gene H. (1973). Numerical Methods for Computing Angles Between Linear Subspaces. Mathematics of computation.

[streubel2013representation] Streubel, Tom, Griewank, Andreas, Radons, Manuel, Bernt, Jens-Uwe. (2013). Representation and analysis of piecewise linear functions in abs-normal form. IFIP Conference on System Modeling and Optimization.

[qi1993nonsmooth] Qi, Liqun, Sun, Jie. (1993). A nonsmooth version of Newton's method. Mathematical programming.

[qi1998nonsmooth] Qi, Liqun, Sun, Defeng. (1998). Nonsmooth equations and smoothing Newton methods. Applied Mathematics Report AMR.

[courant1937differential] Courant, Richard, McShane, Edward James. (1937). Differential and integral calculus.

[absil2006largest] Absil, P-A, Edelman, Alan, Koev, Plamen. (2006). On the largest principal angle between random subspaces. Linear Algebra and its Applications.

[weinstein2000almost] Weinstein, Alan. (2000). Almost invariant submanifolds for compact group actions. Journal of the European Mathematical Society.

[cheney2009linear] Cheney, Ward, Kincaid, David. (2009). Linear algebra: Theory and applications. The Australian Mathematical Society.

[schoenberg1964interpolation] Schoenberg, Issac J. (1964). On interpolation by spline functions and its minimal properties. On Approximation Theory/{.

[reinsch1967smoothing] Reinsch, Christian H. (1967). Smoothing by spline functions. Numerische mathematik.

[bloor1990representing] Bloor, Malcolm IG, Wilson, Michael J. (1990). Representing PDE surfaces in terms of B-splines. Computer-Aided Design.

[smith1985numerical] Smith, Gordon D. (1985). Numerical solution of partial differential equations: finite difference methods.

[cheney1980approximation] Cheney, Elliott Ward. (1980). Approximation theory III.

[graves2013generating] Graves, Alex. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.

[wang1999inverse] Wang, Genyuan, Bao, Zheng. (1999). Inverse synthetic aperture radar imaging of maneuvering targets based on chirplet decomposition. Optical Engineering.

[brock2016neural] Brock, A., Lim, T., Ritchie, J.~M., Weston, N.. (2016). Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093.

[huang2017orthogonal] Huang, L., Liu, X., Lang, B., Yu, A. W., Wang, Y., Li, B.. (2017). Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. arXiv preprint arXiv:1709.06079.

[martinez2021permute] Martinez, Julieta, Shewakramani, Jashan, Liu, Ting Wei, B{^a. (2021). Permute, quantize, and fine-tune: Efficient compression of neural networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[shwartz2017compression] Shwartz-Ziv, Ravid, Tishby, Naftali. (2017). Compression of Deep Neural Networks via Information, (2017). arXiv preprint arXiv:1703.00810.

[shwartz2022pre] Shwartz-Ziv, Ravid, Goldblum, Micah, Souri, Hossein, Kapoor, Sanyam, Zhu, Chen, LeCun, Yann, Wilson, Andrew Gordon. (2022). Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors. arXiv preprint arXiv:2205.10279.

[flandrin2001time] Flandrin, Patrick. (2001). Time frequency and chirps. Aerospace/Defense Sensing, Simulation, and Controls.

[fan2001generalized] Fan, Jianqing, Zhang, Chunming, Zhang, Jian. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. Annals of statistics.

[zeitouni1992generalized] Zeitouni, Ofer, Ziv, Jacob, Merhav, Neri. (1992). When is the generalized likelihood ratio test optimal?. IEEE Transactions on Information Theory.

[boissonnat2006curved] Boissonnat, Jean-Daniel, Wormser, Camille, Yvinec, Mariette. (2006). Curved voronoi diagrams. Effective Computational Geometry for Curves and Surfaces.

[edelsbrunner2012algorithms] Edelsbrunner, Herbert. (2012). Algorithms in combinatorial geometry.

[aurenhammer1987power] Aurenhammer, Franz. (1987). Power diagrams: properties, algorithms and applications. SIAM Journal on Computing.

[Reference1] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2017). Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856.

[largemarginib] Tsai, Yao-Hung Hubert, Wu, Yue, Salakhutdinov, Ruslan, Morency, Louis-Philippe. (2020). Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576. doi:10.1109/TPAMI.2013.2296528.

[dubois2021lossy] Dubois, Yann, Bloem-Reddy, Benjamin, Ullrich, Karen, Maddison, Chris J. (2021). Lossy compression for lossless prediction. Advances in Neural Information Processing Systems.

[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978.

[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[kahana2022contrastive] Kahana, Jonathan, Hoshen, Yedid. (2022). A Contrastive Objective for Learning Disentangled Representations. arXiv preprint arXiv:2203.11284.

[tian2020makes] Tian, Yonglong, Sun, Chen, Poole, Ben, Krishnan, Dilip, Schmid, Cordelia, Isola, Phillip. (2020). What makes for good views for contrastive learning?. Advances in Neural Information Processing Systems.

[zimmermann2021contrastive] Zimmermann, Roland S, Sharma, Yash, Schneider, Steffen, Bethge, Matthias, Brendel, Wieland. (2021). Contrastive learning inverts the data generating process. International Conference on Machine Learning.

[lee2021compressive] Lee, Kuang-Huei, Arnab, Anurag, Guadarrama, Sergio, Canny, John, Fischer, Ian. (2021). Compressive visual representations. Advances in Neural Information Processing Systems.

[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.

[fefferman2016testing] Fefferman, Charles, Mitter, Sanjoy, Narayanan, Hariharan. (2016). Testing the manifold hypothesis. Journal of the American Mathematical Society.

[fischer2020conditional] Fischer, Ian. (2020). The conditional entropy bottleneck. Entropy.

[lee2021predicting] Lee, Jason D, Lei, Qi, Saunshi, Nikunj, Zhuo, Jiacheng. (2021). Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems.

[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[Reference3] Li, Yingming, Yang, Ming, Zhang, Zhongfei. (2018). A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering. doi:10.1109/MIS.2009.36.

[donahue2015long] Donahue, Jeffrey, Anne Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, Darrell, Trevor. (2015). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE conference on computer vision and pattern recognition.

[mao2014deep] Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, Huang, Zhiheng, Yuille, Alan. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.

[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.

[federici2020learning] Federici, Marco, Dutta, Anjan, Forr{'e. (2020). Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017.

[tian2020contrastive] Tian, Yonglong, Krishnan, Dilip, Isola, Phillip. (2020). Contrastive multiview coding. European conference on computer vision.

[tschannen2019mutual] Tschannen, Michael, Djolonga, Josip, Rubenstein, Paul K, Gelly, Sylvain, Lucic, Mario. (2019). On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625.

[darlow2020information] Darlow, Luke Nicholas, Storkey, Amos. (2020). What Information Does a ResNet Compress?. arXiv preprint arXiv:2003.06254.

[deepmultiview2019] Qi Wang, Claire Boudreau, Qixing Luo, Pang-Ning Tan, Jiayu Zhou. (2019). Deep Multi-view Information Bottleneck. Proceedings of the 2019 SIAM International Conference on Data Mining (SDM). doi:10.1137/1.9781611975673.5.

[hang2018kernel] Hang, Hanyuan, Steinwart, Ingo, Feng, Yunlong, Suykens, Johan AK. (2018). Kernel density estimation for dynamical systems. The Journal of Machine Learning Research.

[kozachenko1987sample] Kozachenko, Lyudmyla F, Leonenko, Nikolai N. (1987). Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii.

[linsker88] Henaff, Olivier. (2020). Data-efficient image recognition with contrastive predictive coding. Neural Computation. doi:10.1162/089976602317318938.

[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.

[karpathy2015deep] Karpathy, Andrej, Fei-Fei, Li. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition.

[deepmultiview2015] Wang, Weiran, Arora, Raman, Livescu, Karen, Bilmes, Jeff. (2015). On Deep Multi-View Representation Learning. Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37.

[multimodel2011] Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, Ng, Andrew Y.. (2011). Multimodal Deep Learning. Proceedings of the 28th International Conference on International Conference on Machine Learning.

[srivastava14b] Nitish Srivastava, Ruslan Salakhutdinov. (2014). Multimodal Learning with Deep Boltzmann Machines. Journal of Machine Learning Research.

[chen2010] Chen, Ning, Zhu, Jun, Xing, Eric. (2010). Predictive Subspace Learning for Multi-view Data: a Large Margin Approach. Advances in Neural Information Processing Systems.

[xing2012mining] Xing, Eric P, Yan, Rong, Hauptmann, Alexander G. (2012). Mining associated text and images with dual-wing harmoniums. arXiv preprint arXiv:1207.1423.

[multi2014] Weifeng Liu, Dacheng Tao, Jun Cheng, Yuanyan Tang. (2014). Multiview Hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding. doi:https://doi.org/10.1016/j.cviu.2013.03.007.

[article2008] Sridharan, Karthik, Kakade, Sham. (2008). An Information Theoretic Framework for Multi-View Learning. SO.

[Tian2013] Cao, Tian, Jojic, Vladimir, Modla, Shannon, Powell, Debbie, Czymmek, Kirk, Niethammer, Marc. (2013). Robust Multimodal Dictionary Learning. Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2013.

[factorized2010] Jia, Yangqing, Salzmann, Mathieu, Darrell, Trevor. (2010). Factorized Latent Spaces with Structured Sparsity. Advances in Neural Information Processing Systems.

[matching2003] Barnard, Kobus, Duygulu, Pinar, Forsyth, David, de Freitas, Nando, Blei, David M., Jordan, Michael I.. (2003). Matching Words and Pictures. J. Mach. Learn. Res..

[miss2000] Cohn, David, Hofmann, Thomas. (2000). The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems.

[Sun2013ASO] Shiliang Sun. (2013). A survey of multi-view machine learning. Neural Computing and Applications.

[hardoon2004] Bach, Francis R., Jordan, Michael I.. (2003). Kernel Independent Component Analysis. J. Mach. Learn. Res.. doi:10.1162/153244303768966085.

[cca1396] Harold Hotelling. (1936). Relations Between Two Sets of Variates. Biometrika.

[Darbellay99] Vapnik, Vladimir N, Chervonenkis, A Ya. (2015). On the uniform convergence of relative frequencies of events to their probabilities. CoRR. doi:10.1108/03684921011046735.

[cover1999elements] Cover, Thomas M. (1999). Elements of information theory.

[koopman1936distributions] Koopman, Bernard Osgood. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical society.

[gilad2003information] Gilad-Bachrach, Ran, Navot, Amir, Tishby, Naftali. (2003). An information theoretic tradeoff between complexity and accuracy. Learning Theory and Kernel Machines.

[kinney2014equitability] Kinney, Justin B, Atwal, Gurinder S. (2014). Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences.

[rosenblatt1958perceptron] Rosenblatt, Frank. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.. Psychological review.

[hinton2006fast] Hinton, Geoffrey E, Osindero, Simon, Teh, Yee-Whye. (2006). A fast learning algorithm for deep belief nets. Neural computation.

[ren2015faster] Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems.

[steinke2020reasoning] Steinke, Thomas, Zakynthinou, Lydia. (2020). Reasoning about generalization via conditional mutual information. Conference on Learning Theory.

[alemi2016deep] Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, Murphy, Kevin. (2016). Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.

[lee2019wide] Lee, Jaehoon, Xiao, Lechao, Schoenholz, Samuel, Bahri, Yasaman, Novak, Roman, Sohl-Dickstein, Jascha, Pennington, Jeffrey. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems.

[strouse2017deterministic] Strouse, DJ, Schwab, David J. (2017). The deterministic information bottleneck. Neural computation.

[elad2019direct] Elad, Adar, Haviv, Doron, Blau, Yochai, Michaeli, Tomer. (2019). Direct validation of the information bottleneck principle for deep nets. Proceedings of the IEEE International Conference on Computer Vision Workshops.

[fischer2020ceb] Fischer, Ian, Alemi, Alexander A. (2020). CEB Improves Model Robustness. arXiv preprint arXiv:2002.05380.

[paninski2003estimation] Paninski, Liam. (2003). Estimation of entropy and mutual information. Neural computation.

[mcallester2020formal] McAllester, David, Stratos, Karl. (2020). Formal limitations on the measurement of mutual information. International Conference on Artificial Intelligence and Statistics.

[shannon1948mathematical] Shannon, Claude E. (1948). A mathematical theory of communication. The Bell system technical journal.

[SHAMIR20102696] Ohad Shamir, Sivan Sabato, Naftali Tishby. (2010). Learning and generalization with the information bottleneck. Theoretical Computer Science. doi:https://doi.org/10.1016/j.tcs.2010.04.006.

[painsky2018bregman] Painsky, Amichai, Wornell, Gregory W. (2018). Bregman Divergence Bounds and the Universality of the Logarithmic Loss. arXiv preprint arXiv:1810.07014.

[painsky2018information] Painsky, Amichai, Feder, Meir, Tishby, Naftali. (2018). An Information-Theoretic Framework for Non-linear Canonical Correlation Analysis. arXiv preprint arXiv:1810.13259.

[DBLP:journals/corr/abs-1801-02254] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, Tomaso A. Poggio. (2018). Theory of Deep Learning IIb: Optimization Properties of {SGD. CoRR.

[entropy2019] Cheng, H., Lian, D., Gao, S.and Geng, Y. (2019). Utilizing Information Bottleneck to Evaluate the Capability of Deep Neural Networks for Image Classification. Entropy.

[gabrie2018entropy] Gabri{'e. (2018). Entropy and mutual information in models of deep neural networks. arXiv preprint arXiv:1805.09785.

[DBLP:journals/corr/abs-1710-11029] Pratik Chaudhari, Stefano Soatto. (2017). Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. CoRR.

[2016arXiv161101353A] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. doi:10.1103/PhysRevE.69.066138.

[chechik2005information] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.

[painsky2016generalized] Painsky, Amichai, Rosset, Saharon, Feder, Meir. (2016). Generalized independent component analysis over finite alphabets. IEEE Transactions on Information Theory.

[rissanen1978modeling] Rissanen, Jorma. (1978). Modeling by shortest data description. Automatica.

[vapnik1968uniform] Vapnik, Vladimir N, Chervonenkis, Aleksei Yakovlevich. (1968). The uniform convergence of frequencies of the appearance of events to their probabilities. Doklady Akademii Nauk.

[sauer1972density] Sauer, Norbert. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A.

[shelah1972combinatorial] Shelah, Saharon. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics.

[hoeffding1963probability] Hoeffding, Wassily. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association.

[chigirev2004optimal] Chigirev, Denis V, Bialek, William. (2004). Optimal manifold representation of data: an information theoretic approach. Advances in Neural Information Processing Systems.

[bell1995information] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.

[deco2012information] Deco, Gustavo, Obradovic, Dragan. (2012). An information-theoretic approach to neural computing.

[achille2018emergence] Achille, Alessandro, Soatto, Stefano. (2018). Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research.

[saxe2019information] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.

[yu2020understanding] Yu, Shujian, Wickstr{\o. (2020). Understanding convolutional neural networks with information theory: An initial exploration. IEEE Transactions on Neural Networks and Learning Systems.

[cheng2018evaluating] Cheng, Hao, Lian, Dongze, Gao, Shenghua, Geng, Yanlin. (2018). Evaluating capability of deep neural networks for image classification via information plane. Proceedings of the European Conference on Computer Vision (ECCV).

[goldfeld2018estimating] Goldfeld, Ziv, Berg, Ewout van den, Greenewald, Kristjan, Melnyk, Igor, Nguyen, Nam, Kingsbury, Brian, Polyanskiy, Yury. (2018). Estimating information flow in deep neural networks. arXiv preprint arXiv:1810.05728.

[wickstrom2019information] Wickstr{\o. (2019). Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels. arXiv preprint arXiv:1909.11396.

[cortes2012algorithms] Cortes, Corinna, Mohri, Mehryar, Rostamizadeh, Afshin. (2012). Algorithms for learning kernels based on centered alignment. The Journal of Machine Learning Research.

[amjad2019learning] Amjad, Rana Ali, Geiger, Bernhard Claus. (2019). Learning representations for neural network-based classification using the information bottleneck principle. IEEE transactions on pattern analysis and machine intelligence.

[ben2023reverse] Ben-Shaul, Ido, Shwartz-Ziv, Ravid, Galanti, Tomer, Dekel, Shai, LeCun, Yann. (2023). Reverse Engineering Self-Supervised Learning. arXiv preprint arXiv:2305.15614.

[goldfeld2020convergence] Goldfeld, Ziv, Greenewald, Kristjan, Niles-Weed, Jonathan, Polyanskiy, Yury. (2020). Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory.

[cvitkovic2019minimal] Cvitkovic, Milan, Koliander, G{. (2019). Minimal achievable sufficient statistic learning. arXiv preprint arXiv:1905.07822.

[geiger2020information] Geiger, Bernhard C. (2020). On Information Plane Analyses of Neural Network Classifiers--A Review. arXiv preprint arXiv:2003.09671.

[van2020survey] Van Engelen, Jesper E, Hoos, Holger H. (2020). A survey on semi-supervised learning. Machine Learning.

[pogodin2020kernelized] Pogodin, Roman, Latham, Peter E. (2020). Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks. arXiv preprint arXiv:2006.07123.

[chelombiev2019adaptive] Chelombiev, Ivan, Houghton, Conor, O'Donnell, Cian. (2019). Adaptive estimators show information compression in deep neural networks. ICLR.

[song2021train] Song, Yang, Kingma, Diederik P. (2021). How to train your energy-based models. arXiv preprint arXiv:2101.03288.

[huembeli2022physics] Huembeli, Patrick, Arrazola, Juan Miguel, Killoran, Nathan, Mohseni, Masoud, Wittek, Peter. (2022). The physics of energy-based models. Quantum Machine Intelligence.

[noshad2018scalable] Noshad, Morteza, Hero III, Alfred O. (2018). Scalable Mutual Information Estimation using Dependence Graphs. arXiv preprint arXiv:1801.09125.

[achille2018critical] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2018). Critical learning periods in deep networks. International Conference on Learning Representations.

[achille2018information] Achille, Alessandro, Soatto, Stefano. (2018). Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence.

[kirsch2020unpacking] Kirsch, Andreas, Lyle, Clare, Gal, Yarin. (2020). Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning. arXiv preprint arXiv:2003.12537.

[pensia2018generalization] Pensia, Ankit, Jog, Varun, Loh, Po-Ling. (2018). Generalization error bounds for noisy, iterative algorithms. 2018 IEEE International Symposium on Information Theory (ISIT).

[NIPS2019_9282] Negrea, Jeffrey, Haghifam, Mahdi, Dziugaite, Gintare Karolina, Khisti, Ashish, Roy, Daniel M. (2019). Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. Advances in Neural Information Processing Systems 32.

[NIPS2018_7954] Asadi, Amir, Abbe, Emmanuel, Verdu, Sergio. (2018). Chaining Mutual Information and Tightening Generalization Bounds. Advances in Neural Information Processing Systems 31.

[russo2016controlling] Russo, Daniel, Zou, James. (2016). Controlling bias in adaptive data analysis using information theory. Artificial Intelligence and Statistics.

[vera2018role] Vera, Mat{'\i. (2018). The role of information complexity and randomization in representation learning. arXiv preprint arXiv:1802.05355.

[boucheron2005theory] Boucheron, St{'e. (2005). Theory of classification: A survey of some recent advances. ESAIM: probability and statistics.

[neyshabur2014search] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.

[neyshabur2015norm] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2015). Norm-based capacity control in neural networks. Conference on Learning Theory.

[10.2307/2334522] Ralph B. D'Agostino. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika.

[Krizhevsky09learningmultiple] Alex Krizhevsky. (2009). Learning multiple layers of features from tiny images.

[bartlett2002rademacher] Bartlett, Peter L, Mendelson, Shahar. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research.

[bousquet2002stability] Bousquet, Olivier, Elisseeff, Andre. (2002). Stability and generalization. Journal of machine learning research.

[stavac] Achille, Alessandro, Paolini, Giovanni, Soatto, Stefano. (2019). Where is the information in a deep neural network?. arXiv preprint arXiv:1905.12213.

[nash2018inverting] Nash, Charlie, Kushman, Nate, Williams, Christopher KI. (2018). Inverting Supervised Representations with Autoregressive Neural Density Models. arXiv preprint arXiv:1806.00400.

[csiszar1987conditional] Csisz{'a. (1987). Conditional limit theorems under Markov conditioning. IEEE Transactions on Information Theory.

[jabref-meta: databaseType:bibtex;}

@ARTICLE{2016arXiv16110135amjad2018not3A] {Achille. {Information Dropout: Learning Optimal Representations Through Noisy Computation. ArXiv e-prints.

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[berglund2013measuring] Kraskov, Alexander, St. (2004). Estimating mutual information. Phys. Rev. E. doi:10.1103/PhysRevE.69.066138.

[DBLP:journals/corr/WangYMC16] . . ().

[2014arXiv1412.6615S] Anonymous. (2019). REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS. Journal of Machine Learning Research.

[turner2007maximum] Turner, Richard, Sahani, Maneesh. (2007). A maximum-likelihood interpretation for slow feature analysis. Neural computation.

[hecht2009speaker] Hecht, Ron M, Noor, Elad, Tishby, Naftali. (2009). Speaker recognition by Gaussian information bottleneck. Tenth Annual Conference of the International Speech Communication Association.

[palmer2015predictive] Palmer, Stephanie E, Marre, Olivier, Berry, Michael J, Bialek, William. (2015). Predictive information in a sensory population. Proceedings of the National Academy of Sciences.

[buesing2010spiking] Buesing, Lars, Maass, Wolfgang. (2010). A spiking neuron as information bottleneck. Neural computation.

[saxe2018information] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.

[amjad2018not] Amjad, Rana Ali, Geiger, Bernhard C. (2018). How (Not) To Train Your Neural Network Using the Information Bottleneck Principle. arXiv preprint arXiv:1802.09766.

[elad2018effectiveness] Elad, Adar, Haviv, Doron, Blau, Yochai, Michaeli, Tomer. (2018). The effectiveness of layer-by-layer training using the information bottleneck principle.

[xu2017information] Xu, Aolin, Raginsky, Maxim. (2017). Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems.

[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.

[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[zhang2022how] Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, In So Kweon. (2022). How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning. International Conference on Learning Representations.

[Arora2019theory] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[pu2020multimodal] Pu, Shi, He, Yijiang, Li, Zheng, Zheng, Mao. (2020). Multimodal Topic Learning for Video Recommendation. arXiv preprint arXiv:2010.13373.

[voloshynovskiy2019information] Voloshynovskiy, Slava, Taran, Olga, Kondah, Mouad, Holotyak, Taras, Rezende, Danilo. (2020). Variational Information Bottleneck for Semi-Supervised Classification. Entropy. doi:10.3390/e22090943.

[gao2015efficient] Gao, Shuyang, Ver Steeg, Greg, Galstyan, Aram. (2015). Efficient estimation of mutual information for strongly dependent variables. Artificial Intelligence and Statistics.

[Belghazi2018MutualIN] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, Aaron C. Courville. (2018). Mutual Information Neural Estimation. ICML.

[donsker1975asymptotic] Donsker, Monroe D, Varadhan, SR Srinivasa. (1975). Asymptotic evaluation of certain Markov process expectations for large time, I. Communications on Pure and Applied Mathematics.

[2018Estimating] {Goldfeld. {Estimating Information Flow in Neural Networks. ArXiv e-prints.

[jacobsen2018irevnet] Jörn-Henrik Jacobsen, Arnold W.M. Smeulders, Edouard Oyallon. (2018). i-RevNet: Deep Invertible Networks. International Conference on Learning Representations.

[bertsekas2011incremental] Bertsekas, Dimitri P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning.

[li2017convergence] Li, Yuanzhi, Yuan, Yang. (2017). Convergence analysis of two-layer neural networks with relu activation. Advances in Neural Information Processing Systems.

[dieuleveut2017bridging] Dieuleveut, Aymeric, Durmus, Alain, Bach, Francis. (2017). Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386.

[rumelhart1986learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J. (1986). Learning representations by back-propagating errors. nature.

[oord2016wavenet] Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, Kavukcuoglu, Koray. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[matias2018role] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.

[DBLP:journals/corr/WellingHinton2005] Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, Murphy, Kevin. (2016). Deep Variational Information Bottleneck. arXiv:1612.00410.

[skincat] Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous, Rif A, Murphy, Kevin. (2017). An Information-Theoretic Analysis of Deep Latent-Variable Models. arXiv:1711.00464.

[brokenelbo] Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous, Rif A, Murphy, Kevin. (2018). Fixing a Broken {ELBO. ICML 2018.

[infoautoencoding] Anonymous. (2018). The Information-Autoencoding Family: A Lagrangian Perspective on Latent Variable Generative Modeling. International Conference on Learning Representations.

[rationalignorance] Mattingly, Henry H, Transtrum, Mark K, Abbott, Michael C, Machta, Benjamin B. (2017). Rational ignorance: simpler models learn more from finite data. arXiv:1705.01166.

[infoscaling] Abbott, Michael C, Machta, Benjamin B. (2018). An Information Scaling Law: $\zeta = 3/4$. arXiv:1710.09351.

[thermoinfo] Parrondo, Juan MR, Horowitz, Jordan M, Sagawa, Takahiro. (2015). Thermodynamics of information. Nature physics.

[costbenefitdata] {Still. {Thermodynamic cost and benefit of data representations. arXiv: 1705.00612.

[marginalent] {Crooks. {Marginal and Conditional Second Laws of Thermodynamics. arXiv: 1611.04628.

[thermoprediction] {Still. {Thermodynamics of Prediction. Physical Review Letters. doi:10.1103/PhysRevLett.109.120604.

[interactive] {Still. {Information-theoretic approach to interactive learning. EPL (Europhysics Letters). doi:10.1209/0295-5075/85/28005.

[optimalcausal] {Still. {Optimal Causal Inference: Estimating Stored Information and Approximating Causal Architecture. arXiv: 0708.1580.

[structurenoise] {Still. {Structure or Noise?. arXiv: 0708.0654.

[clusters] {Still. {How many clusters? An information theoretic perspective. ArXiv Physics e-prints.

[jaynes] Jaynes, Edwin T. (1957). Information theory and statistical mechanics. Physical review.

[sethna] Sethna, James. (2006). Statistical mechanics: entropy, order parameters, and complexity.

[coverthomas] Cover, Thomas M, Thomas, Joy A. (2012). Elements of information theory.

[reversible] Maclaurin, Dougal, Duvenaud, David, Adams, Ryan P.. (2015). Gradient-based Hyperparameter Optimization Through Reversible Learning. Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37.

[mib] Friedman, Nir, Mosenzon, Ori, Slonim, Noam, Tishby, Naftali. (2001). Multivariate information bottleneck. Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence.

[predictive] Bialek, William, Nemenman, Ilya, Tishby, Naftali. (2001). Predictability, complexity, and learning. Neural computation.

[vae] Kingma, Diederik P, Welling, Max. {Auto-encoding variational Bayes.

[betavae] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. {$\beta$-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.

[emergence] {Achille. {Emergence of Invariance and Disentangling in Deep Representations. Proceedings of the ICML Workshop on Principled Approaches to Deep Learning.

[ib] N. Tishby, F.C. Pereira, W. Biale. The Information Bottleneck method. The 37th annual Allerton Conf. on Communication, Control, and Computing.

[bbb] {Blundell. {Weight Uncertainty in Neural Networks. arXiv: 1505.05424.

[semi] {Kingma. {Semi-Supervised Learning with Deep Generative Models. arXiv: 1406.5298.

[sgdasbayes] {Mandt. {Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv: 1704.04289.

[sgr] {Ma. {A Complete Recipe for Stochastic Gradient MCMC. arXiv:1506.04696.

[sgld] Welling, Max, Teh, Yee W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th international conference on machine learning (ICML-11).

[bayessgd] {Smith. {A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv:1710.06451.

[sghmc] {Chen. {Stochastic Gradient Hamiltonian Monte Carlo. arXiv:1402.4102.

[snapshot] {Huang. {Snapshot Ensembles: Train 1, get M for free. arXiv: 1704.00109.

[poppar] {Machta. {Monte Carlo Methods for Rough Free Energy Landscapes: Population Annealing and Parallel Tempering. Journal of Statistical Physics. doi:10.1007/s10955-011-0249-0.

[finn] Finn, Colin BP. (1993). Thermal physics.

[energyentropy] {Zhang. {Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning. arXiv: 1803.01927.

[pacbayes] {McAllester. {A PAC-Bayesian Tutorial with A Dropout Bound. arXiv: 1307.2118.

[pacbayesbayes] Germain, Pascal, Bach, Francis, Lacoste, Alexandre, Lacoste-Julien, Simon. (2016). PAC-Bayesian Theory Meets Bayesian Inference. Advances in Neural Information Processing Systems 29.

[marsh] Marsh, Charles. (2013). Introduction to continuous entropy.

[box] Box, George EP, Draper, Norman R. (1987). Empirical model-building and response surfaces..

[infoprojection] Csisz{'a. (2003). Information projections revisited. IEEE Transactions on Information Theory.

[lecturenotes] Ariel Caticha. (2008). Lectures on Probability, Entropy, and Statistical Physics.

[correspondence] Colin H. LaMont, Paul A. Wiggins. (2017). A correspondence between thermodynamics and inference.

[watanabegrey] Watanabe, Sumio. (2009). Algebraic geometry and statistical learning theory.

[watanabegreen] Watanabe, Sumio. (2018). Mathematical theory of Bayesian statistics.

[whereinfo] Alessandro Achille, Stefano Soatto. (2019). Where is the Information in a Deep Neural Network?.

[ffjord] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaud. (2018). FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models.

[widelinear] {Lee. (2019). {Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. arXiv e-prints.

[fisherRao] Liang, Tengyuan, Poggio, Tomaso, Rakhlin, Alexander, Stokes, James. (2017). Fisher-rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530.

[AIC] Akaike, Hirotugu. (1974). A new look at the statistical model identification. Selected Papers of Hirotugu Akaike.

[TIC] {Thomas. (2019). {Information matrices and generalization. arXiv e-prints.

[generalization_dnn] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in Neural Information Processing Systems.

[vmibounds] Ben Poole, Sherjil Ozair, A{. (2019). On Variational Bounds of Mutual Information. CoRR.

[gaussib] Chechik, Gal, Globerson, Amir, Tishby, Naftali, Weiss, Yair. (2005). Information bottleneck for Gaussian variables. Journal of machine learning research.

[halko] Halko, Nathan, Martinsson, Per-Gunnar, Tropp, Joel A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review.

[blackbox] Shwartz-Ziv, Ravid, Tishby, Naftali. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.

[tishbydeep] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW).

[saxe] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.

[hownot] Amjad, Rana Ali, Geiger, Bernhard C. (2018). How (not) to train your neural network using the information bottleneck principle. arXiv preprint arXiv:1802.09766.

[brendan] Kolchinsky, Artemy, Tracey, Brendan D, Van Kuyk, Steven. (2018). Caveats for information bottleneck in deterministic scenarios. arXiv preprint arXiv:1808.07593.

[mnist] LeCun, Yann, Cortes, Corinna, Burges, CJ. (2010). MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist.

[ntk] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems.

[neuraltangents] Novak, Roman, Xiao, Lechao, Hron, Jiri, Lee, Jaehoon, Alemi, Alexander A, Sohl-Dickstein, Jascha, Schoenholz, Samuel S. (2019). Neural tangents: Fast and easy infinite neural networks in python. arXiv preprint arXiv:1912.02803.

[fisher] Frederik Kunstner, Lukas Balles, Philipp Hennig. (2019). Limitations of the Empirical Fisher Approximation.

[littlebits] Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, Amir Yehudayoff. (2017). Learners that Use Little Information.

[neuralode] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. (2018). Neural Ordinary Differential Equations.

[bayesianbounds] Banerjee, Arindam. (2006). On bayesian bounds. Proceedings of the 23rd international conference on Machine learning.

[invertible] Anonymous. (2020). On the Invertibility of Invertible Neural Networks. Submitted to International Conference on Learning Representations.

[cando] Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, Dingli Yu. (2019). Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks.

[liang2019fisher] Liang, Tengyuan, Poggio, Tomaso, Rakhlin, Alexander, Stokes, James. (2019). Fisher-rao metric, geometry, and complexity of neural networks. The 22nd International Conference on Artificial Intelligence and Statistics.

[neyshabur2017exploring] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in neural information processing systems.

[hardt2016train] Hardt, Moritz, Recht, Ben, Singer, Yoram. (2016). Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning.

[watanabe2010asymptotic] Watanabe, Sumio, Opper, Manfred. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory.. Journal of machine learning research.

[russo2019much] Russo, Daniel, Zou, James. (2019). How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory.

[slonim2002information] Slonim, Noam. (2002). The information bottleneck: Theory and applications.

[Tishby1999] Steinbach, Michael, Ert{. (2004). The challenges of clustering high dimensional data. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing.

[Gilad-bachrach] Ran Gilad-bachrach, Amir Navot, Naftali Tishby. (2003). An information theoretic tradeoff between complexity and accuracy. In Proceedings of the COLT.

[CriticalSlowingDown:2004] Tredicce, Jorge R, Lippi, Gian Luca, Mandel, Paul, Charasse, Basile, Chevalier, Aude, Picqu{'e. (2004). Critical slowing down at a bifurcation. American Journal of Physics.

[shwartz2017] {Shwartz-Ziv. (2017). {Opening the Black Box of Deep Neural Networks via Information. arXiv e-prints.

[tishby99information] Tishby, Naftali, Pereira, Fernando C., Bialek, William. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing.

[Csiszar] Csisz'{a. (2004). Information Theory and Statistics: A Tutorial. Commun. Inf. Theory. doi:10.1561/0100000004.

[Cover:2006:EIT:1146355] Cover, Thomas M., Thomas, Joy A.. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).

[DBLP:conf/alt/ShamirST08] Ohad Shamir, Sivan Sabato, Naftali Tishby. (2010). Learning and generalization with the information bottleneck. Theor. Comput. Sci..

[DBLP:conf/alt/2008] . Algorithmic Learning Theory, 19th International Conference, {ALT. (2008).

[Exp_forms] Lawrence D. Brown. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Lecture Notes-Monograph Series.

[Painsky2019] {Painsky. (2018). {Bregman Divergence Bounds and the Universality of the Logarithmic Loss. arXiv e-prints.

[Csiszar:2004:ITS:1166379.1166380] Csisz'{a. (2004). Information Theory and Statistics: A Tutorial. Commun. Inf. Theory. doi:10.1561/0100000004.

[CIS-58533] Tusnady, G., Csiszar, I.. (1984). Information geometry and alternating minimization procedures. Statistics & Decisions: Supplement Issues.

[slonim_MIB] Slonim, Noam, Friedman, Nir, Tishby, Naftali. (2006). Multivariate Information Bottleneck. Neural Computation. doi:10.1162/neco.2006.18.8.1739.

[Ay2019] Domenico Felice, Nihat Ay. (2019). Divergence Functions in Information Geometry. Geometric Science of Information - 4th International Conference, {GSI. doi:10.1007/978-3-030-26980-7_45.

[DBLP:conf/gsi/2019] . Geometric Science of Information - 4th International Conference, {GSI. (2019).

[parker] Albert E. Parker, Tom'{a. (2003). Annealing and the Rate Distortion Problem. Advances in Neural Information Processing Systems 15.

[Jaynes58] Jaynes, E. T.. (1957). Information Theory and Statistical Mechanics. Phys. Rev.. doi:10.1103/PhysRev.106.620.

[ZaslavskyTishby:2019] Zaslavsky, Noga, Tishby, Naftali. (2019). Deterministic Annealing and the Evolution of Optimal Information Bottleneck Representations. Preprint.

[Kullback58] S. Kullback. (1959). Information Theory and Statistics.

[GaussianIB] Chechik, Gal, Globerson, Amir, Tishby, Naftali, Weiss, Yair. (2005). Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res..

[globerson2003sufficient] Globerson, Amir, Tishby, Naftali. (2003). Sufficient dimensionality reduction. Journal of Machine Learning Research.

[ma2019unpaired] Ma, Shuang, McDuff, Daniel, Song, Yale. (2019). Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck. Proceedings of the IEEE International Conference on Computer Vision.

[schneidman2001analyzing] Schneidman, Elad, Slonim, Noam, Tishby, Naftali, van Steveninck, R deRuyter, Bialek, William. (2001). Analyzing neural codes using the information bottleneck method. Advances in Neural Information Processing Systems, NIPS.

[Alemi2016DeepVI] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, Kevin Murphy. (2016). Deep Variational Information Bottleneck. ArXiv.

[Parbhoo2018CausalDI] Sonali Parbhoo, Mario Wieser, Volker Roth. (2018). Causal Deep Information Bottleneck. ArXiv.

[westover2008asymptotic] Westover, M Brandon. (2008). Asymptotic geometry of multiple hypothesis testing. IEEE transactions on information theory.

[nielsen2011chernoff] Nielsen, Frank. (2011). Chernoff information of exponential families. arXiv preprint arXiv:1102.2684.

[wieczorek2020difference] Wieczorek, Aleksander, Roth, Volker. (2020). On the Difference between the Information Bottleneck and the Deep Information Bottleneck. Entropy.

[wu2020phase] Wu, Tailin, Fischer, Ian. (2020). Phase Transitions for the Information Bottleneck in Representation Learning. arXiv preprint arXiv:2001.01878.

[fischer2018conditional] Fischer, Ian. (2018). The conditional entropy bottleneck. URL openreview. net/forum.

[lecun-mnisthandwrittendigit-2010] LeCun, Yann, Cortes, Corinna. {MNIST.

[raman2017illum] Raman, Ravi Kiran, Yu, Haizi, Varshney, Lav R. (2017). Illum information. 2017 Information Theory and Applications Workshop (ITA).

[palomar2008lautum] Palomar, Daniel P, Verd{'u. (2008). Lautum information. IEEE transactions on information theory.

[poole2019variational] Poole, Ben, Ozair, Sherjil, Oord, Aaron van den, Alemi, Alexander A, Tucker, George. (2019). On variational bounds of mutual information. arXiv preprint arXiv:1905.06922.

[hsu2018generalizing] Hsu, Hsiang, Asoodeh, Shahab, Salamatian, Salman, Calmon, Flavio P. (2018). Generalizing bottleneck problems. 2018 IEEE International Symposium on Information Theory (ISIT).

[dusenberry2020efficient] Dusenberry, Michael W, Jerfel, Ghassen, Wen, Yeming, Ma, Yi-an, Snoek, Jasper, Heller, Katherine, Lakshminarayanan, Balaji, Tran, Dustin. (2020). Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors. arXiv preprint arXiv:2005.07186.

[muller2019does] M{. (2019). When does label smoothing help?. Advances in Neural Information Processing Systems.

[zagoruyko2017diracnets] Zagoruyko, Sergey, Komodakis, Nikos. (2017). Diracnets: Training very deep neural networks without skip-connections. arXiv preprint arXiv:1706.00388.

[shamir2008learning] Shamir, Ohad, Sabato, Sivan, Tishby, Naftali. (2008). Learning and generalization with the information bottleneck. International Conference on Algorithmic Learning Theory.

[li-eisner-2019] Li, Xiang Lisa, Eisner, Jason. (2019). Specializing Word Embeddings (for Parsing) by Information Bottleneck. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).

[soudry2018implicit] Soudry, Daniel, Hoffer, Elad, Nacson, Mor Shpigel, Gunasekar, Suriya, Srebro, Nathan. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research.

[gunasekar2018implicit] Gunasekar, Suriya, Lee, Jason D, Soudry, Daniel, Srebro, Nati. (2018). Implicit bias of gradient descent on linear convolutional networks. Advances in Neural Information Processing Systems.

[gunasekar2017implicit] Gunasekar, Suriya, Woodworth, Blake E, Bhojanapalli, Srinadh, Neyshabur, Behnam, Srebro, Nati. (2017). Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems.

[moroshko2020implicit] Moroshko, Edward, Gunasekar, Suriya, Woodworth, Blake, Lee, Jason D, Srebro, Nathan, Soudry, Daniel. (2020). Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy. arXiv preprint arXiv:2007.06738.

[woodworth2020kernel] Woodworth, Blake, Gunasekar, Suriya, Lee, Jason D, Moroshko, Edward, Savarese, Pedro, Golan, Itay, Soudry, Daniel, Srebro, Nathan. (2020). Kernel and rich regimes in overparametrized models. arXiv preprint arXiv:2002.09277.

[kawaguchi2021theory] Kawaguchi, Kenji. (2021). On the Theory of Implicit Deep Learning: Global Convergence with Implicit Layers. International Conference on Learning Representations (ICLR).

[shalev2014understanding] Shalev-Shwartz, Shai, Ben-David, Shai. (2014). Understanding machine learning: From theory to algorithms.

[nilsback2008automated] Nilsback, Maria-Elena, Zisserman, Andrew. (2008). Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[maji2013fine] Maji, Subhransu, Rahtu, Esa, Kannala, Juho, Blaschko, Matthew, Vedaldi, Andrea. (2013). Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.

[bossard2014food] Bossard, Lukas, Guillaumin, Matthieu, Van Gool, Luc. (2014). Food-101--mining discriminative components with random forests. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13.

[federici2019learning] Federici, Marco, Dutta, Anjan, Forr{'e. (2019). Learning Robust Representations via Multi-View Information Bottleneck. International Conference on Learning Representations.

[kawaguchi2018generalization] Kawaguchi, Kenji, Kaelbling, Leslie Pack, Bengio, Yoshua. (2018). Generalization in deep learning. MIT-CSAIL-TR-2018-014, Massachusetts Institute of Technology.

[bartlett2017spectrally] Bartlett, Peter L, Foster, Dylan J, Telgarsky, Matus J. (2017). Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems.

[kawaguchi2022robust] Kawaguchi, Kenji, Deng, Zhun, Luh, Kyle, Huang, Jiaoyang. {Robustness Implies Generalization via Data-Dependent Generalization Bounds. International Conference on Machine Learning (ICML).

[golowich2018size] Golowich, Noah, Rakhlin, Alexander, Shamir, Ohad. (2018). Size-independent sample complexity of neural networks. Conference On Learning Theory.

[mohri2012foundations] Mohri, Mehryar, Rostamizadeh, Afshin, Talwalkar, Ameet. (2012). Foundations of machine learning.

[saunshi2019theoretical] Saunshi, Nikunj, Plevrakis, Orestis, Arora, Sanjeev, Khodak, Mikhail, Khandeparkar, Hrishikesh. (2019). A theoretical analysis of contrastive unsupervised representation learning. International Conference on Machine Learning.

[bib1] Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2897–2905, 2018.

[bib2] N.A. Ahmed and D.V. Gokhale. Entropy expressions and their estimators for multivariate distributions. IEEE Transactions on Information Theory, 35(3):688–692, 1989. doi: 10.1109/18.30996.

[bib3] Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.

[bib4] Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.

[bib5] Bachman et al. (2019) Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.

[bib6] Randall Balestriero and Richard Baraniuk. A spline theory of deep networks. In Proc. ICML, volume 80, pages 374–383, Jul. 2018.

[bib7] Balestriero and LeCun (2022) Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods, 2022. URL https://arxiv.org/abs/2205.11508.

[bib8] Bardes et al. (2021) Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.

[bib9] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.

[bib10] Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.

[bib11] Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.

[bib12] Itamar Ben-Ari and Ravid Shwartz-Ziv. Attentioned convolutional lstm inpaintingnetwork for anomaly detection in videos. arXiv preprint arXiv:1811.10228, 2018.

[bib13] Brendon J Brewer. Computing entropies with nested sampling. Entropy, 19(8):422, 2017.

[bib14] Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.

[bib15] Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.

[bib16] Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.

[bib17] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.

[bib18] Elliott Ward Cheney and William Allan Light. A course in approximation theory, volume 101. American Mathematical Soc., 2009.

[bib19] Ralph B. D’Agostino. An omnibus test of normality for moderate and large size samples. Biometrika, 58(2):341–348, 1971. ISSN 00063444. URL http://www.jstor.org/stable/2334522.

[bib20] Dang et al. (2018) Zheng Dang, Kwang Moo Yi, Yinlin Hu, Fei Wang, Pascal Fua, and Mathieu Salzmann. Eigendecomposition-free training of deep networks with zero eigenvalue-based losses. In Proceedings of the European Conference on Computer Vision (ECCV), pages 768–783, 2018.

[bib21] Dubois et al. (2021) Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, and Chris J Maddison. Lossy compression for lossless prediction. Advances in Neural Information Processing Systems, 34, 2021.

[bib22] Magnus Egerstedt and Clyde Martin. Control theoretic splines: optimal control, statistics, and path planning. Princeton University Press, 2009.

[bib23] Fantuzzi et al. (2002) Cesare Fantuzzi, Silvio Simani, Sergio Beghelli, and Riccardo Rovatti. Identification of piecewise affine models in noisy environment. International Journal of Control, 75(18):1472–1485, 2002.

[bib24] Federici et al. (2019) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. In International Conference on Learning Representations, 2019.

[bib25] Federici et al. (2020) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017, 2020.

[bib26] Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.

[bib27] Mike B. Giles. Collected matrix derivative results for forward and reverse mode algorithmic differentiation. In Christian H. Bischof, H. Martin Bücker, Paul Hovland, Uwe Naumann, and Jean Utke, editors, Advances in Automatic Differentiation, pages 35–44, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.

[bib28] Goldfeld et al. (2018) Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy. Estimating Information Flow in Neural Networks. ArXiv e-prints, 2018.

[bib29] Golowich et al. (2018) Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.

[bib30] Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning, volume 1. MIT Press, 2016.

[bib31] Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.

[bib32] Gunasekar et al. (2017) Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159, 2017.

[bib33] Gunasekar et al. (2018) Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.

[bib34] He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.

[bib35] He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.

[bib36] Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.

[bib37] Huber et al. (2008) Marco Huber, Tim Bailey, Hugh Durrant-Whyte, and Uwe Hanebeck. On entropy approximation for gaussian mixture random vectors. In IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, pages 181 – 188, 09 2008. doi: 10.1109/MFI.2008.4648062.

[bib38] Ionescu et al. (2015) Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix backpropagation for deep networks with structured layers. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2965–2973, 2015. doi: 10.1109/ICCV.2015.339.

[bib39] Jonathan Kahana and Yedid Hoshen. A contrastive objective for learning disentangled representations. arXiv preprint arXiv:2203.11284, 2022.

[bib40] Kawaguchi et al. (2018) Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. MIT-CSAIL-TR-2018-014, Massachusetts Institute of Technology, 2018.

[bib41] Kawaguchi et al. (2022) Kenji Kawaguchi, Zhun Deng, Kyle Luh, and Jiaoyang Huang. Robustness Implies Generalization via Data-Dependent Generalization Bounds. In International Conference on Machine Learning (ICML), 2022.

[bib42] Artemy Kolchinsky and Brendan D Tracey. Estimating mixture entropy with pairwise distances. Entropy, 19(7):361, 2017.

[bib43] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

[bib44] Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

[bib45] Lee et al. (2021a) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34, 2021a.

[bib46] Lee et al. (2021b) Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, and Ian Fischer. Compressive visual representations. Advances in Neural Information Processing Systems, 34, 2021b.

[bib47] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.

[bib48] Martinez et al. (2021) Julieta Martinez, Jashan Shewakramani, Ting Wei Liu, Ioan Andrei Bârsan, Wenyuan Zeng, and Raquel Urtasun. Permute, quantize, and fine-tune: Efficient compression of neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15699–15708, 2021.

[bib49] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.

[bib50] Misra et al. (2005) Neeraj Misra, Harshinder Singh, and Eugene Demchuk. Estimation of the entropy of a multivariate normal distribution. Journal of Multivariate Analysis, 92(2):324–342, 2005. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2003.10.003.

[bib51] Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.

[bib52] Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Proc. NeurIPS, pages 2924–2932, 2014.

[bib53] Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing systems, pages 271–279, 2016.

[bib54] Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.

[bib55] Piran et al. (2020) Zoe Piran, Ravid Shwartz-Ziv, and Naftali Tishby. The dual information bottleneck. arXiv preprint arXiv:2006.04641, 2020.

[bib56] Saunshi et al. (2019) Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pages 5628–5637. PMLR, 2019.

[bib57] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[bib58] Ravid Shwartz-Ziv. Information flow in deep neural networks. arXiv preprint arXiv:2202.06749, 2022.

[bib59] Ravid Shwartz-Ziv and Alexander A Alemi. Information in infinite ensembles of infinitely-wide neural networks. In Symposium on Advances in Approximate Bayesian Inference, pages 1–17. PMLR, 2020.

[bib60] Ravid Shwartz-Ziv and Yann LeCun. To Compress or Not to Compress - Self-Supervised Learning and Information Theory - A Review. 2023.

[bib61] Ravid Shwartz-Ziv and Naftali Tishby. Compression of deep neural networks via information, (2017). arXiv preprint arXiv:1703.00810, 2017a.

[bib62] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017b.

[bib63] Shwartz-Ziv et al. (2018) Ravid Shwartz-Ziv, Amichai Painsky, and Naftali Tishby. Representation compression and generalization in deep neural networks, 2018.

[bib64] Shwartz-Ziv et al. (2022) Ravid Shwartz-Ziv, Micah Goldblum, Hossein Souri, Sanyam Kapoor, Chen Zhu, Yann LeCun, and Andrew Gordon Wilson. Pre-train your loss: Easy bayesian transfer learning with informative priors. arXiv preprint arXiv:2205.10279, 2022.

[bib65] Shwartz-Ziv et al. (2023) Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, and Yann LeCun. What do we maximize in self-supervised learning and why does generalization emerge?, 2023. URL https://openreview.net/forum?id=tuE-MnjN7DV.

[bib66] Soudry et al. (2018) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.

[bib67] Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutual information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.

[bib68] Wang et al. (2022) Haoqing Wang, Xun Guo, Zhi-Hong Deng, and Yan Lu. Rethinking minimal sufficient representation in contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16041–16050, 2022.

[bib69] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.

[bib70] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.

[bib71] Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.

[bib72] Zhanghao Zhouyin and Ding Liu. Understanding neural networks with logarithm determinant entropy estimator. arXiv preprint arXiv:2105.03705, 2021.

[bib73] Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.