Skip to main content

What Do We Maximize in Self-Supervised Learning?

Ravid Shwartz-Ziv, Randall Balestriero, Yann LeCun

Abstract

In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior work that relies on stochastic models. This enables us to demonstrate how VICReg can be (re)discovered from first principles and its assumptions about data distribution. Furthermore, we empirically demonstrate the validity of our assumptions, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be generalized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning.

What Do We Maximize in Self-Supervised Learning?

Ravid Shwartz-Ziv 1 Randall Balestriero 2 Yann LeCun 1 2

In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior work that relies on stochastic models. This enables us to demonstrate how VICReg can be (re)discovered from first principles and its assumptions about data distribution. Furthermore, we empirically demonstrate the validity of our assumptions, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be generalized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning.

Introduction

Self-Supervised Learning (SSL) algorithms (Bromley et al., 1993) learn representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. The results indicate that the learned representations can generalize well to a wide range of downstream tasks (Chen et al., 2020; Misra & Maaten, 2020), even when the SSL objective does not use downstream supervision during training. In SimCLR (Chen et al., 2020), for example, a contrastive loss is defined between images with different augmentations (i.e., one as input and the other as a self-supervised signal). Then, we take our pre-learned model as a feature extractor and adopt the features to various applications, including image classification, object detection, instance segmentation, and pose estimation (Caron et al., 2021). However, despite the success in practice, only a few works (Arora et al., 2019; Lee et al., 2021a) provide theoretical insights into the learning efficacy of SSL.

In recent years, information theory methods have played a key role in several notable deep learning achievements, from practical applications in representation learning as the varia-

  • Equal contribution 1 New York Universityy 2 Meta AI Research. Correspondence to: Ravid Shwartz-Ziv < ravidziv@gmail.com > .

tional information bottleneck (Alemi et al., 2016), to theoretical investigations (e.g., the generalization bound induced by mutual information (Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv, 2022). Moreover, different deep learning problems have been successfully approached by developing and applying novel estimators and learning principles derived from information-theoretic quantities, such as mutual information estimation. Many works have attempted to analyze SSL from an information theory perspective. An example is the use of the mutual information neural estimator (MINE) (Belghazi et al., 2018) in representation learning (Hjelm et al., 2018) in conjunction with the renowned information maximization (InfoMax) principle (Linsker, 1988). However, looking at these works may be confusing. Numerous objective functions are presented, some contradicting each other, as well as many implicit assumptions. Moreover, these works rely on a crucial assumption: a stochastic (often Gaussian) DN mapping, which is rarely the case nowadays.

This paper presents a unified framework for SSL methods from an information theory perspective which can be applied to deterministic DN training. We summarize our contributions into two points: (i) Firdt, in order to study deterministic DNs from an information theory perspective, we shift stochasticity to the DN input, which is a much more faithful assumption for current training techniques. (ii) Second, based on this formulation, we analyze how current SSL methods that use deterministic networks optimize information-theoretic quantities.

Background

Continuous Piecewise Affine (CPA) Mappings. A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition Ω of a domain R D , a spline of order k is a mapping defined by a polynomial of order k on each region ω ∈ Ω with continuity constraints on the entire domain for the derivatives of order 0 ,. . . , k -1 . As we will focus on affine splines ( k = 1 ), we define this case only for concreteness. An K -dimensional affine spline f produces its output via

$$

$$

with input z ∈ R D and A ω ∈ R K × D , b ω ∈ R K , ∀ ω ∈ Ω

the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f ∈ C 0 ( R D ) . Spline operators and especially affine spline operators have been widely used in function approximation theory (Cheney & Light, 2009), optimal control (Egerstedt & Martin, 2009), statistics (Fantuzzi et al., 2002), and related fields.

Deep Networks. A deep network (DN) is a (non-linear) operator f Θ with parameters Θ that map a input x ∈ R D to a prediction y ∈ R K . The precise definitions of DNs operators can be found in Goodfellow et al. (2016). We will omit the Θ notation for clarity unless needed. The only assumption we require for our study is that the non-linearities present in the DN are CPA, as is the case with (leaky-) ReLU, absolute value, and max-pooling. In that case, the entire input-output mapping becomes a CPA spline with an implicit partition Ω , the function of the weights and architecture of the network (Montufar et al., 2014; Balestriero &Baraniuk, 2018). For smooth nonlinearities, our results hold from a first-order Taylor approximation argument.

Self-Supervised Learning . Joint embedding methods learn the DN parameters Θ without supervision and input reconstruction. Due to this formulation, the difficulty of SSL is to produce a good representation for downstream tasks whose labels are not available during training -while avoiding a trivially simple solution where the model maps all inputs to constant output. Many methods have been proposed to solve this problem. Contrastive methods learn representations by contrasting positive and negative examples, e.g. SimCLR (Chen et al., 2020) and its InfoNCE criterion (Oord et al., 2018). Other recent work introduced non-contrastive methods that employ different regularization methods to prevent collapsing of the representation. Several papers used stopgradients and extra predictors to avoid collapse (Chen & He, 2021; Grill et al., 2020) while Caron et al. (2020) uses an additional clustering step. As opposed to contrastive methods, noncontrastive methods do not explicitly rely on negative samples. Of particular interest to us is the VICReg method (Bardes et al., 2021) that considers two embedding batches Z = [ f ( x 1 ) , . . . , f ( x N )] and Z ′ = [ f ( x ′ 1 ) , . . . , f ( x ′ N )] each of size ( N × K ) . Denoting by C the ( K × K ) covariance matrix obtained from [ Z , Z ′ ] we obtain the VICReg triplet loss glyph[negationslash]

$$

$$

Our goal will now be to formulate SSL as an informationtheoretic problem from which we can precisely relate VICReg to known methods even with a deterministic network.

Deep Networks and Information-Theory. Recently, information-theoretic methods have played a key role in sev- eral remarkable deep learning achievements (Alemi et al., 2016; Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv & Tishby, 2017). Moreover, different deep learning problems have been successfully approached by developing and applying information-theoretic estimators and learning principles (Hjelm et al., 2018; Belghazi et al., 2018; Piran et al., 2020; Shwartz-Ziv et al., 2018). There is, however, a major problem when it comes to analyzing information-theoretic objectives in deterministic deep neural networks: the source of randomness. The mutual information between the input and the representation in such networks is infinite, resulting in ill-posed optimization problems or piecewise constant, making gradient-based optimization methods ineffective (Amjad & Geiger, 2019). To solve these problems, researchers have proposed several solutions. For SSL, stochastic deep networks with variational bounds could be used, where the output of the deterministic network is used as parameters of the conditional distribution (Lee et al., 2021b; Shwartz-Ziv & Alemi, 2020). Dubois et al. (2021) suggested another option, which assumed that the randomness of data augmentation among the two views is the source of stochasticity in the network. For supervised learning, Goldfeld et al. (2018) introduced an auxiliary (noisy) DN framework by injecting additive noise into the model and demonstrated that it is a good proxy for the original (deterministic) DN in terms of both performance and representation. Finally, Achille and Soatto (Achille & Soatto, 2018) found that minimizing a stochastic network with a regularizer is equivalent to minimizing cross-entropy over deterministic DNs with multiplicative noise. However, all of these methods assume that the noise comes from the model itself, which contradicts current training methods. In this work, we explicitly assume that the stochasticity comes from the data, which is a less restrictive assumption and does not require changing current algorithms.

Information Maximization of Deep Networks Outputs

This section first sets up notation and assumption on the information-theoretic challenges in self-supervised learning (Section 3.1) and on our assumptions regarding the data distribution (Section 3.2) so that any training sample x can be seen as coming from a single Gaussian distribution as in x ∼ N ( µ x , Σ x ) . From this we obtain that the output of any deep network f ( x ) corresponds to a mixture of truncated Gaussian (Section 3.3). In particular, it can fall back to a single Gaussian under some small noise ( det(Σ) → glyph[epsilon1] ) assumptions. These results will enable information measures to be applied to deterministic DNs. We then recover the known SSL methods (Bardes et al., 2021) by making different assumptions about the data distribution and estimating their information.

SSL as an Information-Theoretic Problem

To better grasp the difference between key SSL methods, we first formulate the general SSL goal from an informationtheoretical perspective.

We start with the MultiView InfoMax principle, i.e., maximizing the mutual information between the representations of the two views. To do so, as shown in Federici et al. (2020), we need to maximize I ( Z ; X ′ ) and I ( Z ′ ; X ) . We can do so by a lower bound using

$$

$$

where H ( Z ) is the entropy of Z . In supervised learning, where we need to maximize I ( Z ; Y ) , the labels ( Y ) are fixed, the entropy term H ( Y ) is constant, and you only need to optimize the log-loss E [log q ( z | x )] (crossentropy or square loss). However, in SSL, the entropy H ( Z ) and H ( Z ′ ) are not constant and can be optimized throughout the learning process. Therefore, only maximizing E [log q ( z | x ′ )] will cause it to collapse to the trivial solution of making the representations constant (where the entropy goes to zero). To regularize these entropies, i.e., prevent collapse, different methods utilize different approaches to implicit regularizing information. To recover them in Section 4, we must first introduce the notation and results around the data distribution (Section 3.2) and how a DN transforms that distribution (Section 3.3).

Data Distribution Hypothesis

Our first step is to assess how the output random variables of the network are represented, assuming a distribution on the data itself. Under the manifold hypothesis, any point can be seen as a Gaussian random variable with a low-rank covariance matrix in the direction of the manifold tangent space of the data. Therefore, we will consider throughout this study the conditioning of a latent representation with respect to the mean of the observation, i.e., X | x ∗ ∼ N ( x ∗ , Σ x ∗ ) where the eigenvectors of Σ x ∗ are in the same linear subspace than the tangent space of the data manifold at x ∗ , which varies with the position of x ∗ in space.

Hence a dataset is considered to be a collection of { x ∗ n , n = 1 , . . . , N } and the full data distribution to be a sum of lowrank covariance Gaussian densities as in

$$

$$

with T the uniform Categorical random variable. To keep things simple and without loss of generality, we consider that the effective support of N ( x ∗ i , Σ x ∗ i ) and N ( x ∗ j , Σ x ∗ j ) do not overlap. This keeps things general, as it is enough to cover the domain of the data manifold overall, without overlap between different Gaussians. Hence, in general, we have that.

$$

$$

where N ( x ; ., . ) is the Gaussian density at x and with n ( x ) = arg min n ( x -x ∗ n ) T Σ x ∗ n ( x -x ∗ n ) . This assumption that a dataset is a mixture of Gaussians with nonoverlapping support will simplify our derivations below, which could be extended to the general case if needed. Note that this is not restrictive since, given a sufficiently large N , the above can represent any manifold with an arbitrarily good approximation.

Data Distribution After Deep Network Transformation

Consider an affine spline operator f (Eq. 1) that goes from a space of dimension D to a space of dimension K with K ≥ D . The span, that we denote as image, of this mapping is given by

$$

$$

with Aff ( ω ; A ω , b ω ) = { A ω x + b ω : x ∈ ω } the affine transformation of region ω by the per-region parameters A ω , b ω , and with Ω the partition of the input space in which x lives in. We also provide an analytical form of the perregion affine mappings in Section 2. Hence, the DN mapping consists of affine transformations on each input space partition region ω ∈ Ω based on the coordinate change induced by A ω and the shift induced by b ω .

When the input space is equipped with a density distribution, this density is transformed by the mapping f . In general, finding the density of f ( X ) is an intractable task. However, given our disjoint support assumption provided in Section 3.2, we can arbitrarily increase the representation power of the density by increasing the number of prototypes N . In doing so, the support of each Gaussian is included with the region ω in which its means lie in, leading to the following result.

Theorem 1. Given the setting of Equation (4) the unconditional DN output density denoted as Z is a mixture of the affinely transformed distributions x | x ∗ n ( x ) e.g. for the Gaussian case

$$

$$

where ω ( x ∗ n ) = ω ∈ Ω ⇐⇒ x ∗ n ∈ ω is the partition region in which the prototype x ∗ n lives in.

The proof of the above involves the fact that if ∫ ω p ( x | x ∗ n ( x ) ) d x ≈ 1 then f is linear within the effective support of p . Therefore, any sample from p will almost

surely lie within a single region ω ∈ Ω and therefore the entire mapping can be considered linear with respect to p . Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

.

Information Optimization and Optimality

Based on our analysis, we now show how specific SSL algorithms can be derived. According to Section 3.1, we want to maximize I ( Z ; X ′ ) and I ( Z ′ ; X ) . When our input noise is small, we can reduce the conditional output density p z | x ∗ to a single Gaussian.

$$

$$

where we abbreviated the parameters. Using that, and the result from Section 3.1, we see that one should optimize both H ( Z | X ′ ) and H ( Z ) . As in a standard regression task, we assume a Gaussian observation model, which means p ( z | z ′ ) ∼ N ( z ′ , Σ r ) . Using the mean square error as a loss function in regression tasks is a particular application of this assumption, where Σ r = I . To compute the expected loss, we need to marginalize out the stochasticity in Z ′ , which means that the conditional decoding map is a Gaussian:

$$

$$

which gives the distribution N ( µ n , Σ r +Σ n ) meaning that we can lower bound the mutual information with

$$

$$

$$

$$

What happens if we attempt to optimize this objective? The only intractable component is the entropy of Z . We will begin by examining Z itself. It is natural to ask why the entropy of Z will not increase to infinity. Intuitively, the answer is that H ( Z ) and H ( Z | X ′ ) are tied together, and one cannot increase without the other. Now, recalling that under our distribution assumption, Z is a mixture of Gaussian (recall Thm. 1), we can see how the existing upper and lower bounds could be used for this case; for example, the ones in Moshksar & Khandani (2016).

Deriving VICReg From First Principles

We now propose to recover VICReg from the first principles per the above information-theoretic principle.

Recall that our goal is to estimate the entropy H ( Z ) in Equation (7), where Z is a Gaussian mixture. This quantity is not known for a mixture of Gaussians due to the logarithm of a sum of exponential functions, except for the special case of a

Figure 1. The network output with VICReg training is more gaussian for small input noise . The P-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The x-axis is the coefficient that multiplies the data distribution standard deviation to obtain the Gaussian standard deviation that samples around each image. The dashed line represents the point at which the null hypothesis (Gaussian distribution of the network output) can be rejected with 99% confidence.

Figure 1. The network output with VICReg training is more gaussian for small input noise . The P-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The x-axis is the coefficient that multiplies the data distribution standard deviation to obtain the Gaussian standard deviation that samples around each image. The dashed line represents the point at which the null hypothesis (Gaussian distribution of the network output) can be rejected with 99% confidence.

single Gaussian density. There are, however, several approximations in the literature that include both upper and lower bounds. Among the methods, some use the logarithmic sum of the probability (Kolchinsky & Tracey, 2017), and some use entropy-adjusted logarithmic probabilities (Huber et al., 2008).

An even simpler solution is to approximate the entire mixture as a single Gaussian by capturing only the first two moments of the mixture distribution. Since the Gaussian distribution maximizes the entropy for a given covariance matrix, this method provides an upper bound approximation of our entropy of interest H ( Z ) . In this case, denoting by Σ Z is the covariance matrix of Z , we find that we should maximize the following objective:

$$

$$

where Σ r is constant with respect to our optimization process, and the second term is the prediction performance of one representation from the other. A key result from Shi et al. (2009) connects the eigenvectors and eigenvalues of Σ Z and those of each component Σ i , ∀ i , and showed that under the assumption that the separation ( µ i -µ j )Σ -1 i ( µ i -µ j ) T between the different components is large enough -which holds true in our case as per our data distribution modelthe eigenfunctions of Σ i , ∀ i are approximately the eigenfunctions of Σ Z . Therefore, in our case, this means that

since all those eigenvalues are tied, we only need to find the most efficient way to maximize | Σ Z | .

We know that the determinant of the matrix is the product of its eigenvalues. For every positive matrix, the maximum eigenvalue is greater than or equal to each diagonal element. Therefore, under the constraint of the eigenvalues of the matrix, the most efficient way is to decrease the off-diagonal terms and increase the diagonal terms. By setting Σ r = I , we therefore fully recover the VICReg objective .

SimCLR vs VICReg

Lee et al. (2021b) connected the SimCLR objective (Chen et al., 2020) to the variational bound on the information between representations (Poole et al., 2019), by using the von Mises-Fisher distribution as the conditional variational family. Based on our analysis in Section 4.1, we can identify two main differences between SimCLR and VICReg: (i) The conditional distribution p ( z | x ′ ) ; SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation ; the entropy term in SimCLR, H ( Z ) = ∫ p ( z | x ′ ) p ( x ′ ) dx ′ , is approximate based on the finite sum of the input samples. VICReg, however, uses a different approach and estimates the entropy of Z only from the first second moments. Creating self-supervised methods that combine these two differences would be an interesting future research direction. In theory, none of these assumptions is more valid than the other, and it depends on the specific task and our computational constraints.

Empirical Evaluation

The next step is to verify the validity of our assumptions. Based on the theory presented in Section 3.3, the conditional output density p z | x = i reduces to a single Gaussian with decreasing input noise. We validated it using a ResNet-18 model trained with SimCLR or VICReg on the CIFAR10 dataset (Krizhevsky, 2009). From the test dataset, we sample 512 Gaussian samples for each image and analyzed whether these samples remain Gaussian (for each image) at the penultimate layer of the DN, that is, before the linear classification layer, independently for each output dimension. Then, we employ D'Agostino and Pearson's test (D'Agostino, 1971) to compute the p-value of the normality test under the null hypothesis that the sample represents a normal distribution. In this test, the deviation from normality is measured, and the test aims to determine whether the sample represents a normally distributed population. A kurtosis and skewness transformation is used to perform the test. The process is repeated for different noise standard deviations. Figure 1 shows the p-value as a function of the normalized standard deviation. We can observe that the network's output is indeed Gaussian with a high probability for small input noise. As we increase the input noise, the network's output becomes less Gaussian until the noise distribution can be rejected with 99% confidence. Moreover, we can see that VICReg is, interestingly, more 'Gaussian' than SimCLR, which may have to do with the fact that it optimizes only the second moments of the density distribution to regularize H ( Z ) .

Conclusions

In this study, we examine SSL's objective function from an information-theoretic perspective. Our analysis, based on transferring the required stochasticity to the input distribution, shows how to derive SSL objectives. Therefore, it is possible to obtain an information-theoretic analysis even when using deterministic DNs. In the second part, we rediscovered VICReg's loss function from first principles and showed its implicit assumptions. In short, VICReg performs a crude lower bound estimate of the output density entropy by approximating this distribution with a Gaussian matching the first two moments. Finally, we empirically validated that our assumptions are valid in practice, thus confirming the validity of our novel understanding of VICReg. Our work opens many new paths for future research; A better estimation of information-theoretic quantities fits our assumptions. Another exciting research direction is to identify which SSL method is preferred based on data properties.

$$ X \sim \sum_{n=1}^N\mathcal{N}(\bx^n,\Sigma{\bx^_n})^{T=n},T\sim {\rm Cat}(N), $$

$$ Im(f)\triangleq{f(\bx):\bx \in \mathbb{R}^D}=\bigcup_{\omega \in \Omega} \Aff(\omega;\bA_{\omega},\bb_{\omega})\label{eq:generator_mapping} $$ \tag{eq:generator_mapping}

$$ q(z|X^\prime=x^_{n}) \hspace{-0.1cm}=\hspace{-0.1cm} \int q(z|z^\prime)p(z^\prime|x^_{n}) dz^\prime, $$

$$ \label {eq:izz_bound} I(&Z;X^\prime) \geq H(Z) + E[\log(q(z|x^\prime)] \hspace{-0.1cm}=\hspace{-0.1cm} H(Z) -\frac{d}{2}\log 2\pi \Sigma_r \nonumber\ &-\sum_{n=1}^N \frac12 (z_n-z_n^\prime)^T\Sigma_{r}^{-1}(z_n-z_n^\prime) -\log |\Sigma{_n}|. $$

$$ \mathcal{L}\hspace{-0.1cm}=\frac{1}{K}\sum_{k=1}^K\hspace{-0.1cm}\left(\hspace{-0.1cm}\alpha\max \left(0, \gamma- \sqrt{\bC_{k,k} +\epsilon}\right)\hspace{-0.1cm}+\hspace{-0.1cm}\beta \sum_{k'\neq k}\hspace{-0.1cm}\left(\bC_{k,k'}\right)^2\hspace{-0.1cm}\right)\ ;;;+\gamma| \bZ-\bZ'|_F^2/N. $$

References

In this paper, we examine self-supervised learning methods, particularly VICReg, to provide an information-theoretical understanding of their construction. As a first step, we demonstrate how information-theoretic quantities can be obtained for a deterministic network, offering a possible alternative to prior work that relies on stochastic models. This enables us to demonstrate how VICReg can be (re)discovered from first principles and its assumptions about data distribution. Furthermore, we empirically demonstrate the validity of our assumptions, confirming our novel understanding of VICReg. Finally, we believe that the derivation and insights we obtain can be generalized to many other SSL methods, opening new avenues for theoretical and practical understanding of SSL and transfer learning.

Self-Supervised Learning (SSL) algorithms (Bromley et al., 1993) learn representations using a proxy objective (i.e., SSL objective) between inputs and self-defined signals. The results indicate that the learned representations can generalize well to a wide range of downstream tasks (Chen et al., 2020; Misra & Maaten, 2020), even when the SSL objective does not use downstream supervision during training. In SimCLR (Chen et al., 2020), for example, a contrastive loss is defined between images with different augmentations (i.e., one as input and the other as a self-supervised signal). Then, we take our pre-learned model as a feature extractor and adopt the features to various applications, including image classification, object detection, instance segmentation, and pose estimation (Caron et al., 2021). However, despite the success in practice, only a few works (Arora et al., 2019; Lee et al., 2021a) provide theoretical insights into the learning efficacy of SSL.

In recent years, information theory methods have played a key role in several notable deep learning achievements, from practical applications in representation learning as the variational information bottleneck (Alemi et al., 2016), to theoretical investigations (e.g., the generalization bound induced by mutual information (Xu & Raginsky, 2017; Steinke & Zakynthinou, 2020; Shwartz-Ziv, 2022). Moreover, different deep learning problems have been successfully approached by developing and applying novel estimators and learning principles derived from information-theoretic quantities, such as mutual information estimation. Many works have attempted to analyze SSL from an information theory perspective. An example is the use of the mutual information neural estimator (MINE) (Belghazi et al., 2018) in representation learning (Hjelm et al., 2018) in conjunction with the renowned information maximization (InfoMax) principle (Linsker, 1988). However, looking at these works may be confusing. Numerous objective functions are presented, some contradicting each other, as well as many implicit assumptions. Moreover, these works rely on a crucial assumption: a stochastic (often Gaussian) DN mapping, which is rarely the case nowadays.

This paper presents a unified framework for SSL methods from an information theory perspective which can be applied to deterministic DN training. We summarize our contributions into two points: (i) Firdt, in order to study deterministic DNs from an information theory perspective, we shift stochasticity to the DN input, which is a much more faithful assumption for current training techniques. (ii) Second, based on this formulation, we analyze how current SSL methods that use deterministic networks optimize information-theoretic quantities.

Continuous Piecewise Affine (CPA) Mappings. A rich class of functions emerges from piecewise polynomials: spline operators. In short, given a partition ΩΩ\Omega of a domain ℝDsuperscriptℝ𝐷\mathbb{R}^{D}, a spline of order k𝑘k is a mapping defined by a polynomial of order k𝑘k on each region ω∈Ω𝜔Ω\omega\in\Omega with continuity constraints on the entire domain for the derivatives of order 00,…,k−1𝑘1k-1. As we will focus on affine splines (k=1𝑘1k=1), we define this case only for concreteness. An K𝐾K-dimensional affine spline f𝑓f produces its output via

with input 𝒛∈ℝD𝒛superscriptℝ𝐷\bm{z}\in\mathbb{R}^{D} and 𝑨ω∈ℝK×D,𝒃ω∈ℝK,∀ω∈Ωformulae-sequencesubscript𝑨𝜔superscriptℝ𝐾𝐷formulae-sequencesubscript𝒃𝜔superscriptℝ𝐾for-all𝜔Ω\bm{A}{\omega}\in\mathbb{R}^{K\times D},\bm{b}{\omega}\in\mathbb{R}^{K},\forall\omega\in\Omega the per-region slope and offset parameters respectively, with the key constraint that the entire mapping is continuous over the domain f∈𝒞0​(ℝD)𝑓superscript𝒞0superscriptℝ𝐷f\in\mathcal{C}^{0}(\mathbb{R}^{D}). Spline operators and especially affine spline operators have been widely used in function approximation theory (Cheney & Light, 2009), optimal control (Egerstedt & Martin, 2009), statistics (Fantuzzi et al., 2002), and related fields. Deep Networks. A deep network (DN) is a (non-linear) operator fΘsubscript𝑓Θf_{\Theta} with parameters ΘΘ\Theta that map a input 𝒙∈ℝD𝒙superscriptℝ𝐷\bm{x}\in{\mathbb{R}}^{D} to a prediction 𝒚∈ℝK𝒚superscriptℝ𝐾\bm{y}\in{\mathbb{R}}^{K}. The precise definitions of DNs operators can be found in Goodfellow et al. (2016). We will omit the ΘΘ\Theta notation for clarity unless needed. The only assumption we require for our study is that the non-linearities present in the DN are CPA, as is the case with (leaky-) ReLU, absolute value, and max-pooling. In that case, the entire input-output mapping becomes a CPA spline with an implicit partition ΩΩ\Omega, the function of the weights and architecture of the network (Montufar et al., 2014; Balestriero & Baraniuk, 2018). For smooth nonlinearities, our results hold from a first-order Taylor approximation argument. Self-Supervised Learning. Joint embedding methods learn the DN parameters ΘΘ\Theta without supervision and input reconstruction. Due to this formulation, the difficulty of SSL is to produce a good representation for downstream tasks whose labels are not available during training —while avoiding a trivially simple solution where the model maps all inputs to constant output. Many methods have been proposed to solve this problem. Contrastive methods learn representations by contrasting positive and negative examples, e.g. SimCLR (Chen et al., 2020) and its InfoNCE criterion (Oord et al., 2018). Other recent work introduced non-contrastive methods that employ different regularization methods to prevent collapsing of the representation. Several papers used stop-gradients and extra predictors to avoid collapse (Chen & He, 2021; Grill et al., 2020) while Caron et al. (2020) uses an additional clustering step. As opposed to contrastive methods, noncontrastive methods do not explicitly rely on negative samples. Of particular interest to us is the VICReg method (Bardes et al., 2021) that considers two embedding batches 𝒁=[f​(𝒙1),…,f​(𝒙N)]𝒁𝑓subscript𝒙1…𝑓subscript𝒙𝑁\bm{Z}=\left[f(\bm{x}{1}),\dots,f(\bm{x}{N})\right] and 𝒁′=[f​(𝒙1′),…,f​(𝒙N′)]superscript𝒁′𝑓subscriptsuperscript𝒙′1…𝑓subscriptsuperscript𝒙′𝑁\bm{Z}^{\prime}=\left[f(\bm{x}^{\prime}{1}),\dots,f(\bm{x}^{\prime}{N})\right] each of size (N×K)𝑁𝐾(N\times K). Denoting by 𝑪𝑪\bm{C} the (K×K)𝐾𝐾(K\times K) covariance matrix obtained from [𝒁,𝒁′]𝒁superscript𝒁′[\bm{Z},\bm{Z}^{\prime}] we obtain the VICReg triplet loss

Our goal will now be to formulate SSL as an information-theoretic problem from which we can precisely relate VICReg to known methods even with a deterministic network.

This section first sets up notation and assumption on the information-theoretic challenges in self-supervised learning (Section 3.1) and on our assumptions regarding the data distribution (Section 3.2) so that any training sample 𝒙𝒙\bm{x} can be seen as coming from a single Gaussian distribution as in 𝒙∼𝒩​(μ𝒙,Σ𝒙)similar-to𝒙𝒩subscript𝜇𝒙subscriptΣ𝒙\bm{x}\sim\mathcal{N}(\mu_{\bm{x}},\Sigma_{\bm{x}}). From this we obtain that the output of any deep network f​(𝒙)𝑓𝒙f(\bm{x}) corresponds to a mixture of truncated Gaussian (Section 3.3). In particular, it can fall back to a single Gaussian under some small noise (det(Σ)→ϵ→Σitalic-ϵ\det(\Sigma)\rightarrow\epsilon) assumptions. These results will enable information measures to be applied to deterministic DNs. We then recover the known SSL methods (Bardes et al., 2021) by making different assumptions about the data distribution and estimating their information.

To better grasp the difference between key SSL methods, we first formulate the general SSL goal from an information-theoretical perspective.

We start with the MultiView InfoMax principle, i.e., maximizing the mutual information between the representations of the two views. To do so, as shown in Federici et al. (2020), we need to maximize I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) and I​(Z′;X)𝐼superscript𝑍′𝑋I(Z^{\prime};X). We can do so by a lower bound using

where H​(Z)𝐻𝑍H(Z) is the entropy of Z𝑍Z. In supervised learning, where we need to maximize I​(Z;Y)𝐼𝑍𝑌I(Z;Y), the labels (Y𝑌Y) are fixed, the entropy term H​(Y)𝐻𝑌H(Y) is constant, and you only need to optimize the log-loss E​[log⁡q​(z|x)]𝐸delimited-[]𝑞conditional𝑧𝑥E[\log q(z|x)] (cross-entropy or square loss). However, in SSL, the entropy H​(Z)𝐻𝑍H(Z) and H​(Z′)𝐻superscript𝑍′H(Z^{\prime}) are not constant and can be optimized throughout the learning process. Therefore, only maximizing E​[log⁡q​(z|x′)]𝐸delimited-[]𝑞conditional𝑧superscript𝑥′E[\log q(z|x^{\prime})] will cause it to collapse to the trivial solution of making the representations constant (where the entropy goes to zero). To regularize these entropies, i.e., prevent collapse, different methods utilize different approaches to implicit regularizing information. To recover them in Section 4, we must first introduce the notation and results around the data distribution (Section 3.2) and how a DN transforms that distribution (Section 3.3).

Our first step is to assess how the output random variables of the network are represented, assuming a distribution on the data itself. Under the manifold hypothesis, any point can be seen as a Gaussian random variable with a low-rank covariance matrix in the direction of the manifold tangent space of the data. Therefore, we will consider throughout this study the conditioning of a latent representation with respect to the mean of the observation, i.e., X|𝒙∗∼𝒩​(𝒙∗,Σ𝒙∗)similar-toconditional𝑋superscript𝒙𝒩superscript𝒙subscriptΣsuperscript𝒙X|\bm{x}^{}\sim\mathcal{N}(\bm{x}^{},\Sigma_{\bm{x}^{}}) where the eigenvectors of Σ𝒙∗subscriptΣsuperscript𝒙\Sigma_{\bm{x}^{}} are in the same linear subspace than the tangent space of the data manifold at 𝒙∗superscript𝒙\bm{x}^{}, which varies with the position of 𝒙∗superscript𝒙\bm{x}^{} in space.

Hence a dataset is considered to be a collection of {𝒙n∗,n=1,…,N}formulae-sequencesubscriptsuperscript𝒙𝑛𝑛1…𝑁{\bm{x}^{*}_{n},n=1,\dots,N} and the full data distribution to be a sum of low-rank covariance Gaussian densities as in

with T𝑇T the uniform Categorical random variable. To keep things simple and without loss of generality, we consider that the effective support of 𝒩​(𝒙i∗,Σ𝒙i∗)𝒩subscriptsuperscript𝒙𝑖subscriptΣsubscriptsuperscript𝒙𝑖\mathcal{N}(\bm{x}^{}{i},\Sigma{\bm{x}^{}{i}}) and 𝒩​(𝒙j∗,Σ𝒙j∗)𝒩subscriptsuperscript𝒙𝑗subscriptΣsubscriptsuperscript𝒙𝑗\mathcal{N}(\bm{x}^{*}{j},\Sigma_{\bm{x}^{*}_{j}}) do not overlap. This keeps things general, as it is enough to cover the domain of the data manifold overall, without overlap between different Gaussians. Hence, in general, we have that.

where 𝒩(𝒙;.,.)\mathcal{N}\left(\bm{x};.,.\right) is the Gaussian density at 𝒙𝒙\bm{x} and with n(𝒙)=arg​minn(𝒙−𝒙n∗)TΣ𝒙n∗(𝒙−𝒙n∗)n(\bm{x})=\operatorname*{arg,min}{n}(\bm{x}-\bm{x}^{*}{n})^{T}\Sigma_{\bm{x}^{}_{n}}(\bm{x}-\bm{x}^{}_{n}). This assumption that a dataset is a mixture of Gaussians with nonoverlapping support will simplify our derivations below, which could be extended to the general case if needed. Note that this is not restrictive since, given a sufficiently large N𝑁N, the above can represent any manifold with an arbitrarily good approximation.

Consider an affine spline operator f𝑓f (Eq. 1) that goes from a space of dimension D𝐷D to a space of dimension K𝐾K with K≥D𝐾𝐷K\geq D. The span, that we denote as image, of this mapping is given by

with Aff​(ω;𝑨ω,𝒃ω)={𝑨ω​𝒙+𝒃ω:𝒙∈ω}Aff𝜔subscript𝑨𝜔subscript𝒃𝜔conditional-setsubscript𝑨𝜔𝒙subscript𝒃𝜔𝒙𝜔\text{Aff}(\omega;\bm{A}{\omega},\bm{b}{\omega})={\bm{A}{\omega}\bm{x}+\bm{b}{\omega}:\bm{x}\in\omega} the affine transformation of region ω𝜔\omega by the per-region parameters 𝑨ω,𝒃ωsubscript𝑨𝜔subscript𝒃𝜔\bm{A}{\omega},\bm{b}{\omega}, and with ΩΩ\Omega the partition of the input space in which 𝒙𝒙\bm{x} lives in. We also provide an analytical form of the per-region affine mappings in Section 2. Hence, the DN mapping consists of affine transformations on each input space partition region ω∈Ω𝜔Ω\omega\in\Omega based on the coordinate change induced by 𝑨ωsubscript𝑨𝜔\bm{A}{\omega} and the shift induced by 𝒃ωsubscript𝒃𝜔\bm{b}{\omega}.

When the input space is equipped with a density distribution, this density is transformed by the mapping f𝑓f. In general, finding the density of f​(X)𝑓𝑋f(X) is an intractable task. However, given our disjoint support assumption provided in Section 3.2, we can arbitrarily increase the representation power of the density by increasing the number of prototypes N𝑁N. In doing so, the support of each Gaussian is included with the region ω𝜔\omega in which its means lie in, leading to the following result.

Given the setting of Equation 4 the unconditional DN output density denoted as Z is a mixture of the affinely transformed distributions 𝐱|𝐱n​(𝐱)∗conditional𝐱subscriptsuperscript𝐱𝑛𝐱\bm{x}|\bm{x}^{*}_{n(\bm{x})} e.g. for the Gaussian case

where ω​(𝐱n∗)=ω∈Ω⇔𝐱n∗∈ωiff𝜔subscriptsuperscript𝐱𝑛𝜔Ωsubscriptsuperscript𝐱𝑛𝜔\omega(\bm{x}^{}_{n})=\omega\in\Omega\iff\bm{x}^{}{n}\in\omega is the partition region in which the prototype 𝐱n∗subscriptsuperscript𝐱𝑛\bm{x}^{*}{n} lives in.

The proof of the above involves the fact that if ∫ωp​(𝒙|𝒙n​(𝒙)∗)​𝑑𝒙≈1subscript𝜔𝑝conditional𝒙subscriptsuperscript𝒙𝑛𝒙differential-d𝒙1\int_{\omega}p(\bm{x}|\bm{x}^{*}_{n(\bm{x})})d\bm{x}\approx 1 then f𝑓f is linear within the effective support of p𝑝p. Therefore, any sample from p𝑝p will almost surely lie within a single region ω∈Ω𝜔Ω\omega\in\Omega and therefore the entire mapping can be considered linear with respect to p𝑝p. Thus, the output distribution is a linear transformation of the input distribution based on the per-region affine mapping.

Based on our analysis, we now show how specific SSL algorithms can be derived. According to Section 3.1, we want to maximize I​(Z;X′)𝐼𝑍superscript𝑋′I(Z;X^{\prime}) and I​(Z′;X)𝐼superscript𝑍′𝑋I(Z^{\prime};X). When our input noise is small, we can reduce the conditional output density p𝒛|𝒙⁣∗subscript𝑝conditional𝒛𝒙p_{\bm{z}|\bm{x}*} to a single Gaussian.

where we abbreviated the parameters. Using that, and the result from Section 3.1, we see that one should optimize both H​(Z|X′)𝐻conditional𝑍superscript𝑋′H(Z|X^{\prime}) and H​(Z)𝐻𝑍H(Z). As in a standard regression task, we assume a Gaussian observation model, which means p​(z|z′)∼𝒩​(z′,Σr)similar-to𝑝conditional𝑧superscript𝑧′𝒩superscript𝑧′subscriptΣ𝑟p(z|z^{\prime})\sim\mathcal{N}(z^{\prime},\Sigma_{r}). Using the mean square error as a loss function in regression tasks is a particular application of this assumption, where Σr=IsubscriptΣ𝑟𝐼\Sigma_{r}=I. To compute the expected loss, we need to marginalize out the stochasticity in Z′superscript𝑍′Z^{\prime}, which means that the conditional decoding map is a Gaussian:

which gives the distribution 𝒩​(μn,Σr+Σn)𝒩subscript𝜇𝑛subscriptΣ𝑟subscriptΣ𝑛\mathcal{N}(\mu_{n},\Sigma_{r}+\Sigma_{n}) meaning that we can lower bound the mutual information with

What happens if we attempt to optimize this objective? The only intractable component is the entropy of Z𝑍Z. We will begin by examining Z𝑍Z itself. It is natural to ask why the entropy of Z𝑍Z will not increase to infinity. Intuitively, the answer is that H​(Z)𝐻𝑍H(Z) and H​(Z|X′)𝐻conditional𝑍superscript𝑋′H(Z|X^{\prime}) are tied together, and one cannot increase without the other. Now, recalling that under our distribution assumption, Z𝑍Z is a mixture of Gaussian (recall Thm. 1), we can see how the existing upper and lower bounds could be used for this case; for example, the ones in Moshksar & Khandani (2016).

Recall that our goal is to estimate the entropy H​(Z)𝐻𝑍H(Z) in Section 4, where Z𝑍Z is a Gaussian mixture. This quantity is not known for a mixture of Gaussians due to the logarithm of a sum of exponential functions, except for the special case of a single Gaussian density. There are, however, several approximations in the literature that include both upper and lower bounds. Among the methods, some use the logarithmic sum of the probability (Kolchinsky & Tracey, 2017), and some use entropy-adjusted logarithmic probabilities (Huber et al., 2008).

An even simpler solution is to approximate the entire mixture as a single Gaussian by capturing only the first two moments of the mixture distribution. Since the Gaussian distribution maximizes the entropy for a given covariance matrix, this method provides an upper bound approximation of our entropy of interest H​(Z)𝐻𝑍H(Z). In this case, denoting by ΣZsubscriptΣ𝑍\Sigma_{Z} is the covariance matrix of Z𝑍Z, we find that we should maximize the following objective:

where ΣrsubscriptΣ𝑟\Sigma_{r} is constant with respect to our optimization process, and the second term is the prediction performance of one representation from the other. A key result from Shi et al. (2009) connects the eigenvectors and eigenvalues of ΣZsubscriptΣ𝑍\Sigma_{Z} and those of each component Σi,∀isubscriptΣ𝑖for-all𝑖\Sigma_{i},\forall i, and showed that under the assumption that the separation (μi−μj)​Σi−1​(μi−μj)Tsubscript𝜇𝑖subscript𝜇𝑗superscriptsubscriptΣ𝑖1superscriptsubscript𝜇𝑖subscript𝜇𝑗𝑇(\mu_{i}-\mu_{j})\Sigma_{i}^{-1}(\mu_{i}-\mu_{j})^{T} between the different components is large enough —which holds true in our case as per our data distribution model— the eigenfunctions of Σi,∀isubscriptΣ𝑖for-all𝑖\Sigma_{i},\forall i are approximately the eigenfunctions of ΣZsubscriptΣ𝑍\Sigma_{Z}. Therefore, in our case, this means that since all those eigenvalues are tied, we only need to find the most efficient way to maximize |ΣZ|subscriptΣ𝑍|\Sigma_{Z}|.

We know that the determinant of the matrix is the product of its eigenvalues. For every positive matrix, the maximum eigenvalue is greater than or equal to each diagonal element. Therefore, under the constraint of the eigenvalues of the matrix, the most efficient way is to decrease the off-diagonal terms and increase the diagonal terms. By setting Σr=IsubscriptΣ𝑟𝐼\Sigma_{r}=I, we therefore fully recover the VICReg objective.

Lee et al. (2021b) connected the SimCLR objective (Chen et al., 2020) to the variational bound on the information between representations (Poole et al., 2019), by using the von Mises-Fisher distribution as the conditional variational family. Based on our analysis in Section 4.1, we can identify two main differences between SimCLR and VICReg: (i) The conditional distribution p(z|x′\bm{p}(\bm{z}|\bm{x}^{\prime}); SimCLR assumes a von Mises-Fisher distribution for the encoder, while VICReg assumes a Gaussian distribution. (ii) Entropy estimation; the entropy term in SimCLR, H​(Z)=∫p​(z|x′)​p​(x′)​𝑑x′𝐻𝑍𝑝conditional𝑧superscript𝑥′𝑝superscript𝑥′differential-dsuperscript𝑥′H(Z)=\int p(z|x^{\prime})p(x^{\prime})dx^{\prime}, is approximate based on the finite sum of the input samples. VICReg, however, uses a different approach and estimates the entropy of Z𝑍Z only from the first second moments. Creating self-supervised methods that combine these two differences would be an interesting future research direction. In theory, none of these assumptions is more valid than the other, and it depends on the specific task and our computational constraints.

The next step is to verify the validity of our assumptions. Based on the theory presented in Section 3.3, the conditional output density p𝒛|x=isubscript𝑝conditional𝒛𝑥𝑖p_{\bm{z}|x=i} reduces to a single Gaussian with decreasing input noise. We validated it using a ResNet-18 model trained with SimCLR or VICReg on the CIFAR-10 dataset (Krizhevsky, 2009). From the test dataset, we sample 512512512 Gaussian samples for each image and analyzed whether these samples remain Gaussian (for each image) at the penultimate layer of the DN, that is, before the linear classification layer, independently for each output dimension. Then, we employ D’Agostino and Pearson’s test (D’Agostino, 1971) to compute the p-value of the normality test under the null hypothesis that the sample represents a normal distribution. In this test, the deviation from normality is measured, and the test aims to determine whether the sample represents a normally distributed population. A kurtosis and skewness transformation is used to perform the test. The process is repeated for different noise standard deviations. Figure 1 shows the p-value as a function of the normalized standard deviation. We can observe that the network’s output is indeed Gaussian with a high probability for small input noise. As we increase the input noise, the network’s output becomes less Gaussian until the noise distribution can be rejected with 99%percent9999% confidence. Moreover, we can see that VICReg is, interestingly, more ”Gaussian” than SimCLR, which may have to do with the fact that it optimizes only the second moments of the density distribution to regularize H​(Z)𝐻𝑍H(Z).

In this study, we examine SSL’s objective function from an information-theoretic perspective. Our analysis, based on transferring the required stochasticity to the input distribution, shows how to derive SSL objectives. Therefore, it is possible to obtain an information-theoretic analysis even when using deterministic DNs. In the second part, we rediscovered VICReg’s loss function from first principles and showed its implicit assumptions. In short, VICReg performs a crude lower bound estimate of the output density entropy by approximating this distribution with a Gaussian matching the first two moments. Finally, we empirically validated that our assumptions are valid in practice, thus confirming the validity of our novel understanding of VICReg. Our work opens many new paths for future research; A better estimation of information-theoretic quantities fits our assumptions. Another exciting research direction is to identify which SSL method is preferred based on data properties.

Refer to caption The network output with VICReg training is more gaussian for small input noise. The P-value of the normality test for different SSL models trained on CIFAR-10 for different input noise levels. The x-axis is the coefficient that multiplies the data distribution standard deviation to obtain the Gaussian standard deviation that samples around each image. The dashed line represents the point at which the null hypothesis (Gaussian distribution of the network output) can be rejected with 99%percent9999% confidence.

$$ X\sim\sum_{n=1}^{N}\mathcal{N}(\bm{x}^{}{n},\Sigma{\bm{x}^{}_{n}})^{T=n},T\sim{\rm Cat}(N), $$ \tag{S3.E3}

$$ Im(f)\triangleq{f(\bm{x}):\bm{x}\in\mathbb{R}^{D}}=\bigcup_{\omega\in\Omega}\text{Aff}(\omega;\bm{A}{\omega},\bm{b}{\omega}) $$ \tag{S3.E5}

$$ (Z^{\prime}|X^{\prime}=x_{n})\sim\mathcal{N}\left(\mu_{n},\Sigma_{n}\right), $$ \tag{S4.Ex4}

$$ q(z|X^{\prime}=x^{}_{n})=\int q(z|z^{\prime})p(z^{\prime}|x^{}_{n})dz^{\prime}, $$ \tag{S4.E6}

$$ \displaystyle\mathcal{L}=\frac{1}{K}\sum_{k=1}^{K}\left(\alpha\max\left(0,\gamma-\sqrt{\bm{C}{k,k}+\epsilon}\right)+\beta\sum{k^{\prime}\neq k}\left(\bm{C}_{k,k^{\prime}}\right)^{2}\right) $$

$$ \displaystyle;;;+\gamma|\bm{Z}-\bm{Z}^{\prime}|_{F}^{2}/N. $$

$$ \displaystyle I(Z,X^{\prime})=H(Z)-H(Z|X^{\prime})\geq H(Z)+E[\log q(z|x^{\prime})] $$

Thm. Theorem 1. Given the setting of Equation 4 the unconditional DN output density denoted as Z is a mixture of the affinely transformed distributions 𝐱|𝐱n​(𝐱)∗conditional𝐱subscriptsuperscript𝐱𝑛𝐱\bm{x}|\bm{x}^{}{n(\bm{x})} e.g. for the Gaussian case Z∼∑n=1N𝒩​(𝑨ω​(𝒙n∗)​𝒙n∗+𝒃ω​(𝒙n∗),𝑨ω​(𝒙n∗)T​Σ𝒙n∗​𝑨ω​(𝒙n∗))T=n,similar-to𝑍superscriptsubscript𝑛1𝑁𝒩superscriptsubscript𝑨𝜔subscriptsuperscript𝒙𝑛subscriptsuperscript𝒙𝑛subscript𝒃𝜔subscriptsuperscript𝒙𝑛subscriptsuperscript𝑨𝑇𝜔subscriptsuperscript𝒙𝑛subscriptΣsubscriptsuperscript𝒙𝑛subscript𝑨𝜔subscriptsuperscript𝒙𝑛𝑇𝑛Z\sim\sum{n=1}^{N}\mathcal{N}\left(\bm{A}_{\omega(\bm{x}^{}{n})}\bm{x}^{*}{n}+\bm{b}{\omega(\bm{x}^{*}{n})},\bm{A}^{T}{\omega(\bm{x}^{*}{n})}\Sigma_{\bm{x}^{}{n}}\bm{A}{\omega(\bm{x}^{}{n})}\right)^{T=n}, where ω​(𝐱n∗)=ω∈Ω⇔𝐱n∗∈ωiff𝜔subscriptsuperscript𝐱𝑛𝜔Ωsubscriptsuperscript𝐱𝑛𝜔\omega(\bm{x}^{*}{n})=\omega\in\Omega\iff\bm{x}^{}_{n}\in\omega is the partition region in which the prototype 𝐱n∗subscriptsuperscript𝐱𝑛\bm{x}^{}_{n} lives in.

Achille, A. and Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence , 40(12):2897-2905, 2018.

Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410 , 2016.

Amjad, R. A. and Geiger, B. C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE transactions on pattern analysis and machine intelligence , 2019.

Balestriero, R. and Baraniuk, R. A spline theory of deep networks. In Proc. ICML , volume 80, pp. 374-383, Jul. 2018.

Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio,

References

[dsprites17] Loic Matthey, Irina Higgins, Demis Hassabis, Alexander Lerchner. (2017). dSprites: Disentanglement testing Sprites dataset.

[rudin2006real] Rudin, Walter. (2006). Real and Complex Analysis.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.

[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.

[entropyapprox2008] Huber, Marco, Bailey, Tim, Durrant-Whyte, Hugh, Hanebeck, Uwe. (2008). On Entropy Approximation for Gaussian Mixture Random Vectors. IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. doi:10.1109/MFI.2008.4648062.

[koltchinskii2000random] Koltchinskii, Vladimir, Gin{'e. (2000). Random matrix approximation of spectra of integral operators. Bernoulli.

[shi2009data] Shi, Tao, Belkin, Mikhail, Yu, Bin. (2009). Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics.

[boundsgmmentropy] Kolchinsky, Artemy, Tracey, Brendan D. (2017). Estimating mixture entropy with pairwise distances. Entropy. doi:10.1109/TIT.2016.2553147.

[balestriero2020mad] Balestriero, Randall, Baraniuk, Richard. (2020). Mad max: Affine spline insights into deep learning. Proceedings of the IEEE.

[heusel2017gans] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. Proc. NeurIPS.

[che2020your] Che, Tong, Zhang, Ruixiang, Sohl-Dickstein, Jascha, Larochelle, Hugo, Paull, Liam, Cao, Yuan, Bengio, Yoshua. (2020). Your GAN is secretly an energy-based model and you should use discriminator driven latent sampling. arXiv preprint arXiv:2003.06060.

[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[piran2020dual] Piran, Zoe, Shwartz-Ziv, Ravid, Tishby, Naftali. (2020). The dual information bottleneck. arXiv preprint arXiv:2006.04641.

[8437679] Russo, Daniel, Zou, James. (2019). How much does your data exploration overfit? Controlling bias via information usage. IEEE Transactions on Information Theory. doi:10.1109/ISIT.2018.8437679.

[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems.

[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[jing2022understanding] Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. International Conference on Learning Representations.

[tanaka2019discriminator] Tanaka, Akinori. (2019). Discriminator optimal transport. arXiv preprint arXiv:1910.06832.

[metz2016unrolled] Metz, Luke, Poole, Ben, Pfau, David, Sohl-Dickstein, Jascha. (2016). Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163.

[peyre2009manifold] Peyr{'e. (2009). Manifold models for signals and images. Computer vision and image understanding.

[faceapi] Microsoft Cognitive Services. {Face API.

[wood1996estimation] Wood, GR, Zhang, BP. (1996). Estimation of the Lipschitz constant of a function. J. Global Optim..

[cheney2009course] Cheney, Elliott Ward, Light, William Allan. (2009). A course in approximation theory.

[baggenstoss2017uniform] Baggenstoss, Paul M. (2017). Uniform manifold sampling (UMS): Sampling the maximum entropy pdf. IEEE Trans. Signal Processing.

[gulrajani2017improved] Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, Courville, Aaron. (2017). Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028.

[scaman2018lipschitz] Scaman, Kevin, Virmaux, Aladin. (2018). Lipschitz regularity of deep neural networks: analysis and efficient estimation. arXiv preprint arXiv:1805.10965.

[thirumuruganathan2020approximate] Thirumuruganathan, Saravanan, Hasan, Shohedul, Koudas, Nick, Das, Gautam. (2020). Approximate query processing for data exploration using deep generative models. Proc. ICDE.

[karras2017progressive] Karras, Tero, Aila, Timo, Laine, Samuli, Lehtinen, Jaakko. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.

[vahdat2020nvae] Vahdat, Arash, Kautz, Jan. (2020). Nvae: A deep hierarchical variational autoencoder. Proc. NeurIPS.

[tan2020fairgen] Tan, Shuhan, Shen, Yujun, Zhou, Bolei. (2020). Improving the Fairness of Deep Generative Models without Retraining. arXiv preprint arXiv:2012.04842.

[hwang2020fairfacegan] Hwang, Sunhee, Park, Sungho, Kim, Dohyung, Do, Mirae, Byun, Hyeran. (2020). FairfaceGAN: Fairness-aware facial image-to-image translation. Proc. BMVC.

[karras2020analyzing] Karras, Tero, Laine, Samuli, Aittala, Miika, Hellsten, Janne, Lehtinen, Jaakko, Aila, Timo. (2020). Analyzing and improving the image quality of stylegan. Proc. CVPR.

[brock2018large] Brock, Andrew, Donahue, Jeff, Simonyan, Karen. (2019). Large scale GAN training for high fidelity natural image synthesis. Proc. ICLR.

[thanh2019improving] Thanh-Tung, Hoang, Tran, Truyen, Venkatesh, Svetha. (2019). Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984.

[sandfort2019data] Sandfort, Veit, Yan, Ke, Pickhardt, Perry J, Summers, Ronald M. (2019). Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Scientific reports.

[zhao2018bias] Zhao, Shengjia, Ren, Hongyu, Yuan, Arianna, Song, Jiaming, Goodman, Noah, Ermon, Stefano. (2018). Bias and generalization in deep generative models: An empirical study. arXiv preprint arXiv:1811.03259.

[wu2019generalization] Wu, Bingzhe, Zhao, Shiwan, Chen, ChaoChao, Xu, Haoyang, Wang, Li, Zhang, Xiaolu, Sun, Guangyu, Zhou, Jun. (2019). Generalization in generative adversarial networks: A novel perspective from privacy protection. arXiv preprint arXiv:1908.07882.

[fantuzzi2002identification] Fantuzzi, Cesare, Simani, Silvio, Beghelli, Sergio, Rovatti, Riccardo. (2002). Identification of piecewise affine models in noisy environment. International Journal of Control.

[egerstedt2009control] Egerstedt, Magnus, Martin, Clyde. (2009). Control theoretic splines: optimal control, statistics, and path planning.

[levina2004maximum] Levina, Elizaveta, Bickel, Peter. (2004). Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems.

[xu2018spherical] Xu, Jiacheng, Durrett, Greg. (2018). Spherical latent spaces for stable variational autoencoders. arXiv preprint arXiv:1808.10805.

[chen2018isolating] Chen, Ricky TQ, Li, Xuechen, Grosse, Roger B, Duvenaud, David K. (2018). Isolating sources of disentanglement in variational autoencoders. Proc. NeurIPS.

[miyato2018spectral] Miyato, Takeru, Kataoka, Toshiki, Koyama, Masanori, Yoshida, Yuichi. (2018). Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.

[mao2017least] Mao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, Paul Smolley, Stephen. (2017). Least squares generative adversarial networks. Proc. ICCV.

[spivak2018calculus] Spivak, Michael. (2018). Calculus on manifolds: a modern approach to classical theorems of advanced calculus.

[ansuini2019intrinsic] Ansuini, Alessio, Laio, Alessandro, Macke, Jakob H, Zoccolan, Davide. (2019). Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems.

[facco2017estimating] Facco, Elena, d’Errico, Maria, Rodriguez, Alex, Laio, Alessandro. (2017). Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific reports.

[balestriero2020analytical] Balestriero, Randall, Paris, S{'e. (2020). Analytical Probability Distributions and Exact Expectation-Maximization for Deep Generative Networks. Proc. NeurIPS.

[hara2016analysis] Hara, Kazuyuki, Saitoh, Daisuke, Shouno, Hayaru. (2016). Analysis of dropout learning regarded as ensemble learning. International Conference on Artificial Neural Networks.

[ketchen1996application] Ketchen, David J, Shook, Christopher L. (1996). The application of cluster analysis in strategic management research: an analysis and critique. Strategic management journal.

[thorndike1953belongs] Thorndike, Robert L. (1953). Who belongs in the family?. Psychometrika.

[baldi2013understanding] Baldi, Pierre, Sadowski, Peter J. (2013). Understanding dropout. Advances in neural information processing systems.

[bachman2014learning] Bachman, Philip, Alsharif, Ouais, Precup, Doina. (2014). Learning with pseudo-ensembles. Advances in neural information processing systems.

[bojanowski2017optimizing] Bojanowski, Piotr, Joulin, Armand, Lopez-Paz, David, Szlam, Arthur. (2017). Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776.

[warde2013empirical] Warde-Farley, David, Goodfellow, Ian J, Courville, Aaron, Bengio, Yoshua. (2013). An empirical analysis of dropout in piecewise linear networks. arXiv preprint arXiv:1312.6197.

[glorot2011deep] Glorot, Xavier, Bordes, Antoine, Bengio, Yoshua. (2011). Deep sparse rectifier neural networks. Proceedings of the fourteenth international conference on artificial intelligence and statistics.

[maas2013rectifier] Maas, Andrew L, Hannun, Awni Y, Ng, Andrew Y. (2013). Rectifier nonlinearities improve neural network acoustic models. Proc. icml.

[bruna2013invariant] Bruna, Joan, Mallat, St{'e. (2013). Invariant scattering convolution networks. IEEE Trans. PAMI.

[zhang2018tropical] Zhang, Liwen, Naitzat, Gregory, Lim, Lek-Heng. (2018). Tropical geometry of deep neural networks. arXiv preprint arXiv:1805.07091.

[he2015delving] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision.

[tran2017disentangled] Tran, Luan, Yin, Xi, Liu, Xiaoming. (2017). Disentangled representation learning gan for pose-invariant face recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[lautensack_zuyev_2008] Banerjee, Sudipto, Roy, Anindya. (2014). Linear Algebra and Matrix Analysis for Statistics. Advances in Applied Probability. doi:10.1239/aap/1222868179.

[yim2015rotating] Yim, Junho, Jung, Heechul, Yoo, ByungIn, Choi, Changkyu, Park, Dusik, Kim, Junmo. (2015). Rotating your face using multi-task deep neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[wang2005generalization] Wang, Shuning, Sun, Xusheng. (2005). Generalization of hinging hyperplanes. IEEE Trans. Information Theory.

[bjorck2018understanding] Bjorck, Nils, Gomes, Carla P, Selman, Bart, Weinberger, Kilian Q. (2018). Understanding batch normalization. Advances in Neural Information Processing Systems.

[zhao2017towards] Zhao, Shengjia, Song, Jiaming, Ermon, Stefano. (2017). Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658.

[huang2018introvae] Huang, Huaibo, He, Ran, Sun, Zhenan, Tan, Tieniu, others. (2018). Introvae: Introspective variational autoencoders for photographic image synthesis. Advances in Neural Information Processing systems.

[cover2012elements] Cover, Thomas M, Thomas, Joy A. (2012). Elements of Information Theory.

[higgins2017beta] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. (2017). Beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.. Proc. ICLR.

[radford2015unsupervised] Radford, Alec, Metz, Luke, Chintala, Soumith. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

[tomczak2017vae] Tomczak, Jakub M, Welling, Max. (2017). VAE with a VampPrior. arXiv preprint arXiv:1705.07120.

[berg2018sylvester] Berg, Rianne van den, Hasenclever, Leonard, Tomczak, Jakub M, Welling, Max. (2018). Sylvester normalizing flows for variational inference. arXiv preprint arXiv:1803.05649.

[tomczak2016improving] Tomczak, Jakub M, Welling, Max. (2016). Improving variational auto-encoders using householder flow. arXiv preprint arXiv:1611.09630.

[davidson2018hyperspherical] Davidson, Tim R, Falorsi, Luca, De Cao, Nicola, Kipf, Thomas, Tomczak, Jakub M. (2018). Hyperspherical variational auto-encoders. arXiv preprint arXiv:1804.00891.

[li2018learning] Li, Yuanzhi, Liang, Yingyu. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems.

[bryant1995principal] Bryant, Fred B, Yarnold, Paul R. (1995). Principal-components analysis and exploratory and confirmatory factor analysis..

[harman1960modern] Harman, Harry H. (1960). Modern factor analysis..

[kim2018disentangling] Kim, Hyunjik, Mnih, Andriy. (2018). Disentangling by factorising. arXiv preprint arXiv:1802.05983.

[isola2017image] Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, Efros, Alexei A. (2017). Image-to-image translation with conditional adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[donoho1994ideal] Donoho, David L, Johnstone, Iain M, others. (1994). Ideal denoising in an orthonormal basis chosen from a library of bases. Comptes rendus de l'Acad{'e.

[breiman1977variable] Breiman, Leo, Meisel, William, Purcell, Edward. (1977). Variable kernel estimates of multivariate densities. Technometrics.

[ben2018gaussian] Ben-Yosef, Matan, Weinshall, Daphna. (2018). Gaussian mixture generative adversarial networks for diverse datasets, and the unsupervised clustering of images. arXiv preprint arXiv:1808.10356.

[yang2019mean] Yang, Greg, Pennington, Jeffrey, Rao, Vinay, Sohl-Dickstein, Jascha, Schoenholz, Samuel S. (2019). A mean field theory of batch normalization. arXiv preprint arXiv:1902.08129.

[wan2013regularization] Wan, Li, Zeiler, Matthew, Zhang, Sixin, Le Cun, Yann, Fergus, Rob. (2013). Regularization of neural networks using dropconnect. International Conference on Machine Learning.

[ulyanov2016instance] Ulyanov, Dmitry, Vedaldi, Andrea, Lempitsky, Victor. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022.

[liao2016importance] Liao, Zhibin, Carneiro, Gustavo. (2016). On the importance of normalisation layers in deep learning with piecewise linear activation units. 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[cun1998efficient] Cun, YL, Bottou, L, Orr, G, Muller, K. (1998). Efficient backprop, neural networks: Tricks of the trade. Lecture notes in computer sciences.

[jin2019auto] Jin, Haifeng, Song, Qingquan, Hu, Xia. (2019). Auto-keras: An efficient neural architecture search system. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

[bender2018understanding] Bender, Gabriel, Kindermans, Pieter-Jan, Zoph, Barret, Vasudevan, Vijay, Le, Quoc. (2018). Understanding and simplifying one-shot architecture search. International Conference on Machine Learning.

[zagoruyko2016wide] Zagoruyko, Sergey, Komodakis, Nikos. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.

[liu2017learning] Liu, Zhuang, Li, Jianguo, Shen, Zhiqiang, Huang, Gao, Yan, Shoumeng, Zhang, Changshui. (2017). Learning efficient convolutional networks through network slimming. Proceedings of the IEEE International Conference on Computer Vision.

[ye2018rethinking] Ye, Jianbo, Lu, Xin, Lin, Zhe, Wang, James Z. (2018). Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124.

[zhang2018shufflenet] Zhang, Xiangyu, Zhou, Xinyu, Lin, Mengxiao, Sun, Jian. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[huang2017densely] Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[dabov2009bm3d] Dabov, Kostadin, Foi, Alessandro, Katkovnik, Vladimir, Egiazarian, Karen. (2009). BM3D image denoising with shape-adaptive principal component analysis.

[du2007hyperspectral] Du, Qian, Fowler, James E. (2007). Hyperspectral image compression using JPEG2000 and principal component analysis. IEEE Geoscience and Remote sensing letters.

[pearson1901liii] Pearson, Karl. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science.

[ioffe2017batch] Ioffe, Sergey. (2017). Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. Advances in Neural Information Processing systems.

[nam2018batch] Nam, Hyeonseob, Kim, Hyo-Eun. (2018). Batch-instance normalization for adaptively style-invariant neural networks. Advances in Neural Information Processing Systems.

[box1978statistics] Box, George EP, Hunter, William Gordon, Hunter, J Stuart, others. (1978). Statistics for experimenters.

[wu2018group] Wu, Yuxin, He, Kaiming. (2018). Group normalization. Proceedings of the European Conference on Computer Vision (ECCV).

[balestriero2019geometry] Balestriero, Randall, Cosentino, Romain, Aazhang, Behnaam, Baraniuk, Richard. (2019). The Geometry of Deep Networks: Power Diagram Subdivision. Proc. NeurIPS.

[kohler2019exponential] Kohler, Jonas, Daneshmand, Hadi, Lucchi, Aurelien, Hofmann, Thomas, Zhou, Ming, Neymeyr, Klaus. (2019). Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization. The 22nd International Conference on Artificial Intelligence and Statistics.

[salimans2016weight] Salimans, Tim, Kingma, Durk P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in Neural Information Processing Systems.

[luo2017learning] Luo, Ping. (2017). Learning deep architectures via generalized whitened neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70.

[huang2018decorrelated] Huang, Lei, Yang, Dawei, Lang, Bo, Deng, Jia. (2018). Decorrelated batch normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[feldman2006coresets] Feldman, Dan, Fiat, Amos, Sharir, Micha. (2006). Coresets forweighted facilities and their applications. 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[megiddo1982complexity] Megiddo, Nimrod, Tamir, Arie. (1982). On the complexity of locating linear facilities in the plane. Operations research letters.

[feldman2013turning] Feldman, Dan, Schmidt, Melanie, Sohler, Christian. (2013). Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms.

[huang2018condensenet] Huang, Gao, Liu, Shichen, Van der Maaten, Laurens, Weinberger, Kilian Q. (2018). Condensenet: An efficient densenet using learned group convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[tan2019efficientnet] Tan, Mingxing, Le, Quoc V. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv preprint arXiv:1905.11946.

[szegedy2016rethinking] Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, Wojna, Zbigniew. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[he2016identity] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Identity mappings in deep residual networks. European conference on computer vision.

[clevert2015fast] Clevert, Djork-Arn{'e. (2015). Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289.

[klambauer2017self] Klambauer, G{. (2017). Self-normalizing neural networks. Advances in Neural Information Processing systems.

[lei2016layer] Lei Ba, Jimmy, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

[santurkar2018does] Santurkar, Shibani, Tsipras, Dimitris, Ilyas, Andrew, Madry, Aleksander. (2018). How does batch normalization help optimization?. Advances in Neural Information Processing Systems.

[resnet-he] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2015). Deep Residual Learning for Image Recognition. CoRR.

[katagiri2002performance] Katagiri, Takahiro. (2002). Performance evaluation of parallel Gram-Schmidt re-orthogonalization methods. International Conference on High Performance Computing for Computational Science.

[agarwal2004k] Agarwal, Pankaj K, Mustafa, Nabil H. (2004). K-means projective clustering. Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.

[agarwal2000covering] Agarwal, Pankaj K, Procopiuc, Cecilia M. (2000). Covering points by strips in the plane. Proc. 11th Annual ACM-SIAM Symposium on Discrete Algorithms.

[han2011data] Han, Jiawei, Pei, Jian, Kamber, Micheline. (2011). Data mining: concepts and techniques.

[kaufman1987clustering] Kaufman, Leonard, Rousseeuw, Peter J. (1987). Clustering by means of medoids. Statistical Data Analysis based on the L1 Norm. Y. Dodge, Ed.

[steinhaus1956division] Steinhaus, Hugo. (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci.

[kanungo2002efficient] Kanungo, Tapas, Mount, David M, Netanyahu, Nathan S, Piatko, Christine D, Silverman, Ruth, Wu, Angela Y. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. PAMI.

[tan2006cluster] Tan, Pang-Ning, Steinbach, Michael, Kumar, Vipin, others. (2006). Cluster analysis: basic concepts and algorithms. Introduction to data mining.

[knuth1992two] Knuth, Donald E. (1992). Two notes on notation. The American Mathematical Monthly.

[bell1934exponential] Bell, Eric Temple. (1934). Exponential polynomials. Annals of Mathematics.

[halmos1960naive] Halmos Paul, R. (1960). Naive set theory.

[georgescu2003mean] Georgescu, Bogdan, Shimshoni, Ilan, Meer, Peter. (2003). Mean Shift Based Clustering in High Dimensions: A Texture Classification Example.. ICCV.

[muja2009fast] Muja, Marius, Lowe, David G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration.. VISAPP (1).

[balestriero2018semi] Balestriero, Randall, Glotin, Herv{'e. (2018). Semi-Supervised Learning Enabled by Multiscale Deep Neural Network Inversion. arXiv preprint arXiv:1802.10172.

[arya1998optimal] Arya, Sunil, Mount, David M, Netanyahu, Nathan S, Silverman, Ruth, Wu, Angela Y. (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM (JACM).

[hanin2019complexity] Hanin, Boris, Rolnick, David. (2019). Complexity of Linear Regions in Deep Networks. arXiv preprint arXiv:1901.09021.

[konda2014zero] Konda, Kishore, Memisevic, Roland, Krueger, David. (2014). Zero-bias autoencoders and the benefits of co-adapting features. arXiv preprint arXiv:1402.3337.

[wang2018a] Zichao Wang, Randall Balestriero, Richard Baraniuk. (2019). A {MAX. International Conference on Learning Representations.

[balestriero2018from] Randall Balestriero, Richard Baraniuk. (2019). From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference. International Conference on Learning Representations.

[chan2015pcanet] Chan, Tsung-Han, Jia, Kui, Gao, Shenghua, Lu, Jiwen, Zeng, Zinan, Ma, Yi. (2015). PCANet: A simple deep learning baseline for image classification?. IEEE Trans. Image Processing.

[lin2015far] Lin, Zhouhan, Memisevic, Roland, Konda, Kishore. (2015). How far can we go without convolution: Improving fully-connected networks. arXiv preprint arXiv:1511.02580.

[johnson1960advanced] Johnson, Roger A. (1960). Advanced Euclidean Geometry: An Elementary Treatise on the Geometry of the Triangle and the Circle: Under the Editorship of John Wesley Young.

[sommerville1958elements] Sommerville, Duncan Mclaren Young. (1958). The elements of non-Euclidean geometry.

[banerjee2005clustering] Banerjee, Arindam, Dhillon, Inderjit S, Ghosh, Joydeep, Sra, Suvrit. (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research.

[imai1985voronoi] Imai, Hiroshi, Iri, Masao, Murota, Kazuo. (1985). Voronoi diagram in the Laguerre geometry and its applications. SIAM Journal on Computing.

[candes2015phase] Candes, Emmanuel J, Li, Xiaodong, Soltanolkotabi, Mahdi. (2015). Phase retrieval via Wirtinger flow: Theory and algorithms. IEEE Trans. Information Theory.

[komodakis2007approximate] Komodakis, Nikos, Tziritas, Georgios. (2007). Approximate labeling via graph cuts based on linear programming. IEEE Trans. PAMI.

[boykov2001fast] Boykov, Yuri, Veksler, Olga, Zabih, Ramin. (2001). Fast approximate energy minimization via graph cuts. IEEE Trans. PAMI.

[zaslavskiy2009path] Zaslavskiy, Mikhail, Bach, Francis, Vert, Jean-Philippe. (2009). A path following algorithm for the graph matching problem. IEEE Trans. PAMI.

[he2016joint] He, Lifang, Lu, Chun-Ta, Ma, Jiaqi, Cao, Jianping, Shen, Linlin, Yu, Philip S. (2016). Joint community and structural hole spanner detection via harmonic modularity. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

[chan2011convex] Chan, Emprise YK, Yeung, Dit-Yan. (2011). A convex formulation of modularity maximization for community detection. Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence (IJCAI), Barcelona, Spain.

[joulin2010discriminative] Joulin, Armand, Bach, Francis, Ponce, Jean. (2010). Discriminative clustering for image co-segmentation. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.

[yuan2017exact] Yuan, Ganzhao, Ghanem, Bernard. (2017). An Exact Penalty Method for Binary Optimization Based on MPEC Formulation.. AAAI.

[simonyan2013deep] Simonyan, Karen, Vedaldi, Andrea, Zisserman, Andrew. (2013). Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034.

[zintgraf2016new] Zintgraf, Luisa M, Cohen, Taco S, Welling, Max. (2016). A new method to visualize deep neural networks. arXiv preprint arXiv:1603.02518.

[yosinski2015understanding] Yosinski, Jason, Clune, Jeff, Nguyen, Anh, Fuchs, Thomas, Lipson, Hod. (2015). Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579.

[zeiler2014visualizing] Zeiler, Matthew D, Fergus, Rob. (2014). Visualizing and understanding convolutional networks. European conference on computer vision.

[erhan2009visualizing] Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, Vincent, Pascal. (2009). Visualizing higher-layer features of a deep network. University of Montreal.

[comp] Srivastava, R. K., Masci, J., Gomez, F., Schmidhuber, J.. (2014). Understanding locally competitive networks. arXiv preprint arXiv:1410.1165.

[trottier2017parametric] Trottier, L., Gigu, P., Chaib-draa, B.. (2017). Parametric exponential linear unit for deep convolutional neural networks. 16th IEEE Int. Conf. Mach. Learn. Appl..

[eldar2003optimal] Eldar, Yonina C, Chan, Albert M. (2003). An optimal whitening approach to linear multiuser detection. IEEE Trans. Information Theory.

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.

[nasrabadi2007pattern] Nasrabadi, Nasser M. (2007). Pattern recognition and machine learning. Journal of electronic imaging.

[allen1977unified] Allen, Jont B, Rabiner, Lawrence R. (1977). A unified approach to short-time Fourier analysis and synthesis. Proceedings of the IEEE.

[daniel1976reorthogonalization] Daniel, J.~W., Gragg, W.~B., Kaufman, L., Stewart, G.~W.. (1976). Reorthogonalization and stable algorithms for updating the {G. Math. Comput..

[weisstein2002crc] Weisstein, E.~W.. (2002). CRC Concise Encyclopedia of Mathematics.

[van2016wavenet] Van Den Oord, Aaron, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, Kavukcuoglu, Koray. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[pal1992multilayer] Pal, Sankar K, Mitra, Sushmita. (1992). Multilayer perceptron, fuzzy sets, and classification. IEEE Trans. Neural Networks.

[lecun1995convolutional] LeCun, Yann, Bengio, Yoshua, others. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks.

[boureau2010theoretical] Boureau, Y., Ponce, J., LeCun, Y.. (2010). A theoretical analysis of feature pooling in visual recognition. Proc. Int. Conf. Mach. Learn..

[xu2015empirical] Xu, Bing, Wang, Naiyan, Chen, Tianqi, Li, Mu. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.

[silver2016mastering] Silver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, others. (2016). Mastering the game of Go with deep neural networks and tree search. nature.

[rabiner1975theory] Rabiner, Lawrence R, Gold, Bernard. (1975). Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p..

[neal1998view] Neal, Radford M, Hinton, Geoffrey E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models.

[hastie2001elements] Hastie, Trevor, Tibshirani, Robert, Friedman, Jerome. (2001). The Elements of Statistical LearninE.

[elfwing2018sigmoid] Elfwing, S., Uchibe, E., Doya, K.. (2018). Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw..

[baraniuk1999optimal] Baraniuk, Richard G. (1999). Optimal tree approximation with wavelets. Wavelet Applications in Signal and Image Processing VII.

[goodfellow2016deep] Goodfellow, I., Bengio, Y., Courville, A.. (2016). Deep Learning.

[anden2015joint] And{'e. (2015). Joint time-frequency scattering for audio classification. Machine Learning for Signal Processing (MLSP), 2015 IEEE 25th International Workshop on.

[boutilier2002active] Boutilier, Craig, Zemel, Richard S, Marlin, Benjamin. (2002). Active collaborative filtering. Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence.

[ghahramani1995factorial] Ghahramani, Zoubin. (1995). Factorial learning and the EM algorithm. Advances in Neural Information Processing systems.

[ross2003multiple] Ross, David A, Zemel, Richard S. (2003). Multiple cause vector quantization. Advances in Neural Information Processing Systems.

[montufar2014number] Montufar, Guido F, Pascanu, Razvan, Cho, Kyunghyun, Bengio, Yoshua. (2014). On the number of linear regions of deep neural networks. Proc. NeurIPS.

[gulcehre2016mollifying] Gulcehre, Caglar, Moczulski, Marcin, Visin, Francesco, Bengio, Yoshua. (2016). Mollifying networks. arXiv preprint arXiv:1608.04980.

[xu2013block] Xu, Yangyang, Yin, Wotao. (2013). A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences.

[amos2016input] Amos, Brandon, Xu, Lei, Kolter, J Zico. (2016). Input convex neural networks. arXiv preprint arXiv:1609.07152.

[cohen2001tree] Cohen, Albert, Dahmen, Wolfgang, Daubechies, Ingrid, DeVore, Ronald. (2001). Tree approximation and optimal encoding. Applied and Computational Harmonic Analysis.

[nam2014local] Nam, Woonhyun, Doll{'a. (2014). Local decorrelation for improved pedestrian detection. Advances in Neural Information Processing Systems.

[vgg] Simonyan, K., Zisserman, A.. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR.

[shannon1959mathematical] Shannon-winner, CL. (1959). A mathematical theory of communication Bell. system. Tech. J.

[nakhmani2013new] Nakhmani, Arie, Tannenbaum, Allen. (2013). A new distance measure based on generalized image normalized cross-correlation for robust video tracking and image recognition. Pattern recognition letters.

[eldar2001orthogonal] Eldar, Yonina C, Oppenheim, Alan V. (2001). Orthogonal matched filter detection. Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). 2001 IEEE International Conference on.

[aschwanden1992experimental] Aschwanden, P, Guggenbuhl, W. (1992). Experimental results from a comparative study on correlation-type registration algorithms. Robust computer vision.

[eldar2002orthogonal] Eldar, Yonina C, Oppenheim, Alan V. (2002). Orthogonal multiuser detection. Signal Processing.

[eldar2004orthogonal] Eldar, Yonina C, Oppenheim, Alan V, Egnor, Dianne. (2004). Orthogonal and projected orthogonal matched filter detection. Signal Processing.

[bishop1995neural] Bishop, C. M.. (1995). Neural networks for pattern recognition.

[krishnavisualgenome] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, D.~A., Bernstein, M., Fei-Fei, L.. (2016). Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.

[szegedy2013intriguing] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.

[roberts1993convex] Roberts, Arthur Wayne. (1993). Convex functions. Handbook of Convex Geometry, Part B.

[goodfellow2014explaining] Goodfellow, Ian J, Shlens, Jonathon, Szegedy, Christian. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[bennett1985structural] Bennett, JA, Botkin, ME. (1985). Structural shape optimization with geometric description and adaptive mesh refinement. AIAA journal.

[kurakin2016adversarial] Kurakin, Alexey, Goodfellow, Ian, Bengio, Samy. (2016). Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533.

[papernot2017practical] Papernot, Nicolas, McDaniel, Patrick, Goodfellow, Ian, Jha, Somesh, Celik, Z Berkay, Swami, Ananthram. (2017). Practical black-box attacks against machine learning. Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security.

[medard2000effect] Feder, M., Lapidoth, A.. (1998). Universal decoding for channels with memory. IEEE Trans. Info. Theory. doi:10.1109/18.841172.

[tuncel2009capacity] Tuncel, E.. (2009). Capacity/storage tradeoff in high-dimensional identification systems. IEEE Trans. Info. Theory.

[dasarathy2011reliability] Dasarathy, G., Draper, S. C.. (2011). On reliability of content identification from databases based on noisy queries. Proc. IEEE Intl. Symp. Info. Theory (ISIT'11).

[giles1987learning] Giles, C. L., Maxwell, T.. (1987). Learning, invariance, and generalization in high-order neural networks. Appl. Opt..

[cohen2016group] Cohen, T. S., Welling, M.. (2016). Group Equivariant Convolutional Networks. arXiv preprint arXiv:1602.07576.

[Karpathy-viz-rnn:2015wu] Karpathy, A., Johnson, J., Fei-Fei, L.. (2015). {Visualizing and Understanding Recurrent Networks. arXiv.org.

[hyvarinen2004independent] Hyv{. (2004). Independent component analysis.

[dltutorial] LeCun, Yann, Ranzato, Marc' Aurelio. (2013). Deep Learning Tutorial.

[cappe2007onlineEM] {Capp{'e. {Online EM Algorithm for Latent Data Models. ArXiv e-prints.

[hegde2012convex] Hegde, Chinmay, Sankaranarayanan, Aswin, Yin, Wotao, Baraniuk, Richard. (2012). A convex approach for learning near-isometric linear embeddings. preparation, August.

[bengio2013deep] Bengio, Yoshua. (2013). Deep learning of representations: Looking forward. Statistical language and speech processing.

[mallat2012group] Mallat, S.. (2012). Group invariant scattering. Comm. Pure Appl. Math..

[nasrabadi1988image] Nasrabadi, N.~M., King, R.~A.. (1988). Image coding using vector quantization: A review. IEEE Trans. Commun..

[rister2017piecewise] Rister, Blaine, Rubin, Daniel L. (2017). Piecewise convexity of artificial neural networks. Neural Networks.

[specht1990probabilistic] Specht, Donald F. (1990). Probabilistic neural networks. Neural networks.

[variani2015gaussian] Variani, Ehsan, McDermott, Erik, Heigold, Georg. (2015). A Gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on.

[tang2012deep] Tang, Yichuan, Salakhutdinov, Ruslan, Hinton, Geoffrey. (2012). Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635.

[chen2013deep] Chen, Bo, Polatkan, Gungor, Sapiro, Guillermo, Blei, David, Dunson, David, Carin, Lawrence. (2013). Deep learning with hierarchical convolutional factor analysis. IEEE Trans. PAMI.

[jordan1998learning] Jordan, M.I.. (1998). Learning in Graphical Models.

[wei2000fast] Wei, Li-Yi, Levoy, Marc. (2000). Fast texture synthesis using tree-structured vector quantization. Proceedings of the 27th annual conference on Computer graphics and interactive techniques.

[gersho2012vector] Gersho, A., Gray, R.~M.. (2012). Vector Quantization and Signal Compression.

[weinberger2009distance] Weinberger, Kilian Q, Saul, Lawrence K. (2009). Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research.

[salakhutdinov2007learning] Salakhutdinov, Ruslan, Hinton, Geoffrey E. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. International Conference on Artificial Intelligence and Statistics.

[mehta2014exact] Mehta, Pankaj, Schwab, David J. (2014). An exact mapping between the Variational Renormalization Group and Deep Learning. arXiv preprint arXiv:1410.3831.

[PoggioOnInvariance] Anselmi, F., Rosasco, L., Poggio, T.. (2015). On Invariance and Selectivity in Representation Learning. arXiv preprint arXiv:1503.05938.

[arora2013provable] Arora, S., Bhaskara, A., Ge, R., Ma, T.. (2013). Provable bounds for learning some deep representations. arXiv preprint arXiv:1310.6343.

[schroff2015facenet] Schroff, Florian, Kalenichenko, Dmitry, Philbin, James. (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering. arXiv preprint arXiv:1503.03832.

[salakhutdinov2010one] Salakhutdinov, Ruslan, Tenenbaum, Josh, Torralba, Antonio. (2010). One-shot learning with a hierarchical nonparametric bayesian model.

[breiman2001random] Breiman, Leo. (2001). Random forests. Machine learning.

[altland2010condensed] Altland, A., Simons, B.D.. (2010). Condensed Matter Field Theory.

[criminisi2013decision] Criminisi, A., Shotton, J.. (2013). Decision Forests for Computer Vision and Medical Image Analysis.

[bengio2013representation] Bengio, Y., Courville, A., Vincent, P.. (2013). Representation learning: {A. IEEE Trans. Pattern Anal. Mach. Intell..

[goodfellow2013maxout] Goodfellow, Ian J, Warde-Farley, David, Mirza, Mehdi, Courville, Aaron, Bengio, Yoshua. (2013). Maxout networks. arXiv preprint arXiv:1302.4389.

[anselmi2013unsupervised] Anselmi, Fabio, Leibo, Joel Z, Rosasco, Lorenzo, Mutch, Jim, Tacchetti, Andrea, Poggio, Tomaso. (2013). Unsupervised learning of invariant representations in hierarchical architectures. arXiv preprint arXiv:1311.4158.

[yamins2014performance] Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., DiCarlo, J. J.. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci..

[szegedy2014going] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.. (2014). Going deeper with convolutions. arXiv preprint arXiv:1409.4842.

[dahl2013improving] Dahl, George E, Sainath, Tara N, Hinton, Geoffrey E. (2013). Improving deep neural networks for LVCSR using rectified linear units and dropout. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.

[kemp2007learning] Kemp, Charles, Perfors, Amy, Tenenbaum, Joshua B. (2007). Learning overhypotheses with hierarchical Bayesian models. Developmental science.

[tenenbaum2011grow] Tenenbaum, Joshua B, Kemp, Charles, Griffiths, Thomas L, Goodman, Noah D. (2011). How to grow a mind: Statistics, structure, and abstraction. science.

[mnih2015human] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, others. (2015). Human-level control through deep reinforcement learning. Nature.

[ghahramani1996algorithm] Ghahramani, Z., Hinton, G.~E., others. (1996). The {EM.

[van2014factoring] Van Den Oord, A., Schrauwen, B.. (2014). Factoring Variations in Natural Images with Deep {Gaussian. Proc. Adv. Neural Inf. Process. Syst. (NIPS'14).

[soatto2016visual] Soatto, S., Chiuso, A.. (2016). Visual Representations: Defining Properties and Deep Approximations. Proc. Int. Conf. Learn. Rep. (ICLR'16).

[pmlr-v49-cohen16] Nadav Cohen, Or Sharir, Amnon Shashua. (2016). On the Expressive Power of Deep Learning: A Tensor Analysis. 29th Annual Conference on Learning Theory.

[lu2017depth] Lu, Haihao, Kawaguchi, Kenji. (2017). Depth Creates No Bad Local Minima. arXiv preprint arXiv:1702.08580.

[soudry2017exponentially] Soudry, Daniel, Hoffer, Elad. (2017). Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777.

[zhang2016understanding] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, Vinyals, Oriol. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.

[shwartz2018representation] Shwartz-Ziv, Ravid, Painsky, Amichai, Tishby, Naftali. (2018). Representation compression and generalization in deep neural networks.

[shwartz2020information] Shwartz-Ziv, Ravid, Alemi, Alexander A. (2020). Information in infinite ensembles of infinitely-wide neural networks. Symposium on Advances in Approximate Bayesian Inference.

[shwartz2022information] Shwartz-Ziv, Ravid. (2022). Information Flow in Deep Neural Networks. arXiv preprint arXiv:2202.06749.

[rasmus2015semi] Rasmus, A., Berglund, M., Honkala, M., Valpola, H., Raiko, T.. (2015). Semi-Supervised Learning with Ladder Networks. Proc. Adv. Neural Inf. Process. Syst (NIPS'15).

[zhao2015swwae] Zhao, J., Mathieu, M., Goroshin, R., LeCun, Y.. (2016). Stacked What-Where Autoencoders. arXiv preprint arXiv:1506.02351.

[roweis2001learning] Roweis, Sam, Ghahramani, Zoubin. (2001). Learning nonlinear dynamical systems using the expectation--maximization algorithm. Kalman filtering and neural networks.

[ioffe2015batch] Ioffe, S., Szegedy, C.. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167.

[jordan2002discriminative] Ng, A., Jordan, M.. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing systems.

[rister2016piecewise] Rister, Blaine. (2016). Piecewise convexity of artificial neural networks. arXiv preprint arXiv:1607.04917.

[murphy2012machine] Murphy, Kevin P. (2012). Machine learning: a probabilistic perspective.

[luo2018cosine] Luo, Chunjie, Zhan, Jianfeng, Xue, Xiaohe, Wang, Lei, Ren, Rui, Yang, Qiang. (2018). Cosine normalization: Using cosine similarity instead of dot product in neural networks. International Conference on Artificial Neural Networks.

[harman2010decompositional] Harman, Radoslav, Lacko, Vladim{'\i. (2010). On decompositional algorithms for uniform sampling from n-spheres and n-balls. Journal of Multivariate Analysis.

[voelker2017efficiently] Voelker, Aaron R, Gosmann, Jan, Stewart, Terrence C. (2017). Efficiently sampling vectors and coordinates from the n-sphere and n-ball.

[anton2013elementary] Anton, Howard, Rorres, Chris. (2013). Elementary Linear Algebra, Binder Ready Version: Applications Version.

[nielsen2016guaranteed] Nielsen, Frank, Sun, Ke. (2016). Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities. arXiv preprint arXiv:1606.05850.

[balestriero2018spline] Balestriero, Randall, Baraniuk, Richard. (2018). A Spline Theory of Deep Networks. Proc. ICML.

[boyd2004convex] Boyd, Stephen, Vandenberghe, Lieven. (2004). Convex optimization.

[bishop2007generative] Bishop, Christopher M, Lasserre, Julia, others. (2007). Generative or discriminative? getting the best of both worlds. Bayesian Statistics.

[sohl2010unsupervised] Sohl-Dickstein, Jascha, Wang, Jimmy C, Olshausen, Bruno A. (2010). An unsupervised algorithm for learning lie group transformations. arXiv preprint arXiv:1001.1027.

[michalski2014modeling] Michalski, Vincent, Memisevic, Roland, Konda, Kishore. (2014). Modeling sequential data using higher-order relational features and predictive training. arXiv preprint arXiv:1402.2333.

[miao2007learning] Miao, Xu, Rao, Rajesh PN. (2007). Learning the lie groups of visual invariance. Neural computation.

[pearl1988probabilistic] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kauffman Pub.

[aurenhammer1988geometric] Aurenhammer, Franz, Imai, Hiroshi. (1988). Geometric relations among Voronoi diagrams. Geometriae Dedicata.

[aurenhammer1991voronoi] Aurenhammer, Franz. (1991). Voronoi diagrams—a survey of a fundamental geometric data structure. ACM Computing Surveys (CSUR).

[preparata2012computational] Preparata, Franco P, Shamos, Michael I. (2012). Computational geometry: an introduction.

[rudin1976principles] Rudin, Walter, others. (1976). Principles of mathematical analysis.

[pach2011combinatorial] Pach, J{'a. (2011). Combinatorial geometry.

[quinlan1986induction] Quinlan, J. Ross. (1986). Induction of decision trees. Machine learning.

[kumar2009fast] Kumar, NSLP, Satoor, Sanjiv, Buck, Ian. (2009). Fast parallel expectation maximization for Gaussian mixture models on GPUs using CUDA. High Performance Computing and Communications, 2009. HPCC'09. 11th IEEE International Conference on.

[jordan2001graphical] Jordan, Michael Irwin, Sejnowski, Terrence Joseph. (2001). Graphical models: Foundations of neural computation.

[hintonMITVideo] Geoffrey Hinton. What's wrong with convolutional nets?.

[dong2017deep] Dong, Xiao, Wu, Jiasong, Zhou, Ling. (2017). How deep learning works--The geometry of deep learning. arXiv preprint arXiv:1710.10784.

[raghu2017expressive] Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, Dickstein, Jascha Sohl. (2017). On the expressive power of deep neural networks. Proceedings of the 34th International Conference on Machine Learning-Volume 70.

[tropical] Liwen Zhang, Gregory Naitzat, Lek{-. (2018). Tropical Geometry of Deep Neural Networks. CoRR.

[hintonVideo] Geoffrey Hinton. (2014). What's wrong with convolutional nets?.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[goodfellow2012large] Goodfellow, Ian, Courville, Aaron, Bengio, Yoshua. (2012). Large-scale feature learning with spike-and-slab sparse coding. arXiv preprint arXiv:1206.6407.

[hannun2014deepspeech] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., others. (2014). DeepSpeech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.

[schmidhuber2015deep] Schmidhuber, J.. (2015). Deep learning in neural networks: {A. Neural Net..

[tikhonov2013numerical] Tikhonov, Andreui Nikolaevich, Goncharsky, AV, Stepanov, VV, Yagola, Anatoly G. (2013). Numerical methods for the solution of ill-posed problems.

[wolfdeepface] Wolf, Lior. DeepFace: Closing the Gap to Human-Level Performance in Face Verification.

[griffiths2004hierarchical] Griffiths, DMBTL, Tenenbaum, MIJJB. (2004). Hierarchical topic models and the nested Chinese restaurant process. Advances in Neural Information Processing systems.

[lucke2012closed] L{. (2012). Closed-form EM for sparse coding and its application to source separation. Latent Variable Analysis and Signal Separation.

[Saxe-Ganguli-dyn-lin-nn:2013tq] Saxe, A.~M., McClelland, J.~L., Ganguli, S.. (2013). {Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv.org.

[Saxe-Ganguli-hier-cat-dnn:2013vq] McClelland, J. L., Ganguli, S.. (2013). {Learning hierarchical category structure in deep neural networks. Proc. Annu. Cog. Sci. Soc..

[kschischang2001factor] F. R. Kschischang, B. J. Frey, H. A. Loeliger. (2001). Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory.

[wilamowski2001algorithm] Wilamowski, Bogdan M, Iplikci, Serdar, Kaynak, Okyay, Efe, M {. (2001). An algorithm for fast convergence in training neural networks. Proceedings of the international joint conference on neural networks.

[karklin2005hierarchical] Karklin, Yan, Lewicki, Michael S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural computation.

[pham2015study] Pham, Ngoc-Quan, Le, Hai-Son, Nguyen, Duc-Dung, Ngo, Truong-Giang. (2015). A Study of Feature Combination in Gesture Recognition with Kinect. Knowledge and Systems Engineering.

[hartley2003multiple] Hartley, Richard, Zisserman, Andrew. (2003). Multiple view geometry in computer vision.

[bishop2006pattern] Bishop, C.~M.. (2006). Pattern Recognition and Machine Learning.

[corduneanu2001variational] Corduneanu, Adrian, Bishop, Christopher M. (2001). Variational Bayesian model selection for mixture distributions. Artificial intelligence and Statistics.

[amari1993backpropagation] Amari, Shun-ichi. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing.

[wainwright2008graphical] Wainwright, M. J., Jordan, M. I.. (2008). Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn..

[Schmid:1994:PTN:991886.991915] Schmid, H.. (1994). Part-of-speech Tagging with Neural Networks. Proc. Conf. Comput. Linguistics. doi:10.3115/991886.991915.

[salakhutdinov2013learning] Jin, Chi, Ge, Rong, Netrapalli, Praneeth, Kakade, Sham M, Jordan, Michael I. (2017). How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887. doi:10.1109/TIT.1972.1054753.

[russakovsky2012attribute] Russakovsky, O., Fei-Fei, L.. (2012). Attribute learning in large-scale datasets. Trends and Topics in Computer Vision.

[russakovsky2015imagenet] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., others. (2015). Imagenet large scale visual recognition challenge. Int. J. Comput. Vision.

[ramachandran2017searching] Ramachandran, P., Zoph, B., Le, Q.. (2017). Searching for activation functions. ArXiv e-prints.

[Yuste:2004jm] Yuste, Rafael, Urban, Rochelle. (2004). {Dendritic spines and linear networks. Journal of Physiology-Paris.

[Patel:un] Patel, Ankit B. {Modeling and Inferring Cleavage Patterns in Proliferating Epithelia.

[Anonymous:TT8NVr95] . {bioessays08-nagpal.pdf. ().

[Anonymous:U7RotLPL] . {nature06-ssr.pdf. ().

[Anonymous:cPTrEePs] . {DESYNC: Self-Organizing Desynchronization and TDMA on Wireless Sensor Networks. (2007).

[Patel:2007wn] Patel, Ankit B, Degesys, Julius, Nagpal, Radhika. (2007). {Desynchronization:The Theory of Self-Organizing Algorithms for Round-Robin Scheduling.

[Charles:2013tp] Charles, Adam, Rozell, Christopher. (2013). {Short Term Memory Capacity in Networks via the Restricted Isometry Property.

[Anonymous:2012wr] . {Dynamic Filtering of Sparse Signals using Reweighted. (2012).

[Anonymous:2013uy] . {Visual Nonclassical Receptive Field E↵ects Emerge from Sparse Coding in a Dynamical System. (2013).

[Packer:2013gt] Packer, Adam M, Roska, Botond, H{. (2013). {Targeting neurons and photons for optogenetics. Nature Publishing Group.

[Dyer:2013ua] Dyer, Eva. (2013). {Greedy Feature Selection for Subspace Clustering. Journal of Machine Learning Research.

[Yoon:2013hv] Yoon, KiJung, Buice, Michael A, Barry, Caswell, Hayman, Robin, Burgess, Neil, Fiete, Ila R. (2013). {Specific evidence of low-dimensional continuous attractor dynamics in grid cells. Nature Publishing Group.

[Ramirez:2013bl] Ramirez, S, Liu, X, Lin, P A, Suh, J, Pignatelli, M, Redondo, R L, Ryan, T J, Tonegawa, S. (2013). {Creating a False Memory in the Hippocampus. Science.

[Izhikevich:2003ul] Izhikevich, Eugene M. (2003). {Which Model to Use for Cortical Spiking Neurons?. IEEE Trans. Neural Networks.

[Maglione:2013ia] Maglione, Marta, Sigrist, Stephan J. (2013). {Seeing the forest tree by tree: super-resolution light microscopy meets the neurosciences. Nature Publishing Group.

[Sutherland:1998wn] Sutherland, Ivan. (1998). {Technology and Courage.

[Rozell:2008wr] Rozell, Christopher, Johnson, Don, Baraniuk, Rich, Olshausen, Bruno. (2008). {Sparse Coding via Thresholding and Local Competition in Neural Circuits. Neural Computation.

[Gordon:2012td] Gordon, Geoff, Tibshirani, Ryan. (2012). {Generalized gradient descent.

[OLSHAUSEN:2004fw] OLSHAUSEN, B, FIELD, D. (2004). {Sparse coding of sensory inputs. Current Opinion in Neurobiology.

[Anonymous:JVLVJtUI] . {Cog_Neurosci2011_98. (2011).

[Anselmi:2007ke] Anselmi, F., Mutch, J., Poggio, T.. (2007). {Magic Materials. Proc. Natl. Acad. Sci..

[Cadieu:2013wa] Cadieu, Charles, Yamins, Dan, DiCarlo, James. (2013). {The Neural Representation Benchmark and its Evaluation on Brain and Machine. ArXiV.

[Anonymous:mLLJA3aZ] . {High Frequency Stimulation of the Subthalamic Nucleus Eliminates Pathological Thalamic Rhythmicity in a Computational Model. (2004).

[DiCarlo:2012em] DiCarlo, James J, Zoccolan, Davide, Rust, Nicole C. (2012). {Perspective. Neuron.

[Humphries:2012ju] Humphries, Mark D, Gurney, Kevin. (2012). {Network effects of subthalamic deep brain stimulation drive a unique mixture of responses in basal ganglia output. European Journal of Neuroscience.

[Johnson:2005ha] Johnson, Jeffrey S, Olshausen, Bruno A. (2005). {The recognition of partially visible natural objects in the presence and absence of their occluders. Vision Research.

[Buckner:2013fu] Buckner, Randy L, Krienen, Fenna M, Yeo, B T Thomas. (2013). {Opportunities and limitations of intrinsic functional connectivity MRI. Nature Publishing Group.

[Keck:2012cb] Keck, C., Savin, C., L{. (2012). {Feedforward Inhibition and Synaptic Scaling -- Two Sides of the Same Coin?. PLoS Computational Biology.

[Anonymous:21M5ylQ8] . {Unsupervised Learning of Translation Invariant Occlusive Components. (2012).

[Rozell:2013tv] Rozell, Christopher. (2013). {Stable Manifold Embeddings with Structured Random Matrices.

[Carandini:2013dv] Carandini, Matteo, Churchland, Anne K. (2013). {Probing perceptual decisions in rodents. Nature Publishing Group.

[Anonymous:oVbxcaph] . {Specular Surface Reconstruction from Sparse Reflection Correspondences. (2010).

[Sandoe:2013il] Sandoe, Jackson, Eggan, Kevin. (2013). {Opportunities and challenges of pluripotent stem cell neurodegenerative disease models. Nature Publishing Group.

[Anonymous:2013cg] . {Focus on neurotechniques. Nature Publishing Group (2013).

[Patel:2013tv] Patel, A.~B., Kukreja, R.~S.. (2013). {Final Contract for 1515 Hyde Park {#.

[Otero:2013hh] Otero, Ives, Delbracio, Mauricio. (2013). {The Anatomy of the SIFT Method.

[Anonymous:wJ0z1pAS] . {Learning Feature Representations with K-means. (2012).

[Raphael:2012ug] Raphael, Robert. (2012). {IGERT: Neuroengineering: From Cells to Systems.

[Berens:2012fi] Berens, P, Ecker, A S, Cotton, R J, Ma, W J, Bethge, M, Tolias, A S. (2012). {A Fast and Simple Population Code for Orientation in Primate V1. Journal of Neuroscience.

[Ma:2006bh] Ma, Wei Ji, Beck, Jeffrey M, Latham, Peter E, Pouget, Alexandre. (2006). {Bayesian inference with probabilistic population codes. Nature Neuroscience.

[Ma:2013uk] Ma, Wei Ji. (2013). {Population Vector COding.

[Anonymous:S7HycmMg] . {Parallelized Stochastic Gradient Descent. (2010).

[Anonymous:FVKVV-yP] . {On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines. (2001).

[Anonymous:OYKu-7Li] . {Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. (2007).

[Anonymous:9wEHQ3-F] . {Random Feature Maps for Dot Product Kernels. (2013).

[Rahimi:2007vq] Rahimi, Ali, Recht, Ben. (2007). {Random Features for Large-Scale Kernel Machines.

[Anonymous:2011de] . {Perceptual and neural consequences of rapid motion adaptation. (2011).

[Boyd:2011bw] Boyd, Stephen. (2011). {Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends{\textregistered.

[Dubois:2011dy] Dubois, Julien, VanRullen, Rufin. (2011). {Visual Trails: Do the Doors of Perception Open Periodically?. PLoS Biology.

[Boyd:2011tq] Boyd, Stephen. (2011). {Alternating Direction Method of Multipliers.

[Sokoliuk:2013hu] Sokoliuk, R, VanRullen, R. (2013). {The Flickering Wheel Illusion: When Rhythms Make a Static Wheel Flicker. Journal of Neuroscience.

[Adibi:2013hq] Adibi, M, Clifford, C W G, Arabzadeh, E. (2013). {Informational Basis of Sensory Adaptation: Entropy and Single-Spike Efficiency in Rat Barrel Cortex. Journal of Neuroscience.

[Saxe:2013up] Saxe, Andrew, McClelland, James, Ganguli, Surya. (2013). {A Mathematical Theory of Semantic Development.

[Hinton:2010un] Hinton, Geoff. (2010). {A Practical Guide to Training Restricted Boltzmann Machines.

[Anonymous:QF6Em5B4] . {Complete Discrete 2-D Gabor Transforms by Neural Networks for Image Analysis and Compression. (2004).

[Anonymous:J448A51u] . {Tutorial on Gabor Filters. (2008).

[Anonymous:puO477jp] . {Learning hierarchical category structure in deep neural networks. (2013).

[Zoran-Weiss:2013pr] Zoran, D., Weiss, Y.. (2012). Natural Images, Gaussian Mixtures and Dead Leaves. Proc. Adv. Neural Inf. Process. Syst. (NIPS'12).

[Helmstaedter:2014iv] Helmstaedter, Moritz, Briggman, Kevin L, Turaga, Srinivas C, Jain, Viren, Seung, H Sebastian, Denk, Winfried. (2014). {Connectomic reconstruction of the innerplexiform layer in the mouse retina. Nature.

[Anonymous:sBTrRq3Q] . {Controllable single photon stimulation of retinal rod cells. (2013).

[Weiss:2002id] Weiss, Yair, Simoncelli, Eero P, Adelson, Edward H. (2002). {Motion illusions as optimal percepts. Nature Neuroscience.

[Yamins:2013tp] Yamins, Dan, Hong, Ha, DiCarlo, James. (2013). {Key Features of Higher Visual Cortex Emerge in Behaviorally Optimized Neural Networks.

[wjma:2013ts] {wjma. (2013). {Relating back to behavior.

[Anonymous:E_1bFc4h] . {Kanizsa triangle. (2013).

[wjma:2013wj] {wjma. (2013). {Lecture 11 -- Probability and inference with neurons.

[wjma:2013tp] {wjma. (2013). {Complications.

[Krizhevsky:2012wl] Krizhevsky, A., Sutskever, I., Hinton, G.. (2012). {ImageNet Classification with Deep Convolutional Neural Networks. Proc. Adv. Neural Inf. Process. Syst (NIPS'12).

[wiskott2006does] Wiskott, Laurenz. (2006). How does our visual system achieve shift and size invariance. JL van Hemmen and TJ Sejnowski, editors.

[lecun1998gradient] LeCun, Yann, Bottou, L{'e. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE.

[Carandini:2011fm] Carandini, Matteo, Heeger, David J. (2011). {Normalization as a canonical neural computation. Nature Reviews Neuroscience.

[Adibi:2013dd] Adibi, M, McDonald, J S, Clifford, C W G, Arabzadeh, E. (2013). {Adaptation Improves Neural Coding Efficiency Despite Increasing Correlations in Variability. Journal of Neuroscience.

[Anonymous:ly3rlGJy] . {Sparse Filtering. (2011).

[Cafaro:2011im] Cafaro, Jon, Rieke, Fred. (2011). {Noise correlations improve response fidelity and stimulus encoding. Nature.

[Ibbotson:2011jh] Ibbotson, Michael, Krekelberg, Bart. (2011). {Visual perception and saccadic eye movements. Current Opinion in Neurobiology.

[Kandel:2013cf] Kandel, Eric R, Markram, Henry, Matthews, Paul M, Yuste, Rafael, Koch, Christof. (2013). {Neuroscience thinks big (and collaboratively). Nature Reviews Neuroscience.

[Lacy:2013km] Lacy, Joyce W, Stark, Craig E L. (2013). {The neuroscience of memory: implications for the courtroom. Nature Reviews Neuroscience.

[BurgosArtizzu:2012ul] Burgos-Artizzu, Xavier. (2012). {Social behavior recognition in continuous video. Computer Vision and Pattern Recognition.

[Averbeck:2006ew] Averbeck, Bruno B, Latham, Peter E, Pouget, Alexandre. (2006). {Neural correlations, population coding and computation. Nature Reviews Neuroscience.

[Le:2011ts] Le, Quoc, Ng, Andrew. (2011). {Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. Computer Vision and Pattern Recognition.

[Adams:2006ti] Adams, Ryan, MacKay, David. (2006). {Bayesian Online Changepoint Detection.

[Salvator:2004wq] Salvator, Dave. (2004). {ExtremeTech 3D Pipeline Tutorial. PCMag.

[Deneve:2007by] Deneve, S, Duhamel, J R, Pouget, A. (2007). {Optimal Sensorimotor Integration in Recurrent Cortical Networks: A Neural Implementation of Kalman Filters. Journal of Neuroscience.

[Jordan:1999ti] Jordan, Michael, Ghahramani, Zoubin, Jaakkola, Tommi, Saul, Lawrence. (1999). {An Introduction to Variational Methods for Graphical Models. Machine Learning.

[Anonymous:OEEDCGDt] . {343263a0. (2002).

[Poggio:2013ju] Poggio, Tomaso, Ullman, Shimon. (2013). {Vision: are models of object recognition catching up with the brain?. Annals of the New York Academy of Sciences.

[Pinto:2009gu] Pinto, Nicolas, Doukhan, David, DiCarlo, James J, Cox, David D. (2009). {A High-Throughput Screening Approach to Discovering Good Forms of Biologically Inspired Visual Representation. PLoS Computational Biology.

[Dayan:2012kb] Dayan, Peter. (2012). {Twenty-Five Lessonsfrom Computational Neuromodulation. Neuron.

[Anonymous:MaG0r2vx] . {Beyond Simple Features: A Large-Scale Feature Search Approach to Unconstrained Face Recognition. (2011).

[Pinto:2008bo] Pinto, Nicolas, Cox, David D, DiCarlo, James J. (2008). {Why is Real-World Visual Object Recognition Hard?. PLoS Computational Biology.

[Zhu:2004ur] Zhu, Mengchen, Durand, Fredo, Rozell, Christopher. (2004). {MIT 6.837 - Ray Tracing.

[Pouget:2013gi] Pouget, Alexandre, Beck, Jeffrey M, Ma, Wei Ji, Latham, Peter E. (2013). {Probabilistic brains: knowns and unknowns. Nature Publishing Group.

[Thibodeau:2011je] Thibodeau, Paul, Boroditsky, Lera. (2011). {Metaphors We Think With: The Role of Metaphor in Reasoning. PLoS One.

[LaCamera:2008do] La Camera, Giancarlo, Richmond, Barry J. (2008). {Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules. PLoS Computational Biology.

[Fetsch:2013ks] Fetsch, Christopher R, DeAngelis, Gregory C, Angelaki, Dora E. (2013). {Bridging the gap between theoriesof sensory cue integration and thephysiology of multisensory neurons.

[Hosoya:2005fu] Hosoya, Toshihiko, Baccus, Stephen A, Meister, Markus. (2005). {Dynamic predictive coding by the retina. Nature.

[Anonymous:uoHvrsjc] . {Perceptual filling in of artificially induced scotomas in human vision. (2001).

[Pitkow:2012dh] Pitkow, Xaq, Meister, Markus. (2012). {Decorrelation and efficient coding by retinal ganglion cells. Nature Publishing Group.

[Meister:2013tw] Meister, Markus. (2013). {Neural computation in sensory systems.

[n:2009ws] 000n, 376 377 000M 000a 000r 000t 000i. (2009). {Understanding the Rotating Snakes illusion.

[wjma:2013we] {wjma. (2013). {1/21/2013Bayesian modeling.

[Laurens:2013fy] Laurens, Jean, Meng, Hui, Angelaki, Dora E. (2013). {Computation of linear acceleration through an internal model in the macaque cerebellum. Nature Publishing Group.

[Watson:2003td] Watson, Andrew. (2003). {Real-world illumination and the perception of surface reflectance properties.

[Brainard:2011dr] Brainard, D H, Maloney, L T. (2011). {Surface color perception and equivalent illumination models. Journal of Vision.

[Fleming:2013jy] Fleming, R W, Wiebel, C, Gegenfurtner, K. (2013). {Perceptual qualities and material classes. Journal of Vision.

[vanderKooij:2011fa] van der Kooij, Katinka. (2011). {Perception of 3D slant out of the box.

[Ecker:2011bx] Ecker, A S, Berens, P, Tolias, A S, Bethge, M. (2011). {The Effect of Noise Correlations in Populations of Diversely Tuned Neurons. Journal of Neuroscience.

[wjma:2013wea] {wjma. (2013). {1/21/2013Bayesian modeling.

[Anonymous:W77SX1oQ] . {Homography Estimation. (2009).

[Anonymous:jScaT-4D] . {At Least at the Level of Inferior Temporal Cortex, the Stereo Correspondence Problem Is Solved. (2003).

[Murphy:2013eq] Murphy, A P, Ban, H, Welchman, A E. (2013). {Integration of texture and disparity cues to surface slant in dorsal visual cortex. Journal of Neurophysiology.

[Tsutsui:2002kr] Tsutsui, K I. (2002). {Neural Correlates for Perception of 3D Surface Orientation from Texture Gradient. Science.

[Anonymous:Le2AY_hs] . {A Bayesian Treatment of the Stereo Correspondence Problem Using Half-Occluded Regions. (2004).

[Savarese:2008us] Savarese, Silvio. (2008). {EECS 442 -- Computer visionStereo systems.

[Savarese:2008usa] Savarese, Silvio. (2008). {EECS 442 -- Computer visionStereo systems.

[Savarese:2008uq] Savarese, Silvio. (2008). {EECS 442 -- Computer visionEpipolar Geometry.

[Savarese:2008vc] Savarese, Silvio. (2008). {EECS 442 -- Computer visionSingle view metrology.

[Savarese:2008vw] Savarese, Silvio. (2008). {EECS 442 -- Computer visionCameras.

[Customer:2008vf] Customer, Preferred. (2008). {Course overview.

[Anonymous:bR8HbOTu] . {EECS 442 -- Computer Vision. (2008).

[Savarese:2009va] Savarese, Silvio. (2009). {EECS 442 -- Computer visionVolumetric stereo.

[Savarese:2008ur] Savarese, Silvio. (2008). {EECS 442 -- Computer visionShape from reflections.

[Savarese:2008up] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Multiple view geometryAffine structure from Motion.

[Savarese:2008tu] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Multiple view geometry.

[Savarese:2008tx] Savarese, Silvio. (2008). {EECS 442 -- Computer visionFitting methods.

[Savarese:2008wc] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Radiometry.

[Savarese:2008wca] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Radiometry.

[Li:2008ub] Li, Fei-Fei. (2008). {Natural Scene Classification inNatural Scene Classification in.

[FeiFie:2008tu] {Fei-Fie. (2008). {EECS 442 -- Computer vision.

[Anonymous:Dx4Xe0J_] . {3. The Junction Tree Algorithms. (2003).

[Anonymous:YzDnGnl9] . {Distances and affinities between measures. (2000).

[manfred:2006up] {manfred. (2006). {taipei4.

[Koolen:2012wk] Koolen, Wouter, Warmuth, Manfred. (2012). {Putting Bayes to sleep.

[FeiFie:2008tua] {Fei-Fie. (2008). {EECS 442 -- Computer vision.

[Savarese:2008wn] Savarese, Silvio. (2008). {Segmentation {&.

[Anonymous:Sn-7BTe2] . {20 years of learning about vision: Questions answered, questions unanswered, and questions not yet asked. (2012).

[Anonymous:Vwe7RZoh] . {Shape perception reduces activity in human primary visual cortex. (2002).

[Anonymous:QbT06TIM] . {Principles of Image Representation in Visual Cortex. (2005).

[Savarese:2008wu] Savarese, Silvio. (2008). {Recognition.

[Savarese:2008wua] Savarese, Silvio. (2008). {Recognition.

[Savarese:2008wub] Savarese, Silvio. (2008). {Recognition.

[Savarese:2009tu] Savarese, Silvio. (2009). {EECS 442 -- Computer visionOptical flow and tracking.

[Savarese:2008ut] Savarese, Silvio. (2008). {EECS 442 -- Computer vision Face Recognition.

[Anonymous:uKw8bset] . {Computer Vision: Algorithms and Applications. (2010).

[Anonymous:2012hj] . {Relative luminance and binocular disparity preferencesare correlated in macaque primary visual cortex,matching natural scene statistics. (2012).

[Sanada:2012hq] Sanada, T M, Nguyenkim, J D, DeAngelis, G C. (2012). {Representation of 3-D surface orientation by velocity and disparity gradient cues in area MT. Journal of Neurophysiology.

[Srivastava:2009ch] Srivastava, S, Orban, G A, De Maziere, P A, Janssen, P. (2009). {A Distinct Representation of Three-Dimensional Shape in Macaque Anterior Intraparietal Area: Fast, Metric, and Coarse. Journal of Neuroscience.

[Anonymous:9uZVlpuI] . {Stereopsis Activates V3A and Caudal Intraparietal Areas in Macaques and Humans. (2003).

[Nieder:2003kv] Nieder, Andreas. (2003). {Stereoscopic Vision: Solving the Correspondence Problem. Current Biology.

[Orban:2006fp] Orban, Guy A, Janssen, Peter, Vogels, Rufin. (2006). {Extracting 3D structure from disparity. Trends in Neurosciences.

[Kruger:gc] Kruger, Norbert, Janssen, Peter, Kalkan, Sinan, Lappe, Markus, Leonardis, Ales, Piater, Justus, Rodriguez-Sanchez, Antonio J, Wiskott, Laurenz. {Deep Hierarchies in the Primate Visual Cortex: What Can We Learn for Computer Vision?. IEEE Trans. PAMI.

[Anonymous:XBCH2ycA] . {10 Neuronal interactions and their role in solving the stereo correspondence problem. (2010).

[Tanabe:2011dx] Tanabe, S, Haefner, R M, Cumming, B G. (2011). {Suppressive Mechanisms in Monkey V1 Help to Solve the Stereo Correspondence Problem. Journal of Neuroscience.

[Howe:2005jb] Howe, P D L. (2005). {V1 Partially Solves the Stereo Aperture Problem. Cerebral Cortex.

[Read:2007gn] Read, Jenny C A, Cumming, Bruce G. (2007). {Sensors for impossible stimuli may solve the stereo correspondence problem. Nature Neuroscience.

[Jeyabalaratnam:2013fz] Jeyabalaratnam, Jeyadarshan, Bharmauria, Vishal, Bachatene, Lyes, Cattan, Sarah, Angers, Annie, Molotchnikoff, St{'e. (2013). {Adaptation Shifts Preferred Orientation of Tuning Curve in the Mouse Visual Cortex. PLoS One.

[Anonymous:L7ZAZoJb] . {gcp_stereo_cvpr11. (2013).

[Anonymous:Parcc-uC] . {Introduction -- a Tour of Multiple View Geometry. (2004).

[Anonymous:iIqqh1eh] . {MULTIPLE VIEW GEOMETRY. (2004).

[Searcy:1996vt] Searcy, J H, Bartlett, J C. (1996). {Inversion and processing of component and spatial-relational information in faces.. Journal of experimental psychology. Human perception and performance.

[Graves:2013wt] Graves, Alex, Mohamed, Abdel-rahman, Hinton, Geoffrey. (2013). {Speech recognition with deep recurrent neural networks.

[Hinton:em] Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara, Kingsbury, Brian. {Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine.

[Zeiler:2013ux] Zeiler, Matthew D, Fergus, Rob. (2013). {Visualizing and Understanding Convolutional Neural Networks. arXiv preprint arXiv:1311.2901.

[Anonymous:GT9cUL3p] . {exact feature probabilities in images with occlusion. (2010).

[Anonymous:xLsGm21g] . {Compressive neural representation of sparse, high-dimensional probabilities. (2013).

[Anonymous:6l_wwPr_] . {Modeling image patches with a directed hierarchy of Markov random fields. (2008).

[Anonymous:a7uhOohM] . {RecurrentSamplingHelmholtz_Dayan. (1999).

[IEEE:2013wx] {IEEE. (2013). {A Pencil Balancing Robotusing a Pair of AER Dynamic Vision Sensors.

[Zaidi:2012ff] Zaidi, Qasim, Ennis, Robert, Cao, Dingcai, Lee, Barry. (2012). {Neural Locus of Color Afterimages. Current Biology.

[Anonymous:rrYjDhO3] . {Integration. (2002).

[Anonymous:2012pp] . {Depth and Deblurring from a Spectrally-varying Depth-of-Field. (2012).

[Anonymous:2013bu] . {scatterometer. (2013).

[Levin:2013bd] Levin, Anat, Glasner, Daniel, Xiong, Ying, Durand, Fredo, Freeman, William, Matusik, Wojciech, Zickler, Todd. (2013). {Fabricating BRDFs at high spatial resolution using wave optics. ACM Trans. Graphics.

[Anonymous:8qQPTOkW] . {arXiv:1206.1428v1 [cs.GR] 7 Jun 2012. (2012).

[Anonymous:2013gf] . {Synthesizing cognition in neuromorphic electronic systems. (2013).

[Jones:2012fy] Jones, P W, Gabbiani, F. (2012). {Impact of neural noise on a sensory-motor pathway signaling impending collision. Journal of Neurophysiology.

[Benosman:2012dh] Benosman, Ryad, Ieng, Sio-Hoi, Clercq, Charles, Bartolozzi, Chiara, Srinivasan, Mandyam. (2012). {Neural Networks. Neural Networks.

[Roska:2006fj] Roska, B. (2006). {Parallel Processing in Retinal Ganglion Cells: How Integration of Space-Time Patterns of Excitation and Inhibition Form the Spiking Output. Journal of Neurophysiology.

[Lichtsteiner:bm] Lichtsteiner, Patrick, Posch, Christoph, Delbruck, Tobi. {A 128$\times$ 128 120 dB 15 $\mu$s Latency Asynchronous Temporal Contrast Vision Sensor. IEEE Journal of Solid-State Circuits.

[Bialek:1990ce] Bialek, W, Owen, W G. (1990). {Temporalfiltering. Biophysical Journal.

[Anonymous:GXtE_twh] . {Local Illumination. (2004).

[Anonymous:SMGtXmKz] . {The Graphics Pipeline: Projective Transformations. (2004).

[jovan:2004vg] {jovan. (2004). {Conventional Animation.

[jovan:2004uj] {jovan. (2004). {Computer Animation II.

[jovan:2004wd] {jovan. (2004). {Computer Animation III.

[Anonymous:9iTr4Vho] . {projective. (1998).

[Abbott:2000wh] Abbott, Larry. (2000). {Theoretical Neuroscience Computational and Mathematical Modeling of Neural Systems - Peter Dayan, L. F. Abbott.

[Anonymous:5F9KVaoE] . {Kogo{&. (2013).

[Anonymous:WiFH6Vnp] . {Coding of Border Ownership in Monkey Visual Cortex. (2000).

[mdf:2011wq] {mdf. (2011). {THECOLOR CURIOSITY SHOP.

[Anonymous:leK42DDc] . {COLOR IS NOT A METRIC SPACE. (2013).

[Anonymous:Jy1FKFoA] . {Deriving Appearance Scales. (2012).

[mdf:2011wh] {mdf. (2011). {Brightness, Lightness, and Specifying Color in High-Dynamic-Range Scenes and Images.

[Anonymous:ieDds7qq] . {Number of discernible object colors is a conundrum. (2013).

[felzenszwalb2006efficient] Felzenszwalb, Pedro F, Huttenlocher, Daniel P. (2006). Efficient belief propagation for early vision. International journal of computer vision.

[Hartley2004] Hartley, R.~I., Zisserman, A.. (2004). Multiple View Geometry in Computer Vision.

[Anonymous:GxRPIp0i] . {2101911. (2010).

[tomg-admm] Taylor, Gavin, Burmeister, Ryan, Xu, Zheng, Singh, Bharat, Patel, Ankit, Goldstein, Tom. (2016). Training Neural Networks Without Gradients: A Scalable ADMM Approach. arXiv preprint arXiv:1605.02026.

[Anonymous:XCFYGa7M] . {Statistical Estimation, Optimization and Computation-Risk Tradeoffsin Data Analysis. (2013).

[vapnik1998statistical] Vapnik, Vladimir Naumovich, Vapnik, Vlamimir. (1998). Statistical learning theory.

[rifai2011contractive] Rifai, Salah, Vincent, Pascal, Muller, Xavier, Glorot, Xavier, Bengio, Yoshua. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. Proceedings of the 28th international conference on machine learning (ICML-11).

[rifai2011manifold] Rifai, Salah, Dauphin, Yann N, Vincent, Pascal, Bengio, Yoshua, Muller, Xavier. (2011). The manifold tangent classifier. Advances in Neural Information Processing Systems.

[lee2013pseudo] Lee, Dong-Hyun. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on Challenges in Representation Learning, ICML.

[makhzani2015winner] Makhzani, Alireza, Frey, Brendan J. (2015). Winner-Take-All Autoencoders. Advances in Neural Information Processing Systems.

[kingma2014semi] Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, Welling, Max. (2014). Semi-supervised learning with deep generative models. Advances in Neural Information Processing Systems.

[wakin2005multiscale] Wakin, M. B., Donoho, D. L., Choi, H., Baraniuk, R. G.. (2005). The multiscale structure of non-differentiable image manifolds. Proc. Int. Soc. Optical Eng..

[goodfellow2014generative] Goodfellow, I. J, Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.. (2014). Generative adversarial nets. Proc. NIPS.

[papyan2017convolutional] Papyan, Vardan, Romano, Yaniv, Elad, Michael. (2017). Convolutional Neural Networks Analyzed via Convolutional Sparse Coding. Journal of Machine Learning Research.

[srivastava2015training] Srivastava, Rupesh K, Greff, Klaus, Schmidhuber, J{. (2015). Training very deep networks. Advances in Neural Information Processing systems.

[chen2016infogan] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.. (2016). Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in Neural Information Processing Systems.

[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[poole2016exponential] Poole, Ben, Lahiri, Subhaneil, Raghu, Maithreyi, Sohl-Dickstein, Jascha, Ganguli, Surya. (2016). Exponential expressivity in deep neural networks through transient chaos. Advances In Neural Information Processing Systems.

[chen2011multiscale] Chen, G., Maggioni, M.. (2011). Multiscale geometric dictionaries for point-cloud data. Proc. Sampling Theory and Applications (SampTA).

[donoho2005image] Donoho, D. L., Grimes, C.. (2005). Image manifolds which are isometric to Euclidean space. J. Math. Imaging Vision.

[ziv2013long] Wiatowski, Thomas, B{. (2015). A mathematical theory of deep convolutional neural networks for feature extraction. arXiv preprint arXiv:1512.06293.

[rubin2010theory] Xiong, H. Y., Alipanahi, B., Lee, L. J., Bretschneider, H., Merico, D., Yuen, R. K. C., Hua, Y., Gueroussov, S., Najafabadi, H. S., Hughes, T. R., Morris, Q., Barash, Y., Krainer, A. R., Jojic, N., Scherer, S. W., Blencowe, B. J., Frey, B. J.. (2015). The human splicing code reveals new insights into the genetic determinants of disease. Science. doi:10.1126/science.1254806.

[serre2007feedforward] M. Pilanci, M. J. Wainwright. (2015). Randomized sketches of convex programs with sharp guarantees. IEEE Trans. Info. Theory.

[PilWai16a] M. Pilanci, M. J. Wainwright. Iterative {H. J. Mach. Learn. Res..

[WaiJor08] M. J. Wainwright, M. I. Jordan. (2008). Graphical models, exponential families and variational inference. Found. Tren. Mach. Learn..

[HasTibWai15] T. Hastie, R. Tibshirani, M. J. Wainwright. (2015). Statistical {L.

[LohWai15] P. Loh, M. J. Wainwright. Regularized {M. J. Mach. Learn. Res..

[Wai14a] M. J. Wainwright. Structured regularizers: Statistical and computational issues. Annu. Rev. Stat. Appl..

[PilWaiElg15] M. Pilanci, M. J. Wainwright, L. {E. Sparse learning via {B. Math. Program.. doi:10.1007/s10107-015-0894-1.

[SchWaiYu15] G. Schiebinger, M. J. Wainwright, B. Yu. (2015). The geometry of kernelized spectral clustering. Ann. Stat..

[alpha-go] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., others. (2016). Mastering the game of Go with deep neural networks and tree search. Nature.

[shashua-cvpr-keynote] A. Shashua. (2016). Autonomous Driving, Computer Vision and Machine Learning.

[godeepnips] Patel, A., Nguyen, T., Baraniuk, R.. (2016). A Probabilistic Framework for Deep Learning. Proc. Adv. Neural Inf. Process. Syst. (NIPS'16).

[lensfree16] V. Boominathan, J. K. Adams, M. S. Asif, B. W. Avants, J. T. Robinson, R. G. Baraniuk, A. C. Sankaranarayanan, A. Veeraraghavan. (2016). Lensless Imaging: A computational renaissance. IEEE Signal Process. Mag.. doi:10.1109/MSP.2016.2581921.

[lensfree17] Szeliski, R.. (2006). Locally adapted hierarchical basis preconditioning. IEEE Trans. Comput. Imag.. doi:10.1109/TCI.2016.2593662.

[huang1999statistics] Huang, J., Mumford, D.. (1999). Statistics of natural images and models. Proc. IEEE Conf. Comp. Vision Pat. Recog. (CVPR'99).

[lee2003nonlinear] Lee, A.~B., Pedersen, K.~S., Mumford, David. (2003). The nonlinear statistics of high-contrast patches in natural images. Intl. J. Comp. Vision.

[li2009towards] Li, L.J., Socher, R., Fei-Fei, L.. (2009). Towards total scene understanding: Classification, annotation and segmentation in an automatic framework. Proc. IEEE Conf. Comp. Vision Pattern Recog. (CVPR'09).

[deng2009imagenet] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.. (2009). Imagenet: A large-scale hierarchical image database. Proc. IEEE Conf. Com. Vision and Pattern Recog. (CVPR'09).

[li2010object] Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. Proc. Adv. Neural Info. Process. Syst. (NIPS'10).

[yao2012codebook] Yao, B., Bradski, G., Fei-Fei, L.. (2012). A codebook-free and annotation-free approach for fine-grained image categorization. Proc. IEEE Conf. Com. Vision and Pattern Recog. (CVPR'12).

[carin1] Chen, M., Silva, J., Paisley, J., Wang, C., Dunson, D., Carin, L.. (2010). Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Process..

[gregor2013deep] Gregor, Karol, Danihelka, Ivo, Mnih, Andriy, Blundell, Charles, Wierstra, Daan. (2013). Deep autoregressive networks. arXiv preprint arXiv:1310.8499.

[patel2016probabilistic] Patel, Ankit B, Nguyen, Tan, Baraniuk, Richard G. (2016). A Probabilistic Framework for Deep Learning. NIPS.

[salimans2016improved] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, Chen, Xi. (2016). Improved techniques for training gans. arXiv preprint arXiv:1606.03498.

[springenberg2015unsupervised] Springenberg, Jost Tobias. (2015). Unsupervised and Semi-supervised Learning with Categorical Generative Adversarial Networks. arXiv preprint arXiv:1511.06390.

[miyato2015distributional] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, Nakae, Ken, Ishii, Shin. (2015). Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677.

[maaloe2016auxiliary] Maal{\o. (2016). Auxiliary Deep Generative Models. arXiv preprint arXiv:1602.05473.

[springenberg2014striving] Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox, Thomas, Riedmiller, Martin. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

[wei2017early] Wei, Yuting, Yang, Fanny, Wainwright, Martin J. (2017). Early stopping for kernel boosting algorithms: A general analysis with localized complexities. arXiv preprint arXiv:1707.01543.

[achille2017emergence] Achille, Alessandro, Soatto, Stefano. (2017). Emergence of invariance and disentangling in deep representations. arXiv preprint arXiv:1706.01350.

[Wai17book] M. J. Wainwright. (2017). High-dimensional statistics: A non-asymptotic view.

[nishikawa1998accurate] Nishikawa, Hiroaki. (1998). Accurate Piecewise Linear Continuous Approximations to One-Dimensional Curves: Error Estimates and Algorithms.

[Yedidia01] J. S. Yedidia, W. T. Freeman, Y. Weiss. (2001). Generalized belief propagation. NIPS 13.

[SonJaa07a] D. Sontag, T. Jaakkola. (2007). New outer bounds on the marginal polytope. Neural Information Processing Systems.

[MelGloWei09] T. Meltzer, A. Globerson, Y. Weiss. (2009). Convergent message-passing algorithms: {A. Uncertainty in Artificial Intelligence.

[KolTik59] A. N. Kolmogorov, B. Tikhomirov. (1959). $\epsilon$-entropy and $\epsilon$-capacity of sets in functional spaces. Uspekhi Mat. Nauk..

[YanBar99] Y. Yang, A. Barron. (1999). Information-theoretic determination of minimax rates of convergence. annstat.

[Yu] B. Yu. (1996). Assouad, {F. Research Papers in Probability and Statistics: Festschrift in Honor of Lucien Le Cam.

[zhang2016convexified] Zhang, Yuchen, Liang, Percy, Wainwright, Martin J. (2016). Convexified convolutional neural networks. arXiv preprint arXiv:1609.01000.

[tishby2015deep] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep learning and the information bottleneck principle. Information Theory Workshop (ITW), 2015 IEEE.

[hinton1997modeling] Hinton, Geoffrey E, Dayan, Peter, Revow, Michael. (1997). Modeling the manifolds of images of handwritten digits. IEEE transactions on Neural Networks.

[simard1993efficient] Simard, Patrice, LeCun, Yann, Denker, John S. (1993). Efficient pattern recognition using a new transformation distance. Advances in Neural Information Processing systems.

[belkin2003laplacian] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation.

[zhou2009hierarchical] Zhou, Xi, Cui, Na, Li, Zhen, Liang, Feng, Huang, Thomas S. (2009). Hierarchical gaussianization for image classification. Computer Vision, 2009 IEEE 12th International Conference on.

[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[gatys2015neural] Gatys, Leon A, Ecker, Alexander S, Bethge, Matthias. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576.

[tian2017deeptest] Tian, Yuchi, Pei, Kexin, Jana, Suman, Ray, Baishakhi. (2017). DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars. arXiv preprint arXiv:1708.08559.

[edX] . Discrete Time Signals and Systems. ().

[goodman2016european] Goodman, Bryce, Flaxman, Seth. (2016). European Union regulations on algorithmic decision-making and a. arXiv preprint arXiv:1606.08813.

[rust2010selectivity] Rust, Nicole C, DiCarlo, James J. (2010). Selectivity and tolerance both increase as visual information propagates from cortical area V4 to IT. Journal of Neuroscience.

[coifman1992entropy] Coifman, Ronald R, Wickerhauser, M Victor. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on information theory.

[tropp2004greed] Tropp, Joel A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory.

[hopfield1985neural] Hopfield, John J, Tank, David W. (1985). “Neural” computation of decisions in optimization problems. Biological cybernetics.

[hannah2013multivariate] Hannah, L.~A., Dunson, D.~B.. (2013). Multivariate convex regression with adaptive partitioning. J. Mach. Learn. Res..

[breiman1993hinging] Breiman, Leo. (1993). Hinging hyperplanes for regression, classification, and function approximation. IEEE Transactions on Information Theory.

[magnani2009convex] Magnani, Alessandro, Boyd, Stephen P. (2009). Convex piecewise-linear fitting. Optim. Eng..

[cybenko1989approximation] Cybenko, George. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS).

[meyer1993algorithms] Meyer, Yves. (1993). Algorithms and applications. SIAM, philadelphia.

[hornik1989multilayer] Hornik, Kurt, Stinchcombe, Maxwell, White, Halbert. (1989). Multilayer feedforward networks are universal approximators. Neural networks.

[raj2016local] Raj, Anant, Kumar, Abhishek, Mroueh, Youssef, Fletcher, P Thomas, others. (2016). Local Group Invariant Representations via Orbit Embeddings. arXiv preprint arXiv:1612.01988.

[marcos2016rotation] Marcos, Diego, Volpi, Michele, Komodakis, Nikos, Tuia, Devis. (2016). Rotation equivariant vector field networks. arXiv preprint arXiv:1612.09346.

[cooijmans2016recurrent] Cooijmans, Tim, Ballas, Nicolas, Laurent, C{'e. (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.

[glorot2010understanding] Glorot, X., Bengio, Y.. (2010). Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Int. Conf. AI Statist..

[anden2014deep] And{'e. (2014). Deep scattering spectrum. IEEE Transactions on Signal Processing.

[sifre2013rotation] Sifre, Laurent, Mallat, St{'e. (2013). Rotation, scaling and deformation invariant scattering for texture discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[li2005perceptron] Li, Ling. (2005). Perceptron learning with random coordinate descent.

[nesterov2012efficiency] Nesterov, Yu. (2012). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization.

[garnett2007image] Garnett, John B, Le, Triet M, Meyer, Yves, Vese, Luminita A. (2007). Image decompositions using bounded variation and generalized homogeneous Besov spaces. Applied and Computational Harmonic Analysis.

[choi2004multiple] Choi, Hyeokho, Baraniuk, Richard G. (2004). Multiple wavelet basis image denoising using Besov ball projections. IEEE Signal Processing Letters.

[hecht1988theory] Hecht-Nielsen, Robert, others. (1988). Theory of the backpropagation neural network.. Neural Networks.

[balle2014learning] Ball{'e. (2014). Learning sparse filter bank transforms with convolutional ICA. Image Processing (ICIP), 2014 IEEE International Conference on.

[mallat1999wavelet] Mallat, St{'e. (1999). A wavelet tour of signal processing.

[bastien2012theano] Bastien, Fr{'e. (2012). Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590.

[puthawala2020globally] Puthawala, Michael, Kothari, Konik, Lassas, Matti, Dokmani{'c. (2020). Globally Injective ReLU Networks. arXiv preprint arXiv:2006.08464.

[lucas2018using] Lucas, Alice, Iliadis, Michael, Molina, Rafael, Katsaggelos, Aggelos K. (2018). Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Processing Magazine.

[rudin1964principles] Rudin, Walter, others. (1964). Principles of mathematical analysis.

[schumaker2007spline] Schumaker, Larry. (2007). Spline functions: basic theory.

[choromanska2015loss] Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Arous, G{'e. (2015). The Loss Surfaces of Multilayer Networks.. AISTATS.

[donoho1995noising] Donoho, David L. (1995). De-noising by soft-thresholding. IEEE transactions on information theory.

[zhang2014entropy] Zhang, Lin. (2014). Entropy, stochastic matrices, and quantum operations. Linear and Multilinear Algebra.

[guggenheimer1977applicable] Guggenheimer, Heinrich Walter. (1977). Applicable geometry: global and local convexity.

[lloyd1982least] Lloyd, Stuart. (1982). Least squares quantization in PCM. IEEE transactions on information theory.

[kuurkova1992kolmogorov] K{\uu. (1992). Kolmogorov's theorem and multilayer neural networks. Neural networks.

[jayaraman2009digital] Jayaraman, S, Esakkirajan, S, Veerakumar, T. (2009). Digital Image Processing TMH Publication. Year of Publication.

[srivastava2014understanding] Srivastava, R.~K., Masci, J., Gomez, F., Schmidhuber, J.. (2014). Understanding locally competitive networks. arXiv preprint arXiv:1410.1165.

[henaff2014local] H{'e. (2014). The local low-dimensionality of natural images. arXiv preprint arXiv:1412.6626.

[mathieu2016disentangling] Mathieu, Michael F, Zhao, Junbo Jake, Zhao, Junbo, Ramesh, Aditya, Sprechmann, Pablo, LeCun, Yann. (2016). Disentangling factors of variation in deep representation using adversarial training. Advances in Neural Information Processing Systems.

[larsson2016fractalnet] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2016). Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648.

[lee2015generalizing] Lee, Chen-Yu, Gallagher, Patrick W, Tu, Zhuowen. (2015). Generalizing pooling functions in convolutional neural networks: Mixed. Gated, and Tree, arXiv e-print sarXiv.

[ding2005equivalence] Ding, Chris, He, Xiaofeng, Simon, Horst D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. Proceedings of the 2005 SIAM International Conference on Data Mining.

[tieleman2012lecture] Tieleman, Tijmen, Hinton, Geoffrey. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning.

[mairal2009online] Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro, Guillermo. (2009). Online dictionary learning for sparse coding. Proceedings of the 26th annual international conference on machine learning.

[jiang2011learning] Jiang, Zhuolin, Lin, Zhe, Davis, Larry S. (2011). Learning a discriminative dictionary for sparse coding via label consistent K-SVD. Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.

[balestriero2017multiscale] Balestriero, Randall. (2017). Multiscale Residual Mixture of PCA: Dynamic Dictionaries for Optimal Basis Learning. arXiv preprint arXiv:1707.05840.

[lecun1995learning] LeCun, Yann, Jackel, LD, Bottou, L{'e. (1995). Learning algorithms for classification: A comparison on handwritten digit recognition. Neural networks: the statistical mechanics perspective.

[lecun2015lenet] LeCun, Yann, others. (2015). LeNet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet.

[rumelhart1988learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J, others. (1988). Learning representations by back-propagating errors. Cognitive modeling.

[bengio2013advances] Bengio, Yoshua, Boulanger-Lewandowski, Nicolas, Pascanu, Razvan. (2013). Advances in optimizing recurrent networks. Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on.

[zeiler2012adadelta] Zeiler, Matthew D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[reed2014learning] Reed, Scott, Sohn, Kihyuk, Zhang, Yuting, Lee, Honglak. (2014). Learning to disentangle factors of variation with manifold interaction. Proceedings of the 31st International Conference on Machine Learning (ICML-14).

[rennie2014deep] Rennie, Steven J, Goel, Vaibhava, Thomas, Samuel. (2014). Deep order statistic networks. Spoken Language Technology Workshop (SLT), 2014 IEEE.

[lee2015deeply] Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick W, Zhang, Zhengyou, Tu, Zhuowen. (2015). Deeply-Supervised Nets.. AISTATS.

[li2019understanding] Li, Xiang, Chen, Shuo, Hu, Xiaolin, Yang, Jian. (2019). Understanding the disharmony between dropout and batch normalization by variance shift. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[bakir2004learning] Bak{\i. (2004). Learning to find pre-images. Advances in Neural Information Processing systems.

[comon1994independent] Comon, Pierre. (1994). Independent {C. Signal Processing.

[hyvarinen2016unsupervised] Hyvarinen, Aapo, Morioka, Hiroshi. (2016). Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in Neural Information Processing Systems.

[schmidhuber1992learning] Schmidhuber, J{. (1992). Learning factorial codes by predictability minimization. Neural Computation.

[rosenblatt1956remarks] Rosenblatt, Murray. (1956). Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics.

[sajjadi2018assessing] Sajjadi, Mehdi SM, Bachem, Olivier, Lucic, Mario, Bousquet, Olivier, Gelly, Sylvain. (2018). Assessing generative models via precision and recall. arXiv preprint arXiv:1806.00035.

[munkres2014topology] Munkres, James. (2014). Topology.

[karras2019style] Karras, Tero, Laine, Samuli, Aila, Timo. (2019). A style-based generator architecture for generative adversarial networks. Proc. CVPR.

[gong2019autogan] Gong, Xinyu, Chang, Shiyu, Jiang, Yifan, Wang, Zhangyang. (2019). Autogan: Neural architecture search for generative adversarial networks. Proceedings of the IEEE International Conference on Computer Vision.

[stewart1973error] Stewart, Gilbert W. (1973). Error and perturbation bounds for subspaces associated with certain eigenvalue problems. SIAM review.

[locatello2018challenging] Locatello, Francesco, Bauer, Stefan, Lucic, Mario, R{. (2018). Challenging common assumptions in the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359.

[tompson2015efficient] Tompson, Jonathan, Goroshin, Ross, Jain, Arjun, LeCun, Yann, Bregler, Christoph. (2015). Efficient object localization using convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[lin2013network] Lin, Min, Chen, Qiang, Yan, Shuicheng. (2013). Network in network. arXiv preprint arXiv:1312.4400.

[blot2016max] Blot, Michael, Cord, Matthieu, Thome, Nicolas. (2016). Max-min convolutional neural networks for image classification. Image Processing (ICIP), 2016 IEEE International Conference on.

[shang2016understanding] Shang, Wenling, Sohn, Kihyuk, Almeida, Diogo, Lee, Honglak. (2016). Understanding and improving convolutional neural networks via concatenated rectified linear units. Proceedings of the International Conference on Machine Learning (ICML).

[targ2016resnet] Targ, Sasha, Almeida, Diogo, Lyman, Kevin. (2016). Resnet in Resnet: generalizing residual architectures. arXiv preprint arXiv:1603.08029.

[szegedy2016inception] Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, Alemi, Alex. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261.

[graham2014fractional] Graham, Benjamin. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071.

[masnadi2009design] Masnadi-Shirazi, Hamed, Vasconcelos, Nuno. (2009). On the design of loss functions for classification: theory, robustness to outliers, and savageboost. Advances in Neural Information Processing systems.

[zeiler2013stochastic] Zeiler, Matthew D, Fergus, Rob. (2013). Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557.

[malinowski2013learnable] Malinowski, Mateusz, Fritz, Mario. (2013). Learnable pooling regions for image classification. arXiv preprint arXiv:1301.3516.

[chung2014empirical] Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, Bengio, Yoshua. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

[cho2014learning] Cho, Kyunghyun, Van Merri{. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[unser2018representer] Unser, Michael. (2018). A representer theorem for deep neural networks. arXiv preprint arXiv:1802.09210.

[jones1992simple] Jones, Lee K. (1992). A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training. The annals of Statistics.

[szegedy2017inception] Szegedy, Christian, Ioffe, Sergey, Vanhoucke, Vincent, Alemi, Alexander A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning.. AAAI.

[barron1993universal] Barron, Andrew R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory.

[rosasco2004loss] Rosasco, Lorenzo, De Vito, Ernesto, Caponnetto, Andrea, Piana, Michele, Verri, Alessandro. (2004). Are loss functions all the same?. Neural Computation.

[mallat2008wavelet] Mallat, Stephane. (2008). A wavelet tour of signal processing: the sparse way.

[berger1994removing] Berger, Jonathan, Coifman, Ronald R, Goldberg, Maxim J. (1994). Removing noise from music using local trigonometric bases and wavelet packets. Journal of the Audio Engineering Society.

[tikk2003survey] Tikk, Domonkos, K{'o. (2003). A survey on universal approximation and its limits in soft computing techniques. International Journal of Approximate Reasoning.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting.. Journal of Machine Learning Research.

[tikhomirov1991representation] Tikhomirov, VM. (1991). On the Representation of Continuous Functions of Several Variables as Superpositions of Continuous Functions of a Smaller Number of Variables. Selected Works of AN Kolmogorov.

[duchi2011adaptive] Duchi, John, Hazan, Elad, Singer, Yoram. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.

[matsuoka1992noise] Matsuoka, Kiyotoshi. (1992). Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics.

[bishop2008training] Bishop, Chris M. (2008). Training with noise is equivalent to Tikhonov regularization. Training.

[wager2013dropout] Wager, Stefan, Wang, Sida, Liang, Percy S. (2013). Dropout training as adaptive regularization. Advances in Neural Information Processing systems.

[bajcsy1989multiresolution] Bajcsy, Ruzena, Kova{\v{c. (1989). Multiresolution elastic matching. Computer vision, graphics, and image processing.

[zhang1997face] Zhang, Jun, Yan, Yong, Lades, Martin. (1997). Face recognition: eigenface, elastic matching, and neural nets. Proceedings of the IEEE.

[dieleman2015rotation] Dieleman, Sander, Willett, Kyle W, Dambre, Joni. (2015). Rotation-invariant convolutional neural networks for galaxy morphology prediction. Monthly notices of the royal astronomical society.

[bastani2016measuring] Bastani, Osbert, Ioannou, Yani, Lampropoulos, Leonidas, Vytiniotis, Dimitrios, Nori, Aditya, Criminisi, Antonio. (2016). Measuring neural net robustness with constraints. Advances In Neural Information Processing Systems.

[blumer1987occam] Blumer, Anselm, Ehrenfeucht, Andrzej, Haussler, David, Warmuth, Manfred K. (1987). Occam's razor. Information processing letters.

[gal2016dropout] Gal, Yarin, Ghahramani, Zoubin. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. international conference on machine learning.

[li2016whiteout] Li, Yinan, Xu, Ruoyi, Liu, Fang. (2016). Whiteout: Gaussian Adaptive Regularization Noise in Deep Neural Networks. arXiv preprint arXiv:1612.01490.

[schrijver1998theory] Schrijver, Alexander. (1998). Theory of linear and integer programming.

[de1978practical] De Boor, Carl, De Boor, Carl, Math{'e. (1978). A practical guide to splines.

[green1993nonparametric] Green, Peter J, Silverman, Bernard W. (1993). Nonparametric regression and generalized linear models: a roughness penalty approach.

[balestriero2018hard] Balestriero, Randall, Baraniuk, Richard G. (2018). From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference. arXiv preprint arXiv:1810.09274.

[gu2013smoothing] Gu, Chong. (2013). Smoothing spline ANOVA models.

[wang2011smoothing] Wang, Yuedong. (2011). Smoothing splines: methods and applications.

[yin2008noisy] Yin, Junsong, Hu, Dewen, Zhou, Zongtan. (2008). Noisy manifold learning using neighborhood smoothing embedding. Pattern Recognition Letters.

[park2004local] Park, JinHyeong, Zhang, Zhenyue, Zha, Hongyuan, Kasturi, Rangachar. (2004). Local smoothing for manifold learning. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[tjeng2017evaluating] Tjeng, Vincent, Xiao, Kai, Tedrake, Russ. (2017). Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356.

[nalisnick2015scale] Nalisnick, Eric, Anandkumar, Anima, Smyth, Padhraic. (2015). A scale mixture perspective of multiplicative noise in neural networks. arXiv preprint arXiv:1506.03208.

[devries2017dataset] DeVries, Terrance, Taylor, Graham W. (2017). Dataset Augmentation in Feature Space. arXiv preprint arXiv:1702.05538.

[bengio2011deep] Bengio, Yoshua, Bergeron, Arnaud, Boulanger--Lewandowski, Nicolas, Breuel, Thomas, Chherawala, Youssouf, Cisse, Moustapha, Erhan, Dumitru, Eustache, Jeremy, Glorot, Xavier, Muller, Xavier, others. (2011). Deep learners benefit more from out-of-distribution examples. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.

[vapnik1992principles] Vapnik, Vladimir. (1992). Principles of risk minimization for learning theory. Advances in Neural Information Processing systems.

[guyon1992structural] Guyon, Isabelle, Vapnik, Vladimir, Boser, Bernhard, Bottou, Leon, Solla, Sara A. (1992). Structural risk minimization for character recognition. Advances in Neural Information Processing systems.

[moody1994architecture] Moody, John, Utans, Joachim. (1994). Architecture selection strategies for neural networks: Application to corporate bond rating prediction. Neural networks in the capital markets.

[wolpert1994bayesian] Wolpert, David H. (1994). Bayesian backpropagation over io functions rather than weights. Advances in Neural Information Processing systems.

[williams1995bayesian] Williams, Peter M. (1995). Bayesian regularization and pruning using a Laplace prior. Neural computation.

[hochreiter1995simplifying] Hochreiter, Sepp, Schmidhuber, J{. (1995). Simplifying neural nets by discovering flat minima. Advances in Neural Information Processing systems.

[schmidhuber1994discovering] Schmidhuber, J{. (1994). Discovering problem solutions with low Kolmogorov complexity and high generalization capability. Machine Learning: Proceedings of the Twelfth International Conference.

[plaut1986experiments] Plaut, David C, others. (1986). Experiments on Learning by Back Propagation..

[hinton1987learning] Hinton, Geoffrey E. (1987). Learning translation invariant recognition in a massively parallel networks. International Conference on Parallel Architectures and Languages Europe.

[mackay1996bayesian] MacKay, David JC. (1996). Bayesian methods for backpropagation networks. Models of neural networks III.

[hinton1986learning] Hinton, Geoffrey E. (1986). Learning distributed representations of concepts. Proceedings of the eighth annual conference of the cognitive science society.

[weigend1990predicting] Weigend, Andreas S, Huberman, Bernardo A, Rumelhart, David E. (1990). Predicting the future: A connectionist approach. International journal of neural systems.

[morgan1990generalization] Morgan, Nelson, Bourlard, Herv{'e. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. Advances in Neural Information Processing systems.

[yann1987modeles] Yann, LE. (1987). Mod{`e.

[lecun1989generalization] LeCun, Yann, others. (1989). Generalization and network design strategies. Connectionism in perspective.

[lang1990time] Lang, Kevin J, Waibel, Alex H, Hinton, Geoffrey E. (1990). A time-delay neural network architecture for isolated word recognition. Neural networks.

[rumelhart1986parallel] Rumelhart, David E, Mcclelland, James L. (1986). Parallel distributed processing: Explorations in the microstructure of cognition: Foundations (Parallel distributed processing).

[nowlan1992simplifying] Nowlan, Steven J, Hinton, Geoffrey E. (1992). Simplifying neural networks by soft weight-sharing. Neural computation.

[hinton93keeping] Hinton, GE, van Camp, Drew. Keeping neural networks simple by minimising the description length of weights. 1993. Proceedings of COLT-93.

[memisevic2014zero] Memisevic, Roland, Krueger, David. (2014). Zero-bias autoencoders and the benefits of co-adapting features. stat.

[murray1993synaptic] Murray, Alan F, Edwards, Peter J. (1993). Synaptic weight noise during MLP learning enhances fault-tolerance, generalization and learning trajectory. Advances in Neural Information Processing systems.

[valiant1984theory] Valiant, Leslie G. (1984). A theory of the learnable. Communications of the ACM.

[zeiler2010deconvolutional] Zeiler, Matthew D, Krishnan, Dilip, Taylor, Graham W, Fergus, Rob. (2010). Deconvolutional networks. Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.

[mallat2016understanding] Mallat, St{'e. (2016). Understanding deep convolutional networks. Phil. Trans. R. Soc. A.

[jaderberg2015spatial] Jaderberg, Max, Simonyan, Karen, Zisserman, Andrew, others. (2015). Spatial transformer networks. Advances in Neural Information Processing Systems.

[biernacki2000assessing] Biernacki, C., Celeux, G., Govaert, G.. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell..

[graves2013speech] Graves, Alex, Mohamed, Abdel-rahman, Hinton, Geoffrey. (2013). Speech recognition with deep recurrent neural networks. Acoustics, speech and signal processing (icassp), 2013 ieee international conference on.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[burr1981elastic] Burr, David J. (1981). Elastic matching of line drawings.. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[uchida2005survey] Uchida, Seiichi, Sakoe, Hiroaki. (2005). A survey of elastic matching techniques for handwritten character recognition. IEICE transactions on information and systems.

[korman2013fast] Korman, Simon, Reichman, Daniel, Tsur, Gilad, Avidan, Shai. (2013). Fast-match: Fast affine template matching. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[kim2007grayscale] Kim, Hae Yong, De Ara{'u. (2007). Grayscale template-matching invariant to rotation, scale, translation, brightness and contrast. Pacific-Rim Symposium on Image and Video Technology.

[murthy1994system] Murthy, Sreerama K., Kasif, Simon, Salzberg, Steven. (1994). A system for induction of oblique decision trees. Journal of artificial intelligence research.

[rao1999learning] Rao, Rajesh PN, Ruderman, Daniel L. (1999). Learning Lie groups for invariant visual perception. Advances in Neural Information Processing systems.

[hubel1962receptive] Hubel, David H, Wiesel, Torsten N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology.

[feng2015learning] Feng, Jiashi, Darrell, Trevor. (2015). Learning the structure of deep convolutional networks. Proceedings of the IEEE International Conference on Computer Vision.

[spitzer1985complex] Spitzer, HEDVA, Hochstein, SHAUL. (1985). A complex-cell receptive-field model. Journal of Neurophysiology.

[grimes2005bilinear] Grimes, David B, Rao, Rajesh PN. (2005). Bilinear sparse coding for invariant vision. Neural computation.

[foldiak1991learning] F{. (1991). Learning invariance from transformation sequences. Neural Computation.

[kaudererquantifying] Kauderer-Abrams, Eric. Quantifying Translation-Invariance in Convolutional Neural Networks.

[xu2014scale] Xu, Yichong, Xiao, Tianjun, Zhang, Jiaxing, Yang, Kuiyuan, Zhang, Zheng. (2014). Scale-Invariant Convolutional Neural Networks. arXiv preprint arXiv:1411.6369.

[marcos2016learning] Marcos, Diego, Volpi, Michele, Tuia, Devis. (2016). Learning rotation invariant convolutional filters for texture classification. arXiv preprint arXiv:1604.06720.

[2016arXiv160407143B] {Biau. {Neural Random Forests. ArXiv e-prints.

[verma2009spatial] Verma, Nakul, Kpotufe, Samory, Dasgupta, Sanjoy. (2009). Which spatial partition trees are adaptive to intrinsic dimension?. Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence.

[sproull1991refinements] Sproull, Robert F. (1991). Refinements to nearest-neighbor searching ink-dimensional trees. Algorithmica.

[schneidman2002analyzing] Schneidman, Elad, Slonim, Noam, Tishby, Naftali, deRuyter van Steveninck, R, Bialek, William. (2002). Analyzing neural codes using the information bottleneck method. Advances in Neural Information Processing systems.

[barlow2001exploitation] Barlow, Horace. (2001). The exploitation of regularities in the environment by the brain. Behavioral and Brain Sciences.

[chunjie2017cosine] Chunjie, Luo, Qiang, Yang, others. (2017). Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks. arXiv preprint arXiv:1702.05870.

[powell1981approximation] Powell, Michael James David. (1981). Approximation theory and methods.

[grimes2003probabilistic] Grimes, David B, Shon, Aaron P, Rao, Rajesh PN. (2003). Probabilistic bilinear models for appearance-based vision. null.

[grimes2003bilinear] Grimes, David B, Rao, Rajesh PN. (2003). A bilinear model for sparse coding. Advances in Neural Information Processing systems.

[tenenbaum1997separating] Tenenbaum, Joshua B, Freeman, William T. (1997). Separating style and content. Advances in Neural Information Processing systems.

[agostinelli2014learning] Agostinelli, Forest, Hoffman, Matthew, Sadowski, Peter, Baldi, Pierre. (2014). Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830.

[friedman1991multivariate] Friedman, Jerome H. (1991). Multivariate adaptive regression splines. The annals of statistics.

[barlow1981ferrier] Barlow, Horace B. (1981). The ferrier lecture, 1980: Critical limiting factors in the design of the eye and visual cortex. Proceedings of the Royal Society of London B: Biological Sciences.

[strouse2016deterministic] Strouse, DJ, Schwab, David J. (2016). The deterministic information bottleneck. arXiv preprint arXiv:1604.00268.

[fukushima1980neocognitron] Fukushima, Kunihiko. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics.

[raghu2016expressive] Raghu, Maithra, Poole, Ben, Kleinberg, Jon, Ganguli, Surya, Sohl-Dickstein, Jascha. (2016). On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336.

[keskar2016large] Keskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, Tang, Ping Tak Peter. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836.

[hoffer2015deep] Hoffer, Elad, Ailon, Nir. (2015). Deep metric learning using triplet network. International Workshop on Similarity-Based Pattern Recognition.

[taigman2014deepface] Taigman, Yaniv, Yang, Ming, Ranzato, Marc'Aurelio, Wolf, Lior. (2014). Deepface: Closing the gap to human-level performance in face verification. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[krishnan1999extracting] Krishnan, R, Sivakumar, G, Bhattacharya, P. (1999). Extracting decision trees from trained neural networks. Pattern Recognition.

[craven1996extracting] Craven, Mark W. (1996). Extracting comprehensible models from trained neural networks.

[craven1994using] Craven, Mark, Shavlik, Jude W. (1994). Using sampling and queries to extract rules from trained neural networks.. ICML.

[kamruzzaman2010rule] Kamruzzaman, SM, Hasan, Ahmed Ryadh. (2010). Rule Extraction using Artificial Neural Networks. arXiv preprint arXiv:1009.4984.

[towell1993extracting] Towell, Geoffrey G, Shavlik, Jude W. (1993). Extracting refined rules from knowledge-based neural networks. Machine learning.

[quinlan1994comparing] Quinlan, John Ross. (1994). Comparing connectionist and symbolic learning methods. Computational Learning Theory and Natural Learning Systems: Constraints and Prospects.

[fu1994rule] Fu, LiMin. (1994). Rule generation from neural networks. IEEE Transactions on Systems, Man, and Cybernetics.

[bengio2007greedy] Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, Larochelle, Hugo. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems.

[lecun1998mnist] LeCun, Yann. (1998). The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/.

[netzer2011reading] Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, Ng, Andrew Y. (2011). Reading digits in natural images with unsupervised feature learning. NIPS workshop on deep learning and unsupervised feature learning.

[weston2012deep] Weston, Jason, Ratle, Fr{'e. (2012). Deep learning via semi-supervised embedding. Neural Networks: Tricks of the Trade.

[abadi2016tensorflow] Abadi, Mart{'\i. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.

[agarap2018deep] Agarap, A. F.. (2018). Deep Learning using Rectified Linear Units ({R. arXiv preprint arXiv:1803.08375.

[graves2005framewise] Graves, Alex, Schmidhuber, J{. (2005). Framewise phoneme classification with bidirectional LSTM networks. Neural Networks, 2005. IJCNN'05. Proceedings. 2005 IEEE International Joint Conference on.

[boyd1992defeating] Boyd, John P. (1992). Defeating the Runge phenomenon for equispaced polynomial interpolation via Tikhonov regularization. Applied Mathematics Letters.

[boyd2009divergence] Boyd, John P, Xu, Fei. (2009). Divergence (Runge phenomenon) for least-squares polynomial approximation on an equispaced grid and Mock--Chebyshev subset interpolation. Applied Mathematics and Computation.

[pena2000multivariate] Pe{~n. (2000). On the multivariate Horner scheme. SIAM journal on numerical analysis.

[de2015exploration] de Br{'e. (2015). An exploration of softmax alternatives belonging to the spherical loss family. arXiv preprint arXiv:1511.05042.

[veit2016residual] Veit, Andreas, Wilber, Michael J, Belongie, Serge. (2016). Residual networks behave like ensembles of relatively shallow networks. Advances in Neural Information Processing Systems.

[de1983approximation] de Boor, Carl, DeVore, Ron. (1983). Approximation by smooth multivariate splines. Transactions of the American Mathematical Society.

[nowozin2016f] Nowozin, Sebastian, Cseke, Botond, Tomioka, Ryota. (2016). f-gan: Training generative neural samplers using variational divergence minimization. Advances in Neural Information Processing systems.

[dziugaite2015training] Dziugaite, Gintare Karolina, Roy, Daniel M, Ghahramani, Zoubin. (2015). Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906.

[arjovsky2017wasserstein] Arjovsky, Martin, Chintala, Soumith, Bottou, L{'e. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875.

[gan2017triangle] Gan, Zhe, Chen, Liqun, Wang, Weiyao, Pu, Yuchen, Zhang, Yizhe, Liu, Hao, Li, Chunyuan, Carin, Lawrence. (2017). Triangle generative adversarial networks. Advances in Neural Information Processing Systems.

[angles2018generative] Angles, Tom{'a. (2018). Generative networks as inverse problems with scattering transforms. arXiv preprint arXiv:1805.06621.

[zhao2016energy] Zhao, Junbo, Mathieu, Michael, LeCun, Yann. (2016). Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126.

[roth2017stabilizing] Roth, Kevin, Lucchi, Aurelien, Nowozin, Sebastian, Hofmann, Thomas. (2017). Stabilizing training of generative adversarial networks through regularization. Advances in Neural Information Processing systems.

[li2017towards] Li, Jerry, Madry, Aleksander, Peebles, John, Schmidt, Ludwig. (2017). Towards understanding the dynamics of generative adversarial networks. arXiv preprint arXiv:1706.09884.

[liu2017approximation] Liu, Shuang, Bousquet, Olivier, Chaudhuri, Kamalika. (2017). Approximation and convergence properties of generative adversarial learning. Proc. NeurIPS.

[zhang2017discrimination] Zhang, Pengchuan, Liu, Qiang, Zhou, Dengyong, Xu, Tao, He, Xiaodong. (2017). On the discrimination-generalization tradeoff in GANs. arXiv preprint arXiv:1711.02771.

[arjovsky1701towards] Arjovsky, Martin, Bottou, Léon. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862.

[rifai2011higher] Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., Glorot, X.. (2011). Higher order contractive auto-encoder. Joint European Conference on Machine Learning and Knowledge Discovery in Databases.

[miao1992principal] Miao, Jianming, Ben-Israel, Adi. (1992). On principal angles between subspaces in Rn. Linear Algebra Appl.

[deng2020low] Deng, Tingquan, Ye, Dongsheng, Ma, Rong, Fujita, Hamido, Xiong, Lvnan. (2020). Low-rank local tangent space embedding for subspace clustering. Information Sciences.

[ma2010local] Ma, Li, Crawford, Melba M, Tian, Jinwen. (2010). Local manifold learning-based $ k $-nearest-neighbor for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing.

[vincent2008extracting] Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P. A.. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.

[teng2019invertible] Teng, Y., Choromanska, A.. (2019). Invertible Autoencoder for Domain Adaptation. Computation.

[chongxuan2017triple] Chongxuan, LI, Xu, Taufik, Zhu, Jun, Zhang, Bo. (2017). Triple generative adversarial nets. Advances in Neural Information Processing systems.

[khayatkhoei2018disconnected] Khayatkhoei, Mahyar, Singh, Maneesh K, Elgammal, Ahmed. (2018). Disconnected manifold learning for generative adversarial networks. Advances in Neural Information Processing Systems.

[tanielian2020learning] Tanielian, Ugo, Issenhuth, Thibaut, Dohmatob, Elvis, Mary, Jeremie. (2020). Learning disconnected manifolds: a no GANs land. arXiv preprint arXiv:2006.04596.

[durugkar2016generative] Durugkar, Ishan, Gemp, Ian, Mahadevan, Sridhar. (2017). Generative multi-adversarial networks. Proc. ICLR.

[ghosh2018multi] Ghosh, Arnab, Kulharia, Viveka, Namboodiri, Vinay P, Torr, Philip HS, Dokania, Puneet K. (2018). Multi-agent diverse generative adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[yang2019diversitysensitive] Dingdong Yang, Seunghoon Hong, Yunseok Jang, Tianchen Zhao, Honglak Lee. (2019). Diversity-Sensitive Conditional Generative Adversarial Networks. arXiv preprint arXiv:1901.09024.

[kodali2017convergence] Kodali, Naveen, Abernethy, Jacob, Hays, James, Kira, Zsolt. (2017). On convergence and stability of gans. arXiv preprint arXiv:1705.07215.

[fabius2014variational] Fabius, Otto, van Amersfoort, Joost R. (2014). Variational recurrent auto-encoders. arXiv preprint arXiv:1412.6581.

[van2017neural] van den Oord, Aaron, Vinyals, Oriol, others. (2017). Neural discrete representation learning. Proc. NeurIPS.

[roy2018theory] Roy, Aurko, Vaswani, Ashish, Neelakantan, Arvind, Parmar, Niki. (2018). Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063.

[rezende2015variational] Rezende, Danilo Jimenez, Mohamed, Shakir. (2015). Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.

[dinh2019rad] Dinh, Laurent, Sohl-Dickstein, Jascha, Pascanu, Razvan, Larochelle, Hugo. (2019). A RAD approach to deep mixture models. arXiv preprint arXiv:1903.07714.

[grathwohl2018ffjord] Grathwohl, Will, Chen, Ricky TQ, Betterncourt, Jesse, Sutskever, Ilya, Duvenaud, David. (2018). Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367.

[dinh2014nice] Dinh, Laurent, Krueger, David, Bengio, Yoshua. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.

[kingma2018glow] Kingma, Diederik P, Dhariwal, Prafulla. (2018). Glow: Generative flow with invertible 1x1 convolutions. Proc. NeurIPS.

[meyer2000matrix] Meyer, Carl D. (2000). Matrix analysis and applied linear algebra.

[dinh2016density] Dinh, Laurent, Sohl-Dickstein, Jascha, Bengio, Samy. (2017). Density estimation using real NVP. Proc. ICLR.

[andrsterr2019perturbation] Helena Andrés-Terré, Pietro Lió. (2019). Perturbation theory approach to study the latent space degeneracy of Variational Autoencoders. arXiv preprint arXiv:1907.05267.

[srivastava2017veegan] Akash Srivastava, Lazar Valkov, Chris Russell, Michael U. Gutmann, Charles Sutton. (2017). VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning.

[dieng2019prescribed] Adji B. Dieng, Francisco J. R. Ruiz, David M. Blei, Michalis K. Titsias. (2019). Prescribed Generative Adversarial Networks.

[biau2018some] Biau, G{'e. (2018). Some theoretical properties of GANs. arXiv preprint arXiv:1803.07819.

[boyd2010six] Boyd, John P. (2010). Six strategies for defeating the Runge Phenomenon in Gaussian radial basis functions on a finite interval. Computers & Mathematics with Applications.

[gorski2007biconvex] Gorski, Jochen, Pfeuffer, Frank, Klamroth, Kathrin. (2007). Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research.

[gu2006manifold] Gu, Xianfeng, He, Ying, Qin, Hong. (2006). Manifold splines. Graphical Models.

[xu2015block] Xu, Yangyang, Yin, Wotao. (2015). Block stochastic gradient iteration for convex and nonconvex optimization. SIAM Journal on Optimization.

[bezhaev1988splines] Bezhaev, A Yu. (1988). Splines on manifolds. Russian Journal of Numerical Analysis and Mathematical Modelling.

[gu2014towards] Gu, Shixiang, Rigazio, Luca. (2014). Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068.

[lyu2015unified] Lyu, Chunchuan, Huang, Kaizhu, Liang, Hai-Ning. (2015). A unified gradient regularization family for adversarial examples. Data Mining (ICDM), 2015 IEEE International Conference on.

[shaham2015understanding] Shaham, Uri, Yamada, Yutaro, Negahban, Sahand. (2015). Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization. arXiv preprint arXiv:1511.05432.

[fawzi2015analysis] Fawzi, Alhussein, Fawzi, Omar, Frossard, Pascal. (2015). Analysis of classifiers' robustness to adversarial perturbations. arXiv preprint arXiv:1502.02590.

[carlini2016defensive] Carlini, Nicholas, Wagner, David. (2016). Defensive distillation is not robust to adversarial examples. arXiv preprint.

[papernot2016distillation] Papernot, Nicolas, McDaniel, Patrick, Wu, Xi, Jha, Somesh, Swami, Ananthram. (2016). Distillation as a defense to adversarial perturbations against deep neural networks. Security and Privacy (SP), 2016 IEEE Symposium on.

[tang2013deep] Tang, Yichuan. (2013). Deep learning using linear support vector machines. arXiv preprint arXiv:1306.0239.

[shen2017disciplined] Shen, Xinyue, Diamond, Steven, Udell, Madeleine, Gu, Yuantao, Boyd, Stephen. (2017). Disciplined multi-convex programming. Control And Decision Conference (CCDC), 2017 29th Chinese.

[atteia1989spline] Atteia, M, Benbourhim, MN. (1989). Spline elastic manifolds. Mathematical methods in computer aided geometric design.

[savel1995splines] Savel'ev, Il'ya Vasil'evich. (1995). Splines and manifolds. Russian Mathematical Surveys.

[hofer2004energy] Hofer, Michael, Pottmann, Helmut. (2004). Energy-minimizing splines in manifolds. ACM Transactions on Graphics (TOG).

[chui1988multivariate] Chui, Charles K. (1988). Multivariate splines.

[bergstra2010theano] Bergstra, James, Breuleux, Olivier, Bastien, Fr{'e. (2010). Theano: A CPU and GPU math compiler in Python. Proc. 9th Python in Science Conf.

[afriat1957orthogonal] Afriat, Sidney N. (1957). Orthogonal and oblique projectors and the characteristics of pairs of vector spaces. Mathematical Proceedings of the Cambridge Philosophical Society.

[bjorck1973numerical] Bjorck, Ake, Golub, Gene H. (1973). Numerical Methods for Computing Angles Between Linear Subspaces. Mathematics of computation.

[streubel2013representation] Streubel, Tom, Griewank, Andreas, Radons, Manuel, Bernt, Jens-Uwe. (2013). Representation and analysis of piecewise linear functions in abs-normal form. IFIP Conference on System Modeling and Optimization.

[qi1993nonsmooth] Qi, Liqun, Sun, Jie. (1993). A nonsmooth version of Newton's method. Mathematical programming.

[qi1998nonsmooth] Qi, Liqun, Sun, Defeng. (1998). Nonsmooth equations and smoothing Newton methods. Applied Mathematics Report AMR.

[courant1937differential] Courant, Richard, McShane, Edward James. (1937). Differential and integral calculus.

[absil2006largest] Absil, P-A, Edelman, Alan, Koev, Plamen. (2006). On the largest principal angle between random subspaces. Linear Algebra and its Applications.

[weinstein2000almost] Weinstein, Alan. (2000). Almost invariant submanifolds for compact group actions. Journal of the European Mathematical Society.

[cheney2009linear] Cheney, Ward, Kincaid, David. (2009). Linear algebra: Theory and applications. The Australian Mathematical Society.

[schoenberg1964interpolation] Schoenberg, Issac J. (1964). On interpolation by spline functions and its minimal properties. On Approximation Theory/{.

[reinsch1967smoothing] Reinsch, Christian H. (1967). Smoothing by spline functions. Numerische mathematik.

[bloor1990representing] Bloor, Malcolm IG, Wilson, Michael J. (1990). Representing PDE surfaces in terms of B-splines. Computer-Aided Design.

[smith1985numerical] Smith, Gordon D. (1985). Numerical solution of partial differential equations: finite difference methods.

[cheney1980approximation] Cheney, Elliott Ward. (1980). Approximation theory III.

[graves2013generating] Graves, Alex. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.

[wang1999inverse] Wang, Genyuan, Bao, Zheng. (1999). Inverse synthetic aperture radar imaging of maneuvering targets based on chirplet decomposition. Optical Engineering.

[brock2016neural] Brock, A., Lim, T., Ritchie, J.~M., Weston, N.. (2016). Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093.

[huang2017orthogonal] Huang, L., Liu, X., Lang, B., Yu, A. W., Wang, Y., Li, B.. (2017). Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. arXiv preprint arXiv:1709.06079.

[flandrin2001time] Flandrin, Patrick. (2001). Time frequency and chirps. Aerospace/Defense Sensing, Simulation, and Controls.

[fan2001generalized] Fan, Jianqing, Zhang, Chunming, Zhang, Jian. (2001). Generalized likelihood ratio statistics and Wilks phenomenon. Annals of statistics.

[zeitouni1992generalized] Zeitouni, Ofer, Ziv, Jacob, Merhav, Neri. (1992). When is the generalized likelihood ratio test optimal?. IEEE Transactions on Information Theory.

[boissonnat2006curved] Boissonnat, Jean-Daniel, Wormser, Camille, Yvinec, Mariette. (2006). Curved voronoi diagrams. Effective Computational Geometry for Curves and Surfaces.

[edelsbrunner2012algorithms] Edelsbrunner, Herbert. (2012). Algorithms in combinatorial geometry.

[aurenhammer1987power] Aurenhammer, Franz. (1987). Power diagrams: properties, algorithms and applications. SIAM Journal on Computing.

[Reference1] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2017). Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856.

[largemarginib] Tsai, Yao-Hung Hubert, Wu, Yue, Salakhutdinov, Ruslan, Morency, Louis-Philippe. (2020). Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576. doi:10.1109/TPAMI.2013.2296528.

[dubois2021lossy] Dubois, Yann, Bloem-Reddy, Benjamin, Ullrich, Karen, Maddison, Chris J. (2021). Lossy compression for lossless prediction. Advances in Neural Information Processing Systems.

[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978.

[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[kahana2022contrastive] Kahana, Jonathan, Hoshen, Yedid. (2022). A Contrastive Objective for Learning Disentangled Representations. arXiv preprint arXiv:2203.11284.

[wang2022rethinking] Wang, Haoqing, Guo, Xun, Deng, Zhi-Hong, Lu, Yan. (2022). Rethinking Minimal Sufficient Representation in Contrastive Learning. arXiv preprint arXiv:2203.07004.

[tian2020makes] Tian, Yonglong, Sun, Chen, Poole, Ben, Krishnan, Dilip, Schmid, Cordelia, Isola, Phillip. (2020). What makes for good views for contrastive learning?. Advances in Neural Information Processing Systems.

[lee2021compressive] Lee, Kuang-Huei, Arnab, Anurag, Guadarrama, Sergio, Canny, John, Fischer, Ian. (2021). Compressive visual representations. Advances in Neural Information Processing Systems.

[fischer2020conditional] Fischer, Ian. (2020). The conditional entropy bottleneck. Entropy.

[lee2021predicting] Lee, Jason D, Lei, Qi, Saunshi, Nikunj, Zhuo, Jiacheng. (2021). Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems.

[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[Reference3] Li, Yingming, Yang, Ming, Zhang, Zhongfei. (2018). A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering. doi:10.1109/MIS.2009.36.

[donahue2015long] Donahue, Jeffrey, Anne Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, Darrell, Trevor. (2015). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE conference on computer vision and pattern recognition.

[mao2014deep] Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, Huang, Zhiheng, Yuille, Alan. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.

[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.

[federici2020learning] Federici, Marco, Dutta, Anjan, Forr{'e. (2020). Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017.

[tian2020contrastive] Tian, Yonglong, Krishnan, Dilip, Isola, Phillip. (2020). Contrastive multiview coding. European conference on computer vision.

[tschannen2019mutual] Tschannen, Michael, Djolonga, Josip, Rubenstein, Paul K, Gelly, Sylvain, Lucic, Mario. (2019). On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625.

[darlow2020information] Darlow, Luke Nicholas, Storkey, Amos. (2020). What Information Does a ResNet Compress?. arXiv preprint arXiv:2003.06254.

[deepmultiview2019] Qi Wang, Claire Boudreau, Qixing Luo, Pang-Ning Tan, Jiayu Zhou. (2019). Deep Multi-view Information Bottleneck. Proceedings of the 2019 SIAM International Conference on Data Mining (SDM). doi:10.1137/1.9781611975673.5.

[hang2018kernel] Hang, Hanyuan, Steinwart, Ingo, Feng, Yunlong, Suykens, Johan AK. (2018). Kernel density estimation for dynamical systems. The Journal of Machine Learning Research.

[kozachenko1987sample] Kozachenko, Lyudmyla F, Leonenko, Nikolai N. (1987). Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii.

[paninski2003estimation] Paninski, Liam. (2003). Estimation of entropy and mutual information. Neural computation.

[linsker88] Henaff, Olivier. (2020). Data-efficient image recognition with contrastive predictive coding. Neural Computation. doi:10.1162/089976602317318938.

[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.

[karpathy2015deep] Karpathy, Andrej, Fei-Fei, Li. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition.

[deepmultiview2015] Wang, Weiran, Arora, Raman, Livescu, Karen, Bilmes, Jeff. (2015). On Deep Multi-View Representation Learning. Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37.

[multimodel2011] Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, Ng, Andrew Y.. (2011). Multimodal Deep Learning. Proceedings of the 28th International Conference on International Conference on Machine Learning.

[srivastava14b] Nitish Srivastava, Ruslan Salakhutdinov. (2014). Multimodal Learning with Deep Boltzmann Machines. Journal of Machine Learning Research.

[chen2010] Chen, Ning, Zhu, Jun, Xing, Eric. (2010). Predictive Subspace Learning for Multi-view Data: a Large Margin Approach. Advances in Neural Information Processing Systems.

[xing2012mining] Xing, Eric P, Yan, Rong, Hauptmann, Alexander G. (2012). Mining associated text and images with dual-wing harmoniums. arXiv preprint arXiv:1207.1423.

[multi2014] Weifeng Liu, Dacheng Tao, Jun Cheng, Yuanyan Tang. (2014). Multiview Hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding. doi:https://doi.org/10.1016/j.cviu.2013.03.007.

[article2008] Sridharan, Karthik, Kakade, Sham. (2008). An Information Theoretic Framework for Multi-View Learning. SO.

[Tian2013] Cao, Tian, Jojic, Vladimir, Modla, Shannon, Powell, Debbie, Czymmek, Kirk, Niethammer, Marc. (2013). Robust Multimodal Dictionary Learning. Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2013.

[factorized2010] Jia, Yangqing, Salzmann, Mathieu, Darrell, Trevor. (2010). Factorized Latent Spaces with Structured Sparsity. Advances in Neural Information Processing Systems.

[matching2003] Barnard, Kobus, Duygulu, Pinar, Forsyth, David, de Freitas, Nando, Blei, David M., Jordan, Michael I.. (2003). Matching Words and Pictures. J. Mach. Learn. Res..

[miss2000] Cohn, David, Hofmann, Thomas. (2000). The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems.

[Sun2013ASO] Shiliang Sun. (2013). A survey of multi-view machine learning. Neural Computing and Applications.

[hardoon2004] Bach, Francis R., Jordan, Michael I.. (2003). Kernel Independent Component Analysis. J. Mach. Learn. Res.. doi:10.1162/153244303768966085.

[cca1396] Harold Hotelling. (1936). Relations Between Two Sets of Variates. Biometrika.

[Darbellay99] Vapnik, Vladimir N, Chervonenkis, A Ya. (2015). On the uniform convergence of relative frequencies of events to their probabilities. CoRR. doi:10.1108/03684921011046735.

[cover1999elements] Cover, Thomas M. (1999). Elements of information theory.

[tishby2000information] Tishby, Naftali, Pereira, Fernando C, Bialek, William. (2000). The information bottleneck method. arXiv preprint physics/0004057.

[koopman1936distributions] Koopman, Bernard Osgood. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical society.

[gilad2003information] Gilad-Bachrach, Ran, Navot, Amir, Tishby, Naftali. (2003). An information theoretic tradeoff between complexity and accuracy. Learning Theory and Kernel Machines.

[kinney2014equitability] Kinney, Justin B, Atwal, Gurinder S. (2014). Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences.

[rosenblatt1958perceptron] Rosenblatt, Frank. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.. Psychological review.

[hinton2006fast] Hinton, Geoffrey E, Osindero, Simon, Teh, Yee-Whye. (2006). A fast learning algorithm for deep belief nets. Neural computation.

[ren2015faster] Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems.

[belghazi2018mine] Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeswar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, R Devon. (2018). Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062.

[steinke2020reasoning] Steinke, Thomas, Zakynthinou, Lydia. (2020). Reasoning about generalization via conditional mutual information. Conference on Learning Theory.

[alemi2016deep] Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, Murphy, Kevin. (2016). Deep variational information bottleneck. arXiv preprint arXiv:1612.00410.

[lee2019wide] Lee, Jaehoon, Xiao, Lechao, Schoenholz, Samuel, Bahri, Yasaman, Novak, Roman, Sohl-Dickstein, Jascha, Pennington, Jeffrey. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems.

[strouse2017deterministic] Strouse, DJ, Schwab, David J. (2017). The deterministic information bottleneck. Neural computation.

[elad2019direct] Elad, Adar, Haviv, Doron, Blau, Yochai, Michaeli, Tomer. (2019). Direct validation of the information bottleneck principle for deep nets. Proceedings of the IEEE International Conference on Computer Vision Workshops.

[fischer2020ceb] Fischer, Ian, Alemi, Alexander A. (2020). CEB Improves Model Robustness. arXiv preprint arXiv:2002.05380.

[shannon1948mathematical] Shannon, Claude E. (1948). A mathematical theory of communication. The Bell system technical journal.

[SHAMIR20102696] Ohad Shamir, Sivan Sabato, Naftali Tishby. (2010). Learning and generalization with the information bottleneck. Theoretical Computer Science. doi:https://doi.org/10.1016/j.tcs.2010.04.006.

[painsky2018bregman] Painsky, Amichai, Wornell, Gregory W. (2018). Bregman Divergence Bounds and the Universality of the Logarithmic Loss. arXiv preprint arXiv:1810.07014.

[painsky2018information] Painsky, Amichai, Feder, Meir, Tishby, Naftali. (2018). An Information-Theoretic Framework for Non-linear Canonical Correlation Analysis. arXiv preprint arXiv:1810.13259.

[DBLP:journals/corr/abs-1801-02254] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, Tomaso A. Poggio. (2018). Theory of Deep Learning IIb: Optimization Properties of {SGD. CoRR.

[entropy2019] Cheng, H., Lian, D., Gao, S.and Geng, Y. (2019). Utilizing Information Bottleneck to Evaluate the Capability of Deep Neural Networks for Image Classification. Entropy.

[gabrie2018entropy] Gabri{'e. (2018). Entropy and mutual information in models of deep neural networks. arXiv preprint arXiv:1805.09785.

[DBLP:journals/corr/abs-1710-11029] Pratik Chaudhari, Stefano Soatto. (2017). Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. CoRR.

[2016arXiv161101353A] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. doi:10.1103/PhysRevE.69.066138.

[chechik2005information] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.

[painsky2016generalized] Painsky, Amichai, Rosset, Saharon, Feder, Meir. (2016). Generalized independent component analysis over finite alphabets. IEEE Transactions on Information Theory.

[rissanen1978modeling] Rissanen, Jorma. (1978). Modeling by shortest data description. Automatica.

[vapnik1968uniform] Vapnik, Vladimir N, Chervonenkis, Aleksei Yakovlevich. (1968). The uniform convergence of frequencies of the appearance of events to their probabilities. Doklady Akademii Nauk.

[sauer1972density] Sauer, Norbert. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A.

[shelah1972combinatorial] Shelah, Saharon. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics.

[hoeffding1963probability] Hoeffding, Wassily. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association.

[chigirev2004optimal] Chigirev, Denis V, Bialek, William. (2004). Optimal manifold representation of data: an information theoretic approach. Advances in Neural Information Processing Systems.

[bell1995information] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.

[deco2012information] Deco, Gustavo, Obradovic, Dragan. (2012). An information-theoretic approach to neural computing.

[achille2018emergence] Achille, Alessandro, Soatto, Stefano. (2018). Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research.

[saxe2019information] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.

[yu2020understanding] Yu, Shujian, Wickstr{\o. (2020). Understanding convolutional neural networks with information theory: An initial exploration. IEEE Transactions on Neural Networks and Learning Systems.

[cheng2018evaluating] Cheng, Hao, Lian, Dongze, Gao, Shenghua, Geng, Yanlin. (2018). Evaluating capability of deep neural networks for image classification via information plane. Proceedings of the European Conference on Computer Vision (ECCV).

[goldfeld2018estimating] Goldfeld, Ziv, Berg, Ewout van den, Greenewald, Kristjan, Melnyk, Igor, Nguyen, Nam, Kingsbury, Brian, Polyanskiy, Yury. (2018). Estimating information flow in deep neural networks. arXiv preprint arXiv:1810.05728.

[wickstrom2019information] Wickstr{\o. (2019). Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels. arXiv preprint arXiv:1909.11396.

[amjad2019learning] Amjad, Rana Ali, Geiger, Bernhard Claus. (2019). Learning representations for neural network-based classification using the information bottleneck principle. IEEE transactions on pattern analysis and machine intelligence.

[goldfeld2020convergence] Goldfeld, Ziv, Greenewald, Kristjan, Niles-Weed, Jonathan, Polyanskiy, Yury. (2020). Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory.

[cvitkovic2019minimal] Cvitkovic, Milan, Koliander, G{. (2019). Minimal achievable sufficient statistic learning. arXiv preprint arXiv:1905.07822.

[geiger2020information] Geiger, Bernhard C. (2020). On Information Plane Analyses of Neural Network Classifiers--A Review. arXiv preprint arXiv:2003.09671.

[van2020survey] Van Engelen, Jesper E, Hoos, Holger H. (2020). A survey on semi-supervised learning. Machine Learning.

[pogodin2020kernelized] Pogodin, Roman, Latham, Peter E. (2020). Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks. arXiv preprint arXiv:2006.07123.

[chelombiev2019adaptive] Chelombiev, Ivan, Houghton, Conor, O'Donnell, Cian. (2019). Adaptive estimators show information compression in deep neural networks. ICLR.

[song2021train] Song, Yang, Kingma, Diederik P. (2021). How to train your energy-based models. arXiv preprint arXiv:2101.03288.

[huembeli2022physics] Huembeli, Patrick, Arrazola, Juan Miguel, Killoran, Nathan, Mohseni, Masoud, Wittek, Peter. (2022). The physics of energy-based models. Quantum Machine Intelligence.

[noshad2018scalable] Noshad, Morteza, Hero III, Alfred O. (2018). Scalable Mutual Information Estimation using Dependence Graphs. arXiv preprint arXiv:1801.09125.

[achille2018critical] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2018). Critical learning periods in deep networks. International Conference on Learning Representations.

[achille2018information] Achille, Alessandro, Soatto, Stefano. (2018). Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence.

[kirsch2020unpacking] Kirsch, Andreas, Lyle, Clare, Gal, Yarin. (2020). Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning. arXiv preprint arXiv:2003.12537.

[pensia2018generalization] Pensia, Ankit, Jog, Varun, Loh, Po-Ling. (2018). Generalization error bounds for noisy, iterative algorithms. 2018 IEEE International Symposium on Information Theory (ISIT).

[NIPS2019_9282] Negrea, Jeffrey, Haghifam, Mahdi, Dziugaite, Gintare Karolina, Khisti, Ashish, Roy, Daniel M. (2019). Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. Advances in Neural Information Processing Systems 32.

[NIPS2018_7954] Asadi, Amir, Abbe, Emmanuel, Verdu, Sergio. (2018). Chaining Mutual Information and Tightening Generalization Bounds. Advances in Neural Information Processing Systems 31.

[russo2016controlling] Russo, Daniel, Zou, James. (2016). Controlling bias in adaptive data analysis using information theory. Artificial Intelligence and Statistics.

[vera2018role] Vera, Mat{'\i. (2018). The role of information complexity and randomization in representation learning. arXiv preprint arXiv:1802.05355.

[boucheron2005theory] Boucheron, St{'e. (2005). Theory of classification: A survey of some recent advances. ESAIM: probability and statistics.

[neyshabur2014search] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.

[neyshabur2015norm] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2015). Norm-based capacity control in neural networks. Conference on Learning Theory.

[10.2307/2334522] Ralph B. D'Agostino. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika.

[Krizhevsky09learningmultiple] Alex Krizhevsky. (2009). Learning multiple layers of features from tiny images.

[bartlett2002rademacher] Bartlett, Peter L, Mendelson, Shahar. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research.

[bousquet2002stability] Bousquet, Olivier, Elisseeff, Andre. (2002). Stability and generalization. Journal of machine learning research.

[stavac] Achille, Alessandro, Paolini, Giovanni, Soatto, Stefano. (2019). Where is the information in a deep neural network?. arXiv preprint arXiv:1905.12213.

[nash2018inverting] Nash, Charlie, Kushman, Nate, Williams, Christopher KI. (2018). Inverting Supervised Representations with Autoregressive Neural Density Models. arXiv preprint arXiv:1806.00400.

[csiszar1987conditional] Csisz{'a. (1987). Conditional limit theorems under Markov conditioning. IEEE Transactions on Information Theory.

[jabref-meta: databaseType:bibtex;}

@ARTICLE{2016arXiv16110135amjad2018not3A] {Achille. {Information Dropout: Learning Optimal Representations Through Noisy Computation. ArXiv e-prints.

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[berglund2013measuring] Kraskov, Alexander, St. (2004). Estimating mutual information. Phys. Rev. E. doi:10.1103/PhysRevE.69.066138.

[DBLP:journals/corr/WangYMC16] . . ().

[2014arXiv1412.6615S] Anonymous. (2019). REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS. Journal of Machine Learning Research.

[turner2007maximum] Turner, Richard, Sahani, Maneesh. (2007). A maximum-likelihood interpretation for slow feature analysis. Neural computation.

[hecht2009speaker] Hecht, Ron M, Noor, Elad, Tishby, Naftali. (2009). Speaker recognition by Gaussian information bottleneck. Tenth Annual Conference of the International Speech Communication Association.

[palmer2015predictive] Palmer, Stephanie E, Marre, Olivier, Berry, Michael J, Bialek, William. (2015). Predictive information in a sensory population. Proceedings of the National Academy of Sciences.

[buesing2010spiking] Buesing, Lars, Maass, Wolfgang. (2010). A spiking neuron as information bottleneck. Neural computation.

[saxe2018information] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.

[amjad2018not] Amjad, Rana Ali, Geiger, Bernhard C. (2018). How (Not) To Train Your Neural Network Using the Information Bottleneck Principle. arXiv preprint arXiv:1802.09766.

[elad2018effectiveness] Elad, Adar, Haviv, Doron, Blau, Yochai, Michaeli, Tomer. (2018). The effectiveness of layer-by-layer training using the information bottleneck principle.

[xu2017information] Xu, Aolin, Raginsky, Maxim. (2017). Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.

[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.

[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[zhang2022how] Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, In So Kweon. (2022). How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning. International Conference on Learning Representations.

[Arora2019theory] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[pu2020multimodal] Pu, Shi, He, Yijiang, Li, Zheng, Zheng, Mao. (2020). Multimodal Topic Learning for Video Recommendation. arXiv preprint arXiv:2010.13373.

[voloshynovskiy2019information] Voloshynovskiy, Slava, Taran, Olga, Kondah, Mouad, Holotyak, Taras, Rezende, Danilo. (2020). Variational Information Bottleneck for Semi-Supervised Classification. Entropy. doi:10.3390/e22090943.

[gao2015efficient] Gao, Shuyang, Ver Steeg, Greg, Galstyan, Aram. (2015). Efficient estimation of mutual information for strongly dependent variables. Artificial Intelligence and Statistics.

[Belghazi2018MutualIN] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, Aaron C. Courville. (2018). Mutual Information Neural Estimation. ICML.

[donsker1975asymptotic] Donsker, Monroe D, Varadhan, SR Srinivasa. (1975). Asymptotic evaluation of certain Markov process expectations for large time, I. Communications on Pure and Applied Mathematics.

[2018Estimating] {Goldfeld. {Estimating Information Flow in Neural Networks. ArXiv e-prints.

[jacobsen2018irevnet] Jörn-Henrik Jacobsen, Arnold W.M. Smeulders, Edouard Oyallon. (2018). i-RevNet: Deep Invertible Networks. International Conference on Learning Representations.

[bertsekas2011incremental] Bertsekas, Dimitri P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning.

[li2017convergence] Li, Yuanzhi, Yuan, Yang. (2017). Convergence analysis of two-layer neural networks with relu activation. Advances in Neural Information Processing Systems.

[dieuleveut2017bridging] Dieuleveut, Aymeric, Durmus, Alain, Bach, Francis. (2017). Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386.

[rumelhart1986learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J. (1986). Learning representations by back-propagating errors. nature.

[oord2016wavenet] Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, Kavukcuoglu, Koray. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.

[matias2018role] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.

[DBLP:journals/corr/WellingHinton2005] Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, Murphy, Kevin. (2016). Deep Variational Information Bottleneck. arXiv:1612.00410.

[skincat] Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous, Rif A, Murphy, Kevin. (2017). An Information-Theoretic Analysis of Deep Latent-Variable Models. arXiv:1711.00464.

[brokenelbo] Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous, Rif A, Murphy, Kevin. (2018). Fixing a Broken {ELBO. ICML 2018.

[infoautoencoding] Anonymous. (2018). The Information-Autoencoding Family: A Lagrangian Perspective on Latent Variable Generative Modeling. International Conference on Learning Representations.

[rationalignorance] Mattingly, Henry H, Transtrum, Mark K, Abbott, Michael C, Machta, Benjamin B. (2017). Rational ignorance: simpler models learn more from finite data. arXiv:1705.01166.

[infoscaling] Abbott, Michael C, Machta, Benjamin B. (2018). An Information Scaling Law: $\zeta = 3/4$. arXiv:1710.09351.

[thermoinfo] Parrondo, Juan MR, Horowitz, Jordan M, Sagawa, Takahiro. (2015). Thermodynamics of information. Nature physics.

[costbenefitdata] {Still. {Thermodynamic cost and benefit of data representations. arXiv: 1705.00612.

[marginalent] {Crooks. {Marginal and Conditional Second Laws of Thermodynamics. arXiv: 1611.04628.

[thermoprediction] {Still. {Thermodynamics of Prediction. Physical Review Letters. doi:10.1103/PhysRevLett.109.120604.

[interactive] {Still. {Information-theoretic approach to interactive learning. EPL (Europhysics Letters). doi:10.1209/0295-5075/85/28005.

[optimalcausal] {Still. {Optimal Causal Inference: Estimating Stored Information and Approximating Causal Architecture. arXiv: 0708.1580.

[structurenoise] {Still. {Structure or Noise?. arXiv: 0708.0654.

[clusters] {Still. {How many clusters? An information theoretic perspective. ArXiv Physics e-prints.

[jaynes] Jaynes, Edwin T. (1957). Information theory and statistical mechanics. Physical review.

[sethna] Sethna, James. (2006). Statistical mechanics: entropy, order parameters, and complexity.

[coverthomas] Cover, Thomas M, Thomas, Joy A. (2012). Elements of information theory.

[reversible] Maclaurin, Dougal, Duvenaud, David, Adams, Ryan P.. (2015). Gradient-based Hyperparameter Optimization Through Reversible Learning. Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37.

[mib] Friedman, Nir, Mosenzon, Ori, Slonim, Noam, Tishby, Naftali. (2001). Multivariate information bottleneck. Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence.

[predictive] Bialek, William, Nemenman, Ilya, Tishby, Naftali. (2001). Predictability, complexity, and learning. Neural computation.

[vae] Kingma, Diederik P, Welling, Max. {Auto-encoding variational Bayes.

[betavae] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. {$\beta$-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.

[emergence] {Achille. {Emergence of Invariance and Disentangling in Deep Representations. Proceedings of the ICML Workshop on Principled Approaches to Deep Learning.

[ib] N. Tishby, F.C. Pereira, W. Biale. The Information Bottleneck method. The 37th annual Allerton Conf. on Communication, Control, and Computing.

[bbb] {Blundell. {Weight Uncertainty in Neural Networks. arXiv: 1505.05424.

[semi] {Kingma. {Semi-Supervised Learning with Deep Generative Models. arXiv: 1406.5298.

[sgdasbayes] {Mandt. {Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv: 1704.04289.

[sgr] {Ma. {A Complete Recipe for Stochastic Gradient MCMC. arXiv:1506.04696.

[sgld] Welling, Max, Teh, Yee W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th international conference on machine learning (ICML-11).

[bayessgd] {Smith. {A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv:1710.06451.

[sghmc] {Chen. {Stochastic Gradient Hamiltonian Monte Carlo. arXiv:1402.4102.

[snapshot] {Huang. {Snapshot Ensembles: Train 1, get M for free. arXiv: 1704.00109.

[poppar] {Machta. {Monte Carlo Methods for Rough Free Energy Landscapes: Population Annealing and Parallel Tempering. Journal of Statistical Physics. doi:10.1007/s10955-011-0249-0.

[finn] Finn, Colin BP. (1993). Thermal physics.

[energyentropy] {Zhang. {Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning. arXiv: 1803.01927.

[pacbayes] {McAllester. {A PAC-Bayesian Tutorial with A Dropout Bound. arXiv: 1307.2118.

[pacbayesbayes] Germain, Pascal, Bach, Francis, Lacoste, Alexandre, Lacoste-Julien, Simon. (2016). PAC-Bayesian Theory Meets Bayesian Inference. Advances in Neural Information Processing Systems 29.

[marsh] Marsh, Charles. (2013). Introduction to continuous entropy.

[box] Box, George EP, Draper, Norman R. (1987). Empirical model-building and response surfaces..

[infoprojection] Csisz{'a. (2003). Information projections revisited. IEEE Transactions on Information Theory.

[lecturenotes] Ariel Caticha. (2008). Lectures on Probability, Entropy, and Statistical Physics.

[correspondence] Colin H. LaMont, Paul A. Wiggins. (2017). A correspondence between thermodynamics and inference.

[watanabegrey] Watanabe, Sumio. (2009). Algebraic geometry and statistical learning theory.

[watanabegreen] Watanabe, Sumio. (2018). Mathematical theory of Bayesian statistics.

[whereinfo] Alessandro Achille, Stefano Soatto. (2019). Where is the Information in a Deep Neural Network?.

[ffjord] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaud. (2018). FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models.

[widelinear] {Lee. (2019). {Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. arXiv e-prints.

[fisherRao] Liang, Tengyuan, Poggio, Tomaso, Rakhlin, Alexander, Stokes, James. (2017). Fisher-rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530.

[AIC] Akaike, Hirotugu. (1974). A new look at the statistical model identification. Selected Papers of Hirotugu Akaike.

[TIC] {Thomas. (2019). {Information matrices and generalization. arXiv e-prints.

[generalization_dnn] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in Neural Information Processing Systems.

[vmibounds] Ben Poole, Sherjil Ozair, A{. (2019). On Variational Bounds of Mutual Information. CoRR.

[gaussib] Chechik, Gal, Globerson, Amir, Tishby, Naftali, Weiss, Yair. (2005). Information bottleneck for Gaussian variables. Journal of machine learning research.

[halko] Halko, Nathan, Martinsson, Per-Gunnar, Tropp, Joel A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review.

[blackbox] Shwartz-Ziv, Ravid, Tishby, Naftali. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810.

[tishbydeep] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW).

[saxe] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.

[hownot] Amjad, Rana Ali, Geiger, Bernhard C. (2018). How (not) to train your neural network using the information bottleneck principle. arXiv preprint arXiv:1802.09766.

[brendan] Kolchinsky, Artemy, Tracey, Brendan D, Van Kuyk, Steven. (2018). Caveats for information bottleneck in deterministic scenarios. arXiv preprint arXiv:1808.07593.

[mnist] LeCun, Yann, Cortes, Corinna, Burges, CJ. (2010). MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist.

[ntk] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems.

[neuraltangents] Novak, Roman, Xiao, Lechao, Hron, Jiri, Lee, Jaehoon, Alemi, Alexander A, Sohl-Dickstein, Jascha, Schoenholz, Samuel S. (2019). Neural tangents: Fast and easy infinite neural networks in python. arXiv preprint arXiv:1912.02803.

[fisher] Frederik Kunstner, Lukas Balles, Philipp Hennig. (2019). Limitations of the Empirical Fisher Approximation.

[littlebits] Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, Amir Yehudayoff. (2017). Learners that Use Little Information.

[neuralode] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. (2018). Neural Ordinary Differential Equations.

[bayesianbounds] Banerjee, Arindam. (2006). On bayesian bounds. Proceedings of the 23rd international conference on Machine learning.

[invertible] Anonymous. (2020). On the Invertibility of Invertible Neural Networks. Submitted to International Conference on Learning Representations.

[cando] Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, Dingli Yu. (2019). Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks.

[liang2019fisher] Liang, Tengyuan, Poggio, Tomaso, Rakhlin, Alexander, Stokes, James. (2019). Fisher-rao metric, geometry, and complexity of neural networks. The 22nd International Conference on Artificial Intelligence and Statistics.

[neyshabur2017exploring] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in neural information processing systems.

[hardt2016train] Hardt, Moritz, Recht, Ben, Singer, Yoram. (2016). Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning.

[watanabe2010asymptotic] Watanabe, Sumio, Opper, Manfred. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory.. Journal of machine learning research.

[russo2019much] Russo, Daniel, Zou, James. (2019). How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory.

[slonim2002information] Slonim, Noam. (2002). The information bottleneck: Theory and applications.

[Tishby1999] Steinbach, Michael, Ert{. (2004). The challenges of clustering high dimensional data. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing.

[Gilad-bachrach] Ran Gilad-bachrach, Amir Navot, Naftali Tishby. (2003). An information theoretic tradeoff between complexity and accuracy. In Proceedings of the COLT.

[CriticalSlowingDown:2004] Tredicce, Jorge R, Lippi, Gian Luca, Mandel, Paul, Charasse, Basile, Chevalier, Aude, Picqu{'e. (2004). Critical slowing down at a bifurcation. American Journal of Physics.

[shwartz2017] {Shwartz-Ziv. (2017). {Opening the Black Box of Deep Neural Networks via Information. arXiv e-prints.

[tishby99information] Tishby, Naftali, Pereira, Fernando C., Bialek, William. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing.

[Csiszar] Csisz'{a. (2004). Information Theory and Statistics: A Tutorial. Commun. Inf. Theory. doi:10.1561/0100000004.

[Cover:2006:EIT:1146355] Cover, Thomas M., Thomas, Joy A.. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).

[DBLP:conf/alt/ShamirST08] Ohad Shamir, Sivan Sabato, Naftali Tishby. (2010). Learning and generalization with the information bottleneck. Theor. Comput. Sci..

[DBLP:conf/alt/2008] . Algorithmic Learning Theory, 19th International Conference, {ALT. (2008).

[Exp_forms] Lawrence D. Brown. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Lecture Notes-Monograph Series.

[Painsky2019] {Painsky. (2018). {Bregman Divergence Bounds and the Universality of the Logarithmic Loss. arXiv e-prints.

[Csiszar:2004:ITS:1166379.1166380] Csisz'{a. (2004). Information Theory and Statistics: A Tutorial. Commun. Inf. Theory. doi:10.1561/0100000004.

[CIS-58533] Tusnady, G., Csiszar, I.. (1984). Information geometry and alternating minimization procedures. Statistics & Decisions: Supplement Issues.

[slonim_MIB] Slonim, Noam, Friedman, Nir, Tishby, Naftali. (2006). Multivariate Information Bottleneck. Neural Computation. doi:10.1162/neco.2006.18.8.1739.

[Ay2019] Domenico Felice, Nihat Ay. (2019). Divergence Functions in Information Geometry. Geometric Science of Information - 4th International Conference, {GSI. doi:10.1007/978-3-030-26980-7_45.

[DBLP:conf/gsi/2019] . Geometric Science of Information - 4th International Conference, {GSI. (2019).

[parker] Albert E. Parker, Tom'{a. (2003). Annealing and the Rate Distortion Problem. Advances in Neural Information Processing Systems 15.

[Jaynes58] Jaynes, E. T.. (1957). Information Theory and Statistical Mechanics. Phys. Rev.. doi:10.1103/PhysRev.106.620.

[ZaslavskyTishby:2019] Zaslavsky, Noga, Tishby, Naftali. (2019). Deterministic Annealing and the Evolution of Optimal Information Bottleneck Representations. Preprint.

[Kullback58] S. Kullback. (1959). Information Theory and Statistics.

[GaussianIB] Chechik, Gal, Globerson, Amir, Tishby, Naftali, Weiss, Yair. (2005). Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res..

[globerson2003sufficient] Globerson, Amir, Tishby, Naftali. (2003). Sufficient dimensionality reduction. Journal of Machine Learning Research.

[ma2019unpaired] Ma, Shuang, McDuff, Daniel, Song, Yale. (2019). Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck. Proceedings of the IEEE International Conference on Computer Vision.

[schneidman2001analyzing] Schneidman, Elad, Slonim, Noam, Tishby, Naftali, van Steveninck, R deRuyter, Bialek, William. (2001). Analyzing neural codes using the information bottleneck method. Advances in Neural Information Processing Systems, NIPS.

[Alemi2016DeepVI] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, Kevin Murphy. (2016). Deep Variational Information Bottleneck. ArXiv.

[Parbhoo2018CausalDI] Sonali Parbhoo, Mario Wieser, Volker Roth. (2018). Causal Deep Information Bottleneck. ArXiv.

[westover2008asymptotic] Westover, M Brandon. (2008). Asymptotic geometry of multiple hypothesis testing. IEEE transactions on information theory.

[nielsen2011chernoff] Nielsen, Frank. (2011). Chernoff information of exponential families. arXiv preprint arXiv:1102.2684.

[wieczorek2020difference] Wieczorek, Aleksander, Roth, Volker. (2020). On the Difference between the Information Bottleneck and the Deep Information Bottleneck. Entropy.

[wu2020phase] Wu, Tailin, Fischer, Ian. (2020). Phase Transitions for the Information Bottleneck in Representation Learning. arXiv preprint arXiv:2001.01878.

[fischer2018conditional] Fischer, Ian. (2018). The conditional entropy bottleneck. URL openreview. net/forum.

[lecun-mnisthandwrittendigit-2010] LeCun, Yann, Cortes, Corinna. {MNIST.

[raman2017illum] Raman, Ravi Kiran, Yu, Haizi, Varshney, Lav R. (2017). Illum information. 2017 Information Theory and Applications Workshop (ITA).

[palomar2008lautum] Palomar, Daniel P, Verd{'u. (2008). Lautum information. IEEE transactions on information theory.

[poole2019variational] Poole, Ben, Ozair, Sherjil, Oord, Aaron van den, Alemi, Alexander A, Tucker, George. (2019). On variational bounds of mutual information. arXiv preprint arXiv:1905.06922.

[hsu2018generalizing] Hsu, Hsiang, Asoodeh, Shahab, Salamatian, Salman, Calmon, Flavio P. (2018). Generalizing bottleneck problems. 2018 IEEE International Symposium on Information Theory (ISIT).

[dusenberry2020efficient] Dusenberry, Michael W, Jerfel, Ghassen, Wen, Yeming, Ma, Yi-an, Snoek, Jasper, Heller, Katherine, Lakshminarayanan, Balaji, Tran, Dustin. (2020). Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors. arXiv preprint arXiv:2005.07186.

[muller2019does] M{. (2019). When does label smoothing help?. Advances in Neural Information Processing Systems.

[shalev2014understanding] Shalev-Shwartz, Shai, Ben-David, Shai. (2014). Understanding machine learning: From theory to algorithms.

[zagoruyko2017diracnets] Zagoruyko, Sergey, Komodakis, Nikos. (2017). Diracnets: Training very deep neural networks without skip-connections. arXiv preprint arXiv:1706.00388.

[shamir2008learning] Shamir, Ohad, Sabato, Sivan, Tishby, Naftali. (2008). Learning and generalization with the information bottleneck. International Conference on Algorithmic Learning Theory.

[li-eisner-2019] Li, Xiang Lisa, Eisner, Jason. (2019). Specializing Word Embeddings (for Parsing) by Information Bottleneck. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).

[bib1] Achille, A. and Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence, 40(12):2897–2905, 2018.

[bib2] Alemi et al. (2016) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.

[bib3] Amjad, R. A. and Geiger, B. C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE transactions on pattern analysis and machine intelligence, 2019.

[bib4] Arora et al. (2019) Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.

[bib5] Balestriero, R. and Baraniuk, R. A spline theory of deep networks. In Proc. ICML, volume 80, pp. 374–383, Jul. 2018.

[bib6] Bardes et al. (2021) Bardes, A., Ponce, J., and LeCun, Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.

[bib7] Belghazi et al. (2018) Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.

[bib8] Bromley et al. (1993) Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., and Shah, R. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.

[bib9] Caron et al. (2020) Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.

[bib10] Caron et al. (2021) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660, 2021.

[bib11] Chen et al. (2020) Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.

[bib12] Chen, X. and He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758, 2021.

[bib13] Cheney, E. W. and Light, W. A. A course in approximation theory, volume 101. American Mathematical Soc., 2009.

[bib14] D’Agostino, R. B. An omnibus test of normality for moderate and large size samples. Biometrika, 58(2):341–348, 1971. ISSN 00063444. URL http://www.jstor.org/stable/2334522.

[bib15] Dubois et al. (2021) Dubois, Y., Bloem-Reddy, B., Ullrich, K., and Maddison, C. J. Lossy compression for lossless prediction. Advances in Neural Information Processing Systems, 34, 2021.

[bib16] Egerstedt, M. and Martin, C. Control theoretic splines: optimal control, statistics, and path planning. Princeton University Press, 2009.

[bib17] Fantuzzi et al. (2002) Fantuzzi, C., Simani, S., Beghelli, S., and Rovatti, R. Identification of piecewise affine models in noisy environment. International Journal of Control, 75(18):1472–1485, 2002.

[bib18] Federici et al. (2020) Federici, M., Dutta, A., Forré, P., Kushman, N., and Akata, Z. Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017, 2020.

[bib19] Goldfeld et al. (2018) Goldfeld, Z., van den Berg, E., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and Polyanskiy, Y. Estimating Information Flow in Neural Networks. ArXiv e-prints, 2018.

[bib20] Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning, volume 1. MIT Press, 2016.

[bib21] Grill et al. (2020) Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.

[bib22] Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.

[bib23] Huber et al. (2008) Huber, M., Bailey, T., Durrant-Whyte, H., and Hanebeck, U. On entropy approximation for gaussian mixture random vectors. pp. 181 – 188, 09 2008. doi: 10.1109/MFI.2008.4648062.

[bib24] Kolchinsky, A. and Tracey, B. D. Estimating mixture entropy with pairwise distances. Entropy, 19(7):361, 2017.

[bib25] Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.

[bib26] Lee et al. (2021a) Lee, J. D., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34, 2021a.

[bib27] Lee et al. (2021b) Lee, K.-H., Arnab, A., Guadarrama, S., Canny, J., and Fischer, I. Compressive visual representations. Advances in Neural Information Processing Systems, 34, 2021b.

[bib28] Linsker, R. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.

[bib29] Misra, I. and Maaten, L. v. d. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717, 2020.

[bib30] Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. On the number of linear regions of deep neural networks. In Proc. NeurIPS, pp. 2924–2932, 2014.

[bib31] Moshksar, K. and Khandani, A. K. Arbitrarily tight bounds on differential entropy of gaussian mixtures. IEEE Transactions on Information Theory, 62(6):3340–3354, 2016. doi: 10.1109/TIT.2016.2553147.

[bib32] Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.

[bib33] Piran et al. (2020) Piran, Z., Shwartz-Ziv, R., and Tishby, N. The dual information bottleneck. arXiv preprint arXiv:2006.04641, 2020.

[bib34] Poole et al. (2019) Poole, B., Ozair, S., Oord, A. v. d., Alemi, A. A., and Tucker, G. On variational bounds of mutual information. arXiv preprint arXiv:1905.06922, 2019.

[bib35] Shi et al. (2009) Shi, T., Belkin, M., and Yu, B. Data spectroscopy: Eigenspaces of convolution operators and clustering. The Annals of Statistics, pp. 3960–3984, 2009.

[bib36] Shwartz-Ziv, R. Information flow in deep neural networks. arXiv preprint arXiv:2202.06749, 2022.

[bib37] Shwartz-Ziv, R. and Alemi, A. A. Information in infinite ensembles of infinitely-wide neural networks. In Symposium on Advances in Approximate Bayesian Inference, pp. 1–17. PMLR, 2020.

[bib38] Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.

[bib39] Shwartz-Ziv et al. (2018) Shwartz-Ziv, R., Painsky, A., and Tishby, N. Representation compression and generalization in deep neural networks, 2018.

[bib40] Steinke, T. and Zakynthinou, L. Reasoning about generalization via conditional mutual information. In Conference on Learning Theory, pp. 3437–3452. PMLR, 2020.

[bib41] Xu, A. and Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.