Skip to main content

Learning by Reconstruction Produces Uninformative Features For Perception

Randall Balestriero, % Independent

Abstract

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance--a subspace with uninformative features for the latter. % For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90% of the pixel variance can be solved with 45% test accuracy. Using the bottom subspace instead, accounting for only 20% of the pixel variance, reaches 55% test accuracy. % % training a classifier on TinyImagenet projected onto the top subspace explaining 90% of the variance reaches 45% test accuracy while projecting onto the bottom subspace accounting for only 20% of the image variance produces 55% test accuracy. % % {\bf Ill-conditioned:} capturing features useful for perception occur at the latest stage of training since the principal subspace (uninformative for perception) is learned first. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. %That misalignment also means that additional guidance is needed. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task. % Also, as this subspace has negligible impact on the reconstruction loss, Also, % % % {\bf Ill-posed:} for a given reconstruction loss values, one can find two sets of parameters that offer drastically different classification performance of the encoder's embedding, e.g., going from 60% to 86% on Imagenet-10. That last observation hints that reconstruction based learning may be considerably improved through additional constraints put in the loss, DN, or data.

Learning By Reconstruction Can Produce Optimal Representations

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model's capacity towards a subspace of the data explaining the observed variance-a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90% of the pixel variance can be solved with 45% test accuracy. Using the bottom subspace instead, accounting for only 20% of the pixel variance, reaches 55% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask's shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.

Introduction

One of the far reaching mandate of deep learning is to provide a self-contained methodology to learn intelligible and universal representations of data (LeCun et al., 2015). That is, to learn a nonlinear transformation of the data producing a parsimonious and informative representation which can be used to solve numerous downstream tasks. Significant progress has been made through the lens of supervised learning, i.e., by learning a representation that maps

1 Independent 2 NYU. Correspondence to: Randall Balestriero < randallbalestriero@gmail.com > .

Figure 1. Features for reconstruction are uninformative for perception (top): TinyImagenet ResNet9 top-1 accuracy when trained and validated on images projected on the top-subspace (red) or bottom subspace (blue) of explained variance, corresponding images displayed in the middle and in Fig. 9. Perception features are learned last (bottom): training loss evolution (red to blue) of reconstructed training images from a deep Autoencoder projected onto the eigenspace of the original data (black). The top eigenspace (right) is learned first, and then, if training lasts long enough, the features most useful for perception (left) are finally learned. This explains why learning by performances on perception task keep increasing long after reconstructed samples look appealing.

Figure 1. Features for reconstruction are uninformative for perception (top): TinyImagenet ResNet9 top-1 accuracy when trained and validated on images projected on the top-subspace (red) or bottom subspace (blue) of explained variance, corresponding images displayed in the middle and in Fig. 9. Perception features are learned last (bottom): training loss evolution (red to blue) of reconstructed training images from a deep Autoencoder projected onto the eigenspace of the original data (black). The top eigenspace (right) is learned first, and then, if training lasts long enough, the features most useful for perception (left) are finally learned. This explains why learning by performances on perception task keep increasing long after reconstructed samples look appealing.

the observed data to provided labels of an priori known downstream task (Krizhevsky et al., 2012). Labels being costly and over-specialized, much progress was also made through the lens of unsupervised learning (Barlow, 1989; Ghahramani, 2003) roughly falling into three camps. First, reconstruction-based methods that produce (compressed) latent representations of the data nonetheless sufficient to recover most of the original data, e.g., Denoising/Variational/Masked Auto-Encoders (Vincent et al., 2010; Kingma & Welling, 2013; He et al., 2022) and Autoregressive models (Van den Oord et al., 2016; Chen et al., 2018). Second, score matching which is often solved by setting up a surrogate supervised task of classifying observed samples from noise (Hyv¨ arinen & Dayan, 2005). Third, Self-Supervised Learning (SSL) (Chen et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Balestriero et al., 2023)-whose contrastive methods can be thought of as generalized versions of score matching and Noise Contrastive Estimation (Gutmann & Hyv¨ arinen, 2010)-combining an invariance term bringing together representations of data known to be semantically similar, or generated to be so, and an anti-collapse term making sure that not all representations become similar.

In recent years, SSL emerged as the preferred solutionregularly reaching new state-of-the-arts through careful experimental design (Garrido et al., 2023). Yet, reconstructionbased methods maintain a large presence due to their ability to provide reconstructed samples that are human interpretable, enabling informed quality assessment of a model (Selvaraju et al., 2016; Bordes et al., 2021; Brynjolfsson et al., 2023; Baidoo-Anu & Ansah, 2023). Despite that benefit, reconstruction-based learning falls behind SSL as it requires fine-tuning to reach the state-of-the-art. One of the most popular reconstruction-based learning strategy that emerged as the solution of choice in recent years is the Masked-Autoencoder (He et al., 2022).

We ask the following question: Why reconstruction based learning easily produces compelling reconstructed samples but fails to provide competitive latent representations for perception? We can pinpoint that observation to at least three reasons.

R1: Misaligned. The features with most reconstructive power are the least informative for perceptual tasks as depicted in Fig. 1 (top). Instead, images projected onto the bottom subspace remain informative for perception.

R2: Ill-conditioned. Features useful for perception (low variance subspace) are learned last as depicted in Fig. 1 (bottom), in favor of learning first the top subspace of the data which explains most of the pixel variance but fails to solve perceptual tasks.

R3: Ill-posed. There exists different model parameters producing the same train and test reconstruction error but ex- hibiting significant performance gaps for perceptual tasks as depicted in Fig. 6, where for a given reconstruction error the top-1 accuracy on Imagenet-10 can vary from 50% to almost 90%.

The findings from R1, R2, and R3 provide first clues as to why learning by reconstruction requires long training time and fine-tuning. Yet, those findings alone do not answer the following question: Why Masked Autoencoders were able to provide a significant improvement in the quality of the learned representation for solving perception tasks? We will prove that the hindering R1, R2 and R3, can be pushed back through careful design of the noise distribution used in denoising autoencoders. In particular, we will demonstrate that masking is provably beneficial while other noise distributions, e.g., additive Gaussian noise, are not. We hope that our findings will help skew further research in learning by reconstruction to explore alternative noise distributions as they are the main driver of learning useful representations for perception.

Background and Notations

Notations: We denote by x n ∈ R D , n = 1 , . . . , N the n th input sample, e.g., a ( H,W,C ) image flattened to a D = H × W × C dimensional vector. The entire training set is collected into the matrix X ≜ [ x 1 , . . . , x N ] ∈ M D,N ( R ) where M n,m ( R ) is the vector space of real n × m matrices. Throughout our study, vectors will always be column-vectors, and matrices are built by horizontally stacking column-vectors, i.e., they are column major. Unless specified otherwise, we assume that the input matrix X is full-rank. In practice, if this is not the case, one can easily disregard the subspace associated with 0 singular values and apply our analysis on that filtered matrix instead.

Learning by reconstruction. Learning representations by fitting a model's parameters θ to produce a reconstruction of presented inputs as in

$$

$$

is common (Bottou, 2012; Kingma & Ba, 2014; LeCun et al., 2015). The reconstruction provides a qualitative signal enabling one to easily assess the quality of the model, and even interpret trained classification models (Zeiler & Fergus, 2014; Mahendran & Vedaldi, 2015; Olah et al., 2017; Shen et al., 2020). In its simplest form, the encoder f θ : R D ↦→ R K and the decoder g θ : R K ↦→ R D are linear, possibly with shared parameters. In such settings, the optimal parameters are obtained from Principal Component Analysis (Wold et al., 1987). Many variants of Eq. (1) have emerged, such as denoising and masked autoencoders (MAEs)(Vincent et al., 2010; He et al., 2022). The objective remains similar: learn a low-dimensional latent embedding of the data that is able to reconstruct the original samples

while being robust to some noise perturbation added onto the samples as in

$$

$$

with p x ′ | x applying some (conditional) noise transformation to the original input, e.g., x ′ ∼ N ( x , ϵ I ) , ϵ > 0 .

Known limitations. Learning by reconstruction is widely popular and thus heavily studied. Major axes of research evolve around (i) deriving novel loss functions for specific datasets that better align with semantic distance (Wang et al., 2004; Kulis et al., 2013; Ball´ e et al., 2016; Bao et al., 2017), (ii) explaining the learned embedding dimensions (Tran et al., 2017; Esmaeili et al., 2019; Mathieu et al., 2019), and (iii) imposing structure in the embedding space such as clustered embeddings (Jiang et al., 2016; Dilokthanakul et al., 2016; Lim et al., 2020; Seydoux et al., 2020). Despite the rich literature, the current solutions of choice to learn representation in computer vision still rely on the Mean Squared Error loss in pixel space with the possible application of structured noise ( p x ′ | x in Eq. (2)), e.g., as employed by the current state-of-the-art solution of masked-autoencoders (MAEs) (He et al., 2022). Yet, even that solution learns a representation that needs to be fine-tuned to compete with state-of-the-art, e.g., MAE's performances are drastically determined by two parameters (i) the training time, i.e., evaluation performance of the learned representation does not plateau even after 1600 epochs on Imagenet, and (ii) the need to fine tune, i.e., evaluation performance with and without fine tuning have a significant gap going from 70% to 84% on top-1 Imagenet classification task.

Our study will propose some first hints as to why even current state-of-the-art solutions are poised with slow training, and the need to fine-tune (Section 3). We will conclude by proving that MAEs's masking strategy partially alleviates those limitations (Section 4), showing that the most rewarding findings for learning by reconstruction may emerge from novel denoising strategies.

Rich Features for Reconstruction are Poor Features For Perception

This section provides the theoretical ground and empirical validation of R1 and R2 from Section 1, namely, that learning by reconstruction learns features that are misaligned with common perception tasks. We start by deriving a closed form alignment measure between those two tasks in Section 3.1 and conclude by empirically measuring that mismatch in Section 3.2.

How To Measure The Alignment Between Reconstruction and Supervised Tasks

As a starting point to our study, we will build intuition and obtain theoretical results in the linear regime. As we will see at the end of this Section 3.1, this seemingly simplified setting turns out to be informative of practical cases.

Let's consider an encoder mapping V ∈ M K,D ( R ) , a decoder mapping Z ∈ M D,K ( R ) , and a predictor head W ∈ M C,K ( R ) , where C is the number of target dimensions, or classes. The targets for X ∈ M D,N ( R ) are given by Y ∈ M D,N ( R ) . The combination of the supervised and reconstruction losses is given by

$$

$$

where the latent representation, V X , is shared between the two losses, λ ≥ 0 controls the trade-off between the two terms. Quantifying how the optimal parameters of Eq. (3) vary with λ will be key to assess how the two losses are aligned. As a starting point, let's formalize below the optimal parameters of this loss function using the notations M = P M D M P ⊤ M for the eigendecomposition of a symmetric semi-definite positive matrix M . Wewill also denote, to lighten notations, A ≜ X ( Y ⊤ Y + λ X ⊤ X ) X ⊤ .

Theorem 1. The loss function from Eq. (3) is minimized for

$$

$$

$$

$$

$$

$$

where H ≜ D -1 2 XX ⊤ P ⊤ XX ⊤ AP XX ⊤ D -1 2 XX ⊤ . (Proof in Section 6.1, empirical validation in Fig. 8.)

We observe that the optimal solutions from Theorem 1 continuously interpolates between the standard least square (OLS) problem ( λ = 0 ) and the unsupervised linear autoencoder or Principal Component Analysis (PCA) setting ( λ →∞ ). We formalize below that we recover the optimal solutions for each of those extreme cases from Theorem 1.

Corollary 1.1. The solution from Theorem 2 recovers the OLS solution for W ∗⊤ V ∗⊤ as λ → 0 , and the PCA solution for Z ∗⊤ V ∗⊤ as λ →∞ . (Proof in Section 6.2.)

That observation should comfort the reader that Eq. (3) accurately conveys both ends of the spectrum from supervised learning to reconstruction based learning, while continuously interpolating in-between.

Condition for perfect alignment. The result from Theorem 1 enables us to formalize the condition for perfect alignment between the two tasks, i.e., under which condition the solution V ∗ is not impacted by λ .

Figure 2. Depiction of the closed form alignment measure from Eq. (7) measuring the minimum supervised training error achievable given the optimal reconstruction parameters, as per Theorem 1 and Corollary 1.2. Top: depiction in term of the latent dimension K (x-axis). Bottom: depiction in term of the ratio of the latent dimension K to the input dimension D . We clearly observe that as the dataset becomes more realistic (going from background-free images to CIFAR and then to TinyImagenet), as the alignment between the reconstruction and supervised task lessens. In particular, when going to TinyImagenet, we observe that the alignment only increases linearly with respect to the latent space dimension.

Figure 2. Depiction of the closed form alignment measure from Eq. (7) measuring the minimum supervised training error achievable given the optimal reconstruction parameters, as per Theorem 1 and Corollary 1.2. Top: depiction in term of the latent dimension K (x-axis). Bottom: depiction in term of the ratio of the latent dimension K to the input dimension D . We clearly observe that as the dataset becomes more realistic (going from background-free images to CIFAR and then to TinyImagenet), as the alignment between the reconstruction and supervised task lessens. In particular, when going to TinyImagenet, we observe that the alignment only increases linearly with respect to the latent space dimension.

Proposition 1. The supervised and reconstruction tasks are aligned (the optimal solutions do not depend on λ ) iff the intersection of the topK eigenspaces of X ⊤ X and Y ⊤ Y is of dimension K .

In other words, whenever the condition in Proposition 1 holds, the matrix P H (recall Theorem 1) will include the same eigenvectors (up to rotation) for any λ , making the optimal parameters (Eqs. (4) to (6)) independent of λ . In practice, we will see that Proposition 1's alignment condition is never fulfilled, pushing the need to define a more precise measure of alignment between the two tasks.

Continuous measure of alignment. As we aim to measure the tasks alignment more precisely that in a yes/no setting (Proposition 1), we propose the following continuous mea- sure

$$

$$

where ∥ Y ⊤ Y P XX ⊤ ∥ 2 F simplifies as ∥ Y ⊤ Y ∥ 2 F when D = N . We assume that the supervised task can be at least partially solved from X , ensuring ∥ Y ⊤ Y P XX ⊤ ∥ 2 F > ϵ . In words, Eq. (7) is the (scaled) minimum supervised training error that can be achieved given the representation ( V ⊤ X ) minimizing the reconstruction loss, which is measured by how much of the matrix Y ⊤ Y can be reconstructed from the topk subspace of X ⊤ X , as formalized below.

Findings. Wenowpropose to evaluate the closed form alignment metric (Eq. (7)) on a few datasets. Note that it can be implemented efficiently as detailed in Section 6.8. In Fig. 2, we measure the metric of Eq. (7) for a sweep of latent dimension K over 7 different datasets. We observe three striking regimes. First, for images without background, reconstruction and classification tasks are very much aligned even for small latent dimension, as low as 20% of the input dimension. Second, when comparing datasets with same image distributions but different number of classes (CIFAR10 to CIFAR100) the misalignment increases between the two task, especially for small embedding dimension. This follows our intuition that additional budget must be devoted to separate more classes. And as the subspace of the data used for reconstruction and classification do not align (recall Proposition 1), a greater misalignment is measured. And third, when looking at more realistic images (higher resolution and more diverse) such as tiny-imagenet, we observe that the alignment only increases linearly with the latent space dimension K , requiring K = D in that case to ensure alignment. We thus conclude that the presence of background, finer classification tasks, and higher resolution images, are all factors that drastically decrease the alignment between learning features for perception tasks and learning features that reconstruct those images.

Linear regime results are informative. We have focused on the linear encoder, decoder, and classification head of Eq. (3). Albeit insightful, one might wonder how much of those insights transfer to the more realistic setting of employing nonlinear mappings. We note that it remains common to keep the classification head linear therefore

Figure 3. Reprise of Fig. 1 for additional autoencoder architectures: convolutional encoder and deconvolutional decoder ( top ) and MLP encoder and decoder ( bottom ). We clearly observe that the top subspace is learned first during training, which is the one that best minimize the reconstruction loss but that contains the least informative features for perception, as per Fig. 4.

Figure 3. Reprise of Fig. 1 for additional autoencoder architectures: convolutional encoder and deconvolutional decoder ( top ) and MLP encoder and decoder ( bottom ). We clearly observe that the top subspace is learned first during training, which is the one that best minimize the reconstruction loss but that contains the least informative features for perception, as per Fig. 4.

$$

$$

The encoder is now the nonlinear mapping f θ : R D ↦→ R K and the decoder is the nonlinear mapping g γ : R K ↦→ R D . We formalize below a result that will reinforce the legitimacy of our linear regime analysis (Theorem 1, Proposition 1, and Corollary 1.2) by showing that it is (i) a correct model during the early phase of training, and even (ii) a correct model throughout training when the decoder being employed is under-parametrized.

Theorem 2. For any high-capacity encoder f θ , studying Eq. (3) and Eq. (8) is equivalent at initialization for any decoder, and is always equivalent when the decoder is linear. (Proof in Section 6.6.)

Combining Theorems 1 and 2, we obtain that even with DNs, during the early stages of learning, the encoder-decoder mapping focuses on the principal subspace of the data, i.e., the space that explains most of the reconstruction error in the linear regime. As our study strongly hinges on that claim, we propose to empirically validate it in the following Section 3.2.

.
.
.
.
.

Reconstruction and Perception Features Live In Different Subspaces of the Data

We characterized in the previous Section 3.1 how classification and reconstruction tasks fail to align when it comes to learning common features. In particular, Section 3.1 and Fig. 2 validated how training focuses first on the top subspace of the data. We now reinforce our claim by showing that supervised tasks can not be solved when restricting the images on the subspace that is learned first by reconstruction.

Perception can not be solved from the principal subspace of the data. We first propose a controlled experiment where we artificially remove some of the original data subspace. In particular, we consider two settings. First, we gradually remove the subspace associated to the top eigenvectors of the data covariance matrix effectively removing what is most useful for reconstruction but also what we claim to be least useful for perception. Second, we gradually remove the subspace associated to the bottom eigenvectors (the one

Figure 4. We depict the classification accuracy of a ResNet9 DNN when trained and tested on images that have been projected onto the top ( red ) and bottom ( blue ) subspace as ordered per the eigenvalues of the data covariance matrix, without data-augmentation ( top ) and with data-augmentation ( bottom ). We clearly observe that except for datasets without background and for which reconstruction and classification are better aligned (recall Fig. 2), the final performance is greater when employing the subspace of the data that explains the least the pixel variation, i.e., the bottom subspace.

Figure 4. We depict the classification accuracy of a ResNet9 DNN when trained and tested on images that have been projected onto the top ( red ) and bottom ( blue ) subspace as ordered per the eigenvalues of the data covariance matrix, without data-augmentation ( top ) and with data-augmentation ( bottom ). We clearly observe that except for datasets without background and for which reconstruction and classification are better aligned (recall Fig. 2), the final performance is greater when employing the subspace of the data that explains the least the pixel variation, i.e., the bottom subspace.

least useful for reconstruction but that we claim to be most useful for perception). This procedure is applied to the entire dataset (train and test images) before any DN training occurs. Hence the DN is only presented with the filtered images. We report the top-1 accuracy over numerous datasets in Fig. 1 (top) and in Fig. 4. We obtain a few key observations. First, for any % of filtering, keeping the bottom subspace of the data produces higher test accuracy that when keeping the top subspace . That is, the subspace that is most useful for reconstruction (top) is least useful for perception. Second, the accuracy gap is impacted by the presence of background, finer-grained classes, and higher resolution images. This further validates our theoretical observations from Fig. 2 and the result from Theorem 2. We will now focus on validating the second part of our claim that the subspace used for perception (bottom) is learned last, and slowly.

Useful features for perception are learned last. The above results demonstrate that the top subspace of the dataexplaining most of the pixel variance-is not aligned with the perception tasks. Yet, perfect reconstruction implies capturing both the top and bottom subspace. Albeit correct, we demonstrate in Fig. 1 (bottom) and in Fig. 3 that the rate at which the bottom subspace is learned is exponentially slower than the rate at which the top subspace is learned. This empirically validates Theorem 2. For the reader familiar with optimization (Benzi, 2002), or power iteration methods, this observation is akin to how many procedures converge at a rate which is a function of the eigengap (Booth, 2006; Xu et al., 2018), i.e., the difference between λ i and λ i +1 , where λ are the sorted eigenvalues. Because natural images have an exponential decay of their eigenvalues (Van der Schaaf & van Hateren, 1996; Ruderman, 1997), the rate at which the top subspace is approximated is exponentially faster than the bottom one, therefore making the learning of useful features for perception occur only late during training. This finding also resonates with previous studies on the spectral bias of DNs in classification and generative settings (Chakrabarty & Maji, 2019; Rahaman et al., 2019; Schwarz et al., 2021).

Combining the observations from this section supports R1 and R2 from Section 1. It remains to study R3 which states that since features for perception lie within a negligible subspace (as measured by the reconstruction loss), one can find two separate models that equally solve the reconstruction task (same train and test loss values) but provide drastically different perception task performances.

Learning By Reconstruction Needs Guidance

We now turn to R3 stating that features for perceptions lie within a space with negligible impact on the reconstruction

Figure 5. Depiction of multiple resnet34 autoencoders with varying embedding dimensions ( light to dark ) some trained only to reconstruct the input samples with data-augmentations ( blue ) and others with an additional supervised loss signal (as per Eq. (8)) ( green ). We report the test set accuracy and the relative difference ( y-axis ) for each of the 'paired' models, i.e., the ones with every training setting identical except for the use of the supervised signal, as a function of the train and test rec loss. We clearly observe that for any embedding dimension and reconstruction loss, one can find two set of parameters with drastically different ability to solve perception tasks. Reconstructed samples and training curves are provided in Fig. 6.

Figure 5. Depiction of multiple resnet34 autoencoders with varying embedding dimensions ( light to dark ) some trained only to reconstruct the input samples with data-augmentations ( blue ) and others with an additional supervised loss signal (as per Eq. (8)) ( green ). We report the test set accuracy and the relative difference ( y-axis ) for each of the 'paired' models, i.e., the ones with every training setting identical except for the use of the supervised signal, as a function of the train and test rec loss. We clearly observe that for any embedding dimension and reconstruction loss, one can find two set of parameters with drastically different ability to solve perception tasks. Reconstructed samples and training curves are provided in Fig. 6.

error-therefore motivating the need to add additional guidance to the training, e.g., through denoising tasks. To do so, we will show that it is possible to obtain two DNs with same reconstruction error but one with perception capabilities far greater than the other (Section 4.1). Lastly we will prove that some guidance can be provided to the learned representation to reduce that gap and focus towards more useful features through careful design of the denoising task (Section 4.2).

Learning By Reconstruction Can Produce Optimal Representations

One interesting benefit emerging from the observations made for R1 and R2 is that guiding a DN to focus on the subspace containing informative features for perception has minimal impact on the reconstruction loss-as they focus on different subspaces. Therefore, we now propose a simple experiment to demonstrate the above argument. We take a resnet34 autoencoder train it with the usual reconstruction loss (MSE) on Imagenette. This gives us a model that (as per R1 and R2) fails to properly focus on discriminative feature. To obtain the second DN with improved classification performance, we simply add a classification head on top of the embedding of the encoder. That is, the same embedding that is fed to the decoder for reconstruction, is also fed to the linear classifier with supervised training loss (recall Eq. (8)). We obtain the key insight of R3 which is that one can produce two DNs with same training loss (reconstruction) and validation loss (reconstruction), but with significantly different classification performance, as reported in Figs. 5 and 6.

To further understand that observation, we can recall the results from Theorem 1. We demonstrated that the encoder ( V ) having K dimensions at its disposal, is optimal when selecting the top singular vectors of the data matrix X . If K is large enough that it encompasses both the top subspace of the data (which is learned first and has greatest impact on the reconstruction loss) and the bottom subspace of the data (which is useful for perception as per Figs. 1 and 4), then both objective can coexist (recall Proposition 1) as long as enough capacity is given to the encoder. Therefore, we obtain the following key insight. Whenever the capacity of the autoencoder is large, the encoder embedding can (and will at the end of training) include features useful for perception all while being able to reconstruct its inputs . Again, for this to happen one requires the capacity of the encoder to grow with the image resolution (as more and more dimensions will be taken up by the top subspace), and with the complexity of the image background (again taking more dimensions in the top subspace) (recall Section 3.2).

The above observation demonstrates that learning to reconstruct needs an additional training signal to focus towards discriminative features. As we will prove below, learning by denoising offers such a solution.

Provable Benefits of Learning by Denoising

Recalling the Denoising Autoencoder setting from Eq. (2), we aim to obtain a closed form solution of the linear loss Eq. (3) in order to find some hints as to why masking and additive Gaussian noise produce representations of different quality for perception.

Our goal is therefore to study the misalignment metric (Eq. (7)) under the denoising setting which, as per Corollary 1.2, is given by

$$

$$

which is the minimum supervised loss that can be attained using the representation from V ∗ that minimizes the denoising loss. We ought to highlight that we can obtain a closed form solution under the expectation over the noise distribution ( X ′ | X ) as formalized below, where we denote

Figure 6. Depiction of two resnet34 autoencoders trained on Imagenette (Imagenet-10) images, one (orange) with an additional training signal that favors latent representations suited for classification, and the other (blue) that is only the reconstruction loss. As per R1 and R2 the latter naturally focuses on suboptimal features as showcases in the test accuracy, both when using a linear or a nonlinear probe. Crucially, the autoencoder with the additional signal produces representations with much greater discriminative power in both the linear and nonlinear setting. Yet, and despite popular belief, doing so has no impact on the reconstruction losses on the train or test set, and thus no impact on the quality of the reconstruction presented at the top . Therefore validating R3 .

Figure 6. Depiction of two resnet34 autoencoders trained on Imagenette (Imagenet-10) images, one (orange) with an additional training signal that favors latent representations suited for classification, and the other (blue) that is only the reconstruction loss. As per R1 and R2 the latter naturally focuses on suboptimal features as showcases in the test accuracy, both when using a linear or a nonlinear probe. Crucially, the autoencoder with the additional signal produces representations with much greater discriminative power in both the linear and nonlinear setting. Yet, and despite popular belief, doing so has no impact on the reconstruction losses on the train or test set, and thus no impact on the quality of the reconstruction presented at the top . Therefore validating R3 .

$$

$$

. Theorem 3. The closed form solution for V ∗ from Eq. (9) is given by V ∗ spans P G D -1 2 G ( P H ) ., 1: K , where H ≜ D -1 2 G P ⊤ G SX ⊤ XS ⊤ P G D -1 2 G . (Proof in Section 6.4.)

The above result demonstrates that even when employing denoising autoencoders with additive Gaussian noise or masking, we can obtain a closed form solution for V , and from that obtain all the alignment metrics studied so far. In particular, Section 6.4 also provides the closed form solutions for G and S . We illustrate the alignment between reconstruction and perception tasks in the denoising autoencoder setting (Eq. (9)) in Fig. 7 for the case of random masking, as per the MAE setting. We clearly observe that the denoising task has the ability to increase the alignment between the two tasks-especially for small embedding dimension ( K ). We however observe that the size of the masked patches that provides the best gains vary with the dataset, hinging at another challenge of denoising autoencoders: the crossvalidation of the denoising task. Another formal result we propose below will reinforce that point.

Denoting by V ∗ ( σ ) the optimal denoising autoencoder parameters when employing additive isotropic Gaussian noise with standard deviation σ , we obtain the following statement showcasing that this type of denoising task does not help supervised tasks.

Figure 7. Depiction of the relative alignment difference when employing denoising tasks (recall Eq. (9)) with masking noise, with probability of dropping ranging from 0% to 99% ( cyan to pink ) for patch size of (1 , 1) recovering multiplicative dropout ( top ), (2 , 2) ( middle ), and (4 , 4) ( bottom ) on various datasets. A positive number indicates a beneficial impact of using the denoising loss on the supervised performance of the learned representation. We observe that for datasets such as ArabicDigits that already have a strong alignment between the two tasks (recall Fig. 2), the use of any form of masking is detrimental except with shape (1 , 1) . However for datasets such as CIFAR100 ( right column ) with originally poor alignment, masking is beneficial and increases the alignment between the two tasks. As the original alignment increases with K , as the benefit of masking reduces.

Figure 7. Depiction of the relative alignment difference when employing denoising tasks (recall Eq. (9)) with masking noise, with probability of dropping ranging from 0% to 99% ( cyan to pink ) for patch size of (1 , 1) recovering multiplicative dropout ( top ), (2 , 2) ( middle ), and (4 , 4) ( bottom ) on various datasets. A positive number indicates a beneficial impact of using the denoising loss on the supervised performance of the learned representation. We observe that for datasets such as ArabicDigits that already have a strong alignment between the two tasks (recall Fig. 2), the use of any form of masking is detrimental except with shape (1 , 1) . However for datasets such as CIFAR100 ( right column ) with originally poor alignment, masking is beneficial and increases the alignment between the two tasks. As the original alignment increases with K , as the benefit of masking reduces.

Corollary 3.1. Under the settings of Theorem 3, additive Gaussian noise has no impact in the supervised task performance as W ∗⊤ V ∗ ( σ ) ⊤ = W ∗⊤ V ∗ (0) ⊤ , ∀ σ ≥ 0 , regardless of the supervised task. (Proof in Section 6.5.)

We therefore obtain the following insights. Denoising tasks offer a powerful guidance to skew learned representations to better align with perception tasks (Fig. 7) but some noise distributions such as additive Gaussian noise are provably unable to help . A challenge that naturally emerges is in selecting the adequate denoising tasks, e.g., to avoid Corollary 3.1 in a setting where labels are not available and the supervised tasks to be tackled may not be known a priori. An interesting portal that we obtained (Corollary 3.1) in our study is that it is possible to assess if a denoising task has any impact on the perception task without yet knowing what is that supervised task. That alone could help in at least focusing on denoising tasks that do have an impact, albeit it will remain unknown if that impact will be beneficial or not.

.
.

Conclusion

We proposed to study the transferability of representations learned by reconstruction towards perception tasks. In particular, we obtain that the two objectives are fundamentally misaligned, with a degree of misalignment that grows with the presence of complicated background, with greater number of classes for the perception task, and with higher image resolutions. While our study focused on bringing those limitations forward from a theoretical and empirical angle, we also opened new avenues to reduce those limitations in the future. For example, we obtained a closed form solution to measure the impact of noise distributions to better align the learned representation to the downstream perception task. This novel methodology opens the door to a priori selecting noise distribution candidates. Even when the downstream task is unknown, we found that some noise distributions, such as additive Gaussian noise, are effectively unable to provide any benefit for better aligning reconstruction and perception tasks. On the opposite, we validated that masking is a valid strategy, albeit requiring some per-dataset tuning. That finding is in line with MAE's performances going from about 50% to 74% on Imagenet top-1 accuracy when masking is employed. We hope that our study will also open new avenues to study reconstruction methods for other modalities such as time-series and NLP.

Impact Statements

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

Proof of cref{thm:PCA

Proof. The first part of the proof finds the optimum W ∗ and Z ∗ as a function of V which is direct since we are in a least-square style setting for each of them. The second part will consist in showing that the optimal V can be found as the solution of a generalized eigenvalue problem. The third and final step will be to express the solution for V in close-form that is also friendly for computations.

Step 1. Recall that our loss function is given by

$$

$$

recalling that ∥ M ∥ 2 F = Tr( M ⊤ M ) , the above simplifies to

$$

$$

we are now going to find the optimal W and Z which are unique by convexity of the loss and of their domain. Recall that we assume Y and X to be full-rank (therefore also making V full-rank). Recalling the derive of traces, we obtain

$$

$$

setting it to zero (we assume here λ > 0 otherwise we can not solve for Z since its value does not impact the loss) and solving leads

$$

$$

We now have solved for W , Z as a function of V , i.e., the loss is now only a function of V , which we are going to solve for now.

Step 2. We will first proceed by plugging the values for W ∗ , Z ∗ back into the loss, which will now be only a function of V . Let's first simplify our derivations by noticing that

$$

$$

and similarly

$$

$$

finally making the entire loss simplify as follows

$$

$$

First, notice that both X ( Y ⊤ Y + λ X ⊤ X ) X ⊤ and XX ⊤ are symmetric. Therefore, we can minimize the loss by solving the following generalized eigenvalue problem:

$$

$$

where V are the eigenvectors of the generalized eigenvalue problem, and Λ the eigenvalues, then the solution to our problem we be any rotation of any K eigenvectors, but the minimum will be achieve for the top-K ones.

Step 3. We will first demonstrate the general solution for the generalized eigenvalue problem. Given that solution, it will be easy to take the topK eigenvectors that solve the considered problem. Denoting A ≜ X ( Y ⊤ Y + λ X ⊤ X ) X ⊤ and

$$

$$

$$

$$

therefore the eigenvalues are given by D H and the eigenvectors are given by P B D -1 2 B P H or equivalently ( P XX ⊤ D -1 2 XX ⊤ P H ) ., 1: K . So the optimal V is any rotation of the topK eigenvectors. The above is simple to use as-is whenever N > D , if not, then we can obtain a solution without having to compute any D × D matrix, thus making the process more efficient. To that end, we can obtain

$$

$$

that only involves D × min( D,N ) matrices instead of D × D .

Proof.

Proof of cref{thm:PCA

Proof. We will start with the fully supervised (least-square) proof obtained when λ = 0 . Also notice that in any case, we have that

$$

$$

Ordinary Least Square recovery. since λ = 0 we also have

$$

$$

when then lead to

$$

$$

from the above, we can simply plug those values in the analytical form for V ∗ from Eq. (4) to obtain

$$

$$

since we easily see that P H = V ⊤ X V Y . We also have the optimum for W from Eq. (5) to be

$$

$$

and finally the product of both matrices (which produce the supervised linear model) is obtained as

$$

$$

therefore recovering the OLS optimal solution. Note that if K < C then we have a bottleneck and we therefore obtain an interesting alternative solution that looks at the top subspace of Y (this is however never the case in OLS settings).

Principal Component Analysis recovery. We now consider the case where we only employ the unsupervised loss (akin to λ →∞ ). In this setting we get

$$

$$

$$

$$

therefore the optimal form for V will be

$$

$$

which will select the topK subspace of X (recall that the eigenvalues of H are D XX ⊤ and therefore its topK eigenvectors are selected the topK dimension of the subspace. Then the solution for Z from Eq. (6) gives

$$

$$

$$

$$

which is the projection matrix onto the topK subspace of the data X i.e. recovering the optimal solution of Principal Component Analysis.

Proof.

Proof of cref{thm:PCA

Proof. Recall that in the λ → ∞ regime, we have that V ∗ = P XX ⊤ ( D -1 2 XX ⊤ ) ., 1: K and W ∗ = P ⊤ XX ⊤ Y ⊤ . We thus develop

$$

$$

as ∥ Y ∥ 2 F is a constant with respect to the parameters, we consider ∥ Y ( P XX ⊤ ) ., 1: K ∥ 2 F as our alignment measure (the greater, the better the supervised loss can be minimized from the parameters). Since this quantity lives in the range [0 , ∥ Y P XX ⊤ ∥ 2 F ] , we see that by using the reparametrization from Eq. (7) we obtain the proposed measure of alignment rescaled to [0 , 1] .

and we directly obtain

Proof.

Proof of cref{thm:PCA

We need to find the optimal reconstruction solution ( V ∗ , Z ∗ ) first (corresponding to the case of λ →∞ , and then plug it into the supervised loss with optimal W .

$$

$$

from which we obtain

$$

$$

leading to W ∗ = ( V ⊤ E [ X ′ X ′⊤ ] V ) -1 V ⊤ E [ X ′ ] X ⊤ which we can plug back into the loss to obtain

$$

$$

whose solution is given (assuming E [ X ′ X ′⊤ ] is full-rank) by the solution of the generalized eigenvalue problem

$$

$$

Given those optimum we can thus obtain the alignment measure same as before as the sueprvised loss obtain from V ∗ from the unsupervised loss and Z ∗ from the supervised one:

$$

$$

whose optimum is therefore given by Z ∗ = ( V ∗⊤ E [ X ′ X ′⊤ ] V ∗ ) -1 V ∗⊤ E [ X ′ ] Y ⊤ which can then be plugged back to produce the (unnormalized) measure. The analytical form of can be obtained (following the derivations of (Balestriero et al., 2022)) as follows for example for the case of additive Gaussian noise which is trivial and commonly done before e.g. showing the link between ridge regression and additive dropout E [ X ′ ] = X and E [ X ′ X ′⊤ ] = X ′ X ′⊤ + σ I . The perhaps more interesting derivations concern the masking as employed by MAE.

Proof of cref{thm:PCA

In the case of additive, centered Gaussian noise, we have E [ X ′ ] = X and E [ X ′ X ′⊤ ] = X ′ X ′⊤ + σ I . Therefore the optimal value for V , is given by solving the generalized eigenvalue problem ( XX ⊤ XX ⊤ , XX ⊤ + σ I ) . Recalling the derivations of the optimal solution for such problem from Section 6.1, we have that

$$

$$

with

$$

$$

and the important property to notice is that the ordering of the eigenvalues of H which are given by ( D ⊤ XX ⊤ ) 2 ( D XX ⊤ + σ I ) -1 are the same as the ordering of XX ⊤ which are given by D ⊤ XX ⊤ . That is, the topK subspace that will be picked

up by P H are the same for any noise standard deviation σ . Now given the close form for V we can obtain the close form of the classifier weights W from Eq. (5) to be

$$

$$

where the s subscript indicates that only the top K × K part of the diagonal matrix is nonzero. Lastly, the product of both matrices (producing the supervised linear model) is obtained as

$$

$$

therefore recovering the OLS optimal solution Y X ⊤ ( XX ⊤ ) -1 whenever K ≥ D , and otherwise recovers the projection onto the top subspace of X -in any case the final parameters are invariant to the choice of the standard deviation of the additive Gaussian noise ( σ ) during the denoising autoencoder pre-training phase.

Proof of cref{thm:PCA

Proof. The first of the proof is to rewrite the joint objective classification and reconstruction objective with an arbitrary encoder network f θ

$$

$$

as the nonparametric version

$$

$$

both being identical if we assume that the encoder is powerful enough to reach any representation, which is a realistic assumption given current architectures. Given that nonparametric objective, we can now solve for both the optimal decoder weight V and the optimal representation Z as follows

$$

$$

which is solved by Z being any orthogonal matrix in the subspace of the top-K eigenvectors of ( Y ⊤ Y + λ X ⊤ X ) . Now as λ → ∞ as the encoder will become more and more linear, ultimately converging to f θ ( x ) = Ux with U ∈ span { eigvec( X ⊤ X ) 1 , . . . , eigvec( X ⊤ X ) K }

Proof.

Eigendecomposition

Given a matrix X ∈ M ( R ) with D > N , computing the eigendecomposition of

D,N XX ⊤ , a D × D matrix is O ( D 3 ) which instead can be obtained in O ( N 3 + DN 2 ) as def fast_gram_eigh(X, major="C", unit_test=False): """ compute the eigendecomposition of the Gram matrix: - XX.T using column (C) major notation - X.T@X using row (R) major notation """ if major == "C": X_view = X.T else: X_view = X if X_view.shape[1] < X_view.shape[0]: # this case is the usual formula U, S = np.linalg.eigh(X_view.T @ X_view) else: # in this case we work in the tranpose domain U, S = np.linalg.eigh(X_view @ X_view.T) S = X_view.T @ S S[U>0] /= np.sqrt(U[U>0]) # ensuring that we have the correct values if unit_test: Uslow, Sslow = np.linalg.eigh(X_view.T @ X_vew) assert np.allclose(U, Uslow) assert np.allclose(S, Sslow) return U, S

since we have the relation

$$

$$

and thus we can simply compute the eigenvectors of the K × K matrix X ⊤ X and get the eigenvectors of the N × N matrix XX ⊤ by left-multiplying them by X ⊤ , and their corresponding eigenvalues are rescaled by 1 ∥ X ⊤ v ∥ 2 2 .

Fast Implementation of Alignment Metric ( cref{eq:alignment

We want to sweep over the latent dimension K . As such, we can avoid recomputing the metric for each value and get them all at once as below. We again use the column-major notations as per Section 2:

def alignment_sweep(X, Y, major="C"): U, S, Vh = np.linalg.svd(X, full_matrices=False) if major == "C": denom = np.square(np.linalg.norm(Y @ Y.T)) numer = np.linalg.multi_dot([Y.T, Y, Vh.T]) else: denom = np.square(np.linalg.norm(Y.T @ Y)) numer = np.linalg.multi_dot([Y,Y.T, U]) numer = np.linalg.norm(numer, axis=0)**2 return np.cumsum(numer) / denom

Additional figures

Input space reconstruction is an attractive representation learning paradigm. Despite interpretability of the reconstruction and generation, we identify a misalignment between learning by reconstruction, and learning for perception. We show that the former allocates a model’s capacity towards a subspace of the data explaining the observed variance–a subspace with uninformative features for the latter. For example, the supervised TinyImagenet task with images projected onto the top subspace explaining 90% of the pixel variance can be solved with 45% test accuracy. Using the bottom subspace instead, accounting for only 20% of the pixel variance, reaches 55% test accuracy. The features for perception being learned last explains the need for long training time, e.g., with Masked Autoencoders. Learning by denoising is a popular strategy to alleviate that misalignment. We prove that while some noise strategies such as masking are indeed beneficial, others such as additive Gaussian noise are not. Yet, even in the case of masking, we find that the benefits vary as a function of the mask’s shape, ratio, and the considered dataset. While tuning the noise strategy without knowledge of the perception task seems challenging, we provide first clues on how to detect if a noise strategy is never beneficial regardless of the perception task.

bottom 25% variance

One of the far reaching mandate of deep learning is to provide a self-contained methodology to learn intelligible and universal representations of data (LeCun et al., 2015). That is, to learn a nonlinear transformation of the data producing a parsimonious and informative representation which can be used to solve numerous downstream tasks. Significant progress has been made through the lens of supervised learning, i.e., by learning a representation that maps the observed data to provided labels of an priori known downstream task (Krizhevsky et al., 2012). Labels being costly and over-specialized, much progress was also made through the lens of unsupervised learning (Barlow, 1989; Ghahramani, 2003) roughly falling into three camps. First, reconstruction-based methods that produce (compressed) latent representations of the data nonetheless sufficient to recover most of the original data, e.g., Denoising/Variational/Masked Auto-Encoders (Vincent et al., 2010; Kingma & Welling, 2013; He et al., 2022) and Autoregressive models (Van den Oord et al., 2016; Chen et al., 2018). Second, score matching which is often solved by setting up a surrogate supervised task of classifying observed samples from noise (Hyvärinen & Dayan, 2005). Third, Self-Supervised Learning (SSL) (Chen et al., 2020; Zbontar et al., 2021; Bardes et al., 2021; Balestriero et al., 2023)–whose contrastive methods can be thought of as generalized versions of score matching and Noise Contrastive Estimation (Gutmann & Hyvärinen, 2010)–combining an invariance term bringing together representations of data known to be semantically similar, or generated to be so, and an anti-collapse term making sure that not all representations become similar.

In recent years, SSL emerged as the preferred solution–regularly reaching new state-of-the-arts through careful experimental design (Garrido et al., 2023). Yet, reconstruction-based methods maintain a large presence due to their ability to provide reconstructed samples that are human interpretable, enabling informed quality assessment of a model (Selvaraju et al., 2016; Bordes et al., 2021; Brynjolfsson et al., 2023; Baidoo-Anu & Ansah, 2023). Despite that benefit, reconstruction-based learning falls behind SSL as it requires fine-tuning to reach the state-of-the-art. One of the most popular reconstruction-based learning strategy that emerged as the solution of choice in recent years is the Masked-Autoencoder (He et al., 2022).

We ask the following question: Why reconstruction based learning easily produces compelling reconstructed samples but fails to provide competitive latent representations for perception? We can pinpoint that observation to at least three reasons.

R1: Misaligned. The features with most reconstructive power are the least informative for perceptual tasks as depicted in Fig. 1 (top). Instead, images projected onto the bottom subspace remain informative for perception.

R2: Ill-conditioned. Features useful for perception (low variance subspace) are learned last as depicted in Fig. 1 (bottom), in favor of learning first the top subspace of the data which explains most of the pixel variance but fails to solve perceptual tasks.

R3: Ill-posed. There exists different model parameters producing the same train and test reconstruction error but exhibiting significant performance gaps for perceptual tasks as depicted in Fig. 6, where for a given reconstruction error the top-1 accuracy on Imagenet-10 can vary from 50% to almost 90%.

The findings from R1, R2, and R3 provide first clues as to why learning by reconstruction requires long training time and fine-tuning. Yet, those findings alone do not answer the following question: Why Masked Autoencoders were able to provide a significant improvement in the quality of the learned representation for solving perception tasks? We will prove that the hindering R1, R2 and R3, can be pushed back through careful design of the noise distribution used in denoising autoencoders. In particular, we will demonstrate that masking is provably beneficial while other noise distributions, e.g., additive Gaussian noise, are not. We hope that our findings will help skew further research in learning by reconstruction to explore alternative noise distributions as they are the main driver of learning useful representations for perception.

Notations: We denote by 𝒙n∈ℝD,n=1,…,Nformulae-sequencesubscript𝒙𝑛superscriptℝ𝐷𝑛1…𝑁{\bm{x}}{n}\in\mathbb{R}^{D},n=1,\dots,N the nthsuperscript𝑛thn^{\rm th} input sample, e.g., a (H,W,C)𝐻𝑊𝐶(H,W,C) image flattened to a D=H×W×C𝐷𝐻𝑊𝐶D=H\times W\times C dimensional vector. The entire training set is collected into the matrix 𝑿≜[𝒙1,…,𝒙N]∈ℳD,N​(ℝ)≜𝑿subscript𝒙1…subscript𝒙𝑁subscriptℳ𝐷𝑁ℝ{\bm{X}}\triangleq[{\bm{x}}{1},\dots,{\bm{x}}{N}]\in\mathcal{M}{D,N}(\mathbb{R}) where ℳn,m​(ℝ)subscriptℳ𝑛𝑚ℝ\mathcal{M}_{n,m}(\mathbb{R}) is the vector space of real n×m𝑛𝑚n\times m matrices. Throughout our study, vectors will always be column-vectors, and matrices are built by horizontally stacking column-vectors, i.e., they are column major. Unless specified otherwise, we assume that the input matrix 𝑿𝑿{\bm{X}} is full-rank. In practice, if this is not the case, one can easily disregard the subspace associated with 00 singular values and apply our analysis on that filtered matrix instead.

Learning by reconstruction. Learning representations by fitting a model’s parameters θ𝜃\theta to produce a reconstruction of presented inputs as in

is common (Bottou, 2012; Kingma & Ba, 2014; LeCun et al., 2015). The reconstruction provides a qualitative signal enabling one to easily assess the quality of the model, and even interpret trained classification models (Zeiler & Fergus, 2014; Mahendran & Vedaldi, 2015; Olah et al., 2017; Shen et al., 2020). In its simplest form, the encoder fθ:ℝD↦ℝK:subscript𝑓𝜃maps-tosuperscriptℝ𝐷superscriptℝ𝐾f_{\theta}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K} and the decoder gθ:ℝK↦ℝD:subscript𝑔𝜃maps-tosuperscriptℝ𝐾superscriptℝ𝐷g_{\theta}:\mathbb{R}^{K}\mapsto\mathbb{R}^{D} are linear, possibly with shared parameters. In such settings, the optimal parameters are obtained from Principal Component Analysis (Wold et al., 1987). Many variants of Eq. 1 have emerged, such as denoising and masked autoencoders (MAEs)(Vincent et al., 2010; He et al., 2022). The objective remains similar: learn a low-dimensional latent embedding of the data that is able to reconstruct the original samples while being robust to some noise perturbation added onto the samples as in

with p𝒙′|𝒙subscript𝑝conditionalsuperscript𝒙′𝒙p_{{\bm{x}}^{\prime}|{\bm{x}}} applying some (conditional) noise transformation to the original input, e.g., 𝒙′∼𝒩​(𝒙,ϵ​𝑰),ϵ>0formulae-sequencesimilar-tosuperscript𝒙′𝒩𝒙italic-ϵ𝑰italic-ϵ0{\bm{x}}^{\prime}\sim\mathcal{N}({\bm{x}},\epsilon{\bm{I}}),\epsilon>0.

Known limitations. Learning by reconstruction is widely popular and thus heavily studied. Major axes of research evolve around (i) deriving novel loss functions for specific datasets that better align with semantic distance (Wang et al., 2004; Kulis et al., 2013; Ballé et al., 2016; Bao et al., 2017), (ii) explaining the learned embedding dimensions (Tran et al., 2017; Esmaeili et al., 2019; Mathieu et al., 2019), and (iii) imposing structure in the embedding space such as clustered embeddings (Jiang et al., 2016; Dilokthanakul et al., 2016; Lim et al., 2020; Seydoux et al., 2020). Despite the rich literature, the current solutions of choice to learn representation in computer vision still rely on the Mean Squared Error loss in pixel space with the possible application of structured noise (p𝒙′|𝒙subscript𝑝conditionalsuperscript𝒙′𝒙p_{{\bm{x}}^{\prime}|{\bm{x}}} in Eq. 2), e.g., as employed by the current state-of-the-art solution of masked-autoencoders (MAEs) (He et al., 2022). Yet, even that solution learns a representation that needs to be fine-tuned to compete with state-of-the-art, e.g., MAE’s performances are drastically determined by two parameters (i) the training time, i.e., evaluation performance of the learned representation does not plateau even after 1600 epochs on Imagenet, and (ii) the need to fine tune, i.e., evaluation performance with and without fine tuning have a significant gap going from 70% to 84% on top-1 Imagenet classification task.

Our study will propose some first hints as to why even current state-of-the-art solutions are poised with slow training, and the need to fine-tune (Section 3). We will conclude by proving that MAEs’s masking strategy partially alleviates those limitations (Section 4), showing that the most rewarding findings for learning by reconstruction may emerge from novel denoising strategies.

This section provides the theoretical ground and empirical validation of R1 and R2 from Section 1, namely, that learning by reconstruction learns features that are misaligned with common perception tasks. We start by deriving a closed form alignment measure between those two tasks in Section 3.1 and conclude by empirically measuring that mismatch in Section 3.2.

As a starting point to our study, we will build intuition and obtain theoretical results in the linear regime. As we will see at the end of this Section 3.1, this seemingly simplified setting turns out to be informative of practical cases.

Let’s consider an encoder mapping 𝑽∈ℳK,D​(ℝ)𝑽subscriptℳ𝐾𝐷ℝ{\bm{V}}\in\mathcal{M}{K,D}(\mathbb{R}), a decoder mapping 𝒁∈ℳD,K​(ℝ)𝒁subscriptℳ𝐷𝐾ℝ{\bm{Z}}\in\mathcal{M}{D,K}(\mathbb{R}), and a predictor head 𝑾∈ℳC,K​(ℝ)𝑾subscriptℳ𝐶𝐾ℝ{\bm{W}}\in\mathcal{M}{C,K}(\mathbb{R}), where C𝐶C is the number of target dimensions, or classes. The targets for 𝑿∈ℳD,N​(ℝ)𝑿subscriptℳ𝐷𝑁ℝ{\bm{X}}\in\mathcal{M}{D,N}(\mathbb{R}) are given by 𝒀∈ℳD,N​(ℝ)𝒀subscriptℳ𝐷𝑁ℝ{\bm{Y}}\in\mathcal{M}_{D,N}(\mathbb{R}). The combination of the supervised and reconstruction losses is given by

where the latent representation, 𝑽​𝑿𝑽𝑿{\bm{V}}{\bm{X}}, is shared between the two losses, λ≥0𝜆0\lambda\geq 0 controls the trade-off between the two terms. Quantifying how the optimal parameters of Eq. 3 vary with λ𝜆\lambda will be key to assess how the two losses are aligned. As a starting point, let’s formalize below the optimal parameters of this loss function using the notations 𝑴=𝑷𝑴​𝑫𝑴​𝑷𝑴⊤𝑴subscript𝑷𝑴subscript𝑫𝑴superscriptsubscript𝑷𝑴top{\bm{M}}={\bm{P}}{{\bm{M}}}{\bm{D}}{{\bm{M}}}{\bm{P}}_{{\bm{M}}}^{\top} for the eigendecomposition of a symmetric semi-definite positive matrix 𝑴𝑴{\bm{M}}. We will also denote, to lighten notations, 𝑨≜𝑿​(𝒀⊤​𝒀+λ​𝑿⊤​𝑿)​𝑿⊤≜𝑨𝑿superscript𝒀top𝒀𝜆superscript𝑿top𝑿superscript𝑿top{\bm{A}}\triangleq{\bm{X}}\left({\bm{Y}}^{\top}{\bm{Y}}+\lambda{\bm{X}}^{\top}{\bm{X}}\right){\bm{X}}^{\top}.

The loss function from Eq. 3 is minimized for

where 𝐇≜𝐃𝐗​𝐗⊤−12​𝐏𝐗​𝐗⊤⊤​𝐀​𝐏𝐗​𝐗⊤​𝐃𝐗​𝐗⊤−12≜𝐇subscriptsuperscript𝐃12𝐗superscript𝐗topsubscriptsuperscript𝐏top𝐗superscript𝐗top𝐀subscript𝐏𝐗superscript𝐗topsubscriptsuperscript𝐃12𝐗superscript𝐗top{\bm{H}}\triangleq{\bm{D}}^{-\frac{1}{2}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{P}}^{\top}{{\bm{X}}{\bm{X}}^{\top}}{\bm{A}}{\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{D}}^{-\frac{1}{2}}{{\bm{X}}{\bm{X}}^{\top}}. (Proof in Section 6.1, empirical validation in Fig. 8.)

We observe that the optimal solutions from Theorem 1 continuously interpolates between the standard least square (OLS) problem (λ=0𝜆0\lambda=0) and the unsupervised linear autoencoder or Principal Component Analysis (PCA) setting (λ→∞→𝜆\lambda\to\infty). We formalize below that we recover the optimal solutions for each of those extreme cases from Theorem 1.

The solution from Theorem 2 recovers the OLS solution for 𝐖∗⊤​𝐕∗⊤superscriptsuperscript𝐖topsuperscriptsuperscript𝐕top{{\bm{W}}^{}}^{\top}{{\bm{V}}^{}}^{\top} as λ→0→𝜆0\lambda\to 0, and the PCA solution for 𝐙∗⊤​𝐕∗⊤superscriptsuperscript𝐙topsuperscriptsuperscript𝐕top{{\bm{Z}}^{}}^{\top}{{\bm{V}}^{}}^{\top} as λ→∞→𝜆\lambda\to\infty. (Proof in Section 6.2.)

That observation should comfort the reader that Eq. 3 accurately conveys both ends of the spectrum from supervised learning to reconstruction based learning, while continuously interpolating in-between.

Condition for perfect alignment. The result from Theorem 1 enables us to formalize the condition for perfect alignment between the two tasks, i.e., under which condition the solution 𝑽∗superscript𝑽{\bm{V}}^{*} is not impacted by λ𝜆\lambda.

The supervised and reconstruction tasks are aligned (the optimal solutions do not depend on λ𝜆\lambda) iff the intersection of the top-K𝐾K eigenspaces of 𝐗⊤​𝐗superscript𝐗top𝐗{\bm{X}}^{\top}{\bm{X}} and 𝐘⊤​𝐘superscript𝐘top𝐘{\bm{Y}}^{\top}{\bm{Y}} is of dimension K𝐾K.

In other words, whenever the condition in Proposition 1 holds, the matrix 𝑷𝑯subscript𝑷𝑯{\bm{P}}_{{\bm{H}}} (recall Theorem 1) will include the same eigenvectors (up to rotation) for any λ𝜆\lambda, making the optimal parameters (Eqs. 4, 5 and 6) independent of λ𝜆\lambda. In practice, we will see that Proposition 1’s alignment condition is never fulfilled, pushing the need to define a more precise measure of alignment between the two tasks.

CIFAR-DC-depth3

Continuous measure of alignment. As we aim to measure the tasks alignment more precisely that in a yes/no setting (Proposition 1), we propose the following continuous measure

where ‖𝒀⊤​𝒀​𝑷𝑿​𝑿⊤‖F2superscriptsubscriptnormsuperscript𝒀top𝒀subscript𝑷𝑿superscript𝑿top𝐹2|{\bm{Y}}^{\top}{\bm{Y}}{\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}|{F}^{2} simplifies as ‖𝒀⊤​𝒀‖F2superscriptsubscriptnormsuperscript𝒀top𝒀𝐹2|{\bm{Y}}^{\top}{\bm{Y}}|{F}^{2} when D=N𝐷𝑁D=N. We assume that the supervised task can be at least partially solved from 𝑿𝑿{\bm{X}}, ensuring ‖𝒀⊤​𝒀​𝑷𝑿​𝑿⊤‖F2>ϵsuperscriptsubscriptnormsuperscript𝒀top𝒀subscript𝑷𝑿superscript𝑿top𝐹2italic-ϵ|{\bm{Y}}^{\top}{\bm{Y}}{\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}|_{F}^{2}>\epsilon. In words, Eq. 7 is the (scaled) minimum supervised training error that can be achieved given the representation (𝑽⊤​𝑿)superscript𝑽top𝑿({\bm{V}}^{\top}{\bm{X}}) minimizing the reconstruction loss, which is measured by how much of the matrix 𝒀⊤​𝒀superscript𝒀top𝒀{\bm{Y}}^{\top}{\bm{Y}} can be reconstructed from the top-k𝑘k subspace of 𝑿⊤​𝑿superscript𝑿top𝑿{\bm{X}}^{\top}{\bm{X}}, as formalized below.

alignment​(k)alignment𝑘\text{alignment}(k) from Eq. 7 increases with k𝑘k, has value 00 iff the two losses are misaligned, and has value 111 iff the two losses are aligned. (Proof in Section 6.3.)

Findings. We now propose to evaluate the closed form alignment metric (Eq. 7) on a few datasets. Note that it can be implemented efficiently as detailed in Section 6.8. In Fig. 2, we measure the metric of Eq. 7 for a sweep of latent dimension K𝐾K over 777 different datasets. We observe three striking regimes. First, for images without background, reconstruction and classification tasks are very much aligned even for small latent dimension, as low as 20%percent2020% of the input dimension. Second, when comparing datasets with same image distributions but different number of classes (CIFAR10 to CIFAR100) the misalignment increases between the two task, especially for small embedding dimension. This follows our intuition that additional budget must be devoted to separate more classes. And as the subspace of the data used for reconstruction and classification do not align (recall Proposition 1), a greater misalignment is measured. And third, when looking at more realistic images (higher resolution and more diverse) such as tiny-imagenet, we observe that the alignment only increases linearly with the latent space dimension K𝐾K, requiring K=D𝐾𝐷K=D in that case to ensure alignment. We thus conclude that the presence of background, finer classification tasks, and higher resolution images, are all factors that drastically decrease the alignment between learning features for perception tasks and learning features that reconstruct those images.

Linear regime results are informative. We have focused on the linear encoder, decoder, and classification head of Eq. 3. Albeit insightful, one might wonder how much of those insights transfer to the more realistic setting of employing nonlinear mappings. We note that it remains common to keep the classification head linear therefore leading to the following generalization of Eq. 3 as

The encoder is now the nonlinear mapping fθ:ℝD↦ℝK:subscript𝑓𝜃maps-tosuperscriptℝ𝐷superscriptℝ𝐾f_{\theta}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K} and the decoder is the nonlinear mapping gγ:ℝK↦ℝD:subscript𝑔𝛾maps-tosuperscriptℝ𝐾superscriptℝ𝐷g_{\gamma}:\mathbb{R}^{K}\mapsto\mathbb{R}^{D}. We formalize below a result that will reinforce the legitimacy of our linear regime analysis (Theorems 1, 1 and 1.2) by showing that it is (i) a correct model during the early phase of training, and even (ii) a correct model throughout training when the decoder being employed is under-parametrized.

For any high-capacity encoder fθsubscript𝑓𝜃f_{\theta}, studying Eq. 3 and Eq. 8 is equivalent at initialization for any decoder, and is always equivalent when the decoder is linear. (Proof in Section 6.6.)

Combining Theorems 2 and 1, we obtain that even with DNs, during the early stages of learning, the encoder-decoder mapping focuses on the principal subspace of the data, i.e., the space that explains most of the reconstruction error in the linear regime. As our study strongly hinges on that claim, we propose to empirically validate it in the following Section 3.2.

We characterized in the previous Section 3.1 how classification and reconstruction tasks fail to align when it comes to learning common features. In particular, Sections 3.1 and 2 validated how training focuses first on the top subspace of the data. We now reinforce our claim by showing that supervised tasks can not be solved when restricting the images on the subspace that is learned first by reconstruction.

Perception can not be solved from the principal subspace of the data. We first propose a controlled experiment where we artificially remove some of the original data subspace. In particular, we consider two settings. First, we gradually remove the subspace associated to the top eigenvectors of the data covariance matrix effectively removing what is most useful for reconstruction but also what we claim to be least useful for perception. Second, we gradually remove the subspace associated to the bottom eigenvectors (the one least useful for reconstruction but that we claim to be most useful for perception). This procedure is applied to the entire dataset (train and test images) before any DN training occurs. Hence the DN is only presented with the filtered images. We report the top-1 accuracy over numerous datasets in Fig. 1 (top) and in Fig. 4. We obtain a few key observations. First, for any % of filtering, keeping the bottom subspace of the data produces higher test accuracy that when keeping the top subspace. That is, the subspace that is most useful for reconstruction (top) is least useful for perception. Second, the accuracy gap is impacted by the presence of background, finer-grained classes, and higher resolution images. This further validates our theoretical observations from Fig. 2 and the result from Theorem 2. We will now focus on validating the second part of our claim that the subspace used for perception (bottom) is learned last, and slowly.

Useful features for perception are learned last. The above results demonstrate that the top subspace of the data–explaining most of the pixel variance–is not aligned with the perception tasks. Yet, perfect reconstruction implies capturing both the top and bottom subspace. Albeit correct, we demonstrate in Fig. 1 (bottom) and in Fig. 3 that the rate at which the bottom subspace is learned is exponentially slower than the rate at which the top subspace is learned. This empirically validates Theorem 2. For the reader familiar with optimization (Benzi, 2002), or power iteration methods, this observation is akin to how many procedures converge at a rate which is a function of the eigengap (Booth, 2006; Xu et al., 2018), i.e., the difference between λisubscript𝜆𝑖\lambda_{i} and λi+1subscript𝜆𝑖1\lambda_{i+1}, where λ𝜆\lambda are the sorted eigenvalues. Because natural images have an exponential decay of their eigenvalues (Van der Schaaf & van Hateren, 1996; Ruderman, 1997), the rate at which the top subspace is approximated is exponentially faster than the bottom one, therefore making the learning of useful features for perception occur only late during training. This finding also resonates with previous studies on the spectral bias of DNs in classification and generative settings (Chakrabarty & Maji, 2019; Rahaman et al., 2019; Schwarz et al., 2021).

Combining the observations from this section supports R1 and R2 from Section 1. It remains to study R3 which states that since features for perception lie within a negligible subspace (as measured by the reconstruction loss), one can find two separate models that equally solve the reconstruction task (same train and test loss values) but provide drastically different perception task performances.

We now turn to R3 stating that features for perceptions lie within a space with negligible impact on the reconstruction error–therefore motivating the need to add additional guidance to the training, e.g., through denoising tasks. To do so, we will show that it is possible to obtain two DNs with same reconstruction error but one with perception capabilities far greater than the other (Section 4.1). Lastly we will prove that some guidance can be provided to the learned representation to reduce that gap and focus towards more useful features through careful design of the denoising task (Section 4.2).

One interesting benefit emerging from the observations made for R1 and R2 is that guiding a DN to focus on the subspace containing informative features for perception has minimal impact on the reconstruction loss–as they focus on different subspaces. Therefore, we now propose a simple experiment to demonstrate the above argument. We take a resnet34 autoencoder train it with the usual reconstruction loss (MSE) on Imagenette. This gives us a model that (as per R1 and R2) fails to properly focus on discriminative feature. To obtain the second DN with improved classification performance, we simply add a classification head on top of the embedding of the encoder. That is, the same embedding that is fed to the decoder for reconstruction, is also fed to the linear classifier with supervised training loss (recall Eq. 8). We obtain the key insight of R3 which is that one can produce two DNs with same training loss (reconstruction) and validation loss (reconstruction), but with significantly different classification performance, as reported in Figs. 5 and 6.

To further understand that observation, we can recall the results from Theorem 1. We demonstrated that the encoder (𝑽𝑽{\bm{V}}) having K𝐾K dimensions at its disposal, is optimal when selecting the top singular vectors of the data matrix 𝑿𝑿{\bm{X}}. If K𝐾K is large enough that it encompasses both the top subspace of the data (which is learned first and has greatest impact on the reconstruction loss) and the bottom subspace of the data (which is useful for perception as per Figs. 1 and 4), then both objective can coexist (recall Proposition 1) as long as enough capacity is given to the encoder. Therefore, we obtain the following key insight. Whenever the capacity of the autoencoder is large, the encoder embedding can (and will at the end of training) include features useful for perception all while being able to reconstruct its inputs. Again, for this to happen one requires the capacity of the encoder to grow with the image resolution (as more and more dimensions will be taken up by the top subspace), and with the complexity of the image background (again taking more dimensions in the top subspace) (recall Section 3.2).

The above observation demonstrates that learning to reconstruct needs an additional training signal to focus towards discriminative features. As we will prove below, learning by denoising offers such a solution.

Recalling the Denoising Autoencoder setting from Eq. 2, we aim to obtain a closed form solution of the linear loss Eq. 3 in order to find some hints as to why masking and additive Gaussian noise produce representations of different quality for perception.

Our goal is therefore to study the misalignment metric (Eq. 7) under the denoising setting which, as per Corollary 1.2, is given by

which is the minimum supervised loss that can be attained using the representation from 𝑽∗superscript𝑽{\bm{V}}^{*} that minimizes the denoising loss. We ought to highlight that we can obtain a closed form solution under the expectation over the noise distribution (𝑿′|𝑿conditionalsuperscript𝑿′𝑿{\bm{X}}^{\prime}|{\bm{X}}) as formalized below, where we denote G≜𝔼𝑿′|𝑿​[𝑿′​𝑿′⊤]≜𝐺subscript𝔼conditionalsuperscript𝑿′𝑿delimited-[]superscript𝑿′superscriptsuperscript𝑿′topG\triangleq\mathbb{E}{{\bm{X}}^{\prime}|{\bm{X}}}\left[{\bm{X}}^{\prime}{{\bm{X}}^{\prime}}^{\top}\right] and 𝑺≜𝔼𝑿′|𝑿​[𝑿′]≜𝑺subscript𝔼conditionalsuperscript𝑿′𝑿delimited-[]superscript𝑿′{\bm{S}}\triangleq\mathbb{E}{{\bm{X}}^{\prime}|{\bm{X}}}\left[{\bm{X}}^{\prime}\right].

The closed form solution for 𝐕∗superscript𝐕{\bm{V}}^{} from Eq. 9 is given by 𝐕∗​ spans ​𝐏𝐆​𝐃𝐆−12​(𝐏𝐇).,1:K{\bm{V}}^{}\text{ spans }{\bm{P}}{{\bm{G}}}{\bm{D}}^{-\frac{1}{2}}{{\bm{G}}}({\bm{P}}{{\bm{H}}}){.,1:K}, where 𝐇≜𝐃𝐆−12​𝐏𝐆⊤​𝐒​𝐗⊤​𝐗​𝐒⊤​𝐏𝐆​𝐃𝐆−12≜𝐇subscriptsuperscript𝐃12𝐆subscriptsuperscript𝐏top𝐆𝐒superscript𝐗top𝐗superscript𝐒topsubscript𝐏𝐆subscriptsuperscript𝐃12𝐆{\bm{H}}\triangleq{\bm{D}}^{-\frac{1}{2}}{{\bm{G}}}{\bm{P}}^{\top}{{\bm{G}}}{\bm{S}}{\bm{X}}^{\top}{\bm{X}}{\bm{S}}^{\top}{\bm{P}}{{\bm{G}}}{\bm{D}}^{-\frac{1}{2}}{{\bm{G}}}. (Proof in Section 6.4.)

The above result demonstrates that even when employing denoising autoencoders with additive Gaussian noise or masking, we can obtain a closed form solution for 𝑽𝑽{\bm{V}}, and from that obtain all the alignment metrics studied so far. In particular, Section 6.4 also provides the closed form solutions for 𝑮𝑮{\bm{G}} and 𝑺𝑺{\bm{S}}. We illustrate the alignment between reconstruction and perception tasks in the denoising autoencoder setting (Eq. 9) in Fig. 7 for the case of random masking, as per the MAE setting. We clearly observe that the denoising task has the ability to increase the alignment between the two tasks–especially for small embedding dimension (K𝐾K). We however observe that the size of the masked patches that provides the best gains vary with the dataset, hinging at another challenge of denoising autoencoders: the cross-validation of the denoising task. Another formal result we propose below will reinforce that point.

Denoting by 𝑽∗​(σ)superscript𝑽𝜎{\bm{V}}^{*}(\sigma) the optimal denoising autoencoder parameters when employing additive isotropic Gaussian noise with standard deviation σ𝜎\sigma, we obtain the following statement showcasing that this type of denoising task does not help supervised tasks.

Under the settings of Theorem 3, additive Gaussian noise has no impact in the supervised task performance as 𝐖∗⊤​𝐕∗​(σ)⊤=𝐖∗⊤​𝐕∗​(0)⊤,∀σ≥0formulae-sequencesuperscriptsuperscript𝐖topsuperscript𝐕superscript𝜎topsuperscriptsuperscript𝐖topsuperscript𝐕superscript0topfor-all𝜎0{{\bm{W}}^{}}^{\top}{{\bm{V}}^{}(\sigma)}^{\top}={{\bm{W}}^{}}^{\top}{{\bm{V}}^{}(0)}^{\top},\forall\sigma\geq 0, regardless of the supervised task. (Proof in Section 6.5.)

We therefore obtain the following insights. Denoising tasks offer a powerful guidance to skew learned representations to better align with perception tasks (Fig. 7) but some noise distributions such as additive Gaussian noise are provably unable to help. A challenge that naturally emerges is in selecting the adequate denoising tasks, e.g., to avoid Corollary 3.1 in a setting where labels are not available and the supervised tasks to be tackled may not be known a priori. An interesting portal that we obtained (Corollary 3.1) in our study is that it is possible to assess if a denoising task has any impact on the perception task without yet knowing what is that supervised task. That alone could help in at least focusing on denoising tasks that do have an impact, albeit it will remain unknown if that impact will be beneficial or not.

We proposed to study the transferability of representations learned by reconstruction towards perception tasks. In particular, we obtain that the two objectives are fundamentally misaligned, with a degree of misalignment that grows with the presence of complicated background, with greater number of classes for the perception task, and with higher image resolutions. While our study focused on bringing those limitations forward from a theoretical and empirical angle, we also opened new avenues to reduce those limitations in the future. For example, we obtained a closed form solution to measure the impact of noise distributions to better align the learned representation to the downstream perception task. This novel methodology opens the door to a priori selecting noise distribution candidates. Even when the downstream task is unknown, we found that some noise distributions, such as additive Gaussian noise, are effectively unable to provide any benefit for better aligning reconstruction and perception tasks. On the opposite, we validated that masking is a valid strategy, albeit requiring some per-dataset tuning. That finding is in line with MAE’s performances going from about 50% to 74% on Imagenet top-1 accuracy when masking is employed. We hope that our study will also open new avenues to study reconstruction methods for other modalities such as time-series and NLP.

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

The first part of the proof finds the optimum 𝑾∗superscript𝑾{\bm{W}}^{} and 𝒁∗superscript𝒁{\bm{Z}}^{} as a function of 𝑽𝑽{\bm{V}} which is direct since we are in a least-square style setting for each of them. The second part will consist in showing that the optimal 𝑽𝑽{\bm{V}} can be found as the solution of a generalized eigenvalue problem. The third and final step will be to express the solution for 𝑽𝑽{\bm{V}} in close-form that is also friendly for computations.

recalling that ‖𝑴‖F2=Tr⁡(𝑴⊤​𝑴)superscriptsubscriptnorm𝑴𝐹2Trsuperscript𝑴top𝑴|{\bm{M}}|_{F}^{2}=\operatorname{Tr}({\bm{M}}^{\top}{\bm{M}}), the above simplifies to

we are now going to find the optimal 𝑾𝑾{\bm{W}} and 𝒁𝒁{\bm{Z}} which are unique by convexity of the loss and of their domain. Recall that we assume 𝒀𝒀{\bm{Y}} and 𝑿𝑿{\bm{X}} to be full-rank (therefore also making 𝑽𝑽{\bm{V}} full-rank). Recalling the derive of traces, we obtain

setting it to zero (we assume here λ>0𝜆0\lambda>0 otherwise we can not solve for 𝒁𝒁{\bm{Z}} since its value does not impact the loss) and solving leads

We now have solved for 𝑾,𝒁𝑾𝒁{\bm{W}},{\bm{Z}} as a function of 𝑽𝑽{\bm{V}}, i.e., the loss is now only a function of 𝑽𝑽{\bm{V}}, which we are going to solve for now.

Step 2. We will first proceed by plugging the values for 𝑾∗,𝒁∗superscript𝑾superscript𝒁{\bm{W}}^{},{\bm{Z}}^{} back into the loss, which will now be only a function of 𝑽𝑽{\bm{V}}. Let’s first simplify our derivations by noticing that

and similarly

finally making the entire loss simplify as follows

First, notice that both 𝑿​(𝒀⊤​𝒀+λ​𝑿⊤​𝑿)​𝑿⊤𝑿superscript𝒀top𝒀𝜆superscript𝑿top𝑿superscript𝑿top{\bm{X}}\left({\bm{Y}}^{\top}{\bm{Y}}+\lambda{\bm{X}}^{\top}{\bm{X}}\right){\bm{X}}^{\top} and 𝑿​𝑿⊤𝑿superscript𝑿top{\bm{X}}{\bm{X}}^{\top} are symmetric. Therefore, we can minimize the loss by solving the following generalized eigenvalue problem:

where 𝑽𝑽{\bm{V}} are the eigenvectors of the generalized eigenvalue problem, and ΛΛ\Lambda the eigenvalues, then the solution to our problem we be any rotation of any K𝐾K eigenvectors, but the minimum will be achieve for the top-K ones.

Step 3. We will first demonstrate the general solution for the generalized eigenvalue problem. Given that solution, it will be easy to take the top-K𝐾K eigenvectors that solve the considered problem. Denoting 𝑨≜𝑿​(𝒀⊤​𝒀+λ​𝑿⊤​𝑿)​𝑿⊤≜𝑨𝑿superscript𝒀top𝒀𝜆superscript𝑿top𝑿superscript𝑿top{\bm{A}}\triangleq{\bm{X}}\left({\bm{Y}}^{\top}{\bm{Y}}+\lambda{\bm{X}}^{\top}{\bm{X}}\right){\bm{X}}^{\top} and 𝑩≜𝑿​𝑿⊤≜𝑩𝑿superscript𝑿top{\bm{B}}\triangleq{\bm{X}}{\bm{X}}^{\top}, and 𝑯≜𝑫𝑩−12​𝑷𝑩⊤​𝑨​𝑷𝑩​𝑫𝑩−12≜𝑯subscriptsuperscript𝑫12𝑩subscriptsuperscript𝑷top𝑩𝑨subscript𝑷𝑩subscriptsuperscript𝑫12𝑩{\bm{H}}\triangleq{\bm{D}}^{-\frac{1}{2}}{{\bm{B}}}{\bm{P}}^{\top}{{\bm{B}}}{\bm{A}}{\bm{P}}{{\bm{B}}}{\bm{D}}^{-\frac{1}{2}}{{\bm{B}}}, we have

therefore the eigenvalues are given by 𝑫𝑯subscript𝑫𝑯{\bm{D}}{{\bm{H}}} and the eigenvectors are given by 𝑷𝑩​𝑫𝑩−12​𝑷𝑯subscript𝑷𝑩subscriptsuperscript𝑫12𝑩subscript𝑷𝑯{\bm{P}}{{\bm{B}}}{\bm{D}}^{-\frac{1}{2}}{{\bm{B}}}{\bm{P}}{{\bm{H}}} or equivalently (𝑷𝑿​𝑿⊤​𝑫𝑿​𝑿⊤−12​𝑷𝑯).,1:K({\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{D}}^{-\frac{1}{2}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{P}}{{\bm{H}}}){.,1:K}. So the optimal 𝑽𝑽{\bm{V}} is any rotation of the top-K𝐾K eigenvectors. The above is simple to use as-is whenever N>D𝑁𝐷N>D, if not, then we can obtain a solution without having to compute any D×D𝐷𝐷D\times D matrix, thus making the process more efficient. To that end, we can obtain

that only involves D×min⁡(D,N)𝐷𝐷𝑁D\times\min(D,N) matrices instead of D×D𝐷𝐷D\times D.

We will start with the fully supervised (least-square) proof obtained when λ=0𝜆0\lambda=0. Also notice that in any case, we have that

when then lead to

from the above, we can simply plug those values in the analytical form for 𝑽∗superscript𝑽{\bm{V}}^{*} from Eq. 4 to obtain

and finally the product of both matrices (which produce the supervised linear model) is obtained as

therefore recovering the OLS optimal solution. Note that if K<C𝐾𝐶K<C then we have a bottleneck and we therefore obtain an interesting alternative solution that looks at the top subspace of 𝒀𝒀{\bm{Y}} (this is however never the case in OLS settings).

Principal Component Analysis recovery.We now consider the case where we only employ the unsupervised loss (akin to λ→∞→𝜆\lambda\to\infty). In this setting we get

therefore the optimal form for 𝑽𝑽{\bm{V}} will be

which will select the top-K𝐾K subspace of 𝑿𝑿{\bm{X}} (recall that the eigenvalues of 𝑯𝑯{\bm{H}} are 𝑫𝑿​𝑿⊤subscript𝑫𝑿superscript𝑿top{\bm{D}}_{{\bm{X}}{\bm{X}}^{\top}} and therefore its top-K𝐾K eigenvectors are selected the top-K𝐾K dimension of the subspace. Then the solution for 𝒁𝒁{\bm{Z}} from Eq. 6 gives

which is the projection matrix onto the top-K𝐾K subspace of the data 𝑿𝑿{\bm{X}} i.e. recovering the optimal solution of Principal Component Analysis. ∎

Recall that in the λ→∞→𝜆\lambda\to\infty regime, we have that 𝑽∗=𝑷𝑿​𝑿⊤​(𝑫𝑿​𝑿⊤−12).,1:K{\bm{V}}^{}={\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}({\bm{D}}{{\bm{X}}{\bm{X}}^{\top}}^{-\frac{1}{2}})_{.,1:K} and 𝑾∗=𝑷𝑿​𝑿⊤⊤​𝒀⊤superscript𝑾subscriptsuperscript𝑷top𝑿superscript𝑿topsuperscript𝒀top{\bm{W}}^{}={\bm{P}}^{\top}_{{\bm{X}}{\bm{X}}^{\top}}{\bm{Y}}^{\top}. We thus develop

as ‖𝒀‖F2superscriptsubscriptnorm𝒀𝐹2|{\bm{Y}}|{F}^{2} is a constant with respect to the parameters, we consider ‖𝒀​(𝑷𝑿​𝑿⊤).,1:K‖F2|{\bm{Y}}({\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}){.,1:K}|{F}^{2} as our alignment measure (the greater, the better the supervised loss can be minimized from the parameters). Since this quantity lives in the range [0,‖𝒀​𝑷𝑿​𝑿⊤‖F2]0superscriptsubscriptnorm𝒀subscript𝑷𝑿superscript𝑿top𝐹2[0,|{\bm{Y}}{\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}|{F}^{2}], we see that by using the reparametrization from Eq. 7 we obtain the proposed measure of alignment rescaled to [0,1]01[0,1]. ∎

We need to find the optimal reconstruction solution (𝑽∗,𝒁∗)superscript𝑽superscript𝒁({\bm{V}}^{},{\bm{Z}}^{}) first (corresponding to the case of λ→∞→𝜆\lambda\to\infty, and then plug it into the supervised loss with optimal 𝑾𝑾{\bm{W}}.

from which we obtain

leading to 𝑾∗=(𝑽⊤​𝔼​[𝑿′​𝑿′⁣⊤]​𝑽)−1​𝑽⊤​𝔼​[𝑿′]​𝑿⊤superscript𝑾superscriptsuperscript𝑽top𝔼delimited-[]superscript𝑿′superscript𝑿′top𝑽1superscript𝑽top𝔼delimited-[]superscript𝑿′superscript𝑿top{\bm{W}}^{*}=({\bm{V}}^{\top}\mathbb{E}[{\bm{X}}^{\prime}{\bm{X}}^{\prime\top}]{\bm{V}})^{-1}{\bm{V}}^{\top}\mathbb{E}[{\bm{X}}^{\prime}]{\bm{X}}^{\top} which we can plug back into the loss to obtain

Given those optimum we can thus obtain the alignment measure same as before as the sueprvised loss obtain from V∗superscript𝑉V^{} from the unsupervised loss and 𝒁∗superscript𝒁{\bm{Z}}^{} from the supervised one:

whose optimum is therefore given by 𝒁∗=(𝑽∗⊤​𝔼​[𝑿′​𝑿′⁣⊤]​𝑽∗)−1​𝑽∗⊤​𝔼​[𝑿′]​𝒀⊤superscript𝒁superscriptsuperscriptsuperscript𝑽top𝔼delimited-[]superscript𝑿′superscript𝑿′topsuperscript𝑽1superscriptsuperscript𝑽top𝔼delimited-[]superscript𝑿′superscript𝒀top{\bm{Z}}^{}=({{\bm{V}}^{}}^{\top}\mathbb{E}[{\bm{X}}^{\prime}{\bm{X}}^{\prime\top}]{{\bm{V}}^{}})^{-1}{{\bm{V}}^{}}^{\top}\mathbb{E}[{\bm{X}}^{\prime}]{\bm{Y}}^{\top} which can then be plugged back to produce the (unnormalized) measure. The analytical form of can be obtained (following the derivations of (Balestriero et al., 2022)) as follows for example for the case of additive Gaussian noise which is trivial and commonly done before e.g. showing the link between ridge regression and additive dropout 𝔼​[𝑿′]=𝑿𝔼delimited-[]superscript𝑿′𝑿\mathbb{E}[{\bm{X}}^{\prime}]={\bm{X}} and 𝔼​[𝑿′​𝑿′⁣⊤]=𝑿′​𝑿′⁣⊤+σ​𝑰𝔼delimited-[]superscript𝑿′superscript𝑿′topsuperscript𝑿′superscript𝑿′top𝜎𝑰\mathbb{E}[{\bm{X}}^{\prime}{\bm{X}}^{\prime\top}]={\bm{X}}^{\prime}{\bm{X}}^{\prime\top}+\sigma{\bm{I}}. The perhaps more interesting derivations concern the masking as employed by MAE.

In the case of additive, centered Gaussian noise, we have 𝔼​[𝑿′]=𝑿𝔼delimited-[]superscript𝑿′𝑿\mathbb{E}[{\bm{X}}^{\prime}]={\bm{X}} and 𝔼​[𝑿′​𝑿′⁣⊤]=𝑿′​𝑿′⁣⊤+σ​𝑰𝔼delimited-[]superscript𝑿′superscript𝑿′topsuperscript𝑿′superscript𝑿′top𝜎𝑰\mathbb{E}[{\bm{X}}^{\prime}{\bm{X}}^{\prime\top}]={\bm{X}}^{\prime}{\bm{X}}^{\prime\top}+\sigma{\bm{I}}. Therefore the optimal value for 𝑽𝑽{\bm{V}}, is given by solving the generalized eigenvalue problem (𝑿​𝑿⊤​𝑿​𝑿⊤,𝑿​𝑿⊤+σ​𝑰)𝑿superscript𝑿top𝑿superscript𝑿top𝑿superscript𝑿top𝜎𝑰({\bm{X}}{\bm{X}}^{\top}{\bm{X}}{\bm{X}}^{\top},{\bm{X}}{\bm{X}}^{\top}+\sigma{\bm{I}}). Recalling the derivations of the optimal solution for such problem from Section 6.1, we have that

and the important property to notice is that the ordering of the eigenvalues of 𝑯𝑯{\bm{H}} which are given by (𝑫𝑿​𝑿⊤⊤)2​(𝑫𝑿​𝑿⊤+σ​𝑰)−1superscriptsubscriptsuperscript𝑫top𝑿superscript𝑿top2superscriptsubscript𝑫𝑿superscript𝑿top𝜎𝑰1({\bm{D}}^{\top}{{\bm{X}}{\bm{X}}^{\top}})^{2}({\bm{D}}{{\bm{X}}{\bm{X}}^{\top}}+\sigma{\bm{I}})^{-1} are the same as the ordering of 𝑿​𝑿⊤𝑿superscript𝑿top{\bm{X}}{\bm{X}}^{\top} which are given by 𝑫𝑿​𝑿⊤⊤subscriptsuperscript𝑫top𝑿superscript𝑿top{\bm{D}}^{\top}{{\bm{X}}{\bm{X}}^{\top}}. That is, the top-K𝐾K subspace that will be picked up by 𝑷𝑯subscript𝑷𝑯{\bm{P}}{{\bm{H}}} are the same for any noise standard deviation σ𝜎\sigma. Now given the close form for 𝑽𝑽{\bm{V}} we can obtain the close form of the classifier weights 𝑾𝑾{\bm{W}} from Eq. 5 to be

therefore recovering the OLS optimal solution 𝒀​𝑿⊤​(𝑿​𝑿⊤)−1𝒀superscript𝑿topsuperscript𝑿superscript𝑿top1{\bm{Y}}{\bm{X}}^{\top}({\bm{X}}{\bm{X}}^{\top})^{-1} whenever K≥D𝐾𝐷K\geq D, and otherwise recovers the projection onto the top subspace of 𝑿𝑿{\bm{X}}–in any case the final parameters are invariant to the choice of the standard deviation of the additive Gaussian noise (σ𝜎\sigma) during the denoising autoencoder pre-training phase.

The first of the proof is to rewrite the joint objective classification and reconstruction objective with an arbitrary encoder network fθsubscript𝑓𝜃f_{\theta}

as the nonparametric version

both being identical if we assume that the encoder is powerful enough to reach any representation, which is a realistic assumption given current architectures. Given that nonparametric objective, we can now solve for both the optimal decoder weight 𝑽𝑽{\bm{V}} and the optimal representation 𝒁𝒁{\bm{Z}} as follows

which is solved by 𝒁𝒁{\bm{Z}} being any orthogonal matrix in the subspace of the top-K eigenvectors of (𝒀⊤​𝒀+λ​𝑿⊤​𝑿)superscript𝒀top𝒀𝜆superscript𝑿top𝑿\left({\bm{Y}}^{\top}{\bm{Y}}+\lambda{\bm{X}}^{\top}{\bm{X}}\right). Now as λ→∞→𝜆\lambda\to\infty as the encoder will become more and more linear, ultimately converging to fθ​(𝒙)=𝑼​𝒙subscript𝑓𝜃𝒙𝑼𝒙f_{\theta}({\bm{x}})={\bm{U}}{\bm{x}} with 𝑼∈span{eigvec(𝑿⊤𝑿)1,…,eigvec(𝑿⊤𝑿)K}{\bm{U}}\in\operatorname{span}{\operatorname{eigvec}({\bm{X}}^{\top}{\bm{X}}){1},\dots,\operatorname{eigvec}({\bm{X}}^{\top}{\bm{X}}){K}} ∎

Given a matrix 𝑿∈ℳD,N​(ℝ)𝑿subscriptℳ𝐷𝑁ℝ{\bm{X}}\in\mathcal{M}_{D,N}(\mathbb{R}) with D>N𝐷𝑁D>N, computing the eigendecomposition of 𝑿​𝑿⊤𝑿superscript𝑿top{\bm{X}}{\bm{X}}^{\top}, a D×D𝐷𝐷D\times D matrix is 𝒪​(D3)𝒪superscript𝐷3\mathcal{O}(D^{3}) which instead can be obtained in 𝒪​(N3+D​N2)𝒪superscript𝑁3𝐷superscript𝑁2\mathcal{O}(N^{3}+DN^{2}) as ⬇ def fast_gram_eigh(X, major="C", unit_test=False): """ compute the eigendecomposition of the Gram matrix: - XX.T using column (C) major notation - X.T@X using row (R) major notation """ if major == "C": X_view = X.T else: X_view = X if X_view.shape[1] < X_view.shape[0]: # this case is the usual formula U, S = np.linalg.eigh(X_view.T @ X_view) else: # in this case we work in the tranpose domain U, S = np.linalg.eigh(X_view @ X_view.T) S = X_view.T @ S S[U>0] /= np.sqrt(U[U>0]) # ensuring that we have the correct values if unit_test: Uslow, Sslow = np.linalg.eigh(X_view.T @ X_vew) assert np.allclose(U, Uslow) assert np.allclose(S, Sslow) return U, S since we have the relation

and thus we can simply compute the eigenvectors of the K×K𝐾𝐾K\times K matrix 𝑿⊤​𝑿superscript𝑿top𝑿{\bm{X}}^{\top}{\bm{X}} and get the eigenvectors of the N×N𝑁𝑁N\times N matrix 𝑿​𝑿⊤𝑿superscript𝑿top{\bm{X}}{\bm{X}}^{\top} by left-multiplying them by 𝑿⊤superscript𝑿top{\bm{X}}^{\top}, and their corresponding eigenvalues are rescaled by 1‖𝑿⊤​𝒗‖221superscriptsubscriptnormsuperscript𝑿top𝒗22\frac{1}{|{\bm{X}}^{\top}{\bm{v}}|_{2}^{2}}.

We want to sweep over the latent dimension K𝐾K. As such, we can avoid recomputing the metric for each value and get them all at once as below. We again use the column-major notations as per Section 2:

Refer to caption Features for reconstruction are uninformative for perception (top): TinyImagenet ResNet9 top-1 accuracy when trained and validated on images projected on the top-subspace (red) or bottom subspace (blue) of explained variance, corresponding images displayed in the middle and in Fig. 9. Perception features are learned last (bottom): training loss evolution (red to blue) of reconstructed training images from a deep Autoencoder projected onto the eigenspace of the original data (black). The top eigenspace (right) is learned first, and then, if training lasts long enough, the features most useful for perception (left) are finally learned. This explains why learning by performances on perception task keep increasing long after reconstructed samples look appealing.

Refer to caption Depiction of the closed form alignment measure from Eq. 7 measuring the minimum supervised training error achievable given the optimal reconstruction parameters, as per Theorems 1 and 1.2. Top: depiction in term of the latent dimension K𝐾K (x-axis). Bottom: depiction in term of the ratio of the latent dimension K𝐾K to the input dimension D𝐷D. We clearly observe that as the dataset becomes more realistic (going from background-free images to CIFAR and then to TinyImagenet), as the alignment between the reconstruction and supervised task lessens. In particular, when going to TinyImagenet, we observe that the alignment only increases linearly with respect to the latent space dimension.

Refer to caption Reprise of Fig. 1 for additional autoencoder architectures: convolutional encoder and deconvolutional decoder (top) and MLP encoder and decoder (bottom). We clearly observe that the top subspace is learned first during training, which is the one that best minimize the reconstruction loss but that contains the least informative features for perception, as per Fig. 4.

Refer to caption We depict the classification accuracy of a ResNet9 DNN when trained and tested on images that have been projected onto the top (red) and bottom (blue) subspace as ordered per the eigenvalues of the data covariance matrix, without data-augmentation (top) and with data-augmentation (bottom). We clearly observe that except for datasets without background and for which reconstruction and classification are better aligned (recall Fig. 2), the final performance is greater when employing the subspace of the data that explains the least the pixel variation, i.e., the bottom subspace.

Refer to caption Depiction of multiple resnet34 autoencoders with varying embedding dimensions (light to dark) some trained only to reconstruct the input samples with data-augmentations (blue) and others with an additional supervised loss signal (as per Eq. 8) (green). We report the test set accuracy and the relative difference (y-axis) for each of the “paired” models, i.e., the ones with every training setting identical except for the use of the supervised signal, as a function of the train and test rec loss. We clearly observe that for any embedding dimension and reconstruction loss, one can find two set of parameters with drastically different ability to solve perception tasks. Reconstructed samples and training curves are provided in Fig. 6.

Refer to caption Depiction of two resnet34 autoencoders trained on Imagenette (Imagenet-10) images, one (orange) with an additional training signal that favors latent representations suited for classification, and the other (blue) that is only the reconstruction loss. As per R1 and R2 the latter naturally focuses on suboptimal features as showcases in the test accuracy, both when using a linear or a nonlinear probe. Crucially, the autoencoder with the additional signal produces representations with much greater discriminative power in both the linear and nonlinear setting. Yet, and despite popular belief, doing so has no impact on the reconstruction losses on the train or test set, and thus no impact on the quality of the reconstruction presented at the top. Therefore validating R3.

Refer to caption Depiction of the relative alignment difference when employing denoising tasks (recall Eq. 9) with masking noise, with probability of dropping ranging from 0%percent00% to 99%percent9999% (cyan to pink) for patch size of (1,1)11(1,1) recovering multiplicative dropout (top), (2,2)22(2,2) (middle), and (4,4)44(4,4) (bottom) on various datasets. A positive number indicates a beneficial impact of using the denoising loss on the supervised performance of the learned representation. We observe that for datasets such as ArabicDigits that already have a strong alignment between the two tasks (recall Fig. 2), the use of any form of masking is detrimental except with shape (1,1)11(1,1). However for datasets such as CIFAR100 (right column) with originally poor alignment, masking is beneficial and increases the alignment between the two tasks. As the original alignment increases with K𝐾K, as the benefit of masking reduces.

Refer to caption Empirical validation of Theorem 1 comparing the loss value at the optimum (from Eqs. 4, 5 and 6) against the one minimized with gradient descent (Adam optimizer) (y-axis) during gradient steps (t, x-axis). We expect that different to get close to 00 as the gradient updates converge to the minimum value of the loss. Although that quantity (loss(optimum) - loss(t)) is nonnegative in theory, we observe that its minimum value (reported in the title of each subplot) is sometimes negative with negligible value due to round off error. We compare numerous values of K,D,N𝐾𝐷𝑁K,D,N as given in the titles of each row, and different values of λ∈{0.0,0.1,1,10}𝜆0.00.1110\lambda\in{0.0,0.1,1,10} (column).

Refer to caption Depiction of Imagenet images (top) projected onto different subspaces obtained from Principal Component Analysis corresponding to the subspace explaining the top 75% of pixel variance (middle) and bottom 25% of pixel variance (bottom). We clearly observe that the image representation preserved after projection onto the bottom subspace makes the perception tasks (classification) easier to solve that if projected onto the top subspace, where the lower frequency information is insufficient to classify what is the object depicted (recall the classification performances of DNs applied onto those different projections from Figs. 1 and 4).

$$ \min_{\theta} \mathbb{E}{\vx \sim p{\vx}}\left[d\left(g_{\theta}\left(f_{\theta}(\vx)\right), \vx\right)\right],\label{eq:AE} $$ \tag{eq:AE}

$$ \displaystyle\mathcal{L}({\bm{V}},{\bm{W}},{\bm{Z}})= $$

$$ \displaystyle{\bm{V}}^{*} $$

$$ \text{alignment}(k) \triangleq \frac{| \mY^\top\mY(\mP_{\mX\mX^\top})_{1:k}|F^2}{|\mY^\top\mY\mP{\mX\mX^\top}|_F^2},\label{eq:alignment} $$ \tag{eq:alignment}

$$ \text{alignment}(k) \triangleq \min_{\mW}| \mW^\top{\mV^}^\top\mX - \mY|_F^2,\nonumber\ {\mV^} = \argmin_{\mV}\min_{\mZ} \mathbb{E}_{\mX'|\mX}\left[ | \mZ^\top\mV^\top\mX'- \mX|_F^2\right],\label{eq:optimal_V} $$ \tag{eq:optimal_V}

$$ \mathcal{L} =& |\mY|_F^2- 2 \Trp{\mX^\top\mV\mW\mY}+\Trp{\mX^\top\mV\mW\mW^\top\mV^\top\mX}\ &+ \lambda |\mX|_F^2 -2\lambda\Trp{\mX^\top\mV\mZ\mX} + \lambda\Trp{\mX^\top\mV\mZ\mZ^\top\mV^\top\mX}, $$

$$ \displaystyle=\operatorname{Tr}\left({\bm{X}}^{\top}{\bm{V}}({\bm{V}}^{\top}{\bm{X}}{\bm{X}}^{\top}{\bm{V}})^{-1}{\bm{V}}^{\top}{\bm{X}}{\bm{Y}}^{\top}{\bm{Y}}\right) $$

$$ \displaystyle{\bm{A}}{\bm{P}}{{\bm{B}}}{\bm{D}}^{-\frac{1}{2}}{{\bm{B}}}{\bm{P}}_{{\bm{H}}} $$

$$ \displaystyle{\bm{M}}\triangleq{\bm{Y}}^{\top}{\bm{Y}}+\lambda{\bm{X}}^{\top}{\bm{X}} $$

$$ \displaystyle{\bm{P}}{{\bm{A}}}={\bm{V}}{{\bm{S}}},{\bm{D}}{{\bm{A}}}={\bm{\Sigma}}^{2}{{\bm{S}}}, $$

$$ \mX = \mP_{\mX\mX^\top} \mD_{\mX\mX^\top}^{\frac{1}{2}}\mV_{\mX}^\top. $$

$$ \displaystyle{\bm{H}}= $$

$$ \mW^\top\mV^\top = \mU_{\mY}\mSigma_{\mY}({\mV_{\mY}}_{.,1:K})^\top(\mX^{\dagger})^\top=\mY\mX^\top(\mX\mX^\top)^{-1}, K\geq C, $$

$$ \min_{\theta \in \mathbb{R}^{P},\mW \in \rmat{C}{K},\mV \in \rmat{D}{K}}| \mW f_{\theta}(\mX)-\mY |F^2 + \lambda | \mV f{\theta}(\mX)-\mX|_F^2, $$

$$ \mathcal{L} = | \mW^\top\mV^\top\mX-\mY|_F^2 + \lambda | \mZ^\top\mV^\top\mX-\mX|_F^2, $$

Theorem. Theorem 1. The loss function from Eq. 3 is minimized for 𝑽∗superscript𝑽\displaystyle{\bm{V}}^{} spans ​𝑷𝑿​𝑿⊤​𝑫𝑿​𝑿⊤−12​(𝑷𝑯).,1:K,\displaystyle\text{ spans }{\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{D}}^{-\frac{1}{2}}{{\bm{X}}{\bm{X}}^{\top}}({\bm{P}}{{\bm{H}}}){.,1:K}, (4) 𝑾∗=superscript𝑾absent\displaystyle{\bm{W}}^{}= (𝑽∗⊤​𝑿​𝑿⊤​𝑽∗)−1​𝑽∗⊤​𝑿​𝒀⊤,superscriptsuperscriptsuperscript𝑽top𝑿superscript𝑿topsuperscript𝑽1superscriptsuperscript𝑽top𝑿superscript𝒀top\displaystyle\left({{\bm{V}}^{}}^{\top}{\bm{X}}{\bm{X}}^{\top}{\bm{V}}^{}\right)^{-1}{{\bm{V}}^{}}^{\top}{\bm{X}}{\bm{Y}}^{\top}, (5) 𝒁∗=superscript𝒁absent\displaystyle{\bm{Z}}^{}= (𝑽∗⊤​𝑿​𝑿⊤​𝑽∗)−1​𝑽∗⊤​𝑿​𝑿⊤,superscriptsuperscriptsuperscript𝑽top𝑿superscript𝑿topsuperscript𝑽1superscriptsuperscript𝑽top𝑿superscript𝑿top\displaystyle\left({{\bm{V}}^{}}^{\top}{\bm{X}}{\bm{X}}^{\top}{\bm{V}}^{}\right)^{-1}{{\bm{V}}^{*}}^{\top}{\bm{X}}{\bm{X}}^{\top}, (6) where 𝐇≜𝐃𝐗​𝐗⊤−12​𝐏𝐗​𝐗⊤⊤​𝐀​𝐏𝐗​𝐗⊤​𝐃𝐗​𝐗⊤−12≜𝐇subscriptsuperscript𝐃12𝐗superscript𝐗topsubscriptsuperscript𝐏top𝐗superscript𝐗top𝐀subscript𝐏𝐗superscript𝐗topsubscriptsuperscript𝐃12𝐗superscript𝐗top{\bm{H}}\triangleq{\bm{D}}^{-\frac{1}{2}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{P}}^{\top}{{\bm{X}}{\bm{X}}^{\top}}{\bm{A}}{\bm{P}}{{\bm{X}}{\bm{X}}^{\top}}{\bm{D}}^{-\frac{1}{2}}{{\bm{X}}{\bm{X}}^{\top}}. (Proof in Section 6.1, empirical validation in Fig. 8.)

Corollary. The solution from thm:linear_decoder recovers the OLS solution for ${\mW^}^\mV^^\top$ as $\lambda \to 0$, and the PCA solution for ${\mZ^}^\mV^^\top$ as $\lambda \to \infty$. (Proof in proof:PCA.)

Proposition. The supervised and reconstruction tasks are aligned (the optimal solutions do not depend on $\lambda$) iff the intersection of the top-$K$ eigenspaces of $\mX^\top\mX$ and $\mY^\top\mY$ is of dimension $K$.

Corollary. $alignment(k)$ from eq:alignment increases with $k$, has value $0$ iff the two losses are misaligned, and has value $1$ iff the two losses are aligned. (Proof in proof:alignment.)

Theorem. For any high-capacity encoder $f_{\theta}$, studying eq:bilinear and eq:nonlinear is equivalent at initialization for any decoder, and is always equivalent when the decoder is linear. (Proof in proof:linear_decoder.)

Theorem. The closed form solution for $\mV^$ from eq:optimal_V is given by $\mV^ spans \mP_{\mG}\mD^{-1{2}}{\mG}(\mP{\mH}){.,1:K}$, where $\mH \triangleq \mD^{-1{2}}{\mG}\mP^\top_{\mG}\mS\mX^\top\mX\mS^\top\mP_{\mG} \mD^{-1{2}}_{\mG}$. (Proof in proof:DAE.)

Corollary. Corollary 3.1. Under the settings of Theorem 3, additive Gaussian noise has no impact in the supervised task performance as 𝐖∗⊤​𝐕∗​(σ)⊤=𝐖∗⊤​𝐕∗​(0)⊤,∀σ≥0formulae-sequencesuperscriptsuperscript𝐖topsuperscript𝐕superscript𝜎topsuperscriptsuperscript𝐖topsuperscript𝐕superscript0topfor-all𝜎0{{\bm{W}}^{}}^{\top}{{\bm{V}}^{}(\sigma)}^{\top}={{\bm{W}}^{}}^{\top}{{\bm{V}}^{}(0)}^{\top},\forall\sigma\geq 0, regardless of the supervised task. (Proof in Section 6.5.)

Figure 8. Empirical validation of Theorem 1 comparing the loss value at the optimum (from Eqs. (4) to (6)) against the one minimized with gradient descent (Adam optimizer) ( y-axis ) during gradient steps (t, x-axis ). We expect that different to get close to 0 as the gradient updates converge to the minimum value of the loss. Although that quantity (loss(optimum) - loss(t)) is nonnegative in theory, we observe that its minimum value (reported in the title of each subplot) is sometimes negative with negligible value due to round off error. We compare numerous values of K,D,N as given in the titles of each row , and different values of λ ∈ { 0 . 0 , 0 . 1 , 1 , 10 } ( column ).

Figure 9. Depiction of Imagenet images ( top ) projected onto different subspaces obtained from Principal Component Analysis corresponding to the subspace explaining the top 75% of pixel variance ( middle ) and bottom 25% of pixel variance ( bottom ). We clearly observe that the image representation preserved after projection onto the bottom subspace makes the perception tasks (classification) easier to solve that if projected onto the top subspace, where the lower frequency information is insufficient to classify what is the object depicted (recall the classification performances of DNs applied onto those different projections from Figs. 1 and 4).

$$ \mathcal{L}(\mV\hspace{-0.07cm},\mW\hspace{-0.07cm},\mZ) \hspace{-0.07cm}= &| \mW^\top\mV^\top\hspace{-0.04cm}\mX \hspace{-0.07cm}- \hspace{-0.07cm}\mY|_F^2 + \lambda\hspace{-0.02cm}|\mZ^\top\mV^\top\hspace{-0.07cm}\mX \hspace{-0.07cm}-\hspace{-0.07cm} \mX |_F^2,\label{eq:bilinear} $$ \tag{eq:bilinear}

$$ \mV^* &\text{ spans } \mP_{\mX\mX^\top}\mD^{-\frac{1}{2}}{\mX\mX^\top}(\mP{\mH})_{.,1:K},\label{eq:V}\ \mW^* =& \left({\mV^}^\top\mX\mX^\top\mV^\right)^{-1}{\mV^}^\top\mX\mY^\top,\label{eq:W}\ \mZ^ =& \left({\mV^}^\top\mX\mX^\top\mV^\right)^{-1}{\mV^*}^\top\mX\mX^\top,\label{eq:Z} $$ \tag{eq:V}

$$ \nabla_{\mW}\mathcal{L} &= -2\mV^\top\mX\mY^\top + 2 \mV^\top\mX\mX^\top\mV\mW,\ \nabla_{\mZ}\mathcal{L} &= -2\lambda\mV^\top\mX\mX^\top + 2\lambda \mV^\top\mX\mX^\top\mV\mZ, $$

$$ \Trp{\mX^\top\mV\mZ^{\mZ^}^\top\mV^\top\mX}=\Trp{\mX^\top\mV\mZ^*\mX}, $$

$$ \mathcal{L} &= |\mY|_F^2- 2 \Trp{\mX^\top\mV\mW^\mY}+\Trp{\mX^\top\mV\mW^{\mW^}^\top\mV^\top\mX}\ &\hspace{3cm}+ \lambda |\mX|_F^2 -2\lambda\Trp{\mX^\top\mV\mZ^\mX} + \lambda\Trp{\mX^\top\mV\mZ^{\mZ^}^\top\mV^\top\mX}\ &=|\mY|_F^2- \Trp{\mX^\top\mV\mW^\mY}+\lambda |\mX|_F^2 -\lambda\Trp{\mX^\top\mV\mZ^\mX}\ &=|\mY|_F^2- \Trp{\mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX\mY^\top \mY}\ &\hspace{2cm}+\lambda |\mX|_F^2 -\lambda\Trp{\mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX\mX^\top \mX}\ &=|\mY|_F^2+\lambda |\mX|_F^2 - \Trp{\mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)}. $$

$$ \text{ find $\mV \in \rmat{D}{D}$ so that } \mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)\mX^\top\mV = \vv^\top \mX\mX^\top \mV \Lambda, $$

$$ \mA\mP_{\mB}\mD^{-\frac{1}{2}}{\mB}\mP{\mH}&=\mB\mP_{\mB}\mD^{-\frac{1}{2}}{\mB}\mP{\mH} \Lambda\ \iff \mP_{\mH}^\top\mD^{-\frac{1}{2}}{\mB}\mP{\mB}^\top\mA\mP_{\mB}\mD^{-\frac{1}{2}}{\mB}\mP{\mH}&=\mP_{\mH}^\top\mD^{-\frac{1}{2}}{\mB}\mP{\mB}^\top\mB\mP_{\mB}\mD^{-\frac{1}{2}}{\mB}\mP{\mH} \Lambda&&(\mP_{\mH}^\top\mD^{-\frac{1}{2}}{\mB}\mP{\mB}^\top \text{ bijective})\ \iff \mP_{\mH}^\top\mD^{-\frac{1}{2}}{\mB}\mP{\mB}^\top\mA\mP_{\mB}\mD^{-\frac{1}{2}}{\mB}\mP{\mH}&=\Lambda\ \iff \mD_{\mH}&=\Lambda, $$

$$ \mH =& \mD^{-\frac{1}{2}}{\mX\mX^\top}\mP^\top{\mX\mX^\top}\mA\mP_{\mX\mX^\top} \mD^{-\frac{1}{2}}{\mX\mX^\top}\ =& \mD^{-\frac{1}{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mX\mY^\top\mY\mX^\top\mP_{\mX\mX^\top} \mD^{-\frac{1}{2}}{\mX\mX^\top}\ =& \mD^{-\frac{1}{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mP_{\mX\mX^\top} \mD_{\mX\mX^\top}^{\frac{1}{2}}\mV_{\mX}^\top\mY^\top\mY\mV_{\mX}\mD_{\mX\mX^\top}^{\frac{1}{2}}\mP_{\mX\mX^\top} ^\top\mP_{\mX\mX^\top} \mD^{-\frac{1}{2}}{\mX\mX^\top}\ =& \mV{\mX}^\top\mY^\top\mY\mV_{\mX}, $$

$$ \mV=\mP_{\mX\mX^\top}\mD_{\mX\mX^\top}^{-\frac{1}{2}}\mP_{\mH}=\mP_{\mX\mX^\top}\mD_{\mX\mX^\top}^{-\frac{1}{2}}\mV_{\mX}^\top\mV_{\mY}=\mX^{\dagger}(\mV_{\mY})_{.,1:K} $$

$$ | {\mW^}^\top{\mV^}^\top\mX-\mY|F^2=&|\mY\mP{\mX\mX^\top}((\mD_{\mX\mX^\top}^{-\frac{1}{2}}){.,1:K})^{\top} \mP{\mX\mX^\top}^\top \mX - \mY |F^2\ =&|\mY\mP{\mX\mX^\top}((\mD_{\mX\mX^\top}^{-\frac{1}{2}}){.,1:K})^{\top} \mP{\mX\mX^\top}^\top \mP_{\mX\mX^\top} \mD_{\mX\mX^\top}^{\frac{1}{2}}\mP_{\mX\mX^\top}^\top - \mY |F^2\ =&| \mY (\mP{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top-\mY|F^2\ =&|\mY|F^2 - 2 \Trp{\mY (\mP{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top\mY^\top} + | \mY (\mP{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top|F^2\ =&|\mY|F^2 - \Trp{\mY (\mP{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top\mY^\top}\ =&|\mY|F^2 - |\mY (\mP{\mX\mX^\top}){.,1:K}|_F^2, $$

$$ \min_{\mW,\mV}\sum_{n=1}^{N}\mathbb{E}_{\vx'n \sim p{\vx'_n|\vx_n}}\Tr\left(\mW^\top\mV^\top\mX'\mX'^\top\mV\mW\right)-2\Tr\left(\mW^\top\mV^\top\mX'\mX^\top\right)+\cst, $$

$$ &\Tr\left(\mX\mathbb{E}[\mX']^\top\mV(\mV^{\top}\mathbb{E}[\mX'\mX'^\top]\mV)^{-1}\mV^\top\mathbb{E}[\mX'\mX'^\top]\mV(\mV^{\top}\mathbb{E}[\mX'\mX'^\top]\mV)^{-1}\mV^\top\mathbb{E}[\mX']\mX^\top\right)\ &-2\Tr\left(\mX\mathbb{E}[\mX']^\top\mV(\mV^{\top}\mathbb{E}[\mX'\mX'^\top]\mV)^{-1}\mV^\top\mathbb{E}[\mX']\mX^\top\right)+\cst,\ =&-\Tr\left(\mX\mathbb{E}[\mX']^\top\mV(\mV^{\top}\mathbb{E}[\mX'\mX'^\top]\mV)^{-1}\mV^\top\mathbb{E}[\mX']\mX^\top\right)+\cst,\ =&-\Tr\left(\mV^\top\mathbb{E}[\mX']\mX^\top\mX\mathbb{E}[\mX']^\top\mV(\mV^{\top}\mathbb{E}[\mX'\mX'^\top]\mV)^{-1}\right)+\cst,\ $$

$$ \mZ^=\argmin_{\mZ} \mathbb{E}_{\mX'n \sim p{\mX'|\mX}}| \mZ^\top{\mV^}^\top\mX' - \mY|_2^2, $$

$$ \mV = \mP_{\mX\mX^\top + \sigma \mI}\mD^{-\frac{1}{2}}{\mX\mX^\top + \sigma \mI}(\mP{\mH}){.,1:K}=\mP{\mX\mX^\top}(\mD_{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}(\mP_{\mH})_{.,1:K}, $$

$$ \mH =& \mD^{-\frac{1}{2}}{\mX\mX^\top + \sigma \mI}\mP^\top{\mX\mX^\top + \sigma \mI}(\mX\mX^\top\mX\mX^\top)\mP_{\mX\mX^\top + \sigma \mI}\mD_{\mX\mX^\top + \sigma \mI}^{-\frac{1}{2}}\ =& \mD^{-\frac{1}{2}}{\mX\mX^\top + \sigma \mI}\mP^\top{\mX\mX^\top}(\mX\mX^\top\mX\mX^\top)\mP_{\mX\mX^\top}\mD_{\mX\mX^\top + \sigma \mI}^{-\frac{1}{2}}\ =& (\mD_{\mX\mX^\top})^2(\mD_{\mX\mX^\top} + \sigma \mI)^{-1}, $$

$$ \mW=&({\mV^}^\top\mX\mX^\top\mV^)^{-1}{\mV^*}^\top\mX\mY^\top\ \mW=&\left(((\mP_{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}\mP_{\mX\mX^\top}^\top\mX\mX^\top\mP_{\mX\mX^\top}(\mD_{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}(\mP_{\mH}){.,1:K}\right)^{-1}\ &\times ((\mP{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}\mP_{\mX\mX^\top}^\top\mX\mY^\top\ \mW=&\left(((\mP_{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)^{-1}\mD_{\mX\mX^\top}(\mP_{\mH}){.,1:K}\right)^{-1}((\mP{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}\mD_{\mX\mX^\top}^{\frac{1}{2}}\mV_{\mX}^\top\mY^\top\ \mW=&((\mP_{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)\mD_{\mX\mX^\top}^{-1}(\mP_{\mH}){.,1:K}((\mP{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}\mD_{\mX\mX^\top}^{\frac{1}{2}}\mV_{\mX}^\top\mY^\top\ \mW=&((\mP_{\mH}){.,1:K})^\top \left((\mD{\mX\mX^\top} + \sigma \mI)^{\frac{1}{2}}\mD_{\mX\mX^\top}^{-\frac{1}{2}}\right){s}\mV{\mX}^\top\mY^\top, $$

$$ \mW^\top\mV^\top =& \mY \mV_{\mX} \left((\mD_{\mX\mX^\top} + \sigma \mI)^{\frac{1}{2}}\mD_{\mX\mX^\top}^{-\frac{1}{2}}\right){s}(\mP{\mH}){.,1:K}((\mP{\mH}){.,1:K})^\top (\mD{\mX\mX^\top} + \sigma \mI)^{-\frac{1}{2}}\mP_{\mX\mX^\top}^\top\ =& \mY \mV_{\mX} (\mD_{\mX\mX^\top}^{-\frac{1}{2}}){s}\mP{\mX\mX^\top}^\top $$

$$ &\min_{\mZ \in \rmat{K}{N},\mW \in \rmat{C}{K},\mV \in \rmat{D}{K}} | \mW\mZ-\mY |_F^2 + \lambda | \mV\mZ-\mX|F^2\ =&\min{\mZ \in \rmat{K}{N}} | \mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ-\mY |_F^2 + \lambda | \mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ-\mX|F^2\ =&\min{\mZ \in \rmat{K}{N}} \Trp{\mZ^\top(\mZ\mZ^\top)^{-1}\mZ\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ}-2\Trp{\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + |\mY|_F^2\ &+\lambda\Trp{\mZ^\top(\mZ\mZ^\top)^{-1}\mZ\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} -2\lambda\Trp{\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + \lambda|\mX|F^2\ =&\min{\mZ \in \rmat{K}{N}} \Trp{(\mZ\mZ^\top)^{-1}\mZ\mY^\top\mY\mZ^\top}-2\Trp{\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + |\mY|_F^2\ &+\Trp{(\mZ\mZ^\top)^{-1}\mZ\mX^\top\mX\mZ^\top} -2\lambda\Trp{\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + \lambda|\mX|F^2\ =&\min{\mZ \in \rmat{K}{N}} -\Trp{\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + |\mY|_F^2 -\lambda\Trp{\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + \lambda|\mX|F^2\ =&\min{\mZ \in \rmat{K}{N}} -\Trp{\left(\mY^\top\mY+\lambda\mX^\top\mX\right)\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} + |\mY|_F^2 + \lambda|\mX|F^2\ =&\min{\mZ \in \rmat{K}{N}: \mZ^\top\mZ = \mI} -\Trp{\left(\mY^\top\mY+\lambda\mX^\top\mX\right)\mZ^\top\mZ} + |\mY|_F^2 + \lambda|\mX|_F^2 $$

Theorem. The loss function from eq:bilinear is minimized for align \mV^* & spans \mP_{\mX\mX^\top}\mD^{-1{2}}{\mX\mX^\top}(\mP{\mH}){.,1:K},\ \mW^* =& \left({\mV^}^\top\mX\mX^\top\mV^\right)^{-1}{\mV^}^\top\mX\mY^\top,\ \mZ^ =& \left({\mV^}^\top\mX\mX^\top\mV^\right)^{-1}{\mV^*}^\top\mX\mX^\top, align where $\mH \triangleq \mD^{-1{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mA\mP_{\mX\mX^\top} \mD^{-1{2}}_{\mX\mX^\top}$. (Proof in proof:linear_solution, empirical validation in fig:validation_general.)

Corollary. Under the settings of thm:DAE, additive Gaussian noise has no impact in the supervised task performance as ${\mW^}^\mV^(\sigma)^\top={\mW^}^\mV^(0)^\top,\forall \sigma \geq 0$, regardless of the supervised task. (Proof in proof:gaussian_noise.)

Proof. The first part of the proof finds the optimum $\mW^$ and $\mZ^$ as a function of $\mV$ which is direct since we are in a least-square style setting for each of them. The second part will consist in showing that the optimal $\mV$ can be found as the solution of a generalized eigenvalue problem. The third and final step will be to express the solution for $\mV$ in close-form that is also friendly for computations. {\bf Step 1.}~Recall that our loss function is given by align* L = | \mW^\top\mV^\top\mX-\mY|F^2 + \lambda | \mZ^\top\mV^\top\mX-\mX|F^2, align* recalling that $|\mM|F^2=\Tr(\mM^\top\mM)$, the above simplifies to align* L =& |\mY|F^2- 2 \mX^\top\mV\mW\mY+\mX^\top\mV\mW\mW^\top\mV^\top\mX\ &+ \lambda |\mX|F^2 -2\lambda\mX^\top\mV\mZ\mX + \lambda\mX^\top\mV\mZ\mZ^\top\mV^\top\mX, align* we are now going to find the optimal $\mW$ and $\mZ$ which are unique by convexity of the loss and of their domain. Recall that we assume $\mY$ and $\mX$ to be full-rank (therefore also making $\mV$ full-rank). Recalling the derive of traces, we obtain align* \nabla{\mW}L &= -2\mV^\top\mX\mY^\top + 2 \mV^\top\mX\mX^\top\mV\mW,\ \nabla{\mZ}L &= -2\lambda\mV^\top\mX\mX^\top + 2\lambda \mV^\top\mX\mX^\top\mV\mZ, align* setting it to zero (we assume here $\lambda>0$ otherwise we can not solve for $\mZ$ since its value does not impact the loss) and solving leads align* \mW^* =& (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX\mY^\top,\ \mZ^* =& (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX\mX^\top. align* We now have solved for $\mW,\mZ$ as a function of $\mV$, i.e., the loss is now only a function of $\mV$, which we are going to solve for now. {\bf Step 2.}~We will first proceed by plugging the values for $\mW^,\mZ^$ back into the loss, which will now be only a function of $\mV$. Let's first simplify our derivations by noticing that align* \mX^\top\mV\mW^{\mW^^\top\mV^\top\mX}&=\mX^\top\mV(\mV^\top\mX\mX^\top\mV)^{-1\mV^\top\mX\mY^\top\mY\mX^\top\mV(\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX}\ &=\mX^\top\mV(\mV^\top\mX\mX^\top\mV)^{-1\mV^\top\mX\mY^\top\mY}\ &=\mX^\top\mV\mW^\mY, align and similarly align* \mX^\top\mV\mZ^{\mZ^^\top\mV^\top\mX}=\mX^\top\mV\mZ^\mX, align finally making the entire loss simplify as follows align* L &= |\mY|F^2- 2 \mX^\top\mV\mW^\mY+\mX^\top\mV\mW^{\mW^^\top\mV^\top\mX}\ &3cm+ \lambda |\mX|_F^2 -2\lambda\mX^\top\mV\mZ^\mX + \lambda\mX^\top\mV\mZ^{\mZ^^\top\mV^\top\mX}\ &=|\mY|F^2- \mX^\top\mV\mW^\mY+\lambda |\mX|_F^2 -\lambda\mX^\top\mV\mZ^\mX\ &=|\mY|F^2- \mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1\mV^\top\mX\mY^\top \mY}\ &2cm+\lambda |\mX|F^2 -\lambda\mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1\mV^\top\mX\mX^\top \mX}\ &=|\mY|F^2+\lambda |\mX|F^2 - \mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1\mV^\top\mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)}. align* First, notice that both $\mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)\mX^\top$ and $\mX\mX^\top$ are symmetric. Therefore, we can minimize the loss by solving the following generalized eigenvalue problem: align* find $\mV \in D{D$ so that } \mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)\mX^\top\mV = \vv^\top \mX\mX^\top \mV \Lambda, align* where $\mV$ are the eigenvectors of the generalized eigenvalue problem, and $\Lambda$ the eigenvalues, then the solution to our problem we be any rotation of any $K$ eigenvectors, but the minimum will be achieve for the top-K ones. {\bf Step 3.} We will first demonstrate the general solution for the generalized eigenvalue problem. Given that solution, it will be easy to take the top-$K$ eigenvectors that solve the considered problem. Denoting $\mA \triangleq \mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)\mX^\top$ and $\mB \triangleq \mX\mX^\top$, and $\mH\triangleq \mD^{-1{2}}{\mB}\mP^\top{\mB}\mA\mP{\mB} \mD^{-1{2}}{\mB}$, we have align* \mA\mP{\mB}\mD^{-1{2}}{\mB}\mP{\mH}&=\mB\mP{\mB}\mD^{-1{2}}{\mB}\mP_{\mH} \Lambda\ \iff \mP_{\mH}^\top\mD^{-1{2}}{\mB}\mP{\mB}^\top\mA\mP_{\mB}\mD^{-1{2}}{\mB}\mP{\mH}&=\mP_{\mH}^\top\mD^{-1{2}}{\mB}\mP{\mB}^\top\mB\mP_{\mB}\mD^{-1{2}}{\mB}\mP{\mH} \Lambda&&(\mP_{\mH}^\top\mD^{-1{2}}{\mB}\mP{\mB}^\top bijective)\ \iff \mP_{\mH}^\top\mD^{-1{2}}{\mB}\mP{\mB}^\top\mA\mP_{\mB}\mD^{-1{2}}{\mB}\mP{\mH}&=\Lambda\ \iff \mD_{\mH}&=\Lambda, align* therefore the eigenvalues are given by $\mD_{\mH}$ and the eigenvectors are given by $\mP_{\mB}\mD^{-1{2}}{\mB}\mP{\mH}$ or equivalently $(\mP_{\mX\mX^\top}\mD^{-1{2}}{\mX\mX^\top}\mP{\mH}){.,1:K}$. So the optimal $\mV$ is any rotation of the top-$K$ eigenvectors. The above is simple to use as-is whenever $N >D$, if not, then we can obtain a solution without having to compute any $D \times D$ matrix, thus making the process more efficient. To that end, we can obtain align* \mM\triangleq \mY^\top \mY + \lambda \mX^\top \mX\ \mS \triangleq\mD{\mM}^{1{2}}\mP^\top_{\mM}\mX^\top\ \mP_{\mA} = \mV_{\mS}, \mD_{\mA}=\mSigma^2_{\mS}, align* that only involves $D \times \min(D,N)$ matrices instead of $D \times D$. % , therefore, we have that % gather* % \min_{\mV \in D{K}}|\mY|F^2+\lambda |\mX|F^2 - \mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1\mV^\top\mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)}\ % =\min{\mU \in D{K}:\mU^\top\mU=\mI}|\mY|F^2+\lambda |\mX|F^2 - \mU\mU^\top\left(\mY^\top\mY + \lambda \mX^\top\mX\right), % gather* % which is solved by having $\mU$ be in the space of the top-$K$ eigenvectors of $\mY^\top\mY + \lambda \mX^\top\mX$. The last step will be to recover the value of $\mV$ from the optimal value $\mU$. % {\bf Step 3.}~The solution which is now given in term of the tall matrix $\mU$ to be in the space of the top-$K$ eigenvectors of $\mY^\top\mY + \lambda \mX^\top\mX$ needs to be expressed for $\mV$. Recall that we used $\mU\mU^\top$ for $\mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX$. We will now prove that setting $\mV$ to be $\mU{\mX}\mSigma^{-1}{\mX}\mV{\mX}^\top\mU\mR$ for any full-rank matrix $\mR$ is the solution: % align* % \mX^\top\mV (\mV^\top\mX\mX^\top\mV)^{-1}\mV^\top\mX=&\mV_{\mX}\mV_{\mX}^\top\mU\mR(\mR^\top\mU^\top\mV_{\mX}\mV_{\mX}^\top\mV_{\mX}\mV_{\mX}^\top\mU\mR)^{-1}\mR^\top\mU^\top\mV_{\mX}\mV_{\mX}^\top\ % =&\mV_{\mX}\mV_{\mX}^\top\mU\mR(\mR^\top\mU^\top\mV_{\mX}\mV_{\mX}^\top\mU\mR)^{-1}\mR^\top\mU^\top\mV_{\mX}\mV_{\mX}^\top\ % =&\mU\mR(\mR^\top\mU^\top\mU\mR)^{-1}\mR^\top\mU^\top\ % =&\mU\mR(\mR^\top\mR)^{-1}\mR^\top\mU^\top \ % =& \mU \mU^\top, % align* % since we recall that $\mV_{\mV}\mV^\top_{\mX}\mU=\mU$ since $\mU$ which contains the top-$K$ eigenvectors of $\mY^\top\mY+\lambda \mX^\top\mX$ lives in the span of $\mV_{\mX}$ for any $\lambda$, concluding the proof.

Proof. We will start with the fully supervised (least-square) proof obtained when $\lambda =0$. Also notice that in any case, we have that align* \mX = \mP_{\mX\mX^\top} \mD_{\mX\mX^\top}^{1{2}}\mV_{\mX}^\top. align* {\bf Ordinary Least Square recovery.}~since $\lambda=0$ we also have align* \mA =\mX\left(\mY^\top\mY + \lambda \mX^\top\mX\right)\mX^\top=\mX\mY^\top\mY\mX^\top align* when then lead to align* \mH =& \mD^{-1{2}}{\mX\mX^\top}\mP^\top{\mX\mX^\top}\mA\mP_{\mX\mX^\top} \mD^{-1{2}}{\mX\mX^\top}\ =& \mD^{-1{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mX\mY^\top\mY\mX^\top\mP_{\mX\mX^\top} \mD^{-1{2}}{\mX\mX^\top}\ =& \mD^{-1{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mP_{\mX\mX^\top} \mD_{\mX\mX^\top}^{1{2}}\mV_{\mX}^\top\mY^\top\mY\mV_{\mX}\mD_{\mX\mX^\top}^{1{2}}\mP_{\mX\mX^\top} ^\top\mP_{\mX\mX^\top} \mD^{-1{2}}{\mX\mX^\top}\ =& \mV{\mX}^\top\mY^\top\mY\mV_{\mX}, align* from the above, we can simply plug those values in the analytical form for $\mV^$ from eq:V to obtain align \mV=\mP_{\mX\mX^\top}\mD_{\mX\mX^\top}^{-1{2}}\mP_{\mH}=\mP_{\mX\mX^\top}\mD_{\mX\mX^\top}^{-1{2}}\mV_{\mX}^\top\mV_{\mY}=\mX^{\dagger}(\mV_{\mY}){.,1:K} align* since we easily see that $\mP{\mH}=\mV_{\mX}^\top\mV_{\mY}$. We also have the optimum for $\mW$ from eq:W to be align* \mW=&({\mV^}^\top\mX\mX^\top\mV^)^{-1}{\mV^}^\top\mX\mY^\top\ =&(\mV_{\mY}^\top\mV_{\mY})^{-1}\mV_{\mY}^\top\mV_{\mX}\mD_{\mX\mX^\top}^{-1{2}}\mP_{\mX\mX^\top}^\top\mX\mY^\top\ =&\mV_{\mY}^\top\mV_{\mX}\mV_{\mX}^{\top}\mY^\top\ =&\mSigma_{\mY}\mU_{\mY}^\top align and finally the product of both matrices (which produce the supervised linear model) is obtained as align* \mW^\top\mV^\top = \mU_{\mY}\mSigma_{\mY}({\mV_{\mY}}{.,1:K})^\top(\mX^{\dagger})^\top=\mY\mX^\top(\mX\mX^\top)^{-1}, K\geq C, align* therefore recovering the OLS optimal solution. Note that if $K<C$ then we have a bottleneck and we therefore obtain an interesting alternative solution that looks at the top subspace of $\mY$ (this is however never the case in OLS settings). {\bf Principal Component Analysis recovery.}We now consider the case where we only employ the unsupervised loss (akin to $\lambda \to \infty$). In this setting we get align* \mA=\mX\mX^\top\mX\mX^\top, align* and we directly obtain align* \mH=&\mD^{-1{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mA\mP_{\mX\mX^\top} \mD^{-1{2}}{\mX\mX^\top}\ =&\mD^{-1{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mX\mX^\top\mX\mX^\top\mP_{\mX\mX^\top} \mD^{-1{2}}{\mX\mX^\top}\ =&\mD{\mX\mX^\top}, align* therefore the optimal form for $\mV$ will be align* \mV=\mP_{\mX\mX^\top}\mD_{\mX\mX^\top}^{-1{2}}(\mP_{\mH}){.,1:K}=\mP{\mX\mX^\top}(\mD_{\mX\mX^\top}^{-1{2}}){.,1:K}, align* which will select the top-$K$ subspace of $\mX$ (recall that the eigenvalues of $\mH$ are $\mD{\mX\mX^\top}$ and therefore its top-$K$ eigenvectors are selected the top-$K$ dimension of the subspace. Then the solution for $\mZ$ from eq:Z gives align* \mZ =& ({\mV^}^\top\mX\mX^\top\mV^)^{-1}{\mV^}^\top\mX\mX^\top\ =&{\mV^}^\top\mX\mX^\top\ =&((\mD_{\mX\mX^\top}^{1{2}}){.,1:K})^\top\mP{\mX\mX^\top}^\top, align* and lastly the product of $\mZ$ and $\mV$ (which produce the final linear transformation processing $\mX$ takes the form align* \mZ^\top\mV^\top=\mP_{\mX\mX^\top}(\mD_{\mX\mX^\top}^{1{2}}){.,1:K})((\mD{\mX\mX^\top}^{-1{2}}){.,1:K})^\top\mP{\mX\mX^\top}^\top=(\mP_{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top})_{.,1:K})^\top, align* which is the projection matrix onto the top-$K$ subspace of the data $\mX$ i.e. recovering the optimal solution of Principal Component Analysis.

Proof. Recall that in the $\lambda \to \infty$ regime, we have that $\mV^=\mP_{\mX\mX^\top}(\mD_{\mX\mX^\top}^{-1{2}})_{.,1:K}$ and $\mW^=\mP^\top_{\mX\mX^\top}\mY^\top$. We thus develop align* | {\mW^}^\mV^^\top\mX-\mY|F^2=&|\mY\mP{\mX\mX^\top}((\mD_{\mX\mX^\top}^{-1{2}}){.,1:K})^{\top} \mP{\mX\mX^\top}^\top \mX - \mY |F^2\ =&|\mY\mP{\mX\mX^\top}((\mD_{\mX\mX^\top}^{-1{2}}){.,1:K})^{\top} \mP{\mX\mX^\top}^\top \mP_{\mX\mX^\top} \mD_{\mX\mX^\top}^{1{2}}\mP_{\mX\mX^\top}^\top - \mY |F^2\ =&| \mY (\mP{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top-\mY|F^2\ =&|\mY|F^2 - 2 \mY (\mP{\mX\mX^\top){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top\mY^\top} + | \mY (\mP{\mX\mX^\top}){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top|F^2\ =&|\mY|F^2 - \mY (\mP{\mX\mX^\top){.,1:K}((\mP{\mX\mX^\top}){.,1:K})^\top\mY^\top}\ =&|\mY|F^2 - |\mY (\mP{\mX\mX^\top}){.,1:K}|F^2, align* as $|\mY|F^2$ is a constant with respect to the parameters, we consider $|\mY (\mP{\mX\mX^\top}){.,1:K}|F^2$ as our alignment measure (the greater, the better the supervised loss can be minimized from the parameters). Since this quantity lives in the range $[0,|\mY \mP{\mX\mX^\top}|_F^2]$, we see that by using the reparametrization from eq:alignment we obtain the proposed measure of alignment rescaled to $[0,1]$.

Proof. The first of the proof is to rewrite the joint objective classification and reconstruction objective with an arbitrary encoder network $f_{\theta}$ align* \min_{\theta \in R^{P},\mW \in C{K},\mV \in D{K}}| \mW f_{\theta}(\mX)-\mY |F^2 + \lambda | \mV f{\theta}(\mX)-\mX|F^2, align* as the nonparametric version align* \min{\mZ \in K{N},\mW \in C{K},\mV \in D{K}} | \mW\mZ-\mY |_F^2 + \lambda | \mV\mZ-\mX|F^2, align* both being identical if we assume that the encoder is powerful enough to reach any representation, which is a realistic assumption given current architectures. Given that nonparametric objective, we can now solve for both the optimal decoder weight $\mV$ and the optimal representation $\mZ$ as follows align* &\min{\mZ \in K{N},\mW \in C{K},\mV \in D{K}} | \mW\mZ-\mY |_F^2 + \lambda | \mV\mZ-\mX|F^2\ =&\min{\mZ \in K{N}} | \mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ-\mY |_F^2 + \lambda | \mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ-\mX|F^2\ =&\min{\mZ \in K{N}} \mZ^\top(\mZ\mZ^\top)^{-1\mZ\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1}\mZ}-2\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + |\mY|_F^2\ &+\lambda\mZ^\top(\mZ\mZ^\top)^{-1\mZ\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1}\mZ} -2\lambda\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + \lambda|\mX|F^2\ =&\min{\mZ \in K{N}} (\mZ\mZ^\top)^{-1\mZ\mY^\top\mY\mZ^\top}-2\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + |\mY|_F^2\ &+(\mZ\mZ^\top)^{-1\mZ\mX^\top\mX\mZ^\top} -2\lambda\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + \lambda|\mX|F^2\ =&\min{\mZ \in K{N}} -\mY^\top\mY\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + |\mY|_F^2 -\lambda\mX^\top\mX\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + \lambda|\mX|F^2\ =&\min{\mZ \in K{N}} -\left(\mY^\top\mY+\lambda\mX^\top\mX\right)\mZ^\top(\mZ\mZ^\top)^{-1\mZ} + |\mY|_F^2 + \lambda|\mX|F^2\ =&\min{\mZ \in K{N}: \mZ^\top\mZ = \mI} -\left(\mY^\top\mY+\lambda\mX^\top\mX\right)\mZ^\top\mZ + |\mY|_F^2 + \lambda|\mX|F^2 align* which is solved by $\mZ$ being any orthogonal matrix in the subspace of the top-K eigenvectors of $\left(\mY^\top\mY+\lambda\mX^\top\mX\right)$. Now as $\lambda \to \infty$ as the encoder will become more and more linear, ultimately converging to $f{\theta}(\vx) = \mU\vx$ with $\mU\in \spn {\eigvec(\mX^\top\mX)_1,\dots,\eigvec(\mX^\top\mX)_K}$

$$ \mW=&({\mV^}^\top\mX\mX^\top\mV^)^{-1}{\mV^*}^\top\mX\mY^\top\ =&(\mV_{\mY}^\top\mV_{\mY})^{-1}\mV_{\mY}^\top\mV_{\mX}\mD_{\mX\mX^\top}^{-\frac{1}{2}}\mP_{\mX\mX^\top}^\top\mX\mY^\top\ =&\mV_{\mY}^\top\mV_{\mX}\mV_{\mX}^{\top}\mY^\top\ =&\mSigma_{\mY}\mU_{\mY}^\top $$

$$ \mH=&\mD^{-\frac{1}{2}}{\mX\mX^\top}\mP^\top{\mX\mX^\top}\mA\mP_{\mX\mX^\top} \mD^{-\frac{1}{2}}{\mX\mX^\top}\ =&\mD^{-\frac{1}{2}}{\mX\mX^\top}\mP^\top_{\mX\mX^\top}\mX\mX^\top\mX\mX^\top\mP_{\mX\mX^\top} \mD^{-\frac{1}{2}}{\mX\mX^\top}\ =&\mD{\mX\mX^\top}, $$

$$ \argmax_{\vv} \frac{\vv^\top \mathbb{E}[\mX']\mX^\top\mX\mathbb{E}[\mX']^\top \vv}{\vv^\top \mathbb{E}[\mX'\mX'^\top] \vv}. $$

$$ \displaystyle+\lambda\operatorname{Tr}\left({\bm{Z}}^{\top}({\bm{Z}}{\bm{Z}}^{\top})^{-1}{\bm{Z}}{\bm{X}}^{\top}{\bm{X}}{\bm{Z}}^{\top}({\bm{Z}}{\bm{Z}}^{\top})^{-1}{\bm{Z}}\right)-2\lambda\operatorname{Tr}\left({\bm{X}}^{\top}{\bm{X}}{\bm{Z}}^{\top}({\bm{Z}}{\bm{Z}}^{\top})^{-1}{\bm{Z}}\right)+\lambda|{\bm{X}}|_{F}^{2} $$

References

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[chen2018pixelsnail] Chen, Xi, Mishra, Nikhil, Rohaninejad, Mostafa, Abbeel, Pieter. (2018). Pixelsnail: An improved autoregressive generative model. International Conference on Machine Learning.

[seydoux2020clustering] Seydoux, L{'e. (2020). Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning. Nature communications.

[balle2016end] Ball{'e. (2016). End-to-end optimization of nonlinear transform codes for perceptual quality. 2016 Picture Coding Symposium (PCS).

[dilokthanakul2016deep] Dilokthanakul, Nat, Mediano, Pedro AM, Garnelo, Marta, Lee, Matthew CH, Salimbeni, Hugh, Arulkumaran, Kai, Shanahan, Murray. (2016). Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648.

[lim2020deep] Lim, Kart-Leong, Jiang, Xudong, Yi, Chenyu. (2020). Deep clustering with variational autoencoder. IEEE Signal Processing Letters.

[jiang2016variational] Jiang, Zhuxi, Zheng, Yin, Tan, Huachun, Tang, Bangsheng, Zhou, Hanning. (2016). Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148.

[tran2017disentangled] Tran, Luan, Yin, Xi, Liu, Xiaoming. (2017). Disentangled representation learning gan for pose-invariant face recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.

[mathieu2019disentangling] Mathieu, Emile, Rainforth, Tom, Siddharth, Nana, Teh, Yee Whye. (2019). Disentangling disentanglement in variational autoencoders. International conference on machine learning.

[esmaeili2019structured] Esmaeili, Babak, Wu, Hao, Jain, Sarthak, Bozkurt, Alican, Siddharth, Narayanaswamy, Paige, Brooks, Brooks, Dana H, Dy, Jennifer, Meent, Jan-Willem. (2019). Structured disentangled representations. The 22nd International Conference on Artificial Intelligence and Statistics.

[wang2004image] Wang, Zhou, Bovik, Alan C, Sheikh, Hamid R, Simoncelli, Eero P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing.

[bao2017cvae] Bao, Jianmin, Chen, Dong, Wen, Fang, Li, Houqiang, Hua, Gang. (2017). CVAE-GAN: fine-grained image generation through asymmetric training. Proceedings of the IEEE international conference on computer vision.

[kulis2013metric] Kulis, Brian, others. (2013). Metric learning: A survey. Foundations and Trends{\textregistered.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research.

[vincent2010stacked] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, Pierre-Antoine, Bottou, L{'e. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research.

[zeiler2014visualizing] Zeiler, Matthew D, Fergus, Rob. (2014). Visualizing and understanding convolutional networks. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13.

[mahendran2015understanding] Mahendran, Aravindh, Vedaldi, Andrea. (2015). Understanding deep image representations by inverting them. Proceedings of the IEEE conference on computer vision and pattern recognition.

[olah2017feature] Olah, Chris, Mordvintsev, Alexander, Schubert, Ludwig. (2017). Feature visualization. Distill.

[shen2020interpreting] Shen, Yujun, Gu, Jinjin, Tang, Xiaoou, Zhou, Bolei. (2020). Interpreting the latent space of gans for semantic face editing. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[wold1987principal] Wold, Svante, Esbensen, Kim, Geladi, Paul. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems.

[lecun2015deep] LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey. (2015). Deep learning. nature.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[bottou2012stochastic] Bottou, L{'e. (2012). Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade: Second Edition.

[bordes2021high] Bordes, Florian, Balestriero, Randall, Vincent, Pascal. (2021). High fidelity visualization of what your self-supervised representation knows about. arXiv preprint arXiv:2112.09164.

[baidoo2023education] Baidoo-Anu, David, Ansah, Leticia Owusu. (2023). Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of ChatGPT in promoting teaching and learning. Journal of AI.

[brynjolfsson2023generative] Brynjolfsson, Erik, Li, Danielle, Raymond, Lindsey R. (2023). Generative AI at work.

[selvaraju2016grad] Selvaraju, Ramprasaath R, Das, Abhishek, Vedantam, Ramakrishna, Cogswell, Michael, Parikh, Devi, Batra, Dhruv. (2016). Grad-CAM: Why did you say that?. arXiv preprint arXiv:1611.07450.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.

[chen2020big] Chen, Ting, Kornblith, Simon, Swersky, Kevin, Norouzi, Mohammad, Hinton, Geoffrey E. (2020). Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems.

[balestriero2023cookbook] Balestriero, Randall, Ibrahim, Mark, Sobal, Vlad, Morcos, Ari, Shekhar, Shashank, Goldstein, Tom, Bordes, Florian, Bardes, Adrien, Mialon, Gregoire, Tian, Yuandong, others. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.

[gutmann2010noise] Gutmann, Michael, Hyv{. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the thirteenth international conference on artificial intelligence and statistics.

[hyvarinen2005estimation] Hyv{. (2005). Estimation of non-normalized statistical models by score matching.. Journal of Machine Learning Research.

[van2016conditional] Van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, others. (2016). Conditional image generation with pixelcnn decoders. Advances in neural information processing systems.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[ghahramani2003unsupervised] Ghahramani, Zoubin. (2003). Unsupervised learning. Summer school on machine learning.

[barlow1989unsupervised] Barlow, Horace B. (1989). Unsupervised learning. Neural computation.

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.

[booth2006power] Booth, Thomas E. (2006). Power iteration method for the several largest eigenvalues and eigenfunctions. Nuclear science and engineering.

[chakrabarty2019spectral] Chakrabarty, Prithvijit, Maji, Subhransu. (2019). The spectral bias of the deep image prior. arXiv preprint arXiv:1912.08905.

[rahaman2019spectral] Rahaman, Nasim, Baratin, Aristide, Arpit, Devansh, Draxler, Felix, Lin, Min, Hamprecht, Fred, Bengio, Yoshua, Courville, Aaron. (2019). On the spectral bias of neural networks. International Conference on Machine Learning.

[balestriero2022data] Balestriero, Randall, Misra, Ishan, LeCun, Yann. (2022). A Data-Augmentation Is Worth A Thousand Samples: Analytical Moments And Sampling-Free Training. Advances in Neural Information Processing Systems.

[garrido2023rankme] Garrido, Quentin, Balestriero, Randall, Najman, Laurent, Lecun, Yann. (2023). Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. International Conference on Machine Learning.

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[huang2023contrastive] Huang, Zhicheng, Jin, Xiaojie, Lu, Chengze, Hou, Qibin, Cheng, Ming-Ming, Fu, Dongmei, Shen, Xiaohui, Feng, Jiashi. (2023). Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[kessy2018optimal] Kessy, Agnan, Lewin, Alex, Strimmer, Korbinian. (2018). Optimal whitening and decorrelation. The American Statistician.

[schwarz2021frequency] Schwarz, Katja, Liao, Yiyi, Geiger, Andreas. (2021). On the frequency bias of generative models. Advances in Neural Information Processing Systems.

[shen1996eigenvalue] Shen, Zhongwei. (1996). *Eigenvalue asymptotics and exponential decay of eigenfunctions for Schr{*. Transactions of the American Mathematical Society.

[ruderman1997origins] Ruderman, Daniel L. (1997). Origins of scaling in natural images. Vision research.

[van1996modelling] Van der Schaaf, van A, van Hateren, JH van. (1996). Modelling the power spectra of natural images: statistics and information. Vision research.

[benzi2002preconditioning] Benzi, Michele. (2002). Preconditioning techniques for large linear systems: a survey. Journal of computational Physics.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[xu2018accelerated] Xu, Peng, He, Bryan, De Sa, Christopher, Mitliagkas, Ioannis, Re, Chris. (2018). Accelerated stochastic power iteration. International Conference on Artificial Intelligence and Statistics.

[bib1] Baidoo-Anu, D. and Ansah, L. O. Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning. Journal of AI, 7(1):52–62, 2023.

[bib2] Balestriero et al. (2022) Balestriero, R., Misra, I., and LeCun, Y. A data-augmentation is worth a thousand samples: Analytical moments and sampling-free training. Advances in Neural Information Processing Systems, 35:19631–19644, 2022.

[bib3] Balestriero et al. (2023) Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., et al. A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.

[bib4] Ballé et al. (2016) Ballé, J., Laparra, V., and Simoncelli, E. P. End-to-end optimization of nonlinear transform codes for perceptual quality. In 2016 Picture Coding Symposium (PCS), pp. 1–5. IEEE, 2016.

[bib5] Bao et al. (2017) Bao, J., Chen, D., Wen, F., Li, H., and Hua, G. Cvae-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision, pp. 2745–2754, 2017.

[bib6] Bardes et al. (2021) Bardes, A., Ponce, J., and LeCun, Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.

[bib7] Barlow, H. B. Unsupervised learning. Neural computation, 1(3):295–311, 1989.

[bib8] Benzi, M. Preconditioning techniques for large linear systems: a survey. Journal of computational Physics, 182(2):418–477, 2002.

[bib9] Booth, T. E. Power iteration method for the several largest eigenvalues and eigenfunctions. Nuclear science and engineering, 154(1):48–62, 2006.

[bib10] Bordes et al. (2021) Bordes, F., Balestriero, R., and Vincent, P. High fidelity visualization of what your self-supervised representation knows about. arXiv preprint arXiv:2112.09164, 2021.

[bib11] Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade: Second Edition, pp. 421–436. Springer, 2012.

[bib12] Brynjolfsson et al. (2023) Brynjolfsson, E., Li, D., and Raymond, L. R. Generative ai at work. Technical report, National Bureau of Economic Research, 2023.

[bib13] Chakrabarty, P. and Maji, S. The spectral bias of the deep image prior. arXiv preprint arXiv:1912.08905, 2019.

[bib14] Chen et al. (2020) Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and Hinton, G. E. Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems, 33:22243–22255, 2020.

[bib15] Chen et al. (2018) Chen, X., Mishra, N., Rohaninejad, M., and Abbeel, P. Pixelsnail: An improved autoregressive generative model. In International Conference on Machine Learning, pp. 864–872. PMLR, 2018.

[bib16] Dilokthanakul et al. (2016) Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.

[bib17] Esmaeili et al. (2019) Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N., Paige, B., Brooks, D. H., Dy, J., and Meent, J.-W. Structured disentangled representations. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2525–2534. PMLR, 2019.

[bib18] Garrido et al. (2023) Garrido, Q., Balestriero, R., Najman, L., and Lecun, Y. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. In International Conference on Machine Learning, pp. 10929–10974. PMLR, 2023.

[bib19] Ghahramani, Z. Unsupervised learning. In Summer school on machine learning, pp. 72–112. Springer, 2003.

[bib20] Gutmann, M. and Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 297–304. JMLR Workshop and Conference Proceedings, 2010.

[bib21] He et al. (2022) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16000–16009, 2022.

[bib22] Hyvärinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.

[bib23] Jiang et al. (2016) Jiang, Z., Zheng, Y., Tan, H., Tang, B., and Zhou, H. Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:1611.05148, 2016.

[bib24] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[bib25] Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[bib26] Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.

[bib27] Kulis et al. (2013) Kulis, B. et al. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.

[bib28] LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.

[bib29] Lim et al. (2020) Lim, K.-L., Jiang, X., and Yi, C. Deep clustering with variational autoencoder. IEEE Signal Processing Letters, 27:231–235, 2020.

[bib30] Mahendran, A. and Vedaldi, A. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196, 2015.

[bib31] Mathieu et al. (2019) Mathieu, E., Rainforth, T., Siddharth, N., and Teh, Y. W. Disentangling disentanglement in variational autoencoders. In International conference on machine learning, pp. 4402–4412. PMLR, 2019.

[bib32] Olah et al. (2017) Olah, C., Mordvintsev, A., and Schubert, L. Feature visualization. Distill, 2(11):e7, 2017.

[bib33] Rahaman et al. (2019) Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., and Courville, A. On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. PMLR, 2019.

[bib34] Ruderman, D. L. Origins of scaling in natural images. Vision research, 37(23):3385–3398, 1997.

[bib35] Schwarz et al. (2021) Schwarz, K., Liao, Y., and Geiger, A. On the frequency bias of generative models. Advances in Neural Information Processing Systems, 34:18126–18136, 2021.

[bib36] Selvaraju et al. (2016) Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., and Batra, D. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.

[bib37] Seydoux et al. (2020) Seydoux, L., Balestriero, R., Poli, P., Hoop, M. d., Campillo, M., and Baraniuk, R. Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning. Nature communications, 11(1):3972, 2020.

[bib38] Shen et al. (2020) Shen, Y., Gu, J., Tang, X., and Zhou, B. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9243–9252, 2020.

[bib39] Tran et al. (2017) Tran, L., Yin, X., and Liu, X. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1415–1424, 2017.

[bib40] Van den Oord et al. (2016) Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.

[bib41] Van der Schaaf & van Hateren (1996) Van der Schaaf, v. A. and van Hateren, J. v. Modelling the power spectra of natural images: statistics and information. Vision research, 36(17):2759–2770, 1996.

[bib42] Vincent et al. (2010) Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A., and Bottou, L. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.

[bib43] Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.

[bib44] Wold et al. (1987) Wold, S., Esbensen, K., and Geladi, P. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.

[bib45] Xu et al. (2018) Xu, P., He, B., De Sa, C., Mitliagkas, I., and Re, C. Accelerated stochastic power iteration. In International Conference on Artificial Intelligence and Statistics, pp. 58–67. PMLR, 2018.

[bib46] Zbontar et al. (2021) Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. PMLR, 2021.

[bib47] Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer, 2014.