Implicit Rank-Minimizing Autoencoder
% Li Jing, Facebook AI Research, New York, Jure Zbontar, Facebook AI Research, New York, Yann LeCun, Facebook AI Research, New York
Abstract
An important component of autoencoders is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension. The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and learns compact latent spaces. We demonstrate the validity of the method on several image generation and representation learning tasks. % Code available at \url{https://github.com/jingli9111/irmae}
Implicit Rank-Minimizing Autoencoder
Li Jing Facebook AI Research New York ljng@fb.com
Jure Zbontar
Broader Impact
An important component of autoencoders is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension. The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and learns compact latent spaces. We demonstrate the validity of the method on several image generation and representation learning tasks.
Introduction
Optimizing a linear multi-layer neural network through gradient descent leads to a low-rank solution. This phenomenon is known as implicit regularization and has been extensively studied under the context of matrix factorization [9, 1, 21], linear regression [24, 6], logistic regression [25], and linear convolutional neural networks [8]. The main goal of these prior works were to understand the generalization ability of deep neural networks. By contrast, the goal of the present work is to design an architecture that takes advantage of this phenomenon to improve the quality of learned representations.
Learning good representations remains a core issue in AI [2]. Representations learned in a selfsupervised (or unsupervised) manner can be used for downstream tasks such as generation and classification. Autoencoders (AE) are a popular class of method for learning representations without requiring labeled data. The internal representation of an AE must have a limited information capacity to prevent the AE from learning a trivial identity function. Variants of AEs differ by how they perform this limitation. Bottleneck AE (sometimes called "Diabolo networks") simply use low-dimensional codes [23], noisy AE, such as variational AE add noise to the codes while limiting the variance of their distribution [4, 14], quantizing AE (such as VQ-VAE) quantize the codes into discrete clusters [27], sparse AE impose a sparsity penalty on the code [19, 20], contracting and saturating AE minimize the curvature of the network function in directions outside the data manifold [22, 7], and denoising AE are trained to produce large reconstruction error for corrupted samples [28].
In this work, we propose a new method to implicitly minimize the rank/dimensionality of the latent code of an autoencoder. We call this model Implicit Rank-Minimizing Autoencoder (IRMAE). This method consists in inserting extra linear layers between the encoder and the decoder of a standard autoencoder. This additional linear network is trained jointly with the rest of the autoencoder through classical backpropagation. As a result, the system spontaneously learns representations with a low effective dimensionality. Like other regularization methods, this extra linear neural network does not appear at inference time as the linear matrices collapse into one. Thus, the encoder and the decoder architecture of the model is identical to the original model. In practice, we fold the collapsed linear matrices into the last layer of the encoder at inference time.
Yann LeCun Facebook AI Research New York yann@fb.com
We empirically demonstrate IRMAE's regularization behavior through a synthetic dataset and show that it learns good representation with a much smaller latent dimension. Then we demonstrate superior representation learning performance of our method against a standard deterministic autoencoder and comparable performance to a variational autoencoder on MNIST dataset and CelebA dataset through a variety of generative tasks, including interpolation, sample generation from noise, PCA interpolation in low dimension, and a downstream classification task. We also conducted an ablation study to verify that the advantage of implicit regularization comes from gradient descent learning dynamics.
We summarize our contributions as follows:
· We proposed a method of inserting extra linear layers in deep neural networks for rank regularization; · We proposed a simple, deterministic rank-minimization autoencoder that learns lowdimensional representation; · Wedemonstrated a superior performance of our method compared to a standard deterministic autoencoder and a variational autoencoder on a variety of generative and downstream classification tasks.
Related Work
The implicit regularization provided by gradient descent optimization is widely believed to be one of the keys to deep neural networks' generalization ability. Many works focusing on linear cases are trying to study this behavior empirically and theoretically. Soudry et al. [25] show that implicit bias helps to learn logistic regression. Saxe et al. [24] study a 2-layer linear regression and theoretically demonstrated that continuous gradient descent could lead to a low-rank solution. Gidel et al. [6] extend such theory to a discrete case for linear regression problems. In the field of matrix factorization, Gunasekar et al. [9] theoretically prove that gradient descent can derive minimal nuclear norm solution. Arora et al. [1] extend this concept to the deep linear network case by theoretically and empirically demonstrating that a deep linear network can derive low-rank solutions. Gunasekar et al. [8] prove that gradient descent has a regularization effect in linear convolutional networks. All these works are trying to understand why gradient descent can help generalization in existing approaches. On the contrary, we take advantage of this phenomenon to develop better algorithms. Also, the current implicit regularization study requires a small gradient and vanishing initialization, while our method is more general and can be used with complicated optimizers such as Adam [13] and allow combination with more complicated components.
Autoencoders are popular for representation learning. It is important to limit the latent capacity as the data are embedded in a lower-dimensional space. A big family of them are based on variational autoencoders [14] such as beta-VAE [12]. These methods tend to generate blurry images due to its intrinsic probabilistic nature. On the other hand, a naive deterministic autoencoder is considered a failure in generative tasks and has 'holes' in its latent space, due to the absence of explicit constraint on the latent distribution. Many methods with deterministic autoencoder are proposed to solve this problem, such as RAE [5], WAE [26], VQ-VAE [27].
Implicit Rank-Minimizing Autoencoder
We denote by E () and D () the encoder and decoder of a deterministic autoencoder, respectively. The latent variable z ∈ R d is determined by E ( y ) . Encoder and decoder are classically trained by jointly minimizing the L 2 reconstruction loss L AE = || y -D ( E ( y )) || 2 2 . Without any constraint on the latent space, a simple deterministic autoencoder will typically learn a non-Gaussian latent space with 'holes' and hence does not generate good samples.
Implicit rank-minimizing autoencoder consists in adding extra linear matrices W 1 , W 2 , · · · , W l between the encoder and decoder, where W i ∈ R d × d are randomly initialized. The corresponding diagram is shown in Figure 1. All W i matrices are trained jointly with the encoder and the decoder. Hence, the reconstruction loss is represented as

During training, these matrices encourage latent variables to use a lower number of dimensions and effectively minimize the rank of the covariance matrix of the latent space. Thus, one can amplify the regularization effect by adding more W i matrices between the encoder and the decoder. Also, we do not use special initialization of each W i , and it works with more optimizers such as Adam [13].
During inference, all W i matrices can be 'absorbed' into the encoder as all the linear matrices collapse, as linear matrix multiplication is associative. Therefore, we can directly use this linearly modified decoder for generative tasks; we can also directly use the encoder for downstream tasks such as classification.
Experiment
In this section, we empirically evaluate the proposed IRMAE model. We first verify the regularization effect through a synthetic task. We then demonstrate that IRMAE generates higher quality images compared to a baseline AE. IRMAE shows comparable performance to VAE. Lastly, we demonstrate IRMAE's superior performance on downstream classification tasks.
Throughout all the experiments, we demonstrate the latent dimension by plotting the normalized singular values. Each plot in Figures 2 and 4 depicts singular values (sorted from large to small) of the covariance matrix of the latent variables z corresponding to examples in the validation set. The plots are normalized by dividing each singular value by the largest singular value of the covariance matrix. Therefore, the dimension of latent space can be interpreted as the number of nonzero singular values.
Verification with Known Intrinsic Dimension
We verify the regularization behavior of IRMAE via a synthetic shape dataset. Each example is a 32x32 RGB image with a random-color, random-sized square or circle, located at a random position. Hence, the data has a known intrinsic dimensionality of 7 (3 for color, 2 for coordinate, 1 for size, 1 for shape).
The base architecture we used is a deterministic autoencoder. The architecture and experimental detail can be found in supplementary material. We use a latent dimension of 32. For IRMAE, we use l = 2 and l = 4 extra matrices between the encoder and the decoder. We test our method against non-regularization, L1 regularization, and L2 regularization on the hidden code with the
same architecture. We demonstrate the learned latent space in Figure 2. The baseline model, L1 regularization, L2 regularization, IRMAE with l = 2 yields excellent reconstructions on validation set.
This result shows that IRMAE with l = 2 is able to learn good latent representation with a rank close to intrinsic dimension, while L1, L2 regularization tends to use a much larger latent space.


Image Generation
Generating high-quality images by sampling the latent space is one of the key indicators of a good representation. In order to provide a comparison with standard deterministic autoencoders and variational autoencoders [14], we train our model on the MNIST dataset [15] and the CelebA dataset [16]. We set the latent dimension to 128/512 for the two datasets, respectively. We use 8/4 extra linear matrices for regularization in IRMAE, respectively. More experiment detail can be found in the supplementary material. We evaluate our model on a variety of representation learning tasks: interpolation between data points, sample generation from random noise, downstream classification task, PCA interpolation in latent space. We also quantitatively evaluate the sample generation by using the FID score. Each model uses the same architecture, except that the VAE code is twice as large to include the means and variances. On all these tasks, our method demonstrates comparable performances to the VAE.
Latent Dimension We show the latent dimensionality reduction of our method in Figure 4. IRMAE utilizes significantly lower-dimensional latent space compared to baseline autoencoder. Notice that
we omit the VAE's curve because VAE uses the whole latent space and hence all singular values tend to be large.

Interpolation between Data Points: We linearly interpolate the latent variable between two images from the validation set. The generated results are shown in Figure 5. IRMAE significantly outperforms the baseline AE on MNIST.

Sampling from Noise: Deterministic autoencoders are not considered to be generative models. It is essential to have constraints on the latent space to derive such ability [2]. Here, we show that IRMAE can sample high-quality images from Gaussian noise. Specifically, we sample random latent variables
from 1) a multivariate Gaussian captured by this covariance matrix, 2) a Gaussian Mixture Model with 4/10 clusters. The generated results are presented in Figure 6 and Figure 7. We quantitatively evaluate the performance of each model by using the Frechet Inception Distance (FID) [11] and report the results on MNIST/CelebA in Table 1.
Table 1: FID score (smaller is better) for samples of various models for MNIST/CelebA.
PCA on Latent Space: We verify that IRMAE learns a compact and continuous latent space by performing PCA on the latent space. We project all latent variables to a 2-dimensional space. We randomly sample vectors in this low dimensional space and interpolate them along two principal vectors. The corresponding images are sampled from inverse PCA followed by the decoder, which is shown in Figure 8. IRMAE generates higher quality images compared to VAE.
Additional experiments are demonstrated in the supplementary material, including comparing IRMAE to other deterministic AEs, comparing IRMAE against AEs with various latent dimension, effect of varying linear layer depth in IRMAE.
Downstream classification
Latent variables are useful for downstream tasks since they capture the main underlying structure of the data distribution [10, 18, 3]. These self-supervised learning methods have the exciting potential to outperform purely-supervised models. We train a multilayer perceptron head on the latent variable generated by the encoder, to classify MNIST images. This MLP head has two linear layers of hidden dimension 128, with ReLU activation. Thus, all models share the same architecture. Each model is trained with an Adam optimizer with a learning rate of 0.001. Early stopping is performed based on


validation set accuracy. The encoder weights are kept fixed. We compare our method against several baselines as well as the supervised version whose entire network is trained jointly. Representations learned by IRMAE obtain a significantly lower error rate compared to those from the unregularized AE in this task. The results are listed in Table 2.
Ablation Study
We perform several ablation studies to verify that the effect of dimensionality reduction comes from the extra linear neural network and its optimization dynamics.
Linear matrices fixed: In this ablation experiment, we fix the linear matrices to verify that the regularization effect comes from the learning dynamics instead of just the architecture. Figure 9 shows that under this condition, the regularization effect is weakened, and the sampled images are significantly worse.
Table 2: Downstream classification on MNIST dataset. We add a MLP head on top of the pretrained encoder by each method. Thus, all models share the same architecture. We do not perform fine tuning on the pretrained encoder except with the purely supervised version. Representation learned by IRMAE obtains significantly lower error rate compared to baselines and supervised version in the low labeled data regime.

Nonlinearity between matrices: One may suspect that the regularization effect comes from a deeper architecture. If we add nonlinearity between matrices, the model is equivalent to a standard autoencoder, with more layers. We show that adding a nonlinearity results in worse generation results, and the regularization effect is also completely lost. See Figure 10.
Weight Sharing: As our method introduces more parameters for training, it would be desirable to have all inserted matrices to share weights to reduce memory requirement. We show that forcing all matrices to share weights results in slightly worse generation results and weakened regularization effect. See Figure 11.

Conclusion
An important component of autoencoder methods is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension. The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and low-rank latent space. We demonstrate the validity of the method on several image generation and representation learning tasks.
Broader Impact
This work provides a novel approach to representation learning and self-supervised learning. It has the potential of boosting general self-supervised learning performances with social benefits including requiring less human data labeling, reducing power consumption of AI models, improving data privacy.
Acknowledgement
We are grateful to Stephane Deny for his feedback on early versions of the manuscript. We thank Pascal Vincent, Nicolas Ballas, Lluis Castrejon, Piotr Bojanowski for their fruitful discussions.
Experiment
Appendix
Experiment Detail
Dataset
For the synthetic shape dataset, we generate shape images on the fly. The size of each shape is uniformly sampled between 3 and 8, inclusively. The color is uniformly sampled in RGB. The coordinate of the center of the shape is randomly sampled with x and y between 8 and 24, inclusively.
For the MNIST dataset, all images are resized to 32x32.
Architecture
The architecture of the encoder and the decoder for each experiment is listed in Tables 3. Conv n /ConvT denotes a convolutional/transposed-convolutional layer with the output channel dimension equal to n . All convolutional layers use 4x4 kernel size with a stride 2, padding 1. FC n denotes a fully connected network with output dimension n .
Table 3: The architecture of the encoder and the decoder for each experiment.
For VAE models, the last layer of the decoder has doubled output dimension, which is split as the average and the standard deviation. It also uses Sigmoid instead of Tanh.
Hyperparameters
The following hyperparameters for each experiment are listed in Table. 4. The number of epochs is chosen for converged reconstruction error for the base model.
Table 4: hyperparameters.
Additional Experiments
Effect of Varying Linear Layers Initial Variance
Initial variance of the linear matrices has strong influence on the regularization effect. We observe that a larger variance weakens the regularization effect. See Table.5.
Table 5: Effect of varying initial variance of linear layers in IRMAE. Performed on MNIST dataset. Latent rank represents corresponding number of nonzero singular values of the covariance matrix of latent space.
Effect of Varying Linear Layers Depth
Adding more linear layers will increase the regularization effect. We demonstrate such effect in Table.6. The number of linear layers l is a hyperparameter and needs to be optimized in practice.
Table 6: Effect of varying linear layers depth. Performed on MNIST dataset. Latent rank represents corresponding number of nonzero singular values of the covariance matrix of latent space.
Comparing to State-of-the-art Deterministic AEs
We compare IRMAE against several modern deterministic autoencders including WAE and RAE. IRMAE demonstrates superior performance on CelebA dataset. See Table.7.
Comparing to AEs with Various Latent Dimension
Autoencoders with different latent dimension or prior setting has trade-off in learning useful representations. Here, we study the effect of latent dimensionality of IRMAE against AE in Table.8 and Figure.12. IRMAE with larger latent dimensions outperforms the optimal dimensional AE.
Table 8: Comparing IRMAE against AEs with different latent dimension. Performed on CelebA dataset. IRMAE uses l = 4 throughout the experiment. Results are listed in FID score.
t-SNE visualization
Comparing IRMAE against AEs with different latent dimension. Performed on CelebA dataset.
We compare IRMAE against an AE and a VAE. It's desirable that two point-clouds overlap. IRMAE demonstrates a comparable performance to VAE and a superior performance to AE.
t-SNE visualization on MNIST images. Blue points represent the test set data point. Orange points represent the sampled images.
| Multivariate Gaussian | Multivariate Gaussian | Multivariate Gaussian | Multivariate Gaussian | Gaussian Mixture Model | Gaussian Mixture Model | Gaussian Mixture Model | Gaussian Mixture Model |
|---|---|---|---|---|---|---|---|
| AE | VAE | IRMAE | AE | VAE | IRMAE | ||
| MNIST | 55.0 | 33.9 | 37.4 | MNIST | 38.0 | 30.8 | 34.0 |
| CelebA | 52.8 | 51.8 | 42.5 | CelebA | 49.0 | 48.8 | 36.4 |
| total training size | 10 | 100 | 1000 | 10000 | 60000 |
|---|---|---|---|---|---|
| AE | 31.4 ± 0.5 | 30.2 ± 0.3 | 10.6 ± 0.2 | 3.7 ± 0.1 | 1.9 ± 0.1 |
| VAE | 21.8 ± 1.0 | 21.7 ± 0.4 | 5.1 ± 0.2 | 1.7 ± 0.1 | 1.1 ± 0.1 |
| IRMAE | 12.0 ± 0.9 | 10.2 ± 0.5 | 3.8 ± 0.3 | 2.4 ± 0.2 | 1.9 ± 0.1 |
| supervised | 29.1 ± 2.6 | 25.1 ± 0.6 | 6.0 ± 0.4 | 1.7 ± 0.1 | 0.8 ± 0.1 |
| Dataset Shape | MNIST | CelebA |
|---|---|---|
| x ∈R 32 x 32 x 3 → Conv 32 → ReLU → Conv 64 → ReLU → Conv 128 → ReLU → Conv 256 → ReLU → Conv 32 → ReLU → z ∈R 32 | x ∈R 32 x 32 x 1 → Conv 32 → ReLU → Conv 64 → ReLU → Conv 128 → ReLU → Conv 256 → ReLU → flattern_to 1024 → FC 128 → z ∈R 128 | Encoder x ∈R 64 x 64 x 3 → Conv 128 → ReLU → Conv 256 → ReLU → Conv 512 → ReLU → Conv 1024 → ReLU → flattern_to 16384 → FC 512 → z ∈R 512 |
| Decoder z ∈R 32 → ConvT 256 → ReLU → ConvT 128 → ReLU → ConvT 64 → ReLU → ConvT 32 → ReLU → ConvT 3 → Tanh → ˆ x ∈R 32 x 32 x 3 | z ∈R 128 → FC 8096 → reshape_to 8x8x128 → ConvT 64 → ReLU → ConvT 32 → ReLU → ConvT 3 → Tanh → ˆ x ∈R 32 x 32 x 1 | z ∈R 512 → FC 65536 → reshape_to 8x8x1024 → ConvT 512 → ReLU → ConvT 256 → ReLU → ConvT 128 → ReLU → ConvT 3 → Tanh → ˆ x ∈R 64 x 64 x 3 |
| Dataset | Shape | MNIST | CelebA |
|---|---|---|---|
| learning rate epochs latent dimension batch size training examples evaluation examples | 0.0001 100 32 32 50000 10000 | 0.0001 50 128 32 60000 10000 | 0.0001 100 512 32 162770 19962 |
| Variance | 1x | 2x | 4x |
|---|---|---|---|
| Latent Rank | 8 | 43 | 66 |
| FID | 37.4 | 33.8 | 49 |
| Depth (l) | 2 | 4 | 8 | 12 |
|---|---|---|---|---|
| Latent Rank | 70 | 39 | 8 | 4 |
| FID | 44 | 30.1 | 37.4 | 62.6 |
| WAE [26] | RAE [5] | IRMAE |
|---|---|---|
| 53.7 | 44.7 | 42 |
| Latent dimension | 32 | 64 | 128 | 256 | 512 |
|---|---|---|---|---|---|
| IRMAE ( l = 4 ) | 81.6 | 64.6 | 47.6 | 42.7 | 42 |
| AE | 78.2 | 60.1 | 46 | 45.4 | 53.9 |
An important component of autoencoders is the method by which the information capacity of the latent representation is minimized or limited. In this work, the rank of the covariance matrix of the codes is implicitly minimized by relying on the fact that gradient descent learning in multi-layer linear networks leads to minimum-rank solutions. By inserting a number of extra linear layers between the encoder and the decoder, the system spontaneously learns representations with a low effective dimension. The model, dubbed Implicit Rank-Minimizing Autoencoder (IRMAE), is simple, deterministic, and learns compact latent spaces. We demonstrate the validity of the method on several image generation and representation learning tasks.
Optimizing a linear multi-layer neural network through gradient descent leads to a low-rank solution. This phenomenon is known as implicit regularization and has been extensively studied under the context of matrix factorization [9, 1, 21], linear regression [24, 6], logistic regression [25], and linear convolutional neural networks [8]. The main goal of these prior works were to understand the generalization ability of deep neural networks. By contrast, the goal of the present work is to design an architecture that takes advantage of this phenomenon to improve the quality of learned representations.
Learning good representations remains a core issue in AI [2]. Representations learned in a self-supervised (or unsupervised) manner can be used for downstream tasks such as generation and classification. Autoencoders (AE) are a popular class of method for learning representations without requiring labeled data. The internal representation of an AE must have a limited information capacity to prevent the AE from learning a trivial identity function. Variants of AEs differ by how they perform this limitation. Bottleneck AE (sometimes called "Diabolo networks") simply use low-dimensional codes [23], noisy AE, such as variational AE add noise to the codes while limiting the variance of their distribution [4, 14], quantizing AE (such as VQ-VAE) quantize the codes into discrete clusters [27], sparse AE impose a sparsity penalty on the code [19, 20], contracting and saturating AE minimize the curvature of the network function in directions outside the data manifold [22, 7], and denoising AE are trained to produce large reconstruction error for corrupted samples [28].
In this work, we propose a new method to implicitly minimize the rank/dimensionality of the latent code of an autoencoder. We call this model Implicit Rank-Minimizing Autoencoder (IRMAE). This method consists in inserting extra linear layers between the encoder and the decoder of a standard autoencoder. This additional linear network is trained jointly with the rest of the autoencoder through classical backpropagation. As a result, the system spontaneously learns representations with a low effective dimensionality. Like other regularization methods, this extra linear neural network does not appear at inference time as the linear matrices collapse into one. Thus, the encoder and the decoder architecture of the model is identical to the original model. In practice, we fold the collapsed linear matrices into the last layer of the encoder at inference time.
We empirically demonstrate IRMAE’s regularization behavior through a synthetic dataset and show that it learns good representation with a much smaller latent dimension. Then we demonstrate superior representation learning performance of our method against a standard deterministic autoencoder and comparable performance to a variational autoencoder on MNIST dataset and CelebA dataset through a variety of generative tasks, including interpolation, sample generation from noise, PCA interpolation in low dimension, and a downstream classification task. We also conducted an ablation study to verify that the advantage of implicit regularization comes from gradient descent learning dynamics.
We summarize our contributions as follows:
We proposed a method of inserting extra linear layers in deep neural networks for rank regularization;
We proposed a simple, deterministic rank-minimization autoencoder that learns low-dimensional representation;
We demonstrated a superior performance of our method compared to a standard deterministic autoencoder and a variational autoencoder on a variety of generative and downstream classification tasks.
The implicit regularization provided by gradient descent optimization is widely believed to be one of the keys to deep neural networks’ generalization ability. Many works focusing on linear cases are trying to study this behavior empirically and theoretically. Soudry et al. [25] show that implicit bias helps to learn logistic regression. Saxe et al. [24] study a 2-layer linear regression and theoretically demonstrated that continuous gradient descent could lead to a low-rank solution. Gidel et al. [6] extend such theory to a discrete case for linear regression problems. In the field of matrix factorization, Gunasekar et al. [9] theoretically prove that gradient descent can derive minimal nuclear norm solution. Arora et al. [1] extend this concept to the deep linear network case by theoretically and empirically demonstrating that a deep linear network can derive low-rank solutions. Gunasekar et al. [8] prove that gradient descent has a regularization effect in linear convolutional networks. All these works are trying to understand why gradient descent can help generalization in existing approaches. On the contrary, we take advantage of this phenomenon to develop better algorithms. Also, the current implicit regularization study requires a small gradient and vanishing initialization, while our method is more general and can be used with complicated optimizers such as Adam [13] and allow combination with more complicated components.
Autoencoders are popular for representation learning. It is important to limit the latent capacity as the data are embedded in a lower-dimensional space. A big family of them are based on variational autoencoders [14] such as beta-VAE [12]. These methods tend to generate blurry images due to its intrinsic probabilistic nature. On the other hand, a naive deterministic autoencoder is considered a failure in generative tasks and has “holes” in its latent space, due to the absence of explicit constraint on the latent distribution. Many methods with deterministic autoencoder are proposed to solve this problem, such as RAE [5], WAE [26], VQ-VAE [27].
We denote by ℰ()ℰ\mathcal{E}() and 𝒟()𝒟\mathcal{D}() the encoder and decoder of a deterministic autoencoder, respectively. The latent variable z∈ℝd𝑧superscriptℝ𝑑z\in\mathbb{R}^{d} is determined by ℰ(y)ℰ𝑦\mathcal{E}(y). Encoder and decoder are classically trained by jointly minimizing the L2subscript𝐿2L_{2} reconstruction loss LAE=‖y−𝒟(ℰ(y))‖22subscript𝐿𝐴𝐸superscriptsubscriptnorm𝑦𝒟ℰ𝑦22L_{AE}=||y-\mathcal{D}(\mathcal{E}(y))||_{2}^{2}. Without any constraint on the latent space, a simple deterministic autoencoder will typically learn a non-Gaussian latent space with “holes” and hence does not generate good samples.
Implicit rank-minimizing autoencoder consists in adding extra linear matrices W1,W2,⋯,Wlsubscript𝑊1subscript𝑊2⋯subscript𝑊𝑙W_{1},W_{2},\cdots,W_{l} between the encoder and decoder, where Wi∈ℝd×dsubscript𝑊𝑖superscriptℝ𝑑𝑑W_{i}\in\mathbb{R}^{d\times d} are randomly initialized. The corresponding diagram is shown in Figure 1. All Wisubscript𝑊𝑖W_{i} matrices are trained jointly with the encoder and the decoder. Hence, the reconstruction loss is represented as
During training, these matrices encourage latent variables to use a lower number of dimensions and effectively minimize the rank of the covariance matrix of the latent space. Thus, one can amplify the regularization effect by adding more Wisubscript𝑊𝑖W_{i} matrices between the encoder and the decoder. Also, we do not use special initialization of each Wisubscript𝑊𝑖W_{i}, and it works with more optimizers such as Adam [13].
During inference, all Wisubscript𝑊𝑖W_{i} matrices can be “absorbed” into the encoder as all the linear matrices collapse, as linear matrix multiplication is associative. Therefore, we can directly use this linearly modified decoder for generative tasks; we can also directly use the encoder for downstream tasks such as classification.
In this section, we empirically evaluate the proposed IRMAE model. We first verify the regularization effect through a synthetic task. We then demonstrate that IRMAE generates higher quality images compared to a baseline AE. IRMAE shows comparable performance to VAE. Lastly, we demonstrate IRMAE’s superior performance on downstream classification tasks.
Throughout all the experiments, we demonstrate the latent dimension by plotting the normalized singular values. Each plot in Figures 2 and 4 depicts singular values (sorted from large to small) of the covariance matrix of the latent variables z𝑧z corresponding to examples in the validation set. The plots are normalized by dividing each singular value by the largest singular value of the covariance matrix. Therefore, the dimension of latent space can be interpreted as the number of nonzero singular values.
We verify the regularization behavior of IRMAE via a synthetic shape dataset. Each example is a 32x32 RGB image with a random-color, random-sized square or circle, located at a random position. Hence, the data has a known intrinsic dimensionality of 7 (3 for color, 2 for coordinate, 1 for size, 1 for shape).
The base architecture we used is a deterministic autoencoder. The architecture and experimental detail can be found in supplementary material. We use a latent dimension of 32. For IRMAE, we use l=2𝑙2l=2 and l=4𝑙4l=4 extra matrices between the encoder and the decoder. We test our method against non-regularization, L1 regularization, and L2 regularization on the hidden code with the same architecture. We demonstrate the learned latent space in Figure 2. The baseline model, L1 regularization, L2 regularization, IRMAE with l=2𝑙2l=2 yields excellent reconstructions on validation set.
This result shows that IRMAE with l=2𝑙2l=2 is able to learn good latent representation with a rank close to intrinsic dimension, while L1, L2 regularization tends to use a much larger latent space.
Generating high-quality images by sampling the latent space is one of the key indicators of a good representation. In order to provide a comparison with standard deterministic autoencoders and variational autoencoders [14], we train our model on the MNIST dataset [15] and the CelebA dataset [16]. We set the latent dimension to 128/512 for the two datasets, respectively. We use 8/4 extra linear matrices for regularization in IRMAE, respectively. More experiment detail can be found in the supplementary material. We evaluate our model on a variety of representation learning tasks: interpolation between data points, sample generation from random noise, downstream classification task, PCA interpolation in latent space. We also quantitatively evaluate the sample generation by using the FID score. Each model uses the same architecture, except that the VAE code is twice as large to include the means and variances. On all these tasks, our method demonstrates comparable performances to the VAE.
Latent Dimension We show the latent dimensionality reduction of our method in Figure 4. IRMAE utilizes significantly lower-dimensional latent space compared to baseline autoencoder. Notice that we omit the VAE’s curve because VAE uses the whole latent space and hence all singular values tend to be large.
Interpolation between Data Points: We linearly interpolate the latent variable between two images from the validation set. The generated results are shown in Figure 5. IRMAE significantly outperforms the baseline AE on MNIST.
Sampling from Noise: Deterministic autoencoders are not considered to be generative models. It is essential to have constraints on the latent space to derive such ability [2]. Here, we show that IRMAE can sample high-quality images from Gaussian noise. Specifically, we sample random latent variables from 1) a multivariate Gaussian captured by this covariance matrix, 2) a Gaussian Mixture Model with 4/10 clusters. The generated results are presented in Figure 6 and Figure 7. We quantitatively evaluate the performance of each model by using the Frechet Inception Distance (FID) [11] and report the results on MNIST/CelebA in Table 1.
PCA on Latent Space: We verify that IRMAE learns a compact and continuous latent space by performing PCA on the latent space. We project all latent variables to a 2-dimensional space. We randomly sample vectors in this low dimensional space and interpolate them along two principal vectors. The corresponding images are sampled from inverse PCA followed by the decoder, which is shown in Figure 8. IRMAE generates higher quality images compared to VAE.
Additional experiments are demonstrated in the supplementary material, including comparing IRMAE to other deterministic AEs, comparing IRMAE against AEs with various latent dimension, effect of varying linear layer depth in IRMAE.
Latent variables are useful for downstream tasks since they capture the main underlying structure of the data distribution [10, 18, 3]. These self-supervised learning methods have the exciting potential to outperform purely-supervised models. We train a multilayer perceptron head on the latent variable generated by the encoder, to classify MNIST images. This MLP head has two linear layers of hidden dimension 128, with ReLU activation. Thus, all models share the same architecture. Each model is trained with an Adam optimizer with a learning rate of 0.001. Early stopping is performed based on validation set accuracy. The encoder weights are kept fixed. We compare our method against several baselines as well as the supervised version whose entire network is trained jointly. Representations learned by IRMAE obtain a significantly lower error rate compared to those from the unregularized AE in this task. The results are listed in Table 2.
Linear matrices fixed: In this ablation experiment, we fix the linear matrices to verify that the regularization effect comes from the learning dynamics instead of just the architecture. Figure 9 shows that under this condition, the regularization effect is weakened, and the sampled images are significantly worse.
Nonlinearity between matrices: One may suspect that the regularization effect comes from a deeper architecture. If we add nonlinearity between matrices, the model is equivalent to a standard autoencoder, with more layers. We show that adding a nonlinearity results in worse generation results, and the regularization effect is also completely lost. See Figure 10.
Weight Sharing: As our method introduces more parameters for training, it would be desirable to have all inserted matrices to share weights to reduce memory requirement. We show that forcing all matrices to share weights results in slightly worse generation results and weakened regularization effect. See Figure 11.
This work provides a novel approach to representation learning and self-supervised learning. It has the potential of boosting general self-supervised learning performances with social benefits including requiring less human data labeling, reducing power consumption of AI models, improving data privacy.
We are grateful to Stephane Deny for his feedback on early versions of the manuscript. We thank Pascal Vincent, Nicolas Ballas, Lluis Castrejon, Piotr Bojanowski for their fruitful discussions.
For the synthetic shape dataset, we generate shape images on the fly. The size of each shape is uniformly sampled between 3 and 8, inclusively. The color is uniformly sampled in RGB. The coordinate of the center of the shape is randomly sampled with x and y between 8 and 24, inclusively.
For the MNIST dataset, all images are resized to 32x32.
The architecture of the encoder and the decoder for each experiment is listed in Tables 3. Convn/ConvT denotes a convolutional/transposed-convolutional layer with the output channel dimension equal to n𝑛n. All convolutional layers use 4x4 kernel size with a stride 2, padding 1. FCn denotes a fully connected network with output dimension n𝑛n.
For VAE models, the last layer of the decoder has doubled output dimension, which is split as the average and the standard deviation. It also uses Sigmoid instead of Tanh.
The following hyperparameters for each experiment are listed in Table. 4. The number of epochs is chosen for converged reconstruction error for the base model.
Adding more linear layers will increase the regularization effect. We demonstrate such effect in Table.6. The number of linear layers l𝑙l is a hyperparameter and needs to be optimized in practice.
Autoencoders with different latent dimension or prior setting has trade-off in learning useful representations. Here, we study the effect of latent dimensionality of IRMAE against AE in Table.8 and Figure.12. IRMAE with larger latent dimensions outperforms the optimal dimensional AE.
We visualize the density of the sampled MNIST images by each model in Figure 13 using t-SNE [17]. Blue points represent the original data point, and the orange points represent the sampled ones. We compare IRMAE against an AE and a VAE. It’s desirable that two point-clouds overlap. IRMAE demonstrates a comparable performance to VAE and a superior performance to AE.
Table: S4.T1: FID score (smaller is better) for samples of various models for MNIST/CelebA.
| Multivariate Gaussian | Gaussian Mixture Model | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AE VAE IRMAE MNIST 55.0 33.9 37.4 CelebA 52.8 51.8 42.5 | AE | VAE | IRMAE | MNIST | 55.0 | 33.9 | 37.4 | CelebA | 52.8 | 51.8 | 42.5 | AE VAE IRMAE MNIST 38.0 30.8 34.0 CelebA 49.0 48.8 36.4 | AE | VAE | IRMAE | MNIST | 38.0 | 30.8 | 34.0 | CelebA | 49.0 | 48.8 | 36.4 | ||
| AE | VAE | IRMAE | |||||||||||||||||||||||
| MNIST | 55.0 | 33.9 | 37.4 | ||||||||||||||||||||||
| CelebA | 52.8 | 51.8 | 42.5 | ||||||||||||||||||||||
| AE | VAE | IRMAE | |||||||||||||||||||||||
| MNIST | 38.0 | 30.8 | 34.0 | ||||||||||||||||||||||
| CelebA | 49.0 | 48.8 | 36.4 |
Table: S4.T2: Downstream classification on MNIST dataset. We add a MLP head on top of the pretrained encoder by each method. Thus, all models share the same architecture. We do not perform fine tuning on the pretrained encoder except with the purely supervised version. Representation learned by IRMAE obtains significantly lower error rate compared to baselines and supervised version in the low labeled data regime.
| total training size | 10 | 100 | 1000 | 10000 | 60000 |
|---|---|---|---|---|---|
| AE | 31.4±plus-or-minus\pm0.5 | 30.2±plus-or-minus\pm0.3 | 10.6±plus-or-minus\pm0.2 | 3.7±plus-or-minus\pm0.1 | 1.9±plus-or-minus\pm0.1 |
| VAE | 21.8±plus-or-minus\pm1.0 | 21.7±plus-or-minus\pm0.4 | 5.1±plus-or-minus\pm0.2 | 1.7±plus-or-minus\pm0.1 | 1.1±plus-or-minus\pm0.1 |
| IRMAE | 12.0±plus-or-minus\pm0.9 | 10.2±plus-or-minus\pm0.5 | 3.8±plus-or-minus\pm0.3 | 2.4±plus-or-minus\pm0.2 | 1.9±plus-or-minus\pm0.1 |
| supervised | 29.1±plus-or-minus\pm2.6 | 25.1±plus-or-minus\pm0.6 | 6.0±plus-or-minus\pm0.4 | 1.7±plus-or-minus\pm0.1 | 0.8±plus-or-minus\pm0.1 |
Table: Sx3.T3: The architecture of the encoder and the decoder for each experiment.
| Dataset | Shape | MNIST | CelebA | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Encoder | x∈ℛ32x32x3𝑥superscriptℛ32𝑥32𝑥3x\in\mathcal{R}^{32x32x3} →→\rightarrow Conv32 →→\rightarrow ReLU →→\rightarrow Conv64 →→\rightarrow ReLU →→\rightarrow Conv128 →→\rightarrow ReLU →→\rightarrow Conv256 →→\rightarrow ReLU →→\rightarrow Conv32 →→\rightarrow ReLU →→\rightarrow z∈ℛ32𝑧superscriptℛ32z\in\mathcal{R}^{32} | x∈ℛ32x32x3𝑥superscriptℛ32𝑥32𝑥3x\in\mathcal{R}^{32x32x3} | →→\rightarrow Conv32 →→\rightarrow ReLU | →→\rightarrow Conv64 →→\rightarrow ReLU | →→\rightarrow Conv128 →→\rightarrow ReLU | →→\rightarrow Conv256 →→\rightarrow ReLU | →→\rightarrow Conv32 →→\rightarrow ReLU | →→\rightarrow z∈ℛ32𝑧superscriptℛ32z\in\mathcal{R}^{32} | x∈ℛ32x32x1𝑥superscriptℛ32𝑥32𝑥1x\in\mathcal{R}^{32x32x1} →→\rightarrow Conv32 →→\rightarrow ReLU →→\rightarrow Conv64 →→\rightarrow ReLU →→\rightarrow Conv128 →→\rightarrow ReLU →→\rightarrow Conv256 →→\rightarrow ReLU →→\rightarrow flattern___to 1024 →→\rightarrow FC128 →→\rightarrow z∈ℛ128𝑧superscriptℛ128z\in\mathcal{R}^{128} | x∈ℛ32x32x1𝑥superscriptℛ32𝑥32𝑥1x\in\mathcal{R}^{32x32x1} | →→\rightarrow Conv32 →→\rightarrow ReLU | →→\rightarrow Conv64 →→\rightarrow ReLU | →→\rightarrow Conv128 →→\rightarrow ReLU | →→\rightarrow Conv256 →→\rightarrow ReLU | →→\rightarrow flattern___to 1024 | →→\rightarrow FC128 →→\rightarrow z∈ℛ128𝑧superscriptℛ128z\in\mathcal{R}^{128} | x∈ℛ64x64x3𝑥superscriptℛ64𝑥64𝑥3x\in\mathcal{R}^{64x64x3} →→\rightarrow Conv128 →→\rightarrow ReLU →→\rightarrow Conv256 →→\rightarrow ReLU →→\rightarrow Conv512 →→\rightarrow ReLU →→\rightarrow Conv1024 →→\rightarrow ReLU →→\rightarrow flattern___to 16384 →→\rightarrow FC512→→\rightarrow z∈ℛ512𝑧superscriptℛ512z\in\mathcal{R}^{512} | x∈ℛ64x64x3𝑥superscriptℛ64𝑥64𝑥3x\in\mathcal{R}^{64x64x3} | →→\rightarrow Conv128 →→\rightarrow ReLU | →→\rightarrow Conv256 →→\rightarrow ReLU | →→\rightarrow Conv512 →→\rightarrow ReLU | →→\rightarrow Conv1024 →→\rightarrow ReLU | →→\rightarrow flattern___to 16384 | →→\rightarrow FC512→→\rightarrow z∈ℛ512𝑧superscriptℛ512z\in\mathcal{R}^{512} | |
| x∈ℛ32x32x3𝑥superscriptℛ32𝑥32𝑥3x\in\mathcal{R}^{32x32x3} | |||||||||||||||||||||||||
| →→\rightarrow Conv32 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv64 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv128 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv256 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv32 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow z∈ℛ32𝑧superscriptℛ32z\in\mathcal{R}^{32} | |||||||||||||||||||||||||
| x∈ℛ32x32x1𝑥superscriptℛ32𝑥32𝑥1x\in\mathcal{R}^{32x32x1} | |||||||||||||||||||||||||
| →→\rightarrow Conv32 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv64 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv128 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv256 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow flattern___to 1024 | |||||||||||||||||||||||||
| →→\rightarrow FC128 →→\rightarrow z∈ℛ128𝑧superscriptℛ128z\in\mathcal{R}^{128} | |||||||||||||||||||||||||
| x∈ℛ64x64x3𝑥superscriptℛ64𝑥64𝑥3x\in\mathcal{R}^{64x64x3} | |||||||||||||||||||||||||
| →→\rightarrow Conv128 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv256 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv512 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow Conv1024 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow flattern___to 16384 | |||||||||||||||||||||||||
| →→\rightarrow FC512→→\rightarrow z∈ℛ512𝑧superscriptℛ512z\in\mathcal{R}^{512} | |||||||||||||||||||||||||
| Decoder | z∈ℛ32𝑧superscriptℛ32z\in\mathcal{R}^{32} →→\rightarrow ConvT256 →→\rightarrow ReLU →→\rightarrow ConvT128 →→\rightarrow ReLU →→\rightarrow ConvT64 →→\rightarrow ReLU →→\rightarrow ConvT32 →→\rightarrow ReLU →→\rightarrow ConvT3 →→\rightarrow Tanh →→\rightarrow x^∈ℛ32x32x3^𝑥superscriptℛ32𝑥32𝑥3\hat{x}\in\mathcal{R}^{32x32x3} | z∈ℛ32𝑧superscriptℛ32z\in\mathcal{R}^{32} | →→\rightarrow ConvT256 →→\rightarrow ReLU | →→\rightarrow ConvT128 →→\rightarrow ReLU | →→\rightarrow ConvT64 →→\rightarrow ReLU | →→\rightarrow ConvT32 →→\rightarrow ReLU | →→\rightarrow ConvT3 →→\rightarrow Tanh | →→\rightarrow x^∈ℛ32x32x3^𝑥superscriptℛ32𝑥32𝑥3\hat{x}\in\mathcal{R}^{32x32x3} | z∈ℛ128𝑧superscriptℛ128z\in\mathcal{R}^{128} →→\rightarrow FC8096 →→\rightarrow reshape___to 8x8x128 →→\rightarrow ConvT64 →→\rightarrow ReLU →→\rightarrow ConvT32 →→\rightarrow ReLU →→\rightarrow ConvT3 →→\rightarrow Tanh →→\rightarrow x^∈ℛ32x32x1^𝑥superscriptℛ32𝑥32𝑥1\hat{x}\in\mathcal{R}^{32x32x1} | z∈ℛ128𝑧superscriptℛ128z\in\mathcal{R}^{128} | →→\rightarrow FC8096 | →→\rightarrow reshape___to 8x8x128 | →→\rightarrow ConvT64 →→\rightarrow ReLU | →→\rightarrow ConvT32 →→\rightarrow ReLU | →→\rightarrow ConvT3 →→\rightarrow Tanh | →→\rightarrow x^∈ℛ32x32x1^𝑥superscriptℛ32𝑥32𝑥1\hat{x}\in\mathcal{R}^{32x32x1} | z∈ℛ512𝑧superscriptℛ512z\in\mathcal{R}^{512} →→\rightarrow FC65536 →→\rightarrow reshape___to 8x8x1024 →→\rightarrow ConvT512 →→\rightarrow ReLU →→\rightarrow ConvT256 →→\rightarrow ReLU →→\rightarrow ConvT128 →→\rightarrow ReLU →→\rightarrow ConvT3 →→\rightarrow Tanh →→\rightarrow x^∈ℛ64x64x3^𝑥superscriptℛ64𝑥64𝑥3\hat{x}\in\mathcal{R}^{64x64x3} | z∈ℛ512𝑧superscriptℛ512z\in\mathcal{R}^{512} | →→\rightarrow FC65536 | →→\rightarrow reshape___to 8x8x1024 | →→\rightarrow ConvT512 →→\rightarrow ReLU | →→\rightarrow ConvT256 →→\rightarrow ReLU | →→\rightarrow ConvT128 →→\rightarrow ReLU | →→\rightarrow ConvT3 →→\rightarrow Tanh | →→\rightarrow x^∈ℛ64x64x3^𝑥superscriptℛ64𝑥64𝑥3\hat{x}\in\mathcal{R}^{64x64x3} |
| z∈ℛ32𝑧superscriptℛ32z\in\mathcal{R}^{32} | |||||||||||||||||||||||||
| →→\rightarrow ConvT256 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT128 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT64 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT32 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT3 →→\rightarrow Tanh | |||||||||||||||||||||||||
| →→\rightarrow x^∈ℛ32x32x3^𝑥superscriptℛ32𝑥32𝑥3\hat{x}\in\mathcal{R}^{32x32x3} | |||||||||||||||||||||||||
| z∈ℛ128𝑧superscriptℛ128z\in\mathcal{R}^{128} | |||||||||||||||||||||||||
| →→\rightarrow FC8096 | |||||||||||||||||||||||||
| →→\rightarrow reshape___to 8x8x128 | |||||||||||||||||||||||||
| →→\rightarrow ConvT64 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT32 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT3 →→\rightarrow Tanh | |||||||||||||||||||||||||
| →→\rightarrow x^∈ℛ32x32x1^𝑥superscriptℛ32𝑥32𝑥1\hat{x}\in\mathcal{R}^{32x32x1} | |||||||||||||||||||||||||
| z∈ℛ512𝑧superscriptℛ512z\in\mathcal{R}^{512} | |||||||||||||||||||||||||
| →→\rightarrow FC65536 | |||||||||||||||||||||||||
| →→\rightarrow reshape___to 8x8x1024 | |||||||||||||||||||||||||
| →→\rightarrow ConvT512 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT256 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT128 →→\rightarrow ReLU | |||||||||||||||||||||||||
| →→\rightarrow ConvT3 →→\rightarrow Tanh | |||||||||||||||||||||||||
| →→\rightarrow x^∈ℛ64x64x3^𝑥superscriptℛ64𝑥64𝑥3\hat{x}\in\mathcal{R}^{64x64x3} |
Table: Sx3.T4: hyperparameters.
| learning rate | 0.0001 | 0.0001 | 0.0001 |
|---|---|---|---|
| epochs | 100 | 50 | 100 |
| latent dimension | 32 | 128 | 512 |
| batch size | 32 | 32 | 32 |
| training examples | 50000 | 60000 | 162770 |
| evaluation examples | 10000 | 10000 | 19962 |
Table: Sx3.T5: Effect of varying initial variance of linear layers in IRMAE. Performed on MNIST dataset. Latent rank represents corresponding number of nonzero singular values of the covariance matrix of latent space.
| Variance | 1x | 2x | 4x |
| Latent Rank | 8 | 43 | 66 |
| FID | 37.4 | 33.8 | 49.0 |
Table: Sx3.T7: Comparing IRMAE against state-of-the-art deterministic AEs on CelebA dataset.
| WAE [26] | RAE [5] | IRMAE |
| 53.7 | 44.7 | 42.0 |
Implicit rank-minimizing autoencoder: a deterministic autoencoder with implicit regularization. The linear matrices that form a linear neural network between the encoder and the decoder are all square matrices. The effect of these matrices is to penalize the rank of the code variable. These matrices are equivalent to a single linear layer at inference time, and thus they do not change the capacity of the autoencoder. In practice, they are absorbed into the last layer of the encoder.
Singular values of the latent space of each model on synthetic shape dataset. Each curve represents singular values of the covariance matrix of the code computed on the validation set. IRMAE l=2𝑙2l=2 is able to approach the minimal theoretical rank of 7.
Linear interpolation between two randomly generated samples. From top to bottom are results from baseline unregularized AE, AE with L1 regularization, AE with L2 regularization, IRMAE l=2𝑙2l=2, IRMAE l=4𝑙4l=4.
Singular value spectra of covariance matrices of codes for MNIST and CelebA datasets by IRMAE and a baseline AE. Each curve represents the singular values of the covariance matrix of the hidden code computed on the validation set.
Linear interpolation between data points on the MNIST dataset. From top to bottom are images generated from an unregularized AE, a VAE, and an IRMAE, respectively. IRMAE produces higher quality images.
Sampling images from 2-dimensional space, mapped by PCA from latent variables. We interpolate along two principal components to generate samples. From left to right are images generated from an unregularized AE, a VAE, and an IRMAE, respectively.
Ablation study: linear matrices fixed. This proves that the regularization behavior is not an effect of naive soft bottleneck.
Ablation study: sharing weights in the inserted linear layers.
$$ L=||y-\mathcal{D}(W_l\cdots W_2W_1\mathcal{E}(y))||_2^2 $$
| Multivariate Gaussian | Multivariate Gaussian | Multivariate Gaussian | Multivariate Gaussian | Gaussian Mixture Model | Gaussian Mixture Model | Gaussian Mixture Model | Gaussian Mixture Model |
|---|---|---|---|---|---|---|---|
| AE | VAE | IRMAE | AE | VAE | IRMAE | ||
| MNIST | 55.0 | 33.9 | 37.4 | MNIST | 38.0 | 30.8 | 34.0 |
| CelebA | 52.8 | 51.8 | 42.5 | CelebA | 49.0 | 48.8 | 36.4 |
| total training size | 10 | 100 | 1000 | 10000 | 60000 |
|---|---|---|---|---|---|
| AE | 31.4 ± 0.5 | 30.2 ± 0.3 | 10.6 ± 0.2 | 3.7 ± 0.1 | 1.9 ± 0.1 |
| VAE | 21.8 ± 1.0 | 21.7 ± 0.4 | 5.1 ± 0.2 | 1.7 ± 0.1 | 1.1 ± 0.1 |
| IRMAE | 12.0 ± 0.9 | 10.2 ± 0.5 | 3.8 ± 0.3 | 2.4 ± 0.2 | 1.9 ± 0.1 |
| supervised | 29.1 ± 2.6 | 25.1 ± 0.6 | 6.0 ± 0.4 | 1.7 ± 0.1 | 0.8 ± 0.1 |
| Dataset Shape | MNIST | CelebA |
|---|---|---|
| x ∈R 32 x 32 x 3 → Conv 32 → ReLU → Conv 64 → ReLU → Conv 128 → ReLU → Conv 256 → ReLU → Conv 32 → ReLU → z ∈R 32 | x ∈R 32 x 32 x 1 → Conv 32 → ReLU → Conv 64 → ReLU → Conv 128 → ReLU → Conv 256 → ReLU → flattern_to 1024 → FC 128 → z ∈R 128 | Encoder x ∈R 64 x 64 x 3 → Conv 128 → ReLU → Conv 256 → ReLU → Conv 512 → ReLU → Conv 1024 → ReLU → flattern_to 16384 → FC 512 → z ∈R 512 |
| Decoder z ∈R 32 → ConvT 256 → ReLU → ConvT 128 → ReLU → ConvT 64 → ReLU → ConvT 32 → ReLU → ConvT 3 → Tanh → ˆ x ∈R 32 x 32 x 3 | z ∈R 128 → FC 8096 → reshape_to 8x8x128 → ConvT 64 → ReLU → ConvT 32 → ReLU → ConvT 3 → Tanh → ˆ x ∈R 32 x 32 x 1 | z ∈R 512 → FC 65536 → reshape_to 8x8x1024 → ConvT 512 → ReLU → ConvT 256 → ReLU → ConvT 128 → ReLU → ConvT 3 → Tanh → ˆ x ∈R 64 x 64 x 3 |
| Dataset | Shape | MNIST | CelebA |
|---|---|---|---|
| learning rate epochs latent dimension batch size training examples evaluation examples | 0.0001 100 32 32 50000 10000 | 0.0001 50 128 32 60000 10000 | 0.0001 100 512 32 162770 19962 |
| Variance | 1x | 2x | 4x |
|---|---|---|---|
| Latent Rank | 8 | 43 | 66 |
| FID | 37.4 | 33.8 | 49 |
| Depth (l) | 2 | 4 | 8 | 12 |
|---|---|---|---|---|
| Latent Rank | 70 | 39 | 8 | 4 |
| FID | 44 | 30.1 | 37.4 | 62.6 |
| WAE [26] | RAE [5] | IRMAE |
|---|---|---|
| 53.7 | 44.7 | 42 |
| Latent dimension | 32 | 64 | 128 | 256 | 512 |
|---|---|---|---|---|---|
| IRMAE ( l = 4 ) | 81.6 | 64.6 | 47.6 | 42.7 | 42 |
| AE | 78.2 | 60.1 | 46 | 45.4 | 53.9 |
References
[goroshin-lecun-iclr-13] Goroshin, Rotislav, LeCun, Yann. (2013). Saturating Auto-Encoders. International Conference on Learning Representations (ICLR2013).
[ranzato-nips-07] Ranzato, Marc'Aurelio, Boureau, {Y-Lan. (2007). Sparse feature learning for deep belief networks. Advances in Neural Information Processing Systems (NIPS 2007).
[doi-nips-2004] Eizaburo Doi, Michael S. Lewicki. (2005). Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units. Advances in Neural Information Processing Systems 17.
[rhw-1986] Gunasekar, Suriya, Woodworth, Blake E, Bhojanapalli, Srinadh, Neyshabur, Behnam, Srebro, Nati. (2017). Implicit Regularization in Matrix Factorization. Advances in Neural Information Processing Systems 30 (NeurIPS '17).
[Soudry2018TheIB] Daniel Soudry, Elad Hoffer, Nathan Srebro. (2018). The Implicit Bias of Gradient Descent on Separable Data. International Conference on Learning Representations (ICLR '18).
[Saxe2019AMT] Andrew M. Saxe, James L. McClelland, Surya Ganguli. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences of the United States of America.
[Gidel2019ImplicitRO] Gidel, Gauthier, Bach, Francis, Lacoste-Julien, Simon. (2019). Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks. Advances in Neural Information Processing Systems 32 (NeurIPS '19).
[Arora2019ImplicitRI] Arora, Sanjeev, Cohen, Nadav, Hu, Wei, Luo, Yuping. (2019). Implicit Regularization in Deep Matrix Factorization. Advances in Neural Information Processing Systems (NeurIPS '19).
[Gunasekar2018ImplicitBO] Suriya Gunasekar, Jason D. Lee, Daniel Soudry, Nathan Srebro. (2018). Implicit Bias of Gradient Descent on Linear Convolutional Networks. Advances in Neural Information Processing Systems (NeurIPS '18).
[Kingma2014AutoEncodingVB] Diederik P. Kingma, Max Welling. (2014). Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114.
[Higgins2017betaVAELB] Irina Higgins, Lo{. (2020). beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. International Conference on Learning Representations (ICLR '20).
[Razin2020ImplicitRI] Noam Razin, Nadav Cohen. (2020). Implicit Regularization in Deep Learning May Not Be Explainable by Norms. arXiv preprint arXiv:2005.06398.
[Bengio2013RepresentationLA] Yoshua Bengio, Aaron C. Courville, Pascal Vincent. (2013). Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[Gulrajani2017PixelVAEAL] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Ta{. (2017). PixelVAE: A Latent Variable Model for Natural Images. ArXiv.
[Tolstikhin2018WassersteinA] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Schoelkopf. (2018). Wasserstein Auto-Encoders. International Conference on Learning Representations (ICLR '18).
[Oord2017NeuralDR] A{. (2017). Neural Discrete Representation Learning. Advances in Neural Information Processing Systems (NeurIPS '17).
[Liu2015DeepLF] Ziwei Liu, Ping Luo, Xiaogang Wang, Xiaoou Tang. (2015). Deep Learning Face Attributes in the Wild. 2015 IEEE International Conference on Computer Vision (ICCV '15).
[Kingma2015AdamAM] Diederik P. Kingma, Jimmy Ba. (2015). Adam: A Method for Stochastic Optimization. CoRR.
[He2019MomentumCF] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722.
[Misra2019SelfSupervisedLO] Misra, Ishan, van der Maaten, Laurens. (2020). Self-Supervised Learning of Pretext-Invariant Representations. Conference on Computer Vision and Pattern Recognition (CVPR '20).
[Chen2020ASF] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A Simple Framework for Contrastive Learning of Visual Representations. arXiv preprint arXiv:2002.05709.
[Makhzani2015AdversarialA] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian J. Goodfellow. (2015). Adversarial Autoencoders. ArXiv.
[Louizos2016TheVF] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, Richard S. Zemel. (2016). The Variational Fair Autoencoder. CoRR.
[Lee2019MaskGANTD] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, Ping Luo. (2019). MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. ArXiv.
[LeCun1998GradientbasedLA] LeCun, Yann, Bottou, Léon, Bengio, Yoshua, Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE.
[Heusel2017GANsTB] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Advances in Neural Information Processing Systems 30 (NeurIPS '17).
[Ghosh2020FromVT] Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, Bernhard Scholkopf. (2020). From Variational to Deterministic Autoencoders. International Conference on Learning Representations (ICLR '20).
[Vincent2008ExtractingAC] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, Pierre-Antoine Manzagol. (2008). Extracting and composing robust features with denoising autoencoders. International Conference on Machine Learning (ICML '08).
[Ng2000SparseAE] Andrew Ng. (2000). Sparse autoencoder. Online notes.
[Rifai2011ContractiveAE] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, Yoshua Bengio. (2011). Contractive Auto-Encoders: Explicit Invariance During Feature Extraction. International Conference on Machine Learning (ICML '11).
[Hechtnielsen1995ReplicatorNN] Robert Hecht-nielsen. (1995). Replicator neural networks for universal optimal source coding.. Science.
[Maaten2008VisualizingDU] L. V. D. Maaten, Geoffrey E. Hinton. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research.
[bib9] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems 30 (NeurIPS ’17), pages 6151–6159, 2017.
[bib11] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems 30 (NeurIPS ’17), pages 6626–6637, 2017.
[bib22] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In International Conference on Machine Learning (ICML ’11), 2011.
[bib23] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, pages 318–362. MIT Press, 1986.
[bib28] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In International Conference on Machine Learning (ICML ’08), 2008.