Sparse Coding with Multi-layer Decoders using Variance Regularization

\addr Center for Data Science, New York University, % \addr Department of Computer Science and Center for Data Science, \addr Courant Institute and Center for Data Science, New York University, \addr Meta AI - FAIR

Abstract

Sparse representations of images are useful in many computer vision applications. Sparse coding with an $l_1$ penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the $l_1$ norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary's elements. In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries. In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime.

Ablation: No Variance Regularization

Katrina Evtimova

Center for Data Science New York University

Yann LeCun

Courant Institute and Center for Data Science New York University

Meta AI - FAIR

Reviewed on OpenReview:

https: // openreview. net/ forum? id= 4GuIi1jJ74

Sparse representations of images are useful in many computer vision applications. Sparse coding with an l 1 penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the l 1 norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary's elements. In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries. In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime.

Introduction

Finding representations of data which are useful for downstream applications is at the core of machine learning and deep learning research (Bengio et al., 2013). Self-supervised learning (SSL) is a branch of representation learning which is entirely data driven and in which the representations are learned without the need for external supervision such as semantic annotations. In SSL, learning can happen through various pre-text tasks such as reconstructing the input with autoencoders (Vincent et al., 2008), predicting the correct order of frames in a video (Misra et al., 2016; Fernando et al., 2017), enforcing equivariant representations across transformations of the same image (He et al., 2019; Grill et al., 2020; Zbontar et al., 2021; Bardes et al., 2021) or enforcing other invariance properties (Földiák, 1991; Wiskott & Sejnowski, 2002; Hyvärinen et al., 2003; Chen et al., 2018). The pre-text tasks should be designed to extract data representations useful in other downstream tasks.

We focus on learning representations of images through sparse coding , a self-supervised learning paradigm based on input reconstruction (Olshausen & Field, 1996). A representation (or code ) in the form of a feature vector z ∈ R l is considered sparse if only a small number of its components are active for a given data sample. Sparsity introduces the inductive bias that only a small number of factors are relevant for a single data point kve216@nyu.edu

yann@cs.nyu.edu

and it is a desirable property for data representations as it can improve efficiency and interpretability of the representations (Bengio, 2009; Bengio et al., 2013). While it is an open debate, sparse representations of images have been shown to improve classification performance (Ranzato et al., 2007; Kavukcuoglu et al., 2010; Makhzani & Frey, 2013) and are popular for denoising and inpainting applications (Elad & Aharon, 2006; Mairal et al., 2007). They are biologically motivated as the primary visual cortex uses sparse activations of neurons to represent natural scenes (Olshausen & Field, 1996; Yoshida & Ohki, 2020).

In sparse coding, the goal is to find a latent sparse representation z ∈ R l which can reconstruct an input y ∈ R d using a learned decoder D . Typically, the decoder is linear, namely D ∈ R d × l , and the sparse code z selects a linear combination of a few of its columns to reconstruct the input. In practice, a sparse code z is computed using an inference algorithm which minimizes z 's l 1 norm and simultaneously optimizes for a good reconstruction of the input y using D 's columns. Learning of the decoder D typically happens in an iterative EM-like fashion. Inference provides a set of sparse codes corresponding to some inputs (expectation step). Given these codes, the decoder's parameters are updated using gradient descent to minimize the error between the original inputs and their reconstructions (maximization step). Sparse coding systems in which the decoder is non-linear are hard to learn and often require layer-wise training (Zeiler et al., 2010; He et al., 2014). In this work, we explore ways to effectively train sparse coding systems with non-linear decoders using end-to-end learning.

In order to successfully train a sparse coding system which learns a linear dictionary D , it is common to bound the norms of the dictionary's weights in order to avoid collapse in the l 1 norms of the codes. Indeed, consider a reconstruction ˜ y obtained from a dictionary D and code z , namely ˜ y = D z . The same reconstruction can be obtained from a re-scaled dictionary D ′ = c D and code z ′ = z /c for any non-zero multiple c . When c > 1, the representation z ′ has a smaller l 1 norm than z but produces the same reconstruction. In practice, if D 's weights are unbounded during training, they become arbitrarily large, while the l 1 norms of the latent representations become arbitrarily small which leads to a collapse in their l 1 norm. Typically, in order to restrict a linear dictionary D , its columns are re-scaled to have a constant l 2 norm (Lee et al., 2006) or weight decay is applied to its weights (Sun et al., 2018). However, in the case when D is a multi-layer neural network trained end-to-end, it is not clear what the best strategy to bound the decoder's weights is in order to avoid collapse in the codes and at the same time learn useful features.

Instead of restricting D 's weights, we propose to prevent collapse in the l 1 norm of the codes directly by applying variance regularization to each latent code component. In particular, we add a regularization term to the energy minimized during inference for computing a set of codes. This term encourages the variance of each latent code component to be greater than a fixed lower bound. This strategy is similar to the variance principle presented in Bardes et al. (2021) and related to methods that directly regularize the latent codes (Földiak, 1990; Olshausen & Field, 1996; Barello et al., 2018). Our strategy ensures that the norms of the sparse codes do not collapse and we find that it removes the need to explicitly restrict D 's parameters. Since this variance regularization term is independent from D 's architecture, we can use it when the decoder is a deep neural network. In this work, we experiment with fully connected sparse autoencoders with a one hidden layer decoder and successfully train such models using our variance regularization strategy.

The main contributions of our work are:

· We introduce an effective way to train sparse autoencoders and prevent collapse in the l 1 norms of the codes that they produce without the need to regularize the decoder's weights but instead by encouraging the latent code components to have variance above a fixed threshold. · We show empirically that linear decoders trained with this variance regularization strategy are comparable to the ones trained using the standard l 2 -normalization approach in terms of the features that they learn and the quality of the reconstructions they produce. Moreover, variance regularization successfully extends sparse coding to the case of a non-linear fully connected decoder with one hidden layer. · Additionally, our experiments show that sparse representations obtained from autoencoders with a nonlinear decoder trained using our proposed variance regularization strategy: 1) produce higher quality reconstructions than ones obtained from autoencoders with a linear decoder, 2) are helpful in the downstream task of classification in the low-data regime.

Figure 1: Our sparse autoencoder setup. C denotes mean squared error. The inference step in Figure 1a finds a code z ∗ which minimizes the reconstruction error between the input y and its reconstruction ˜ y . The code is subject to sparsity and variance regularization, denoted by R ( z ). Additionally, the code is regularized to stay close to the encoder's predictions z E . During the learning step in Figure 1b, the encoder E is trained to predict the codes z ∗ computed during inference and the decoder D learns to reconstruct inputs y from the sparse codes z ∗ .

Method

A schematic view of our sparse autoencoder setup is presented in Figure 1. At a high level, given an input y and a fixed decoder D , we perform inference using a modified version of the FISTA algorithm (Beck & Teboulle, 2009) to find a sparse code z ∗ which can best reconstruct y using D 's elements. The decoder's weights are trained by minimizing the mean squared error between the input y and the reconstruction ˜ y computed from z ∗ . The encoder E is trained to predict the codes z ∗ computed during inference. More details on each of the components in Figure 1 are presented in the following sections.

Background: Sparse Coding and Dictionary Learning

Inference Sparse coding algorithms with an l 1 sparsity penalty and a linear decoder D ∈ R d × l perform inference to find a latent sparse representation z ∈ R l of a given data sample y ∈ R d which minimizes the following energy function:

The first term in (1) is the reconstruction error for the sample y using code z and decoder D . The second term is a regularization term which penalizes the l 1 norm of z . In essence, each code z selects a linear combination of the decoder's columns. The decoder can thus be thought of as a dictionary of building blocks. Inference algorithms find a solution for the minimization problem in (1) keeping D fixed, namely

Learning the Decoder The decoder's parameters can be learned through gradient-based optimization by minimizing the mean squared reconstruction error between elements in the training set Y = { y 1 , . . . , y N } ⊂ R d and their corresponding reconstructions from the sparse representations Z ∗ = { z ∗ 1 , . . . , z ∗ N } ⊂ R l obtained during inference, namely:

where D 's columns are typically projected onto the unit sphere to avoid collapse in the latent codes, namely ‖D : ,i ‖ 2 2 = 1 for D : ,i ∈ R d .

Background: Inference with ISTA and FISTA

We provide an overview of the Iterative Shrinkage-Thresholding Algorithm (ISTA) and its faster version FISTA which we adopt in our setup. ISTA is a proximal gradient method which performs inference to find a sparse code z ∗ which minimizes the energy in (1) for a given input y and a fixed dictionary D . We can re-write the inference minimization problem as:

where f ( z ) := 1 2 ‖ y -D z ‖ 2 2 is the reconstruction term and g ( z ) := λ ‖ z ‖ 1 is the l 1 penalty term. In ISTA, the code z can be given any initial value and is typically set to the zero vector: z (0) = 0 ∈ R l . An ISTA iteration consists of two steps: a gradient step and a shrinkage step.

Gradient Step To compute code z ( k ) from its previous iteration z ( k -1) , where k ∈ { 1 , . . . , K } , ISTA first performs the gradient update by taking gradient step of size η k with respect to the squared error between the input and the reconstruction from z ( k -1) :

The size of the gradient step η k at iteration k can be determined with backtracking (Beck & Teboulle, 2009).

Shrinkage Step The shrinkage step is applied to the intermediate output from the gradient step in ISTA. It uses the shrinkage function τ α which sets components of its input to 0 if their absolute value is less than α or otherwise contracts them by α , namely [ τ α ( x )] j = sign( x j )( | x j | -α ) + for some threshold α > 0 where j traverses the components of its input vector x . The shrinkage step for ISTA is given by:

where λ > 0 is the hyperparameter which determines the sparsity level. In the case when the codes are restricted to be non-negative, a property that we adopt in our setup, the shrinkage function can be modified to set all negative components in its output to 0: [˜ τ α ( x )] j = ([ τ α ( x )] j ) + .

FISTA In our setup, we use FISTA (Beck & Teboulle, 2009) which accelerates the convergence of ISTA by adding a momentum term in each of its iterations. Namely, z ( k -1) in (4) is replaced by:

for k ≥ 2 where t k = 1+ √ 1+4 t 2 k -1 2 and t 1 = 1. Thus, the FISTA gradient step becomes:

FISTA employs the same shrinkage step as ISTA.

Modified Inference with Variance Regularization on the Latent Code Components

In order to prevent collapse in the l 1 norm of the latent codes, we propose to ensure that the variance of each latent code component stays above a pre-set threshold. To achieve this, we add a regularization term to the energy in (3) which encourages the variance of all latent component across a mini-batch of codes to be greater than a pre-set threshold. In particular, we replace the function f ( z ) = 1 2 ‖ y -D z ‖ 2 2 in (3) with a new function ˜ f ( Z ) defined over a set of codes Z ∈ R l × n corresponding to a mini-batch of data samples Y = { y 1 , . . . , y n } as:

The first sum in (8) is over the reconstruction terms from each code Z : ,i ∈ R l . The second sum in (8) is over squared hinge terms involving the variance of each latent component Z j, : ∈ R n across the batch where Var( Z j, : ) = 1 n -1 n ∑ i =1 ( Z j,i -µ j ) 2 and µ j is the mean across the j -th latent component, namely µ j = 1 n ∑ n i =1 Z j,i . The hinge terms are non-zero for any latent dimension whose variance is below the fixed threshold of √ T .

Modified Energy: Variance Regularization Using ˜ f ( Z ) from (8) and setting ˜ g ( Z ) := n ∑ i =1 g ( Z : ,i ), our modified objective during inference is to minimize the energy:

with respect to the codes Z for a mini-batch of samples Y and a fixed dictionary D . Note that the hinge terms counteract with the l 1 penalty terms in (9). The hinge terms act as regularization terms which encourage the variance of each of the latent code components to remain above the threshold of √ T . This should prevent a collapse in the l 1 norm of the latent codes directly and thus remove the need to normalize the weights of the decoder.

Gradient Step The first step in our modified version of FISTA is to take a gradient step for each code Z : ,t ∈ R l :

In practice, since we use FISTA, the code Z ( k -1) : ,t in (10) is replaced by a momentum term as defined in (6) but here we omit this notation for simplicity. The gradient of the sum of reconstruction terms in (8) with respect to Z s,t , one of the latent components of code Z : ,t , for a linear decoder D is:

where D s denotes the s -th column of the dictionary D . 1 The gradient of the sum of hinge terms in (8) with respect to Z s,t is:

1 The gradient for any neural network decoder D can be computed through automatic differentiation.

Even though the hinge terms in (8) are not smooth convex functions 2 , the fact that the gradient in (12) is a line implies that the hinge term behaves locally as a convex quadratic function. A visualization of h ( x ) = [( 1 -√ Var( x ) ) + ] 2 for x ∈ R 2 and the full derivation of (12) are available in Appendix A.2 and A.3, respectively.

Encoder for Amortized Inference

One limitation of the proposed variance regularization is that the modified inference procedure depends on batch statistics (see equation 12) and thus the codes are not deterministic. To address this limitation, we propose to train an encoder E simultaneously with the decoder D to predict the sparse codes computed from inference with our modified FISTA. After training of the encoder and decoder is done, the encoder can be used to compute sparse codes for different inputs independently from each other in a deterministic way. Additionally, using an encoder reduces the inference time: instead of performing inference with FISTA, the encoder computes sparse representations directly from the inputs and thus provides amortized inference .

Learning the Encoder The encoder is trained using mini-batch gradient descent to minimize the following mean-squared error between its predictions and the sparse codes Z ∗ ∈ R l × n computed from inference with FISTA for inputs Y = { y 1 , . . . , y n } :

Note that the codes Z ∗ are treated as constants. Training the encoder this way is similar to target propagation (Lee et al., 2015).

Modified Energy: Variance and Encoder Regularization To encourage FISTA to find codes which can be learned by the encoder, we further modify the inference protocol by adding a term to ˜ f ( Z ) which penalizes the distance between FISTA's outputs and the encoder's outputs:

This regularization term ensures that the codes computed during inference with FISTA do not deviate too much from the encoder's predictions. In our setup, the encoder's predictions are treated as constants and are used as the initial value for codes in FISTA, namely Z (0) : ,i := E ( y i ). This can reduce the inference time by lowering the number of FISTA iterations needed for convergence if the encoder provides good initial values. With the new ˜ f ( Z ) from equation 14, the energy minimized during inference becomes:

2 A requirement for the FISTA algorithm.

Experimental Setup

Encoder and Decoder Architectures

LISTA Encoder The encoder's architecture is inspired by the Learned ISTA (LISTA) architecture (Gregor & LeCun, 2010). LISTA is designed to mimic outputs from inference with ISTA and resembles a recurrent neural network. Our encoder consists of two fully connected layers U ∈ R d × l and S ∈ R l × l , a bias term b ∈ R l , and ReLU activation functions. Algorithm 1 describes how the encoder's outputs are computed. Our version of LISTA produces non-negative codes by design. We refer to our encoder as LISTA in the rest of the paper.

Algorithm 1 LISTA encoder E : forward pass

Input: Image y ∈ R d , number of iterations L Parameters: U ∈ R d × l , S ∈ R l × l , b ∈ R l Output: sparse code z E ∈ R l u = Uy + b z 0 = ReLU( u ) for i = 1 to L do z i = ReLU( u + Sz i -1 ) end for z E = z L

Linear Decoder A linear decoder D is parameterized simply as a linear transformation W which maps codes in R l to reconstructions in the input data dimension R d , i.e. W ∈ R d × l . No bias term is used in this linear transformation.

Non-Linear Decoder In the case of a non-linear decoder D , we use a fully connected network with one hidden layer of size m and ReLU as an activation function. We refer to the layer which maps the input codes to hidden representations as W 1 ∈ R m × l and the one which maps the hidden representations to the output (reconstructions) as W 2 ∈ R d × m . There is a bias term b 1 ∈ R m following W 1 and no bias term after W 2 . Given an input z ∈ R l , the decoder's output is computed as: ˜ y = W 2 ( W 1 z + b 1 ) + .

Evaluation

We compare sparse autoencoders trained using our variance regularization method to sparse autoencoders trained using other regularization methods. We refer to our approach as variance-regularized dictionary learning ( VDL ) in the case when the decoder is a linear dictionary and as VDL-NL when the decoder is non-linear. We refer to the traditional approach in which the l 2 norm of the decoder's columns is fixed as standard dictionary learning ( SDL ) in the case the decoder is linear and as SDL-NL in the case it is non-linear - a natural extension of SDL to the case of non-linear decoders is to fix the l 2 norms of the columns of both layers W 1 and W 2 . Additionally, we compare our variance regularization method to sparse autoencoders in which only weight decay is used to avoid collapse which we refer to as WDL and WDL-NL in the case of a linear and a non-linear decoder, respectively. Finally, we consider a decoder-only setup in which the standard approach of fixing the l 2 norm of the decoder's columns is applied and the codes are computed using inference with FISTA. We refer to this setup as DO when the decoder is linear and as DO-NL when the decoder is non-linear.

In order to compare the different models, we train them with different levels of sparsity regularization λ , evaluate the quality of their reconstructions in terms of peak signal-to-noise ratio (PSNR), and visualize the features that they learn. We evaluate models with a linear decoder on the downstream task of denoising. Additionally, we measure the pre-trained models' performance on the downstream task of MNIST classification in the low data regime.

Data Processing

MNIST In the first set of our experiments, we use the MNIST dataset (LeCun & Cortes, 2010) consisting of 28 × 28 hand-written digits and do standard pre-processing by subtracting the global mean and dividing by the global standard deviation. We split the training data randomly into 55000 training samples and 5000 validation samples. The test set consists of 10000 images.

Natural Image Patches For experiments with natural images, we use patches from ImageNet ILSVRC2012 (Deng et al., 2009). We first convert the ImageNet images to grayscale and then do standard pre-

processing by subtracting the global mean and dividing by the global standard deviation. We then apply local contrast normalization to the images with a 13 × 13 Gaussian filter with standard deviation of 5 following Jarrett et al. (2009) and Zeiler et al. (2011). We use 200000 randomly selected patches of size 28 × 28 for training, 20000 patches for validation, and 20000 patches for testing.

Inference and Training

We determine all hyperparameter values through grid search. Full training details can be found in Appendix B.1.

Inference The codes Z are restricted to be non-negative as described in section 2.2. The dimension of the latent codes in our MNIST experiments is l = 128 and in experiments with ImageNet patches it is l = 256. The batch size is set to 250 which we find sufficiently large for the regularization term on the variance of each latent component in (8) for VDL and VDL-NL models. We set the maximum number of FISTA iterations K to 200 which we find sufficient for good reconstructions.

Implementation and Hardware Our PyTorch implementation is available on GitHub at https:// github.com/kevtimova/deep-sparse . We train our models on one NVIDIA RTX 8000 GPU card and all our experiments take less than 24 hours to run.

Results and Analysis

Visualizing Learned Features

Linear Decoders Figures 2 shows the dictionary elements for two SDL and two VDL decoders trained with different levels of sparsity regularization λ . As evident from Figures 2a and 2b we confirm that, in the case of standard dictionary learning on MNIST, the dictionary atoms look like orientations, strokes, and parts of digits for lower values of the sparsity regularization λ and they become looking like full digit prototypes as λ increases (Yu, 2012). A similar transition from strokes to full digit prototypes is observed in figures 2c and 2d displaying the dictionary elements of models trained using variance regularization in the latent code components. The same trend is observed in dictionary elements from WDL and DO models (Figure 12 in the Appendix). Figure 3 shows dictionary elements for two SDL and two VDL models trained on ImageNet patches. All of these models contain gratings and Gabor-like filters. We observe that models which are trained with a lower sparsity regularization parameter λ (Figures 3a and 3c) contain more high frequency atoms than ones trained with higher values of λ (Figures 3b and 3d).

Non-linear Decoders Each column W : ,i ∈ R d in a linear dictionary D (parameterized by W ) reflects what a single latent component encodes since W : ,i = W e ( i ) where e ( i ) ∈ R l is a one-hot vector with 1 in position i and 0s elsewhere. We use a similar strategy to visualize the features that non-linear decoders learn. Namely, we provide codes e ( i ) with only one non-zero component as inputs to the non-linear decoders and display the resulting 'atoms' in image space. 4 While in the linear case each code component selects a single dictionary column, in the case of non-linear decoders with one hidden layer, each code component selects a linear combination of columns in W 2 . Two SDL-NL and two VDL-NL models trained on MNIST with different levels of sparsity regularization λ yield the atoms displayed in Figure 4. They contain strokes and orientations for smaller values of λ and fuller digit prototypes for larger values of λ and are very similar to the dictionary elements from linear decoders in Figure 2. We use the same strategy to visualize the features that non-linear decoders trained on ImageNet patches learn. These features are displayed in Figure 5 for

3 As a reminder, codes are computed using amortized inference with the encoder during validation.

4 In practice, we remove the effect of the bias term b 1 after W 1 and display D ( e ( i ) ) -D ( 0 ).

Figure 2: Dictionary elements for linear dictionaries from SDL and VDL models with latent dimension l = 128 and different levels of sparsity regularization λ . The SDL models in (2a) and (2b) produce reconstructions with PSNR of 21.1 and 18.6 for codes with average sparsity level of 63% and 83% on the test set, respectively. The VDL models in (2c) and (2d) produce reconstructions with PSNR of 20.7 and 17.7 for codes with average sparsity level of 69% and 91 . 8% on the test set, respectively.

Figure 3: Dictionary elements which resemble gratings and Gabor filters for SDL and VDL models with latent dimension l = 256 trained on ImageNet patches with different levels of sparsity regularization λ . The SDL models in 3a and 3b produce reconstructions with average PSNR of 26 . 9 and 24 . 0 and average sparsity level in the codes of 74 . 6% and 90 . 5% on the test set, respectively. The VDL models in Figures 3c and 3d produce reconstructions with average PSNR of 27 . 3 and 25 . 3 and average sparsity level in the codes of 77 . 3% and 89 . 2% on the test set, respectively.

SDL-NL and VDL-NL models trained with different levels of sparsity regularization λ . As in the case of linear decoders, we observe that both SDL-NL and VDL-NL models learn gratings and Gabor-like filters.

Encoders Figure 14 in the Appendix shows visualizations of the U ∈ R d × l layer of the LISTA encoder in SDL, VDL, and WDL models with code dimension l = 128 trained on MNIST images and different levels of sparsity regularization λ . Similarly to the decoders in 11, the features learned by the encoders resemble orientations and parts of digits but in the case of WDL there is a larger proportion of noisy features compared to the other models as λ increases.

Reconstruction Quality

5 As a reference for reconstruction quality in terms of PSNR, the second row of images in Figure 7a and 7b contains reconstructions of the images in the top row with PSNR of around 17.3 and 17.7, respectively.

Figure 4: Visualizing reconstructions from SDL-NL and VDL-NL decoders trained on MNIST and input codes in which only one component is active and the rest are zero. The models are trained with different levels of sparsity regularization λ . Each code component encodes stokes, orientations, parts of digits, and full digit prototypes very similar to those in Figure 2. The SDL-NL models in 4a and 4b produce reconstructions with average PSNR of 20 . 9 and 18 . 4 and average sparsity level in the codes of 69 . 2% and 85 . 6% on the test set, respectively. The VDL-NL models in 4c and 4d produce reconstructions with average PSNR of 22 . 7 and 18 . 3 and average sparsity level in the codes of 58 . 8% and 92 . 2% on the test set, respectively.

Figure 5: Visualizing reconstructions from SDL-NL and VDL-NL decoders trained on ImageNet patches and input codes in which only one code component is active and the rest are zero. The models are trained with different levels of sparsity regularization λ . Code components encode gratings and Gabor-like filters. The SDL-NL models in (5a) and (5b) produce reconstructions with PSNR of 27.1 and 25.0 for codes with average sparsity level of 69 . 8% and 83 . 4%, respectively. The VDL-NL models in (5c) and (5d) produce reconstructions with PSNR of 28.4 and 24.9 for codes with average sparsity level of 67 . 6% and 90 . 8%, respectively.

other models for higher levels of sparsity. This implies that VDL-NL models can better utilize the additional representational capacity from the larger number of layers and parameters in the non-linear decoder compared to the other models when the sparsity level is high.

Denoising

We evaluate the performance of SDL and VDL autoencoders on the downstream task of denoising. In this setup, we corrupt the input images by adding Gaussian noise and proceed with computing their corresponding sparse codes through amortized inference using the encoder. We then feed these codes to the decoder to obtain reconstructions of the noisy inputs.

Figure 7 shows the reconstructions of MNIST test set images and of the same images corrupted with Gaussian noise with zero mean and standard deviations σ = 1 and σ = 1 . 5 using the SDL and VDL models in Figures 2b and 2d. We observe that both SDL and VDL models are able to produce reconstructions which are closer to samples from the manifold of training data than to the noisy inputs. This implies that the sparse coding systems have learned to detect features useful for the data they are trained on and cannot reconstruct random noise, as desired. SDL and VDL models trained on ImageNet patches are also robust to the addition of Gaussian noise to their inputs. This is evident in Figure 8 which contains reconstructions ˜ y and ˜ y σ of

Figure 6: Trade-off between sparsity measured by average percentage of inactive (zero) code components and reconstruction quality measured by PSNR for different models. Figure (6a) plots regret curves for SDL, SDL-NL, VDL, VDL-NL, WDL, WDL-NL, DO and DO-NL models trained on MNIST with code dimension l = 128 and different levels of sparsity regularization λ . Figure (6b) plots regret curves for SDL, SDL-NL, VDL, and VDL-NL models trained on ImageNet patches with code dimension l = 256 and different levels of sparsity regularization λ . Higher PSNR reflects better reconstruction quality.

Figure 7: Examples of denoising abilities of encoders from SDL and VDL models trained on MNIST which produce codes with average sparsity of 89 . 7% and 91 . 8%, respectively. Top two rows: original inputs from the test set (top) and their reconstructions (bottom); middle two rows: the inputs corrupted with Gaussian noise with standard deviation σ = 1 (top) and their reconstructions (bottom), bottom two rows: the inputs corrupted with Gaussian noise with standard deviation σ = 1 . 5 (top) and their reconstructions (bottom).

images y and their noisy version y σ for the SDL and VDL models from Figures 3a and 3c which produce codes with similar sparsity on average.

Table 1 shows a summary of the reconstruction quality of images with varying levels of Gaussian noise corruption when they are passed through SDL and VDL autoencoders. The results are evaluated on the test set over 5 random seeds. Both the SDL and VDL encoders are able to produce codes from corrupted inputs y σ which, when passed through their corresponding decoders, produce outputs ˜ y σ that are less noisy than the corrupted inputs. This observation is significant since the sparse autoencoders in our experiments are not explicitly trained to perform denoising.

Table 1: The performance of SDL and VDL models trained on MNIST and ImageNet patches with different levels of sparsity regularization λ on the the task of denoising inputs corrupted with Gaussian noise with zero mean and standard deviation σ = 1 . 0 or σ = 1 . 5. We report the average sparsity of codes computed with these models and the PSNR of: the reconstruction ˜ y of the original input y , the corrupted input y σ , and the reconstruction ˜ y σ of the corrupted input y σ . Evaluated on the test set over 5 random seeds.

Ablation: No Variance Regularization

VDL and VDL-NL experiments from Figure 6 show that our proposed variance regularization strategy can successfully be used to train sparse coding models without collapse in the l 1 norms of the sparse codes. In experiments without our variance regularization strategy or any regularization restricting the dictionary's weights, we do observe collapse, both for models with a linear and a non-linear decoder. This shows that our proposed variance regularization protocol is an effective alternative to the standard l 2 normalization or weight decay regularization for preventing collapse, both in the case of models with a linear decoder and with a fully connected decoder with one hidden layer.

Classification in the Low Data Regime~~

We investigate whether self-supervised pre-training with sparse coding improves classification performance over training from scratch in the low data regime. For this purpose, we compare the classification error of classifiers trained on the raw MNIST data to the classification error of linear classifiers trained on features from pre-trained frozen LISTA encoders in SDL, SDL-NL, VDL, and VDL-NL autoencoders 6 and pre-trained frozen DO and DO-NL decoders when few training samples are available for supervised learning.

6 The autoencoders are trained on the full MNIST dataset in an unsupervised way as described in section 1.

Figure 9: Linear separability (measured by classification error) of the raw MNIST data and of features from pre-trained SDL, SDL-NL, VDL, VDL-NL encoders and decoder-only DO and DO-NL models with different numbers of training samples per class. Additionally, classification error of a classifier consisting of a LISTA enocder followed by a linear layer (LISTA + Linear) trained from scratch on the raw MNIST data. Models trained with variance regularization in the codes and a non-linear decoder (VDL-NL) outperform other models in linear classification with up to 10 training samples per class. Test set performance is reported over 5 random seeds.

ones presented in Figure 6a which gives the best validation accuracy. As baselines, we train two classifiers from scratch on the raw MNSIT data. The first one is a linear classifier referred to as 'Linear on raw'. The second one consists of a linear layer on top of a randomly initialized LISTA encoder 7 and is referred to as 'LSITA + Linear on raw'.

Figure 9 summarizes the top-1 and top-3 classification errors of these classifiers on the test set evaluated over 5 random seeds. With small exceptions 8 , all linear classifiers trained on top of the pre-trained LISTA features give better top-1 and top-3 classification performance than classifiers trained on the raw MNIST data with up to 100 samples per class. This observation supports a claim made by Raina et al. (2007) that SSL is expected to be useful especially when labeled data is scarce. Notably, classifiers trained on top of VDL-NL features have the best performance among the rest of the models with up to 10 training samples per class. In other words, sparse autoencoders with a non-linear decoder pre-trained using our proposed variance regularization on the latent codes can boost the classification performance when few labeled training samples are available.

Our work is related to the extensive literature on sparse coding and dictionary learning. Mairal et al. (2014) provide a comprehensive overview of sparse modeling in computer vision, discuss its relation to the fields of statistics, information theory, and signal processing, as well as its applications in computer vision including image denoising, inpainting, and super-resolution.

7 This randomly initialized LISTA encoder is trainable and with the same architecture as the one in Algorithm 1.

8 Namely, classifiers trained on DO, DO-NL and SDL features with 1 or 2 training samples per class.

incorporate in our model. In the spiking model of Földiak (1990), an anti-Hebbian adaptive thresholding mechanism encourages each latent component to be active with a pre-set level of probability. Our variance regularization strategy is also related to Olshausen & Field (1996; 1997) where the norm of the basis elements is adjusted adaptively over time so that each of the latent components has a desired level of variance. Their method works well in the linear dictionary setup but it is not directly extendable to the case of a nonlinear decoder. In variational inference models such as VAE Kingma & Welling (2013) and its sparse coding extensions such as SVAE Barello et al. (2018) and Tonolini et al. (2020), the latent codes have a pre-set level of variance determined by the prior distribution. In SVAE, there is also variance regularization term which determines the trade-off between the input reconstruction and the sparsity level imposed by the prior. Our variance regularization term is most closely related to the variance principle presented in Bardes et al. (2021) in which each latent component is encouraged to have variance above a pre-set threshold.

Sparse Coding and Classification Performance There is an active area of research which studies whether sparse coding is useful in classification tasks. Some existing literature supports this claim Raina et al. (2007); Ranzato et al. (2007); Mairal et al. (2008a;b); Yang et al. (2009); Kavukcuoglu et al. (2010); Makhzani & Frey (2013); He et al. (2014) while other research refutes it Coates & Ng (2011); Rigamonti et al. (2011); Luther & Seung (2022). Tang et al. (2020) show that adding layers to a hierarchy of sparse dictionaries can improve classification accuracy. In our work, we provide empirical evidence that sparse coding models trained with a reconstruction objective, especially models with a non-linear decoder trained using our variance regularization method, can give better classification performance than classifiers trained on the raw data when very few training samples are available, supporting a claim made by Raina et al. (2007) that SSL is expected to be useful when labeled data is scarce.

Sparsity Constraints Sparse representations can be obtained using many different objectives. The l 1 penalty term in equation 1 can be replaced by other sparsity-inducing constraints including l 0 , l p (0 < p < 1), l 2 , or l 2 , 1 norms. In particular, the l 1 term is a convex relaxation of the non-differentiable l 0 constraint of the form ‖ z ‖ 0 ≤ k requiring z to have at most k active components. Zhang et al. (2015) provide a survey of existing algorithms for sparse coding such as K-SVD which generalizes K-means clustering (Aharon et al., 2006). Makhzani & Frey (2013) propose an approach to train a sparse autoencoder with a linear encoder and decoder in which only the top k activations in the hidden layer are kept and no other sparsifying penalty is used. In our model, we modify the commonly used sparse coding setup with l 1 penalty term in 1 by including a term which encourages a pre-set level of variance across each latent component in 9.

Inference There are many variations of the ISTA inference algorithm which address the fact that it is computationally expensive and are designed to speed it up, such as FISTA (Beck & Teboulle, 2009) which is used in our setup (for details please refer to section 2.2). Gated LISTA (Wu et al., 2019) introduces a novel gated mechanism for LISTA (Gregor & LeCun, 2010) which enhanses its performance. In Zhou et al. (2018), it is shown that deep learning with LSTMs can be used to improve learning sparse codes in ISTA by modeling the history of the iterative algorithm and acting like a momentum. Inspired by LISTA, we train an encoder to predict the non-negative representations computed during inference using the original and modified FISTA protocols as explained in section 2.4.

Multi-layer Sparse Coding There exist methods which greedily construct hierarchies of sparse representations. Zeiler et al. (2010) use ISTA in a top-down convolutional multi-layer architecture to learn sparse latent representations at each level in a greedy layer-wise fashion. Other convolutional models which produce a hierarchy of sparse representations are convolutional DBNs (Lee et al., 2009) and hierarchical convolutional factor analysis (Chen et al., 2013) which also use layer-wise training. Other greedy approaches are proposed by He et al. (2014) and Tariyal et al. (2016) who train multiple levels of dictionaries. Chen et al. (2018) combine sparse coding and manifold learning in a multi-layer network which is also trained layer-wise. In contrast, our work is about finding sparse representations by utilizing multi-layer decoders which are trained end-to-end. Hu & Tan (2018) propose a non-linear sparse coding method in which a linear dictionary is trained in the encoding space of a non-linear autoencoder but it differs from our method as the sparse representations of this encoding layer are stand-alone, i.e. they are not fed as inputs to the multi-layer decoder. Other multi-layer sparse coding models include Sun et al. (2018), Mahdizadehaghdam et al. (2019), and Tang et al. (2020) in which linear dictionaries are stacked at different scales. Rather than using linear dictionaries, our method learns a non-linear decoder to decode sparse codes.

Conclusion

In this work, we revisit the traditional setup of sparse coding with an l 1 sparsity penalty. We propose to apply variance regularization to the sparse codes during inference with the goal of training non-linear decoders without collapse in the l 1 norm of the codes. We show that using our proposed method we can successfully train sparse autoencoders with fully connected multi-layer decoders which have interpretable features, outperform models with linear decoders in terms of reconstruction quality for a given average level of sparsity in the codes, and improve MNIST classification performance in the low data regime. Future research directions include scaling up our method to natural images using deep convolutional decoders.

Broader Impact Statement

This work proposes a general-purpose machine learning algorithm for representation learning which is trained using large amounts of unlabeled data. The representations learned by this algorithm reflect any biases present in the training data. Therefore, practitioners should apply techniques for identifying and mitigating the biases in the data when applicable. This notice is doubly important since the learned representations can be incorporated in numerous real-world applications such as medical diagnosis, surveillance systems, autonomous driving, and so on. These applications can have both positive and negative effect on society, and warrant extensive testing before being deployed in order to prevent harm or malicious use.

Acknowledgments

We would like to thank our colleagues at the NYU CILVR Lab and NYU Center for Data Science for insightful discussions, and to the TMLR anonymous reviewers and action editor for their feedback. Part of this research was conducted while Katrina was interning at FAIR. This work is supported in part by AFOSR award FA9550-19-1-0343 and by the National Science Foundation under NSF Award 1922658.

References

Kai Yu. Tutorial on deep learning: Sparse coding. CVPR , 2012.

Appendix

Notation

Table 2 contains descriptions of the notation and symbols used in this work.

Variance Regularization Term

Figure 10 visualizes the variance regularization term h ( x ) = [( 1 -√ Var( x ) ) + ] 2 for x ∈ R 2 in 3D.

Gradient Derivation

We compute the gradient of the hinge term in (9) with respect to one latent code's component Z s,t :

Table 2: Descriptions of symbols and notation used in the paper.

As a reminder of our notation, l is the latent dimension, n is the batch size, Z ∈ R l × n stores the codes for all elements in a batch, and Z j, : ∈ R n stores code values for the j -th latent component.

Now, continuing with:

Figure 10: Visualizing the variance regularization function h ( x ) = [( 1 -√ Var( x ) ) + ] 2 for x ∈ R 2 in 3D.

Note that:

The gradient in (16) is non-zero only for j = s and √ Var( Z s, : ) < T . Thus,

/negationslash

and using the result in (22) we conclude that:

Additional Visualizations

Figures 11 and 13 display 128 of the dictionary elements for the same SDL and VDL models as in Figures 2 and 3. Figure 12 displays 128 of the dictionary elements for the WDL and DO models trained with different levels of sparsity regularization.

Training Details

All hyperparameter values are selected through grid search. For the constant step size η k in FISTA, we consider values in the range of 0 . 01 to 100. In VDL experiments, we consider values for the hinge regularization coefficient β between 1e -1 and 100. In all our experiments, we use Adam (Kingma & Ba, 2014) as an optimizer for both the encoder E and decoder D with a batch size of 250. We use L = 3 iterations in the LISTA encoder (see Algorithm 1). We run the FISTA algorithm for a maximum of 200 iterations. The ‖ z ( k ) -z ( k -1) ‖ ( ∗ )

‖

(

convergence criterion we set is:

FISTA's output is

where

is the index of the first iteration for which the convergence criterion holds. Table 3 contains the hyperparameter values we use in all our experiments expect for the ones with WDL, WDL-NL, DO and DO-NL models. In the case of DO and DO-NL models, we use the same number of epochs, η D , wd( b ), wd( b 1 ), and η k as in SDL and SDL-NL experiments, respectively. In the case of WDL and WDL-NL models, we use grid search to find the optimal weight decay value of 5e -4 (in terms of reconstruction quality regret curve as displayed in Figure 6; this is also the value used in Sun et al. (2018)) and use the same number of epochs, η D , γ , η E , and η k as in SDL and SDL-NL experiments, respectively.

∗

Figure 11: Dictionary elements for linear dictionaries trained on MNIST with latent dimension l = 128 and different levels of sparsity regularization λ . SDL stands for dictionary learning in which the l 2 norm of the decoder's columns is fixed to 1. VDL stands for dictionary learning in which norms of the decoder's columns are not explicitly bounded but our proposed variance regularization on the latent codes is used. The SDL models in 2a and 2b produce reconstructions with average PSNR of 21.1 and 18.6 for codes with average sparsity level of 63% and 83% on the test set, respectively. The VDL models in 2c and 2d produce reconstructions with average PSNR of 20.7 and 17.7 for codes with average sparsity level of 69% and 91 . 8% on the test set, respectively.

Figure 13: Dictionary elements which resemble Gabor filters for SDL and VDL models with latent dimension l = 256 trained on ImageNet patches with different levels of sparsity regularization λ . The SDL models in 3a and 3b produce reconstructions with average PSNR of 26 . 9 and 24 . 0 and average sparsity level in the codes of 74 . 6% and 90 . 5% on the test set, respectively. The VDL models in Figures 3c and 3d produce reconstructions with average PSNR of 27 . 3 and 25 . 3 and average sparsity level in the codes of 77 . 3% and 89 . 2% on the test set, respectively.

Figure 14: Visualization of the U ∈ R d × l layer of the LISTA encoder for SDL, VDL, and WDL models with code dimension l = 128 trained on MNIST images and different levels of sparsity regularization λ . Similarly to the decoders in 11, the features learned by the encoders resemble orientations and parts of digits but in the case of WDL models there are more noisy features compared to the other models as λ increases.

Table 3: Training hyperparameters. SDL indicates models in which columns of the decoder's layers have a fixed norm of 1. VDL indicates models in which variance regularization is applied to the codes during inference. ( ∗ ) means that the learning rate for the decoder η D is annealed by half every 30 epochs. wd stands for weight decay.

MNIST	MNIST	MNIST	MNIST	MNIST	MNIST	MNIST	MNIST
Model	λ	L 0 ( z ˜ y )	PSNR ˜ y	σ	PSNR y σ	L 0 ( z ˜ y σ )	PSNR ˜ y σ
SDL	0 . 005	89 . 7% ± 0 . 2%	17 . 3 ± 0 . 028	1.0	10 . 2	87 . 3% ± 0 . 2%	16 . 9 ± 0 . 053
VDL	0 . 02	91 . 8% ± 0 . 1%	17 . 7 ± 0 . 023	1.0	10 . 2	88 . 1% ± 0 . 2%	15 . 4 ± 0 . 175
SDL	0 . 005	89 . 7% ± 0 . 2%	17 . 3 ± 0 . 028	1.5	6 . 7	84 . 6% ± 0 . 5%	15 . 7 ± 0 . 204
VDL	0 . 02	91 . 8% ± 0 . 1%	17 . 7 ± 0 . 023	1.5	6 . 7	84 . 5% ± 0 . 3%	12 . 2 ± 0 . 275
ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches
SDL	0 . 002	74 . 6% ± 0 . 1%	26 . 9 ± 0 . 007	1.0	20 . 0	72 . 7% ± 0 . 1%	25 . 6 ± 0 . 011
VDL	0 . 005	77 . 3% ± 0 . 2%	27 . 3 ± 0 . 058	1.0	20 . 0	73 . 5% ± 0 . 2%	25 . 2 ± 0 . 035

Notation	Description
E	LISTA-inspired encoder.
D	Decoder: linear or fully connected with one hidden layer.
m	Dimension of the decoder's hidden layer.
λ	Hyperparameter for the sparsity regularization.
β	Hyperparameter for the variance regularization of the latent codes.
γ	Hyperparameter encouraging the FISTA codes to be close to the encoder's predictions.
d	Dimension of the input data.
y	Input sample in R d .
N	Total number of training samples.
n	Mini-batch size.
T	Number of ISTA/FISTA iterations.
η k	Step size for ISTA/FISTA gradient step.
η D	Learning rate for the decoder.
η E	Learning rate for the encoder.
τ α	Shrinkage function [ τ α ( x )] j = sign( x j )(	x j	- α ).
˜ τ α	Non-negative shrinkage function [˜ τ α ( x )] j = ([ τ α ( x )] j ) + .
l	Dimension of the latent code. .
z	Latent code in R l
Z	Mini-batch of latent codes in R l × n .
W	Parametrization in R d × l of a linear dictionary D
W 1	. Parametrization in R m × l of the bottom layer of a fully connected decoder D with one hidden layer.
W 2	Parametrization of a the top layer of a fully connected decoder D with one hidden .
	layer in R d × m
b 1	Parametrization of the bias term in R m following the bottom layer of a fully connected decoder D with one hidden layer.
b	Parametrization of the bias term in the LISTA encoder (see 3.1).
SDL	Standard Dictionary Learning with a LISTA encoder and a linear decoder whose columns have a fixed l 2 norm.
SDL-NL	Standard Dictionary learning with a LISTA encoder and a non-linear fully connected decoder. Each layer in the decoder has columns with a fixed l 2 norm.
VDL	Variance-regularized Dictionary Learning with a LISTA encoder and a linear decoder in which regularization is applied to the sparse codes encouraging the variance across the latent components to be above a fixed threshold.
VDL-NL	Variance-regularized Dictionary Learning (as above) with a non-linear decoder.
WDL	Dictionary Learning with a LISTA encoder and a linear decoder trained with weight decay regularization.
WDL-NL	Dictionary Learning with a LISTA encoder and a non-linear decoder trained with weight decay regularization.
DO	Standard Dictionary Learning without an encoder and with a linear decoder whose columns have a fixed l 2 norm.
DO-NL	Standard Dictionary Learning without an encoder and with a non-linear decoder whose columns have a fixed l 2 norm.

dataset	model	ep	λ	γ	β	T	η D	η E	wd( b 1 )	wd( b )	η k
MNIST	SDL	200	0	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	1e - 4	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	5e - 4	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	1e - 3	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	3e - 3	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	5e - 3	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	VDL	200	0	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	1e - 3	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	3e - 3	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	5e - 3	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	1e - 2	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	2e - 2	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	SDL-NL	200	0	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	1e - 3	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	3e - 3	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	5e - 3	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	1e - 2	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	2e - 2	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	VDL-NL	200	0	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	1e - 3	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	3e - 3	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	5e - 3	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	1e - 2	100 100	10 10	0.5 0.5	3e - 4 ∗ ∗	1e - 4 1e 4	1e - 3 1e 3	0 0	0.5 0.5
MNIST	VDL-NL	200	2e - 2		0	-	3e - 4 1e 3	-	- -	-	0.5
ImageNet	SDL	100	0	1		-	-	1e - 4		1e 2 1e 2
ImageNet	SDL	100	5e - 4	1	0		1e - 3	1e - 4	-	-	0.5
ImageNet	SDL	100	1e - 3	1	0	-	1e - 3	1e - 4	-	1e - 2	0.5
ImageNet	SDL	100	2e - 3	1	0	-	1e - 3	1e - 4	-	1e - 2	0.5
ImageNet	SDL	100	3e - 3	1	0	-	1e - 3	1e - 4	-	1e - 2	0.5 0.5
ImageNet	SDL	100	5e - 3	1	0	-	1e - 3	1e - 4	- -	1e - 2	0.5
ImageNet ImageNet	VDL VDL	100	0	5	10	0.5	3e - 4	1e - 4 4	-	1e - 2
		100	1e - 3	5	10	0.5	3e - 4	1e - 1e 4	-	1e - 2	0.5
ImageNet ImageNet	VDL VDL	100 100	3e - 3 5e - 3	5	10	0.5 0.5	3e - 4	- 1e 4	-	1e - 2 1e 2	0.5
ImageNet	VDL	100	1e - 2	5 5	10	0.5	3e - 4	-		- 1e - 2	0.5 0.5
ImageNet			1 5e 2		10		3e - 4	1e - 4	-
ImageNet	VDL	100	. -	5	10	0.5	3e - 4	1e - 4 1e 4	-	1e - 2	0.5 0.5
ImageNet	SDL-NL SDL-NL	100 100	0	1	0	-	1e - 3	-	1e - 2	1e - 2
			1e - 3	1	0	-	1e - 3	1e - 4	1e - 2	1e - 2	0.5
ImageNet	SDL-NL	100	3e - 3	1	0	-	1e - 3	1e - 4	1e - 2	1e - 2	0.5
ImageNet	SDL-NL	100	5e - 3	1	0	-	1e - 3	1e - 4	1e - 2	1e - 2	0.5
ImageNet ImageNet	SDL-NL SDL-NL	100	8e - 3 1e 2	1	0 0	- -	1e - 3	1e - 4	1e - 2 1e 2	1e - 2 1e 2	0.5 0.5
		100	- 0	1			1e - 3 5e 5 ∗	1e - 4	-	- 1e 2
ImageNet	VDL-NL	100		20	10	0.5	-	1e - 4	1e - 1	1e - 2	0.5
ImageNet ImageNet	VDL-NL	100	1e - 3	20	10 10	0.5 0.5	5e - 5 ∗	1e - 4	1e - 1 1e 1	- 1e 2	0.5 0.5
ImageNet	VDL-NL VDL-NL	100 100	2e - 3 3e - 3	20 20	10	0.5	5e - 5 ∗ 5e - 5 ∗	1e - 4 1e - 4	- 1e - 1	- 1e - 2	0.5
ImageNet	VDL-NL	100	5e - 3	20	10	0.5	5e - 5 ∗	1e - 4	1e - 1	1e - 2	0.5
ImageNet	VDL-NL	100	1e - 2	40	10	0.5	5e - 5 ∗	1e - 4	1e - 1	1e - 2	0.5
ImageNet	VDL-NL	100	2e - 2	40	10	0.5	5e - 5 ∗	1e - 4	1e - 1	1e - 2	0.5

Sparse representations of images are useful in many computer vision applications. Sparse coding with an l1subscript𝑙1l_{1} penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the l1subscript𝑙1l_{1} norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary’s elements. In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries. In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime.

We focus on learning representations of images through sparse coding, a self-supervised learning paradigm based on input reconstruction (Olshausen & Field, 1996). A representation (or code) in the form of a feature vector 𝒛∈ℝl𝒛superscriptℝ𝑙\displaystyle{\bm{z}}\in\displaystyle\mathbb{R}^{l} is considered sparse if only a small number of its components are active for a given data sample. Sparsity introduces the inductive bias that only a small number of factors are relevant for a single data point and it is a desirable property for data representations as it can improve efficiency and interpretability of the representations (Bengio, 2009; Bengio et al., 2013). While it is an open debate, sparse representations of images have been shown to improve classification performance (Ranzato et al., 2007; Kavukcuoglu et al., 2010; Makhzani & Frey, 2013) and are popular for denoising and inpainting applications (Elad & Aharon, 2006; Mairal et al., 2007). They are biologically motivated as the primary visual cortex uses sparse activations of neurons to represent natural scenes (Olshausen & Field, 1996; Yoshida & Ohki, 2020).

In sparse coding, the goal is to find a latent sparse representation 𝒛∈ℝl𝒛superscriptℝ𝑙\displaystyle{\bm{z}}\in\displaystyle\mathbb{R}^{l} which can reconstruct an input 𝒚∈ℝd𝒚superscriptℝ𝑑\displaystyle{\bm{y}}\in\displaystyle\mathbb{R}^{d} using a learned decoder 𝒟𝒟\mathcal{D}. Typically, the decoder is linear, namely 𝒟∈ℝd×l𝒟superscriptℝ𝑑𝑙\mathcal{D}\in\displaystyle\mathbb{R}^{d\times l}, and the sparse code 𝒛𝒛\displaystyle{\bm{z}} selects a linear combination of a few of its columns to reconstruct the input. In practice, a sparse code 𝒛𝒛\displaystyle{\bm{z}} is computed using an inference algorithm which minimizes 𝒛𝒛\displaystyle{\bm{z}}’s l1subscript𝑙1l_{1} norm and simultaneously optimizes for a good reconstruction of the input 𝒚𝒚\displaystyle{\bm{y}} using 𝒟𝒟\mathcal{D}’s columns. Learning of the decoder 𝒟𝒟\mathcal{D} typically happens in an iterative EM-like fashion. Inference provides a set of sparse codes corresponding to some inputs (expectation step). Given these codes, the decoder’s parameters are updated using gradient descent to minimize the error between the original inputs and their reconstructions (maximization step). Sparse coding systems in which the decoder is non-linear are hard to learn and often require layer-wise training (Zeiler et al., 2010; He et al., 2014). In this work, we explore ways to effectively train sparse coding systems with non-linear decoders using end-to-end learning.

In order to successfully train a sparse coding system which learns a linear dictionary 𝒟𝒟\mathcal{D}, it is common to bound the norms of the dictionary’s weights in order to avoid collapse in the l1subscript𝑙1l_{1} norms of the codes. Indeed, consider a reconstruction 𝒚~~𝒚\tilde{\displaystyle{\bm{y}}} obtained from a dictionary 𝒟𝒟\mathcal{D} and code 𝒛𝒛\displaystyle{\bm{z}}, namely 𝒚~=𝒟𝒛~𝒚𝒟𝒛\tilde{\displaystyle{\bm{y}}}=\mathcal{D}\displaystyle{\bm{z}}. The same reconstruction can be obtained from a re-scaled dictionary 𝒟′=c𝒟superscript𝒟′𝑐𝒟\mathcal{D}^{\prime}=c\mathcal{D} and code 𝒛′=𝒛/csuperscript𝒛′𝒛𝑐\displaystyle{\bm{z}}^{\prime}=\displaystyle{\bm{z}}/c for any non-zero multiple c𝑐c. When c>1𝑐1c>1, the representation 𝒛′superscript𝒛′\displaystyle{\bm{z}}^{\prime} has a smaller l1subscript𝑙1l_{1} norm than 𝒛𝒛\displaystyle{\bm{z}} but produces the same reconstruction. In practice, if 𝒟𝒟\mathcal{D}’s weights are unbounded during training, they become arbitrarily large, while the l1subscript𝑙1l_{1} norms of the latent representations become arbitrarily small which leads to a collapse in their l1subscript𝑙1l_{1} norm. Typically, in order to restrict a linear dictionary 𝒟𝒟\mathcal{D}, its columns are re-scaled to have a constant l2subscript𝑙2l_{2} norm (Lee et al., 2006) or weight decay is applied to its weights (Sun et al., 2018). However, in the case when 𝒟𝒟\mathcal{D} is a multi-layer neural network trained end-to-end, it is not clear what the best strategy to bound the decoder’s weights is in order to avoid collapse in the codes and at the same time learn useful features.

Instead of restricting 𝒟𝒟\mathcal{D}’s weights, we propose to prevent collapse in the l1subscript𝑙1l_{1} norm of the codes directly by applying variance regularization to each latent code component. In particular, we add a regularization term to the energy minimized during inference for computing a set of codes. This term encourages the variance of each latent code component to be greater than a fixed lower bound. This strategy is similar to the variance principle presented in Bardes et al. (2021) and related to methods that directly regularize the latent codes (Földiak, 1990; Olshausen & Field, 1996; Barello et al., 2018). Our strategy ensures that the norms of the sparse codes do not collapse and we find that it removes the need to explicitly restrict 𝒟𝒟\mathcal{D}’s parameters. Since this variance regularization term is independent from 𝒟𝒟\mathcal{D}’s architecture, we can use it when the decoder is a deep neural network. In this work, we experiment with fully connected sparse autoencoders with a one hidden layer decoder and successfully train such models using our variance regularization strategy.

The main contributions of our work are:

We introduce an effective way to train sparse autoencoders and prevent collapse in the l1subscript𝑙1l_{1} norms of the codes that they produce without the need to regularize the decoder’s weights but instead by encouraging the latent code components to have variance above a fixed threshold.

We show empirically that linear decoders trained with this variance regularization strategy are comparable to the ones trained using the standard l2subscript𝑙2l_{2}-normalization approach in terms of the features that they learn and the quality of the reconstructions they produce. Moreover, variance regularization successfully extends sparse coding to the case of a non-linear fully connected decoder with one hidden layer.

Additionally, our experiments show that sparse representations obtained from autoencoders with a non-linear decoder trained using our proposed variance regularization strategy: 1) produce higher quality reconstructions than ones obtained from autoencoders with a linear decoder, 2) are helpful in the downstream task of classification in the low-data regime.

A schematic view of our sparse autoencoder setup is presented in Figure 1. At a high level, given an input 𝒚𝒚\displaystyle{\bm{y}} and a fixed decoder 𝒟𝒟\mathcal{D}, we perform inference using a modified version of the FISTA algorithm (Beck & Teboulle, 2009) to find a sparse code 𝒛∗superscript𝒛\displaystyle{\bm{z}}^{} which can best reconstruct 𝒚𝒚\displaystyle{\bm{y}} using 𝒟𝒟\mathcal{D}’s elements. The decoder’s weights are trained by minimizing the mean squared error between the input 𝒚𝒚\displaystyle{\bm{y}} and the reconstruction 𝒚~~𝒚\tilde{\displaystyle{\bm{y}}} computed from 𝒛∗superscript𝒛\displaystyle{\bm{z}}^{}. The encoder ℰℰ\mathcal{E} is trained to predict the codes 𝒛∗superscript𝒛\displaystyle{\bm{z}}^{*} computed during inference. More details on each of the components in Figure 1 are presented in the following sections.

Inference Sparse coding algorithms with an l1subscript𝑙1l_{1} sparsity penalty and a linear decoder 𝒟∈ℝd×l𝒟superscriptℝ𝑑𝑙\mathcal{D}\in\displaystyle\mathbb{R}^{d\times l} perform inference to find a latent sparse representation 𝒛∈ℝl𝒛superscriptℝ𝑙\displaystyle{\bm{z}}\in\displaystyle\mathbb{R}^{l} of a given data sample 𝒚∈ℝd𝒚superscriptℝ𝑑\displaystyle{\bm{y}}\in\displaystyle\mathbb{R}^{d} which minimizes the following energy function:

The first term in (1) is the reconstruction error for the sample 𝒚𝒚\displaystyle{\bm{y}} using code 𝒛𝒛\displaystyle{\bm{z}} and decoder 𝒟𝒟\mathcal{D}. The second term is a regularization term which penalizes the l1subscript𝑙1l_{1} norm of 𝒛𝒛\displaystyle{\bm{z}}. In essence, each code 𝒛𝒛\displaystyle{\bm{z}} selects a linear combination of the decoder’s columns. The decoder can thus be thought of as a dictionary of building blocks. Inference algorithms find a solution for the minimization problem in (1) keeping 𝒟𝒟\mathcal{D} fixed, namely

Learning the Decoder The decoder’s parameters can be learned through gradient-based optimization by minimizing the mean squared reconstruction error between elements in the training set 𝒴={𝒚1,…,𝒚N}⊂ℝd𝒴subscript𝒚1…subscript𝒚𝑁superscriptℝ𝑑\mathcal{Y}={\displaystyle{\bm{y}}{1},\ldots,\displaystyle{\bm{y}}{N}}\subset\displaystyle\mathbb{R}^{d} and their corresponding reconstructions from the sparse representations 𝒵∗={𝒛1∗,…,𝒛N∗}⊂ℝlsuperscript𝒵superscriptsubscript𝒛1…superscriptsubscript𝒛𝑁superscriptℝ𝑙\mathcal{Z}^{}={\displaystyle{\bm{z}}_{1}^{},\ldots,\displaystyle{\bm{z}}_{N}^{*}}\subset\displaystyle\mathbb{R}^{l} obtained during inference, namely:

where 𝒟𝒟\mathcal{D}’s columns are typically projected onto the unit sphere to avoid collapse in the latent codes, namely ‖𝒟:,i‖22=1superscriptsubscriptnormsubscript𝒟:𝑖221|\mathcal{D}{:,i}|{2}^{2}=1 for 𝒟:,i∈ℝdsubscript𝒟:𝑖superscriptℝ𝑑\mathcal{D}_{:,i}\in\mathbb{R}^{d}.

We provide an overview of the Iterative Shrinkage-Thresholding Algorithm (ISTA) and its faster version FISTA which we adopt in our setup. ISTA is a proximal gradient method which performs inference to find a sparse code 𝒛∗superscript𝒛\displaystyle{\bm{z}}^{*} which minimizes the energy in (1) for a given input 𝒚𝒚\displaystyle{\bm{y}} and a fixed dictionary 𝒟𝒟\mathcal{D}. We can re-write the inference minimization problem as:

where f(𝒛):=12∥𝒚−𝒟𝒛∥22f(\displaystyle{\bm{z}})\mathrel{\mathop{:}}=\frac{1}{2}|\displaystyle{\bm{y}}-\mathcal{D}\displaystyle{\bm{z}}|{2}^{2} is the reconstruction term and g(𝒛):=λ∥𝒛∥1g(\displaystyle{\bm{z}})\mathrel{\mathop{:}}=\lambda|\displaystyle{\bm{z}}|{1} is the l1subscript𝑙1l_{1} penalty term. In ISTA, the code 𝒛𝒛\displaystyle{\bm{z}} can be given any initial value and is typically set to the zero vector: 𝒛(0)=𝟎∈ℝlsuperscript𝒛00superscriptℝ𝑙\displaystyle{\bm{z}}^{(0)}=\mathbf{0}\in\displaystyle\mathbb{R}^{l}. An ISTA iteration consists of two steps: a gradient step and a shrinkage step.

Gradient Step To compute code 𝒛(k)superscript𝒛𝑘\displaystyle{\bm{z}}^{(k)} from its previous iteration 𝒛(k−1)superscript𝒛𝑘1\displaystyle{\bm{z}}^{(k-1)}, where k∈{1,…,K}𝑘1…𝐾k\in{1,\ldots,K}, ISTA first performs the gradient update by taking gradient step of size ηksubscript𝜂𝑘\eta_{k} with respect to the squared error between the input and the reconstruction from 𝒛(k−1)superscript𝒛𝑘1\displaystyle{\bm{z}}^{(k-1)}:

The size of the gradient step ηksubscript𝜂𝑘\eta_{k} at iteration k𝑘k can be determined with backtracking (Beck & Teboulle, 2009).

Shrinkage Step The shrinkage step is applied to the intermediate output from the gradient step in ISTA. It uses the shrinkage function ταsubscript𝜏𝛼\tau_{\alpha} which sets components of its input to 0 if their absolute value is less than α𝛼\alpha or otherwise contracts them by α𝛼\alpha, namely [τα(𝒙)]j=sign(xj)(|xj|−α)+subscriptdelimited-[]subscript𝜏𝛼𝒙𝑗signsubscript𝑥𝑗subscriptsubscript𝑥𝑗𝛼[\tau_{\alpha}(\displaystyle{\bm{x}})]{j}=\textrm{sign}(\displaystyle{x}{j})(|x_{j}|-\alpha)_{+} for some threshold α>0𝛼0\alpha>0 where j𝑗j traverses the components of its input vector 𝒙𝒙\displaystyle{\bm{x}}. The shrinkage step for ISTA is given by:

where λ>0𝜆0\lambda>0 is the hyperparameter which determines the sparsity level. In the case when the codes are restricted to be non-negative, a property that we adopt in our setup, the shrinkage function can be modified to set all negative components in its output to 0: [τ~~α(𝒙)]j=([τα(𝒙)]j)+subscriptdelimited-[]subscript~~𝜏𝛼𝒙𝑗subscriptsubscriptdelimited-[]subscript𝜏𝛼𝒙𝑗[\tilde{\tau}{\alpha}(\displaystyle{\bm{x}})]{j}=([\tau_{\alpha}(\displaystyle{\bm{x}})]{j}){+}.

FISTA In our setup, we use FISTA (Beck & Teboulle, 2009) which accelerates the convergence of ISTA by adding a momentum term in each of its iterations. Namely, 𝒛(k−1)superscript𝒛𝑘1\displaystyle{\bm{z}}^{(k-1)} in (4) is replaced by:

for k≥2𝑘2k\geq 2 where tk=1+1+4tk−122subscript𝑡𝑘114superscriptsubscript𝑡𝑘122t_{k}=\frac{1+\sqrt{1+4t_{k-1}^{2}}}{2} and t1=1subscript𝑡11t_{1}=1. Thus, the FISTA gradient step becomes:

FISTA employs the same shrinkage step as ISTA.

In order to prevent collapse in the l1subscript𝑙1l_{1} norm of the latent codes, we propose to ensure that the variance of each latent code component stays above a pre-set threshold. To achieve this, we add a regularization term to the energy in (3) which encourages the variance of all latent component across a mini-batch of codes to be greater than a pre-set threshold. In particular, we replace the function f(𝒛)=12‖𝒚−𝒟𝒛‖22𝑓𝒛12superscriptsubscriptnorm𝒚𝒟𝒛22f(\displaystyle{\bm{z}})=\frac{1}{2}|\displaystyle{\bm{y}}-\mathcal{D}\displaystyle{\bm{z}}|{2}^{2} in (3) with a new function f~(𝒁)~𝑓𝒁\tilde{f}(\displaystyle{\bm{Z}}) defined over a set of codes 𝒁∈ℝl×n𝒁superscriptℝ𝑙𝑛\displaystyle{\bm{Z}}\in\displaystyle\mathbb{R}^{l\times n} corresponding to a mini-batch of data samples Y={𝒚1,…,𝒚n}𝑌subscript𝒚1…subscript𝒚𝑛Y={\displaystyle{\bm{y}}{1},\ldots,\displaystyle{\bm{y}}_{n}} as:

The first sum in (8) is over the reconstruction terms from each code 𝒁:,i∈ℝlsubscript𝒁:𝑖superscriptℝ𝑙\displaystyle{\bm{Z}}{:,i}\in\displaystyle\mathbb{R}^{l}. The second sum in (8) is over squared hinge terms involving the variance of each latent component 𝒁j,:∈ℝnsubscript𝒁𝑗:superscriptℝ𝑛\displaystyle{\bm{Z}}{j,:}\in\displaystyle\mathbb{R}^{n} across the batch where Var(𝒁j,:)=1n−1∑i=1n(𝒁j,i−μj)2Varsubscript𝒁𝑗:1𝑛1superscriptsubscript𝑖1𝑛superscriptsubscript𝒁𝑗𝑖subscript𝜇𝑗2\displaystyle\mathrm{Var}(\displaystyle{\bm{Z}}{j,:})=\frac{1}{n-1}\sum{i=1}^{n}(\displaystyle{\bm{Z}}{j,i}-\mu{j})^{2} and μjsubscript𝜇𝑗\mu_{j} is the mean across the j𝑗j-th latent component, namely μj=1n∑i=1nZj,isubscript𝜇𝑗1𝑛superscriptsubscript𝑖1𝑛subscript𝑍𝑗𝑖\mu_{j}=\frac{1}{n}\sum_{i=1}^{n}\displaystyle{Z}_{j,i}. The hinge terms are non-zero for any latent dimension whose variance is below the fixed threshold of T𝑇\sqrt{T}.

Modified Energy: Variance Regularization Using f~(𝒁)~~𝑓𝒁\tilde{f}(\displaystyle{\bm{Z}}) from (8) and setting g~~(𝒁):=∑i=1ng(𝒁:,i)\tilde{g}(\displaystyle{\bm{Z}})\mathrel{\mathop{:}}=\sum_{i=1}^{n}g(\displaystyle{\bm{Z}}_{:,i}), our modified objective during inference is to minimize the energy:

with respect to the codes 𝒁𝒁\displaystyle{\bm{Z}} for a mini-batch of samples Y𝑌Y and a fixed dictionary 𝒟𝒟\mathcal{D}. Note that the hinge terms counteract with the l1subscript𝑙1l_{1} penalty terms in (9). The hinge terms act as regularization terms which encourage the variance of each of the latent code components to remain above the threshold of T𝑇\sqrt{T}. This should prevent a collapse in the l1subscript𝑙1l_{1} norm of the latent codes directly and thus remove the need to normalize the weights of the decoder.

Gradient Step The first step in our modified version of FISTA is to take a gradient step for each code 𝒁:,t∈ℝlsubscript𝒁:𝑡superscriptℝ𝑙\displaystyle{\bm{Z}}_{:,t}\in\displaystyle\mathbb{R}^{l}:

In practice, since we use FISTA, the code 𝒁:,t(k−1)superscriptsubscript𝒁:𝑡𝑘1\displaystyle{\bm{Z}}{:,t}^{(k-1)} in (10) is replaced by a momentum term as defined in (6) but here we omit this notation for simplicity. The gradient of the sum of reconstruction terms in (8) with respect to Zs,tsubscript𝑍𝑠𝑡\displaystyle{Z}{s,t}, one of the latent components of code 𝒁:,tsubscript𝒁:𝑡\displaystyle{\bm{Z}}_{:,t}, for a linear decoder 𝒟𝒟\mathcal{D} is:

where 𝒟ssubscript𝒟𝑠\mathcal{D}{s} denotes the s𝑠s-th column of the dictionary 𝒟𝒟\mathcal{D}.111The gradient for any neural network decoder 𝒟𝒟\mathcal{D} can be computed through automatic differentiation. The gradient of the sum of hinge terms in (8) with respect to Zs,tsubscript𝑍𝑠𝑡\displaystyle{Z}{s,t} is:

Even though the hinge terms in (8) are not smooth convex functions222A requirement for the FISTA algorithm., the fact that the gradient in (12) is a line implies that the hinge term behaves locally as a convex quadratic function. A visualization of h(𝒙)=[(1−Var(𝒙))+]2ℎ𝒙superscriptdelimited-[]subscript1Var𝒙2h(\displaystyle{\bm{x}})=\big{[}\big{(}1-\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{x}})}\big{)}_{+}\big{]}^{2} for 𝒙∈ℝ2𝒙superscriptℝ2\displaystyle{\bm{x}}\in\displaystyle\mathbb{R}^{2} and the full derivation of (12) are available in Appendix A.2 and A.3, respectively.

Shrinkage Step Similarly to the regular FISTA protocol, the shrinkage step in our modified FISTA is applied to the intermediate results from the gradient step in (10).

One limitation of the proposed variance regularization is that the modified inference procedure depends on batch statistics (see equation 12) and thus the codes are not deterministic. To address this limitation, we propose to train an encoder ℰℰ\mathcal{E} simultaneously with the decoder 𝒟𝒟\mathcal{D} to predict the sparse codes computed from inference with our modified FISTA. After training of the encoder and decoder is done, the encoder can be used to compute sparse codes for different inputs independently from each other in a deterministic way. Additionally, using an encoder reduces the inference time: instead of performing inference with FISTA, the encoder computes sparse representations directly from the inputs and thus provides amortized inference.

Learning the Encoder The encoder is trained using mini-batch gradient descent to minimize the following mean-squared error between its predictions and the sparse codes 𝒁∗∈ℝl×nsuperscript𝒁superscriptℝ𝑙𝑛\displaystyle{\bm{Z}}^{*}\in\displaystyle\mathbb{R}^{l\times n} computed from inference with FISTA for inputs Y={𝒚1,…,𝒚n}𝑌subscript𝒚1…subscript𝒚𝑛Y={\displaystyle{\bm{y}}{1},\ldots,\displaystyle{\bm{y}}{n}}:

Note that the codes 𝒁∗superscript𝒁\displaystyle{\bm{Z}}^{*} are treated as constants. Training the encoder this way is similar to target propagation (Lee et al., 2015).

Modified Energy: Variance and Encoder Regularization To encourage FISTA to find codes which can be learned by the encoder, we further modify the inference protocol by adding a term to f~(𝒁)~𝑓𝒁\tilde{f}(\displaystyle{\bm{Z}}) which penalizes the distance between FISTA’s outputs and the encoder’s outputs:

This regularization term ensures that the codes computed during inference with FISTA do not deviate too much from the encoder’s predictions. In our setup, the encoder’s predictions are treated as constants and are used as the initial value for codes in FISTA, namely 𝒁:,i(0):=ℰ(𝒚i)\displaystyle{\bm{Z}}{:,i}^{(0)}\mathrel{\mathop{:}}=\mathcal{E}(\displaystyle{\bm{y}}{i}). This can reduce the inference time by lowering the number of FISTA iterations needed for convergence if the encoder provides good initial values. With the new f~(𝒁)~𝑓𝒁\tilde{f}(\displaystyle{\bm{Z}}) from equation 14, the energy minimized during inference becomes:

LISTA Encoder The encoder’s architecture is inspired by the Learned ISTA (LISTA) architecture (Gregor & LeCun, 2010). LISTA is designed to mimic outputs from inference with ISTA and resembles a recurrent neural network. Our encoder consists of two fully connected layers 𝑼∈ℝd×l𝑼superscriptℝ𝑑𝑙\displaystyle{\bm{U}}\in\displaystyle\mathbb{R}^{d\times l} and 𝑺∈ℝl×l𝑺superscriptℝ𝑙𝑙\displaystyle{\bm{S}}\in\displaystyle\mathbb{R}^{l\times l}, a bias term 𝒃∈ℝl𝒃superscriptℝ𝑙\displaystyle{\bm{b}}\in\displaystyle\mathbb{R}^{l}, and ReLU activation functions. Algorithm 1 describes how the encoder’s outputs are computed. Our version of LISTA produces non-negative codes by design. We refer to our encoder as LISTA in the rest of the paper.

Linear Decoder A linear decoder 𝒟𝒟\mathcal{D} is parameterized simply as a linear transformation 𝑾𝑾\displaystyle{\bm{W}} which maps codes in ℝlsuperscriptℝ𝑙\displaystyle\mathbb{R}^{l} to reconstructions in the input data dimension ℝdsuperscriptℝ𝑑\displaystyle\mathbb{R}^{d}, i.e. 𝑾∈ℝd×l𝑾superscriptℝ𝑑𝑙\displaystyle{\bm{W}}\in\displaystyle\mathbb{R}^{d\times l}. No bias term is used in this linear transformation.

Non-Linear Decoder In the case of a non-linear decoder 𝒟𝒟\mathcal{D}, we use a fully connected network with one hidden layer of size m𝑚m and ReLU as an activation function. We refer to the layer which maps the input codes to hidden representations as 𝑾1∈ℝm×lsubscript𝑾1superscriptℝ𝑚𝑙\displaystyle{\bm{W}}{1}\in{\displaystyle\mathbb{R}^{m\times l}} and the one which maps the hidden representations to the output (reconstructions) as 𝑾2∈ℝd×msubscript𝑾2superscriptℝ𝑑𝑚\displaystyle{\bm{W}}{2}\in\displaystyle\mathbb{R}^{d\times m}. There is a bias term 𝒃1∈ℝmsubscript𝒃1superscriptℝ𝑚\displaystyle{\bm{b}}{1}\in\displaystyle\mathbb{R}^{m} following 𝑾1subscript𝑾1\displaystyle{\bm{W}}{1} and no bias term after 𝑾2subscript𝑾2\displaystyle{\bm{W}}{2}. Given an input 𝒛∈ℝl𝒛superscriptℝ𝑙{\bm{z}}\in\mathbb{R}^{l}, the decoder’s output is computed as: 𝒚~=𝑾2(𝑾1𝒛+𝒃1)+~𝒚subscript𝑾2subscriptsubscript𝑾1𝒛subscript𝒃1\tilde{{\bm{y}}}=\displaystyle{\bm{W}}{2}(\displaystyle{\bm{W}}{1}{\bm{z}}+{\bm{b}}{1})_{+}.

We compare sparse autoencoders trained using our variance regularization method to sparse autoencoders trained using other regularization methods. We refer to our approach as variance-regularized dictionary learning (VDL) in the case when the decoder is a linear dictionary and as VDL-NL when the decoder is non-linear. We refer to the traditional approach in which the l2subscript𝑙2l_{2} norm of the decoder’s columns is fixed as standard dictionary learning (SDL) in the case the decoder is linear and as SDL-NL in the case it is non-linear - a natural extension of SDL to the case of non-linear decoders is to fix the l2subscript𝑙2l_{2} norms of the columns of both layers 𝑾1subscript𝑾1\displaystyle{\bm{W}}{1} and 𝑾2subscript𝑾2\displaystyle{\bm{W}}{2}. Additionally, we compare our variance regularization method to sparse autoencoders in which only weight decay is used to avoid collapse which we refer to as WDL and WDL-NL in the case of a linear and a non-linear decoder, respectively. Finally, we consider a decoder-only setup in which the standard approach of fixing the l2subscript𝑙2l_{2} norm of the decoder’s columns is applied and the codes are computed using inference with FISTA. We refer to this setup as DO when the decoder is linear and as DO-NL when the decoder is non-linear.

In order to compare the different models, we train them with different levels of sparsity regularization λ𝜆\lambda, evaluate the quality of their reconstructions in terms of peak signal-to-noise ratio (PSNR), and visualize the features that they learn. We evaluate models with a linear decoder on the downstream task of denoising. Additionally, we measure the pre-trained models’ performance on the downstream task of MNIST classification in the low data regime.

MNIST In the first set of our experiments, we use the MNIST dataset (LeCun & Cortes, 2010) consisting of 28×28282828\times 28 hand-written digits and do standard pre-processing by subtracting the global mean and dividing by the global standard deviation. We split the training data randomly into 55000 training samples and 5000 validation samples. The test set consists of 10000 images.

Natural Image Patches For experiments with natural images, we use patches from ImageNet ILSVRC-2012 (Deng et al., 2009). We first convert the ImageNet images to grayscale and then do standard pre-processing by subtracting the global mean and dividing by the global standard deviation. We then apply local contrast normalization to the images with a 13×13131313\times 13 Gaussian filter with standard deviation of 5 following Jarrett et al. (2009) and Zeiler et al. (2011). We use 200000200000200000 randomly selected patches of size 28×28282828\times 28 for training, 200002000020000 patches for validation, and 200002000020000 patches for testing.

We determine all hyperparameter values through grid search. Full training details can be found in Appendix B.1.

Inference The codes 𝒁𝒁\displaystyle{\bm{Z}} are restricted to be non-negative as described in section 2.2. The dimension of the latent codes in our MNIST experiments is l=128𝑙128l=128 and in experiments with ImageNet patches it is l=256𝑙256l=256. The batch size is set to 250 which we find sufficiently large for the regularization term on the variance of each latent component in (8) for VDL and VDL-NL models. We set the maximum number of FISTA iterations K𝐾K to 200200200 which we find sufficient for good reconstructions.

Autoencoders Training We train models for a maximum of 200 epochs in MNIST experiments and for 100 epochs in experiments with natural image patches. We apply early stopping and save the autoencoder which outputs codes with the lowest average energy measured on the validation set.333As a reminder, codes are computed using amortized inference with the encoder during validation. In SDL and SDL-NL experiments, we fix the l2subscript𝑙2l_{2} norms of the columns in the decoders’ fully connected layers 𝑾𝑾\displaystyle{\bm{W}}, 𝑾1subscript𝑾1\displaystyle{\bm{W}}{1}, and 𝑾2subscript𝑾2\displaystyle{\bm{W}}{2} to be equal to 111. We add weight decay to the bias term 𝒃1subscript𝒃1\displaystyle{\bm{b}}_{1} in models with non-linear decoder as well as to the bias term 𝒃𝒃\displaystyle{\bm{b}} in LISTA encoders to prevent their norms from inflating arbitrarily.

Implementation and Hardware Our PyTorch implementation is available on GitHub at https://github.com/kevtimova/deep-sparse. We train our models on one NVIDIA RTX 8000 GPU card and all our experiments take less than 24 hours to run.

Linear Decoders Figures 2 shows the dictionary elements for two SDL and two VDL decoders trained with different levels of sparsity regularization λ𝜆\lambda. As evident from Figures 2(a) and 2(b) we confirm that, in the case of standard dictionary learning on MNIST, the dictionary atoms look like orientations, strokes, and parts of digits for lower values of the sparsity regularization λ𝜆\lambda and they become looking like full digit prototypes as λ𝜆\lambda increases (Yu, 2012). A similar transition from strokes to full digit prototypes is observed in figures 2(c) and 2(d) displaying the dictionary elements of models trained using variance regularization in the latent code components. The same trend is observed in dictionary elements from WDL and DO models (Figure 12 in the Appendix). Figure 3 shows dictionary elements for two SDL and two VDL models trained on ImageNet patches. All of these models contain gratings and Gabor-like filters. We observe that models which are trained with a lower sparsity regularization parameter λ𝜆\lambda (Figures 3(a) and 3(c)) contain more high frequency atoms than ones trained with higher values of λ𝜆\lambda (Figures 3(b) and 3(d)).

Non-linear Decoders Each column 𝑾:,i∈ℝdsubscript𝑾:𝑖superscriptℝ𝑑\displaystyle{\bm{W}}{:,i}\in\displaystyle\mathbb{R}^{d} in a linear dictionary 𝒟𝒟\mathcal{D} (parameterized by 𝑾𝑾\displaystyle{\bm{W}}) reflects what a single latent component encodes since 𝑾:,i=W𝒆(i)subscript𝑾:𝑖𝑊superscript𝒆𝑖\displaystyle{\bm{W}}{:,i}=W\displaystyle{\bm{e}}^{(i)} where 𝒆(i)∈ℝlsuperscript𝒆𝑖superscriptℝ𝑙\displaystyle{\bm{e}}^{(i)}\in\displaystyle\mathbb{R}^{l} is a one-hot vector with 1 in position i𝑖i and 00s elsewhere. We use a similar strategy to visualize the features that non-linear decoders learn. Namely, we provide codes 𝒆(i)superscript𝒆𝑖\displaystyle{\bm{e}}^{(i)} with only one non-zero component as inputs to the non-linear decoders and display the resulting “atoms” in image space.444In practice, we remove the effect of the bias term b1subscript𝑏1b_{1} after W1subscript𝑊1W_{1} and display 𝒟(𝒆(i))−𝒟(𝟎)𝒟superscript𝒆𝑖𝒟0\mathcal{D}({\bm{e}}^{(i)})-\mathcal{D}(\mathbf{0}). While in the linear case each code component selects a single dictionary column, in the case of non-linear decoders with one hidden layer, each code component selects a linear combination of columns in 𝑾2subscript𝑾2{\bm{W}}_{2}. Two SDL-NL and two VDL-NL models trained on MNIST with different levels of sparsity regularization λ𝜆\lambda yield the atoms displayed in Figure 4. They contain strokes and orientations for smaller values of λ𝜆\lambda and fuller digit prototypes for larger values of λ𝜆\lambda and are very similar to the dictionary elements from linear decoders in Figure 2. We use the same strategy to visualize the features that non-linear decoders trained on ImageNet patches learn. These features are displayed in Figure 5 for SDL-NL and VDL-NL models trained with different levels of sparsity regularization λ𝜆\lambda. As in the case of linear decoders, we observe that both SDL-NL and VDL-NL models learn gratings and Gabor-like filters.

Figure 6 plots regret curves for the trade-off between average sparsity in the codes and reconstruction quality from these codes for models with linear decoders (dashed lines) and non-linear decoders (solid lines) trained on MNIST images (Figure 6(a)) and ImageNet patches (Figure 6(b)). The sparsity level is measured by the average percentage of non-active (zero) code components and the reconstruction quality is measured by the average PSNR. Both are evaluated on the test set over 5 random seeds. As expected, higher sparsity levels result in worse reconstructions. 555As a reference for reconstruction quality in terms of PSNR, the second row of images in Figure 7(a) and 7(b) contains reconstructions of the images in the top row with PSNR of around 17.3 and 17.7, respectively. Moreover, models trained with our variance regularization approach and a non-linear decoder (VDL-NL) on both MNIST and ImageNet patches produce better reconstructions than other models for higher levels of sparsity. This implies that VDL-NL models can better utilize the additional representational capacity from the larger number of layers and parameters in the non-linear decoder compared to the other models when the sparsity level is high.

Figure 7 shows the reconstructions of MNIST test set images and of the same images corrupted with Gaussian noise with zero mean and standard deviations σ=1𝜎1\sigma=1 and σ=1.5𝜎1.5\sigma=1.5 using the SDL and VDL models in Figures 2(b) and 2(d). We observe that both SDL and VDL models are able to produce reconstructions which are closer to samples from the manifold of training data than to the noisy inputs. This implies that the sparse coding systems have learned to detect features useful for the data they are trained on and cannot reconstruct random noise, as desired. SDL and VDL models trained on ImageNet patches are also robust to the addition of Gaussian noise to their inputs. This is evident in Figure 8 which contains reconstructions 𝒚~~𝒚\tilde{\displaystyle{\bm{y}}} and 𝒚~~σsubscript~~𝒚𝜎\tilde{\displaystyle{\bm{y}}}{\sigma} of images 𝒚𝒚\displaystyle{\bm{y}} and their noisy version 𝒚σsubscript𝒚𝜎\displaystyle{\bm{y}}{\sigma} for the SDL and VDL models from Figures 3(a) and 3(c) which produce codes with similar sparsity on average.

Table 1 shows a summary of the reconstruction quality of images with varying levels of Gaussian noise corruption when they are passed through SDL and VDL autoencoders. The results are evaluated on the test set over 5 random seeds. Both the SDL and VDL encoders are able to produce codes from corrupted inputs 𝒚σsubscript𝒚𝜎\displaystyle{\bm{y}}{\sigma} which, when passed through their corresponding decoders, produce outputs 𝒚~~σsubscript~~𝒚𝜎\tilde{\displaystyle{\bm{y}}}{\sigma} that are less noisy than the corrupted inputs. This observation is significant since the sparse autoencoders in our experiments are not explicitly trained to perform denoising.

VDL and VDL-NL experiments from Figure 6 show that our proposed variance regularization strategy can successfully be used to train sparse coding models without collapse in the l1subscript𝑙1l_{1} norms of the sparse codes. In experiments without our variance regularization strategy or any regularization restricting the dictionary’s weights, we do observe collapse, both for models with a linear and a non-linear decoder. This shows that our proposed variance regularization protocol is an effective alternative to the standard l2subscript𝑙2l_{2} normalization or weight decay regularization for preventing collapse, both in the case of models with a linear decoder and with a fully connected decoder with one hidden layer.

We investigate whether self-supervised pre-training with sparse coding improves classification performance over training from scratch in the low data regime. For this purpose, we compare the classification error of classifiers trained on the raw MNIST data to the classification error of linear classifiers trained on features from pre-trained frozen LISTA encoders in SDL, SDL-NL, VDL, and VDL-NL autoencoders666The autoencoders are trained on the full MNIST dataset in an unsupervised way as described in section 1. and pre-trained frozen DO and DO-NL decoders when few training samples are available for supervised learning.

In particular, we train a linear classifier on top of features from these pre-trained models when 1, 2, 5, 10, 20, 50, 100 training samples per class are available as well as when the full dataset is available. For each type of pre-trained model and each number of training samples per class, we select the model among the ones presented in Figure 6(a) which gives the best validation accuracy. As baselines, we train two classifiers from scratch on the raw MNSIT data. The first one is a linear classifier referred to as “Linear on raw”. The second one consists of a linear layer on top of a randomly initialized LISTA encoder 777This randomly initialized LISTA encoder is trainable and with the same architecture as the one in Algorithm 1. and is referred to as “LSITA + Linear on raw”.

Figure 9 summarizes the top-1 and top-3 classification errors of these classifiers on the test set evaluated over 5 random seeds. With small exceptions888Namely, classifiers trained on DO, DO-NL and SDL features with 1 or 2 training samples per class., all linear classifiers trained on top of the pre-trained LISTA features give better top-1 and top-3 classification performance than classifiers trained on the raw MNIST data with up to 100 samples per class. This observation supports a claim made by Raina et al. (2007) that SSL is expected to be useful especially when labeled data is scarce. Notably, classifiers trained on top of VDL-NL features have the best performance among the rest of the models with up to 10 training samples per class. In other words, sparse autoencoders with a non-linear decoder pre-trained using our proposed variance regularization on the latent codes can boost the classification performance when few labeled training samples are available.

Ways to Avoid Collapse There are different ways to avoid the collapse issue in a sparse coding model. One approach is to constrain the weights of the decoder, for example by bounding the norms of its columns (Lee et al., 2006; Raina et al., 2007; Mairal et al., 2009; Hu & Tan, 2018) or by applying weight decay regularization (Sun et al., 2018). Another approach is to regularize the latent codes directly - a strategy we incorporate in our model. In the spiking model of Földiak (1990), an anti-Hebbian adaptive thresholding mechanism encourages each latent component to be active with a pre-set level of probability. Our variance regularization strategy is also related to Olshausen & Field (1996; 1997) where the norm of the basis elements is adjusted adaptively over time so that each of the latent components has a desired level of variance. Their method works well in the linear dictionary setup but it is not directly extendable to the case of a non-linear decoder. In variational inference models such as VAE Kingma & Welling (2013) and its sparse coding extensions such as SVAE Barello et al. (2018) and Tonolini et al. (2020), the latent codes have a pre-set level of variance determined by the prior distribution. In SVAE, there is also variance regularization term which determines the trade-off between the input reconstruction and the sparsity level imposed by the prior. Our variance regularization term is most closely related to the variance principle presented in Bardes et al. (2021) in which each latent component is encouraged to have variance above a pre-set threshold.

Sparse Coding and Classification Performance There is an active area of research which studies whether sparse coding is useful in classification tasks. Some existing literature supports this claim Raina et al. (2007); Ranzato et al. (2007); Mairal et al. (2008a; b); Yang et al. (2009); Kavukcuoglu et al. (2010); Makhzani & Frey (2013); He et al. (2014) while other research refutes it Coates & Ng (2011); Rigamonti et al. (2011); Luther & Seung (2022). Tang et al. (2020) show that adding layers to a hierarchy of sparse dictionaries can improve classification accuracy. In our work, we provide empirical evidence that sparse coding models trained with a reconstruction objective, especially models with a non-linear decoder trained using our variance regularization method, can give better classification performance than classifiers trained on the raw data when very few training samples are available, supporting a claim made by Raina et al. (2007) that SSL is expected to be useful when labeled data is scarce.

Sparsity Constraints Sparse representations can be obtained using many different objectives. The l1subscript𝑙1l_{1} penalty term in equation 1 can be replaced by other sparsity-inducing constraints including l0subscript𝑙0l_{0}, lpsubscript𝑙𝑝l_{p} (0<p<1)0𝑝1(0<p<1), l2subscript𝑙2l_{2}, or l2,1subscript𝑙21l_{2,1} norms. In particular, the l1subscript𝑙1l_{1} term is a convex relaxation of the non-differentiable l0subscript𝑙0l_{0} constraint of the form ‖𝒛‖0≤ksubscriptnorm𝒛0𝑘|\displaystyle{\bm{z}}|{0}\leq k requiring 𝒛𝒛\displaystyle{\bm{z}} to have at most k𝑘k active components. Zhang et al. (2015) provide a survey of existing algorithms for sparse coding such as K-SVD which generalizes K-means clustering (Aharon et al., 2006). Makhzani & Frey (2013) propose an approach to train a sparse autoencoder with a linear encoder and decoder in which only the top k𝑘k activations in the hidden layer are kept and no other sparsifying penalty is used. In our model, we modify the commonly used sparse coding setup with l1subscript𝑙1l{1} penalty term in 1 by including a term which encourages a pre-set level of variance across each latent component in 9.

In this work, we revisit the traditional setup of sparse coding with an l1subscript𝑙1l_{1} sparsity penalty. We propose to apply variance regularization to the sparse codes during inference with the goal of training non-linear decoders without collapse in the l1subscript𝑙1l_{1} norm of the codes. We show that using our proposed method we can successfully train sparse autoencoders with fully connected multi-layer decoders which have interpretable features, outperform models with linear decoders in terms of reconstruction quality for a given average level of sparsity in the codes, and improve MNIST classification performance in the low data regime. Future research directions include scaling up our method to natural images using deep convolutional decoders.

Figure 10 visualizes the variance regularization term h(𝒙)=[(1−Var(𝒙))+]2ℎ𝒙superscriptdelimited-[]subscript1Var𝒙2h(\displaystyle{\bm{x}})=\big{[}\big{(}1-\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{x}})}\big{)}_{+}\big{]}^{2} for 𝒙∈ℝ2𝒙superscriptℝ2\displaystyle{\bm{x}}\in\displaystyle\mathbb{R}^{2} in 3D.

As a reminder of our notation, l𝑙l is the latent dimension, n𝑛n is the batch size, 𝒁∈ℝl×n𝒁superscriptℝ𝑙𝑛\displaystyle{\bm{Z}}\in\displaystyle\mathbb{R}^{l\times n} stores the codes for all elements in a batch, and 𝒁j,:∈ℝnsubscript𝒁𝑗:superscriptℝ𝑛\displaystyle{\bm{Z}}_{j,:}\in\displaystyle\mathbb{R}^{n} stores code values for the j𝑗j-th latent component.

Note that:

The gradient in (16) is non-zero only for j=s𝑗𝑠j=s and Var(𝒁s,:)<TVarsubscript𝒁𝑠:𝑇\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{Z}}_{s,:})}<T. Thus,

and using the result in (22) we conclude that:

All hyperparameter values are selected through grid search. For the constant step size ηksubscript𝜂𝑘\eta_{k} in FISTA, we consider values in the range of 0.010.010.01 to 100100100. In VDL experiments, we consider values for the hinge regularization coefficient β𝛽\beta between 1e−11e11\mathrm{e}{-1} and 100100100. In all our experiments, we use Adam (Kingma & Ba, 2014) as an optimizer for both the encoder ℰℰ\mathcal{E} and decoder 𝒟𝒟\mathcal{D} with a batch size of 250. We use L=3𝐿3L=3 iterations in the LISTA encoder (see Algorithm 1). We run the FISTA algorithm for a maximum of 200 iterations. The convergence criterion we set is: ‖𝒛(k)−𝒛(k−1)‖2‖𝒛(k−1)‖2<1e−3subscriptnormsuperscript𝒛𝑘superscript𝒛𝑘12subscriptnormsuperscript𝒛𝑘121e3\frac{|\displaystyle{\bm{z}}^{(k)}-\displaystyle{\bm{z}}^{(k-1)}|{2}}{|\displaystyle{\bm{z}}^{(k-1)}|{2}}<1\mathrm{e}{-3}. FISTA’s output is 𝒛∗=𝒛(k∗)superscript𝒛superscript𝒛superscript𝑘\displaystyle{\bm{z}}^{}=\displaystyle{\bm{z}}^{(k^{})} where k∗superscript𝑘k^{*} is the index of the first iteration for which the convergence criterion holds. Table 3 contains the hyperparameter values we use in all our experiments expect for the ones with WDL, WDL-NL, DO and DO-NL models. In the case of DO and DO-NL models, we use the same number of epochs, η𝒟subscript𝜂𝒟\eta_{\mathcal{D}}, wd(𝒃)wd𝒃\mathrm{wd}(\displaystyle{\bm{b}}), wd(𝒃1)wdsubscript𝒃1\mathrm{wd}(\displaystyle{\bm{b}}{1}), and ηksubscript𝜂𝑘\eta{k} as in SDL and SDL-NL experiments, respectively. In the case of WDL and WDL-NL models, we use grid search to find the optimal weight decay value of 5e−45e45\mathrm{e}{-4} (in terms of reconstruction quality regret curve as displayed in Figure 6; this is also the value used in Sun et al. (2018)) and use the same number of epochs, η𝒟subscript𝜂𝒟\eta_{\mathcal{D}}, γ𝛾\gamma, ηℰsubscript𝜂ℰ\eta_{\mathcal{E}}, and ηksubscript𝜂𝑘\eta_{k} as in SDL and SDL-NL experiments, respectively.

Table: S4.T1: The performance of SDL and VDL models trained on MNIST and ImageNet patches with different levels of sparsity regularization λ𝜆\lambda on the the task of denoising inputs corrupted with Gaussian noise with zero mean and standard deviation σ=1.0𝜎1.0\sigma=1.0 or σ=1.5𝜎1.5\sigma=1.5. We report the average sparsity of codes computed with these models and the PSNR of: the reconstruction 𝒚~~𝒚\tilde{\displaystyle{\bm{y}}} of the original input 𝒚𝒚\displaystyle{\bm{y}}, the corrupted input 𝒚σsubscript𝒚𝜎\displaystyle{\bm{y}}{\sigma}, and the reconstruction 𝒚~~σsubscript~~𝒚𝜎\tilde{\displaystyle{\bm{y}}}{\sigma} of the corrupted input 𝒚σsubscript𝒚𝜎\displaystyle{\bm{y}}_{\sigma}. Evaluated on the test set over 5 random seeds.


MNIST
Model	λ𝜆\lambda	L0(𝒛𝒚~)subscript𝐿0subscript𝒛~𝒚L_{0}(\displaystyle{\bm{z}}_{\tilde{\displaystyle{\bm{y}}}})	PSNR 𝒚~~𝒚\tilde{\displaystyle{\bm{y}}}	σ𝜎\sigma	PSNR 𝒚σsubscript𝒚𝜎\displaystyle{\bm{y}}_{\sigma}	L0(𝒛𝒚~~σ)subscript𝐿0subscript𝒛subscript~~𝒚𝜎L_{0}(\displaystyle{\bm{z}}{\tilde{\displaystyle{\bm{y}}}{\sigma}})	PSNR 𝒚~~σsubscript~~𝒚𝜎\tilde{\displaystyle{\bm{y}}}_{\sigma}
SDL	0.0050.0050.005	89.7%±0.2%plus-or-minuspercent89.7percent0.289.7%\pm 0.2%	17.3±0.028plus-or-minus17.30.02817.3\pm 0.028	1.0	10.210.210.2	87.3%±0.2%plus-or-minuspercent87.3percent0.287.3%\pm 0.2%	16.9±0.053plus-or-minus16.90.05316.9\pm 0.053
VDL	0.020.020.02	91.8%±0.1%plus-or-minuspercent91.8percent0.191.8%\pm 0.1%	17.7±0.023plus-or-minus17.70.02317.7\pm 0.023	1.0	10.210.210.2	88.1%±0.2%plus-or-minuspercent88.1percent0.288.1%\pm 0.2%	15.4±0.175plus-or-minus15.40.17515.4\pm 0.175
SDL	0.0050.0050.005	89.7%±0.2%plus-or-minuspercent89.7percent0.289.7%\pm 0.2%	17.3±0.028plus-or-minus17.30.02817.3\pm 0.028	1.5	6.76.76.7	84.6%±0.5%plus-or-minuspercent84.6percent0.584.6%\pm 0.5%	15.7±0.204plus-or-minus15.70.20415.7\pm 0.204
VDL	0.020.020.02	91.8%±0.1%plus-or-minuspercent91.8percent0.191.8%\pm 0.1%	17.7±0.023plus-or-minus17.70.02317.7\pm 0.023	1.5	6.76.76.7	84.5%±0.3%plus-or-minuspercent84.5percent0.384.5%\pm 0.3%	12.2±0.275plus-or-minus12.20.27512.2\pm 0.275
ImageNet patches
SDL	0.0020.0020.002	74.6%±0.1%plus-or-minuspercent74.6percent0.174.6%\pm 0.1%	26.9±0.007plus-or-minus26.90.00726.9\pm 0.007	1.0	20.020.020.0	72.7%±0.1%plus-or-minuspercent72.7percent0.172.7%\pm 0.1%	25.6±0.011plus-or-minus25.60.01125.6\pm 0.011
VDL	0.0050.0050.005	77.3%±0.2%plus-or-minuspercent77.3percent0.277.3%\pm 0.2%	27.3±0.058plus-or-minus27.30.05827.3\pm 0.058	1.0	20.020.020.0	73.5%±0.2%plus-or-minuspercent73.5percent0.273.5%\pm 0.2%	25.2±0.035plus-or-minus25.20.03525.2\pm 0.035

Table: A1.T2: Descriptions of symbols and notation used in the paper.


Notation	Description
ℰℰ\mathcal{E}	LISTA-inspired encoder.
𝒟𝒟\mathcal{D}	Decoder: linear or fully connected with one hidden layer.
m𝑚m	Dimension of the decoder’s hidden layer.
λ𝜆\lambda	Hyperparameter for the sparsity regularization.
β𝛽\beta	Hyperparameter for the variance regularization of the latent codes.
γ𝛾\gamma	Hyperparameter encouraging the FISTA codes to be close to the encoder’s predictions.
d𝑑d	Dimension of the input data.
𝒚𝒚\displaystyle{\bm{y}}	Input sample in ℝdsuperscriptℝ𝑑\displaystyle\mathbb{R}^{d}.
N𝑁N	Total number of training samples.
n𝑛n	Mini-batch size.
T𝑇T	Number of ISTA/FISTA iterations.
ηksubscript𝜂𝑘\eta_{k}	Step size for ISTA/FISTA gradient step.
η𝒟subscript𝜂𝒟\eta_{\mathcal{D}}	Learning rate for the decoder.
ηℰsubscript𝜂ℰ\eta_{\mathcal{E}}	Learning rate for the encoder.
ταsubscript𝜏𝛼\tau_{\alpha}	Shrinkage function [τα(𝒙)]j=sign(xj)(\|xj\|−α)subscriptdelimited-[]subscript𝜏𝛼𝒙𝑗signsubscript𝑥𝑗subscript𝑥𝑗𝛼[\tau_{\alpha}(\displaystyle{\bm{x}})]{j}=\textrm{sign}(x{j})(\|x_{j}\|-\alpha).
τ~~αsubscript~~𝜏𝛼\tilde{\tau}_{\alpha}	Non-negative shrinkage function [τ~~α(𝒙)]j=([τα(𝒙)]j)+subscriptdelimited-[]subscript~~𝜏𝛼𝒙𝑗subscriptsubscriptdelimited-[]subscript𝜏𝛼𝒙𝑗[\tilde{\tau}{\alpha}(\displaystyle{\bm{x}})]{j}=([\tau_{\alpha}(\displaystyle{\bm{x}})]{j}){+}.
l𝑙l	Dimension of the latent code.
𝒛𝒛\displaystyle{\bm{z}}	Latent code in ℝlsuperscriptℝ𝑙\displaystyle\mathbb{R}^{l}.
𝒁𝒁\displaystyle{\bm{Z}}	Mini-batch of latent codes in ℝl×nsuperscriptℝ𝑙𝑛\displaystyle\mathbb{R}^{l\times n}.
𝑾𝑾\displaystyle{\bm{W}}	Parametrization in ℝd×lsuperscriptℝ𝑑𝑙\displaystyle\mathbb{R}^{d\times l} of a linear dictionary 𝒟𝒟\mathcal{D}.
𝑾1subscript𝑾1\displaystyle{\bm{W}}_{1}	Parametrization in ℝm×lsuperscriptℝ𝑚𝑙\displaystyle\mathbb{R}^{m\times l} of the bottom layer of a fully connected decoder 𝒟𝒟\mathcal{D} with one
	hidden layer.
𝑾2subscript𝑾2\displaystyle{\bm{W}}_{2}	Parametrization of a the top layer of a fully connected decoder 𝒟𝒟\mathcal{D} with one hidden
	layer in ℝd×msuperscriptℝ𝑑𝑚\displaystyle\mathbb{R}^{d\times m}.
𝒃1subscript𝒃1\displaystyle{\bm{b}}_{1}	Parametrization of the bias term in ℝmsuperscriptℝ𝑚\displaystyle\mathbb{R}^{m} following the bottom layer of a fully connected
	decoder 𝒟𝒟\mathcal{D} with one hidden layer.
𝒃𝒃\displaystyle{\bm{b}}	Parametrization of the bias term in the LISTA encoder (see 3.1).
SDL	Standard Dictionary Learning with a LISTA encoder and a linear decoder whose columns have
	a fixed l2subscript𝑙2l_{2} norm.
SDL-NL	Standard Dictionary learning with a LISTA encoder and a non-linear fully connected decoder.
	Each layer in the decoder has columns with a fixed l2subscript𝑙2l_{2} norm.
VDL	Variance-regularized Dictionary Learning with a LISTA encoder and a linear decoder in which
	regularization is applied to the sparse codes encouraging the variance across the latent
	components to be above a fixed threshold.
VDL-NL	Variance-regularized Dictionary Learning (as above) with a non-linear decoder.
WDL	Dictionary Learning with a LISTA encoder and a linear decoder trained with weight decay
	regularization.
WDL-NL	Dictionary Learning with a LISTA encoder and a non-linear decoder trained with weight
	decay regularization.
DO	Standard Dictionary Learning without an encoder and with a linear decoder whose columns
	have a fixed l2subscript𝑙2l_{2} norm.
DO-NL	Standard Dictionary Learning without an encoder and with a non-linear decoder whose columns
	have a fixed l2subscript𝑙2l_{2} norm.

Table: A2.T3: Training hyperparameters. SDL indicates models in which columns of the decoder’s layers have a fixed norm of 1. VDL indicates models in which variance regularization is applied to the codes during inference. (∗)(^{*}) means that the learning rate for the decoder η𝒟subscript𝜂𝒟\eta_{\mathcal{D}} is annealed by half every 30 epochs. wdwd\mathrm{wd} stands for weight decay.


dataset	model	ep	λ𝜆\lambda	γ𝛾\gamma	β𝛽\beta	T𝑇T	η𝒟subscript𝜂𝒟\eta_{\mathcal{D}}	ηℰsubscript𝜂ℰ\eta_{\mathcal{E}}	wd(𝒃1)wdsubscript𝒃1\mathrm{wd}(\displaystyle{\bm{b}}_{1})	wd(𝒃)wd𝒃\mathrm{wd}(\displaystyle{\bm{b}})	ηksubscript𝜂𝑘\eta_{k}
MNIST	SDL	200	0	1	0	-	1e−31e31\mathrm{e}{-3}	3e−43e43\mathrm{e}{-4}	-	0	1
MNIST	SDL	200	1e−41e41\mathrm{e}{-4}	1	0	-	1e−31e31\mathrm{e}{-3}	3e−43e43\mathrm{e}{-4}	-	0	1
MNIST	SDL	200	5e−45e45\mathrm{e}{-4}	1	0	-	1e−31e31\mathrm{e}{-3}	3e−43e43\mathrm{e}{-4}	-	0	1
MNIST	SDL	200	1e−31e31\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	3e−43e43\mathrm{e}{-4}	-	0	1
MNIST	SDL	200	3e−33e33\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	3e−43e43\mathrm{e}{-4}	-	0	1
MNIST	SDL	200	5e−35e35\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	3e−43e43\mathrm{e}{-4}	-	0	1
MNIST	VDL	200	0	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	0	0.5
MNIST	VDL	200	1e−31e31\mathrm{e}{-3}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	0	0.5
MNIST	VDL	200	3e−33e33\mathrm{e}{-3}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	0	0.5
MNIST	VDL	200	5e−35e35\mathrm{e}{-3}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	0	0.5
MNIST	VDL	200	1e−21e21\mathrm{e}{-2}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	0	0.5
MNIST	VDL	200	2e−22e22\mathrm{e}{-2}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	0	0.5
MNIST	SDL-NL	200	0	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	1
MNIST	SDL-NL	200	1e−31e31\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	1
MNIST	SDL-NL	200	3e−33e33\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	1
MNIST	SDL-NL	200	5e−35e35\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	1
MNIST	SDL-NL	200	1e−21e21\mathrm{e}{-2}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	1
MNIST	SDL-NL	200	2e−22e22\mathrm{e}{-2}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	1
MNIST	VDL-NL	200	0	100	10	0.5	3e−4∗3esuperscript43\mathrm{e}{-4}^{*}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	0.5
MNIST	VDL-NL	200	1e−31e31\mathrm{e}{-3}	100	10	0.5	3e−4∗3esuperscript43\mathrm{e}{-4}^{*}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	0.5
MNIST	VDL-NL	200	3e−33e33\mathrm{e}{-3}	100	10	0.5	3e−4∗3esuperscript43\mathrm{e}{-4}^{*}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	0.5
MNIST	VDL-NL	200	5e−35e35\mathrm{e}{-3}	100	10	0.5	3e−4∗3esuperscript43\mathrm{e}{-4}^{*}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	0.5
MNIST	VDL-NL	200	1e−21e21\mathrm{e}{-2}	100	10	0.5	3e−4∗3esuperscript43\mathrm{e}{-4}^{*}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	0.5
MNIST	VDL-NL	200	2e−22e22\mathrm{e}{-2}	100	10	0.5	3e−4∗3esuperscript43\mathrm{e}{-4}^{*}	1e−41e41\mathrm{e}{-4}	1e−31e31\mathrm{e}{-3}	0	0.5
ImageNet	SDL	100	0	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL	100	5e−45e45\mathrm{e}{-4}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL	100	1e−31e31\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL	100	2e−32e32\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL	100	3e−33e33\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL	100	5e−35e35\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL	100	0	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL	100	1e−31e31\mathrm{e}{-3}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL	100	3e−33e33\mathrm{e}{-3}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL	100	5e−35e35\mathrm{e}{-3}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL	100	1e−21e21\mathrm{e}{-2}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL	100	1.5e−21.5e21.5\mathrm{e}{-2}	5	10	0.5	3e−43e43\mathrm{e}{-4}	1e−41e41\mathrm{e}{-4}	-	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL-NL	100	0	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−21e21\mathrm{e}{-2}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL-NL	100	1e−31e31\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−21e21\mathrm{e}{-2}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL-NL	100	3e−33e33\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−21e21\mathrm{e}{-2}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL-NL	100	5e−35e35\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−21e21\mathrm{e}{-2}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL-NL	100	8e−38e38\mathrm{e}{-3}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−21e21\mathrm{e}{-2}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	SDL-NL	100	1e−21e21\mathrm{e}{-2}	1	0	-	1e−31e31\mathrm{e}{-3}	1e−41e41\mathrm{e}{-4}	1e−21e21\mathrm{e}{-2}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	0	20	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	1e−31e31\mathrm{e}{-3}	20	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	2e−32e32\mathrm{e}{-3}	20	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	3e−33e33\mathrm{e}{-3}	20	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	5e−35e35\mathrm{e}{-3}	20	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	1e−21e21\mathrm{e}{-2}	40	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5
ImageNet	VDL-NL	100	2e−22e22\mathrm{e}{-2}	40	10	0.5	5e−5∗5esuperscript55\mathrm{e}{-5}^{*}	1e−41e41\mathrm{e}{-4}	1e−11e11\mathrm{e}{-1}	1e−21e21\mathrm{e}{-2}	0.5

Refer to caption (a) Inference of a sparse code 𝒛𝒛{\bm{z}}.

Refer to caption (b) Learning of the encoder ℰℰ\mathcal{E} and decoder 𝒟𝒟\mathcal{D}.

Refer to caption (a) SDL, λ=0.001𝜆0.001\lambda=0.001

Refer to caption (a) MNIST

Refer to caption (b) ImageNet patches

Refer to caption (a) Top-1 error in %percent%.

Refer to caption Visualizing the variance regularization function h(𝒙)=[(1−Var(𝒙))+]2ℎ𝒙superscriptdelimited-[]subscript1Var𝒙2h(\displaystyle{\bm{x}})=\big{[}\big{(}1-\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{x}})}\big{)}_{+}\big{]}^{2} for 𝒙∈ℝ2𝒙superscriptℝ2\displaystyle{\bm{x}}\in\displaystyle\mathbb{R}^{2} in 3D.

$$ \mathbf{E}(\displaystyle{\bm{z}},\displaystyle{\bm{y}},\mathcal{D})=\frac{1}{2}|\displaystyle{\bm{y}}-\mathcal{D}\displaystyle{\bm{z}}|{2}^{2}+\lambda|\displaystyle{\bm{z}}|{1}. $$ \tag{S2.E1}

$$ \displaystyle \vz^* = \arg\underset{\displaystyle \vz}{\min}~\mathbf{E}(\displaystyle \vz,\displaystyle \vy, \mathcal{D}). $$

$$ \label{d-obj} \arg\underset{\mathcal{D}}{\min}~~\mathcal{L}_{\mathcal{D}}(\mathcal{D},\mathcal{Z}^*,\mathcal{Y}) = \arg\underset{\mathcal{D}}{\min}~~\frac{1}{N}\sum_{i=1}^N|\displaystyle \vy_i - \mathcal{D}\displaystyle \vz_i^*|_2^2, $$ \tag{d-obj}

$$ \label{ista-grad} \textrm{ISTA gradient~~step:~~}\tilde{\displaystyle \vz}^{(k)} = \displaystyle \vz^{(k-1)} - \eta_k\nabla f(\displaystyle \vz^{(k-1)}), $$ \tag{ista-grad}

$$ \label{gradient-of-hinge} \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \sum_{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)_+\Big]^2 $$ \tag{gradient-of-hinge}

$$ \frac{\partial}{\partial\displaystyle{Z}{s,t}}\sum{j=1}^{l}\beta\Big{[}\Big{(}T-\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{Z}}{j,:})}\Big{)}{+}\Big{]}^{2}=\begin{cases}-\frac{2\beta}{n-1}\frac{\Big{(}T-\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{Z}}{s,:})}\Big{)}}{\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{Z}}{s,:})}}(\displaystyle{Z}{s,t}-\mu{s})&\mathrm{if~{}}\sqrt{\displaystyle\mathrm{Var}(\displaystyle{\bm{Z}}_{s,:})}<T\ {}{}~{}0&\mathrm{otherwise}.\end{cases} $$ \tag{A1.E32}

$$ \displaystyle=\frac{1}{n-1}\Big{[}\frac{\partial}{\partial\displaystyle{Z}{s,t}}(\displaystyle{Z}{s,t}-\mu_{s})^{2}+\sum_{i\neq t}\frac{\partial}{\partial\displaystyle{Z}{s,t}}(\displaystyle{Z}{s,i}-\mu_{s})^{2}\Big{]} $$

$$ \displaystyle=\frac{2}{n-1}(\displaystyle{Z}{s,t}-\mu{s}), $$

$$ \label{sc-energy} \mathbf{E}(\displaystyle \vz,\displaystyle \vy, \mathcal{D}) = \frac{1}{2}|\displaystyle \vy - \mathcal{D}\displaystyle \vz|_2^2 + \lambda |\displaystyle \vz|_1. $$ \tag{sc-energy}

$$ \label{hinge-derivative-derived} \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \sum_{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :})}\Big)+\Big]^2 = \begin{cases} -\frac{2\beta}{n-1} \frac{\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ{s, :})}\Big)}{\sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}}(\displaystyle \emZ_{s, t} - \mu_s) & \mathrm{if~}\sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}< T\ ~~~0&\mathrm{otherwise}. \end{cases} $$ \tag{hinge-derivative-derived}

$$ \label{modified-obj} \tilde{\mathbf{E}}(\displaystyle \mZ,Y,\mathcal{D})& = \tilde{f}(\displaystyle \mZ) + \tilde{g}(\displaystyle \mZ) = \sum_{i=1}^{n}\frac{1}{2}|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}|2^2+ \sum{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)+\Big]^2 + \sum{i=1}^n\lambda |\displaystyle \mZ_{:, i}|_1 $$ \tag{modified-obj}

$$ \label{final-obj} \tilde{\mathbf{E}}(\displaystyle \mZ,Y,\mathcal{D})& = \sum_{i=1}^n \frac{1}{2}|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}|2^2 + \beta\sum{j=1}^l\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)+\Big]^2 + \gamma \sum{i=1}^n|\displaystyle \mZ_{:, i} - \mathcal{E}(\displaystyle \vy_i)|2^2+ \sum{i=1}^n\lambda |\displaystyle \mZ_{:, i}|_1. $$ \tag{final-obj}

$$ \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \displaystyle \Var(\displaystyle \mZ_{s, :}) & = \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \Big[\frac{1}{n-1}\sum_{i=1}^n(\displaystyle \emZ_{s, i} - \mu_s)^2\Big]\ & = \frac{1}{n-1}\Big[\frac{\partial}{\partial \displaystyle \emZ_{s, t}}(\displaystyle \emZ_{s, t} - \mu_s)^2 + \sum_{i\neq t}\frac{\partial}{\partial \displaystyle \emZ_{s, t}}(\displaystyle \emZ_{s, i} - \mu_s)^2 \Big]\ & = \frac{1}{n-1}\Big[ 2(\displaystyle \emZ_{s, t} - \mu_s)\frac{\partial}{\partial \displaystyle \emZ_{s, t}}(\displaystyle \emZ_{s, t} - \mu_s) + \sum_{i\neq t}2(\displaystyle \emZ_{s, i} - \mu_s)\frac{\partial}{\partial \displaystyle \emZ_{s, t}}(\displaystyle \emZ_{s, i} - \mu_s)\Big]\ & = \frac{2}{n-1}\Big[(\displaystyle \emZ_{s, t} - \mu_s)\frac{\partial}{\partial \displaystyle \emZ_{s, t}}\Big(\displaystyle \emZ_{s, t} - \frac{1}{n}\sum_{m=1}^n\displaystyle \emZ_{s, m}\Big) + \sum_{i\neq t}(\displaystyle \emZ_{s, i} - \mu_s)\frac{\partial}{\partial \displaystyle \emZ_{s, t}}\Big(\displaystyle \emZ_{s, i} - \frac{1}{n}\sum_{m=1}^n\displaystyle \emZ_{s, m}\Big)\Big]\ & = \frac{2}{n-1}\Big[(\displaystyle \emZ_{s, t} - \mu_s)\Big(1- \frac{1}{n}\Big) + \sum_{i\neq t} (\displaystyle \emZ_{s, i} - \mu_s) \Big(-\frac{1}{n}\Big)\Big]\ & = \frac{2}{n-1}\Big[(\displaystyle \emZ_{s, t} - \mu_s) - \frac{1}{n}\sum_{i=1}^n(\displaystyle \emZ_{s, i} - \mu_s)\Big]\ & = \frac{2}{n-1}\Big[(\displaystyle \emZ_{s, t} - \mu_s) - \frac{1}{n}\sum_{i=1}^n\displaystyle \emZ_{s, i} + \frac{1}{n}n\mu_s\Big]\ & = \frac{2}{n-1}\Big[(\displaystyle \emZ_{s, t} - \mu_s) - \mu_s + \mu_s\Big]\ & = \frac{2}{n-1}(\displaystyle \emZ_{s, t} - \mu_s), $$

$$ \displaystyle \vx^{(k-1)} = \displaystyle \vz^{(k - 1)} + \frac{t_{k-1}-1}{t_{k}}(\displaystyle \vz^{(k-1)} - \displaystyle \vz^{(k-2)}) $$

Algorithm: algorithm
[H]
  \caption{LISTA encoder $\mathcal{E}$: forward pass}
  \label{alg:LISTA}
\begin{algorithmic}
  \STATE {\bfseries Input:} Image $\displaystyle \vy\in\displaystyle \R^{d}$, number of iterations $L$\STATE {\bfseries Parameters:} $\displaystyle \mU\in\displaystyle \R^{d\times l}$, $\displaystyle \mS\in\displaystyle \R^{l\times l}$, $\displaystyle \vb\in \displaystyle \R^l$\STATE {\bfseries Output:} sparse code $\displaystyle \vz_{\mathcal{E}}\in\displaystyle \R^l$  
  \STATE $\displaystyle \vu = \displaystyle \mU \displaystyle \vy + \displaystyle \vb$
  \STATE $\displaystyle \vz_0 = \mathrm{ReLU}(\displaystyle \vu)$
  \FOR{$i=1$ {\bfseries to} $L$}
  \STATE $\displaystyle \vz_i = \mathrm{ReLU}(\displaystyle \vu + \displaystyle \mS \displaystyle \vz_{i-1})$
  % \STATE $t = u + S z_{i-1}$
  % \STATE $z_i = \mathrm{ReLU}(t)$
  \ENDFOR
  \STATE $\displaystyle \vz_{\mathcal{E}} = \displaystyle \vz_L$
\end{algorithmic}

Algorithm: algorithm
we want its own \AND
\AtBeginEnvironment{algorithmic}{\let\AND\algoAND}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{xfrac} % diagonal fractions
\usepackage{booktabs} % toprule
\usepackage{bbold} % \mathbb{1}
\allowdisplaybreaks
\usepackage{enumitem}

\newcommand{\defeq}{\mathrel{\mathop:}=}

\title{Sparse Coding with Multi-layer Decoders using Variance Regularization}

% Authors must not appear in the submitted version. They should be hidden
% as long as the tmlr package is used without the [accepted] or [preprint] options.
% Non-anonymous submissions will be rejected without review.


\author{\name Katrina Evtimova \email kve216@nyu.edu \\
    \addr Center for Data Science \\ New York University
    \AND
    \name Yann LeCun \email yann@cs.nyu.edu \\
    % \addr Department of Computer Science and Center for Data Science\\
    \addr Courant Institute and Center for Data Science\\
       New York University \\
    \addr Meta AI - FAIR
}


% The \author macro works with any number of authors. Use \AND 
% to separate the names and addresses of multiple authors.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

\def\month{08}  % Insert correct month for camera-ready version
\def\year{2022} % Insert correct year for camera-ready version
\def\openreview{\url{ https://openreview.net/forum?id=4GuIi1jJ74}} % Insert correct link to OpenReview for camera-ready version


\begin{document}


\maketitle

\begin{abstract}
Sparse representations of images are useful in many computer vision applications.
Sparse coding with an $l_1$ penalty and a learned linear dictionary requires regularization of the dictionary to prevent a collapse in the $l_1$ norms of the codes. Typically, this regularization entails bounding the Euclidean norms of the dictionary's elements.
In this work, we propose a novel sparse coding protocol which prevents a collapse in the codes without the need to regularize the decoder. Our method regularizes the codes directly so that each latent code component has variance greater than a fixed threshold over a set of sparse representations for a given set of inputs. Furthermore, we explore ways to effectively train sparse coding systems with multi-layer decoders since they can model more complex relationships than linear dictionaries.
In our experiments with MNIST and natural image patches, we show that decoders learned with our approach have interpretable features both in the linear and multi-layer case. Moreover, we show that sparse autoencoders with multi-layer decoders trained using our variance regularization method produce higher quality reconstructions with sparser representations when compared to autoencoders with linear dictionaries. Additionally, sparse representations obtained with our variance regularization approach are useful in the downstream tasks of denoising and classification in the low-data regime. 
\end{abstract}

\section{Introduction}

Finding representations of data which are useful for downstream applications is at the core of machine learning and deep learning research \citep{bengio2013representation}. Self-supervised learning (SSL) is a branch of representation learning which is entirely data driven and in which the representations are learned without the need for external supervision such as semantic annotations.
In SSL, learning can happen through various pre-text tasks such as reconstructing the input with autoencoders \citep{vincent2008extracting}, predicting the correct order of frames in a video \citep{misra2016shuffle, fernando2017self},
%solving a jigsaw puzzle \cite{noroozi2016unsupervised}. 
enforcing equivariant representations across transformations of the same image \citep{he2019momentum, grill-2020-NEURIPS, zbontar2021barlow, bardes2021vicreg} or enforcing other invariance properties \citep{foldiak1991learning, wiskott2002slow, hyvarinen2003bubbles, chen2018sparse}. The pre-text tasks should be designed to extract data representations useful in other downstream tasks. 

% What is sparsity and why is it important? 
We focus on learning representations of images through \textit{sparse coding}, a self-supervised learning paradigm based on input reconstruction \citep{olshausen1996emergence}. 
% which has been studied extensively in the machine learning literature.  \todo{Citations}
A representation (or \textit{code}) in the form of a feature vector $\displaystyle \vz\in\displaystyle \R^l$ is considered sparse if only a small number of its components are active for a given data sample. Sparsity introduces the inductive bias that only a small number of factors are relevant for a single data point and it is a desirable property for data representations as it can improve efficiency and interpretability of the representations \citep{bengio2009learning, bengio2013representation}. 
While it is an open debate, sparse representations of images have been shown to improve classification performance  \citep{ranzato2007efficient, kavukcuoglu, makhzani2013k} and are popular for denoising and inpainting applications \citep{elad2006image,mairal2007sparse}. They are biologically motivated as the primary visual cortex uses sparse activations of neurons to represent natural scenes \citep{olshausen1996emergence, yoshida2020natural}. 

% What is sparse coding and dictionary learning and why is it important? 
In sparse coding, the goal is to find a latent sparse representation $\displaystyle \vz\in\displaystyle \R^l$ which can reconstruct an input $\displaystyle \vy\in\displaystyle \R^d$ using a \textit{learned decoder} $\mathcal{D}$. Typically, the decoder is linear, namely $\mathcal{D}\in\displaystyle \R^{d\times l}$, and the sparse code $\displaystyle \vz$ selects a linear combination of a few of its columns to reconstruct the input. In practice, a sparse code $\displaystyle \vz$ is computed using an \textit{inference algorithm} which minimizes $\displaystyle \vz$'s $l_1$ norm and simultaneously optimizes for a good reconstruction of the input $\displaystyle \vy$ using $\mathcal{D}$'s columns. Learning of the decoder $\mathcal{D}$ typically happens in an iterative EM-like fashion. Inference provides a set of sparse codes corresponding to some inputs (expectation step). Given these codes, the decoder's parameters are updated using gradient descent to minimize the error between the original inputs and their reconstructions (maximization step). 
% The expectation step is to find a sparse representation $z $ which minimizes the reconstruction error for a given input using an inference algorithm and a fixed decoder. The maximization step is to update the parameters of the decoder in a way which reduces the reconstruction error with the codes computed in the first step.
Sparse coding systems in which the decoder is non-linear are hard to learn and often require layer-wise training \citep{zeiler2010deconvolutional, he2014unsupervised}. In this work, we explore ways to effectively train sparse coding systems with non-linear decoders using end-to-end learning. 

% What is an existing limitation we are addressing? 
%In order to prevent the decoder from ``cheating'', i.e. inflating its weights to reduce the $l_1$  norm of $z$, the norm of $\mathcal{D}$'s columns (or atoms) is set to a constant value: $|\mathcal{D}_{(:,i)}| = c$.  
% One of the challenges in working with non-linear decoders is ...

In order to successfully train a sparse coding system which learns a linear dictionary $\mathcal{D}$, it is common to bound the norms of the dictionary's weights in order to avoid collapse in the $l_1$ norms of the codes. Indeed, consider a reconstruction $\tilde{\displaystyle \vy} $ obtained from a dictionary $\mathcal{D}$ and code $\displaystyle \vz$, namely $\tilde{\displaystyle \vy} = \mathcal{D}\displaystyle \vz$. The same reconstruction can be obtained from a re-scaled dictionary $\mathcal{D}' = c\mathcal{D}$ and code $\displaystyle \vz' = \displaystyle \vz/c$ for any non-zero multiple $c$. When $c>1$, the representation $\displaystyle \vz'$  has a smaller $l_1$ norm than $\displaystyle \vz$ but produces the same reconstruction. In practice, if $\mathcal{D}$'s weights are unbounded during training, they become arbitrarily large, while the $l_1$ norms of the latent representations become arbitrarily small which leads to a collapse in their $l_1$ norm. 
% Limiting $\mathcal{D}$'s weights reduces the set of possible solutions  and prevents the scenario in which the decoder's weights become arbitrarily large during training while the $l_1$ norms of the latent representations become arbitrarily small which in practice leads to a collapse in their $l_1$ norm. 
Typically, in order to restrict a linear dictionary $\mathcal{D}$, its columns are re-scaled to have a constant $l_2$ norm \citep{lee2006efficient} or weight decay is applied to its weights \citep{sun2018supervised}.  
% However, it is not trivial to extend this normalization procedure to the case when the decoder $\mathcal{D}$ is a multi-layer neural network and at the same time learn useful features in $\mathcal{D}$. 
However, in the case when $\mathcal{D}$ is a multi-layer neural network trained end-to-end, it is not clear what the best strategy to bound the decoder's weights is in order to avoid collapse in the codes and at the same time learn useful features.

% What is our approach? 

% Instead of normalizing $\mathcal{D}$ directly, we propose to modify the inference procedure for computing the latent representations ensuring that their norm cannot become arbitrarily small. In particular, we modify a popular inference algorithm for computing sparse representations called the Iterative Shrinkage-Thresholding Algorithm  \cite{beck2009fast} by adding a hinge regularization term to its objective function which encourages the standard deviation of latent components of codes for a batch of target images to have a fixed lower bound. 
Instead of restricting $\mathcal{D}$'s weights, we propose to prevent collapse in the $l_1$ norm of the codes directly by applying variance regularization to each latent code component.  In particular, we add a regularization term to the energy  minimized during inference for computing a set of codes. This term encourages the variance of each latent code component to be greater than a fixed lower bound. This strategy is similar to the variance principle presented in \cite{bardes2021vicreg} and related to methods that directly regularize the latent codes \citep{foldiak1990forming, olshausen1996emergence, barello2018sparse}. Our strategy ensures that the norms of the sparse codes do not collapse and we find that it removes the need to explicitly restrict $\mathcal{D}$'s parameters. 
% What are the results and implications/significance? 
Since this variance regularization term is independent from $\mathcal{D}$'s architecture, we can use it when the decoder is a deep neural network. In this work, we experiment with fully connected sparse autoencoders with a one hidden layer decoder and successfully train such models using our variance regularization strategy. 

The main contributions of our work are:
\begin{itemize}[leftmargin=*]
    % \item We introduce an alternative  to normalizing the columns of a linear dictionary by encouraging the latent components to have variance above a fixed threshold.
    % \item We introduce an effective way to train sparse autoencoders with a linear decoder without the need to fix the $l_2$ norm of the decoder's columns but instead by encouraging the latent code components to have variance above a given threshold. 
    \item We introduce an effective way to train sparse autoencoders and prevent collapse in the $l_1$ norms of the codes that they produce without the need to regularize the decoder's weights but instead by encouraging the latent code components to have variance above a fixed threshold. 
    \item We show empirically that linear decoders trained with this variance regularization strategy are comparable to the ones trained using the standard $l_2$-normalization approach in terms of the features that they learn and the quality of the reconstructions they produce. Moreover, variance regularization successfully extends sparse coding to the case of a non-linear fully connected decoder with one hidden layer.
    \item Additionally, our experiments show that sparse representations obtained from  autoencoders with a non-linear decoder trained using our proposed variance regularization strategy: 1) produce higher quality reconstructions than ones obtained from autoencoders with a linear decoder, 2) are helpful in the downstream task of classification in the low-data regime. 
\end{itemize}

    % \item We introduce an alternative  to normalizing the columns of a linear dictionary by encouraging the latent components to have variance above a fixed threshold.
    % \item We introduce an effective way to train sparse autoencoders with a linear decoder without the need to fix the $l_2$ norm of the decoder's columns but instead by encouraging the latent code components to have variance above a given threshold. 
    % $\circ$ We introduce an effective way to train sparse autoencoders and prevent collapse in the $l_1$ norms of the codes that they produce without the need to regularize the decoder's weights but instead by encouraging the latent code components to have variance above a fixed threshold. 
    
    % We show empirically that linear decoders trained with this variance regularization strategy are comparable to the ones trained using the standard $l_2$-normalization approach in terms of the features that they learn and the quality of the reconstructions they produce. Moreover, variance regularization allows us to successfully extend sparse coding to the case of a non-linear fully-connected decoder with one hidden layer.
    
    % Additionally, our experiments show that sparse representations obtained from  autoencoders with a non-linear decoder trained using our proposed variance regularization strategy: 1) produce higher quality reconstructions than ones obtained from autoencoders with a linear decoder, 2) are helpful in the downstream task of classification in the low-data regime. 


\label{sec:method}
\begin{figure}[ht]
\centering
\begin{subfigure}[b]{\columnwidth}
  \centerline{\includegraphics[scale=0.2]{plots/model-inference.pdf}}
  \caption{Inference of a sparse code $\vz$.}
  \label{model-inference}
\end{subfigure}
\begin{subfigure}[b]{\columnwidth}
  \centering
  \includegraphics[scale=0.2]{plots/model-learning.pdf}
  \caption{Learning of the encoder $\mathcal{E}$ and decoder $\mathcal{D}$.}
  \label{model-learning}
\end{subfigure}
\caption{Our sparse autoencoder setup. $C$ denotes mean squared error. The inference step in Figure \ref{model-inference} finds a code $\vz^*$ which minimizes the reconstruction error between the input $\vy$ and its reconstruction $\tilde{\vy}$. The code is subject to sparsity and variance regularization, denoted by $R(z)$. Additionally, the code is regularized to stay close to the encoder's predictions $\vz_{\mathcal{E}}$. During the learning step in  Figure \ref{model-learning}, the encoder $\mathcal{E}$ is trained to predict the codes $\displaystyle \vz^*$ computed during inference and the decoder $\mathcal{D}$ learns to reconstruct inputs $\displaystyle \vy$ from the sparse codes $\displaystyle \vz^*$.}
\label{fig:SAE}
\end{figure}

\section{Method}
A schematic view of our sparse autoencoder setup is presented in Figure \ref{fig:SAE}. At a high level, given an input $\displaystyle \vy$ and a fixed decoder $\mathcal{D}$, we perform inference using a modified version of the FISTA algorithm \citep{beck2009fast} to find a sparse code $\displaystyle \vz^*$ which can best reconstruct $\displaystyle \vy$ using $\mathcal{D}$'s elements. The decoder's weights are trained by minimizing the mean squared error between the input $\displaystyle \vy$ and the reconstruction $\tilde{\displaystyle \vy}$ computed from $\displaystyle \vz^*$. The encoder $\mathcal{E}$ is trained to predict the codes $\displaystyle \vz^*$ computed during inference. More details on each of the components in Figure \ref{fig:SAE} are presented in the following sections.
% \begin{figure}[H]
%     \centering
%     % \includegraphics[scale=0.2]{plots/model_SAE.pdf}
%     % \includegraphics[scale=0.2]{plots/model_z_star.pdf}
%     \includegraphics[scale=0.2]{plots/model_elipse.pdf}
%     \caption{Our sparse autoencoder setup. The encoder $\mathcal{E}$ is trained to predict the codes $\displaystyle \vz^*$ computed from inference with FISTA. The decoder $\mathcal{D}$ learns to reconstruct inputs $\displaystyle \vy$ from the sparse codes $\displaystyle \vz^*$.}
%     \label{fig:SAE}
% \end{figure}


% Overview of Sparse Coding
\subsection{Background: Sparse Coding and Dictionary Learning}
\label{sparse-coding}
\textbf{Inference~~} Sparse coding algorithms with an $l_1$ sparsity penalty and a linear decoder $\mathcal{D}\in\displaystyle \R^{d\times l}$ perform inference to find a latent sparse representation $\displaystyle \vz \in\displaystyle \R^l$ of a given data sample $\displaystyle \vy\in\displaystyle \R^d$ which minimizes the following energy function:
\begin{equation}
\label{sc-energy}
    \mathbf{E}(\displaystyle \vz,\displaystyle \vy, \mathcal{D}) = \frac{1}{2}\|\displaystyle \vy - \mathcal{D}\displaystyle \vz\|_2^2 + \lambda \|\displaystyle \vz\|_1.
\end{equation}
The first term in (\ref{sc-energy}) is the reconstruction error for the sample $\displaystyle \vy$ using code $\displaystyle \vz$ and decoder $\mathcal{D}$. The second term is a regularization term which penalizes the $l_1$ norm of $\displaystyle \vz$. In essence, each code $\displaystyle \vz$ selects a linear combination of the decoder's columns. The decoder can thus be thought of as a \textit{dictionary} of building blocks. Inference algorithms find a solution for the minimization problem in (\ref{sc-energy}) keeping $\mathcal{D}$ fixed, namely 
\[\displaystyle \vz^* = \arg\underset{\displaystyle \vz}{\min}~\mathbf{E}(\displaystyle \vz,\displaystyle \vy, \mathcal{D}).\]

\textbf{Learning the Decoder~~}The decoder's parameters can be learned through gradient-based optimization by minimizing the mean squared reconstruction error between elements in the training set $\mathcal{Y}=\{\displaystyle \vy_1, \ldots, \displaystyle \vy_N\}\subset \displaystyle \R^{d}$ and their corresponding reconstructions from the sparse representations $\mathcal{Z}^* = \{\displaystyle \vz_1^* , \ldots, \displaystyle \vz_N^* \}\subset \displaystyle \R^{l}$ obtained during inference, namely:
\begin{equation}
\label{d-obj}
\arg\underset{\mathcal{D}}{\min}~\mathcal{L}_{\mathcal{D}}(\mathcal{D},\mathcal{Z}^*,\mathcal{Y}) = \arg\underset{\mathcal{D}}{\min}~\frac{1}{N}\sum_{i=1}^N\|\displaystyle \vy_i - \mathcal{D}\displaystyle \vz_i^*\|_2^2,
\end{equation}
where $\mathcal{D}$'s columns are typically projected onto the unit sphere to avoid collapse in the latent codes, namely $\|\mathcal{D}_{:,i}\|_2^2 = 1$ for $\mathcal{D}_{:,i} \in \mathbb{R}^{d}$.
% Overview of FISTA
\subsection{Background: Inference with ISTA and FISTA}
\label{ISTA-FISTA}
We provide an overview of the Iterative Shrinkage-Thresholding Algorithm (ISTA) and its faster version FISTA which we adopt in our setup.  
ISTA is a proximal gradient method which performs inference to find a sparse code $\displaystyle \vz^*$ which minimizes 
the energy in (\ref{sc-energy}) for a given input $\displaystyle \vy$ and a fixed dictionary $\mathcal{D}$. We can re-write the inference minimization problem as:
\begin{equation}
\label{inference-Z}
    \displaystyle \vz^* = \arg\underset{\displaystyle \vz}{\min}~\mathbf{E}(\displaystyle \vz, \displaystyle \vy, \mathcal{D}) = \arg\underset{\displaystyle \vz}{\min}~ f(\displaystyle \vz) + g(\displaystyle \vz),
\end{equation} 
where $f(\displaystyle \vz) \defeq \frac{1}{2}\|\displaystyle \vy - \mathcal{D}\displaystyle \vz\|_2^2$ is the reconstruction term and $g(\displaystyle \vz) \defeq \lambda \|\displaystyle \vz\|_1$ is the $l_1$ penalty term. In ISTA, the code $\displaystyle \vz$ can be given any initial value and is typically set to the zero vector: $\displaystyle \vz^{(0)} = \mathbf{0}\in \displaystyle \R^l$. An ISTA iteration consists of two steps: a gradient step and a shrinkage step.

\textbf{Gradient Step~~} To compute code $\displaystyle \vz^{(k)}$ from its previous iteration $\displaystyle \vz^{(k-1)}$, where $k\in\{1, \ldots, K\}$, ISTA first performs the gradient update by taking gradient step of size $\eta_k$ with respect to the squared error between the input and the reconstruction from $\displaystyle \vz^{(k-1)}$: \begin{equation}
\label{ista-grad}
     \textrm{ISTA gradient~step:~}\tilde{\displaystyle \vz}^{(k)}  =  \displaystyle \vz^{(k-1)} - \eta_k\nabla f(\displaystyle \vz^{(k-1)}),
    %  = z_{k} - \eta_k\nabla_{z_{k}} \frac{1}{2}\|y - \mathcal{D}z_k\|_2^2
\end{equation}

% \begin{align}
% \label{proximal}
%      \textrm{Proximal~gradient~step:~}\mathrm{prox}_{t}(z_{k+0.5}) &= \arg \underset{x\in\displaystyle \R^l}{\min}~\frac{1}{2t} \|x - (z - t \nabla f(z))\|_2^2 + g(x) \\
%      & = \tau_{\lambda t } (z - t \nabla f(z)).
% \end{align}
The size of the gradient step $\eta_k$ at iteration $k$ can be determined with backtracking \citep{beck2009fast}. 

\textbf{Shrinkage Step~~} The shrinkage step is applied to the intermediate output from the gradient step in ISTA. It uses the shrinkage function $\tau_{\alpha}$
%$\tau_{\alpha}\colon \displaystyle \R^n \to \displaystyle \R^n$  
which sets components of its input to 0 if their absolute value is less than $\alpha$ or otherwise contracts them by $\alpha$, namely $[\tau_{\alpha}(\displaystyle \vx)]_j = \textrm{sign}(\displaystyle \evx_j)(|x_j| - \alpha)_{+}$ for some threshold $\alpha>0$ where $j$ traverses the components of its input vector $\displaystyle \vx$. The shrinkage step for ISTA is given by: 
\begin{equation}
    \textrm{ISTA~shrinkage~step:~}\displaystyle \vz^{(k)} = \tau_{\lambda\eta_k} (\tilde{\displaystyle \vz}^{(k)})
\end{equation}
where  $\lambda > 0$ is the hyperparameter which determines the sparsity level. In the case when the codes are restricted to be non-negative, a property that we adopt in our setup, the shrinkage function can be modified to set all negative components in its output to 0: $[\tilde{\tau}_{\alpha}(\displaystyle \vx)]_j = ([\tau_{\alpha}(\displaystyle \vx)]_j)_{+}$.

% The convergence criterion we set for ISTA is: $\frac{\|z^{(k)} - z^{(k-1)}\|_2}{\|z^{(k-1)}\|_2} < 1\mathrm{e}{-3}$. ISTA's output is $z^* = z^{(k^*)}$ where $k^*$ is the index of the first iteration for which the convergence criterion holds.

\textbf{FISTA~~} In our setup, we use FISTA \citep{beck2009fast} which accelerates the convergence of ISTA by adding a momentum term in each of its iterations. Namely, $\displaystyle \vz^{(k-1)}$ in (\ref{ista-grad}) is replaced by:
% \[\displaystyle \vx^{(k-1)} = \displaystyle \vz^{(k - 1)} + \frac{t_{k-1}-1}{t_{k}}(\displaystyle \vz^{(k-1)} - \displaystyle \vz^{(k-2)}) \]
\begin{equation}
\label{FISTA-momentum}\displaystyle \vx^{(k)} = \displaystyle \vz^{(k-1)} + \frac{t_{k-1}-1}{t_{k}}(\displaystyle \vz^{(k-1)} - \displaystyle \vz^{(k-2)})
\end{equation}
for $k\geq 2$ where $t_k = \frac{1 + \sqrt{1+4t_{k-1}^2}}{2}$ and $t_1 = 1$. Thus, the FISTA gradient step becomes:
\begin{equation}
\label{FISTA-grad-step}
    \tilde{\displaystyle \vz}^{(k)}  =  \displaystyle \vx^{(k)} - \eta_k\nabla f(\displaystyle \vx^{(k)}).
\end{equation}
FISTA employs the same shrinkage step as ISTA.

\subsection{Modified Inference with Variance Regularization on the Latent Code Components}
\label{modified-ISTA}
% In order to prevent the decoder from ``cheating'' by inflating its weights arbitrarily, we propose to regularize the codes computed during inference as an alternative to constraining the norms of the decoder's columns directly. 
In order to prevent collapse in the $l_1$ norm of the latent codes, we propose to ensure that the variance of each latent code component stays above a pre-set threshold. To achieve this, we add a regularization term to the energy in (\ref{inference-Z}) which encourages the variance of all latent component across a mini-batch of codes to be greater than a pre-set threshold. In particular, we replace the function $f(\displaystyle \vz) = \frac{1}{2}\|\displaystyle \vy - \mathcal{D}\displaystyle \vz\|_2^2 $ in (\ref{inference-Z}) with a new function $\tilde{f}(\displaystyle \mZ)$ defined over a set of codes $\displaystyle \mZ\in\displaystyle \R^{l\times n}$ corresponding to a mini-batch of data samples $Y = \{\displaystyle \vy_1, \ldots, \displaystyle \vy_n\}$ as:
\begin{align}
\label{reg-obj}
\tilde{f}(\displaystyle \mZ) 
&= \sum_{i=1}^n \frac{1}{2}\|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2 + \sum_{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)_+\Big]^2.
% \\
% &=\sum_{i=1}^n \frac{1}{2}\|y_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2 + \beta\sum_{j=1}^l\max\Bigg(\Bigg(1 - \sqrt{\frac{1}{n}\sum_{i=1}^n(\displaystyle \mZ_{j, i}- \mu_j)^2}\Bigg)^2, ~ 0   \Bigg)  
\end{align}
The first sum in (\ref{reg-obj}) is over the reconstruction terms from each code $\displaystyle \mZ_{:, i}\in \displaystyle \R^l$. The second sum 
in (\ref{reg-obj}) is over  squared hinge terms involving the variance of each latent component $\displaystyle \mZ_{j, :}\in\displaystyle \R^n$ across the batch where $\displaystyle \Var(\displaystyle \mZ_{j, :}) = \frac{1}{n-1}\sum_{i=1}^n(\displaystyle \mZ_{j, i}- \mu_j)^2$ and $\mu_j$ is the mean across the $j$-th latent component, namely $\mu_j = \frac{1}{n}\sum_{i=1}^n\displaystyle \emZ_{j,i}$. The hinge terms are non-zero for any latent dimension whose variance is below the fixed threshold of $\sqrt{T}$.

\textbf{Modified Energy: Variance Regularization~~} Using $\tilde{f}(\displaystyle \mZ)$ from (\ref{reg-obj}) and setting $\tilde{g}(\displaystyle \mZ) \defeq  \sum_{i=1}^{n} g(\displaystyle \mZ_{:, i})$, our modified objective during inference is to minimize the energy:
\begin{align}
    \label{modified-obj}
    \tilde{\mathbf{E}}(\displaystyle \mZ,Y,\mathcal{D})&  = \tilde{f}(\displaystyle \mZ) + \tilde{g}(\displaystyle \mZ) 
    % \nonumber \\& 
    = \sum_{i=1}^{n}\frac{1}{2}\|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2+  \sum_{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)_+\Big]^2  + \sum_{i=1}^n\lambda \|\displaystyle \mZ_{:, i}\|_1
\end{align}
with respect to the codes $\displaystyle \mZ$ for a mini-batch of samples $Y$ and a fixed dictionary $\mathcal{D}$.
Note that the hinge terms counteract with the $l_1$ penalty terms in (\ref{modified-obj}). The hinge terms act as regularization terms which encourage the variance of each of the latent code components to remain above the threshold of $\sqrt{T}$. This should prevent a collapse in the $l_1$ norm of the latent codes directly and thus remove the need to normalize the weights of the decoder. 

% [TODO] Squaring the hinge terms magnifies the gradients coming from them and in practice we observe that with a sufficiently large value of $\beta$ the average hinge loss over a mini-batch of codes goes to values in the range of $1\mathrm{e}{-4}$ in the first few iterations of FISTA and remains low for the rest of the iterations.

\textbf{Gradient Step~} The first step in our modified version of FISTA is to take a gradient step for each code $\displaystyle \mZ_{:, t}\in\displaystyle \R^l$: 
\begin{equation}
    \label{eq:mod-grad-step}
    \tilde{\displaystyle \mZ}_{:, t}^{(k)} = \displaystyle \mZ_{:, t}^{(k-1)} - \eta_k \nabla_{\displaystyle \mZ_{:, t}^{(k-1)} }\tilde{f}(\displaystyle \mZ).
    % \footnote{Since we use FISTA, the gradient step is applied to momentum terms as defined in (\ref{FISTA-momentum}) but we omit this notation for simplicity.}
\end{equation}
In practice, since we use FISTA, the code $\displaystyle \mZ_{:, t}^{(k-1)}$ in (\ref{eq:mod-grad-step}) is replaced by a momentum term as defined in (\ref{FISTA-momentum}) but here we omit this notation for simplicity. The gradient of the sum of reconstruction terms in (\ref{reg-obj}) with respect to $\displaystyle \emZ_{s, t}$, one of the latent components of code $\displaystyle \mZ_{:, t}$, for a linear decoder $\mathcal{D}$ is:
\begin{align}
\label{z-grad-rec-sk}
    \frac{\partial}{\partial \displaystyle \emZ_{s, t}}  \sum_{i=1}^n \frac{1}{2}\|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2 &= \mathcal{D}_s^T (\mathcal{D} \displaystyle \mZ_{:, t} - \displaystyle \vy_t),
\end{align}
where $\mathcal{D}_s$ denotes the $s$-th column of the dictionary $\mathcal{D}$.\footnote{The gradient for any neural network decoder $\mathcal{D}$ can be computed through automatic differentiation.}
The gradient of the sum of hinge terms in (\ref{reg-obj}) with respect to $\displaystyle \emZ_{s, t}$ is:
\begin{align}
\label{z-grad-hinge}
    \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \sum_{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)_+\Big]^2
    & = \begin{cases} 
    -\frac{2\beta}{n-1}\frac{T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}}{\sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}}(\displaystyle \emZ_{s, t} - \mu_s),& \mathrm{if~} \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})} < T\\
    ~0 & \mathrm{otherwise}.
    \end{cases}
\end{align}
Even though the hinge terms in (\ref{reg-obj}) are not smooth convex functions\footnote{A requirement for the FISTA algorithm.}, the fact that the gradient in (\ref{z-grad-hinge}) is a line implies that the hinge term behaves locally as a convex quadratic function. A visualization of $h(\displaystyle \vx) = \big[\big(1 - \sqrt{\displaystyle \Var(\displaystyle \vx)}\big)_+\big]^2$ for $\displaystyle \vx\in \displaystyle \R^2$ and the full derivation of (\ref{z-grad-hinge}) are available in Appendix \ref{app:variance-regularization-term} and \ref{app:hinge-gradient-derivation}, respectively.

\textbf{Shrinkage Step~~} Similarly to the regular FISTA protocol, the shrinkage step in our modified FISTA is applied to the intermediate results from the gradient step in (\ref{eq:mod-grad-step}). 

\subsection{Encoder for Amortized Inference}
\label{LISTA}

One limitation of the proposed variance regularization is that the modified inference procedure depends on batch statistics (see equation \ref{z-grad-hinge}) and thus the codes are not deterministic. To address this limitation, we propose to train an encoder $\mathcal{E}$ simultaneously with the decoder $\mathcal{D}$ to predict the sparse codes computed from inference with our modified FISTA. 
After training of the encoder and decoder is done, the encoder can be used to compute sparse codes for different inputs independently from each other in a deterministic way. Additionally, using an encoder reduces the inference time: instead of performing inference with FISTA, the encoder computes sparse representations directly from the inputs and thus provides \textit{amortized inference}.

\textbf{Learning the Encoder~~}The encoder is trained using mini-batch gradient descent to minimize the following mean-squared  error between its predictions and the sparse codes $\displaystyle \mZ^*\in\displaystyle \R^{l\times n}$ computed from inference with FISTA for inputs $Y = \{\displaystyle \vy_1, \ldots, \displaystyle \vy_n\}$:
% We explore two ways to address this: omitting the variance regularization term entirely during validation . One reason for this might be that the without the gradient in \ref{z-grad-hinge}, the codes' magnitudes do not reach same level of standard deviation. To alleviate this issue, we train an encoder $\mathcal{E}$ to predict the codes computed with our modified FISTA by minimizing the following mean-squared error loss:
\begin{equation}
\label{e-obj}
\arg\underset{\mathcal{E}}{\min}~\mathcal{L}_{\mathcal{E}}(\mathcal{E},\displaystyle \mZ^*,Y) = \arg\underset{\mathcal{E}}{\min}~\frac{1}{n}\sum_{i=1}^n\|\mathcal{E}(\displaystyle \vy_i)- \displaystyle \mZ_{:, i}^*\|_2^2.
\end{equation}
% Training of the encoder happens concurrently with the training of the decoder. The decoder's and encoder's parameters are updated using gradient descent after each batch of inputs. 
% During validation, we follow the standard ISTA protocol using the encoder's predictions as initialization for the latent codes $z$. The addition of the encoder is a step towards amortized inference and can potentially reduce the average number of ISTA iterations performed at validation time. In all of the equations above the decoder $\mathcal{D}$ can be replaced by any neural network. 
Note that the codes $\displaystyle \mZ^*$ are treated as constants. Training the encoder this way is similar to target propagation \citep{lee2015difference}.

\textbf{Modified Energy: Variance and Encoder Regularization~~} To encourage FISTA to find codes which can be learned by the encoder, we further modify the inference protocol by adding a term to $\tilde{f}(\displaystyle \mZ)$ which penalizes the distance between FISTA's outputs and the encoder's outputs:
\begin{align}
\label{reg-obj-final}
\tilde{f}(\displaystyle \mZ) 
&= \sum_{i=1}^n \frac{1}{2}\|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2 + \beta\sum_{j=1}^l\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)_+\Big]^2 + \gamma \sum_{i=1}^n\|\displaystyle \mZ_{:, i} - \mathcal{E}(\displaystyle \vy_i)\|_2^2.
% \\
% &=\sum_{i=1}^n \frac{1}{2}\|y_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2 + \beta\sum_{j=1}^l\max\Bigg(\Bigg(1 - \sqrt{\frac{1}{n}\sum_{i=1}^n(\displaystyle \mZ_{j, i}- \mu_j)^2}\Bigg)^2, ~ 0   \Bigg)  
\end{align}
This regularization term ensures that the codes computed during inference with FISTA do not deviate too much from the encoder's predictions. In our setup, the encoder's predictions are treated as constants and are used as the initial value for codes in FISTA, namely $\displaystyle \mZ_{:, i}^{(0)} \defeq \mathcal{E}(\displaystyle \vy_i)$. This can reduce the inference time by lowering the number of FISTA iterations needed for convergence if the encoder provides good initial values. With the new $\tilde{f}(\displaystyle \mZ)$ from equation \ref{reg-obj-final}, the energy minimized during inference becomes:
\begin{align}
    \label{final-obj}
    \tilde{\mathbf{E}}(\displaystyle \mZ,Y,\mathcal{D})& 
    = \sum_{i=1}^n \frac{1}{2}\|\displaystyle \vy_i - \mathcal{D}\displaystyle \mZ_{:, i}\|_2^2 + \beta\sum_{j=1}^l\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :}})\Big)_+\Big]^2 + \gamma \sum_{i=1}^n\|\displaystyle \mZ_{:, i} - \mathcal{E}(\displaystyle \vy_i)\|_2^2+ \sum_{i=1}^n\lambda \|\displaystyle \mZ_{:, i}\|_1.
\end{align}

\section{Experimental Setup}
\subsection{Encoder and Decoder Architectures}
\label{enc-dec-arch}

\begin{minipage}{0.48\textwidth}
\textbf{LISTA Encoder~~} The encoder's architecture is inspired by the Learned ISTA (LISTA) architecture \citep{gregor2010learning}. LISTA is designed to mimic outputs from inference with ISTA and resembles a recurrent neural network. Our encoder consists of two fully connected layers $\displaystyle \mU\in\displaystyle \R^{d\times l}$ and $\displaystyle \mS\in\displaystyle \R^{l\times l}$, a bias term $\displaystyle \vb\in \displaystyle \R^l$, and ReLU activation functions. Algorithm \ref{alg:LISTA} describes how the encoder's outputs are computed. Our version of LISTA produces non-negative codes by design. We refer to our encoder as LISTA in the rest of the paper.
\end{minipage}
\hfill
\begin{minipage}{0.49\textwidth}
\begin{algorithm}[H]
  \caption{LISTA encoder $\mathcal{E}$: forward pass}
  \label{alg:LISTA}
\begin{algorithmic}
  \STATE {\bfseries Input:} Image $\displaystyle \vy\in\displaystyle \R^{d}$, number of iterations $L$\STATE {\bfseries Parameters:} $\displaystyle \mU\in\displaystyle \R^{d\times l}$, $\displaystyle \mS\in\displaystyle \R^{l\times l}$, $\displaystyle \vb\in \displaystyle \R^l$\STATE {\bfseries Output:} sparse code $\displaystyle \vz_{\mathcal{E}}\in\displaystyle \R^l$  
  \STATE $\displaystyle \vu = \displaystyle \mU \displaystyle \vy + \displaystyle \vb$
  \STATE $\displaystyle \vz_0 = \mathrm{ReLU}(\displaystyle \vu)$
  \FOR{$i=1$ {\bfseries to} $L$}
  \STATE $\displaystyle \vz_i = \mathrm{ReLU}(\displaystyle \vu + \displaystyle \mS \displaystyle \vz_{i-1})$
  % \STATE $t = u + S z_{i-1}$
  % \STATE $z_i = \mathrm{ReLU}(t)$
  \ENDFOR
  \STATE $\displaystyle \vz_{\mathcal{E}} = \displaystyle \vz_L$

MNIST	MNIST	MNIST	MNIST	MNIST	MNIST	MNIST	MNIST
Model	λ	L 0 ( z ˜ y )	PSNR ˜ y	σ	PSNR y σ	L 0 ( z ˜ y σ )	PSNR ˜ y σ
SDL	0 . 005	89 . 7% ± 0 . 2%	17 . 3 ± 0 . 028	1.0	10 . 2	87 . 3% ± 0 . 2%	16 . 9 ± 0 . 053
VDL	0 . 02	91 . 8% ± 0 . 1%	17 . 7 ± 0 . 023	1.0	10 . 2	88 . 1% ± 0 . 2%	15 . 4 ± 0 . 175
SDL	0 . 005	89 . 7% ± 0 . 2%	17 . 3 ± 0 . 028	1.5	6 . 7	84 . 6% ± 0 . 5%	15 . 7 ± 0 . 204
VDL	0 . 02	91 . 8% ± 0 . 1%	17 . 7 ± 0 . 023	1.5	6 . 7	84 . 5% ± 0 . 3%	12 . 2 ± 0 . 275
ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches	ImageNet patches
SDL	0 . 002	74 . 6% ± 0 . 1%	26 . 9 ± 0 . 007	1.0	20 . 0	72 . 7% ± 0 . 1%	25 . 6 ± 0 . 011
VDL	0 . 005	77 . 3% ± 0 . 2%	27 . 3 ± 0 . 058	1.0	20 . 0	73 . 5% ± 0 . 2%	25 . 2 ± 0 . 035

Notation	Description
E	LISTA-inspired encoder.
D	Decoder: linear or fully connected with one hidden layer.
m	Dimension of the decoder's hidden layer.
λ	Hyperparameter for the sparsity regularization.
β	Hyperparameter for the variance regularization of the latent codes.
γ	Hyperparameter encouraging the FISTA codes to be close to the encoder's predictions.
d	Dimension of the input data.
y	Input sample in R d .
N	Total number of training samples.
n	Mini-batch size.
T	Number of ISTA/FISTA iterations.
η k	Step size for ISTA/FISTA gradient step.
η D	Learning rate for the decoder.
η E	Learning rate for the encoder.
τ α	Shrinkage function [ τ α ( x )] j = sign( x j )(	x j	- α ).
˜ τ α	Non-negative shrinkage function [˜ τ α ( x )] j = ([ τ α ( x )] j ) + .
l	Dimension of the latent code. .
z	Latent code in R l
Z	Mini-batch of latent codes in R l × n .
W	Parametrization in R d × l of a linear dictionary D
W 1	. Parametrization in R m × l of the bottom layer of a fully connected decoder D with one hidden layer.
W 2	Parametrization of a the top layer of a fully connected decoder D with one hidden .
	layer in R d × m
b 1	Parametrization of the bias term in R m following the bottom layer of a fully connected decoder D with one hidden layer.
b	Parametrization of the bias term in the LISTA encoder (see 3.1).
SDL	Standard Dictionary Learning with a LISTA encoder and a linear decoder whose columns have a fixed l 2 norm.
SDL-NL	Standard Dictionary learning with a LISTA encoder and a non-linear fully connected decoder. Each layer in the decoder has columns with a fixed l 2 norm.
VDL	Variance-regularized Dictionary Learning with a LISTA encoder and a linear decoder in which regularization is applied to the sparse codes encouraging the variance across the latent components to be above a fixed threshold.
VDL-NL	Variance-regularized Dictionary Learning (as above) with a non-linear decoder.
WDL	Dictionary Learning with a LISTA encoder and a linear decoder trained with weight decay regularization.
WDL-NL	Dictionary Learning with a LISTA encoder and a non-linear decoder trained with weight decay regularization.
DO	Standard Dictionary Learning without an encoder and with a linear decoder whose columns have a fixed l 2 norm.
DO-NL	Standard Dictionary Learning without an encoder and with a non-linear decoder whose columns have a fixed l 2 norm.

dataset	model	ep	λ	γ	β	T	η D	η E	wd( b 1 )	wd( b )	η k
MNIST	SDL	200	0	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	1e - 4	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	5e - 4	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	1e - 3	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	3e - 3	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	SDL	200	5e - 3	1	0	-	1e - 3	3e - 4	-	0	1
MNIST	VDL	200	0	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	1e - 3	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	3e - 3	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	5e - 3	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	1e - 2	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	VDL	200	2e - 2	5	10	0.5	3e - 4	1e - 4	-	0	0.5
MNIST	SDL-NL	200	0	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	1e - 3	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	3e - 3	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	5e - 3	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	1e - 2	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	SDL-NL	200	2e - 2	1	0	-	1e - 3	1e - 4	1e - 3	0	1
MNIST	VDL-NL	200	0	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	1e - 3	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	3e - 3	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	5e - 3	100	10	0.5	3e - 4 ∗	1e - 4	1e - 3	0	0.5
MNIST	VDL-NL	200	1e - 2	100 100	10 10	0.5 0.5	3e - 4 ∗ ∗	1e - 4 1e 4	1e - 3 1e 3	0 0	0.5 0.5
MNIST	VDL-NL	200	2e - 2		0	-	3e - 4 1e 3	-	- -	-	0.5
ImageNet	SDL	100	0	1		-	-	1e - 4		1e 2 1e 2
ImageNet	SDL	100	5e - 4	1	0		1e - 3	1e - 4	-	-	0.5
ImageNet	SDL	100	1e - 3	1	0	-	1e - 3	1e - 4	-	1e - 2	0.5
ImageNet	SDL	100	2e - 3	1	0	-	1e - 3	1e - 4	-	1e - 2	0.5
ImageNet	SDL	100	3e - 3	1	0	-	1e - 3	1e - 4	-	1e - 2	0.5 0.5
ImageNet	SDL	100	5e - 3	1	0	-	1e - 3	1e - 4	- -	1e - 2	0.5
ImageNet ImageNet	VDL VDL	100	0	5	10	0.5	3e - 4	1e - 4 4	-	1e - 2
		100	1e - 3	5	10	0.5	3e - 4	1e - 1e 4	-	1e - 2	0.5
ImageNet ImageNet	VDL VDL	100 100	3e - 3 5e - 3	5	10	0.5 0.5	3e - 4	- 1e 4	-	1e - 2 1e 2	0.5
ImageNet	VDL	100	1e - 2	5 5	10	0.5	3e - 4	-		- 1e - 2	0.5 0.5
ImageNet			1 5e 2		10		3e - 4	1e - 4	-
ImageNet	VDL	100	. -	5	10	0.5	3e - 4	1e - 4 1e 4	-	1e - 2	0.5 0.5
ImageNet	SDL-NL SDL-NL	100 100	0	1	0	-	1e - 3	-	1e - 2	1e - 2
			1e - 3	1	0	-	1e - 3	1e - 4	1e - 2	1e - 2	0.5
ImageNet	SDL-NL	100	3e - 3	1	0	-	1e - 3	1e - 4	1e - 2	1e - 2	0.5
ImageNet	SDL-NL	100	5e - 3	1	0	-	1e - 3	1e - 4	1e - 2	1e - 2	0.5
ImageNet ImageNet	SDL-NL SDL-NL	100	8e - 3 1e 2	1	0 0	- -	1e - 3	1e - 4	1e - 2 1e 2	1e - 2 1e 2	0.5 0.5
		100	- 0	1			1e - 3 5e 5 ∗	1e - 4	-	- 1e 2
ImageNet	VDL-NL	100		20	10	0.5	-	1e - 4	1e - 1	1e - 2	0.5
ImageNet ImageNet	VDL-NL	100	1e - 3	20	10 10	0.5 0.5	5e - 5 ∗	1e - 4	1e - 1 1e 1	- 1e 2	0.5 0.5
ImageNet	VDL-NL VDL-NL	100 100	2e - 3 3e - 3	20 20	10	0.5	5e - 5 ∗ 5e - 5 ∗	1e - 4 1e - 4	- 1e - 1	- 1e - 2	0.5
ImageNet	VDL-NL	100	5e - 3	20	10	0.5	5e - 5 ∗	1e - 4	1e - 1	1e - 2	0.5
ImageNet	VDL-NL	100	1e - 2	40	10	0.5	5e - 5 ∗	1e - 4	1e - 1	1e - 2	0.5
ImageNet	VDL-NL	100	2e - 2	40	10	0.5	5e - 5 ∗	1e - 4	1e - 1	1e - 2	0.5

$$ \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \sum_{j=1}^l\beta\Big[\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{j, :})}\Big)+\Big]^2 & = \frac{\partial}{\partial \displaystyle \emZ{s, t}} \beta\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}\Big)^2\ & = 2\beta \Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})} \Big) \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})} \Big)\ & = -2\beta \Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})} \Big)\frac{\partial}{\partial \displaystyle \emZ_{s, t}} \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}\ & = -\beta \frac{\Big(T - \sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}\Big)}{\sqrt{\displaystyle \Var(\displaystyle \mZ_{s, :})}} \frac{\partial}{\partial \displaystyle \emZ_{s, t}} \displaystyle \Var(\displaystyle \mZ_{s, :}) \label{app:intermediate-step} $$ \tag{app:intermediate-step}

References

[olshausen1996emergence] Olshausen, Bruno A, Field, David J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature.

[foldiak1990forming] F{. (1990). Forming sparse representations by local anti-Hebbian learning. Biological cybernetics.

[mairal2008discriminative] Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro, Guillermo, Zisserman, Andrew. (2008). Discriminative learned dictionaries for local image analysis. 2008 IEEE conference on computer vision and pattern recognition.

[mairal2008supervised] Mairal, Julien, Ponce, Jean, Sapiro, Guillermo, Zisserman, Andrew, Bach, Francis. (2008). Supervised dictionary learning. Advances in neural information processing systems.

[raina2007self] Raina, Rajat, Battle, Alexis, Lee, Honglak, Packer, Benjamin, Ng, Andrew Y. (2007). Self-taught learning: transfer learning from unlabeled data. Proceedings of the 24th international conference on Machine learning.

[yang2009linear] Yang, Jianchao, Yu, Kai, Gong, Yihong, Huang, Thomas. (2009). Linear spatial pyramid matching using sparse coding for image classification. 2009 IEEE Conference on computer vision and pattern recognition.

[coates2011importance] Coates, Adam, Ng, Andrew Y. (2011). The importance of encoding versus training with sparse coding and vector quantization. ICML.

[rigamonti2011sparse] Rigamonti, Roberto, Brown, Matthew A, Lepetit, Vincent. (2011). Are sparse representations really relevant for image classification?. CVPR 2011.

[luther2022sensitivity] Luther, Kyle, Seung, H Sebastian. (2022). Sensitivity of sparse codes to image distortions. Neural Computation.

[chen2018sparse] Chen, Yubei, Paiton, Dylan, Olshausen, Bruno. (2018). The sparse manifold transform. Advances in neural information processing systems.

[sun2018supervised] Sun, Xiaoxia, Nasrabadi, Nasser M, Tran, Trac D. (2018). Supervised deep sparse coding networks. 2018 25th IEEE International Conference on Image Processing (ICIP).

[lee2006efficient] Lee, Honglak, Battle, Alexis, Raina, Rajat, Ng, Andrew. (2006). Efficient sparse coding algorithms. Advances in neural information processing systems.

[mairal2009online] Mairal, Julien, Bach, Francis, Ponce, Jean, Sapiro, Guillermo. (2009). Online dictionary learning for sparse coding. Proceedings of the 26th annual international conference on machine learning.

[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

[barello2018sparse] Barello, Gabriel, Charles, Adam S, Pillow, Jonathan W. (2018). Sparse-coding variational auto-encoders. BioRxiv.

[foldiak1991learning] F{. (1991). Learning invariance from transformation sequences. Neural computation.

[hyvarinen2003bubbles] Hyv{. (2003). Bubbles: a unifying framework for low-level statistical properties of natural image sequences. JOSA A.

[tonolini2020variational] Tonolini, Francesco, Jensen, Bj{\o. (2020). Variational sparse coding. Uncertainty in Artificial Intelligence.

[tang2020dictionary] Tang, Hao, Liu, Hong, Xiao, Wei, Sebe, Nicu. (2020). When dictionary learning meets deep learning: Deep dictionary learning and coding network for image recognition with limited data. IEEE transactions on neural networks and learning systems.

[he2014unsupervised] He, Yunlong, Kavukcuoglu, Koray, Wang, Yun, Szlam, Arthur, Qi, Yanjun. (2014). Unsupervised feature learning by deep sparse coding. Proceedings of the 2014 SIAM international conference on data mining.

[hu2018nonlinear] Hu, Junlin, Tan, Yap-Peng. (2018). Nonlinear dictionary learning with application to image classification. Pattern Recognition.

[mahdizadehaghdam2019deep] Mahdizadehaghdam, Shahin, Panahi, Ashkan, Krim, Hamid, Dai, Liyi. (2019). Deep dictionary learning: A parametric network approach. IEEE Transactions on Image Processing.

[wiskott2002slow] Wiskott, Laurenz, Sejnowski, Terrence J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural computation.

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[bengio2013representation] Bengio, Yoshua, Courville, Aaron, Vincent, Pascal. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence.

[zhang2018ista] Zhang, Jian, Ghanem, Bernard. (2018). ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. Proceedings of the IEEE conference on computer vision and pattern recognition.

[DBLP:journals/corr/MairalBP14] Julien Mairal, Francis R. Bach, Jean Ponce. (2014). Sparse Modeling for Image and Vision Processing. CoRR.

[tariyal2016deep] Tariyal, Snigdha, Majumdar, Angshul, Singh, Richa, Vatsa, Mayank. (2016). Deep dictionary learning. IEEE Access.

[mairal2007sparse] Mairal, Julien, Elad, Michael, Sapiro, Guillermo. (2007). Sparse representation for color image restoration. IEEE Transactions on image processing.

[elad2006image] Elad, Michael, Aharon, Michal. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing.

[yu-2012-sparse] Kai Yu. (2012). Tutorial on Deep Learning: Sparse Coding. CVPR.

[imagenet_cvpr09] Beck, Amir, Teboulle, Marc. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences.

[devlin-etal-2019-bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2019). {BERT. Proceedings of the 2019 Conference of the North {A. doi:10.18653/v1/N19-1423.

[he2019momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2019). Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722.

[misra2019self] Misra, Ishan, van der Maaten, Laurens. (2019). Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991.

[olshausen1997sparse] Olshausen, Bruno A, Field, David J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1?. Vision research.

[teh2003energy] Teh, Yee Whye, Welling, Max, Osindero, Simon, Hinton, Geoffrey E. (2003). Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research.

[bengio2009learning] Bengio, Yoshua. (2009). Learning deep architectures for AI. Foundations and trends in Machine Learning.

[langley00] P. Langley. (2000). Crafting Papers on Machine Learning. Proceedings of the 17th International Conference on Machine Learning (ICML 2000).

[mitchell80] T. M. Mitchell. (1980). The Need for Biases in Learning Generalizations.

[kearns89] M. J. Kearns. (1989). Computational Complexity of Machine Learning.

[MachineLearningI] . Machine Learning: An Artificial Intelligence Approach, Vol. I. (1983).

[DudaHart2nd] R. O. Duda, P. E. Hart, D. G. Stork. (2000). Pattern Classification.

[anonymous] Author, N. N.. (2021). Suppressed for Anonymity.

[Newell81] A. Newell, P. S. Rosenbloom. (1981). Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition.

[Samuel59] A. L. Samuel. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

[ranzato2007efficient] Ranzato, Marc'Aurelio, Poultney, Christopher, Chopra, Sumit, Cun, Yann L. (2007). Efficient learning of sparse representations with an energy-based model. Advances in neural information processing systems.

[zeiler2010deconvolutional] Zeiler, Matthew D, Krishnan, Dilip, Taylor, Graham W, Fergus, Rob. (2010). Deconvolutional networks. 2010 IEEE Computer Society Conference on computer vision and pattern recognition.

[zeiler2011adaptive] Zeiler, Matthew D, Taylor, Graham W, Fergus, Rob. (2011). Adaptive deconvolutional networks for mid and high level feature learning. 2011 International Conference on Computer Vision.

[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. European Conference on Computer Vision.

[fernando2017self] Fernando, Basura, Bilen, Hakan, Gavves, Efstratios, Gould, Stephen. (2017). Self-supervised video representation learning with odd-one-out networks. Proceedings of the IEEE conference on computer vision and pattern recognition.

[misra2016shuffle] Misra, Ishan, Zitnick, C Lawrence, Hebert, Martial. (2016). Shuffle and learn: unsupervised learning using temporal order verification. European Conference on Computer Vision.

[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.

[DBLP:journals/corr/KingmaW13] Diederik P. Kingma, Max Welling. (2014). Auto-Encoding Variational Bayes. 2nd International Conference on Learning Representations, {ICLR.

[goodfellow2014generative] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, Bengio, Yoshua. (2014). Generative adversarial nets. Advances in neural information processing systems.

[NIPS2010_4133] Kavukcuoglu, Koray, Sermanet, Pierre, Y-lan Boureau, Gregor, Karol, Michael Mathieu, Yann L. Cun. (2010). Learning Convolutional Feature Hierarchies for Visual Recognition. Advances in Neural Information Processing Systems 23.

[kavukcuoglu] Kavukcuoglu, Koray, Sermanet, Pierre, Y-lan Boureau, Gregor, Karol, Michael Mathieu, Yann L. Cun. (2010). Learning Convolutional Feature Hierarchies for Visual Recognition. Advances in Neural Information Processing Systems 23.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709.

[hnaff2019dataefficient] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, Aaron van den Oord. (2019). Data-Efficient Image Recognition with Contrastive Predictive Coding.

[makhzani2013k] Makhzani, Alireza, Frey, Brendan. (2013). K-sparse autoencoders. arXiv preprint arXiv:1312.5663.

[balasubramanian2013smooth] Balasubramanian, Krishnakumar, Yu, Kai, Lebanon, Guy. (2013). Smooth sparse coding via marginal regression for learning sparse representations. International Conference on Machine Learning.

[zhang2017constructing] Zhang, Shizhou, Wang, Jinjun, Tao, Xiaoyu, Gong, Yihong, Zheng, Nanning. (2017). Constructing deep sparse coding network for image classification. Pattern Recognition.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.

[gregor2010learning] Gregor, Karol, LeCun, Yann. (2010). Learning fast approximations of sparse coding. Proceedings of the 27th International Conference on International Conference on Machine Learning.

[jarrett2009best] Jarrett, Kevin, Kavukcuoglu, Koray, Ranzato, Marc'Aurelio, LeCun, Yann. (2009). What is the best multi-stage architecture for object recognition?. 2009 IEEE 12th international conference on computer vision.

[lee2009convolutional] Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, Ng, Andrew Y. (2009). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. Proceedings of the 26th annual international conference on machine learning.

[chen2013deep] Chen, Bo, Polatkan, Gungor, Sapiro, Guillermo, Blei, David, Dunson, David, Carin, Lawrence. (2013). Deep learning with hierarchical convolutional factor analysis. IEEE transactions on pattern analysis and machine intelligence.

[lecun-mnisthandwrittendigit-2010] LeCun, Yann, Cortes, Corinna. {MNIST.

[rencker2019sparse] Rencker, Lucas, Bach, Francis, Wang, Wenwu, Plumbley, Mark D. (2019). Sparse recovery and dictionary learning from nonlinear compressive measurements. IEEE Transactions on Signal Processing.

[yoshida2020natural] Yoshida, Takashi, Ohki, Kenichi. (2020). Natural images are reliably represented by sparse and variable populations of neurons in visual cortex. Nature communications.

[Zhang2015ASO] Zheng Zhang, Y. Xu, Jian Yang, Xuelong Li, David Zhang. (2015). A Survey of Sparse Representation: Algorithms and Applications. IEEE Access.

[aharon2006k] Aharon, Michal, Elad, Michael, Bruckstein, Alfred. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing.

[Cooley1965AnAF] J. Cooley, J. W. Tukey. (1965). An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation.

[Zhou2018SC2NetSL] Joey Tianyi Zhou, K. Di, J. Du, Xi Peng, H. Yang, Sinno Jialin Pan, I. Tsang, Yong Liu, Z. Qin, R. Goh. (2018). SC2Net: Sparse LSTMs for Sparse Coding. AAAI.

[ioffe2015batch] Ioffe, Sergey, Szegedy, Christian. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning.

[grill-2020-NEURIPS] Grill, Jean-Bastien, Strub, Florian, Altch'{e. (2020). Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning. Advances in Neural Information Processing Systems.

[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230.

[wu2019sparse] Wu, Kailun, Guo, Yiwen, Li, Ziang, Zhang, Changshui. (2019). Sparse coding with gated learned ista. International Conference on Learning Representations.

[lee2015difference] Lee, Dong-Hyun, Zhang, Saizheng, Fischer, Asja, Bengio, Yoshua. (2015). Difference target propagation. Joint european conference on machine learning and knowledge discovery in databases.

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[bib1] Aharon et al. (2006) Michal Aharon, Michael Elad, and Alfred Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing, 54(11):4311–4322, 2006.

[bib2] Bardes et al. (2021) Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.

[bib3] Barello et al. (2018) Gabriel Barello, Adam S Charles, and Jonathan W Pillow. Sparse-coding variational auto-encoders. BioRxiv, pp. 399246, 2018.

[bib4] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.

[bib5] Yoshua Bengio. Learning deep architectures for ai. Foundations and trends in Machine Learning, 2(1):1–127, 2009.

[bib6] Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

[bib7] Chen et al. (2013) Bo Chen, Gungor Polatkan, Guillermo Sapiro, David Blei, David Dunson, and Lawrence Carin. Deep learning with hierarchical convolutional factor analysis. IEEE transactions on pattern analysis and machine intelligence, 35(8):1887–1901, 2013.

[bib8] Chen et al. (2018) Yubei Chen, Dylan Paiton, and Bruno Olshausen. The sparse manifold transform. Advances in neural information processing systems, 31, 2018.

[bib9] Adam Coates and Andrew Y Ng. The importance of encoding versus training with sparse coding and vector quantization. In ICML, 2011.

[bib10] Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[bib11] Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.

[bib12] Fernando et al. (2017) Basura Fernando, Hakan Bilen, Efstratios Gavves, and Stephen Gould. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645, 2017.

[bib13] Peter Földiak. Forming sparse representations by local anti-hebbian learning. Biological cybernetics, 64(2):165–170, 1990.

[bib14] Peter Földiák. Learning invariance from transformation sequences. Neural computation, 3(2):194–200, 1991.

[bib15] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 399–406, 2010.

[bib16] Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal Piot, koray kavukcuoglu, Remi Munos, and Michal Valko. Bootstrap your own latent - a new approach to self-supervised learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 21271–21284. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf.

[bib17] He et al. (2019) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.

[bib18] He et al. (2014) Yunlong He, Koray Kavukcuoglu, Yun Wang, Arthur Szlam, and Yanjun Qi. Unsupervised feature learning by deep sparse coding. In Proceedings of the 2014 SIAM international conference on data mining, pp. 902–910. SIAM, 2014.

[bib19] Junlin Hu and Yap-Peng Tan. Nonlinear dictionary learning with application to image classification. Pattern Recognition, 75:282–291, 2018.

[bib20] Hyvärinen et al. (2003) Aapo Hyvärinen, Jarmo Hurri, and Jaakko Väyrynen. Bubbles: a unifying framework for low-level statistical properties of natural image sequences. JOSA A, 20(7):1237–1252, 2003.

[bib21] Jarrett et al. (2009) Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato, and Yann LeCun. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th international conference on computer vision, pp. 2146–2153. IEEE, 2009.

[bib22] Kavukcuoglu et al. (2010) Koray Kavukcuoglu, Pierre Sermanet, Y lan Boureau, Karol Gregor, Michael Mathieu, and Yann L. Cun. Learning convolutional feature hierarchies for visual recognition. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (eds.), Advances in Neural Information Processing Systems 23, pp. 1090–1098. Curran Associates, Inc., 2010. URL http://papers.nips.cc/paper/4133-learning-convolutional-feature-hierarchies-for-visual-recognition.pdf.

[bib23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[bib24] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.

[bib25] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. URL http://yann.lecun.com/exdb/mnist/, 2010.

[bib26] Lee et al. (2015) Dong-Hyun Lee, Saizheng Zhang, Asja Fischer, and Yoshua Bengio. Difference target propagation. In Joint european conference on machine learning and knowledge discovery in databases, pp. 498–515. Springer, 2015.

[bib27] Lee et al. (2006) Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms. Advances in neural information processing systems, 19, 2006.

[bib28] Lee et al. (2009) Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pp. 609–616, 2009.

[bib29] Kyle Luther and H Sebastian Seung. Sensitivity of sparse codes to image distortions. Neural Computation, pp. 1–20, 2022.

[bib30] Mahdizadehaghdam et al. (2019) Shahin Mahdizadehaghdam, Ashkan Panahi, Hamid Krim, and Liyi Dai. Deep dictionary learning: A parametric network approach. IEEE Transactions on Image Processing, 28(10):4790–4802, 2019.

[bib31] Mairal et al. (2007) Julien Mairal, Michael Elad, and Guillermo Sapiro. Sparse representation for color image restoration. IEEE Transactions on image processing, 17(1):53–69, 2007.

[bib32] Mairal et al. (2008a) Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Discriminative learned dictionaries for local image analysis. In 2008 IEEE conference on computer vision and pattern recognition, pp. 1–8. IEEE, 2008a.

[bib33] Mairal et al. (2008b) Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis Bach. Supervised dictionary learning. Advances in neural information processing systems, 21, 2008b.

[bib34] Mairal et al. (2009) Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pp. 689–696, 2009.

[bib35] Mairal et al. (2014) Julien Mairal, Francis R. Bach, and Jean Ponce. Sparse modeling for image and vision processing. CoRR, abs/1411.3230, 2014. URL http://arxiv.org/abs/1411.3230.

[bib36] Alireza Makhzani and Brendan Frey. K-sparse autoencoders. arXiv preprint arXiv:1312.5663, 2013.

[bib37] Misra et al. (2016) Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Springer, 2016.

[bib38] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.

[bib39] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.

[bib40] Raina et al. (2007) Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y Ng. Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th international conference on Machine learning, pp. 759–766, 2007.

[bib41] Ranzato et al. (2007) Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra, and Yann L Cun. Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pp. 1137–1144, 2007.

[bib42] Rigamonti et al. (2011) Roberto Rigamonti, Matthew A Brown, and Vincent Lepetit. Are sparse representations really relevant for image classification? In CVPR 2011, pp. 1545–1552. IEEE, 2011.

[bib43] Sun et al. (2018) Xiaoxia Sun, Nasser M Nasrabadi, and Trac D Tran. Supervised deep sparse coding networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 346–350. IEEE, 2018.

[bib44] Tang et al. (2020) Hao Tang, Hong Liu, Wei Xiao, and Nicu Sebe. When dictionary learning meets deep learning: Deep dictionary learning and coding network for image recognition with limited data. IEEE transactions on neural networks and learning systems, 32(5):2129–2141, 2020.

[bib45] Tariyal et al. (2016) Snigdha Tariyal, Angshul Majumdar, Richa Singh, and Mayank Vatsa. Deep dictionary learning. IEEE Access, 4:10096–10109, 2016.

[bib46] Tonolini et al. (2020) Francesco Tonolini, Bjørn Sand Jensen, and Roderick Murray-Smith. Variational sparse coding. In Uncertainty in Artificial Intelligence, pp. 690–700. PMLR, 2020.

[bib47] Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, 2008.

[bib48] Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.

[bib49] Wu et al. (2019) Kailun Wu, Yiwen Guo, Ziang Li, and Changshui Zhang. Sparse coding with gated learned ista. In International Conference on Learning Representations, 2019.

[bib50] Yang et al. (2009) Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In 2009 IEEE Conference on computer vision and pattern recognition, pp. 1794–1801. IEEE, 2009.

[bib51] Takashi Yoshida and Kenichi Ohki. Natural images are reliably represented by sparse and variable populations of neurons in visual cortex. Nature communications, 11(1):1–19, 2020.

[bib52] Kai Yu. Tutorial on deep learning: Sparse coding. CVPR, 2012.

[bib53] Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.

[bib54] Zeiler et al. (2010) Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pp. 2528–2535. IEEE, 2010.

[bib55] Zeiler et al. (2011) Matthew D Zeiler, Graham W Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In 2011 International Conference on Computer Vision, pp. 2018–2025. IEEE, 2011.

[bib56] Zhang et al. (2015) Zheng Zhang, Y. Xu, Jian Yang, Xuelong Li, and David Zhang. A survey of sparse representation: Algorithms and applications. IEEE Access, 3:490–530, 2015.

[bib57] Zhou et al. (2018) Joey Tianyi Zhou, K. Di, J. Du, Xi Peng, H. Yang, Sinno Jialin Pan, I. Tsang, Yong Liu, Z. Qin, and R. Goh. Sc2net: Sparse lstms for sparse coding. In AAAI, 2018.

Ablation: No Variance Regularization​

Katrina Evtimova

Yann LeCun

Introduction​

Method​

Background: Sparse Coding and Dictionary Learning​

Background: Inference with ISTA and FISTA​

Modified Inference with Variance Regularization on the Latent Code Components​

Encoder for Amortized Inference​

Experimental Setup​

Encoder and Decoder Architectures​

Algorithm 1 LISTA encoder E : forward pass

Evaluation​

Data Processing​

Inference and Training​

Results and Analysis​

Visualizing Learned Features​

Reconstruction Quality​

Denoising​

Ablation: No Variance Regularization​

Classification in the Low Data Regime~~​

Related Work​

Conclusion​

Broader Impact Statement​

Acknowledgments​

References​

Appendix​

Notation​

Variance Regularization Term​

Gradient Derivation​

Additional Visualizations​

Training Details​

References​

Ablation: No Variance Regularization

Introduction

Method

Background: Sparse Coding and Dictionary Learning

Background: Inference with ISTA and FISTA

Modified Inference with Variance Regularization on the Latent Code Components

Encoder for Amortized Inference

Experimental Setup

Encoder and Decoder Architectures

Evaluation

Data Processing

Inference and Training

Results and Analysis

Visualizing Learned Features

Reconstruction Quality

Denoising

Ablation: No Variance Regularization

Classification in the Low Data Regime~~

Related Work

Conclusion

Broader Impact Statement

Acknowledgments

References

Appendix

Notation

Variance Regularization Term

Gradient Derivation

Additional Visualizations

Training Details

References