Skip to main content

Tangent distance-based classification induced by discriminative training of a deep sparse coder

Jason Tyler Rolfe & Yann LeCun, Courant Institute of Mathematical Sciences, New York University, 719 Broadway, 12th Floor, New York, NY, %Yann LeCun, %Courant Institute of Mathematical Sciences, %New York University, %New York, NY

Abstract

We present the discriminative recurrent sparse auto-encoder model, comprising a recurrent encoder of rectified linear units, unrolled %in time for a fixed number of iterations, and connected to two linear decoders that reconstruct the input and predict its supervised classification. %The recurrent encoder is unrolled in time for a fixed number of iterations, and trained using backpropagation-through-time. Training via backpropagation-through-time initially minimizes an unsupervised sparse reconstruction error; the loss function is then augmented with a discriminative term on the supervised classification. %In its temporally-unrolled form, the network can be seen as a deep network, with parameters shared between the hidden layers. The depth implicit in the temporally-unrolled form allows the system to exhibit far more representational power, while keeping the number of trainable parameters fixed. %all the power of deep networks, while substantially reducing the number of trainable parameters. %The proposed architecture also mitigates the vanishing gradient problem, which has plagued deep and recurrent networks. From an initially unstructured network the hidden units differentiate into categorical-units, each of which represents an input prototype with a well-defined class; and part-units representing deformations of these prototypes. The learned organization of the recurrent encoder is hierarchical: part-units are driven directly by the input, whereas the activity of categorical-units builds up over time through interactions with the part-units. Even using a small number of hidden units per layer, discriminative recurrent sparse auto-encoders achieve excellent performance on MNIST. % (that are pixel-permutation agnostic / without using prior information).

Introduction

Deep networks complement the hierarchical structure in natural data (Bengio, 2009). By breaking complex calculations into many steps, deep networks can gradually build up complicated decision boundaries or input transformations, facilitate the reuse of common substructure, and explicitly compare alternative interpretations of ambiguous input (Lee, Ekanadham, & Ng, 2008; Zeiler, Taylor, & Fergus, 2011). Leveraging these strengths, deep networks have facilitated significant advances in solving sensory problems like visual classification and speech recognition (Dahl, et al., 2012; Hinton, Osindero, & Teh, 2006; Hinton, et al., 2012).

Although deep networks have traditionally used independent parameters for each layer, they are equivalent to recurrent networks in which a disjoint set of units is active on each time step. The corresponding representations are sparse, and thus invite the incorporation of powerful techniques from sparse coding (Glorot, Bordes, & Bengio, 2011; Lee, Ekanadham, & Ng, 2008; Olshausen & Field, 1996, 1997; Ranzato, et al., 2006). Recurrence opens the possibility of sharing parameters between successive layers of a deep network.

This paper introduces the Discriminative Recurrent Sparse Auto-Encoder model (DrSAE), comprising a recurrent encoder of rectified linear units (ReLU; Coates & Ng, 2011; Glorot, Bordes, & Bengio, 2011; Jarrett, et al., 2009; Nair & Hinton, 2010; Salinas & Abbott, 1996), connected to two linear decoders that reconstruct the input and predict its supervised classification. The recurrent encoder is unrolled in time for a fixed number of iterations, with the input projecting to each resulting layer, and trained using backpropagation-through-time (Rumelhart, et al., 1986). Training initially minimizes an unsupervised sparse reconstruction error; the loss function is then augmented with a

discriminative term on the supervised classification. In its temporally-unrolled form, the network can be seen as a deep network, with parameters shared between the hidden layers. The temporal depth allows the system to exhibit far more representational power, while keeping the number of trainable parameters fixed.

Interestingly, experiments show that DrSAE does not just discover more discriminative 'parts' of the form conventionally produced by sparse coding. Rather, the hidden units spontaneously differentiate into two types: a small number of categorical-units and a larger number of part-units. The categorical-units have decoder bases that look like prototypes of the input classes. They are weakly influenced by the input and activate late in the dynamics as the result of interaction with the part-units. In contrast, the part-units are strongly influenced by the input, and encode small transformations through which the prototypes of categorical-units can be reshaped into the current input. Categorical-units compete with each other through mutual inhibition and cooperate with relevant part-units. This can be interpreted as a representation of the data manifold in which the categoricalunits are points on the manifold, and the part-units are akin to tangent vectors along the manifold.

Prior work

The encoder architecture of DrSAE is modeled after the Iterative Shrinkage and Threshold Algorithm (ISTA), a proximal method for sparse coding (Chambolle, et al., 1998; Daubechies, Defrise, & De Mol, 2004). Gregor & LeCun (2010) showed that the sparse representations computed by ISTA can be efficiently approximated by a structurally similar encoder with a less restrictive, learned parameterization. Rather than learn to approximate a precomputed optimal sparse code, the LISTA autoencoders of Sprechmann, Bronstein, & Sapiro (2012a,b) are trained to directly minimize the sparse reconstruction loss function. DrSAE extends LISTA autoencoders with a non-negativity constraint, which converts the shrink nonlinearity of LISTA into a rectified linear operator; and introduces a unified classification loss, as previously used in conjunction with traditional sparse coders (Bradley & Bagnell, 2008; Mairal, et al., 2009; Mairal, Bach, & Ponce, 2012) and other autoencoders (Boureau, et al., 2010; Ranzato & Szummer, 2008).

DrSAEs resemble the structure of deep sparse rectifier neural networks (Glorot, Bordes, & Bengio, 2011), but differ in that the parameter matrices at each layer are tied (Bengio, BoulangerLewandowski, & Pascanu, 2012), the input projects to all layers, and the outputs are normalized. DrSAEs are also reminiscent of the recurrent neural networks investigated by Bengio & Gingras (1996), but use a different nonlinearity and a heavily regularized loss function. Finally, they are similar to the recurrent networks described by Seung (1998), but have recurrent connections amongst the hidden units, rather than between the hidden units and the input units, and introduce classification and sparsification losses.

Network architecture

In the following, we use lower-case bold letters to denote vectors, upper-case bold letters to denote matrices, superscripts to indicate iterative copies of a vector, and subscripts to index the columns (or rows, if explicitly specified by the context) of a matrix or (without boldface) the elements of a vector. We consider discriminative recurrent sparse auto-encoders (DrSAEs) of rectified linear units with the architecture shown in figure 1:

$$

$$

for t = 1 , . . . , T , where n -dimensional vector z t is the activity of the hidden units at iteration t , m -dimensional vector x is the input, and z t =0 = 0 . Unlike traditional recurrent autoencoders (Bengio, Boulanger-Lewandowski, & Pascanu, 2012), the input projects to every iteration. We call the n × m parameter matrix E the encoding matrix, and the n × n parameter matrix S the explaining-away matrix. The n -element parameter vector b contains a bias term. The parameters also include the m × n decoding matrix D and the l × n classification matrix C .

We pretrain DrSAEs using stochastic gradient descent on the unsupervised loss function

$$

$$

Figure 1: The discriminative recurrent sparse auto-encoder (DrSAE) architecture. z t is the hidden representation after iteration t of T , and is initialized to z 0 = 0 ; x is the input; and y is the supervised classification. Overbars denote approximations produced by the network, rather than the true input. E , S , D , and b are learned parameters.

Figure 1: The discriminative recurrent sparse auto-encoder (DrSAE) architecture. z t is the hidden representation after iteration t of T , and is initialized to z 0 = 0 ; x is the input; and y is the supervised classification. Overbars denote approximations produced by the network, rather than the true input. E , S , D , and b are learned parameters.

with the magnitude of the columns of D bounded by 1 , 1 and the magnitude of the rows of E bounded by 1 . 25 T . 2 We then add in the supervised classification loss function

$$

$$

where the multinomial logistic loss function is defined by

$$

$$

and y is the index of the desired class. 3 Starting with the parameters learned by the unsupervised pretraining, we perform discriminative fine-tune by stochastic gradient descent on L U + L S , with the magnitude of the rows of C bounded by 5 . 4 The learning rate of each matrix is scaled down by the number of times it is repeated in the network, and the learning rate of the classification matrix is scaled down by a factor of 5 , to keep the effective learning rate consistent amongst the parameter matrices.

We train DrSAEs with T = 11 recurrent iterations (ten nontrivial passes through the explainingaway matrix S ) 5 and 400 hidden units on the MNIST dataset of 28 × 28 grayscale handwritten digits (LeCun, et al., 1998), with each input normalized to have glyph[lscript] 2 magnitude equal to 1 . We use a training set of 50,000 elements, and a validation set of 10,000 elements to perform early-stopping. Encoding, decoding, and classification matrices learned via this procedure are depicted in figure 2.

The dynamics of equation 1 are inspired by the Learned Iterative Shrinkage and Thresholding Algorithm (LISTA) (Gregor & LeCun, 2010), an efficient approximation to the sparse coding Iterative Shrinkage and Threshold Algorithm (ISTA) (Chambolle, et al., 1998; Daubechies, Defrise, & De

1 This sets the scale of z ; otherwise, the magnitude of z will shrink to zero and the magnitude of the columns of D will explode. This and all other such constraints are enforced by a projection after each SGD step.

2 The size of each ISTA step must be sufficiently small to guarantee convergence. As the step size grows large, the input will be over-explained by multiple aligned hidden units, leading to extreme oscillations. This bound serves the same function as glyph[lscript] 2 weight regularization (Hinton, 2010). The particular value of the bound is heuristic, and was determined by an informal search of parameter space.

3 Consistent with standard autoencoders but unlike traditional applications of backpropagation-through-time, the loss functions L U and L S only depend directly on the final iteration of the hidden units z T .

4 As in the case of the encoder, this serves the same function as glyph[lscript] 2 weight regularization (Hinton, 2010). The particular value of the bound is heuristic, and was determined by an informal search of parameter space.

5 The chosen number of recurrent iterations achieves a heuristic balance between representational power and computational expense. Experiments were conducted with T ∈ { 2 , 6 , 11 , 21 } .

Figure 2: The hidden units differentiate into spatially localized part-units, which have well-aligned encoders and decoders; and global prototype categorical-units, which have poorly aligned encoders and decoders. A subset of the rows of encoding matrix E (a) and the columns of decoding matrix D (b), and all rows of the classification matrix C (c) after training. The first row of (a,b) shows the most categorical units; the last row contains the least categorical units; and the middle row evenly steps through the remaining units in order of categoricalness. Gray pixels denote connections with weight 0 ; darker pixels indicate more positive connections.

Figure 2: The hidden units differentiate into spatially localized part-units, which have well-aligned encoders and decoders; and global prototype categorical-units, which have poorly aligned encoders and decoders. A subset of the rows of encoding matrix E (a) and the columns of decoding matrix D (b), and all rows of the classification matrix C (c) after training. The first row of (a,b) shows the most categorical units; the last row contains the least categorical units; and the middle row evenly steps through the remaining units in order of categoricalness. Gray pixels denote connections with weight 0 ; darker pixels indicate more positive connections.

Mol, 2004). ISTA is an algorithm for minimizing the glyph[lscript] 1 -regularized reconstruction loss function L U of equation 2 with respect to z T . It is defined by the iterative step

$$

$$

where [ h θ ( x )] i = sign ( x i ) · max(0 , | x i | -θ ) and α is a small step-size parameter. With nonnegative units, ISTA is equivalent to projected gradient descent of L U of equation 2. As the number of iterations T →∞ , a DrSAE defined by equation 1 becomes a non-negative version of ISTA if it satisfies the restrictions:

$$

$$

where the positive scale factor α is less than the maximal eigenvalue of D glyph[latticetop] · D , and I is the n × n identity matrix.

As in LISTA, but unlike ISTA, the encoding matrix E and explaining-away matrix S in a DrSAE are independent of the decoding matrix D . Connections from the input to the hidden units, and recurrent connections between the hidden units, are all-to-all, so the network structure is agnostic to permutations of the input. DrSAEs can also be understood as deep, feedforward networks with the parameter matrices tied between the layers.

Analysis of the hidden unit representation

Discriminative fine-tuning naturally induces the hidden units of a DrSAE to differentiate into a hierarchy-like continuum. On one extreme are part-units, which perform an ISTA-like sparse coding computation; on the other are categorical-units, which use a sophisticated form of pooling to integrate over matching part-units, and implement winner-take-all dynamics amongst themselves. Converging lines of evidence indicate that these two groups use distinct computational mechanisms and serve different representational roles.

In the ISTA algorithm, each row of the encoding matrix E i (which we sometimes call the encoder of unit i ) is proportional to the corresponding column of the decoding matrix D i (which we call the decoder of unit i ), and each row ( S -I ) i is proportional to ( D i ) glyph[latticetop] · D , as in equation 4. As a result, the angle between E i and D i , and the angle between the rows of S -I and D glyph[latticetop] · D , are both simple measures of the degree to which a unit's dynamics follow the ISTA algorithm, and thus perform

Figure 3: The hidden units differentiate into two populations after discriminative fine-tuning. The magnitude of row ( S -I ) i (a,b,e) and C i (c,d), versus the angle between encoder row and decoder column, for each unit from networks using 11 (a,c,e) and 2 (b,d,f) iterations. All plots are from discriminatively fine-tuned networks except (a,b), which are only subject to unsupervised pretraining. We call the dense cloud in the bottom-left part-units, and the tail extending to the top-right categorical-units.

Figure 3: The hidden units differentiate into two populations after discriminative fine-tuning. The magnitude of row ( S -I ) i (a,b,e) and C i (c,d), versus the angle between encoder row and decoder column, for each unit from networks using 11 (a,c,e) and 2 (b,d,f) iterations. All plots are from discriminatively fine-tuned networks except (a,b), which are only subject to unsupervised pretraining. We call the dense cloud in the bottom-left part-units, and the tail extending to the top-right categorical-units.

sparse coding. 6 These quantities are equal to 0 in the case of perfect ISTA, and grow larger as the network diverges from ISTA. Of these two angles, the explaining-away matrix comparison is more difficult to interpret, since a distortion of any one unit's decoding column D i will affect all rows of D glyph[latticetop] · D , whereas the angle between the encoder row E i and decoder column D i only depends upon a single unit. For this reason, we use the angle between the encoder row and decoder column as a measure of the position of each unit on the part/categorical continuum.

6 We always use S -I when plotting recurrent connection strength, since it governs the perturbation of the otherwise stable hidden unit activations, as in projected gradient descent of L U ; i.e., ISTA.

Figure 3 plots, for each unit i , the magnitude of row ( S -I ) i and column C i , versus the angle between row E i and column D i . Before discriminative fine-tuning, there are no categorical-units; the angle between the encoder row and decoder column is small and the incoming recurrent connections are weak for all units, as in figure 3(a,b). After discriminative fine-tuning, there remains a dense cloud of points for which the angle between the encoder row and decoder column is very small, and the incoming recurrent and outgoing classification connections are weak. Abutting this is an extended tail of points that have a larger angle between the encoder row and decoder column, and stronger incoming recurrent and outgoing classification connections. We call units composing the dense cloud part-units , since they have ISTA-compatible connections, while we refer to those making up the extended tail as categorical-units , since they have strong connections to the classification output. 7 When trained on MNIST, part-units have localized, pen stroke-like decoders, as can be seen in the bottom rows of figure 2(a,b). Categorical-units, in contrast, tend to have whole-digit prototype-like decoders, as in the top rows of figure 2(a,b). Discriminative fine-tuning induces the differentiation of categorical-units regardless of the depth of the encoder.

Part-units

Examination of the relationship between the elements of S -I and D glyph[latticetop] · D confirms that partunits with an encoder-decoder angle less than 0 . 5 radians abide by ISTA, and so perform sparse coding on the residual input after the categorical-unit prototypes are subtracted out. The prominent diagonals with matching slopes in figure 4(a,b), which plot the value of S i,j -δ i,j versus D i · D j for connections between part-units, and from categorical-units to part-units, respectively, demonstrate that part-units receive ISTA-consistent connections from all units. The fidelity of these connections to the ISTA ideal is not strongly dependent upon whether the afferent units are ISTA-compliant partunits, or ISTA-ignoring categorical-units. As a result, the part-units treat the categorical-units as if they were also participating in the reconstruction of the input, and only attempt to reconstruct the residual input not explained by the categorical-unit prototypes.

As can be seen in figure 4(c), the degree to which the encoder conforms to the ISTA algorithm is strongly correlated with the degree to which the explaining-away matrix matches the ISTA algorithm. Figure 5 shows the decoders associated with the strongest recurrent connections to three representative part-units. As expected, the decoders of these afferent units tend to be strongly aligned or anti-aligned with their target's decoder, and include both part-units and categorical-units.

Priors on the encoder

The encoder architecture of DrSAE is modeled after the Iterative Shrinkage and Threshold Algorithm (ISTA), a proximal method for sparse coding (Chambolle, et al., 1998; Daubechies, Defrise, & De Mol, 2004). Gregor & LeCun (2010) showed that the sparse representations computed by ISTA can be efficiently approximated by a structurally similar encoder with a less restrictive, learned parameterization. Rather than learn to approximate a precomputed optimal sparse code, the LISTA autoencoders of Sprechmann, Bronstein, & Sapiro (2012a,b) are trained to directly minimize the sparse reconstruction loss function. DrSAE extends LISTA autoencoders with a non-negativity constraint, which converts the shrink nonlinearity of LISTA into a rectified linear operator; and introduces a unified classification loss, as previously used in conjunction with traditional sparse coders (Bradley & Bagnell, 2008; Mairal, et al., 2009; Mairal, Bach, & Ponce, 2012) and other autoencoders (Boureau, et al., 2010; Ranzato & Szummer, 2008).

DrSAEs resemble the structure of deep sparse rectifier neural networks (Glorot, Bordes, & Bengio, 2011), but differ in that the parameter matrices at each layer are tied (Bengio, BoulangerLewandowski, & Pascanu, 2012), the input projects to all layers, and the outputs are normalized. DrSAEs are also reminiscent of the recurrent neural networks investigated by Bengio & Gingras (1996), but use a different nonlinearity and a heavily regularized loss function. Finally, they are similar to the recurrent networks described by Seung (1998), but have recurrent connections amongst the hidden units, rather than between the hidden units and the input units, and introduce classification and sparsification losses.

Categorical-units

In contrast, the recurrent connections to categorical-units with an encoder-decoder angle greater than 0 . 7 radians are not strongly correlated with the values predicted by ISTA. Rather than analyzing connections to the categorical-units only based upon their destination, it is more informative to consider them organized by their source. Part-units are compatible with categorical-units of certain classes, 8 and not with others, as shown by figure 6(a). Part-units generally have positive connections to categorical-units with parallel prototypes, independent of offset, and negative connections to categorical-units with orthogonal prototypes, as shown in figure 7(a). This corresponds to a sophisticated form of pooling (Jarrett, et al., 2009), with a single categorical-unit drawing excitation from a large collection of parallel but not necessarily perfectly aligned part-units, as in figure 6(c). It is also suggestive of the standard Hubel and Wiesel model of complex cells in primary visual cortex (Hubel & Wiesel, 1962). ISTA would instead predict a connection proportional to the inner product, which is zero for orthogonal prototypes and negative for anti-aligned prototypes.

Part-units use sparse coding dynamics, and so are not disproportionately suppressed by categoricalunits that represent any particular class. However, each part-unit is itself compatible with (i.e., has positive connections to) categorical-units of only a subset of the classes. As a result, the categorical-

7 For the purpose of constructing figures characterizing the difference between part-units and categoricalunits, we consider units with encoder-decoder angle less than 0 . 5 radians to be part-units, and units with encoder-decoder angle greater than 0 . 7 radians to be categorical-units. These thresholds are heuristic, and fail to reflect the continuum that exists between part- and categorical-units, but they facilitate analysis of the extremes.

8 Categorical-units have strong, sparse classification matrix projections, as shown in figures 2(c) and 3(e,f), and can be identified with the output class to which they have the strongest projection.

Figure 4: Part-units have connections consistent with ISTA. The actual connection weights S -I versus the ISTA-predicted weights D glyph[latticetop] · D , for connections from part-units to part-units (a) and categorical-units to part-units (b); and the angle between the rows of S -I and the ISTA-ideal D glyph[latticetop] · D versus the angle between the encoder rows and decoder columns (c). Units are considered part-units if the angle between their encoder and decoder is less than 0 . 5 radians, and categoricalunits if the angle between their encoder and decoder is greater than 0 . 7 radians.

Figure 4: Part-units have connections consistent with ISTA. The actual connection weights S -I versus the ISTA-predicted weights D glyph[latticetop] · D , for connections from part-units to part-units (a) and categorical-units to part-units (b); and the angle between the rows of S -I and the ISTA-ideal D glyph[latticetop] · D versus the angle between the encoder rows and decoder columns (c). Units are considered part-units if the angle between their encoder and decoder is less than 0 . 5 radians, and categoricalunits if the angle between their encoder and decoder is greater than 0 . 7 radians.

Figure 5: Part-units receive ISTA-compatible connections and thus perform sparse coding on the residual input after the contribution of the categorical-units is subtracted out. The decoders of the twenty units with the strongest explaining-away connections | S i,j -δ i,j | to three typical part-units, sorted by connection magnitude. The left-most column depicts the decoder of the recipient partunit. The bars above the decoders in the remaining columns indicate the strength of the connections. Black bars are used for positive connections, and white bars for negative connections.

Figure 5: Part-units receive ISTA-compatible connections and thus perform sparse coding on the residual input after the contribution of the categorical-units is subtracted out. The decoders of the twenty units with the strongest explaining-away connections | S i,j -δ i,j | to three typical part-units, sorted by connection magnitude. The left-most column depicts the decoder of the recipient partunit. The bars above the decoders in the remaining columns indicate the strength of the connections. Black bars are used for positive connections, and white bars for negative connections.

units and thus the class chosen are determined by the part-unit activations. In particular, only a subset of the possible deformations implemented by part-unit decoders are freely available for each prototype, since part-units with a strong negative connection to a categorical-unit will tend to silence it, and so cannot be used to transform the prototype of that categorical-unit.

Categorical-units implement winner-take-all-like dynamics amongst themselves, as shown in figure 6(b), with negative connections to most other categorical-units. Positive total self-connections S i,i facilitate the integration of inputs over time.

Figure 6: Categorical-units execute a sophisticated form of pooling over part-units, and have winnertake-all dynamics amongst themselves. The decoders of the categorical-units receiving the twenty strongest connections | S i,j -δ i,j | from representative part-units (a) and categorical-units (b), and the decoders of the part-units sending the twenty strongest projections to representative categoricalunits (c). The connections are sorted first by the class of their destination, and then by the magnitude of the connection. The left-most column depicts the decoder of the source (a,b) or destination (c) unit. The bars above the decoders in the remaining columns indicate the strength of the connections. Black bars are used for positive connections, and white bars for negative connections.

Figure 6: Categorical-units execute a sophisticated form of pooling over part-units, and have winnertake-all dynamics amongst themselves. The decoders of the categorical-units receiving the twenty strongest connections | S i,j -δ i,j | from representative part-units (a) and categorical-units (b), and the decoders of the part-units sending the twenty strongest projections to representative categoricalunits (c). The connections are sorted first by the class of their destination, and then by the magnitude of the connection. The left-most column depicts the decoder of the source (a,b) or destination (c) unit. The bars above the decoders in the remaining columns indicate the strength of the connections. Black bars are used for positive connections, and white bars for negative connections.

When activated, the categorical-units make a much larger contribution to the reconstruction than any single part-unit, as can be seen in figure 7(b). Since, the projections from categorical-units to part-units are consistent with ISTA, the magnitude of the categorical-unit contribution to the reconstruction need not be tightly regulated. The part-units adjust accordingly to accommodate whatever residual is left by the categorical-units.

The units form a rough hierarchy, with part-units on the bottom and categorical-units on the top. Categorical-units receive strong recurrent connections, as shown in figure 3(c,d) implying that their activity is more determined by other hidden units and less by the input (since the magnitude of the input connections is bounded), and thus they are higher in the hierarchy. As shown in figure 7(c), part-units receive most of their input from other part-units; categorical-units receive a larger fraction of their input from other categorical-units. Whereas part-units have well-structured encoders and are generally activated directly by the input on the first iteration, categorical-units are more likely to first achieve a non-zero activation on the second iteration, as shown in figure 7(d), suggesting that they require stimulation from part-units. The immediate response of part-units in contrast to the gradual refinement of categorical-units is apparent in figure 8, which shows the optimal decoding matrix for selected units, inferred from their observed activity at each iteration.

Performance

The comparison of MNIST classification performance in table 1 demonstrates the power of the hierarchical representation learned by DrSAEs. Rather than learn to minimize the sum of equations 2 and 3, Gregor & LeCun (2010) train the LISTA encoder to approximate the code generated by a traditional sparse coder. While they do not report classification performance using LISTA, Gregor and LeCun do evaluate MNIST classification error using the related learned coordinate descent algorithm. Sprechmann, Bronstein, & Sapiro (2012a,b) extend this approach by training a LISTA auto-encoder to reconstruct the input directly, using loss functions similar to equation 2. Although they identify the possibility of using regularization dependent upon supervised information, Sprechmann and colleagues do not consider a parameterized classifier operating on a common hidden

Figure 7: Statistics of connections indicate the presence of a rough hierarchy, with categorical-units on the top integrating over part-units on the bottom. Average explaining-away connection weight S ij , binned by alignment between decoders, for connections from part-units to categorical-units (a). If no units fall in a given bin, the average is set to zero. Average final value of a unit z t = T i , given that z t = T i > 0 , versus the angle between the encoder row E i and decoder column D i (b). Average angle between encoder row E j and decoder column D j of afferents to unit i , weighted by the strength of the connection to unit i , versus the angle between encoder row E i and decoder column D i (c). Probability that z 1 i = 0 and z 2 i > 0 , versus the angle between the encoder row E i and decoder column D i (d). Average value of the decoder column D i versus the angle between the encoder row E i and the decoder column D i (e).

Figure 7: Statistics of connections indicate the presence of a rough hierarchy, with categorical-units on the top integrating over part-units on the bottom. Average explaining-away connection weight S ij , binned by alignment between decoders, for connections from part-units to categorical-units (a). If no units fall in a given bin, the average is set to zero. Average final value of a unit z t = T i , given that z t = T i > 0 , versus the angle between the encoder row E i and decoder column D i (b). Average angle between encoder row E j and decoder column D j of afferents to unit i , weighted by the strength of the connection to unit i , versus the angle between encoder row E i and decoder column D i (c). Probability that z 1 i = 0 and z 2 i > 0 , versus the angle between the encoder row E i and decoder column D i (d). Average value of the decoder column D i versus the angle between the encoder row E i and the decoder column D i (e).

representation. Instead, they train a separate encoder for each class, and classify each input based upon the encoder with the lowest sparse coding error. DrSAEs significantly outperform these other techniques based upon a LISTA encoder.

DrSAEs also perform well compared to other techniques using encoders related to LISTA. Deep sparse rectifier neural networks (Glorot, Bordes, & Bengio, 2011) combine discriminative training with an encoder similar to LISTA, but do not tie the parameters between the layers and only allow the input to project to the first layer. Differentiable sparse coding (Bradley & Bagnell, 2008) and

Figure 8: Part-units (a) respond to the input quickly, while the activity of categorical-units (b) refines slowly. Columns of the optimal decoding matrices D t minimizing the input reconstruction error || x -D t · z t || 2 2 from the hidden representation z t for t = 1 , . . . , T . The first and last columns show the corresponding encoder and decoder for the chosen representative units. Intermediate columns represent successive iterations t .

Figure 8: Part-units (a) respond to the input quickly, while the activity of categorical-units (b) refines slowly. Columns of the optimal decoding matrices D t minimizing the input reconstruction error || x -D t · z t || 2 2 from the hidden representation z t for t = 1 , . . . , T . The first and last columns show the corresponding encoder and decoder for the chosen representative units. Intermediate columns represent successive iterations t .

Table 1: MNIST classification error rate (%) for pixel-permutation-agnostic encoders without boosting-like augmentations. The first column indicates the size of each layer in the specified encoder, separated by hyphens. Exponents specify the number of recurrent iterations; asterisks denote repetition to convergence. 10 × ( · · · ) indicates that a separate encoder is trained for each input class; 45 × ( · · · ) indicates that a separate encoder is trained for each pairwise binary classification problem. Further performance improvements have been reported with regularization techniques such as dropout, architectures that enforce translation-invariance, and datasets augmented by deformations, as discussed in the main text.

supervised dictionary learning (Mairal, et al., 2009) also train discriminatively, but effectively use an infinite-depth ISTA-like encoder, and are thus much less computationally efficient than DrSAEs. Supervised dictionary learning achieves performance statistically indistinguishable from DrSAEs using a contrastive loss function. A similar technique achieves MNIST classification error as low

as 0 . 54% when the dataset is augmented with shifted copies of the inputs (Mairal, Bach, & Ponce, 2012).

Additional regularizations and boosting-like techniques can further improve performance of networks with LISTA-like encoders. Recent examples include dropout, which trains and then averages over a large set of random subnetworks formed by removing a constant fraction of the hidden units from the original network (Goodfellow, et al., 2013; Hinton, et al., 2012). Deep belief networks and deep Boltzmann machines fine-tuned with dropout are the current state-of-the-art for pixelpermutation-agnostic handwritten digit recognition (Hinton, et al., 2012), and can achieve MNIST classification error as low as 0 . 79% with a carefully tuned network structure and multi-step training procedure. Deep convex networks, which iteratively refine the classification by successively training a stack of classifiers, with the output of the i -1 st classifier provided as input to the i th classifier, can achieve an MNIST error of 0 . 83% (Deng & Yu, 2011). Regularizing by explicit modeling of the data manifold, and then minimizing the square of the Jacobian of the output along the tangent bundle around the training datapoints, can reduce MNIST error to 0 . 81% (Rifai, et al., 2011). Further performance improvements are possible if translation invariance is built directly into the network via a convolutional architecture, and deformations of the inputs are included in the training set (LeCun, et al., 1998), yielding error as low as 0 . 23% (Ciresan, Meier, & Schmidhuber, 2012). These regularizations and augmentations are potentially compatible with DrSAE, but we defer their exploration to future work.

Recurrence is essential to the performance of DrSAEs. If the number of recurrent iterations is decreased from eleven to two, MNIST classification error in a network with 400 hidden units increases from 1 . 08% to 1 . 32% . With only 200 hidden units, MNIST classification error increases from 1 . 21% to 1 . 49% , although the hidden units still differentiate into part-units and categorical-units, as shown in figure 3(d,f).

Decoder decomposition into prototype and tangent-space

Discussion

It is widely believed that natural stimuli, such as images and sounds, fall near a low-dimensional manifold within a higher-dimensional space (the manifold hypothesis ) (Bengio, Courville, & Vincent, 2012; Lee, Pedersen, & Mumford, 2003; Olshausen & Field, 2004). The low-dimensional data manifold provides an intuitively compelling and empirically effective basis for classification (Rifai, et al., 2011; Simard, LeCun, & Denker, 1993; Simard, et al., 1998). The continuous deformations that define the data manifold usually preserve identity, whereas even relatively small invalid transformations may change the class of a stimulus. For instance, the various handwritten renditions of the digit 3 in in the last column of figure 9(c) barely overlap, and so the Euclidean distance between them in pixel space is greater than that to the nearest 8 formed by closing both loops. Nevertheless, smooth deformations of one 3 into another correspond to relatively short trajectories along the data manifold, 9 whereas the transformation of a 3 into an 8 requires a much longer path within the data manifold. A prohibitive amount of data is required to fully characterize the data manifold (Narayanan & Mitter, 2010), so it is often approximated by the set of linear submanifolds tangent to the data manifold at the observed datapoints, known as the tangent spaces (Ekanadham, Tranchina, & Simoncelli, 2011; Rifai, et al., 2011; Simard, et al., 1998). DrSAEs naturally and efficiently form a tangent space-like representation, consisting of a point on the data manifold indicated by the categorical-units, and a shift within the tangent space specified by the part-units.

Before discriminative fine-tuning, DrSAEs perform a traditional part-based decomposition, familiar from sparse coding, as shown in figure 9(a). The decoding matrix columns are class-independent, local pen strokes, and many units make a comparable, small contribution to the reconstruction. After discriminative fine-tuning, the hidden units differentiate into sparse coding local part-units, and global prototype categorical-units that integrate over them. As shown in figure 9(b,c), the input is decomposed into a prototype, corresponding to a point on the data manifold; and a set of deformations from this prototype along the data manifold, corresponding to shifts within the tangent space. The same prototype can be used for very different inputs, as demonstrated in figure 9(c), since the space of deformations is rich enough to encompass diverse transformations without moving off the data

9 In particular, figure 9(c) shows how each input can be produced by identity-preserving deformations from a common prototype, using the tangent space decomposition produced by our network.

Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and deformations along the data manifold. The progressive reconstruction of selected inputs by the hidden representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right depict either the components of the reconstruction (top row of each pair), or the partial reconstruction induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction in order of decreasing contribution magnitude; smoother transformations are possible with an optimized sequence. The last two columns show the final reconstruction with all parts (Fin), and the original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit activity associated with the column.

Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and deformations along the data manifold. The progressive reconstruction of selected inputs by the hidden representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right depict either the components of the reconstruction (top row of each pair), or the partial reconstruction induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction in order of decreasing contribution magnitude; smoother transformations are possible with an optimized sequence. The last two columns show the final reconstruction with all parts (Fin), and the original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit activity associated with the column.

manifold. Even when the prototype is very different from the input, all steps along the reconstruction trajectories in figure 9(b,c) are recognizable as members of the same class.

The prototypes learned by the categorical-units for each class are not simply the average over the elements of the class, as depicted in figure 10. Each class includes many possible input variations, so its average is blurry. The prototypes, in contrast, are sharp, and look like representative elements of the appropriate class. Many categorical-units are available for each class, as shown in figure 6. Not all categorical-units correspond to full prototypes; some capture global transformations of a prototype, such as rotations (Simard, et al., 1998).

Consistent with prototypes for the non-negative MNIST inputs, the decoding matrix columns of the categorical-units are generally positive, as shown in figure 7(e). In contrast, the decoders of the part-units are approximately mean-zero and so cannot serve as prototypes themselves. Rather, they shift and transform prototypes, moving activation from one region in the image to another, as demonstrated in figure 9(b,c).

Figure 10: The prototypes learned by categorical-units resemble representative instances of the appropriate class, and are sharper than the average over all members of the class in the dataset. The left-most column in each group depicts the average over all elements of each of the ten MNIST digit classes. The other columns show the decoders of the associated units with the largest-magnitude columns in the classification matrix C . Bars above the decoders indicate the angle between the encoder and the decoder for the displayed unit. The most prototypical unit always makes the strongest contribution to the classification, and has a large (but not necessarily the largest) angle between its encoder and decoder. Some units that make large contributions to the classification represent global transformations, such as rotations, of a prototype (Simard, et al., 1998).

Figure 10: The prototypes learned by categorical-units resemble representative instances of the appropriate class, and are sharper than the average over all members of the class in the dataset. The left-most column in each group depicts the average over all elements of each of the ten MNIST digit classes. The other columns show the decoders of the associated units with the largest-magnitude columns in the classification matrix C . Bars above the decoders indicate the angle between the encoder and the decoder for the displayed unit. The most prototypical unit always makes the strongest contribution to the classification, and has a large (but not necessarily the largest) angle between its encoder and decoder. Some units that make large contributions to the classification represent global transformations, such as rotations, of a prototype (Simard, et al., 1998).

Discrepancies between the prototype and the input due to transformations along the data manifold are explained by class-consistent part-units, and only serve to further activate the categorical-units of that class, as in figure 6(a,c). Discrepancies between the prototype and the input due to deformations orthogonal to the data manifold are explained by class-incompatible part-units, and serve to suppress the categorical-units of that class, both directly and via activation of incompatible categorical-units.

If the wrong prototype is turned on, the residual input will generally contain substantial unexplained components. Part-units obey ISTA-like dynamics and thus function as a sparse coder on the residual input, so part-units that match the unexplained components of the input will be activated. These partunits will have positive connections to categorical-units with compatible prototypes, and so will tend to activate categorical-units associated with the true class (so long as the unexplained components of the input are diagnostic). The spuriously activated categorical-unit will not be able to sustain its activity, since few compatible part-units will be required to capture the residual input.

The classification approach used by DrSAEs is different from one based upon a traditional sparse coding decomposition: it projects into the space of deviations from a prototype, which is not the same as the space of prototype-free parts, as is clear from figure 9(a,b). For instance, a 5 can easily be constructed using the parts of a 6 , making it difficult to distinguish the two. Indeed, the first seven progressive reconstruction steps of the 6 in figure 9(a) could just as easily be used to produce a 5 . However, starting from a 6 prototype, the parts required to break the bottom loop are outside the data manifold of the 6 class, and so will tend to change the active prototype.

DrSAEs naturally learn a hierarchical representation within a recurrent network, thereby implementing a deep network with parameter sharing between the layers.

Biological plausibility

Conclusion

It is widely believed that natural stimuli, such as images and sounds, fall near a low-dimensional manifold within a higher-dimensional space (the manifold hypothesis ) (Bengio, Courville, & Vincent, 2012; Lee, Pedersen, & Mumford, 2003; Olshausen & Field, 2004). The low-dimensional data manifold provides an intuitively compelling and empirically effective basis for classification (Rifai, et al., 2011; Simard, LeCun, & Denker, 1993; Simard, et al., 1998). The continuous deformations that define the data manifold usually preserve identity, whereas even relatively small invalid transformations may change the class of a stimulus. For instance, the various handwritten renditions of the digit 3 in in the last column of figure 9(c) barely overlap, and so the Euclidean distance between them in pixel space is greater than that to the nearest 8 formed by closing both loops. Nevertheless, smooth deformations of one 3 into another correspond to relatively short trajectories along the data manifold, 9 whereas the transformation of a 3 into an 8 requires a much longer path within the data manifold. A prohibitive amount of data is required to fully characterize the data manifold (Narayanan & Mitter, 2010), so it is often approximated by the set of linear submanifolds tangent to the data manifold at the observed datapoints, known as the tangent spaces (Ekanadham, Tranchina, & Simoncelli, 2011; Rifai, et al., 2011; Simard, et al., 1998). DrSAEs naturally and efficiently form a tangent space-like representation, consisting of a point on the data manifold indicated by the categorical-units, and a shift within the tangent space specified by the part-units.

Before discriminative fine-tuning, DrSAEs perform a traditional part-based decomposition, familiar from sparse coding, as shown in figure 9(a). The decoding matrix columns are class-independent, local pen strokes, and many units make a comparable, small contribution to the reconstruction. After discriminative fine-tuning, the hidden units differentiate into sparse coding local part-units, and global prototype categorical-units that integrate over them. As shown in figure 9(b,c), the input is decomposed into a prototype, corresponding to a point on the data manifold; and a set of deformations from this prototype along the data manifold, corresponding to shifts within the tangent space. The same prototype can be used for very different inputs, as demonstrated in figure 9(c), since the space of deformations is rich enough to encompass diverse transformations without moving off the data

9 In particular, figure 9(c) shows how each input can be produced by identity-preserving deformations from a common prototype, using the tangent space decomposition produced by our network.

Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and deformations along the data manifold. The progressive reconstruction of selected inputs by the hidden representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right depict either the components of the reconstruction (top row of each pair), or the partial reconstruction induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction in order of decreasing contribution magnitude; smoother transformations are possible with an optimized sequence. The last two columns show the final reconstruction with all parts (Fin), and the original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit activity associated with the column.

Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and deformations along the data manifold. The progressive reconstruction of selected inputs by the hidden representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right depict either the components of the reconstruction (top row of each pair), or the partial reconstruction induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction in order of decreasing contribution magnitude; smoother transformations are possible with an optimized sequence. The last two columns show the final reconstruction with all parts (Fin), and the original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit activity associated with the column.

manifold. Even when the prototype is very different from the input, all steps along the reconstruction trajectories in figure 9(b,c) are recognizable as members of the same class.

The prototypes learned by the categorical-units for each class are not simply the average over the elements of the class, as depicted in figure 10. Each class includes many possible input variations, so its average is blurry. The prototypes, in contrast, are sharp, and look like representative elements of the appropriate class. Many categorical-units are available for each class, as shown in figure 6. Not all categorical-units correspond to full prototypes; some capture global transformations of a prototype, such as rotations (Simard, et al., 1998).

Consistent with prototypes for the non-negative MNIST inputs, the decoding matrix columns of the categorical-units are generally positive, as shown in figure 7(e). In contrast, the decoders of the part-units are approximately mean-zero and so cannot serve as prototypes themselves. Rather, they shift and transform prototypes, moving activation from one region in the image to another, as demonstrated in figure 9(b,c).

Figure 10: The prototypes learned by categorical-units resemble representative instances of the appropriate class, and are sharper than the average over all members of the class in the dataset. The left-most column in each group depicts the average over all elements of each of the ten MNIST digit classes. The other columns show the decoders of the associated units with the largest-magnitude columns in the classification matrix C . Bars above the decoders indicate the angle between the encoder and the decoder for the displayed unit. The most prototypical unit always makes the strongest contribution to the classification, and has a large (but not necessarily the largest) angle between its encoder and decoder. Some units that make large contributions to the classification represent global transformations, such as rotations, of a prototype (Simard, et al., 1998).

Figure 10: The prototypes learned by categorical-units resemble representative instances of the appropriate class, and are sharper than the average over all members of the class in the dataset. The left-most column in each group depicts the average over all elements of each of the ten MNIST digit classes. The other columns show the decoders of the associated units with the largest-magnitude columns in the classification matrix C . Bars above the decoders indicate the angle between the encoder and the decoder for the displayed unit. The most prototypical unit always makes the strongest contribution to the classification, and has a large (but not necessarily the largest) angle between its encoder and decoder. Some units that make large contributions to the classification represent global transformations, such as rotations, of a prototype (Simard, et al., 1998).

Discrepancies between the prototype and the input due to transformations along the data manifold are explained by class-consistent part-units, and only serve to further activate the categorical-units of that class, as in figure 6(a,c). Discrepancies between the prototype and the input due to deformations orthogonal to the data manifold are explained by class-incompatible part-units, and serve to suppress the categorical-units of that class, both directly and via activation of incompatible categorical-units.

If the wrong prototype is turned on, the residual input will generally contain substantial unexplained components. Part-units obey ISTA-like dynamics and thus function as a sparse coder on the residual input, so part-units that match the unexplained components of the input will be activated. These partunits will have positive connections to categorical-units with compatible prototypes, and so will tend to activate categorical-units associated with the true class (so long as the unexplained components of the input are diagnostic). The spuriously activated categorical-unit will not be able to sustain its activity, since few compatible part-units will be required to capture the residual input.

The classification approach used by DrSAEs is different from one based upon a traditional sparse coding decomposition: it projects into the space of deviations from a prototype, which is not the same as the space of prototype-free parts, as is clear from figure 9(a,b). For instance, a 5 can easily be constructed using the parts of a 6 , making it difficult to distinguish the two. Indeed, the first seven progressive reconstruction steps of the 6 in figure 9(a) could just as easily be used to produce a 5 . However, starting from a 6 prototype, the parts required to break the bottom loop are outside the data manifold of the 6 class, and so will tend to change the active prototype.

DrSAEs naturally learn a hierarchical representation within a recurrent network, thereby implementing a deep network with parameter sharing between the layers.

Efficient encoding and training

Extra ideas

Acknowledgments

Supplementary figures

Extended discussion

It is widely believed that natural stimuli, such as images and sounds, fall near a low-dimensional manifold within a higher-dimensional space (the manifold hypothesis ) (Bengio, Courville, & Vincent, 2012; Lee, Pedersen, & Mumford, 2003; Olshausen & Field, 2004). The low-dimensional data manifold provides an intuitively compelling and empirically effective basis for classification (Rifai, et al., 2011; Simard, LeCun, & Denker, 1993; Simard, et al., 1998). The continuous deformations that define the data manifold usually preserve identity, whereas even relatively small invalid transformations may change the class of a stimulus. For instance, the various handwritten renditions of the digit 3 in in the last column of figure 9(c) barely overlap, and so the Euclidean distance between them in pixel space is greater than that to the nearest 8 formed by closing both loops. Nevertheless, smooth deformations of one 3 into another correspond to relatively short trajectories along the data manifold, 9 whereas the transformation of a 3 into an 8 requires a much longer path within the data manifold. A prohibitive amount of data is required to fully characterize the data manifold (Narayanan & Mitter, 2010), so it is often approximated by the set of linear submanifolds tangent to the data manifold at the observed datapoints, known as the tangent spaces (Ekanadham, Tranchina, & Simoncelli, 2011; Rifai, et al., 2011; Simard, et al., 1998). DrSAEs naturally and efficiently form a tangent space-like representation, consisting of a point on the data manifold indicated by the categorical-units, and a shift within the tangent space specified by the part-units.

Before discriminative fine-tuning, DrSAEs perform a traditional part-based decomposition, familiar from sparse coding, as shown in figure 9(a). The decoding matrix columns are class-independent, local pen strokes, and many units make a comparable, small contribution to the reconstruction. After discriminative fine-tuning, the hidden units differentiate into sparse coding local part-units, and global prototype categorical-units that integrate over them. As shown in figure 9(b,c), the input is decomposed into a prototype, corresponding to a point on the data manifold; and a set of deformations from this prototype along the data manifold, corresponding to shifts within the tangent space. The same prototype can be used for very different inputs, as demonstrated in figure 9(c), since the space of deformations is rich enough to encompass diverse transformations without moving off the data

9 In particular, figure 9(c) shows how each input can be produced by identity-preserving deformations from a common prototype, using the tangent space decomposition produced by our network.

Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and deformations along the data manifold. The progressive reconstruction of selected inputs by the hidden representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right depict either the components of the reconstruction (top row of each pair), or the partial reconstruction induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction in order of decreasing contribution magnitude; smoother transformations are possible with an optimized sequence. The last two columns show the final reconstruction with all parts (Fin), and the original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit activity associated with the column.

Figure 9: Discriminative recurrent sparse auto-encoders decompose the input into a prototype and deformations along the data manifold. The progressive reconstruction of selected inputs by the hidden representation before (a) or after (b,c) discriminative fine-tuning. The columns from left to right depict either the components of the reconstruction (top row of each pair), or the partial reconstruction induced by the first n parts (bottom row of each pair). Parts are added to the reconstruction in order of decreasing contribution magnitude; smoother transformations are possible with an optimized sequence. The last two columns show the final reconstruction with all parts (Fin), and the original input (Inp). Bars above the decoding matrix columns indicate the scale factor/hidden unit activity associated with the column.

manifold. Even when the prototype is very different from the input, all steps along the reconstruction trajectories in figure 9(b,c) are recognizable as members of the same class.

The prototypes learned by the categorical-units for each class are not simply the average over the elements of the class, as depicted in figure 10. Each class includes many possible input variations, so its average is blurry. The prototypes, in contrast, are sharp, and look like representative elements of the appropriate class. Many categorical-units are available for each class, as shown in figure 6. Not all categorical-units correspond to full prototypes; some capture global transformations of a prototype, such as rotations (Simard, et al., 1998).

Consistent with prototypes for the non-negative MNIST inputs, the decoding matrix columns of the categorical-units are generally positive, as shown in figure 7(e). In contrast, the decoders of the part-units are approximately mean-zero and so cannot serve as prototypes themselves. Rather, they shift and transform prototypes, moving activation from one region in the image to another, as demonstrated in figure 9(b,c).

Figure 10: The prototypes learned by categorical-units resemble representative instances of the appropriate class, and are sharper than the average over all members of the class in the dataset. The left-most column in each group depicts the average over all elements of each of the ten MNIST digit classes. The other columns show the decoders of the associated units with the largest-magnitude columns in the classification matrix C . Bars above the decoders indicate the angle between the encoder and the decoder for the displayed unit. The most prototypical unit always makes the strongest contribution to the classification, and has a large (but not necessarily the largest) angle between its encoder and decoder. Some units that make large contributions to the classification represent global transformations, such as rotations, of a prototype (Simard, et al., 1998).

Figure 10: The prototypes learned by categorical-units resemble representative instances of the appropriate class, and are sharper than the average over all members of the class in the dataset. The left-most column in each group depicts the average over all elements of each of the ten MNIST digit classes. The other columns show the decoders of the associated units with the largest-magnitude columns in the classification matrix C . Bars above the decoders indicate the angle between the encoder and the decoder for the displayed unit. The most prototypical unit always makes the strongest contribution to the classification, and has a large (but not necessarily the largest) angle between its encoder and decoder. Some units that make large contributions to the classification represent global transformations, such as rotations, of a prototype (Simard, et al., 1998).

Discrepancies between the prototype and the input due to transformations along the data manifold are explained by class-consistent part-units, and only serve to further activate the categorical-units of that class, as in figure 6(a,c). Discrepancies between the prototype and the input due to deformations orthogonal to the data manifold are explained by class-incompatible part-units, and serve to suppress the categorical-units of that class, both directly and via activation of incompatible categorical-units.

If the wrong prototype is turned on, the residual input will generally contain substantial unexplained components. Part-units obey ISTA-like dynamics and thus function as a sparse coder on the residual input, so part-units that match the unexplained components of the input will be activated. These partunits will have positive connections to categorical-units with compatible prototypes, and so will tend to activate categorical-units associated with the true class (so long as the unexplained components of the input are diagnostic). The spuriously activated categorical-unit will not be able to sustain its activity, since few compatible part-units will be required to capture the residual input.

The classification approach used by DrSAEs is different from one based upon a traditional sparse coding decomposition: it projects into the space of deviations from a prototype, which is not the same as the space of prototype-free parts, as is clear from figure 9(a,b). For instance, a 5 can easily be constructed using the parts of a 6 , making it difficult to distinguish the two. Indeed, the first seven progressive reconstruction steps of the 6 in figure 9(a) could just as easily be used to produce a 5 . However, starting from a 6 prototype, the parts required to break the bottom loop are outside the data manifold of the 6 class, and so will tend to change the active prototype.

DrSAEs naturally learn a hierarchical representation within a recurrent network, thereby implementing a deep network with parameter sharing between the layers.

Interpretation

$$ \label{layer-dynamics} \hid^{t+1} = \max\left(0, \E \cdot \inp + \Sm \cdot \hid^t - \bv \right) $$ \tag{layer-dynamics}

$$ \label{reconstruction-loss} L^U = \frac{1}{2} \cdot \left| \left| \inp - \D \cdot \hid^T \right| \right|_2^2 + \lambda \cdot \left|\left|\hid^T \right|\right|_1, $$ \tag{reconstruction-loss}

$$ \logistic_\out(\hid) = \nobfhid_\out - \log\left( \sum_i e^{\nobfhid_i} \right) , , $$

$$ \label{reconstruction-loss} $$ \tag{reconstruction-loss}

$$ \label{ISTA-restriction} \E = \alpha \cdot \D^{\top} , \hspace{1cm} \Sm = \I - \alpha \cdot \D^{\top} \cdot \D , , \hspace{0.5cm} b_i= \alpha \cdot \lambda , , \hspace{0.5cm} %\boldsymbol\beta , , \hspace{0.5cm} \text{and} \hspace{0.5cm} \nobfhid_i^t \geq 0 , , $$ \tag{ISTA-restriction}

$$ \out_j &= \Theta\left(\sum_i W_{ji} \cdot \nobfinp_i \right) \ \frac{d \out_j}{d \nobfhid_k} &= \left( \sum_i W_{ji} \cdot \nobfinp_i > 0 \right) \cdot \sum_i W_{ji} \cdot \frac{d \nobfinp_i}{d \nobfhid_k} $$

Deep networks complement the hierarchical structure in natural data (Bengio, 2009). By breaking complex calculations into many steps, deep networks can gradually build up complicated decision boundaries or input transformations, facilitate the reuse of common substructure, and explicitly compare alternative interpretations of ambiguous input (Lee, Ekanadham, & Ng, 2008; Zeiler, Taylor, & Fergus, 2011). Leveraging these strengths, deep networks have facilitated significant advances in solving sensory problems like visual classification and speech recognition (Dahl, et al., 2012; Hinton, Osindero, & Teh, 2006; Hinton, et al., 2012).

Although deep networks have traditionally used independent parameters for each layer, they are equivalent to recurrent networks in which a disjoint set of units is active on each time step. The corresponding representations are sparse, and thus invite the incorporation of powerful techniques from sparse coding (Glorot, Bordes, & Bengio, 2011; Lee, Ekanadham, & Ng, 2008; Olshausen & Field, 1996, 1997; Ranzato, et al., 2006). Recurrence opens the possibility of sharing parameters between successive layers of a deep network.

This paper introduces the Discriminative Recurrent Sparse Auto-Encoder model (DrSAE), comprising a recurrent encoder of rectified linear units (ReLU; Coates & Ng, 2011; Glorot, Bordes, & Bengio, 2011; Jarrett, et al., 2009; Nair & Hinton, 2010; Salinas & Abbott, 1996), connected to two linear decoders that reconstruct the input and predict its supervised classification. The recurrent encoder is unrolled in time for a fixed number of iterations, with the input projecting to each resulting layer, and trained using backpropagation-through-time (Rumelhart, et al., 1986). Training initially minimizes an unsupervised sparse reconstruction error; the loss function is then augmented with a

discriminative term on the supervised classification. In its temporally-unrolled form, the network can be seen as a deep network, with parameters shared between the hidden layers. The temporal depth allows the system to exhibit far more representational power, while keeping the number of trainable parameters fixed.

Interestingly, experiments show that DrSAE does not just discover more discriminative 'parts' of the form conventionally produced by sparse coding. Rather, the hidden units spontaneously differentiate into two types: a small number of categorical-units and a larger number of part-units. The categorical-units have decoder bases that look like prototypes of the input classes. They are weakly influenced by the input and activate late in the dynamics as the result of interaction with the part-units. In contrast, the part-units are strongly influenced by the input, and encode small transformations through which the prototypes of categorical-units can be reshaped into the current input. Categorical-units compete with each other through mutual inhibition and cooperate with relevant part-units. This can be interpreted as a representation of the data manifold in which the categoricalunits are points on the manifold, and the part-units are akin to tangent vectors along the manifold.

LISTA auto-encoder, 10 × ( 289 - 100 5 ) (Sprechmann, Bronstein, &Sapiro, 2012a)3 . 76( 5 . 98 with 289 hidden units)
Learned coordinate descent, 784 - 784 50 - 10 (Gregor &LeCun, 2010)2 . 29
Differentiable sparse coding, 180 - 256 ∗ - 10 (Bradley &Bagnell, 2008)1 . 30
Deep sparse rectifier neural network 784 - 1000 - 1000 - 1000 - 10 (Glorot, Bordes, &Bengio, 2011)1 . 20( 1 . 16 with tanh nonlinearity)
Deep belief network, 784 - 500 - 500 - 2000 - 10 (Hinton, et al., 2012)1 . 18( 0 . 92 with dropout)
Discriminative recurrent sparse auto-encoder 784 - 400 11 - 101 . 08( 1 . 21 with 200 hidden units)
Supervised dictionary learning, 45 × (784 - 24 ∗ ) to 45 × (784 - 96 ∗ ) (Mairal, et al., 2009)1 . 05( 3 . 56 without contrastive loss)

Table: S4.T1: MNIST classification error rate (%) for pixel-permutation-agnostic encoders without boosting-like augmentations. The first column indicates the size of each layer in the specified encoder, separated by hyphens. Exponents specify the number of recurrent iterations; asterisks denote repetition to convergence. 10×(⋯)10⋯10\times\left(\cdots\right) indicates that a separate encoder is trained for each input class; 45×(⋯)45⋯45\times\left(\cdots\right) indicates that a separate encoder is trained for each pairwise binary classification problem. Further performance improvements have been reported with regularization techniques such as dropout, architectures that enforce translation-invariance, and datasets augmented by deformations, as discussed in the main text.

LISTA auto-encoder, 10×(289​-​1005)10289-superscript100510\times\left(289\text{-}100^{5}\right)3.763.763.76(5.985.985.98 with 289 hidden units)
(Sprechmann, Bronstein, & Sapiro, 2012a)
Learned coordinate descent, 784​-​78450​-​10784-superscript78450-10784\text{-}784^{50}\text{-}102.292.292.29
(Gregor & LeCun, 2010)
Differentiable sparse coding, 180​-​256∗​-​10180-superscript256-10180\text{-}256^{*}\text{-}101.301.301.30
(Bradley & Bagnell, 2008)
Deep sparse rectifier neural network1.201.201.20(1.161.161.16 with tanh nonlinearity)
784784784-100010001000-100010001000-100010001000-101010
(Glorot, Bordes, & Bengio, 2011)
Deep belief network, 784​-​500​-​500​-​2000​-​10784-500-500-2000-10784\text{-}500\text{-}500\text{-}2000\text{-}101.181.181.18(0.920.920.92 with dropout)
(Hinton, et al., 2012)
Discriminative recurrent sparse auto-encoder1.081.08\mathbf{1.08}(1.211.211.21 with 200 hidden units)
𝟕𝟖𝟒784\mathbf{784}-𝟒𝟎𝟎𝟏𝟏superscript40011\mathbf{400^{11}}-𝟏𝟎10\mathbf{10}
Supervised dictionary learning,1.051.051.05(3.563.563.56 without contrastive loss)
45×(784​-​24∗)45784-superscript2445\times\left(784\text{-}24^{}\right) to 45×(784​-​96∗)45784-superscript9645\times\left(784\text{-}96^{}\right)
(Mairal, et al., 2009)

$$ L^{U}=\frac{1}{2}\cdot\left|\left|\mathbf{x}-\mathbf{D}\cdot\mathbf{z}^{T}\right|\right|{2}^{2}+\lambda\cdot\left|\left|\mathbf{z}^{T}\right|\right|{1}, $$ \tag{S2.E2}

References

[barlow1961] %Barlow, H. B. (1961). %\newblock Possible principles underlying the transformation of sensory messages. %\newblock In Sensory communication. Cambridge, MA: MIT Press. %(pp. 217--234)

[bengio2009] Bengio, Y. (2009). \newblock Learning deep architectures for AI. \newblock Foundations and Trends in Machine Learning, 2(1), 1--127.

[bengio2012a] Bengio, Y., Boulanger-Lewandowski, N., & Pascanu, R. (2012). \newblock Advances in optimizing recurrent networks. \newblock arXiv:1212.0901v2 [cs.LG]

[bengio2012b] Bengio, Y., Courville, A., & Vincent, P. (2012). \newblock Representation learning: A review and new perspectives. \newblock arXiv:1206.5538 [cs.LG]

[bengio1996] Bengio, Y., & Gingras, F. (1996). \newblock Recurrent neural networks for missing or asynchronous data. \newblock In D.~Touretzky, M.~Mozer, & M.~Hasselmo (Eds.) Advances in Neural Information Processing Systems (NIPS 8) (pp. 395--401). %

[bengio1994] %Bengio, Y., Simard, P., & Frasconi, P. (1994). %\newblock Learning long-term dependencies with gradient descent is difficult. %\newblock IEEE Transactions on Neural Networks, 5(2), 157--166. %

[bishop2006] %Bishop, C. M. (2006). %\newblock Pattern recognition and machine learning. %\newblock New York: Springer.

[bradley2008] Bradley, D. M., & Bagnell, J. A. (2008). \newblock Differentiable sparse coding. \newblock In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.) Advances in Neural Information Processing Systems (NIPS 21) (pp. 113--120).

[boureau2010] Boureau, Y., Bach, F., LeCun, L., & Ponce, J. (2010). \newblock Learning mid-level features for recognition \newblock In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010).

[chambolle1998] Chambolle, A., De Vore, R. A., Lee, N. Y., & Lucier, B. J. (1998). \newblock Nonlinear wavelet image processing: Variational problems, compression, and noise removal through wavelet shrinkage. \newblock IEEE Transactions on Image Processing, 7(3), 319--335.

[ciresan2012] Ciresan, D., Meier, U., & Schmidhuber, J. (2012). \newblock Multi-column deep neural networks for image classification. \newblock In Proceedings of the 25th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012) (pp. 3642--3649).

[coates2011] Coates, A., & Ng, A. Y. (2011). \newblock The importance of encoding versus training with sparse coding and vector quantization \newblock L. Getoor & T. Scheffer (Eds.) Proceedings of the 28th International Conference on Machine Learning (ICML 2011) (pp. 921--928).

[dahl2012] Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). \newblock Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. \newblock IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30--42.

[daubechies2004] Daubechies, I., Defrise, M., & De Mol, C. (2004). \newblock An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. \newblock Communications on Pure and Applied Mathematics, 57(11), 1413--1457. %

[douglas1995] %Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A., & Suarez, H. H. (1995). %\newblock Recurrent excitation in neocortical circuits. %\newblock Science, 269(5226), 981--985. %

[douglas2004] %Douglas, R. J., & Martin, K. A. C. (2004). %\newblock Neuronal circuits of the neocortex. %\newblock Annual Review of Neuroscience, 27, 419--451.

[ekanadham2011] Ekanadham, C., Tranchina, D., & Simoncelli, E. P. (2011). \newblock Recovery of sparse translation-invariant signals with continuous basis pursuit. \newblock IEEE Transactions on Signal Processing, 59(10), 4735--4744. %

[glorot2011] Glorot, X., Bordes, A., & Bengio, Y. (2011). \newblock Deep sparse rectifier neural networks. %\newblock In G. Gordon, D. Dunson, & M. Dudik (Eds.) AISTATS 2011 (pp. 315--323). \newblock In G. Gordon, D. Dunson, & M. Dudik (Eds.) JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011) (pp. 315--323).

[goodfellow2013] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). \newblock Maxout networks \newblock arXiv:1302.4389v3 [stat.ML]

[gregor2010] Gregor, K., & LeCun, Y. (2010). \newblock Learning fast approximations of sparse coding. %\newblock In J. F{"u}rnkranz & T. Joachims (Eds.) ICML 2010 (pp. 399--406). \newblock In J. F{"u}rnkranz & T. Joachims (Eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010) (pp. 399--406). %

[gregor2011] %Gregor, K., Szlam, A., & LeCun, Y. (2011). %\newblock Structured sparse coding via lateral inhibition. %\newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) NIPS 24 (pp. 1116--1124). %Cambridge, MA: MIT Press. %\newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) Advances in Neural Information Processing Systems (NIPS 24) (pp. 1116--1124). %Cambridge, MA: MIT Press.

[hinton2006] Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). \newblock A fast learning algorithm for deep belief nets. \newblock Neural Computation, 18(7), 1527--1554.

[hinton2010] Hinton, G. (2010). \newblock A practical guide to training restricted Boltzmann machines (UTML TR 2010-003, version 1). \newblock Toronto, Canada: University of Toronto, Department of Computer Science.

[hinton2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). \newblock Improving neural networks by preventing co-adaptation of feature detectors \newblock arXiv:1207.0580v1 [cs.NE]

[hubel1962] Hubel, D. H., & Wiesel, T. N. (1962). \newblock Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. \newblock The Journal of Physiology, 160(1), 106--154. %

[hyvarinen2000] %Hyv{"a}rinen, A. & Hoyer, P. O. (2000). %\newblock Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces. %\newblock Neural Computation, 12(7), 1705--1720. %

[hyvarinen2001] %Hyv{"a}rinen, A. Hoyer, P. O., & Inki, M. (2001). %\newblock Topographic independent component analysis. %\newblock Neural Computation, 13(7), 1527--1558.

[jarrett2009] Jarrett, K., Kavukcuoglu, K., Ranzato, M. A., & LeCun, Y. (2009). \newblock What is the best multi-stage architecture for object recognition? %\newblock In ICCV 2009 (pp. 2146--2153). %IEEE. \newblock In Proceedings of the 12th International Conference on Computer Vision (ICCV 2009) (pp. 2146--2153). %IEEE. %

[jenatton2011] %Jenatton, R., Mairal, J., Obozinski, G., & Bach, F. (2011). %\newblock Proximal methods for hierarchical sparse coding. %\newblock Journal of Machine Learning Research, 12, 2297--2334. % CONSIDER REMOVING! %

[kavukcuoglu2009] %Kavukcuoglu, K., Ranzato, M. A., Fergus, R., & LeCun, Y. (2009). %\newblock Learning invariant features through topographic filter maps. %\newblock In CVPR 2009 (pp. 1605-1612). %\newblock In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009) (pp. 1605-1612). %

[le2011] %Le, Q. V., Karpenko, A., Ngiam, J., & Ng, A. Y. (2011). %\newblock ICA with reconstruction cost for efficient overcomplete feature learning. %\newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) Advances in Neural Information Processing Systems (NIPS 24) (pp. 1116--1124). %Cambridge, MA: MIT Press.

[lecun1998] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). \newblock Gradient-based learning applied to document recognition. \newblock Proceedings of the IEEE, 86(11), 2278--2324.

[lee2003] Lee, A. B., Pedersen, K. S., & Mumford, D. (2003). \newblock The nonlinear statistics of high-contrast patches in natural images. \newblock International Journal of Computer Vision, 54(1), 83--103. %

[lee1999] %Lee, D. D., & Seung, H. (1999). %\newblock Learning the parts of objects by non-negative matrix factorization. %\newblock Nature, 401(6755), 788--791.

[lee2008] Lee, H., Ekanadham, C., & Ng, A. (2008). \newblock Sparse deep belief net model for visual area V2. \newblock In J. C. Platt, D. Koller, Y. Singer & S. Roweis (Eds.) Advances in Neural Information Processing Systems (NIPS 20), (pp. 873--880). %

[lewicki2002] %Lewicki, M. S. (2002). %\newblock Efficient coding of natural sounds. %\newblock Nature Neuroscience, 5(4), 356--363.

[mairal2009] Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009). \newblock Supervised dictionary learning. \newblock In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.) Advances in Neural Information Processing Systems (NIPS 21) (pp. 1033--1040).

[mairal2012] Mairal, J., Bach, F., & Ponce, J. (2012). \newblock Task-driven dictionary learning. \newblock IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 791--804. %

[martinez2003] %Martinez, L. M., & Alonso, J. M. (2003). %\newblock Complex receptive fields in primary visual cortex. %\newblock The Neuroscientist, 9(5), 317--331.

[nair2010] Nair, V., & Hinton, G. E. (2010). \newblock Rectified linear units improve restricted boltzmann machines. \newblock In J. F{"u}rnkranz & T. Joachims (Eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010) (pp. 807-814). %Madison, WI: Omnipress.

[narayanan2010] Narayanan, H. & MItter, S. (2010). \newblock Sample complexity of testing the manifold hypothesis. %\newblock In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, & A. Culotta (Eds.) NIPS 23 (pp. 1786--1794). \newblock In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, & A. Culotta (Eds.) Advances in Neural Information Processing Systems (NIPS 23) (pp. 1786--1794).

[olshausen1996] Olshausen, B. A., & Field, D. J. (1996). \newblock Emergence of simple-cell receptive field properties by learning a sparse code for natural images. \newblock Nature, 381(6583), 607--609.

[olshausen1997] Olshausen, B. A., & Field, D. J. (1997). \newblock Sparse coding with an overcomplete basis set: A strategy employed by VI? \newblock Vision Research, 37(23), 3311--3326.

[olshausen2004] Olshausen, B. A., & Field, D. J. (2004). \newblock Sparse coding of sensory inputs. \newblock Current opinion in neurobiology, 14(4), 481--487.

[ranzato2006] Ranzato M., Poultney, C., Chopra, S., & LeCun, Y. (2006). \newblock Efficient learning of sparse representations with an energy-based model. \newblock In B. Sch"{o}lkopf, J. Platt, & T. Hoffman (Eds.) Advances in Neural Information Processing Systems (NIPS 19), (pp. 1137--1144).

[ranzato2008] Ranzato, M., & Szummer, M. (2008). \newblock Semi-supervised learning of compact document representations with deep networks \newblock In A. McCallum & S. Roweis (Eds.), Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008) (pp. 792--799). %

[reisberg1997] %Reisberg, D. (1997). %\newblock Cognition: Exploring the science of the mind. %\newblock New York: WW Norton. %

[rifai2011a] %Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). %\newblock Contractive auto-encoders: Explicit invariance during feature extraction. %\newblock L. Getoor & T. Scheffer (Eds.) Proceedings of the 28th International Conference on Machine Learning (ICML 2011) (pp. 833--840).

[rifai2011b] Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., & Muller, X. (2011). \newblock The manifold tangent classifier. %\newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) NIPS 24 (pp. 2294--2302). %Cambridge, MA: MIT Press. \newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) Advances in Neural Information Processing Systems (NIPS 24) (pp. 2294--2302). %Cambridge, MA: MIT Press. %

[rozell2008] %Rozell, C. J., Johnson, D. H., Baraniuk, R. G., & Olshausen, B. A. (2008). %\newblock Sparse coding via thresholding and local competition in neural circuits. %\newblock Neural computation, 20(10), 2526--2563.

[rumelhart1986] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). \newblock Learning internal representations by error propagation. \newblock In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Vol. 1. Foundations (pp. 318--362). Cambridge, MA: MIT Press.

[salinas1996] Salinas, E., & Abbott, L. F. (1996). \newblock A model of multiplicative neural responses in parietal cortex. \newblock Proceedings of the National Academy of Sciences of the United States of America, 93(21), 11956--11961. %

[saxe2011] %Saxe, A., Bhand, M., Murdur, R., Suresh, B., & Ng, A. Y. (2011). %\newblock Unsupervised learning models of primary cortical receptive fields and receptive field plasticity. %\newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) NIPS 24 (pp. 1971--1979). %Cambridge, MA: MIT Press. %\newblock In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) Advances in Neural Information Processing Systems (NIPS 24) (pp. 1971--1979). %Cambridge, MA: MIT Press.

[seung1998] Seung, H. S. (1998). \newblock Learning continuous attractors in recurrent networks. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.) Advances in Neural Information Processing Systems (NIPS 10) (pp. 654--660). %Morgan Kaufmann Publishers Inc.

[simard1993] Simard, P., LeCun, Y., & Denker, J. S. (1993). \newblock Efficient pattern recognition using a new transformation distance. %In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.) NIPS 5 (pp. 50--58). %Morgan Kaufmann Publishers Inc. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.) Advances in Neural Information Processing Systems (NIPS 5) (pp. 50--58). %Morgan Kaufmann Publishers Inc.

[simard1998] Simard, P., LeCun, Y., Denker, J., & Victorri, B. (1998). \newblock Transformation invariance in pattern recognition: Tangent distance and tangent propagation. \newblock In G. Orr, & K. Muller (Eds.), Neural networks: Tricks of the trade. Berlin: Springer. %\newblock Neural networks: Tricks of the trade, 549--550. % CONSIDER REMOVING!!! %

[smith2006] %Smith, E. C., & Lewicki, M. S. (2006). %\newblock Efficient auditory coding. %\newblock Nature, 439(7079), 978--982.

[sprechmann2012a] Sprechmann, P., Bronstein, A., & Sapiro, G. (2012). \newblock Learning efficient structured sparse models. %\newblock In J. Langford & J. Pineau (Eds.) ICML 12 (pp. 615--622). \newblock In J. Langford & J. Pineau (Eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 12) (pp. 615--622).

[sprechmann2012b] Sprechmann, P., Bronstein, A., & Sapiro, G. (2012). \newblock Learning efficient sparse and low rank models. \newblock arXiv:1212.3631 [cs.LG] %

[tibshirani1996] %Tibshirani, R. (1996). %\newblock Regression shrinkage and selection via the lasso. %\newblock Journal of the Royal Statistical Society. Series B, 58(1), 267--288. %

[vinje2000] %Vinje, W. E., & Gallant, J. L. (2000). %\newblock Sparse coding and decorrelation in primary visual cortex during natural vision. %\newblock Science, 287(5456), 1273--1276.

[deng2011] Deng, L. & Yu, D. (2011). \newblock Deep convex net: A scalable architecture for speech pattern classification. \newblock In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011) (pp. 2285-2288). %

[zeiler2010] %Zeiler, M. D., Krishnan, D., Taylor, G. W., & Fergus, R. (2010). %\newblock Deconvolutional networks. %\newblock In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2010) (pp. 2528--2535). Los Alamitos, CA: IEEE Computer Society.

[zeiler2011] Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). \newblock Adaptive deconvolutional networks for mid and high level feature learning. \newblock In Proceedings of the 13th International Conference on Computer Vision (ICCV 2011) (pp. 2018--2025).

[bib1]

[bib2] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127.

[bib3] Bengio, Y., Boulanger-Lewandowski, N., & Pascanu, R. (2012). Advances in optimizing recurrent networks. arXiv:1212.0901v2 [cs.LG]

[bib4] Bengio, Y., Courville, A., & Vincent, P. (2012). Representation learning: A review and new perspectives. arXiv:1206.5538 [cs.LG]

[bib5] Bengio, Y., & Gingras, F. (1996). Recurrent neural networks for missing or asynchronous data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.) Advances in Neural Information Processing Systems (NIPS 8) (pp. 395–401).

[bib6] Bradley, D. M., & Bagnell, J. A. (2008). Differentiable sparse coding. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.) Advances in Neural Information Processing Systems (NIPS 21) (pp. 113–120).

[bib7] Boureau, et al. (2010) Boureau, Y., Bach, F., LeCun, L., & Ponce, J. (2010). Learning mid-level features for recognition In Proceedings of the 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010).

[bib8] Chambolle, et al. (1998) Chambolle, A., De Vore, R. A., Lee, N. Y., & Lucier, B. J. (1998). Nonlinear wavelet image processing: Variational problems, compression, and noise removal through wavelet shrinkage. IEEE Transactions on Image Processing, 7(3), 319–335.

[bib9] Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In Proceedings of the 25th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012) (pp. 3642–3649).

[bib10] Coates, A., & Ng, A. Y. (2011). The importance of encoding versus training with sparse coding and vector quantization L. Getoor & T. Scheffer (Eds.) Proceedings of the 28th International Conference on Machine Learning (ICML 2011) (pp. 921–928).

[bib11] Dahl, et al. (2012) Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42.

[bib12] Daubechies, I., Defrise, M., & De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457.

[bib13] Ekanadham, C., Tranchina, D., & Simoncelli, E. P. (2011). Recovery of sparse translation-invariant signals with continuous basis pursuit. IEEE Transactions on Signal Processing, 59(10), 4735–4744.

[bib14] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In G. Gordon, D. Dunson, & M. Dudik (Eds.) JMLR W&CP: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011) (pp. 315–323).

[bib15] Goodfellow, et al. (2013) Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., & Bengio, Y. (2013). Maxout networks arXiv:1302.4389v3 [stat.ML]

[bib16] Gregor, K., & LeCun, Y. (2010). Learning fast approximations of sparse coding. In J. Fürnkranz & T. Joachims (Eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010) (pp. 399–406).

[bib17] Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

[bib18] Hinton, G. (2010). A practical guide to training restricted Boltzmann machines (UTML TR 2010-003, version 1). Toronto, Canada: University of Toronto, Department of Computer Science.

[bib19] Hinton, et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors arXiv:1207.0580v1 [cs.NE]

[bib20] Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of Physiology, 160(1), 106–154.

[bib21] Jarrett, et al. (2009) Jarrett, K., Kavukcuoglu, K., Ranzato, M. A., & LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proceedings of the 12th International Conference on Computer Vision (ICCV 2009) (pp. 2146–2153).

[bib22] LeCun, et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

[bib23] Lee, A. B., Pedersen, K. S., & Mumford, D. (2003). The nonlinear statistics of high-contrast patches in natural images. International Journal of Computer Vision, 54(1), 83–103.

[bib24] Lee, H., Ekanadham, C., & Ng, A. (2008). Sparse deep belief net model for visual area V2. In J. C. Platt, D. Koller, Y. Singer & S. Roweis (Eds.) Advances in Neural Information Processing Systems (NIPS 20), (pp. 873–880).

[bib25] Mairal, et al. (2009) Mairal, J., Bach, F., Ponce, J., Sapiro, G., & Zisserman, A. (2009). Supervised dictionary learning. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.) Advances in Neural Information Processing Systems (NIPS 21) (pp. 1033–1040).

[bib26] Mairal, J., Bach, F., & Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 791–804.

[bib27] Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In J. Fürnkranz & T. Joachims (Eds.) Proceedings of the 27th International Conference on Machine Learning (ICML 2010) (pp. 807-814).

[bib28] Narayanan, H. & MItter, S. (2010). Sample complexity of testing the manifold hypothesis. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, & A. Culotta (Eds.) Advances in Neural Information Processing Systems (NIPS 23) (pp. 1786–1794).

[bib29] Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.

[bib30] Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by VI? Vision Research, 37(23), 3311–3326.

[bib31] Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. Current opinion in neurobiology, 14(4), 481–487.

[bib32] Ranzato, et al. (2006) Ranzato M., Poultney, C., Chopra, S., & LeCun, Y. (2006). Efficient learning of sparse representations with an energy-based model. In B. Schölkopf, J. Platt, & T. Hoffman (Eds.) Advances in Neural Information Processing Systems (NIPS 19), (pp. 1137–1144).

[bib33] Ranzato, M., & Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks In A. McCallum & S. Roweis (Eds.), Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008) (pp. 792–799).

[bib34] Rifai, et al. (2011) Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., & Muller, X. (2011). The manifold tangent classifier. In J. Shawe-Taylor, R. S. Zemel, P. Bartlett, F. C. N. Pereira, & K. Q. Weinberger (Eds.) Advances in Neural Information Processing Systems (NIPS 24) (pp. 2294–2302).

[bib35] Rumelhart, et al. (1986) Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Vol. 1. Foundations (pp. 318–362). Cambridge, MA: MIT Press.

[bib36] Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences of the United States of America, 93(21), 11956–11961.

[bib37] Seung, H. S. (1998). Learning continuous attractors in recurrent networks. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.) Advances in Neural Information Processing Systems (NIPS 10) (pp. 654–660).

[bib38] Simard, P., LeCun, Y., & Denker, J. S. (1993). Efficient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.) Advances in Neural Information Processing Systems (NIPS 5) (pp. 50–58).

[bib39] Simard, et al. (1998) Simard, P., LeCun, Y., Denker, J., & Victorri, B. (1998). Transformation invariance in pattern recognition: Tangent distance and tangent propagation. In G. Orr, & K. Muller (Eds.), Neural networks: Tricks of the trade. Berlin: Springer.

[bib40] Sprechmann, P., Bronstein, A., & Sapiro, G. (2012). Learning efficient structured sparse models. In J. Langford & J. Pineau (Eds.) Proceedings of the 29th International Conference on Machine Learning (ICML 12) (pp. 615–622).

[bib41] Sprechmann, P., Bronstein, A., & Sapiro, G. (2012). Learning efficient sparse and low rank models. arXiv:1212.3631 [cs.LG]

[bib42] Deng, L. & Yu, D. (2011). Deep convex net: A scalable architecture for speech pattern classification. In Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH 2011) (pp. 2285-2288).

[bib43] Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In Proceedings of the 13th International Conference on Computer Vision (ICCV 2011) (pp. 2018–2025).