Skip to main content

Tunable Efficient Unitary Neural Networks (EUNN) and their application to RNNs

Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, Marin Soljačić

Abstract

Using unitary (instead of general) matrices in artificial neural networks (ANNs) is a promising way to solve the gradient explosion/vanishing problem, as well as to enable ANNs to learn long-term correlations in the data. This approach appears particularly promising for Recurrent Neural Networks (RNNs). In this work, we present a new architecture for implementing an Efficient Unitary Neural Network (EUNNs); its main advantages can be summarized as follows. Firstly, the representation capacity of the unitary space in an EUNN is fully tunable, ranging from a subspace of SU(N) to the entire unitary space. Secondly, the computational complexity for training an EUNN is merely $O(1)$ per parameter. Finally, we test the performance of EUNNs on the standard copying task, the pixel-permuted MNIST digit recognition benchmark as well as the Speech Prediction Test (TIMIT). We find that our architecture significantly outperforms both other state-of-the-art unitary RNNs and the LSTM architecture, in terms of the final performance and/or the wall-clock training speed. EUNNs are thus promising alternatives to RNNs and LSTMs for a wide variety of applications. % Using unitary matrix (instead of general matrix) in various kinds of artificial neural networks (ANN) has been considered as a novel way to solve the gradient explosion / vanishing problem in ANNs, especially in Recurrent Neural Networks (RNN), and enable ANNs to learn long term correlations in the data. In this work, we present a new architecture for implementing an Efficient Unitary Neural Network (EUNN), whose contribution can be summarized as below. Firstly, the representation capacity of the unitary space in EUNN is fully tunable: it has the capacity to span the entire unitary space, while it can also be tuned to span subspaces of SU(N). Secondly, the computational complexity for training EUNN is merely $O(1)$ per parameter. Finally, we test the performance of EUNN on the standard copying task, permuted pixel-by-pixel MNIST digit recognition benchmark and Speech Prediction Test. We find that our model significantly outperforms an LSTM network and some state-of-arts unitary RNNs in terms of final performance and wall-clock training speed.

Efficient Unitary Neural Network (EUNN) Architectures

Li Jing * 1 Yichen Shen * 1 Tena Dubcek 1 John Peurifoy 1 Scott Skirlo 1 Yann LeCun 2 Max Tegmark 1 Marin Soljaˇ ci´ c 1

Using unitary (instead of general) matrices in artificial neural networks (ANNs) is a promising way to solve the gradient explosion/vanishing problem, as well as to enable ANNs to learn long-term correlations in the data. This approach appears particularly promising for Recurrent Neural Networks (RNNs). In this work, we present a new architecture for implementing an Efficient Unitary Neural Network (EUNNs); its main advantages can be summarized as follows. Firstly, the representation capacity of the unitary space in an EUNN is fully tunable, ranging from a subspace of SU(N) to the entire unitary space. Secondly, the computational complexity for training an EUNN is merely O (1) per parameter. Finally, we test the performance of EUNNs on the standard copying task, the pixelpermuted MNIST digit recognition benchmark as well as the Speech Prediction Test (TIMIT). Wefind that our architecture significantly outperforms both other state-of-the-art unitary RNNs and the LSTM architecture, in terms of the final performance and/or the wall-clock training speed. EUNNs are thus promising alternatives to RNNs and LSTMs for a wide variety of applications.

Introduction

Deep Neural Networks (LeCun et al., 2015) have been successful on numerous difficult machine learning tasks, including image recognition(Krizhevsky et al., 2012; Donahue et al., 2015), speech recognition(Hinton et al., 2012) and natural language processing(Collobert et al., 2011; Bahdanau et al., 2014; Sutskever et al., 2014). However, deep neural networks can suffer from vanishing and ex-

  • Equal contribution 1 Massachusetts Institute of Technology 2 New York University, Facebook AI Research. Correspondence to: Li Jing < ljing@mit.edu > , Yichen Shen < ycshen@mit.edu > .

ploding gradient problems(Hochreiter, 1991; Bengio et al., 1994), which are known to be caused by matrix eigenvalues far from unity being raised to large powers. Because the severity of these problems grows with the the depth of a neural network, they are particularly grave for Recurrent Neural Networks (RNNs), whose recurrence can be equivalent to thousands or millions of equivalent hidden layers.

Several solutions have been proposed to solve these problems for RNNs. Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997), which help RNNs contain information inside hidden layers with gates, remains one of the the most popular RNN implementations. Other recently proposed methods such as GRUs(Cho et al., 2014) and Bidirectional RNNs (Berglund et al., 2015) also perform well in numerous applications. However, none of these approaches has fundamentally solved the vanishing and exploding gradient problems, and gradient clipping is often required to keep gradients in a reasonable range.

A recently proposed solution strategy is using orthogonal hidden weight matrices or their complex generalization (unitary matrices) (Saxe et al., 2013; Le et al., 2015; Arjovsky et al., 2015; Henaff et al., 2016), because all their eigenvalues will then have absolute values of unity, and can safely be raised to large powers. This has been shown to help both when weight matrices are initialized to be unitary (Saxe et al., 2013; Le et al., 2015) and when they are kept unitary during training, either by restricting them to a more tractable matrix subspace (Arjovsky et al., 2015) or by alternating gradient-descent steps with projections onto the unitary subspace (Wisdom et al., 2016).

In this paper, we will first present an Efficient Unitary Neural Network (EUNN) architecture that parametrizes the entire space of unitary matrices in a complete and computationally efficient way, thereby eliminating the need for time-consuming unitary subspace-projections. Our architecture has a wide range of capacity-tunability to represent subspace unitary models by fixing some of our parameters; the above-mentioned unitary subspace models correspond to special cases of our architecture. We also implemented an EUNN with an earlier introduced FFT-like architecture which efficiently approximates the unitary space with min-

imum number of required parameters(Mathieu & LeCun, 2014b).

We then benchmark EUNN's performance on both simulated and real tasks: the standard copying task, the pixelpermuted MNIST task, and speech prediction with the TIMIT dataset (Garofolo et al., 1993). We show that our EUNN algorithm with an O ( N ) hidden layer size can compute up to the entire N × N gradient matrix using O (1) computational steps and memory access per parameter. This is superior to the O ( N ) computational complexity of the existing training method for a full-space unitary network (Wisdom et al., 2016) and O (log N ) more efficient than the subspace Unitary RNN(Arjovsky et al., 2015).

Background

Basic Recurrent Neural Networks

A recurrent neural network takes an input sequence and uses the current hidden state to generate a new hidden state during each step, memorizing past information in the hidden layer. We first review the basic RNN architecture.

Consider an RNN updated at regular time intervals t = 1 , 2 , ... whose input is the sequence of vectors x ( t ) whose hidden layer h ( t ) is updated according to the following rule:

where σ is the nonlinear activation function. The output is generated by

where b is the bias vector for the hidden-to-output layer. For t = 0 , the hidden layer h ( 0 ) can be initialized to some special vector or set as a trainable variable. For convenience of notation, we define z ( t ) = Ux ( t ) + Wh ( t -1) so that h ( t ) = σ ( z ( t ) ) .

The Vanishing and Exploding Gradient Problems

When training the neural network to minimize a cost function C that depends on a parameter vector a , the gradient descent method updates this vector to a -λ ∂C ∂ a , where λ is a fixed learning rate and ∂C ∂ a ≡ ∇ C . For an RNN, the vanishing or exploding gradient problem is most significant during back propagation from hidden to hidden layers, so we will only focus on the gradient for hidden layers. Training the input-to-hidden and hidden-to-output matrices is relatively trivial once the hidden-to-hidden matrix has been successfully optimized.

In order to evaluate ∂C ∂W ij , one first computes the derivative

∂C ∂h ( t ) using the chain rule:

where D ( k ) = diag { σ ′ ( Ux ( k ) + Wh ( k -1) ) } is the Jacobian matrix of the pointwise nonlinearity. For large times T , the term ∏ W plays a significant role. As long as the eigenvalues of D ( k ) are of order unity, then if W has eigenvalues λ i glyph[greatermuch] 1 , they will cause gradient explosion | ∂C ∂ h ( T ) | → ∞ , while if W has eigenvalues λ i glyph[lessmuch] 1 , they can cause gradient vanishing, | ∂C ∂ h ( T ) |→ 0 . Either situation prevents the RNN from working efficiently.

Unitary RNNs

Partial Space Unitary RNNs

In a breakthrough paper, Arjovsky, Shah & Bengio (Arjovsky et al., 2015) showed that unitary RNNs can overcome the exploding and vanishing gradient problems and perform well on long term memory tasks if the hiddento-hidden matrix in parametrized in the following unitary form:

Here D 1 , 2 , 3 are diagonal matrices with each element e iω j , j = 1 , 2 , · · · , n . T 1 , 2 are reflection matrices, and T = I -2 ̂ v ̂ v † || ̂ v || 2 , where ̂ v is a vector with each of its entries as a parameter to be trained. Π is a fixed permutation matrix. F and F -1 are Fourier and inverse Fourier transform matrices respectively. Since each factor matrix here is unitary, the product W is also a unitary matrix.

This model uses O ( N ) parameters, which spans merely a part of the whole O ( N 2 ) -dimensional space of unitary N × N matrices to enable computational efficiency. Several subsequent papers have tried to expand the space to O ( N 2 ) in order to achieve better performance, as summarized below.

Full Space Unitary RNNs

In order to maximize the power of Unitary RNNs, it is preferable to have the option to optimize the weight matrix W over the full space of unitary matrices rather than a subspace as above. A straightforward method for implementing this is by simply updating W with standard backpropagation and then projecting the resulting matrix (which will typically no longer be unitary) back onto to the space

of unitary matrices. Defining G ij ≡ ∂C ∂W ij as the gradient with respect to W , this can be implemented by the procedure defined by (Wisdom et al., 2016):

This method shows that full space unitary networks are superior on many RNN tasks (Wisdom et al., 2016). A key limitation is that the back-propation in this method cannot avoid N -dimensional matrix multiplication, incurring O ( N 3 ) computational cost.

Efficient Unitary Neural Network (EUNN) Architectures

In the following, we first describe a general parametrization method able to represent arbitrary unitary matrices with up to N 2 degrees of freedom. We then present an efficient algorithm for this parametrization scheme, requiring only O (1) computational and memory access steps to obtain the gradient for each parameter. Finally, we show that our scheme performs significantly better than the above mentioned methods on a few well-known benchmarks.

Unitary Matrix Parametrization

Any N × N unitary matrix W N can be represented as a product of rotation matrices { R ij } and a diagonal matrix D , such that W N = D ∏ N i =2 ∏ i -1 j =1 R ij , where R ij is defined as the N -dimensional identity matrix with the elements R ii , R ij , R ji and R jj replaced as follows (Reck et al., 1994; Clements et al., 2016):

where θ ij and φ ij are unique parameters corresponding to R ij . Each of these matrices performs a U (2) unitary transformation on a two-dimensional subspace of the Ndimensional Hilbert space, leaving an ( N -2) -dimensional subspace unchanged. In other words, a series of U (2) rotations can be used to successively make all off-diagonal elements of the given N × N unitary matrix zero. This generalizes the familiar factorization of a 3D rotation matrix into 2D rotations parametrized by the three Euler angles. To provide intuition for how this works, let us briefly describe a simple way of doing this that is similar to Gaussian elimination by finishing one column at a time. There are infinitely many alternative decomposition schemes as well; Fig. 1 shows two that are particularly convenient to implement in software (and even in neuromorphic hardware (Shen et al., 2016)). The unitary matrix W N is multiplied from the right by a succession of unitary matrices

R Nj for j = N -1 , · · · , 1 . Once all elements of the last row except the one on the diagonal are zero, this row will not be affected by later transformations. Since all transformations are unitary, the last column will then also contain only zeros except on the diagonal:

Figure 1. Unitary matrix decomposition: An arbitrary unitary matrix W can be decomposed (a) with the square decomposition method of Clements et al. (Clements et al., 2016) discussed in section 4.2; or approximated (b) by the Fast Fourier Transformation(FFT) style decomposition method (Mathieu & LeCun, 2014b) discussed in section 4.3. Each junction in the a) and b) graphs above represent the U(2) matrix as shown in c).

The effective dimensionality of the the matrix W is thus reduced to N -1 . The same procedure can then be repeated N -1 times until the effective dimension of W is reduced to 1, leaving us with a diagonal matrix: 1

where D is a diagonal matrix whose diagonal elements are e iw j , from which we can write the direct representation of W N as

where

1 Note that Gaussian Elimination would make merely the upper triangle of a matrix vanish, requiring a subsequent series of rotations (complete Gauss-Jordan Elimination) to zero the lower triangle. We need no such subsequent series because since W is unitary: it is easy to show that if a unitary matrix is triangular, it must be diagonal.

This parametrization thus involves N ( N -1) / 2 different θ ij -values, N ( N -1) / 2 different φ ij -values and N different w i -values, combining to N 2 parameters in total and spans the entire unitary space. Note we can always fix a portion of our parameters, to span only a subset of unitary space - indeed, our benchmark test below will show that for certain tasks, full unitary space parametrization is not necessary. 2

Tunable space implementation

The representation in Eq. 12 can be made more compact by reordering and grouping specific rotational matrices, as was shown in the optical community (Reck et al., 1994; Clements et al., 2016) in the context of universal multiport interferometers. For example (Clements et al., 2016), a unitary matrix can be decomposed as

where every

is a block diagonal matrix, with N angle parameters in total, and

with N -1 parameters, as is schematically shown in Fig. 1a. By choosing different values for L , W N will span a different subspace of the unitary space. Specifically,when L = N , W N will span the entire unitary space.

Following this physics-inspired scheme, we decompose our unitary hidden-to-hidden layer matrix W as

FFT-style approximation

Inspired by (Mathieu & LeCun, 2014a), an alternative way to organize the rotation matrices is implementing an FFTstyle architecture. Instead of using adjacent rotation matrices, each F here performs a certain distance pairwise rotations as shown in Fig. 1b:

The rotation matrices in F i are performed between pairs of coordinates

2 Our preliminary experimental tests even suggest that a fullcapacity unitary RNN is even undesirable for some tasks.

where p = N 2 i , k ∈ { 0 , ..., 2 i -1 } and j ∈ { 1 , ..., p } . This requires only log( N ) matrices, so there are a total of N log( N ) / 2 rotational pairs. This is also the minimal number of rotations that can have all input coordinates interacting with each other, providing an approximation of arbitrary unitary matrices.

Efficient implementation of rotation matrices

To implement this decomposition efficiently in an RNN, we apply vector element-wise multiplications and permutations: we evaluate the product Fx as

where ∗ represents element-wise multiplication, F refers to general rotational matrices such as F A/B in Eq. 14 and F i in Eq. 16. For the case of the tunable-space implementation, if we want to implement F ( l ) A in Eq. 14, we define v and the permutation as follows:

In general, the pseudocode for implementing operation F is as follows:

Algorithm 1 Efficient implementation for F with parameter θ i and φ i .

Input: input x , size N ; parameters θ and φ , size N/ 2 ; constant permuatation index list ind 1 and ind 2 .

Output: output y , size N .

v 1 ← concatenate( cos θ , cos θ * exp( iφ ) )

y ← v 1 ∗ x + v 2 ∗ permute( x , ind 2 )

Note that ind 1 and ind 2 are different for different F .

From a computational complexity viewpoint, since the operations ∗ and permute take O ( N ) computational steps, evaluating Fx only requires O ( N ) steps. The product Dx is trivial, consisting of an element-wise vector multiplication. Therefore, the product Wx with the total unitary

matrix W can be computed in only O ( NL ) steps, and only requires O ( NL ) memory access (for full-space implementation L = N , for FFT-style approximation gives L = log N ). A detailed comparison on computational complexity of the existing unitary RNN architectures is given in Table 1.

Nonlinearity

We use the same nonlinearity as (Arjovsky et al., 2015):

where the bias vector b is a shared trainable parameter, and | z i | is the norm of the complex number z i .

For real number input, modReLU can be simplified to:

Weempirically find that this nonlinearity function performs the best. We believe that this function possibly also serves as a forgetting filter that removes the noise using the bias threshold.

Experimental tests of our method

In this section, we compare the performance of our Efficient Unitary Recurrent Neural Network (EURNN) with

  1. an LSTM RNN (Hochreiter & Schmidhuber, 1997),
  2. a Partial Space URNN (Arjovsky et al., 2015), and
  3. a Projective full-space URNN (Wisdom et al., 2016).

All models are implemented in both Tensorflow and Theano, available from https://github.com/ jingli9111/EUNN-tensorflow and https: //github.com/iguanaus/EUNN-theano .

Copying Memory Task

We compare these networks by applying them all to the well defined Copying Memory Task (Hochreiter & Schmidhuber, 1997; Arjovsky et al., 2015; Henaff et al., 2016). The copying task is a synthetic task that is commonly used to test the network's ability to remember information seen T time steps earlier.

Specifically, the task is defined as follows (Hochreiter & Schmidhuber, 1997; Arjovsky et al., 2015; Henaff et al., 2016). An alphabet consists of symbols { a i } , the first n of which represent data, and the remaining two representing 'blank' and 'start recall', respectively; as illustrated by the following example where T = 20 and M = 5 :

Input:

BACCA--------------------:----

In the above example, n = 3 and { a i } = { A,B,C, -, : } . The input consists of M random data symbols ( M = 5 above) followed by T -1 blanks, the 'start recall' symbol and M more blanks. The desired output consists of M + T blanks followed by the data sequence. The cost function C is defined as the cross entropy of the input and output sequences, which vanishes for perfect performance.

We use n = 8 and input length M = 10 . The symbol for each input is represented by an n -dimensional one-hot vector. We trained all five RNNs for T = 1000 with the same batch size 128 using RMSProp optimization with a learning rate of 0.001. The decay rate is set to 0.5 for EURNN, and 0.9 for all other models respectively. (Fig. 2). This results show that the EURNN architectures introduced in both Sec.4.2 (EURNN with N=512, selecting L=2) and Sec.4.3 (FFT-style EURNN with N=512) outperform the LSTM model (which suffers from long term memory problems and only performs well on the copy task for small time delays T ) and all other unitary RNN models, both in-terms of learnability and in-terms of convergence rate. Note that the only other unitary RNN model that is able to beat the baseline for T = 1000 (Wisdom et al., 2016) is significantly slower than our method.

Moreover, we find that by either choosing smaller L or by using the FFT-style method (so that W spans a smaller unitary subspace), the EURNN converges toward optimal performance significantly more efficiently (and also faster in wall clock time) than the partial (Arjovsky et al., 2015) and projective (Wisdom et al., 2016) unitary methods. The EURNNalso performed more robustly. This means that a fullcapacity unitary matrix is not necessary for this particular task.

Pixel-Permuted MNIST Task

The MNIST handwriting recognition problem is one of the classic benchmarks for quantifying the learning ability of neural networks. MNIST images are formed by a 28 × 28 grayscale image with a target label between 0 and 9.

To test different RNN models, we feed all pixels of the MNIST images into the RNN models in 28 × 28 time steps, where one pixel at a time is fed in as a floating-point number. A fixed random permutation is applied to the order of input pixels. The output is the probability distribution quantifying the digit prediction. We used RMSProp with a learning rate of 0.0001 and a decay rate of 0.9, and set the batch size to 128.

As shown in Fig. 3, EURNN significantly outperforms LSTMwith the same number of parameters. It learns faster, in fewer iteration steps, and converges to a higher classifi-

Table 1. Performance comparison of four Recurrent Neural Network algorithms: URNN (Arjovsky et al., 2015), PURNN (Wisdom et al., 2016), and EURNN (our algorithm). T denotes the RNN length and N denotes the hidden state size. For the tunable-style EURNN, L is an integer between 1 and N parametrizing the unitary matrix capacity.

Table 2. MNIST Task result. EURNN corresponds to our algorithm, PURNN corresponds to algorithm presented in (Wisdom et al., 2016), URNN corresponds to the algorithm presented in (Arjovsky et al., 2015).

Figure

cation accuracy. In addition, the EURNN reaches a similar accuracy with fewer parameters. In Table. 2, we compare the performance of different RNN models on this task.

Figure 3. Pixel-permuted MNIST performance on the validation dataset.

Speech Prediction on TIMIT dataset

We also apply our EURNN to real-world speech prediction task and compare its performance to LSTM. The main task we consider is predicting the log-magnitude of future frames of a short-time Fourier transform (STFT) (Wisdom et al., 2016; Sejdi et al., 2009). We use the TIMIT dataset (Garofolo et al., 1993) sampled at 8 kHz. The audio .wav file is initially diced into different time frames (all frames have the same duration referring to the Hann analysis window below). The audio amplitude in each frame is then

Table 3. Speech Prediction Task result. EURNN corresponds to our algorithm, projective URNN corresponds to algorithm presented in (Wisdom et al., 2016), URNN corresponds to the algorithm presented in (Arjovsky et al., 2015).

Figure 4. Example spectrograms of ground truth and RNN prediction results from evaluation sets.

Fourier transformed into the frequency domain. The logmagnitude of the Fourier amplitude is normalized and used as the data for training/testing each model. In our STFT operation we uses a Hann analysis window of 256 samples (32 milliseconds) and a window hop of 128 samples (16 milliseconds). The frame prediction task is as follows: given all the log-magnitudes of STFT frames up to time t , predict the log-magnitude of the STFT frame at time t +1 that has the minimum mean square error (MSE). We use a training set with 2400 utterances, a validation set of 600 utterances and an evaluation set of 1000 utterances. The training, validation, and evaluation sets have distinct speakers. We trained all RNNs for with the same batch size 32 using RMSProp optimization with a learning rate of 0.001, a momentum of 0.9 and a decay rate of 0.1.

The results are given in Table. 3, in terms of the meansquared error (MSE) loss function. Figure. 4 shows prediction examples from the three types of networks, illustrat-

ing how EURNNs generally perform better than LSTMs. Furthermore, in this particular task, full-capacity EURNNs outperform small capacity EURNNs and FFT-style EURNNs.

Conclusion

We have presented a method for implementing an Efficient Unitary Neural Network (EUNN) whose computational cost is merely O (1) per parameter, which is O (log N ) more efficient than the other methods discussed above. It significantly outperforms existing RNN architectures on the standard Copying Task, and the pixel-permuted MNIST Task using a comparable parameter count, hence demonstrating the highest recorded ability to memorize sequential information over long time periods.

It also performs well on real tasks such as speech prediction, outperforming an LSTM on TIMIT data speech prediction.

We want to emphasize the generality and tunability of our method. The ordering of the rotation matrices we presented in Fig. 1 are merely two of many possibilities; we used it simply as a concrete example. Other ordering options that can result in spanning the full unitary matrix space can be used for our algorithm as well, with identical speed and memory performance. This tunability of the span of the unitary space and, correspondingly, the total number of parameters makes it possible to use different capacities for different tasks, thus opening the way to an optimal performance of the EUNN. For example, as we have shown, a small subspace of the full unitary space is preferable for the copying task, whereas the MNIST task and TIMIT task are better performed by EUNN covering a considerably larger unitary space. Finally, we note that our method remains applicable even if the unitary matrix is decomposed into a different product of matrices (Eq. 12).

This powerful and robust unitary RNN architecture also might be promising for natural language processing because of its ability to efficiently handle tasks with long-term correlation and very high dimensionality.

Acknowledgment

We thank Hugo Larochelle and Yoshua Bengio for helpful discussions and comments.

This work was partially supported by the Army Research Office through the Institute for Soldier Nanotechnologies under contract W911NF-13-D0001, the National Science Foundation under Grant No. CCF-1640012 and the Rothberg Family Fund for Cognitive Science.

$$ \h^{(t)}=\sigma(\U\x^{(t)}+\bm{W}\h^{(t-1)}), $$

$$ \y^{(t)}=\V\h^{(t)}+\b, $$

$$ \W = \D_3\T_2\mathcal{F}^{-1}\D_2\P\T_1\mathcal{F}\D_1. $$

$$ \begin{pmatrix} R_{ii}&R_{ij}\ R_{ji}&R_{jj}\ \end{pmatrix}

\begin{pmatrix} e^{i\phi_{ij}}\cos\theta_{ij} &-e^{i\phi_{ij}}\sin\theta_{ij}\ \sin\theta_{ij}&\cos\theta_{ij}\ \end{pmatrix}. $$

$$ \W_N\R_{N,N-1}\R_{N,N-2}\cdots\R_{i,j}\R_{i,j-1}\cdots\R_{3,1}\R_{2,1} = \D, $$

$$ \R'{ij}=\R(-\theta{ij},-\phi_{ij})=\R(\theta_{ij},\phi_{ij})^{-1}=\R_{ij}^{-1} $$

$$ \mathbf{W} = \mathbf{D}\mathbf{F}_A^{(1)}\mathbf{F}_B^{(2)}\mathbf{F}_A^{(3)}\mathbf{F}_B^{(4)}\cdots\mathbf{F}_B^{(L)}. $$

$$ (2pk + j, p(2k+1) + j) $$

$$ \mathbf{F}\mathbf{x}= \mathbf{v_1}*\mathbf{x} +\mathbf{v_2} * \mathrm{permute}(\mathbf{x}) $$

$$ (\mathrm{modReLU}(\mathbf{z}, \mathbf{b}))_i = \frac{z_i}{|z_i|}*\mathrm{ReLU}(|z_i| + b_i) $$

Algorithm: algorithm
[h!]
% \caption{Back propagation for operation $\mathbf{F}$ with parameter $\theta_i$ and $\phi_i$.}
% \label{alg:example}
% \begin{algorithmic}
% \STATE {\bfseries Input:} original $x$, size $N$; original output $y$, size $N$; parameters $\theta$ and $\phi$, size $N/2$; gradient of output $dy$, size $N$.
% \STATE {\bfseries Output:} gradient of input $dx$; gradient of parameters $d\theta$, $d\phi$.
% \FOR{$k=1$ {\bfseries to} $N/2$}
% \STATE $d\theta_k$ $\leftarrow$ $dy_{2k-1}$ * (-exp(-i$\phi_k$)) * sin($\theta_k$) * $x_{2k-1}$ - exp(-i$\phi+k$) * cos($\theta_k$) * $x_{2k}$)
% \STATE
% \STATE $d\phi_k$ $\leftarrow$ $dy_{2k-1}$ * i(exp(-i$\phi_k$)) * cos($\theta_k$) - exp(-i$\phi_k$)) * sin($\theta_k$))
% \STATE
% \STATE $dx_{2k-1}$ $\leftarrow$ $dy_{2k-1}$ * exp(i$\phi_i$*cos($\theta_k$)) + $dy_{2k}$ * cos($\theta_k$)
% \STATE
% \STATE $dx_{2k}$ $\leftarrow$ $dy_{2k-1}$ * (-exp(i$\phi_k$* sin($\theta_k$)) + $dy_{2k}$ * sin($\theta_k$)
% \ENDFOR
% \end{algorithmic}
%
Algorithm: algorithm
[h!]
\caption{Efficient implementation for $\mathbf{F}$ with parameter $\theta_i$ and $\phi_i$.}
\label{alg:example}
\begin{algorithmic}[h!]
\STATE {\bfseries Input:} input $\mathbf{x}$, size $N$; parameters $\mathbf{\theta}$ and $\mathbf{\phi}$, size $N/2$; constant permuatation index list $\mathbf{ind_1}$ and $\mathbf{ind_2}$.
\STATE {\bfseries Output:} output $\mathbf{y}$, size $N$.
\STATE $\mathbf{v_1}$ $\leftarrow$ concatenate($\cos\mathbf{\theta}$, $\cos\mathbf{\theta}$ * $\exp(i\mathbf{\phi})$)
\STATE $\mathbf{v_2}$ $\leftarrow$ concatenate($\sin\mathbf{\theta}$, - $\sin\mathbf{\theta}$ * $\exp(i\mathbf{\phi})$)
\STATE $\mathbf{v_1}$ $\leftarrow$ $\mathrm{permute}(\mathbf{v_1}, \mathbf{ind_1})$
\STATE $\mathbf{v_2}$ $\leftarrow$ $\mathrm{permute}(\mathbf{v_2}, \mathbf{ind_1})$
\STATE $\mathbf{y}$ $\leftarrow$ $\mathbf{v_1} * \mathbf{x} + \mathbf{v_2} * \mathrm{permute}(\mathbf{x}, \mathbf{ind_2})$

% \FOR{$k=1$ {\bfseries to} $N/2$}
% \STATE $y_{2k-1}$ $\leftarrow$ $x_{2k-1}$ * (exp(i$\phi_k$* cos($\theta_k$) + $x_{2k}$ * cos($\theta_k$)
% \STATE $y_{2k}$ $\leftarrow$ $x_{2k-1}$ * ( - exp(i$\phi_k$* sin($\theta_k$) + $x_{2k}$ * sin($\theta_k$)
% \ENDFOR
\end{algorithmic}
Algorithm: algorithm
[h!]
\STATE {\bfseries Input:} input $\mathbf{x}$, size $N$; parameters $\mathbf{\theta}$ and $\mathbf{\phi}$, size $N/2$; constant permuatation index list $\mathbf{ind_1}$ and $\mathbf{ind_2}$.
\STATE {\bfseries Output:} output $\mathbf{y}$, size $N$.
\STATE $\mathbf{v_1}$ $\leftarrow$ concatenate($\cos\mathbf{\theta}$, $\cos\mathbf{\theta}$ * $\exp(i\mathbf{\phi})$)
\STATE $\mathbf{v_2}$ $\leftarrow$ concatenate($\sin\mathbf{\theta}$, - $\sin\mathbf{\theta}$ * $\exp(i\mathbf{\phi})$)
\STATE $\mathbf{v_1}$ $\leftarrow$ $\mathrm{permute}(\mathbf{v_1}, \mathbf{ind_1})$
\STATE $\mathbf{v_2}$ $\leftarrow$ $\mathrm{permute}(\mathbf{v_2}, \mathbf{ind_1})$
\STATE $\mathbf{y}$ $\leftarrow$ $\mathbf{v_1} * \mathbf{x} + \mathbf{v_2} * \mathrm{permute}(\mathbf{x}, \mathbf{ind_2})$

% \FOR{$k=1$ {\bfseries to} $N/2$}
% \STATE $y_{2k-1}$ $\leftarrow$ $x_{2k-1}$ * (exp(i$\phi_k$* cos($\theta_k$) + $x_{2k}$ * cos($\theta_k$)
% \STATE $y_{2k}$ $\leftarrow$ $x_{2k-1}$ * ( - exp(i$\phi_k$* sin($\theta_k$) + $x_{2k}$ * sin($\theta_k$)
% \ENDFOR

References

Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014.

Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks , 5(2): 157-166, 1994.

Berglund, Mathias, Raiko, Tapani, Honkala, Mikko, K¨ arkk¨ ainen, Leo, Vetek, Akos, and Karhunen, Juha T. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems , pp. 856-864, 2015.

Clements, William R., Humphreys, Peter C., Metcalf, Benjamin J., Kolthammer, W. Steven, and Walmsley, Ian A. An optimal design for universal multiport interferometers, 2016. arXiv:1603.08788.

Collobert, Ronan, Weston, Jason, Bottou, L´ eon, Karlen, Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Natural language processing (almost) from scratch. Journal of Machine Learning Research , 12(Aug):2493-2537, 2011.

Donahue, Jeffrey, Anne Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 26252634, 2015.

Garofolo, John S, Lamel, Lori F, Fisher, William M, Fiscus, Jonathon G, and Pallett, David S. Darpa timit acousticphonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n , 93, 1993.

Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath,

ModelTime complexity of one online gradient stepnumber of parameters in the hidden matrixTransition matrix search space
URNNO ( TN log N )O ( N )subspace of U ( N )
PURNNO ( TN 2 + N 3 )O ( N 2 )full space of U ( N )
EURNN (tunable style)O ( TNL )O ( NL )tunable space of U ( N )
EURNN (FFT style)O ( TN log N )O ( N log N )subspace of U ( N )
Modelhidden size (capacity)number of parametersvalidation accuracytest accuracy
LSTM8016k0.9080.902
URNN51216k0.9420.933
PURNN11616k0.9220.921
EURNN (tunable style)1024 (2)13.3k0.940.937
EURNN (FFT style)512 (FFT)9.0k0.9280.925
Modelhidden size (capacity)number of parametersMSE (validation)MSE (test)
LSTM6433k71.466
LSTM12898k55.354.5
EURNN (tunable style)128 (2)33k63.363.3
EURNN (tunable style)128 (32)35k52.352.7
EURNN (tunable style)128 (128)41k51.851.9
EURNN (FFT style)128 (FFT)34k52.352.4

Using unitary (instead of general) matrices in artificial neural networks (ANNs) is a promising way to solve the gradient explosion/vanishing problem, as well as to enable ANNs to learn long-term correlations in the data. This approach appears particularly promising for Recurrent Neural Networks (RNNs). In this work, we present a new architecture for implementing an Efficient Unitary Neural Network (EUNNs); its main advantages can be summarized as follows. Firstly, the representation capacity of the unitary space in an EUNN is fully tunable, ranging from a subspace of SU(N) to the entire unitary space. Secondly, the computational complexity for training an EUNN is merely 𝒪​(1)𝒪1\mathcal{O}(1) per parameter. Finally, we test the performance of EUNNs on the standard copying task, the pixel-permuted MNIST digit recognition benchmark as well as the Speech Prediction Test (TIMIT). We find that our architecture significantly outperforms both other state-of-the-art unitary RNNs and the LSTM architecture, in terms of the final performance and/or the wall-clock training speed. EUNNs are thus promising alternatives to RNNs and LSTMs for a wide variety of applications.

Deep Neural Networks (LeCun et al., 2015) have been successful on numerous difficult machine learning tasks, including image recognition(Krizhevsky et al., 2012; Donahue et al., 2015), speech recognition(Hinton et al., 2012) and natural language processing(Collobert et al., 2011; Bahdanau et al., 2014; Sutskever et al., 2014). However, deep neural networks can suffer from vanishing and exploding gradient problems(Hochreiter, 1991; Bengio et al., 1994), which are known to be caused by matrix eigenvalues far from unity being raised to large powers. Because the severity of these problems grows with the the depth of a neural network, they are particularly grave for Recurrent Neural Networks (RNNs), whose recurrence can be equivalent to thousands or millions of equivalent hidden layers.

Several solutions have been proposed to solve these problems for RNNs. Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997), which help RNNs contain information inside hidden layers with gates, remains one of the the most popular RNN implementations. Other recently proposed methods such as GRUs(Cho et al., 2014) and Bidirectional RNNs (Berglund et al., 2015) also perform well in numerous applications. However, none of these approaches has fundamentally solved the vanishing and exploding gradient problems, and gradient clipping is often required to keep gradients in a reasonable range.

A recently proposed solution strategy is using orthogonal hidden weight matrices or their complex generalization (unitary matrices) (Saxe et al., 2013; Le et al., 2015; Arjovsky et al., 2015; Henaff et al., 2016), because all their eigenvalues will then have absolute values of unity, and can safely be raised to large powers. This has been shown to help both when weight matrices are initialized to be unitary (Saxe et al., 2013; Le et al., 2015) and when they are kept unitary during training, either by restricting them to a more tractable matrix subspace (Arjovsky et al., 2015) or by alternating gradient-descent steps with projections onto the unitary subspace (Wisdom et al., 2016).

In this paper, we will first present an Efficient Unitary Neural Network (EUNN) architecture that parametrizes the entire space of unitary matrices in a complete and computationally efficient way, thereby eliminating the need for time-consuming unitary subspace-projections. Our architecture has a wide range of capacity-tunability to represent subspace unitary models by fixing some of our parameters; the above-mentioned unitary subspace models correspond to special cases of our architecture. We also implemented an EUNN with an earlier introduced FFT-like architecture which efficiently approximates the unitary space with minimum number of required parameters(Mathieu & LeCun, 2014b).

We then benchmark EUNN’s performance on both simulated and real tasks: the standard copying task, the pixel-permuted MNIST task, and speech prediction with the TIMIT dataset (Garofolo et al., 1993). We show that our EUNN algorithm with an 𝒪​(N)𝒪𝑁\mathcal{O}(N) hidden layer size can compute up to the entire N×N𝑁𝑁N\times N gradient matrix using 𝒪​(1)𝒪1\mathcal{O}(1) computational steps and memory access per parameter. This is superior to the 𝒪​(N)𝒪𝑁\mathcal{O}(N) computational complexity of the existing training method for a full-space unitary network (Wisdom et al., 2016) and 𝒪​(log⁡N)𝒪𝑁\mathcal{O}(\log N) more efficient than the subspace Unitary RNN(Arjovsky et al., 2015).

A recurrent neural network takes an input sequence and uses the current hidden state to generate a new hidden state during each step, memorizing past information in the hidden layer. We first review the basic RNN architecture.

Consider an RNN updated at regular time intervals t=1,2,…𝑡12…t=1,2,... whose input is the sequence of vectors 𝐱(t)superscript𝐱𝑡{\bf x}^{(t)} whose hidden layer 𝐡(t)superscript𝐡𝑡{\bf h}^{(t)} is updated according to the following rule:

where σ𝜎\sigma is the nonlinear activation function. The output is generated by

where 𝐛𝐛{\bf b} is the bias vector for the hidden-to-output layer. For t=0𝑡0t=0, the hidden layer 𝐡(𝟎)superscript𝐡0\mathbf{h^{(0)}} can be initialized to some special vector or set as a trainable variable. For convenience of notation, we define 𝐳(t)=𝐔𝐱(t)+𝐖𝐡(t−1)superscript𝐳𝑡superscript𝐔𝐱𝑡superscript𝐖𝐡𝑡1{\bf z}^{(t)}={\bf U}{\bf x}^{(t)}+{\bf W}{\bf h}^{(t-1)} so that 𝐡(t)=σ​(𝐳(t))superscript𝐡𝑡𝜎superscript𝐳𝑡{\bf h}^{(t)}=\sigma({{\bf z}^{(t)}}).

When training the neural network to minimize a cost function C𝐶C that depends on a parameter vector 𝐚𝐚{\bf a}, the gradient descent method updates this vector to 𝐚−λ​∂C∂𝐚𝐚𝜆𝐶𝐚{\bf a}-\lambda\frac{\partial C}{\partial{\bf a}}, where λ𝜆\lambda is a fixed learning rate and ∂C∂𝐚≡∇C𝐶𝐚∇𝐶\frac{\partial C}{\partial{\bf a}}\equiv\nabla C. For an RNN, the vanishing or exploding gradient problem is most significant during back propagation from hidden to hidden layers, so we will only focus on the gradient for hidden layers. Training the input-to-hidden and hidden-to-output matrices is relatively trivial once the hidden-to-hidden matrix has been successfully optimized.

In order to evaluate ∂C∂Wi​j𝐶subscript𝑊𝑖𝑗\frac{\partial C}{\partial W_{ij}}, one first computes the derivative ∂C∂h(t)𝐶superscriptℎ𝑡\frac{\partial C}{\partial h^{(t)}} using the chain rule:

where 𝐃(k)=diag​{σ′​(𝐔𝐱(k)+𝐖𝐡(k−1))}superscript𝐃𝑘diagsuperscript𝜎′superscript𝐔𝐱𝑘superscript𝐖𝐡𝑘1{\bf D}^{(k)}={\rm diag}{\sigma^{\prime}({\bf U}{\bf x}^{(k)}+{\bf W}{\bf h}^{(k-1)})} is the Jacobian matrix of the pointwise nonlinearity. For large times T𝑇T, the term ∏𝐖product𝐖\prod{\bf W} plays a significant role. As long as the eigenvalues of 𝐃(k)superscript𝐃𝑘{\bf D}^{(k)} are of order unity, then if 𝐖𝐖{\bf W} has eigenvalues λi≫1much-greater-thansubscript𝜆𝑖1\lambda_{i}\gg 1, they will cause gradient explosion |∂C∂𝐡(T)|→∞→𝐶superscript𝐡𝑇|\frac{\partial C}{\partial{\bf h}^{(T)}}|\rightarrow\infty, while if 𝐖𝐖{\bf W} has eigenvalues λi≪1much-less-thansubscript𝜆𝑖1\lambda_{i}\ll 1, they can cause gradient vanishing, |∂C∂𝐡(T)|→0→𝐶superscript𝐡𝑇0|\frac{\partial C}{\partial{\bf h}^{(T)}}|\rightarrow 0. Either situation prevents the RNN from working efficiently.

In a breakthrough paper, Arjovsky, Shah & Bengio (Arjovsky et al., 2015) showed that unitary RNNs can overcome the exploding and vanishing gradient problems and perform well on long term memory tasks if the hidden-to-hidden matrix in parametrized in the following unitary form:

Here 𝐃1,2,3subscript𝐃123{\bf D}{1,2,3} are diagonal matrices with each element ei​ωj,j=1,2,⋯,nformulae-sequencesuperscript𝑒𝑖subscript𝜔𝑗𝑗12⋯𝑛e^{i\omega{j}},j=1,2,\cdots,n. 𝐓1,2subscript𝐓12{\bf T}_{1,2} are reflection matrices, and 𝐓=I−2​𝐯^​𝐯^†‖𝐯^‖2𝐓𝐼2^𝐯superscript^𝐯†superscriptnorm^𝐯2{\bf T}=I-2\frac{\widehat{\bf v}\widehat{\bf v}^{\dagger}}{||\widehat{\bf v}||^{2}}, where 𝐯^^𝐯\widehat{\bf v} is a vector with each of its entries as a parameter to be trained. 𝚷𝚷\bm{\Pi} is a fixed permutation matrix. ℱℱ\mathcal{F} and ℱ−1superscriptℱ1\mathcal{F}^{-1} are Fourier and inverse Fourier transform matrices respectively. Since each factor matrix here is unitary, the product 𝐖𝐖{\bf W} is also a unitary matrix.

This model uses 𝒪​(N)𝒪𝑁\mathcal{O}(N) parameters, which spans merely a part of the whole 𝒪​(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})-dimensional space of unitary N×N𝑁𝑁N\times N matrices to enable computational efficiency. Several subsequent papers have tried to expand the space to 𝒪​(N2)𝒪superscript𝑁2\mathcal{O}(N^{2}) in order to achieve better performance, as summarized below.

In order to maximize the power of Unitary RNNs, it is preferable to have the option to optimize the weight matrix 𝐖𝐖{\bf W} over the full space of unitary matrices rather than a subspace as above. A straightforward method for implementing this is by simply updating 𝐖𝐖{\bf W} with standard back-propagation and then projecting the resulting matrix (which will typically no longer be unitary) back onto to the space of unitary matrices. Defining 𝐆i​j≡∂C∂Wi​jsubscript𝐆𝑖𝑗𝐶subscript𝑊𝑖𝑗{\bf G}{ij}\equiv\frac{\partial{C}}{\partial{W{ij}}} as the gradient with respect to 𝐖𝐖{\bf W}, this can be implemented by the procedure defined by (Wisdom et al., 2016):

This method shows that full space unitary networks are superior on many RNN tasks (Wisdom et al., 2016). A key limitation is that the back-propation in this method cannot avoid N𝑁N-dimensional matrix multiplication, incurring 𝒪​(N3)𝒪superscript𝑁3\mathcal{O}(N^{3}) computational cost.

In the following, we first describe a general parametrization method able to represent arbitrary unitary matrices with up to N2superscript𝑁2N^{2} degrees of freedom. We then present an efficient algorithm for this parametrization scheme, requiring only 𝒪​(1)𝒪1\mathcal{O}(1) computational and memory access steps to obtain the gradient for each parameter. Finally, we show that our scheme performs significantly better than the above mentioned methods on a few well-known benchmarks.

Any N×N𝑁𝑁N\times N unitary matrix 𝐖Nsubscript𝐖𝑁{\bf W}{N} can be represented as a product of rotation matrices {𝐑i​j}subscript𝐑𝑖𝑗{{\bf R}{ij}} and a diagonal matrix 𝐃𝐃{\bf D}, such that 𝐖N=𝐃​∏i=2N∏j=1i−1𝐑i​jsubscript𝐖𝑁𝐃superscriptsubscriptproduct𝑖2𝑁superscriptsubscriptproduct𝑗1𝑖1subscript𝐑𝑖𝑗{\bf W}{N}={\bf D}\prod{i=2}^{N}\prod_{j=1}^{i-1}{\bf R}{ij}, where 𝐑i​jsubscript𝐑𝑖𝑗{\bf R}{ij} is defined as the N𝑁N-dimensional identity matrix with the elements Ri​isubscript𝑅𝑖𝑖R_{ii}, Ri​jsubscript𝑅𝑖𝑗R_{ij}, Rj​isubscript𝑅𝑗𝑖R_{ji} and Rj​jsubscript𝑅𝑗𝑗R_{jj} replaced as follows (Reck et al., 1994; Clements et al., 2016):

where θi​jsubscript𝜃𝑖𝑗\theta_{ij} and ϕi​jsubscriptitalic-ϕ𝑖𝑗\phi_{ij} are unique parameters corresponding to 𝐑𝐢𝐣subscript𝐑𝐢𝐣\mathbf{R_{ij}}. Each of these matrices performs a U​(2)𝑈2U(2) unitary transformation on a two-dimensional subspace of the N-dimensional Hilbert space, leaving an (N−2)𝑁2(N-2)-dimensional subspace unchanged. In other words, a series of U​(2)𝑈2U(2) rotations can be used to successively make all off-diagonal elements of the given N×N𝑁𝑁N\times N unitary matrix zero. This generalizes the familiar factorization of a 3D rotation matrix into 2D rotations parametrized by the three Euler angles. To provide intuition for how this works, let us briefly describe a simple way of doing this that is similar to Gaussian elimination by finishing one column at a time. There are infinitely many alternative decomposition schemes as well; Fig. 1 shows two that are particularly convenient to implement in software (and even in neuromorphic hardware (Shen et al., 2016)). The unitary matrix 𝐖Nsubscript𝐖𝑁{\bf W}{N} is multiplied from the right by a succession of unitary matrices 𝐑N​jsubscript𝐑𝑁𝑗{\bf R}{Nj} for j=N−1,⋯,1𝑗𝑁1⋯1j=N-1,\cdots,1. Once all elements of the last row except the one on the diagonal are zero, this row will not be affected by later transformations. Since all transformations are unitary, the last column will then also contain only zeros except on the diagonal:

The effective dimensionality of the the matrix 𝐖𝐖{\bf W} is thus reduced to N−1𝑁1N-1. The same procedure can then be repeated N−1𝑁1N-1 times until the effective dimension of 𝐖𝐖{\bf W} is reduced to 1, leaving us with a diagonal matrix:111Note that Gaussian Elimination would make merely the upper triangle of a matrix vanish, requiring a subsequent series of rotations (complete Gauss-Jordan Elimination) to zero the lower triangle. We need no such subsequent series because since 𝐖𝐖{\bf W} is unitary: it is easy to show that if a unitary matrix is triangular, it must be diagonal.

where 𝐃𝐃{\bf D} is a diagonal matrix whose diagonal elements are ei​wjsuperscript𝑒𝑖subscript𝑤𝑗e^{iw_{j}}, from which we can write the direct representation of 𝐖Nsubscript𝐖𝑁{\bf W}_{N} as

This parametrization thus involves N​(N−1)/2𝑁𝑁12N(N-1)/2 different θi​jsubscript𝜃𝑖𝑗\theta_{ij}-values, N​(N−1)/2𝑁𝑁12N(N-1)/2 different ϕi​jsubscriptitalic-ϕ𝑖𝑗\phi_{ij}-values and N𝑁N different wisubscript𝑤𝑖w_{i}-values, combining to N2superscript𝑁2N^{2} parameters in total and spans the entire unitary space. Note we can always fix a portion of our parameters, to span only a subset of unitary space – indeed, our benchmark test below will show that for certain tasks, full unitary space parametrization is not necessary. 222Our preliminary experimental tests even suggest that a full-capacity unitary RNN is even undesirable for some tasks.

The representation in Eq. 12 can be made more compact by reordering and grouping specific rotational matrices, as was shown in the optical community (Reck et al., 1994; Clements et al., 2016) in the context of universal multiport interferometers. For example (Clements et al., 2016), a unitary matrix can be decomposed as

where every

is a block diagonal matrix, with N𝑁N angle parameters in total, and

with N−1𝑁1N-1 parameters, as is schematically shown in Fig. 1a. By choosing different values for L𝐿L , 𝐖Nsubscript𝐖𝑁{\bf W}{N} will span a different subspace of the unitary space. Specifically,when L=N𝐿𝑁L=N, 𝐖Nsubscript𝐖𝑁{\bf W}{N} will span the entire unitary space.

Following this physics-inspired scheme, we decompose our unitary hidden-to-hidden layer matrix 𝐖𝐖{\bf W} as

Inspired by (Mathieu & LeCun, 2014a), an alternative way to organize the rotation matrices is implementing an FFT-style architecture. Instead of using adjacent rotation matrices, each 𝐅𝐅\mathbf{F} here performs a certain distance pairwise rotations as shown in Fig. 1b:

The rotation matrices in 𝐅isubscript𝐅𝑖\mathbf{F}_{i} are performed between pairs of coordinates

where p=N2i𝑝𝑁superscript2𝑖p=\frac{N}{2^{i}}, k∈{0,…,2i−1}𝑘0…superscript2𝑖1k\in{0,...,2^{i-1}} and j∈{1,…,p}𝑗1…𝑝j\in{1,...,p}. This requires only log⁡(N)𝑁\log(N) matrices, so there are a total of N​log⁡(N)/2𝑁𝑁2N\log(N)/2 rotational pairs. This is also the minimal number of rotations that can have all input coordinates interacting with each other, providing an approximation of arbitrary unitary matrices.

To implement this decomposition efficiently in an RNN, we apply vector element-wise multiplications and permutations: we evaluate the product 𝐅𝐱𝐅𝐱\mathbf{F}{\bf x} as

where ∗* represents element-wise multiplication, 𝐅𝐅\mathbf{F} refers to general rotational matrices such as 𝐅A/Bsubscript𝐅𝐴𝐵\mathbf{F}{A/B} in Eq. 14 and 𝐅isubscript𝐅𝑖\mathbf{F}{i} in Eq. 16. For the case of the tunable-space implementation, if we want to implement 𝐅A(l)superscriptsubscript𝐅𝐴𝑙\mathbf{F}_{A}^{(l)} in Eq. 14, we define 𝐯𝐯\mathbf{v} and the permutation as follows:

In general, the pseudocode for implementing operation 𝐅𝐅\mathbf{F} is as follows:

Note that 𝐢𝐧𝐝𝟏subscript𝐢𝐧𝐝1\mathbf{ind_{1}} and 𝐢𝐧𝐝𝟐subscript𝐢𝐧𝐝2\mathbf{ind_{2}} are different for different 𝐅𝐅\mathbf{F}.

From a computational complexity viewpoint, since the operations ∗* and permutepermute\mathrm{permute} take 𝒪​(N)𝒪𝑁\mathcal{O}(N) computational steps, evaluating 𝐅𝐱𝐅𝐱\mathbf{F}\mathbf{x} only requires 𝒪​(N)𝒪𝑁\mathcal{O}(N) steps. The product 𝐃𝐱𝐃𝐱\mathbf{D}\mathbf{x} is trivial, consisting of an element-wise vector multiplication. Therefore, the product 𝐖𝐱𝐖𝐱\mathbf{W}\mathbf{x} with the total unitary matrix 𝐖𝐖\mathbf{W} can be computed in only 𝒪​(N​L)𝒪𝑁𝐿\mathcal{O}(NL) steps, and only requires 𝒪​(N​L)𝒪𝑁𝐿\mathcal{O}(NL) memory access (for full-space implementation L=N𝐿𝑁L=N, for FFT-style approximation gives L=log⁡N𝐿𝑁L=\log N). A detailed comparison on computational complexity of the existing unitary RNN architectures is given in Table 1.

We use the same nonlinearity as (Arjovsky et al., 2015):

where the bias vector 𝐛𝐛\mathbf{b} is a shared trainable parameter, and |zi|subscript𝑧𝑖|z_{i}| is the norm of the complex number zisubscript𝑧𝑖z_{i}.

For real number input, modReLUmodReLU\mathrm{modReLU} can be simplified to:

We empirically find that this nonlinearity function performs the best. We believe that this function possibly also serves as a forgetting filter that removes the noise using the bias threshold.

In this section, we compare the performance of our Efficient Unitary Recurrent Neural Network (EURNN) with

an LSTM RNN (Hochreiter & Schmidhuber, 1997),

a Projective full-space URNN (Wisdom et al., 2016).

All models are implemented in both Tensorflow and Theano, available from https://github.com/jingli9111/EUNN-tensorflow and https://github.com/iguanaus/EUNN-theano.

We compare these networks by applying them all to the well defined Copying Memory Task (Hochreiter & Schmidhuber, 1997; Arjovsky et al., 2015; Henaff et al., 2016). The copying task is a synthetic task that is commonly used to test the network’s ability to remember information seen T𝑇T time steps earlier.

Specifically, the task is defined as follows (Hochreiter & Schmidhuber, 1997; Arjovsky et al., 2015; Henaff et al., 2016). An alphabet consists of symbols {ai}subscript𝑎𝑖{a_{i}}, the first n𝑛n of which represent data, and the remaining two representing “blank” and “start recall”, respectively; as illustrated by the following example where T=20𝑇20T=20 and M=5𝑀5M=5:

In the above example, n=3𝑛3n=3 and {ai}={A,B,C,−,:}subscript𝑎𝑖𝐴𝐵𝐶:{a_{i}}={A,B,C,-,:}. The input consists of M𝑀M random data symbols (M=5𝑀5M=5 above) followed by T−1𝑇1T-1 blanks, the “start recall” symbol and M𝑀M more blanks. The desired output consists of M+T𝑀𝑇M+T blanks followed by the data sequence. The cost function C𝐶C is defined as the cross entropy of the input and output sequences, which vanishes for perfect performance.

We use n=8𝑛8n=8 and input length M=10𝑀10M=10. The symbol for each input is represented by an n𝑛n-dimensional one-hot vector. We trained all five RNNs for T=1000𝑇1000T=1000 with the same batch size 128 using RMSProp optimization with a learning rate of 0.001. The decay rate is set to 0.5 for EURNN, and 0.9 for all other models respectively. (Fig. 2). This results show that the EURNN architectures introduced in both Sec.4.2 (EURNN with N=512, selecting L=2) and Sec.4.3 (FFT-style EURNN with N=512) outperform the LSTM model (which suffers from long term memory problems and only performs well on the copy task for small time delays T𝑇T) and all other unitary RNN models, both in-terms of learnability and in-terms of convergence rate. Note that the only other unitary RNN model that is able to beat the baseline for T=1000𝑇1000T=1000 (Wisdom et al., 2016) is significantly slower than our method.

Moreover, we find that by either choosing smaller L𝐿L or by using the FFT-style method (so that 𝐖𝐖\mathbf{W} spans a smaller unitary subspace), the EURNN converges toward optimal performance significantly more efficiently (and also faster in wall clock time) than the partial (Arjovsky et al., 2015) and projective (Wisdom et al., 2016) unitary methods. The EURNN also performed more robustly. This means that a full-capacity unitary matrix is not necessary for this particular task.

The MNIST handwriting recognition problem is one of the classic benchmarks for quantifying the learning ability of neural networks. MNIST images are formed by a 28×\times28 grayscale image with a target label between 0 and 9.

To test different RNN models, we feed all pixels of the MNIST images into the RNN models in 28×\times28 time steps, where one pixel at a time is fed in as a floating-point number. A fixed random permutation is applied to the order of input pixels. The output is the probability distribution quantifying the digit prediction. We used RMSProp with a learning rate of 0.0001 and a decay rate of 0.9, and set the batch size to 128.

As shown in Fig. 3, EURNN significantly outperforms LSTM with the same number of parameters. It learns faster, in fewer iteration steps, and converges to a higher classification accuracy. In addition, the EURNN reaches a similar accuracy with fewer parameters. In Table. 2, we compare the performance of different RNN models on this task.

We also apply our EURNN to real-world speech prediction task and compare its performance to LSTM. The main task we consider is predicting the log-magnitude of future frames of a short-time Fourier transform (STFT) (Wisdom et al., 2016; Sejdić et al., 2009). We use the TIMIT dataset (Garofolo et al., 1993) sampled at 8 kHz. The audio .wav file is initially diced into different time frames (all frames have the same duration referring to the Hann analysis window below). The audio amplitude in each frame is then Fourier transformed into the frequency domain. The log-magnitude of the Fourier amplitude is normalized and used as the data for training/testing each model. In our STFT operation we uses a Hann analysis window of 256 samples (32 milliseconds) and a window hop of 128 samples (16 milliseconds). The frame prediction task is as follows: given all the log-magnitudes of STFT frames up to time t𝑡t, predict the log-magnitude of the STFT frame at time t+1𝑡1t+1 that has the minimum mean square error (MSE). We use a training set with 2400 utterances, a validation set of 600 utterances and an evaluation set of 1000 utterances. The training, validation, and evaluation sets have distinct speakers. We trained all RNNs for with the same batch size 32 using RMSProp optimization with a learning rate of 0.001, a momentum of 0.9 and a decay rate of 0.1.

The results are given in Table. 3, in terms of the mean-squared error (MSE) loss function. Figure. 4 shows prediction examples from the three types of networks, illustrating how EURNNs generally perform better than LSTMs. Furthermore, in this particular task, full-capacity EURNNs outperform small capacity EURNNs and FFT-style EURNNs.

We have presented a method for implementing an Efficient Unitary Neural Network (EUNN) whose computational cost is merely 𝒪​(1)𝒪1\mathcal{O}(1) per parameter, which is 𝒪​(log⁡N)𝒪𝑁\mathcal{O}(\log N) more efficient than the other methods discussed above. It significantly outperforms existing RNN architectures on the standard Copying Task, and the pixel-permuted MNIST Task using a comparable parameter count, hence demonstrating the highest recorded ability to memorize sequential information over long time periods.

It also performs well on real tasks such as speech prediction, outperforming an LSTM on TIMIT data speech prediction.

We want to emphasize the generality and tunability of our method. The ordering of the rotation matrices we presented in Fig. 1 are merely two of many possibilities; we used it simply as a concrete example. Other ordering options that can result in spanning the full unitary matrix space can be used for our algorithm as well, with identical speed and memory performance. This tunability of the span of the unitary space and, correspondingly, the total number of parameters makes it possible to use different capacities for different tasks, thus opening the way to an optimal performance of the EUNN. For example, as we have shown, a small subspace of the full unitary space is preferable for the copying task, whereas the MNIST task and TIMIT task are better performed by EUNN covering a considerably larger unitary space. Finally, we note that our method remains applicable even if the unitary matrix is decomposed into a different product of matrices (Eq. 12).

This powerful and robust unitary RNN architecture also might be promising for natural language processing because of its ability to efficiently handle tasks with long-term correlation and very high dimensionality.

We thank Hugo Larochelle and Yoshua Bengio for helpful discussions and comments.

This work was partially supported by the Army Research Office through the Institute for Soldier Nanotechnologies under contract W911NF-13-D0001, the National Science Foundation under Grant No. CCF-1640012 and the Rothberg Family Fund for Cognitive Science.

Table: S4.T1: Performance comparison of four Recurrent Neural Network algorithms: URNN (Arjovsky et al., 2015), PURNN (Wisdom et al., 2016), and EURNN (our algorithm). T𝑇T denotes the RNN length and N𝑁N denotes the hidden state size. For the tunable-style EURNN, L𝐿L is an integer between 1 and N𝑁N parametrizing the unitary matrix capacity.

ModelTime complexity of onenumber of parametersTransition matrix
online gradient stepin the hidden matrixsearch space
URNN𝒪​(T​N​log⁡N)𝒪𝑇𝑁𝑁\mathcal{O}(TN\log N)𝒪​(N)𝒪𝑁\mathcal{O}(N)subspace of 𝐔​(N)𝐔𝑁\mathbf{U}(N)
PURNN𝒪​(T​N2+N3)𝒪𝑇superscript𝑁2superscript𝑁3\mathcal{O}(TN^{2}+N^{3})𝒪​(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})full space of 𝐔​(N)𝐔𝑁\mathbf{U}(N)
EURNN (tunable style)𝒪​(T​N​L)𝒪𝑇𝑁𝐿\mathcal{O}(TNL)𝒪​(N​L)𝒪𝑁𝐿\mathcal{O}(NL)tunable space of 𝐔​(N)𝐔𝑁\mathbf{U}(N)
EURNN (FFT style)𝒪​(T​N​log⁡N)𝒪𝑇𝑁𝑁\mathcal{O}(TN\log N)𝒪​(N​log⁡N)𝒪𝑁𝑁\mathcal{O}(N\log N)subspace of 𝐔​(N)𝐔𝑁\mathbf{U}(N)

Table: S5.T2: MNIST Task result. EURNN corresponds to our algorithm, PURNN corresponds to algorithm presented in (Wisdom et al., 2016), URNN corresponds to the algorithm presented in (Arjovsky et al., 2015).

(capacity)parametersaccuracyaccuracy
LSTM8016k0.9080.902
URNN51216k0.9420.933
PURNN11616k0.9220.921
EURNN (tunable style)1024 (2)13.3k0.9400.937
EURNN (FFT style)512 (FFT)9.0k0.9280.925

Refer to caption Unitary matrix decomposition: An arbitrary unitary matrix 𝐖𝐖{\bf W} can be decomposed (a) with the square decomposition method of Clements et al. (Clements et al., 2016) discussed in section 4.2; or approximated (b) by the Fast Fourier Transformation(FFT) style decomposition method (Mathieu & LeCun, 2014b) discussed in section 4.3. Each junction in the a) and b) graphs above represent the U(2) matrix as shown in c).

Refer to caption Copying Task for T=1000𝑇1000T=1000. EURNN corresponds to our algorithm, projective URNN corresponds to algorithm presented in (Wisdom et al., 2016), URNN corresponds to the algorithm presented in (Arjovsky et al., 2015). A useful baseline performance is that of the memoryless strategy, which outputs M+T𝑀𝑇M+T blanks followed by M𝑀M random data symbols and produces a cross entropy C=(M​log⁡n)/(T+2∗M)𝐶𝑀𝑛𝑇2𝑀C=(M\log{n})/(T+2M). [Note that each iteration for PURNN takes about 32 times longer than for EURNN models, for this particular simulation, so the speed advantage is much greater than apparent in this plot.]*

Refer to caption Pixel-permuted MNIST performance on the validation dataset.

Refer to caption Example spectrograms of ground truth and RNN prediction results from evaluation sets.

$$ {\bf h}^{(t)}=\sigma({\bf U}{\bf x}^{(t)}+\bm{W}{\bf h}^{(t-1)}), $$ \tag{S2.E1}

$$ {\bf y}^{(t)}={\bf W}{\bf h}^{(t)}+{\bf b}, $$ \tag{S2.E2}

$$ {\bf W}={\bf D}{3}{\bf T}{2}\mathcal{F}^{-1}{\bf D}{2}{\bm{\Pi}}{\bf T}{1}\mathcal{F}{\bf D}_{1}. $$ \tag{S3.E6}

$$ \begin{pmatrix}R_{ii}&R_{ij}\ R_{ji}&R_{jj}\ \end{pmatrix}=\begin{pmatrix}e^{i\phi_{ij}}\cos\theta_{ij}&-e^{i\phi_{ij}}\sin\theta_{ij}\ \sin\theta_{ij}&\cos\theta_{ij}\ \end{pmatrix}. $$ \tag{S4.E9}

$$ {\bf W}{N}{\bf R}{N,N-1}{\bf R}{N,N-2}\cdots{\bf R}{i,j}{\bf R}{i,j-1}\cdots{\bf R}{3,1}{\bf R}_{2,1}={\bf D}, $$ \tag{S4.E11}

$$ {\bf R}^{\prime}{ij}={\bf R}(-\theta{ij},-\phi_{ij})={\bf R}(\theta_{ij},\phi_{ij})^{-1}={\bf R}_{ij}^{-1} $$ \tag{S4.E13}

$$ \mathbf{W}=\mathbf{D}\mathbf{F}{A}^{(1)}\mathbf{F}{B}^{(2)}\mathbf{F}{A}^{(3)}\mathbf{F}{B}^{(4)}\cdots\mathbf{F}_{B}^{(L)}. $$ \tag{S4.E15}

$$ (2pk+j,p(2k+1)+j) $$ \tag{S4.E17}

$$ \mathbf{F}\mathbf{x}=\mathbf{v_{1}}\mathbf{x}+\mathbf{v_{2}}\mathrm{permute}(\mathbf{x}) $$ \tag{S4.E18}

$$ (\mathrm{modReLU}(\mathbf{z},\mathbf{b})){i}=\frac{z{i}}{|z_{i}|}*\mathrm{ReLU}(|z_{i}|+b_{i}) $$ \tag{S4.E19}

$$ \displaystyle\frac{\partial C}{\partial{\bf h}^{(t)}} $$

$$ \displaystyle{\bf A}^{(t)} $$

$$ \displaystyle{\bf W}_{N} $$

$$ \displaystyle\mathbf{F}{B}^{(l)}={\bf R}{2,3}^{(l)}{\bf R}{4,5}^{(l)}\dots\bf R{N/2-2,N/2-1}^{(l)} $$

$$ \displaystyle\mathbf{v_{1}}=(e^{i\phi_{1}^{(l)}}\cos\theta_{1}^{(l)},\cos\theta_{1}^{(l)},e^{i\phi_{2}^{(l)}}\cos\theta_{2}^{(l)},\cos\theta_{2}^{(l)},\cdots) $$

$$ \displaystyle\mathrm{permute}(\mathbf{x})=(x_{2},x_{1},x_{4},x_{3},x_{6},x_{5},\cdots). $$

ModelTime complexity of one online gradient stepnumber of parameters in the hidden matrixTransition matrix search space
URNNO ( TN log N )O ( N )subspace of U ( N )
PURNNO ( TN 2 + N 3 )O ( N 2 )full space of U ( N )
EURNN (tunable style)O ( TNL )O ( NL )tunable space of U ( N )
EURNN (FFT style)O ( TN log N )O ( N log N )subspace of U ( N )
Modelhidden size (capacity)number of parametersvalidation accuracytest accuracy
LSTM8016k0.9080.902
URNN51216k0.9420.933
PURNN11616k0.9220.921
EURNN (tunable style)1024 (2)13.3k0.940.937
EURNN (FFT style)512 (FFT)9.0k0.9280.925
Modelhidden size (capacity)number of parametersMSE (validation)MSE (test)
LSTM6433k71.466
LSTM12898k55.354.5
EURNN (tunable style)128 (2)33k63.363.3
EURNN (tunable style)128 (32)35k52.352.7
EURNN (tunable style)128 (128)41k51.851.9
EURNN (FFT style)128 (FFT)34k52.352.4

Figure

References

[arjovsky2015unitary] Arjovsky, Martin, Shah, Amar, and Bengio, Yoshua. Unitary evolution recurrent neural networks. arXiv preprint arXiv:1511.06464, 2015.

[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

[bengio1994learning] Bengio, Yoshua, Simard, Patrice, and Frasconi, Paolo. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5\penalty0 (2):\penalty0 157--166, 1994.

[berglund2015bidirectional] Berglund, Mathias, Raiko, Tapani, Honkala, Mikko, K"arkk"ainen, Leo, Vetek, Akos, and Karhunen, Juha~T. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pp.\ 856--864, 2015.

[cho2014properties] Cho, Kyunghyun, Van~Merri"enboer, Bart, Bahdanau, Dzmitry, and Bengio, Yoshua. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.

[Clements2015optimal] Clements, WilliamR., Humphreys, PeterC., Metcalf, Benjamin~J., Kolthammer, W.Steven, and Walmsley, IanA. An optimal design for universal multiport interferometers, 2016. arXiv:1603.08788.

[collobert2011natural] Collobert, Ronan, Weston, Jason, Bottou, L'eon, Karlen, Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12\penalty0 (Aug):\penalty0 2493--2537, 2011.

[donahue2015long] Donahue, Jeffrey, Anne~Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 2625--2634, 2015.

[garofolo1993darpa] Garofolo, JohnS, Lamel, LoriF, Fisher, WilliamM, Fiscus, JonathonG, and Pallett, David~S. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93, 1993.

[henaff2016orthogonal] Henaff, Mikael, Szlam, Arthur, and LeCun, Yann. Orthogonal rnns and long-memory tasks. arXiv preprint arXiv:1602.06662, 2016.

[hinton2012deep] Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, GeorgeE, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, TaraN, et~al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29\penalty0 (6):\penalty0 82--97, 2012.

[hochreiter1991untersuchungen] Hochreiter, Sepp. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Universit"at M"unchen, pp.\ ~91, 1991.

[hochreiter1997long] Hochreiter, Sepp and Schmidhuber, J"urgen. Long short-term memory. Neural computation, 9\penalty0 (8):\penalty0 1735--1780, 1997.

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey~E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012.

[le2015simple] Le, QuocV, Jaitly, Navdeep, and Hinton, GeoffreyE. A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941, 2015.

[lecun2015deep] LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. Nature, 521\penalty0 (7553):\penalty0 436--444, 2015.

[michael2014mathieu] Mathieu, Michael and LeCun, Yann. Fast approximation of rotations and hessians matrices. arXiv preprint arXiv:1404.7195, 2014a.

[Mathieu2014Fast] Mathieu, Michal and LeCun, Yann. Fast approximation of rotations and hessians matrices. CoRR, abs/1404.7195, 2014b. URL http://arxiv.org/abs/1404.7195.

[Reck1994experimental] Reck, Michael, Zeilinger, Anton, Bernstein, Herbert~J., and Bertani, Philip. Experimental realization of any discrete unitary operator. Phys. Rev. Lett., 73:\penalty0 58--61, Jul 1994. 10.1103/PhysRevLett.73.58. URL http://link.aps.org/doi/10.1103/PhysRevLett.73.58.

[saxe2013exact] Saxe, AndrewM, McClelland, JamesL, and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.

[SEJDIC2009153] Sejdić, Ervin, Djurović, Igor, and Jiang, Jin. Time–frequency feature representation using energy concentration: An overview of recent advances. Digital Signal Processing, 19\penalty0 (1):\penalty0 153 -- 183, 2009. ISSN 1051-2004. http://dx.doi.org/10.1016/j.dsp.2007.12.004. URL http://www.sciencedirect.com/science/article/pii/S105120040800002X.

[shen2016deep] Shen, Yichen, Harris, NicholasC, Skirlo, Scott, Prabhu, Mihika, Baehr-Jones, Tom, Hochberg, Michael, Sun, Xin, Zhao, Shijie, Larochelle, Hugo, Englund, Dirk, etal. Deep learning with coherent nanophotonic circuits. arXiv preprint arXiv:1610.02365, 2016.

[sutskever2014sequence] Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc~V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp.\ 3104--3112, 2014.

[wisdom2016full] Wisdom, Scott, Powers, Thomas, Hershey, John, Le~Roux, Jonathan, and Atlas, Les. Full-capacity unitary recurrent neural networks. In Advances In Neural Information Processing Systems, pp.\ 4880--4888, 2016.

[bib4] Berglund et al. (2015) Berglund, Mathias, Raiko, Tapani, Honkala, Mikko, Kärkkäinen, Leo, Vetek, Akos, and Karhunen, Juha T. Bidirectional recurrent neural networks as generative models. In Advances in Neural Information Processing Systems, pp. 856–864, 2015.

[bib6] Clements et al. (2016) Clements, William R., Humphreys, Peter C., Metcalf, Benjamin J., Kolthammer, W. Steven, and Walmsley, Ian A. An optimal design for universal multiport interferometers, 2016. arXiv:1603.08788.

[bib7] Collobert et al. (2011) Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

[bib8] Donahue et al. (2015) Donahue, Jeffrey, Anne Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634, 2015.

[bib9] Garofolo et al. (1993) Garofolo, John S, Lamel, Lori F, Fisher, William M, Fiscus, Jonathon G, and Pallett, David S. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93, 1993.

[bib11] Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E, Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012.

[bib19] Reck et al. (1994) Reck, Michael, Zeilinger, Anton, Bernstein, Herbert J., and Bertani, Philip. Experimental realization of any discrete unitary operator. Phys. Rev. Lett., 73:58–61, Jul 1994. doi: 10.1103/PhysRevLett.73.58. URL http://link.aps.org/doi/10.1103/PhysRevLett.73.58.

[bib21] Sejdić et al. (2009) Sejdić, Ervin, Djurović, Igor, and Jiang, Jin. Time–frequency feature representation using energy concentration: An overview of recent advances. Digital Signal Processing, 19(1):153 – 183, 2009. ISSN 1051-2004. doi: http://dx.doi.org/10.1016/j.dsp.2007.12.004. URL http://www.sciencedirect.com/science/article/pii/S105120040800002X.