Skip to main content

Fast Approximation Rotations Hessians Lecun

Abstract

A new method to represent and approximate rotation matrices is introduced. The method represents approximations of a rotation matrix $Q$ with linearithmic complexity, i.e. with $1{2}n\lg(n)$ rotations over pairs of coordinates, arranged in an FFT-like fashion. The approximation is ``learned'' using gradient descent. It allows to represent symmetric matrices $H$ as $QDQ^T$ where $D$ is a diagonal matrix. It can be used to approximate covariance matrix of Gaussian models in order to speed up inference, or to estimate and track the inverse Hessian of an objective function by relating changes in parameters to changes in gradient along the trajectory followed by the optimization procedure. Experiments were conducted to approximate synthetic matrices, covariance matrices of real data, and Hessian matrices of objective functions involved in machine learning problems.

Approximating Hessian matrices

Micha¨ el Mathieu

Courant Institute of Mathematical Sciences New York University mathieu@cs.nyu.edu

Yann LeCun

A new method to represent and approximate rotation matrices is introduced. The method represents approximations of a rotation matrix Q with linearithmic complexity, i.e. with 1 2 n lg( n ) rotations over pairs of coordinates, arranged in an FFT-like fashion. The approximation is 'learned' using gradient descent. It allows to represent symmetric matrices H as QDQ T where D is a diagonal matrix. It can be used to approximate covariance matrix of Gaussian models in order to speed up inference, or to estimate and track the inverse Hessian of an objective function by relating changes in parameters to changes in gradient along the trajectory followed by the optimization procedure. Experiments were conducted to approximate synthetic matrices, covariance matrices of real data, and Hessian matrices of objective functions involved in machine learning problems.

Introduction

Covariance matrices, Hessian matrices and, more generally speaking, symmetric matrices, play a major role in machine learning. In density models containing covariance matrices ( e.g. mixtures of Gaussians), estimation and inference involves computing the inverse of such matrices and computing products of such matrices with vectors. The size of these matrices grow quadratically with the dimension of the space, and some of the computations grow cubically. This renders direct evaluations impractical in large dimension, making approximations necessary. Depending on the problem, different approaches have been proposed.

In Gaussian mixture models (GMM), multiple high-dimensional Gaussian functions must be evaluated. In very large models with high dimension and many mixture components, the computational cost can be prohibitive. Approximations are often used to reduce the computational complexity, including diagonal approximations, low-rank approximations, and shared covariance matrices between multiple mixture components.

In Bayesian inference with Gaussian models, and in variational inference using the Laplace approximation, one must compute high-dimensional Gaussian integrals to marginalize over the latent variables. Estimating and manipulating the covariance matrices and their inverse can quickly become expensive computationally.

In machine learning and statistical model estimation, objective functions must be optimized with respect to high-dimensional parameters. Exploiting the second-order properties of the objective function to speed up learning (or to regularize the estimation) is always a challenge when the dimension is large, particularly when the objective function is non quadratic, or even non convex (as is the case with deep learning models).

In optimization, quasi-Newton methods such as BFGS have been proposed to keep track the inverse Hessian as the optimization proceeds. While BFGS has quadratic complexity per iteration, Limited-Storage BFGS (LBFGS) uses a factorized form of the inverse Hessian approximation that reduces the complexity to linear times an adjustable factor [Noc80]. Quasi-Newton

methods have been experimented with extensively to speed up optimization in machine learning, including LBFGS [BL89, LBOM98, SYG07, BBG09, NCL + 11], although the inherently batch nature of LBFGS has limited its applicability to large-scale learning [DCM + 12]. Some authors have addressed the issue of approximating the Hessian for ML systems with complex structures, such as deep learning architectures. This involves back-propagating curvatures through the computational graph of the function [BL89, LBOM98, CE11, MSS12], which is particularly easy when computing diagonal approximations or Hessians of recurrent neural nets. Others have attempted to identify a few dominant eigen-directions in which the curvatures are larger than the others, which can be seen as a low-rank Hessian approximation [LSP93]

More recently, decompositions of covariance matrices using products of rotations on pairs of coordinates have been explored in [GLC11]. The method greedily selects the best pairwise rotations to approximate the diagonalizing basis of the matrix, and then finds the best corresponding eigenvalues.

Linearithmic Symmetric Matrix Approximation

A symmetric matrix H of size n × n can be factorized (diagonalized) as H = QDQ T where Q is an orthogonal (rotation) matrix, and D is a diagonal matrix. In general, multiplying a vector by Q requires n 2 operations, and the full set of orthogonal matrices has n ( n -1) / 2 free parameters. The main idea of this paper is very simple: parameterize the rotation matrix Q as a product of n lg( n ) / 2 elementary rotations within 2D planes spanned by pairs of coordinates (sometimes called Givens rotations). This makes the compuational complexity of the product of H by a vector to linearithmic instead of quadratic. It also makes the computation of the inverse trivial H -1 = Q T D -1 Q . The second idea is to compute the best such approximation of a matrix by least-square optimization with stochastic gradient descent.

The question is how good of an approximation to real-world covariance and Hessian matrices can we get by restricting ourselves to these linearithmic rotations instead of the full set of rotations.

The main intuition that the approximation may be valid comes from a conjecture in random matrix theory according to which the distribution of random matrices obtained randomly drawing n lg( n ) / 2 successive Givens rotations is dense in the space of rotation matrices [Ben13]. Proven methods for generating orthogonal matrices with a uniform distribution require O (log( p ) p 2 ) Givens rotation, with an arrangement based on the butterfly operators in the Fast Fourier Transform [Gen98].

Indeed, to guarantee that pairs of coordinates are shuffled in a systematic way, we chose to arrange the n lg( n ) / 2 Givens rotations in the same way as the butterfly operators in the Fast Fourier Transform. To simplify the discussion, we assume from now on that n is a power of 2 , unless specified otherwise. We decompose

$$

$$

where Q i is a sparse rotation formed by n/ 2 independent pairwise rotations, between pairs of coordinates

$$

$$

where p = n 2 i , k ∈ 0 .. 2 i -1 and j ∈ 1 ..p . For instance, with n = 8 , we obtain the following matrices :

$$

$$

$$

$$

where c i = cos( θ i ) and s i = sin( θ i ) .

There are n lg( n ) / 2 pairwise rotations, grouped into lg( n ) matrices, such that each matrice represents n/ 2 independent rotations. It is the minimal number of matrices that keeps the pairwise rotations structure, and this particular arrangement has the property that there can be an interaction between each pair of coordinates of the input.

The final decomposition of the symmetric matrix H is therefore a sequence of 2 lg( n ) + 1 sparse matrices, namely

$$

$$

Since all the matrices Q i are rotations, we have Q -1 i = Q T i and thus the decomposition of H -1 is the obtained by simply replacing the elements of D by their inverse.

It is important to notice that the problem becomes easier if H has eigenspaces of high dimension. If there are groups of eivenvalues roughly equal, the rotations between vectors belonging to the corresponding eigenspaces become irrelevant. This is similar to low rank approximation, where all small eigenvalues are grouped into the same eigenspace, but it can work for eigenspaces with non-zero eigenvalues.

Approximating Hessian matrices

Methodology

Let ( θ 1 , . . . , θ n lg( n ) / 2 ) be the rotation parameters of the matrices Q 1 , . . . , Q lg( n ) and ( σ 1 , . . . , σ n ) the diagonal elements of D . Let us name ω the whole set of n (lg( n ) / 2 + 1) parameters, so

$$

$$

When we need the explicit parametrization, we denote Q ω the rotation matrix Q 1 . . . Q lg( n ) parametrized by ( θ j ) and D ω the diagonal matrix diag( σ 1 , . . . , σ n ) . Finally, let Q = { Q ω | ω ∈ R n (lg( n )+1) } be the set of all matrices Q ω .

In order to find the best set of parameters ω , we first notice that the sequence of matrices in the decomposition (2) can be seen as a machine learning problem. Although the problem is unfortunately not convex, it can still be minimized by using standard gradient descent techniques, such that SGD or minibatch gradient descent.

Let ( x ( j ) ) j =1 ..m be a set of n -dimensional vectors. We train our model to predict the output Hx ( j ) when given the input x ( j ) . We use a least square loss, so the function we minimize is

$$

$$

$$

$$

$$

$$

$$

$$

where y ( j ) = Hx ( j ) and

However, the parameterization through the angles makes the problem complicated and hard to optimize. Therefore, during one gradient step, we relax the parametrization and allow each matrix ( Q i ) to have 2 n independent parameters, and we reproject the matrix on the set of the rotation matrices after the parameter update. The problem becomes learning a multilayer linear neural network, with sparse connections and shared weights.

Formally, we use an extended set of parameters, ˜ ω , which has one parameter per non-zero element of each matrix Q i . The new set of matrices parametrized by ˜ ω , which we denote ˜ Q , contains matrices which are not rotation matrices. However, because of the Givens structure inside the matrices, it is trivial to project a matrix from ˜ Q on the set Q , by simply projecting every Givens on the set of the 2 × 2 rotations. The projection proj of a Givens is

$$

$$

We summarize the learning procedure in Algorithm 1, using SGD. Computing the gradient of L ( ˜ ω, x ( j ) , y ( j ) ) is performed by a gradient backpropagation. Note that we don't need the matrix H , but only a way to compute the products y ( i ) = Hx ( i ) .

The hyperparameter α , which doesn't have to be constant, is the learning rate. There is no reason that the same learning rate should be applied to the part of ˜ ω in Q ˜ ω and to the part in D ˜ ω . Therefore, α can be a diagonal matrix instead of a scalar. We have observed that the learning procedure converges faster when the part in D ω has a smaller learning rate.

Approximating Hessian matrices

Input: set ( x ( j ) , y ( j ) ) for j = 1 ..m while not converged do Randomly draw j ∈ 1 ..m Compute gradient g ˜ ω,j = ∂L ( ˜ ω,x ( j ) ,y ( j ) ) ∂ ˜ ω Update ˜ ω ← ˜ ω -αg ˜ ω,j Normalize all the Givens to project Q ˜ ω on Q

end while

There are several ways to obtain the input vectors x ( j ) , depending on the intended use of the approximated matrix. If we only need to obtain a fast and compact representation, we can draw random vectors. We found that vectors uniformly sampled on the unit sphere or in a hypercube centered on zero perform well. In certain cases, we may need to evaluate products Hx for x in a certain region of R n . In this case, using vectors in this specific area can provide better results, since more of the expressivity of our decomposition will be used on this specific region. The set of vectors ( x ( j ) , y ( j ) ) can also appear naturally, as we can see in the next part when we approximate the Hessian of another optimization problem.

Experiments

In order to confirm our hypothesis, we ran our algorithm on synthetic and natural Hessian matrices. Synthetic matrices are obtained by randomly drawing rotations and eigenvalues, while the natural matrix we tried is the Hessian of the MNIST dataset.

To evaluate the quality of the approximation, we measure the average angle between the QDQ T x and Hx for a set of random vectors x .

The synthetic matrices H are generated as R Λ R T where R is a random rotation matrix, uniformly drawn on the set of rotation matrices, and Λ = diag( λ i ) is a diagonal matrix, with λ i = | µ i | where µ i is sampled from a zero-mean and 0 . 1 Gaussian distribution, except a small number n µ of randomly selected µ i that are drawn from a Gaussian of mean 1 and variance 0 . 4 .

˜

The results are displayed on Figure 1. The average angle goes down to about 35 degrees in a space of dimension 64 , with the number of large eigenvalues n µ = 5 .

For sanity check purposes, we also try directly learning a random rotation matrix R using a single Q matrix. The loss function is then

$$

$$

The results are also shown on Figure 1. It doesn't perform as well as learning a Hessian, which was expected since we don't have eigenspaces of high dimension, but it still reaches 68 degrees in dimension 64 .

We performed another sanity check : if the matrix R is actually the form of a matrix Q ω , then we learn it correctly (loss function L ′ reaching zero), and we also learn H = R T Λ R with loss function L going quickly to zero.

Figure

⊗√⌉̂˜√(}˜(√√∐]{]{˜(]√˜√∐√]}{√({glyph[arrowtp]}

It appears that the number of larger eigenvalues n µ has an influence on the quality of the approximation : the average error angle after convergence is lower when n µ is close to 0 or to the size of the space (about 20 degrees in the latter case). On the other hand, when the eigenvalues get split into two groups of about the same size, the average error angle can is about 41 degrees. The average angle as a function of n µ is shown on Figure 2.

We tested our algorithm on the covariance of the MNIST dataset. It reached an angle of 38 degrees. Figure 3 shows real and approximated covariance matrices.

Figure 2: Average error angle after convergence versus the number of dominant eigenvalues n µ .

Figure 3: True (left) and approximated (right) covariance matrix of the MNIST dataset.

Hessian matrices of loss functions

In this section, we denote ̂ H = Q T DQ (or ̂ H ˜ ω = Q T ω D ˜ ω Q ˜ ω ) the approximated Hessian.

˜

Batch gradient descent

Although learning an approximation of a fixed Hessian matrix has uses, our method can be used in an online fashion to track slowly-varying Hessian matrices over the course of an optimization procedure.

In numerous cases, we have an optimization problem involving too many variables to explicitly compute the Hessian, preventing us from using direct second order methods. Our Hessian representation can be used to store and evaluate the Hessian matrix. Thanks to the learning procedure involved, if the Hessian of the loss function changes, our estimation will be able to adapt.

Let us consider an optimization problem where we aim to minimize a loss function glyph[lscript] ( u ) where u ∈ R n is the set of parameters. Therefore, we are looking for

$$

$$

Or, in case of a non convex problem, we are looking for a u † that gives a 'low' value for glyph[lscript] ('low' being problem dependent).

In addition, let us suppose we are in a case where it is possible to compute the gradient ∇ u glyph[lscript] of glyph[lscript] at every point u . It is the case, for instance, for neural networks, using the backpropagation algorithm, or whenever we have an efficient way to compute the gradient.

It is well known that even if the problem is quadratic, and therefore the Hessian matrix H doesn't depend on the point u , a gradient descent algorithm can perform poorly when H has eigenvalues with different magnitude. This situation becomes even worse when the loss function is not quadratic in the parameters and the Hessian is not constant (we use notation H u for the Hessian matrix at point u ). However, if H u is known, we can replace the gradient step ∇ u glyph[lscript] by H -1 u ∇ u glyph[lscript] . This modifies the descent direction to take the curvature into account and allows quadratic convergence if the function is locally quadratic. Since H u is too large to compute or store, we can rely on approximations, such as in LBFGS or Truncated Conjugate Gradient.

Our method provides another way to track an approximation of H u . By definition, we have

$$

$$

In the context of minimization, let us suppose we have taken a step from point u t -1 and arrived at the current point u t . Then we know that

$$

$$

We can use this relation as a training sample for the matrix H .

Thus, we start with an approximated matrix H u 0 set to identity, and we begin a regular gradient descent. At each step, we obtain a training point as defined in Equation 10, and we perform one step in learning H u 1 (following the gradient of the loss function L ). Then we perform a second step in the minimization of glyph[lscript] , folloing vector H -1 u 1 · ∇ u 1 glyph[lscript] , and so on. Note that our representation of H u t allows us to compute the inverse easily. The procedure is summarized in Algorithm 2.

A number of implementation details must be considered :

· It appears that in natural Hessians, many of the eigenvalues are small or zero, especially when we approach a local minimum. Therefore, the inverse of D can diverge. We solve this issue by imposing a minimal value glyph[epsilon1] on the elements of D when computing their inverse. · Similarly, eigenvalues can be negative. It is a problem in the minimization since it makes second order methods diverge if not taken into account. There are several ways to deal with such eigenvalues. We chose to set negative elements of D to glyph[epsilon1] at the same time as the normalization of Q ˜ ω . · The norm of δu (as defined in Algorithm 2) can vary from one step to the other. However, since the Hessian approximation is linear, scaling δu and δg by the same factor is equivalent

Algorithm 2 Optimization with linearithmic Hessian

glyph[negationslash]

Parameters: Learning rates α and β Initialize ˜ ω such that D ˜ ω = I and Q ˜ ω ∈ Q Initialize random u 0 Set t = 0 while not converged do Compute ∇ u t glyph[lscript] if t = 0 then Set δu = u t -u t -1 Set δg = ∇ t glyph[lscript] -∇ t -1 glyph[lscript] Update ˜ ω ← ˜ ω -α ∂L ( ˜ ω,δu,δg ) ∂ ˜ ω Project Q ˜ ω on Q as in Equation 6 end if Set u t +1 = u t -β ̂ H ˜ ω ∇ u glyph[lscript] Update t ← t +1

end while to scaling the learning rate α by the same factor. To have a better control on the learning rate, we rescale both δu and δg by a factor 1 / || δu || .

Although it performs poorly, due to the small eigenvalues, it is also possible to directly learn the inverse Hessian matrix using the relation

$$

$$

Stochastic and minibatch gradient descent

For most problems, stochastic gradient descent (SGD) performs better than batch descent. Here, we will consider the intermediate case, minibatch descent, where each step uses only a part of the dataset. It covers both plain SGD and batch methods by setting the size of the minibatch to 1 or to the size of the dataset.

In minibatch setting, the loss function becomes glyph[lscript] A ( u ) , where A is a subset of the dataset, changing at each iteration. We have glyph[lscript] A ( u ) = glyph[lscript] ( u ) if A is the whole dataset. Because of this difference, the previous analysis doesn't hold anymore, since the Hessian of glyph[lscript] A is not going to be the same as the Hessian of glyph[lscript] A ′ .

However, we will see that if we use the Hessian learning in this setting, it still perform something very similar to minibatch gradient, on the Hessian approximation. When learning the approximation, instead of training pairs ( δu, δg ) , we now obtain pairs ( δu, δg A ) , where δu = u t -u t -1 and δg A = ∇ u t glyph[lscript] A - ∇ u t -1 glyph[lscript] A . Let A be the whole dataset and ( A i ) i =1 ..p be minibatches (so A i ⊂ A ). Since the gradient operator is linear, if A = glyph[unionmulti] p i =1 A i , then

$$

$$

This simply states that if we sum gradients, taken at the same point, from a set of minibatches such that the dataset is their disjoint union, then we obtain the batch gradient. We have scaled the batch gradient by the size of the dataset for clarity purposes in the following part.

We will show that the same property holds for the loss function L . The loss function for the Hessian approximation corresponding to minibatch A i is L ( ˜ ω, δu, δg A i ) . Its gradient is

$$

$$

So we get

$$

$$

$$

$$

$$

$$

$$

$$

Another issue with minibatch descent lies in the difficulty to reuse the gradient computed at step t -1 to compute δg at step t . Indeed, δg A i = ∇ u t glyph[lscript] A i -∇ u t -1 glyph[lscript] A i . In the batch method, we had already computed ∇ u t -1 glyph[lscript] A at the previous step, and we could reuse it. With the minibatch approach, the previous step used ∇ u t -1 glyph[lscript] A i -1 , which is not computed on the same minibatch.

The easiest way to overcome this issue is to recompute ∇ u t -1 glyph[lscript] A i at step t , but it implies doubling the number of gradients computed. Although we don't have a fully satisfying answer to this problem, a possibility to reduce the number of gradients computed would be to use the same minibatch A i a few times consecutively before changing to A i +1 .

Performance

The approximated Hessians are sparse matrices with 2 n coefficients each. They can be efficiently implemented using standard sparse linear algebra, such as spblas . The total number of operations for evaluating an approximate Hessian is O ( n lg( n )) . If we neglect the fact that sparse matrices involve more operations, one evaluation coses 4 n lg( n )+ n MulAdd operations. Computing the gradient of ̂ H has the same cost, thanks to the backpropagation algorithm. Accumulating the gradients cost another 4 n lg( n ) + n . So tracking the Hessian matrix using our method adds about 12 n lg( n ) + 3 n operations at each iteration and, when using minibatches, possibly another gradient evaluation.

In terms of memory, there are 2 n lg( n ) + n parameters to store, and another 2 n lg( n ) + n is required to store the gradient of ̂ H during learning.

Applications and future work

The linearithmic structure makes the Hessian factorization scalable to larger models. In particular, large inference problems invloving Gaussian functions are good candidates for our method. Evaluating a product x T R T DRx is even faster, since it can be rewritten ( Rx ) T D ( Rx ) , and therefore be performed in 2 n lg( n ) + 2 n operations. This is particularly adapted to inference, where it is possible to spend some time to learn the approximate the matrix once and for all. Gaussian mixture models and Bayesian inference with Gaussian models could profit from this approximation.

Although some work has still to be done, speeding up optimization by tracking the approximated Hessian could help optimizing function. In particular, deep neural network could profit from this method, since their loss function tend to be non isotropic, so they could profit from fast approximated second order methods.

$$ Q \approx Q_1Q_2\dots Q_{\lg(n)} $$

$$ L(\omega, x^{(j)}, y^{(j)}) = ||Q_{\omega}D_{\omega}Q_{\omega}^Tx^{(j)}-y^{(j)}||_2^2 $$

$$ \label{eqn:projgivens} proj\begin{pmatrix} a & b \ c & d \ \end{pmatrix} =\frac{1}{\eta}\begin{pmatrix} a+d \quad & b-c\ c-b \quad & a+d\ \end{pmatrix} $$ \tag{eqn:projgivens}

$$ u^*=\argmin{u\in\mathbb{R}^n}\ell(u) $$

$$ H_{u+\ud{}u}\ud u = \nabla_{u+\ud{}u}\ell - \nabla_u\ell + \lo(||\ud u||) $$

$$ \nabla_u \ell = \frac{1}{|\mathcal{A}|}\sum_{i=1}^p \nabla_u \ell_{A_i} $$

$$ \frac{\partial L(\wt, \delta u, \delta g_{A_i})}{\partial \wt} = 2\frac{\partial \Happ_\wt}{\partial \wt}\delta u (\Happ_\wt \delta u - \delta g_{A_i}) $$

$$ Q_1 = \left( \begin{smallmatrix} \mspc{}c_1 & & & & -s_1 & & &\ & \mspc{}c_2 & & & & -s_2 & &\ & & \mspc{}c_3 & & & & -s_3 &\ & & & \mspc{}c_4 & & & & -s_4\ \mspc{}s_1 & & & & \mspc{}c_1 & & &\ & \mspc{}s_2 & & & & \mspc{}c_2 & &\ & & \mspc{}s_3 & & & & \mspc{}c_3 &\ & & & \mspc{}s_4 & & & & \mspc{}c_4\ \end{smallmatrix} \right) $$

$$ Q_2 = \left( \begin{smallmatrix} \mspc{}c_5 & & -s_5 & & & & &\ & \mspc{}c_6 & & -s_6 & & & &\ \mspc{}s_5 & & \mspc{}c_5 & & & & &\ & \mspc{}s_6 & & \mspc{}c_6 & & & &\ & & & & \mspc{}c_7 & & -s_7 &\ & & & & & \mspc{}c_8 & & -s_8\ & & & & \mspc{}s_7 & & \mspc{}c_7 &\ & & & & & \mspc{}s_8 & & \mspc{}c_8\ \end{smallmatrix} \right)\ Q_3 = \left( \begin{smallmatrix} \mspc{}c_9 & -s_9 & & & & & &\ \mspc{}s_9 & \mspc{}c_9 & & & & & &\ & & \mspc{}c_{10} & -s_{10} & & & &\ & & \mspc{}s_{10} & \mspc{}c_{10} & & & &\ & & & & \mspc{}c_{11} & -s_{11} & &\ & & & & \mspc{}s_{11} & \mspc{}c_{11} & &\ & & & & & & \mspc{}c_{12} & -s_{12}\ & & & & & & \mspc{}s_{12} & \mspc{}c_{12}\ \end{smallmatrix} \right) $$

Algorithm: algorithm
[ht]
\caption{Hessian matrix learning}
\label{alg:learnmatrix}
\begin{algorithmic}
\STATE{\bfseries Input:} set $(x^{(j)}, y^{(j)})$ for $j=1..m$
\WHILE{not converged}
\STATE{Randomly draw $j\in 1..m$}
\STATE{Compute gradient $g_{\wt,j} = \frac{\partial L(\wt,x^{(j)}, y^{(j)})}{\partial \wt}$}
\STATE{Update $\wt\leftarrow \wt - \alpha g_{\wt,j}$}
\STATE{Normalize all the Givens to project $Q_{\wt}$ on $\mathcal{Q}$}
\ENDWHILE
\end{algorithmic}
Algorithm: algorithm
[ht]
\caption{Optimization with linearithmic Hessian}
\label{alg:gradientdescenthessian}
\begin{algorithmic}
\STATE{\bfseries Parameters:} Learning rates $\alpha$ and $\beta$
\STATE{Initialize $\wt$ such that $D_{\wt}=I$
and $Q_{\wt}\in \mathcal{Q}$}
%\STATE{Set $H\leftarrow Q^TDQ$}
\STATE{Initialize random $u_0$}
\STATE{Set $t=0$}
\WHILE{not converged}
\STATE{Compute $\nabla_{u_t}\ell$}
\IF{$t\neq 0$}
\STATE{Set $\delta u = u_t - u_{t-1}$}
\STATE{Set $\delta g = \nabla_t \ell - \nabla_{t-1} \ell$}
\STATE{Update $\wt\leftarrow\wt-\alpha\frac{\partial L(\wt, \delta u, \delta g)}{\partial\wt}$}
\STATE{Project $Q_{\wt}$ on $\mathcal{Q}$ as in Equation~\ref{eqn:projgivens}}
\ENDIF
\STATE{Set $u_{t+1} = u_{t} - \beta \Happ_{\wt}\nabla_u\ell$}
\STATE{Update $t\leftarrow t+1$}
\ENDWHILE
\end{algorithmic}

Experiments

Fast Approximation of Rotations and Hessians matrices

A new method to represent and approximate rotation matrices is introduced. The method represents approximations of a rotation matrix Q𝑄Q with linearithmic complexity, i.e. with 12​n​lg⁡(n)12𝑛lg𝑛\frac{1}{2}n\lg(n) rotations over pairs of coordinates, arranged in an FFT-like fashion. The approximation is “learned” using gradient descent. It allows to represent symmetric matrices H𝐻H as Q​D​QT𝑄𝐷superscript𝑄𝑇QDQ^{T} where D𝐷D is a diagonal matrix. It can be used to approximate covariance matrix of Gaussian models in order to speed up inference, or to estimate and track the inverse Hessian of an objective function by relating changes in parameters to changes in gradient along the trajectory followed by the optimization procedure. Experiments were conducted to approximate synthetic matrices, covariance matrices of real data, and Hessian matrices of objective functions involved in machine learning problems.

Covariance matrices, Hessian matrices and, more generally speaking, symmetric matrices, play a major role in machine learning. In density models containing covariance matrices (e.g. mixtures of Gaussians), estimation and inference involves computing the inverse of such matrices and computing products of such matrices with vectors. The size of these matrices grow quadratically with the dimension of the space, and some of the computations grow cubically. This renders direct evaluations impractical in large dimension, making approximations necessary. Depending on the problem, different approaches have been proposed.

In Gaussian mixture models (GMM), multiple high-dimensional Gaussian functions must be evaluated. In very large models with high dimension and many mixture components, the computational cost can be prohibitive. Approximations are often used to reduce the computational complexity, including diagonal approximations, low-rank approximations, and shared covariance matrices between multiple mixture components.

In Bayesian inference with Gaussian models, and in variational inference using the Laplace approximation, one must compute high-dimensional Gaussian integrals to marginalize over the latent variables. Estimating and manipulating the covariance matrices and their inverse can quickly become expensive computationally.

In machine learning and statistical model estimation, objective functions must be optimized with respect to high-dimensional parameters. Exploiting the second-order properties of the objective function to speed up learning (or to regularize the estimation) is always a challenge when the dimension is large, particularly when the objective function is non quadratic, or even non convex (as is the case with deep learning models).

In optimization, quasi-Newton methods such as BFGS have been proposed to keep track the inverse Hessian as the optimization proceeds. While BFGS has quadratic complexity per iteration, Limited-Storage BFGS (LBFGS) uses a factorized form of the inverse Hessian approximation that reduces the complexity to linear times an adjustable factor [12]. Quasi-Newton methods have been experimented with extensively to speed up optimization in machine learning, including LBFGS [3, 8, 13, 1, 11], although the inherently batch nature of LBFGS has limited its applicability to large-scale learning [5]. Some authors have addressed the issue of approximating the Hessian for ML systems with complex structures, such as deep learning architectures. This involves back-propagating curvatures through the computational graph of the function [3, 8, 4, 10], which is particularly easy when computing diagonal approximations or Hessians of recurrent neural nets. Others have attempted to identify a few dominant eigen-directions in which the curvatures are larger than the others, which can be seen as a low-rank Hessian approximation [9]

More recently, decompositions of covariance matrices using products of rotations on pairs of coordinates have been explored in [7]. The method greedily selects the best pairwise rotations to approximate the diagonalizing basis of the matrix, and then finds the best corresponding eigenvalues.

A symmetric matrix H𝐻H of size n×n𝑛𝑛n\times n can be factorized (diagonalized) as H=Q​D​QT𝐻𝑄𝐷superscript𝑄𝑇H=QDQ^{T} where Q𝑄Q is an orthogonal (rotation) matrix, and D𝐷D is a diagonal matrix. In general, multiplying a vector by Q𝑄Q requires n2superscript𝑛2n^{2} operations, and the full set of orthogonal matrices has n​(n−1)/2𝑛𝑛12n(n-1)/2 free parameters. The main idea of this paper is very simple: parameterize the rotation matrix Q𝑄Q as a product of n​lg⁡(n)/2𝑛lg𝑛2n\lg(n)/2 elementary rotations within 2D planes spanned by pairs of coordinates (sometimes called Givens rotations). This makes the compuational complexity of the product of H𝐻H by a vector to linearithmic instead of quadratic. It also makes the computation of the inverse trivial H−1=QT​D−1​Qsuperscript𝐻1superscript𝑄𝑇superscript𝐷1𝑄H^{-1}=Q^{T}D^{-1}Q. The second idea is to compute the best such approximation of a matrix by least-square optimization with stochastic gradient descent.

The question is how good of an approximation to real-world covariance and Hessian matrices can we get by restricting ourselves to these linearithmic rotations instead of the full set of rotations.

The main intuition that the approximation may be valid comes from a conjecture in random matrix theory according to which the distribution of random matrices obtained randomly drawing n​lg⁡(n)/2𝑛lg𝑛2n\lg(n)/2 successive Givens rotations is dense in the space of rotation matrices [2]. Proven methods for generating orthogonal matrices with a uniform distribution require O​(log⁡(p)​p2)𝑂𝑝superscript𝑝2O(\log(p)p^{2}) Givens rotation, with an arrangement based on the butterfly operators in the Fast Fourier Transform [6].

Indeed, to guarantee that pairs of coordinates are shuffled in a systematic way, we chose to arrange the n​lg⁡(n)/2𝑛lg𝑛2n\lg(n)/2 Givens rotations in the same way as the butterfly operators in the Fast Fourier Transform. To simplify the discussion, we assume from now on that n𝑛n is a power of 222, unless specified otherwise. We decompose

where Qisubscript𝑄𝑖Q_{i} is a sparse rotation formed by n/2𝑛2n/2 independent pairwise rotations, between pairs of coordinates

where p=n2i𝑝𝑛superscript2𝑖p=\frac{n}{2^{i}}, k∈0..2i−1𝑘superscript0..2𝑖1k\in 0..2^{i-1} and j∈1..pj\in 1..p. For instance, with n=8𝑛8n=8, we obtain the following matrices :

where ci=cos⁡(θi)subscript𝑐𝑖subscript𝜃𝑖c_{i}=\cos(\theta_{i}) and si=sin⁡(θi)subscript𝑠𝑖subscript𝜃𝑖s_{i}=\sin(\theta_{i}).

There are n​lg⁡(n)/2𝑛lg𝑛2n\lg(n)/2 pairwise rotations, grouped into lg⁡(n)lg𝑛\lg(n) matrices, such that each matrice represents n/2𝑛2n/2 independent rotations. It is the minimal number of matrices that keeps the pairwise rotations structure, and this particular arrangement has the property that there can be an interaction between each pair of coordinates of the input.

The final decomposition of the symmetric matrix H𝐻H is therefore a sequence of 2​lg⁡(n)+12lg𝑛12\lg(n)+1 sparse matrices, namely

Since all the matrices Qisubscript𝑄𝑖Q_{i} are rotations, we have Qi−1=QiTsuperscriptsubscript𝑄𝑖1superscriptsubscript𝑄𝑖𝑇Q_{i}^{-1}=Q_{i}^{T} and thus the decomposition of H−1superscript𝐻1H^{-1} is the obtained by simply replacing the elements of D𝐷D by their inverse.

It is important to notice that the problem becomes easier if H𝐻H has eigenspaces of high dimension. If there are groups of eivenvalues roughly equal, the rotations between vectors belonging to the corresponding eigenspaces become irrelevant. This is similar to low rank approximation, where all small eigenvalues are grouped into the same eigenspace, but it can work for eigenspaces with non-zero eigenvalues.

Let (θ1,…,θn​lg⁡(n)/2)subscript𝜃1…subscript𝜃𝑛lg𝑛2(\theta_{1},\dots,\theta_{n\lg(n)/2}) be the rotation parameters of the matrices Q1,…,Qlg⁡(n)subscript𝑄1…subscript𝑄lg𝑛Q_{1},\dots,Q_{\lg(n)} and (σ1,…,σn)subscript𝜎1…subscript𝜎𝑛(\sigma_{1},\dots,\sigma_{n}) the diagonal elements of D𝐷D. Let us name ω𝜔\omega the whole set of n​(lg⁡(n)/2+1)𝑛lg𝑛21n(\lg(n)/2+1) parameters, so

When we need the explicit parametrization, we denote Qωsubscript𝑄𝜔Q_{\omega} the rotation matrix Q1​…​Qlg⁡(n)subscript𝑄1…subscript𝑄lg𝑛Q_{1}\dots Q_{\lg(n)} parametrized by (θj)subscript𝜃𝑗(\theta_{j}) and Dωsubscript𝐷𝜔D_{\omega} the diagonal matrix diag​(σ1,…,σn)diagsubscript𝜎1…subscript𝜎𝑛\mathrm{diag}(\sigma_{1},\dots,\sigma_{n}). Finally, let 𝒬={Qω|ω∈ℝn​(lg⁡(n)+1)}𝒬conditional-setsubscript𝑄𝜔𝜔superscriptℝ𝑛lg𝑛1\mathcal{Q}={Q_{\omega}|\omega\in\mathbb{R}^{n(\lg(n)+1)}} be the set of all matrices Qωsubscript𝑄𝜔Q_{\omega}.

In order to find the best set of parameters ω𝜔\omega, we first notice that the sequence of matrices in the decomposition (2) can be seen as a machine learning problem. Although the problem is unfortunately not convex, it can still be minimized by using standard gradient descent techniques, such that SGD or minibatch gradient descent.

Let (x(j))j=1..m(x^{(j)})_{j=1..m} be a set of n𝑛n-dimensional vectors. We train our model to predict the output H​x(j)𝐻superscript𝑥𝑗Hx^{(j)} when given the input x(j)superscript𝑥𝑗x^{(j)}. We use a least square loss, so the function we minimize is

where y(j)=H​x(j)superscript𝑦𝑗𝐻superscript𝑥𝑗y^{(j)}=Hx^{(j)} and

However, the parameterization through the angles makes the problem complicated and hard to optimize. Therefore, during one gradient step, we relax the parametrization and allow each matrix (Qi)subscript𝑄𝑖(Q_{i}) to have 2​n2𝑛2n independent parameters, and we reproject the matrix on the set of the rotation matrices after the parameter update. The problem becomes learning a multilayer linear neural network, with sparse connections and shared weights. Formally, we use an extended set of parameters, ω𝜔{\widetilde{\omega}}, which has one parameter per non-zero element of each matrix Qisubscript𝑄𝑖Q_{i}. The new set of matrices parametrized by ω𝜔{\widetilde{\omega}}, which we denote 𝒬𝒬\widetilde{\mathcal{Q}}, contains matrices which are not rotation matrices. However, because of the Givens structure inside the matrices, it is trivial to project a matrix from 𝒬𝒬\widetilde{\mathcal{Q}} on the set 𝒬𝒬\mathcal{Q}, by simply projecting every Givens on the set of the 2×2222\times 2 rotations. The projection p​r​o​j𝑝𝑟𝑜𝑗proj of a Givens is

We summarize the learning procedure in Algorithm 1, using SGD. Computing the gradient of L​(ω~,x(j),y(j))𝐿~𝜔superscript𝑥𝑗superscript𝑦𝑗L({\widetilde{\omega}},x^{(j)},y^{(j)}) is performed by a gradient backpropagation. Note that we don’t need the matrix H𝐻H, but only a way to compute the products y(i)=H​x(i)superscript𝑦𝑖𝐻superscript𝑥𝑖y^{(i)}=Hx^{(i)}.

The hyperparameter α𝛼\alpha, which doesn’t have to be constant, is the learning rate. There is no reason that the same learning rate should be applied to the part of ω~~𝜔{\widetilde{\omega}} in Qωsubscript𝑄𝜔Q_{{\widetilde{\omega}}} and to the part in Dωsubscript𝐷𝜔D_{{\widetilde{\omega}}}. Therefore, α𝛼\alpha can be a diagonal matrix instead of a scalar. We have observed that the learning procedure converges faster when the part in Dωsubscript𝐷𝜔D_{{\widetilde{\omega}}} has a smaller learning rate.

There are several ways to obtain the input vectors x(j)superscript𝑥𝑗x^{(j)}, depending on the intended use of the approximated matrix. If we only need to obtain a fast and compact representation, we can draw random vectors. We found that vectors uniformly sampled on the unit sphere or in a hypercube centered on zero perform well. In certain cases, we may need to evaluate products H​x𝐻𝑥Hx for x𝑥x in a certain region of ℝnsuperscriptℝ𝑛\mathbb{R}^{n}. In this case, using vectors in this specific area can provide better results, since more of the expressivity of our decomposition will be used on this specific region. The set of vectors (x(j),y(j))superscript𝑥𝑗superscript𝑦𝑗(x^{(j)},y^{(j)}) can also appear naturally, as we can see in the next part when we approximate the Hessian of another optimization problem.

In order to confirm our hypothesis, we ran our algorithm on synthetic and natural Hessian matrices. Synthetic matrices are obtained by randomly drawing rotations and eigenvalues, while the natural matrix we tried is the Hessian of the MNIST dataset.

To evaluate the quality of the approximation, we measure the average angle between the Q​D​QT​x𝑄𝐷superscript𝑄𝑇𝑥QDQ^{T}x and H​x𝐻𝑥Hx for a set of random vectors x𝑥x.

The synthetic matrices H𝐻H are generated as R​Λ​RT𝑅Λsuperscript𝑅𝑇R\Lambda R^{T} where R𝑅R is a random rotation matrix, uniformly drawn on the set of rotation matrices, and Λ=diag​(λi)Λdiagsubscript𝜆𝑖\Lambda=\mathrm{diag}(\lambda_{i}) is a diagonal matrix, with λi=|μi|subscript𝜆𝑖subscript𝜇𝑖\lambda_{i}=|\mu_{i}| where μisubscript𝜇𝑖\mu_{i} is sampled from a zero-mean and 0.10.10.1 Gaussian distribution, except a small number nμsubscript𝑛𝜇n_{\mu} of randomly selected μisubscript𝜇𝑖\mu_{i} that are drawn from a Gaussian of mean 111 and variance 0.40.40.4. The results are displayed on Figure 1. The average angle goes down to about 353535 degrees in a space of dimension 646464, with the number of large eigenvalues nμ=5subscript𝑛𝜇5n_{\mu}=5.

For sanity check purposes, we also try directly learning a random rotation matrix R𝑅R using a single Q𝑄Q matrix. The loss function is then

The results are also shown on Figure 1. It doesn’t perform as well as learning a Hessian, which was expected since we don’t have eigenspaces of high dimension, but it still reaches 686868 degrees in dimension 646464. We performed another sanity check : if the matrix R𝑅R is actually the form of a matrix Qωsubscript𝑄𝜔Q_{\omega}, then we learn it correctly (loss function ℒ′superscriptℒ′\mathcal{L}^{\prime} reaching zero), and we also learn H=RT​Λ​R𝐻superscript𝑅𝑇Λ𝑅H=R^{T}\Lambda R with loss function ℒℒ\mathcal{L} going quickly to zero.

It appears that the number of larger eigenvalues nμsubscript𝑛𝜇n_{\mu} has an influence on the quality of the approximation : the average error angle after convergence is lower when nμsubscript𝑛𝜇n_{\mu} is close to 00 or to the size of the space (about 202020 degrees in the latter case). On the other hand, when the eigenvalues get split into two groups of about the same size, the average error angle can is about 414141 degrees. The average angle as a function of nμsubscript𝑛𝜇n_{\mu} is shown on Figure 2.

We tested our algorithm on the covariance of the MNIST dataset. It reached an angle of 383838 degrees. Figure 3 shows real and approximated covariance matrices.

In this section, we denote H^=QT​D​Q^𝐻superscript𝑄𝑇𝐷𝑄\widehat{H}=Q^{T}DQ (or H^ω~=QωT​Dω​Qωsubscript^𝐻𝜔subscriptsuperscript𝑄𝑇𝜔subscript𝐷𝜔subscript𝑄~𝜔\widehat{H}{{\widetilde{\omega}}}=Q^{T}{{\widetilde{\omega}}}D_{{\widetilde{\omega}}}Q_{{\widetilde{\omega}}}) the approximated Hessian.

Although learning an approximation of a fixed Hessian matrix has uses, our method can be used in an online fashion to track slowly-varying Hessian matrices over the course of an optimization procedure.

In numerous cases, we have an optimization problem involving too many variables to explicitly compute the Hessian, preventing us from using direct second order methods. Our Hessian representation can be used to store and evaluate the Hessian matrix. Thanks to the learning procedure involved, if the Hessian of the loss function changes, our estimation will be able to adapt.

Let us consider an optimization problem where we aim to minimize a loss function ℓ​(u)ℓ𝑢\ell(u) where u∈ℝn𝑢superscriptℝ𝑛u\in\mathbb{R}^{n} is the set of parameters. Therefore, we are looking for

Or, in case of a non convex problem, we are looking for a u†superscript𝑢†u^{\dagger} that gives a “low” value for ℓℓ\ell (“low” being problem dependent). In addition, let us suppose we are in a case where it is possible to compute the gradient ∇uℓsubscript∇𝑢ℓ\nabla_{u}\ell of ℓℓ\ell at every point u𝑢u. It is the case, for instance, for neural networks, using the backpropagation algorithm, or whenever we have an efficient way to compute the gradient.

It is well known that even if the problem is quadratic, and therefore the Hessian matrix H𝐻H doesn’t depend on the point u𝑢u, a gradient descent algorithm can perform poorly when H𝐻H has eigenvalues with different magnitude. This situation becomes even worse when the loss function is not quadratic in the parameters and the Hessian is not constant (we use notation Husubscript𝐻𝑢H_{u} for the Hessian matrix at point u𝑢u). However, if Husubscript𝐻𝑢H_{u} is known, we can replace the gradient step ∇uℓsubscript∇𝑢ℓ\nabla_{u}\ell by Hu−1​∇uℓsuperscriptsubscript𝐻𝑢1subscript∇𝑢ℓH_{u}^{-1}\nabla_{u}\ell. This modifies the descent direction to take the curvature into account and allows quadratic convergence if the function is locally quadratic. Since Husubscript𝐻𝑢H_{u} is too large to compute or store, we can rely on approximations, such as in LBFGS or Truncated Conjugate Gradient.

Our method provides another way to track an approximation of Husubscript𝐻𝑢H_{u}. By definition, we have

In the context of minimization, let us suppose we have taken a step from point ut−1subscript𝑢𝑡1u_{t-1} and arrived at the current point utsubscript𝑢𝑡u_{t}. Then we know that

We can use this relation as a training sample for the matrix H𝐻H.

Thus, we start with an approximated matrix Hu0subscript𝐻subscript𝑢0H_{u_{0}} set to identity, and we begin a regular gradient descent. At each step, we obtain a training point as defined in Equation 10, and we perform one step in learning Hu1subscript𝐻subscript𝑢1H_{u_{1}} (following the gradient of the loss function L𝐿L). Then we perform a second step in the minimization of ℓℓ\ell, folloing vector Hu1−1⋅∇u1ℓ⋅superscriptsubscript𝐻subscript𝑢11subscript∇subscript𝑢1ℓH_{u_{1}}^{-1}\cdot\nabla_{u_{1}}\ell, and so on. Note that our representation of Hutsubscript𝐻subscript𝑢𝑡H_{u_{t}} allows us to compute the inverse easily. The procedure is summarized in Algorithm 2.

A number of implementation details must be considered :

It appears that in natural Hessians, many of the eigenvalues are small or zero, especially when we approach a local minimum. Therefore, the inverse of D𝐷D can diverge. We solve this issue by imposing a minimal value ϵitalic-ϵ\epsilon on the elements of D𝐷D when computing their inverse.

Similarly, eigenvalues can be negative. It is a problem in the minimization since it makes second order methods diverge if not taken into account. There are several ways to deal with such eigenvalues. We chose to set negative elements of D𝐷D to ϵitalic-ϵ\epsilon at the same time as the normalization of Qωsubscript𝑄𝜔Q_{{\widetilde{\omega}}}.

The norm of δ​u𝛿𝑢\delta u (as defined in Algorithm 2) can vary from one step to the other. However, since the Hessian approximation is linear, scaling δ​u𝛿𝑢\delta u and δ​g𝛿𝑔\delta g by the same factor is equivalent to scaling the learning rate α𝛼\alpha by the same factor. To have a better control on the learning rate, we rescale both δ​u𝛿𝑢\delta u and δ​g𝛿𝑔\delta g by a factor 1/‖δ​u‖1norm𝛿𝑢1/||\delta u||.

Although it performs poorly, due to the small eigenvalues, it is also possible to directly learn the inverse Hessian matrix using the relation

For most problems, stochastic gradient descent (SGD) performs better than batch descent. Here, we will consider the intermediate case, minibatch descent, where each step uses only a part of the dataset. It covers both plain SGD and batch methods by setting the size of the minibatch to 111 or to the size of the dataset.

In minibatch setting, the loss function becomes ℓA​(u)subscriptℓ𝐴𝑢\ell_{A}(u), where A𝐴A is a subset of the dataset, changing at each iteration. We have ℓA​(u)=ℓ​(u)subscriptℓ𝐴𝑢ℓ𝑢\ell_{A}(u)=\ell(u) if A𝐴A is the whole dataset. Because of this difference, the previous analysis doesn’t hold anymore, since the Hessian of ℓAsubscriptℓ𝐴\ell_{A} is not going to be the same as the Hessian of ℓA′subscriptℓsuperscript𝐴′\ell_{A^{\prime}}.

However, we will see that if we use the Hessian learning in this setting, it still perform something very similar to minibatch gradient, on the Hessian approximation. When learning the approximation, instead of training pairs (δ​u,δ​g)𝛿𝑢𝛿𝑔(\delta u,\delta g), we now obtain pairs (δ​u,δ​gA)𝛿𝑢𝛿subscript𝑔𝐴(\delta u,\delta g_{A}), where δ​u=ut−ut−1𝛿𝑢subscript𝑢𝑡subscript𝑢𝑡1\delta u=u_{t}-u_{t-1} and δ​gA=∇utℓA−∇ut−1ℓA𝛿subscript𝑔𝐴subscript∇subscript𝑢𝑡subscriptℓ𝐴subscript∇subscript𝑢𝑡1subscriptℓ𝐴\delta g_{A}=\nabla_{u_{t}}\ell_{A}-\nabla_{u_{t-1}}\ell_{A}. Let 𝒜𝒜\mathcal{A} be the whole dataset and (Ai)i=1..p(A_{i}){i=1..p} be minibatches (so Ai⊂𝒜subscript𝐴𝑖𝒜A{i}\subset\mathcal{A}). Since the gradient operator is linear, if 𝒜=⊎i=1pAi𝒜superscriptsubscript⊎𝑖1𝑝subscript𝐴𝑖\mathcal{A}=\uplus_{i=1}^{p}A_{i}, then

This simply states that if we sum gradients, taken at the same point, from a set of minibatches such that the dataset is their disjoint union, then we obtain the batch gradient. We have scaled the batch gradient by the size of the dataset for clarity purposes in the following part.

We will show that the same property holds for the loss function L𝐿L. The loss function for the Hessian approximation corresponding to minibatch Aisubscript𝐴𝑖A_{i} is L​(ω~,δ​u,δ​gAi)𝐿~𝜔𝛿𝑢𝛿subscript𝑔subscript𝐴𝑖L({\widetilde{\omega}},\delta u,\delta g_{A_{i}}). Its gradient is

Another issue with minibatch descent lies in the difficulty to reuse the gradient computed at step t−1𝑡1t-1 to compute δ​g𝛿𝑔\delta g at step t𝑡t. Indeed, δ​gAi=∇utℓAi−∇ut−1ℓAi𝛿subscript𝑔subscript𝐴𝑖subscript∇subscript𝑢𝑡subscriptℓsubscript𝐴𝑖subscript∇subscript𝑢𝑡1subscriptℓsubscript𝐴𝑖\delta g_{A_{i}}=\nabla_{u_{t}}\ell_{A_{i}}-\nabla_{u_{t-1}}\ell_{A_{i}}. In the batch method, we had already computed ∇ut−1ℓAsubscript∇subscript𝑢𝑡1subscriptℓ𝐴\nabla_{u_{t-1}}\ell_{A} at the previous step, and we could reuse it. With the minibatch approach, the previous step used ∇ut−1ℓAi−1subscript∇subscript𝑢𝑡1subscriptℓsubscript𝐴𝑖1\nabla_{u_{t-1}}\ell_{A_{i-1}}, which is not computed on the same minibatch. The easiest way to overcome this issue is to recompute ∇ut−1ℓAisubscript∇subscript𝑢𝑡1subscriptℓsubscript𝐴𝑖\nabla_{u_{t-1}}\ell_{A_{i}} at step t𝑡t, but it implies doubling the number of gradients computed. Although we don’t have a fully satisfying answer to this problem, a possibility to reduce the number of gradients computed would be to use the same minibatch Aisubscript𝐴𝑖A_{i} a few times consecutively before changing to Ai+1subscript𝐴𝑖1A_{i+1}.

The approximated Hessians are sparse matrices with 2​n2𝑛2n coefficients each. They can be efficiently implemented using standard sparse linear algebra, such as spblas. The total number of operations for evaluating an approximate Hessian is O​(n​lg⁡(n))𝑂𝑛lg𝑛O(n\lg(n)). If we neglect the fact that sparse matrices involve more operations, one evaluation coses 4​n​lg⁡(n)+n4𝑛lg𝑛𝑛4n\lg(n)+n MulAdd operations. Computing the gradient of H^^𝐻\widehat{H} has the same cost, thanks to the backpropagation algorithm. Accumulating the gradients cost another 4​n​lg⁡(n)+n4𝑛lg𝑛𝑛4n\lg(n)+n. So tracking the Hessian matrix using our method adds about 12​n​lg⁡(n)+3​n12𝑛lg𝑛3𝑛12n\lg(n)+3n operations at each iteration and, when using minibatches, possibly another gradient evaluation.

In terms of memory, there are 2​n​lg⁡(n)+n2𝑛lg𝑛𝑛2n\lg(n)+n parameters to store, and another 2​n​lg⁡(n)+n2𝑛lg𝑛𝑛2n\lg(n)+n is required to store the gradient of H^^𝐻\widehat{H} during learning.

The linearithmic structure makes the Hessian factorization scalable to larger models. In particular, large inference problems invloving Gaussian functions are good candidates for our method. Evaluating a product xT​RT​D​R​xsuperscript𝑥𝑇superscript𝑅𝑇𝐷𝑅𝑥x^{T}R^{T}DRx is even faster, since it can be rewritten (R​x)T​D​(R​x)superscript𝑅𝑥𝑇𝐷𝑅𝑥(Rx)^{T}D(Rx), and therefore be performed in 2​n​lg⁡(n)+2​n2𝑛lg𝑛2𝑛2n\lg(n)+2n operations. This is particularly adapted to inference, where it is possible to spend some time to learn the approximate the matrix once and for all. Gaussian mixture models and Bayesian inference with Gaussian models could profit from this approximation.

Although some work has still to be done, speeding up optimization by tracking the approximated Hessian could help optimizing function. In particular, deep neural network could profit from this method, since their loss function tend to be non isotropic, so they could profit from fast approximated second order methods.

Refer to caption Average error angle versus epochs for matrix learning. In green, learning H𝐻H, with the number of larger eigenvalues nμ=5subscript𝑛𝜇5n_{\mu}=5. In red, learning a rotation matrix R𝑅R.

Refer to caption True (left) and approximated (right) covariance matrix of the MNIST dataset.

$$ Q\approx Q_{1}Q_{2}\dots Q_{\lg(n)} $$ \tag{S2.E1}

$$ \big{(}2pk+j,\qquad p(2k+1)+j\big{)} $$ \tag{S2.Ex1}

$$ \omega=(\theta_{1},\dots,\theta_{n\lg(n)/2},\sigma_{1},\dots,\sigma_{n}) $$ \tag{S3.Ex5}

$$ L(\omega,x^{(j)},y^{(j)})=||Q_{\omega}D_{\omega}Q_{\omega}^{T}x^{(j)}-y^{(j)}||_{2}^{2} $$ \tag{S3.E5}

$$ proj\begin{pmatrix}a&b\ c&d\ \end{pmatrix}=\frac{1}{\eta}\begin{pmatrix}a+d\quad&b-c\ c-b\quad&a+d\ \end{pmatrix} $$ \tag{S3.E6}

$$ u^{*}=\underset{u\in\mathbb{R}^{n}}{\arg\min}\ \ell(u) $$ \tag{S4.E8}

$$ H_{u+\mathrm{d}{}u}\mathrm{d}u=\nabla_{u+\mathrm{d}{}u}\ell-\nabla_{u}\ell+o(||\mathrm{d}u||) $$ \tag{S4.E9}

$$ \nabla_{u}\ell=\frac{1}{|\mathcal{A}|}\sum_{i=1}^{p}\nabla_{u}\ell_{A_{i}} $$ \tag{S4.E12}

$$ \frac{\partial L({\widetilde{\omega}},\delta u,\delta g_{A_{i}})}{\partial{\widetilde{\omega}}}=2\frac{\partial\widehat{H}{\widetilde{\omega}}}{\partial{\widetilde{\omega}}}\delta u(\widehat{H}{\widetilde{\omega}}\delta u-\delta g_{A_{i}}) $$ \tag{S4.E13}

$$ \displaystyle Q_{1}=\left(\begin{smallmatrix}\hskip 7.0pt{}c_{1}&&&&-s_{1}&&&\ &\hskip 7.0pt{}c_{2}&&&&-s_{2}&&\ &&\hskip 7.0pt{}c_{3}&&&&-s_{3}&\ &&&\hskip 7.0pt{}c_{4}&&&&-s_{4}\ \hskip 7.0pt{}s_{1}&&&&\hskip 7.0pt{}c_{1}&&&\ &\hskip 7.0pt{}s_{2}&&&&\hskip 7.0pt{}c_{2}&&\ &&\hskip 7.0pt{}s_{3}&&&&\hskip 7.0pt{}c_{3}&\ &&&\hskip 7.0pt{}s_{4}&&&&\hskip 7.0pt{}c_{4}\ \end{smallmatrix}\right) $$

$$ \displaystyle Q_{2}=\left(\begin{smallmatrix}\hskip 7.0pt{}c_{5}&&-s_{5}&&&&&\ &\hskip 7.0pt{}c_{6}&&-s_{6}&&&&\ \hskip 7.0pt{}s_{5}&&\hskip 7.0pt{}c_{5}&&&&&\ &\hskip 7.0pt{}s_{6}&&\hskip 7.0pt{}c_{6}&&&&\ &&&&\hskip 7.0pt{}c_{7}&&-s_{7}&\ &&&&&\hskip 7.0pt{}c_{8}&&-s_{8}\ &&&&\hskip 7.0pt{}s_{7}&&\hskip 7.0pt{}c_{7}&\ &&&&&\hskip 7.0pt{}s_{8}&&\hskip 7.0pt{}c_{8}\ \end{smallmatrix}\right) $$

$$ \displaystyle Q_{3}=\left(\begin{smallmatrix}\hskip 7.0pt{}c_{9}&-s_{9}&&&&&&\ \hskip 7.0pt{}s_{9}&\hskip 7.0pt{}c_{9}&&&&&&\ &&\hskip 7.0pt{}c_{10}&-s_{10}&&&&\ &&\hskip 7.0pt{}s_{10}&\hskip 7.0pt{}c_{10}&&&&\ &&&&\hskip 7.0pt{}c_{11}&-s_{11}&&\ &&&&\hskip 7.0pt{}s_{11}&\hskip 7.0pt{}c_{11}&&\ &&&&&&\hskip 7.0pt{}c_{12}&-s_{12}\ &&&&&&\hskip 7.0pt{}s_{12}&\hskip 7.0pt{}c_{12}\ \end{smallmatrix}\right) $$

$$ \displaystyle\mathcal{L}(\omega) $$

Figure

References

[bordes-bottou-gallinari-2009] Antoine Bordes, L'{e}on Bottou, and Patrick Gallinari. \newblock Sgd-qn: Careful quasi-newton stochastic gradient descent. \newblock {\em Journal of Machine Learning Research}, 10:1737--1754, July 2009.

[benarous-2013] G'erard {Ben Arous}. \newblock personal communication, 2013.

[becker-lecun-89] S.~Becker and Y.~LeCun. \newblock Improving the convergence of back-propagation learning with second-order methods. \newblock In D.~Touretzky, G.~Hinton, and T.~Sejnowski, editors, {\em Proc. of the 1988 Connectionist Models Summer School}, pages 29--37, San Mateo, 1989. Morgan Kaufman.

[chapelle-erhan-2011] Olivier Chapelle and Dumitru Erhan. \newblock Improved preconditioner for hessian free optimization. \newblock In {\em NIPS Workshop on Deep Learning and Unsupervised Feature Learning}, 2011.

[dean-nips-2012] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc Le, Mark Mao, Marc'Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke~Yang, and Andrew Ng. \newblock Large scale distributed deep networks. \newblock In P.~Bartlett, F.C.N. Pereira, C.J.C. Burges, L.~Bottou, and K.Q. Weinberger, editors, {\em Advances in Neural Information Processing Systems 25}, pages 1232--1240. 2012.

[genz-1998] A.~Genz. \newblock Methods for generating random orthogonal matrices. \newblock In H.~Niederreiter and J.~Spanier, editors, {\em Monte Carlo and Quasi-Monte Carlo Methods}, page 199–213. Springer, 1998.

[bouman-2011] {Guangzhi Cao}, {Leonardo R. Bachega}, and {Charles A. Bouman}. \newblock The sparse matrix transform for covariance estimation and analysis of high dimensional signals. \newblock {\em IEEE Trans. on Image Processing}, 20(3):625--640, 2011.

[lecun-98b] Y.~LeCun, L.~Bottou, G.~Orr, and K.~Muller. \newblock Efficient backprop. \newblock In G.~Orr and Muller K., editors, {\em Neural Networks: Tricks of the trade}. Springer, 1998.

[lecun-simard-pearlmutter-93] Y.~LeCun, P.~Simard, and B.~Pearlmutter. \newblock Automatic learning rate maximization by on-line estimation of the hessian's eigenvectors. \newblock In S.~Hanson, J.~Cowan, and L.Giles, editors, {\em Advances in Neural Information Processing Systems (NIPS 1992)}, volume5. Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[martens-2012] James Martens, Ilya Sutskever, and Kevin Swersky. \newblock Estimating the hessian by back-propagating curvature. \newblock In {\em Proc of ICML}, 2012. \newblock arXiv:1206.6464.

[ngiam-2011] Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Andrew Ng, and Quoc~V Le. \newblock On optimization methods for deep learning. \newblock In {\em Proceedings of the 28th International Conference on Machine Learning (ICML-11)}, pages 265--272, 2011.

[nocedal-1980] Jorge Nocedal. \newblock Updating quasi-newton matrices with limited storage. \newblock {\em Mathematics of computation}, 35(151):773--782, 1980.

[schraudolph-2007] Nicol Schraudolph, Jin Yu, and Simon G{"u}nter. \newblock A stochastic quasi-newton method for online convex optimization. \newblock In {\em Proc. 11th Intl. Conf. on Artificial Intelligence and Statistics (AIstats)}, pages 433--440. Society for Artificial Intelligence and Statistics, 2007.

[bib1] Antoine Bordes, Léon Bottou, and Patrick Gallinari. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10:1737–1754, July 2009.

[bib2] Gérard Ben Arous. personal communication, 2013.

[bib3] S. Becker and Y. LeCun. Improving the convergence of back-propagation learning with second-order methods. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proc. of the 1988 Connectionist Models Summer School, pages 29–37, San Mateo, 1989. Morgan Kaufman.

[bib4] Olivier Chapelle and Dumitru Erhan. Improved preconditioner for hessian free optimization. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

[bib5] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc Le, Mark Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Ng. Large scale distributed deep networks. In P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1232–1240. 2012.

[bib6] A. Genz. Methods for generating random orthogonal matrices. In H. Niederreiter and J. Spanier, editors, Monte Carlo and Quasi-Monte Carlo Methods, page 199–213. Springer, 1998.

[bib7] Guangzhi Cao, Leonardo R. Bachega, and Charles A. Bouman. The sparse matrix transform for covariance estimation and analysis of high dimensional signals. IEEE Trans. on Image Processing, 20(3):625–640, 2011.

[bib8] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, Neural Networks: Tricks of the trade. Springer, 1998.

[bib9] Y. LeCun, P. Simard, and B. Pearlmutter. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In S. Hanson, J. Cowan, and L. Giles, editors, Advances in Neural Information Processing Systems (NIPS 1992), volume 5. Morgan Kaufmann Publishers, San Mateo, CA, 1993.

[bib10] James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the hessian by back-propagating curvature. In Proc of ICML, 2012. arXiv:1206.6464.

[bib11] Jiquan Ngiam, Adam Coates, Ahbik Lahiri, Bobby Prochnow, Andrew Ng, and Quoc V Le. On optimization methods for deep learning. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 265–272, 2011.

[bib12] Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773–782, 1980.

[bib13] Nicol Schraudolph, Jin Yu, and Simon Günter. A stochastic quasi-newton method for online convex optimization. In Proc. 11th Intl. Conf. on Artificial Intelligence and Statistics (AIstats), pages 433–440. Society for Artificial Intelligence and Statistics, 2007.