Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Tom Schaul, Sixin Zhang, Courant Institute of Mathematical Sciences, New York University, 715 Broadway,, 10003, New York, Yann LeCun, Courant Institute of Mathematical Sciences, New York University, 715 Broadway,, 10003, New York

Abstract

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Tom Schaul

Courant Institute of Mathematical Sciences New York University, 715 Broadway, 10003, New York schaul@cims.nyu.edu

Yann LeCun

Courant Institute of Mathematical Sciences New York University, 715 Broadway, 10003, New York yann@cims.nyu.edu

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in nonstationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.

Introduction

Many machine learning problems can be framed as minimizing a loss function over a large (maybe infinite) number of samples. In representation learning, those loss functions are generally built on top of multiple layers of non-linearities, precluding any direct or closed-form optimization, but admitting (sample) gradients to guide iterative optimization of the loss.

Stochastic gradient descent (SGD) is among the most broadly applicable and widely-used algorithms for such learning tasks, because of its simplicity, robustness and scalability to arbitrarily large datasets. Doing many small but noisy updates instead of fewer large ones (as in batch methods) gives both a speed-up, and makes the learning process less likely to get stuck in sensitive local optima. In addition, SGD is eminently well-suited for learning in non-stationary environments, e.g., when that data stream is generated by a changing environment; but non-stationary adaptivity is useful even on stationary problems, as the initial search phase (before a local optimum is located) of the learning process can be likened to a non-stationary environment.

Given the increasingly wide adoption of machine learning tools, there is an undoubted benefit to making learning algorithms, and SGD in particular, easy to use and hyper-parameter free. In recent work, we made SGD hyper-parameter free by introducing optimal adaptive learning rates that are based on gradient variance estimates [1]. While broadly successful, the approach was limited to smooth loss functions, and to minibatch sizes of one. In this paper, we therefore complement that work, by addressing and resolving the issues of

· minibatches and parallelization, · sparse gradients, and · non-smooth loss functions

robustly dealing with non-smooth loss functions is also a very practical concern: a growing number of learning architectures employ non-smooth nonlinearities, like absolute value normalization or rectified-linear units. Our final algorithm addresses all of these, while remaining simple to implement and of linear complexity.

Background

There are a number of adaptive settings for SGD learning rates, or equivalently, diagonal preconditioning schemes, to be found in the literature, e.g., [2, 3, 4, 5, 6, 7]. The aim of those is generally to increase performance on stochastic optimization tasks, a concern complementary to our focus of producing an algorithm that works robustly without any hyper-parameter tuning. Often those adaptive schemes produce monotonically decreasing rates, however, which makes them no longer applicable to non-stationary tasks.

The remainder of this paper build upon the adaptive learning rate scheme of [1], which is not monotonically decreasing, so we recapitulate its main results here. Using an idealized quadratic and separable loss function, it is possible to derive an optimal learning rate schedule which preserves the convergence guarantees of SGD. When the problem is approximately separable, the analysis is simplified as all quantities are one-dimensional. The analysis also holds as a local approximation in the non-quadratic but smooth case.

In the idealized case, and for any dimension i , the optimal learning rate can be derived analytically, and takes the following form

where ( θ i -θ ∗ i ) is the distance to the optimal parameter value, and σ 2 i and h i are the local sample variance and curvature, respectively.

We use an exponential moving average with time-constant τ (the approximate number of samples considered from recent memory) for online estimates of the quantities in equation 1:

where the diagonal Hessian entries h ( bbprop ) i are computed using the 'bbprop' procedure [8], and the time-constant (memory) is adapted according to how large a step was taken:

The final algorithm is called vSGD, and used the learning rates from equation 1 to update the parameters (element-wise):

Parallelization with minibatches

Compared to the pure online SGD, computation time can be reduced by 'minibatch'-parallelization: n sample-gradients are computed (simultaneously, e.g., on multiple cores) and then a single update on the resulting averaged minibatch gradient is performed.

While n can be seen as a hyperparameter of the algorithm [9], it is often constrained to a large extent by the computational hardware, memory requirements and communication bandwidth. A derivation just like the one that led to equation 1 can be used to determine the optimal learning

Figure 1: Diminishing returns of minibatch parallelization. Plotted is the relative log-loss gain (per number of sample gradients evaluated) of a given minibatch size compared to the gain of the n = 1 case (in the noisy quadratic scenario from section 2, for different noise levels σ , and assuming optimal learning rates as in equation 4); each figure corresponds to a different sparsity level. For example, the ratio is 0.02 for n = 100 (left plot, low noise): This means that it takes 50 times more samples to obtain the same gain in loss than with pure SGD. Those are strongly diminishing returns, but they are less drastic if the noise level is high (only 5 times more samples in this example). If the sample gradients are somewhat sparse , however, and we use that fact to increase learning rates appropriately, then the diminishing returns kick in only for much larger minibatch sizes; see the left two figures.

rates automatically, for an arbitrary minibatch size n . The key difference is that the averaging in equation 2 reduces the effective variance by a factor n , leading to:

This expresses the intuition that using minibatches reduces the sample noise, in turn permitting larger step sizes: if the noise (or sample diversity) is small, those gains are minimal, if it is large, they are substantial (see Figure 1, left). Varying minibatch sizes tend to be impractical 1 to implement however, and so common practice is to simply fix a minibatch size, and then re-tune the learning rates (by a factor between 1 and n ). With our adaptive minibatch-aware scheme (equation 3) this is no longer necessary: in fact, we get an automatic transition from initially small effective minibatches (by means of the learning rates) to large minibatches toward the end, when the noise level is higher.

Sparse gradients

Many common learning architectures (e.g., those using rectified linear units, or sparsity penalties) lead to sample gradients that are increasingly sparse , that is, they are non-zero only in small fraction of the problem dimensions. It is possible to exploit this to speed up learning, by averaging many sparse gradients in a minibatch, or by doing asynchronous updates [10].

Here, we investigate how to set the learning rates in the presence of sparsity, and our result is simply based on the observation that doing an update using a set of sparse gradients is equivalent to doing the same update, but with a smaller effective minibatch size, while ignoring all the zero entries.

We can do this again on an element-by-element basis, where we define z i to be the number of non-zero elements in dimension i , within the current minibatch. In each dimension, we rescale the minibatch gradient accordingly by a factor n/ ( n -z i ) , and at the same time reduce the learning rate to reflect the smaller effective minibatch size. Compounding those two effects gives the optimal learning rate for sparse minibatches (we ignore the case z i = n , when there is no update):

Figure 1 shows how using minibatches with such adaptive learning rates reduces the impact of diminishing returns if the sample gradients are sparse. In other words, with the right learning rates, higher sparsity can be directly translated into higher parallelizability.

1 If the implementation/computational architecture is flexible enough, the variance-term of the learning rate can also be used to adapt the minibatch size adaptively to its optimal trade-off.

Figure 2: Difference between global or instance-based computation of effective minibatch sizes in the presence of sparse gradients. Our proposed method computes the number of non-zero entries ( n -z i ) in the current mini-batch to set the learning rate (green). This involves some additional computation compared to just using the long-term average sparsity p ( nz ) i (red), but obtains a substantially higher relative gain (see figure 1), especially in the regime where the sparsity level produces minibatches with just one or a few non-zero entries (dent in the curve near n = 1 /p ( nz ) i ). If the noise level is low (left two figures), the effect is much more pronounced than if the noise is higher. For comparison, the performance for 40 different fixed learning-rate SGD settings (between 0.01 and 100) are plotted as yellow dots.

Figure 3: Illustrating the effect of reweighting minibatch gradients. Assume the samples are drawn from 2 different noisy clusters (yellow and light blue vectors), but one of the clusters has a higher probability of occurrence. The regular minibatch gradient is simply their arithmetic average (red), dominated by the more common cluster. The reweighted minibatch gradient (blue) does a full step toward each of the clusters, closely resembling the gradient one would obtain by performing a hard clustering (difficult in practice) on the samples, in dotted green.

An alternative to computing z i for each minibatch (and each dimension) anew would be to just use the long-term average sparsity p ( nz ) i = E [ n -z i ] instead. Figure 2 shows that this is suboptimal, especially if the noise level is small, and in the regime where each minibatch is expected to contain just a few non-zero entries. This figure also shows that equation 4 produces a higher relative gain compared to the outer envelope of the performance of all fixed learning rates.







Figure 4: Illustrating the expectation over non-smooth sample losses. In dotted blue, the loss functions for a few individual samples are shown, each a non-smooth function. However, the expectation over a distribution of such functions is smooth, as shown by the thick magenta curve. Left: absolute value, right: rectified linear function; samples are identical but offset by a value drawn from N (0 , 1) .

Orthogonal gradients

One reason for the boost in parallelizability if the gradients are sparse comes from the fact that sparse gradients are mostly orthogonal, allowing independent progress in each direction. But sparse gradients are in fact a special case of orthogonal gradients, for which we can obtain similar speedups with a reweighting of the minibatch gradients:

In other words, each sample is weighted by one over the number of times (smoothed) that its gradient is interfering (non-orthogonal) with another sample's gradient.

In the limit, this scheme simplifies to the sparse-gradient cases discussed above: if all sample gradients are aligned, they are averaged (reweighted by 1 /n , corresponding to the dense case in equation 2), and if all sample gradients are orthogonal, they are summed (reweighted by 1, corresponding to the maximally sparse case z i = n -1 in equation 4). See Figure 3 for an illustration.

In practice, this reweighting comes at a certain cost, increasing the computational expense of a single iteration from O ( nd ) to O ( n 2 d ) , where d is the problem dimension. In other words, it is only likely to be viable if the forward-backward passes of the gradient computation are non-trivial, or if the minibatch size is small.

Non-smooth losses

Many commonly used non-linearities (rectified linear units, absolute value normalization, etc.) produce non-smooth sample loss functions. However, when optimizing over a distribution of samples (or just a large enough dataset), the variability between samples can lead to a smooth expected loss function, even though each sample has a non-smooth contribution. Figure 4 illustrates this point for samples that have an absolute value or a rectified linear contribution to the loss.

It is clear from this observation that it is not possible to reliably estimate the curvature of the true expected loss function, from the curvature of the individual sample losses (which are all zero in the two examples above), if the sample losses are non-smooth. This means that our previous approach of estimating the h i term in the optimal learning rate expression by a moving average of sample curvatures, as estimated by the 'bbprop' procedure [8] (which computes a Gauss-Newton approximation of the diagonal Hessian, at the cost of one additional backward pass) is limited to smooth sample loss functions, and we need a different approach for the general case 2 .

2 This also alleviates potential implementation effort, e.g., when using third-party software that does not implement bbprop.

Algorithm 1: vSGD-fd: minibatch-SGD with finite-difference-estimated adaptive learning rates

repeat

until stopping criterion is met

Finite-difference curvature

A good estimate of the relevant curvature for our purposes (i.e., for determining a good learning rate) is to not to compute the true Hessian at the current point, but to take the expectation over noisy finite-difference steps, where those steps are on the same scale than the actually performed update steps, because this is the regime we care about.

In practice, we obtain this finite-difference estimates by computing two gradients of the same sample loss, on points differing by the typical update distance 3 :

where δ i = g i . This approach is related to the diagonal Hessian preconditioning in SGD-QN [11], but the step-difference used is different, and the moving average scheme there is decaying with time, which thus loses the suitability for non-stationary problems.

Curvature variability

To further increase robustness, we reuse the same intuition that originally motivated vSGD, and take into account the variance of the curvature estimates (produced by the finite-difference method) to reduce the likelihood of becoming overconfident (underestimating curvature, i.e., overestimating learning rates) by using a variance-normalization based on the signal-to-noise ratio of the curvature estimates.

3 Of course, this estimate does not need to be computed at every step, which can save computation time.

( j )

For this purpose we maintain two additional moving averages:

and then compute the curvature term simply as h i = v i fd /h i fd .

Outlier detection

If an outlier sample is encountered while the time constants τ i is close to one (i.e., the history is mostly discarded from the moving averages at each update), this has the potential to disrupt the optimization process. Here, the statistics we keep for the adaptive learning rates have an additional, unforeseen benefit: they make it trivial to detect outliers.

The outlier's effect can be mitigated relatively simply by increasing the time-constant τ i before incorporating the sample into the statistics (to make sure old samples are not forgotten), and then due to the perceived variance shooting up, the learning rate is automatically reduced. If it was not an outlier, but a genuine change in the data distribution, the algorithm will quickly adapt, increase the learning rates again.

In practice, we use a detection threshold of two standard deviations, and increase the corresponding τ i by one (see pseudocode).

Algorithm

Algorithm 1 gives the explicit pseudocode for this finite-difference estimation, in combination with the minibatch size-adjusted rates from equation 3, termed 'vSGD-fd'. Initialization is akin to the one of vSGD, in that all moving averages are bootstrapped on a few samples (10) before any updates are done. It is also wise to add an tiny glyph[epsilon1] = 10 -5 term where necessary to avoid divisions by zero.

Simulations

An algorithm that has the ambition to work out-of-the-box, without any tuning of hyper-parameters, must be able to pass a number of elementary tests: those may not be sufficient, but they are necessary. To that purpose, we set up a collection of elementary (one-dimensional) stochastic optimization test cases, varying the shape of the loss function, its curvature, and the noise level. The sample loss functions are

where A is the curvature setting and the ξ ( j ) are drawn from N (0 , σ 2 ) . We vary curvature and noise levels by two orders of magnitude, i.e., A ∈ { 0 . 1 , 1 , 10 } and σ 2 ∈ { 0 . 1 , 1 , 10 } , giving us 9x4 test cases. To visualize the large number of results, we summarize the each test case and algorithm combination in a concise heatmap square (see Figure 5 for the full explanation).

In Figure 6, we show the results for all test cases on a range of algorithms and minibatch sizes n . Each square shows the gain in loss for 100 independent runs of 1024 updates each. Each group of columns corresponds to one of the four functions, with the 9 inner columns using different curvature and noise level settings. Color scales are identical for all heatmaps within a column, but not across columns. Each group of rows corresponds to one algorithm, with each row using a different hyperparameter setting, namely initial learning rates η 0 ∈ { 0 . 01 , 0 . 1 , 1 , 10 } (for SGD, ADAGRAD [6] and the natural gradient [7]) and decay rate γ ∈ { 0 , 1 } for SGD. All rows come in pairs, with the upper one using pure SGD ( n = 1 ) and the lower one using minibatches ( n = 10 ).

Figure 5: Explanation of how to read our concise heatmap performance plots (right), based on the more common representation as learning curves (left). In the learning curve representation, we plot one curve for each algorithm and each trial (3x8 total), with a unique color/line-type per algorithm, and the mean performance per algorithm with more contrast. Performance is measured every power of 2 iterations. This gives a good idea of the progress, but becomes quickly hard to read. On the right side, we plot the identical data in heatmap format. Each square corresponds to one algorithm, the horizontal axis are still the iterations (on log 2 scale), and on the vertical axis we arrange (sort) the performance of the different trials at the given iteration. The color scale is as follows: white is the initial loss value, the stronger the blue, the lower the loss, and if the color is reddish, the algorithm overjumped to loss values that are bigger than the initial one. Good algorithm performance is visible when the square becomes blue on the right side, instability is marked in red, and the variability of the algorithm across trials is visible by the color range on the vertical axis.

The findings are clear: in contrast to the other algorithms tested, vSGD-fd does not require any hyper-parameter tuning to give reliably good performance on the broad range of tests: the learning rates adapt automatically to different curvatures and noise levels. And in contrast to the predecessor vSGD, it also deals with non-smooth loss functions appropriately. The learning rates are adjusted automatically according to the minibatch size, which improves convergence speed on the noisier test cases (3 left columns), where there is a larger potential gain from minibatches.

The earlier variant (vSGD) was shown to work very robustly on a broad range of real-world benchmarks and non-convex, deep neural network-based loss functions. We expect those results on smooth losses to transfer directly to vSGD-fd. This bodes well for future work that will determine its performance on real-world non-smooth problems.

Conclusion

We have presented a novel variant of SGD with adaptive learning rates that expands on previous work in three directions. The adaptive rates properly take into account the minibatch size, which in combination with sparse gradients drastically alleviates the diminishing returns of parallelization. Also, the curvature estimation procedure is based on a finite-difference approach that can deal with non-smooth sample loss functions. The final algorithm integrates these components, has linear complexity and is hyper-parameter free. Unlike other adaptive schemes, it works on a broad range of elementary test cases, the necessary condition for an out-of-the-box method.

Future work will investigate how to adjust the presented element-wise approach to highly nonseparable problems (tightly correlated gradient dimensions), potentially relying on a low-rank or block-decomposed estimate of the gradient covariance matrix, as in TONGA [12].

Acknowledgments

The authors want to thank Sixin Zhang, Durk Kingma, Daan Wierstra, Camille Couprie, Cl´ ement Farabet and Arthur Szlam for helpful discussions. We also thank the reviewers for helpful suggestions, and the 'Open Reviewing Network' for perfectly managing the novel open and transparent reviewing process. This work was funded in part through AFR postdoc grant number 2915104, of the National Research Fund Luxembourg.

$$ h_i^{fd} = \left|\frac{\nabla_{\theta_i}- \nabla_{\theta_i+\delta_i}}{\delta_i}\right| \label{eq:fd-sample} $$ \tag{eq:fd-sample}

Algorithm: algorithm
[tb]
\DontPrintSemicolon
\SetKwInOut{Input}{input}
\SetKwInOut{Output}{output}
\label{alg:vSGD-fd}
\caption{vSGD-fd: minibatch-SGD with finite-difference-estimated adaptive learning rates }
 %\Input{$\epsilon$, $n_0$, $C$}
 %\vspace{0.5em}
 %initialization from $n_0$ samples, but no updates\\
  % \For{$i \in \{1, \ldots,d\}$}{
  %initialize $\theta_i$, $\tau_i$, $\hatg$, $\hatv$ and $\hath$\\
% $ \begin{array}{lllllll}
%\theta_i&\leftarrow & \theta_{\text{init}} & \hspace{4cm} &
%\hatg   &\leftarrow & \frac{1}{n_0} \sum_{j=1}^{n_0} \gradj^{(j)}\\
%\tau_i  &\leftarrow & n_0 &&
%\hatv   &\leftarrow & C \cdot \frac{1}{n_0} \sum_{j=1}^{n_0} \left(\gradj^{(j)}\right)^2\\
%&&&&\hath   &\leftarrow & \max \left[\epsilon,C \cdot \frac{1}{n_0} \sum_{j=1}^{n_0} h_i^{(j)} \right]\\
%\end{array}$\\
%}
%main loop\\
 \Repeat{stopping criterion is met}{
  draw $\popsize$ samples,
  compute the gradients $\nabla_{\params}^{(j)}$ for each sample $j$\\ 
  %compute the gradients on the same samples, with the parameters shifted by $\delta_i = \eta^*_i \cdot \hatg$\\
  compute the gradients on the same samples, with the parameters shifted by $\delta_i = \hatg$\\
  %compute the diagonal Hessian estimates $h_i^{(j)}$ using the ``bbprop'' method\\
  \For{$i \in \{1, \ldots,d\}$}{
  	compute finite-difference curvatures $h_i^{fd (j)} = \left|\frac{\nabla_{\theta_i}^{(j)}- \nabla_{\theta_i+\delta_i}^{(j)}}{\delta_i}\right|$ \\
  	\vspace{0.5em}
  	\If{$|\gradj - \hatg| > 2 \sqrt{\hatv-\hatg^2} \;\;\; \operatorname{or} \;\;\; \left|h_i^{fd} - \hath^{fd}\right| > 2 \sqrt{\hatv^{fd}-\left(\hath^{fd}\right)^2}$}
  	{\vspace{0.5em}
  	increase memory size for outliers $\tau_i \leftarrow \tau_i + 1$}{}
  	\vspace{0.5em}
  update moving averages
 $ \begin{array}{lll}
\hatg  &\leftarrow &(1-\tau_i^{-1}) \cdot \hatg + \tau_i^{-1} \cdot \frac{1}{\popsize}\sum_{j=1}^{\popsize}\gradj^{(j)}\\
\hatv  &\leftarrow &(1-\tau_i^{-1}) \cdot \hatv + \tau_i^{-1} \cdot \frac{1}{\popsize}\sum_{j=1}^{\popsize}\left(\gradj^{(j)}\right)^2\\
%\hath  &\leftarrow &(1-\tau_i^{-1}) \cdot \hath + \tau_i^{-1} \cdot \left|\text{bbprop}(\params)^{(j)}_i \right|\\
\hath^{fd}  &\leftarrow& (1-\tau_i^{-1}) \cdot \hath^{fd} + \tau_i^{-1} \cdot\frac{1}{\popsize}\sum_{j=1}^{\popsize} h_i^{fd (j)}\\
\hatv^{fd}  &\leftarrow& (1-\tau_i^{-1}) \cdot \hatv^{fd} + \tau_i^{-1} \cdot \frac{1}{\popsize}\sum_{j=1}^{\popsize}\left(h_i^{fd (j)}\right)^2\\
\end{array}$\\
\vspace{0.5em}
  estimate learning rate 
  $\;\;\displaystyle\eta_i^* \leftarrow \frac{\hath^{fd}}{\hatv^{fd}} \cdot \frac{\popsize \cdot(\hatg)^2}{\hatv + (\popsize-1)\cdot (\hatg)^2}$\\
  \vspace{0.5em}
 update memory size $\;\;\displaystyle\tau_i \leftarrow \left(1-\frac{(\hatg)^2}{\hatv}\right) \cdot  \tau_i+ 1 $\\
 \vspace{0.5em}
 update parameter $\;\; \theta_i \leftarrow \theta_i - \eta_i^*\cdot \frac{1}{\popsize}\sum_{j=1}^{\popsize}\gradj^{(j)}$\\
 }
 }

Sparse gradients

Refer to caption Performance comparisons for a number of algorithms (row groups) under different setting variants (rows) and sample loss functions (columns), the latter grouped by loss function shape. Red tones indicate a loss value worsening from its initial value, white corresponds to no progress, and darker blue tones indicate a reduction of loss (in log-scale). For a detailed explanation of how to read the heatmaps, see Figure 5. The new proposed algorithm vSGD-fd (bottom row group) performs well across all functions and noise-level settings, namely fixing the vSGD instability on non-smooth functions like the absolute value. The other algorithms need to have their hyper-parameters tuned to the task to work well.

Figure 6: Performance comparisons for a number of algorithms (row groups) under different setting variants (rows) and sample loss functions (columns), the latter grouped by loss function shape. Red tones indicate a loss value worsening from its initial value, white corresponds to no progress, and darker blue tones indicate a reduction of loss (in log-scale). For a detailed explanation of how to read the heatmaps, see Figure 5. The new proposed algorithm vSGD-fd (bottom row group) performs well across all functions and noise-level settings, namely fixing the vSGD instability on non-smooth functions like the absolute value. The other algorithms need to have their hyper-parameters tuned to the task to work well.

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on stationary problems, and permitting learning rates to grow appropriately in non-stationary tasks. Here, we extend the idea in three directions, addressing proper minibatch parallelization, including reweighted updates for sparse or orthogonal gradients, improving robustness on non-smooth loss functions, in the process replacing the diagonal Hessian estimation procedure that may not always be available by a robust finite-difference approximation. The final algorithm integrates all these components, has linear complexity and is hyper-parameter free.

minibatches and parallelization,

sparse gradients, and

non-smooth loss functions

all while retaining the optimal adaptive learning rates. All of these issues are of practical importance: minibatch parallelization has strong diminishing returns, but in combination with sparse gradients and adaptive learning rates, we show how that effect is drastically mitigated. The importance of robustly dealing with non-smooth loss functions is also a very practical concern: a growing number of learning architectures employ non-smooth nonlinearities, like absolute value normalization or rectified-linear units. Our final algorithm addresses all of these, while remaining simple to implement and of linear complexity.

In the idealized case, and for any dimension i𝑖i, the optimal learning rate can be derived analytically, and takes the following form

where (θi−θi∗)subscript𝜃𝑖subscriptsuperscript𝜃𝑖(\theta_{i}-\theta^{*}{i}) is the distance to the optimal parameter value, and σi2superscriptsubscript𝜎𝑖2\sigma{i}^{2} and hisubscriptℎ𝑖h_{i} are the local sample variance and curvature, respectively.

We use an exponential moving average with time-constant τ𝜏\tau (the approximate number of samples considered from recent memory) for online estimates of the quantities in equation 1:

where the diagonal Hessian entries hi(bbprop)superscriptsubscriptℎ𝑖𝑏𝑏𝑝𝑟𝑜𝑝h_{i}^{(bbprop)} are computed using the ‘bbprop’ procedure [8], and the time-constant (memory) is adapted according to how large a step was taken:

The final algorithm is called vSGD, and used the learning rates from equation 1 to update the parameters (element-wise):

Compared to the pure online SGD, computation time can be reduced by “minibatch”-parallelization: n𝑛n sample-gradients are computed (simultaneously, e.g., on multiple cores) and then a single update on the resulting averaged minibatch gradient is performed.

While n𝑛n can be seen as a hyperparameter of the algorithm [9], it is often constrained to a large extent by the computational hardware, memory requirements and communication bandwidth. A derivation just like the one that led to equation 1 can be used to determine the optimal learning rates automatically, for an arbitrary minibatch size n𝑛n. The key difference is that the averaging in equation 2 reduces the effective variance by a factor n𝑛n, leading to:

This expresses the intuition that using minibatches reduces the sample noise, in turn permitting larger step sizes: if the noise (or sample diversity) is small, those gains are minimal, if it is large, they are substantial (see Figure 1, left). Varying minibatch sizes tend to be impractical111If the implementation/computational architecture is flexible enough, the variance-term of the learning rate can also be used to adapt the minibatch size adaptively to its optimal trade-off. to implement however, and so common practice is to simply fix a minibatch size, and then re-tune the learning rates (by a factor between 1 and n𝑛n). With our adaptive minibatch-aware scheme (equation 3) this is no longer necessary: in fact, we get an automatic transition from initially small effective minibatches (by means of the learning rates) to large minibatches toward the end, when the noise level is higher.

Many common learning architectures (e.g., those using rectified linear units, or sparsity penalties) lead to sample gradients that are increasingly sparse, that is, they are non-zero only in small fraction of the problem dimensions. It is possible to exploit this to speed up learning, by averaging many sparse gradients in a minibatch, or by doing asynchronous updates [10].

We can do this again on an element-by-element basis, where we define zisubscript𝑧𝑖z_{i} to be the number of non-zero elements in dimension i𝑖i, within the current minibatch. In each dimension, we rescale the minibatch gradient accordingly by a factor n/(n−zi)𝑛𝑛subscript𝑧𝑖n/(n-z_{i}), and at the same time reduce the learning rate to reflect the smaller effective minibatch size. Compounding those two effects gives the optimal learning rate for sparse minibatches (we ignore the case zi=nsubscript𝑧𝑖𝑛z_{i}=n, when there is no update):

An alternative to computing zisubscript𝑧𝑖z_{i} for each minibatch (and each dimension) anew would be to just use the long-term average sparsity pi(nz)=𝔼[n−zi]subscriptsuperscript𝑝𝑛𝑧𝑖𝔼delimited-[]𝑛subscript𝑧𝑖p^{(nz)}{i}=\mathbb{E}\left[n-z{i}\right] instead. Figure 2 shows that this is suboptimal, especially if the noise level is small, and in the regime where each minibatch is expected to contain just a few non-zero entries. This figure also shows that equation 4 produces a higher relative gain compared to the outer envelope of the performance of all fixed learning rates.

In other words, each sample is weighted by one over the number of times (smoothed) that its gradient is interfering (non-orthogonal) with another sample’s gradient.

In the limit, this scheme simplifies to the sparse-gradient cases discussed above: if all sample gradients are aligned, they are averaged (reweighted by 1/n1𝑛1/n, corresponding to the dense case in equation 2), and if all sample gradients are orthogonal, they are summed (reweighted by 1, corresponding to the maximally sparse case zi=n−1subscript𝑧𝑖𝑛1z_{i}=n-1 in equation 4). See Figure 3 for an illustration.

In practice, this reweighting comes at a certain cost, increasing the computational expense of a single iteration from 𝒪(nd)𝒪𝑛𝑑\mathcal{O}(nd) to 𝒪(n2d)𝒪superscript𝑛2𝑑\mathcal{O}(n^{2}d), where d𝑑d is the problem dimension. In other words, it is only likely to be viable if the forward-backward passes of the gradient computation are non-trivial, or if the minibatch size is small.

It is clear from this observation that it is not possible to reliably estimate the curvature of the true expected loss function, from the curvature of the individual sample losses (which are all zero in the two examples above), if the sample losses are non-smooth. This means that our previous approach of estimating the hisubscriptℎ𝑖h_{i} term in the optimal learning rate expression by a moving average of sample curvatures, as estimated by the “bbprop” procedure [8] (which computes a Gauss-Newton approximation of the diagonal Hessian, at the cost of one additional backward pass) is limited to smooth sample loss functions, and we need a different approach for the general case222This also alleviates potential implementation effort, e.g., when using third-party software that does not implement bbprop..

In practice, we obtain this finite-difference estimates by computing two gradients of the same sample loss, on points differing by the typical update distance333Of course, this estimate does not need to be computed at every step, which can save computation time.:

where δi=gi¯subscript𝛿𝑖¯subscript𝑔𝑖\delta_{i}=\overline{g_{i}}. This approach is related to the diagonal Hessian preconditioning in SGD-QN [11], but the step-difference used is different, and the moving average scheme there is decaying with time, which thus loses the suitability for non-stationary problems.

For this purpose we maintain two additional moving averages:

and then compute the curvature term simply as hi¯=vi¯fd/hi¯fd¯subscriptℎ𝑖superscript¯subscript𝑣𝑖𝑓𝑑superscript¯subscriptℎ𝑖𝑓𝑑\overline{h_{i}}=\overline{v_{i}}^{fd}/\overline{h_{i}}^{fd}.

If an outlier sample is encountered while the time constants τisubscript𝜏𝑖\tau_{i} is close to one (i.e., the history is mostly discarded from the moving averages at each update), this has the potential to disrupt the optimization process. Here, the statistics we keep for the adaptive learning rates have an additional, unforeseen benefit: they make it trivial to detect outliers.

The outlier’s effect can be mitigated relatively simply by increasing the time-constant τisubscript𝜏𝑖\tau_{i} before incorporating the sample into the statistics (to make sure old samples are not forgotten), and then due to the perceived variance shooting up, the learning rate is automatically reduced. If it was not an outlier, but a genuine change in the data distribution, the algorithm will quickly adapt, increase the learning rates again.

In practice, we use a detection threshold of two standard deviations, and increase the corresponding τisubscript𝜏𝑖\tau_{i} by one (see pseudocode).

Algorithm 1 gives the explicit pseudocode for this finite-difference estimation, in combination with the minibatch size-adjusted rates from equation 3, termed “vSGD-fd”. Initialization is akin to the one of vSGD, in that all moving averages are bootstrapped on a few samples (10) before any updates are done. It is also wise to add an tiny ϵ=10−5italic-ϵsuperscript105\epsilon=10^{-5} term where necessary to avoid divisions by zero.

where A𝐴A is the curvature setting and the ξ(j)superscript𝜉𝑗\xi^{(j)} are drawn from 𝒩(0,σ2)𝒩0superscript𝜎2\mathcal{N}(0,\sigma^{2}). We vary curvature and noise levels by two orders of magnitude, i.e., A∈{0.1,1,10}𝐴0.1110A\in{0.1,1,10} and σ2∈{0.1,1,10}superscript𝜎20.1110\sigma^{2}\in{0.1,1,10}, giving us 9x4 test cases. To visualize the large number of results, we summarize the each test case and algorithm combination in a concise heatmap square (see Figure 5 for the full explanation).

In Figure 6, we show the results for all test cases on a range of algorithms and minibatch sizes n𝑛n. Each square shows the gain in loss for 100 independent runs of 1024 updates each. Each group of columns corresponds to one of the four functions, with the 9 inner columns using different curvature and noise level settings. Color scales are identical for all heatmaps within a column, but not across columns. Each group of rows corresponds to one algorithm, with each row using a different hyper-parameter setting, namely initial learning rates η0∈{0.01,0.1,1,10}subscript𝜂00.010.1110\eta_{0}\in{0.01,0.1,1,10} (for SGD, AdaGrad [6] and the natural gradient [7]) and decay rate γ∈{0,1}𝛾01\gamma\in{0,1} for SGD. All rows come in pairs, with the upper one using pure SGD (n=1𝑛1n=1) and the lower one using minibatches (n=10𝑛10n=10).

The authors want to thank Sixin Zhang, Durk Kingma, Daan Wierstra, Camille Couprie, Clément Farabet and Arthur Szlam for helpful discussions. We also thank the reviewers for helpful suggestions, and the ‘Open Reviewing Network’ for perfectly managing the novel open and transparent reviewing process. This work was funded in part through AFR postdoc grant number 2915104, of the National Research Fund Luxembourg.

Refer to caption Diminishing returns of minibatch parallelization. Plotted is the relative log-loss gain (per number of sample gradients evaluated) of a given minibatch size compared to the gain of the n=1𝑛1n=1 case (in the noisy quadratic scenario from section 2, for different noise levels σ𝜎\sigma, and assuming optimal learning rates as in equation 4); each figure corresponds to a different sparsity level. For example, the ratio is 0.02 for n=100𝑛100n=100 (left plot, low noise): This means that it takes 50 times more samples to obtain the same gain in loss than with pure SGD. Those are strongly diminishing returns, but they are less drastic if the noise level is high (only 5 times more samples in this example). If the sample gradients are somewhat sparse, however, and we use that fact to increase learning rates appropriately, then the diminishing returns kick in only for much larger minibatch sizes; see the left two figures.

Refer to caption Difference between global or instance-based computation of effective minibatch sizes in the presence of sparse gradients. Our proposed method computes the number of non-zero entries (n−zi𝑛subscript𝑧𝑖n-z_{i}) in the current mini-batch to set the learning rate (green). This involves some additional computation compared to just using the long-term average sparsity pi(nz)subscriptsuperscript𝑝𝑛𝑧𝑖p^{(nz)}{i} (red), but obtains a substantially higher relative gain (see figure 1), especially in the regime where the sparsity level produces mini-batches with just one or a few non-zero entries (dent in the curve near n=1/pi(nz)𝑛1subscriptsuperscript𝑝𝑛𝑧𝑖n=1/p^{(nz)}{i}). If the noise level is low (left two figures), the effect is much more pronounced than if the noise is higher. For comparison, the performance for 40 different fixed learning-rate SGD settings (between 0.01 and 100) are plotted as yellow dots.

Refer to caption Illustrating the effect of reweighting minibatch gradients. Assume the samples are drawn from 2 different noisy clusters (yellow and light blue vectors), but one of the clusters has a higher probability of occurrence. The regular minibatch gradient is simply their arithmetic average (red), dominated by the more common cluster. The reweighted minibatch gradient (blue) does a full step toward each of the clusters, closely resembling the gradient one would obtain by performing a hard clustering (difficult in practice) on the samples, in dotted green.

Refer to caption Illustrating the expectation over non-smooth sample losses. In dotted blue, the loss functions for a few individual samples are shown, each a non-smooth function. However, the expectation over a distribution of such functions is smooth, as shown by the thick magenta curve. Left: absolute value, right: rectified linear function; samples are identical but offset by a value drawn from 𝒩(0,1)𝒩01\mathcal{N}(0,1).

Refer to caption Explanation of how to read our concise heatmap performance plots (right), based on the more common representation as learning curves (left). In the learning curve representation, we plot one curve for each algorithm and each trial (3x8 total), with a unique color/line-type per algorithm, and the mean performance per algorithm with more contrast. Performance is measured every power of 2 iterations. This gives a good idea of the progress, but becomes quickly hard to read. On the right side, we plot the identical data in heatmap format. Each square corresponds to one algorithm, the horizontal axis are still the iterations (on log2subscript2\log_{2} scale), and on the vertical axis we arrange (sort) the performance of the different trials at the given iteration. The color scale is as follows: white is the initial loss value, the stronger the blue, the lower the loss, and if the color is reddish, the algorithm overjumped to loss values that are bigger than the initial one. Good algorithm performance is visible when the square becomes blue on the right side, instability is marked in red, and the variability of the algorithm across trials is visible by the color range on the vertical axis.

$$ h_{i}^{fd}=\left|\frac{\nabla_{\theta_{i}}-\nabla_{\theta_{i}+\delta_{i}}}{\delta_{i}}\right| $$ \tag{S5.E6}

$$ \displaystyle\eta_{i}^{*} $$

References

[Schaul2012] Schaul, T, Zhang, S, and LeCun, Y. \newblock {No More Pesky Learning Rates}. \newblock Technical report, June 2012.

[Jacobs1988] Jacobs, R.~A. \newblock {Increased rates of convergence through learning rate adaptation}. \newblock {\em Neural Networks}, 1(4):295--307, January 1988.

[Almeida1999] Almeida, L and Langlois, T. \newblock {Parameter adaptation in stochastic optimization}. \newblock {\em On-line learning in neural \ldots}, 1999.

[George2006] George, A.~P and Powell, W.~B. \newblock {Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming}. \newblock {\em Machine Learning}, 65(1):167--198, May 2006.

[NicolasLeRoux] {Nicolas Le Roux}, A.~F. \newblock {A fast natural Newton method}.

[DuchiHS11] Duchi, J.~C, Hazan, E, and Singer, Y. \newblock Adaptive subgradient methods for online learning and stochastic optimization. \newblock 2010.

[amari2000adaptive] Amari, S, Park, H, and Fukumizu, K. \newblock Adaptive method of realizing natural gradient learning for multilayer perceptrons. \newblock {\em Neural Computation}, 12(6):1399--1409, 2000.

[lecun-98b] LeCun, Y, Bottou, L, Orr, G, and Muller, K. \newblock Efficient backprop. \newblock In Orr, G and K., M, editors, {\em Neural Networks: Tricks of the trade}. Springer, 1998.

[Byrd2012] Byrd, R, Chin, G, Nocedal, J, and Wu, Y. \newblock {Sample size selection in optimization methods for machine learning}. \newblock {\em Mathematical Programming}, 2012.

[hogwild11] Niu, F, Recht, B, Re, C, and Wright, S.~J. \newblock Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. \newblock {\em Matrix}, (1):21, 2011.

[bordes-jmlr-09] Bordes, A, Bottou, L, and Gallinari, P. \newblock Sgd-qn: Careful quasi-newton stochastic gradient descent. \newblock {\em Journal of Machine Learning Research}, 10:1737--1754, July 2009.

[leroux-nips-08] Le~Roux, N, Manzagol, P, and Bengio, Y. \newblock Topmoumoute online natural gradient algorithm, 2008.

[bib1] Schaul, T, Zhang, S, and LeCun, Y. No More Pesky Learning Rates. Technical report, June 2012.

[bib2] Jacobs, R. A. Increased rates of convergence through learning rate adaptation. Neural Networks, 1(4):295–307, January 1988.

[bib3] Almeida, L and Langlois, T. Parameter adaptation in stochastic optimization. On-line learning in neural …, 1999.

[bib4] George, A. P and Powell, W. B. Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming. Machine Learning, 65(1):167–198, May 2006.

[bib5] Nicolas Le Roux, A. F. A fast natural Newton method.

[bib6] Duchi, J. C, Hazan, E, and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. 2010.

[bib7] Amari, S, Park, H, and Fukumizu, K. Adaptive method of realizing natural gradient learning for multilayer perceptrons. Neural Computation, 12(6):1399–1409, 2000.

[bib8] LeCun, Y, Bottou, L, Orr, G, and Muller, K. Efficient backprop. In Orr, G and K., M, editors, Neural Networks: Tricks of the trade. Springer, 1998.

[bib9] Byrd, R, Chin, G, Nocedal, J, and Wu, Y. Sample size selection in optimization methods for machine learning. Mathematical Programming, 2012.

[bib10] Niu, F, Recht, B, Re, C, and Wright, S. J. Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Matrix, (1):21, 2011.

[bib11] Bordes, A, Bottou, L, and Gallinari, P. Sgd-qn: Careful quasi-newton stochastic gradient descent. Journal of Machine Learning Research, 10:1737–1754, July 2009.

[bib12] Le Roux, N, Manzagol, P, and Bengio, Y. Topmoumoute online natural gradient algorithm, 2008.

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Tom Schaul

Yann LeCun

Introduction​

Background​

Parallelization with minibatches​

Sparse gradients​

Orthogonal gradients​

Non-smooth losses​

repeat

Finite-difference curvature​

Curvature variability​

Outlier detection​

Algorithm​

Simulations​

Conclusion​

Acknowledgments​

Sparse gradients​

References​

Introduction

Background

Parallelization with minibatches

Sparse gradients

Orthogonal gradients

Non-smooth losses

Finite-difference curvature

Curvature variability

Outlier detection

Algorithm

Simulations

Conclusion

Acknowledgments

Sparse gradients

References