Skip to main content

Deep learning with Elastic Averaging SGD

Sixin Zhang, Courant Institute, NYU, Anna Choromanska, Courant Institute, NYU, Yann LeCun, Center for Data Science, NYU & Facebook AI Research

Abstract

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

Deep learning with Elastic Averaging SGD

Sixin Zhang Courant Institute, NYU

zsx@cims.nyu.edu

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM . We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM . We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

Introduction

One of the most challenging problems in large-scale machine learning is how to parallelize the training of large models that use a form of stochastic gradient descent ( SGD ) [1]. There have been attempts to parallelize SGD -based training for large-scale deep learning models on large number of CPUs, including the Google's Distbelief system [2]. But practical image recognition systems consist of large-scale convolutional neural networks trained on few GPU cards sitting in a single computer [3, 4]. The main challenge is to devise parallel SGD algorithms to train large-scale deep learning models that yield a significant speedup when run on multiple GPU cards.

In this paper we introduce the Elastic Averaging SGD method ( EASGD ) and its variants. EASGD is motivated by quadratic penalty method [5], but is re-interpreted as a parallelized extension of the averaging SGD algorithm [6]. The basic idea is to let each worker maintain its own local parameter, and the communication and coordination of work among the local workers is based on an elastic force which links the parameters they compute with a center variable stored by the master. The center variable is updated as a moving average where the average is taken in time and also in space over the parameters computed by local workers. The main contribution of this paper is a new algorithm that provides fast convergent minimization while outperforming DOWNPOUR method [2] and other

baseline approaches in practice. Simultaneously it reduces the communication overhead between the master and the local workers while at the same time it maintains high-quality performance measured by the test error. The new algorithm applies to deep learning settings such as parallelized training of convolutional neural networks.

The article is organized as follows. Section 2 explains the problem setting, Section 3 presents the synchronous EASGD algorithm and its asynchronous and momentum-based variants, Section 4 provides stability analysis of EASGD and ADMM in the round-robin scheme, Section 5 shows experimental results and Section 6 concludes. The Supplement contains additional material including additional theoretical analysis.

Problem setting

Consider minimizing a function F ( x ) in a parallel computing environment [7] with p ∈ N workers and a master. In this paper we focus on the stochastic optimization problem of the following form

$$

$$

where x is the model parameter to be estimated and ξ is a random variable that follows the probability distribution P over Ω such that F ( x ) = ∫ Ω f ( x, ξ ) P ( dξ ) . The optimization problem in Equation 1 can be reformulated as follows

$$

$$

where each ξ i follows the same distribution P (thus we assume each worker can sample the entire dataset). In the paper we refer to x i 's as local variables and we refer to ˜ x as a center variable. The problem of the equivalence of these two objectives is studied in the literature and is known as the augmentability or the global variable consensus problem [8, 9]. The quadratic penalty term ρ in Equation 2 is expected to ensure that local workers will not fall into different attractors that are far away from the center variable. This paper focuses on the problem of reducing the parameter communication overhead between the master and local workers [10, 2, 11, 12, 13]. The problem of data communication when the data is distributed among the workers [7, 14] is a more general problem and is not addressed in this work. We however emphasize that our problem setting is still highly non-trivial under the communication constraints due to the existence of many local optima [15].

EASGD update rule

The EASGD updates captured in resp. Equation 3 and 4 are obtained by taking the gradient descent step on the objective in Equation 2 with respect to resp. variable x i and ˜ x ,

$$

$$

$$

$$

where g i t ( x i t ) denotes the stochastic gradient of F with respect to x i evaluated at iteration t , x i t and ˜ x t denote respectively the value of variables x i and ˜ x at iteration t , and η is the learning rate.

The update rule for the center variable ˜ x takes the form of moving average where the average is taken over both space and time. Denote α = ηρ and β = pα , then Equation 3 and 4 become

$$

$$

$$

$$

Note that choosing β = pα leads to an elastic symmetry in the update rule, i.e. there exists an symmetric force equal to α ( x i t -˜ x t ) between the update of each x i and ˜ x . It has a crucial influence on the algorithm's stability as will be explained in Section 4. Also in order to minimize the staleness [16] of the difference x i t -˜ x t between the center and the local variable, the update for the master in Equation 4 involves x i t instead of x i t +1 .

Note also that α = ηρ , where the magnitude of ρ represents the amount of exploration we allow in the model. In particular, small ρ allows for more exploration as it allows x i 's to fluctuate further from the center ˜ x . The distinctive idea of EASGD is to allow the local workers to perform more exploration (small ρ ) and the master to perform exploitation. This approach differs from other settings explored in the literature [2, 17, 18, 19, 20, 21, 22, 23], and focus on how fast the center variable converges. In this paper we show the merits of our approach in the deep learning setting.

Asynchronous EASGD

We discussed the synchronous update of EASGD algorithm in the previous section. In this section we propose its asynchronous variant. The local workers are still responsible for updating the local variables x i 's, whereas the master is updating the center variable ˜ x . Each worker maintains its own clock t i , which starts from 0 and is incremented by 1 after each stochastic gradient update of x i as shown in Algorithm 1. The master performs an update whenever the local workers finished τ steps of their gradient updates, where we refer to τ as the communication period . As can be seen in Algorithm 1, whenever τ divides the local clock of the i th worker, the i th worker communicates with the master and requests the current value of the center variable ˜ x . The worker then waits until the master sends back the requested parameter value, and computes the elastic difference α ( x -˜ x ) (this entire procedure is captured in step a) in Algorithm 1). The elastic difference is then sent back to the master (step b) in Algorithm 1) who then updates ˜ x .

The communication period τ controls the frequency of the communication between every local worker and the master, and thus the trade-off between exploration and exploitation.

Algorithm 1: Asynchronous EASGD: Processing by worker i and the master Input: learning rate η , moving rate α , communication period τ ∈ N Initialize: ˜ x is initialized randomly, x i = ˜ x , t i = 0 Repeat x ← x i if ( τ divides t i ) then a) x i ← x i -α ( x -˜ x ) b) ˜ x ← ˜ x + α ( x -˜ x ) end x i ← x i -ηg i t i ( x ) t i ← t i +1 Until forever

Momentum EASGD

The momentum EASGD ( EAMSGD ) is a variant of our Algorithm 1 and is captured in Algorithm 2. It is based on the Nesterov's momentum scheme [24, 25, 26], where the update of the local worker of the form captured in Equation 3 is replaced by the following update

$$

$$

$$

$$

where δ is the momentum term. Note that when δ = 0 we recover the original EASGD algorithm.

As we are interested in reducing the communication overhead in the parallel computing environment where the parameter vector is very large, we will be exploring in the experimental section the asynchronous EASGD algorithm and its momentum-based variant in the relatively large τ regime (less frequent communication).

Stability analysis of EASGD and ADMM in the round-robin scheme

In this section we study the stability of the asynchronous EASGD and ADMM methods in the roundrobin scheme [20]. We first state the updates of both algorithms in this setting, and then we study

Figure

their stability. We will show that in the one-dimensional quadratic case, ADMM algorithm can exhibit chaotic behavior, leading to exponential divergence. The analytic condition for the ADMM algorithm to be stable is still unknown, while for the EASGD algorithm it is very simple 1 .

The analysis of the synchronous EASGD algorithm, including its convergence rate, and its averaging property, in the quadratic and strongly convex case, is deferred to the Supplement.

In our setting, the ADMM method [9, 27, 28] involves solving the following minimax problem 2 ,

$$

$$

where λ i 's are the Lagrangian multipliers. The resulting updates of the ADMM algorithm in the round-robin scheme are given next. Let t ≥ 0 be a global clock. At each t , we linearize the function F ( x i ) with F ( x i t ) + 〈 ∇ F ( x i t ) , x i -x i t 〉 + 1 2 η ∥ ∥ x i -x i t ∥ ∥ 2 as in [28]. The updates become glyph[negationslash]

$$

$$

glyph[negationslash]

$$

$$

$$

$$

Each local variable x i is periodically updated (with period p ). First, the Lagrangian multiplier λ i is updated with the dual ascent update as in Equation 9. It is followed by the gradient descent update of the local variable as given in Equation 10. Then the center variable ˜ x is updated with the most recent values of all the local variables and Lagrangian multipliers as in Equation 11. Note that since the step size for the dual ascent update is chosen to be ρ by convention [9, 27, 28], we have re-parametrized the Lagrangian multiplier to be λ i t ← λ i t /ρ in the above updates.

The EASGD algorithm in the round-robin scheme is defined similarly and is given below glyph[negationslash]

$$

$$

$$

$$

At time t , only the i -th local worker (whose index i -1 equals t modulo p ) is activated, and performs the update in Equations 12 which is followed by the master update given in Equation 13.

We will now focus on the one-dimensional quadratic case without noise, i.e. F ( x ) = x 2 2 , x ∈ R .

For the ADMM algorithm, let the state of the (dynamical) system at time t be s t = ( λ 1 t , x 1 t , . . . , λ p t , x p t , ˜ x t ) ∈ R 2 p +1 . The local worker i 's updates in Equations 9, 10, and 11 are composed of three linear maps which can be written as s t +1 = ( F i 3 ◦ F i 2 ◦ F i 1 )( s t ) . For simplicity, we will only write them out below for the case when i = 1 and p = 2 :

$$

$$

For each of the p linear maps, it's possible to find a simple condition such that each map, where the i th map has the form F i 3 ◦ F i 2 ◦ F i 1 , is stable (the absolute value of the eigenvalues of the map are

2 The convergence analysis in [27] is based on the assumption that 'At any master iteration, updates from the workers have the same probability of arriving at the master.', which is not satisfied in the round-robin scheme.

smaller or equal to one). However, when these non-symmetric maps are composed one after another as follows F = F p 3 ◦ F p 2 ◦ F p 1 ◦ . . . ◦ F 1 3 ◦ F 1 2 ◦ F 1 1 , the resulting map F can become unstable! (more precisely, some eigenvalues of the map can sit outside the unit circle in the complex plane).

We now present the numerical conditions for which the ADMM algorithm becomes unstable in the round-robin scheme for p = 3 and p = 8 , by computing the largest absolute eigenvalue of the map F . Figure 1 summarizes the obtained result.

Figure 1: The largest absolute eigenvalue of the linear map F = F p 3 ◦ F p 2 ◦ F p 1 ◦ . . . ◦ F 1 3 ◦ F 1 2 ◦ F 1 1 as a function of η ∈ (0 , 10 -2 ) and ρ ∈ (0 , 10) when p = 3 and p = 8 . To simulate the chaotic behavior of the ADMM algorithm, one may pick η = 0 . 001 and ρ = 2 . 5 and initialize the state s 0 either randomly or with λ i 0 = 0 , x i 0 = ˜ x 0 = 1000 , ∀ i . Figure should be read in color .

Figure 1: The largest absolute eigenvalue of the linear map F = F p 3 ◦ F p 2 ◦ F p 1 ◦ . . . ◦ F 1 3 ◦ F 1 2 ◦ F 1 1 as a function of η ∈ (0 , 10 -2 ) and ρ ∈ (0 , 10) when p = 3 and p = 8 . To simulate the chaotic behavior of the ADMM algorithm, one may pick η = 0 . 001 and ρ = 2 . 5 and initialize the state s 0 either randomly or with λ i 0 = 0 , x i 0 = ˜ x 0 = 1000 , ∀ i . Figure should be read in color .

Figure

On the other hand, the EASGD algorithm involves composing only symmetric linear maps due to the elasticity. Let the state of the (dynamical) system at time t be s t = ( x 1 t , . . . , x p t , ˜ x t ) ∈ R p +1 . The activated local worker i 's update in Equation 12 and the master update in Equation 13 can be written as s t +1 = F i ( s t ) . In case of p = 2 , the map F 1 and F 2 are defined as follows

$$

$$

For the composite map F p ◦ . . . ◦ F 1 to be stable, the condition that needs to be satisfied is actually the same for each i , and is furthermore independent of p (since each linear map F i is symmetric). It essentially involves the stability of the 2 × 2 matrix ( 1 -η -α α α 1 -α ) , whose two (real) eigenvalues λ satisfy (1 -η -α -λ )(1 -α -λ ) = α 2 . The resulting stability condition ( | λ | ≤ 1 ) is simple and given as 0 ≤ η ≤ 2 , 0 ≤ α ≤ 4 -2 η 4 -η .

Experiments

In this section we compare the performance of EASGD and EAMSGD with the parallel method DOWNPOUR and the sequential method SGD , as well as their averaging and momentum variants.

All the parallel comparator methods are listed below 3 :

· DOWNPOUR [2], the pseudo-code of the implementation of DOWNPOUR used in this paper is enclosed in the Supplement. · Momentum DOWNPOUR ( MDOWNPOUR ), where the Nesterov's momentum scheme is applied to the master's update (note it is unclear how to apply it to the local workers or for the case when τ > 1 ). The pseudo-code is in the Supplement. · A method that we call ADOWNPOUR , where we compute the average over time of the center variable ˜ x as follows: z t +1 = (1 -α t +1 ) z t + α t +1 ˜ x t , and α t +1 = 1 t +1 is a moving rate, and z 0 = ˜ x 0 . t denotes the master clock, which is initialized to 0 and incremented every time the center variable ˜ x is updated. · A method that we call MVADOWNPOUR , where we compute the moving average of the center variable ˜ x as follows: z t +1 = (1 -α ) z t + α ˜ x t , and the moving rate α was chosen to be constant, and z 0 = ˜ x 0 . t denotes the master clock and is defined in the same way as for the ADOWNPOUR method.

3 We have compared asynchronous ADMM [27] with EASGD in our setting as well, the performance is nearly the same. However, ADMM 's momentum variant is not as stable for large communication periods.

· SGD [1] with constant learning rate η . · Momentum SGD ( MSGD ) [26] with constant momentum δ . · ASGD [6] with moving rate α t +1 = 1 t +1 . · MVASGD [6] with moving rate α set to a constant.

We perform experiments in a deep learning setting on two benchmark datasets: CIFAR-10 (we refer to it as CIFAR ) 4 and ImageNet ILSVRC 2013 (we refer to it as ImageNet ) 5 . We focus on the image classification task with deep convolutional neural networks. We next explain the experimental setup. The details of the data preprocessing and prefetching are deferred to the Supplement.

Experimental setup

For all our experiments we use a GPU-cluster interconnected with InfiniBand. Each node has 4 Titan GPU processors where each local worker corresponds to one GPU processor. The center variable of the master is stored and updated on the centralized parameter server [2] 6 .

To describe the architecture of the convolutional neural network, we will first introduce a notation. Let ( c, y ) denotes the size of the input image to each layer, where c is the number of color channels and y is both the horizontal and the vertical dimension of the input. Let C denotes the fully-connected convolutional operator and let P denotes the max pooling operator, D denotes the linear operator with dropout rate equal to 0 . 5 and S denotes the linear operator with softmax output non-linearity. We use the cross-entropy loss and all inner layers use rectified linear units. For the ImageNet experiment we use the similar approach to [4] with the following 11 -layer convolutional neural network (3,221)C(96,108)P(96,36)C(256,32)P(256,16)C(384,14) C(384,13)C(256,12)P(256,6)D(4096,1)D(4096,1)S(1000,1). For the CIFAR experiment we use the similar approach to [29] with the following 7 -layer convolutional neural network (3,28)C(64,24)P(64,12)C(128,8)P(128,4)C(64,2)D(256,1)S(10,1).

In our experiments all the methods we run use the same initial parameter chosen randomly, except that we set all the biases to zero for CIFAR case and to 0.1 for ImageNet case. This parameter is used to initialize the master and all the local workers 7 . We add l 2 -regularization λ 2 ‖ x ‖ 2 to the loss function F ( x ) . For ImageNet we use λ = 10 -5 and for CIFAR we use λ = 10 -4 . We also compute the stochastic gradient using mini-batches of sample size 128 .

Experimental results

For all experiments in this section we use EASGD with β = 0 . 9 8 , for all momentum-based methods we set the momentum term δ = 0 . 99 and finally for MVADOWNPOUR we set the moving rate to α = 0 . 001 . We start with the experiment on CIFAR dataset with p = 4 local workers running on a single computing node. For all the methods, we examined the communication periods from the following set τ = { 1 , 4 , 16 , 64 } . For comparison we also report the performance of MSGD which outperformed SGD , ASGD and MVASGD as shown in Figure 6 in the Supplement. For each method we examined a wide range of learning rates (the learning rates explored in all experiments are summarized in Table 1, 2, 3 in the Supplement). The CIFAR experiment was run 3 times independently from the same initialization and for each method we report its best performance measured by the smallest achievable test error. From the results in Figure 2, we conclude that all DOWNPOUR -based methods achieve their best performance (test error) for small τ ( τ ∈ { 1 , 4 } ), and become highly unstable for τ ∈ { 16 , 64 } . While EAMSGD significantly outperforms comparator methods for all values of τ by having faster convergence. It also finds better-quality solution measured by the test error and this advantage becomes more significant for τ ∈ { 16 , 64 } . Note that the tendency to achieve better test performance with larger τ is also characteristic for the EASGD algorithm.

4 Downloaded from http://www.cs.toronto.edu/˜kriz/cifar.html .

6 Our implementation is available at https://github.com/sixin-zh/mpiT .

7 On the contrary, initializing the local workers and the master with different random seeds 'traps' the algorithm in the symmetry breaking phase.

8 Intuitively the 'effective β ' is β/τ = pα = pηρ (thus ρ = β τpη ) in the asynchronous setting.

Figure 2: Training and test loss and the test error for the center variable versus a wallclock time for different communication periods τ on CIFAR dataset with the 7 -layer convolutional neural network.

Figure 2: Training and test loss and the test error for the center variable versus a wallclock time for different communication periods τ on CIFAR dataset with the 7 -layer convolutional neural network.

We next explore different number of local workers p from the set p = { 4 , 8 , 16 } for the CIFAR experiment, and p = { 4 , 8 } for the ImageNet experiment 9 . For the ImageNet experiment we report the results of one run with the best setting we have found. EASGD and EAMSGD were run with τ = 10 whereas DOWNPOUR and MDOWNPOUR were run with τ = 1 . The results are in Figure 3 and 4. For the CIFAR experiment, it's noticeable that the lowest achievable test error by either EASGD or EAMSGD decreases with larger p . This can potentially be explained by the fact that larger p allows for more exploration of the parameter space. In the Supplement, we discuss further the trade-off between exploration and exploitation as a function of the learning rate (section 9.5) and the communication period (section 9.6). Finally, the results obtained for the ImageNet experiment also shows the advantage of EAMSGD over the competitor methods.

Conclusion

In this paper we describe a new algorithm called EASGD and its variants for training deep neural networks in the stochastic setting when the computations are parallelized over multiple GPUs. Experiments demonstrate that this new algorithm quickly achieves improvement in test error compared to more common baseline approaches such as DOWNPOUR and its variants. We show that our approach is very stable and plausible under communication constraints. We provide the stability analysis of the asynchronous EASGD in the round-robin scheme, and show the theoretical advantage of the method over ADMM . The different behavior of the EASGD algorithm from its momentumbased variant EAMSGD is intriguing and will be studied in future works.

9 For the ImageNet experiment, the training loss is measured on a subset of the training data of size 50,000.

Figure 3: Training and test loss and the test error for the center variable versus a wallclock time for different number of local workers p for parallel methods ( MSGD uses p = 1 ) on CIFAR with the 7 -layer convolutional neural network. EAMSGD achieves significant accelerations compared to other methods, e.g. the relative speed-up for p = 16 (the best comparator method is then MSGD ) to achieve the test error 21% equals 11 . 1 .

Figure 3: Training and test loss and the test error for the center variable versus a wallclock time for different number of local workers p for parallel methods ( MSGD uses p = 1 ) on CIFAR with the 7 -layer convolutional neural network. EAMSGD achieves significant accelerations compared to other methods, e.g. the relative speed-up for p = 16 (the best comparator method is then MSGD ) to achieve the test error 21% equals 11 . 1 .

Acknowledgments

The authors thank R. Power, J. Li for implementation guidance, J. Bruna, O. Henaff, C. Farabet, A. Szlam, Y. Bakhtin for helpful discussion, P. L. Combettes, S. Bengio and the referees for valuable feedback.

Experiments

Additional theoretical results and proofs

Quadratic case

Weprovide here the convergence analysis of the synchronous EASGD algorithm with constant learning rate. The analysis is focused on the convergence of the center variable to the local optimum. We discuss one-dimensional quadratic case first, then the generalization to multi-dimensional setting (Lemma 7.3) and finally to the strongly convex case (Theorem 7.1).

Our analysis in the quadratic case extends the analysis of ASGD in [6]. Assume each of the p local workers x i t ∈ R n observes a noisy gradient at time t ≥ 0 of the linear form given in Equation 14.

$$

$$

where the matrix A is positive-definite (each eigenvalue is strictly positive) and { ξ i t } 's are i.i.d. random variables, with zero mean and positive-definite covariance Σ . Let x ∗ denote the optimum solution, where x ∗ = A -1 b ∈ R n . In this section we analyze the behavior of the mean squared error (MSE) of the center variable ˜ x t , where this error is denoted as E [ ‖ ˜ x t -x ∗ ‖ 2 ] , as a function of t , p , η , α and β , where β = pα . Note that the MSE error can be decomposed as (squared) bias and variance 10 : E [ ‖ ˜ x t -x ∗ ‖ 2 ] = ‖ E [˜ x t -x ∗ ] ‖ 2 + V [˜ x t -x ∗ ] . For one-dimensional case ( n = 1 ), we assume A = h > 0 and Σ = σ 2 > 0 .

Lemma 7.1. Let ˜ x 0 and { x i 0 } i =1 ,...,p be arbitrary constants, then

$$

$$

$$

$$

where u 0 = ∑ p i =1 ( x i 0 -x ∗ -α 1 -pα -φ (˜ x 0 -x ∗ )) , a = ηh +( p +1) α , c 2 = ηhpα , γ = 1 -a - √ a 2 -4 c 2 2 , and φ = 1 -a + √ a 2 -4 c 2 2 .

It follows from Lemma 7.1 that for the center variable to be stable the following has to hold

$$

$$

It can be verified that φ and γ are the two zero-roots of the polynomial in λ : λ 2 -(2 -a ) λ +(1 -a + c 2 ) . Recall that φ and λ are the functions of η and α . Thus (see proof in Section 7.1.2)

The proof the above Lemma is based on the diagonalization of the linear gradient map (this map is symmetric due to the relation β = pα ). The stability analysis of the asynchronous EASGD algorithm in the round-robin scheme is similar due to this elastic symmetry.

Proof. Substituting the gradient from Equation 14 into the update rule used by each local worker in the synchronous EASGD algorithm (Equation 5 and 6) we obtain

$$

$$

$$

$$

10 In our notation, V denotes the variance.

where η is the learning rate, and α is the moving rate. Recall that α = ηρ and A = h . For the ease of notation we redefine ˜ x t and x i t as follows:

$$

$$

We prove the lemma by explicitly solving the linear equations 18 and 19. Let x t = ( x 1 t , . . . , x p t , ˜ x t ) T . We rewrite the recursive relation captured in Equation 18 and 19 as simply

$$

$$

where the drift matrix M is defined as

$$

$$

and the (diffusion) vector b t = ( ηξ 1 t , . . . , ηξ p t , 0) T .

Note that one of the eigenvalues of matrix M , that we call φ , satisfies (1 -α -ηh -φ )(1 -pα -φ ) = pα 2 . The corresponding eigenvector is (1 , 1 , . . . , 1 , -pα 1 -pα -φ ) T . Let u t be the projection of x t onto this eigenvector. Thus u t = ∑ p i =1 ( x i t -α 1 -pα -φ ˜ x t ) . Let furthermore ξ t = ∑ p i =1 ξ i t . Therefore we have

$$

$$

By combining Equation 19 and 20 as follows

$$

$$

where the last step results from the following relations: pα 2 1 -pα -φ = 1 -α -ηh -φ and φ + γ = 1 -α -ηh +1 -pα . Thus we obtained

$$

$$

Based on Equation 20 and 21, we can then expand u t and ˜ x t recursively,

$$

$$

$$

$$

Substituting u 0 , u 1 , . . . , u t , each given through Equation 22, into Equation 23 we obtain

$$

$$

To be more specific, the Equation 24 is obtained by integrating by parts,

$$

$$

Since the random variables ξ l are i.i.d, we may sum the variance term by term as follows

$$

$$

Note that E [ ξ t ] = ∑ p i =1 E [ ξ i t ] = 0 and V [ ξ t ] = ∑ p i =1 V [ ξ i t ] = pσ 2 . These two facts, the equality in Equation 24 and Equation 25 can then be used to compute E [˜ x t ] and V [˜ x t ] as given in Equation 15 and 16 in Lemma 7.1.

.
Proof.

Visualizing Lemma~ ref{lem:lemm1

Figure

eta eta

Figure 5: Theoretical mean squared error (MSE) of the center ˜ x in the quadratic case, with various choices of the learning rate η (horizontal within each block), and the moving rate β = pα (vertical within each block), the number of processors p = { 1 , 10 , 100 , 1000 , 10000 } (vertical across blocks), and the time steps t = { 1 , 2 , 10 , 100 , ∞} (horizontal across blocks). The MSE is plotted in log scale, ranging from 10 -3 to 10 3 (from deep blue to red). The dark red (i.e. on the upper-right corners) indicates divergence.

In Figure 5, we illustrate the dependence of MSE on β , η and the number of processors p over time t . We consider the large-noise setting where ˜ x 0 = x i 0 = 1 , h = 1 and σ = 10 . The MSE error is color-coded such that the deep blue color corresponds to the MSE equal to 10 -3 , the green color corresponds to the MSE equal to 1 , the red color corresponds to MSE equal to 10 3 and the dark red color corresponds to the divergence of algorithm EASGD (condition in Equation 17 is then violated). The plot shows that we can achieve significant variance reduction by increasing the number of local workers p . This effect is less sensitive to the choice of β and η for large p .

Condition in Equation~ ref{eq:condition

We are going to show that

$$

$$

Recall that a = ηh +( p +1) α , c 2 = ηhpα , γ = 1 -a - √ a 2 -4 c 2 2 , φ = 1 -a + √ a 2 -4 c 2 2 , and β = pα . We have

$$

$$

$$

$$

The next corollary is a consequence of Lemma 7.1. As the number of workers p grows, the averaging property of the EASGD can be characterized as follows

Corollary 7.1. Let the Elastic Averaging relation β = pα and the condition 17 hold, then

$$

$$

Proof. Note that when β is fixed, lim p →∞ a = ηh + β and c 2 = ηhβ . Then lim p →∞ φ = min(1 -β, 1 -ηh ) and lim p →∞ γ = max(1 -β, 1 -ηh ) . Also note that using Lemma 7.1 we obtain

$$

$$

Corollary 7.1 is obtained by plugining in the limiting values of φ and γ

.

The crucial point of Corollary 7.1 is that the MSE in the limit t → ∞ is in the order of 1 /p which implies that as the number of processors p grows, the MSE will decrease for the EASGD algorithm. Also note that the smaller the β is (recall that β = pα = pηρ ), the more exploration is allowed (small ρ ) and simultaneously the smaller the MSE is.

.
Proof.

Generalization to multidimensional case

The next lemma (Lemma 7.2) shows that EASGD algorithm achieves the highest possible rate of convergence when we consider the double averaging sequence (similarly to [6]) { z 1 , z 2 , . . . } defined as below

$$

$$

Lemma 7.2 (Weak convergence) . If the condition in Equation 17 holds, then the normalized double averaging sequence defined in Equation 26 converges weakly to the normal distribution with zero mean and variance σ 2 /ph 2 ,

$$

$$

Proof. As in the proof of Lemma 7.1, for the ease of notation we redefine ˜ x t and x i t as follows:

$$

$$

Also recall that { ξ i t } 's are i.i.d. random variables (noise) with zero mean and the same covariance Σ glyph[follows] 0 . We are interested in the asymptotic behavior of the double averaging sequence { z 1 , z 2 , . . . } defined as

$$

$$

$$

$$

where ξ t = ∑ p i =1 ξ i t . Therefore

$$

$$

Note that the only non-vanishing term (in weak convergence) of 1 / √ t ∑ t k =0 ˜ x k as t →∞ is

$$

$$

Also recall that V [ ξ l -1 ] = pσ 2 and

$$

$$

The asymptotic variance in the Lemma 7.2 is optimal with any fixed η and β for which Equation 17 holds. The next lemma (Lemma 7.3) extends the result in Lemma 7.2 to the multi-dimensional setting.

Lemma 7.3 (Weak convergence) . Let h denotes the largest eigenvalue of A . If (2 -ηh )(2 -β ) > 2 β/p , (2 -ηh ) + (2 -β ) > β/p , η > 0 and β > 0 , then the normalized double averaging sequence converges weakly to the normal distribution with zero mean and the covariance matrix V = A -1 Σ( A -1 ) T ,

$$

$$

Proof. Since A is symmetric, one can use the proof technique of Lemma 7.2 to prove Lemma 7.3 by diagonalizing the matrix A . This diagonalization essentially generalizes Lemma 7.1 to the multidimensional case. We will not go into the details of this proof as we will provide a simpler way to look at the system. As in the proof of Lemma 7.1 and Lemma 7.2, for the ease of notation we redefine ˜ x t and x i t as follows:

$$

$$

Let the spatial average of the local parameters at time t be denoted as y t where y t = 1 p ∑ p i =1 x i t , and let the average noise be denoted as ξ t , where ξ t = 1 p ∑ p i =1 ξ i t . Equations 18 and 19 can then be reduced to the following

$$

$$

$$

$$

We focus on the case where the learning rate η and the moving rate α are kept constant over time 11 . Recall β = pα and α = ηρ .

Let's introduce the block notation U t = ( y t , ˜ x t ) , Ξ t = ( ηξ t , 0) , M = I -ηL and

$$

$$

From Equations 31 and 32 it follows that U t +1 = MU t + Ξ t . Note that this linear system has a degenerate noise Ξ t which prevents us from directly applying results of [6]. Expanding this recursive relation and summing by parts, we have

$$

$$

By Lemma 7.4, ‖ M ‖ 2 < 1 and thus

$$

$$

Since A is invertible, we get

$$

$$

thus

$$

$$

$$

$$

$$

$$

Lemma 7.4. If the following conditions hold:

$$

$$

Proof. The eigenvalue λ of M and the (non-zero) eigenvector ( y, z ) of M satisfy

$$

$$

11 As a side note, notice that the center parameter ˜ x t is tracking the spatial average y t of the local parameters with a non-symmetric spring in Equation 31 and 32. To be more precise note that the update on y t +1 contains (˜ x t -y t ) scaled by α , whereas the update on ˜ x t +1 contains -(˜ x t -y t ) scaled by β . Since α = β/p the impact of the center ˜ x t +1 on the spatial local average y t +1 becomes more negligible as p grows.

Recall that

$$

$$

From the Equations 34 and 35 we obtain

$$

$$

Since ( y, z ) is assumed to be non-zero, we can write z = βy/ ( λ + β -1) . Then the Equation 36 can be reduced to

$$

$$

Thus y is the eigenvector of A . Let λ A be the eigenvalue of matrix A such that Ay = λ A y . Thus based on Equation 37 it follows that

$$

$$

$$

$$

where a = ηλ A + ( p + 1) α , c 2 = ηλ A pα . It follows from the condition in Equation 17 that -1 < λ < 1 iff η > 0 , β > 0 , (2 -ηλ A )(2 -β ) > 2 β/p and (2 -ηλ A ) + (2 -β ) > β/p . Let h denote the maximum eigenvalue of A and note that 2 -ηλ A ≥ 2 -ηh . This implies that the condition of our lemma is sufficient.

As in Lemma 7.2, the asymptotic covariance in the Lemma 7.3 is optimal, i.e. meets the Fisher information lower-bound. The fact that this asymptotic covariance matrix V does not contain any term involving ρ is quite remarkable, since the penalty term ρ does have an impact on the condition number of the Hessian in Equation 2.

(Weak convergence).
Proof.
(Weak convergence).
Proof.
.
Proof.

Strongly convex case

Wenowextend the above proof ideas to analyze the strongly convex case, in which the noisy gradient g i t ( x ) = ∇ F ( x ) -ξ i t has the regularity that there exists some 0 < µ ≤ L , for which µ ‖ x -y ‖ 2 ≤ 〈∇ F ( x ) -∇ F ( y ) , x -y 〉 ≤ L ‖ x -y ‖ 2 holds uniformly for any x ∈ R d , y ∈ R d . The noise { ξ i t } 's is assumed to be i.i.d. with zero mean and bounded variance E [ ∥ ∥ ξ i t ∥ ∥ 2 ] ≤ σ 2 .

$$

$$

$$

$$

Proof. The idea of the proof is based on the point of view in Lemma 7.3, i.e. how close the center variable ˜ x t is to the spatial average of the local variables y t = 1 p ∑ p i =1 x i t . To further simplify the notation, let the noisy gradient be ∇ f i t,ξ = g i t ( x i t ) = ∇ F ( x i t ) -ξ i t , and ∇ f i t = ∇ F ( x i t ) be its deterministic part. Then EASGD updates can be rewritten as follows,

$$

$$

$$

$$

We have thus the update for the spatial average,

$$

$$

$$

$$

Equation 38 is equivalent to

The idea of the proof is to bound the distance ‖ ˜ x t -x ∗ ‖ 2 through ‖ y t -x ∗ ‖ 2 and 1 p ∑ p i ∥ ∥ x i t -x ∗ ∥ ∥ 2 . Wstart from the following estimate for the strongly convex function [31],

$$

$$

Since ∇ f ( x ∗ ) = 0 , we have

$$

$$

From Equation 40 the following relation holds,

$$

$$

By the cosine rule ( 2 〈 a -b, c -d 〉 = ‖ a -d ‖ 2 -‖ a -c ‖ 2 + ‖ c -b ‖ 2 -‖ d -b ‖ 2 ), we have

$$

$$

By the Cauchy-Schwarz inequality, we have

$$

$$

Combining the above estimates in Equations 43, 44, 45, 46, we obtain

$$

$$

Choosing 0 ≤ α < 1 , we can have this upper-bound for the terms α 2 ∥ ∥ x i t -˜ x t ∥ ∥ 2 -α ∥ ∥ x i t -˜ x t ∥ ∥ 2 + 2 ηα ∥ ∥ ∇ f i t ∥ ∥ ∥ ∥ x i t -˜ x t ∥ ∥ = -α (1 -α ) ∥ ∥ x i t -˜ x t ∥ ∥ 2 + 2 ηα ∥ ∥ ∇ f i t ∥ ∥ ∥ ∥ x i t -˜ x t ∥ ∥ ≤ η 2 α 1 -α ∥ ∥ ∇ f i t ∥ ∥ 2 by applying -ax 2 + bx ≤ b 2 4 a with x = ∥ ∥ x i t -˜ x t ∥ ∥ . Thus we can further bound Equation 47 with

$$

$$

As in Equation 48 and 49, the noise ξ i t is zero mean ( E ξ i t = 0 ) and the variance of the noise ξ i t is bounded ( E ∥ ∥ ξ i t ∥ ∥ 2 ≤ σ 2 ), if η is chosen small enough such that η 2 + η 2 α 1 -α -2 η µ + L ≤ 0 , then

$$

$$

$$

$$

$$

$$

$$

$$

Denote ξ t = 1 p ∑ p i =1 ξ i t , we can rewrite Equation 51 as

$$

$$

$$

$$

Thus it follows from Equation 43 and 56 that

$$

$$

$$

$$

$$

$$

$$

$$

Similarly if 0 ≤ α < 1 , we can have this upper-bound for the terms α 2 ‖ y t -˜ x t ‖ 2 -α ‖ y t -˜ x t ‖ 2 + 2 ηα ∥ ∥ ∥ 1 p ∑ p i =1 ∇ f i t ∥ ∥ ∥ ‖ y t -˜ x t ‖ ≤ η 2 α 1 -α ∥ ∥ ∥ 1 p ∑ p i =1 ∇ f i t ∥ ∥ ∥ 2 by applying -ax 2 + bx ≤ b 2 4 a with x = ‖ y t -˜ x t ‖ . Thus we have the following bound for the Equation 60

$$

$$

Since 2 √ µL µ + L ≤ 1 , we need also bound the non-linear term 〈 ∇ f i t -∇ f j t , x i t -x j t 〉 ≤ L ∥ ∥ ∥ x i t -x j t ∥ ∥ ∥ 2 . Recall the bias-variance relation 1 p ∑ p i =1 ∥ ∥ x i t -x ∗ ∥ ∥ 2 = 1 p 2 ∑ i>j ∥ ∥ ∥ x i t -x j t ∥ ∥ ∥ 2 + ‖ y t -x ∗ ‖ 2 . The key observation is that if 1 p ∑ p i =1 ∥ ∥ x i t -x ∗ ∥ ∥ 2 remains bounded, then larger variance ∑ i>j ∥ ∥ ∥ x i t -x j t ∥ ∥ ∥ 2 implies smaller bias ‖ y t -x ∗ ‖ 2 . Thus this non-linear term can be compensated.

Again choose η small enough such that η 2 + η 2 α 1 -α -2 η µ + L ≤ 0 and take expectation in Equation 61,

$$

$$

As for the center variable in Equation 41, we apply simply the convexity of the norm ‖·‖ 2 to obtain

$$

$$

Combing the estimates from Equations 50, 62, 63, and denote a t = E ‖ y t -x ∗ ‖ 2 , b t = 1 p ∑ p i =1 E ∥ ∥ x i t -x ∗ ∥ ∥ 2 , c t = E ‖ ˜ x t -x ∗ ‖ 2 , γ 1 = 2 η µL µ + L , γ 2 = 2 ηL (1 -2 √ µL µ + L ) , then

$$

$$

$$

$$

.
Proof.

Additional pseudo-codes of the algorithms

DOWNPOUR pseudo-code

Algorithm 3 captures the pseudo-code of the implementation of the DOWNPOUR used in this paper.

Algorithm 3: DOWNPOUR: Processing by worker i and the master

Input:

learning rate η , communication period τ ∈ N

˜

x

is initialized randomly,

i

,

v

= 0

t

MDOWNPOUR pseudo-code

Algorithm 3 captures the pseudo-code of the implementation of the DOWNPOUR used in this paper.

Algorithm 3: DOWNPOUR: Processing by worker i and the master

Input:

learning rate η , communication period τ ∈ N

˜

x

is initialized randomly,

i

,

v

= 0

t

Experiments - additional material

Data preprocessing

For the ImageNet experiment, we re-size each RGB image so that the smallest dimension is 256 pixels. We also re-scale each pixel value to the interval [0 , 1] . We then extract random crops (and their horizontal flips) of size 3 × 221 × 221 pixels and present these to the network in mini-batches of size 128 .

The training and test loss and the test error are only computed from the center patch ( 3 × 28 × 28 ) for the CIFAR experiment and the center patch ( 3 × 221 × 221 ) for the ImageNet experiment.

Data prefetching (Sampling the dataset by the local workers)

We will now explain precisely how the dataset is sampled by each local worker as uniformly and efficiently as possible. The general parallel data loading scheme on a single machine is as follows: we use k CPUs, where k = 8 , to load the data in parallel. Each data loader reads from the memory-mapped (mmap) file a chunk of c raw images (preprocessing was described in the previous subsection) and their labels (for CIFAR c = 512 and for ImageNet c = 64 ). For the CIFAR , the mmap file of each data loader contains the entire dataset whereas for ImageNet , each mmap file of each data loader contains different 1 /k fractions of the entire dataset. A chunk of data is always sent by one of the data loaders to the first worker who requests the data. The next worker requesting the data from the same data loader will get the next chunk. Each worker requests in total k data chunks from k different data loaders and then process them before asking for new data chunks. Notice that each data loader cycles through the data in the mmap file, sending consecutive chunks to the workers in order in which it receives requests from them. When the data loader reaches the end of the mmap file, it selects the address in memory uniformly at random from the interval [0 , s ] , where s = ( number of images in the mmap file modulo mini-batch size ) , and uses this address to start cycling again through the data in the mmap file. After the local worker receives the k data chunks from the data loaders, it shuffles them and divides it into mini-batches of size 128 .

Learning rates

In Table 1 we summarize the learning rates η (we used constant learning rates) explored for each method shown in Figure 2. For all values of τ the same set of learning rates was explored for each method.

In Table 3 we summarize the initial learning rates η we use for each method shown in Figure 4. For all values of p the same set of learning rates was explored for each method. We also used the rule of the thumb to decrease the initial learning rate twice, first time we divided it by 5 and the second time by 2 , when we observed that the decrease of the online predictive (training) loss saturates.

Comparison of textit{SGD

Figure 6: Convergence of the training and test loss (negative log-likelihood) and the test error (original and zoomed) computed for the center variable as a function of wallclock time for SGD , ASGD , MVASGD and MSGD ( p = 1 ) on the CIFAR experiment.

Figure 6: Convergence of the training and test loss (negative log-likelihood) and the test error (original and zoomed) computed for the center variable as a function of wallclock time for SGD , ASGD , MVASGD and MSGD ( p = 1 ) on the CIFAR experiment.

Figure 6 shows the convergence of the training and test loss (negative log-likelihood) and the test error computed for the center variable as a function of wallclock time for SGD , ASGD , MVASGD and MSGD ( p = 1 ) on the CIFAR experiment. For all CIFAR experiments we always start the averaging for the ADOWNPOUR and ASGD methods from the very beginning of each experiment. For all ImageNet experiments we start the averaging for the ASGD at the same time when we first reduce the initial learning rate.

Dependence of the learning rate

This section discusses the dependence of the trade-off between exploration and exploitation on the learning rate. We compare the performance of respectively EAMSGD and EASGD for different learning rates η when p = 16 and τ = 10 on the CIFAR experiment. We observe in Figure 8 that higher learning rates η lead to better test performance for the EAMSGD algorithm which potentially can be justified by the fact that they sustain higher fluctuations of the local workers. We conjecture that higher fluctuations lead to more exploration and simultaneously they also impose higher regularization. This picture however seems to be opposite for the EASGD algorithm for which larger learning rates hurt the performance of the method and lead to overfitting. Interestingly in this experiment for both EASGD and EAMSGD algorithm, the learning rate for which the best training performance was achieved simultaneously led to the worst test performance.

Figure 8: Convergence of the training loss (negative log-likelihood, original) and the test error (zoomed) computed for the center variable as a function of wallclock time for EAMSGD and EASGD run with different values of η on the CIFAR experiment. p = 16 , τ = 10 .

Figure 8: Convergence of the training loss (negative log-likelihood, original) and the test error (zoomed) computed for the center variable as a function of wallclock time for EAMSGD and EASGD run with different values of η on the CIFAR experiment. p = 16 , τ = 10 .

Dependence of the communication period

This section discusses the dependence of the trade-off between exploration and exploitation on the communication period. We have observed from the CIFAR experiment that EASGD algorithm exhibits very similar convergence behavior when τ = 1 up to even τ = 1000 , whereas EAMSGD can get trapped at worse energy (loss) level for τ = 100 . This behavior of EAMSGD is most likely due to the non-convexity of the objective function. Luckily, it can be avoided by gradually decreasing the learning rate, i.e. increasing the penalty term ρ (recall α = ηρ ), as shown in Figure 9. In contrast, the EASGD algorithm does not seem to get trapped at all along its trajectory. The performance of EASGD is less sensitive to increasing the communication period compared to EAMSGD , whereas for the EAMSGD the careful choice of the learning rate for large communication periods seems crucial.

Compared to all earlier results, the experiment in this section is re-run three times with a new random 12 seed and with faster cuDNN 13 package 14 . All our methods are implemented in Torch 15 . The Message Passing Interface implementation MVAPICH2 16 is used for the GPU-CPU communication.

Figure 9: Convergence of the training loss (negative log-likelihood, original) and the test error (zoomed) computed for the center variable as a function of wallclock time for EASGD and EAMSGD ( p = 16 , η = 0 . 01 , β = 0 . 9 , δ = 0 . 99 ) on the CIFAR experiment with various communication period τ and learning rate decay γ . The learning rate is decreased gradually over time based each local worker's own clock t with η t = η/ (1 + γt ) 0 . 5 .

Figure 9: Convergence of the training loss (negative log-likelihood, original) and the test error (zoomed) computed for the center variable as a function of wallclock time for EASGD and EAMSGD ( p = 16 , η = 0 . 01 , β = 0 . 9 , δ = 0 . 99 ) on the CIFAR experiment with various communication period τ and learning rate decay γ . The learning rate is decreased gradually over time based each local worker's own clock t with η t = η/ (1 + γt ) 0 . 5 .

12 To clarify, the random initialization we use is by default in Torch's implementation.

13 https://developer.nvidia.com/cuDNN

16 http://mvapich.cse.ohio-state.edu

Breakdown of the wallclock time

In addition, we report in Table 4 the breakdown of the total running time for EASGD when τ = 10 (the time breakdown for EAMSGD is almost identical) and DOWNPOUR when τ = 1 into computation time, data loading time and parameter communication time. For the CIFAR experiment the reported time corresponds to processing 400 × 128 data samples whereas for the ImageNet experiment it corresponds to processing 1024 × 128 data samples. For τ = 1 and p ∈ { 8 , 16 } we observe that the communication time accounts for significant portion of the total running time whereas for τ = 10 the communication time becomes negligible compared to the total running time (recall that based on previous results EASGD and EAMSGD achieve best performance with larger τ which is ideal in the setting when communication is time-consuming).

Table 4: Approximate computation time, data loading time and parameter communication time [sec] for DOWNPOUR (top line for τ = 1 ) and EASGD (the time breakdown for EAMSGD is almost identical) (bottom line for τ = 10 ). Left time corresponds to CIFAR experiment and right table corresponds to ImageNet experiment.

Table 4: Approximate computation time, data loading time and parameter communication time [sec] for DOWNPOUR (top line for τ = 1 ) and EASGD (the time breakdown for EAMSGD is almost identical) (bottom line for τ = 10 ). Left time corresponds to CIFAR experiment and right table corresponds to ImageNet experiment.

Time speed-up

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.

One of the most challenging problems in large-scale machine learning is how to parallelize the training of large models that use a form of stochastic gradient descent (SGD) [1]. There have been attempts to parallelize SGD-based training for large-scale deep learning models on large number of CPUs, including the Google’s Distbelief system [2]. But practical image recognition systems consist of large-scale convolutional neural networks trained on few GPU cards sitting in a single computer [3, 4]. The main challenge is to devise parallel SGD algorithms to train large-scale deep learning models that yield a significant speedup when run on multiple GPU cards.

In this paper we introduce the Elastic Averaging SGD method (EASGD) and its variants. EASGD is motivated by quadratic penalty method [5], but is re-interpreted as a parallelized extension of the averaging SGD algorithm [6]. The basic idea is to let each worker maintain its own local parameter, and the communication and coordination of work among the local workers is based on an elastic force which links the parameters they compute with a center variable stored by the master. The center variable is updated as a moving average where the average is taken in time and also in space over the parameters computed by local workers. The main contribution of this paper is a new algorithm that provides fast convergent minimization while outperforming DOWNPOUR method [2] and other baseline approaches in practice. Simultaneously it reduces the communication overhead between the master and the local workers while at the same time it maintains high-quality performance measured by the test error. The new algorithm applies to deep learning settings such as parallelized training of convolutional neural networks.

The article is organized as follows. Section 2 explains the problem setting, Section 3 presents the synchronous EASGD algorithm and its asynchronous and momentum-based variants, Section 4 provides stability analysis of EASGD and ADMM in the round-robin scheme, Section 5 shows experimental results and Section 6 concludes. The Supplement contains additional material including additional theoretical analysis.

Consider minimizing a function F​(x)𝐹𝑥F(x) in a parallel computing environment [7] with p∈ℕ𝑝ℕp\in\mathbb{N} workers and a master. In this paper we focus on the stochastic optimization problem of the following form

where x𝑥x is the model parameter to be estimated and ξ𝜉\xi is a random variable that follows the probability distribution ℙℙ\mathbb{P} over ΩΩ\Omega such that F​(x)=∫Ωf​(x,ξ)​ℙ​(d​ξ)𝐹𝑥subscriptΩ𝑓𝑥𝜉ℙ𝑑𝜉F(x)=\int_{\Omega}f(x,\xi)\mathbb{P}(d\xi). The optimization problem in Equation 1 can be reformulated as follows

where each ξisuperscript𝜉𝑖\xi^{i} follows the same distribution ℙℙ\mathbb{P} (thus we assume each worker can sample the entire dataset). In the paper we refer to xisuperscript𝑥𝑖x^{i}’s as local variables and we refer to x~~𝑥\tilde{x} as a center variable. The problem of the equivalence of these two objectives is studied in the literature and is known as the augmentability or the global variable consensus problem [8, 9]. The quadratic penalty term ρ𝜌\rho in Equation 2 is expected to ensure that local workers will not fall into different attractors that are far away from the center variable. This paper focuses on the problem of reducing the parameter communication overhead between the master and local workers [10, 2, 11, 12, 13]. The problem of data communication when the data is distributed among the workers [7, 14] is a more general problem and is not addressed in this work. We however emphasize that our problem setting is still highly non-trivial under the communication constraints due to the existence of many local optima [15].

The EASGD updates captured in resp. Equation 3 and 4 are obtained by taking the gradient descent step on the objective in Equation 2 with respect to resp. variable xisuperscript𝑥𝑖x^{i} and x~~𝑥\tilde{x},

where gti​(xti)superscriptsubscript𝑔𝑡𝑖superscriptsubscript𝑥𝑡𝑖g_{t}^{i}(x_{t}^{i}) denotes the stochastic gradient of F𝐹F with respect to xisuperscript𝑥𝑖x^{i} evaluated at iteration t𝑡t, xtisubscriptsuperscript𝑥𝑖𝑡x^{i}{t} and xtsubscript𝑥𝑡\tilde{x}{t} denote respectively the value of variables xisuperscript𝑥𝑖x^{i} and x~~𝑥\tilde{x} at iteration t𝑡t, and η𝜂\eta is the learning rate.

The update rule for the center variable x~~𝑥\tilde{x} takes the form of moving average where the average is taken over both space and time. Denote α=η​ρ𝛼𝜂𝜌\alpha=\eta\rho and β=p​α𝛽𝑝𝛼\beta=p\alpha, then Equation 3 and 4 become

Note that choosing β=p​α𝛽𝑝𝛼\beta=p\alpha leads to an elastic symmetry in the update rule, i.e. there exists an symmetric force equal to α​(xti−xt)𝛼superscriptsubscript𝑥𝑡𝑖subscript𝑥𝑡\alpha(x_{t}^{i}-\tilde{x}{t}) between the update of each xisuperscript𝑥𝑖x^{i} and x~~𝑥\tilde{x}. It has a crucial influence on the algorithm’s stability as will be explained in Section 4. Also in order to minimize the staleness [16] of the difference xti−xtsuperscriptsubscript𝑥𝑡𝑖subscript𝑥𝑡x{t}^{i}-\tilde{x}{t} between the center and the local variable, the update for the master in Equation 4 involves xtisuperscriptsubscript𝑥𝑡𝑖x{t}^{i} instead of xt+1isuperscriptsubscript𝑥𝑡1𝑖x_{t+1}^{i}.

Note also that α=η​ρ𝛼𝜂𝜌\alpha=\eta\rho, where the magnitude of ρ𝜌\rho represents the amount of exploration we allow in the model. In particular, small ρ𝜌\rho allows for more exploration as it allows xisuperscript𝑥𝑖x^{i}’s to fluctuate further from the center x~~𝑥\tilde{x}. The distinctive idea of EASGD is to allow the local workers to perform more exploration (small ρ𝜌\rho) and the master to perform exploitation. This approach differs from other settings explored in the literature [2, 17, 18, 19, 20, 21, 22, 23], and focus on how fast the center variable converges. In this paper we show the merits of our approach in the deep learning setting.

We discussed the synchronous update of EASGD algorithm in the previous section. In this section we propose its asynchronous variant. The local workers are still responsible for updating the local variables xisuperscript𝑥𝑖x^{i}’s, whereas the master is updating the center variable x𝑥\tilde{x}. Each worker maintains its own clock tisuperscript𝑡𝑖t^{i}, which starts from 00 and is incremented by 111 after each stochastic gradient update of xisuperscript𝑥𝑖x^{i} as shown in Algorithm 1. The master performs an update whenever the local workers finished τ𝜏\tau steps of their gradient updates, where we refer to τ𝜏\tau as the communication period. As can be seen in Algorithm 1, whenever τ𝜏\tau divides the local clock of the ithsuperscript𝑖thi^{\text{th}} worker, the ithsuperscript𝑖thi^{\text{th}} worker communicates with the master and requests the current value of the center variable x𝑥\tilde{x}. The worker then waits until the master sends back the requested parameter value, and computes the elastic difference α​(x−x~)𝛼𝑥~𝑥\alpha(x-\tilde{x}) (this entire procedure is captured in step a) in Algorithm 1). The elastic difference is then sent back to the master (step b) in Algorithm 1) who then updates x~~𝑥\tilde{x}.

The communication period τ𝜏\tau controls the frequency of the communication between every local worker and the master, and thus the trade-off between exploration and exploitation.

The momentum EASGD (EAMSGD) is a variant of our Algorithm 1 and is captured in Algorithm 2. It is based on the Nesterov’s momentum scheme [24, 25, 26], where the update of the local worker of the form captured in Equation 3 is replaced by the following update

where δ𝛿\delta is the momentum term. Note that when δ=0𝛿0\delta=0 we recover the original EASGD algorithm.

As we are interested in reducing the communication overhead in the parallel computing environment where the parameter vector is very large, we will be exploring in the experimental section the asynchronous EASGD algorithm and its momentum-based variant in the relatively large τ𝜏\tau regime (less frequent communication).

In this section we study the stability of the asynchronous EASGD and ADMM methods in the round-robin scheme [20]. We first state the updates of both algorithms in this setting, and then we study their stability. We will show that in the one-dimensional quadratic case, ADMM algorithm can exhibit chaotic behavior, leading to exponential divergence. The analytic condition for the ADMM algorithm to be stable is still unknown, while for the EASGD algorithm it is very simple111This condition resembles the stability condition for the synchronous EASGD algorithm (Condition 23 for p=1𝑝1p=1) in the analysis in the Supplement..

The analysis of the synchronous EASGD algorithm, including its convergence rate, and its averaging property, in the quadratic and strongly convex case, is deferred to the Supplement.

In our setting, the ADMM method [9, 27, 28] involves solving the following minimax problem222The convergence analysis in [27] is based on the assumption that “At any master iteration, updates from the workers have the same probability of arriving at the master.”, which is not satisfied in the round-robin scheme.,

where λisuperscript𝜆𝑖\lambda^{i}’s are the Lagrangian multipliers. The resulting updates of the ADMM algorithm in the round-robin scheme are given next. Let t≥0𝑡0t\geq 0 be a global clock. At each t𝑡t, we linearize the function F​(xi)𝐹superscript𝑥𝑖F(x^{i}) with F​(xti)+\scalprod​L​2​∇F​(xti)​xi−xti+12​η​\norm​L​2​xi−xti2𝐹subscriptsuperscript𝑥𝑖𝑡\scalprod𝐿2∇𝐹subscriptsuperscript𝑥𝑖𝑡superscript𝑥𝑖subscriptsuperscript𝑥𝑖𝑡12𝜂\norm𝐿2superscript𝑥𝑖superscriptsubscriptsuperscript𝑥𝑖𝑡2F(x^{i}{t})+\scalprod{L2}{\nabla{F}(x^{i}{t})}{x^{i}-x^{i}{t}}+\frac{1}{2\eta}\norm{L2}{x^{i}-x^{i}{t}}^{2} as in [28]. The updates become

Each local variable xisuperscript𝑥𝑖x^{i} is periodically updated (with period p𝑝p). First, the Lagrangian multiplier λisuperscript𝜆𝑖\lambda^{i} is updated with the dual ascent update as in Equation 11. It is followed by the gradient descent update of the local variable as given in Equation 14. Then the center variable x~~𝑥\tilde{x} is updated with the most recent values of all the local variables and Lagrangian multipliers as in Equation 15. Note that since the step size for the dual ascent update is chosen to be ρ𝜌\rho by convention [9, 27, 28], we have re-parametrized the Lagrangian multiplier to be λti←λti/ρ←subscriptsuperscript𝜆𝑖𝑡subscriptsuperscript𝜆𝑖𝑡𝜌\lambda^{i}{t}\leftarrow\lambda^{i}{t}/\rho in the above updates.

The EASGD algorithm in the round-robin scheme is defined similarly and is given below

At time t𝑡t, only the i𝑖i-th local worker (whose index i−1𝑖1i-1 equals t𝑡t modulo p𝑝p) is activated, and performs the update in Equations 18 which is followed by the master update given in Equation 19.

We will now focus on the one-dimensional quadratic case without noise, i.e. F​(x)=x22,x∈ℝformulae-sequence𝐹𝑥superscript𝑥22𝑥ℝF(x)=\frac{x^{2}}{2},x\in\mathbb{R}.

For the ADMM algorithm, let the state of the (dynamical) system at time t𝑡t be st=(λt1,xt1,…,λtp,xtp,xt)∈ℝ2​p+1subscript𝑠𝑡superscriptsubscript𝜆𝑡1superscriptsubscript𝑥𝑡1…superscriptsubscript𝜆𝑡𝑝superscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡superscriptℝ2𝑝1s_{t}=(\lambda_{t}^{1},x_{t}^{1},\ldots,\lambda_{t}^{p},x_{t}^{p},\tilde{x}{t})\in\mathbb{R}^{2p+1}. The local worker i𝑖i’s updates in Equations 11, 14, and 15 are composed of three linear maps which can be written as st+1=(F3i∘F2i∘F1i)​(st)subscript𝑠𝑡1subscriptsuperscript𝐹𝑖3subscriptsuperscript𝐹𝑖2subscriptsuperscript𝐹𝑖1subscript𝑠𝑡s{t+1}=(F^{i}{3}\circ F^{i}{2}\circ F^{i}{1})(s{t}). For simplicity, we will only write them out below for the case when i=1𝑖1i=1 and p=2𝑝2p=2:

For each of the p𝑝p linear maps, it’s possible to find a simple condition such that each map, where the ithsuperscript𝑖thi^{\text{th}} map has the form F3i∘F2i∘F1isubscriptsuperscript𝐹𝑖3subscriptsuperscript𝐹𝑖2subscriptsuperscript𝐹𝑖1F^{i}{3}\circ F^{i}{2}\circ F^{i}{1}, is stable (the absolute value of the eigenvalues of the map are smaller or equal to one). However, when these non-symmetric maps are composed one after another as follows ℱ=F3p∘F2p∘F1p∘…∘F31∘F21∘F11ℱsubscriptsuperscript𝐹𝑝3subscriptsuperscript𝐹𝑝2subscriptsuperscript𝐹𝑝1…subscriptsuperscript𝐹13subscriptsuperscript𝐹12subscriptsuperscript𝐹11\mathcal{F}=F^{p}{3}\circ F^{p}{2}\circ F^{p}{1}\circ\ldots\circ F^{1}{3}\circ F^{1}{2}\circ F^{1}_{1}, the resulting map ℱℱ\mathcal{F} can become unstable! (more precisely, some eigenvalues of the map can sit outside the unit circle in the complex plane).

We now present the numerical conditions for which the ADMM algorithm becomes unstable in the round-robin scheme for p=3𝑝3p=3 and p=8𝑝8p=8, by computing the largest absolute eigenvalue of the map ℱℱ\mathcal{F}. Figure 1 summarizes the obtained result.

On the other hand, the EASGD algorithm involves composing only symmetric linear maps due to the elasticity. Let the state of the (dynamical) system at time t𝑡t be st=(xt1,…,xtp,xt)∈ℝp+1subscript𝑠𝑡subscriptsuperscript𝑥1𝑡…subscriptsuperscript𝑥𝑝𝑡subscript𝑥𝑡superscriptℝ𝑝1s_{t}=(x^{1}{t},\ldots,x^{p}{t},\tilde{x}{t})\in\mathbb{R}^{p+1}. The activated local worker i𝑖i’s update in Equation 18 and the master update in Equation 19 can be written as st+1=Fi​(st)subscript𝑠𝑡1superscript𝐹𝑖subscript𝑠𝑡s{t+1}=F^{i}(s_{t}). In case of p=2𝑝2p=2, the map F1superscript𝐹1F^{1} and F2superscript𝐹2F^{2} are defined as follows

For the composite map Fp∘…∘F1superscript𝐹𝑝…superscript𝐹1F^{p}\circ\ldots\circ F^{1} to be stable, the condition that needs to be satisfied is actually the same for each i𝑖i, and is furthermore independent of p𝑝p (since each linear map Fisuperscript𝐹𝑖F^{i} is symmetric). It essentially involves the stability of the 2×2222\times 2 matrix (1−η−ααα1−α)1𝜂𝛼𝛼𝛼1𝛼\left(\begin{array}[]{cc}1-\eta-\alpha&\alpha\ \alpha&1-\alpha\end{array}\right), whose two (real) eigenvalues λ𝜆\lambda satisfy (1−η−α−λ)​(1−α−λ)=α21𝜂𝛼𝜆1𝛼𝜆superscript𝛼2(1-\eta-\alpha-\lambda)(1-\alpha-\lambda)=\alpha^{2}. The resulting stability condition (|λ|≤1𝜆1|\lambda|\leq 1) is simple and given as 0≤η≤2,0≤α≤4−2​η4−ηformulae-sequence0𝜂20𝛼42𝜂4𝜂0\leq\eta\leq 2,0\leq\alpha\leq\frac{4-2\eta}{4-\eta}.

In this section we compare the performance of EASGD and EAMSGD with the parallel method DOWNPOUR and the sequential method SGD, as well as their averaging and momentum variants.

All the parallel comparator methods are listed below333We have compared asynchronous ADMM [27] with EASGD in our setting as well, the performance is nearly the same. However, ADMM’s momentum variant is not as stable for large communication periods.:

DOWNPOUR [2], the pseudo-code of the implementation of DOWNPOUR used in this paper is enclosed in the Supplement.

Momentum DOWNPOUR (MDOWNPOUR), where the Nesterov’s momentum scheme is applied to the master’s update (note it is unclear how to apply it to the local workers or for the case when τ>1𝜏1\tau>1). The pseudo-code is in the Supplement.

A method that we call ADOWNPOUR, where we compute the average over time of the center variable x𝑥\tilde{x} as follows: zt+1=(1−αt+1)​zt+αt+1​xtsubscript𝑧𝑡11subscript𝛼𝑡1subscript𝑧𝑡subscript𝛼𝑡1subscript𝑥𝑡z_{t+1}=(1-\alpha_{t+1})z_{t}+\alpha_{t+1}\tilde{x}{t}, and αt+1=1t+1subscript𝛼𝑡11𝑡1\alpha{t+1}=\frac{1}{t+1} is a moving rate, and z0=x0subscript𝑧0subscript𝑥0z_{0}=\tilde{x}_{0}. t𝑡t denotes the master clock, which is initialized to 00 and incremented every time the center variable x𝑥\tilde{x} is updated.

A method that we call MVADOWNPOUR, where we compute the moving average of the center variable x~~𝑥\tilde{x} as follows: zt+1=(1−α)​zt+α​xtsubscript𝑧𝑡11𝛼subscript𝑧𝑡𝛼subscript𝑥𝑡z_{t+1}=(1-\alpha)z_{t}+\alpha\tilde{x}{t}, and the moving rate α𝛼\alpha was chosen to be constant, and z0=x0subscript𝑧0subscript𝑥0z{0}=\tilde{x}_{0}. t𝑡t denotes the master clock and is defined in the same way as for the ADOWNPOUR method.

All the sequential comparator methods (p=1𝑝1p=1) are listed below:

SGD [1] with constant learning rate η𝜂\eta.

MVASGD [6] with moving rate α𝛼\alpha set to a constant.

We perform experiments in a deep learning setting on two benchmark datasets: CIFAR-10 (we refer to it as CIFAR) 444Downloaded from http://www.cs.toronto.edu/~kriz/cifar.html. and ImageNet ILSVRC 2013 (we refer to it as ImageNet) 555Downloaded from http://image-net.org/challenges/LSVRC/2013.. We focus on the image classification task with deep convolutional neural networks. We next explain the experimental setup. The details of the data preprocessing and prefetching are deferred to the Supplement.

For all our experiments we use a GPU-cluster interconnected with InfiniBand. Each node has 444 Titan GPU processors where each local worker corresponds to one GPU processor. The center variable of the master is stored and updated on the centralized parameter server [2]666Our implementation is available at https://github.com/sixin-zh/mpiT..

To describe the architecture of the convolutional neural network, we will first introduce a notation. Let (c,y)𝑐𝑦(c,y) denotes the size of the input image to each layer, where c𝑐c is the number of color channels and y𝑦y is both the horizontal and the vertical dimension of the input. Let C𝐶C denotes the fully-connected convolutional operator and let P𝑃P denotes the max pooling operator, D𝐷D denotes the linear operator with dropout rate equal to 0.50.50.5 and S𝑆S denotes the linear operator with softmax output non-linearity. We use the cross-entropy loss and all inner layers use rectified linear units. For the ImageNet experiment we use the similar approach to [4] with the following 111111-layer convolutional neural network (3,221)C(96,108)P(96,36)C(256,32)P(256,16)C(384,14) C(384,13)C(256,12)P(256,6)D(4096,1)D(4096,1)S(1000,1). For the CIFAR experiment we use the similar approach to [29] with the following 777-layer convolutional neural network (3,28)C(64,24)P(64,12)C(128,8)P(128,4)C(64,2)D(256,1)S(10,1).

In our experiments all the methods we run use the same initial parameter chosen randomly, except that we set all the biases to zero for CIFAR case and to 0.1 for ImageNet case. This parameter is used to initialize the master and all the local workers777On the contrary, initializing the local workers and the master with different random seeds ’traps’ the algorithm in the symmetry breaking phase.. We add l2subscript𝑙2l_{2}-regularization λ2​\norm​x2𝜆2\normsuperscript𝑥2\frac{\lambda}{2}\norm{}{x}^{2} to the loss function F​(x)𝐹𝑥F(x). For ImageNet we use λ=10−5𝜆superscript105\lambda=10^{-5} and for CIFAR we use λ=10−4𝜆superscript104\lambda=10^{-4}. We also compute the stochastic gradient using mini-batches of sample size 128128128.

For all experiments in this section we use EASGD with β=0.9𝛽0.9\beta=0.9888Intuitively the ’effective β𝛽\beta’ is β/τ=p​α=p​η​ρ𝛽𝜏𝑝𝛼𝑝𝜂𝜌\beta/\tau=p\alpha=p\eta\rho (thus ρ=βτ​p​η𝜌𝛽𝜏𝑝𝜂\rho=\frac{\beta}{\tau p\eta}) in the asynchronous setting. , for all momentum-based methods we set the momentum term δ=0.99𝛿0.99\delta=0.99 and finally for MVADOWNPOUR we set the moving rate to α=0.001𝛼0.001\alpha=0.001. We start with the experiment on CIFAR dataset with p=4𝑝4p=4 local workers running on a single computing node. For all the methods, we examined the communication periods from the following set τ={1,4,16,64}𝜏141664\tau={1,4,16,64}. For comparison we also report the performance of MSGD which outperformed SGD, ASGD and MVASGD as shown in Figure 6 in the Supplement. For each method we examined a wide range of learning rates (the learning rates explored in all experiments are summarized in Table 1, 2, 3 in the Supplement). The CIFAR experiment was run 333 times independently from the same initialization and for each method we report its best performance measured by the smallest achievable test error.

From the results in Figure 2, we conclude that all DOWNPOUR-based methods achieve their best performance (test error) for small τ𝜏\tau (τ∈{1,4}𝜏14\tau\in{1,4}), and become highly unstable for τ∈{16,64}𝜏1664\tau\in{16,64}. While EAMSGD significantly outperforms comparator methods for all values of τ𝜏\tau by having faster convergence. It also finds better-quality solution measured by the test error and this advantage becomes more significant for τ∈{16,64}𝜏1664\tau\in{16,64}. Note that the tendency to achieve better test performance with larger τ𝜏\tau is also characteristic for the EASGD algorithm.

We next explore different number of local workers p𝑝p from the set p={4,8,16}𝑝4816p={4,8,16} for the CIFAR experiment, and p={4,8}𝑝48p={4,8} for the ImageNet experiment999For the ImageNet experiment, the training loss is measured on a subset of the training data of size 50,000.. For the ImageNet experiment we report the results of one run with the best setting we have found. EASGD and EAMSGD were run with τ=10𝜏10\tau=10 whereas DOWNPOUR and MDOWNPOUR were run with τ=1𝜏1\tau=1. The results are in Figure 3 and 4. For the CIFAR experiment, it’s noticeable that the lowest achievable test error by either EASGD or EAMSGD decreases with larger p𝑝p. This can potentially be explained by the fact that larger p𝑝p allows for more exploration of the parameter space. In the Supplement, we discuss further the trade-off between exploration and exploitation as a function of the learning rate (section 9.5) and the communication period (section 9.6). Finally, the results obtained for the ImageNet experiment also shows the advantage of EAMSGD over the competitor methods.

In this paper we describe a new algorithm called EASGD and its variants for training deep neural networks in the stochastic setting when the computations are parallelized over multiple GPUs. Experiments demonstrate that this new algorithm quickly achieves improvement in test error compared to more common baseline approaches such as DOWNPOUR and its variants. We show that our approach is very stable and plausible under communication constraints. We provide the stability analysis of the asynchronous EASGD in the round-robin scheme, and show the theoretical advantage of the method over ADMM. The different behavior of the EASGD algorithm from its momentum-based variant EAMSGD is intriguing and will be studied in future works.

The authors thank R. Power, J. Li for implementation guidance, J. Bruna, O. Henaff, C. Farabet, A. Szlam, Y. Bakhtin for helpful discussion, P. L. Combettes, S. Bengio and the referees for valuable feedback.

Deep learning with Elastic Averaging SGD (Supplementary Material)

We provide here the convergence analysis of the synchronous EASGD algorithm with constant learning rate. The analysis is focused on the convergence of the center variable to the local optimum. We discuss one-dimensional quadratic case first, then the generalization to multi-dimensional setting (Lemma 7.3) and finally to the strongly convex case (Theorem 7.1).

Our analysis in the quadratic case extends the analysis of ASGD in [6]. Assume each of the p𝑝p local workers xti∈ℝnsubscriptsuperscript𝑥𝑖𝑡superscriptℝ𝑛x^{i}_{t}\in\mathbb{R}^{n} observes a noisy gradient at time t≥0𝑡0t\geq 0 of the linear form given in Equation 20.

where the matrix A𝐴A is positive-definite (each eigenvalue is strictly positive) and {ξti}subscriptsuperscript𝜉𝑖𝑡{\xi^{i}{t}}’s are i.i.d. random variables, with zero mean and positive-definite covariance ΣΣ\Sigma. Let x∗superscript𝑥∗x^{\ast} denote the optimum solution, where x∗=A−1​b∈ℝnsuperscript𝑥∗superscript𝐴1𝑏superscriptℝ𝑛x^{\ast}=A^{-1}b\in\mathbb{R}^{n}. In this section we analyze the behavior of the mean squared error (MSE) of the center variable xtsubscript𝑥𝑡\tilde{x}{t}, where this error is denoted as 𝔼​[\norm​L​2​xt−x∗2]𝔼delimited-[]\norm𝐿2subscript𝑥𝑡superscriptsuperscript𝑥2\mathbb{E}[\norm{L2}{\tilde{x}{t}-x^{*}}^{2}], as a function of t𝑡t, p𝑝p, η𝜂\eta, α𝛼\alpha and β𝛽\beta, where β=p​α𝛽𝑝𝛼\beta=p\alpha. Note that the MSE error can be decomposed as (squared) bias and variance101010In our notation, 𝕍𝕍\mathbb{V} denotes the variance.: 𝔼​[\norm​L​2​xt−x∗2]=\norm​L​2​𝔼​[xt−x∗]2+𝕍​[xt−x∗]𝔼delimited-[]\norm𝐿2subscript𝑥𝑡superscriptsuperscript𝑥2\norm𝐿2𝔼superscriptdelimited-[]subscript𝑥𝑡superscript𝑥2𝕍delimited-[]subscript𝑥𝑡superscript𝑥\mathbb{E}[\norm{L2}{\tilde{x}{t}-x^{}}^{2}]=\norm{L2}{\mathbb{E}[\tilde{x}_{t}-x^{}]}^{2}+\mathbb{V}[\tilde{x}_{t}-x^{*}]. For one-dimensional case (n=1𝑛1n=1), we assume A=h>0𝐴ℎ0A=h>0 and Σ=σ2>0Σsuperscript𝜎20\Sigma=\sigma^{2}>0.

Let x0subscript𝑥0\tilde{x}{0} and {x0i}i=1,…,psubscriptsubscriptsuperscript𝑥𝑖0𝑖1…𝑝{x^{i}{0}}_{i=1,\ldots,p} be arbitrary constants, then

where u0=∑i=1p(x0i−x∗−α1−p​α−ϕ​(x0−x∗))subscript𝑢0superscriptsubscript𝑖1𝑝subscriptsuperscript𝑥𝑖0superscript𝑥∗𝛼1𝑝𝛼italic-ϕsubscript𝑥0superscript𝑥∗u_{0}=\sum_{i=1}^{p}(x^{i}{0}-x^{\ast}-\frac{\alpha}{1-p\alpha-\phi}(\tilde{x}{0}-x^{\ast})), a=η​h+(p+1)​α𝑎𝜂ℎ𝑝1𝛼a=\eta h+(p+1)\alpha, c2=η​h​p​αsuperscript𝑐2𝜂ℎ𝑝𝛼c^{2}=\eta hp\alpha, γ=1−a−a2−4​c22𝛾1𝑎superscript𝑎24superscript𝑐22\gamma=1-\frac{a-\sqrt{a^{2}-4c^{2}}}{2}, and ϕ=1−a+a2−4​c22italic-ϕ1𝑎superscript𝑎24superscript𝑐22\phi=1-\frac{a+\sqrt{a^{2}-4c^{2}}}{2}.

It follows from Lemma 7.1 that for the center variable to be stable the following has to hold

It can be verified that ϕitalic-ϕ\phi and γ𝛾\gamma are the two zero-roots of the polynomial in λ𝜆\lambda: λ2−(2−a)​λ+(1−a+c2)superscript𝜆22𝑎𝜆1𝑎superscript𝑐2\lambda^{2}-(2-a)\lambda+(1-a+c^{2}). Recall that ϕitalic-ϕ\phi and λ𝜆\lambda are the functions of η𝜂\eta and α𝛼\alpha. Thus (see proof in Section 7.1.2)

γ<1𝛾1\gamma<1 iff c2>0superscript𝑐20c^{2}>0 (i.e. η>0𝜂0\eta>0 and α>0𝛼0\alpha>0).

ϕ>−1italic-ϕ1\phi>-1 iff (2−η​h)​(2−p​α)>2​α2𝜂ℎ2𝑝𝛼2𝛼(2-\eta h)(2-p\alpha)>2\alpha and (2−η​h)+(2−p​α)>α2𝜂ℎ2𝑝𝛼𝛼(2-\eta h)+(2-p\alpha)>\alpha.

The proof the above Lemma is based on the diagonalization of the linear gradient map (this map is symmetric due to the relation β=p​α𝛽𝑝𝛼\beta=p\alpha). The stability analysis of the asynchronous EASGD algorithm in the round-robin scheme is similar due to this elastic symmetry.

Substituting the gradient from Equation 20 into the update rule used by each local worker in the synchronous EASGD algorithm (Equation 5 and 6) we obtain

where η𝜂\eta is the learning rate, and α𝛼\alpha is the moving rate. Recall that α=η​ρ𝛼𝜂𝜌\alpha=\eta\rho and A=h𝐴ℎA=h.

We prove the lemma by explicitly solving the linear equations 24 and 25. Let xt=(xt1,…,xtp,xt)Tsubscript𝑥𝑡superscriptsubscriptsuperscript𝑥1𝑡…subscriptsuperscript𝑥𝑝𝑡subscript𝑥𝑡𝑇x_{t}=(x^{1}{t},\ldots,x^{p}{t},\tilde{x}_{t})^{T}. We rewrite the recursive relation captured in Equation 24 and 25 as simply

where the drift matrix M𝑀M is defined as

Note that one of the eigenvalues of matrix M𝑀M, that we call ϕitalic-ϕ\phi, satisfies (1−α−η​h−ϕ)​(1−p​α−ϕ)=p​α21𝛼𝜂ℎitalic-ϕ1𝑝𝛼italic-ϕ𝑝superscript𝛼2(1-\alpha-\eta h-\phi)(1-p\alpha-\phi)=p\alpha^{2}. The corresponding eigenvector is (1,1,…,1,−p​α1−p​α−ϕ)Tsuperscript11…1𝑝𝛼1𝑝𝛼italic-ϕ𝑇(1,1,\ldots,1,-\frac{p\alpha}{1-p\alpha-\phi})^{T}. Let utsubscript𝑢𝑡u_{t} be the projection of xtsubscript𝑥𝑡x_{t} onto this eigenvector. Thus ut=∑i=1p(xti−α1−p​α−ϕ​xt)subscript𝑢𝑡superscriptsubscript𝑖1𝑝subscriptsuperscript𝑥𝑖𝑡𝛼1𝑝𝛼italic-ϕsubscript𝑥𝑡u_{t}=\sum_{i=1}^{p}(x^{i}{t}-\frac{\alpha}{1-p\alpha-\phi}\tilde{x}{t}). Let furthermore ξt=∑i=1pξtisubscript𝜉𝑡superscriptsubscript𝑖1𝑝superscriptsubscript𝜉𝑡𝑖\xi_{t}=\sum_{i=1}^{p}\xi_{t}^{i}. Therefore we have

By combining Equation 25 and 26 as follows

where the last step results from the following relations: p​α21−p​α−ϕ=1−α−η​h−ϕ𝑝superscript𝛼21𝑝𝛼italic-ϕ1𝛼𝜂ℎitalic-ϕ\frac{p\alpha^{2}}{1-p\alpha-\phi}=1-\alpha-\eta h-\phi and ϕ+γ=1−α−η​h+1−p​αitalic-ϕ𝛾1𝛼𝜂ℎ1𝑝𝛼\phi+\gamma=1-\alpha-\eta h+1-p\alpha. Thus we obtained

Based on Equation 26 and 27, we can then expand utsubscript𝑢𝑡u_{t} and xtsubscript𝑥𝑡\tilde{x}_{t} recursively,

Substituting u0,u1,…,utsubscript𝑢0subscript𝑢1…subscript𝑢𝑡u_{0},u_{1},\dots,u_{t}, each given through Equation 28, into Equation 29 we obtain

To be more specific, the Equation 30 is obtained by integrating by parts,

Since the random variables ξlsubscript𝜉𝑙\xi_{l} are i.i.d, we may sum the variance term by term as follows

Note that 𝔼​[ξt]=∑i=1p𝔼​[ξti]=0𝔼delimited-[]subscript𝜉𝑡superscriptsubscript𝑖1𝑝𝔼delimited-[]superscriptsubscript𝜉𝑡𝑖0\mathbb{E}[\xi_{t}]=\sum_{i=1}^{p}\mathbb{E}[\xi_{t}^{i}]=0 and 𝕍​[ξt]=∑i=1p𝕍​[ξti]=p​σ2𝕍delimited-[]subscript𝜉𝑡superscriptsubscript𝑖1𝑝𝕍delimited-[]superscriptsubscript𝜉𝑡𝑖𝑝superscript𝜎2\mathbb{V}[\xi_{t}]=\sum_{i=1}^{p}\mathbb{V}[\xi_{t}^{i}]=p\sigma^{2}. These two facts, the equality in Equation 30 and Equation 31 can then be used to compute 𝔼​[xt]𝔼delimited-[]subscript𝑥𝑡\mathbb{E}[\tilde{x}{t}] and 𝕍​[xt]𝕍delimited-[]subscript𝑥𝑡\mathbb{V}[\tilde{x}{t}] as given in Equation 21 and 22 in Lemma 7.1. ∎

In Figure 5, we illustrate the dependence of MSE on β𝛽\beta, η𝜂\eta and the number of processors p𝑝p over time t𝑡t. We consider the large-noise setting where x0=x0i=1subscript𝑥0subscriptsuperscript𝑥𝑖01\tilde{x}{0}=x^{i}{0}=1, h=1ℎ1h=1 and σ=10𝜎10\sigma=10. The MSE error is color-coded such that the deep blue color corresponds to the MSE equal to 10−3superscript10310^{-3}, the green color corresponds to the MSE equal to 111, the red color corresponds to MSE equal to 103superscript10310^{3} and the dark red color corresponds to the divergence of algorithm EASGD (condition in Equation 23 is then violated). The plot shows that we can achieve significant variance reduction by increasing the number of local workers p𝑝p. This effect is less sensitive to the choice of β𝛽\beta and η𝜂\eta for large p𝑝p.

We are going to show that

Recall that a=η​h+(p+1)​α𝑎𝜂ℎ𝑝1𝛼a=\eta h+(p+1)\alpha, c2=η​h​p​αsuperscript𝑐2𝜂ℎ𝑝𝛼c^{2}=\eta hp\alpha, γ=1−a−a2−4​c22𝛾1𝑎superscript𝑎24superscript𝑐22\gamma=1-\frac{a-\sqrt{a^{2}-4c^{2}}}{2}, ϕ=1−a+a2−4​c22italic-ϕ1𝑎superscript𝑎24superscript𝑐22\phi=1-\frac{a+\sqrt{a^{2}-4c^{2}}}{2}, and β=p​α𝛽𝑝𝛼\beta=p\alpha. We have

γ<1⇔a−a2−4​c22>0⇔a>a2−4​c2⇔a2>a2−4​c2⇔c2>0⇔𝛾1𝑎superscript𝑎24superscript𝑐220⇔𝑎superscript𝑎24superscript𝑐2⇔superscript𝑎2superscript𝑎24superscript𝑐2⇔superscript𝑐20\gamma<1\Leftrightarrow\frac{a-\sqrt{a^{2}-4c^{2}}}{2}>0\Leftrightarrow a>\sqrt{a^{2}-4c^{2}}\Leftrightarrow a^{2}>a^{2}-4c^{2}\Leftrightarrow c^{2}>0.

The next corollary is a consequence of Lemma 7.1. As the number of workers p𝑝p grows, the averaging property of the EASGD can be characterized as follows

Let the Elastic Averaging relation β=p​α𝛽𝑝𝛼\beta=p\alpha and the condition 23 hold, then

Note that when β𝛽\beta is fixed, limp→∞a=η​h+βsubscript→𝑝𝑎𝜂ℎ𝛽\lim_{p\rightarrow\infty}a=\eta h+\beta and c2=η​h​βsuperscript𝑐2𝜂ℎ𝛽c^{2}=\eta h\beta. Then limp→∞ϕ=min⁡(1−β,1−η​h)subscript→𝑝italic-ϕ1𝛽1𝜂ℎ\lim_{p\rightarrow\infty}\phi=\min(1-\beta,1-\eta h) and limp→∞γ=max⁡(1−β,1−η​h)subscript→𝑝𝛾1𝛽1𝜂ℎ\lim_{p\rightarrow\infty}\gamma=\max(1-\beta,1-\eta h). Also note that using Lemma 7.1 we obtain

Corollary 7.1 is obtained by plugining in the limiting values of ϕitalic-ϕ\phi and γ𝛾\gamma. ∎

The crucial point of Corollary 7.1 is that the MSE in the limit t→∞→𝑡t\rightarrow\infty is in the order of 1/p1𝑝1/p which implies that as the number of processors p𝑝p grows, the MSE will decrease for the EASGD algorithm. Also note that the smaller the β𝛽\beta is (recall that β=p​α=p​η​ρ𝛽𝑝𝛼𝑝𝜂𝜌\beta=p\alpha=p\eta\rho), the more exploration is allowed (small ρ𝜌\rho) and simultaneously the smaller the MSE is.

The next lemma (Lemma 7.2) shows that EASGD algorithm achieves the highest possible rate of convergence when we consider the double averaging sequence (similarly to [6]) {z1,z2,…}subscript𝑧1subscript𝑧2…{z_{1},z_{2},\dots} defined as below

If the condition in Equation 23 holds, then the normalized double averaging sequence defined in Equation 32 converges weakly to the normal distribution with zero mean and variance σ2/p​h2superscript𝜎2𝑝superscriptℎ2\sigma^{2}/ph^{2},

Also recall that {ξti}subscriptsuperscript𝜉𝑖𝑡{\xi^{i}{t}}’s are i.i.d. random variables (noise) with zero mean and the same covariance Σ≻0succeedsΣ0\Sigma\succ 0. We are interested in the asymptotic behavior of the double averaging sequence {z1,z2,…}subscript𝑧1subscript𝑧2…{z{1},z_{2},\dots} defined as

Recall the Equation 30 from the proof of Lemma 7.1 (for the convenience it is provided below):

Note that the only non-vanishing term (in weak convergence) of 1/t​∑k=0txk1𝑡superscriptsubscript𝑘0𝑡subscript𝑥𝑘1/\sqrt{t}\sum_{k=0}^{t}\tilde{x}_{k} as t→∞→𝑡t\rightarrow\infty is

Also recall that 𝕍​[ξl−1]=p​σ2𝕍delimited-[]subscript𝜉𝑙1𝑝superscript𝜎2\mathbb{V}[\xi_{l-1}]=p\sigma^{2} and

The asymptotic variance in the Lemma 7.2 is optimal with any fixed η𝜂\eta and β𝛽\beta for which Equation 23 holds. The next lemma (Lemma 7.3) extends the result in Lemma 7.2 to the multi-dimensional setting.

Let hℎh denotes the largest eigenvalue of A𝐴A. If (2−η​h)​(2−β)>2​β/p2𝜂ℎ2𝛽2𝛽𝑝(2-\eta h)(2-\beta)>2\beta/p, (2−η​h)+(2−β)>β/p2𝜂ℎ2𝛽𝛽𝑝(2-\eta h)+(2-\beta)>\beta/p, η>0𝜂0\eta>0 and β>0𝛽0\beta>0, then the normalized double averaging sequence converges weakly to the normal distribution with zero mean and the covariance matrix V=A−1​Σ​(A−1)T𝑉superscript𝐴1Σsuperscriptsuperscript𝐴1𝑇V=A^{-1}\Sigma(A^{-1})^{T},

Since A𝐴A is symmetric, one can use the proof technique of Lemma 7.2 to prove Lemma 7.3 by diagonalizing the matrix A𝐴A. This diagonalization essentially generalizes Lemma 7.1 to the multidimensional case. We will not go into the details of this proof as we will provide a simpler way to look at the system. As in the proof of Lemma 7.1 and Lemma 7.2, for the ease of notation we redefine xtsubscript𝑥𝑡\tilde{x}{t} and xtisubscriptsuperscript𝑥𝑖𝑡x^{i}{t} as follows:

Let the spatial average of the local parameters at time t𝑡t be denoted as ytsubscript𝑦𝑡y_{t} where yt=1p​∑i=1pxtisubscript𝑦𝑡1𝑝superscriptsubscript𝑖1𝑝superscriptsubscript𝑥𝑡𝑖y_{t}=\frac{1}{p}\sum_{i=1}^{p}x_{t}^{i}, and let the average noise be denoted as ξtsubscript𝜉𝑡\xi_{t}, where ξt=1p​∑i=1pξtisubscript𝜉𝑡1𝑝superscriptsubscript𝑖1𝑝superscriptsubscript𝜉𝑡𝑖\xi_{t}=\frac{1}{p}\sum_{i=1}^{p}\xi_{t}^{i}. Equations 24 and 25 can then be reduced to the following

We focus on the case where the learning rate η𝜂\eta and the moving rate α𝛼\alpha are kept constant over time111111As a side note, notice that the center parameter xtsubscript𝑥𝑡\tilde{x}{t} is tracking the spatial average ytsubscript𝑦𝑡y{t} of the local parameters with a non-symmetric spring in Equation 37 and 38. To be more precise note that the update on yt+1subscript𝑦𝑡1y_{t+1} contains (xt−yt)subscript𝑥𝑡subscript𝑦𝑡(\tilde{x}{t}-y{t}) scaled by α𝛼\alpha, whereas the update on xt+1subscript𝑥𝑡1\tilde{x}{t+1} contains −(xt−yt)subscript𝑥𝑡subscript𝑦𝑡-(\tilde{x}{t}-y_{t}) scaled by β𝛽\beta. Since α=β/p𝛼𝛽𝑝\alpha=\beta/p the impact of the center xt+1subscript𝑥𝑡1\tilde{x}{t+1} on the spatial local average yt+1subscript𝑦𝑡1y{t+1} becomes more negligible as p𝑝p grows.. Recall β=p​α𝛽𝑝𝛼\beta=p\alpha and α=η​ρ𝛼𝜂𝜌\alpha=\eta\rho.

Let’s introduce the block notation Ut=(yt,xt)subscript𝑈𝑡subscript𝑦𝑡subscript𝑥𝑡U_{t}=(y_{t},\tilde{x}{t}), Ξt=(η​ξt,0)subscriptΞ𝑡𝜂subscript𝜉𝑡0\Xi{t}=(\eta\xi_{t},0), M=I−η​L𝑀𝐼𝜂𝐿M=I-\eta L and

From Equations 37 and 38 it follows that Ut+1=M​Ut+Ξtsubscript𝑈𝑡1𝑀subscript𝑈𝑡subscriptΞ𝑡U_{t+1}=MU_{t}+\Xi_{t}. Note that this linear system has a degenerate noise ΞtsubscriptΞ𝑡\Xi_{t} which prevents us from directly applying results of [6]. Expanding this recursive relation and summing by parts, we have

By Lemma 7.4, ‖M‖2<1subscriptnorm𝑀21|M|_{2}<1 and thus

Since A𝐴A is invertible, we get

where V=A−1​Σ​(A−1)T𝑉superscript𝐴1Σsuperscriptsuperscript𝐴1𝑇V=A^{-1}\Sigma(A^{-1})^{T}. ∎

If the following conditions hold:

The eigenvalue λ𝜆\lambda of M𝑀M and the (non-zero) eigenvector (y,z)𝑦𝑧(y,z) of M𝑀M satisfy

Recall that

Since (y,z)𝑦𝑧(y,z) is assumed to be non-zero, we can write z=β​y/(λ+β−1)𝑧𝛽𝑦𝜆𝛽1z=\beta y/(\lambda+\beta-1). Then the Equation 50 can be reduced to

Thus y𝑦y is the eigenvector of A𝐴A. Let λAsubscript𝜆𝐴\lambda_{A} be the eigenvalue of matrix A𝐴A such that A​y=λA​y𝐴𝑦subscript𝜆𝐴𝑦Ay=\lambda_{A}y. Thus based on Equation 51 it follows that

Equation 52 is equivalent to

where a=η​λA+(p+1)​α𝑎𝜂subscript𝜆𝐴𝑝1𝛼a=\eta\lambda_{A}+(p+1)\alpha, c2=η​λA​p​αsuperscript𝑐2𝜂subscript𝜆𝐴𝑝𝛼c^{2}=\eta\lambda_{A}p\alpha. It follows from the condition in Equation 23 that −1<λ<11𝜆1-1<\lambda<1 iff η>0𝜂0\eta>0, β>0𝛽0\beta>0, (2−η​λA)​(2−β)>2​β/p2𝜂subscript𝜆𝐴2𝛽2𝛽𝑝(2-\eta\lambda_{A})(2-\beta)>2\beta/p and (2−η​λA)+(2−β)>β/p2𝜂subscript𝜆𝐴2𝛽𝛽𝑝(2-\eta\lambda_{A})+(2-\beta)>\beta/p. Let hℎh denote the maximum eigenvalue of A𝐴A and note that 2−η​λA≥2−η​h2𝜂subscript𝜆𝐴2𝜂ℎ2-\eta\lambda_{A}\geq 2-\eta h. This implies that the condition of our lemma is sufficient. ∎

As in Lemma 7.2, the asymptotic covariance in the Lemma 7.3 is optimal, i.e. meets the Fisher information lower-bound. The fact that this asymptotic covariance matrix V𝑉V does not contain any term involving ρ𝜌\rho is quite remarkable, since the penalty term ρ𝜌\rho does have an impact on the condition number of the Hessian in Equation 2.

We now extend the above proof ideas to analyze the strongly convex case, in which the noisy gradient gti​(x)=∇F​(x)−ξtisuperscriptsubscript𝑔𝑡𝑖𝑥∇𝐹𝑥subscriptsuperscript𝜉𝑖𝑡g_{t}^{i}(x)=\nabla{F}(x)-\xi^{i}{t} has the regularity that there exists some 0<μ≤L0𝜇𝐿0<\mu\leq L, for which μ​\norm​L​2​x−y2≤\scalprod​L​2​∇F​(x)−∇F​(y)​x−y≤L​\norm​L​2​x−y2𝜇\norm𝐿2𝑥superscript𝑦2\scalprod𝐿2∇𝐹𝑥∇𝐹𝑦𝑥𝑦𝐿\norm𝐿2𝑥superscript𝑦2\mu\norm{L2}{x-y}^{2}\leq\scalprod{L2}{\nabla{F}(x)-\nabla{F}(y)}{x-y}\leq L\norm{L2}{x-y}^{2} holds uniformly for any x∈ℝd,y∈ℝdformulae-sequence𝑥superscriptℝ𝑑𝑦superscriptℝ𝑑x\in\mathbb{R}^{d},y\in\mathbb{R}^{d}. The noise {ξti}subscriptsuperscript𝜉𝑖𝑡{\xi^{i}{t}}’s is assumed to be i.i.d. with zero mean and bounded variance 𝔼​[\norm​L​2​ξti2]≤σ2𝔼delimited-[]\norm𝐿2superscriptsubscriptsuperscript𝜉𝑖𝑡2superscript𝜎2\mathbb{E}[\norm{L2}{\xi^{i}_{t}}^{2}]\leq\sigma^{2}.

Let at=𝔼​\norm​L​2​1p​∑i=1pxti−x∗2subscript𝑎𝑡𝔼\norm𝐿21𝑝superscriptsubscript𝑖1𝑝subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2a_{t}=\mathbb{E}\norm{L2}{\frac{1}{p}\sum_{i=1}^{p}x^{i}{t}-x^{\ast}}^{2}, bt=1p​∑i=1p𝔼​\norm​L​2​xti−x∗2subscript𝑏𝑡1𝑝superscriptsubscript𝑖1𝑝𝔼\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2b{t}=\frac{1}{p}\sum_{i=1}^{p}\mathbb{E}\norm{L2}{x^{i}{t}-x^{\ast}}^{2}, ct=𝔼​\norm​L​2​xt−x∗2subscript𝑐𝑡𝔼\norm𝐿2subscript𝑥𝑡superscriptsuperscript𝑥∗2c{t}=\mathbb{E}\norm{L2}{\tilde{x}{t}-x^{\ast}}^{2}, γ1=2​η​μ​Lμ+Lsubscript𝛾12𝜂𝜇𝐿𝜇𝐿\gamma{1}=2\eta\frac{\mu L}{\mu+L} and γ2=2​η​L​(1−2​μ​Lμ+L)subscript𝛾22𝜂𝐿12𝜇𝐿𝜇𝐿\gamma_{2}=2\eta L(1-\frac{2\sqrt{\mu L}}{\mu+L}). If 0≤η≤2μ+L​(1−α)0𝜂2𝜇𝐿1𝛼0\leq\eta\leq\frac{2}{\mu+L}(1-\alpha), 0≤α<10𝛼10\leq\alpha<1 and 0≤β≤10𝛽10\leq\beta\leq 1 then

The idea of the proof is based on the point of view in Lemma 7.3, i.e. how close the center variable xtsubscript𝑥𝑡\tilde{x}{t} is to the spatial average of the local variables yt=1p​∑i=1pxtisubscript𝑦𝑡1𝑝superscriptsubscript𝑖1𝑝subscriptsuperscript𝑥𝑖𝑡y{t}=\frac{1}{p}\sum_{i=1}^{p}x^{i}{t}. To further simplify the notation, let the noisy gradient be ∇ft,ξi=gti​(xti)=∇F​(xti)−ξti∇subscriptsuperscript𝑓𝑖𝑡𝜉superscriptsubscript𝑔𝑡𝑖superscriptsubscript𝑥𝑡𝑖∇𝐹superscriptsubscript𝑥𝑡𝑖subscriptsuperscript𝜉𝑖𝑡\nabla f^{i}{t,\xi}=g_{t}^{i}(x_{t}^{i})=\nabla{F}(x_{t}^{i})-\xi^{i}{t}, and ∇fti=∇F​(xti)∇subscriptsuperscript𝑓𝑖𝑡∇𝐹superscriptsubscript𝑥𝑡𝑖\nabla f^{i}{t}=\nabla{F}(x_{t}^{i}) be its deterministic part. Then EASGD updates can be rewritten as follows,

We have thus the update for the spatial average,

The idea of the proof is to bound the distance \norm​L​2​xt−x∗2\norm𝐿2subscript𝑥𝑡superscriptsuperscript𝑥∗2\norm{L2}{\tilde{x}{t}-x^{\ast}}^{2} through \norm​L​2​yt−x∗2\norm𝐿2subscript𝑦𝑡superscriptsuperscript𝑥∗2\norm{L2}{y{t}-x^{\ast}}^{2} and 1p​∑ip\norm​L​2​xti−x∗21𝑝superscriptsubscript𝑖𝑝\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2\frac{1}{p}\sum_{i}^{p}\norm{L2}{x^{i}_{t}-x^{\ast}}^{2}. W start from the following estimate for the strongly convex function [31],

Since ∇f​(x∗)=0∇𝑓superscript𝑥∗0\nabla{f(x^{\ast})}=0, we have

By the cosine rule (2​\scalprod​L​2​a−b​c−d=\norm​L​2​a−d2−\norm​L​2​a−c2+\norm​L​2​c−b2−\norm​L​2​d−b22\scalprod𝐿2𝑎𝑏𝑐𝑑\norm𝐿2𝑎superscript𝑑2\norm𝐿2𝑎superscript𝑐2\norm𝐿2𝑐superscript𝑏2\norm𝐿2𝑑superscript𝑏22\scalprod{L2}{a-b}{c-d}=\norm{L2}{a-d}^{2}-\norm{L2}{a-c}^{2}+\norm{L2}{c-b}^{2}-\norm{L2}{d-b}^{2}), we have

By the Cauchy-Schwarz inequality, we have

Combining the above estimates in Equations 57, 58, 59, 60, we obtain

Choosing 0≤α<10𝛼10\leq\alpha<1, we can have this upper-bound for the terms α2​\norm​L​2​xti−xt2−α​\norm​L​2​xti−xt2+2​η​α​\norm​L​2​∇fti​\norm​L​2​xti−xt=−α​(1−α)​\norm​L​2​xti−xt2+2​η​α​\norm​L​2​∇fti​\norm​L​2​xti−xt≤η2​α1−α​\norm​L​2​∇fti2superscript𝛼2\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsubscript𝑥𝑡2𝛼\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsubscript𝑥𝑡22𝜂𝛼\norm𝐿2∇subscriptsuperscript𝑓𝑖𝑡\norm𝐿2subscriptsuperscript𝑥𝑖𝑡subscript𝑥𝑡𝛼1𝛼\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsubscript𝑥𝑡22𝜂𝛼\norm𝐿2∇subscriptsuperscript𝑓𝑖𝑡\norm𝐿2subscriptsuperscript𝑥𝑖𝑡subscript𝑥𝑡superscript𝜂2𝛼1𝛼\norm𝐿2∇superscriptsubscriptsuperscript𝑓𝑖𝑡2\alpha^{2}\norm{L2}{x^{i}{t}-\tilde{x}{t}}^{2}-\alpha\norm{L2}{x^{i}{t}-\tilde{x}{t}}^{2}+2\eta\alpha\norm{L2}{\nabla f^{i}{t}}\norm{L2}{x^{i}{t}-\tilde{x}{t}}=-\alpha(1-\alpha)\norm{L2}{x^{i}{t}-\tilde{x}{t}}^{2}+2\eta\alpha\norm{L2}{\nabla f^{i}{t}}\norm{L2}{x^{i}{t}-\tilde{x}{t}}\leq\frac{\eta^{2}\alpha}{1-\alpha}\norm{L2}{\nabla f^{i}{t}}^{2} by applying −a​x2+b​x≤b24​a𝑎superscript𝑥2𝑏𝑥superscript𝑏24𝑎-ax^{2}+bx\leq\frac{b^{2}}{4a} with x=\norm​L​2​xti−xt𝑥\norm𝐿2subscriptsuperscript𝑥𝑖𝑡subscript𝑥𝑡x=\norm{L2}{x^{i}{t}-\tilde{x}_{t}}. Thus we can further bound Equation 61 with

As in Equation 62 and 63, the noise ξtisubscriptsuperscript𝜉𝑖𝑡\xi^{i}{t} is zero mean (𝔼​ξti=0𝔼subscriptsuperscript𝜉𝑖𝑡0\mathbb{E}\xi^{i}{t}=0) and the variance of the noise ξtisubscriptsuperscript𝜉𝑖𝑡\xi^{i}{t} is bounded (𝔼​\norm​L​2​ξti2≤σ2𝔼\norm𝐿2superscriptsubscriptsuperscript𝜉𝑖𝑡2superscript𝜎2\mathbb{E}\norm{L2}{\xi^{i}{t}}^{2}\leq\sigma^{2}), if η𝜂\eta is chosen small enough such that η2+η2​α1−α−2​ημ+L≤0superscript𝜂2superscript𝜂2𝛼1𝛼2𝜂𝜇𝐿0\eta^{2}+\frac{\eta^{2}\alpha}{1-\alpha}-\frac{2\eta}{\mu+L}\leq 0, then

Now we apply similar idea to estimate \norm​L​2​yt−x∗2\norm𝐿2subscript𝑦𝑡superscriptsuperscript𝑥∗2\norm{L2}{y_{t}-x^{\ast}}^{2}. From Equation 56 the following relation holds,

By \scalprod​L​2​1p​∑i=1pai​1p​∑j=1pbj=1p​∑i=1p\scalprod​L​2​ai​bi−1p2​∑i>j\scalprod​L​2​ai−aj​bi−bj\scalprod𝐿21𝑝superscriptsubscript𝑖1𝑝subscript𝑎𝑖1𝑝superscriptsubscript𝑗1𝑝subscript𝑏𝑗1𝑝superscriptsubscript𝑖1𝑝\scalprod𝐿2subscript𝑎𝑖subscript𝑏𝑖1superscript𝑝2subscript𝑖𝑗\scalprod𝐿2subscript𝑎𝑖subscript𝑎𝑗subscript𝑏𝑖subscript𝑏𝑗\scalprod{L2}{\frac{1}{p}\sum_{i=1}^{p}a_{i}}{\frac{1}{p}\sum_{j=1}^{p}b_{j}}=\frac{1}{p}\sum_{i=1}^{p}\scalprod{L2}{a_{i}}{b_{i}}-\frac{1}{p^{2}}\sum_{i>j}\scalprod{L2}{a_{i}-a_{j}}{b_{i}-b_{j}}, we have

Since 2​μ​Lμ+L≤12𝜇𝐿𝜇𝐿1\frac{2\sqrt{\mu L}}{\mu+L}\leq 1, we need also bound the non-linear term \scalprod​L​2​∇fti−∇ftj​xti−xtj≤L​\norm​L​2​xti−xtj2\scalprod𝐿2∇subscriptsuperscript𝑓𝑖𝑡∇subscriptsuperscript𝑓𝑗𝑡superscriptsubscript𝑥𝑡𝑖superscriptsubscript𝑥𝑡𝑗𝐿\norm𝐿2superscriptsubscript𝑥𝑡𝑖superscriptsuperscriptsubscript𝑥𝑡𝑗2\scalprod{L2}{\nabla f^{i}{t}-\nabla f^{j}{t}}{x_{t}^{i}-x_{t}^{j}}\leq L\norm{L2}{x_{t}^{i}-x_{t}^{j}}^{2}. Recall the bias-variance relation 1p​∑i=1p\norm​L​2​xti−x∗2=1p2​∑i>j\norm​L​2​xti−xtj2+\norm​L​2​yt−x∗21𝑝superscriptsubscript𝑖1𝑝\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗21superscript𝑝2subscript𝑖𝑗\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsubscriptsuperscript𝑥𝑗𝑡2\norm𝐿2subscript𝑦𝑡superscriptsuperscript𝑥∗2\frac{1}{p}\sum_{i=1}^{p}\norm{L2}{x^{i}{t}-x^{\ast}}^{2}=\frac{1}{p^{2}}\sum{i>j}\norm{L2}{x^{i}{t}-x^{j}{t}}^{2}+\norm{L2}{y_{t}-x^{\ast}}^{2}. The key observation is that if 1p​∑i=1p\norm​L​2​xti−x∗21𝑝superscriptsubscript𝑖1𝑝\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2\frac{1}{p}\sum_{i=1}^{p}\norm{L2}{x^{i}{t}-x^{\ast}}^{2} remains bounded, then larger variance ∑i>j\norm​L​2​xti−xtj2subscript𝑖𝑗\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsubscriptsuperscript𝑥𝑗𝑡2\sum{i>j}\norm{L2}{x^{i}{t}-x^{j}{t}}^{2} implies smaller bias \norm​L​2​yt−x∗2\norm𝐿2subscript𝑦𝑡superscriptsuperscript𝑥∗2\norm{L2}{y_{t}-x^{\ast}}^{2}. Thus this non-linear term can be compensated.

Again choose η𝜂\eta small enough such that η2+η2​α1−α−2​ημ+L≤0superscript𝜂2superscript𝜂2𝛼1𝛼2𝜂𝜇𝐿0\eta^{2}+\frac{\eta^{2}\alpha}{1-\alpha}-\frac{2\eta}{\mu+L}\leq 0 and take expectation in Equation 75,

Combing the estimates from Equations 64, 76, 77, and denote at=𝔼​\norm​L​2​yt−x∗2subscript𝑎𝑡𝔼\norm𝐿2subscript𝑦𝑡superscriptsuperscript𝑥∗2a_{t}=\mathbb{E}\norm{L2}{y_{t}-x^{\ast}}^{2}, bt=1p​∑i=1p𝔼​\norm​L​2​xti−x∗2subscript𝑏𝑡1𝑝superscriptsubscript𝑖1𝑝𝔼\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2b_{t}=\frac{1}{p}\sum_{i=1}^{p}\mathbb{E}\norm{L2}{x^{i}{t}-x^{\ast}}^{2}, ct=𝔼​\norm​L​2​xt−x∗2subscript𝑐𝑡𝔼\norm𝐿2subscript𝑥𝑡superscriptsuperscript𝑥∗2c{t}=\mathbb{E}\norm{L2}{\tilde{x}{t}-x^{\ast}}^{2}, γ1=2​η​μ​Lμ+Lsubscript𝛾12𝜂𝜇𝐿𝜇𝐿\gamma{1}=2\eta\frac{\mu L}{\mu+L}, γ2=2​η​L​(1−2​μ​Lμ+L)subscript𝛾22𝜂𝐿12𝜇𝐿𝜇𝐿\gamma_{2}=2\eta L(1-\frac{2\sqrt{\mu L}}{\mu+L}), then

Algorithms 4 and 5 capture the pseudo-codes of the implementation of momentum DOWNPOUR (MDOWNPOUR) used in this paper. Algorithm 4 shows the behavior of each local worker and Algorithm 5 shows the behavior of the master.

For the ImageNet experiment, we re-size each RGB image so that the smallest dimension is 256256256 pixels. We also re-scale each pixel value to the interval [0,1]01[0,1]. We then extract random crops (and their horizontal flips) of size 3×221×22132212213\times 221\times 221 pixels and present these to the network in mini-batches of size 128128128.

The training and test loss and the test error are only computed from the center patch (3×28×28328283\times 28\times 28) for the CIFAR experiment and the center patch (3×221×22132212213\times 221\times 221) for the ImageNet experiment.

We will now explain precisely how the dataset is sampled by each local worker as uniformly and efficiently as possible. The general parallel data loading scheme on a single machine is as follows: we use k𝑘k CPUs, where k=8𝑘8k=8, to load the data in parallel. Each data loader reads from the memory-mapped (mmap) file a chunk of c𝑐c raw images (preprocessing was described in the previous subsection) and their labels (for CIFAR c=512𝑐512c=512 and for ImageNet c=64𝑐64c=64). For the CIFAR, the mmap file of each data loader contains the entire dataset whereas for ImageNet, each mmap file of each data loader contains different 1/k1𝑘1/k fractions of the entire dataset. A chunk of data is always sent by one of the data loaders to the first worker who requests the data. The next worker requesting the data from the same data loader will get the next chunk. Each worker requests in total k𝑘k data chunks from k𝑘k different data loaders and then process them before asking for new data chunks. Notice that each data loader cycles through the data in the mmap file, sending consecutive chunks to the workers in order in which it receives requests from them. When the data loader reaches the end of the mmap file, it selects the address in memory uniformly at random from the interval [0,s]0𝑠[0,s], where s=(number of images in the mmap file modulo mini-batch size)𝑠number of images in the mmap file modulo mini-batch sizes=(\textsf{number of images in the mmap file}\text{>>modulo>>}\textsf{mini-batch size}), and uses this address to start cycling again through the data in the mmap file. After the local worker receives the k𝑘k data chunks from the data loaders, it shuffles them and divides it into mini-batches of size 128128128.

In Table 1 we summarize the learning rates η𝜂\eta (we used constant learning rates) explored for each method shown in Figure 2. For all values of τ𝜏\tau the same set of learning rates was explored for each method.

In Table 3 we summarize the initial learning rates η𝜂\eta we use for each method shown in Figure 4. For all values of p𝑝p the same set of learning rates was explored for each method. We also used the rule of the thumb to decrease the initial learning rate twice, first time we divided it by 555 and the second time by 222, when we observed that the decrease of the online predictive (training) loss saturates.

Figure 6 shows the convergence of the training and test loss (negative log-likelihood) and the test error computed for the center variable as a function of wallclock time for SGD, ASGD, MVASGD and MSGD (p=1𝑝1p=1) on the CIFAR experiment. For all CIFAR experiments we always start the averaging for the A​D​O​W​N​P​O​U​R𝐴𝐷𝑂𝑊𝑁𝑃𝑂𝑈𝑅ADOWNPOUR and A​S​G​D𝐴𝑆𝐺𝐷ASGD methods from the very beginning of each experiment. For all ImageNet experiments we start the averaging for the A​S​G​D𝐴𝑆𝐺𝐷ASGD at the same time when we first reduce the initial learning rate.

This section discusses the dependence of the trade-off between exploration and exploitation on the learning rate. We compare the performance of respectively EAMSGD and EASGD for different learning rates η𝜂\eta when p=16𝑝16p=16 and τ=10𝜏10\tau=10 on the CIFAR experiment. We observe in Figure 8 that higher learning rates η𝜂\eta lead to better test performance for the EAMSGD algorithm which potentially can be justified by the fact that they sustain higher fluctuations of the local workers. We conjecture that higher fluctuations lead to more exploration and simultaneously they also impose higher regularization. This picture however seems to be opposite for the EASGD algorithm for which larger learning rates hurt the performance of the method and lead to overfitting. Interestingly in this experiment for both EASGD and EAMSGD algorithm, the learning rate for which the best training performance was achieved simultaneously led to the worst test performance.

This section discusses the dependence of the trade-off between exploration and exploitation on the communication period. We have observed from the CIFAR experiment that EASGD algorithm exhibits very similar convergence behavior when τ=1𝜏1\tau=1 up to even τ=1000𝜏1000\tau=1000, whereas EAMSGD can get trapped at worse energy (loss) level for τ=100𝜏100\tau=100. This behavior of EAMSGD is most likely due to the non-convexity of the objective function. Luckily, it can be avoided by gradually decreasing the learning rate, i.e. increasing the penalty term ρ𝜌\rho (recall α=η​ρ𝛼𝜂𝜌\alpha=\eta\rho), as shown in Figure 9. In contrast, the EASGD algorithm does not seem to get trapped at all along its trajectory. The performance of EASGD is less sensitive to increasing the communication period compared to EAMSGD, whereas for the EAMSGD the careful choice of the learning rate for large communication periods seems crucial.

Compared to all earlier results, the experiment in this section is re-run three times with a new random121212To clarify, the random initialization we use is by default in Torch’s implementation. seed and with faster cuDNN131313https://developer.nvidia.com/cuDNN package141414https://github.com/soumith/cudnn.torch. All our methods are implemented in Torch151515http://torch.ch. The Message Passing Interface implementation MVAPICH2161616http://mvapich.cse.ohio-state.edu is used for the GPU-CPU communication.

In addition, we report in Table 4 the breakdown of the total running time for EASGD when τ=10𝜏10\tau=10 (the time breakdown for EAMSGD is almost identical) and DOWNPOUR when τ=1𝜏1\tau=1 into computation time, data loading time and parameter communication time. For the CIFAR experiment the reported time corresponds to processing 400×128400128400\times 128 data samples whereas for the ImageNet experiment it corresponds to processing 1024×12810241281024\times 128 data samples. For τ=1𝜏1\tau=1 and p∈{8,16}𝑝816p\in{8,16} we observe that the communication time accounts for significant portion of the total running time whereas for τ=10𝜏10\tau=10 the communication time becomes negligible compared to the total running time (recall that based on previous results EASGD and EAMSGD achieve best performance with larger τ𝜏\tau which is ideal in the setting when communication is time-consuming).

In Figure 10 and 11, we summarize the wall clock time needed to achieve the same level of the test error for all the methods in the CIFAR and ImageNet experiment as a function of the number of local workers p . For the CIFAR (Figure 10) we examined the following levels: { 21% , 20% , 19% , 18% } and for the ImageNet (Figure 11) we examined: { 49% , 47% , 45% , 43% } . If some method does not appear on the figure for a given test error level, it indicates that this method never achieved this level. For the CIFAR experiment we observe that from among EASGD , DOWNPOUR and MDOWNPOUR methods, the EASGD method needs less time to achieve a particular level of test error. We observe that with higher p each of these methods does not necessarily need less time to achieve the same level of test error. This seems counter intuitive though recall that the learning rate for the methods is selected based on the smallest achievable test error. For larger p smaller learning rates were selected than for smaller p which explains our results. Meanwhile, the EAMSGD method achieves significant speed-up over other methods for all the test error levels. For the ImageNet experiment we observe that all methods outperform MSGD and furthermore with p = 4 or p = 8 each of these methods requires less time to achieve the same level of test error. The EAMSGD consistently needs less time than any other method, in particular DOWNPOUR , to achieve any of the test error levels.

Table: S9.T1: Learning rates explored for each method shown in Figure 2 (CIFAR experiment).

η𝜂\eta
EASGD{0.05,0.01,0.005}0.050.010.005{0.05,0.01,0.005}
EAMSGD{0.01,0.005,0.001}0.010.0050.001{0.01,0.005,0.001}
DOWNPOUR
ADOWNPOUR{0.005,0.001,0.0005}0.0050.0010.0005{0.005,0.001,0.0005}
MVADOWNPOUR
MDOWNPOUR{0.00005,0.00001,0.000005}0.000050.000010.000005{0.00005,0.00001,0.000005}
SGD, ASGD, MVASGD{0.05,0.01,0.005}0.050.010.005{0.05,0.01,0.005}
MSGD{0.001,0.0005,0.0001}0.0010.00050.0001{0.001,0.0005,0.0001}

Table: S9.T4: Approximate computation time, data loading time and parameter communication time [sec] for DOWNPOUR (top line for τ=1𝜏1\tau=1) and EASGD (the time breakdown for EAMSGD is almost identical) (bottom line for τ=10𝜏10\tau=10). Left time corresponds to CIFAR experiment and right table corresponds to ImageNet experiment.

p=1𝑝1p=1p=4𝑝4p=4p=8𝑝8p=8p=16𝑝16p=16
τ=1𝜏1\tau=112/1/0121012/1/011/2/3112311/2/311/2/5112511/2/511/2/9112911/2/9
τ=10𝜏10\tau=10NA11/2/1112111/2/111/2/1112111/2/112/2/1122112/2/1

Refer to caption The largest absolute eigenvalue of the linear map ℱ=F3p∘F2p∘F1p∘…∘F31∘F21∘F11ℱsubscriptsuperscript𝐹𝑝3subscriptsuperscript𝐹𝑝2subscriptsuperscript𝐹𝑝1…subscriptsuperscript𝐹13subscriptsuperscript𝐹12subscriptsuperscript𝐹11\mathcal{F}=F^{p}{3}\circ F^{p}{2}\circ F^{p}{1}\circ\ldots\circ F^{1}{3}\circ F^{1}{2}\circ F^{1}{1} as a function of η∈(0,10−2)𝜂0superscript102\eta\in(0,10^{-2}) and ρ∈(0,10)𝜌010\rho\in(0,10) when p=3𝑝3p=3 and p=8𝑝8p=8. To simulate the chaotic behavior of the ADMM algorithm, one may pick η=0.001𝜂0.001\eta=0.001 and ρ=2.5𝜌2.5\rho=2.5 and initialize the state s0subscript𝑠0s_{0} either randomly or with λ0i=0,x0i=x0=1000,∀iformulae-sequenceformulae-sequencesubscriptsuperscript𝜆𝑖00subscriptsuperscript𝑥𝑖0subscript𝑥01000for-all𝑖\lambda^{i}{0}=0,x^{i}{0}=\tilde{x}_{0}=1000,\forall i. Figure should be read in color.

Refer to caption Training and test loss and the test error for the center variable versus a wallclock time for different communication periods τ𝜏\tau on CIFAR dataset with the 777-layer convolutional neural network.

Refer to caption Training and test loss and the test error for the center variable versus a wallclock time for different number of local workers p𝑝p (MSGD uses p=1𝑝1p=1) on ImageNet with the 111111-layer convolutional neural network. Initial learning rate is decreased twice, by a factor of 555 and then 222, when we observe that the online predictive loss [30] stagnates. EAMSGD achieves significant accelerations compared to other methods, e.g. the relative speed-up for p=8𝑝8p=8 (the best comparator method is then DOWNPOUR) to achieve the test error 49%percent4949% equals 1.81.81.8, and simultaneously it reduces the communication overhead (DOWNPOUR uses communication period τ=1𝜏1\tau=1 and EAMSGD uses τ=10𝜏10\tau=10).

Refer to caption Theoretical mean squared error (MSE) of the center x~~𝑥\tilde{x} in the quadratic case, with various choices of the learning rate η𝜂\eta (horizontal within each block), and the moving rate β=p​α𝛽𝑝𝛼\beta=p\alpha (vertical within each block), the number of processors p={1,10,100,1000,10000}𝑝110100100010000p={1,10,100,1000,10000} (vertical across blocks), and the time steps t={1,2,10,100,∞}𝑡1210100t={1,2,10,100,\infty} (horizontal across blocks). The MSE is plotted in log scale, ranging from 10−3superscript10310^{-3} to 103superscript10310^{3} (from deep blue to red). The dark red (i.e. on the upper-right corners) indicates divergence.

Refer to caption Convergence of the training and test loss (negative log-likelihood) and the test error (original and zoomed) computed for the center variable as a function of wallclock time for SGD, ASGD, MVASGD and MSGD (p=1𝑝1p=1) on the CIFAR experiment.

Refer to caption The wall clock time needed to achieve the same level of the test error t​h​r𝑡ℎ𝑟thr as a function of the number of local workers p𝑝p on the CIFAR dataset. From left to right: t​h​r={21%,20%,19%,18%}𝑡ℎ𝑟percent21percent20percent19percent18thr={21%,20%,19%,18%}. Missing bars denote that the method never achieved specified level of test error.

$$ \vspace{-0.02in} \min_{x} F(x):=\mathbb{E} [f(x,\xi)], \label{eq:original} $$ \tag{eq:original}

$$ \vspace{-0.02in} \label{eq:penalty} \min_{x^1,\ldots,x^p,\tilde{x}} \sum_{i=1}^p \mathbb{E} [f(x^i,\xi^i)] + \frac{\rho}{2}|x^i - \tilde{x}|^2, $$ \tag{eq:penalty}

$$ F_{1}^{1}!!=!!\left(\begin{array}[]{ccccc}1&-1&0&0&1\ 0&1&0&0&0\ 0&0&1&0&0\ 0&0&0&1&0\ 0&0&0&0&1\ \end{array}\right)!!,F_{2}^{1}!!=!!\left(\begin{array}[]{ccccc}1&0&0&0&0\ \frac{\eta\rho}{1+\eta\rho}&\frac{1-\eta}{1+\eta\rho}&0&0&\frac{\eta\rho}{1+\eta\rho}\ 0&0&1&0&0\ 0&0&0&1&0\ 0&0&0&0&1\ \end{array}\right)!!,F_{3}^{1}!!=!!\left(\begin{array}[]{ccccc}1&0&0&0&0\ 0&1&0&0&0\ 0&0&1&0&0\ 0&0&0&1&0\ -\frac{1}{p}&\frac{1}{p}&-\frac{1}{p}&\frac{1}{p}&0\ \end{array}\right)!!. $$ \tag{S4.Ex2}

$$ F^{1}!!=!!\left(\begin{array}[]{ccc}1-\eta-\alpha&0&\alpha\ 0&1&0\ \alpha&0&1-\alpha\ \end{array}\right)!!,F^{2}!!=!!\left(\begin{array}[]{ccc}1&0&0\ 0&1-\eta-\alpha&\alpha\ 0&\alpha&1-\alpha\ \end{array}\right)!! $$ \tag{S4.Ex3}

$$ g_t^i(x^i_t) = A x^i_t - b - \xi^i_t, \quad i \in {1,\ldots,p}, \label{eq:gradlform} $$ \tag{eq:gradlform}

$$ -1 < \phi < \gamma < 1. \label{eq:condition} $$ \tag{eq:condition}

$$ \cbar{x}{t} \triangleq \cbar{x}{t} - x^\ast \text{:::and:::} x^i_t \triangleq x^i_t - x^\ast. $$

$$ \cbar{x}_{t+1} = \gamma \cbar{x}_t + \alpha u_t. \label{eq:proj2} $$ \tag{eq:proj2}

$$ M= \begin{bmatrix} 1-\alpha-\eta h & 0 & ... & 0 & \alpha \ 0 & 1-\alpha-\eta h& 0 & ... & \alpha \ ... & 0 & ... &0 & ... \ 0 & ... &0 & 1-\alpha-\eta h & \alpha \ \alpha & \alpha & ... & \alpha & 1-p\alpha \end{bmatrix}, $$

$$ z_{t+1} = \frac{1}{t+1} \sum_{k=0}^{t} \tilde{x}_k. \label{eq:doubleseq} $$ \tag{eq:doubleseq}

$$ \eta \lambda_A = (1-\alpha -\lambda) + \frac{\alpha \beta}{\lambda + \beta - 1}. \label{eq:intermediate} $$ \tag{eq:intermediate}

$$ \displaystyle x^{i}_{t+1} $$

$$ \cbar{x}t = \gamma^t \cbar{x}0 + \frac{\gamma^t-\phi^t}{\gamma-\phi} \alpha u_0 + \alpha \eta \sum{l=1}^{t-1} \frac{\gamma^{t-l}-\phi^{t-l}}{\gamma-\phi} \xi{l-1} . \label{eq:lem1e} $$ \tag{eq:lem1e}

$$ \displaystyle\mathbb{V}[\tilde{x}_{t}-x^{\ast}]=\frac{p^{2}\alpha^{2}\eta^{2}}{(\gamma-\phi)^{2}}\bigg{(}\frac{\gamma^{2}-\gamma^{2t}}{1-\gamma^{2}}+\frac{\phi^{2}-\phi^{2t}}{1-\phi^{2}}-2\frac{\gamma\phi-(\gamma\phi)^{t}}{1-\gamma\phi}\bigg{)}\frac{\sigma^{2}}{p}, $$

$$ \tilde{x}_{t+1} &= \tilde{x}t + \sum{i=1}^p \alpha (x^i_t - \tilde{x}_t) = (1-p\alpha) \tilde{x}_t + \alpha (u_t + \frac{p\alpha}{1-p\alpha-\phi} \tilde{x}_t )\ &= (1-p\alpha+\frac{p\alpha^2}{1-p\alpha-\phi}) \tilde{x}_t + \alpha u_t = \gamma \cbar{x}_t + \alpha u_t, $$

$$ u_{t+1} = \phi^{t+1} u_0 + \phi^{t} (\eta \xi_0) + \ldots + \phi^{0} (\eta \xi_t), \label{eq:lem1c} \ \cbar{x}_{t+1} = \gamma^{t+1} \cbar{x}_0 + \gamma^{t} (\alpha u_0) + \ldots + \gamma^{0} (\alpha u_t). \label{eq:lem1d} $$ \tag{eq:lem1c}

$$ \lim_{p \rightarrow \infty}\lim_{t \rightarrow \infty} p\mathbb{E} [ (\tilde{x}_{t} - x^{*})^2] = \frac{\beta \eta h}{(2-\beta)(2-\eta h)} \cdot \frac{2 -\beta -\eta h + \beta \eta h}{\beta + \eta h - \beta \eta h } \cdot \frac{\sigma^2}{h^2}. $$

$$ \displaystyle\lim_{t\rightarrow\infty}\mathbb{E}[(\tilde{x}_{t}-x^{*})^{2}] $$

$$ L = \left( \begin{array}{cc} A + \frac{\alpha}{\eta} I & - \frac{\alpha}{\eta}I \

  • \frac{\beta}{\eta} I & \frac{\beta}{\eta} I \end{array} \right). $$

$$ \frac{1}{\sqrt{t}} \sum_{k=0}^t U_k = \frac{1}{\sqrt{t}} U_0 + \frac{1}{\sqrt{t}} \eta L^{-1} \sum_{k=1}^{t} \Xi_{k-1} - \frac{1}{\sqrt{t}} \sum_{k=1}^{t} M^{k+1} \Xi_{k-1}. $$

$$ M \left( \begin{array}{c} y \ z \end{array} \right) = \lambda \left( \begin{array}{c} y \ z \end{array} \right) . \label{eq:oneM} $$ \tag{eq:oneM}

$$ \displaystyle\begin{pmatrix}a_{t+1}\ b_{t+1}\ c_{t+1}\ \end{pmatrix}\leq\begin{pmatrix}1-\gamma_{1}-\gamma_{2}-\alpha&\gamma_{2}&\alpha\ 0&1-\gamma_{1}-\alpha&\alpha\ \beta&0&1-\beta\end{pmatrix}\begin{pmatrix}a_{t}\ b_{t}\ c_{t}\ \end{pmatrix}+\begin{pmatrix}\eta^{2}\frac{\sigma^{2}}{p}\ \eta^{2}\sigma^{2}\ 0\ \end{pmatrix}. $$

$$ \displaystyle\scalprod{L2}{\nabla{F}(x)-\nabla{F}(y)}{x-y}\geq\frac{\mu L}{\mu+L}\norm{L2}{x-y}^{2}+\frac{1}{\mu+L}\norm{L2}{\nabla{F}(x)-\nabla{F}(y)}^{2}. $$

$$ \displaystyle\scalprod{L2}{\nabla f^{i}{t}}{x^{i}{t}-\tilde{x}{t}}\leq\norm{L2}{\nabla f^{i}{t}}\norm{L2}{x^{i}{t}-\tilde{x}{t}}. $$

Lemma. Lemma 7.1. Let x0subscript𝑥0\tilde{x}{0} and {x0i}i=1,…,psubscriptsubscriptsuperscript𝑥𝑖0𝑖1…𝑝{x^{i}{0}}{i=1,\ldots,p} be arbitrary constants, then 𝔼​[xt−x∗]=γt​(x0−x∗)+γt−ϕtγ−ϕ​α​u0,𝔼delimited-[]subscript𝑥𝑡superscript𝑥∗superscript𝛾𝑡subscript𝑥0superscript𝑥∗superscript𝛾𝑡superscriptitalic-ϕ𝑡𝛾italic-ϕ𝛼subscript𝑢0\displaystyle\mathbb{E}[\tilde{x}{t}-x^{\ast}]=\gamma^{t}(\tilde{x}{0}-x^{\ast})+\frac{\gamma^{t}-\phi^{t}}{\gamma-\phi}\alpha u{0}, (21) 𝕍​[xt−x∗]=p2​α2​η2(γ−ϕ)2​(γ2−γ2​t1−γ2+ϕ2−ϕ2​t1−ϕ2−2​γ​ϕ−(γ​ϕ)t1−γ​ϕ)​σ2p,𝕍delimited-[]subscript𝑥𝑡superscript𝑥∗superscript𝑝2superscript𝛼2superscript𝜂2superscript𝛾italic-ϕ2superscript𝛾2superscript𝛾2𝑡1superscript𝛾2superscriptitalic-ϕ2superscriptitalic-ϕ2𝑡1superscriptitalic-ϕ22𝛾italic-ϕsuperscript𝛾italic-ϕ𝑡1𝛾italic-ϕsuperscript𝜎2𝑝\displaystyle\mathbb{V}[\tilde{x}{t}-x^{\ast}]=\frac{p^{2}\alpha^{2}\eta^{2}}{(\gamma-\phi)^{2}}\bigg{(}\frac{\gamma^{2}-\gamma^{2t}}{1-\gamma^{2}}+\frac{\phi^{2}-\phi^{2t}}{1-\phi^{2}}-2\frac{\gamma\phi-(\gamma\phi)^{t}}{1-\gamma\phi}\bigg{)}\frac{\sigma^{2}}{p}, (22) where u0=∑i=1p(x0i−x∗−α1−p​α−ϕ​(x0−x∗))subscript𝑢0superscriptsubscript𝑖1𝑝subscriptsuperscript𝑥𝑖0superscript𝑥∗𝛼1𝑝𝛼italic-ϕsubscript𝑥0superscript𝑥∗u{0}=\sum_{i=1}^{p}(x^{i}{0}-x^{\ast}-\frac{\alpha}{1-p\alpha-\phi}(\tilde{x}{0}-x^{\ast})), a=η​h+(p+1)​α𝑎𝜂ℎ𝑝1𝛼a=\eta h+(p+1)\alpha, c2=η​h​p​αsuperscript𝑐2𝜂ℎ𝑝𝛼c^{2}=\eta hp\alpha, γ=1−a−a2−4​c22𝛾1𝑎superscript𝑎24superscript𝑐22\gamma=1-\frac{a-\sqrt{a^{2}-4c^{2}}}{2}, and ϕ=1−a+a2−4​c22italic-ϕ1𝑎superscript𝑎24superscript𝑐22\phi=1-\frac{a+\sqrt{a^{2}-4c^{2}}}{2}.

Corollary. Corollary 7.1. Let the Elastic Averaging relation β=p​α𝛽𝑝𝛼\beta=p\alpha and the condition 23 hold, then limp→∞limt→∞p​𝔼​[(xt−x∗)2]=β​η​h(2−β)​(2−η​h)⋅2−β−η​h+β​η​hβ+η​h−β​η​h⋅σ2h2.subscript→𝑝subscript→𝑡𝑝𝔼delimited-[]superscriptsubscript𝑥𝑡superscript𝑥2⋅𝛽𝜂ℎ2𝛽2𝜂ℎ2𝛽𝜂ℎ𝛽𝜂ℎ𝛽𝜂ℎ𝛽𝜂ℎsuperscript𝜎2superscriptℎ2\displaystyle\lim_{p\rightarrow\infty}\lim_{t\rightarrow\infty}p\mathbb{E}[(\tilde{x}_{t}-x^{*})^{2}]=\frac{\beta\eta h}{(2-\beta)(2-\eta h)}\cdot\frac{2-\beta-\eta h+\beta\eta h}{\beta+\eta h-\beta\eta h}\cdot\frac{\sigma^{2}}{h^{2}}.

Lemma. [Weak convergence] If the condition in Equationeq:condition holds, then the normalized double averaging sequence defined in Equationeq:doubleseq converges weakly to the normal distribution with zero mean and variance $\sigma^2/p h^2$, gather t (z_{t} - x^\ast) \rightharpoonup N(0,\sigma^2{p h^2}), \quad t\rightarrow \infty. gather

Lemma. Lemma 7.3 (Weak convergence). Let hℎh denotes the largest eigenvalue of A𝐴A. If (2−η​h)​(2−β)>2​β/p2𝜂ℎ2𝛽2𝛽𝑝(2-\eta h)(2-\beta)>2\beta/p, (2−η​h)+(2−β)>β/p2𝜂ℎ2𝛽𝛽𝑝(2-\eta h)+(2-\beta)>\beta/p, η>0𝜂0\eta>0 and β>0𝛽0\beta>0, then the normalized double averaging sequence converges weakly to the normal distribution with zero mean and the covariance matrix V=A−1​Σ​(A−1)T𝑉superscript𝐴1Σsuperscriptsuperscript𝐴1𝑇V=A^{-1}\Sigma(A^{-1})^{T}, t​p​(zt−x∗)⇀𝒩​(0,V),t→∞.formulae-sequence⇀𝑡𝑝subscript𝑧𝑡superscript𝑥∗𝒩0𝑉→𝑡\displaystyle\sqrt{tp}(z_{t}-x^{\ast})\rightharpoonup\mathcal{N}(0,V),\quad t\rightarrow\infty. (36)

Lemma. If the following conditions hold: eqnarray* (2-\eta h)(2-p\alpha) &>& 2\alpha\ (2-\eta h) + (2-p\alpha) &>& \alpha\ \eta &>& 0\ \alpha &>& 0 eqnarray* then $|M|_2 < 1$.

Theorem. Theorem 7.1. Let at=𝔼​\norm​L​2​1p​∑i=1pxti−x∗2subscript𝑎𝑡𝔼\norm𝐿21𝑝superscriptsubscript𝑖1𝑝subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2a_{t}=\mathbb{E}\norm{L2}{\frac{1}{p}\sum_{i=1}^{p}x^{i}{t}-x^{\ast}}^{2}, bt=1p​∑i=1p𝔼​\norm​L​2​xti−x∗2subscript𝑏𝑡1𝑝superscriptsubscript𝑖1𝑝𝔼\norm𝐿2subscriptsuperscript𝑥𝑖𝑡superscriptsuperscript𝑥∗2b{t}=\frac{1}{p}\sum_{i=1}^{p}\mathbb{E}\norm{L2}{x^{i}{t}-x^{\ast}}^{2}, ct=𝔼​\norm​L​2​xt−x∗2subscript𝑐𝑡𝔼\norm𝐿2subscript𝑥𝑡superscriptsuperscript𝑥∗2c{t}=\mathbb{E}\norm{L2}{\tilde{x}{t}-x^{\ast}}^{2}, γ1=2​η​μ​Lμ+Lsubscript𝛾12𝜂𝜇𝐿𝜇𝐿\gamma{1}=2\eta\frac{\mu L}{\mu+L} and γ2=2​η​L​(1−2​μ​Lμ+L)subscript𝛾22𝜂𝐿12𝜇𝐿𝜇𝐿\gamma_{2}=2\eta L(1-\frac{2\sqrt{\mu L}}{\mu+L}). If 0≤η≤2μ+L​(1−α)0𝜂2𝜇𝐿1𝛼0\leq\eta\leq\frac{2}{\mu+L}(1-\alpha), 0≤α<10𝛼10\leq\alpha<1 and 0≤β≤10𝛽10\leq\beta\leq 1 then (at+1bt+1ct+1)≤(1−γ1−γ2−αγ2α01−γ1−ααβ01−β)​(atbtct)+(η2​σ2pη2​σ20).matrixsubscript𝑎𝑡1subscript𝑏𝑡1subscript𝑐𝑡1matrix1subscript𝛾1subscript𝛾2𝛼subscript𝛾2𝛼01subscript𝛾1𝛼𝛼𝛽01𝛽matrixsubscript𝑎𝑡subscript𝑏𝑡subscript𝑐𝑡matrixsuperscript𝜂2superscript𝜎2𝑝superscript𝜂2superscript𝜎20\displaystyle\begin{pmatrix}a_{t+1}\ b_{t+1}\ c_{t+1}\ \end{pmatrix}\leq\begin{pmatrix}1-\gamma_{1}-\gamma_{2}-\alpha&\gamma_{2}&\alpha\ 0&1-\gamma_{1}-\alpha&\alpha\ \beta&0&1-\beta\end{pmatrix}\begin{pmatrix}a_{t}\ b_{t}\ c_{t}\ \end{pmatrix}+\begin{pmatrix}\eta^{2}\frac{\sigma^{2}}{p}\ \eta^{2}\sigma^{2}\ 0\ \end{pmatrix}.

Figure 10: The wall clock time needed to achieve the same level of the test error thr as a function of the number of local workers p on the CIFAR dataset. From left to right: thr = { 21% , 20% , 19% , 18% } . Missing bars denote that the method never achieved specified level of test error.

Figure 11: The wall clock time needed to achieve the same level of the test error thr as a function of the number of local workers p on the ImageNet dataset. From left to right: thr = { 49% , 47% , 45% , 43% } . Missing bars denote that the method never achieved specified level of test error.

.

η
EASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
EAMSGD{ 0 . 01 , 0 . 005 , 0 . 001 }
DOWNPOUR ADOWNPOUR MVADOWNPOUR{ 0 . 005 , 0 . 001 , 0 . 0005 }
MDOWNPOUR{ 0 . 00005 , 0 . 00001 , 0 . 000005 }
SGD, ASGD, MVASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
MSGD{ 0 . 001 , 0 . 0005 , 0 . 0001 }
η
EASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
EAMSGD{ 0 . 01 , 0 . 005 , 0 . 001 }
DOWNPOUR{ 0 . 005 , 0 . 001 , 0 . 0005 }
MDOWNPOUR{ 0 . 00005 , 0 . 00001 , 0 . 000005 }
SGD, ASGD, MVASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
MSGD{ 0 . 001 , 0 . 0005 , 0 . 0001 }
η
EASGD0 . 1
EAMSGD0 . 001
DOWNPOURfor p = 4 : 0 . 02 for p = 8 : 0 . 01
SGD, ASGD, MVASGD0 . 05
MSGD0 . 0005

$$ x^i_{t+1} &= x^i_{t} - \eta ( A x^i_{t} - b - \xi^i_t) - \alpha (x^i_t - \tilde{x}t), \label{loc:local} \ \tilde{x}{t+1} &= \tilde{x}t + \sum{i=1}^p \alpha (x^i_t - \tilde{x}_t) \label{loc:center}, $$ \tag{loc:local}

$$ \cbar{x}{t+1} &= \gamma^{t+1} \cbar{x}0 + \sum{i=0}^t \gamma^{t-i} (\alpha u_i) \ &= \gamma^{t+1} \cbar{x}0 + \sum{i=0}^t \gamma^{t-i} (\alpha (\phi^i u_0 + \sum{l=0}^{i-1} \phi^{i-1-l} \eta \xi_l )) \ &= \gamma^{t+1} \cbar{x}0 + \sum{i=0}^t \gamma^{t-i} \phi^i (\alpha u_0) + \sum_{l=0}^{t-1} \sum_{i=l+1}^{t} \gamma^{t-i} \phi^{i-1-l} (\alpha \eta \xi_l) \ &= \gamma^{t+1} \cbar{x}0 + \frac{\gamma^{t+1}-\phi^{t+1}}{\gamma - \phi} (\alpha u_0) + \sum{l=0}^{t-1} \frac{\gamma^{t-l}-\phi^{t-l}}{\gamma-\phi} (\alpha \eta \xi_l). $$

$$ \sum_{l=0}^{t-1} \bigg( \frac{\gamma^{t-l}-\phi^{t-l}}{\gamma-\phi} \bigg)^2 &= \sum_{l=0}^{t-1} \frac{\gamma^{2(t-l)} - 2\gamma^{t-l}\phi^{t-l} +\phi^{2(t-l)}}{(\gamma-\phi)^2} \nonumber\ &= \frac{1}{(\gamma-\phi)^2} \bigg( \frac{\gamma^2-\gamma^{2(t+1)}}{1-\gamma^2}

  • 2 \frac{\gamma \phi - (\gamma \phi)^{t+1}}{1-\gamma \phi }
  • \frac{\phi^2-\phi^{2(t+1)}}{1-\phi^2} \bigg). \label{eq:new} $$ \tag{eq:new}

$$ \lim_{t \rightarrow \infty} \mathbb{E} [(\tilde{x}_{t} - x^{*})^2] &=
\frac{\beta^2 \eta^2}{(\gamma-\phi)^2} \bigg( \frac{\gamma^2}{1-\gamma^2} + \frac{\phi^2}{1-\phi^2} - \frac{2 \gamma \phi}{1- \gamma \phi} \bigg) \frac{\sigma^2}{p} \ &= \frac{\beta^2 \eta^2}{(\gamma-\phi)^2} \bigg( \frac{\gamma^2(1-\phi^2)(1-\phi \gamma) + \phi^2(1-\gamma^2)(1-\phi \gamma) - 2\gamma \phi (1-\gamma^2)(1-\phi^2) }{(1-\gamma^2)(1-\phi^2)(1- \gamma \phi)} \bigg ) \frac{\sigma^2}{p} \ &= \frac{\beta^2 \eta^2}{(\gamma-\phi)^2} \bigg( \frac{ (\gamma-\phi)^2(1+\gamma \phi) }{(1-\gamma^2)(1-\phi^2)(1- \gamma \phi)} \bigg ) \frac{\sigma^2}{p} \ &= \frac{\beta ^2 \eta^2}{(1-\gamma^2)(1-\phi^2)} \cdot \frac{1+\gamma \phi}{1-\gamma \phi} \cdot \frac{\sigma^2}{p}. $$

$$ \frac{1}{\sqrt{t}} \alpha \eta \sum_{l=1}^{t-1} \frac{1}{\gamma - \phi} \bigg ( \frac{\gamma}{1-\gamma} - \frac{\phi}{1-\phi} \bigg) \xi_{l-1}. \label{eq:normalized} $$ \tag{eq:normalized}

$$ M^0 + M^1 + \cdots + M^t + \cdots = (I-M)^{-1} = \eta^{-1} L^{-1}. $$

$$ \mathbb{E} [\tilde{x}_{t} - x^\ast] = \gamma^{t} (\tilde{x}0 - x^\ast) + \frac{\gamma^t-\phi^t}{\gamma-\phi} \alpha u_0, \label{eq:lem1a} \ \mathbb{V} [\tilde{x}{t} - x^\ast] = \frac{p^2 \alpha^2 \eta^2 }{(\gamma-\phi)^2} \bigg( \frac{\gamma^2-\gamma^{2t}}{1-\gamma^2} +\frac{\phi^2-\phi^{2t}}{1-\phi^2} -2 \frac{\gamma \phi - (\gamma \phi)^t}{1 - \gamma \phi}\bigg) \frac{\sigma^2}{p}, \label{eq:lem1b} $$ \tag{eq:lem1a}

$$ u_{t+1} = \phi u_t + \eta \xi_t. \label{eq:proj1} $$ \tag{eq:proj1}

$$ \sqrt{t} (z_{t} - x^\ast) \rightharpoonup \mathcal{N}(0,\frac{\sigma^2}{p h^2}), \quad t\rightarrow \infty. $$

$$ \begin{pmatrix} a_{t+1} \ b_{t+1} \ c_{t+1} \ \end{pmatrix} \le \begin{pmatrix} 1-\gamma_1-\gamma_2-\alpha & \gamma_2 & \alpha \ 0 & 1-\gamma_1-\alpha & \alpha \ \beta & 0 & 1-\beta \end{pmatrix} \begin{pmatrix} a_{t} \ b_{t} \ c_{t} \ \end{pmatrix} + \begin{pmatrix} \eta^2 \frac{\sigma^2}{p} \ \eta^2 \sigma^2 \ 0 \ \end{pmatrix} . $$

$$ \Phi(x^i,\tilde{x}) := \mathbb{E}[f(x^i,\xi^i)] + \frac{\rho}{2}|x^i - \tilde{x}|^2. $$

$$ F_1^1!!=!! \left( \begin{array}{ccccc} 1 & -1 & 0 & 0 & 1 \ 0 & 1& 0 & 0 & 0 \ 0 & 0 & 1 &0 &0 \ 0 & 0 & 0 &1 &0 \ 0 & 0 & 0 &0 &1\ \end{array} \right)!!, F_2^1!!=!! \left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \ \frac{\eta \rho}{1+\eta \rho} & \frac{1-\eta}{1+\eta\rho} & 0 & 0 & \frac{\eta\rho}{ 1+\eta\rho} \ 0 & 0 & 1 &0 &0 \ 0 & 0 & 0 &1 &0 \ 0 & 0 & 0 &0 &1\ \end{array} \right)!!, F_3^1!!=!! \left( \begin{array}{ccccc} 1 & 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 & 0 \ 0 & 0 & 1 &0 &0 \ 0 & 0 & 0 &1 &0 \ -\frac{1}{p} & \frac{1}{p} & -\frac{1}{p} &\frac{1}{p} &0\ \end{array} \right)!!. $$

Theorem. Let $a_t = E L2{1{p} \sum_{i=1}^p x^i_t -x^\ast}^2$, $b_t= 1{p} \sum_{i=1}^p E L2{ x^i_t -x^\ast}^2$, $c_t = E L2{x_t-x^\ast}^2 $, $\gamma_1=2\eta \mu L{\mu+L}$ and $\gamma_2 = 2 \eta L (1-2 \sqrt{\mu L}{\mu+L})$. If $0 \le \eta \le 2{\mu+L}(1-\alpha)$, $0 \le \alpha<1$ and $0 \le \beta \le 1$ then gather* pmatrix a_{t+1} \ b_{t+1} \ c_{t+1} \ pmatrix \le pmatrix 1-\gamma_1-\gamma_2-\alpha & \gamma_2 & \alpha \ 0 & 1-\gamma_1-\alpha & \alpha \ \beta & 0 & 1-\beta pmatrix pmatrix a_{t} \ b_{t} \ c_{t} \ pmatrix + pmatrix \eta^2 \sigma^2{p} \ \eta^2 \sigma^2 \ 0 \ pmatrix . gather*

Lemma. Let $x_0$ and ${x^i_0}{i=1,\ldots,p}$ be arbitrary constants, then gather E [x{t} - x^\ast] = \gamma^{t} (x_0 - x^\ast) + \gamma^t-\phi^t{\gamma-\phi} \alpha u_0, \ V [x_{t} - x^\ast] = p^2 \alpha^2 \eta^2 {(\gamma-\phi)^2} \bigg( \gamma^2-\gamma^{2t}{1-\gamma^2} +\phi^2-\phi^{2t}{1-\phi^2} -2 \gamma \phi - (\gamma \phi)^t{1 - \gamma \phi}\bigg) \sigma^2{p}, gather where $u_0 = \sum_{i=1}^{p} ( x^i_0 - x^\ast - \alpha{1-p\alpha-\phi} (x_0-x^\ast))$, $a = \eta h + (p+1)\alpha$, $c^2 = \eta h p \alpha $, $\gamma = 1-a-\sqrt{a^2-4 c^2}{2}$, and $\phi = 1-a+\sqrt{a^2-4 c^2}{2}$.

Lemma. [Weak convergence] %Let $h$ denote the real part of this eigenvalue of $A$ which has the largest real part from among all the eigenvalues of $A$. Let $h$ denotes the largest eigenvalue of $A$. If $(2-\eta h)(2-\beta) >2 \beta/p$, $(2-\eta h) + (2-\beta) > \beta/p $, $\eta>0$ and $\beta>0$, then the normalized double averaging sequence converges weakly to the normal distribution with zero mean and the covariance matrix $V=A^{-1} \Sigma (A^{-1})^T$, gather tp (z_{t}- x^\ast) \rightharpoonup N(0,V), \quad t\rightarrow \infty. gather

Corollary. Let the Elastic Averaging relation $\beta = p \alpha$ and the condition eq:condition hold, then gather* \lim_{p \rightarrow \infty}\lim_{t \rightarrow \infty} pE [ (x_{t} - x^{})^2] = \beta \eta h{(2-\beta)(2-\eta h)} \cdot 2 -\beta -\eta h + \beta \eta h{\beta + \eta h - \beta \eta h } \cdot \sigma^2{h^2}. gather

Proof. Substituting the gradient from Equationeq:gradlform into the update rule used by each local worker in the synchronous EASGD algorithm (Equationeq:movingaverage1 and eq:movingaverage2) we obtain align x^i_{t+1} &= x^i_{t} - \eta ( A x^i_{t} - b - \xi^i_t) - \alpha (x^i_t - x_t), \ x_{t+1} &= x_t + \sum_{i=1}^p \alpha (x^i_t - x_t) , align where $\eta$ is the learning rate, and $\alpha$ is the moving rate. Recall that $\alpha = \eta \rho$ and $A=h$. For the ease of notation we redefine $x_{t}$ and $x^i_t$ as follows: [x_{t} \triangleq x_{t} - x^\ast :::and::: x^i_t \triangleq x^i_t - x^\ast. ] We prove the lemma by explicitly solving the linear equations loc:local and loc:center. Let $x_{t} = (x^1_t,\ldots,x^p_t,x_t)^T$. We rewrite the recursive relation captured in Equationloc:local andloc:center as simply [x_{t+1} = M x_t + b_t, ] where the drift matrix $M$ is defined as equation* M= bmatrix 1-\alpha-\eta h & 0 & ... & 0 & \alpha \ 0 & 1-\alpha-\eta h& 0 & ... & \alpha \ ... & 0 & ... &0 & ... \ 0 & ... &0 & 1-\alpha-\eta h & \alpha \ \alpha & \alpha & ... & \alpha & 1-p\alpha bmatrix, equation* and the (diffusion) vector $b_t = (\eta\xi^1_t,\ldots,\eta\xi^p_t,0)^T$. Note that one of the eigenvalues of matrix $M$, that we call $\phi$, satisfies $(1-\alpha-\eta h-\phi)(1-p\alpha-\phi) =p\alpha^2$. The corresponding eigenvector is $(1,1,\ldots,1,-p\alpha{1-p\alpha-\phi})^T$. Let $u_t$ be the projection of $x_{t}$ onto this eigenvector. Thus $u_t = \sum_{i=1}^{p} ( x^i_t - \alpha{1-p\alpha-\phi} x_t)$. Let furthermore $\xi_t = \sum_{i=1}^p \xi_t^i$. Therefore we have gather u_{t+1} = \phi u_t + \eta \xi_t. gather By combining Equation loc:center and eq:proj1 as follows align* x_{t+1} &= x_t + \sum_{i=1}^p \alpha (x^i_t - x_t) = (1-p\alpha) x_t + \alpha (u_t + p\alpha{1-p\alpha-\phi} x_t )\ &= (1-p\alpha+p\alpha^2{1-p\alpha-\phi}) x_t + \alpha u_t = \gamma x_t + \alpha u_t, align* where the last step results from the following relations: $p\alpha^2{1-p\alpha-\phi} = 1-\alpha-\eta h-\phi $ and $\phi + \gamma = 1-\alpha-\eta h + 1-p\alpha $. Thus we obtained gather x_{t+1} = \gamma x_t + \alpha u_t. gather Based on Equation eq:proj1 and eq:proj2, we can then expand $u_t$ and $x_t$ recursively, gather u_{t+1} = \phi^{t+1} u_0 + \phi^{t} (\eta \xi_0) + \ldots + \phi^{0} (\eta \xi_t), \ x_{t+1} = \gamma^{t+1} x_0 + \gamma^{t} (\alpha u_0) + \ldots + \gamma^{0} (\alpha u_t). gather Substituting $u_0, u_1, \dots, u_t$, each given through Equation eq:lem1c, into Equationeq:lem1d we obtain gather x_t = \gamma^t x_0 + \gamma^t-\phi^t{\gamma-\phi} \alpha u_0 + \alpha \eta \sum_{l=1}^{t-1} \gamma^{t-l-\phi^{t-l}}{\gamma-\phi} \xi_{l-1} . gather To be more specific, the Equation eq:lem1e is obtained by integrating by parts, align* x_{t+1} &= \gamma^{t+1} x_0 + \sum_{i=0}^t \gamma^{t-i} (\alpha u_i) \ &= \gamma^{t+1} x_0 + \sum_{i=0}^t \gamma^{t-i} (\alpha (\phi^i u_0 + \sum_{l=0}^{i-1} \phi^{i-1-l} \eta \xi_l )) \ &= \gamma^{t+1} x_0 + \sum_{i=0}^t \gamma^{t-i} \phi^i (\alpha u_0) + \sum_{l=0}^{t-1} \sum_{i=l+1}^{t} \gamma^{t-i} \phi^{i-1-l} (\alpha \eta \xi_l) \ &= \gamma^{t+1} x_0 + \gamma^{t+1-\phi^{t+1}}{\gamma - \phi} (\alpha u_0) + \sum_{l=0}^{t-1} \gamma^{t-l-\phi^{t-l}}{\gamma-\phi} (\alpha \eta \xi_l). align* Since the random variables $\xi_l$ are i.i.d, we may sum the variance term by term as follows align \sum_{l=0}^{t-1} \bigg( \gamma^{t-l-\phi^{t-l}}{\gamma-\phi} \bigg)^2 &= \sum_{l=0}^{t-1} \gamma^{2(t-l) - 2\gamma^{t-l}\phi^{t-l} +\phi^{2(t-l)}}{(\gamma-\phi)^2} \nonumber\ &= 1{(\gamma-\phi)^2} \bigg( \gamma^2-\gamma^{2(t+1)}{1-\gamma^2} - 2 \gamma \phi - (\gamma \phi)^{t+1}{1-\gamma \phi } + \phi^2-\phi^{2(t+1)}{1-\phi^2} \bigg). align Note that $E[\xi_t] = \sum_{i=1}^p E [\xi_t^i] = 0$ and $V[\xi_t] = \sum_{i=1}^p V [\xi_t^i] = p\sigma^2$. These two facts, the equality in Equationeq:lem1e and Equationeq:new can then be used to compute $E[x_t]$ and $V[x_t]$ as given in Equationeq:lem1a andeq:lem1b in Lemma~lem:lemm1.

Proof. Note that when $\beta$ is fixed, $\lim_{p\rightarrow\infty}a = \eta h+ \beta$ and $c^2 = \eta h \beta$. Then $\lim_{p\rightarrow\infty} \phi = \min(1 - \beta,1 - \eta h)$ and $\lim_{p\rightarrow\infty} \gamma= \max(1 - \beta,1 - \eta h)$. Also note that using Lemmalem:lemm1 we obtain align* \lim_{t \rightarrow \infty} E [(x_{t} - x^{})^2] &= \beta^2 \eta^2{(\gamma-\phi)^2} \bigg( \gamma^2{1-\gamma^2} + \phi^2{1-\phi^2} - 2 \gamma \phi{1- \gamma \phi} \bigg) \sigma^2{p} \ &= \beta^2 \eta^2{(\gamma-\phi)^2} \bigg( \gamma^2(1-\phi^2)(1-\phi \gamma) + \phi^2(1-\gamma^2)(1-\phi \gamma) - 2\gamma \phi (1-\gamma^2)(1-\phi^2) {(1-\gamma^2)(1-\phi^2)(1- \gamma \phi)} \bigg ) \sigma^2{p} \ &= \beta^2 \eta^2{(\gamma-\phi)^2} \bigg( (\gamma-\phi)^2(1+\gamma \phi) {(1-\gamma^2)(1-\phi^2)(1- \gamma \phi)} \bigg ) \sigma^2{p} \ &= \beta ^2 \eta^2{(1-\gamma^2)(1-\phi^2)} \cdot 1+\gamma \phi{1-\gamma \phi} \cdot \sigma^2{p}. align Corollarycor:lemm1cor is obtained by plugining in the limiting values of $\phi$ and $\gamma$.

Proof. As in the proof of Lemmalem:lemm1, for the ease of notation we redefine $x_{t}$ and $x^i_t$ as follows: [x_{t} \triangleq x_{t} - x^\ast :::and::: x^i_t \triangleq x^i_t - x^\ast. ] Also recall that ${ \xi^i_t }$'s are i.i.d. random variables (noise) with zero mean and the same covariance $\Sigma \succ 0$. We are interested in the asymptotic behavior of the double averaging sequence ${z_1,z_2,\dots}$ defined as gather z_{t+1} = 1{t+1} \sum_{k=0}^{t} x_k. gather Recall the Equationeq:lem1e from the proof of Lemmalem:lemm1 (for the convenience it is provided below): gather* x_k = \gamma^k x_0 + \alpha u_0 \gamma^k-\phi^k{\gamma-\phi}+\alpha \eta \sum_{l=1}^{k-1} \gamma^{k-l-\phi^{k-l}}{\gamma-\phi} \xi_{l-1}, gather* where $\xi_t = \sum_{i=1}^p \xi_t^i$. Therefore eqnarray* \sum_{k=0}^t x_k !!!&=&!!! 1-\gamma^{t+1}{1-\gamma} x_0 + \alpha u_0 1{\gamma-\mu} \bigg(1-\gamma^{t+1}{1-\gamma} - 1-\phi^{t+1}{1-\phi} \bigg) + \alpha \eta \sum_{l=1}^{t-1} \sum_{k=l+1}^{t} \gamma^{k-l-\phi^{k-l}}{\gamma-\phi} \xi_{l-1} \ &=&!!! O(1) + \alpha \eta \sum_{l=1}^{t-1} 1{\gamma - \phi} \bigg (\gamma 1-\gamma^{t-l}{1-\gamma} - \phi 1-\phi^{t-l}{1-\phi} \bigg) \xi_{l-1} eqnarray* Note that the only non-vanishing term (in weak convergence) of $1/t\sum_{k=0}^t x_k$ as $t\rightarrow \infty$ is align 1{t} \alpha \eta \sum_{l=1}^{t-1} 1{\gamma - \phi} \bigg ( \gamma{1-\gamma} - \phi{1-\phi} \bigg) \xi_{l-1}. align Also recall that $V[\xi_{l-1}] = p \sigma^2$ and align* 1{\gamma - \phi} \bigg ( \gamma{1-\gamma} - \phi{1-\phi} \bigg) = 1{(1-\gamma)(1-\phi) } = 1{\eta h p \alpha} . align* Therefore the expression in Equationeq:normalized is asymptotically normal with zero mean and variance $\sigma^2/p h^2$.

Proof. Since $A$ is symmetric, one can use the proof technique of Lemmalem:lemm2 to prove Lemmalem:lemm3 by diagonalizing the matrix $A$. This diagonalization essentially generalizes Lemmalem:lemm1 to the multidimensional case. We will not go into the details of this proof as we will provide a simpler way to look at the system. As in the proof of Lemmalem:lemm1 and Lemmalem:lemm2, for the ease of notation we redefine $x_{t}$ and $x^i_t$ as follows: [x_{t} \triangleq x_{t} - x^\ast :::and::: x^i_t \triangleq x^i_t - x^\ast. ] Let the spatial average of the local parameters at time $t$ be denoted as $y_t$ where $y_t = 1{p} \sum_{i=1}^p x_t^i$, and let the average noise be denoted as $\xi_t$, where $\xi_t = 1{p} \sum_{i=1}^p \xi_t^i$. Equationsloc:local and loc:center can then be reduced to the following eqnarray y_{t+1} &=& y_{t} - \eta ( Ay_t - \xi_t) + \alpha ( x_t - y_t ), \ x_{t+1} &=& x_t + \beta ( y_t - x_t). eqnarray We focus on the case where the learning rate $\eta$ and the moving rate $\alpha$ are kept constant over timeAs a side note, notice that the center parameter $\cbar{x_t$ is tracking the spatial average $y_t$ of the local parameters with a non-symmetric spring in Equationeq:reduced1 and eq:reduced2. To be more precise note that the update on $y_{t+1}$ contains $ ( x_t - y_t )$ scaled by $\alpha$, whereas the update on $x_{t+1}$ contains $-( x_t - y_t )$ scaled by $\beta$. Since $\alpha = \beta / p$ the impact of the center $x_{t+1}$ on the spatial local average $y_{t+1}$ becomes more negligible as $p$ grows.}. Recall $\beta = p \alpha$ and $\alpha = \eta \rho$. Let's introduce the block notation $U_t = (y_t,x_t)$, $\Xi_t = (\eta \xi_t,0)$, $M = I - \eta L$ and align* L = \left( array{cc} A + \alpha{\eta} I & - \alpha{\eta}I \ - \beta{\eta} I & \beta{\eta} I array \right). align* From Equationseq:reduced1 andeq:reduced2 it follows that $U_{t+1} = M U_t + \Xi_t$. Note that this linear system has a degenerate noise $\Xi_t$ which prevents us from directly applying results of~[polyak1992acceleration]. Expanding this recursive relation and summing by parts, we have eqnarray* \sum_{k=0}^t U_k &=& M^0 U_0 + \ &&M^1 U_0 + M^0 \Xi_0+ \ &&M^2 U_0 + M^1 \Xi_0+ M^0 \Xi_1 + \ &&... \ &&M^t U_0 + M^{t-1} \Xi_0+ \cdots + M^0 \Xi_{t-1} . eqnarray* By Lemma~lem:additional, $|M|2 < 1$ and thus align* M^0 + M^1 + \cdots + M^t + \cdots = (I-M)^{-1} = \eta^{-1} L^{-1}. align* Since $A$ is invertible, we get align* L^{-1} = \left( array{cc} A^{-1} & \alpha{\beta} A^{-1} \ A^{-1} & \eta{\beta} + \alpha{\beta} A^{-1} array \right), align* thus align* 1{t} \sum{k=0}^t U_k = 1{t} U_0 + 1{t} \eta L^{-1} \sum_{k=1}^{t} \Xi_{k-1} - 1{t} \sum_{k=1}^{t} M^{k+1} \Xi_{k-1}. align* Note that the only non-vanishing term (in weak convergence) of $1{t} \sum_{k=0}^t U_k$ is $1{t} (\eta L)^{-1} \sum_{k=1}^{t} \Xi_{k-1} $ thus we have align 1{t} (\eta L)^{-1} \sum_{k=1}^{t} \Xi_{k-1} \rightharpoonup N \bigg( \left( array{c} 0 \ 0 array \right) , \left( array{cc} V & V \ V & V array \right) \bigg), align where $V=A^{-1} \Sigma (A^{-1})^T$.

Proof. The eigenvalue $\lambda$ of $M$ and the (non-zero) eigenvector $(y,z)$ of $M$ satisfy equation M \left( array{c} y \ z array \right) = \lambda \left( array{c} y \ z array \right) . equation Recall that align M = I - \eta L = \left( array{cc} I - \eta A -\alpha I & \alpha I \ \beta I & I-\beta I array \right). align From the Equationseq:oneM andeq:twoM we obtain eqnarray \left{ array{ll} y - \eta A y - \alpha y + \alpha z = \lambda y \ \beta y + (1-\beta) z = \lambda z array \right.. eqnarray Since $(y,z)$ is assumed to be non-zero, we can write $z = \beta y / (\lambda + \beta -1)$. Then the Equationeq:bundle can be reduced to eqnarray \eta A y = (1-\alpha -\lambda) y + \alpha \beta{\lambda + \beta - 1} y . eqnarray Thus $y$ is the eigenvector of $A$. Let $\lambda_A$ be the eigenvalue of matrix $A$ such that $ A y = \lambda_A y$. Thus based on Equationeq:mapping it follows that equation \eta \lambda_A = (1-\alpha -\lambda) + \alpha \beta{\lambda + \beta - 1}. equation Equationeq:intermediate is equivalent to eqnarray \lambda^2 - (2-a)\lambda + (1-a+c^2) = 0, eqnarray where $a = \eta \lambda_A + (p+1)\alpha$, $c^2 = \eta \lambda_A p \alpha $. It follows from the condition in Equationeq:condition that $-1 < \lambda < 1$ iff $\eta > 0$, $\beta > 0$, $(2-\eta \lambda_A)(2-\beta) >2\beta/p$ and $(2-\eta \lambda_A) +(2-\beta) > \beta/p $. Let $h$ denote the maximum eigenvalue of $A$ and note that $2-\eta \lambda_A \ge 2-\eta h$. This implies that the condition of our lemma is sufficient.

Proof. The idea of the proof is based on the point of view in Lemmalem:lemm3, i.e. how close the center variable $x_t$ is to the spatial average of the local variables $y_t = 1{p} \sum_{i=1}^p x^i_t$. To further simplify the notation, let the noisy gradient be $\nabla f^i_{t,\xi}=g_t^i(x_t^i) = F(x_t^i) - \xi^i_t$, and $\nabla f^i_{t} = F(x_t^i)$ be its deterministic part. Then EASGD updates can be rewritten as follows, eqnarray x_{t+1}^i &=& x_t^i - \eta \nabla f^i_{t,\xi} - \alpha(x^i_t - x_t), \ x_{t+1} &=& x_t + \beta ( y_t - x_t). eqnarray We have thus the update for the spatial average, eqnarray y_{t+1} &=& y_{t} - \eta 1{p} \sum_{i=1}^p \nabla f^i_{t,\xi} - \alpha (y_t - x_t). eqnarray The idea of the proof is to bound the distance $L2{x_t-x^\ast}^2$ through $L2{y_t-x^\ast}^2$ and $1{p} \sum_i^p L2{x^i_t-x^\ast}^2$. W start from the following estimate for the strongly convex function[nesterov2004introductory], eqnarray* L2{F(x)-F(y)}{x-y} \ge \mu L{\mu+L} L2{x-y}^2+ 1{\mu+L} L2{F(x)-F(y)}^2. eqnarray* Since $f( x^\ast)=0$, we have eqnarray L2{ \nabla f^i_{t}}{x^i_t - x^\ast} \ge \mu L{\mu+L} L2{x^i_t - x^\ast}^2+ 1{\mu+L} L2{\nabla f^i_{t}}^2. eqnarray From Equationeq:strong_updates1 the following relation holds, eqnarray L2{x^i_{t+1}-x^\ast}^2 &=& L2{x^i_{t}-x^\ast}^2 + \eta^2 L2{ \nabla f^i_{t,\xi} } ^2 + \alpha^2 L2{x^i_t - x_t} ^2 \nonumber \ &-&2\eta L2{ \nabla f^i_{t,\xi}}{x^i_t - x^\ast} -2 \alpha L2{x^i_t - x_t}{x^i_t - x^\ast} \nonumber \ &+&2 \eta \alpha L2{ \nabla f^i_{t,\xi}}{ x^i_t - x_t} . eqnarray By the cosine rule ($2 L2{a-b}{c-d} = L2{a-d}^2 - L2{a-c}^2 + L2{c-b}^2 - L2{d-b}^2 $), we have eqnarray 2 L2{x^i_t - x_t}{x^i_t - x^\ast} = L2{x^i_t-x^\ast}^2 + L2{x^i_t-x_t}^2 - L2{x_t - x^\ast}^2 . eqnarray By the Cauchy-Schwarz inequality, we have eqnarray L2{ \nabla f^i_{t}}{ x^i_t - x_t} \le L2{ \nabla f^i_{t} } L2{ x^i_t - x_t}. eqnarray Combining the above estimates in Equationseq:est0,eq:est1a,eq:est1b,eq:est1c, we obtain eqnarray L2{x^i_{t+1}-x^\ast}^2 &\le& L2{x^i_{t}-x^\ast}^2 + \eta^2 L2{ \nabla f^i_{t} - \xi^i_t } ^2 + \alpha^2 L2{x^i_t - x_t}^2 \nonumber \ &-&2\eta \bigg( \mu L{\mu+L} L2{x^i_t - x^\ast}^2+ 1{\mu+L} L2{\nabla f^i_{t}}^2 \bigg) + 2\eta L2{ \xi^i_t}{x^i_t - x^\ast} \nonumber \ &-&\alpha \big ( L2{x^i_t-x^\ast}^2 + L2{x^i_t-x_t}^2 - L2{x_t - x^\ast}^2 \big) \nonumber \ &+& 2 \eta \alpha L2{ \nabla f^i_{t} } L2{ x^i_t - x_t} - 2\eta \alpha L2{ \xi^i_t}{x^i_t-x_t} . eqnarray Choosing $0 \le \alpha<1$, we can have this upper-bound for the terms $\alpha^2 L2{x^i_t - x_t}^2 - \alpha L2{x^i_t-x_t}^2 + 2 \eta \alpha L2{ \nabla f^i_{t} } L2{ x^i_t - x_t} = -\alpha(1-\alpha) L2{x^i_t - x_t}^2 +2 \eta \alpha L2{ \nabla f^i_{t} } L2{ x^i_t - x_t} \le \eta^2 \alpha {1-\alpha} L2{ \nabla f^i_{t} }^2 $ by applying $-ax^2+bx \le b^2{4a}$ with $x= L2{ x^i_t - x_t}$. Thus we can further bound Equationeq:est1 with eqnarray L2{x^i_{t+1}-x^\ast}^2 &\le& (1-2\eta \mu L{\mu + L} - \alpha) L2{x^i_{t}-x^\ast}^2 + (\eta^2 + \eta^2 \alpha {1-\alpha} -2 \eta{\mu+L} ) L2{ \nabla f^i_{t}} ^2 \nonumber \ &-& 2 \eta^2 L2{ \nabla f^i_{t} }{ \xi^i_t } +2\eta L2{ \xi^i_t }{ x^i_t - x^\ast} - 2\eta \alpha L2{ \xi^i_t }{ x^i_t - x_t } \ &+& \eta^2 L2{ \xi^i_t }^2 + \alpha L2{ x_t - x^\ast}^2 eqnarray As in Equationeq:est2a and eq:est2b, the noise $\xi^i_t$ is zero mean ($E \xi^i_t = 0$) and the variance of the noise $\xi^i_t$ is bounded ($E L2{ \xi^i_t }^2 \le \sigma^2 $), if $\eta$ is chosen small enough such that $\eta^2 + \eta^2 \alpha {1-\alpha} -2 \eta{\mu+L} \le 0$, then eqnarray E L2{x^i_{t+1}-x^\ast}^2 &\le& (1-2\eta \mu L{\mu + L} - \alpha) E L2{x^i_{t}-x^\ast}^2 + \eta^2 \sigma^2 + \alpha E L2{ x_t - x^\ast}^2 . eqnarray Now we apply similar idea to estimate $L2{y_t-x^\ast}^2$. From Equationeq:strong_updates3 the following relation holds, eqnarray L2{y_{t+1}-x^\ast}^2 &=& L2{y_{t}-x^\ast}^2 + \eta^2 L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t,\xi}} ^2 + \alpha^2 L2{y_t - x_t} ^2 \nonumber \ &-&2\eta L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t,\xi} }{y_t - x^\ast} -2 \alpha L2{y_t - x_t}{y_t - x^\ast} \nonumber \ &+&2 \eta \alpha L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t,\xi} }{ y_t - x_t} . eqnarray By $L2{1{p} \sum_{i=1}^p a_i}{1{p} \sum_{j=1}^p b_j}=1{p} \sum_{i=1}^p L2{a_i}{b_i} - 1{p^2} \sum_{i>j} L2{a_i-a_j}{b_i-b_j}$, we have eqnarray L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} }{y_t - x^\ast} = 1{p} \sum_{i=1}^p L2{ \nabla f^i_{t} }{x_t^i - x^\ast} - 1{p^2} \sum_{i>j} L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j }. eqnarray By the cosine rule, we have eqnarray 2 L2{y_t - x_t}{y_t - x^\ast} = L2{y_t-x^\ast}^2 + L2{y_t-x_t}^2 - L2{x_t - x^\ast}^2. eqnarray Denote $\xi_t = 1{p} \sum_{i=1}^p \xi^i_t $, we can rewrite Equationeq:est3 as eqnarray L2{y_{t+1}-x^\ast}^2 &=& L2{y_{t}-x^\ast}^2 + \eta^2 L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t } ^2 + \alpha^2 L2{y_t - x_t} ^2 \nonumber \ &-&2\eta L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t }{y_t - x^\ast} -2 \alpha L2{y_t - x_t}{y_t - x^\ast} \nonumber \ &+&2 \eta \alpha L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t }{ y_t - x_t} . eqnarray By combining the above Equationseq:est3a,eq:est3b with eq:est3c, we obtain eqnarray L2{y_{t+1}-x^\ast}^2 &=& L2{y_{t}-x^\ast}^2 + \eta^2 L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t } ^2 + \alpha^2 L2{y_t - x_t} ^2 \nonumber \ &-& 2\eta \bigg(1{p} \sum_{i=1}^p L2{ \nabla f^i_{t} }{x_t^i - x^\ast} - 1{p^2} \sum_{i>j} L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j } \bigg) \ &+& 2 \etaL2{ \xi_t }{y_t - x^\ast} - \alpha ( L2{y_t-x^\ast}^2 + L2{y_t-x_t}^2 - L2{x_t - x^\ast}^2 ) \nonumber \ &+&2 \eta \alpha L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t }{ y_t - x_t} . eqnarray Thus it follows from Equationeq:est0 andeq:est3d that eqnarray L2{y_{t+1}-x^\ast}^2 &\le& L2{y_{t}-x^\ast}^2 + \eta^2 L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t } ^2 + \alpha^2 L2{y_t - x_t} ^2 \nonumber \ &-& 2\eta 1{p} \sum_{i=1}^p \bigg( \mu L{\mu+L} L2{x^i_t - x^\ast}^2+ 1{\mu+L} L2{\nabla f^i_{t}}^2 \bigg) \nonumber \ &+& 2\eta 1{p^2} \sum_{i>j} L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j } \nonumber \ &+& 2 \eta L2{ \xi_t }{y_t - x^\ast} - \alpha ( L2{y_t-x^\ast}^2 + L2{y_t-x_t}^2 - L2{x_t - x^\ast}^2 ) \nonumber \ &+&2 \eta \alpha L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t }{ y_t - x_t} . eqnarray Recall $y_t = 1{p} \sum_{i=1}^p x^i_t$, we have the following bias-variance relation, eqnarray 1{p} \sum_{i=1}^p L2{x^i_t - x^\ast}^2 &=& 1{p} \sum_{i=1}^p L2{x^i_t - y_t}^2 + L2{ y_t - x^\ast}^2 = 1{p^2} \sum_{i>j} L2{x^i_t - x^j_t}^2 + L2{ y_t - x^\ast}^2, \nonumber \ 1{p} \sum_{i=1}^p L2{ \nabla f^i_{t}}^2 &=& 1{p^2} \sum_{i>j} L2{ \nabla f^i_{t} - \nabla f^j_{t}}^2 + L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} }^2. eqnarray By the Cauchy-Schwarz inequality, we have eqnarray \mu L{\mu+L} L2{x^i_t - x^j_t}^2 + 1{\mu+L} L2{ \nabla f^i_{t} - \nabla f^j_{t}}^2 \ge 2\sqrt{\mu L}{\mu+L} L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j } . eqnarray Combining the above estimates in Equationseq:est3e,eq:est3f,eq:est3g, we obtain eqnarray L2{y_{t+1}-x^\ast}^2 &\le& L2{y_{t}-x^\ast}^2 + \eta^2 L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t } ^2 + \alpha^2 L2{y_t - x_t} ^2 \nonumber \ &-& 2\eta \bigg( \mu L{\mu+L} L2{y_t - x^\ast}^2+ 1{\mu+L} L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} }^2 \bigg) \nonumber \ &+& 2 \eta \bigg(1-2\sqrt{\mu L}{\mu+L}\bigg) 1{p^2} \sum_{i>j} L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j } \nonumber \ &+& 2 \eta L2{ \xi_t }{y_t - x^\ast} - \alpha ( L2{y_t-x^\ast}^2 + L2{y_t-x_t}^2 - L2{x_t - x^\ast}^2 ) \nonumber \ &+&2 \eta \alpha L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} - \xi_t }{ y_t - x_t} . eqnarray Similarly if $0 \le \alpha<1$, we can have this upper-bound for the terms $\alpha^2 L2{y_t - x_t}^2 - \alpha L2{y_t-x_t}^2 + 2 \eta \alpha L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} } L2{y_t - x_t} \le \eta^2 \alpha {1-\alpha} L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t}}^2 $ by applying $-ax^2+bx \le b^2{4a}$ with $x= L2{ y_t - x_t}$. Thus we have the following bound for the Equationeq:est4 eqnarray L2{y_{t+1}-x^\ast}^2 &\le& (1-2\eta \mu L{\mu + L} - \alpha) L2{y_{t}-x^\ast}^2 + (\eta^2 + \eta^2 \alpha {1-\alpha} -2 \eta{\mu+L} ) L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} } ^2 \nonumber \ &-& 2 \eta^2 L2{ 1{p} \sum_{i=1}^p \nabla f^i_{t} }{ \xi_t } +2\eta L2{ \xi_t }{y_t - x^\ast} - 2\eta \alpha L2{ \xi_t }{y_t - x_t } \nonumber \ &+& 2 \eta \bigg(1-2\sqrt{\mu L}{\mu+L}\bigg) 1{p^2} \sum_{i>j} L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j } \nonumber \ &+& \eta^2 L2{\xi_t }^2 + \alpha L2{ x_t - x^\ast}^2. eqnarray Since $ 2\sqrt{\mu L}{\mu+L} \le 1 $, we need also bound the non-linear term $L2{ \nabla f^i_{t} - \nabla f^j_{t} }{x_t^i - x_t^j } \le L L2{ x_t^i - x_t^j}^2$. Recall the bias-variance relation $1{p} \sum_{i=1}^p L2{x^i_t - x^\ast}^2 = 1{p^2} \sum_{i>j} L2{x^i_t - x^j_t}^2 + L2{ y_t - x^\ast}^2$. The key observation is that if $1{p} \sum_{i=1}^p L2{x^i_t - x^\ast}^2$ remains bounded, then larger variance $ \sum_{i>j} L2{x^i_t - x^j_t}^2$ implies smaller bias $ L2{ y_t - x^\ast}^2$. Thus this non-linear term can be compensated. Again choose $\eta$ small enough such that $\eta^2 + \eta^2 \alpha {1-\alpha} -2 \eta{\mu+L} \le 0$ and take expectation in Equationeq:est5, eqnarray E L2{y_{t+1}-x^\ast}^2 &\le& (1-2\eta \mu L{\mu + L} - \alpha) E L2{y_{t}-x^\ast}^2 \nonumber \ &+& 2 \eta L \bigg(1-2\sqrt{\mu L}{\mu+L}\bigg) \bigg( 1{p} \sum_{i=1}^p E L2{x^i_t - x^\ast}^2 - E L2{y_{t}-x^\ast}^2 \bigg) \nonumber \ &+& \eta^2 \sigma^2{p} + \alpha E L2{ x_t - x^\ast}^2. eqnarray As for the center variable in Equationeq:strong_updates2, we apply simply the convexity of the norm $L2{\cdot}^2$ to obtain eqnarray L2{x_{t+1} - x^\ast}^2 \le (1-\beta) L2{x_t - x^\ast}^2 + \beta L2{ y_t - x^\ast}^2 . eqnarray Combing the estimates from Equationseq:key1,~eq:key2,~eq:key3, and denote $a_t = E L2{y_t -x^\ast}^2 $, $b_t= 1{p} \sum_{i=1}^p E L2{ x^i_t -x^\ast}^2$, $c_t = E L2{x_t-x^\ast}^2 $, $\gamma_1=2\eta \mu L{\mu+L}$, $\gamma_2 = 2 \eta L (1-2 \sqrt{\mu L}{\mu+L})$, then gather* pmatrix a_{t+1} \ b_{t+1} \ c_{t+1} \ pmatrix \le pmatrix 1-\gamma_1-\gamma_2-\alpha & \gamma_2 & \alpha \ 0 & 1-\gamma_1-\alpha & \alpha \ \beta & 0 & 1-\beta pmatrix pmatrix a_{t} \ b_{t} \ c_{t} \ pmatrix + pmatrix \eta^2 \sigma^2{p} \ \eta^2 \sigma^2 \ 0 \ pmatrix , gather* as long as $0\le \beta \le 1$, $0 \le \alpha < 1$ and $\eta^2 + \eta^2 \alpha {1-\alpha} -2 \eta{\mu+L} \le 0$, i.e. $0 \le \eta \le 2{\mu+L}(1-\alpha).$

Algorithm: algorithm
[H]
\caption{Asynchronous EASGD: \newline Processing by worker $i$ and the master}
\label{alg:async}
\begin{tabular}{l}
\textbf{Input:} learning rate $\eta$, moving rate $\alpha$,\\ $\:\:\:\:\:\:\:\:\:\:\:\:\:\:$communication period $\tau \in \mathbb{N}$\\
\textbf{Initialize:} $\tilde{x}$ is initialized randomly, $x^i = \tilde{x}$,$\:\:\:$ \\$\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:t^i = 0$\\
\hline
\textbf{Repeat}\\
\:\:\:\:\:$x \leftarrow x^i$\\
\:\:\:\:\:\textbf{if} ($\tau $ divides $t^i $) \textbf{then}\\
\:\:\:\:\:\:\:\:\:\:\:\textbf{a)}\:$x^i \leftarrow x^i - \alpha (x - \tilde{x}) $\\
\:\:\:\:\:\:\:\:\:\:\:\textbf{b)}\:$\tilde{x} \:\:\:\!\!\leftarrow \tilde{x} \:\:\!+ \alpha (x - \tilde{x}) $\\
\:\:\:\:\:\textbf{end}\\
\:\:\:\:\:$ x^i \leftarrow x^i - \eta g^i_{t^i}(x)$\\
\:\:\:\:\:$t^i\:\:\:\:\!\!\!\!\leftarrow \:\!t^i \:\:\!\!+ 1$\\
\textbf{Until forever}
\end{tabular}
Algorithm: algorithm
[h]
\caption{DOWNPOUR: Processing by worker $i$ and the master}
\label{alg:asyncD}
\begin{tabular}{l}
\textbf{Input:} learning rate $\eta$, communication period $\tau \in \mathbb{N}$\\
\textbf{Initialize:} $\tilde{x}$ is initialized randomly, $x^i = \tilde{x}$, $v^i = 0$, $t^i = 0\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$\\
\hline
\textbf{Repeat}\\
\:\:\:\:\:\textbf{if} ($\tau $ divides $t^i $) \textbf{then}\\
\:\:\:\:\:\:\:\:\:\:\:$\tilde{x} \:\:\:\!\!\leftarrow \tilde{x} \:\!+ v^i $\\
\:\:\:\:\:\:\:\:\:\:\:$x^i \leftarrow \tilde{x}$\\
\:\:\:\:\:\:\:\:\:\:\:$v^i \leftarrow 0$\\
\:\:\:\:\:\textbf{end}\\
\:\:\:\:\:$ x^i \leftarrow x^i - \eta g^i_{t^i}(x^i)$\\
\:\:\:\:\:$ v^i \leftarrow v^i - \eta g^i_{t^i}(x^i)$\\
\:\:\:\:\:$t^i\:\:\:\:\!\!\!\!\leftarrow \:\!t^i \:\:\!\!+ 1$\\
\textbf{Until forever}
\end{tabular}
Algorithm: algorithm
[h]
\caption{MDOWNPOUR: Processing by worker $i$}
\label{alg:asyncMDworker}
\begin{tabular}{l}
\textbf{Initialize:} $x^i = \tilde{x}\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$\\
\hline
\textbf{Repeat}\\
\:\:\:\:\:\textsf{Receive} $\tilde{x}$ \textsf{from the master}: $x^i \leftarrow \tilde{x}$\\
\:\:\:\:\:\textsf{Compute gradient} $g^i = g^i(x^i)$\\
\:\:\:\:\:\textsf{Send} $g^i$ \textsf{to the master}\\
\textbf{Until forever}
\end{tabular}
Algorithm: algorithm
[h]
\caption{MDOWNPOUR: Processing by the master}
\label{alg:asyncMDmaster}
\begin{tabular}{l}
\textbf{Input:} learning rate $\eta$, momentum term $\delta$\\
\textbf{Initialize:} $\tilde{x}$ is initialized randomly, $v^i = 0$, $\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:\:$\\
\hline
\textbf{Repeat}\\
\:\:\:\:\:\textsf{Receive} $g^i$\\
\:\:\:\:\:$v \leftarrow \delta v - \eta g^i$\\
\:\:\:\:\:$\tilde{x} \leftarrow \tilde{x} + \delta v$\\
\textbf{Until forever}
\end{tabular}
η
EASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
EAMSGD{ 0 . 01 , 0 . 005 , 0 . 001 }
DOWNPOUR ADOWNPOUR MVADOWNPOUR{ 0 . 005 , 0 . 001 , 0 . 0005 }
MDOWNPOUR{ 0 . 00005 , 0 . 00001 , 0 . 000005 }
SGD, ASGD, MVASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
MSGD{ 0 . 001 , 0 . 0005 , 0 . 0001 }
η
EASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
EAMSGD{ 0 . 01 , 0 . 005 , 0 . 001 }
DOWNPOUR{ 0 . 005 , 0 . 001 , 0 . 0005 }
MDOWNPOUR{ 0 . 00005 , 0 . 00001 , 0 . 000005 }
SGD, ASGD, MVASGD{ 0 . 05 , 0 . 01 , 0 . 005 }
MSGD{ 0 . 001 , 0 . 0005 , 0 . 0001 }
η
EASGD0 . 1
EAMSGD0 . 001
DOWNPOURfor p = 4 : 0 . 02 for p = 8 : 0 . 01
SGD, ASGD, MVASGD0 . 05
MSGD0 . 0005

Figure

Figure

Figure

$$ F^1!!=!! \left( \begin{array}{ccc} 1-\eta-\alpha & 0 & \alpha \ 0 & 1& 0 \ \alpha & 0 &1-\alpha \ \end{array} \right)!!, F^2!!=!! \left( \begin{array}{ccc} 1 & 0 & 0 \ 0 & 1-\eta-\alpha& \alpha \ 0 & \alpha &1-\alpha \ \end{array} \right)!! $$

$$ \displaystyle 2\scalprod{L2}{x^{i}{t}-\tilde{x}{t}}{x^{i}{t}-x^{\ast}}=\norm{L2}{x^{i}{t}-x^{\ast}}^{2}+\norm{L2}{x^{i}{t}-\tilde{x}{t}}^{2}-\norm{L2}{\tilde{x}_{t}-x^{\ast}}^{2}. $$

References

[bottou-98x] Bottou, L. \newblock Online algorithms and stochastic approximations. \newblock In {\em Online Learning and Neural Networks}. Cambridge University Press, 1998.

[2012distbelief] Dean, J, Corrado, G, Monga, R, Chen, K, Devin, M, Le, Q, Mao, M, Ranzato, M, Senior, A, Tucker, P, Yang, K, and Ng, A. \newblock Large scale distributed deep networks. \newblock In {\em NIPS}. 2012.

[krizhevsky2012imagenet] Krizhevsky, A, Sutskever, I, and Hinton, G.~E. \newblock Imagenet classification with deep convolutional neural networks. \newblock In {\em Advances in Neural Information Processing Systems 25}, pages 1106--1114, 2012.

[2013arXiv1312.6229S] {Sermanet}, P, {Eigen}, D, {Zhang}, X, {Mathieu}, M, {Fergus}, R, and {LeCun}, Y. \newblock {OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks}. \newblock {\em ArXiv}, 2013.

[nocedal2006numerical] Nocedal, J and Wright, S. \newblock {\em Numerical Optimization, Second Edition}. \newblock Springer New York, 2006.

[polyak1992acceleration] Polyak, B.~T and Juditsky, A.~B. \newblock Acceleration of stochastic approximation by averaging. \newblock {\em SIAM Journal on Control and Optimization}, 30(4):838--855, 1992.

[Bertsekas1989] Bertsekas, D.~P and Tsitsiklis, J.~N. \newblock {\em Parallel and Distributed Computation}. \newblock Prentice Hall, 1989.

[hestenes1975optimization] Hestenes, M.~R. \newblock {\em Optimization theory: the finite dimensional case}. \newblock Wiley, 1975.

[Boyd2011] Boyd, S, Parikh, N, Chu, E, Peleato, B, and Eckstein, J. \newblock Distributed optimization and statistical learning via the alternating direction method of multipliers. \newblock {\em Found. Trends Mach. Learn.}, 3(1):1--122, 2011.

[NIPS2014_5386] Shamir, O. \newblock Fundamental limits of online and distributed algorithms for statistical learning and estimation. \newblock In {\em NIPS}. 2014.

[Yadan2013] Yadan, O, Adams, K, Taigman, Y, and Ranzato, M. \newblock Multi-gpu training of convnets. \newblock In {\em Arxiv}. 2013.

[Paine2013] Paine, T, Jin, H, Yang, J, Lin, Z, and Huang, T. \newblock Gpu asynchronous stochastic gradient descent to speed up neural network training. \newblock In {\em Arxiv}. 2013.

[onebitparallel] Seide, F, Fu, H, Droppo, J, Li, G, and Yu, D. \newblock 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. \newblock In {\em Interspeech 2014}, September 2014.

[Bekkerman:2011:SUM:2107736.2107740] Bekkerman, R, Bilenko, M, and Langford, J. \newblock Scaling up machine learning: Parallel and distributed approaches. \newblock Camridge Universityy Press, 2011.

[ChoromanskaHMAL15] Choromanska, A, Henaff, M.~B, Mathieu, M, Arous, G.~B, and LeCun, Y. \newblock The loss surfaces of multilayer networks. \newblock In {\em AISTATS}, 2015.

[NIPS2013_4894] Ho, Q, Cipar, J, Cui, H, Lee, S, Kim, J.~K, Gibbons, P.~B, Gibson, G.~A, Ganger, G, and Xing, E.~P. \newblock More effective distributed ml via a stale synchronous parallel parameter server. \newblock In {\em NIPS}. 2013.

[DBLP:conf/icml/AzadiS14] Azadi, S and Sra, S. \newblock Towards an optimal stochastic alternating direction method of multipliers. \newblock In {\em ICML}, 2014.

[doi:10.1137/S0363012995282784] Borkar, V. \newblock Asynchronous stochastic approximations. \newblock {\em SIAM Journal on Control and Optimization}, 36(3):840--851, 1998.

[Nedic2001381] Nedi{'c}, A, Bertsekas, D, and Borkar, V. \newblock Distributed asynchronous incremental subgradient methods. \newblock In {\em Inherently Parallel Algorithms in Feasibility and Optimization and their Applications}, volume~8 of {\em Studies in Computational Mathematics}, pages 381 -- 407. 2001.

[langford2009slow] Langford, J, Smola, A, and Zinkevich, M. \newblock Slow learners are fast. \newblock In {\em NIPS}, 2009.

[NIPS2011_0574] Agarwal, A and Duchi, J. \newblock Distributed delayed stochastic optimization. \newblock In {\em NIPS}. 2011.

[DBLPRechtRWN11] Recht, B, Re, C, Wright, S.~J, and Niu, F. \newblock {Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent}. \newblock In {\em NIPS}, 2011.

[zinkevich2010parallelized] Zinkevich, M, Weimer, M, Smola, A, and Li, L. \newblock Parallelized stochastic gradient descent. \newblock In {\em NIPS}, 2010.

[Nesterov:2005:SMN:1058099.1058103] Nesterov, Y. \newblock Smooth minimization of non-smooth functions. \newblock {\em Math. Program.}, 103(1):127--152, 2005.

[lan2012optimal] Lan, G. \newblock An optimal method for stochastic composite optimization. \newblock {\em Mathematical Programming}, 133(1-2):365--397, 2012.

[icml2013_sutskever13] Sutskever, I, Martens, J, Dahl, G, and Hinton, G. \newblock On the importance of initialization and momentum in deep learning. \newblock In {\em ICML}, 2013.

[icml2014c2_zhange14] Zhang, R and Kwok, J. \newblock Asynchronous distributed admm for consensus optimization. \newblock In {\em ICML}, 2014.

[ouyang2013stochastic] Ouyang, H, He, N, Tran, L, and Gray, A. \newblock Stochastic alternating direction method of multipliers. \newblock In {\em Proceedings of the 30th International Conference on Machine Learning}, pages 80--88, 2013.

[DBLPWanZZLF13] Wan, L, Zeiler, M.~D, Zhang, S, LeCun, Y, and Fergus, R. \newblock Regularization of neural networks using dropconnect. \newblock In {\em ICML}, 2013.

[cesa2004generalization] Cesa-Bianchi, N, Conconi, A, and Gentile, C. \newblock On the generalization ability of on-line learning algorithms. \newblock {\em IEEE Transactions on Information Theory}, 50(9):2050--2057, 2004.

[nesterov2004introductory] Nesterov, Y. \newblock {\em Introductory lectures on convex optimization}, volume~87. \newblock Springer Science & Business Media, 2004.

[bib1] Bottou, L. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998.

[bib2] Dean, J, Corrado, G, Monga, R, Chen, K, Devin, M, Le, Q, Mao, M, Ranzato, M, Senior, A, Tucker, P, Yang, K, and Ng, A. Large scale distributed deep networks. In NIPS. 2012.

[bib3] Krizhevsky, A, Sutskever, I, and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.

[bib4] Sermanet, P, Eigen, D, Zhang, X, Mathieu, M, Fergus, R, and LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. ArXiv, 2013.

[bib5] Nocedal, J and Wright, S. Numerical Optimization, Second Edition. Springer New York, 2006.

[bib6] Polyak, B. T and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.

[bib7] Bertsekas, D. P and Tsitsiklis, J. N. Parallel and Distributed Computation. Prentice Hall, 1989.

[bib8] Hestenes, M. R. Optimization theory: the finite dimensional case. Wiley, 1975.

[bib9] Boyd, S, Parikh, N, Chu, E, Peleato, B, and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, 2011.

[bib10] Shamir, O. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In NIPS. 2014.

[bib11] Yadan, O, Adams, K, Taigman, Y, and Ranzato, M. Multi-gpu training of convnets. In Arxiv. 2013.

[bib12] Paine, T, Jin, H, Yang, J, Lin, Z, and Huang, T. Gpu asynchronous stochastic gradient descent to speed up neural network training. In Arxiv. 2013.

[bib13] Seide, F, Fu, H, Droppo, J, Li, G, and Yu, D. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014.

[bib14] Bekkerman, R, Bilenko, M, and Langford, J. Scaling up machine learning: Parallel and distributed approaches. Camridge Universityy Press, 2011.

[bib15] Choromanska, A, Henaff, M. B, Mathieu, M, Arous, G. B, and LeCun, Y. The loss surfaces of multilayer networks. In AISTATS, 2015.

[bib16] Ho, Q, Cipar, J, Cui, H, Lee, S, Kim, J. K, Gibbons, P. B, Gibson, G. A, Ganger, G, and Xing, E. P. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS. 2013.

[bib17] Azadi, S and Sra, S. Towards an optimal stochastic alternating direction method of multipliers. In ICML, 2014.

[bib18] Borkar, V. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization, 36(3):840–851, 1998.

[bib19] Nedić, A, Bertsekas, D, and Borkar, V. Distributed asynchronous incremental subgradient methods. In Inherently Parallel Algorithms in Feasibility and Optimization and their Applications, volume 8 of Studies in Computational Mathematics, pages 381 – 407. 2001.

[bib20] Langford, J, Smola, A, and Zinkevich, M. Slow learners are fast. In NIPS, 2009.

[bib21] Agarwal, A and Duchi, J. Distributed delayed stochastic optimization. In NIPS. 2011.

[bib22] Recht, B, Re, C, Wright, S. J, and Niu, F. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS, 2011.

[bib23] Zinkevich, M, Weimer, M, Smola, A, and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010.

[bib24] Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program., 103(1):127–152, 2005.

[bib25] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2):365–397, 2012.

[bib26] Sutskever, I, Martens, J, Dahl, G, and Hinton, G. On the importance of initialization and momentum in deep learning. In ICML, 2013.

[bib27] Zhang, R and Kwok, J. Asynchronous distributed admm for consensus optimization. In ICML, 2014.

[bib28] Ouyang, H, He, N, Tran, L, and Gray, A. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, pages 80–88, 2013.

[bib29] Wan, L, Zeiler, M. D, Zhang, S, LeCun, Y, and Fergus, R. Regularization of neural networks using dropconnect. In ICML, 2013.

[bib30] Cesa-Bianchi, N, Conconi, A, and Gentile, C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, 2004.

[bib31] Nesterov, Y. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.