Explorations on high dimensional landscapes
Levent Sagun1, V. Ugur G"uney2, G'erard Ben Arous1, & Yann LeCun1,, 1Courant Institute of Mathematical Sciences, New York, NY, 2Department of Physics, The City University of New York, New York, NY, 3Facebook AI Research, 770 Broadway, New York, NY
Abstract
Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.
Floor for high dimensional systems
Levent Sagun 1 , V . U˘ gur G¨ uney 2 , G´ erard Ben Arous 1 , &Yann LeCun 1,3
1 Courant Institute of Mathematical Sciences, New York, NY 10012
3 Facebook AI Research, 770 Broadway, New York, NY 10003
sagun@cims.nyu.edu , uguney@gc.cuny.edu benarous@cims.nyu.edu , yann@cs.nyu.edu
Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.
Introduction
Many problems of practical interest can be rephrased as forms of an optimization problem in which one needs to locate a point in a given domain that has some optimal energy or cost value. Given a dynamics on such a surface the question of where it will stop on the landscape on which it is moving is a challenging one, especially if the terrain is rugged. Complex 1 systems can be described by functions that are highly non-convex, that lives on high dimensional spaces and they have exponentially many critical points.
One such problem arises in the context of learning: Given a set of input-output pairs, one seeks to find a function whose values roughly match the ones in the set and which also performs well on new inputs. That is, it learns the task described by the set it began with. To this end, a loss function is formed which is optimized over the parameters of the desired function. Performance of this loss minimization is tested against another set of input-output pairs.
The second problem we consider comes from statistical physics: In magnetic materials, the spins are aligned according to interactions with their neighbors. Over time equilibrium is reached when particles settle down to a configuration that has a reasonably low energy, if not the lowest possible. Stability of this state is tested against small fluctuations.
Some machine learning problems, such as deep networks, hint at similarity with spin systems of statistical mechanics. Finding connections between the two became an attractive research area. For a nice survey of theoretical results between Hopfield networks and Boltzmann machines, see Agliari et al. (2014), and Barra et al. (2012). In a slightly different approach Dauphin et al. (2014) and Choromanska et al. (2014) exhibit that stochastic gradient descent might end up in critical points other than local minima.
The energy landscapes of spin glasses are of interest in themselves. One of the main questions concerns the number of critical points of a given index at a given level. For a detailed analysis of this in spherical spin glasses, see Auffinger et al. (2013), Auffinger & Ben Arous (2013) and
1 Even though we use 'complex' to refer to systems that has exponentially many critical points, we would like to point out that to the best of our knowledge there is no agreed upon definition of what complex a system is across disciplines.
Auffinger & Chen (2014). The first of this sequence of papers will be used in this present paper in the corresponding experimental section. It establishes the existence of an energy level, which we named floor 2 , in which bulk of the low index critical points lie in the absence of an external field. It is proved that this level contains exponentially many local minima. For an algorithm to pass this level the search time should be exponentially long. Yet the floor lies just above the ground state, therefore from an optimization point of view it does not matter whether the algorithm reaches to the global minimum or local minimum at the floor. For a similar analysis in a slightly more general context using random matrices, see Fyodorov (2013), and Fyodorov & Le Doussal (2013). For a review of random matrices and extreme value statistics, see Dean & Majumdar (2008). The added external field as a tunable parameter allows changing the topology of the landscape. Inspired by this work, Mehta et al. (a) gives a detailed experimental analysis of the case in which there are polynomially many critical points, and Mehta et al. (b) counts critical points for third order interactions in which there are exponentially many critical points.
In this work, we focus on the case that has exponentially many critical points. In such high dimensional non-convex surfaces, a moving object is likely to get stuck in a local minimum point or a plateau of flat areas that is close to degenerate. A reasonable algorithm should have enough noise to make the particle jump out of those critical points that have high energy. However, if the landscape has exponentially many critical points that lie around the same energy level, then jumping out of this well will lead to another point of similar energy, clearly not resulting in any improvement. The existence of such a floor will increase the search time for points that have lower energy values regardless of the method of choice. Alternatively, one can attempt to change the system so that floor is at a different, more favorable value, namely closer to the global minimum. One will of course be curious about the performance of such a point in the learning context, or its stability in the spin glass context.
So we ask two questions:
- Where does the particle end up and how good is that point for the task at hand?
- What do various descent algorithms do differently?
We perform experiments in spin glasses and in deep networks to address those questions above. The differences in the graph connectivities of spins and weights indicate that the two systems are not mathematically analogous. Rather we propose that they are special cases of a more general phenomenon which is not yet discovered. The authors hope to extract further knowledge from such complex systems that will shed light to a thorough theoretical understanding in the future.
Setting for experiments and previous theoretical work
Mean field spherical spin glass model
The simplest model of a magnetic material is the one in which atoms have spins with two states: +1 for spin up, or -1 for spin down. Considering pairwise interactions (in other words 2-body interactions) between neighbors gives the total energy of the system: -∑ ij w i w j where w represents spin sign of the particle located at position i . Mean field assumption ignores the lattice geometry of particles and assumes all interactions have equal strength. Introducing a weak external field leads all particles to favor alignment, which incidentally corresponds to the minimum energy configuration which will be achieved at the point where all of the particles have spin up (or down for that matter). This is a rather simple landscape whose global minimum is easy to achieve. This assumes no frustration, that is, all particles favor alignment. This model is known as the Currie-Weiss model. If we have an alloy in which some particle pairs favor alignment and some favor dis-alignment, the picture changes drastically. This introduces a whole new set of constraints which may not be simultaneously attained, leading to glassy states. One way to model this is to introduce random interaction coefficients between particles: ∑ ij x ij w i w j this model, which has a rather complex energy landscape, is known as the Sherrington-Kirkpatrick model. See Agliari et al. (2014) and Sherrington (2014) for a survey on spin glasses and their connection to complex systems.
2 Throughout this paper the energy levels that an algorithm can not pass in a reasonable amount of time will be referred to as floor . For example in the spin glass context of the following section it corresponds to the smallest energy of the highest value of the function Θ( u ) in equation 1, which is -E ∞ .
We can also disregard the discrete states and put the points on the sphere and consider 3 -body interactions of N particles. Taking a point w ∈ S N -1 ( √ N ) ⊂ R N and x ( · ) ∼ Gaussian (0 , 1) , the Hamiltonian with standard normal interactions is given by ∑ ijk x ijk w i w j w k . We have arrived at the model of interest for the purposes of this paper. While this model has been studied previously in great detail, our particular focus is on the location of critical points. The main results we will use in this section and further details on the model can be found in Auffinger et al. (2013) and references therein.
Auffinger et al. (2013) considers a more general version that involves any combinations of interactions not only 2 or 3. The main reason we consider triplets is because it is the smallest system that has exponentially many critical points. Before moving any further we would like to remark on why there are only polynomially many critical points in the binary interaction case. For ∑ w 2 i = N and x ij ∼ Gaussian(0,1) the Hamiltonian of the system is given by
$$
$$
which is a quadratic form where M ij = x ij + x ji 2 is a random N × N Gaussian Orthogonal Ensemble matrix. This system has 2 N critical points at eigenvectors and their corresponding energy values are eigenvalues scaled by N .
Floor of the Hamiltonian
For w ∈ S N -1 ( √ N ) ⊂ R N and x ( · ) ∼ Gaussian (0 , 1) consider the function
$$
$$
Once the random coefficients are chosen, the landscape that will be explored is fixed. Note that as a continuous function on a compact set H N attains its minimum and maximum. Also, since the landscape is symmetric through origin, its local or global maximum points have similar energy values with the opposite sign. We only focus on its minimum.
Let N N,k ( u ) denote the total number of indexk critical points of H N that lies below the level Nu . In other words N N,k ( u ) is the random variable that counts critical points in the set { w : H N ( w ) ≤ Nu } . Auffinger et al. (2013) finds asymptotic expected value of this quantity in logarithmic scale:
$$
$$

Figure 1: Plot of Θ 0 ( u ) (upper curve for local minima) and Θ 1 ( u ) (lower curve for index 1). Their values agree at the point -E ∞ , which is the asymptotic value of the floor, and remain constant. Also note that the energy values on the horizontal axis are scaled by N .
In other words Θ( u ) is a measure of the cumulative number of critical points from the ground state to the level Nu . An exact analytic description of the function Θ( u ) can be found in Auffinger et al. (2013). Figure 1 shows that analytic expression for k = 0 , local minima, and k = 1 , saddles of
index 1. It shows, as expected, that Θ in equation (2) is non-decreasing as it counts the number of critical points cumulatively. Θ becomes positive 3 , and keeps increasing until it becomes constant at a value denoted as -E ∞ , which is the lowest u for which Θ is at its maximum value. Hence the number of critical points of low index do not increase at high energy values.
Clearly the function in (1) is a Gaussian random variable as it is formed by a sum of independent Gaussians. The 1 /N normalization factor is chosen for the model so that Var( H ) = N . This implies that we have a random field with mean zero and variance N at each point. Another implication of the size of variance of the Hamiltonian is that its extensive quantities scale with N , for which Auffinger et al. (2013) gives,
$$
$$
where -E 0 is the zero of Θ function. In words, the global minimum of the Hamiltonian on the sphere is bounded from below by -NE 0 . This lower bound gives a measure on how far a given point is away from the ground state. For our experiments the energy at the ground state and floor is calculated to be -N 1 . 657 and -N 1 . 633 , respectively. For practical purposes the floor level is close to the ground state. At the level of -N 1 . 633 (the vertical line in Figure 1) the landscape is expected to exponentially many points that are local minimum or saddles of low index, and the exponent is at its maximum. Therefore probability of finding a local minimum that lies below the floor level is exponentially small. This leads us to conjecture that this floor is the first place to get stuck when using a descent method which is confirmed in our simulations. Diving deeper would require a long search time. Therefore trying to find points that has lower energy levels is not necessary and even realizable.
Setting for MNIST
Consider a probability measure µ on R n , centralized, that represents data, and a function G from R n to R that correctly labels data. True loss then measures how well a given function, say G ( w ) , approximates G :
$$
$$
For an integer P consider i.i.d. training sample, x p for p = 1 , ..., P , of the measure µ and define the empirical training loss:
$$
$$
where L is the loss per sample which can be the mean square loss, the hinge loss or cross-entropy. Also by the law of large numbers:
$$
$$
$$
$$
The limit holds pointwise so that the sampled loss approximates well the true loss as the number of samples increases. Moreover, by the central limit theorem the following holds pointwise:
So that the fluctuations of this approximation is pointwise Gaussian. In real life problems true loss is not accessible, instead we have L Train ( w ) which approximates d ( w ) . From the above convergence properties we expect both the test and training loss to converge with good accuracy to the true loss. Therefore we expect L Train ( w ) ∼ L Test ( w ) as well if there is enough data for both functions. A natural question is how close are they to each other? If we sample points from the floor of training surface, do they necessarily correspond to the floor of the test surface? How does the smaller scale fluctuations effect learning? We address these questions in the rest of the paper.
3 Note that in the binary interaction case described above, the exponent in (2) never crosses beyond zero otherwise it would imply exponentially many critical points. So that the binary interaction case, or the twobody case is an example of a simple system.
Simulations and experiments
Our simulations in the first part below show existence of a critical dimension before which the landscape is hectic and does not really show any sign of a well-defined value that algorithms converge. This implies that in low dimensions the surface is not trivial, and that there are traps at high energy levels and those traps are located at somewhat arbitrary values as seen in figure 2. On the other hand high dimensional picture is drastically different. The landscape is trivial in the following sense: Starting from a random point, and following the gradient descent algorithm almost always leads to a very narrow band of values. Moreover in the second part we show that this descent is irrespective of which algorithm is being used.
Floor for high dimensional systems
Spin glasses
Simulations of the spin glass model shows a clear qualitative difference between low dimensional and high dimensional surfaces. Namely, low dimensional surfaces do not exhibit a well defined floor however high dimensional surfaces does. Another important feature is that the floor level is very close to the global minimum, therefore floor level is enough for practical purposes of optimization, even when we set aside the problem of reaching global minimum.
The gradient descent algorithm is used on spin glass landscape. Procedure starts with fixing random couplings. Given the dimension N , sample N 3 many i.i.d. standard normal random variables:
- Pick a random element w of the sphere S N -1 ( √ N ) as a starting point for each trial.
- Iteratate 4 w t +1 = w t -γ t ∇ w H ( w t ) , and normalize, √ N w t +1 || w t +1 || ← w t +1 .
- Stop when the gradient size is below the order of 10 -5 .

H
(
w
)
/N
Figure 2: Vertical black lines indicate the theoretical value of the floor in the large N limit. When N is small the system gets trapped at bad energy values (upper histogram). When N is finite but large, floor is a narrow band (lower histogram).
For each N the experiment is repeated 5000 times. And the stopping criteria is the norm of the gradient vector. The theoretical results of the previous section holds true asymptotically. At this point, to the best of the knowledge of authors, there is no proof of how fast the energy of critical
4 Note here the constraint that the gradient is tangential to the sphere.
points concentrate around the limiting floor value. We hope this simulation will lead to a prediction in this direction. Nevertheless it is observed that the points found with this procedure lies within a narrow band of values that is just above the asymptotic floor level that is given in the previous section.
Tri-partite spin glass: a toy model for multilayer Restricted Boltzmann Machines

Figure 3: Graph connectivity: p-spin spin glass vs tri-partite spin glass.

The graph structure of a fully connected spin glass above is highly coupled, in this section we modify the function so the terms of the polynomial have a layered structure. This is achieved by simulating the uncoupled version of the above spin glass model in a way that mimics the layered structure of a multilayer Restricted Boltzmann Machine. From an optimization point of view a similar consideration can be found in Frieze & Kannan (2008). In this new setting, take three

Figure 4: Tri-partite spins low dimension vs high dimension comparison.
different points on the sphere, or equivalently, take one point on the product of three spheres to form the tri-partite Hamiltonian. Instead of equation 1 the uncoupled function at hand becomes ˜ H N ( w ) = 1 N ∑ x ijk w 1 i w 2 j w 3 k . Then perform gradient descent over all parameters and normalize the resulting 3 N dimensional vector on the product of spheres. Since there are fewer constraints, the values obtained by this procedure should be lower, which is indeed the case. Figure 4 shows a similar floor structure of the landscape. However, in this case we do not have any theoretical result that tells us the value of its ground state; therefore, even though we find evidence of floor in high dimensions, it is hard tell how far away that floor value lies from the actual ground state. Since the order of growth of the function is still N , the ratio of the ground state to the value of the floor should also be the same as in the previous section, however we do not, at this point, have a proof for this.
Teacher-student network: case of a known ground state at zero
In this section we design a system for which we know where its global minimum lies. The idea for this experiment is inspired by the teacher-student network comparisons of Saad et al. (1996) and West et al. (1997). It goes as follows: Split the MNIST training set into two parts. Using only the first half of the training data, train a network of two hidden layers, 784-500-300-10, with ReLU's at first two layers and soft-max at the last output layer. The teacher makes 211 mistakes in the test set of size 10,000. Use cross entropy loss and train the system with SGD. This gives the teacher network. Using the teacher network create new labels on the second half of the training set by replacing actual tags with the probabilities assigned by the outputs of the teacher network.
Table 1: Student networks trained with SGD
If size of the student network is at least as big as the teacher network, it is guaranteed that zero cost value points exist. Moreover there are exponentially many of them: an exact copy of the network can be found, and appealing to the symmetries of the network one can permute all the weight connections without changing the cost. All student networks are trained with SGD, and it does not reach zero cost. The algorithm either gets stuck at a critical point or extremely slows down at a very flat area in the surface whose value is away from zero. We conjecture that this level is the floor for the student training surface. Also notice the different behavior of low dimensional networks that have bad critical points at which the floor is not well defined, which is again in contrast with the high dimensional networks.

Figure 5: Training cost values for different student network sizes. High dimensional surfaces exhibit a similar approach to the asymptotic floor value. The student networks that are at least of the same size as the teacher there is exponentially many possible zero values on the training surface, yet SGD can not locate them!
Data that students are trained on are not exact, they are partial in the sense that many labels are vague. For some characters, the teacher does not know the answer for sure. Figure 6 captures this vague response of the teacher which is learned by the student. But this ambiguity does not stop the teacher from teaching; instead, the teacher passes information with the ambiguity, which is some information by itself (this reminds us the Dark Knowledge at Hinton et al.).
Recall that the teacher did not see the second half of the data during its training, but used it to teach its students. We look at its students to see how well they learned the things the teacher taught. It turns out, perhaps not surprisingly, that the ambiguity propagates. Many mistakes that the teacher makes in the second half of the data are also made by its students. There are cases that a student might judge more correctly. See Figure 6 for an example digit 0 that the teacher mistakenly thought was 6. Luckily, as an honest teacher, it did not train its student as if it were a firm 6. With the help of this extra information, the student was able to classify correctly the digit by a tiny margin.


Figure 6: One example of a medium sized student that correctly classifies a mistake of the teacher. The digit is seen on the right side. Histograms are probability outputs of the teacher (left) and the student (middle).
Floor by gradient descent vs. stochastic gradient descent
This section compares the two algorithms in terms of where they reach on their training surface.
SGD for spin glass
A spin glass field is created by a sum of smaller such fields:
$$
$$
Here x p ijk ∼ Gaussian (0 , 1 P ) so that the resulting field of the sum is distributed as the one in equation 1. Then at the p th step, the point on the sphere is updated in the negative direction of the gradient of the p th summand. This procedure is then the stochastic gradient descent with a minibatch of size 1. It is important to note that all summands are independent from each other, unlike the MNIST case. SGD still goes to the floor. The tiny difference in the values of SGD comes from the fact that the SGD slows down very fast and stops at the walls of a well. Once SGD slows down, one could restart GD from that point and reach the bottom of that well.

Figure 7: Both methods reach the floor. Different P 's give eventually the same energy when measured at the same step-size-times-number-of-steps. P = 1 corresponds to the gradient descent.

Table 2: Averages of results for Gradient Descent and Stochastic Gradient Descent for MNIST
MNIST
This experiment compares GD with SGD over full MNIST in a two layer network. One property that is attributed to SGD is that due to its noisy nature it is capable of escaping local minima at higher cost values. This would imply that GD would get stuck before SGD slows down. However it does not seem like this is the case at all. Within the same stepsize SGD and GD perfoms very similar on the training surface. Table 2 shows mean cost values and the difference in test errors.

Figure 8: Mean costs of GD and SGD experiments are given by the thin lines, and shady areas around the curve is the standard deviation around the mean. Vertical axis is in log scale. Decay of the function is not noticable in linear scale. Progress is very slow even in the log-scale. And the two curves follow each other tight.
Conclusion
High dimensional systems are typically considered as systems that come with their curses, however they also exhibit lots of symmetries which can be a blessing as observed in the first part of the simulations. One goal of the paper is to trigger theoretical research on non-convex surfaces on high dimensional domains. It is crucial to repeat here that we do not suggest a direct equivalency between spin glasses and deep networks; rather, we hint at a more general phenomenon that governs the two different cases. Finding similar properties in a variety of other problems might help us identify and quantify the properties of such complex systems. We hope further investigation in other optimization problems will lead to supporting conclusion that is in line with this research.
Acknowledgements
We thank David Belius for valuable discussions, Taylan Cemgil and Atilla Yılmaz for valuable feedback, and reviewers for valuable suggestions. We thank the developers of Torch7 (Collobert et al. (2011)) which we used for the spin glass simulations and Theano (Bastien et al. (2012) and Bergstra et al. (2010)) which we used for MNIST experiments. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for part of this research.
$$ H_{N}(w)=\frac{1}{N}\sum_{i, j, k}^Nx_{ijk}w_{i}w_{j}w_{k} \label{ham} $$ \tag{ham}
$$ \mathbb{E}[\mathcal{N}_{N,k}(u)] \asymp e^{N\Theta_k(u)} \label{power} $$ \tag{power}
$$ \liminf_{N\rightarrow \infty} \big(\min_w \frac{H_{N}(w)}{N}\big) \geq -E_0 \label{GS} $$ \tag{GS}
$$ \begin{split} H_{N}(w) =\frac{1}{N}\sum_{i, j, k}^Nx_{ijk}w_{i}w_{j}w_{k} &=\sum_{p=1}^P \big( \frac{1}{N}\sum_{i, j, k}^Nx^p_{ijk}w_{i}w_{j}w_{k} \big) \label{ham2} \ &=\frac{1}{N} \sum_{i, j, k}^N \big( \sum_{p=1}^P x^p_{ijk} \big) w_{i}w_{j}w_{k} \ \end{split} $$ \tag{ham2}
$$ \begin{aligned} d(w)&=d(\mathcal{G},G(w)) \ &=\int_{\mathbb{R}^n}d(\mathcal{G}(x),G(w)(x))d\mu(x) \ \end{aligned} $$
$$ \mathcal{L}{\text{Train}}(w) = \frac{1}{P} \sum{p=1}^P L(x^p,w) $$
$$ \mathcal{L}{\text{Train}}(w)\rightarrow E{\mu}[L(x,w)] = d(w) \text{ $\mu$-a.s. as } P \rightarrow \infty \text{.} $$
References
| Training cost | Test cost | Test error | St.Dev.(test error) | |
|---|---|---|---|---|
| 784-50-50-10 student | 0.0183 | 0.169 | 326 | 11.3 |
| 784-250-150-10 student | 0.0172 | 0.131 | 276 | 7.6 |
| 784-500-300-10 student | 0.017 | 0.125 | 265 | 6.6 |
| 784-1200-1200-10 student | 0.0168 | 0.118 | 257 | 5.2 |
| Training cost | Test cost | Test error | St.Dev.test error | |
|---|---|---|---|---|
| 50-50 GD | 0.000452 | 0.202 | 299 | 15.5 |
| 50-50 SGD | 0.000386 | 0.157 | 256 | 11.6 |
| 500-300 GD | 0.000155 | 0.102 | 194 | 6.9 |
| 500-300 SGD | 0.000138 | 0.0847 | 174 | 5.3 |
Finding minima of a real valued non-convex function over a high dimensional space is a major challenge in science. We provide evidence that some such functions that are defined on high dimensional domains have a narrow band of values whose pre-image contains the bulk of its critical points. This is in contrast with the low dimensional picture in which this band is wide. Our simulations agree with the previous theoretical work on spin glasses that proves the existence of such a band when the dimension of the domain tends to infinity. Furthermore our experiments on teacher-student networks with the MNIST dataset establish a similar phenomenon in deep networks. We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps.
Many problems of practical interest can be rephrased as forms of an optimization problem in which one needs to locate a point in a given domain that has some optimal energy or cost value. Given a dynamics on such a surface the question of where it will stop on the landscape on which it is moving is a challenging one, especially if the terrain is rugged. Complex111Even though we use ‘complex’ to refer to systems that has exponentially many critical points, we would like to point out that to the best of our knowledge there is no agreed upon definition of what complex a system is across disciplines. systems can be described by functions that are highly non-convex, that lives on high dimensional spaces and they have exponentially many critical points.
One such problem arises in the context of learning: Given a set of input-output pairs, one seeks to find a function whose values roughly match the ones in the set and which also performs well on new inputs. That is, it learns the task described by the set it began with. To this end, a loss function is formed which is optimized over the parameters of the desired function. Performance of this loss minimization is tested against another set of input-output pairs.
The second problem we consider comes from statistical physics: In magnetic materials, the spins are aligned according to interactions with their neighbors. Over time equilibrium is reached when particles settle down to a configuration that has a reasonably low energy, if not the lowest possible. Stability of this state is tested against small fluctuations.
Some machine learning problems, such as deep networks, hint at similarity with spin systems of statistical mechanics. Finding connections between the two became an attractive research area. For a nice survey of theoretical results between Hopfield networks and Boltzmann machines, see Agliari et al. (2014), and Barra et al. (2012). In a slightly different approach Dauphin et al. (2014) and Choromanska et al. (2014) exhibit that stochastic gradient descent might end up in critical points other than local minima.
The energy landscapes of spin glasses are of interest in themselves. One of the main questions concerns the number of critical points of a given index at a given level. For a detailed analysis of this in spherical spin glasses, see Auffinger et al. (2013), Auffinger & Ben Arous (2013) and Auffinger & Chen (2014). The first of this sequence of papers will be used in this present paper in the corresponding experimental section. It establishes the existence of an energy level, which we named floor222Throughout this paper the energy levels that an algorithm can not pass in a reasonable amount of time will be referred to as floor. For example in the spin glass context of the following section it corresponds to the smallest energy of the highest value of the function Θ(u)Θ𝑢\Theta(u) in equation 1, which is −E∞subscript𝐸-E_{\infty}., in which bulk of the low index critical points lie in the absence of an external field. It is proved that this level contains exponentially many local minima. For an algorithm to pass this level the search time should be exponentially long. Yet the floor lies just above the ground state, therefore from an optimization point of view it does not matter whether the algorithm reaches to the global minimum or local minimum at the floor. For a similar analysis in a slightly more general context using random matrices, see Fyodorov (2013), and Fyodorov & Le Doussal (2013). For a review of random matrices and extreme value statistics, see Dean & Majumdar (2008). The added external field as a tunable parameter allows changing the topology of the landscape. Inspired by this work, Mehta et al. (a) gives a detailed experimental analysis of the case in which there are polynomially many critical points, and Mehta et al. (b) counts critical points for third order interactions in which there are exponentially many critical points.
In this work, we focus on the case that has exponentially many critical points. In such high dimensional non-convex surfaces, a moving object is likely to get stuck in a local minimum point or a plateau of flat areas that is close to degenerate. A reasonable algorithm should have enough noise to make the particle jump out of those critical points that have high energy. However, if the landscape has exponentially many critical points that lie around the same energy level, then jumping out of this well will lead to another point of similar energy, clearly not resulting in any improvement. The existence of such a floor will increase the search time for points that have lower energy values regardless of the method of choice. Alternatively, one can attempt to change the system so that floor is at a different, more favorable value, namely closer to the global minimum. One will of course be curious about the performance of such a point in the learning context, or its stability in the spin glass context.
So we ask two questions:
-
Where does the particle end up and how good is that point for the task at hand?
-
What do various descent algorithms do differently?
We perform experiments in spin glasses and in deep networks to address those questions above. The differences in the graph connectivities of spins and weights indicate that the two systems are not mathematically analogous. Rather we propose that they are special cases of a more general phenomenon which is not yet discovered. The authors hope to extract further knowledge from such complex systems that will shed light to a thorough theoretical understanding in the future.
The simplest model of a magnetic material is the one in which atoms have spins with two states: +1 for spin up, or -1 for spin down. Considering pairwise interactions (in other words 2-body interactions) between neighbors gives the total energy of the system: −∑ijwiwjsubscript𝑖𝑗subscript𝑤𝑖subscript𝑤𝑗-\sum_{ij}w_{i}w_{j} where w𝑤w represents spin sign of the particle located at position i𝑖i. Mean field assumption ignores the lattice geometry of particles and assumes all interactions have equal strength. Introducing a weak external field leads all particles to favor alignment, which incidentally corresponds to the minimum energy configuration which will be achieved at the point where all of the particles have spin up (or down for that matter). This is a rather simple landscape whose global minimum is easy to achieve. This assumes no frustration, that is, all particles favor alignment. This model is known as the Currie-Weiss model. If we have an alloy in which some particle pairs favor alignment and some favor dis-alignment, the picture changes drastically. This introduces a whole new set of constraints which may not be simultaneously attained, leading to glassy states. One way to model this is to introduce random interaction coefficients between particles: ∑ijxijwiwjsubscript𝑖𝑗subscript𝑥𝑖𝑗subscript𝑤𝑖subscript𝑤𝑗\sum_{ij}x_{ij}w_{i}w_{j} this model, which has a rather complex energy landscape, is known as the Sherrington-Kirkpatrick model. See Agliari et al. (2014) and Sherrington (2014) for a survey on spin glasses and their connection to complex systems.
We can also disregard the discrete states and put the points on the sphere and consider 333-body interactions of N𝑁N particles. Taking a point w∈SN−1(N)⊂ℝN𝑤superscript𝑆𝑁1𝑁superscriptℝ𝑁w\in S^{N-1}(\sqrt{N})\subset\mathbb{R}^{N} and x(⋅)∼ Gaussian(0,1)similar-tosubscript𝑥⋅ Gaussian01x_{(\raisebox{-0.75346pt}{\scalebox{1.2}{$\cdot$}})}\sim\text{ Gaussian}(0,1), the Hamiltonian with standard normal interactions is given by ∑ijkxijkwiwjwksubscript𝑖𝑗𝑘subscript𝑥𝑖𝑗𝑘subscript𝑤𝑖subscript𝑤𝑗subscript𝑤𝑘\sum_{ijk}x_{ijk}w_{i}w_{j}w_{k}. We have arrived at the model of interest for the purposes of this paper. While this model has been studied previously in great detail, our particular focus is on the location of critical points. The main results we will use in this section and further details on the model can be found in Auffinger et al. (2013) and references therein.
Auffinger et al. (2013) considers a more general version that involves any combinations of interactions not only 2 or 3. The main reason we consider triplets is because it is the smallest system that has exponentially many critical points. Before moving any further we would like to remark on why there are only polynomially many critical points in the binary interaction case. For ∑wi2=Nsubscriptsuperscript𝑤2𝑖𝑁\sum w^{2}{i}=N and xij∼similar-tosubscript𝑥𝑖𝑗absentx{ij}\sim Gaussian(0,1) the Hamiltonian of the system is given by
which is a quadratic form where Mij=xij+xji2subscript𝑀𝑖𝑗subscript𝑥𝑖𝑗subscript𝑥𝑗𝑖2M_{ij}=\frac{x_{ij}+x_{ji}}{2} is a random N×N𝑁𝑁N\times N Gaussian Orthogonal Ensemble matrix. This system has 2N2𝑁2N critical points at eigenvectors and their corresponding energy values are eigenvalues scaled by N𝑁N.
For w∈SN−1(N)⊂ℝN𝑤superscript𝑆𝑁1𝑁superscriptℝ𝑁w\in S^{N-1}(\sqrt{N})\subset\mathbb{R}^{N} and x(⋅)∼ Gaussian(0,1)similar-tosubscript𝑥⋅ Gaussian01x_{(\raisebox{-0.75346pt}{\scalebox{1.2}{$\cdot$}})}\sim\text{ Gaussian}(0,1) consider the function
Once the random coefficients are chosen, the landscape that will be explored is fixed. Note that as a continuous function on a compact set HNsubscript𝐻𝑁H_{N} attains its minimum and maximum. Also, since the landscape is symmetric through origin, its local or global maximum points have similar energy values with the opposite sign. We only focus on its minimum.
Let 𝒩N,k(u)subscript𝒩𝑁𝑘𝑢\mathcal{N}{N,k}(u) denote the total number of index-k𝑘k critical points of HNsubscript𝐻𝑁H{N} that lies below the level Nu𝑁𝑢Nu. In other words 𝒩N,k(u)subscript𝒩𝑁𝑘𝑢\mathcal{N}{N,k}(u) is the random variable that counts critical points in the set {w:HN(w)≤Nu}conditional-set𝑤subscript𝐻𝑁𝑤𝑁𝑢{w:H{N}(w)\leq Nu}. Auffinger et al. (2013) finds asymptotic expected value of this quantity in logarithmic scale:
In other words Θ(u)Θ𝑢\Theta(u) is a measure of the cumulative number of critical points from the ground state to the level Nu𝑁𝑢Nu. An exact analytic description of the function Θ(u)Θ𝑢\Theta(u) can be found in Auffinger et al. (2013). Figure 1 shows that analytic expression for k=0𝑘0k=0, local minima, and k=1𝑘1k=1, saddles of index 1. It shows, as expected, that ΘΘ\Theta in equation (2) is non-decreasing as it counts the number of critical points cumulatively. ΘΘ\Theta becomes positive333Note that in the binary interaction case described above, the exponent in (2) never crosses beyond zero otherwise it would imply exponentially many critical points. So that the binary interaction case, or the two-body case is an example of a simple system., and keeps increasing until it becomes constant at a value denoted as −E∞subscript𝐸-E_{\infty}, which is the lowest u𝑢u for which ΘΘ\Theta is at its maximum value. Hence the number of critical points of low index do not increase at high energy values.
Clearly the function in (1) is a Gaussian random variable as it is formed by a sum of independent Gaussians. The 1/N1𝑁1/N normalization factor is chosen for the model so that Var(H)=NVar𝐻𝑁\mathrm{Var}(H)=N. This implies that we have a random field with mean zero and variance N𝑁N at each point. Another implication of the size of variance of the Hamiltonian is that its extensive quantities scale with N𝑁N, for which Auffinger et al. (2013) gives,
where −E0subscript𝐸0-E_{0} is the zero of ΘΘ\Theta function. In words, the global minimum of the Hamiltonian on the sphere is bounded from below by −NE0𝑁subscript𝐸0-NE_{0}. This lower bound gives a measure on how far a given point is away from the ground state. For our experiments the energy at the ground state and floor is calculated to be −N1.657𝑁1.657-N1.657 and −N1.633𝑁1.633-N1.633, respectively. For practical purposes the floor level is close to the ground state. At the level of −N1.633𝑁1.633-N1.633 (the vertical line in Figure 1) the landscape is expected to exponentially many points that are local minimum or saddles of low index, and the exponent is at its maximum. Therefore probability of finding a local minimum that lies below the floor level is exponentially small. This leads us to conjecture that this floor is the first place to get stuck when using a descent method which is confirmed in our simulations. Diving deeper would require a long search time. Therefore trying to find points that has lower energy levels is not necessary and even realizable.
Consider a probability measure μ𝜇\mu on ℝnsuperscriptℝ𝑛\mathbb{R}^{n}, centralized, that represents data, and a function 𝒢𝒢\mathcal{G} from ℝnsuperscriptℝ𝑛\mathbb{R}^{n} to ℝℝ\mathbb{R} that correctly labels data. True loss then measures how well a given function, say G(w)𝐺𝑤G(w), approximates 𝒢𝒢\mathcal{G}:
For an integer P𝑃P consider i.i.d. training sample, xpsuperscript𝑥𝑝x^{p} for p=1,…,P𝑝1…𝑃p=1,...,P, of the measure μ𝜇\mu and define the empirical training loss:
where L𝐿L is the loss per sample which can be the mean square loss, the hinge loss or cross-entropy. Also by the law of large numbers:
The limit holds pointwise so that the sampled loss approximates well the true loss as the number of samples increases. Moreover, by the central limit theorem the following holds pointwise:
So that the fluctuations of this approximation is pointwise Gaussian. In real life problems true loss is not accessible, instead we have ℒTrain(w)subscriptℒTrain𝑤\mathcal{L}{\text{Train}}(w) which approximates d(w)𝑑𝑤d(w). From the above convergence properties we expect both the test and training loss to converge with good accuracy to the true loss. Therefore we expect ℒTrain(w)∼ℒTest(w)similar-tosubscriptℒTrain𝑤subscriptℒTest𝑤\mathcal{L}{\text{Train}}(w)\sim\mathcal{L}_{\text{Test}}(w) as well if there is enough data for both functions. A natural question is how close are they to each other? If we sample points from the floor of training surface, do they necessarily correspond to the floor of the test surface? How does the smaller scale fluctuations effect learning? We address these questions in the rest of the paper.
Our simulations in the first part below show existence of a critical dimension before which the landscape is hectic and does not really show any sign of a well-defined value that algorithms converge. This implies that in low dimensions the surface is not trivial, and that there are traps at high energy levels and those traps are located at somewhat arbitrary values as seen in figure 2. On the other hand high dimensional picture is drastically different. The landscape is trivial in the following sense: Starting from a random point, and following the gradient descent algorithm almost always leads to a very narrow band of values. Moreover in the second part we show that this descent is irrespective of which algorithm is being used.
Simulations of the spin glass model shows a clear qualitative difference between low dimensional and high dimensional surfaces. Namely, low dimensional surfaces do not exhibit a well defined floor however high dimensional surfaces does. Another important feature is that the floor level is very close to the global minimum, therefore floor level is enough for practical purposes of optimization, even when we set aside the problem of reaching global minimum.
The gradient descent algorithm is used on spin glass landscape. Procedure starts with fixing random couplings. Given the dimension N𝑁N, sample N3superscript𝑁3N^{3} many i.i.d. standard normal random variables:
Pick a random element w𝑤w of the sphere SN−1(N)superscript𝑆𝑁1𝑁S^{N-1}(\sqrt{N}) as a starting point for each trial.
Iteratate444Note here the constraint that the gradient is tangential to the sphere. wt+1=wt−γt∇wH(wt)superscript𝑤𝑡1superscript𝑤𝑡subscript𝛾𝑡subscript∇𝑤𝐻superscript𝑤𝑡w^{t+1}=w^{t}-\gamma_{t}\nabla_{w}H(w^{t}), and normalize, Nwt+1‖wt+1‖←wt+1←𝑁superscript𝑤𝑡1normsuperscript𝑤𝑡1superscript𝑤𝑡1\sqrt{N}\frac{w^{t+1}}{||w^{t+1}||}\leftarrow w^{t+1}.
Stop when the gradient size is below the order of 10−5superscript10510^{-5}.
For each N𝑁N the experiment is repeated 5000 times. And the stopping criteria is the norm of the gradient vector. The theoretical results of the previous section holds true asymptotically. At this point, to the best of the knowledge of authors, there is no proof of how fast the energy of critical points concentrate around the limiting floor value. We hope this simulation will lead to a prediction in this direction. Nevertheless it is observed that the points found with this procedure lies within a narrow band of values that is just above the asymptotic floor level that is given in the previous section.
The graph structure of a fully connected spin glass above is highly coupled, in this section we modify the function so the terms of the polynomial have a layered structure. This is achieved by simulating the uncoupled version of the above spin glass model in a way that mimics the layered structure of a multilayer Restricted Boltzmann Machine. From an optimization point of view a similar consideration can be found in Frieze & Kannan (2008).
In this new setting, take three different points on the sphere, or equivalently, take one point on the product of three spheres to form the tri-partite Hamiltonian. Instead of equation 1 the uncoupled function at hand becomes HN(w)=1N∑xijkwi1wj2wk3subscript𝐻𝑁𝑤1𝑁subscript𝑥𝑖𝑗𝑘subscriptsuperscript𝑤1𝑖subscriptsuperscript𝑤2𝑗subscriptsuperscript𝑤3𝑘\tilde{H}{N}(w)=\frac{1}{N}\sum x{ijk}w^{1}{i}w^{2}{j}w^{3}_{k}. Then perform gradient descent over all parameters and normalize the resulting 3N3𝑁3N dimensional vector on the product of spheres. Since there are fewer constraints, the values obtained by this procedure should be lower, which is indeed the case. Figure 4 shows a similar floor structure of the landscape. However, in this case we do not have any theoretical result that tells us the value of its ground state; therefore, even though we find evidence of floor in high dimensions, it is hard tell how far away that floor value lies from the actual ground state. Since the order of growth of the function is still N𝑁N, the ratio of the ground state to the value of the floor should also be the same as in the previous section, however we do not, at this point, have a proof for this.
In this section we design a system for which we know where its global minimum lies. The idea for this experiment is inspired by the teacher-student network comparisons of Saad et al. (1996) and West et al. (1997). It goes as follows: Split the MNIST training set into two parts. Using only the first half of the training data, train a network of two hidden layers, 784-500-300-10, with ReLU’s at first two layers and soft-max at the last output layer. The teacher makes 211 mistakes in the test set of size 10,000. Use cross entropy loss and train the system with SGD. This gives the teacher network. Using the teacher network create new labels on the second half of the training set by replacing actual tags with the probabilities assigned by the outputs of the teacher network.
If size of the student network is at least as big as the teacher network, it is guaranteed that zero cost value points exist. Moreover there are exponentially many of them: an exact copy of the network can be found, and appealing to the symmetries of the network one can permute all the weight connections without changing the cost. All student networks are trained with SGD, and it does not reach zero cost. The algorithm either gets stuck at a critical point or extremely slows down at a very flat area in the surface whose value is away from zero. We conjecture that this level is the floor for the student training surface. Also notice the different behavior of low dimensional networks that have bad critical points at which the floor is not well defined, which is again in contrast with the high dimensional networks.
Data that students are trained on are not exact, they are partial in the sense that many labels are vague. For some characters, the teacher does not know the answer for sure. Figure 6 captures this vague response of the teacher which is learned by the student. But this ambiguity does not stop the teacher from teaching; instead, the teacher passes information with the ambiguity, which is some information by itself (this reminds us the Dark Knowledge at Hinton et al. ).
Recall that the teacher did not see the second half of the data during its training, but used it to teach its students. We look at its students to see how well they learned the things the teacher taught. It turns out, perhaps not surprisingly, that the ambiguity propagates. Many mistakes that the teacher makes in the second half of the data are also made by its students. There are cases that a student might judge more correctly. See Figure 6 for an example digit 0 that the teacher mistakenly thought was 6. Luckily, as an honest teacher, it did not train its student as if it were a firm 6. With the help of this extra information, the student was able to classify correctly the digit by a tiny margin.
This section compares the two algorithms in terms of where they reach on their training surface.
A spin glass field is created by a sum of smaller such fields:
Here xijkp∼ Gaussian(0,1P)similar-tosubscriptsuperscript𝑥𝑝𝑖𝑗𝑘 Gaussian01𝑃x^{p}_{ijk}\sim\text{ Gaussian}(0,\frac{1}{P}) so that the resulting field of the sum is distributed as the one in equation 1. Then at the pthsuperscript𝑝𝑡ℎp^{th} step, the point on the sphere is updated in the negative direction of the gradient of the pthsuperscript𝑝𝑡ℎp^{th} summand. This procedure is then the stochastic gradient descent with a minibatch of size 1. It is important to note that all summands are independent from each other, unlike the MNIST case. SGD still goes to the floor. The tiny difference in the values of SGD comes from the fact that the SGD slows down very fast and stops at the walls of a well. Once SGD slows down, one could restart GD from that point and reach the bottom of that well.
This experiment compares GD with SGD over full MNIST in a two layer network. One property that is attributed to SGD is that due to its noisy nature it is capable of escaping local minima at higher cost values. This would imply that GD would get stuck before SGD slows down. However it does not seem like this is the case at all. Within the same stepsize SGD and GD perfoms very similar on the training surface. Table 2 shows mean cost values and the difference in test errors.
High dimensional systems are typically considered as systems that come with their curses, however they also exhibit lots of symmetries which can be a blessing as observed in the first part of the simulations. One goal of the paper is to trigger theoretical research on non-convex surfaces on high dimensional domains. It is crucial to repeat here that we do not suggest a direct equivalency between spin glasses and deep networks; rather, we hint at a more general phenomenon that governs the two different cases. Finding similar properties in a variety of other problems might help us identify and quantify the properties of such complex systems. We hope further investigation in other optimization problems will lead to supporting conclusion that is in line with this research.
We thank David Belius for valuable discussions, Taylan Cemgil and Atilla Yılmaz for valuable feedback, and reviewers for valuable suggestions. We thank the developers of Torch7 (Collobert et al. (2011)) which we used for the spin glass simulations and Theano (Bastien et al. (2012) and Bergstra et al. (2010)) which we used for MNIST experiments. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for part of this research.
Table: S3.T1: Student networks trained with SGD
| Training cost | Test cost | Test error | St.Dev.(test error) | |
| 784-50-50-10 student | 1.83e-02 | 1.69e-01 | 326 | 11.3 |
| 784-250-150-10 student | 1.72e-02 | 1.31e-01 | 276 | 7.6 |
| 784-500-300-10 student | 1.70e-02 | 1.25e-01 | 265 | 6.6 |
| 784-1200-1200-10 student | 1.68e-02 | 1.18e-01 | 257 | 5.2 |
Table: S3.T2: Averages of results for Gradient Descent and Stochastic Gradient Descent for MNIST
| Training cost | Test cost | Test error | St.Dev.test error | |
| 50-50 GD | 4.52e-04 | 2.02e-01 | 299 | 15.5 |
| 50-50 SGD | 3.86e-04 | 1.57e-01 | 256 | 11.6 |
| 500-300 GD | 1.55e-04 | 1.02e-01 | 194 | 6.9 |
| 500-300 SGD | 1.38e-04 | 8.47e-02 | 174 | 5.3 |
Plot of Θ0(u)subscriptΘ0𝑢\Theta_{0}(u) (upper curve for local minima) and Θ1(u)subscriptΘ1𝑢\Theta_{1}(u) (lower curve for index 1). Their values agree at the point −E∞subscript𝐸-E_{\infty}, which is the asymptotic value of the floor, and remain constant. Also note that the energy values on the horizontal axis are scaled by N𝑁N.
Vertical black lines indicate the theoretical value of the floor in the large N𝑁N limit. When N𝑁N is small the system gets trapped at bad energy values (upper histogram). When N𝑁N is finite but large, floor is a narrow band (lower histogram).
Graph connectivity: p-spin spin glass vs tri-partite spin glass.
Tri-partite spins low dimension vs high dimension comparison.
Training cost values for different student network sizes. High dimensional surfaces exhibit a similar approach to the asymptotic floor value. The student networks that are at least of the same size as the teacher there is exponentially many possible zero values on the training surface, yet SGD can not locate them!
One example of a medium sized student that correctly classifies a mistake of the teacher. The digit is seen on the right side. Histograms are probability outputs of the teacher (left) and the student (middle).
Both methods reach the floor. Different P𝑃P’s give eventually the same energy when measured at the same step-size-times-number-of-steps. P=1𝑃1P=1 corresponds to the gradient descent.
Mean costs of GD and SGD experiments are given by the thin lines, and shady areas around the curve is the standard deviation around the mean. Vertical axis is in log scale. Decay of the function is not noticable in linear scale. Progress is very slow even in the log-scale. And the two curves follow each other tight.
$$ H(w)=\sum_{i\sim j}w_{i}w_{j}x_{ij}=(Mx,x) $$ \tag{S2.Ex1}
$$ \mathbb{E}[\mathcal{N}{N,k}(u)]\asymp e^{N\Theta{k}(u)} $$ \tag{S2.E2}
$$ \liminf_{N\rightarrow\infty}\big{(}\min_{w}\frac{H_{N}(w)}{N}\big{)}\geq-E_{0} $$ \tag{S2.E3}
$$ \mathcal{L}{\text{Train}}(w)=\frac{1}{P}\sum{p=1}^{P}L(x^{p},w) $$ \tag{S2.Ex3}
$$ \mathcal{L}{\text{Train}}(w)\rightarrow E{\mu}[L(x,w)]=d(w)\text{ $\mu$-a.s. as }P\rightarrow\infty\text{.} $$ \tag{S2.Ex4}
$$ \begin{split}H_{N}(w)=\frac{1}{N}\sum_{i,j,k}^{N}x_{ijk}w_{i}w_{j}w_{k}&=\sum_{p=1}^{P}\big{(}\frac{1}{N}\sum_{i,j,k}^{N}x^{p}{ijk}w{i}w_{j}w_{k}\big{)}\ &=\frac{1}{N}\sum_{i,j,k}^{N}\big{(}\sum_{p=1}^{P}x^{p}{ijk}\big{)}w{i}w_{j}w_{k}\ \end{split} $$ \tag{S3.E4}
$$ \displaystyle d(w) $$
| Training cost | Test cost | Test error | St.Dev.(test error) | |
|---|---|---|---|---|
| 784-50-50-10 student | 0.0183 | 0.169 | 326 | 11.3 |
| 784-250-150-10 student | 0.0172 | 0.131 | 276 | 7.6 |
| 784-500-300-10 student | 0.017 | 0.125 | 265 | 6.6 |
| 784-1200-1200-10 student | 0.0168 | 0.118 | 257 | 5.2 |
| Training cost | Test cost | Test error | St.Dev.test error | |
|---|---|---|---|---|
| 50-50 GD | 0.000452 | 0.202 | 299 | 15.5 |
| 50-50 SGD | 0.000386 | 0.157 | 256 | 11.6 |
| 500-300 GD | 0.000155 | 0.102 | 194 | 6.9 |
| 500-300 SGD | 0.000138 | 0.0847 | 174 | 5.3 |







References
[Agliari2014] Agliari, Elena, Barra, Adriano, Galluzzi, Andrea, Tantari, Daniele, and Tavani, Flavia. \newblock {A walk in the statistical mechanical formulation of neural networks}. \newblock 1948:\penalty0 12, July 2014.
[Auffinger2013] Auffinger, Antonio and {Ben Arous}, G'{e}rard. \newblock {Complexity of random smooth functions on the high-dimensional sphere}. \newblock The Annals of Probability, 41\penalty0 (6):\penalty0 4214--4247, November 2013. \newblock ISSN 0091-1798. \newblock 10.1214/13-AOP862.
[Auffinger2014] Auffinger, Antonio and Chen, Wei-kuo. \newblock {Free Energy and Complexity of Spherical Bipartite Models}. \newblock Journal of Statistical Physics, 157\penalty0 (1):\penalty0 40--59, July 2014. \newblock ISSN 0022-4715. \newblock 10.1007/s10955-014-1073-0.
[Auffinger2010] Auffinger, Antonio, {Ben Arous}, G'{e}rard, and Cern'{y}, Jir'{\i}. \newblock {Random Matrices and Complexity of Spin Glasses}. \newblock Communications on Pure and Applied Mathematics, 66\penalty0 (2):\penalty0 165--201, February 2013. \newblock ISSN 00103640. \newblock 10.1002/cpa.21422.
[1742-5468-2012-07-P07009] Barra, Adriano, Genovese, Giuseppe, Guerra, Francesco, and Tantari, Daniele. \newblock {How glassy are neural networks?} \newblock Journal of Statistical Mechanics: Theory and Experiment, 2012\penalty0 (07):\penalty0 P07009, 2012.
[Bastien-Theano-2012] Bastien, Fr{'{e}}d{'{e}}ric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian~J., Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. \newblock Theano: new features and speed improvements. \newblock Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
[bergstra+al:2010-scipy] Bergstra, James, Breuleux, Olivier, Bastien, Fr{'{e}}d{'{e}}ric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. \newblock Theano: a {CPU} and {GPU} math expression compiler. \newblock In Proceedings of the Python for Scientific Computing Conference ({SciPy)}, June 2010. \newblock Oral Presentation.
[Choromanska] Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, {Ben Arous}, G'{e}rard, and LeCun, Yann. \newblock {The Loss Surface of Multilayer Networks}. \newblock November 2014.
[Collobert_NIPSWORKSHOP_2011] Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Cl{'{e}}ment. \newblock Torch7: A matlab-like environment for machine learning. \newblock In BigLearn, NIPS Workshop, 2011.
[NIPS2014_5486] Dauphin, YannN, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. \newblock {Identifying and attacking the saddle point problem in high-dimensional non-convex optimization}. \newblock In Ghahramani, Z, Welling, M, Cortes, C, Lawrence, ND, and Weinberger, K~Q (eds.), Advances in Neural Information Processing Systems 27, pp.\ 2933--2941. Curran Associates, Inc., June 2014.
[Dean2008] Dean, DavidS and Majumdar, SatyaN. \newblock {Extreme value statistics of eigenvalues of Gaussian random matrices}. \newblock Physical Review E, 77\penalty0 (4):\penalty0 041108, April 2008. \newblock ISSN 1539-3755. \newblock 10.1103/PhysRevE.77.041108.
[frieze_et_al:LIPIcs:2008:1752] Frieze, Alan and Kannan, Ravi. \newblock {A new approach to the planted clique problem}. \newblock In Hariharan, Ramesh, Mukund, Madhavan, and Vinay, V (eds.), IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, volume~2 of Leibniz International Proceedings in Informatics (LIPIcs), pp.\ 187--198, Dagstuhl, Germany, 2008. Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik. \newblock ISBN 978-3-939897-08-8. \newblock http://dx.doi.org/10.4230/LIPIcs.FSTTCS.2008.1752.
[Fyodorov2013] Fyodorov, Yan~V. \newblock {High-Dimensional Random Fields and Random Matrix Theory}. \newblock pp.\ ~40, July 2013.
[Fyodorov2013a] Fyodorov, Yan~V and {Le Doussal}, Pierre. \newblock {Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization}. \newblock Journal of Statistical Physics, 154\penalty0 (1-2):\penalty0 466--490, September 2013. \newblock ISSN 0022-4715. \newblock 10.1007/s10955-013-0838-1.
[Hinton] Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. \newblock {Dark Knowledge}. \newblock http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/geoff_hinton_dark14.pdf
[Mehta] Mehta, Dhagash, Hauenstein, JonathanD, Niemerg, Matthew, Simm, NicholasJ, and Stariolo, Daniel~A. \newblock {Energy Landscape of the Finite-Size Mean-field 2-Spin Spherical Model and Topology Trivialization}. \newblock pp.\ 1--9, {a}.
[Mehta2] Mehta, Dhagash, Stariolo, Daniel~A, and Kastner, Michael. \newblock {Energy Landscape of the Finite-Size Mean-field 3-Spin Spherical Model}. \newblock pp.\ 1--10, {b}.
[NIPS1995_1072] Saad, David, Birmingham, B, and Solla, SaraA. \newblock {Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks}. \newblock In Touretzky, DS, Mozer, MC, and Hasselmo, ME (eds.), Advances in Neural Information Processing Systems 8, pp.\ 302--308. MIT Press, 1996.
[Sherrington2014] Sherrington, David. \newblock {Physics and Complexity: An Introduction}. \newblock In Delitala, Marcello and {Ajmone Marsan}, Giulia (eds.), Managing Complexity, Reducing Perplexity, volume~67 of Springer Proceedings in Mathematics & Statistics, pp.\ 119--129. Springer International Publishing, Cham, 2014. \newblock ISBN 978-3-319-03758-5. \newblock 10.1007/978-3-319-03759-2.
[NIPS1996_1256] West, Ansgar HL, Saad, David, and Nabney, IanT. \newblock {The Learning Dynamcis of a Universal Approximator}. \newblock In Mozer, MC, Jordan, MI, and Petsche, T (eds.), Advances in Neural Information Processing Systems 9, pp.\ 288--294. MIT Press, 1997.
[bib1] Agliari et al. (2014) Agliari, Elena, Barra, Adriano, Galluzzi, Andrea, Tantari, Daniele, and Tavani, Flavia. A walk in the statistical mechanical formulation of neural networks. 1948:12, July 2014.
[bib2] Auffinger, Antonio and Ben Arous, Gérard. Complexity of random smooth functions on the high-dimensional sphere. The Annals of Probability, 41(6):4214–4247, November 2013. ISSN 0091-1798. doi: 10.1214/13-AOP862.
[bib3] Auffinger, Antonio and Chen, Wei-kuo. Free Energy and Complexity of Spherical Bipartite Models. Journal of Statistical Physics, 157(1):40–59, July 2014. ISSN 0022-4715. doi: 10.1007/s10955-014-1073-0.
[bib4] Auffinger et al. (2013) Auffinger, Antonio, Ben Arous, Gérard, and Černý, Jir̆í. Random Matrices and Complexity of Spin Glasses. Communications on Pure and Applied Mathematics, 66(2):165–201, February 2013. ISSN 00103640. doi: 10.1002/cpa.21422.
[bib5] Barra et al. (2012) Barra, Adriano, Genovese, Giuseppe, Guerra, Francesco, and Tantari, Daniele. How glassy are neural networks? Journal of Statistical Mechanics: Theory and Experiment, 2012(07):P07009, 2012.
[bib6] Bastien et al. (2012) Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
[bib7] Bergstra et al. (2010) Bergstra, James, Breuleux, Olivier, Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.
[bib8] Choromanska et al. (2014) Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Ben Arous, Gérard, and LeCun, Yann. The Loss Surface of Multilayer Networks. November 2014.
[bib9] Collobert et al. (2011) Collobert, Ronan, Kavukcuoglu, Koray, and Farabet, Clément. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
[bib10] Dauphin et al. (2014) Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Ghahramani, Z, Welling, M, Cortes, C, Lawrence, N D, and Weinberger, K Q (eds.), Advances in Neural Information Processing Systems 27, pp. 2933–2941. Curran Associates, Inc., June 2014.
[bib11] Dean, David S and Majumdar, Satya N. Extreme value statistics of eigenvalues of Gaussian random matrices. Physical Review E, 77(4):041108, April 2008. ISSN 1539-3755. doi: 10.1103/PhysRevE.77.041108.
[bib12] Frieze, Alan and Kannan, Ravi. A new approach to the planted clique problem. In Hariharan, Ramesh, Mukund, Madhavan, and Vinay, V (eds.), IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, volume 2 of Leibniz International Proceedings in Informatics (LIPIcs), pp. 187–198, Dagstuhl, Germany, 2008. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. ISBN 978-3-939897-08-8. doi: http://dx.doi.org/10.4230/LIPIcs.FSTTCS.2008.1752.
[bib13] Fyodorov, Yan V. High-Dimensional Random Fields and Random Matrix Theory. pp. 40, July 2013.
[bib14] Fyodorov, Yan V and Le Doussal, Pierre. Topology Trivialization and Large Deviations for the Minimum in the Simplest Random Optimization. Journal of Statistical Physics, 154(1-2):466–490, September 2013. ISSN 0022-4715. doi: 10.1007/s10955-013-0838-1.
[bib15] Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Dark Knowledge. http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/geoff_hinton_dark14.pdf
[bib16] Mehta et al. (a) Mehta, Dhagash, Hauenstein, Jonathan D, Niemerg, Matthew, Simm, Nicholas J, and Stariolo, Daniel A. Energy Landscape of the Finite-Size Mean-field 2-Spin Spherical Model and Topology Trivialization. pp. 1–9, a.
[bib17] Mehta et al. (b) Mehta, Dhagash, Stariolo, Daniel A, and Kastner, Michael. Energy Landscape of the Finite-Size Mean-field 3-Spin Spherical Model. pp. 1–10, b.
[bib18] Saad et al. (1996) Saad, David, Birmingham, B, and Solla, Sara A. Dynamics of On-Line Gradient Descent Learning for Multilayer Neural Networks. In Touretzky, D S, Mozer, M C, and Hasselmo, M E (eds.), Advances in Neural Information Processing Systems 8, pp. 302–308. MIT Press, 1996.
[bib19] Sherrington, David. Physics and Complexity: An Introduction. In Delitala, Marcello and Ajmone Marsan, Giulia (eds.), Managing Complexity, Reducing Perplexity, volume 67 of Springer Proceedings in Mathematics & Statistics, pp. 119–129. Springer International Publishing, Cham, 2014. ISBN 978-3-319-03758-5. doi: 10.1007/978-3-319-03759-2.
[bib20] West et al. (1997) West, Ansgar H L, Saad, David, and Nabney, Ian T. The Learning Dynamcis of a Universal Approximator. In Mozer, M C, Jordan, M I, and Petsche, T (eds.), Advances in Neural Information Processing Systems 9, pp. 288–294. MIT Press, 1997.