Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks
11{1cm}\selectfont Behnam Neyshabur, 11{1cm}\selectfont Zhiyuan Li, 11{1cm}\selectfont Srinadh Bhojanapalli, 11{1cm}\selectfont Yann LeCun, 11{1cm}\selectfont Nathan Srebro
Abstract
Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a {\em tighter} generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks
Behnam Neyshabur * Zhiyuan Li † Srinadh Bhojanapalli ‡ Yann LeCun § Nathan Srebro
Despite existing work on ensuring generalization of neural networks in terms of scale sensitive complexity measures, such as norms, margin and sharpness, these complexity measures do not offer an explanation of why neural networks generalize better with over-parametrization. In this work we suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization with over-parametrization. We further present a matching lower bound for the Rademacher complexity that improves over previous capacity lower bounds for neural networks.
Introduction
Deep neural networks have enjoyed great success in learning across a wide variety of tasks. They played a crucial role in the seminal work of Krizhevsky et al. [12], starting an arms race of training larger networks with more hidden units, in pursuit of better test performance [10]. In fact the networks used in practice are over-parametrized to the extent that they can easily fit random labels to the data [27]. Even though they have such a high capacity, when trained with real labels they achieve smaller generalization error.
Traditional wisdom in learning suggests that using models with increasing capacity will result in overfitting to the training data. Hence capacity of the models is generally controlled either by limiting the size of the model (number of parameters) or by adding an explicit regularization, to prevent from overfitting to the training data. Surprisingly, in the case of neural networks we notice that increasing the model size only helps in improving the generalization error, even when the networks are trained without any explicit regularization - weight decay or early stopping [13, 26, 21]. In particular, Neyshabur et al. [21] observed that training on models with increasing number of hidden units lead to decrease in the test error for image classification on MNIST and CIFAR-10. Similar empirical observations have been made over a wide range of architectural and hyper-parameter choices [15, 24, 14]. What explains this improvement in generalization with over-parametrization? What is the right measure of complexity of neural networks that captures this generalization phenomenon?
- Institute for Advanced Study, School of Mathematics, Princeton, NJ 08540. Email: bneyshabur@ias.edu
‡ Toyota Technological Institute at Chicago, Chicago IL 60637. Email: srinadh@ttic.edu
§ Facebook AI Research & Courant Institute, New York University, NY 10012. Email: yann@cs.nyu.edu
¶

To study and analyze this phenomenon more carefully, we need to simplify the architecture making sure that the property of interest is preserved after the simplification. We therefore chose two layer ReLU networks since as shown in the left and middle panel of Figure 1, it exhibits the same behavior with over-parametrization as the more complex pre-activation ResNet18 architecture. In this paper we prove a tighter generalization bound (Theorem 2) for two layer ReLU networks. Our capacity bound, unlike existing bounds, correlates with the test error and decreases with the increasing number of hidden units. Our key insight is to characterize complexity at a unit level, and as we see in the right panel in Figure 1 these unit level measures shrink at a rate faster than 1 / √ h for each hidden unit, decreasing the overall measure as the network size increases. When measured in terms of layer norms, our generalization bound depends on the Frobenius norm of the top layer and the Frobenius norm of the difference of the hidden layer weights with the initialization, which decreases with increasing network size (see Figure 2).
The closeness of learned weights to initialization in the over-parametrized setting can be understood by considering the limiting case as the number of hidden units go to infinity, as considered in Bengio et al. [5] and Bach [2]. In this extreme setting, just training the top layer of the network, which is a convex optimization problem for convex losses, will result in minimizing the training error, as the randomly initialized hidden layer has all possible features. Intuitively, the large number of hidden units here represent all possible features and hence the optimization problem involves just picking the right features that will minimize the training loss. This suggests that as we over-parametrize the networks, the optimization algorithms need to do less work in tuning the weights of the hidden units to find the right solution. Dziugaite and Roy [6] indeed have numerically evaluated a PAC-Bayes measure from the initialization used by the algorithms and state that the Euclidean distance to the initialization is smaller than the Frobenius norm of the parameters. Nagarajan and Kolter [18] also make a similar empirical observation on the significant role of initialization, and in fact prove an initialization dependent generalization bound for linear networks. However they do not prove a similar generalization bound for neural networks. Alternatively, Liang et al. [15] suggested a Fisher-Rao metric based complexity measure that correlates with generalization behavior in larger networks but they also prove the capacity bound only for linear networks.
Contributions: Our contributions in this paper are as follows.
· We empirically investigate the role of over-parametrization in generalization of neural networks on 3 different datasets (MNIST, CIFAR10 and SVHN), and show that the existing complexity measures increase with the number of hidden units - hence do not explain the generalization behavior with over-parametrization. · We prove tighter generalization bounds (Theorems 2 and 5) for two layer ReLU networks. Our proposed complexity measure actually decreases with the increasing number of hidden units, and can potentially explain the effect of over-parametrization on generalization of neural networks. · We provide a matching lower bound for the Rademacher complexity of two layer ReLU networks. Our lower bound considerably improves over the best known bound given in Bartlett et al. [4], and to our knowledge is the first such lower bound that is bigger than the Lipschitz of the network class.

Figure 2: Properties of two layer ReLU networks trained on CIFAR-10. We report different measures on the trained network. From left to right: measures on the second (output) layer, measures on the first (hidden) layer, distribution of angles of the trained weights to the initial weights in the first layer, and the distribution of unit capacities of the first layer. "Distance" in the first two plots is the distance from initialization in Frobenius norm.
Preliminaries
We consider two layer fully connected ReLU networks with input dimension d , output dimension c , and the number of hidden units h . Output of a network is f V , U ( x ) = V [ Ux ] + 1 where x ∈ R d , U ∈ R h × d and V ∈ R c × h . We denote the incoming weights to the hidden unit i by u i and the outgoing weights from hidden unit i by v i . Therefore u i corresponds to row i of matrix U and v i corresponds to the column i of matrix V .
We consider the c -class classification task where the label with maximum output score will be selected as the prediction. Following Bartlett et al. [4], we define the margin operator µ : R c × [ c ] → R as a function that given the scores f ( x ) ∈ R c for each label and the correct label y ∈ [ c ] , it returns the difference between the score of the correct label and the maximum score among other labels, i.e. µ ( f ( x ) , y ) = f ( x )[ y ] -max i = y f ( x )[ i ] . We now define the ramp loss as follows:
Generalization of Two Layer ReLU Networks
where R S ( H ) is the Rademacher complexity of a class H of functions with respect to the training set S which is defined as:
1 Since the number of bias parameters is negligible compare to the size of the network, we drop the bias parameters to simplify the analysis. Moreover, one can model the bias parameters in the first layer by adding an extra dimension with value 1.
An Empirical Investigation
We will bound the Rademacher complexity of neural networks to get a bound on the generalization error . Since the Rademacher complexity depends on the function class considered, we need to choose the right function class that only captures the real trained networks, which is potentially much smaller than networks with all possible weights, to get a complexity measure that explains the decrease in generalization error with increasing width. Choosing a bigger function class can result in weaker bounds that does not capture this phenomenon. Towards that we first investigate the behavior of different measures of network layers with increasing number of hidden units. The experiments discussed below are done on the CIFAR-10 dataset. Please see Section A for similar observations on SVHN and MNIST datasets.
First layer: As we see in the second panel in Figure 2 even though the spectral and Frobenius norms of the learned layer decrease initially, they eventually increase with h , with Frobenius norm increasing at a faster rate. However the distance Frobenius norm measured w.r.t. initialization ( ‖ U -U 0 ‖ F ) decreases. This suggests that the increase in the Frobenius norm of the weights in larger networks is due to the increase in the Frobenius norm of the random initialization. To understand this behavior in more detail we also plot the distance to initialization per unit and the distribution of angles between learned weights and initial weights in the last two panels of Figure 2. We indeed observe that per unit distance to initialization decreases with increasing h , and a significant shift in the distribution of angles to initial points, from being almost orthogonal in small networks to almost aligned in large networks. This per unit distance to initialization is a key quantity that appears in our capacity bounds and we refer to it as unit capacity in the remainder of the paper.
Unit capacity. We define β i = ∥ ∥ u i -u 0 i ∥ ∥ 2 as the unit capacity of the hidden unit i .
Second layer: Similar to first layer, we look at the behavior of different measures of the second layer of the trained networks with increasing h in the first panel of Figure 2. Here, unlike the first layer, we notice that Frobenius norm and distance to initialization both decrease and are quite close suggesting a limited role of initialization for this layer. Moreover, as the size grows, since the Frobenius norm ‖ V ‖ F of the second layer slightly decreases, we can argue that the norm of outgoing weights v i from a hidden unit i decreases with a rate faster than 1 / √ h . If we think of each hidden unit as a linear separator and the top layer as an ensemble over classifiers, this means the impact of each classifier on the final decision is shrinking with a rate faster than 1 / √ h . This per unit measure again plays an important role and we define it as unit impact for the remainder of this paper.
Unit impact. We define α i = ‖ v i ‖ 2 as the unit impact, which is the magnitude of the outgoing weights from the unit i . Motivated by our empirical observations we consider the following class of two layer neural networks that depend on the capacity and impact of the hidden units of a network. Let W be the following restricted set of parameters:
We now consider the hypothesis class of neural networks represented using parameters in the set W :
Our empirical observations indicate that networks we learn from real data have bounded unit capacity and unit impact and therefore studying the generalization behavior of the above function class can potentially provide us with a better understanding of these networks. Given the above function class, we will now study its generalization properties.
In this section we prove a generalization bound for two layer ReLU networks. We first bound the Rademacher complexity of the class F W in terms of the sum over hidden units of the product of unit capacity and unit impact. Combining this with the equation (2) will give a generalization bound.
Table 1: Comparison with the existing generalization measures presented for the case of two layer ReLU networks with constant number of outputs and constant margin.

The proof is given in the supplementary Section B. The main idea behind the proof is a new technique to decompose the complexity of the network into complexity of the hidden units. To our knowledge, all previous works decompose the complexity to that of layers and use Lipschitz property of the network to bound the generalization error. However, Lipschitzness of the layer is a rather weak property that ignores the linear structure of each individual layer. Instead, by decomposing the complexity across the hidden units, we get the above tighter bound on the Rademacher complexity of the networks.
The generalization bound in Theorem 1 is for any function in the function class defined by a specific choice of α and β fixed before the training procedure. To get a generalization bound that holds for all networks, we need to cover the space of possible values for α and β and take a union bound over it. The following theorem states the generalization bound for any two layer ReLU network 2 .
Theorem 2. For any h ≥ 2 , γ > 0 , δ ∈ (0 , 1) and U 0 ∈ R h × d , with probability 1 -δ over the choice of the training set S = x i i m =1 ⊂ R d , for any function f ( x ) = V [ Ux ] + such that V ∈ R c × h and U ∈ R h × d , the generalization error is bounded as follows:
.
.
Comparison with Existing Results
2 For the statement with exact constants see Lemma 13 in Supplementary Section B.

/

Figure 5: Left panel: Comparing capacity bounds on CIFAR10 (unnormalized). Middle panel: Comparing capacity bounds on CIFAR10 (normalized). Right panel: Comparing capacity bounds on SVHN (normalized).
Experimental comparison. We train two layer ReLU networks of size h on CIFAR-10 and SVHN datasets with values of h ranging from 2 6 to 2 15 . The training and test error for CIFAR-10 are shown in the first panel of Figure 1, and for SVHN in the left panel of Figure 4. We observe for both datasets that even though a network of size 128 is enough to get to zero training error, networks with sizes well beyond 128 can still get better generalization, even when trained without any regularization. We further measure the unit-wise properties introduce in the paper, namely unit capacity and unit impact. These quantities decrease with increasing h , and are reported in the right panel of Figure 1 and second panel of Figure 4. Also notice that the number of epochs required for each network size to get 0.01 cross-entropy loss decreases for larger networks as shown in the third panel of Figure 4.
For the same experimental setup, Figure 5 compares the behavior of different capacity bounds over networks of increasing sizes. Generalization bounds typically scale as √ C/m where C is the effective capacity of the function class. The left panel reports the effective capacity C based on different measures calculated with all the terms and constants. We can see that our bound is the only that decreases with h and is consistently lower that other norm-based data-independent bounds. Our bound even improves over VC-dimension for networks with size larger than 1024. While the actual numerical values are very loose, we believe they are useful tools to understand the relative generalization behavior with respect to different complexity measures, and in many cases applying a set of data-dependent techniques, one can improve the numerical values of these bounds significantly [6, 1]. In the middle and right panel we presented
each capacity bound normalized by its maximum in the range of the study for networks trained on CIFAR-10 and SVHN respectively. For both datasets, our capacity bound is the only one that decreases with the size even for networks with about 100 million parameters. All other existing norm-based bounds initially decrease for smaller networks but then increase significantly for larger networks. Our capacity bound therefore could potentially point to the right properties that allow the over-parametrized networks to generalize.
Finally we check the behavior of our complexity measure under a different setting where we compare this measure between networks trained on real and random labels [22, 4]. We plot the distribution of margin normalized by our measure, computed on networks trained with true and random labels in the last panel of Figure 4 - and as expected they correlate well with the generalization behavior.
Lower Bound
In this section we will prove a Rademacher complexity lower bound for neural networks matching the dominant term in the upper bound of Theorem 1. We will show our lower bound on a smaller function class than F W , with an additional constraint on spectral norm of the hidden layer, as it allows for comparison with the existing results, and extends also to the bigger class F W .
Theorem 3. Define the parameter set
and let F W ′ be the function class defined on W ′ by equation (5) . Then, for any d = h ≤ m , α j , β j h j =1 ⊂ R + and U 0 = 0 , there exists S = x i i m =1 ⊂ R d , such that
The proof is given in the supplementary Section B.3. Clearly, W ′ ⊆ W since it has an extra constraint. The above lower bound matches the first term, ∑ h i =1 α i β i ‖ X ‖ F mγ , in the upper bound of Theorem 1, upto 1 γ which comes from the 1 γ -Lipschitz constant of the ramp loss l γ . Also when c = 1 and β = 0 ,
where F V = f ( x ) = Vx | V ∈ R 1 × h , ‖ v j ‖ ≤ α j . In other words, when β = 0 , the function class F W ′ on S = x i i m =1 is equivalent to the linear function class F V on [ U 0 ◦ S ] + = [ U 0 x i ] + i m =1 , and therefore we have the above lower bound. This shows that the upper bound provided in Theorem 1 is tight. It also indicates that even if we have more information, such as bounded spectral norm with respect to the reference matrix is small (which effectively bounds the Lipschitz of the network), we still cannot improve our upper bound.
To our knowledge all previous capacity lower bounds for spectral norm bounded classes of neural networks correspond to the Lipschitz constant of the network. Our lower bound strictly improves over this, and shows gap between Lipschitz constant of the network (which can be achieved by even linear models) and capacity of neural networks. This lower bound is non-trivial in the sense that the smaller function class excludes the neural networks with all rank-1 matrices as weights, and thus shows Θ( √ h ) -gap between the neural networks with and without ReLU. The lower bound therefore does not hold for linear networks. Finally, one can extend the construction in this bound to more layers by setting all weight matrices in intermediate layers to be the Identity matrix.
Comparison with existing results. In particular, Bartlett et al. [4] has proved a lower bound of Ω ( s 1 s 2 ‖ X ‖ F m ) , for the function class defined by the parameter set,
Note that s 1 s 2 is a Lipschitz bound of the function class F W spec .
Given W spec with bounds s 1 and s 2 , choosing α and β such that ‖ α ‖ 2 = s 1 and max i ∈ [ h ] β i = s 2 results in W ′ ⊂ W spec . Hence we get the following result from Theorem 3.
Hence our result improves the lower bound in Bartlett et al. [4] by a factor of √ h . Theorem 7 in Golowich et al. [7] also gives a Ω( s 1 s 2 √ c ) lower bound, c is the number of outputs of the network, for the composition of 1-Lipschitz loss function and neural networks with bounded spectral norm, or ∞ -Schatten norm. Our above result even improves on this lower bound.
.

/

.
In contrast to Theorem 2 the additive √ h term is replaced by √ h 2 p . For p of order ln h , √ h 2 p ≈ constant improves on the √ h additive term in Theorem 2. However the norms in the first term ‖ V ‖ F and ‖ U -U 0 ‖ F are replaced by h 1 2 -1 p ∥ ∥ V T ∥ ∥ p, 2 and h 1 2 -1 p ‖ U -U 0 ‖ p, 2 . For p ≈ ln h , h 1 2 -1 p ∥ ∥ V T ∥ ∥ p, 2 ≈ h 1 2 -1 ln h ∥ ∥ V T ∥ ∥ ln h, 2 which is a tight upper bound for ‖ V ‖ F and is of the same order if all rows of V have the same norm - hence giving a tighter bound that decreases with h for larger values. In particular for p = ln h we get the following bound.
.
.
In this paper we present a new capacity bound for neural networks that decreases with the increasing number of hidden units, and could potentially explain the better generalization performance of larger networks. However our results are currently limited to two layer networks and it is of interest to understand and extend these results to deeper networks. Also while these bounds are useful for relative comparison between networks of different size, their absolute values are still much larger than the number of training samples, and it is of interest to get smaller bounds. Finally we provided a matching lower bound for the capacity improving on the current lower bounds for neural networks.
In this paper we do not address the question of whether optimization algorithms converge to low complexity networks in the function class considered in this paper, or in general how does different hyper parameter choices affect the complexity of the recovered solutions. It is interesting to understand the implicit regularization effects of the optimization algorithms [19, 8, 25] for neural networks, which we leave for future work.
Acknowledgements
The authors thank Sanjeev Arora for many fruitful discussions on generalization of neural networks and David McAllester for discussion on the distance to random initialization. This research was supported in part by NSF IIS-RI award 1302662 and Schmidt Foundation.
Experiments Settings
Below we describe the setting for each reported experiment.
Two Layer ReLU Networks We trained fully connected feedforward networks on CIFAR-10, SVHN and MNIST datasets. For each data set, we trained 13 architectures with sizes from 2 3 to 2 15 each time increasing the number of hidden units by factor 2. For each experiment, we trained the network using SGD with mini-batch size 64, momentum 0.9 and fixed step size 0.01 for MNIST and 0.001 for CIFAR-10 and SVHN. We did not use weight decay, dropout or batch normalization. For experiment, we stopped the training when the cross-entropy reached 0.01 or when the number of epochs reached 1000. We use the reference line in the plots to differentiate the architectures that achieved 0.01 loss.
Evaluations For each generalization bound, we have calculated the exact bound including the log-terms and constants. We set the margin to 5 th percentile of the margin of data points. Since bounds in [3] and [21] are given for binary classification, we multiplied [3] by factor c and [21] by factor √ c to make sure that the bound increases linearly with the number of classes (assuming that all output units have the same norm). Furthermore, since the reference matrices can be used in the bounds given in [4] and [23], we used random initialization as the reference matrix. When plotting distributions, we estimate the distribution using standard Gaussian kernel density estimation.
ResNet18
Evaluations
Figures 6 and 7 show the behavior of several measures on networks with different sizes trained on SVHN and MNIST datasets respectively. The left panel of Figure 8 shows the over-parametrization phenomenon in MNSIT dataset and the middle and right panels compare our generalization bound to others.


Figure 8: Left panel: Training and test errors of fully connected networks trained on MNIST. Middle panel: Comparing capacity bounds on MNIST (normalized). Left panel: Comparing capacity bounds on MNIST (unnormalized).
Proofs
We start by stating a simple lemma which is a vector-contraction inequality for Rademacher complexities and relates the norm of a vector to the expected magnitude of its inner product with a vector of Rademacher random variables. We use the following technical result from Maurer [16] in our proof.
Lemma 7 (Propostion 6 of Maurer [16]) . Let ξ i be the Rademacher random variables. For any vector v ∈ R d , the following holds: √
The above lemma can be useful to get Rademacher complexities in multi-class settings. The below lemma bounds the Rademacher-like complexity term for linear operators with multiple output centered around a reference matrix. The proof is very simple and similar to that of linear separators. See [3] for similar arguments.
Lemma 8. For any positive integer c , positive scaler r > 0 , reference matrix V 0 ∈ R c × d and set x i i m =1 ⊂ R d , the following inequality holds:
Proof.
( i ) follows from the Jensen's inequality.
Proof. Let ρ ij = ∣ ∣ 〈 u 0 j , x i 〉∣ ∣ . We prove the lemma by showing the following statement by induction on t :
The above statement holds trivially for the base case of t = 1 by the definition of the Rademacher complexity (3). We now assume that it is true for any t ′ ≤ t and prove it is true for t ′ = t +1 .
√
The last inequality follows from the 2 γ Lipschitzness of the ramp loss. The ramp loss is 1 /γ Lipschitz with respect to each dimension but since the loss at each point only depends on score of the correct labels and the maximum score among other labels, it is √ 2 γ -Lipschitz.
We will now add and subtract the initialization U 0 terms.
From equations (9), (10), (11) and Lemma 7 we get,
This completes the induction proof.
Proof of Theorem 1. Using Lemma 8, we can bound the the right hand side of the upper bound on the Rademacher complexity given in Lemma 9:
$$ \ell_\gamma(f(\vecx), y) = \begin{cases} 0 & \mu(f(\vecx), y) > \gamma\ \mu(f(\vecx), y)/\gamma& \mu(f(\vecx), y) \in [0,\gamma]\ 1 & \mu(f(\vecx), y) <0. \end{cases} $$
$$ \calW=\left{ (\vecV,\vecU) \mid \vecV\in \R^{c\times h},\vecU \in \R^{h\times d},\norm{\vecV}_{F}\leq \alpha,\norm{\vecU}2\leq \beta, \forall{j\in[h]},\vecu_i \in \calU_i\right}, $$
$$ \calF_\calW=\left{f(\vecx)=\vecV\left[\vecU \vecx\right]_+ \mid (\vecV,\vecU) \in \calW\right}. $$
$$ L_{0}(f) \leq \hatL_\gamma(f) + 2\calR(\ell_\gamma \circ \calH) + \sqrt{\frac{\ln(2/\delta)}{2m}}. $$
$$ \calR(\calH) = \frac{1}{m}\sup_{\norm{\vecx_i}2\leq 1 \atop{i\in[m]}}\E{\xi \sim \left{\pm 1\right}^m} \left[\sup_{f\in \calH} \sum_{i=1}^m \xi_i f({x_i}) \right]. $$
$$ \norm{\vecV[\vecz]+-\vecV[\vecz']+}2 \leq \sum{i=1}^h \calI_i \abs{z_i-z'_i} $$
$$
$$
$$ L_{0}(f) \leq \hatL_\gamma(f) + \tildeO\left(\frac{\sqrt{c}h^{\frac{1}{2}-\frac{1}{p}}\norm{\vecV^T}{p,2}\left(h^{\frac{1}{2}-\frac{1}{p}}\norm{\vecU-\vecU^0}{p,2}\norm{\vecX}_F+\norm{\vecU^0\vecX}_F\right)}{\gamma m } +\sqrt{\frac{e^{-p}h}{m}}\right), $$
$$ K =\left\lceil \frac{D}{(1+\eps)^p-1}\right\rceil. $$
$$ \begin{split} \calR_\calS(\ell_\gamma \circ \calF_\calW) &\leq \frac{2\sqrt{2c}+2}{\gamma m} \norm{\bm{\alpha}}_2 \left(\norm{\bm{\beta}}_2\norm{\vecX}_F + \norm{\vecU^0\vecX}_F\right)\ & \leq \frac{2\sqrt{2c}+2}{\gamma m} h^{\frac{1}{2}-\frac{1}{p}}\norm{\bm{\alpha}}_p \left(h^{\frac{1}{2}-\frac{1}{p}}\norm{\bm{\beta}}_p\norm{\vecX}_F + \norm{\vecU^0\vecX}_F\right), \end{split} $$
$$ &\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV-\vecV^0}{F}\leq r}\sum{i=1}^{m} \inner{\vecxi_i}{\vecV\vecx_i}\right]\ & \qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV-\vecV_0}{F}\leq r}\inner{\vecV}{\sum{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV-\vecV_0}{F}\leq r}\inner{\vecV-\vecV_0 + \vecV_0}{\sum{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV-\vecV_0}{F}\leq r}\inner{\vecV-\vecV_0}{\sum{i=1}^{m} \vecxi_i \vecx_i^\top} \right]+\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV_0}{F}\leq r}\inner{\vecV_0}{\sum{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV-\vecV_0}{F}\leq r}\inner{\vecV-\vecV_0}{\sum{i=1}^{m} \vecxi_i \vecx_i^\top} \right]+\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\inner{\vecV_0}{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\norm{\vecV-\vecV_0}{F}\leq r}\inner{\vecV-\vecV_0}{\sum{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad \leq r\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\norm{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top}F \right]\ &\qquad \qquad \stackrel{(i)}{\leq} r\E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\norm{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top}^2_F \right]^{1/2}\ &\qquad \qquad =r \left(\sum_{j=1}^c\E_{\xi\in {\pm 1}^{m}} \left[\norm{\sum_{i=1}^{m} \xi_i \vecx_i^\top}^2_F \right]\right)^{1/2}\ &\qquad \qquad = r\sqrt{c} \norm{\vecX}_F. $$
$$ \calR_\calS(\ell_\gamma \circ \calF_\calW) &\leq \frac{2}{\gamma m}\sum_{j=1}^h \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{ \norm{\vecv_j}2\leq \alpha_j} \sum{i=1}^{m} (\rho_{ij} +\beta_j\norm{\vecx_i}2)\inner{\vecxi_i}{\vecv_j}\right] \ &+ \frac{2}{\gamma m}\sum{j=1}^h\E_{\vecxi \in {\pm 1}^m} \left[\sup_{\norm{\vecu_j-\vecu^0_j}2 \leq \beta_j }\sum{i=1}^{m} \xi_{i} \alpha_j\inner{\vecu_j}{\vecx_i}\right]. $$
$$ & m\calR_\calS(\ell_\gamma \circ \calF_\calW)\ & \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU)\in \calW} \frac{2}{\gamma}\sum_{i=1}^{t-1} \sum_{j=1}^h \left( (\rho_{ij} +\beta_j\norm{\vecx_i}2)\inner{\vecxi_i}{\vecv_j} + \xi{i1} \alpha_j\inner{\vecu_j}{\vecx_i} \right) \right. \ & \qquad \qquad \qquad \qquad \left. +\sum_{i=t}^m \xi_{i1}\ell_\gamma(\vecV[\vecU\vecx_i]+,y_i)\right]\ & =\E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU)\in \calW} \xi_{t1}\ell_\gamma(\vecV[\vecU\vecx_t]+,y_t) + \phi{\vecV,\vecU}\right], $$
$$ &m\calR_\calS(\ell_\gamma \circ \calF_\calW) \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU)\in \calW} \xi_{t1}\ell_\gamma(\vecV[\vecU\vecx_t]+,y_t) + \phi{\vecV,\vecU}\right] \nonumber \ &= \frac{1}{2} \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU), (\vecV',\vecU')\in \calW} \ell_\gamma(\vecV[\vecU\vecx_t]+,y_t) - \ell\gamma(\vecV'[\vecU'\vecx_t]+,y_t) +\phi{\vecV,\vecU} +\phi_{\vecV',\vecU'}\right] \nonumber \ &\leq \frac{1}{2} \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU), (\vecV',\vecU')\in \calW} \frac{\sqrt{2}}{\gamma}\norm{\vecV[\vecU\vecx_t]+ - \vecV'[\vecU'\vecx_t]+}2+\phi{\vecV,\vecU} +\phi_{\vecV',\vecU'}\right]. $$
$$ & \norm{\vecV[\vecU\vecx_t]+ - \vecV'[\vecU'\vecx_t]+}2 \leq \norm{\vecV[\vecU\vecx_t]+ -\vecV'[\vecU\vecx_t]+ + \vecV'[\vecU\vecx_t]+- \vecV'[\vecU'\vecx_t]+}2 \nonumber \ & \qquad \qquad \leq \norm{\vecV[\vecU\vecx_t]+ -\vecV'[\vecU\vecx_t]+}2 +\norm{\vecV'[\vecU\vecx_t]+- \vecV'[\vecU'\vecx_t]+}2 \nonumber \ & \qquad \qquad \leq \sum{j=1}^h\norm{[\inner{\vecu_j}{\vecx_t}]+ \vecv_j - [\inner{\vecu_j}{\vecx_t}]_+ \vecv'_j}2 +\norm{[\inner{\vecu_j}{\vecx_t}]+ \vecv'_j- [\inner{\vecu'j}{\vecx_t}]+ \vecv'j}2 \nonumber \ & \qquad \qquad \leq \sum{j=1}^h\abs{[\inner{\vecu_j}{\vecx_t}]+}\norm{\vecv_j - \vecv'_j}2 +\abs{[\inner{\vecu_j}{\vecx_t}]+ - [\inner{\vecu'j}{\vecx_t}]+ }\norm{\vecv'_j}_2. $$
$$ \norm{\vecAlpha'}p^p &=\sum{i=1}^D \left\lceil\frac{\abs{x_i^p}K}{\beta^p}\right\rceil \frac{\beta^p}{K}\ &\leq \sum_{i=1}^D \left(\frac{\abs{x_i^p}K}{\beta^p}+1\right)\frac{\beta^p}{K}\ &= \norm{\vecx}_p^p + \frac{D\beta^p}{K}\ &\leq \beta^p\left(1 + \frac{D}{K}\right)\ $$
$$ \norm{\vecAlpha}_2 &\leq D^{1/2-1/p}\norm{\vecAlpha'}_p\ &\leq \beta D^{1/2-1/p}\left(1+(1+\eps)^p-1\right)^{1/p}=\beta D^{1/2-1/p}(1+\eps) $$
$$ \norm{\alpha}_2 & \leq \left(\beta^2(1+(D-1)/K)\right)^{1/2}\ &\leq \left(\beta^2(D^{1/2-1/p}+\eps/\beta)^2\right)^{1/2}\ &\leq D^{1/2-1/p}\beta + \eps. $$
$$ \ln N_{p,h} &= \ln \binom{\left\lceil h/\mu\right\rceil+h-2}{h-1} \leq \ln \left(\left[e\frac{\left\lceil h/\mu\right\rceil+h-2}{h-1}\right]^{h-1}\right)=(h-1)\ln\left(e + e\frac{\left\lceil h/\mu\right\rceil-1}{h-1}\right)\ &\leq (h-1)\ln\left(e + e\frac{h/\mu}{h-1}\right) \leq h\ln(e+2e/\mu) \leq 5h $$
$$ \ln N_{p,h} &= \ln \binom{\left\lceil h/\mu\right\rceil+h-2}{h-1} = \ln \binom{\left\lceil h/\mu\right\rceil+h-2}{\left\lceil h/\mu\right\rceil-1} \leq \ln \left(\left[e\frac{\left\lceil h/\mu\right\rceil+h-2}{\left\lceil h/\mu\right\rceil-1}\right]^{\left\lceil h/\mu\right\rceil-1}\right)\ &=(\left\lceil h/(e^p-1)\right\rceil-1)\ln\left(e + e\frac{h-1}{\left\lceil h/(e^p-1)\right\rceil-1}\right) \leq (\left\lceil e^{1-p}h\right\rceil-1)\ln\left(eh\right) $$
$$ m\calR_\calS(\ell_\gamma \circ \calF_\calW) \ \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \bigg[\sup_{(\vecV,\vecU),(\vecV',\vecU')\in \calW}\frac{2}{\gamma} \sum_{j=1}^h(\beta_j\norm{\vecx_y}2+\rho{tj})\inner{\vecxi_t}{\vecv_j}+\xi_{t1}\alpha_j\inner{\vecu_j}{\vecx_t}+\phi_{\vecV,\vecU}\bigg]. $$
$$ \calR_\calS(\calF_{\calW}) \geq \calR_\calS(\calF_{\calW'}) =\Omega \left( \frac{\sum_{j=1}^h \alpha_j\beta_j|\vecX|_F}{m} \right). $$
$$ Q=\left{\balpha \in \R^D \mid \forall_i \alpha_i^2 \in \left{j\beta^2/K\right}_{j=1}^K, \norm{\balpha}_2^2 \le 2D^{1-2/p}\beta^2 \right}. $$
$$ \sum_{i=1}^D \alpha_i^2 = \sum_{i=1}^D\frac{\beta^2}{K}\lceil\frac{ Kx_i^2}{\beta^2}\rceil \le \sum_{i=1}^D\frac{\beta^2}{K}(1+\frac{ Kx_i^2}{\beta^2}) = \sum_{i=1}^D x_i^2 + \frac{\beta^2D}{K} \le D^{1-\frac{2}{p}}\beta^2 + \frac{\beta^2D}{K} \le 2\beta^2D^{1-\frac{2}{p}}. $$
$$ \ln N_{p,h} \le \ln \left[\frac{e(K+h-2)}{h-1}\right]^{K-1} \le \ln \left[e\left(1+\frac{2^{1-p}h}{h-1}\right)\right]^{K-1}\leq 2^{1-p}h\ln\left(2e\right)\leq 2^{2-p}h. $$
$$ \forall i\in [d], \quad \max{ \inp{\vecs}{\vecf_i}, \inp{\vecs}{-\vecf_i}} \geq \frac{1}{2}\left( \inp{\vecs}{[\vecf_i]+} + \inp{\vecs}{[-\vecf_i]+}\right) = \frac{\sum_{j=1}^{2^k}s_j |f_{ji}|}{2} = 2^{-\frac{k}{2}-1} \balpha^\top \bbeta. $$
$$ \tF(\bxi):= [\tbf_1,\tbf_2,\ldots,\tbf_{2^k}] \textrm{ such that } \textrm{if } \eps_i(\bxi)\ge 0, \tbf_i= \bbf_{i}, \textrm{ and } \textrm{if } \eps_i(\bxi) <0, \tbf_i=\bm{0}, $$
$$ \sum_{i=1}^n \xi_i \vecV [\vecU \vecx_i]+ = \sum{j=1}^{2^k} \sum_{i=(j-1)n+1}^{jn} \xi_i \vecV [\vecU \vecx_i]+ = \sum{j=1}^{2^k} \eps_j(\bxi) \vecV [\vecU \vece_j]_+ . $$
$$ \eps_j \vecV [\vecU \vece_j]+ = \eps_j \inp{\Diag{\bbeta}\balpha}{[\tbf_j ]+} = \eps_j \inp{\vecs}{[\vecf_j ]+}\bm{1}{\eps_j > 0 } \geq 2^{-\frac{k}{2}-1} \balpha^\top \bbeta[\eps_j]_+ . $$
$$ \begin{split} m\calR_\calS(\calF_{\calW_2}) \geq &\E_{\bxi\sim{\pm1}^m}\left[{\sum_{i=1}^m \xi_i \vecV[\vecU(\bxi) \vecx_i]+} \right]\ \geq &\balpha^\top\bbeta 2^{-\frac{k}{2}-1}\E{\bxi\sim{\pm1}^m}\left[\sum_{j=1}^{2^k}[\eps_j(\bxi)]+\right] \ = &\balpha^\top\bbeta 2^{\frac{k}{2}-1}\E{\bxi\sim{\pm1}^n}\left[[\eps_1(\bxi)]+ \right]\ = &\balpha^\top\bbeta 2^{\frac{k}{2}-2}\E{\bxi\sim{\pm1}^n}\left[|\eps_1(\bxi)| \right]\ \geq &\balpha^\top\bbeta 2^{\frac{k-1}{2}-2}\sqrt{n} \ = &\frac{\balpha^\top\bbeta\sqrt{2d}}{8}\sqrt{\frac{m}{d}}\ = &\frac{\balpha^\top\bbeta \sqrt{2m}}{8} \end{split} $$
Theorem. Given a training set $\calS={\vecx_i}{i=1}^m$ and $\gamma>0$, Rademacher complexity of the composition of loss function $\ell\gamma$ over the class $\calF_\calW$ defined in equations eq:param and eq:class is bounded as follows: smallalign \calR_\calS(\ell_\gamma \circ \calF_\calW) &\leq 2\sqrt{2c+2}{\gamma m}\sum_{j=1}^h \alpha_j \left(\beta_j\vecX_F + \vecu_j^0\vecX_2\right)\ &\leq 2\sqrt{2c+2}{\gamma m} \bm{\alpha}2 \left(\bm{\beta}2 \frac{1{m}\sum{i=1}^m \vecx_i_2^2}+ \frac{1{m} \sum{i=1}^m \vecU^0\vecx_i_2^2 }\right). alignsmall
Theorem. For any $h\geq 2$, $\gamma>0$, $\delta\in(0,1)$ and $\vecU^0\in \R^{h\times d}$, with probability $1-\delta$ over the choice of the training set $\calS={\vecx_i}{i=1}^m\subset \R^d$, for any function $f(\vecx)=\vecV[\vecU \vecx]{+}$ such that $\vecV\in \R^{c\times h}$ and $\vecU\in\R^{h\times d}$, the generalization error is bounded as follows: small align* L_{0}(f) & \leq \hatL_\gamma(f) + \tildeO\left(\sqrt{c\vecV_F\left(\vecU-\vecU^0_{F}\vecX_F+\vecU^0\vecX_F\right)}{\gamma m } +\frac{h{m}}\right)\ &\leq \hatL_\gamma(f) + \tildeO\left(\sqrt{c\vecV_F\left(\vecU-\vecU^0_{F}+\vecU^0_2\right) \frac{1{m}\sum_{i=1}^m \vecx_i_2^2} }{\gamma m } +\frac{h{m}}\right). align*small
Theorem. %Define $\calW_{\vecV}$ to be the same function class as Eq.eq:param with its first parameter fixed as $\vecV\in\mathR^{c\times h}$: %equation %\calW_\vecV=\left{ (\vecV,\vecU) \mid \vecU \in \R^{h\times d},\vecU_2\leq \beta, \forall_{j\in[h]},\vecu_{j-\vecu_j^0}{2}\leq r_j\right}, %equation %Given $c,h,d\in\mathZ^+$, $\balpha,\bbeta\in\mathR+^h$, Define the parameter set [ \calW'=\left{ (\vecV,\vecU) \mid \vecV\in \R^{1\times h},\vecU \in \R^{h\times d},\vecv_j\leq \alpha_j,\vecu_j-\vecu^0_j_2 \leq\beta_j,|\vecU-\vecU^0|2\leq \max{j\in h} \beta_j\right}, ] and let $\calF_{\calW'}$ be the function class defined on $\calW'$ by equationeq:class. Then, for any $d=h\leq m$, ${\alpha_j,\beta_j}{j=1}^h \subset \R^+$ and $\vecU_0 = 0$, there exists $\calS={\vecx_i}{i=1}^m \subset \R^d$, such that [\calR_\calS(\calF_{\calW}) \geq \calR_\calS(\calF_{\calW'}) =\Omega \left( \sum_{j=1^h \alpha_j\beta_j|\vecX|_F}{m} \right).]
Theorem. %For any $\gamma>0$ and $\delta\in(0,1)$, with probability $1-\delta$ over the choice of the training set $\calS={\vecx_i}{i=1}^m$, for any function $f$ in class $\calF\calW$ defined in equations eq:class and eq:param, the generalization error is bounded as follows: %align* %L_{0}(f) &\leq \hatL_\gamma(f) + \frac{2\sqrt{2c+2}{\gamma}\sum_{j=1}^h \alpha_j \left(\beta_j\vecX_F + \vecu_j^0\vecX_2\right)}{\gamma m }+ 3\frac{\ln(2/\delta){2m}}\ %&=\hatL_\gamma(f) + 4 \alpha\left(\beta \sqrt{c+\vecr_2\right)\frac{1{m}\sum_{i=1}^m\vecx_i_2^2}}{\gamma m }+ 3\frac{\ln(2/\delta){2m}}. %align* %
Lemma. [Propostion 6 of maurer2016vector] Let $\xi_i$ be the Rademacher random variables. For any vector $\vecv\in \R^d$, the following holds: equation* \vecv_2 \leq 2 \E_{\xi_i \sim \left{\pm 1\right}, i \in [d]} \left[ \inner{\vecxi{\vecv}}\right]. equation*
Lemma. For any positive integer $c$, positive scaler $r>0$, reference matrix $\vecV^0\in \R^{c\times d}$ and set ${\vecx_i}{i=1}^m \subset \R^d$, the following inequality holds: align* \E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV^0_{F}\leq r}\sum_{i=1}^{m} \vecxi_i{\vecV\vecx_i}\right] \leq rc \vecX_F. align*
Lemma. [$\ell_p$ covering lemma] Given any $\eps,D,\beta>0$, $p\geq2$, consider the set $S^D_{p,\beta}={\vecx\in\R^D\mid \vecx_p \leq \beta}$. Then there exist $N$ sets ${T_i}{i=1}^N$ of the form $T_i={\vecx\in\R^D\mid x_j\leq \alpha^i_j,\ \forall j\ \in [D]}$ such that $S^D{p,\beta}\subseteq \bigcup_{i=1}^N T_i$ and $\balpha^i_2\leq D^{1/p-1/2}\beta (1+\eps),\forall i \in [N]$ where $N =K+D-1{ D-1}$ and equation* K =\left\lceil D{(1+\eps)^p-1}\right\rceil. equation*
Lemma. Let input $x_i$ be such that $\forall i, ~ x_i_2 \leq B$. For any $h,p\geq 2$, $\gamma>0$ and $\delta\in(0,1)$, with probability $1-\delta$ over the choice of the training set $\calS={\vecx_i}{i=1}^m$, for any function $f(\vecx)=\vecV[\vecU \vecx]{+}$, the generalization error is bounded as follows: small align* L_{0}(f) \leq \hatL_\gamma(f) &+ 4(\sqrt{2c+1) (h^{1{2}-1{p}}\vecV_{p,2}+1)\left(h^{1{2}-1{p}}\vecU-\vecU^0_{p,2}\vecX_F+\vecU^0\vecX_F+1\right)}{\gamma m }\ &+3\frac{\ln(N_{p,h) + \ln(2/\delta)}{m}},align* small where ${\scriptstyle N_{p,h} =pmatrix \left\lceil h-1{(h^{1/2-1/p}+8/\gammam)^2-1}\right\rceil+h-2\ h-1 pmatrix }$, and $._{p,2}$ is the $\ell_p$ norm of the column $\ell_2$ norms.
Corollary. $\forall h=d\leq m$, $s_1,s_2\geq 0$, $\exists\calS\in \R^{d\times m}$ such that $\calR_\calS(\calF_{\calW_{spec}}) =\Omega \left(s_1s_2\sqrt{h|\vecX|_F}{m} \right)$.
Corollary. Define % $\calW_2 = {(\vecV,\vecU)\mid \vecV\in \R^{1\times h},\vecU \in \R^{h\times d},|\vecV|2\leq \alpha, |\vecU|2\leq \beta}$ and $\calF{\calW{spec}} = { f(\vecx)=\vecV[\vecU\vecx]+\mid (\vecV,\vecU)\in\calW{spec}}$, for $h=d\leq m$, we have $\calR_\calS(\calF_{\calW_{spec}}) =\Omega(s_1s_2\sqrt{h|\vecX|_F}{m})$.
Proof. align* &\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV^0_{F}\leq r}\sum_{i=1}^{m} \vecxi_i{\vecV\vecx_i}\right]\ & \qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV_0_{F}\leq r}\vecV{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV_0_{F}\leq r}\vecV-\vecV_0 + \vecV_0{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV_0_{F}\leq r}\vecV-\vecV_0{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]+\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV_0_{F}\leq r}\vecV_0{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV_0_{F}\leq r}\vecV-\vecV_0{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]+\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\vecV_0{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad = \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV-\vecV_0_{F}\leq r}\vecV-\vecV_0{\sum_{i=1}^{m} \vecxi_i \vecx_i^\top} \right]\ &\qquad \qquad \leq r\E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sum_{i=1^{m} \vecxi_i \vecx_i^\top}F \right]\ &\qquad \qquad (i){\leq} r\E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sum_{i=1^{m} \vecxi_i \vecx_i^\top}^2_F \right]^{1/2}\ &\qquad \qquad =r \left(\sum_{j=1}^c\E_{\xi\in {\pm 1}^{m}} \left[\sum_{i=1^{m} \xi_i \vecx_i^\top}^2_F \right]\right)^{1/2}\ &\qquad \qquad = rc \vecX_F. align* $(i)$ follows from the Jensen's inequality.
Proof. Let $\rho_{ij} =\inner{\vecu^0_j{\vecx_i}}$. We prove the lemma by showing the following statement by induction on $t$: align* & m\calR_\calS(\ell_\gamma \circ \calF_\calW)\ & \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU)\in \calW} 2{\gamma}\sum_{i=1}^{t-1} \sum_{j=1}^h \left( (\rho_{ij} +\beta_j\vecx_i_2)\vecxi_i{\vecv_j} + \xi_{i1} \alpha_j\vecu_j{\vecx_i} \right) \right. \ & \qquad \qquad \qquad \qquad \left. +\sum_{i=t}^m \xi_{i1}\ell_\gamma(\vecV[\vecU\vecx_i]+,y_i)\right]\ & =\E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU)\in \calW} \xi_{t1}\ell_\gamma(\vecV[\vecU\vecx_t]+,y_t) + \phi{\vecV,\vecU}\right], align* where for simplicity of the notation, we let $\phi_{\vecV,\vecU}=2{\gamma}\sum_{i=1}^{t-1} \sum_{j=1}^h \left( (\rho_{ij} +\beta_j\vecx_i_2)\vecxi_i{\vecv_j} + \xi_{i1} \alpha_j\vecu_j{\vecx_i} \right) +\sum_{i=t+1}^m \xi_{i1}\ell_\gamma(\vecV[\vecU\vecx_i]+,y_i)$. The above statement holds trivially for the base case of $t=1$ by the definition of the Rademacher complexity eq:rademacher. We now assume that it is true for any $t'\leq t$ and prove it is true for $t'=t+1$. align &m\calR\calS(\ell_\gamma \circ \calF_\calW) \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU)\in \calW} \xi_{t1}\ell_\gamma(\vecV[\vecU\vecx_t]+,y_t) + \phi{\vecV,\vecU}\right] \nonumber \ &= 1{2} \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU), (\vecV',\vecU')\in \calW} \ell_\gamma(\vecV[\vecU\vecx_t]+,y_t) - \ell\gamma(\vecV'[\vecU'\vecx_t]+,y_t) +\phi{\vecV,\vecU} +\phi_{\vecV',\vecU'}\right] \nonumber \ &\leq 1{2} \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{(\vecV,\vecU), (\vecV',\vecU')\in \calW} \sqrt{2}{\gamma}\vecV[\vecU\vecx_t]+ - \vecV'[\vecU'\vecx_t]+2+\phi{\vecV,\vecU} +\phi_{\vecV',\vecU'}\right]. align The last inequality follows from the $\sqrt{2}{\gamma}$ Lipschitzness of the ramp loss. The ramp loss is $1/\gamma$ Lipschitz with respect to each dimension but since the loss at each point only depends on score of the correct labels and the maximum score among other labels, it is $\sqrt{2}{\gamma}$-Lipschitz. Using the triangle inequality we can bound the first term in the above bound as follows. align & \vecV[\vecU\vecx_t]+ - \vecV'[\vecU'\vecx_t]+2 \leq \vecV[\vecU\vecx_t]+ -\vecV'[\vecU\vecx_t]+ + \vecV'[\vecU\vecx_t]+- \vecV'[\vecU'\vecx_t]+2 \nonumber \ & \qquad \qquad \leq \vecV[\vecU\vecx_t]+ -\vecV'[\vecU\vecx_t]+2 +\vecV'[\vecU\vecx_t]+- \vecV'[\vecU'\vecx_t]+2 \nonumber \ & \qquad \qquad \leq \sum{j=1}^h[\inner{\vecu_j{\vecx_t}]+ \vecv_j - [\vecu_j{\vecx_t}]+ \vecv'j}2 +[\inner{\vecu_j{\vecx_t}]+ \vecv'j- [\vecu'j{\vecx_t}]+ \vecv'j}2 \nonumber \ & \qquad \qquad \leq \sum{j=1}^h[\inner{\vecu_j{\vecx_t}]+}\vecv_j - \vecv'j_2 +[\inner{\vecu_j{\vecx_t}]+ - [\vecu'j{\vecx_t}]+ }\vecv'j_2. align We will now add and subtract the initialization $\vecU^0$ terms. align & \vecV[\vecU\vecx_t]+ - \vecV'[\vecU'\vecx_t]+2 \nonumber \ & \qquad \qquad \leq \sum{j=1}^h\inner{\vecu_j+\vecu^0_j-\vecu^0_j{\vecx_t}}\vecv_j - \vecv'j_2 +[\inner{\vecu_j{\vecx_t}]+ - [\vecu'j{\vecx_t}]+ }\vecv'j_2 \nonumber \ & \qquad \qquad \leq \sum{j=1}^h(\beta_j\vecx_t_2+\rho{ij})\vecv_j - \vecv'j_2 +\alpha_j\inner{\vecu_j{\vecx_t}-\vecu'j{\vecx_t}} . align From equations eq:proof_new1, eq:proof_new2, eq:proof_new3 and Lemma lem:contraction we get, multline m\calR\calS(\ell\gamma \circ \calF\calW) \ \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \bigg[\sup_{(\vecV,\vecU),(\vecV',\vecU')\in \calW}2{\gamma} \sum_{j=1}^h(\beta_j\vecx_y_2+\rho_{tj})\vecxi_t{\vecv_j}+\xi_{t1}\alpha_j\vecu_j{\vecx_t}+\phi_{\vecV,\vecU}\bigg]. multline This completes the induction proof. Hence the induction step at $t=m$ gives us: align* & m\calR_\calS(\ell_\gamma \circ \calF_\calW) \leq \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecu_k-\vecu^0_k_2 \leq \beta_k \norm{\vecv_k_2\leq \alpha_k, k\in[h]}} 2{\gamma}\sum_{i=1}^{m} \sum_{j=1}^h(\rho_{ij} +\beta_j\vecx_i_2)\vecxi_i{\vecv_j} + \xi_{i1} \alpha_j\vecu_j{\vecx_i}\right]\ &=2{\gamma}\sum_{j=1}^h \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{ \vecv_j_2\leq \alpha_j} \sum_{i=1}^{m} (\rho_{ij} +\beta_j\vecx_i_2)\vecxi_i{\vecv_j}\right] \ & \qquad \qquad + 2{\gamma}\sum_{j=1}^h\E_{\vecxi \in {\pm 1}^m} \left[\sup_{\vecu_j-\vecu^0_j_2 \leq \beta_j }\sum_{i=1}^{m} \xi_{i} \alpha_j\vecu_j{\vecx_i}\right]. align*
Proof. [Proof of Theoremthm:rademacher] Using Lemmalem:linear, we can bound the the right hand side of the upper bound on the Rademacher complexity given in Lemma~lem:decomposition: align* m\calR_\calS(\ell_\gamma \circ \calF_\calW) &\leq 2{\gamma}\sum_{j=1}^h \E_{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{ \vecv_j_2\leq \alpha_j} \sum_{i=1}^{m} (\rho_{ij} +\beta_j\vecx_i_2)\vecxi_i{\vecv_j}\right] \ &+ 2{\gamma}\sum_{j=1}^h\E_{\vecxi \in {\pm 1}^m} \left[\sup_{\vecu_j-\vecu^0_j_2 \leq \beta_j }\sum_{i=1}^{m} \xi_{i} \alpha_j\vecu_j{\vecx_i}\right]\ &\leq2\sqrt{2c}{\gamma}\sum_{j=1}^h \alpha_j \left(\beta_j\vecX_F + \vecu_j^0\vecX_2\right)+ 2{\gamma} \sum_{j=1}^h \alpha_j \beta_j\vecX_F\ &\leq2\sqrt{2c+2}{\gamma}\sum_{j=1}^h \alpha_j \left(\beta_j\vecX_F + \vecu_j^0\vecX_2\right). align*
Proof. %We prove this theorem by substituting the Rademacher complexity bound from Lemma lem:decomposition in equation eq:generalization. % %From Lemma lem:decomposition we have multline* %m\calR_\calS(\ell_\gamma \circ \calF_\calW) \leq 2{\gamma}\sup_{\vecz_{i}2\leq \beta \vecx_i_2}\E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV_{F}\leq \alpha}\sum_{i=1}^{m} \vecxi_i{\vecV\vecz_i}\right]\ + 2\alpha{\gamma}\sum_{j=1^h \E_{\xi\in {\pm 1}^m}\left[\sum_{i=1}^{m} \sup_{\vecu_j-\vecu_j^0\leq r_j}\xi_{i}\vecu_j{\vecx_i}\right]^2}. %multline* %Bounding the inner product using Lemma lem:contraction gives us, %align* %& m\calR_\calS(\ell_\gamma \circ \calF_\calW) \ %& \leq 2{\gamma}\sup_{\vecz_{i}2\leq \beta \vecx_i_2}\E{\vecxi_i \in {\pm 1}^{c}, i \in [m]} \left[\sup_{\vecV_{F}\leq \alpha}\sum_{i=1}^{m} \vecxi_i{\vecV\vecz_i}\right] \ %& \qquad \qquad + 2\alpha{\gamma}\sum_{j=1^h \E_{\xi\in {\pm 1}^m}\left[\sum_{i=1}^{m} \sup_{\vecu_j-\vecu_j^0\leq r_j}\xi_{i}\vecu_j{\vecx_i}\right]^2} \ %& \leq 2{\gamma} c \alpha \beta X_F + 2\alpha{\gamma}\sum_{j=1^h r_j^2 X_F^2 } \ %&= 2{\gamma} \alpha \left( c \beta + \sum_{j=1^h r_j^2} \right) X_F. %align* %Finally substituting the above Rademacher complexity bound in equation eq:generalization gives the result. %
Proof. [Proof of Lemmalem:generalization_lp] This lemma can be proved by directly applying union bound on Thereomthm:rademacher for all $\alpha_j$ and $\beta_j$. When any of $|\vecV^\top|{p,2}$ and $|\vecU|{p,2}$ is larger than $\gamma\sqrt{m}{8}$, the second term above inequality is larger than 1 thus holds trivially. Therefore, if we consider values in the set ${1,2,3,\ldots, \gamma \sqrt{m}{8}}$ we always have $\vecAlpha$ and $\vecBeta$ such that $\vecV_F\leq \vecAlpha_2\leq V_F+1$ and $\vecU-\vecU^0_F\leq \vecBeta_2\leq \vecU-\vecU^0_F+1$. Therefore, for each $j\in[h]$ need to pick $\alpha_j$ and $\beta_j$ from the set ${1{h},2{h},3{h},\ldots, \gamma \sqrt{m}{8}}$ which gives us a cover of size $\left(\gamma\sqrt{mh}{8}\right)^{2h}$ and completes the proof.
Proof. [Proof of Theorem thm:generalization] The proof directly follows from Lemma~lem:generalization and using $\tildeO$ notation to hide the constants and logarithmic factors.
Proof. The proof is the same as the proof of lem:generalization except here all $\alpha_j$s are equal to each other the same holds $\beta_j$s. Therefore, we size of the cover is bounded by $\left(\gamma\sqrt{m}{8}\right)^{2}$.
Proof. We prove the lemma by construction. Consider the set $Q=\left{\alpha \in \R^D \mid \forall_i \alpha_i^p \in \left{j\beta^p/K\right}{j=1}^K, \alpha_p^p =\beta^p(1+D/K)\right}$. For any $\vecx\in S{p,\beta}$, consider $\vecAlpha'$ such that for any $i\in [D]$, $\alpha'_i=\left(\left\lceil\abs{x_i^pK}{\beta^p}\right\rceil \beta^p{K}\right)^{1/p}$. It is clear that $x_i\leq \alpha'i$. Moreover, we have: align* \vecAlpha'p^p &=\sum{i=1}^D \left\lceil\abs{x_i^pK}{\beta^p}\right\rceil \beta^p{K}\ &\leq \sum{i=1}^D \left(\abs{x_i^pK}{\beta^p}+1\right)\beta^p{K}\ &= \vecx_p^p + D\beta^p{K}\ &\leq \beta^p\left(1 + D{K}\right)\ align* Therefore, we have $\alpha'\in Q$. Furthermore for any $\alpha\in Q$, we have: align* \vecAlpha_2 &\leq D^{1/2-1/p}\vecAlpha'p\ &\leq \beta D^{1/2-1/p}\left(1+(1+\eps)^p-1\right)^{1/p}=\beta D^{1/2-1/p}(1+\eps) align* Therefore, to complete the proof, we only need to bound the size of the set $Q$. The size of the set $Q$ is equal to the number of unique solutions for the problem $\sum{i=1}^D z_i=K+D-1$ for non-zero integer variables $z_i$, which is $K+D-2{D-1}$.
Proof. We prove the lemma by construction. Consider the set [Q=\left{\balpha \in \R^D \mid \forall_i \alpha_i^2 \in \left{j\beta^2/K\right}{j=1}^K, \balpha_2^2 \le 2D^{1-2/p}\beta^2 \right}.] Then $\forall \vecx\in S{p,\beta}^D$, we claim $\balpha$ satisfying $\forall i\in[D], \ x_i^2\le \alpha_i^2 = \beta^2{K}\lceil Kx_i^2{\beta^2}\rceil$ belongs to $Q$, because [\sum_{i=1}^D \alpha_i^2 = \sum_{i=1}^D\beta^2{K}\lceil Kx_i^2{\beta^2}\rceil \le \sum_{i=1}^D\beta^2{K}(1+ Kx_i^2{\beta^2}) = \sum_{i=1}^D x_i^2 + \beta^2D{K} \le D^{1-2{p}}\beta^2 + \beta^2D{K} \le 2\beta^2D^{1-2{p}}.] Therefore, to complete the proof, we only need to bound the size of the set $Q$. The size of the set $Q$ is equal to the number of unique solutions for the problem $\sum_{i=1}^D z_i\le K+D-1$(This should be 2D?? My fix is not correct) for positive integer variables $z_i$, which is $K+D-1{D-1}$.
Proof. We prove the lemma by construction. Consider the set $Q=\left{\alpha \in \R^D \mid \forall_i \alpha_i^2 \in \left{j\beta^2/K\right}{j=1}^K, \alpha_2^2 =\beta^2(1+(D-1)/K)\right}$. It is clear that for any $\vecx\in S{p,\beta}$, there exists $\alpha\in Q$, such that for any $i\in[D]$, $x_i\leq \alpha_i$. Moreover, we have: align* \alpha_2 & \leq \left(\beta^2(1+(D-1)/K)\right)^{1/2}\ &\leq \left(\beta^2(D^{1/2-1/p}+\eps/\beta)^2\right)^{1/2}\ &\leq D^{1/2-1/p}\beta + \eps. align* Therefore, to complete the proof, we only need to bound the size of the set $Q$. The size of the set $Q$ is equal to the number of unique solutions for the problem $\sum_{i=1}^D z_i=K+D-1$ for non-zero integer variables $z_i$, which is $K+D-2{D-1}$.
Proof. The proof of this lemma follows from using the result of Theorem thm:rademacher and taking a union bound to cover all the possible values of ${\vecV\mid \vecV_{p,2}\le C_1}$ and $\vecU = {\vecU\mid \vecU-\vecU^0_{p,2}\le C_2}$. Note that $\forall \vecx\in\mathR^h$ and $p\ge 2$, we have $\vecx_2\le h^{1{2}-1{p}}\vecx_p$. Recall the result of Theorem~thm:rademacher, given any fixed $\balpha,\bbeta$, we have equationsplit \calR_\calS(\ell_\gamma \circ \calF_\calW) &\leq 2\sqrt{2c+2}{\gamma m} \bm{\alpha}2 \left(\bm{\beta}2\vecX_F + \vecU^0\vecX_F\right)\ & \leq 2\sqrt{2c+2}{\gamma m} h^{1{2}-1{p}}\bm{\alpha}p \left(h^{1{2}-1{p}}\bm{\beta}p\vecX_F + \vecU^0\vecX_F\right), splitequation By Lemma~lem:lp-cover, picking $\eps=h^{1{2}-1{p}}C_1, $we can find a set of vectors, ${\balpha^i}{i=1}^{N{p,h}}$, where $K= \left\lceil h{2^p-1}\right\rceil, N{p,h} = K+h-2{h-1}$ such that $\forall \vecx, \vecx_p\leq C_1$, $\exists 1\le i\le N{p,h}$, $x_j\leq \alpha^i_j,\ \forall j \in[h]$. Similarly, picking $\eps = h^{1{2}-1{p}}C_2$, we can find a set of vectors, ${\bbeta^i}{i=1}^{N{p,h}}$, where $K= \left\lceil h{2^p-1}\right\rceil, N_{p,h} = K+h-2{h-1}$ such that $\forall \vecx, \vecx_p\leq C_2$, $\exists 1\le i\le N_{p,h}$, $x_j\leq \beta^i_j,\ \forall j \in[h]$. Thus if we apply union bound on Eq.~eq:generalization with all possible pairs $(\balpha,\bbeta)\in {\balpha^i}{i=1}^{N{p,h}}\times {\bbeta^i}{i=1}^{N{p,h}}$, we have small align* L_{0}(f) \leq \hatL_\gamma(f) &+ 8(\sqrt{2c+1) h^{1{2}-1{p}}C_1\left(2h^{1{2}-1{p}}C_2\vecX_F+\vecU^0\vecX_F\right)}{\gamma m }\ &+ \frac{\ln N^2_{p,h + \ln(2/\delta)}{m}}.align* small To complete the proof, we only need to upper bound $N_{p,h}$. Note that $K = \left\lceil h{2^p-1}\right\rceil \le h{2^p-1}+1 \le 2^{1-p}h+1$, thus [\ln N_{p,h} \le \ln \left[e(K+h-2){h-1}\right]^{K-1} \le \ln \left[e\left(1+2^{1-ph}{h-1}\right)\right]^{K-1}\leq 2^{1-p}h\ln\left(2e\right)\leq 2^{2-p}h.]
Proof. [Proof of Lemmalem:generalization_lp] This lemma can be proved by directly applying union bound on Lemmalem:pre_lpfor every pair $(C_1,C_2)\in {1,2,3,\ldots, \gamma \sqrt{m}{32}}^2$. For $|\vecV^\top|{p,2}\le 1$, we can use the bound where $C_1=1$, and the additional constant $1$ in Eq.~eq:generalization_lp will cover that. The same thing happens for the case of $|\vecU|{p,2}\le 1$. When any of $|\vecV^\top|{p,2}$ and $|\vecU|{p,2}$ is larger than $\gamma\sqrt{m}{32}$, the second term in Eq.~eq:generalization_lp is larger than 1 thus holds trivially. For the rest of the case, there exists $(C_1,C_2)\in{1,2,3,\ldots, \gamma \sqrt{m}{32}}^2$ such that $|\vecV^\top|{p,2}\le C_1+1 \ $, $|\vecU|{p,2}\le C_2+1$, which completes the proof.
Proof. To prove the lemma, we directly upper bound the generalization bound given in Lemma~lem:generalization_mu for $p=2$ and $\mu=3\sqrt{2}{4}-1$. For this choice of $\mu$ and $p$, we have $4(\mu+1)^{2/p} \leq 32$ and $\ln N_{p,h}$ is bounded as follows: align* \ln N_{p,h} &= \ln \left\lceil h/\mu\right\rceil+h-2{h-1} \leq \ln \left(\left[e\left\lceil h/\mu\right\rceil+h-2{h-1}\right]^{h-1}\right)=(h-1)\ln\left(e + e\left\lceil h/\mu\right\rceil-1{h-1}\right)\ &\leq (h-1)\ln\left(e + eh/\mu{h-1}\right) \leq h\ln(e+2e/\mu) \leq 5h align*
Proof. [Proof of Theorem thm:lower] We will start with the case $h=d=2^k,m = n2^k$ for some $k,n\in \mathN$. We will pick $\vecV = \balpha^\top = [\alpha_1 \ldots \alpha_{2^k}] $ for every $\bxi$, and $\calS = {\vecx_i}{i=1}^m$, where $\vecx_i := \vece{\left\lceili{n}\right\rceil}$. That is, the whole dataset are divides into $2^k$ groups, while each group has $n$ copies of a different element in standard orthonormal basis. We further define $\eps_j(\bxi) = \sum\limits_{(j-1)n+1}^{jn} \xi_i$, $\forall j\in[2^k]$ and $\vecF = (\bbf_1,\bbf_2,\ldots,\bbf_{2^k}) \in {-2^{-k/2},2^{-k/2}}^{2^k\times 2^k}$ be the Hadamard matrix which satisfies $\bbf_i{\bbf_j}=\delta_{ij}$. Note that for $\vecs\in \mathR^d, s_j=\alpha_j\beta_j$, $\forall j\in [d]$, it holds that [ \forall i\in [d], \quad \max{ \vecs{\vecf_i}, \vecs{-\vecf_i}} \geq 1{2}\left( \vecs{[\vecf_i]+} + \vecs{[-\vecf_i]+}\right) = \sum_{j=1^{2^k}s_j |f_{ji}|}{2} = 2^{-k{2}-1} \balpha^\top \bbeta. ] Thus without loss of generality, we can assume $\forall i\in [2^k]$, $\vecs{\vecf_i} \geq 2^{k{2}-1} \balpha^\top\bbeta$ by flipping the signs of $\vecf_i$. %Let $\tF:= (\tbf_1,\tbf_2,\ldots,\tbf_{2^k}) \in {-2^{-k/2},2^{-k/2}}^{2^k\times 2^k}$, where $\tbf_i = \vech = \vecf_i,-\vecf_i \vecs{[\vech]+}$. For any $ \bxi\in {-1,1}^n$, let $\bbeta$ be the square diagonal matrix with its diagonal equal to $\bbeta$ and $\tF(\bxi)$ be the following: [ \tF(\bxi):= [\tbf_1,\tbf_2,\ldots,\tbf{2^k}] such that if \eps_i(\bxi)\ge 0, \tbf_i= \bbf_{i}, and if \eps_i(\bxi) <0, \tbf_i=0, ] and we will choose $\vecU(\bxi)$ as $\bbeta \times \tF(\bxi)$. %[ \vecU(\eps) =\vecr\times\left[\ba_1 \cdots \ba_{2^k}\right] such that if \xi_i\ge 0, \ba_i= \bbf_{i}, and if \xi_i <0, \ba_i=0.] Since $\vecF$ is orthogonal, by the definition of $\tF(\bxi)$, we have $|\tF(\bxi)|2\le 1$ and the 2-norm of each row of $\tF$ is upper bounded by 1. Therefore, we have $|\vecU(\bxi)|2\leq |\bbeta|2|\tF(\bxi)|2\leq \max_i \beta_i$, and $|\vecu_i-\vecu^0_i|2 = |\vecu^i|2 \le \beta_i |\vece_i^\top \vecF(\bxi)|2\le \beta_i $. In other words, $f(\vecx) = \vecV[\vecU(\bxi)\vecx]+ \in \calF{\calW'}$. We will omit the index $\beps$ when it's clear. Now we have [\sum{i=1}^n \xi_i \vecV [\vecU \vecx_i]+ = \sum{j=1}^{2^k} \sum{i=(j-1)n+1}^{jn} \xi_i \vecV [\vecU \vecx_i]+ = \sum_{j=1}^{2^k} \eps_j(\bxi) \vecV [\vecU \vece_j]+ .] Note that [ \eps_j \vecV [\vecU \vece_j]+ = \eps_j \Diag{\bbeta\balpha}{[\tbf_j ]+} = \eps_j \vecs{[\vecf_j ]+}1_{\eps_j > 0 } \geq 2^{-k{2}-1} \balpha^\top \bbeta[\eps_j]+ . ] The last inequality uses the previous assumption, that $\forall i\in [2^k]$, $\vecs{\vecf_i} \geq 2^{-k{2}-1} \balpha^\top\bbeta$. Thus, [split m\calR\calS(\calF_{\calW_2}) \geq &\E_{\bxi\sim{\pm1}^m}\left[{\sum_{i=1}^m \xi_i \vecV[\vecU(\bxi) \vecx_i]+} \right]\ \geq &\balpha^\top\bbeta 2^{-k{2}-1}\E{\bxi\sim{\pm1}^m}\left[\sum_{j=1}^{2^k}[\eps_j(\bxi)]+\right] \ = &\balpha^\top\bbeta 2^{k{2}-1}\E{\bxi\sim{\pm1}^n}\left[[\eps_1(\bxi)]+ \right]\ = &\balpha^\top\bbeta 2^{k{2}-2}\E{\bxi\sim{\pm1}^n}\left[|\eps_1(\bxi)| \right]\ \geq &\balpha^\top\bbeta 2^{k-1{2}-2}n \ = &\balpha^\top\bbeta\sqrt{2d}{8}\frac{m{d}}\ = &\balpha^\top\bbeta \sqrt{2m}{8} split] where the last inequality is by Lemma~lem:contraction. For arbitrary $d=h\le m$, $d,h,m\in \mathZ_{+}$, let $k= \lfloor \log_2 d\rfloor$, $d'=h'=2^k$, $m' = \lfloor m{2^k}\rfloor*2^k$. Then we have $h'\ge h{2}, m' \geq m{2}$. Thus there exists $S\subseteq [h]$, such that $\sum_{i\in S}\alpha_i\beta_i \geq \sum_{i=1}^h \alpha_i\beta_i$. Therefore we can pick $h'$ hidden units out of $h$ hidden units, $d'$ input dimensions out of $d$ dimensions, $m'$ input samples out of $m$ to construct a lower bound of $\sum_{i\in S\alpha_i\beta_i2m'}{8}\ge \balpha^\top\bbeta\sqrt{m}{16}$.
(Propostion 6 of Maurer [16]).
√
.
(Rademacher Decomposition).
√
Proof. Weprove the lemma by construction. Consider the set Q = α ∈ R D | ∀ i α p i ∈ jβ p /K K j =1 , ‖ α ‖ p p = β p (1 + D/K ) . 1 /p
For any x ∈ S p,β , consider α ′ such that for any i ∈ [ D ] , α ′ i = (⌈ | x p i | K β p ⌉ β p K ) . It is clear that | x i | ≤ α ′ i . Moreover,
we have:
Therefore, we have α ′ ∈ Q . Furthermore for any α ∈ Q , we have:
Therefore, to complete the proof, we only need to bound the size of the set Q . The size of the set Q is equal to the number of unique solutions for the problem ∑ D i =1 z i = K + D -1 for non-zero integer variables z i , which is ( K + D -2 D -1 ) .
Proof. The proof of this lemma follows from using the result of Theorem 1 and taking a union bound to cover all the possible values of V | ‖ V ‖ p, 2 ≤ C 1 and U = U | ∥ ∥ U -U 0 ∥ ∥ p, 2 ≤ C 2 .
We next use the general results in Lemma 12 to give specific results for the case p = 2 .
Proof. To prove the lemma, we directly upper bound the generalization bound given in Lemma 12 for p = 2 and µ = 3 √ 2 4 -1 . For this choice of µ and p , we have 4( µ +1) 2 /p ≤ 3 √ 2 and ln N p,h is bounded as follows:
Proof of Theorem 2. The proof directly follows from Lemma 13 and using ˜ O notation to hide the constants and logarithmic factors.
Since the right hand side of the above inequality is greater than zero for µ ≥ h , it is true for every µ > 0 .
(ℓp covering lemma).
.
.
.
√
.
√
Hadamard matrix which satisfies 〈 f i , f j 〉 = δ ij . Note that for s ∈ R d , s j = α j β j , ∀ j ∈ [ d ] , it holds that
For any ξ ∈ -1 , 1 n , let Diag ( β ) be the square diagonal matrix with its diagonal equal to β and ˜ F ( ξ ) be the following:
and we will choose U ( ξ ) as Diag ( β ) × ˜ F ( ξ ) .
Thus,
| # | Reference | Measure |
|---|---|---|
| (1) | Harvey et al. [9] | ˜ Θ( dh ) |
| (2) | Bartlett and Mendelson [3] | ˜ Θ ( ‖ U ‖ ∞ , 1 ‖ V ‖ ∞ , 1 ) |
| (3) | Neyshabur et al. [20], Golowich et al. [7] | ˜ Θ ( ‖ U ‖ F ‖ V ‖ F ) |
| (4) | Bartlett et al. [4], Golowich et al. [7] | ˜ Θ ( ‖ U ‖ 2 ‖ V - V 0 ‖ 1 , 2 + ‖ U - U 0 ‖ 1 , 2 ‖ V ‖ 2 ) |
| (5) | Neyshabur et al. [23] | ˜ Θ ( ‖ U ‖ 2 ‖ V - V 0 ‖ F + √ h ‖ U - U 0 ‖ F ‖ V ‖ 2 ) |
| (6) | Theorem 2 | ˜ Θ ( ‖ U 0 ‖ 2 ‖ V ‖ F + ∥ ∥ U - U 0 ∥ ∥ F ‖ V ‖ F + √ h ) |
We empirically investigate the role of over-parametrization in generalization of neural networks on 3 different datasets (MNIST, CIFAR10 and SVHN), and show that the existing complexity measures increase with the number of hidden units - hence do not explain the generalization behavior with over-parametrization.
We prove tighter generalization bounds (Theorems 2 and 5) for two layer ReLU networks. Our proposed complexity measure actually decreases with the increasing number of hidden units, and can potentially explain the effect of over-parametrization on generalization of neural networks.
We provide a matching lower bound for the Rademacher complexity of two layer ReLU networks. Our lower bound considerably improves over the best known bound given in Bartlett et al. [4], and to our knowledge is the first such lower bound that is bigger than the Lipschitz of the network class.
In table 1 we compare our result with the existing generalization bounds, presented for the simpler setting of two layer networks. In comparison with the bound Θ~(∥𝐔∥2∥𝐕−𝐕0∥1,2+∥𝐔−𝐔0∥1,2∥𝐕∥2) [4, 7]: The first term in their bound ∥𝐔∥2∥𝐕−𝐕0∥1,2 is of smaller magnitude and behaves roughly similar to the first term in our bound ∥𝐔0∥2∥𝐕∥F (see Figure 3 last two panels). The key complexity term in their bound is ∥𝐔−𝐔0∥1,2∥𝐕∥2, and in our bound is ∥𝐔−𝐔0∥F∥𝐕∥F, for the range of h considered. ∥𝐕∥2 and ∥𝐕∥F differ by number of classes, a small constant, and hence behave similarly. However, ∥𝐔−𝐔0∥1,2 can be as big as h⋅∥𝐔−𝐔0∥F when most hidden units have similar capacity, and are only similar for really sparse networks. Infact their bound increases with h mainly because of this term ∥𝐔−𝐔0∥1,2 . As we see in the first and second panels of Figure 3, ℓ1 norm terms appearing in Bartlett and Mendelson [3], Bartlett et al. [4], Golowich et al. [7] over hidden units increase with the number of units as the hidden layers learned in practice are usually dense.
Neyshabur et al. [20], Golowich et al. [7] showed a bound depending on the product of Frobenius norms of layers, which increases with h, showing the important role of initialization in our bounds. In fact the proof technique of Neyshabur et al. [20] does not allow for getting a bound with norms measured from initialization, and our new decomposition approach is the key for the tighter bound.
∀h=d≤m, s1,s2≥0, ∃𝒮∈ℝd×m such that ℛ𝒮(ℱ𝒲spec)=Ω(s1s2h‖𝐗‖Fm).
where ∥.∥p,2 is the ℓp norm of the row ℓ2 norms.
where for simplicity of the notation, we let ϕ𝐕,𝐔=2γ∑i=1t−1∑j=1h((ρij+βj∥𝐱i∥2)⟨𝝃i,𝐯j⟩+ξi1αj⟨𝐮j,𝐱i⟩)+∑i=t+1mξi1ℓγ(𝐕[𝐔𝐱i]+,yi).
We will start with the case h=d=2k,m=n2k for some k,n∈ℕ.
We will pick 𝐕=𝜶⊤=[α1…α2k] for every 𝝃, and 𝒮=𝐱ii=1m, where 𝐱i:=𝐞⌈in⌉. That is, the whole dataset are divides into 2k groups, while each group has n copies of a different element in standard orthonormal basis.
For arbitrary d=h≤m, d,h,m∈ℤ+, let k=⌊log2d⌋, d′=h′=2k, m′=⌊m2k⌋∗2k. Then we have h′≥h2,m′≥m2. Thus there exists S⊆[h], such that ∑i∈Sαiβi≥∑i=1hαiβi. Therefore we can pick h′ hidden units out of h hidden units, d′ input dimensions out of d dimensions, m′ input samples out of m to construct a lower bound of ∑i∈Sαiβi2m′8≥𝜶⊤𝜷m16. ∎



$$ m\mathcal{R}{\mathcal{S}}(\ell{\gamma}\circ\mathcal{F}{\mathcal{W}})\ \leq\mathop{\mathbb{missing}}{E}{\bm{\xi}{i}\in{\pm 1}^{c},i\in[m]}\bigg{[}\sup{(\mathbf{V},\mathbf{U}),(\mathbf{V}^{\prime},\mathbf{U}^{\prime})\in\mathcal{W}}\frac{2}{\gamma}\sum_{j=1}^{h}(\beta_{j}\left\lVert{\mathbf{x}{y}}\right\rVert{2}+\rho_{tj})\left\langle\bm{\xi}{t},\mathbf{v}{j}\right\rangle+\xi_{t1}\alpha_{j}\left\langle\mathbf{u}{j},\mathbf{x}{t}\right\rangle+\phi_{\mathbf{V},\mathbf{U}}\bigg{]}. $$
$$ \begin{split}\mathcal{R}{\mathcal{S}}(\ell{\gamma}\circ\mathcal{F}{\mathcal{W}})&\leq\frac{2\sqrt{2c}+2}{\gamma m}\left\lVert{\bm{\alpha}}\right\rVert{2}\left(\left\lVert{\bm{\beta}}\right\rVert_{2}\left\lVert{\mathbf{X}}\right\rVert_{F}+\left\lVert{\mathbf{U}^{0}\mathbf{X}}\right\rVert_{F}\right)\ &\leq\frac{2\sqrt{2c}+2}{\gamma m}h^{\frac{1}{2}-\frac{1}{p}}\left\lVert{\bm{\alpha}}\right\rVert_{p}\left(h^{\frac{1}{2}-\frac{1}{p}}\left\lVert{\bm{\beta}}\right\rVert_{p}\left\lVert{\mathbf{X}}\right\rVert_{F}+\left\lVert{\mathbf{U}^{0}\mathbf{X}}\right\rVert_{F}\right),\end{split} $$
$$ \begin{split}m\mathcal{R}{\mathcal{S}}(\mathcal{F}{\mathcal{W}{2}})\geq&\mathop{\mathbb{missing}}{E}{\bm{\xi}\sim{\pm 1}^{m}}\left[{\sum_{i=1}^{m}\xi_{i}\mathbf{V}[\mathbf{U}(\bm{\xi})\mathbf{x}{i}]{+}}\right]\ \geq&\bm{\alpha}^{\top}\bm{\beta}2^{-\frac{k}{2}-1}\mathop{\mathbb{missing}}{E}{\bm{\xi}\sim{\pm 1}^{m}}\left[\sum{j=1}^{2^{k}}[\epsilon_{j}(\bm{\xi})]{+}\right]\ =&\bm{\alpha}^{\top}\bm{\beta}2^{\frac{k}{2}-1}\mathop{\mathbb{missing}}{E}{\bm{\xi}\sim{\pm 1}^{n}}\left[[\epsilon_{1}(\bm{\xi})]{+}\right]\ =&\bm{\alpha}^{\top}\bm{\beta}2^{\frac{k}{2}-2}\mathop{\mathbb{missing}}{E}{\bm{\xi}\sim{\pm 1}^{n}}\left[|\epsilon_{1}(\bm{\xi})|\right]\ \geq&\bm{\alpha}^{\top}\bm{\beta}2^{\frac{k-1}{2}-2}\sqrt{n}\ =&\frac{\bm{\alpha}^{\top}\bm{\beta}\sqrt{2d}}{8}\sqrt{\frac{m}{d}}\ =&\frac{\bm{\alpha}^{\top}\bm{\beta}\sqrt{2m}}{8}\end{split} $$
$$ \displaystyle L_{0}(f) $$
$$ \displaystyle\qquad\qquad=\mathop{\mathbb{missing}}{E}{\bm{\xi}{i}\in{\pm 1}^{c},i\in[m]}\left[\sup_{\left\lVert{\mathbf{V}-\mathbf{V}{0}}\right\rVert{F}\leq r}\left\langle\mathbf{V}-\mathbf{V}{0},\sum{i=1}^{m}\bm{\xi}{i}\mathbf{x}{i}^{\top}\right\rangle\right]+\mathop{\mathbb{missing}}{E}{\bm{\xi}{i}\in{\pm 1}^{c},i\in[m]}\left[\sup_{\left\lVert{\mathbf{V}{0}}\right\rVert{F}\leq r}\left\langle\mathbf{V}{0},\sum{i=1}^{m}\bm{\xi}{i}\mathbf{x}{i}^{\top}\right\rangle\right] $$
$$ \displaystyle\qquad\qquad=r\sqrt{c}\left\lVert{\mathbf{X}}\right\rVert_{F}. $$
$$ \displaystyle\leq\mathop{\mathbb{missing}}{E}{\bm{\xi}{i}\in{\pm 1}^{c},i\in[m]}\left[\sup_{(\mathbf{V},\mathbf{U})\in\mathcal{W}}\frac{2}{\gamma}\sum_{i=1}^{t-1}\sum_{j=1}^{h}\left((\rho_{ij}+\beta_{j}\left\lVert{\mathbf{x}{i}}\right\rVert{2})\left\langle\bm{\xi}{i},\mathbf{v}{j}\right\rangle+\xi_{i1}\alpha_{j}\left\langle\mathbf{u}{j},\mathbf{x}{i}\right\rangle\right)\right. $$
$$ \displaystyle=\frac{1}{2}\mathop{\mathbb{missing}}{E}{\bm{\xi}{i}\in{\pm 1}^{c},i\in[m]}\left[\sup_{(\mathbf{V},\mathbf{U}),(\mathbf{V}^{\prime},\mathbf{U}^{\prime})\in\mathcal{W}}\ell_{\gamma}(\mathbf{V}[\mathbf{U}\mathbf{x}{t}]{+},y_{t})-\ell_{\gamma}(\mathbf{V}^{\prime}[\mathbf{U}^{\prime}\mathbf{x}{t}]{+},y_{t})+\phi_{\mathbf{V},\mathbf{U}}+\phi_{\mathbf{V}^{\prime},\mathbf{U}^{\prime}}\right] $$
$$ \displaystyle\left\lVert{\mathbf{V}[\mathbf{U}\mathbf{x}{t}]{+}-\mathbf{V}^{\prime}[\mathbf{U}^{\prime}\mathbf{x}{t}]{+}}\right\rVert_{2}\leq\left\lVert{\mathbf{V}[\mathbf{U}\mathbf{x}{t}]{+}-\mathbf{V}^{\prime}[\mathbf{U}\mathbf{x}{t}]{+}+\mathbf{V}^{\prime}[\mathbf{U}\mathbf{x}{t}]{+}-\mathbf{V}^{\prime}[\mathbf{U}^{\prime}\mathbf{x}{t}]{+}}\right\rVert_{2} $$
$$ \displaystyle\qquad\qquad\leq\sum_{j=1}^{h}(\beta_{j}\left\lVert{\mathbf{x}{t}}\right\rVert{2}+\rho_{ij})\left\lVert{\mathbf{v}{j}-\mathbf{v}^{\prime}{j}}\right\rVert_{2}+\alpha_{j}{\left\lvert{\left\langle\mathbf{u}{j},\mathbf{x}{t}\right\rangle-\left\langle\mathbf{u}^{\prime}{j},\mathbf{x}{t}\right\rangle}\right\rvert}. $$
Theorem. Theorem 1. Given a training set 𝒮={𝐱i}i=1m and γ>0, Rademacher complexity of the composition of loss function ℓγ over the class ℱ𝒲 defined in equations (4) and (5) is bounded as follows: ℛ𝒮(ℓγ∘ℱ𝒲) ≤22c+2γm∑j=1hαj(βj∥𝐗∥F+∥𝐮j0𝐗∥2) (6) ≤22c+2γm∥𝜶∥2(∥𝜷∥21m∑i=1m∥𝐱i∥22+1m∑i=1m∥𝐔0𝐱i∥22). (7)
Theorem. Theorem 2. For any h≥2, γ>0, δ∈(0,1) and 𝐔0∈ℝh×d, with probability 1−δ over the choice of the training set 𝒮={𝐱i}i=1m⊂ℝd, for any function f(𝐱)=𝐕[𝐔𝐱]+ such that 𝐕∈ℝc×h and 𝐔∈ℝh×d, the generalization error is bounded as follows: L0(f) ≤L^γ(f)+O(c∥𝐕∥F(∥𝐔−𝐔0∥F∥𝐗∥F+∥𝐔0𝐗∥F)γm+hm) ≤L^γ(f)+O(c∥𝐕∥F(∥𝐔−𝐔0∥F+∥𝐔0∥2)1m∑i=1m∥𝐱i∥22γm+hm).
Theorem. Theorem 3. Define the parameter set 𝒲′={(𝐕,𝐔)∣𝐕∈ℝ1×h,𝐔∈ℝh×d,∥𝐯j∥≤αj,∥𝐮j−𝐮j0∥2≤βj,‖𝐔−𝐔0‖2≤maxj∈hβj}, and let ℱ𝒲′ be the function class defined on 𝒲′ by equation (5). Then, for any d=h≤m, {αj,βj}j=1h⊂ℝ+ and 𝐔0=𝟎, there exists 𝒮={𝐱i}i=1m⊂ℝd, such that ℛ𝒮(ℱ𝒲)≥ℛ𝒮(ℱ𝒲′)=Ω(∑j=1hαjβj‖𝐗‖Fm).
Corollary. Corollary 4. ∀h=d≤m, s1,s2≥0, ∃𝒮∈ℝd×m such that ℛ𝒮(ℱ𝒲spec)=Ω(s1s2h‖𝐗‖Fm).
Lemma. Lemma 7 (Propostion 6 of Maurer [16]). Let ξi be the Rademacher random variables. For any vector 𝐯∈ℝd, the following holds: ∥𝐯∥2≤2missingEξi∼{±1},i∈[d][|⟨𝝃,𝐯⟩|].
Lemma. Lemma 10 (ℓp covering lemma). Given any ϵ,D,β>0, p≥2, consider the set Sp,βD={𝐱∈ℝD∣∥𝐱∥p≤β}. Then there exist N sets {Ti}i=1N of the form Ti={𝐱∈ℝD∣|xj|≤αji,∀j∈[D]} such that Sp,βD⊆⋃i=1NTi and ∥𝛂i∥2≤D1/p−1/2β(1+ϵ),∀i∈[N] where N=(K+D−1D−1) and K=⌈D(1+ϵ)p−1⌉.
√




References
[1] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
[2] Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of Machine Learning Research, 18\penalty0 (19):\penalty0 1--53, 2017.
[3] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3\penalty0 (Nov):\penalty0 463--482, 2002.
[4] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241--6250, 2017.
[5] Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte. Convex neural networks. In Advances in neural information processing systems, pages 123--130, 2006.
[6] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
[7] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.
[8] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6152--6160, 2017.
[9] Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930, 2017.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016.
[11] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pages 1097--1105, 2012.
[13] Steve Lawrence, C Lee Giles, and Ah Chung Tsoi. What size neural network gives optimal generalization? convergence properties of backpropagation. Technical report, U. of Maryland, 1998.
[14] Jaehoon Lee, Jascha Sohl-dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
[15] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530, 2017.
[16] Andreas Maurer. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 3--17. Springer, 2016.
[17] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
[18] Vaishnavh Nagarajan and J.Zico Kolter. Generalization in deep networks: The role of distance from initialization. NIPS workshop on Deep Learning: Bridging Theory and Practice, 2017.
[19] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processsing Systems (NIPS), 2015a.
[20] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Proceeding of the 28th Conference on Learning Theory (COLT), 2015b.
[21] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. Proceeding of the International Conference on Learning Representations workshop track, 2015c.
[22] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. In to appear in Advances in Neural Information Processsing Systems (NIPS), 2017.
[23] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
[24] Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, 2018.
[25] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
[26] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15\penalty0 (1):\penalty0 1929--1958, 2014.
[27] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.