Skip to main content

Fast and Exact Enumeration of Deep Networks Partitions Regions

Abstract

One fruitful formulation of Deep Networks (DNs) enabling their theoretical study and providing practical guidelines to practitioners relies on Piecewise Affine Splines. In that realm, a DN's input-mapping is expressed as per-region affine mapping where those regions are implicitly determined by the model's architecture and form a partition of their input space. That partition --which is involved in all the results spanned from this line of research-- has so far only been computed on $2/3$-dimensional slices of the DN's input space or estimated by random sampling. In this paper, we provide the first parallel algorithm that does exact enumeration of the DN's partition regions. The proposed algorithm enables one to finally assess the closeness of the commonly employed approximations methods, e.g. based on random sampling of the DN input space. One of our key finding is that if one is only interested in regions with large'' volume, then uniform sampling of the space is highly efficient, but that if one is also interested in discovering the small'' regions of the partition, then uniform sampling is exponentially costly with the DN's input space dimension. On the other hand, our proposed method has complexity scaling linearly with input dimension and the number of regions.

Enumeration of Single-Layer Partitions

Randall Balestriero rbalestriero@meta.com

Meta AI, FAIR NYC, USA

One fruitful formulation of Deep Networks (DNs) enabling their theoretical study and providing practical guidelines to practitioners relies on Piecewise Affine Splines. In that realm, a DN's input-mapping is expressed as per-region affine mapping where those regions are implicitly determined by the model's architecture and form a partition of their input space. That partition -which is involved in all the results spanned from this line of research- has so far only been computed on 2 / 3 -dimensional slices of the DN's input space or estimated by random sampling. In this paper, we provide the first parallel algorithm that does exact enumeration of the DN's partition regions. The proposed algorithm enables one to finally assess the closeness of the commonly employed approximations methods, e.g. based on random sampling of the DN input space. One of our key finding is that if one is only interested in regions with 'large' volume, then uniform sampling of the space is highly efficient, but that if one is also interested in discovering the 'small' regions of the partition, then uniform sampling is exponentially costly with the DN's input space dimension. On the other hand, our proposed method has complexity scaling linearly with input dimension and the number of regions.

Introduction

Deep Networks (DNs) are compositions of linear and nonlinear operators altogether forming a differentiable functional f θ governed by some trainable parameters θ [1]. Understanding the underlying properties that make DNs the great function approximators that they are involve many different research directions e.g. the underlying implicit regularization of architectures [2], or the impact of input and feature normalization into the optimization landscape [3]. Most existing results emerge from a few different mathematical formulations. One eponymous example relies on kernels and emerges from pushing the DN's layers width to infinity. In this case, and under some additional assumptions, a closed-form expression for the DN's underlying embedding space metric is obtained [4]. With that form, training dynamics and generalization bounds are amenable to theoretical analysis [5]. Another line of research considers the case of deep linear networks i.e. a DN

Yann LeCun

Meta AI, FAIR, NYU yann@meta.com NYC, USA

Fig. 1 . Proposed exact region enumeration depicted as an orange star against sampling-based region discovery of the partition Ω depicted as blue dots for a single hidden layer DN with leaky-ReLU, random parameters and width 64 as a function of computation time ( x-axis ) and number of partition regions found ( y-axis ); for a 4 -dimensional input space at the top and 8 -dimensional input space at the bottom . The proposed Algorithm 1 is able to dramatically outperform the sampling-based search that has been used throughout recent studies on CPA DNs.

Fig. 1 . Proposed exact region enumeration depicted as an orange star against sampling-based region discovery of the partition Ω depicted as blue dots for a single hidden layer DN with leaky-ReLU, random parameters and width 64 as a function of computation time ( x-axis ) and number of partition regions found ( y-axis ); for a 4 -dimensional input space at the top and 8 -dimensional input space at the bottom . The proposed Algorithm 1 is able to dramatically outperform the sampling-based search that has been used throughout recent studies on CPA DNs.

without nonlinearities. In this setting, it is possible to obtain the explicit regularizer that acts upon the DN's functional and that depends on the specifics of the architecture e.g. depth and with [6]. Another direction, most relevant to our study, proposes to unravel the Continuous Piecewise Affine (CPA) mapping of standard DNs [7]. In short, one can combine the fact that (i) the nonlinearities present in most current DNs are themselves CPA e.g. (leaky-)ReLU, absolute value, maxpooling, (ii) the interleaved affine mappings preserve the CPA property, and (iii) composition of CPA mappings remain CPA. Thus, the entire input-output DN is itself a CPA. From that observation, it is possible to study the DN's loss landscape [8], the implicit regularizer of different architectures [9], the explicit probabilistic distributions of CPA Deep Generative Networks [10, 11], the approximation rates [12, 13], or even the conditions for adversarial robustness guarantees [14, 15]. A striking benefit of the CPA viewpoint lies in the fact that it is an exact mathematical description of the DN's input-output mapping without any approximation nor simplification. This makes the obtained insights and guidelines highly relevant to improve currently deployed state-of-the-art architectures.

Despite this active recent development of CPA-based results around DNs, one key challenge remains open. In fact, because under this view one expresses the DN mapping as a collection of affine mappings -one for each region ω of some partition Ω of their input space- it becomes crucial to compute that partition Ω or at least infer some statistics from it. Current analytical characterizations of Ω are in fact insufficient e.g. existing bounds characterizing the number of regions in Ω are known to be loose and uninformative [16]. As such, practitioners resort to simple approximation strategies, e.g. sampling, to estimate such properties of Ω . Another approach is to only consider 2 / 3 -dimensional slices of the DN's input space and estimate Ω restricted on that subspace. All in all, nothing is known yet about how accurate are those approximations at conveying the underlying properties of the entire partition Ω that current theoretical results heavily rely on. In particular, [17] uses estimates of the partition's number of region to perform Neural Architecture Search (NAS), and for which exact computation of the DNN's partition regions will further improve the NAS; [11] uses estimates of the partition to adapt the distribution of deep generative networks (e.g. variational autoencoders) and for which exact computation of the partition would make their method exact, and not an approximation

In this paper, we propose a principled and provable enumeration method for DNs partitions (Algorithm 1) that we first develop for a layer-wise analysis in Section 2 and then extend to the multilayer case in Section 3. As depicted in Fig. 1, the proposed method becomes exponentially faster than the sampling-based strategy to discover the regions ω ∈ Ω as the input dimensionality increases. Practically, the proposed enumeration method enables for the first time to measure the accuracy of the currently employed approximations. Our method is efficiently implemented with a few lines of codes, leverages parallel computations, and provably enumerates all the regions of the DN's partition. Lastly, our method has linear asymptotic complexity with respect to the number of regions and with respect to the DN's input space dimension. This property is crucial as we will demonstrate that sampling-based enumeration method has complexity growing exponentially with respect to the DN's input space dimension as a direct consequence of the curse of dimensionality [18, 19]. We hope that our method will serve as the baseline algorithm for any application requiring provable partition region enumeration, or to assess the theoretical findings obtain from the CPA formulation of DNs.

Enumeration of Single-Layer Partitions

We now develop the enumeration algorithm for a single DN layer. Because a DN recursively subdivides the per-layer partition, the single layer case will be enough to iteratively compute the partition of a multilayer DN as shown in the next Section 3.

Layer Partitions and Hyperplane Arrangements

We denote the single layer of a DN 1 input-output mapping as f θ : R D ↦→ R K , with θ the parameters of the mapping. Without loss of generality, we consider vectors as inputs since when dealing with images, one can always flatten them into vectors and reparametrize the layer accordingly. The layer mapping takes the form

$$

$$

where σ is a pointwise activation function, W is a weight matrix of dimensions K × D , b is a bias vector of length K , h ( x ) denotes the pre-activation map and lastly x is some input from R D . The layer parameters are thus θ ≜ { W , b . Although simple, Eq. (1) encompasses most current DNs layers by specifying the correct structural constraints on the matrix W , e.g. to be circulant for a convolutional layer. The details on the layer mapping will not impact our results. The CPA view of DNs [20, 7] consists in expressing Eq. (1) as

$$

$$

where Ω is the layer input space partition [21]. Understanding the form of Ω will greatly help the development of the enumeration algorithm in Section 2.2. Given nonlinearities σ such as (leaky-)ReLU or absolute value, it is direct to see that the layer stays linear for a region ω so that all the inputs within it have the same pre-activation signs. That is, a region is entirely and uniquely determined by those sign patterns

$$

$$

where the equality is to be understood elementwise on all of the K entries of the sign vectors. The only exception arises for degenerate weights W which we do not consider since any arbitrarily small perturbation of such degeneracies remove those

1 without loss of generality we consider the first layer, although the exact same analysis applies to any layer in the DN when looking at the partition of its own input space

edge cases. From the above observation along, it becomes clear that the transition between different regions of Ω must occur when a pre-activation sign for some unit k ∈ { 1 , . . . , K } changes, and because h is nothing more but an affine mapping, this sign change for some unit k can only occur when crossing the hyperplane

$$

$$

Leveraging Eq. (3) we obtain that ∂ Ω , the boundaries of the layer's partition, is an hyperplane arrangement as in ∂ Ω = ⋃ K k =1 H k .

We are now able to leverage this particular structure of the layer's partition to present an enumeration algorithm that will recursively search for all the regions ω ∈ Ω .

Region Enumeration Algorithm

From the previous understanding that the layer's partition arises from an hyperplane arrangements involving Eq. (3), we are now able to leverage and adapt existing enumeration methods for such partitions to obtain all the regions ω ∈ Ω , form which it will become trivial to consider the multilayer case that we leave for the following Section 3.

Enumerating the regions of the layer f θ 's partition can be done efficiently by adapting existing reverse search algorithms [22] optimized for hyperplane arrangements. In fact, a naive approach of enumerating all of the 2 K possible sign patterns q ∈ {-1 , 1 } K and checking if each defines a non-empty region

$$

$$

would be largely wasteful. In fact, most of such sign combinations do produce empty regions e.g. if the partition is central i.e. the intersection of all the hyperplane is not empty then the total number of regions grows linearly with K [23] and is thus much smaller than 2 K . Instead, a much more efficient strategy is to only explore feasible sign patterns in a recursive tree-like structure. To do so, the algorithm recursively sub-divides a parent region by the hyperplane of unit k . If that hyperplane does not intersect the current region then we can skip unit k and recurse the sub-division of that same region by unit k +1 . On the other hand, if hyperplane k divides the current region, we consider both sides of it and keep the recursion going on both sides. We formally summarize the method in Algorithm 1 and present one illustrative example and comparison against sampling-based region enumeration in Fig. 1. In particular, we provide the efficiency of the sampling solution for various configurations in Table 1.

Enumeration of Multi-Layers Partitions

This section demonstrates how the derivation carried out in Section 2 for the single layer setting is sufficient to enumerate the partition of a multilayer DN, thanks to the subdivision

Algorithm 1 Proposed region enumeration method for the single hidden layer case that recursively searches over the feasible sign patterns q one unit at a time, and only explores the branches that coincide with non-empty region i.e. avoiding the 2 K total number of possible of combinations. The step checking for intersection between an hyperplane and a given polytopal region can be done easily by setting up a linear program with dummy constant objective, the hyperplane as a linear constraint, and the polytopal region as inequality constraint; during the feasibility check the test will fail if the intersection is empty. This algorithm is obtained to provide the results from Fig. 1 and Table 1. The algorithm terminates once all the regions of the partition Ω have been visited.

$$

$$

  • 1: if k ? = K + 1 then this branch has reached a leaf, the sign pattern q is feasible and can be accumulated into Ω 's current estimate
  • 2: Check if the hyperplane defined by ( w k , b k ) intersects the polytopal region defined by ⋂ k -1 j =1 { x ∈ R D : ( ⟨ w j , x ⟩ + b j ) q j ≥ 0 }
  • 3: if NO then unit j is redundant, call the routine again with [ q j , 0] as q and k +1 as k
  • 4: if YES then unit j splits the region into two, call the routine again with [ q j , 1] and k +1 and [ q j , -1] and k +1 Ensure: X ( L ) ▷ Evaluate loss and back-propagate as usual

Table 1 . Comparison of our exact enumeration method versus sampling-based partition discovery for various single layer configurations with random weights and biases. The sampling-based discovery is run 5 times and we report the average and standard deviation of the number of regions found after sampling. The number of input space sample is obtain so that the computation time of the proposed method is the same as the computation time of the sampling method i.e. for each configuration, both methods have run the exact same amount of time. We observe that for low-dimensional input space, and with the same fixed time-budget, both methods perform similarly and sampling is sufficient to quickly discover all of the layer's partition.

process under which the composition of many layers ultimately form the global DN's input space partition. We first recall this subdivision step in Section 3.1 and summarize the enumeration algorithm in Section 3.2.

Fig. 2 . Depiction of the multilayer case which corresponds to a union of region-constrained hyperplane arrangements and thus which can be studied directly form the proposed hyperplane arrangement region enumeration. The only additional step is to first enforce that the search takes place on the restricted region of interest from the up-to-layerℓ input space partition. For example on the left column one first obtains the first layer partition depicted in black . On each of the enumerated region, a subdivision will be performed by the next layer; pick any region of interest, compose the per-region affine mapping (fixed on that region) with the second layer affine mappings, and repeat the region enumeration algorithm. This discovers the second subdivision done by the second layer, highlighted in blue in the middle column . This can be repeated to obtain the subdivision of the third layer, here highlighted in red in the right column .

Fig. 2 . Depiction of the multilayer case which corresponds to a union of region-constrained hyperplane arrangements and thus which can be studied directly form the proposed hyperplane arrangement region enumeration. The only additional step is to first enforce that the search takes place on the restricted region of interest from the up-to-layerℓ input space partition. For example on the left column one first obtains the first layer partition depicted in black . On each of the enumerated region, a subdivision will be performed by the next layer; pick any region of interest, compose the per-region affine mapping (fixed on that region) with the second layer affine mappings, and repeat the region enumeration algorithm. This discovers the second subdivision done by the second layer, highlighted in blue in the middle column . This can be repeated to obtain the subdivision of the third layer, here highlighted in red in the right column .

Deep Networks are Continuous Piecewise Affine

We specialize the per-layer notations from Section 2 by expliciting the layer index ℓ as f ( ℓ ) for the layer mapping, as θ ( ℓ ) for its parameters, and the entire DN's input-output mapping is now referred to as f θ : R D ↦→ R K with K the output space dimension. The composition of layers take the form

$$

$$

where each layer mapping f ( ℓ ) : R D ( ℓ ) ↦→ R D ( ℓ +1) produces a feature map ; with D (1) ≜ D and D ( L ) ≜ K ; with mapping given by Eq. (1), and h ( ℓ ) denoting the pre-activation map of layer ℓ . A key result from [20, 7] is the DN mapping is itself defined on a partition as in

$$

$$

which is known to be recursively built by each layer subdividing the previously built partition of the space [21].

Enumerating Union of Hyperplane Arrangements

Considering an arbitrarily deep model can be tackled by understanding the recurrent subdivision process of a two hidden layer DN and applying the same principle successively. In this setting, notice that for the (two-layer) DN to be affine within some region ω of the DN's input space, each layer must stay affine as well. By composition the first layer staying linear does not ensure that the DN stays linear, but the first layer being nonlinear does imply that the entire DN is nonlinear. From that, we see that the first layer's partition are 'coarser' the the entire DN's partition regions. More precisely, and following the derivation of [21], we obtain that each layer is a recursive subdivision of the previously build partition when in our case we need to search for each region ω of the first layer's partition the regions within it where the second layer stays linear. As a result, the proposed single hidden layer enumeration method from Section 2 can be applied recursively as follows. First, compute the first layer partition enumeration. Then, for each enumerated region with corresponding sign pattern q , define a new single layer model with h ( x ) ≜ σ ( W (2) diag( q ) W (1) x + W (2) ( q ⊙ b (1) ) + b (2) and within ω apply the single layer enumeration; repeating the process for all regions -and corresponding sign patterns q of the previously found first layer partition. This enumerates the partition of ( f (2) ◦ f (1) ) , and the same process can be repeated as many times as there are layers in the DN; as illustrated in Fig. 2.

Conclusion and Future Work

In this paper, we provided the first exact enumeration method for Deep Networks partitions that relies on the existing highly efficient enumeration method of hyperplane arrangements. In fact, both the hallow and deep architectures produce partitions that correspond to hyperplane arrangements or union of restricted hyperplane arrangements. A crucial finding that was enabled by the proposed method is that sampling-based region enumeration, which is the only strategy used in current research studies dealing with DNs and affine splines, is in fact relatively poor at finding the regions of the DN's partition. In particular, when using such sampling to estimating some sensitive statistics e.g. the volume of the smallest region, sampling is biased and should be avoid in favor of an exact enumeration method.

$$ f_{\vtheta}(\vx)=\sigma(\vh(\vx))\text{ with } \vh(\vx)=\mW\vx+\vb\label{eq:layer} $$ \tag{eq:layer}

$$ f_{\vtheta}(\vx) = \sum_{\omega \in \Omega}(\mA_{\omega}\vx+\vb_{\omega})1_{{\vz \in \omega}},\label{eq:CPA} $$ \tag{eq:CPA}

$$ f_{\vtheta}\text{ affine on $\omega$} \hspace{-0.13cm}\iff\hspace{-0.13cm} \sign(\vh(\vx))\hspace{-0.1cm}=\hspace{-0.1cm}\sign(\vh(\vx')),\hspace{-0.07cm} \forall (\vx,\vx') \in \omega^2, $$

$$ \sH_k \triangleq {\vx \in \mathbb{R}^{D}:\langle \mW_{k,.},\vx\rangle +\vb_k=0 }.\label{eq:hyperplane} $$ \tag{eq:hyperplane}

$$ f_{\vtheta}=\left(f_{\vtheta^{(L)}}^{(L)} \circ \dots \circ f_{\vtheta^{(1)}}^{(1)}\right),\label{eq:DNN} $$ \tag{eq:DNN}

Algorithm: algorithm
[t!]
\caption{\small
Proposed region enumeration method for the single hidden layer case that recursively searches over the feasible sign patterns $\vq$ one unit at a time, and only explores the branches that coincide with non-empty region i.e. avoiding the $2^K$ total number of possible of combinations. The step checking for intersection between an hyperplane and a given polytopal region can be done easily by setting up a linear program with dummy constant objective, the hyperplane as a linear constraint, and the polytopal region as inequality constraint; during the feasibility check the test will fail if the intersection is empty. This algorithm is obtained to provide the results from \cref{fig:teaser,tab:times}. The algorithm terminates once all the regions of the partition $\Omega$ have been visited.}\label{algo:police}
\begin{algorithmic}[1]
\Require $\mW \in \mathbb{R}^{K \times D}, \vb \in \mathbb{R}^{K},k\in\{1,\dots,K\},\vq\in\{-1,0,1\}^{k}$
\State if ${\bf k\overset{?}{=}K+1}$ then this branch has reached a leaf, the sign pattern $\vq$ is feasible and can be accumulated into $\Omega$'s current estimate
\State Check if the hyperplane defined by $(\vw_k,\vb_k)$ intersects the polytopal region defined by $\bigcap_{j=1}^{k-1}\{ \vx \in \mathbb{R}^{D}:(\langle \vw_j,\vx\rangle + \vb_j)\vq_j \geq 0\}$
\State if {\bf NO} then unit $j$ is redundant, call the routine again with $[\vq_j,0]$ as $\vq$ and $k+1$ as $k$
\State if {\bf YES} then unit $j$ splits the region into two, call the routine again with $[\vq_j,1]$ and $k+1$ and $[\vq_j,-1]$ and $k+1$
\Ensure $\mX^{(L)}$\Comment{Evaluate loss and back-propagate as usual}
\end{algorithmic}
Algorithm: algorithm
[1]
\Require $\mW \in \mathbb{R}^{K \times D}, \vb \in \mathbb{R}^{K},k\in\{1,\dots,K\},\vq\in\{-1,0,1\}^{k}$
\State if ${\bf k\overset{?}{=}K+1}$ then this branch has reached a leaf, the sign pattern $\vq$ is feasible and can be accumulated into $\Omega$'s current estimate
\State Check if the hyperplane defined by $(\vw_k,\vb_k)$ intersects the polytopal region defined by $\bigcap_{j=1}^{k-1}\{ \vx \in \mathbb{R}^{D}:(\langle \vw_j,\vx\rangle + \vb_j)\vq_j \geq 0\}$
\State if {\bf NO} then unit $j$ is redundant, call the routine again with $[\vq_j,0]$ as $\vq$ and $k+1$ as $k$
\State if {\bf YES} then unit $j$ splits the region into two, call the routine again with $[\vq_j,1]$ and $k+1$ and $[\vq_j,-1]$ and $k+1$
\Ensure $\mX^{(L)}$\Comment{Evaluate loss and back-propagate as usual}

References

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, 'Deep learning,' nature , vol. 521, no. 7553, pp. 436444, 2015. [2] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro, 'In search of the real inductive bias: On the role of implicit regularization in deep learning,' arXiv preprint arXiv:1412.6614 , 2014. [3] Yann Le Cun, Ido Kanter, and Sara A Solla, 'Eigenvalues of covariance matrices: Application to neural-network learning,' Physical Review Letters , vol. 66, no. 18, pp. 2396, 1991. [4] Arthur Jacot, Franck Gabriel, and Clément Hongler, 'Neural tangent kernel: Convergence and generalization in neural networks,' arXiv preprint arXiv:1806.07572 , 2018. [5] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao, 'Why do deep residual networks generalize better than deep feedforward networks?-a neural tangent kernel perspective,' Advances in neural information processing systems , vol. 33, pp. 2698-2709, 2020. [6] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro, 'The implicit bias of gradient descent on separable data,' The Journal of Machine Learning Research , vol. 19, no. 1, pp. 28222878, 2018. [7] Randall Balestriero and Richard Baraniuk, 'A spline theory of deep learning,' in International Conference on Machine Learning . PMLR, 2018, pp. 374-383. [8] Rudolf H Riedi, Randall Balestriero, and Richard G Baraniuk, 'Singular value perturbation and deep network optimization,' arXiv preprint arXiv:2203.03099 , 2022. [9] Randall Balestriero and Richard G Baraniuk, 'From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference,' arXiv preprint arXiv:1810.09274 , 2018. [10] Randall Balestriero, Sébastien Paris, and Richard Baraniuk, 'Analytical probability distributions and exact expectation-maximization for deep generative networks,' Advances in neural information processing systems , vol. 33, pp. 14938-14949, 2020. [11] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk, 'Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values,' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 10641-10650. [12] Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova, 'Nonlinear approximation and (deep) relu networks,' Constructive Approximation , vol. 55, no. 1, pp. 127-172, 2022. [13] Randall Balestriero and Richard G Baraniuk, 'Batch normalization explained,' arXiv preprint arXiv:2209.14778 , 2022. [14] Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon, 'Towards fast computation of certified robustness for relu networks,' in International Conference on Machine Learning . PMLR, 2018, pp. 5276-5285. [15] Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang, 'Semidefinite relaxations for certifying robustness to adversarial examples,' Advances in Neural Information Processing Systems , vol. 31, 2018. [16] Herbert Edelsbrunner, Algorithms in combinatorial geometry , vol. 10, Springer Science & Business Media, 1987. [17] Wuyang Chen, Xinyu Gong, and Zhangyang Wang, 'Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective,' arXiv preprint arXiv:2102.11535 , 2021. [18] Richard E Bellman and Stuart E Dreyfus, Applied dynamic programming , vol. 2050, Princeton university press, 2015. [19] Mario Köppen, 'The curse of dimensionality,' in 5th online world conference on soft computing in industrial applications (WSC5) , 2000, vol. 1, pp. 4-8. [20] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, 'On the number of linear regions of deep neural networks,' Advances in neural information processing systems , vol. 27, 2014. [21] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk, 'The geometry of deep networks: Power diagram subdivision,' Advances in Neural Information Processing Systems , vol. 32, 2019. [22] David Avis and Komei Fukuda, 'Reverse search for enumeration,' Discrete applied mathematics , vol. 65, no. 1-3, pp. 21-46, 1996. [23] Richard P Stanley et al., 'An introduction to hyperplane arrangements,' Geometric combinatorics , vol. 13, no. 389-496, pp. 24, 2004.

input dim \width K=16input dim \width K=16K=32K=64K=128K=256
enumeration 161371127631
D=2sampling 16 ± 013 ± 067 ± 0127 ± 0611 ± 2
samp. found 100%100%94%100%96%
enumeration 54801107427195954
D=4sampling 51 ± 069 ± 1866 ± 33288 ± 1870635 ± 55
samp. found94%86%78%77%73%
enumeration 2412428396386566-
D=8sampling 18 ± 0543 ± 22875 ± 5136748 ± 251-
samp. found75%44%34%35%-

One fruitful formulation of Deep Networks (DNs) enabling their theoretical study and providing practical guidelines to practitioners relies on Piecewise Affine Splines. In that realm, a DN’s input-mapping is expressed as per-region affine mapping where those regions are implicitly determined by the model’s architecture and form a partition of their input space. That partition –which is involved in all the results spanned from this line of research– has so far only been computed on 2/3232/3-dimensional slices of the DN’s input space or estimated by random sampling. In this paper, we provide the first parallel algorithm that does exact enumeration of the DN’s partition regions. The proposed algorithm enables one to finally assess the closeness of the commonly employed approximations methods, e.g. based on random sampling of the DN input space. One of our key finding is that if one is only interested in regions with “large” volume, then uniform sampling of the space is highly efficient, but that if one is also interested in discovering the “small” regions of the partition, then uniform sampling is exponentially costly with the DN’s input space dimension. On the other hand, our proposed method has complexity scaling linearly with input dimension and the number of regions.

Deep Networks (DNs) are compositions of linear and nonlinear operators altogether forming a differentiable functional f𝜽subscript𝑓𝜽f_{{\bm{\theta}}} governed by some trainable parameters 𝜽𝜽{\bm{\theta}} [1]. Understanding the underlying properties that make DNs the great function approximators that they are involve many different research directions e.g. the underlying implicit regularization of architectures [2], or the impact of input and feature normalization into the optimization landscape [3]. Most existing results emerge from a few different mathematical formulations. One eponymous example relies on kernels and emerges from pushing the DN’s layers width to infinity. In this case, and under some additional assumptions, a closed-form expression for the DN’s underlying embedding space metric is obtained [4]. With that form, training dynamics and generalization bounds are amenable to theoretical analysis [5]. Another line of research considers the case of deep linear networks i.e. a DN without nonlinearities. In this setting, it is possible to obtain the explicit regularizer that acts upon the DN’s functional and that depends on the specifics of the architecture e.g. depth and with [6]. Another direction, most relevant to our study, proposes to unravel the Continuous Piecewise Affine (CPA) mapping of standard DNs [7]. In short, one can combine the fact that (i) the nonlinearities present in most current DNs are themselves CPA e.g. (leaky-)ReLU, absolute value, max-pooling, (ii) the interleaved affine mappings preserve the CPA property, and (iii) composition of CPA mappings remain CPA. Thus, the entire input-output DN is itself a CPA. From that observation, it is possible to study the DN’s loss landscape [8], the implicit regularizer of different architectures [9], the explicit probabilistic distributions of CPA Deep Generative Networks [10, 11], the approximation rates [12, 13], or even the conditions for adversarial robustness guarantees [14, 15]. A striking benefit of the CPA viewpoint lies in the fact that it is an exact mathematical description of the DN’s input-output mapping without any approximation nor simplification. This makes the obtained insights and guidelines highly relevant to improve currently deployed state-of-the-art architectures.

Despite this active recent development of CPA-based results around DNs, one key challenge remains open. In fact, because under this view one expresses the DN mapping as a collection of affine mappings –one for each region ω𝜔\omega of some partition ΩΩ\Omega of their input space– it becomes crucial to compute that partition ΩΩ\Omega or at least infer some statistics from it. Current analytical characterizations of ΩΩ\Omega are in fact insufficient e.g. existing bounds characterizing the number of regions in ΩΩ\Omega are known to be loose and uninformative [16]. As such, practitioners resort to simple approximation strategies, e.g. sampling, to estimate such properties of ΩΩ\Omega. Another approach is to only consider 2/3232/3-dimensional slices of the DN’s input space and estimate ΩΩ\Omega restricted on that subspace. All in all, nothing is known yet about how accurate are those approximations at conveying the underlying properties of the entire partition ΩΩ\Omega that current theoretical results heavily rely on. In particular, [17] uses estimates of the partition’s number of region to perform Neural Architecture Search (NAS), and for which exact computation of the DNN’s partition regions will further improve the NAS; [11] uses estimates of the partition to adapt the distribution of deep generative networks (e.g. variational autoencoders) and for which exact computation of the partition would make their method exact, and not an approximation

In this paper, we propose a principled and provable enumeration method for DNs partitions (Algorithm 1) that we first develop for a layer-wise analysis in Section 2 and then extend to the multilayer case in Section 3. As depicted in Fig. 1, the proposed method becomes exponentially faster than the sampling-based strategy to discover the regions ω∈Ω𝜔Ω\omega\in\Omega as the input dimensionality increases. Practically, the proposed enumeration method enables for the first time to measure the accuracy of the currently employed approximations. Our method is efficiently implemented with a few lines of codes, leverages parallel computations, and provably enumerates all the regions of the DN’s partition. Lastly, our method has linear asymptotic complexity with respect to the number of regions and with respect to the DN’s input space dimension. This property is crucial as we will demonstrate that sampling-based enumeration method has complexity growing exponentially with respect to the DN’s input space dimension as a direct consequence of the curse of dimensionality [18, 19]. We hope that our method will serve as the baseline algorithm for any application requiring provable partition region enumeration, or to assess the theoretical findings obtain from the CPA formulation of DNs.

We now develop the enumeration algorithm for a single DN layer. Because a DN recursively subdivides the per-layer partition, the single layer case will be enough to iteratively compute the partition of a multilayer DN as shown in the next Section 3.

We denote the single layer of a DN111without loss of generality we consider the first layer, although the exact same analysis applies to any layer in the DN when looking at the partition of its own input space input-output mapping as f𝜽:ℝD↦ℝK:subscript𝑓𝜽maps-tosuperscriptℝ𝐷superscriptℝ𝐾f_{{\bm{\theta}}}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K}, with 𝜽𝜽{\bm{\theta}} the parameters of the mapping. Without loss of generality, we consider vectors as inputs since when dealing with images, one can always flatten them into vectors and reparametrize the layer accordingly. The layer mapping takes the form

where σ𝜎\sigma is a pointwise activation function, 𝑾𝑾{\bm{W}} is a weight matrix of dimensions K×D𝐾𝐷K\times D, 𝒃𝒃{\bm{b}} is a bias vector of length K𝐾K, 𝒉​(𝒙)𝒉𝒙{\bm{h}}({\bm{x}}) denotes the pre-activation map and lastly 𝒙𝒙{\bm{x}} is some input from ℝDsuperscriptℝ𝐷\mathbb{R}^{D}. The layer parameters are thus 𝜽≜{𝑾,𝒃{\bm{\theta}}\triangleq{{\bm{W}},{\bm{b}}. Although simple, Eq. 1 encompasses most current DNs layers by specifying the correct structural constraints on the matrix 𝑾𝑾{\bm{W}}, e.g. to be circulant for a convolutional layer. The details on the layer mapping will not impact our results. The CPA view of DNs [20, 7] consists in expressing Eq. 1 as

where ΩΩ\Omega is the layer input space partition [21]. Understanding the form of ΩΩ\Omega will greatly help the development of the enumeration algorithm in Section 2.2. Given nonlinearities σ𝜎\sigma such as (leaky-)ReLU or absolute value, it is direct to see that the layer stays linear for a region ω𝜔\omega so that all the inputs within it have the same pre-activation signs. That is, a region is entirely and uniquely determined by those sign patterns

where the equality is to be understood elementwise on all of the K𝐾K entries of the sign vectors. The only exception arises for degenerate weights 𝑾𝑾{\bm{W}} which we do not consider since any arbitrarily small perturbation of such degeneracies remove those edge cases. From the above observation along, it becomes clear that the transition between different regions of ΩΩ\Omega must occur when a pre-activation sign for some unit k∈{1,…,K}𝑘1…𝐾k\in{1,\dots,K} changes, and because 𝒉𝒉{\bm{h}} is nothing more but an affine mapping, this sign change for some unit k𝑘k can only occur when crossing the hyperplane

Leveraging Eq. 3 we obtain that ∂ΩΩ\partial\Omega, the boundaries of the layer’s partition, is an hyperplane arrangement as in ∂Ω=⋃k=1KℍkΩsuperscriptsubscript𝑘1𝐾subscriptℍ𝑘\partial\Omega=\bigcup_{k=1}^{K}{\mathbb{H}}_{k}.

We are now able to leverage this particular structure of the layer’s partition to present an enumeration algorithm that will recursively search for all the regions ω∈Ω𝜔Ω\omega\in\Omega.

From the previous understanding that the layer’s partition arises from an hyperplane arrangements involving Eq. 3, we are now able to leverage and adapt existing enumeration methods for such partitions to obtain all the regions ω∈Ω𝜔Ω\omega\in\Omega, form which it will become trivial to consider the multilayer case that we leave for the following Section 3.

Enumerating the regions of the layer f𝜽subscript𝑓𝜽f_{{\bm{\theta}}}’s partition can be done efficiently by adapting existing reverse search algorithms [22] optimized for hyperplane arrangements. In fact, a naive approach of enumerating all of the 2Ksuperscript2𝐾2^{K} possible sign patterns 𝒒∈{−1,1}K𝒒superscript11𝐾{\bm{q}}\in{-1,1}^{K} and checking if each defines a non-empty region

would be largely wasteful. In fact, most of such sign combinations do produce empty regions e.g. if the partition is central i.e. the intersection of all the hyperplane is not empty then the total number of regions grows linearly with K𝐾K [23] and is thus much smaller than 2Ksuperscript2𝐾2^{K}. Instead, a much more efficient strategy is to only explore feasible sign patterns in a recursive tree-like structure. To do so, the algorithm recursively sub-divides a parent region by the hyperplane of unit k𝑘k. If that hyperplane does not intersect the current region then we can skip unit k𝑘k and recurse the sub-division of that same region by unit k+1𝑘1k+1. On the other hand, if hyperplane k𝑘k divides the current region, we consider both sides of it and keep the recursion going on both sides. We formally summarize the method in Algorithm 1 and present one illustrative example and comparison against sampling-based region enumeration in Fig. 1. In particular, we provide the efficiency of the sampling solution for various configurations in Table 1.

This section demonstrates how the derivation carried out in Section 2 for the single layer setting is sufficient to enumerate the partition of a multilayer DN, thanks to the subdivision process under which the composition of many layers ultimately form the global DN’s input space partition. We first recall this subdivision step in Section 3.1 and summarize the enumeration algorithm in Section 3.2.

We specialize the per-layer notations from Section 2 by expliciting the layer index ℓℓ\ell as f(ℓ)superscript𝑓ℓf^{(\ell)} for the layer mapping, as 𝜽(ℓ)superscript𝜽ℓ{\bm{\theta}}^{(\ell)} for its parameters, and the entire DN’s input-output mapping is now referred to as f𝜽:ℝD↦ℝK:subscript𝑓𝜽maps-tosuperscriptℝ𝐷superscriptℝ𝐾f_{{\bm{\theta}}}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K} with K𝐾K the output space dimension. The composition of layers take the form

where each layer mapping f(ℓ):ℝD(ℓ)↦ℝD(ℓ+1):superscript𝑓ℓmaps-tosuperscriptℝsuperscript𝐷ℓsuperscriptℝsuperscript𝐷ℓ1f^{(\ell)}:\mathbb{R}^{D^{(\ell)}}\mapsto\mathbb{R}^{D^{(\ell+1)}} produces a feature map; with D(1)≜D≜superscript𝐷1𝐷D^{(1)}\triangleq D and D(L)≜K≜superscript𝐷𝐿𝐾D^{(L)}\triangleq K; with mapping given by Eq. 1, and 𝒉(ℓ)superscript𝒉ℓ{\bm{h}}^{(\ell)} denoting the pre-activation map of layer ℓℓ\ell. A key result from [20, 7] is the DN mapping is itself defined on a partition as in

which is known to be recursively built by each layer subdividing the previously built partition of the space [21].

Considering an arbitrarily deep model can be tackled by understanding the recurrent subdivision process of a two hidden layer DN and applying the same principle successively. In this setting, notice that for the (two-layer) DN to be affine within some region ω𝜔\omega of the DN’s input space, each layer must stay affine as well. By composition the first layer staying linear does not ensure that the DN stays linear, but the first layer being nonlinear does imply that the entire DN is nonlinear. From that, we see that the first layer’s partition are “coarser” the the entire DN’s partition regions. More precisely, and following the derivation of [21], we obtain that each layer is a recursive subdivision of the previously build partition when in our case we need to search for each region ω𝜔\omega of the first layer’s partition the regions within it where the second layer stays linear. As a result, the proposed single hidden layer enumeration method from Section 2 can be applied recursively as follows. First, compute the first layer partition enumeration. Then, for each enumerated region with corresponding sign pattern 𝒒𝒒{\bm{q}}, define a new single layer model with 𝒉(𝒙)≜σ(𝑾(2)diag(𝒒)𝑾(1)𝒙+𝑾(2)(𝒒⊙𝒃(1))+𝒃(2){\bm{h}}({\bm{x}})\triangleq\sigma({\bm{W}}^{(2)}\operatorname{diag}({\bm{q}}){\bm{W}}^{(1)}{\bm{x}}+{\bm{W}}^{(2)}({\bm{q}}\odot{\bm{b}}^{(1)})+{\bm{b}}^{(2)} and within ω𝜔\omega apply the single layer enumeration; repeating the process for all regions –and corresponding sign patterns 𝒒𝒒{\bm{q}} of the previously found first layer partition. This enumerates the partition of (f(2)∘f(1))superscript𝑓2superscript𝑓1(f^{(2)}\circ f^{(1)}), and the same process can be repeated as many times as there are layers in the DN; as illustrated in Fig. 2.

In this paper, we provided the first exact enumeration method for Deep Networks partitions that relies on the existing highly efficient enumeration method of hyperplane arrangements. In fact, both the hallow and deep architectures produce partitions that correspond to hyperplane arrangements or union of restricted hyperplane arrangements. A crucial finding that was enabled by the proposed method is that sampling-based region enumeration, which is the only strategy used in current research studies dealing with DNs and affine splines, is in fact relatively poor at finding the regions of the DN’s partition. In particular, when using such sampling to estimating some sensitive statistics e.g. the volume of the smallest region, sampling is biased and should be avoid in favor of an exact enumeration method.

Table: S2.T1: Comparison of our exact enumeration method versus sampling-based partition discovery for various single layer configurations with random weights and biases. The sampling-based discovery is run 555 times and we report the average and standard deviation of the number of regions found after sampling. The number of input space sample is obtain so that the computation time of the proposed method is the same as the computation time of the sampling method i.e. for each configuration, both methods have run the exact same amount of time. We observe that for low-dimensional input space, and with the same fixed time-budget, both methods perform similarly and sampling is sufficient to quickly discover all of the layer’s partition.

input dim \widthK=16K=32K=64K=128K=256
D=2enumeration161371127631
sampling16 ±plus-or-minus\pm013 ±plus-or-minus\pm067±plus-or-minus\pm0127±plus-or-minus\pm0611 ±plus-or-minus\pm2
samp. found100%100 %94 %100 %96 %
D=4enumeration54801107427195954
sampling51 ±plus-or-minus\pm069 ±plus-or-minus\pm 1866±plus-or-minus\pm33288±plus-or-minus\pm1870635 ±plus-or-minus\pm55
samp. found94 %86 %78 %77%73%
D=8enumeration2412428396386566-
sampling18 ±plus-or-minus\pm0543±plus-or-minus\pm22875±plus-or-minus\pm5136748±plus-or-minus\pm251-
samp. found75 %44 %34 %35 %-

Refer to caption Fig. 1: Proposed exact region enumeration depicted as an orange star against sampling-based region discovery of the partition ΩΩ\Omega depicted as blue dots for a single hidden layer DN with leaky-ReLU, random parameters and width 646464 as a function of computation time (x-axis) and number of partition regions found (y-axis); for a 444-dimensional input space at the top and 888-dimensional input space at the bottom. The proposed Algorithm 1 is able to dramatically outperform the sampling-based search that has been used throughout recent studies on CPA DNs.

Refer to caption Fig. 2: Depiction of the multilayer case which corresponds to a union of region-constrained hyperplane arrangements and thus which can be studied directly form the proposed hyperplane arrangement region enumeration. The only additional step is to first enforce that the search takes place on the restricted region of interest from the up-to-layer-ℓℓ\ell input space partition. For example on the left column one first obtains the first layer partition depicted in black. On each of the enumerated region, a subdivision will be performed by the next layer; pick any region of interest, compose the per-region affine mapping (fixed on that region) with the second layer affine mappings, and repeat the region enumeration algorithm. This discovers the second subdivision done by the second layer, highlighted in blue in the middle column. This can be repeated to obtain the subdivision of the third layer, here highlighted in red in the right column.

$$ \displaystyle f_{{\bm{\theta}}}({\bm{x}})=\sigma({\bm{h}}({\bm{x}}))\text{ with }{\bm{h}}({\bm{x}})={\bm{W}}{\bm{x}}+{\bm{b}} $$

$$ \displaystyle{\mathbb{H}}{k}\triangleq{{\bm{x}}\in\mathbb{R}^{D}:\langle{\bm{W}}{k,.},{\bm{x}}\rangle+{\bm{b}}_{k}=0}. $$

$$ \displaystyle f_{{\bm{\theta}}}=\left(f_{{\bm{\theta}}^{(L)}}^{(L)}\circ\dots\circ f_{{\bm{\theta}}^{(1)}}^{(1)}\right), $$

input dim \width K=16input dim \width K=16K=32K=64K=128K=256
enumeration 161371127631
D=2sampling 16 ± 013 ± 067 ± 0127 ± 0611 ± 2
samp. found 100%100%94%100%96%
enumeration 54801107427195954
D=4sampling 51 ± 069 ± 1866 ± 33288 ± 1870635 ± 55
samp. found94%86%78%77%73%
enumeration 2412428396386566-
D=8sampling 18 ± 0543 ± 22875 ± 5136748 ± 251-
samp. found75%44%34%35%-

References

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[madry2017towards] Madry, Aleksander, Makelov, Aleksandar, Schmidt, Ludwig, Tsipras, Dimitris, Vladu, Adrian. (2017). Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.

[balestriero2022batch] Balestriero, Randall, Baraniuk, Richard G. (2022). Batch Normalization Explained. arXiv preprint arXiv:2209.14778.

[david2004order] David, Herbert A, Nagaraja, Haikady N. (2004). Order statistics.

[soudry2018implicit] Soudry, Daniel, Hoffer, Elad, Nacson, Mor Shpigel, Gunasekar, Suriya, Srebro, Nathan. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research.

[weng2018towards] Weng, Lily, Zhang, Huan, Chen, Hongge, Song, Zhao, Hsieh, Cho-Jui, Daniel, Luca, Boning, Duane, Dhillon, Inderjit. (2018). Towards fast computation of certified robustness for relu networks. International Conference on Machine Learning.

[raghunathan2018semidefinite] Raghunathan, Aditi, Steinhardt, Jacob, Liang, Percy S. (2018). Semidefinite relaxations for certifying robustness to adversarial examples. Advances in Neural Information Processing Systems.

[daubechies2022nonlinear] Daubechies, Ingrid, DeVore, Ronald, Foucart, Simon, Hanin, Boris, Petrova, Guergana. (2022). Nonlinear Approximation and (Deep) ReLU Networks. Constructive Approximation.

[balestriero2020analytical] Balestriero, Randall, Paris, S{'e. (2020). Analytical probability distributions and exact expectation-maximization for deep generative networks. Advances in neural information processing systems.

[humayun2021magnet] Humayun, Ahmed Imtiaz, Balestriero, Randall, Baraniuk, Richard. (2021). Magnet: Uniform sampling from deep generative network manifolds without retraining. International Conference on Learning Representations.

[khodak2021initialization] Khodak, Mikhail, Tenenholtz, Neil, Mackey, Lester, Fusi, Nicolo. (2021). Initialization and Regularization of Factorized Neural Layers. arXiv preprint arXiv:2105.01029.

[krogh1991simple] Krogh, Anders, Hertz, John. (1991). A simple weight decay can improve generalization. Advances in neural information processing systems.

[lecun2015deep] LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey. (2015). Deep learning. nature.

[wong2018provable] Wong, Eric, Kolter, Zico. (2018). Provable defenses against adversarial examples via the convex outer adversarial polytope. International Conference on Machine Learning.

[yuan2019adversarial] Yuan, Xiaoyong, He, Pan, Zhu, Qile, Li, Xiaolin. (2019). Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems.

[goodfellow2014explaining] Goodfellow, Ian J, Shlens, Jonathon, Szegedy, Christian. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[bers1964partial] Bers, Lipman, John, Fritz, Schechter, Martin. (1964). Partial differential equations.

[farlow1993partial] Farlow, Stanley J. (1993). Partial differential equations for scientists and engineers.

[stanley2004introduction] Stanley, Richard P, others. (2004). An introduction to hyperplane arrangements. Geometric combinatorics.

[evans2010partial] Evans, Lawrence C. (2010). Partial differential equations.

[sun2018concolic] Sun, Youcheng, Wu, Min, Ruan, Wenjie, Huang, Xiaowei, Kwiatkowska, Marta, Kroening, Daniel. (2018). Concolic testing for deep neural networks. Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.

[wu2018training] Wu, Shuang, Li, Guoqi, Chen, Feng, Shi, Luping. (2018). Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680.

[liu2021algorithms] Liu, Changliu, Arnon, Tomer, Lazarus, Christopher, Strong, Christopher, Barrett, Clark, Kochenderfer, Mykel J, others. (2021). Algorithms for verifying deep neural networks. Foundations and Trends{\textregistered.

[montufar2014number] Montufar, Guido F, Pascanu, Razvan, Cho, Kyunghyun, Bengio, Yoshua. (2014). On the number of linear regions of deep neural networks. Advances in neural information processing systems.

[cabannes_overcoming_2021] Cabannes, Vivien, Pillaud-Vivien, Loucas, Bach, Francis, Rudi, Alessandro. (2021). Overcoming the curse of dimensionality with {Laplacian. NeurIPS.

[humayun2022polarity] Humayun, Ahmed Imtiaz, Balestriero, Randall, Baraniuk, Richard. (2022). Polarity Sampling: Quality and Diversity Control of Pre-Trained Generative Networks via Singular Values. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[chen2021neural] Chen, Wuyang, Gong, Xinyu, Wang, Zhangyang. (2021). Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. arXiv preprint arXiv:2102.11535.

[zhu_semi-supervised_2003] Zhu, Xiaojin, Ghahramani, Zoubin, Lafferty, John. (2003). Semi-supervised learning using {G. ICML.

[devlin_bert_2019] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2019). {BERT. NAACL.

[brown_language_2020] Brown, Tom, others. (2020). Language Models are Few-Shot Learners.

[chowdhery_palm_2022] Chowdhery, Aakanksha, others. (2022). {PaLM.

[radford_robust_2022] Alec Radford. (2022). Robust Speech Recognition via Large-Scale Weak Supervision.

[balestriero_contrastive_2022] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. NeurIPS.

[haochen_provable_2021] HaoChen, Jeff Z., Wei, Colin, Gaidon, Adrien, Ma, Tengyu. (2021). Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. NeurIPS.

[chen_simple_2020] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICML.

[zbontar_barlow_2021] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, Stephane. (2021). Barlow {Twins. ICML.

[bardes_vicreg_2022] Bardes, Adrien, Ponce, Jean, Lecun, Yann. (2022). {VICReg. ICLR.

[rijsbergen_information_1979] Rijsbergen, Van. (1979). Information Retrieval.

[rigollet_generalization_2007] Rigollet, Philippe. (2007). Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption. JMLR.

[krizhevsky_learning_2008] Krizhevsky, Alex. Learning Multiple Layers of Features from Tiny Images.

[engelen_surver_2020] van Engelen, Jesper, Hoos, Holger. (2020). A Survey of Semi-Supervised Learning. Machine Learning.

[bartlett_convexity_2006] Bartlett, Peter, Jordan, Michael, Mcauliffe, Jon. (2006). Convexity, Classification, and Risk Bounds. JASA.

[von_luxburg_consistency_2008] von Luxburg, Ulrike, Belkin, Mikhail, Bousquet, Olivier. (2008). Consistency of Spectral Clustering. Annals of Stat..

[hein_graph_2007] Hein, Matthias, Audibert, Jean-Yves, Luxburg, Ulrike von. (2007). Graph {Laplacians. JMLR.

[bousquet_measure_2003] Bousquet, Olivier, Chapelle, Olivier, Hein, Matthias. (2003). Measure Based Regularization. NeurIPS.

[belkin_laplacian_2003] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Comp..

[bengio_label_2006] Bengio, Yoshua, Delalleau, Olivier, Roux, Nicolas Le. (2006). Label Propagation and Quadratic Criterion. Semi-Supervised Learning.

[gulrajani_improved_2017] Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, Courville, Aaron. (2017). Improved training of {W. NeuRIPS.

[hoffman_robust_2019] Hoffman, Judy, Roberts, Daniel A., Yaida, Sho. (2019). Robust Learning with {Jacobian.

[belkin_manifold_2006] Belkin, Mikhail, Niyogi, Partha, Sindhwani, Vikas. (2006). Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples. JMLR.

[vapnik_nature_1995] Vapnik, Vladimir. (1995). The Nature of Statistical Learning.

[pillaudvivien_learning_2020] Loucas Pillaud-Vivien. (2020). Learning with reproducing kernel {H.

[coifman_diffusion_2006] Coifman, Ronald, Lafon, St'ephane. (2006). Diffusion maps. Applied and Computational Harmonic Analysis.

[bakry2014analysis] Bakry, Dominique, Gentil, Ivan, Ledoux, Michel, others. (2014). Analysis and geometry of Markov diffusion operators.

[caron_unsupervised_2020] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised {Learning. NeuRIPS.

[drucker1992improving] Drucker, Harris, Le Cun, Yann. (1992). Improving generalization performance using double backpropagation. IEEE transactions on neural networks.

[bietti2019kernel] Bietti, Alberto, Mialon, Gr{'e. (2019). A kernel perspective for regularizing deep neural networks. ICML.

[saunshi_understanding_2022] Saunshi, Nikunj, Ash, Jordan, Goel, Surbhi, Misra, Dipendra, Zhang, Cyril, Arora, Sanjeev, Kakade, Sham, Krishnamurthy, Akshay. (2022). Understanding Contrastive Learning Requires Incorporating Inductive Biases. ICML.

[tangent_prop] Patrice Simard, Bernard Victorri, Yann LeCun, John Denker. (1991). Tangent Prop - A formalism for specifying selected invariances in an adaptive network. NeuRIPS.

[balestriero2018hard] Balestriero, Randall, Baraniuk, Richard G. (2018). From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference. arXiv preprint arXiv:1810.09274.

[balestriero2019geometry] Balestriero, Randall, Cosentino, Romain, Aazhang, Behnaam, Baraniuk, Richard. (2019). The geometry of deep networks: Power diagram subdivision. Advances in Neural Information Processing Systems.

[balestriero2018spline] Balestriero, Randall, Baraniuk, Richard. (2018). A spline theory of deep learning. International Conference on Machine Learning.

[wang2020orthogonal] Wang, Jiayun, Chen, Yubei, Chakraborty, Rudrasis, Yu, Stella X. (2020). Orthogonal convolutional neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[bansal2018can] Bansal, Nitin, Chen, Xiaohan, Wang, Zhangyang. (2018). Can we gain more from orthogonality regularizations in training deep networks?. Advances in Neural Information Processing Systems.

[balestriero2020mad] Balestriero, Randall, Baraniuk, Richard G. (2020). Mad max: Affine spline insights into deep learning. Proceedings of the IEEE.

[pennington2018emergence] Pennington, Jeffrey, Schoenholz, Samuel, Ganguli, Surya. (2018). The emergence of spectral universality in deep networks. International Conference on Artificial Intelligence and Statistics.

[pennington2017resurrecting] Pennington, Jeffrey, Schoenholz, Samuel S, Ganguli, Surya. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Proceedings of the 31st International Conference on Neural Information Processing Systems.

[rainforth2018tighter] Rainforth, Tom, Kosiorek, Adam, Le, Tuan Anh, Maddison, Chris, Igl, Maximilian, Wood, Frank, Teh, Yee Whye. (2018). Tighter variational bounds are not necessarily better. International Conference on Machine Learning.

[brockett1991dynamical] Brockett, Roger W. (1991). Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems. Linear Algebra and its applications.

[specht1991general] Specht, Donald F, others. (1991). A general regression neural network. IEEE transactions on neural networks.

[lai2000kernel] Lai, Pei Ling, Fyfe, Colin. (2000). Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems.

[he2021masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

[bao2021beit] Bao, Hangbo, Dong, Li, Wei, Furu. (2021). Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.

[qiu2018network] Qiu, Jiezhong, Dong, Yuxiao, Ma, Hao, Li, Jian, Wang, Kuansan, Tang, Jie. (2018). Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. Proceedings of the eleventh ACM international conference on web search and data mining.

[sprekeler2011relation] Sprekeler, Henning. (2011). On the relation of slow feature analysis and laplacian eigenmaps. Neural computation.

[xing2002distance] Xing, Eric, Jordan, Michael, Russell, Stuart J, Ng, Andrew. (2002). Distance metric learning with application to clustering with side-information. Advances in neural information processing systems.

[von2007tutorial] Von Luxburg, Ulrike. (2007). A tutorial on spectral clustering. Statistics and computing.

[wathen2015spectral] Wathen, Andrew J, Zhu, Shengxin. (2015). On spectral distribution of kernel matrices related to radial basis functions. Numerical Algorithms.

[cunningham2015linear] Cunningham, John P, Ghahramani, Zoubin. (2015). Linear dimensionality reduction: Survey, insights, and generalizations. The Journal of Machine Learning Research.

[knaf2007kernel] Knaf, H. (2007). Kernel Fisher discriminant functions--a concise and rigorous introduction.

[weinberger2009distance] Weinberger, Kilian Q, Saul, Lawrence K. (2009). Distance metric learning for large margin nearest neighbor classification.. Journal of machine learning research.

[hadsell2006dimensionality] Hadsell, Raia, Chopra, Sumit, LeCun, Yann. (2006). Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[kuss2003geometry] Kuss, Malte, Graepel, Thore. (2003). The geometry of kernel canonical correlation analysis.

[fukumizu2007statistical] Fukumizu, Kenji, Bach, Francis R, Gretton, Arthur. (2007). Statistical Consistency of Kernel Canonical Correlation Analysis.. Journal of Machine Learning Research.

[broomhead1988radial] Broomhead, David S, Lowe, David. (1988). Radial basis functions, multi-variable functional interpolation and adaptive networks.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[kokiopoulou2011trace] Kokiopoulou, Effrosini, Chen, Jie, Saad, Yousef. (2011). Trace optimization and eigenproblems in dimension reduction methods. Numerical Linear Algebra with Applications.

[li2014fisher] Li, Cheng, Wang, Bingyu. (2014). Fisher linear discriminant analysis. CCIS Northeastern University.

[webb2003statistical] Webb, Andrew R. (2003). Statistical pattern recognition.

[guo2003generalized] Guo, Yue-Fei, Li, Shi-Jin, Yang, Jing-Yu, Shu, Ting-Ting, Wu, Li-De. (2003). A generalized Foley--Sammon transform based on generalized fisher discriminant criterion and its application to face recognition. Pattern Recognition Letters.

[wang2007trace] Wang, Huan, Yan, Shuicheng, Xu, Dong, Tang, Xiaoou, Huang, Thomas. (2007). Trace ratio vs. ratio trace for dimensionality reduction. 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[he2003locality] He, Xiaofei, Niyogi, Partha. (2003). Locality preserving projections. Advances in neural information processing systems.

[cohen2014applied] Cohen, Patricia, West, Stephen G, Aiken, Leona S. (2014). Applied multiple regression/correlation analysis for the behavioral sciences.

[fisher1936use] Fisher, Ronald A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics.

[tenenbaum2000global] Tenenbaum, Joshua B, Silva, Vin de, Langford, John C. (2000). A global geometric framework for nonlinear dimensionality reduction. science.

[kruskal1964multidimensional] Kruskal, Joseph B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems.

[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[caron2018deep] Caron, Mathilde, Bojanowski, Piotr, Joulin, Armand, Douze, Matthijs. (2018). Deep clustering for unsupervised learning of visual features. Proceedings of the European conference on computer vision (ECCV).

[von1937some] Von Neumann, John. (1937). Some matrix-inequalities and metrization of matric space.

[hagen1992new] Hagen, Lars, Kahng, Andrew B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE transactions on computer-aided design of integrated circuits and systems.

[belkin2003laplacian] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation.

[zhang2021canonical] Zhang, Hengrui, Wu, Qitian, Yan, Junchi, Wipf, David, Yu, Philip S. (2021). From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems.

[jing2021understanding] Jing, Li, Vincent, Pascal, LeCun, Yann, Tian, Yuandong. (2021). Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348.

[rashmi2019optimal] Rashmi, Manazhy, Sankaran, Praveen. (2019). Optimal landmark point selection using clustering for manifold modeling and data classification. Journal of Classification.

[vinod1976canonical] Vinod, Hrishikesh D. (1976). Canonical ridge and econometrics of joint production. Journal of econometrics.

[platt2005fastmap] Platt, John. (2005). *Fastmap, metricmap, and landmark mds are all nystr{*. International Workshop on Artificial Intelligence and Statistics.

[jahan2016regularized] Jahan, Sohana, Qi, Hou-Duo. (2016). Regularized multidimensional scaling with radial basis functions. Journal of Industrial & Management Optimization.

[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[preparata2012computational] Preparata, Franco P, Shamos, Michael I. (2012). Computational geometry: an introduction.

[scholkopf1998nonlinear] Sch{. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural computation.

[webb1995multidimensional] Webb, Andrew R. (1995). Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recognition.

[williams2000connection] Williams, Christopher. (2000). On a connection between kernel PCA and metric multidimensional scaling. Advances in neural information processing systems.

[liang2013trace] Liang, Xin, Li, Ren-Cang, Bai, Zhaojun. (2013). Trace minimization principles for positive semi-definite pencils. Linear Algebra and its Applications.

[yang2017breaking] Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, Cohen, William W. (2017). Breaking the softmax bottleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953.

[ganea2019breaking] Ganea, Octavian, Gelly, Sylvain, B{'e. (2019). Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. International Conference on Machine Learning.

[silva2002global] Silva, Vin, Tenenbaum, Joshua. (2002). Global versus local methods in nonlinear dimensionality reduction. Advances in neural information processing systems.

[pokle2022contrasting] Pokle, Ashwini, Tian, Jinjin, Li, Yuchen, Risteski, Andrej. (2022). Contrasting the landscape of contrastive and non-contrastive learning. arXiv preprint arXiv:2203.15702.

[gretton2005kernel] Gretton, Arthur, Herbrich, Ralf, Smola, Alexander, Bousquet, Olivier, Sch{. (2005). Kernel methods for measuring independence.

[dauxois1998nonlinear] Dauxois, Jacques, Nkiet, Guy Martial. (1998). Nonlinear canonical analysis and independence tests. The Annals of Statistics.

[bao2021sharp] Bao, Han, Nagano, Yoshihiro, Nozawa, Kento. (2021). Sharp Learning Bounds for Contrastive Unsupervised Representation Learning. arXiv preprint arXiv:2110.02501.

[aronszajn1950theory] Aronszajn, Nachman. (1950). Theory of reproducing kernels. Transactions of the American mathematical society.

[tai2022kernelized] Tai, Mariko, Kudo, Mineichi, Tanaka, Akira, Imai, Hideyuki, Kimura, Keigo. (2022). Kernelized Supervised Laplacian Eigenmap for Visualization and Classification of Multi-Label Data. Pattern Recognition.

[bengio2003out] Bengio, Yoshua, Paiement, Jean-fran{\c{c. (2003). Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. Advances in neural information processing systems.

[cheng2005supervised] Cheng, Jian, Liu, Qingshan, Lu, Hanqing, Chen, Yen-Wei. (2005). Supervised kernel locality preserving projections for face recognition. Neurocomputing.

[nazi2019generalized] Nazi, Azade, Hang, Will, Goldie, Anna, Ravi, Sujith, Mirhoseini, Azalia. (2019). Generalized Clustering by Learning to Optimize Expected Normalized Cuts. arXiv preprint arXiv:1910.07623.

[smola2003kernels] Smola, Alexander J, Kondor, Risi. (2003). Kernels and regularization on graphs. Learning theory and kernel machines.

[knyazev1987convergence] Knyazev, Andrew V. (1987). Convergence rate estimates for iterative methods for a mesh symmetrie eigenvalue problem.

[hautamaki2004outlier] Hautamaki, Ville, Karkkainen, Ismo, Franti, Pasi. (2004). Outlier detection using k-nearest neighbour graph. Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[hirsch1983discrete] Hirsch, Jorge E. (1983). Discrete Hubbard-Stratonovich transformation for fermion lattice models. Physical Review B.

[wen2021toward] Wen, Zixin, Li, Yuanzhi. (2021). Toward understanding the feature learning process of self-supervised contrastive learning. International Conference on Machine Learning.

[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.

[wang2021towards] Wang, Xiang, Chen, Xinlei, Du, Simon S, Tian, Yuandong. (2021). Towards Demystifying Representation Learning with Non-contrastive Self-supervision. arXiv preprint arXiv:2110.04947.

[uurtio2017tutorial] Uurtio, Viivi, Monteiro, Jo{~a. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR).

[ding2005equivalence] Ding, Chris, He, Xiaofeng, Simon, Horst D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. Proceedings of the 2005 SIAM international conference on data mining.

[qiao2015new] Qiao, Hanli. (2015). New SVD based initialization strategy for non-negative matrix factorization. Pattern Recognition Letters.

[ding2004k] Ding, Chris, He, Xiaofeng. (2004). K-means clustering via principal component analysis. Proceedings of the twenty-first international conference on Machine learning.

[ding2006orthogonal] Ding, Chris, Li, Tao, Peng, Wei, Park, Haesun. (2006). Orthogonal nonnegative matrix t-factorizations for clustering. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.

[liang2021generalizing] Liang, Xin, Wang, Li, Zhang, Lei-Hong, Li, Ren-Cang. (2021). On Generalizing Trace Minimization. arXiv preprint arXiv:2104.00257.

[https://doi.org/10.48550/arxiv.2006.07322] Hui, Like, Belkin, Mikhail. (2020). Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks. doi:10.48550/ARXIV.2006.07322.

[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.

[lampinen2018analytic] Lampinen, Andrew K, Ganguli, Surya. (2018). An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374.

[huang2017densely] Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition.

[yu2018deep] Yu, Fisher, Wang, Dequan, Shelhamer, Evan, Darrell, Trevor. (2018). Deep layer aggregation. Proceedings of the IEEE conference on computer vision and pattern recognition.

[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[ravanelli2018speaker] Ravanelli, Mirco, Bengio, Yoshua. (2018). Speaker recognition from raw waveform with sincnet. 2018 IEEE Spoken Language Technology Workshop (SLT).

[seydoux2020clustering] Seydoux, L{'e. (2020). Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning. Nature communications.

[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.

[ewerbring1989canonical] Ewerbring, L Magnus, Luk, Franklin T. (1989). Canonical correlations and generalized SVD: applications and new algorithms. Journal of computational and applied mathematics.

[andrew2013deep] Andrew, Galen, Arora, Raman, Bilmes, Jeff, Livescu, Karen. (2013). Deep canonical correlation analysis. International conference on machine learning.

[witten2009penalized] Witten, Daniela M, Tibshirani, Robert, Hastie, Trevor. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics.

[healy1957rotation] Healy, MJR. (1957). A rotation method for computing canonical correlations. Mathematics of Computation.

[xue2014does] Xue, Jing-Hao, Hall, Peter. (2014). Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis?. IEEE transactions on pattern analysis and machine intelligence.

[mika1999fisher] Mika, Sebastian, Ratsch, Gunnar, Weston, Jason, Scholkopf, Bernhard, Mullers, Klaus-Robert. (1999). Fisher discriminant analysis with kernels. Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468).

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[tosh2021contrastive] Tosh, Christopher, Krishnamurthy, Akshay, Hsu, Daniel. (2021). Contrastive learning, multi-view redundancy, and linear models. Algorithmic Learning Theory.

[tian2022deep] Tian, Yuandong. (2022). Deep Contrastive Learning is Provably (almost) Principal Component Analysis. arXiv preprint arXiv:2201.12680.

[bach2005probabilistic] Bach, Francis R, Jordan, Michael I. (2005). A probabilistic interpretation of canonical correlation analysis.

[haochen2022beyond] HaoChen, Jeff Z, Wei, Colin, Kumar, Ananya, Ma, Tengyu. (2022). Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations. arXiv preprint arXiv:2204.02683.

[pfau2018spectral] David Pfau, Stig Petersen, Ashish Agarwal, David G. T. Barrett, Kimberly L. Stachenfeld. (2019). Spectral Inference Networks: Unifying Deep and Spectral Learning. International Conference on Learning Representations.

[baevski2020wav2vec] Baevski, Alexei, Zhou, Yuhao, Mohamed, Abdelrahman, Auli, Michael. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems.

[higham1988computing] Higham, Nicholas J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications.

[geirhos2020surprising] Geirhos, Robert, Narayanappa, Kantharaju, Mitzkus, Benjamin, Bethge, Matthias, Wichmann, Felix A, Brendel, Wieland. (2020). On the surprising similarities between supervised and self-supervised models. arXiv preprint arXiv:2010.08377.

[golub1971singular] Golub, Gene H, Reinsch, Christian. (1971). Singular value decomposition and least squares solutions. Linear algebra.

[hestenes1958inversion] Hestenes, Magnus R. (1958). Inversion of matrices by biorthogonalization and related results. Journal of the Society for Industrial and Applied Mathematics.

[eckart1936approximation] Eckart, Carl, Young, Gale. (1936). The approximation of one matrix by another of lower rank. Psychometrika.

[horn2012matrix] Horn, Roger A, Johnson, Charles R. (2012). Matrix analysis.

[ioffe2006probabilistic] Ioffe, Sergey. (2006). Probabilistic linear discriminant analysis. European Conference on Computer Vision.

[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. Proceedings of the ieee/cvf International Conference on computer vision.

[cortes1995support] Cortes, Corinna, Vapnik, Vladimir. (1995). Support-vector networks. Machine learning.

[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.

[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[kalofolias2016learn] Kalofolias, Vassilis. (2016). How to learn a graph from smooth signals. Artificial Intelligence and Statistics.

[dong2016learning] Dong, Xiaowen, Thanou, Dorina, Frossard, Pascal, Vandergheynst, Pierre. (2016). Learning Laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing.

[koohpayegani2021mean] Koohpayegani, Soroush Abbasi, Tejankar, Ajinkya, Pirsiavash, Hamed. (2021). Mean shift for self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[roweis2000nonlinear] Roweis, Sam T, Saul, Lawrence K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science.

[shi2020run] Shi, Haizhou, Luo, Dongliang, Tang, Siliang, Wang, Jian, Zhuang, Yueting. (2020). Run away from your teacher: Understanding byol by a novel self-supervised approach. arXiv preprint arXiv:2011.10944.

[gidaris2018unsupervised] Gidaris, Spyros, Singh, Praveer, Komodakis, Nikos. (2018). Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728.

[ericsson2021well] Ericsson, Linus, Gouk, Henry, Hospedales, Timothy M. (2021). How well do self-supervised models transfer?. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[kanazawa2016warpnet] Kanazawa, Angjoo, Jacobs, David W, Chandraker, Manmohan. (2016). Warpnet: Weakly supervised matching for single-view reconstruction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[bordes2021high] Bordes, Florian, Balestriero, Randall, Vincent, Pascal. (2021). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv preprint arXiv:2112.09164.

[novotny2018self] Novotny, David, Albanie, Samuel, Larlus, Diane, Vedaldi, Andrea. (2018). Self-supervised learning of geometrically stable features through probabilistic introspection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[o1985manova] O'Brien, Ralph G, Kaiser, Mary K. (1985). MANOVA method for analyzing repeated measures designs: an extensive primer.. Psychological bulletin.

[xu2019self] Xu, Dejing, Xiao, Jun, Zhao, Zhou, Shao, Jian, Xie, Di, Zhuang, Yueting. (2019). Self-supervised spatiotemporal learning via video clip order prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[kim2019self] Kim, Dahun, Cho, Donghyeon, Kweon, In So. (2019). Self-supervised video representation learning with space-time cubic puzzles. Proceedings of the AAAI conference on artificial intelligence.

[sermanet2018time] Sermanet, Pierre, Lynch, Corey, Chebotar, Yevgen, Hsu, Jasmine, Jang, Eric, Schaal, Stefan, Levine, Sergey, Brain, Google. (2018). Time-contrastive networks: Self-supervised learning from video. 2018 IEEE international conference on robotics and automation (ICRA).

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.

[huberty2006applied] Huberty, Carl J, Olejnik, Stephen. (2006). Applied MANOVA and discriminant analysis.

[haochen2021provable] HaoChen, Jeff Z, Wei, Colin, Gaidon, Adrien, Ma, Tengyu. (2021). Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems.

[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.

[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[hastie1995penalized] Hastie, Trevor, Buja, Andreas, Tibshirani, Robert. (1995). Penalized discriminant analysis. The Annals of Statistics.

[bartlett1938further] Bartlett, Maurice S. (1938). Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society.

[hardoon2004canonical] Hardoon, David R, Szedmak, Sandor, Shawe-Taylor, John. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation.

[avis1996reverse] Avis, David, Fukuda, Komei. (1996). Reverse search for enumeration. Discrete applied mathematics.

[kursun2011canonical] Kursun, Olcay, Alpaydin, Ethem, Favorov, Oleg V. (2011). Canonical correlation analysis using within-class coupling. Pattern Recognition Letters.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230.

[sankararaman2020impact] Sankararaman, Karthik Abinav, De, Soham, Xu, Zheng, Huang, W Ronny, Goldstein, Tom. (2020). The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. International Conference on Machine Learning.

[zhang2019identity] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Mozer, Michael C, Singer, Yoram. (2019). Identity crisis: Memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698.

[arora2018optimization] Arora, Sanjeev, Cohen, Nadav, Hazan, Elad. (2018). On the optimization of deep networks: Implicit acceleration by overparameterization. International Conference on Machine Learning.

[neyshabur2018towards] Neyshabur, Behnam, Li, Zhiyuan, Bhojanapalli, Srinadh, LeCun, Yann, Srebro, Nathan. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076.

[srebro2004generalization] Srebro, Nathan, Alon, Noga, Jaakkola, Tommi S. (2004). Generalization error bounds for collaborative prediction with low-rank matrices.. NIPS.

[guo2018expandnets] Guo, Shuxuan, Alvarez, Jose M, Salzmann, Mathieu. (2018). Expandnets: Linear over-parameterization to train compact convolutional networks. arXiv preprint arXiv:1811.10495.

[riedi2022singular] Riedi, Rudolf H, Balestriero, Randall, Baraniuk, Richard G. (2022). Singular Value Perturbation and Deep Network Optimization. arXiv preprint arXiv:2203.03099.

[arora2019implicit] Arora, Sanjeev, Cohen, Nadav, Hu, Wei, Luo, Yuping. (2019). Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems.

[hecht1992theory] Hecht-Nielsen, Robert. (1992). Theory of the backpropagation neural network. Neural networks for perception.

[alemohammad2021the] Sina Alemohammad, Zichao Wang, Randall Balestriero, Richard Baraniuk. (2021). The Recurrent Neural Tangent Kernel. International Conference on Learning Representations.

[neyshabur2014search] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.

[jacot2018neural] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572.

[gunasekar2018implicitb] Gunasekar, Suriya, Lee, Jason, Soudry, Daniel, Srebro, Nathan. (2018). Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468.

[kakade2008complexity] Kakade, Sham M, Sridharan, Karthik, Tewari, Ambuj. (2008). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization.

[bartlett2002rademacher] Bartlett, Peter L, Mendelson, Shahar. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research.

[jing2020implicit] Jing, Li, Zbontar, Jure, LeCun, Yann. (2020). Implicit rank-minimizing autoencoder. arXiv preprint arXiv:2010.00679.

[gunasekar2018characterizing] Gunasekar, Suriya, Lee, Jason, Soudry, Daniel, Srebro, Nathan. (2018). Characterizing implicit bias in terms of optimization geometry. International Conference on Machine Learning.

[telgarsky2013margins] Telgarsky, Matus. (2013). Margins, shrinkage, and boosting. International Conference on Machine Learning.

[park1991universal] Park, Jooyoung, Sandberg, Irwin W. (1991). Universal approximation using radial-basis-function networks. Neural computation.

[nocedal1999numerical] Nocedal, Jorge, Wright, Stephen J. (1999). Numerical optimization.

[kelaghan1982barrier] Kelaghan, Joseph, Rubin, George L, Ory, Howard W, Layde, Peter M. (1982). Barrier-method contraceptives and pelvic inflammatory disease. Jama.

[gunasekar2018implicit] Gunasekar, Suriya, Woodworth, Blake, Bhojanapalli, Srinadh, Neyshabur, Behnam, Srebro, Nathan. (2018). Implicit regularization in matrix factorization. 2018 Information Theory and Applications Workshop (ITA).

[yariv2021volume] Yariv, Lior, Gu, Jiatao, Kasten, Yoni, Lipman, Yaron. (2021). Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems.

[bracewell1986fourier] Bracewell, Ronald Newbold, Bracewell, Ronald N. (1986). The Fourier transform and its applications.

[bellman2015applied] Bellman, Richard E, Dreyfus, Stuart E. (2015). Applied dynamic programming.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research.

[sitzmann2020implicit] Sitzmann, Vincent, Martel, Julien, Bergman, Alexander, Lindell, David, Wetzstein, Gordon. (2020). Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems.

[neyshabur2018role] Neyshabur, Behnam, Li, Zhiyuan, Bhojanapalli, Srinadh, LeCun, Yann, Srebro, Nathan. (2018). The role of over-parametrization in generalization of neural networks. International Conference on Learning Representations.

[wang2017residual] Wang, Fei, Jiang, Mengqing, Qian, Chen, Yang, Shuo, Li, Cheng, Zhang, Honggang, Wang, Xiaogang, Tang, Xiaoou. (2017). Residual attention network for image classification. Proceedings of the IEEE conference on computer vision and pattern recognition.

[koppen2000curse] K{. (2000). The curse of dimensionality. 5th online world conference on soft computing in industrial applications (WSC5).

[boltyanski2012excursions] Boltyanski, Vladimir, Martini, Horst, Soltan, Petru S. (2012). Excursions into combinatorial geometry.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.

[lecun1989backpropagation] LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson, Donnie, Howard, Richard E, Hubbard, Wayne, Jackel, Lawrence D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation.

[edelsbrunner1987algorithms] Edelsbrunner, Herbert. (1987). Algorithms in combinatorial geometry.

[huang2020deep] Huang, Kaixuan, Wang, Yuqing, Tao, Molei, Zhao, Tuo. (2020). Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks?---A Neural Tangent Kernel Perspective. Advances in neural information processing systems.

[he2016identity] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Identity mappings in deep residual networks. European conference on computer vision.

[neyshabur2017implicit] Neyshabur, Behnam. (2017). Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953.

[le1991eigenvalues] Le Cun, Yann, Kanter, Ido, Solla, Sara A. (1991). Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters.

[grill2017two] Grill, Thomas, Schl{. (2017). Two convolutional neural networks for bird detection in audio signals. 2017 25th European Signal Processing Conference (EUSIPCO).

[zhang2021attention] Zhang, Zhichao, Xu, Shugong, Zhang, Shunqing, Qiao, Tianhao, Cao, Shan. (2021). Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing.

[rybakov2020streaming] Rybakov, Oleg, Kononenko, Natasha, Subrahmanya, Niranjan, Visontai, Mirko, Laurenzo, Stella. (2020). Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720.

[mohri2018foundations] Mohri, Mehryar, Rostamizadeh, Afshin, Talwalkar, Ameet. (2018). Foundations of machine learning.

[ioffe2015batch] Ioffe, Sergey, Szegedy, Christian. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning.

[jordan2015machine] Jordan, Michael I, Mitchell, Tom M. (2015). Machine learning: Trends, perspectives, and prospects. Science.

[passalis2018learning] Passalis, Nikolaos, Tefas, Anastasios. (2018). Learning deep representations with probabilistic knowledge transfer. Proceedings of the European Conference on Computer Vision (ECCV).

[wolf2020transformers] Wolf, Thomas, Chaumond, Julien, Debut, Lysandre, Sanh, Victor, Delangue, Clement, Moi, Anthony, Cistac, Pierric, Funtowicz, Morgan, Davison, Joe, Shleifer, Sam, others. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[long2015fully] Long, Jonathan, Shelhamer, Evan, Darrell, Trevor. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition.

[csimcsek2021geometry] {\c{S. (2021). Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances. arXiv preprint arXiv:2105.12221.

[vapnik2015uniform] Vapnik, Vladimir N, Chervonenkis, A Ya. (2015). On the uniform convergence of relative frequencies of events to their probabilities. Measures of complexity.

[pmlr-v9-glorot10a] Glorot, Xavier, Bengio, Yoshua. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.

[bib1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.

[bib2] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro, “In search of the real inductive bias: On the role of implicit regularization in deep learning,” arXiv preprint arXiv:1412.6614, 2014.

[bib3] Yann Le Cun, Ido Kanter, and Sara A Solla, “Eigenvalues of covariance matrices: Application to neural-network learning,” Physical Review Letters, vol. 66, no. 18, pp. 2396, 1991.

[bib4] Arthur Jacot, Franck Gabriel, and Clément Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” arXiv preprint arXiv:1806.07572, 2018.

[bib5] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao, “Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective,” Advances in neural information processing systems, vol. 33, pp. 2698–2709, 2020.

[bib6] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro, “The implicit bias of gradient descent on separable data,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 2822–2878, 2018.

[bib7] Randall Balestriero and Richard Baraniuk, “A spline theory of deep learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 374–383.

[bib8] Rudolf H Riedi, Randall Balestriero, and Richard G Baraniuk, “Singular value perturbation and deep network optimization,” arXiv preprint arXiv:2203.03099, 2022.

[bib9] Randall Balestriero and Richard G Baraniuk, “From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference,” arXiv preprint arXiv:1810.09274, 2018.

[bib10] Randall Balestriero, Sébastien Paris, and Richard Baraniuk, “Analytical probability distributions and exact expectation-maximization for deep generative networks,” Advances in neural information processing systems, vol. 33, pp. 14938–14949, 2020.

[bib11] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk, “Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10641–10650.

[bib12] Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova, “Nonlinear approximation and (deep) relu networks,” Constructive Approximation, vol. 55, no. 1, pp. 127–172, 2022.

[bib13] Randall Balestriero and Richard G Baraniuk, “Batch normalization explained,” arXiv preprint arXiv:2209.14778, 2022.

[bib14] Lily Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane Boning, and Inderjit Dhillon, “Towards fast computation of certified robustness for relu networks,” in International Conference on Machine Learning. PMLR, 2018, pp. 5276–5285.

[bib15] Aditi Raghunathan, Jacob Steinhardt, and Percy S Liang, “Semidefinite relaxations for certifying robustness to adversarial examples,” Advances in Neural Information Processing Systems, vol. 31, 2018.

[bib16] Herbert Edelsbrunner, Algorithms in combinatorial geometry, vol. 10, Springer Science & Business Media, 1987.

[bib17] Wuyang Chen, Xinyu Gong, and Zhangyang Wang, “Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective,” arXiv preprint arXiv:2102.11535, 2021.

[bib18] Richard E Bellman and Stuart E Dreyfus, Applied dynamic programming, vol. 2050, Princeton university press, 2015.

[bib19] Mario Köppen, “The curse of dimensionality,” in 5th online world conference on soft computing in industrial applications (WSC5), 2000, vol. 1, pp. 4–8.

[bib20] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, “On the number of linear regions of deep neural networks,” Advances in neural information processing systems, vol. 27, 2014.

[bib21] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk, “The geometry of deep networks: Power diagram subdivision,” Advances in Neural Information Processing Systems, vol. 32, 2019.

[bib22] David Avis and Komei Fukuda, “Reverse search for enumeration,” Discrete applied mathematics, vol. 65, no. 1-3, pp. 21–46, 1996.

[bib23] Richard P Stanley et al., “An introduction to hyperplane arrangements,” Geometric combinatorics, vol. 13, no. 389-496, pp. 24, 2004.