Skip to main content

POLICE: Provably Optimal Linear Constraint Enforcement for Deep Neural Networks

Abstract

Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that only requires minimal changes into a given DNN's forward-pass, that is computationally friendly, and that leaves the optimization of the DNN's parameter to be unconstrained, i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space's region at any point during training, and testing. We coin this method POLICE, standing for {\bf P}rovably {\bf O}ptimal {\bf LI}near {\bf C}onstraint {\bf E}nforcement. Github:{\footnotesize \noindenthttps://github.com/RandallBalestriero/POLICE}

Strict Enforcement of Affine Constraints for Deep Neural Networks

Randall Balestriero

Meta AI, FAIR

NYC, USA

Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that only requires minimal changes into a given DNN's forward-pass, that is computationally friendly, and that leaves the optimization of the DNN's parameter to be unconstrained, i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space's region at any point during training, and testing. We coin this method POLICE, standing for P rovably O ptimal LI near C onstraint E nforcement. Github: https://github.com/RandallBalestriero/POLICE

Introduction

Deep Neural Networks (DNNs) are compositions of interleaved linear and nonlinear operators forming a differentiable functional f θ governed by some parameters θ [1]. DNNs are universal approximators, but so are many other alternatives e.g. Fourier series [2], kernel machines [3], and decision trees [4]. The strength of DNNs rather lies in the ability of the function f θ to quickly and relatively easily fit most encountered target functional f ∗ -often known through a given and finite training set- only by performing gradient updates on its parameters θ [5]. This success, combined with the development of software and hardware making DNNs' training and inference efficient, have made DNNs the operator of choice for most tasks encountered in machine learning and pattern recognition [6].

Although ubiquitous for general purpose function fitting, DNNs have had a much harder time to shine in settings where specific constraints must be enforced exactly on the learned mapping. A motivating example follows from the following

Yann LeCun

Meta AI, FAIR, NYU yann@meta.com

NYC, USA

Fig. 1 . Classification task of orange versus purple . The learned decision boundary ( red ) comes from a SGD trained POLICE d DNN of depth 3 with leaky-ReLU and dense layers of width 256 ; POLICE provably enforce the DNN to be affine within the region delimited by the black lines through a simple projection of each layer's bias parameters (Algorithm 1) -standard unconstrained optimization is still be used to train the DNN. See Fig. 2 for regression tasks.

Fig. 1 . Classification task of orange versus purple . The learned decision boundary ( red ) comes from a SGD trained POLICE d DNN of depth 3 with leaky-ReLU and dense layers of width 256 ; POLICE provably enforce the DNN to be affine within the region delimited by the black lines through a simple projection of each layer's bias parameters (Algorithm 1) -standard unconstrained optimization is still be used to train the DNN. See Fig. 2 for regression tasks.

optimization problem

$$

$$

where J is the Jacobian operator and A is a target matrix that the model f θ 's Jacobian should match for any input x in a given region R of the input space. We employed explicitly a regularizer R with associated hyper-parameter λ as one commonly employs weight-decay i.e. glyph[lscript] 2 regularization on θ [7]. The type of constraint of Eq. (1) is crucial in a variety of applications e.g. based on informed physical properties that f θ should obey [8, 9, 10] or based on (adversarial) noise robustness constraints [11, 12]. Another motivating example emerges when one has a priori knowledge that the target mapping f ∗ is itself linear with parameters A ∗ , b ∗ within a region R leading to the following optimization problem

$$

$$

It is easy to see that one might arrive at numerous variants of Eqs. (1) and (2)'s constraints. Crucial to our study is the observation that in all cases the necessary constraint one must provably and strictly enforce into f θ is to be affine within R . As a result, we summarize the generalized affine constraint that will interest us as follows

$$

$$

and from which any more specific constraint can be efficiently enforced. For example, given Eq. (3), imposing Eq. (1) is straightforward by using a barrier method [13] with ‖ J f θ ( v ) -A ‖ which can be done only by sampling a single sample v ∈ R and computing the DNN's Jacobian matrix at that point. The same goes for imposing Eq. (2) given that Eq. (3) is fulfilled.

The main question that naturally arises is how to leverage DNNs while imposing Eq. (3). Current constraint enforcement with DNNs have so far only considered questions of verification [14, 15, 16] or constrained parameter space e.g. with integer parameters θ [17] leaving us with little help to solve Eq. (3). A direct regularization-based approach suffers from the curse of dimensionality [18, 19] since sampling R to ensure that f θ is affine requires exponentially many points as the dimension of R increases. A more direct approach e.g. defining an alternative mapping g θ , A , b such as

$$

$$

would destroy key properties such as continuity at the boundary ∂R . In short, we are in need of a principled and informed solution to strictly enforce Eq. (3) in arbitrarily wide and/or deep DNNs .

In this paper, we propose a theoretically motivated and provably optimal solution to enforce the affine constraint of Eq. (3) into DNNs. In particular, by focusing on convex polytopal regions R , we are able to exactly enforce the model to be affine on R ; a visual depiction of the method at work is given in Fig. 1, and the pseudo-code is given in Algorithm 1. Our method has linear asymptotic complexity with respect to the number of vertices P defining (or approximating) the region R i.e. it has time-complexity O ( KP ) where K is the complexity of the model's forward pass. Hence the proposed strategy will often be unimpaired by the curse of dimensionality since for example a simplicial region R ⊂ R D has only D +1 vertices

[20] i.e. our method in this case has complexity growing linearly with the input space dimension. Our method is exact for any DNN architecture in which the nonlinearities are such as (leaky-)ReLU, absolute value, max-pooling and the likes, extension to smooth nonlinearities is discussed in Section 3. We coin our method POLICE standing for P rovably O ptimal LI near C onstraint E nforcement; and we denoted a constrained DNN as being POLICE d. Two crucial benefits of POLICE are that the constraint enforcement is exact without requiring any sampling, and that the standard gradient-based training can be used on the POLICEd DNN parameters without any changes.

Strict Enforcement of Affine Constraints for Deep Neural Networks

In this section, we develop the proposed POLICE method that relies on two ingredients: (i) a DNN f θ using continuous piecewise-affine (CPA) nonlinearities which we formally introduce in Section 2.1; and (ii) a convex polytopal region R . The derivations of the proposed method will be carried in Section 2.2 and will empirically validated in Section 2.3.

Deep Neural Networks Are Continuous Piecewise Linear Operators

We denote the DNN input-output mapping as f θ : R D ↦→ R K with θ the governing parameters of the mapping. In all generality, a DNN is formed by a composition of many layers as in

$$

$$

where each layer mapping f ( glyph[lscript] ) : R D ( glyph[lscript] ) ↦→ R D ( glyph[lscript] +1) produces a feature map ; with D (1) glyph[defines] D and D ( L ) glyph[defines] K . For most architectures, i.e. parametrizations of the layers, the inputoutput mapping producing each feature map takes the form

$$

$$

where σ is a pointwise activation function, W ( glyph[lscript] ) is a weight matrix of dimensions D ( glyph[lscript] +1) × D ( glyph[lscript] ) , and b ( glyph[lscript] ) is a bias vector of length D ( glyph[lscript] +1) , h ( glyph[lscript] ) is denoted as the pre-activation map , and the layer parameters are gathered into θ ( glyph[lscript] ) glyph[defines] { W ( glyph[lscript] ) , b ( glyph[lscript] ) } . The matrix W ( glyph[lscript] ) will often have a specific structure (e.g., circulant) at different layers. Without loss of generality we consider vectors as inputs since when dealing with images for example, one can always flatten them into vectors and reparametrize the layers accordingly leaving the input-output mapping of the DNN unchanged. The details on the layer mapping will not impact our results. One recent line of research that we will heavily rely on consists in formulating DNNs as Continuous Piecewise Affine (CPA) mappings [21, 22], that be expressed as

$$

$$

where Ω is the input space partition induced by the DNN architecture 1 , ω is a partition-region, and A ω , b ω are the corresponding per-region slope and offset parameters. A Key result that we will build upon is that the CPA formulation of Eq. (6) represents exactly the DNN functional of Eqs. (4) and (5), when the nonlinearities σ are themselves CPA e.g. (leaky-)ReLU, absolute value, max-pooling [22].

POLICE Algorithm and Optimality

Equipped with the CPA notations expressing DNNs as perregion affine mappings (recall Eq. (6)) we now derive the proposed POLICE algorithm that will strictly enforce the POLICEd DNN to fulfill the affine constraint from Eq. (3)) on a region R which is convex. We will then empirically validate the method in the following Section 2.3.

As mentioned above, our goal is to enforce the DNN to be a simple affine mapping on a convex region R . In particular, we will consider the V -representation of the region R i.e. we consider P vertices v 1 , . . . , v P so that the considered region R is, or can be closely approximated by, the convex hull of those vertices as in

$$

$$

where we gathered the P vertices into the P × D matrix

$$

$$

We highlight that the actual ordering of the rows of V in Eq. (8), i.e. which vertex is put in which row, will not impact our result. We will also denote by v ( glyph[lscript] ) p the p th vertex mapped through the first glyph[lscript] -1 layer, starting with v p for glyph[lscript] = 1 and the corresponding matrix V ( glyph[lscript] ) . By leveraging the CPA property of the DNN f θ , one should notice that a necessary and sufficient condition for the model to stay affine within R is to have the same pre-activation sign patterns (recall Eq. (5)) which we formalize below.

Theorem 1. A necessary and sufficient condition for a CPA DNN as per Eq. (6) to stay affine with a region R given by Eq. (7) is to have the same pre-activation sign patterns between the vertices v p , ∀ p ∈ [ P ] and that for each layer.

Proof. The proof is direct. If the pre-activation signs are the same for all vertices, e.g. sign( h (1) ( v p )) = sign( h (1) ( v q )) , ∀ p, q ∈ [ P ] 2 for the first layer, then the vertices all lie within the same partition region i.e. ∃ ω ∗ ∈ Ω such that v p ∈ ω, ∀ p . Using Eq. (7) we obtain that R ⊂ ω ∗ . Now, because the partition regions ω that form the DNN partition Ω (recall Eq. (6)) are themselves convex polytopes, it implies that all the points within ω ∗ , and thus R since we showed that R ⊂ ω ∗ , will also have the same pre-activation patterns, i.e. the entire DNN input-output mapping will stay affine within R concluding the proof.

1 exact relations between architecture and partition are given in [23] and are beyond the scope of this study

Algorithm 1 POLICE procedure given a DNN f θ (recall Eq. (4)) with activations σ that can be (leaky-)ReLU or absolute value, and the region R expressed by its vertices (recall Eq. (8)). POLICE strictly enforces that f θ be affine in R ; for clarity we denote by MajorityVote( ., axis = 0) the s vector from Proposition 1.

.
Proof.
.

Empirical Validation

We now propose to empirically validate the proposed POLICE method from Algorithm 1. First, we point out that POLICE can be applied regardless of the downstream task that one aims at solving e.g. classification or regression as presented in Figs. 1 and 2 respectively. In either case, the exact same methodology is employed (Algorithm 1). To complement

Figure

Table 1 . Average (over 1024 runs) time (in ms.) and std to compute a forward-pass and a backward-pass with gradient accumulation in a standard DNN without any constraint enforced versus a POLICEd DNN for various input dimensions, widths and depths; given a minibatch size of 1024 . We recall that at test time a POLICEd DNN has zero overhead (recall Algorithm 1).

the proposed classification case, we also provide in Fig. 3 the evolution of the DNN approximation at different training steps. We observe how the POLICE method enable gradient based learning to take into account the constraints so that the approximation can be of high-quality despite the constraint enforcement. Crucially, POLICE provides a strict constraint enforcement at any point during training requiring no tuning or sampling . We provide training times in Table 1 for the case with R being the simplex of the considered ambient space, and recall the reader that POLICE has zero overhead at test time. We observe that especially as the dimension or the width of the model grows, as POLICE is able to provide relatively low computational overhead. In any case, the computational slow-down seems to lie between 1 . 4 and 6 . 6 in all studied cases. Additionally, those numbers were obtained from a naive Pytorch reproduction of Algorithm 1 and we expect the slow-down factor to greatly diminish by optimizing the implementation which we leave for future work.

Fig. 2 . Regression task of a 2 -dimensional space to a 1 -dimensional space with target function depicted in the left column with leakyReLU MLP of depth 4 and width 256 and for which the POLICE constrains the DNN to be an affine mapping within the region R delimited by the black lines as depicted in the second column . On the right column a zoom-in is provided demonstrating how gradient based learning adapted the DNN parameters to very accurately approximate the target function.

Fig. 3 . Evolution of the DNN function approximation to the target function ( left ) at different training steps (5,50,10000) in each column . At any point during training POLICE strictly enforced the constraints on the region R delimited by the black lines in a differentiable way to allow gradient based learning to discover a good value for the DNN parameters.

Fig. 3 . Evolution of the DNN function approximation to the target function ( left ) at different training steps (5,50,10000) in each column . At any point during training POLICE strictly enforced the constraints on the region R delimited by the black lines in a differentiable way to allow gradient based learning to discover a good value for the DNN parameters.

Conclusion and Future Work

We proposed a novel algorithm -POLICE- to provably enforce affine constraints into DNNs that employ continuous piecewise affine nonlinearities such as (leaky-)ReLU, absolute value or max-pooling. Given polytopal region R , POLICE enforces the DNN to be affine on R only by a forward-pass through the DNN of the vertices defining R (Algorithm 1) making it computationally efficient (Table 1) and lightweight to implement (no change in the model or optimizer). We believe that strict enforcement of the affine constraint is key to enable critical applications to leverage DNNs e.g. to enforce Eqs. (1) and (2). Among future directions we hope to explore the constraint enforcement on multiple regions simultaneously, and the extension to DNN employing smooth activation functions using a probabilistic argument as done e.g. in [24, 25]. Regarding the application of POLICE, we believe that implicit representation DNNs [26] could be fruitful beneficiaries.

$$ \min_{\vtheta} |f_{\vtheta} -f^* |+ \lambda \mathcal{R}(\vtheta) \nonumber\ \text{ s.t. } \mJ f_{\vtheta}(\vx) = \mA, \forall \vx \in R,\label{eq:constraint} $$ \tag{eq:constraint}

$$ g_{\vtheta,\mA,\vb}(\vx)=1_{{\vx \in R}} (\mA\vx+\vb)+ 1_{{\vx \not \in R}}f_{\vtheta}(\vx), $$

$$ f_{\vtheta}=\left(f_{\vtheta^{(L)}}^{(L)} \circ \dots \circ f_{\vtheta^{(1)}}^{(1)}\right),\label{eq:DNN} $$ \tag{eq:DNN}

$$ f_{\vtheta^{(\ell)}}^{(\ell)}(\vv)=\sigma^{(\ell)}(\vh^{(\ell)}(\vv))\text{ with } \vh^{(\ell)}(\vv)=\mW^{(\ell)}\vv+\vb^{(\ell)}\label{eq:layer} $$ \tag{eq:layer}

$$ f_{\vtheta}(\vx) = \sum_{\omega \in \Omega}(\mA_{\omega}\vx+\vb_{\omega})1_{{\vx \in \omega}},\label{eq:CPA} $$ \tag{eq:CPA}

$$ R = \left{\mV^T\valpha: \valpha_p \geq 0, \forall p \wedge \valpha^{T}\mathbf{1}_{P}=1\right},\label{eq:convex} $$ \tag{eq:convex}

$$ \mV\triangleq [\vv_{1},\dots,\vv_{P}]^T.\label{eq:V} $$ \tag{eq:V}

$$ 0\leq \min_{(p,k) \in [P] \times [D^{(\ell+1)}]}\mH_{p,k}\vs_k\label{eq:b} $$ \tag{eq:b}

Theorem. A necessary and sufficient condition for a CPA DNN as per eq:CPA to stay affine with a region $R$ given by eq:convex is to have the same pre-activation sign patterns between the vertices $\vv_p,\forall p\in[P]$ and that for each layer.

Proposition. A sufficient condition for layer $\ell$ to be linear within the input-space region defined by $\mV$ (recall eq:V) is -0.2cm align 0\leq \min_{(p,k) \in [P] \times [D^{(\ell+1)}]}\mH_{p,k}\vs_k align with $\mH \triangleq \mV^{(\ell)}(\mW^{(\ell)})^T + 1_{P}(\vb^{(\ell)})^T$ and with $\vs_k$ being $1$ if $# {i:(\mH)_{i,k}>0}\geq P/2$ and $-1$ otherwise.

Proof. The proof is direct. If the pre-activation signs are the same for all vertices, e.g. ${\rm sign}(\vh^{(1)}(\vv_p))={\rm sign}(\vh^{(1)}(\vv_q))$, $\forall p,q \in [P]^2$ for the first layer, then the vertices all lie within the same partition region i.e. $\exists \omega^* \in \Omega $ such that $\vv_p \in \omega, \forall p$. Using eq:convex we obtain that $R \subset \omega^$. Now, because the partition regions $\omega$ that form the DNN partition $\Omega$ (recall eq:CPA) are themselves convex polytopes, it implies that all the points within $\omega^$, and thus $R$ since we showed that $R \subset \omega^*$, will also have the same pre-activation patterns, i.e. the entire DNN input-output mapping will stay affine within $R$ concluding the proof.

Algorithm: algorithm
[t!]
\caption{\small
POLICE procedure given a DNN $f_{\vtheta}$ (recall \cref{eq:DNN}) with activations $\sigma$ that can be (leaky-)ReLU or absolute value, and the region $R$ expressed by its vertices (recall \cref{eq:V}). POLICE strictly enforces that $f_{\vtheta}$ be affine in $R$; for clarity we denote by ${\rm MajorityVote}(., {\rm axis}=0)$ the $\vs$ vector from \cref{prop:suff}.}\label{algo:police}
\begin{center}
{\bf \underline{Train time procedure for each mini-batch $\mX$:}}
\end{center}\vspace{-0.3cm}
\begin{algorithmic}[1]
\Require $\mX \in \mathbb{R}^{N \times D},\mV\in \mathbb{R}^{V \times D}$
\State $\mV^{(1)} \triangleq \mV, \mX^{(1)} \triangleq \mX$\Comment{Initialization}
\For{$\ell=1,\dots,L$}\Comment{Forward-pass}
\State $ \mH \leftarrow \mV^{(\ell)} (\mW^{(\ell)})^T + \mathbf{1}_{P}(\vb^{(\ell)})^T$
\State $ \vs = {\rm MajorityVote}({\rm sign}(\mH), {\rm axis}=0)$
\State $\vc = {\rm ReLU}(-\mH {\rm diag}(\vs)).{\rm max}({\rm axis}=0) \odot \vs$\label{lst:line:c}
\State $\mX^{(\ell+1)}\leftarrow \sigma\left(\mX^{(\ell)}(\mW^{(\ell)})^T + \mathbf{1}_{N}(\vb^{(\ell)}+\vc)^T\right)$
\State $\mV^{(\ell+1)}\leftarrow \sigma\left(\mV^{(\ell)}(\mW^{(\ell)})^T + \mathbf{1}_{P}(\vb^{(\ell)}+\vc)^T\right)$
\EndFor
\Ensure $\mX^{(L)}$\Comment{Evaluate loss and back-propagate as usual}
\end{algorithmic}\vspace{-0.3cm}
\begin{center}
{\bf \underline{Test time procedure (to run once post-training):}}
\end{center}\vspace{-0.3cm}
\begin{algorithmic}
\State Do the train time procedure but without $\mX,\mX^{(1)},\dots$ and set $\vb^{(\ell)} \leftarrow \vb^{(\ell)}+\vc,\forall \ell$ with $\vc$ from Line~\ref{lst:line:c}.
\end{algorithmic}
Algorithm: algorithm
[1]
\Require $\mX \in \mathbb{R}^{N \times D},\mV\in \mathbb{R}^{V \times D}$
\State $\mV^{(1)} \triangleq \mV, \mX^{(1)} \triangleq \mX$\Comment{Initialization}
\For{$\ell=1,\dots,L$}\Comment{Forward-pass}
\State $ \mH \leftarrow \mV^{(\ell)} (\mW^{(\ell)})^T + \mathbf{1}_{P}(\vb^{(\ell)})^T$
\State $ \vs = {\rm MajorityVote}({\rm sign}(\mH), {\rm axis}=0)$
\State $\vc = {\rm ReLU}(-\mH {\rm diag}(\vs)).{\rm max}({\rm axis}=0) \odot \vs$\label{lst:line:c}
\State $\mX^{(\ell+1)}\leftarrow \sigma\left(\mX^{(\ell)}(\mW^{(\ell)})^T + \mathbf{1}_{N}(\vb^{(\ell)}+\vc)^T\right)$
\State $\mV^{(\ell+1)}\leftarrow \sigma\left(\mV^{(\ell)}(\mW^{(\ell)})^T + \mathbf{1}_{P}(\vb^{(\ell)}+\vc)^T\right)$
\EndFor
\Ensure $\mX^{(L)}$\Comment{Evaluate loss and back-propagate as usual}
Algorithm: algorithm
\State Do the train time procedure but without $\mX,\mX^{(1)},\dots$ and set $\vb^{(\ell)} \leftarrow \vb^{(\ell)}+\vc,\forall \ell$ with $\vc$ from Line~\ref{lst:line:c}.

References

Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that only requires minimal changes into a given DNN’s forward-pass, that is computationally friendly, and that leaves the optimization of the DNN’s parameter to be unconstrained, i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space’s region at any point during training, and testing. We coin this method POLICE, standing for Provably Optimal LInear Constraint Enforcement. Github:https://github.com/RandallBalestriero/POLICE

Deep Neural Networks (DNNs) are compositions of interleaved linear and nonlinear operators forming a differentiable functional f𝜽subscript𝑓𝜽f_{{\bm{\theta}}} governed by some parameters 𝜽𝜽{\bm{\theta}} [1]. DNNs are universal approximators, but so are many other alternatives e.g. Fourier series [2], kernel machines [3], and decision trees [4]. The strength of DNNs rather lies in the ability of the function f𝜽subscript𝑓𝜽f_{{\bm{\theta}}} to quickly and relatively easily fit most encountered target functional f∗superscript𝑓f^{*} –often known through a given and finite training set– only by performing gradient updates on its parameters 𝜽𝜽{\bm{\theta}} [5]. This success, combined with the development of software and hardware making DNNs’ training and inference efficient, have made DNNs the operator of choice for most tasks encountered in machine learning and pattern recognition [6].

Although ubiquitous for general purpose function fitting, DNNs have had a much harder time to shine in settings where specific constraints must be enforced exactly on the learned mapping. A motivating example follows from the following optimization problem

where 𝑱𝑱{\bm{J}} is the Jacobian operator and 𝑨𝑨{\bm{A}} is a target matrix that the model f𝜽subscript𝑓𝜽f_{{\bm{\theta}}}’s Jacobian should match for any input 𝒙𝒙{\bm{x}} in a given region R𝑅R of the input space. We employed explicitly a regularizer ℛℛ\mathcal{R} with associated hyper-parameter λ𝜆\lambda as one commonly employs weight-decay i.e. ℓ2subscriptℓ2\ell_{2} regularization on 𝜽𝜽{\bm{\theta}} [7]. The type of constraint of Eq. 1 is crucial in a variety of applications e.g. based on informed physical properties that f𝜽subscript𝑓𝜽f_{{\bm{\theta}}} should obey [8, 9, 10] or based on (adversarial) noise robustness constraints [11, 12]. Another motivating example emerges when one has a priori knowledge that the target mapping f∗superscript𝑓f^{} is itself linear with parameters 𝑨∗,𝒃∗superscript𝑨superscript𝒃{\bm{A}}^{},{\bm{b}}^{*} within a region R𝑅R leading to the following optimization problem

It is easy to see that one might arrive at numerous variants of Eqs. 1 and 2’s constraints. Crucial to our study is the observation that in all cases the necessary constraint one must provably and strictly enforce into f𝛉subscript𝑓𝛉f_{{\bm{\theta}}} is to be affine within R𝑅R. As a result, we summarize the generalized affine constraint that will interest us as follows

and from which any more specific constraint can be efficiently enforced. For example, given Eq. 3, imposing Eq. 1 is straightforward by using a barrier method [13] with ‖𝑱​f𝜽​(𝒗)−𝑨‖norm𝑱subscript𝑓𝜽𝒗𝑨|{\bm{J}}f_{{\bm{\theta}}}({\bm{v}})-{\bm{A}}| which can be done only by sampling a single sample 𝒗∈R𝒗𝑅{\bm{v}}\in R and computing the DNN’s Jacobian matrix at that point. The same goes for imposing Eq. 2 given that Eq. 3 is fulfilled.

The main question that naturally arises is how to leverage DNNs while imposing Eq. 3. Current constraint enforcement with DNNs have so far only considered questions of verification [14, 15, 16] or constrained parameter space e.g. with integer parameters 𝜽𝜽{\bm{\theta}} [17] leaving us with little help to solve Eq. 3. A direct regularization-based approach suffers from the curse of dimensionality [18, 19] since sampling R𝑅R to ensure that f𝜽subscript𝑓𝜽f_{{\bm{\theta}}} is affine requires exponentially many points as the dimension of R𝑅R increases. A more direct approach e.g. defining an alternative mapping g𝜽,𝑨,𝒃subscript𝑔𝜽𝑨𝒃g_{{\bm{\theta}},{\bm{A}},{\bm{b}}} such as

would destroy key properties such as continuity at the boundary ∂R𝑅\partial R. In short, we are in need of a principled and informed solution to strictly enforce Eq. 3 in arbitrarily wide and/or deep DNNs.

In this paper, we propose a theoretically motivated and provably optimal solution to enforce the affine constraint of Eq. 3 into DNNs. In particular, by focusing on convex polytopal regions R𝑅R, we are able to exactly enforce the model to be affine on R𝑅R; a visual depiction of the method at work is given in Fig. 1, and the pseudo-code is given in Algorithm 1. Our method has linear asymptotic complexity with respect to the number of vertices P𝑃P defining (or approximating) the region R𝑅R i.e. it has time-complexity 𝒪​(K​P)𝒪𝐾𝑃\mathcal{O}(KP) where K𝐾K is the complexity of the model’s forward pass. Hence the proposed strategy will often be unimpaired by the curse of dimensionality since for example a simplicial region R⊂ℝD𝑅superscriptℝ𝐷R\subset\mathbb{R}^{D} has only D+1𝐷1D+1 vertices [20] i.e. our method in this case has complexity growing linearly with the input space dimension. Our method is exact for any DNN architecture in which the nonlinearities are such as (leaky-)ReLU, absolute value, max-pooling and the likes, extension to smooth nonlinearities is discussed in Section 3. We coin our method POLICE standing for Provably Optimal LInear Constraint Enforcement; and we denoted a constrained DNN as being POLICEd. Two crucial benefits of POLICE are that the constraint enforcement is exact without requiring any sampling, and that the standard gradient-based training can be used on the POLICEd DNN parameters without any changes.

In this section, we develop the proposed POLICE method that relies on two ingredients: (i) a DNN f𝜽subscript𝑓𝜽f_{{\bm{\theta}}} using continuous piecewise-affine (CPA) nonlinearities which we formally introduce in Section 2.1; and (ii) a convex polytopal region R𝑅R. The derivations of the proposed method will be carried in Section 2.2 and will empirically validated in Section 2.3.

We denote the DNN input-output mapping as f𝜽:ℝD↦ℝK:subscript𝑓𝜽maps-tosuperscriptℝ𝐷superscriptℝ𝐾f_{{\bm{\theta}}}:\mathbb{R}^{D}\mapsto\mathbb{R}^{K} with 𝜽𝜽{\bm{\theta}} the governing parameters of the mapping. In all generality, a DNN is formed by a composition of many layers as in

where each layer mapping f(ℓ):ℝD(ℓ)↦ℝD(ℓ+1):superscript𝑓ℓmaps-tosuperscriptℝsuperscript𝐷ℓsuperscriptℝsuperscript𝐷ℓ1f^{(\ell)}:\mathbb{R}^{D^{(\ell)}}\mapsto\mathbb{R}^{D^{(\ell+1)}} produces a feature map; with D(1)≜D≜superscript𝐷1𝐷D^{(1)}\triangleq D and D(L)≜K≜superscript𝐷𝐿𝐾D^{(L)}\triangleq K. For most architectures, i.e. parametrizations of the layers, the input-output mapping producing each feature map takes the form

where σ𝜎\sigma is a pointwise activation function, 𝑾(ℓ)superscript𝑾ℓ{\bm{W}}^{(\ell)} is a weight matrix of dimensions D(ℓ+1)×D(ℓ)superscript𝐷ℓ1superscript𝐷ℓD^{(\ell+1)}\times D^{(\ell)}, and 𝒃(ℓ)superscript𝒃ℓ{\bm{b}}^{(\ell)} is a bias vector of length D(ℓ+1)superscript𝐷ℓ1D^{(\ell+1)}, 𝒉(ℓ)superscript𝒉ℓ{\bm{h}}^{(\ell)} is denoted as the pre-activation map, and the layer parameters are gathered into 𝜽(ℓ)≜{𝑾(ℓ),𝒃(ℓ)}≜superscript𝜽ℓsuperscript𝑾ℓsuperscript𝒃ℓ{\bm{\theta}}^{(\ell)}\triangleq{{\bm{W}}^{(\ell)},{\bm{b}}^{(\ell)}}. The matrix 𝑾(ℓ)superscript𝑾ℓ{\bm{W}}^{(\ell)} will often have a specific structure (e.g., circulant) at different layers. Without loss of generality we consider vectors as inputs since when dealing with images for example, one can always flatten them into vectors and reparametrize the layers accordingly leaving the input-output mapping of the DNN unchanged. The details on the layer mapping will not impact our results. One recent line of research that we will heavily rely on consists in formulating DNNs as Continuous Piecewise Affine (CPA) mappings [21, 22], that be expressed as

where ΩΩ\Omega is the input space partition induced by the DNN architecture111exact relations between architecture and partition are given in [23] and are beyond the scope of this study, ω𝜔\omega is a partition-region, and 𝑨ω,𝒃ωsubscript𝑨𝜔subscript𝒃𝜔{\bm{A}}{\omega},{\bm{b}}{\omega} are the corresponding per-region slope and offset parameters. A Key result that we will build upon is that the CPA formulation of Eq. 6 represents exactly the DNN functional of Eqs. 4 and 5, when the nonlinearities σ𝜎\sigma are themselves CPA e.g. (leaky-)ReLU, absolute value, max-pooling [22].

Equipped with the CPA notations expressing DNNs as per-region affine mappings (recall Eq. 6) we now derive the proposed POLICE algorithm that will strictly enforce the POLICEd DNN to fulfill the affine constraint from Eq. 3) on a region R𝑅R which is convex. We will then empirically validate the method in the following Section 2.3.

As mentioned above, our goal is to enforce the DNN to be a simple affine mapping on a convex region R𝑅R. In particular, we will consider the V𝑉V-representation of the region R𝑅R i.e. we consider P𝑃P vertices 𝒗1,…,𝒗Psubscript𝒗1…subscript𝒗𝑃{\bm{v}}{1},\dots,{\bm{v}}{P} so that the considered region R𝑅R is, or can be closely approximated by, the convex hull of those vertices as in

where we gathered the P𝑃P vertices into the P×D𝑃𝐷P\times D matrix

We highlight that the actual ordering of the rows of 𝑽𝑽{\bm{V}} in Eq. 8, i.e. which vertex is put in which row, will not impact our result. We will also denote by 𝒗p(ℓ)subscriptsuperscript𝒗ℓ𝑝{\bm{v}}^{(\ell)}{p} the pthsuperscript𝑝thp^{\rm th} vertex mapped through the first ℓ−1ℓ1\ell-1 layer, starting with 𝒗psubscript𝒗𝑝{\bm{v}}{p} for ℓ=1ℓ1\ell=1 and the corresponding matrix 𝑽(ℓ)superscript𝑽ℓ{\bm{V}}^{(\ell)}. By leveraging the CPA property of the DNN f𝜽subscript𝑓𝜽f_{{\bm{\theta}}}, one should notice that a necessary and sufficient condition for the model to stay affine within R𝑅R is to have the same pre-activation sign patterns (recall Eq. 5) which we formalize below.

A necessary and sufficient condition for a CPA DNN as per Eq. 6 to stay affine with a region R𝑅R given by Eq. 7 is to have the same pre-activation sign patterns between the vertices 𝐯p,∀p∈[P]subscript𝐯𝑝for-all𝑝delimited-[]𝑃{\bm{v}}_{p},\forall p\in[P] and that for each layer.

The proof is direct. If the pre-activation signs are the same for all vertices, e.g. sign​(𝒉(1)​(𝒗p))=sign​(𝒉(1)​(𝒗q))signsuperscript𝒉1subscript𝒗𝑝signsuperscript𝒉1subscript𝒗𝑞{\rm sign}({\bm{h}}^{(1)}({\bm{v}}{p}))={\rm sign}({\bm{h}}^{(1)}({\bm{v}}{q})), ∀p,q∈[P]2for-all𝑝𝑞superscriptdelimited-[]𝑃2\forall p,q\in[P]^{2} for the first layer, then the vertices all lie within the same partition region i.e. ∃ω∗∈Ωsuperscript𝜔Ω\exists\omega^{}\in\Omega such that 𝒗p∈ω,∀psubscript𝒗𝑝𝜔for-all𝑝{\bm{v}}_{p}\in\omega,\forall p. Using Eq. 7 we obtain that R⊂ω∗𝑅superscript𝜔R\subset\omega^{}. Now, because the partition regions ω𝜔\omega that form the DNN partition ΩΩ\Omega (recall Eq. 6) are themselves convex polytopes, it implies that all the points within ω∗superscript𝜔\omega^{}, and thus R𝑅R since we showed that R⊂ω∗𝑅superscript𝜔R\subset\omega^{}, will also have the same pre-activation patterns, i.e. the entire DNN input-output mapping will stay affine within R𝑅R concluding the proof. ∎

Theorem 1 provides us with a necessary and sufficient condition but does not yet answer the principal question of “how to ensure that the vertices have the same pre-activation sign patterns”. There are actually many ways to enforce this, in our study we focus on one method that we deem the simplest and leave further comparison for future work.

A sufficient condition for layer ℓℓ\ell to be linear within the input-space region defined by 𝐕𝐕{\bm{V}} (recall Eq. 8) is

with 𝐇≜𝐕(ℓ)​(𝐖(ℓ))T+𝟏P​(𝐛(ℓ))T≜𝐇superscript𝐕ℓsuperscriptsuperscript𝐖ℓ𝑇subscript1𝑃superscriptsuperscript𝐛ℓ𝑇{\bm{H}}\triangleq{\bm{V}}^{(\ell)}({\bm{W}}^{(\ell)})^{T}+\mathbf{1}{P}({\bm{b}}^{(\ell)})^{T} and with 𝐬ksubscript𝐬𝑘{\bm{s}}{k} being 111 if #​{i:(𝐇)i,k>0}≥P/2#conditional-set𝑖subscript𝐇𝑖𝑘0𝑃2#{i:({\bm{H}})_{i,k}>0}\geq P/2 and −11-1 otherwise.

In the above, 𝑯𝑯{\bm{H}} is simply the pre-activation of layer ℓℓ\ell given input 𝑽(ℓ)superscript𝑽ℓ{\bm{V}}^{(\ell)}, and Eq. 9 ensures that all vertices have the same sign pattern as 𝒔𝒔{\bm{s}}; the latter determines on which side of the region 𝑽(ℓ)superscript𝑽ℓ{\bm{V}}^{(\ell)} each hyperplane is projected to. Combining Theorems 1 and 1 leads to the POLICE algorithm given in Algorithm 1. Note that our choice of majority vote for 𝒔𝒔{\bm{s}} is arbitrary, other options would be valid as well depending on the desired properties that POLICE should follow. For example, one could find that direction based on margins instead. We now turn to empirically validate it in a variety of tasks in the following section.

Train time procedure for each mini-batch X𝑋{\bm{X}}:

target function

POLICE constrained

DNN approximation

We now propose to empirically validate the proposed POLICE method from Algorithm 1. First, we point out that POLICE can be applied regardless of the downstream task that one aims at solving e.g. classification or regression as presented in Figs. 1 and 2 respectively. In either case, the exact same methodology is employed (Algorithm 1). To complement the proposed classification case, we also provide in Fig. 3 the evolution of the DNN approximation at different training steps. We observe how the POLICE method enable gradient based learning to take into account the constraints so that the approximation can be of high-quality despite the constraint enforcement. Crucially, POLICE provides a strict constraint enforcement at any point during training requiring no tuning or sampling. We provide training times in Table 1 for the case with R𝑅R being the simplex of the considered ambient space, and recall the reader that POLICE has zero overhead at test time. We observe that especially as the dimension or the width of the model grows, as POLICE is able to provide relatively low computational overhead. In any case, the computational slow-down seems to lie between 1.41.41.4 and 6.66.66.6 in all studied cases. Additionally, those numbers were obtained from a naive Pytorch reproduction of Algorithm 1 and we expect the slow-down factor to greatly diminish by optimizing the implementation which we leave for future work.

step 505050

We proposed a novel algorithm –POLICE– to provably enforce affine constraints into DNNs that employ continuous piecewise affine nonlinearities such as (leaky-)ReLU, absolute value or max-pooling. Given polytopal region R𝑅R, POLICE enforces the DNN to be affine on R𝑅R only by a forward-pass through the DNN of the vertices defining R𝑅R (Algorithm 1) making it computationally efficient (Table 1) and lightweight to implement (no change in the model or optimizer). We believe that strict enforcement of the affine constraint is key to enable critical applications to leverage DNNs e.g. to enforce Eqs. 1 and 2. Among future directions we hope to explore the constraint enforcement on multiple regions simultaneously, and the extension to DNN employing smooth activation functions using a probabilistic argument as done e.g. in [24, 25]. Regarding the application of POLICE, we believe that implicit representation DNNs [26] could be fruitful beneficiaries.

Table: S2.T1: Average (over 102410241024 runs) time (in ms.) and std to compute a forward-pass and a backward-pass with gradient accumulation in a standard DNN without any constraint enforced versus a POLICEd DNN for various input dimensions, widths and depths; given a mini-batch size of 102410241024. We recall that at test time a POLICEd DNN has zero overhead (recall Algorithm 1).

widths D(ℓ)superscript𝐷ℓD^{(\ell)}256644096102410244096
no constr.0.9±plus-or-minus\pm01.3±plus-or-minus\pm027.9±plus-or-minus\pm30.9±plus-or-minus\pm03.1±plus-or-minus\pm051.9±plus-or-minus\pm5
POLICEd4.3±plus-or-minus\pm07.7±plus-or-minus\pm140.3±plus-or-minus\pm15.2±plus-or-minus\pm020.4±plus-or-minus\pm1202.2±plus-or-minus\pm12
slow-down×\times4.5×\times 5.7×\times 1.4×\times 5.5×\times 6.6×\times 3.9

Refer to caption Fig. 1: Classification task of orange versus purple . The learned decision boundary ( red) comes from a SGD trained POLICEd DNN of depth 333 with leaky-ReLU and dense layers of width 256256256; POLICE provably enforce the DNN to be affine within the region delimited by the black lines through a simple projection of each layer’s bias parameters (Algorithm 1) –standard unconstrained optimization is still be used to train the DNN. See Fig. 2 for regression tasks.

Refer to caption Fig. 2: Regression task of a 222-dimensional space to a 111-dimensional space with target function depicted in the left column with leaky-ReLU MLP of depth 444 and width 256256256 and for which the POLICE constrains the DNN to be an affine mapping within the region R𝑅R delimited by the black lines as depicted in the second column. On the right column a zoom-in is provided demonstrating how gradient based learning adapted the DNN parameters to very accurately approximate the target function.

Refer to caption Fig. 3: Evolution of the DNN function approximation to the target function (left) at different training steps (5,50,10000) in each column. At any point during training POLICE strictly enforced the constraints on the region R𝑅R delimited by the black lines in a differentiable way to allow gradient based learning to discover a good value for the DNN parameters.

$$ \displaystyle\min_{{\bm{\theta}}}|f_{{\bm{\theta}}}-f^{*}|+\lambda\mathcal{R}({\bm{\theta}}) $$

$$ \displaystyle f_{{\bm{\theta}}}=\left(f_{{\bm{\theta}}^{(L)}}^{(L)}\circ\dots\circ f_{{\bm{\theta}}^{(1)}}^{(1)}\right), $$

$$ \displaystyle f_{{\bm{\theta}}}({\bm{x}})=\sum_{\omega\in\Omega}({\bm{A}}{\omega}{\bm{x}}+{\bm{b}}{\omega})1_{{{\bm{x}}\in\omega}}, $$

$$ \displaystyle{\bm{V}}\triangleq[{\bm{v}}{1},\dots,{\bm{v}}{P}]^{T}. $$

$$ \displaystyle 0\leq\min_{(p,k)\in[P]\times[D^{(\ell+1)}]}{\bm{H}}{p,k}{\bm{s}}{k} $$

Theorem. Theorem 1. A necessary and sufficient condition for a CPA DNN as per Eq. 6 to stay affine with a region R𝑅R given by Eq. 7 is to have the same pre-activation sign patterns between the vertices 𝐯p,∀p∈[P]subscript𝐯𝑝for-all𝑝delimited-[]𝑃{\bm{v}}_{p},\forall p\in[P] and that for each layer.

Proposition. Proposition 1. A sufficient condition for layer ℓℓ\ell to be linear within the input-space region defined by 𝐕𝐕{\bm{V}} (recall Eq. 8) is 0≤min(p,k)∈[P]×[D(ℓ+1)]⁡𝑯p,k​𝒔k0subscript𝑝𝑘delimited-[]𝑃delimited-[]superscript𝐷ℓ1subscript𝑯𝑝𝑘subscript𝒔𝑘\displaystyle 0\leq\min_{(p,k)\in[P]\times[D^{(\ell+1)}]}{\bm{H}}{p,k}{\bm{s}}{k} (9) with 𝐇≜𝐕(ℓ)​(𝐖(ℓ))T+𝟏P​(𝐛(ℓ))T≜𝐇superscript𝐕ℓsuperscriptsuperscript𝐖ℓ𝑇subscript1𝑃superscriptsuperscript𝐛ℓ𝑇{\bm{H}}\triangleq{\bm{V}}^{(\ell)}({\bm{W}}^{(\ell)})^{T}+\mathbf{1}{P}({\bm{b}}^{(\ell)})^{T} and with 𝐬ksubscript𝐬𝑘{\bm{s}}{k} being 111 if #​{i:(𝐇)i,k>0}≥P/2#conditional-set𝑖subscript𝐇𝑖𝑘0𝑃2#{i:({\bm{H}})_{i,k}>0}\geq P/2 and −11-1 otherwise.

[1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, 'Deep learning,' nature , vol. 521, no. 7553, pp. 436444, 2015. [2] Ronald Newbold Bracewell and Ronald N Bracewell, The Fourier transform and its applications , vol. 31999, McGraw-Hill New York, 1986. [3] Jooyoung Park and Irwin W Sandberg, 'Universal approximation using radial-basis-function networks,' Neural computation , vol. 3, no. 2, pp. 246-257, 1991. [4] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of machine learning , MIT press, 2018. [5] Yann Le Cun, Ido Kanter, and Sara A Solla, 'Eigenvalues of covariance matrices: Application to neural-network learning,' Physical Review Letters , vol. 66, no. 18, pp. 2396, 1991. [6] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning , vol. 1, MIT Press, 2016. [7] Anders Krogh and John Hertz, 'A simple weight decay can improve generalization,' Advances in neural information processing systems , vol. 4, 1991. [8] Lipman Bers, Fritz John, and Martin Schechter, Partial differential equations , American Mathematical Soc., 1964. [9] Stanley J Farlow, Partial differential equations for scientists and engineers , Courier Corporation, 1993. [10] Lawrence C Evans, Partial differential equations , vol. 19, American Mathematical Soc., 2010. [11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, 'Explaining and harnessing adversarial examples,' arXiv preprint arXiv:1412.6572 , 2014. [12] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li, 'Adversarial examples: Attacks and defenses for deep learning,' IEEE transactions on neural networks and learning systems , vol. 30, no. 9, pp. 2805-2824, 2019. [13] Anders Forsgren, Philip E Gill, and Margaret H Wright, 'Interior methods for nonlinear optimization,' SIAM review , vol. 44, no. 4, pp. 525-597, 2002. [14] Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening, 'Concolic testing for deep neural networks,' in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering , 2018, pp. 109-119. [15] Eric Wong and Zico Kolter, 'Provable defenses against adversarial examples via the convex outer adversarial polytope,' in International Conference on Machine Learning . PMLR, 2018, pp. 5286-5295. [16] Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher Strong, Clark Barrett, Mykel J Kochenderfer, et al., 'Algorithms for verifying deep neural networks,' Foundations and Trends® in Optimization , vol. 4, no. 3-4, pp. 244-404, 2021. [17] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi, 'Training and inference with integers in deep neural networks,' arXiv preprint arXiv:1802.04680 , 2018. [18] Richard E Bellman and Stuart E Dreyfus, Applied dynamic programming , vol. 2050, Princeton university press, 2015. [19] Mario Köppen, 'The curse of dimensionality,' in 5th online world conference on soft computing in industrial applications (WSC5) , 2000, vol. 1, pp. 4-8. [20] Herbert Edelsbrunner, Algorithms in combinatorial geometry , vol. 10, Springer Science & Business Media, 1987. [21] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, 'On the number of linear regions of deep neural networks,' Advances in neural information processing systems , vol. 27, 2014. [22] Randall Balestriero and Richard Baraniuk, 'A spline theory of deep learning,' in International Conference on Machine Learning . PMLR, 2018, pp. 374-383. [23] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk, 'The geometry of deep networks: Power diagram subdivision,' Advances in Neural Information Processing Systems , vol. 32, 2019. [24] Randall Balestriero and Richard G Baraniuk, 'From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference,' arXiv preprint arXiv:1810.09274 , 2018. [25] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk, 'Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values,' in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 10641-10650. [26] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein, 'Implicit neural representations with periodic activation functions,' Advances in Neural Information Processing Systems , vol. 33, pp. 7462-7473, 2020.

input dim. D depth L widths D ( glyph[lscript] )2 2. 2562 42 4784784 8 10243072 6 4096
no constr.0.9 ± 064 ±4096 ±2 1024
0.9 ± 03.1 ± 051.9 ± 5
1.3 027.9 35.2 ± 020.4 ±
POLICEd4.3 ± 07.7 ± 140.3 ± 11202.2 ± 12
slow-down× 4.5× 5.7× 1.4× 5.5× 6.6× 3.9
input dim. D depth L widths D ( glyph[lscript] )2 2. 2562 42 4784784 8 10243072 6 4096
no constr.0.9 ± 064 ±4096 ±2 1024
0.9 ± 03.1 ± 051.9 ± 5
1.3 027.9 35.2 ± 020.4 ±
POLICEd4.3 ± 07.7 ± 140.3 ± 11202.2 ± 12
slow-down× 4.5× 5.7× 1.4× 5.5× 6.6× 3.9

Figure

References

[Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[forsgren2002interior] Forsgren, Anders, Gill, Philip E, Wright, Margaret H. (2002). Interior methods for nonlinear optimization. SIAM review.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[david2004order] David, Herbert A, Nagaraja, Haikady N. (2004). Order statistics.

[khodak2021initialization] Khodak, Mikhail, Tenenholtz, Neil, Mackey, Lester, Fusi, Nicolo. (2021). Initialization and Regularization of Factorized Neural Layers. arXiv preprint arXiv:2105.01029.

[krogh1991simple] Krogh, Anders, Hertz, John. (1991). A simple weight decay can improve generalization. Advances in neural information processing systems.

[lecun2015deep] LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey. (2015). Deep learning. nature.

[wong2018provable] Wong, Eric, Kolter, Zico. (2018). Provable defenses against adversarial examples via the convex outer adversarial polytope. International Conference on Machine Learning.

[yuan2019adversarial] Yuan, Xiaoyong, He, Pan, Zhu, Qile, Li, Xiaolin. (2019). Adversarial examples: Attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems.

[goodfellow2014explaining] Goodfellow, Ian J, Shlens, Jonathon, Szegedy, Christian. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[bers1964partial] Bers, Lipman, John, Fritz, Schechter, Martin. (1964). Partial differential equations.

[farlow1993partial] Farlow, Stanley J. (1993). Partial differential equations for scientists and engineers.

[evans2010partial] Evans, Lawrence C. (2010). Partial differential equations.

[sun2018concolic] Sun, Youcheng, Wu, Min, Ruan, Wenjie, Huang, Xiaowei, Kwiatkowska, Marta, Kroening, Daniel. (2018). Concolic testing for deep neural networks. Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.

[wu2018training] Wu, Shuang, Li, Guoqi, Chen, Feng, Shi, Luping. (2018). Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680.

[liu2021algorithms] Liu, Changliu, Arnon, Tomer, Lazarus, Christopher, Strong, Christopher, Barrett, Clark, Kochenderfer, Mykel J, others. (2021). Algorithms for verifying deep neural networks. Foundations and Trends{\textregistered.

[montufar2014number] Montufar, Guido F, Pascanu, Razvan, Cho, Kyunghyun, Bengio, Yoshua. (2014). On the number of linear regions of deep neural networks. Advances in neural information processing systems.

[humayun2022polarity] Humayun, Ahmed Imtiaz, Balestriero, Randall, Baraniuk, Richard. (2022). Polarity Sampling: Quality and Diversity Control of Pre-Trained Generative Networks via Singular Values. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[balestriero2018hard] Balestriero, Randall, Baraniuk, Richard G. (2018). From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference. arXiv preprint arXiv:1810.09274.

[balestriero2019geometry] Balestriero, Randall, Cosentino, Romain, Aazhang, Behnaam, Baraniuk, Richard. (2019). The geometry of deep networks: Power diagram subdivision. Advances in Neural Information Processing Systems.

[balestriero2018spline] Balestriero, Randall, Baraniuk, Richard. (2018). A spline theory of deep learning. International Conference on Machine Learning.

[wang2020orthogonal] Wang, Jiayun, Chen, Yubei, Chakraborty, Rudrasis, Yu, Stella X. (2020). Orthogonal convolutional neural networks. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[bansal2018can] Bansal, Nitin, Chen, Xiaohan, Wang, Zhangyang. (2018). Can we gain more from orthogonality regularizations in training deep networks?. Advances in Neural Information Processing Systems.

[balestriero2020mad] Balestriero, Randall, Baraniuk, Richard G. (2020). Mad max: Affine spline insights into deep learning. Proceedings of the IEEE.

[pennington2018emergence] Pennington, Jeffrey, Schoenholz, Samuel, Ganguli, Surya. (2018). The emergence of spectral universality in deep networks. International Conference on Artificial Intelligence and Statistics.

[pennington2017resurrecting] Pennington, Jeffrey, Schoenholz, Samuel S, Ganguli, Surya. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Proceedings of the 31st International Conference on Neural Information Processing Systems.

[rainforth2018tighter] Rainforth, Tom, Kosiorek, Adam, Le, Tuan Anh, Maddison, Chris, Igl, Maximilian, Wood, Frank, Teh, Yee Whye. (2018). Tighter variational bounds are not necessarily better. International Conference on Machine Learning.

[brockett1991dynamical] Brockett, Roger W. (1991). Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems. Linear Algebra and its applications.

[specht1991general] Specht, Donald F, others. (1991). A general regression neural network. IEEE transactions on neural networks.

[lai2000kernel] Lai, Pei Ling, Fyfe, Colin. (2000). Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems.

[he2021masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

[bao2021beit] Bao, Hangbo, Dong, Li, Wei, Furu. (2021). Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.

[qiu2018network] Qiu, Jiezhong, Dong, Yuxiao, Ma, Hao, Li, Jian, Wang, Kuansan, Tang, Jie. (2018). Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. Proceedings of the eleventh ACM international conference on web search and data mining.

[sprekeler2011relation] Sprekeler, Henning. (2011). On the relation of slow feature analysis and laplacian eigenmaps. Neural computation.

[xing2002distance] Xing, Eric, Jordan, Michael, Russell, Stuart J, Ng, Andrew. (2002). Distance metric learning with application to clustering with side-information. Advances in neural information processing systems.

[von2007tutorial] Von Luxburg, Ulrike. (2007). A tutorial on spectral clustering. Statistics and computing.

[wathen2015spectral] Wathen, Andrew J, Zhu, Shengxin. (2015). On spectral distribution of kernel matrices related to radial basis functions. Numerical Algorithms.

[cunningham2015linear] Cunningham, John P, Ghahramani, Zoubin. (2015). Linear dimensionality reduction: Survey, insights, and generalizations. The Journal of Machine Learning Research.

[knaf2007kernel] Knaf, H. (2007). Kernel Fisher discriminant functions--a concise and rigorous introduction.

[weinberger2009distance] Weinberger, Kilian Q, Saul, Lawrence K. (2009). Distance metric learning for large margin nearest neighbor classification.. Journal of machine learning research.

[hadsell2006dimensionality] Hadsell, Raia, Chopra, Sumit, LeCun, Yann. (2006). Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[kuss2003geometry] Kuss, Malte, Graepel, Thore. (2003). The geometry of kernel canonical correlation analysis.

[fukumizu2007statistical] Fukumizu, Kenji, Bach, Francis R, Gretton, Arthur. (2007). Statistical Consistency of Kernel Canonical Correlation Analysis.. Journal of Machine Learning Research.

[broomhead1988radial] Broomhead, David S, Lowe, David. (1988). Radial basis functions, multi-variable functional interpolation and adaptive networks.

[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[kokiopoulou2011trace] Kokiopoulou, Effrosini, Chen, Jie, Saad, Yousef. (2011). Trace optimization and eigenproblems in dimension reduction methods. Numerical Linear Algebra with Applications.

[li2014fisher] Li, Cheng, Wang, Bingyu. (2014). Fisher linear discriminant analysis. CCIS Northeastern University.

[webb2003statistical] Webb, Andrew R. (2003). Statistical pattern recognition.

[guo2003generalized] Guo, Yue-Fei, Li, Shi-Jin, Yang, Jing-Yu, Shu, Ting-Ting, Wu, Li-De. (2003). A generalized Foley--Sammon transform based on generalized fisher discriminant criterion and its application to face recognition. Pattern Recognition Letters.

[wang2007trace] Wang, Huan, Yan, Shuicheng, Xu, Dong, Tang, Xiaoou, Huang, Thomas. (2007). Trace ratio vs. ratio trace for dimensionality reduction. 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[he2003locality] He, Xiaofei, Niyogi, Partha. (2003). Locality preserving projections. Advances in neural information processing systems.

[cohen2014applied] Cohen, Patricia, West, Stephen G, Aiken, Leona S. (2014). Applied multiple regression/correlation analysis for the behavioral sciences.

[fisher1936use] Fisher, Ronald A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics.

[tenenbaum2000global] Tenenbaum, Joshua B, Silva, Vin de, Langford, John C. (2000). A global geometric framework for nonlinear dimensionality reduction. science.

[kruskal1964multidimensional] Kruskal, Joseph B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems.

[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[caron2018deep] Caron, Mathilde, Bojanowski, Piotr, Joulin, Armand, Douze, Matthijs. (2018). Deep clustering for unsupervised learning of visual features. Proceedings of the European conference on computer vision (ECCV).

[von1937some] Von Neumann, John. (1937). Some matrix-inequalities and metrization of matric space.

[hagen1992new] Hagen, Lars, Kahng, Andrew B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE transactions on computer-aided design of integrated circuits and systems.

[belkin2003laplacian] Belkin, Mikhail, Niyogi, Partha. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation.

[zhang2021canonical] Zhang, Hengrui, Wu, Qitian, Yan, Junchi, Wipf, David, Yu, Philip S. (2021). From canonical correlation analysis to self-supervised graph neural networks. Advances in Neural Information Processing Systems.

[jing2021understanding] Jing, Li, Vincent, Pascal, LeCun, Yann, Tian, Yuandong. (2021). Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348.

[rashmi2019optimal] Rashmi, Manazhy, Sankaran, Praveen. (2019). Optimal landmark point selection using clustering for manifold modeling and data classification. Journal of Classification.

[vinod1976canonical] Vinod, Hrishikesh D. (1976). Canonical ridge and econometrics of joint production. Journal of econometrics.

[platt2005fastmap] Platt, John. (2005). *Fastmap, metricmap, and landmark mds are all nystr{*. International Workshop on Artificial Intelligence and Statistics.

[jahan2016regularized] Jahan, Sohana, Qi, Hou-Duo. (2016). Regularized multidimensional scaling with radial basis functions. Journal of Industrial & Management Optimization.

[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[preparata2012computational] Preparata, Franco P, Shamos, Michael I. (2012). Computational geometry: an introduction.

[scholkopf1998nonlinear] Sch{. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural computation.

[webb1995multidimensional] Webb, Andrew R. (1995). Multidimensional scaling by iterative majorization using radial basis functions. Pattern Recognition.

[williams2000connection] Williams, Christopher. (2000). On a connection between kernel PCA and metric multidimensional scaling. Advances in neural information processing systems.

[liang2013trace] Liang, Xin, Li, Ren-Cang, Bai, Zhaojun. (2013). Trace minimization principles for positive semi-definite pencils. Linear Algebra and its Applications.

[yang2017breaking] Yang, Zhilin, Dai, Zihang, Salakhutdinov, Ruslan, Cohen, William W. (2017). Breaking the softmax bottleneck: A high-rank RNN language model. arXiv preprint arXiv:1711.03953.

[ganea2019breaking] Ganea, Octavian, Gelly, Sylvain, B{'e. (2019). Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. International Conference on Machine Learning.

[silva2002global] Silva, Vin, Tenenbaum, Joshua. (2002). Global versus local methods in nonlinear dimensionality reduction. Advances in neural information processing systems.

[pokle2022contrasting] Pokle, Ashwini, Tian, Jinjin, Li, Yuchen, Risteski, Andrej. (2022). Contrasting the landscape of contrastive and non-contrastive learning. arXiv preprint arXiv:2203.15702.

[gretton2005kernel] Gretton, Arthur, Herbrich, Ralf, Smola, Alexander, Bousquet, Olivier, Sch{. (2005). Kernel methods for measuring independence.

[dauxois1998nonlinear] Dauxois, Jacques, Nkiet, Guy Martial. (1998). Nonlinear canonical analysis and independence tests. The Annals of Statistics.

[bao2021sharp] Bao, Han, Nagano, Yoshihiro, Nozawa, Kento. (2021). Sharp Learning Bounds for Contrastive Unsupervised Representation Learning. arXiv preprint arXiv:2110.02501.

[aronszajn1950theory] Aronszajn, Nachman. (1950). Theory of reproducing kernels. Transactions of the American mathematical society.

[tai2022kernelized] Tai, Mariko, Kudo, Mineichi, Tanaka, Akira, Imai, Hideyuki, Kimura, Keigo. (2022). Kernelized Supervised Laplacian Eigenmap for Visualization and Classification of Multi-Label Data. Pattern Recognition.

[bengio2003out] Bengio, Yoshua, Paiement, Jean-fran{\c{c. (2003). Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering. Advances in neural information processing systems.

[cheng2005supervised] Cheng, Jian, Liu, Qingshan, Lu, Hanqing, Chen, Yen-Wei. (2005). Supervised kernel locality preserving projections for face recognition. Neurocomputing.

[nazi2019generalized] Nazi, Azade, Hang, Will, Goldie, Anna, Ravi, Sujith, Mirhoseini, Azalia. (2019). Generalized Clustering by Learning to Optimize Expected Normalized Cuts. arXiv preprint arXiv:1910.07623.

[smola2003kernels] Smola, Alexander J, Kondor, Risi. (2003). Kernels and regularization on graphs. Learning theory and kernel machines.

[knyazev1987convergence] Knyazev, Andrew V. (1987). Convergence rate estimates for iterative methods for a mesh symmetrie eigenvalue problem.

[hautamaki2004outlier] Hautamaki, Ville, Karkkainen, Ismo, Franti, Pasi. (2004). Outlier detection using k-nearest neighbour graph. Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[hirsch1983discrete] Hirsch, Jorge E. (1983). Discrete Hubbard-Stratonovich transformation for fermion lattice models. Physical Review B.

[wen2021toward] Wen, Zixin, Li, Yuanzhi. (2021). Toward understanding the feature learning process of self-supervised contrastive learning. International Conference on Machine Learning.

[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.

[wang2021towards] Wang, Xiang, Chen, Xinlei, Du, Simon S, Tian, Yuandong. (2021). Towards Demystifying Representation Learning with Non-contrastive Self-supervision. arXiv preprint arXiv:2110.04947.

[uurtio2017tutorial] Uurtio, Viivi, Monteiro, Jo{~a. (2017). A tutorial on canonical correlation methods. ACM Computing Surveys (CSUR).

[ding2005equivalence] Ding, Chris, He, Xiaofeng, Simon, Horst D. (2005). On the equivalence of nonnegative matrix factorization and spectral clustering. Proceedings of the 2005 SIAM international conference on data mining.

[qiao2015new] Qiao, Hanli. (2015). New SVD based initialization strategy for non-negative matrix factorization. Pattern Recognition Letters.

[ding2004k] Ding, Chris, He, Xiaofeng. (2004). K-means clustering via principal component analysis. Proceedings of the twenty-first international conference on Machine learning.

[ding2006orthogonal] Ding, Chris, Li, Tao, Peng, Wei, Park, Haesun. (2006). Orthogonal nonnegative matrix t-factorizations for clustering. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.

[liang2021generalizing] Liang, Xin, Wang, Li, Zhang, Lei-Hong, Li, Ren-Cang. (2021). On Generalizing Trace Minimization. arXiv preprint arXiv:2104.00257.

[https://doi.org/10.48550/arxiv.2006.07322] Hui, Like, Belkin, Mikhail. (2020). Evaluation of Neural Architectures Trained with Square Loss vs Cross-Entropy in Classification Tasks. doi:10.48550/ARXIV.2006.07322.

[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545.

[lampinen2018analytic] Lampinen, Andrew K, Ganguli, Surya. (2018). An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374.

[huang2017densely] Huang, Gao, Liu, Zhuang, Van Der Maaten, Laurens, Weinberger, Kilian Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition.

[yu2018deep] Yu, Fisher, Wang, Dequan, Shelhamer, Evan, Darrell, Trevor. (2018). Deep layer aggregation. Proceedings of the IEEE conference on computer vision and pattern recognition.

[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[ravanelli2018speaker] Ravanelli, Mirco, Bengio, Yoshua. (2018). Speaker recognition from raw waveform with sincnet. 2018 IEEE Spoken Language Technology Workshop (SLT).

[seydoux2020clustering] Seydoux, L{'e. (2020). Clustering earthquake signals and background noises in continuous seismic data with unsupervised deep learning. Nature communications.

[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.

[ewerbring1989canonical] Ewerbring, L Magnus, Luk, Franklin T. (1989). Canonical correlations and generalized SVD: applications and new algorithms. Journal of computational and applied mathematics.

[andrew2013deep] Andrew, Galen, Arora, Raman, Bilmes, Jeff, Livescu, Karen. (2013). Deep canonical correlation analysis. International conference on machine learning.

[witten2009penalized] Witten, Daniela M, Tibshirani, Robert, Hastie, Trevor. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics.

[healy1957rotation] Healy, MJR. (1957). A rotation method for computing canonical correlations. Mathematics of Computation.

[xue2014does] Xue, Jing-Hao, Hall, Peter. (2014). Why does rebalancing class-unbalanced data improve AUC for linear discriminant analysis?. IEEE transactions on pattern analysis and machine intelligence.

[mika1999fisher] Mika, Sebastian, Ratsch, Gunnar, Weston, Jason, Scholkopf, Bernhard, Mullers, Klaus-Robert. (1999). Fisher discriminant analysis with kernels. Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468).

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[tosh2021contrastive] Tosh, Christopher, Krishnamurthy, Akshay, Hsu, Daniel. (2021). Contrastive learning, multi-view redundancy, and linear models. Algorithmic Learning Theory.

[tian2022deep] Tian, Yuandong. (2022). Deep Contrastive Learning is Provably (almost) Principal Component Analysis. arXiv preprint arXiv:2201.12680.

[bach2005probabilistic] Bach, Francis R, Jordan, Michael I. (2005). A probabilistic interpretation of canonical correlation analysis.

[haochen2022beyond] HaoChen, Jeff Z, Wei, Colin, Kumar, Ananya, Ma, Tengyu. (2022). Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations. arXiv preprint arXiv:2204.02683.

[pfau2018spectral] David Pfau, Stig Petersen, Ashish Agarwal, David G. T. Barrett, Kimberly L. Stachenfeld. (2019). Spectral Inference Networks: Unifying Deep and Spectral Learning. International Conference on Learning Representations.

[baevski2020wav2vec] Baevski, Alexei, Zhou, Yuhao, Mohamed, Abdelrahman, Auli, Michael. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems.

[higham1988computing] Higham, Nicholas J. (1988). Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications.

[geirhos2020surprising] Geirhos, Robert, Narayanappa, Kantharaju, Mitzkus, Benjamin, Bethge, Matthias, Wichmann, Felix A, Brendel, Wieland. (2020). On the surprising similarities between supervised and self-supervised models. arXiv preprint arXiv:2010.08377.

[golub1971singular] Golub, Gene H, Reinsch, Christian. (1971). Singular value decomposition and least squares solutions. Linear algebra.

[hestenes1958inversion] Hestenes, Magnus R. (1958). Inversion of matrices by biorthogonalization and related results. Journal of the Society for Industrial and Applied Mathematics.

[eckart1936approximation] Eckart, Carl, Young, Gale. (1936). The approximation of one matrix by another of lower rank. Psychometrika.

[horn2012matrix] Horn, Roger A, Johnson, Charles R. (2012). Matrix analysis.

[ioffe2006probabilistic] Ioffe, Sergey. (2006). Probabilistic linear discriminant analysis. European Conference on Computer Vision.

[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. Proceedings of the ieee/cvf International Conference on computer vision.

[cortes1995support] Cortes, Corinna, Vapnik, Vladimir. (1995). Support-vector networks. Machine learning.

[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.

[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.

[kalofolias2016learn] Kalofolias, Vassilis. (2016). How to learn a graph from smooth signals. Artificial Intelligence and Statistics.

[dong2016learning] Dong, Xiaowen, Thanou, Dorina, Frossard, Pascal, Vandergheynst, Pierre. (2016). Learning Laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing.

[koohpayegani2021mean] Koohpayegani, Soroush Abbasi, Tejankar, Ajinkya, Pirsiavash, Hamed. (2021). Mean shift for self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[roweis2000nonlinear] Roweis, Sam T, Saul, Lawrence K. (2000). Nonlinear dimensionality reduction by locally linear embedding. science.

[shi2020run] Shi, Haizhou, Luo, Dongliang, Tang, Siliang, Wang, Jian, Zhuang, Yueting. (2020). Run away from your teacher: Understanding byol by a novel self-supervised approach. arXiv preprint arXiv:2011.10944.

[gidaris2018unsupervised] Gidaris, Spyros, Singh, Praveer, Komodakis, Nikos. (2018). Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728.

[ericsson2021well] Ericsson, Linus, Gouk, Henry, Hospedales, Timothy M. (2021). How well do self-supervised models transfer?. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[kanazawa2016warpnet] Kanazawa, Angjoo, Jacobs, David W, Chandraker, Manmohan. (2016). Warpnet: Weakly supervised matching for single-view reconstruction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[bordes2021high] Bordes, Florian, Balestriero, Randall, Vincent, Pascal. (2021). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv preprint arXiv:2112.09164.

[novotny2018self] Novotny, David, Albanie, Samuel, Larlus, Diane, Vedaldi, Andrea. (2018). Self-supervised learning of geometrically stable features through probabilistic introspection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[o1985manova] O'Brien, Ralph G, Kaiser, Mary K. (1985). MANOVA method for analyzing repeated measures designs: an extensive primer.. Psychological bulletin.

[xu2019self] Xu, Dejing, Xiao, Jun, Zhao, Zhou, Shao, Jian, Xie, Di, Zhuang, Yueting. (2019). Self-supervised spatiotemporal learning via video clip order prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[kim2019self] Kim, Dahun, Cho, Donghyeon, Kweon, In So. (2019). Self-supervised video representation learning with space-time cubic puzzles. Proceedings of the AAAI conference on artificial intelligence.

[sermanet2018time] Sermanet, Pierre, Lynch, Corey, Chebotar, Yevgen, Hsu, Jasmine, Jang, Eric, Schaal, Stefan, Levine, Sergey, Brain, Google. (2018). Time-contrastive networks: Self-supervised learning from video. 2018 IEEE international conference on robotics and automation (ICRA).

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.

[huberty2006applied] Huberty, Carl J, Olejnik, Stephen. (2006). Applied MANOVA and discriminant analysis.

[haochen2021provable] HaoChen, Jeff Z, Wei, Colin, Gaidon, Adrien, Ma, Tengyu. (2021). Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems.

[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.

[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[hastie1995penalized] Hastie, Trevor, Buja, Andreas, Tibshirani, Robert. (1995). Penalized discriminant analysis. The Annals of Statistics.

[bartlett1938further] Bartlett, Maurice S. (1938). Further aspects of the theory of multiple regression. Mathematical Proceedings of the Cambridge Philosophical Society.

[hardoon2004canonical] Hardoon, David R, Szedmak, Sandor, Shawe-Taylor, John. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural computation.

[kursun2011canonical] Kursun, Olcay, Alpaydin, Ethem, Favorov, Oleg V. (2011). Canonical correlation analysis using within-class coupling. Pattern Recognition Letters.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230.

[sankararaman2020impact] Sankararaman, Karthik Abinav, De, Soham, Xu, Zheng, Huang, W Ronny, Goldstein, Tom. (2020). The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. International Conference on Machine Learning.

[zhang2019identity] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Mozer, Michael C, Singer, Yoram. (2019). Identity crisis: Memorization and generalization under extreme overparameterization. arXiv preprint arXiv:1902.04698.

[arora2018optimization] Arora, Sanjeev, Cohen, Nadav, Hazan, Elad. (2018). On the optimization of deep networks: Implicit acceleration by overparameterization. International Conference on Machine Learning.

[neyshabur2018towards] Neyshabur, Behnam, Li, Zhiyuan, Bhojanapalli, Srinadh, LeCun, Yann, Srebro, Nathan. (2018). Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076.

[srebro2004generalization] Srebro, Nathan, Alon, Noga, Jaakkola, Tommi S. (2004). Generalization error bounds for collaborative prediction with low-rank matrices.. NIPS.

[guo2018expandnets] Guo, Shuxuan, Alvarez, Jose M, Salzmann, Mathieu. (2018). Expandnets: Linear over-parameterization to train compact convolutional networks. arXiv preprint arXiv:1811.10495.

[arora2019implicit] Arora, Sanjeev, Cohen, Nadav, Hu, Wei, Luo, Yuping. (2019). Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems.

[hecht1992theory] Hecht-Nielsen, Robert. (1992). Theory of the backpropagation neural network. Neural networks for perception.

[alemohammad2021the] Sina Alemohammad, Zichao Wang, Randall Balestriero, Richard Baraniuk. (2021). The Recurrent Neural Tangent Kernel. International Conference on Learning Representations.

[jacot2018neural] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572.

[gunasekar2018implicitb] Gunasekar, Suriya, Lee, Jason, Soudry, Daniel, Srebro, Nathan. (2018). Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468.

[kakade2008complexity] Kakade, Sham M, Sridharan, Karthik, Tewari, Ambuj. (2008). On the complexity of linear prediction: Risk bounds, margin bounds, and regularization.

[bartlett2002rademacher] Bartlett, Peter L, Mendelson, Shahar. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research.

[soudry2018implicit] Soudry, Daniel, Hoffer, Elad, Nacson, Mor Shpigel, Gunasekar, Suriya, Srebro, Nathan. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research.

[jing2020implicit] Jing, Li, Zbontar, Jure, LeCun, Yann. (2020). Implicit rank-minimizing autoencoder. arXiv preprint arXiv:2010.00679.

[gunasekar2018characterizing] Gunasekar, Suriya, Lee, Jason, Soudry, Daniel, Srebro, Nathan. (2018). Characterizing implicit bias in terms of optimization geometry. International Conference on Machine Learning.

[telgarsky2013margins] Telgarsky, Matus. (2013). Margins, shrinkage, and boosting. International Conference on Machine Learning.

[park1991universal] Park, Jooyoung, Sandberg, Irwin W. (1991). Universal approximation using radial-basis-function networks. Neural computation.

[nocedal1999numerical] Nocedal, Jorge, Wright, Stephen J. (1999). Numerical optimization.

[kelaghan1982barrier] Kelaghan, Joseph, Rubin, George L, Ory, Howard W, Layde, Peter M. (1982). Barrier-method contraceptives and pelvic inflammatory disease. Jama.

[gunasekar2018implicit] Gunasekar, Suriya, Woodworth, Blake, Bhojanapalli, Srinadh, Neyshabur, Behnam, Srebro, Nathan. (2018). Implicit regularization in matrix factorization. 2018 Information Theory and Applications Workshop (ITA).

[yariv2021volume] Yariv, Lior, Gu, Jiatao, Kasten, Yoni, Lipman, Yaron. (2021). Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems.

[bracewell1986fourier] Bracewell, Ronald Newbold, Bracewell, Ronald N. (1986). The Fourier transform and its applications.

[bellman2015applied] Bellman, Richard E, Dreyfus, Stuart E. (2015). Applied dynamic programming.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research.

[sitzmann2020implicit] Sitzmann, Vincent, Martel, Julien, Bergman, Alexander, Lindell, David, Wetzstein, Gordon. (2020). Implicit neural representations with periodic activation functions. Advances in Neural Information Processing Systems.

[neyshabur2018role] Neyshabur, Behnam, Li, Zhiyuan, Bhojanapalli, Srinadh, LeCun, Yann, Srebro, Nathan. (2018). The role of over-parametrization in generalization of neural networks. International Conference on Learning Representations.

[wang2017residual] Wang, Fei, Jiang, Mengqing, Qian, Chen, Yang, Shuo, Li, Cheng, Zhang, Honggang, Wang, Xiaogang, Tang, Xiaoou. (2017). Residual attention network for image classification. Proceedings of the IEEE conference on computer vision and pattern recognition.

[koppen2000curse] K{. (2000). The curse of dimensionality. 5th online world conference on soft computing in industrial applications (WSC5).

[boltyanski2012excursions] Boltyanski, Vladimir, Martini, Horst, Soltan, Petru S. (2012). Excursions into combinatorial geometry.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.

[lecun1989backpropagation] LeCun, Yann, Boser, Bernhard, Denker, John S, Henderson, Donnie, Howard, Richard E, Hubbard, Wayne, Jackel, Lawrence D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation.

[edelsbrunner1987algorithms] Edelsbrunner, Herbert. (1987). Algorithms in combinatorial geometry.

[neyshabur2014search] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.

[he2016identity] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Identity mappings in deep residual networks. European conference on computer vision.

[neyshabur2017implicit] Neyshabur, Behnam. (2017). Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953.

[le1991eigenvalues] Le Cun, Yann, Kanter, Ido, Solla, Sara A. (1991). Eigenvalues of covariance matrices: Application to neural-network learning. Physical Review Letters.

[grill2017two] Grill, Thomas, Schl{. (2017). Two convolutional neural networks for bird detection in audio signals. 2017 25th European Signal Processing Conference (EUSIPCO).

[zhang2021attention] Zhang, Zhichao, Xu, Shugong, Zhang, Shunqing, Qiao, Tianhao, Cao, Shan. (2021). Attention based convolutional recurrent neural network for environmental sound classification. Neurocomputing.

[rybakov2020streaming] Rybakov, Oleg, Kononenko, Natasha, Subrahmanya, Niranjan, Visontai, Mirko, Laurenzo, Stella. (2020). Streaming keyword spotting on mobile devices. arXiv preprint arXiv:2005.06720.

[mohri2018foundations] Mohri, Mehryar, Rostamizadeh, Afshin, Talwalkar, Ameet. (2018). Foundations of machine learning.

[ioffe2015batch] Ioffe, Sergey, Szegedy, Christian. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning.

[jordan2015machine] Jordan, Michael I, Mitchell, Tom M. (2015). Machine learning: Trends, perspectives, and prospects. Science.

[passalis2018learning] Passalis, Nikolaos, Tefas, Anastasios. (2018). Learning deep representations with probabilistic knowledge transfer. Proceedings of the European Conference on Computer Vision (ECCV).

[wolf2020transformers] Wolf, Thomas, Chaumond, Julien, Debut, Lysandre, Sanh, Victor, Delangue, Clement, Moi, Anthony, Cistac, Pierric, Funtowicz, Morgan, Davison, Joe, Shleifer, Sam, others. (2020). Transformers: State-of-the-art natural language processing. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[long2015fully] Long, Jonathan, Shelhamer, Evan, Darrell, Trevor. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition.

[csimcsek2021geometry] {\c{S. (2021). Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances. arXiv preprint arXiv:2105.12221.

[vapnik2015uniform] Vapnik, Vladimir N, Chervonenkis, A Ya. (2015). On the uniform convergence of relative frequencies of events to their probabilities. Measures of complexity.

[pmlr-v9-glorot10a] Glorot, Xavier, Bengio, Yoshua. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.

[bib1] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.

[bib2] Ronald Newbold Bracewell and Ronald N Bracewell, The Fourier transform and its applications, vol. 31999, McGraw-Hill New York, 1986.

[bib3] Jooyoung Park and Irwin W Sandberg, “Universal approximation using radial-basis-function networks,” Neural computation, vol. 3, no. 2, pp. 246–257, 1991.

[bib4] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of machine learning, MIT press, 2018.

[bib5] Yann Le Cun, Ido Kanter, and Sara A Solla, “Eigenvalues of covariance matrices: Application to neural-network learning,” Physical Review Letters, vol. 66, no. 18, pp. 2396, 1991.

[bib6] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning, vol. 1, MIT Press, 2016.

[bib7] Anders Krogh and John Hertz, “A simple weight decay can improve generalization,” Advances in neural information processing systems, vol. 4, 1991.

[bib8] Lipman Bers, Fritz John, and Martin Schechter, Partial differential equations, American Mathematical Soc., 1964.

[bib9] Stanley J Farlow, Partial differential equations for scientists and engineers, Courier Corporation, 1993.

[bib10] Lawrence C Evans, Partial differential equations, vol. 19, American Mathematical Soc., 2010.

[bib11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

[bib12] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li, “Adversarial examples: Attacks and defenses for deep learning,” IEEE transactions on neural networks and learning systems, vol. 30, no. 9, pp. 2805–2824, 2019.

[bib13] Anders Forsgren, Philip E Gill, and Margaret H Wright, “Interior methods for nonlinear optimization,” SIAM review, vol. 44, no. 4, pp. 525–597, 2002.

[bib14] Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 109–119.

[bib15] Eric Wong and Zico Kolter, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” in International Conference on Machine Learning. PMLR, 2018, pp. 5286–5295.

[bib16] Changliu Liu, Tomer Arnon, Christopher Lazarus, Christopher Strong, Clark Barrett, Mykel J Kochenderfer, et al., “Algorithms for verifying deep neural networks,” Foundations and Trends® in Optimization, vol. 4, no. 3-4, pp. 244–404, 2021.

[bib17] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi, “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018.

[bib18] Richard E Bellman and Stuart E Dreyfus, Applied dynamic programming, vol. 2050, Princeton university press, 2015.

[bib19] Mario Köppen, “The curse of dimensionality,” in 5th online world conference on soft computing in industrial applications (WSC5), 2000, vol. 1, pp. 4–8.

[bib20] Herbert Edelsbrunner, Algorithms in combinatorial geometry, vol. 10, Springer Science & Business Media, 1987.

[bib21] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio, “On the number of linear regions of deep neural networks,” Advances in neural information processing systems, vol. 27, 2014.

[bib22] Randall Balestriero and Richard Baraniuk, “A spline theory of deep learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 374–383.

[bib23] Randall Balestriero, Romain Cosentino, Behnaam Aazhang, and Richard Baraniuk, “The geometry of deep networks: Power diagram subdivision,” Advances in Neural Information Processing Systems, vol. 32, 2019.

[bib24] Randall Balestriero and Richard G Baraniuk, “From hard to soft: Understanding deep network nonlinearities via vector quantization and statistical inference,” arXiv preprint arXiv:1810.09274, 2018.

[bib25] Ahmed Imtiaz Humayun, Randall Balestriero, and Richard Baraniuk, “Polarity sampling: Quality and diversity control of pre-trained generative networks via singular values,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10641–10650.

[bib26] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein, “Implicit neural representations with periodic activation functions,” Advances in Neural Information Processing Systems, vol. 33, pp. 7462–7473, 2020.