Skip to main content

The SSL Interplay: Augmentations, Inductive Bias, and Generalization

Vivien Cabannes, Bobak T. Kiani, Randall Balestriero, Yann LeCun, Alberto Bietti

Abstract

Self-supervised learning (SSL) has emerged as a powerful framework to learn representations from raw data without supervision. Yet in practice, engineers face issues such as instability in tuning optimizers and collapse of representations during training. Such challenges motivate the need for a theory to shed light on the complex interplay between the choice of data augmentation, network architecture, and training algorithm. We study such an interplay with a precise analysis of generalization performance on both pretraining and downstream tasks in a theory friendly setup, and highlight several insights for SSL practitioners that arise from our theory.

Introduction

Self-supervised learning (SSL) aims to construct useful representations of data without the need for pre-constructed labels. Due to the recent success and widespread applicability of SSL, established methods for training large neural networks now incorporate pre-training of models in an unsupervised manner over large amounts of data, before fine-tuning/probing them over downstream datasets (Devlin et al., 2019; Chen et al., 2020; Brown et al., 2020; Radford et al., 2021). Self-supervised pretraining generally aims to render the model invariant to certain distorsions/views of the inputs, in order to capture useful features for downstream tasks (e.g., Chen et al., 2020; Caron et al., 2020; Grill et al., 2020; Caron et al., 2021; Bardes et al., 2022). Though very powerful, SSL methods can be challenging to implement properly. They tend to suffer from various practical issues, such as instability and collapse during training and the need to carefully tune parameters related to the architecture, optimization algorithm, representation dimension, and form of augmentations. These different aspects of pretraining can lead to widely different behaviors and representations,

1 Meta AI, New York, NY, USA 2 MIT Department of Electrical Engineering and Computer Science, Cambridge, MA, USA. Correspondence to: Vivien Cabannes vivc@meta.com.

Proceedings of the 40 th International Conference on Machine Learning , Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

as illustrated for instance in Figure 1. These challenges motivate new theoretical insights to better understand why such issues arise and how to better address them.

Our study focuses on the joint-embedding framework and characterizes learned representations for given choices of input distributions, data augmentations, and architecture. To obtain a fine-grained picture, we study linear classes of functions endowed with a reproducing kernel, and analyze a theoretically friendly loss function that models both contrastive and non-contrastive methods. Our work generalizes the discrete data setting of HaoChen et al. (2021) and the finite dimensional setting of Saunshi et al. (2022), encompassing more expressive nonparametric models, potentially with universal approximation properties, and which can capture certain properties of architectures through their limiting kernel limits (Jacot et al., 2018).

Our contributions are as follows:

  1. We unveil two central integral operators: an 'intrinsic' one that depends on the input distribution and choice of augmentations and another capturing the inductive bias associated with the model of computation.
  2. We propose new generalization bounds on the pretraining excess risk via tools from convex analysis. This analysis yields novel insights, including an understanding of the benefits of using multiple augmentations per sample (e.g., 'multi-crop').
  3. We provide new bounds on the downstream generalization error that are sharper than previous work, and which can handle distributions shift between data before and after performing augmentations.
  4. We detail several examples where optimal representations are found in closed form, illustrating the role of augmentations, architecture, and regularization in forming representations.
  5. We discuss several practical insights for SSL practitioners that emerge from our theory, in particular on how design choices in pretraining may affect downstream performance, and on how to avoid collapse of representations.

Related work. Foundations for theoretically analyzing SSL have emerged in the past few years. Particularly relevant to our work, Balestriero & LeCun (2022); Kiani et al.

Figure 1. Effect of augmentations and architecture. TSNE of representations learned on MNIST with no augmentations (left) or with rotations and an MLP (middle) or a CNN (right). The representations depend on both the augmentations and the architecture.

(2022) provide theoretically friendly characterizations of many self-supervised learning settings, including closedform solutions of representations in the kernel setting. For contrastive learning, SSL was first theoretically analyzed by Arora et al. (2019); Tosh et al. (2021a;b); Tian et al. (2021). Notably, HaoChen et al. (2021) recently leveraged tools in spectral graph theory to characterize guarantees on SSL performance under clustering assumptions. These assumptions were deemed impractical by Saunshi et al. (2022), who highlighted the importance of incorporating inductive bias to obtain provable guarantees. This line of work was extended to multi-modal SSL by Lee et al. (2021) where in essence the central symmetric operator T is replaced by a non-symmetric one, and the eigen decomposition is replaced by the singular one. The role of inductive bias has also been scrutinized through analysis of feature learning in training dynamics by Wen & Li (2021) and Tian (2022).

Setup

Machine learning streamlines the task of creating algorithms for finding patterns in data. An algorithm is conceptualized as a mapping f from an input x ∈ X to an output y ∈ Y . To construct this mapping f : X → Y , one can choose a measure of disagreement ℓ : Y × Y → R , and minimize the risk

for ρ ∈ ∆ X×Y a distribution on I/O pairs. We denote by f ∗ ∈ arg min R an optimal I/O map according to the risk. Mapping raw inputs (e.g., arrays of pixels), to outputs (e.g., classifying an animal in an image), is in general a challenging task. An effective technique consists of first extracting (or engineering) meaningful features ψ : X → R k from input data before using those features to search f under the form g ◦ ψ for g : R k →Y a simple function. 1

Though features ψ can be hand-engineered, representation learning aims at improving such design via unsupervised learning procedures. On the one hand, reconstruction based methods mask or add noise to inputs via a mapping Mx and aim to reconstruct the original input x from the features g ◦ ψ

1 For convenience, several technicalities, such as measurability, have been deferred to the appendix.

Table 1. Analogy between practice and theory that this paper proposes to help disentangle the various phenomena of SSL training.

using g , a simple prediction head. Large language models largely rely on this paradigm, usually learning ψ by completing sentences Mx where word tokens are masked (e.g. Devlin et al., 2019). On the other hand, joint embedding methods learn ψ by leveraging invariance to small perturbations of the semantic information contained in inputs. This is the paradigm we shall focus on. Recently, joint embedding methods have relied heavily on the concept of data augmentation, such as small rotation, translation, color jittering of images. In particular, contrastive methods learn ψ by enforcing that if two augmentations ξ and ξ ′ come from the same data point, their representation ψ ( ξ ) and ψ ( ξ ′ ) are close; while if they come from different data points, their representation are far away from one another (e.g., Chen et al., 2020). Non-contrastive methods only enforce similarities of augmented datapoints and avoid collapse by enforcing richness of the representation (see, e.g., Bardes et al., 2022). In the following, we focus on a theoretically friendly variant of VICReg (Balestriero & LeCun, 2022) with parameter β > 0 , defined for ψ : X → R k by

where pairs of inputs/augmentations ( X,ξ ) follow a distribution µ ∈ ∆ X×X , whose conditional ( ξ | X ) arises from the choice of augmentation. The first term in L enforces invariance of the representation ψ to two augmentations ξ and ξ ′ of the same input X , while the second term lowers risk of collapse by pushing coordinates ψ i : X → R of ψ = ( ψ i ) i ∈ [ k ] to be orthogonal in L 2 .

Remark 1 (Contrastive learning with L ) . When β = 1 , the population loss L is equivalent to the spectral contrastive loss studied in HaoChen et al. (2021) as a theoretically friendly proxy for SimCLR (Chen et al., 2020). In other terms, L analyzes both contrastive and non-contrastive approaches to representation learning.

Given a representation ψ , one can optimize for f through linear probing by constructing f = g ◦ ψ where g is a linear function. f is thereby in the class of functions

In practice, one might not know the optimal ψ , but can estimate it as ˆ ψ from empirical data, leading to an estimate ˆ F of this class of functions.

(Contrastive learning with ℒ).

In this section, we study the representations induced by pretraining with specific augmentations and inductive biases.

Closed form solution

Equation (2) admits a closed form solution for ψ upon noting that the invariant part is a quadratic form.

Lemma 2 (Spectral embedding) . There exists a linear positive symmetric operator L in L 2 for which the operator I -T is positive and

As a consequence, if ( λ i ) are the eigenvalues of T and ( f i ) are the corresponding eigenvectors, a minimizer of L is ψ i = √ µ i f i with µ i = 1 -β + βλ i .

Lemma 2 is closely tied to the guiding principle in unsupervised learning that a good representation of data should minimize variations over the manifold of the data (Cabannes et al., 2023), and techniques that learn such representations through spectral decomposition of a central operator (see, e.g., Coifman & Lafon, 2006).

Practical insights

Search within a linear class of functions

In this more technical section, we study solutions of L for ψ belonging to a linear class of functions. The coordinates of the mapping ψ : X → R k are typically searched within a space of functions Ψ ⊂ R X , leading to ψ ∈ Ψ k . In our theoretically friendly setup, we assume that Ψ is a linear class of functions endowed with a Hilbertian topology such that the linear evaluations ψ ↦→ ψ ( x ) are continuous for almost all x ∈ X . The theory of reproducing kernel Hilbert space (Scholkopf & Smola, 2001) asserts that Ψ can be parameterized by a Hilbert space H and a mapping φ : X → H such that

This generalizes the setting of HaoChen et al. (2021) where X is assumed to be finite and Ψ is parameterized by H = R X and φ ( x ) = δ x , as well as the setting of Saunshi et al. (2022) where H is assumed to be finite dimensional.

To describe architectures such as neural networks with such a linear structure, it is common to linearize those models (e.g. Jacot et al. (2018)) as

Figure 2. Interplay between T and K as a function of λ . Illustration of Proposition 4 in a setting where ( λ i ) = ( . 9 , . 75 , . 5) and ( ∥ θ i ∥ 2 ) = ( . 4 , . 25 , . 125) . The plot displays the eigenvalues associated with three different eigenfunctions as a function of λ , β is set to one for convenience. When λ = 0 , the minimizer ψ ∗ : X → R of (2) is defined through T , here φ ∗ = f 1 ( i = 1 , shown in blue), when λ is big ψ ∗ = f 3 (green) mainly depends on K . In the middle, there is an interplay between these two regimes leading to ψ ∗ = f 2 (orange). The three regimes are named the 'augmentation', the 'architecture' (or VCReg) and the 'interplay' regime respectively. This abstract setting can be instantiated with a two-layer ReLU network and cropping as detailed in Figure 8.

where θ are the network parameters, assumed close to their initialization θ 0 , and ψ θ is the neural network. In this case, we may take φ = ∇ θ 0 ψ θ 0 , which arguably describes some regimes of wide neural networks (Lee et al., 2019).

To minimize L in practice and improve generalization, a regularization parameter is typically introduced. 2 The following lemma provides a closed form solution of the regularized variant of L .

Lemma 3 (Regularized population loss) . For Θ ∈ R k ⊗H , and a regularizer λ > 0 , the regularized loss L ( S Θ) + λ ∥ Θ ∥ 2 2 can be minimized in closed form with the operator

The need for inductive bias

Two different points of view motivate the introduction of the regularizer λ ∥ Θ ∥ leading to the operator T λ .

In the classical viewpoint of statistical learning theory, one would like to retrieve the eigenfunctions of T to minimize L (Lemma 2). However, when solely accessing finitely many samples of data, eigenfunctions of T should be searched

2 While we study here the bias of Tikhonov regularization for simplicity, similar studies can be done for early stopped gradient descent or stochastic gradient descent when they are cast as spectral filters, as in Lin et al. (2020), see also literature related to optimization for matrix factorization problems (Chi et al., 2019), which has been applied to SSL by Simon et al. (2023).

within a space of finite capacity (i.e. f ∈ Ψ | ∥ f ∥ 2 Ψ ≤ λ -1 ). Though fewer samples are needed for smaller models (e.g. the fewer neurons and layers in a deep network), such small models are unlikely to be expressive enough to represent the ideal solutions. This echoes the classical tradeoff between approximation and estimation error. In the case of Laplacians, one can assume that the eigenfunctions of T are smooth thereby belong to a small space of functions that are well-approximated with a finite model of computation. We refer the curious reader to Cabannes et al. (2021a) for results in this vein when I -T is the real Laplacian in L 2 .

Another take was suggested by Saunshi et al. (2022), which pointed out that eigenvalues of T can have large multiplicity in realistic situations (in particular in the non-localized augmentations setting of Section 4.2), meaning that the space F is not uniquely defined from the loss L . As a consequence, defining the optimal solution solely from T is somewhat ill-posed, whereas, when K is properly chosen, T λ could define a 'more principled' representation ψ . Paradoxically, with this viewpoint, bias could reduce the approximation error . Figure 3 illustrates such an idea. It leverages the following interpretation of the inductive bias in the friendly setting where T and K commute.

Proposition 4. If T and K commute, and if ( λ i ) are the eigenvalues of T and ( f i ) its eigenfunctions, then there exists ( θ i ) such that f i = f θ i (4) . Moreover, the optimal representation to minimize the regularized loss are the f i that maximize βλ i -λ ∥ θ i ∥ 2 . In other terms, the regularization biases towards representations that have a small complexity with respect to the model of computation.

Lemma 3 shows an interesting behavior of the VCReg loss ( β = 0 , i.e. VICReg without invariance term). In this setting, the optimal ψ retrieve the largest eigenfunctions of K , recovering kernel PCA. Learning downstream tasks with linear probing of the resulting ψ is equivalent to linear regression with an eigenvalue cut-off, which is a powerful spectral filtering technique (see, e.g. Bun et al., 2017).

.

Illustrative examples

The analysis of this paper relies on two central operators: T , that is 'intrinsically' defined from the data distribution and augmentations, and K , which relates to the model of computation (e.g. the network architecture). Once those operators are chosen, Section 5 provides a sharp analysis of convergence and generalization with SSL in the kernel regime. In essence, Assumption 3 requires that the target function (downstream) align well with the learned representation (upstream) when given infinite data. Here, the effect of T and the inductive bias introduced by λK -1 on the learned representation can appear abstract. To provide intuition and outline important properties of these operators, this section lays out several concrete examples to help prac-

Figure 3. Trade-off on eigenvalues between T and K . Illustration of a harmonic setting where T and K are diagonalized in the same basis. This basis is parametrized by an 'invariance score' ( x = m in (54)) and a 'complexity score' ( y = | S | in (54)). The eigenvalues λ x,y ( A ) for A ∈ { T, K, T λ } are represented with colors and displayed in a grid associated with x ∈ [15] and y ∈ [8] . The sole use of the operator T biases towards invariance (lower x ) with high complexity (lower y ), while the sole use of K biases toward low complexity. The interplay between the two results in T λ whose biggest eigenfunctions have high invariance and low complexity, and corresponds to an ideal representation ψ .

Two different perspective have emerged to understand learned representation in SSL. One intuition comes from the spectral clustering literature, and is the object of subsection 4.1. The other intuitive way to understand SSL is based on harmonic analysis, and is the object of subsection 4.2. All in all, this section generalizes previous works by dropping out strong clustering assumptions in the data, showing that what really matter are the eigenfuntions of T , which eventually capture clustering structures when such clustering assumptions are invoked. It further uses harmonic analysis tools to better describe these eigenfunctions as suggested by Saunshi et al. (2022) and detailed in Table 2.

Low-variation with localized augmentations

When augmentations ( ξ | X ) are localized around the input X , optimizing the loss L (2) biases towards small gradients of ψ along the directions of augmentations. Formally for ψ : X → R , using first order Taylor expansion,

Under isotropic augmentations, the objective simplifies as

which enforces ψ to have small variations on densely populated regions of the input space - reminiscent of popular approaches to tackle representation and semi-supervised learning in the last two decades (van Engelen & Hoos, 2020). More generally, augmentations govern the important directions of invariance for ψ , recovering a finite-differences approach to the Tangent Prop algorithm (Simard et al., 1991).

Low variation methods are particularly useful when data display a clustering structure (c.f. Figure 10 for an illustration with neural networks). If augmentations preserve

the clustering structure, L is minimized by piecewise constant functions on each cluster, leading to useful features for downstream tasks that involve classifying different clusters (HaoChen et al., 2021; Schiebinger et al., 2015). The inductive biases further deforms those top eigenfunctions to be regular in a sense defined by Ψ (4), e.g., analytic if we use a radial basis function kernel (Sun & Zhou, 2008).

When augmentations are not localized, which is often the case in practice, harmonic analysis provides useful tools to study in-depth the role of augmentations, in particular when data are uniform on the sphere or the hypercube. Our findings on the hypercube are summarized in Table 2. In such a setting, we show that common augmentations enforce smoothness, locality, or invariance to certain symmetries. For example, crops push ψ to focus on details that can appear within the crop size, filtering out long-range interactions between parts of the input that are likely to be spurious features. The following example formalizes this.

Example 1 (Cropping) . Consider the hypercube setting where X = -1 , 1 n and X is uniformly distributed. A basis of L 2 ( X , R ) is given by the parity functions χ S : x ↦→ ∏ i ∈ S x i for all the subsets S ⊆ [ n ] . Pre-training via cropping with window sizes v × w set Tχ S = 0 for all S whose support forms a window larger than the size v × w . For all the other S , Tχ S = λ S χ S , where λ S decreases with the diameter of S . In other terms, pre-training with 2-D cropping eliminates the influence of functions which act globally outside of the cropping window. This, in effect, imparts a locality to the induced representation ψ which is often desirable for generalization.

This example suggests that the ideal crop size should match the desired scale of details for ψ ; e.g., on a dataset with fine-grained details such as iNaturalist, one should reduce the crop window size in comparison to a dataset such as ImageNet. Appendix D discusses further examples of augmentations, such as random noise or translations, and shows how they bias towards smooth or invariant eigenfunctions.

(Cropping).

While the design of augmentations and architecture can be done separately, changes to the architecture and optimization scheme play an important role in the resulting optimal ψ . Generally, increasing the amount of inductive bias by increasing λ shifts ψ towards smoother functions, in the sense captured by the H norm, which we illustrate in Figure 4. In practice, the right amount (captured here by the parameter λ ) of inductive bias to enforce is often set by a mix of intuition, common knowledge and empirical evidence. For example, Caron et al. (2021) links the inductive bias of early stopping to beneficial outcomes noting that 'training longer [...] has been leading to worse performance'.

Example 2 (Dot-product kernel) . On the Boolean hypercube setting of Example 1, many linear models (4) take the form φ ( x ) ⊤ φ ( y ) = h ( ⟨ x, y ⟩ ) (e.g., the classical NTK linearization of fully connected layer) leading to an integral operator K that is diagonalizable by parity functions. More precisely, there exists ( ν i ) ∈ R d such that Kχ S = ν | S | χ S , where | S | is the cardinality of S and ν | S | decreases with | S | . In the setting of crops, T pushes towards representation on parity functions with small diameter ( ψ = ( χ S ) S for S with small diameters), while the inductive bias acts on the cardinality of the sets S , pushing towards the χ S that maximize ν | S | . Formal derivations are provided in Appendix D.

Appendix D provides additional examples. For instance, in the case of translations, there is a similar interplay between a low-degree bias in K versus an invariance bias in T . We also consider convolutional architectures, which can impose locality through constraints on diam( S ) , on top of a lowdegree bias. Figure 3 shows such trade-offs in eigenvalues, Figure 4 visualizes how this interplay may affect leading eigenfunctions in a spherical setup, and Figure 5 illustrates the resulting effect on different downstream tasks.

Convergence analysis

The following section analyzes guarantees on both the pretraining and downstream tasks. 3 For simplicity, we consider the mean-square loss ℓ ( y, y ′ ) = ∥ y -y ′ ∥ 2 with Y = R d y . The studies of many losses can be reduced to the leastsquares case thanks to calibration inequalities (Bartlett et al., 2006) or self-concordance properties (Ostrovskii & Bach, 2018). To precisely study convergence rates, we consider the kernel regime of Section 3.2, where F is specified through Θ ∗ of Lemma 3 as

and ˆ F is defined similarly with ˆ Θ as an estimate of Θ ∗ . In the following, ( f i ) denote the eigenfunctions of T λ ordered by decreasing eigenvalues, and λ is considered to be fixed throughout this section.

Dealing with distribution shift

Self-supervised learning algorithms often incorporate strong augmentations leading to potentially different marginal distributions over inputs and augmentations. This discrepancy is often overlooked, many theoretical works implicitly assuming ρ X = µ Ξ . In practice, the marginal distribution ρ X of inputs in the downstream task can be meaningfully different from the marginal distribution of augmentations µ Ξ

3 The pretraining and downstream tasks refer to minimization of L and R respectively.

Table 2. Effect of common augmentations on the optimal representation ψ through the operator T . Without augmentations, ψ could match any Fourier basis function. Augmentations filter out some of those by attenuating their eigenvalues in T , and the architecture will push ψ to pick some specific frequencies among the remaining ones through the operator K . The table stylizes the effect of usual augmentations on parity functions over bit streams. We refer the reader to Appendix D for further details and derivations.

Figure 4. Interplay on the sphere. Level lines of the 7 -th eigenfunction of T λ for three different λ . Augmentations consist of translations of the x, y, z coordinates together with Gaussian perturbations. K is the integral operator associated with the radial basis function. Without regularization (left), the eigenfunction is highly localized at clusters corresponding to the action of the augmentations. Increasing the regularization biases towards smoother harmonic eigenfunctions of K (middle and right).

on which we have imposed orthogonality of the representation ψ in the pretraining task. However, the optimal representation ψ is likely to be invariant to augmentations, meaning that ideally, ψ ( X ) should have the same distribution when X ∼ µ Ξ or X ∼ ρ X , which we write formally as ψ # µ Ξ = ψ # ρ X . Moreover, augmentations are likely to spread the input data distribution, leading to the dominance ρ X ≪ µ Ξ . This motivates the following assumptions and definitions.

Assumption 1 (Low expansion) . There exists c r > 0 such that for any function f in the original space of functions Ψ defined in (4) ,

Assumption 2. For any i smaller than the number of positive eigenvalues of T λ , the projection of the target f ∗ on f i in L 2 ( µ Ξ ) coincides with the projection on f i in L 2 ( ρ X ) .

To make those two concepts more concrete, we provide three examples below.

Example 3. If ρ X has a density against µ Ξ which is bounded from below by δ ∈ (0 , 1] on its support, i.e. µ Ξ = δρ X + (1 -δ ) µ ⊥ with µ ⊥ ∈ ∆ X , then Assumption 1 is met with c r = 1 /δ .

Example 4. Let Σ τ = E X ∼ τ [ φ ( X ) φ ( X ) ⊤ ] be the covariance matrix of φ under the distribution τ . When there exists c such that Σ ρ X ⪯ c Σ µ Ξ (i.e c Σ µ Ξ -Σ ρ X is positive semidefinite), then Assumption 1 holds with c r = c .

Figure 5. Trade-off on downstream errors. Effect of pretraining regularization λ on the empirical downstream error for two tasks on the sphere S 7 . The targets f ∗ ℓ are polynomials of degree ℓ ∈ { 1 , 3 } , with only f ∗ 3 invariant to translations. K is built from a dotproduct kernel that acts as a regularizer on degrees, while T is built from local translations. Designing ψ from T alone ( λ = 0 ) is helpful to learn globally invariant polynomials in the downstream, while increasing the regularization λ helps to learn polynomials of small degree. Experiment details in Appendix E.2, and Figure 11 showcases a similar behavior for neural networks.

Example 5. If ψ ♯ µ Ξ = ψ ♯ ρ X holds for the optimal representation ψ = ( f i ) , with ( f i ) the positive eigenfunctions of T λ , and there exists a measurable function g : R k →Y such that f ∗ = g ◦ ψ , then Assumption 2 is verified.

In essence, Assumptions 1 and 2 allow for the incorporation of augmented data that does not resemble the original data as long as the model of computation (Assumption 1) and training via the VICReg loss (Assumption 2) do not bias too much towards this aberrant augmented data. Example 3 states that when the augmented data mostly looks like the original samples then one does not have to worry about bias introduced by the model of computation. Example 4 gives a more relaxed guarantee based on second order moments. Finally, Example 5 states that one need not worry about the idiosyncrasies of the augmented data if the learned representations confound augmented data with their original samples.

.
.
.
.

Generalization on downstream tasks

E ρ [ Y | X = x ] of the downstream task is well represented by the pretraining problem.

Assumption 3 (Source condition) . f ∗ belongs to the positive eigenspace of T λ , i.e. f ∗ ∈ Span f i | λ i > 0 .

Example 6 (Cluster assumption) . If the support of the density µ Ξ has k connected components, f ∗ is constant on those clusters, and λ = 0 , then Assumption 3 holds.

We now give a simplified version of our downstream guarantee. See Theorem 4 in Appendix B for the full statement.

Theorem 1 (Downstream error) . Let ( X i , Y i ) ∼ ρ ⊗ n be n samples drawn from the distribution for the downstream task and ℓ be the square loss. Define k λ < + ∞ as the number of strictly positive eigenvalues of T λ . Under Assumptions 1, 2, and 3, after a transitory regime, the average excess risk of the optimally-regularized empirical risk minimizer f n is

where ε 2 is the noise level of Y (the supremum of conditional variances), k e ≤ k is the effective dimension of the representation ψ = Θ φ on the downstream task, c f,k ≤ ( k λ -k ) + ∥ f ∗ ∥ 2 L 2 ( ρ X ) is a constant relating to the concentration of the energy of f ∗ the target function on the downstream task with respect to the eigenspaces of T λ , c f,T λ ≤ ∥ ∥ T -1 λ f ∗ ∥ ∥ is a similar constant taking into account the decay of eigenvalues of T λ , and the index k in L k indicates that we search over ψ : X → R k .

The results of Theorem 1 can be seen as a variance-bias decomposition. A variance term, due to misspecified linear regression, displays rates in k log( n ) /n . The log( n ) factor is actually an artefact of our derivations, and could be removed with Theorem 1 of Mourtada et al. (2022). A bias term relates to the approximation error. It captures both the hardness to learn f ∗ with T λ through the constants c f,T λ and c f,k , and the error made on the pre-training task through L - L ∗ . Note that the proof of Theorem 1 mindfully avoids bounding c f,T λ by ∥ ∥ T -1 λ ∥ ∥ op ∥ f ∗ ∥ which would introduce the inverse of the spectral gap of T λ in the bound, and would not characterize well situation where the target function f ∗ is actually easy to learn with T λ . We also remark that for classification tasks, recent work shows that under mild assumptions on ρ X , and low noise conditions, it should be possible to convert the rates of Theorem 1 into exponentially fast rates for the zero-one loss (Cabannes et al., 2021b). This is particularly the case under the cluster setting studied by HaoChen et al. (2021). 4 The theoretical convergence rates of Theorem 1 are validated experimentally in Figure 6.

4 See also Rigollet (2007) for fast rates in this setting.

Figure 6. Empirical downstream performance on a simple task (detailed in Appendix D) depends on the number of both downstream samples ( x -axis) and pretraining ( y -axis) in log-scale. Along the left hand side of the plot, convergence rates of n -1 / 2 pre are observed with respect to the number of pretraining samples (Theorem 3) while along the top, convergence rates of n -1 down are observed with respect to the number of downstream samples (Theorem 1). At the bottom, a saturation phenomenon is observed where added downstream samples do not result in noticeable benefits as the excess risk stalls at R (Π ˆ F f ∗ ) -R ( f ∗ ) > 0 .

When a least-square problem benefits from additional structure, such as smoothness or strong convexity, results from convex optimization could lead to improvement over the usual convergence rates in n -1 / 2 . Recall basic results from convex optimization.

Lemma 36. Let L (Θ) = E Z [ ℓ (Θ , Z )] be a convex function optimized over a convex domain. Given n samples ( Z i ) , (unbiased) stochastic gradient descent with final averaging can achieve an excess of risk

with M 2 = ∥ Θ ∗ -Θ 0 ∥ 2 and V 2 = E [ ∥∇ Θ ℓ (Θ , Z i ) ∥ 2 ] . Moreover, if L is α -smooth, then it can achieve

As a consequence, given n data samples, there exists an empirical estimate of ˆ Θ that guarantee those generalization bounds.

Proof.

This lemma is a direct consequence of Theorems 6.1, 6.2 and 6.3 of Bubeck (2015).

It should be noted that when parameterized with Λ = Θ ⊤ Θ , L is a quadratic form as stated by Lemma 30, yet it is minimized over a non-convex domain, the domain of symmetric operator of rank k . We will relax this constraint and consider the harder problem of optimizing over Λ in the set of self-adjoint positive operators. This is justified by the fact that Theorem 4 provides guarantee on the downstream task, even when one relaxes the rank constraint on Λ .

To benefit from Lemma 36, one should consider an unbiased expression of L . Consider the minibatch scheme that consist in sampling two inputs X 1 , X 2 , and m augmentations ξ ij for each X i , formally

Here µ X denotes the marginal of µ with respect to X , which is likely to match ρ X , and µ | X denotes the distribution of Ξ conditionally to X .

Lemma 37. An unbiased formulation of L is based on ℓ defined as

Moreover, when L is regularized, one has to add + λI to get a gradient on the regularized risk.

̸

Proof. This formula follows from Lemma 30.

In order to bound the norm squared of the gradient, one can use the following lemma.

Lemma 38. For ℓ given in (42) , bounds on the gradient norm and its variance are

where σ X relates to the variance of E [ ψ ( ξ ) | X ] and σ ξ relates to the average variance of ( ξ | X ) .

Proof. Let us decompose ∇ ℓ into three terms ∇ ℓ = a + b + c as appearing in (42), we have

Let us begin with the part in a ,

Similarly, the part in b can be expressed as

Finally,

As a consequence, we get

where

Using the fact that m 2 ≥ m and choosing the right σ X and σ ξ leads to the lemma.

The following lemma states the convexity properties of L

Lemma 39. As a function of Λ , the objective L is α -smooth with α = κ 4 , where κ is a bound on ∥ φ ∥ . Moreover, when X is finite, it is α ′ -strongly, with α ′ being the square of eigen gap of K = SS ⊤ .

Proof. This is a consequence of Lemma 30, L is a quadratic function, with the quadratic part being

In other terms, the hessian of L is Σ ⊗ Σ ∈ H ⊗ 2 ⊗H ⊗ 2 . As a consequence, Σ ⊗ Σ ⪯ ∥ Σ ⊗ Σ ∥ op I = ∥ Σ ∥ 2 op I ⪯ κ 4 I. Similarly, Σ ⊗ Σ ⪰ ∥ ∥ Σ -1 ∥ ∥ -2 op I = γ 2 ξ I, where γ ξ is the eigen gap of Σ , hence of K .

There are few remaining difficulties that must be addressed before concluding. First, although the identity is not HilbertSchmidt, it should be noted that the term in λ will only contract distances in the stochastic gradient descent. As a consequence, optimizing the regularized risk will only contract the descent trajectory (to prove it formally one could go back to the proofs of Bubeck (2015)). Finally, we have described a descent in the space of self-adjoint positive operators, without incorporating any constraints on the rank of Λ . Notice that based on Lemma 35, on can restrict the search of Λ to inside the domain ∥ Λ ∥ ≤ k λ /λ . Finally, if Λ minimizes the loss L , then one can show that thresholding its eigenvalues to make it of dimension at most k can only increase the loss L by a bounded multiplicative factor. We note that without explicit regularization, the previously described stochastic gradient descent algorithm with early stopping has a regularization effect that could be studied in the spectral filtering framework of Lin et al. (2020).

Two regimes should be distinguished for the downstream problem. When few downstream samples are available, fewshot learning requires a small effective dimension k e (6) to lower the estimation error and avoid fitting noise. Limiting k e (or equivalently the capacity of ˆ F ) can be done either by decreasing the representation dimension k or applying regularization on downstream tasks. This theoretical tradeoff between effective dimension and number of downstream examples is illustrated empirically by He & Ozay (2020, Figure 6). On the contrary, when accessing a substantial amount of data for training downstream tasks, one could confidently augment the representation dimension k to decrease the approximation error. This was notably observed on large-scale datasets by Garrido et al. (2022, Figure 1): as k increases, the effective dimension k e converges to a limit, and the downstream performance keeps increasing until this limit is reached. Remarkably, our theory explains this phenomenon: since k λ is finite, as k increases, the effective dimension k e will be bounded by the limiting case where k = k λ . 5

Theorem 1 above states that representations with small pretraining loss can solve downstream tasks that satisfy Assumption 3 but do not address how difficult it is to find such representations. This section aims to bridge that gap. The following theorem details convergence rates of the empirical risk minimizer using Rademacher complexity arguments.

Theorem 2 (Empirical risk minimizer) . Let Θ n ∈ R k ⊗ H be the minimizer of the unbiased regularized empirical version of L based on a dataset D n . Assume that D n is built from n input samples ( X i ) ∼ µ ⊗ n X and m augmentation per samples ( ξ ij ) ∼ µ | ⊗ m X i , then the average excess risk is bounded by

where κ is a bound on ∥ φ ( X ) ∥ .

Note that the proof of Theorem 2 proceeds with a loose bound on the variance of the empirical risk, which is mainly due to the difficulty in dealing with non-exchangeability of the samples ( ξ ij ) . In essence, the ease of minimizing L depends on both the variance of L when estimated with empirical data (or the variance of stochastic gradients when performing SGD), and the size of the space where we aim to find representations ψ : X → R k . With stronger assumptions on the distribution of φ ( ξ ) (e.g., data are clustered, and the law of ( ξ | X ) is invariant per cluster), one could show much better behavior of the excess risk with respect to the number of augmentations (e.g., replacing n by the minimum

number of points in one cluster multiplied by the number of views). The following theorem states convergence rates with a stochastic gradient descent algorithm capturing such a potential situation. Proofs and technicalities, based on convex optimization literature, are detailed in Appendix C.

Theorem 3 (Sharper bounds) . There exists an implementable algorithm that guarantees an average excess risk

where c λ = 1+ κ 2 k λ /λ , c ′ λ = 1+ k 2 λ /λ 2 , k λ is the number of positive eigenvalues of T λ , κ is a bound on ∥ φ ∥ , σ X relates to the variance of E [ ψ ( ξ ) | X ] , and σ ξ relates to the average variance of ( ξ | X ) . Moreover, when K = SS ⊤ or the covariance of the φ ( ξ ) has a finite number of positive eigenvalues (e.g. X finite or H finite dimensional), with c K a constant that relates to the condition number of K , this bound can be tightened to

In the setting studied in HaoChen et al. (2021), we stress that Theorem 3 guarantees convergence rates of O ( n -1 ) rather than O ( n -1 / 2 ) on the upstream loss. In effect, we improve the rates of HaoChen et al. (2021, Theorem 4.3) from n -1 / 2 to n -1 on both pretraining and downstream tasks.

Mathematical details and simple proofs

The pretraining problem

Usefulness of multiple augmentations per sample. Theorem 3 shows how multiple augmentation such as multi-crops can result in faster convergence to an optimal representation ψ . There, the variance of the empirical risk depends on both σ X due to variation over inputs, and σ ξ due to variations over resulting views after augmentation. With multiple augmentations per sample, one can reduce the latter variance and improve performance, which was observed with the introduction of multicrops in Caron et al. (2020). However, when the total amount m × n of pre-processed data is held fixed, it is generally better to process many inputs with two views m = 2 , rather than a few inputs with many augmentations. This finding matches the empirical observations of Bardes et al. (2022) that if available, fresh samples are always better than more views.

Capacity trouble in pretraining. Theorems 2 and 3 show that, without regularization restricting the capacity of the model of computation, one cannot expect to meaningfully solve the pretraining task. This is captured by the quantity c λ that goes to infinity as λ goes to zero. Such issues related to the lack of regularization commonly arise in practice. Given n × m upstream samples ( ξ ij ) , the empirical minimization of VICReg can be implemented by approximating µ with ∑ ij δ ( i,ξ i,j ) /nm . In this setting, T is the adjacency matrix of a graph with as many connected components as there are inputs n , as detailed in Appendix E. All the connected components define a maximal eigenvector of the empirical approximation to T , leading to a 'collapsed' representation ψ = ∑ j δ ξ ij /m . Regularizing forces the optimizer to search for representation inside the space Ψ which mixes those small clusters letting meaning eigenfunctions emerging (see Figure 7 for an illustration).

Our theoretical study provides several insights that may be useful for SSL practitioners. We highlight a few below.

Avoiding collapse. The common collapse phenomenon, where pretraining ends up fitting noise instead of learning useful features, may be addressed in several ways. Our theory suggests to:

· Reduce the model capacity , through regularization (e.g., early stopping), or simpler architectures (e.g., a shallow CNN instead of an MLP). As a consequence,

5 Note that without regularization, (1 -β ) I + βT is not traceclass so k e will not converge as k increases.

Figure 7. Capacity trouble. Level lines of the top eigenfunction of empirical estimate of T λ for negligible regularization (left) and small regularization λ (right). Experiments are done with a Gaussian kernel with scale about one tenth of the problem diameter, augmentations are represented as black dots, connected by a line when they provide from the same input X . When λ is negligibly small, capacity troubles arise, infringing the recovery of the cluster structure on the right.

Figure

  • Ψ will have a lower effective dimension, K will encourage 'simpler' representations that can be learned with less data, even without any data augmentation. · Use stronger augmentations. T will become more compact, reducing k λ the dimension of the 'positive eigenspace' of T λ . The ideal ψ will exhibit more structure, thereby its search could be reduced to smaller spaces, making it harder to collapse.

Incorporating priors. Representations are typically used for solving downstream tasks, thus it is crucial to incorporate the right priors during pretraining. Our theory showcases the important role of several factors. (i) Augmentations determine the nature of the invariance that is enforced (e.g., low variations, short-range dependencies, translation invariance); affects top eigenfunctions of T . (ii) Architecture promotes 'simple' representations (e.g., smoothness, locality); affects top eigenfunctions of K . (iii) Regularization balances the interplay between augmentations and architecture; affects top eigenfunctions of T λ . (iv) Pretraining data impacts both T and K and their eigenfunctions, e.g., through clustering structure, or natural image statistics.

This paper presents a theoretical framework for studying self-supervised learning in the kernel regime. It examines three key operators and their impact on convergence and generalization: T linked with augmentations, K linked with architecture choices, and T λ resulting from their interplay and tuned by the parameter λ . Our analysis offers useful guarantees and practical guidelines for practitioners to improve the stability and performance of SSL algorithms. Looking beyond the kernel regime, future work can We left for future work the extension of our analysis beyond the kernel setting, in particular to understand non-linear training dynamics in finite width neural network and feature learning capabilities within layers. Moreover, future studies could encompass more techniques that enhance performance in SSL, these include projecting representations before enforcing losses, batching the data, or applying different loss functions.

Experiments

Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In ICML , 2019.

Bach, F. Learning Theory from First Principles . To appear at MIT press, 2023.

Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning , 2015.

Several technicalities were left implicit in the main text, we discuss it now. In particular, we assumed that there exists a minimizer f ∗ of the risk R which is true when Y is finite, or when ℓ is the least-square loss and ( Y | X ) has a second order moment almost everywhere. Moreover, in the proof for least-squares, we will assume for simplicity that Y = R . The same derivations still holds true when Y = R k although it requires slight precautions such as working in Y ⊗ H rather than in H (see Cabannes et al., 2021b, for example).

Integers. The representation dimension k is an integer, and [ k ] denotes the set 1 , 2 , · · · , k . For simplicity, we abuse notations and denote by N the set of strictly positive integers.

Geometry. The product A × B denotes the set of elements ( a, b ) with a ∈ A and b ∈ B . The notation a ⊤ denotes the adjoint of a which depends on the Hilbertian metric space one consider a to be part of ( e.g. the adjoint in L 2 ( µ Ξ ) is not the same as the adjoint in L 2 ( ρ X ) ). The notation a ⊤ b denotes the scalar product ⟨ a, b ⟩ with the Hilbertian metric space a and b are understood to be part of. The Hilbertian norm on matrices or operators is denoted by ∥·∥ F (Frobenius), ∥·∥ 2 or ∥·∥ HS (Hibert-Schmidt). The operator norm is denoted by ∥·∥ op . Moreover, the identity is always denoted by I .

Distributions. In order to define probabilities, X and Y are assumed to be Polish spacesa endowed with the Borel topologies. We used the simplex notation ∆ A to design the set of probability measure on A , and the tensor notations ρ ⊗ n to denotes the measure of n independent random variables all distributed according to ρ . The notation φ # ρ denote the measure of φ ( X ) when X is distributed according to the measure ρ . The notation ρ ≪ µ means that for any measurable set X the fact that µ ( X ) = 0 implies that ρ ( X ) = 0 . The notation δ x denotes the Dirac distribution, which satisfies ⟨ f, δ x ⟩ = f ( x ) using the duality bracket between functions and distributions. For any distribution p , the space L 2 ( p ) is made of measurable functions that are square-integrable.

Functions. All functions such as ℓ , f , ψ , φ , and so on, are restricted to be measurable. The notation ◦ denote the composition of functions f ◦ g ( · ) = f ( g ( · )) . A function f : X → Y is understood as an element of Y X , and we use some isomorphism such as ( R k ) X = ( R X ) k . We use the notation R k ⊗H to denote linear bounded operator from H to R k . This tensor product notation generalizes matrix notations with R k ⊗ R d h = R k × d h . In particular,

For Θ ∈ R k ⊗H , one can write Θ in row-style as an element of H k as well as its adjoint Θ ⊤ ∈ H ⊗ R k in column-style which follows from the fact that H k is self-adjoint when endowed with the ℓ 2 -product topology.

Integers.
Geometry.

Proof of Remark

Let us characterize (2) in order to easily implement it with unbiased stochastic gradient. We need to get the expectation outside the norm. This can be done with the following derivations

For the first part, we get

In particular, when β = 1 , we retrieve the spectral contrastive loss introduced by HaoChen et al. (2021),

First, notice that if we define for ψ : X → R , the mapping ω ( ψ ) → E X [ E ξ,ξ ′ [ ∥ ψ ( ξ ) -ψ ( ξ ) ∥ 2 ∣ ∣ ∣ X ] ] , ω is a quadratic form on L 2 ( µ Ξ ) . As a consequence, it can be represented with a linear self-adjoint operator L on L 2 ( µ Ξ ) such that ω ( ψ ) = ⟨ ψ, Lψ ⟩ L 2 ( µ Ξ ) . Because ω ( ψ ) ≥ 0 , we have L ⪰ 0 (with ⪰ the Loewner order on symmetric operators, i.e. A ⪰ B if A -B is positive). The following lemma show that L is bounded.

Lemma 5. For any ψ ∈ L 2 ( µ Ξ ) , ω ( ψ ) ≤ 2 ∥ ψ ∥ 2 L 2 ( µ Ξ ) . As a consequence, L ⪯ 2 I .

Hence for any ψ , with the L 2 ( µ Ξ ) geometry we have ψ ⊤ Lψ ≤ ψ ⊤ ψ , which implies, since L is self-adjoint, that ∥ L ∥ op ≤ 2 .

Because 0 ⪯ L ⪯ 2 I , let us introduce T = (2 I -L ) / 2 , we have 0 ⪯ T ⪯ 1 and, with the L 2 ( µ Ξ ) geometry, for ψ : X → R k

Assumption 4. Assume that T has a pure point spectrum.

Example 7. When the distribution of augmentation have a density p with respect to a any measure and ( x, ξ ) → p ( ξ | x ) /p ( ξ ) is in L 2 ( µ ) , or when X is finite, T can be shown to be a compact operator, hence to have a pure point spectrum according to the spectral theorem.

Proof. When X is finite, the L 2 spaces are finite dimensional, hence locally compact, which implies that all operators are compact. To prove the case with density, let us develop T as an integral operator. We have, in L 2 ( µ Ξ ) geometry, for f : X → R

This allow us to identify T with the inner product, we have for g : X → R and p the density of augmentations

As a consequence, one can consider T as the integral operator in L 2 ( µ Ξ ) linked with the kernel

When this kernel is bounded, or simply when ξ → k ( ξ, ξ ) belongs to L 2 ( µ Ξ ) , T is trace-class hence compact.

Let us now prove in order to minimize L , one should take the eigenfunctions of the operator (1 -β ) I + βT whose corresponding eigenvalues are the biggest positives ones. It can be proven with simple geometry in a somewhat abstract space. To do so, remark that ψ : X → R k in L 2 ( µ Ξ , X , R k ) can be represented as ˜ ψ ∈ R k ⊗ L 2 ( µ Ξ , X , R ) with the linear map that associates ( ψ ⊤ i φ ) to a function φ ∈ L 2 ( µ Ξ , X , R ) . Let denote T β = (1 -β ) I + βT , the upstream loss can be characterized as

In order to find the minimizer of L with this new characterization, slight precautions are needed here since the two operators are not trace-class. The following lemma takes those precautions in order to finish the proof.

Lemma 6. Let A be a self-adjoint operator on L 2 ( µ Ξ ) . Assume that there exists c such that A ⪯ cI and that A is pure-point spectrum. Then if ( λ i , f i ) denote the eigen-decomposition of A with λ i in decreasing order, the minimization of Tr ( ( B -A ) 2 -B 2 ) under the constraint that B is a self-adjoint positive operator of rank at most k , is reached for B = ˜ ψ ⊤ ˜ ψ with ψ : X → R k such that ψ i = max(0 , λ i ) 1 / 2 f i .

Let us decompose B into k symmetric operators of rank at most one as B = ∑ k i =1 B i such that B i B j = 0 for any

̸

i = j ∈ [ k ] . Using the different properties of the operators introduced, we proceed with

where Π B denote the orthogonal projector on the image of B , and σ i ( A ) the i -th singular value of A (monotonically ordered with σ 1 ( A ) the biggest). The last inequality is due to the Courant-Fisher min-max principle, This inequality can be achieved with Π B i the projection on the i -th eigenspace of A and ∥ B i ∥ op = σ i ( A ) . In other terms, B should match the first k positive eigenvalues of A . In the case where A has less than k positive eigenvalues, then B should match all the positive eigenvalues and be null on the range of A -.

Proposition 7 (Uniqueness of minimizers) . The minimizers of L are unique up to orthogonal transformations and eigenfunction picking. More specifically, if U ∈ R k × k is orthogonal, i.e. U ⊤ U = I , then L ( ψ ) = L ( Uψ ) ; and if λ k = λ k +1 , one can choose different eigenfunctions as f k in the eigen-decomposition ( λ i , f i ) of T β .

.
Proof.
.
.
.

Figure

Let us consider ψ : X → R k with ψ i = f θ i for θ i ∈ H and S : H → L 2 ( µ Ξ ); θ → θ ⊤ φ ( · ) . We can use the tensor notations introduced earlier to parameterized ψ = S Θ ⊤ with Θ = ( θ i ) i ∈ [ k ] seen as an element of R k ⊗H . The proof of Lemma 3 follows from the fact that

Since Ψ = S ( H ) = im S = im K 1 / 2 = K 1 / 2 ( L 2 ( µ Ξ )) , one can consider K -1 as the inverse of K such that for ψ i ∈ ker K , ψ ⊤ i K -1 ψ i = + ∞ . This is what we implicitly assumed in the main paper, which lead to the ( ψ i ) being all in Ψ . Note that in many cases, Ψ is dense in L 2 ( µ Ξ ) (Micchelli et al., 2006), and one does not need to take such a precaution since the ker K = 0 , and there is only one way to define K -1 on L 2 ( µ Ξ ) .

The proof given above of Lemma 3 might seem quite abstract for the reader unfamiliar with reproducing kernel Hilbert space. In this subsection, we provide a somewhat more accessible proof of this Lemma based on covariance operators.

where the last term provides from the fact that

The search for ψ will be done under the form Θ φ for Θ ∈ R k ⊗H and φ : X → H . Let us discuss technicalities related to the infinite dimensional operators that will appear.

Assumption 5. The Hilbert space H is separable, and the mapping φ belongs to L 2 ( µ X ) endowed with Borel topology on both X and H .

As a consequence, Σ is compact, hence has a pure point spectrum, and since H is separable it can be diagonalized with its eigenvectors forming a basis of H .

We will see later that Σ -1 / 2 Σ X Σ -1 / 2 is indeed isometric to T . Hence, under Assumption 4, Σ -1 / 2 Σ X Σ -1 / 2 has a pure-point spectrum. However, the following lemma shows that this operator is bounded without using the fact that T ⪯ I .

Remark 9. The operator Σ X = E X E ξ,ξ ′ [ φ ( ξ ) φ ( ξ ′ ) ⊤ ∣ ∣ X ] ∈ H ⊗ H verifies 0 ⪯ Σ X ⪯ Σ with ⪯ the Loewner order ( A ⪯ B if B -A is semi-definite positive). As a consequence, Σ X is trace-class and Σ -1 / 2 Σ X Σ -1 / 2 is continuous.

Proof. This follows from Jensen inequality applies to A → AA ⊤ , which can be proven using the positivity of covariance operator.

As a consequence, TrΣ X ≤ TrΣ < + ∞ and Σ -1 / 2 Σ X Σ -1 / 2 ⪯ I , and ∥ ∥ Σ -1 / 2 Σ X Σ -1 / 2 ∥ ∥ op ≤ 1 . The positivity follows from the fact that Σ X is a covariance operator Σ X = E X [ E ξ [ φ ( ξ ) | X ] E ξ [ φ ( ξ ) | X ] ⊤ ] .

.
.
.

Let us begin by proving a variant of the lemma where everything is expressed in H . We expand later on the isometry between H and L 2 ( µ Ξ ) (due to the isometry between S and Σ 1 / 2 ) that allows us to transfer it to the lemma written in the paper.

Lemma 10. For ( θ i ) ∈ H k and f θ : x →⟨ φ ( x ) , θ ⟩ , and a regularizer λ ∈ R

with A and Σ being operator on H defined as

Moreover, ( f θ i ) are orthogonal in L 2 ( µ Ξ ) , where µ Ξ denotes the marginal distribution over augmentations.

The adjoint Θ ⊤ is taken with respect to the canonical topology on H and R k . Similarly,

This proves the first part of the lemma. Remark that the expression of the lemma is slightly different from the generalization to continuous X suggested by HaoChen et al. (2021) in their Appendix F, that would reuse the work of Schiebinger et al. (2015) considering the covariance operator with feature ¯ φ ( x ) = q -1 / 2 ( x ) E [ φ ( ξ ) | X = x ] where q : x → E X ∼ µ Ξ [ k ( x, X )] rather than Σ -1 / 2 Σ X Σ -1 / 2 .

This proves the orthogonality of the f θ i in L 2 ( µ Ξ ) .

.

Lemma 11. S is isometric to Σ 1 / 2 , and K = S ⊤ S is an integral operator that maps f ∈ L 2 ( µ Ξ ) to Kf ∈ L 2 ( µ Ξ ) defined for ξ ∈ X as

Proof. This follows from the fact that both S and Σ 1 / 2 are a square root of Σ . Indeed, Σ = S ⊤ S , since for θ ∈ H ,

As a consequence, S is isometric to Σ 1 / 2 (if we write the singular value decomposition of S as USV ⊤ , then Σ 1 / 2 = USU ⊤ ). Regarding the part in K , one can check with the same derivation that S ⊤ f = E [ f ( ξ ) φ ( ξ )] ∈ H hence the value of ( Kf )( ξ ) = ( S ⊤ f ) ⊤ φ ( ξ ) = E ξ ′ [ f ( ξ ′ ) φ ( ξ ′ ) ⊤ φ ( ξ )] .

Using the isometry one can replace ∥ Sθ ∥ = ∥ ∥ Σ 1 / 2 θ ∥ ∥ with the Hilbertian norm on H and L 2 ( µ Ξ ) , so that for C operating in H , Tr SCS ⊤ = TrΣ 1 / 2 C Σ 1 / 2 . Going back to the proof in H , one can replace all the Σ 1 / 2 by S or its adjoint at the right places to get the following statement.

Proof. This lemma follows from the previous discussion. The fact that S -⊤ Σ S -1 equates to T on the L 2 ( µ Ξ ) -closure of Ψ is due to the characterization in Lemma 3. We can nonetheless prove it in a more direct fashion, by adapting Lemma B.9 of Saunshi et al. (2022) to our case.

.
.

Proof of Proposition

Proposition 4 relies on the fact that when two operators commutes, they can be diagonalized in the same basis.

Proof. When the operators commute, if f is an eigenfunction of T with Tf = λf , then TKf = KTf = λKf . This means that the eigenspace of T , i.e. ker( T -λI ) are stable by K . As a consequence, K can be decomposed with respect to the summation L 2 = ⊕ λ ∈ spec( T ) ker( T -λI ) . By diagonalizing the restrictions of K on each of those spaces, there exists a basis that diagonalizes both K and T .

While we did not discuss it in the main text, one should not consider any eigenvalue decomposition of T but only the eigenfunctions that jointly diagonalize T and K . However, note that to find those eigenfunctions, based on Courant-Fisher principle, one can take, recursively on i ∈ N , f i = f θ i an eigenfunction in ker( T -λ i I ) that maximizes or minimizes ∥ θ i ∥ . Those eigenfunctions ( f i ) will diagonalize T λ , and the optimal representation will pick the ones that maximize f ⊤ i T λ f i as long as this quantity is positive.

If f i diagonalize K then f i ∈ im K 1 / 2 = Ψ = im S , hence there exists a θ i ∈ H such that f i = Sθ i . As a consequence, with the L 2 ( µ Ξ ) geometry, f ⊤ i K -1 f i = ( Sθ i ) ⊤ ( S ⊤ S ) -1 Sθ i = ∥ θ i ∥ 2 2 . We use this to derive that

Remark 14. Recently, HaoChen & Ma (2022) have taken this second perspective on inductive bias perspective by looking at the 'barrier' case where one can only match eigenfunctions that belongs to the function space Ψ . In the kernel regime, this is deceptive since, for example, when considering the Gaussian kernel φ ( x ) ⊤ φ ( x ′ ) = -exp( ∥ x -x ′ ∥ 2 ) , Ψ is made of analytic functions (Sun & Zhou, 2008), hence cannot parameterize any indicator functions without being one everywhere, therefore their approach would fail to explain how the Gaussian kernel could learn fast under the cluster assumption.

.
.

Remark about VCReg

When L = 0 , finding ψ correspond in finding k functions ( f θ i ) i that are orthogonal in L 2 ( µ Ξ ) and maximize 1 -λ ∥ θ ∥ 2 = 1 -λf ⊤ θ K -1 f θ before multiplying them by (1 -λ ∥ θ i ∥ 2 ) + . Using Courant-Fisher min-max principle, the function ( f θ i ) i are given by the k biggest eigenfunctions of K .

If µ Ξ = δρ X +(1 -δ ) µ ⊥ , then for any measurable function f

This follows from the embedding S ρ and S µ of H in L 2 ( ρ X ) and L 2 ( µ Ξ ) respectively. We have seen earlier that S ⊤ µ S µ = Σ µ Ξ and S ⊤ X S X = Σ ρ X . Let f ∈ Ψ , there exists θ ∈ H such that f = f θ , hence, using the isometry between S and Σ 1 / 2 ,

We conclude by using the equivalence the fact that A ⪯ cB implies that B -1 / 2 AB -1 / 2 ⪯ c · I .

This follows from the definition of the different objects,

We develop this last objective as

If µ Ξ has k connected components, then the indicators of those components will be orthogonal in L 2 ( µ Ξ ) while minimizing the invariant term E x E [ ∥ φ ( ξ ) -φ ( ξ ′ ) ∥ 2 ∣ ∣ ∣ X ] . As a consequence, f ∗ belongs to the space of the ( f i ) i ≤ k .

This section is devoted to the proof of Theorem 1. In all the following, k λ designs the number of positive eigenvalues of T λ (including multiplicity) as an operator on L 2 ( µ Ξ ) . We fix k ≤ k λ , and design by F the span of the ( f i ) i ∈ [ k ] . In the kernel regime, the space F can also be written as F = w ⊤ Θ ∗ φ ∣ ∣ w ∈ R k for Θ ∗ the minimizer defined in Lemma 12. We denote by ˆ F the space defined similarly from an estimate ˆ Θ of Θ ∗ .

The error on the downstream task could be decomposed into three quantities: the error on the downstream task linked with the capacity of ˆ F (12); the error on the upstream task linked to approximation error between F and ˆ F (13), the error due to the fact that the downstream task might not be effectively solved within F (14).

Lemma 15 (Decomposition intuition) . Let F and ˆ F be two closed convex sets of L 2 ( ρ X ) , and Π F design the orthogonal projection on the space F according to L 2 ( ρ X ) geometry. For any function f : X → Y in ˆ F , the excess of risk (1) can be decomposed as

Proof. The proof of the lemma follows from classical characterization of the mean square error and a triangular inequality. Introduce the following technical assumption.

When ℓ ( y, y ′ ) = ∥ y -y ′ ∥ 2 , using the fact that ( X,Y ) → Y -E [ Y | X ] is orthogonal to any measurable function that does not depend on Y in L 2 ( ρ ) ,

For linear probing (3), when the downstream task is learned with n data points and a noise level ε , (12) is expected to behave as ε 2 k/n (Mourtada & Rosasco, 2022). In this linear setting, (13) should be seen as a measure of angle between F and ˆ F seen through the eyes of f ∗ (Davis & Kahan, 1970; Kato, 1995).

(Decomposition intuition).
.

The downstream task error relates to the generalization error of mis-specified linear model. To bound it, we will use the convergence rates analysis through concentration of integral operators of Smale & Zhou (2007) and Caponnetto & De Vito (2007). It requires reworking slightly the previous decomposition.

Lemma 16 (Warm-up) . Let ˆ F be the span of the ( ψ i ) i ∈ [ k ] , with S ψ : R k → L 2 defined as S ψ w = w ⊤ ψ , then

Based on data ( X i , Y i ) , one can define the empirical risk minimizer f n = S ψ w n , where w n is the minimizer of

Proof. The two formula can be proven at once by remarking that if Π ˆ F f ∗ is defined as S ψ w for w minimizing

Minimizing this quadratic form leads to the first results. The second result is proven in the same way after substituting the distribution over ( X,Y ) by the empirical one n -1 ∑ i ∈ [ n ] δ ( X i ,Y i ) .

where ℓ 2 ( n ) is endowed with normalized (i.e. probability-like) scalar product ⟨ a, b ⟩ = n -1 ∑ i ∈ [ n ] a i b i . Similarly to Lemma 11, one can show that the adjoint of S ψ and ˆ S ψ , and the covariance operators are

In this subsection, we will only consider S and Σ associated with ψ and we remove the indices for convenience. To simplify notation when f ∈ L 2 ( ρ X ) we will write ˆ S ⊤ f for ˆ S ⊤ ( f ( X i )) i ∈ [ n ] .

When doing so, under Assumption 7, the average excess of risk can be decomposed as, with M = sup ∥ ψ ( X ) ∥ ,

Proof. Retaking the warm-up lemma, one can show that

The first term can be worked out with Mourtada & Rosasco (2022) techniques as

Proof. Let us set f = ( I -Π ˆ F ) f ∗ and A γ = A + γI for simplicity. Remark that f is orthogonal to the image of S , hence S ⊤ f = 0 . We decompose the last quantity with

We know that

We also have that for A and ˆ A self adjoint and any t > 0 , the sequence of implications

Probabilistic arguments will show that t , as well as ∥ ∥ ∥ (Σ + γ -1 / 2 ˆ S ⊤ ( I -Π ˆ F ) f ∗ ∥ ∥ ∥ , vanishes to zero in n -1 / 2 . We will use Bernstein concentration inequality.

Lemma 19 (Bernstein concentration inequalities) . Let denote by A a Hilbert space and by ( Z i ) i ∈ [ n ] a sequence of independent random vectors on A such that E [ Z i ] = 0 , and such that there exists two positive constants M and σ such that for all m> 2

In particular when the ( Z i ) are bounded by 3 M , and σ 2 = n -1 ∑ i ∈ n E [ Z 2 i ] , the condition holds. When, instead, Z i are symmetric matrices in R k × k and ∥·∥ is the operator norm, the same bound holds with k exp( · · · ) instead of 2 exp( · · · ) on the right-hand side, where σ 2 = ∥ ∥ ∥ n -1 ∑ i ∈ [ n ] E [ Z 2 i ] ∥ ∥ ∥ .

Proof. See Corollary 1 in Pinelis & Sakhanenko (1986) for the first part, and Tropp (2015) for the matrix version.

Lemma 20. For any t > 0 , the vector part in last term of the bias decomposition (20) can be controlled with

For any t > 0 ,

where b = 2Tr(Σ + γ ) -1 Σ , M = sup ∥ ψ ( X ) ∥ and a = ∥ f ∗ ∥ L ∞ + M ∥ f ∗ ∥ L 2 . Moreover, this vector part is bounded by γ -1 a 2 M 2 . The matrix part in the last term of (20) is controlled with

Proof. Let us introduce

as well as, since im S = ˆ F

Moreover,

We have (Σ + γ ) -1 / 2 ( ˆ Σ -Σ)(Σ + γ ) -1 / 2 = 1 n ∑ i ∈ [ n ] Z i , and

Finally, using the fact that ∥ U i ∥ 2 ⪯ U i sup ∥ U i ∥ , with the variational definition of the mean, with the infimum taken with respect to the Loewner order

Proof. In essence, we have two random variables, X = ∥ ∥ ∥ Σ -1 / 2 γ ( ˆ Σ -Σ)Σ -1 / 2 γ ∥ ∥ ∥ 2 op , and Y = ∥ ∥ ∥ Σ -1 / 2 γ ˆ S ⊤ ( I -Π ˆ F ) f ∗ ∥ ∥ ∥ 2 the vector one. We proceed with computation using the fact that for X positive E [ X ] = ∫ t> 0 P ( X > t ) d t and that ab > t , implies for any s that a > 1 + s or b > t/ (1 + s ) ,

Rather than solving this in closed form, we will proceed with a much simpler bound that consists in taking s = 1 without any optimization. It gives the much simpler formula

For Y we can use the same technique as before, using that exp( -( a + b ) -1 ) ≤ exp( -max(2 a, 2 b ) -1 ) ≤ exp( -(2 a ) -1 ) + exp( -(2 b ) -1 ) , we get

Let us now simplify the constant that appear in the bound derived so far.

Proof. The first bound is a direct application of the fact that Σ ⪯ Σ+ λ , hence Tr((Σ + γ ) -1 γ ) ≤ Tr( I ) = k . The second bound is due to the fact that ψ = ˆ Θ φ , hence ∥ ψ ∥ ≤ ∥ ∥ ∥ ˆ Θ ∥ ∥ ∥ op ∥ φ ∥ ≤ ∥ ∥ ∥ ˆ Θ ∥ ∥ ∥ F ∥ φ ∥ . In the meantime, if ˆ Θ was regularized

Finally, the last equality is due to the fact that f ∗ ( x ) is the mean of Y conditionally to X = x ,

where k e = Tr ( Σ(Σ + γI ) -1 ) ≤ k is the effective dimension, a = ∥ ∥ I -Π ˆ F f ∗ ∥ ∥ L ∞ ≤ ∥ f ∗ ∥ L ∞ + M ∥ f ∥ L 2 , and M = sup ∥ ψ ∥ ≤ kλ -1 sup ∥ φ ∥ .

Proof. When γ = c log( n ) 1+ δ n -1 the excess of risk reads

.
.
.
(Simplifying constants).
.

An ideal control of (13) would leverage closed form solutions to both the population and empirical risk and use concentration inequalities on integral operators, as in Cabannes et al. (2021b); Pillaud-Vivien & Bach (2023). Yet, those proof proceed with the estimation of the smallest eigenfunctions of ( ˆ Σ X + λ ) -1 / 2 Σ( ˆ Σ X + λ ) -1 / 2 , rather than the biggest of Σ -1 / 2 (Σ X -λ )Σ -1 / 2 . In this proof, we will rather utilize derivations based on empirical processes concentration, together with the following 'transfer bound'.

Proof. For simplicity, let us remove the dependency to µ Ξ in the proof. Let us introduce C = S ˆ Θ ˆ Θ ⊤ S ⊤ , C is a positive operator of rank k in L 2 , let us write it as C = ∑ i ∈ [ k ] µ i g i g ⊤ i with µ i ≥ 0 .

Let us decompose T λ = T + -T -where T + and T -are positive. Since T λ ⪯ T + , -C 1 / 2 T λ C 1 / 2 ⪰ C 1 / 2 T + C 1 / 2 , hence

Minimizing this quantity with respect to µ i , leads to

Let us know introduce ( f i ) the eigenfunctions of T λ . With U = ( ⟨ g i , f j ⟩ 2 ) ij ∈ R k × k λ and λ = ( λ i ) ∈ R k λ , we have

Note that U is at most doubly stochastic since both ( g i ) and ( f i ) are orthonormal families, thus ∥ U ∥ ≤ 1 , and U ⊤ U ⪯ I . If one replace the f i by f i / ∥ ∥ Π ˆ F ∥ ∥ in the definition of U that would become ˜ U = diag(( ∥ ∥ Π ˆ F f i ∥ ∥ 2 ) i ≤ k λ ) -1 U , ˜ U is still right stochastic. Hence

The left-hand side in Lemma 24 is to be linked with the desired control of (13). In order to deal more finely with distribution-shift, we introduce the following generic variant of Assumptions 1 and 2.

with ζ : R → R continuous, increasing and ζ (0) = 0 .

Definition 25 (Distribution ε -robustness) . A close convex set of functions F will be said to be ε -robust to distribution shift conditionally to the function f if

Assumption 9. There exists a profile σ : R 2 → R increasing and bounded such that for any k ∈ N , Span f i i ∈ [ k ] is σ ( k ) -robust to f ∗ .

Applied to Π ( µ Ξ ) F l f ∗ , this leads to

We are done with all the quantities that relate to the distribution shift. Under Assumption 3, we have

To get a finer control of (13), remark that the left-hand side of (26) has some additional constraints that can help us to tighten our bound. For simplicity, we will remove all the dependency to µ Ξ in the following. In essence, we want to lower bound the λ 2 i and to upper bound the ∥ ∥ (Π F -Π ˆ F ) f i ∥ ∥ . The next lemma adds a constraint the maximal error one can make on (13) under a constraint on L (Θ; λ ) .

Proof. Let us consider two projection U and V onto the span of ( u i ) i ∈ [ k ] and ( v i ) i ∈ [ k ] with ( u i ) i ∈ N and ( v i ) i ∈ N two orthonormal basis of the ambient space. We have, with Hilbert-Schmidt norm everywhere,

Based on invariant of the Hilbert-Schmidt norm to adjoint, and the fact that projection are self-adjoint, we have

Finally, we also know that since projection contracts distances ∥ ( I -V ) U ∥ 2 ≤ ∥ U ∥ 2 = k . The claim of the lemma consists in writing explicitly

Given a control on (2), finding an upper bound on (13) reduces to a purely algebraic one. In order to find the worse value that ∑ i ≤ k |⟨ f ∗ , f i ⟩| ∥ ∥ ∥ (Π ( µ Ξ ) F -Π ˆ F ) f i ∥ ∥ ∥ L 2 ( µ Ξ ) can take, let us introduce

The previous results lead to the following maximization problem in order to find the worse value of (13),

(Transfer bound).
.
.

Keeping it simple and concluding after controlling texorpdfstring

Solving smartly the algebraic problem above to get the best bound on (13) requires distinguishing between many cases. While it might be relevant to distinguish those different cases and show different convergence regimes, this subsection proceed in a simpler way, although less tight. In particular, we can simplify the problem with respect to the ( x i ) ik , using the fact that k ′ ≤ k (it is minimum between the number of positive eigenvalues of ˆ T λ based on samples and k ), it leads to x 2 k +1 = ∑ i ≤ k x 2 k and x 2 k +1+ j = 0 , (30) becomes

In general, one could refine this formulation by introducing a probability argument that tells us how much one can expect the error between Π ˆ F and Π F to concentrates on the eigenspace linked to the smallest eigenvalue of T 2 λ . The problem shows two behaviors, if the c i decrease faster than the λ i than we want to charge the energy of ( x i ) i ≤ k on the smallest indices. Otherwise, we want to charge the ( x i ) i ≤ k on the biggest indices.

To keep it simple, we will optimize L without any rank restriction first, which allow considering λ k λ +1 = 0 , before thresholding the rank to get to a space of dimension k .

Proof. Keeping the algebraic notation above, this comes from a simple application of Cauchy-Schwarz, for ( a i ) ∈ R k

For the second part, set ˆ F k the k first eigenfunctions to the all the one retrieve with the empirical minimization of L , and F to be the span of all the eigenfunctions linked with positive eigenvalues of T λ . Let us rework the decomposition of the excess of risk, we have

The last bound begin due to Assumption 3, as well as the lax bounding that on the operator norm of two projections. When one could remove the k -k λ we let it as we expect the quantity to behave it this way, with a constant similar to ∥ f ∗ ∥ 2 /k λ instead of ∥ f ∗ ∥ 2 .

Theorem 4. Under Assumptions 3, 7, 8 and 9, there exists a regularizer γ such that the regularized empirical risk minimizer verifies that: for any δ > 0 , there exists an N δ > 0 such that for any n > N δ , the excess of risk of the regularized empirical risk (19) minimizer reads

where F l the span of l -th first eigenfunction of T λ , k λ the number of strictly positive eigenfunctions of T λ , k e ≤ k is the effective dimension of ψ in L 2 ( ρ X ) , a = ∥ ∥ I -Π ˆ F f ∗ ∥ ∥ L ∞ ≤ ∥ f ∗ ∥ L ∞ + M ∥ f ∥ L 2 , M = sup ∥ ψ ∥ ≤ kλ -1 sup ∥ φ ∥ , and ˜ T λ = ∑ i ∈ [ k ] ( λ 2 i -λ 2 k +1 ) 1 / 2 f i f ⊤ i . Moreover, under the sole Assumptions 1 and 2, we have the simpler bound

Where ˆ Θ is understood as belonging to R k λ ⊗ H in this last expression and F λ the eigenspace linked with positive eigenvalues of T λ .

.
.

The following result relates the eigenvalues of T λ with those of K . It notably proves that k λ is finite when K is trace-class, which is one claim of Theorem 1.

Lemma 29 (Relating capacity between K and T λ ) . If ( µ i ) are the eigenvalues of K , then the number of eigenvalues of T λ that are bigger than t ∈ R is smaller than the cardinality of i | µ i > λ/ (1 -t ) . Moreover, if there exists q > 0 such that Tr ( K 1 /q ) < + ∞ , then there exists a c q such that if ( µ i ) are the eigenvalues of K , we have µ i ≤ c q i -q . As a consequence, in this setting, for any t ∈ R , the number of eigenvalues of T λ that is bigger than t is smaller than ( c q (1 -t ) /λ ) 1 /q .

Proof. Let us consider the set of eigenvectors ( f i ) whose eigenvalues are bigger than t . Consider the span of this space, we want to quantify its dimension. We know that all unitary vectors in this span satisfies

This means that this span does not intersect the span of φ i for φ i the eigenvectors of K -1 such that the eigenvalues are bigger than λ/ (1 -t ) . In other terms, this linear space does not intersect a linear space of co-dimension d where d is the cardinality mentioned in the lemma statement. Let us denote by U the space we are interested in and by V the space it does not intersect beside in the origin, and by E the ambient space Since U ∩ V = 0 , the quotient ( U + V ) /V is isomorphic to U , hence

The second claim follows from the fact that µ 1 /q i are summable and decreasing, hence the sequence S n = ∑ i ≤ n µ 1 /q i is a Cauchy sequence. As a consequence, there exists N ∈ N , such that for any s > N/ 2 , we have

Hence, for all s ≥ N , we have µ s ≤ s -1 /q , hence µ s /s -1 /q is bounded. Denoting by c q the maximum, leads to the first result. The final statement is a consequence of the fact that c q i -q > λ/ (1 -t ) implies i < ( c q (1 -t ) /λ ) 1 /q .

Example 8. When considering the radial basis function kernel φ ( x ) ⊤ φ ( x ′ ) = exp( -∥ x -x ′ ∥ 2 ) , Ψ is the space of analytical functions (Sun & Zhou, 2008), which is known to be small compared to L 2 spaces (Kolmogorov & Tikhomirov, 1959). As a consequence, one can think as q = + ∞ in the previous lemma. More in general, when φ is bounded, K is trace-class and one can take q = 1 .

Proof. The capacity of K is relates to the capacity of K ( f ∣ ∣ ∣ ∥ f ∥ L 2 ( µ Ξ ) ≤ 1 ) , which itself relates to the capacity of Ψ = im K 1 / 2 . This explains why q can be taken, in essence, as arbitrarily big (Bach, 2023).

proves that K is trace class.

(Relating capacity between K and Tλ).
.

In the main text, we have assumed that T λ was the right operator to define the solution of the representation learning (which explains Assumption 3). This might offend the purist as it would be nicer to define a principled solution that does not depend on the choice of the architecture (yet that might be easier to approximate with some architecture than others). This suggests studying the behavior of the last expression in Theorem 4 when λ goes to zero.

We let for future work a more precise study of the inductive bias in this vanishing setting: in essence, the choice of architecture Ψ perturbs T by λK -1 to make it T λ , and ideally, we would like to quantify the speed at which T λ converges to T when seen through the eyes of f ∗ as we decrease the regularization parameter. In the kernel regime, it could be characterized by perturbation theory (Kato, 1995), and refinement of Davis-Kahan theorem (Davis & Kahan, 1970) taking into account Assumption 3. Moreover, when K and T commute, the interplay can be studied in a more direct fashion thanks to Proposition 4.

Control of the upstream excess of risk

In order to control the excess of risk, one can use technique steaming from optimization as well as technique steaming from classical statistical learning.

Rademacher complexity

Lemma 30. Let Θ ∈ R k ⊗H , denote Λ = Θ ⊤ Θ ∈ H ⊗ H

Moreover, the regularization reads λ ∥ Θ ∥ 2 = λ TrΛ = λ ⟨ Λ , I ⟩ .

Let us recall three useful facts from the statistical learning literature.

Lemma 31. Let R ( ζ ) = E Z [ ℓ ( ζ, Z )] , ζ ∗ be the minimizer of L inside a domain for ζ , and ζ n be the minimizer of R ( Z i ) ( ζ ) = 1 n ∑ i ∈ [ n ] ℓ ( ζ, Z i ) based on exchangeable data Z i such that E ( Z i ) [ R ( Z i ) ] = R . The average excess of risk of ζ n is bounded by Rademacher complexity as

where σ i are i.i.d variables taking values one and minus one with probability one half.

Proof. The proof is a classical result from learning theory (Bartlett & Mendelson, 2002), its proof consists in introducing both the empirical risk of ζ n and ζ , and bounding the difference between the empirical and population of ζ by the supremum of this deviation over the entire domain of ζ . This is followed by the replacement of the population risk by the average empirical one, and a symmetrization trick that introduce the variable ( σ i ) based on exchangeability of the ( Z i ) .

Proof. This is a classical result on Rademacher complexity of ball constraints predictors (Bartlett & Mendelson, 2002). Lemma 33. Moreover, when h : R → R is Lipschitz, the following contraction principle holds

To work out those terms, remark that if ( Z i ) are i.i.d. variables,

While one could work out each term, the lemma consists in simply bounding φ by κ , hence all the mean and standard deviation one can obtain with expression of φ by κ .

Lemma 35. When minimizing a regularized risk, one can reduce the search of Θ under the constraint ∥ Λ ∥ HS ≤ λ -1 k.

The attentive reader would remark that compared to the bound of HaoChen et al. (2021) we gain a factor k -1 / 2 . Indeed, this factor could be recovered in HaoChen et al. (2021) by using the techniques of Maurer (2016) rather than a trivial bound on Rademacher complexity of vector-valued function spaces in k max i ∈ [ k ] ˆ R ( F i ) with HaoChen et al. (2021) notations.

.
.
.
.
.
.

̸

.
.
.
.

This section is devoted to illustrate what T and K are under simple distributions thanks to harmonic analysis techniques.

Harmonics analysis on the sequence of bits, a.k.a. the Boolean hypercube

A fine-grained analysis of the role of classical augmentations can be derived in settings that allow précise derivations. We shall focus on invariant data distribution such as the uniform distribution, and augmentations consisting of permutations or perturbations of coordinates that left this distribution invariant. While such distributions may lack structure present in real data, they allow for a precise study of the effect of certain architectures and augmentations, which may also partly apply to more realistic data. The study involves the construction of appropriate L 2 bases that ease the study of the effect of both the kernel operator K and the smoothing operator T defined from augmentations. These are closely related to the study of invariant kernels (see, e.g., Bietti et al., 2021; Bietti, 2022; Mei et al., 2021; Misiakiewicz & Mei, 2022).

Will focus here on the data that are n-bit inputs on the Boolean cube X = -1 , +1 d with uniform distribution. To be able to use the harmonic analysis tools to their fullest, we assume that inputs are sampled from the uniform distribution on X . In this setting, the space of function L 2 ( X ) = L 2 ( X , R , µ X ) is defined through the usual scalar product, for f, g : X → R ,

Let us know analyze the role of augmentation in the definition of T on the Boolean cube. For simplicity and ease of notation, we assume indexing of the bits is taken mod d , e.g., x -1 = x d .

Parity functions. A useful basis in this space are the parity functions, which can be seen as Fourier functions in this L 2 -space (O'Donnell, 2014). They are defined for each subset S ⊆ [ d ] as counting the parity of x within this set

̸

Proof. It is straightforwards to check that ⟨ χ S , χ S ⟩ = 1 . If S = S ′ , then w.l.o.g. there is an i ∈ S \ S ′ , and we have

Proposition 41 (Random noise) . Consider the flip of each bit of x with probability equal to p formally via the operation

where the operation x ⊙ y applies pointwise multiplication and the distribution Ber( -1 , +1 , p ) returns the value -1 with probability p and +1 with probability 1 -p . Under the augmentations ξ = X ⊗ y , T is diagonalized in the parity basis with

In other terms, T applies a factor | 1 -2 p | | S | to reduce the effect of higher order Fourier functions.

Proof. Recall the formula g ⊤ Tf = E X E ξ,ξ ′ [ ⟨ f ( ξ ) , g ( ξ ′ ) ⟩ | X ] . As a consequence, with y, y ′ denoting the noise strings (each bit equal to -1 with probability p ) and S △ S ′ = ( S ∪ S ′ ) / ( S ∩ S ′ ) ,

where [ a, a + w ) = a, a + 1 , . . . , a + w -1 , a is drawn from the uniform distribution over [ d ] , and the distribution Ber( -1 , +1 , 0 . 5) returns a random bit with equal probability for +1 and -1 thus effectively masking the values outside of the window in [ a, a + w ) . Under the augmentations ξ = M w a ( X ) , T is diagonalized in the parity basis with

In other terms, the action of cropping effectively removes any dependence on the kernel with parity functions of high order whose support falls outside the windows of size w .

Proposition 43 (2D Cropping) . Consider that 2D setting X = -1 , +1 m × d where inputs are organized into an m × d grid. Consider the cropping operation to a window of size v × w , formally

where J is the involution that matches any set S to its mirror ˜ S = -i | i ∈ S . In this setting, T is diagonalized by the √ 2( χ S + χ ˜ S ) and √ 2)( χ S -χ ˜ S ) for S ⊆ [ d ] .

which explain the lemma.

Remark 45. Up to now, we have studied all the operators in the space L 2 ( X , R , µ X ) while the main text considered those operators in L 2 ( X , R , µ Ξ ) , this is justified by the fact that all transformations studied earlier let invariant the uniform distribution, hence

̸

.
(Random noise).
.

Study of translations through cyclic parities

In order to study augmentations that consist of permutations, and more specifically translations, the parity basis is not adapted to diagonalize T . Instead we define below a different basis that incorporates cyclic symmetries (Misiakiewicz & Mei, 2022). We note that a similar study may be carried on other distributions, e.g., uniform on the sphere, product of spheres, or torus (Bietti et al., 2021; Bietti, 2022; Favero et al., 2021; Mei et al., 2021).

Cyclic parity functions. The functions χ S are polynomials that can be grouped by their degree ℓ = | S | into spaces V d,ℓ , whose direct sum yields the full L 2 ( X ) space, with

Those different spaces can be further decomposed into orbits under the action of a group. In particular for the group of permutations G = S d , we define the action A : G ×X → X denoted A ( a, x ) = a · x as

To give a concrete example of the study of augmentations through harmonic analysis, let us focus more specifically on the action of translation, which form a sub-group of permutations. For simplicity, we will denote this group [ d ] which is understood as Z /d Z , acting on X as

where i -a being understood modulo d . Define the orbits of this action as S + a | a ∈ [ d ] for S ⊆ [ d ] . On those different orbits, one can define the following 'cyclic parities' ψ m,S : X → C :

where m ∈ [ k S ] and S is taken as a representant of an orbit.

Lemma 46. The cyclic parities ( ψ m,S ) , for m ∈ [ k S ] and S is a set of representers of each orbit of the translations action, form an orthogonal basis of L 2 ( X , C , µ ) where µ is the uniform measure on X . Moreover, they diagonalize the operators A : L 2 → L 2 defined as Af ( x ) = f ( a · x ) for any a ∈ [ d ] .

Proof. The first part follows from the fact that L 2 ( X ) can be decomposed into the direct sum linked with the V d,ℓ for ℓ ∈ [0 , d ] , that each subspace can be decomposed into the orbits of the action translation orb( S ) = S + a | a ∈ [ d ] (note that translation do not change the cardinals of the sets S ). Those latter spaces can be parameterized through the discrete Fourier transform, yielding the ψ m,S .

A natural way to 'find' those basis is when trying to diagonalize an operator T such that ( χ ⊤ S Tχ S ′ ) S,S ′ ⊆ [ d ] that is block diagonal, where each block corresponds to a circulant matrix on an orbit, which can be diagonalized with the discrete Fourier transform. This is especially the case for operator of the lemma

The above is only nonzero when [ d ] · S intersects [ d ] · S ′ , which implies orb( S ) = orb( S ′ ) thereby constructing a block diagonal structure. Indexing the elements of the i -th block by S i,k = S i + k for k ∈ [ d ] , we have

The study of the operator T can be simplified thanks to its square root A : L 2 ( µ Ξ ) ↦→ L 2 ( µ X ) formally defined by

and verifying

This decomposition will be particularly useful, when µ X is invariant under the action of permutations, which implies µ X = µ Ξ =: µ .

Lemma 47. In the uniform Boolean setting, when augmentations are defined as ξ = a · X where a is a permutation sampled from the probability distribution p ∈ ∆ S d ,

Proof. The square root of T is defined as Af ( x ) = ∑ a ∈ S d p ( a ) f ( a · x ) . Let us focus on the case where p ( b ) = δ a = b , using the fact that µ X is the uniform measure, hence is left invariant by translation, we compute the adjoint of A with

In the general case, we get by linearity,

were A k be the operator that associate f to x → f ( k · x ) , it is a translation operator and retaking the proof of Lemma 46, A k ψ m,S = e -2 iπkm/k S ψ m,S . This leads to

Computing T = A ⊤ A leads to the result. Remark that if we further assume that p is symmetric (i.e., p ( a ) = p ( a -1 ) ), then we have A ⊤ = A , so that T = A 2 .

where ˆ p is the Fourier transform of p , defined for ω ∈ [ d ] by

and

We now show how different sampling distributions over translations induce varying smoothing effects in the operator T .

Example 9 (Smoothing effect of translations) . To see the effect of augmentation strength, consider a distribution p over translations that takes the form p ( a ) = ωp 0 ( ωa ) , where p 0 is a localized window shape (e.g., uniform or Gaussian) that sums to 1 . Here ω ≈ 1 / ∆ is inversely related to the window size ∆ , which controls the 'strength' or range of augmentations. Then we have

Here, the squared Fourier coefficients | ˆ p 0 ( m ) | 2 typically decay with the frequency m , which shows that T has a smoothing effect that penalizes eigenfunctions ψ m,S with larger m , i.e., those which oscillate more quickly. The above formula also highlights that the increasing the augmentation strength ∆ will lead to faster decay with m , while leaving the translation-invariant eigenfunctions ( m = 0 ) unaffected.

Cyclic parity functions.

̸

.
.

A particularly useful feature space φ to define the linear class of functions Ψ is the set:

for any sequences ( e S ) ∈ R 2 d . Linear model of this form can be diagonalized in the parity basis, which allows one to effectively study the interplay between the role of augmentation and the role of the architecture.

Among those classes of functions are dot-product kernel that verifies k ( x, y ) := φ ( x ) ⊤ φ ( y ) = ˜ h ( ∥ x -y ∥ 2 ) = h ( x ⊤ y ) . Once again, those kernels are particularly well adapted to the Fourier geometry of the Boolean hypercube.

Lemma 50 (Spectral decomposition of dot-product kernel) . Any dot-product kernel is diagonalizable in the parity basis. Specifically, there exists ( ν i ) i ∈ [0 ,d ] ∈ R d +1 such that, when µ X is the uniform distribution on the hypercube,

Proof. One can check that x ⊤ y = d -2 k for k the number of bits that differs in x and y . Define Q ℓ the degreeℓ averaged polynomials of degree ℓ as

for any Boolean strings x and y . The Q ℓ,d are well defined since the left-hand side is translation invariant. Moreover, leveraging the orthogonality of the χ S , one can show that the ( Q ℓ,d ) ℓ ∈ [0 ,d ] form a basis of functions on d -2 k | k ∈ [0 , d ] .

More exactly, the m → ( d ℓ ) -1 / 2 Q ℓ,d ( m ) are orthonormal basis of the L 2 space endowed with τ the pushforwards measure of the uniform distribution on X through the mapping x → ⟨ x, y ⟩ for any fixed y , and the dimensions match. As a consequence, there exists ν ℓ such that

where ν ℓ can be found by computing the scalar product between h and Q ℓ in L 2 ( τ ) .

Lemma 49 can also be shown on the sphere. Its proof showcase the Q ℓ which act as normalized Legendre (or Gegenbauer) polynomials. See, e.g. Smola et al. (2000); Bietti et al. (2021); Mei et al. (2021) for details. Note that for common kernel functions on the sphere, such as the ones appearing in the NTK, the ν k decay polynomially with k (Bach, 2017; Bietti & Mairal, 2019).

The features (60) are rich enough to describe the neural tangent kernels of simple architectures with fully connected or convolutional layers. First, we describe the general form of such NTKs as below.

Proposition 51 (Linearization of simple network) . Define a simple neural architecture as

where x ( q ) ( k ) = ( x k , x k +1 , · · · , x k + q -1 ) is a local patch of size q (with indices being defined modulo d ), w i the weights initialized from a rotation-invariant distribution W , σ : R ↦→ R is an activation function, ω ∈ N is the size of the average pooling window, ∆ ∈ N is the pooling window, and N is the channel number. The linearization of this network near initialization yields the kernel

Such a linearization can be found, e.g., in Proposition 3 of Misiakiewicz & Mei (2022).

where the coefficients are given by ν h ( d, ℓ ) = ⟨ h, Q ℓ ⟩ L 2 ( τ ) as in (64) .

Note that eigenvalues ν h ( d, ℓ ) are non-increasing with ℓ , and for fixed ℓ and large d they satisfy ν h ( d, ℓ ) = Θ d ( d -ℓ ) . More generally, it can be shown that lim d →∞ d k ν h ( d, ℓ ) = d k dt k h ( t ) ∣ ∣ t =0 .

Proof. The first part is a direct consequence of the prior proposition with ω = 1 and q = ∆ = d . The second part is due to Lemma 50, and (63). For the statements on eigenvalues, see (Yang & Salman, 2019).

where ν h ( q, ℓ ) are defined by Proposition 52.

The fact that K let the S | | S | = a, diam( S ) = b invariant, since the eigenvalues only depends on | S | and diam( S ) , allows to change from the parity basis to the cyclic basis.

When pooling is included in the kernel and ω > 1 in (65), then the architecture enforces local translation invariance. As a simple example, consider the setting of global average pooling ω = d where strict invariance to translations is enforced and parity functions are projected onto their sum of elements of the orbit to form the eigenbasis. In this case, K is no longer diagonalized in the parity basis, but it is diagonal in the basis of cyclic parities.

.
(Linearization of a fully connected network).

Lemma 54. The operator K associated with a dot-product kernel in the uniform Boolean setting commutes with all the operators T that can be built from bitwise noise, cropping, translations or index flip.

Proof. In the case of a dot-product kernel in the uniform setting, the spaces V d,ℓ are eigenspaces of K . Those spaces are left invariant by all the T defined through usual augmentations, since translations and index-flip operations preserve the cardinality of subsets. As a consequence, K and T can be diagonalized in the same basis, hence they commute.

As a consequence of the previous lemma, the integral operator K associated with the linear model of fully connected layer commute with all the operators T defined for usual augmentations. It is also the case for the convolutional layer with T deriving from random noise, cropping, or translation. 6 As a consequence, the interplay between the architecture and the augmentations can be studied easily thanks to Proposition 4.

Example 10 (Interplay between FC kernel and translation augmentations) . Recall from Example 9 that when sampling translations from a localized window, the eigenvalues of T are of the form | ˆ p ( m ) | 2 and typically decay with the frequency index m in ψ m,S = 1 √ k S ∑ k ∈ [ k S ] e 2 iπkm/k S χ S + k for any set S with no periodicity. In contrast, the eigenvalues ν h ( d, | S | ) of K for eigenfunctions ψ m,S decay as Θ d ( d -| S | ) , independently of m . Regularization with parameter λ thus shrinks the eigenvalues to | ˆ p ( m ) | 2 -λν h ( d, | S | ) -1 after pre-training. This most notably eliminates contributions from eigenfunctions ψ m,S where m is small (i.e., near-invariant) but | S | is large. See Figures 3 and 5 for an illustration.

Example 11 (Interplay between kernel for CNN and translation augmentations) . Consider the setting as before in Example 10 with translations sampled from a localized window. For a single layer CNN with patch width q , eigenfunctions correspond to parity functions χ S , or cyclic parities ψ m,S where diam( S ) ≤ q with corresponding eigenvalue ν h ( q, ℓ ) q +1 -diam( S ) d . Here, the eigenfunctions ψ m,S of T for S with diameter larger than q are completely eliminated, regardless of the regularization strength λ , . For eigenfunctions ψ m,S where diam( S ) ≤ q , the CNN shrinks the contribution to | ˆ p ( m ) | 2 -λ ( ν h ( q, ℓ ) q +1 -diam( S ) d ) -1 , which shrinks more when diam( S ) is larger.

Figure

Regularization parameter λ

Figure 8. Illustration of the interplay between T and K as a function of λ where K is the NTK of a 2-layer ReLU network and T performs crops of window size 8 on 12 -bit inputs. Here we plot eigenvalues of three different parity functions in the eigenbasis of both operators. Parity functions which large diameters have smaller eigenvalues for T (here, the parity function with largest diameter is χ 1 , 6 ( X ) = X 1 X 6 ). Eigenvalues of K , in contrast, bias towards parities supported over fewer bits. Therefore, small regularization biases towards parities with small diameter whereas added regularization penalizes parities with high cardinality.

.

Figure

Figure

In experiments, we also consider a setup with uniform data on the sphere X = S d -1 , with augmentations consisting of permutations, and a dot-product kernel φ ( x ) ⊤ φ ( y ) = h ( x ⊤ y ) . A natural choice of basis functions for L 2 ( X ) in this case are spherical harmonics (Efthimiou & Frye, 2014). These consist of homogeneous harmonic polynomials, and similar to the parity case, these can be grouped by degree, leading to orthogonal spaces V d,ℓ of spherical harmonics of any degree ℓ ≥ 0 , with

It is well-known that for dot-product kernels, K is diagonal in such a basis (Smola et al., 2000; Bach, 2017), with decaying eigenvalues that only depend on the degree ℓ . These are given analogously to the hypercube setting by

where Q ℓ,d are now Legendre (Gegenbauer) polynomials of degree ℓ orthogonal w.r.t. a different measure dτ ( t ) = (1 -t 2 ) d -3 2 dt over [ -1 , 1] .

Since the spaces V d,ℓ are left stable by the operator T = A ⊤ A , it is possible to show that there exists a choice of spherical harmonics that also diagonalizes T (see, e.g., Bietti et al., 2021, Lemma 12). We may then see the eigenvalues λ ℓ,j of T in this basis as capturing the invariance of the corresponding harmonic Y ℓ,j , in particular Y ℓ,j is invariant to all augmentations when λ ℓ,j = 1 , and non-invariant or only partially invariant when λ ℓ,j < 1 .

Ordering λ k,j at fixed k by decreasing j , the interplay between T and K then resembles the one described, e.g., in Figure 3.

Previously, we extensively studied the embedding of H in L 2 defined as S : H ↦→ L 2 ; θ → φ ( · ) ⊤ θ . Given samples ( ξ ij ) i ≤ n,j ≤ m , all the action on H can be reduced to the span of the φ ( ξ ij ) (which is known as the representer theorem), and S can be reduced to the embedding ˆ S : H ↦→ R nm ; θ → ( 1 nm φ ( X ij ) ⊤ θ ) ij . This leads to the implementation

Figure 9. Extending Figure 4. The i -th row representing the i -th eigenfunctions of T λ (ordered by decreasing eigenvalues). Regularization λ increases over the columns as λ ∈ { 0 , . 1 , 1 , 10 , 100 } . Small λ biases towards functions invariant to the translation augmentation chosen here whereas large λ biases towards smoother functions on the sphere corresponding to low order spherical harmonics in this setting. The last two on the right are artifacts of the instability of the pseudo-inverse for K (leading to the implementation φK -1 φ = 0 while we have defined φ ⊤ K -1 φ = + ∞ when Kφ = 0 ).

Where, ˆ T ∈ R nm is the matrix equal to the following where we index elements in R nm by ij with i ∈ [ n ] and j ∈ [ m ] ,

Note the the matrix ˆ T -I can be seen as the adjacency matrix of the graph that connects augmentations if and only if they come from the same input. Equivalently, ˆ T can be seen as a Laplacian matrix. An eigenvector of ˆ T λ in R nm is projected back onto L 2 thanks to S ˆ S -1 = S ˆ S ( ˆ S ˆ S ⊤ ) -1 = K ⊤ x K -1 where

Figure 10. VCReg with Neural networks . Contour plots of the minimizer ψ : X → R of L for β = 1 (left) and β = 0 (right) with a two layer fully connected neural network when k = 1 , X = R 2 , X is distributed according to a half-moon structure and ξ = X + ε for a small noise ε . Augmentations are represented as black dots, connected by a line when they provide from the same input X .

Figure

Experiment details for Figure~

Weconsider data uniformly distributed on the sphere S d -1 with d = 8 , augmentations consisting of cyclic shifts of -1 , 0 , 1 , and a dot-product kernel of the form k ( x, y ) = (1 + x ⊤ y ) κ ( x ⊤ y ) , with κ ( u ) = 1 -arccos( u ) /π .

The target functions f ∗ ℓ are given by:

where Q ℓ,d are the Gegenbauer polynomials introduced in Appendix D.5. Note that f ∗ 3 is a cyclic-invariant spherical harmonic of degree 3 , while f ∗ 1 is a non-invariant spherical harmonic of degree 1 (though is has some local shift stability). Labels on the downstream tasks are generated from the f ∗ ℓ without noise.

Figure 5 shows the downstream relative excess risk ∥ ˆ f n -f ∗ ℓ ∥ 2 L 2 / ∥ f ∗ ℓ ∥ 2 L 2 , approximated over 1500 test datapoints, as a function of the regularization parameter λ used in pretraining. We use the same n = 300 samples for pretraining and downstream linear prediction. Pretraining uses all 3 augmentations for each sample, with a representation dimension k = 20 . The downstream problem is solved with kernel ridge regression using the induced kernel from pretraining, and the ridge parameter is tuned on test samples to avoid dealing with model selection issues.

$$ {\cal R}(f) = \E_{(X, Y)\sim\rho}[\ell(f(X), Y)], $$

$$ {\cal F} = \brace{x\mapsto w^\top \psi(x)\midvert w\in\R^k}. $$

$$ \Psi = \brace{x\mapsto f_\theta(x) \midvert f_\theta(x) = \scap{\theta}{\phi(x)}_{\cal H}, \theta\in{\cal H}}. $$

$$ T_\lambda = (1-\beta) I + \beta T - \lambda K^{-1}. $$

$$ \E_{{\cal D}_n}[{\cal L}(S\Theta_n)] - {\cal L}(S\Theta) \leq \frac{12\kappa^2 k}{\lambda\sqrt{n}} \paren{1 + \frac{\kappa^2 k}{\lambda}}, $$

$$ {\cal L}(\psi; \beta) = 2(\beta - 1)\E_{\xi}[\norm{\psi(\xi)}^2] - 2\beta\E_X\E_{\xi, \xi'}\bracket{\psi(\xi)^\top\psi(\xi')\midvert X} + \E_{\xi, \xi'}\bracket{(\psi(\xi')^\top\psi(\xi))^2} + k. $$

$$ \Theta_* = (\theta_i)_{i\in[k]},,,\text{with},, \theta_i = \sqrt{\max(\lambda_i, 0)} \Sigma^{-1/2}u_i. $$

$$ Kf(\xi) = \E_{\xi'}\bracket{\phi(\xi)^\top\phi(\xi')f(\xi')}. $$

$$ \Pi_{\hat{\cal F}}f^* = S_\psi \E[\psi(X)\psi(X)^\top]^{-1} \E[Y\psi(X)]. $$

$$ w_n \in \argmin_{w\in\R^k} \sum_{i=1}^n \norm{w^\top \psi(X_i) - Y_i}^2 = [\frac{1}{n}\sum_{i=1}^n\phi(X_i)\phi(X_i)^\top]^{-1} \frac{1}{n} \sum_{j=1}^n Y_i\phi(X_i). $$

$$ S_\psi:\R^k \to L^2(\rho_\X); w\to w^\top \psi, \qquad \hat S_\psi:\R^k \to \ell^2(n); w \to (w^\top \psi(X_i))_{i\in[n]}, $$

$$ \norm{S(\hat\Sigma + \gamma)^{-1}\hat S^\top (I - \Pi_{\hat{\cal F}}) f^}{L^2(\rho\X)} \leq \min\brace{\frac{1}{1-t}, 1 + t\cdot\frac{M^2 + \gamma}{\gamma}}\norm{\Sigma_\gamma^{-1/2} \hat S^\top (I - \Pi_{\hat{\cal F}})f^}. $$

$$ \Pbb\paren{\norm{\Sigma_\gamma^{-1/2} \hat S^\top (I - \Pi_{\hat{\cal F}})f^*} \geq t} \leq 2\exp\paren{\frac{-nt^2}{a(b + 2M\gamma^{-1/2}t/3)}} $$

$$ Z_i = (I-\Pi_{\hat{\cal F}})f^*(X_i) (\Sigma+\gamma)^{-1/2}\psi(X_i) \in \R^k. $$

$$ \trace\paren{\Sigma(\Sigma+\gamma)^{-1}} \leq k,\qquad M \leq \lambda^{-1}k \sup \norm{\phi},\qquad \scap{\Pi_{\hat{\cal F}}f^}{\Sigma(\Sigma + \gamma)^{-1}\Pi_{\hat{\cal F}}f^}{L^2(\rho\X)} \leq \norm{f^*}{L^2(\rho\X)}. $$

$$ \norm{f^}{L^2(\rho\X)} \leq \norm{f^}{L^\infty(\rho\X)} \leq \sigma, \qquad \epsilon^2 \leq \sigma^2,\quad \sigma^2 = \sup_x\E\bracket{Y^2 \midvert X=x} $$

$$ \E_{(X_i, Y_i)}[{\cal R}(f_n) - {\cal R}(f^)] \leq \frac{2k_e\epsilon^2}{n} + \frac{8M^2\log(n)^{1+\delta}}{n}\norm{f^}{L^2(\rho\X)} + \frac{64 k a}{n} + 2\norm{I-\Pi_{\hat{\cal F}}\Pi_{\cal F}f^}^2 + \norm{I-\Pi_{\cal F}f^}^2 $$

$$ \sum_{i\in[k]} \lambda_i^2 \norm{(\Pi^{(\mu_\Xi)}{\cal F} - \Pi^{(\mu\Xi)}{\hat{\cal F}})f_i}{L^2(\mu_\Xi)}^2 -\sum_{k< i \leq k_\lambda} \lambda_i^2 \norm{\Pi^{(\mu_\Xi)}{\hat{\cal F}}f_i}{L^2(\mu_\Xi)}^2 \leq {\cal L}(\hat\Theta;\lambda) - {\cal L}(\Theta_*;\lambda), $$

$$ \norm{(I - \Pi_{\hat{\cal F}}^{(\rho_\X)})\Pi_{{\cal F}l}^{(\rho\X)} f^}{L^2(\rho\X)} \leq \sigma(l) + \zeta\paren{\sum_{i\leq l} \abs{\scap{f^}{f_i}{L^2(\mu\Xi)}} \norm{(\Pi_{{\cal F}l}^{(\mu\Xi)} - \Pi_{\hat{\cal F}}^{(\mu_\Xi)})f_i}{L^2(\mu\Xi)}}. $$

$$ \sum_{i\leq k} \norm{(\Pi_{\cal F} - \Pi_{\hat{\cal F}})f_i}^2 = k-k'+\sum_{i > k}\norm{\Pi_{\hat{\cal F}}f_i}^2 \leq k. $$

$$ {\cal R}(\zeta_n) - {\cal R}(\zeta_*) \leq 4 \E_{(Z_i), (\sigma_i)}\bracket{\sup_{\zeta}\frac{1}{n}\sum_{i=1}^n \sigma_i\ell(\zeta, Z_i)} $$

$$ X_i \sim \mu_\X^{\otimes 2},\qquad \xi_{ij} \sim \mu\vert_{X_i}^{\otimes m}. $$

$$ \norm{\nabla_\Lambda \ell} \leq 2\kappa^2 + \kappa^4 \sup\norm{\Lambda}, \qquad\text{and}\qquad \E[\norm{\nabla_\Lambda \ell-\nabla {\cal L}}^2] \leq (\sigma_X^2 + m^{-1}\sigma_\xi^2)(1+\sup\norm{\Lambda}^2), $$

$$ \chi_S( x) = \prod_{i \in S} x_i. $$

$$ B^p_y(x) = x \odot y, ; ; ; ; y \sim \operatorname{Ber}( {-1,+1}, p)^{\otimes d}, $$

$$ T\chi_S = \abs{1 - 2p}^{\card{S}} \chi_S. $$

$$ [M^w_{a}(x)]_i = \begin{cases} x_i & \text{if } i \in [a, a+w) \ \operatorname{Ber}( {-1,+1}, 0.5) & \text{otherwise} \end{cases}, \qquad a \sim \uniform{[d]}, $$

$$ T\chi_S = \frac{\max\brace{1+w-\diam(S), 0}^2}{d^2}\cdot \chi_S \qquad\text{with}\qquad \diam(S) = \min \brace{v\midvert v, a \in [d]; S \subseteq [a, a + v)}. $$

$$ T\chi_S = \frac{1}{m^2d^2}\paren{1 + v - \diam_{e_1}{S}}+^2\cdot \paren{1 + w - \diam{e_2}{S}}_+^2 \chi_S, $$

$$ [R(x)]i = x{-i}. $$

$$ T = (1-2p + 2p^2) I + 2p(1-p) J, $$

$$ \psi_{m,S} = \frac{1}{\sqrt{k_S}}\sum_{k\in[k_S]} e^{2i\pi k m / k_S} \chi_{S+k} = \frac{\sqrt{k_S}}{d}\sum_{k\in[d]} e^{2i\pi k \frac{m d}{k_S} / d} \chi_{S+k} \qquad\text{where}\qquad k_S = \card{\orb(S)}, $$

$$ \scap{f}{T g}{L^2(\mu\Xi)} = \E_X\E_{\xi, \xi'}[ f(\xi) g(\xi') | X] = \langle Af, Ag \rangle_{L^2(\mu_{\mathcal X})}. $$

$$ T\psi_{m, S} = \frac{d^2}{k_S^2} \abs{\hat{p}\paren{\frac{m d}{k_S}}}^2 \psi_{m, S}, $$

$$ \hat{p}(\omega) = \sum_{a\in[d]} p(a) \exp\paren{\frac{-2i\pi a\omega }{d}} $$

$$ \phi:\X\mapsto\R^{2^d}; x\mapsto(e_S\chi_S(x))_{S\subseteq[d]}. $$

$$ \sum_{S \subseteq [d], |S| = \ell} \chi_S(x) \chi_S(y) = {d \choose \ell} Q_{\ell, d}(\scap{x}{y}), $$

$$ \nu_\ell = \scap{h}{Q_\ell}_{L^2(\tau)}. $$

$$ f(x) = \sqrt{\frac{\Delta}{N\omega d}} \sum_{i\in[N]} \sum_{k\in[d/\Delta]} a_{ik} \sum_{s\in[\omega]} \sigma\paren{\scap{w_i}{x_{(k\Delta + s)}^{(q)}}}, $$

$$ h(\scap{u}{v} / q) = \E_{w\sim\cal W}\bracket{\sigma(\scap{u}{w} / \sqrt{q}) \sigma(\scap{v}{w} / \sqrt{q}) + \sigma'(\scap{u}{v} / \sqrt{q})\sigma'(\scap{u}{v} / \sqrt{q})\cdot\scap{u}{v} / q}. $$

$$ {\cal L}(\psi) &= \beta\E_X\E_{\xi, \xi'} \bracket{\norm{\psi(\xi) - \psi(\xi')}^2\midvert X} \nonumber \&\qquad+ \norm{\E_{\xi}[\psi(\xi)\psi(\xi)^\top] - I}^2_2, $$

$$ \norm{\E[\psi(\xi)\psi(\xi)^\top] - I}^2 &= \trace\paren{(\E[\psi(\xi)\psi(\xi)^\top] - I)(\E[\psi(\xi')\psi(\xi')^\top] - I)} \&= \E_{\xi, \xi'}\bracket{\trace\paren{\psi(\xi)\psi(\xi)^\top \psi(\xi')\psi(\xi')^\top}} - 2\E_\xi\bracket{\trace\paren{\psi(\xi)\psi(\xi)^\top}} + \trace(I) \&= \E_{\xi, \xi'}\bracket{(\psi(\xi')^\top\psi(\xi))^2} - 2\E_\xi\bracket{\norm{\psi(\xi)}^2} + k. $$

$$ \omega(\psi) &= \E_X[\E_{\xi,\xi'}\bracket{\norm{\psi(\xi) - \psi(\xi')}^2\midvert X}] \&= \E_X[\E_{\xi,\xi'}\bracket{\norm{\psi(\xi) - \E\bracket{\psi(\xi)\midvert X} + \E\bracket{\psi(\xi)\midvert X} - \psi(\xi')}^2\midvert X}] \&= \E_X[\E_{\xi,\xi'}\bracket{\norm{\psi(\xi) - \E\bracket{\psi(\xi)\midvert X}}^2 + \norm{\E\bracket{\psi(\xi)\midvert X} - \psi(\xi')}^2\midvert X}] \&= 2\E_X[\E_{\xi}\bracket{\norm{\psi(\xi) - \E\bracket{\psi(\xi)\midvert X}}^2\midvert X}] \&= 2\min_{\psi_0:\X\to\R}\E_X[\E_{\xi}\bracket{\norm{\psi(\xi) - \psi_0(X)}^2\midvert X}] \&\leq 2\E_X[\E_{\xi}\bracket{\norm{\psi(\xi)}^2\midvert X}] = 2\E_{\xi}\bracket{\norm{\psi(\xi)}^2} = 2\norm{\psi}{L^2(\mu\Xi)} $$

$$ 2f^\top (I - T) f &= \E_X\E\bracket{\norm{f(\xi) - f(\xi')}^2\midvert X} = \E_X\E\bracket{\norm{f(\xi)}^2 + \norm{f(\xi')}^2 - 2\scap{f(\xi)}{f(\xi')}\midvert X} \&= 2f^\top f - 2\E_X\E\bracket{\scap{f(\xi)}{f(\xi')}\midvert X}. $$

$$ f^\top T g &= \E_X\E\bracket{\scap{f(\xi)}{g(\xi')}\midvert X} = \int \scap{f(\xi)}{g(\xi')} p\paren{\xi \midvert x} p\paren{\xi'\midvert x} \diff \xi' \diff \xi \mu_\X(\diff x) \&= \int \mu_\Xi(\diff \xi) \scap{f(\xi)}{\int \mu_\Xi(\diff \xi') g(\xi') \frac{\int \mu_\X(\diff x) p\paren{\xi \midvert x} p\paren{\xi'\midvert x}}{p(\xi)p(\xi')}}. $$

$$ {\cal L}(\psi) &= 2\beta \sum_{i\in[k]} \scap{\psi_i}{(I - T)\psi_i} + \norm{\E_\xi[\psi(\xi)\psi(\xi)^\top] - I}^2 \&= 2\beta \sum_{i\in[k]} \scap{e_i^\top\psi}{(I - T)\psi^\top e_i} + \norm{\E_\xi[\sum_{i,j\in[k]}e_i^\top\psi(\xi)\psi(\xi)^\top e_j e_i e_j^\top] - I}^2 \&= 2\beta \sum_{i\in[k]} e_i \tilde\psi(I - T)\tilde\psi^\top e_i + \norm{\sum_{i, j\in[k]} e_i^\top\tilde\psi\tilde\psi^\top e_j e_i e_j^\top - I}^2 \&= 2\beta \trace\paren{\tilde\psi (I - T)\tilde\psi^\top} + \norm{\tilde\psi\tilde\psi^\top - I}^2 = 2\beta \trace\paren{\tilde\psi (I - T)\tilde\psi^\top} + \trace\paren{\paren{\tilde\psi\tilde\psi^\top - I}^2}. \&= \trace\paren{2\beta\tilde\psi (I - T)\tilde\psi^\top + \tilde\psi\tilde\psi^\top \tilde\psi\tilde\psi^\top - 2 \tilde\psi\tilde\psi^\top + I}. \&= \trace_{L^2(\mu_\Xi)}\paren{\tilde\psi^\top \tilde\psi\tilde\psi^\top\tilde\psi + (2\beta (I - T) - 2 I)\tilde\psi^\top\tilde\psi} + k. \&= \trace_{L^2(\mu_\Xi)}\paren{\tilde\psi^\top \tilde\psi\tilde\psi^\top\tilde\psi - 2 T_\beta\tilde\psi^\top\tilde\psi} + k. = \trace_{L^2(\mu_\Xi)}\paren{(\tilde\psi^\top \tilde\psi - T_\beta)^2 - T_\beta^2} + k. $$

$$ \trace\paren{(B-A)^2 - A^2} &= \trace{B^2 - 2B^{1/2}AB^{1/2}} = \trace\paren{B^2 - 2B^{1/2}A_+ B^{1/2}} + 2\trace\paren{B^{1/2}A_-B^{1/2}} \&\geq \trace\paren{B^2 - 2B^{1/2}A_+ B^{1/2}}. $$

$$ \trace\paren{(B-A)^2 - A^2} &\geq \sum_{i=1}^k \trace\paren{B_i^2} - 2\trace\paren{B_iA_+} = \sum_{i=1}^k \norm{B_i}^2_{\op} - 2\norm{B_iA_+}{\op} \&\geq \sum{i=1}^k \norm{B_i}^2_{\op} - 2\norm{B_i}{\op}\norm{\Pi{B_i} A_+}{\op} \geq \sum{i=1}^k \norm{B_i}^2_{\op} - 2\norm{B_i}{\op}\big|\prod{j< i}(I - \Pi_{B_j}) A_+\big|{\op} \&= \sum{i=1}^k \paren{\norm{B_i}{\op} - \big|\prod{j< i}(I - \Pi_{B_j}) A_+\big|{\op}}^2 - \big|\prod{j< i}(I - \Pi_{B_j}) A_+\big|{\op}^2 \&\geq -\sum{i=1}^k \big|\prod_{j< i}(I - \Pi_{B_j}) A_+\big|{\op}^2 \geq -\sum{i=1}^k \sigma_i(A_+) $$

$$ \E_{\xi, \xi'}\bracket{(\psi(\xi)^\top\psi(\xi'))^2} &= \E_{\xi, \xi'}\bracket{\psi(\xi)^\top\psi(\xi')\psi(\xi')^\top\psi(\xi)} = \E_{\xi, \xi'}\bracket{\trace\paren{\psi(\xi')\psi(\xi')^\top\psi(\xi)\psi(\xi)^\top}} \&= \trace\paren{\E_{\xi}[\psi(\xi)\psi(\xi)^\top]\E_{\xi'}[\psi(\xi')\psi(\xi')^\top]} = \trace\paren{\E_{\xi}[\psi(\xi)\psi(\xi)^\top]^2}. $$

$$ A = \Sigma^{-1/2}((1-\beta)\Sigma + \beta\Sigma_X - \lambda I)\Sigma^{-1/2},\quad \Sigma = \E_{\xi}\bracket{\phi(\xi)\phi(\xi)^\top},\quad \Sigma_X = \E_X[\E_{\xi, \xi'}\bracket{\phi(\xi)\phi(\xi')^\top \midvert X}]. $$

$$ &{\cal L}(\Theta\phi) + 2\lambda\trace(\Theta^\top \Theta)- k \&= \trace\big(2(\beta - 1) \Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2} - 2\beta\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2}\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2} \&\qquad\qquad\qquad\qquad+ \Sigma^{1/2}\Theta^\top \Theta\Sigma\Theta^\top \Theta\Sigma^{1/2} + 2\lambda\Sigma^{-1}\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2}\big) \&= \trace\paren{\paren{\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2} + (\beta-1)I - \beta\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2} + \lambda\Sigma^{-1}}^2 - \paren{(\beta-1)I - \beta\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2} + \lambda\Sigma^{-1}}^2} \&= \trace\paren{\paren{\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2} - \Sigma^{-1/2}((1-\beta)\Sigma + \beta \Sigma_X - \lambda)\Sigma^{-1/2}}^2 - \paren{\Sigma^{-1/2}((1-\beta)\Sigma + \beta \Sigma_X - \lambda)\Sigma^{-1/2}}^2}. $$

$$ \scap{f_{\theta_i}}{f_{\theta_j}}{L^2(\mu\Xi)} &= \sqrt{\max(\lambda_i, 0)\max(\lambda_j, 0)} \E[\scap{\Sigma^{-1/2}u_i}{\phi(\xi)}\scap{\Sigma^{-1/2}u_i}{\phi(\xi)}] \&= \sqrt{\max(\lambda_i, 0)\max(\lambda_j, 0)} \E[u_i^\top\Sigma^{-1/2}\phi(\xi)\phi(\xi)^\top\Sigma^{-1/2}u_j] \&= \sqrt{\max(\lambda_i, 0)\max(\lambda_j, 0)} u_i^\top\Sigma^{-1/2}\E[\phi(\xi)\phi(\xi)^\top]\Sigma^{-1/2}u_j \&= \sqrt{\max(\lambda_i, 0)\max(\lambda_j, 0)} u_i^\top\Sigma^{-1/2}\Sigma\Sigma^{-1/2}u_j \&= \sqrt{\max(\lambda_i, 0)\max(\lambda_j, 0)} u_i^\top u_j = \sqrt{\max(\lambda_i, 0)\max(\lambda_j, 0)} \delta_{ij}. $$

$$ \scap{\theta}{S^\top S\theta}{\cal H} &= \scap{S\theta}{S\theta}{L^2(\mu_\X)} = \E_\xi[S\theta(\xi)^2] \&= \E_\xi[\scap{\theta}{\phi(\xi)}^2] = \E_\xi[\scap{\theta}{\phi(\xi)\otimes\phi(\xi)\theta}] \&= \scap{\theta}{\E[\phi(\xi)\otimes\phi(\xi)]\theta} = \scap{\theta}{\Sigma\theta}. $$

$$ \norm{f_\theta}{L^2(\rho\X)}^2 &= \norm{S_X\theta}{L^2(\rho\X)}^2 = \norm{\Sigma_{\rho_\X}^{1/2}\theta}{\cal H}^2 = \norm{\Sigma{\rho_\X}^{1/2}\Sigma_{\mu_\Xi}^{-1/2} \Sigma_{\mu_\Xi}^{1/2}\theta}{L^2(\rho\X)}^2 \&\leq \norm{\Sigma_{\mu_\Xi}^{-1/2}\Sigma_{\rho_\X}\Sigma_{\mu_\Xi}^{-1/2}}{\op} \norm{\Sigma{\mu_\Xi}^{-1/2}\theta}{\cal H}^2 = \norm{\Sigma{\mu_\Xi}^{-1/2}\Sigma_{\rho_\X}\Sigma_{\mu_\Xi}^{-1/2}}{\op} \norm{f\theta}{L^2(\mu\Xi)}^2. $$

$$ \E_{X\sim\rho_\X}[\norm{f(X) - wf_i(X)}^2] &= \E_{X\sim\rho_\X}[\norm{g(\psi(X)) - wf_i(X)}^2] = \E_{Z\sim\psi_#\rho_\X}[\norm{g(Z) - w^\top Z}^2] \&= \E_{Z\sim\mu_\Xi}[\norm{f(X) - w^\top f_i(X)}^2]. $$

$$ \norm{(I - \Pi_{\hat{\cal F}})f^}^2_{L^2(\rho_\X)} & =\norm{(I - \Pi_{\hat{\cal F}})\Pi_{\cal F}f^ + (I - \Pi_{\hat{\cal F}})(I-\Pi_{\cal F})f^}^2_{L^2(\rho_\X)} \& \leq 2\norm{(I - \Pi_{\hat{\cal F}})\Pi_{\cal F}f^}^2_{L^2(\rho_\X)} + \norm{(I - \Pi_{\hat{\cal F}})(I-\Pi_{\cal F})f^}^2_{L^2(\rho_\X)} \& \leq 2\norm{(I - \Pi_{\hat{\cal F}})\Pi_{\cal F}f^}^2_{L^2(\rho_\X)} + \norm{(I-\Pi_{\cal F})f^*}^2_{L^2(\rho_\X)} $$

$$ \nonumber &S_\psi:L^2(\rho_\X)\to \R^k; f \to \E_{\rho_\X}[f(X)\psi(X)], \qquad \hat S_\psi:\ell^2(n)\to \R^k; (Y_i){i\in n} \to \frac{1}{n}\sum{i\in[n]} Y_i\psi(X_i). \&\Sigma_\psi = S_\psi S_\psi^\top = \E_{\rho_\X}[\psi(X)\psi(X)^\top], \qquad \hat \Sigma_\psi = \hat S_\psi \hat S_\psi^\top = \frac{1}{n}[\psi(X)\psi(X)^\top], $$

$$ \hat\Sigma_\gamma^{-1}\hat S^\top f &=(\hat\Sigma_\gamma^{-1}\hat - \Sigma_\gamma^{-1}) \hat S^\top f + (\Sigma_\gamma)^{-1} \hat S^\top f \&=\hat\Sigma_\gamma^{-1}\hat (\Sigma_\gamma - \hat\Sigma_\gamma) \Sigma_\gamma^{-1} S^\top f + \Sigma_\gamma^{-1} \hat S^\top f \&=\hat\Sigma_\gamma^{-1}\hat (\Sigma- \hat\Sigma) \Sigma_\gamma^{-1} \hat S^\top f + \Sigma_\gamma^{-1} \hat S^\top f \&=\Sigma_\gamma^{-1/2}\paren{\Sigma_\gamma^{1/2}\hat\Sigma_\gamma^{-1}\Sigma_\gamma^{1/2}\Sigma_\gamma^{-1/2}\hat (\Sigma- \hat\Sigma) \Sigma_\gamma^{-1/2} + I} \Sigma_\gamma^{-1/2} \hat S^\top f $$

$$ \norm{A^{-1/2}(A - \hat A) A^{-1/2}}_{\op} \leq t &\quad\Leftrightarrow\quad -t I \preceq A^{-1/2}(\hat A - A) A^{-1/2} \preceq t I \&\quad\Leftrightarrow\quad -t A \preceq \hat A - A \preceq t A \&\quad\Leftrightarrow\quad (1-t) A \preceq \hat A \preceq (1+t) A \&\quad\Leftrightarrow\quad (1+t)^{-1} A^{-1} \preceq \hat A^{-1} \preceq (1-t)^{-1} A^{-1} \&\quad\Leftrightarrow\quad (1+t)^{-1} \preceq A^{1/2}\hat A^{-1}A^{1/2} \preceq (1-t)^{-1}. $$

$$ \abs{(I - \Pi_{\hat{\cal F}})f^(X_i)} &\leq \abs{f^(X_i)} + \abs{\Pi_{\hat{\cal F}}f^(X_i)} = \abs{f^(X_i)} + \abs{\scap{SS^{-1}\Pi_{\hat{\cal F}}f^}{\phi(X)}} \&\leq \abs{f^(X_i)} + \norm{SS^{-1}}{\op}\norm{\Pi{\hat{\cal F}}f^}_{L^2}\norm{\phi(X_i)} \leq \norm{f^}{L^\infty} + M\norm{f^*}{L^2}. $$

$$ &\E[\min\brace{\frac{1}{1-X}, 1 + \frac{(M^2 + \gamma)X}{\gamma}}^2 Y^2] \&= \int_{t\in(0,\sup (1+\gamma^{-1}(M^2 + \gamma)X)^2Y^2)} \Pbb\paren{\min\brace{\frac{1}{1-X}, 1 + \frac{(M^2 + \gamma)X}{\gamma}}^2 Y^2 > t} \diff t \&\leq \int \inf_{s} \Pbb\paren{\min\brace{\frac{1}{1-X}, 1 + \frac{(M^2 + \gamma)X}{\gamma}}^2 > 1+s} + \Pbb(Y^2 > t/(1+s)) \diff t. $$

$$ \E[Y^2] &= \int_{t>0} \Pbb(Y^2> t)\diff t \leq \int_{t>0} 2\exp\paren{-\frac{-nt}{a(b + 2M\gamma^{-1/2}t^{1/2}/3)}}\diff t \&\leq 4\int_{t>0} \exp\paren{-\frac{-nt}{2ab}} + \exp\paren{-\frac{-nt^{1/2}}{4 aM\gamma^{-1/2}/3)}}\diff t \&= 8 ab n^{-1} + 256 a^2M^2\gamma^{-1} n^{-2} / 9 $$

$$ {\cal L}(\hat\Theta; \lambda) - {\cal L}(\Theta;\lambda) &\geq \sum_{i\leq k} \lambda_i^2 \paren{f_i^\top f_i - f_i^\top \Pi_{\hat{\cal F}}f_i} - \sum_{k<i\leq k_\lambda} \lambda_i^2 f_i^\top \Pi_{\hat{\cal F}}f_i \&= \sum_{i\leq k} \lambda_i^2 \scap{f_i}{(I - \Pi_{\hat{\cal F}})f_i} - \sum_{k<i\leq k_\lambda} \lambda_i^2 \norm{\Pi_{\hat{\cal F}}f_i}^2 \&= \sum_{i\leq k} \lambda_i^2 \norm{(\Pi_{\cal F} - \Pi_{\hat{\cal F}})f_i}^2 - \sum_{k<i\leq k_\lambda} \lambda_i^2 \norm{\Pi_{\hat{\cal F}}f_i}^2. $$

$$ \norm{U(I-V)}^2 &= \norm{U}^2 - \norm{UV}^2 = k - \norm{UV}^2 = k - \norm{(UV)^\top}^2 = k - k' + k' - \norm{VU}^2 = k - k' + \norm{V(I-U)}^2. $$

$$ &\max_x \sum_{i\leq k} c_i x_i \\text{subject to}\quad & \sum_{i\leq k} \lambda_i^2 x_i^2 - \sum_{k< i\leq k_\lambda} \lambda_i^2 x_i^2 \leq \epsilon } \&\sum_{i\leq k} x_i^2 = k-k' + \sum_{k< i\leq k_\lambda}x_i^2 \leq k } $$

$$ \norm{(I - \Pi_{\hat{\cal F}k})f^*}^2 &= \norm{\Pi{\hat{\cal F}{k\lambda}}(I - \Pi_{\hat{\cal F}k})f^*}^2 + \norm{(I - \Pi{\hat{\cal F}{k\lambda}})(I - \Pi_{\hat{\cal F}k})f^*}^2 \&= \norm{(\Pi{\hat{\cal F}{k\lambda}} - \Pi_{\hat{\cal F}k})f^*}^2 + \norm{(I - \Pi{\hat{\cal F}{k\lambda}})f^}^2 \&\leq \norm{(\Pi_{\hat{\cal F}{k\lambda}} - \Pi_{\hat{\cal F}_k})f^}^2 + 2\norm{(I - \Pi_{\hat{\cal F}{k\lambda}})\Pi_{\cal F}f^}^2 + \norm{(I - \Pi_{\cal F})f^}^2 \&\leq \abs{k-k_\lambda}\norm{f^}^2 + 2\norm{(I - \Pi_{\hat{\cal F}{k\lambda}})\Pi_{\cal F}f^}^2. $$

$$ \trace\paren{K} &= \trace\paren{SS^\top} = \trace\paren{S^\top S} = \trace\paren{\E[\phi(X)\phi(X)^\top]} = \E[\trace\paren{\phi(X)\phi(X)^\top}] \&= \E[\phi(X)^\top\phi(X)] = \E[\norm{\phi(X)}^2] < +\infty, $$

$$ &{\cal L}(\psi; \beta) = 2(\beta - 1)\E_{\xi}[\psi(\xi)^\top\psi(\xi)] - 2\beta\E_X\E_{\xi, \xi'}\bracket{\psi(\xi)^\top\psi(\xi')\midvert X} + \E_{\xi, \xi'}\bracket{(\psi(\xi')^\top\psi(\xi))^2} + k. \&\quad= 2(\beta - 1)\E_{\xi}[\phi(\xi)^\top \Lambda \phi(\xi)] - 2\beta\E_X\E_{\xi, \xi'}\bracket{\phi(\xi)^\top\Lambda\phi(\xi')\midvert X} + \E_{\xi, \xi'}\bracket{(\phi(\xi')^\top\Lambda\phi(\xi))^2} + k. \&\quad= 2(\beta - 1)\E_{\xi}[\trace\paren{\Lambda \phi(\xi)\phi(\xi)^\top}] - 2\beta\E_X\E_{\xi, \xi'}\bracket{\trace\paren{\Lambda \phi(\xi')\phi(\xi)^\top}\midvert X} + \E_{\xi, \xi'}\bracket{\trace\paren{\Lambda\phi(\xi)\phi(\xi')^\top}^2} + k. $$

$$ &\E_{{\cal D}n}[{\cal L}(S\Theta_n);\lambda] - {\cal L}(S\Theta;\lambda) \leq 8 \E{{\cal D}n, \sigma}\bracket{\sup{\Lambda} \frac{1-\beta}{n} \sum_{i\in[n]}\sigma_i\scap{\Lambda}{\frac{1}{m}\sum_{j\in[m]}\phi(\xi_{ij})\phi(\xi_{ij})^\top}} \&\qquad\qquad+ 8 \E_{{\cal D}n, \sigma}\bracket{\sup{\Lambda} \frac{\beta}{n}\sum_{i\in[n]}\sigma_{i}\scap{\Lambda}{\frac{2}{m}\sum_{j\in [m/2]; j+k-1=m}\phi(\xi_{ij})\phi(\xi_{ik})^\top}} \&\qquad\qquad+ 4 \E_{{\cal D}n,\sigma}\bracket{\frac{2}{n}\sum{i \in [n/2]; i+j-1=n}\sigma_{i} \frac{1}{m^2} \sum_{k,l\in[m]}\scap{\Lambda}{\phi(\xi_{ik})\phi(\xi_{jk})^\top}^2} \&\leq \frac{8\sup \norm{\Lambda}{HS}}{\sqrt{n}} \paren{(1-\beta)\E_X\bracket{\E\bracket{\big|\frac{1}{m}\sum{i=1}^m \phi(\xi_i)\phi(\xi_i)^\top\big|{HS}^2 \midvert X}}^{1/2}} \&\qquad\qquad + \frac{8\sup \norm{\Lambda}{HS}}{\sqrt{n}} \paren{\beta\E_X\bracket{\E\bracket{\big|\frac{2}{m}\sum_{i,j=1}^{m/2} \phi(\xi_i)\phi(\xi_j)^\top\big|{HS}^2 \midvert X}}^{1/2}} \&\qquad\qquad+ \frac{8\sup \norm{\Lambda}{HS}}{\sqrt{n}} \paren{\sqrt{2}\sup \abs{\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top}} \E\bracket{\big|\frac{1}{m^2}\sum_{i,j=1}^{m} \phi(\xi_i)\phi(\xi_j)^\top\big|_{HS}^2}^{1/2} }. $$

$$ \nonumber \nabla_{\Lambda}\ell(S\Theta;\lambda) &= \frac{2(\beta - 1)}{m}\sum_{j\in[m]} \phi(\xi_{1j})\phi(\xi_{1j})^\top - \frac{2\beta}{m(m-1)} \sum_{1\leq j \neq k \leq m} \phi(\xi_{1j})\phi(\xi_{1k})^\top \&\qquad\qquad+ \frac{1}{m^2} \sum_{i,i'=1}^2 \sum_{j, k=1}^m \scap{\Lambda}{\phi(\xi_{ij})\phi(\xi_{i'k})^\top} \phi(\xi_{ij})\phi(\xi_{i'k})^\top. $$

$$ &\E\bracket{\norm{\frac{1}{m}\sum_{i\in[m]}\phi(\xi_{1i})\phi(\xi_{1i})^\top - \E[\phi(\xi)\phi(\xi)^\top]}^2} = \E\bracket{\norm{\frac{1}{m}\sum_{i\in[m]}\phi(\xi_{1i})\phi(\xi_{1i})^\top - \E[\phi(\xi)\phi(\xi)^\top\midvert X=X_1]}^2} \&\qquad\qquad\qquad\qquad+ \E\bracket{\norm{\E[\phi(\xi)\phi(\xi)^\top\midvert X=X_1] - \E[\phi(\xi)\phi(\xi)^\top]}^2} \&\qquad\qquad= \frac{1}{m} \E_X \E_\xi\bracket{\norm{\phi(\xi)\phi(\xi)^\top - \E[\phi(\xi)\phi(\xi)^\top\midvert X]}^2\midvert X} + \E\bracket{\norm{\E[\phi(\xi)\phi(\xi)^\top\midvert X] - \E[\phi(\xi)\phi(\xi)^\top]}^2}. $$

$$ \E[\norm{c - \E[c]}^2] &= \frac{1}{m^2} \E_X \E_\xi\bracket{\norm{\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top} \phi(\xi)\phi(\xi')^\top - \E[\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top\midvert X, X']}^2\midvert X, X'} \&\qquad\qquad+ \E\bracket{\norm{\E\bracket{\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top\midvert X, X'} - \E[\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top]}^2} \&= \frac{1}{m^2} \E_X \E_\xi\bracket{\norm{\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top\midvert X, X']}}^2\midvert X, X'} \&\qquad\qquad+ \E\bracket{\norm{\scap{\Lambda}{\E\bracket{\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top\midvert X, X'} - \E[\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top]}}^2}. \&\leq \frac{1}{m^2} \norm{\Lambda}^2\E_X \E_\xi\bracket{\norm{\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top\midvert X, X']}^2\midvert X, X'} \&\qquad\qquad+\norm{\Lambda}^2 \E\bracket{\norm{\E\bracket{\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top\midvert X, X'} - \E[\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top]}^2}. $$

$$ &\sigma_{\xi, 1}^2 = \E_X \E_\xi\bracket{\norm{\phi(\xi)\phi(\xi)^\top - \E[\phi(\xi)\phi(\xi)^\top\midvert X]}^2\midvert X} \& \sigma_{X, 1}^2 = \E\bracket{\norm{\E[\phi(\xi)\phi(\xi)^\top\midvert X] - \E[\phi(\xi)\phi(\xi)^\top]}^2} \&\sigma_{\xi, 2}^2 = \E_X \E_\xi\bracket{\norm{\phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top\midvert X]}^2\midvert X} \&\sigma_{X, 2}^2 = \E\bracket{\norm{\E[\phi(\xi)\phi(\xi')^\top\midvert X] - \E[\phi(\xi)\phi(\xi')^\top]}^2} \&\sigma_{\xi,3}^2 = \E_X \E_\xi\bracket{\norm{\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top\midvert X, X']}^2\midvert X, X'} \&\sigma_{X,3}^2 = \E\bracket{\norm{\E\bracket{\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top\midvert X, X'} - \E[\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top]}^2}. $$

$$ \chi_S^\top T \chi_{S'} &= \E_X[ \E_{y,y'} [ \chi_S (X \odot y) \chi_{S'} (X \odot y') ] ] = \E_X \bracket{ \E_{y,y'}\bracket{\prod_{i \in S } X_i y_i \prod_{j \in S'} X_j y_j'}} \ &= \E_X \left[ \E_y \left[\prod_{i \in S\vartriangle S'} X_i y_i \right]\right] \E_{y,y'}\left[\prod_{i \in S \cap S'} y_i y_i' \right] \ &=\E[\chi_{S\vartriangle S'}(X)] \cdot |1-2p|^{|S\vartriangle S'|} |1-2p|^{2 | S \cap S'|} = |1-2p|^{|S|} \delta_{S,S'}. $$

$$ \chi_S^\top T \chi_{S'} &= \E_X[ \E_{a,b} [ \chi_S (M_a^w(X)) \chi_{S'} (M_b^w(X)) ] ] \&= \frac{1}{d^2}\sum_{a,b=1}^d \E_{X, \nu, \nu'} \left[ \prod_{i \in S \cap [a,a+w)} x_i \prod_{i' \in S \backslash [a,a+w)} \nu_{i'} \prod_{j \in S' \cap [b,b+w)} x_j \prod_{j' \in S'\backslash [b,b+w)} \nu_{j'}' \right] \&= \frac{1}{d^2}\sum_{a,b=1}^d \ind{S \subseteq [a,a+w)};\ind{S' \subseteq [b,b+w) } ;\mathbb{E}X[\chi_S(X)\chi{S'} (X)] = \frac{1}{d^2}\sum_{a,b=1}^d \ind{S \subseteq [a,a+w) };\ind{S' \subseteq [b,b+w)} ;\delta_{S,S'} \&= \paren{\frac{1}{d}\sum_{a=1}^d \ind{S \subseteq [a,a+w) }}^2 ;\delta_{S, S'}. $$

$$ \chi_S^\top T\chi_{S'} &= ((1-p)^2 + p^2) \E_X\left[ \chi_S (X) \chi_{S'} (X) \right] + 2p(1-p) \E_X\left[ \chi_{\tilde{S}}(X) \chi_{S'} (X) \right] \& = (1-2p+2p^2) \delta_{S,S'} + 2p(1-p)\delta_{\tilde{S},S'}, $$

$$ \scap{Af}{g}{L^2(\mu\X)} &= \frac{1}{2^d}\sum_{x\in\X} Af(x) g(x) = \frac{1}{2^d}\sum_{x\in\X} p(a) f(a\cdot x) g(x)
\&= \frac{1}{2^d}\sum_{x\in\X} p(a) f(a\cdot x) g(a^{-1} \cdot a\cdot x)
= \frac{1}{2^d}\sum_{x\in\X} p(a) f(x) g(a^{-1} \cdot x)
\& = \scap{f}{x\mapsto p(a)g(a^{-1} \cdot x)}{L^2(\mu\X)}.
\& = \scap{f}{A^\top g}{L^2(\mu\Xi)}. $$

$$ f_1^(x) &= \frac{1}{3} \sum_{j=1}^3 Q_{1,d}(x_j) \ f_3^(x) &= \frac{1}{d} \sum_{j=1}^d Q_{3,d}(x_j), $$

$$ \psi_\theta(x) = \psi_{\theta_0}(x) + \scap{\nabla_{\theta_0}\psi_{\theta_0}(x)}{\theta-\theta_0} + o(\norm{\theta - \theta_0}), $$

$$ \E\bracket{\scap{\nabla\psi(X)}{\xi - \xi'}^2 \midvert X} \propto \norm{\nabla\psi(X)}^2, $$

$$ \norm{f}^2_{L^2(\rho_\X)} \leq c_r\norm{f}^2_{L^2(\mu_\X)}, $$

$$ k(\xi, \xi') = \frac{\int \mu_\X(\diff x) p\paren{\xi \midvert x} p\paren{\xi'\midvert x}}{p(\xi)p(\xi')}. $$

$$ \norm{\Theta}2^2 = \norm{\Theta^\top}2^2 = \scap{S\Theta^\top}{(S^\top S)^{-1}S\Theta^\top}{L^2(\mu\Xi)} = \sum_{i=1}^k S\theta_i^\top K^{-1} S\theta_i = \sum_{i=1}^k \psi_i^\top K^{-1} \psi_i. $$

$$ 0\preceq \E[(A - \E[A])(A - \E[A])^\top], \qquad\Rightarrow\qquad \E[A]\E[A]^\top\preceq\E[AA^\top]. $$

$$ \Sigma_X \preceq \Sigma. $$

$$ {\cal L}((f_{\theta_i}){i\in[k]}) + \lambda \sum{i\in[k]}\norm{\theta_i}^2_2 = \trace\paren{\Big(\Sigma^{1/2}(\sum_{i\in[k]}\theta_i\theta_i^\top)\Sigma^{1/2} - A)^2 - A^2} + k, $$

$$ \trace\paren{\E_X\E\bracket{\psi(\xi)\psi(\xi')^\top\midvert X}} = \trace\paren{\Theta\Sigma_X \Theta^\top } = \trace\paren{\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2}\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2}}. $$

$$ \norm{f}^2_{L^2(\mu_\Xi)} = \delta \int_\X f(x)^2 \rho_\X(\diff x) + (1-\delta) \int_\X f(x)^2 \mu_\perp(\diff x) \geq \delta \norm{f(X)}{L^2(\rho\X)}^2. $$

$$ \Pi_{f_i}^{(\rho_\X)}(f) = w f_i, \qquad\text{with}\qquad w = \argmin_{w\in\R} \E_{X\sim\rho_\X}[\norm{f(X) - w f_i(X)}^2]. $$

$$ {\cal R}(f) = \E[\norm{f(X) - Y}^2] = \E[\norm{f(X) - \E\bracket{Y\midvert X}}^2] + \E[\norm{\E\bracket{Y\midvert X} - Y}^2]. $$

$$ {\cal R}(f) - {\cal R}(f^) = \norm{f - f^}^2_{L^2(\rho_\X)}. $$

$$ \frac{1}{n}\sum_{i\in[n]}\E[\norm{Z_i}^m] \leq m! \sigma^2 M^{m-2} / 2 $$

$$ \sup\norm{Z} \leq \sup\norm{U}^2 \leq \gamma^{-1} M^2. $$

$$ \E[Z_i^2] = \inf_{a} \E[(Z_i - a)^2] \preceq \E[(U_i U_i^\top)^2] \preceq \sup\norm{U}^2\E[U_i^\top U_i] = \sup\norm{U}^2(\Sigma+\gamma)^{-1}\Sigma \preceq \sup\norm{U}^2 I $$

$$ {\cal L}(\hat\Theta; \lambda) - k = \trace\paren{(C - T_\lambda)^2 - T_\lambda^2} = \trace\paren{C^2 - 2C^{1/2}T_\lambda C^{1/2}}. $$

$$ \sum_{i\leq k} \norm{T_+^{1/2}g_i}^4 = \sum_{i\leq k} (g_i^\top T_+ g_i)^2 = \sum_{i\leq k} \paren{\sum_{j\leq k_\lambda} \lambda_j \scap{g_i}{f_j}^2}^2 = \sum_{j, m\leq k_\lambda} \lambda_j\lambda_m \sum_{i\leq k}\scap{g_i}{f_j}^2\scap{g_i}{f_m}^2 = \lambda^\top U^\top U \lambda. $$

$$ U^\top U \preceq \diag(\norm{\Pi_{\hat{\cal F}}f_i}{i\leq k\lambda}^2)^{2} \diag(\norm{\Pi_{\hat{\cal F}}f_i}{i\leq k\lambda}^2). $$

$$ \sum_{i\leq l} c_i x_i = \sum_{i\in[l]} \frac{c_i}{a_i} a_i x_i \leq \paren{\sum_{i\leq [l]} \frac{c_i^2}{a_i^2}}^{1/2} \paren{\sum_{i\in[l]} a_i^2 x_i^2}^{1/2}. $$

$$ x^\top K^{-1} x \leq \frac{1 - t}{\lambda} $$

$$ \dim(U) = \dim\paren{ \frac{U+V}{V} } \leq \dim\paren{\frac{E}{V}} = \operatorname{codim}(V) = d. $$

$$ \E[\sup_f \frac{1}{n}\sum_{i=1}^n \sigma_i h(f(Z_i))] \leq \sup \norm{\diff h(x)} \E[\sup_f \frac{1}{n}\sum_{i=1}^n \sigma_i f(Z_i)] $$

$$ \E[\norm{\frac{1}{p} \sum_{i\in[p]}Z_i}^2] = \E[\norm{\frac{1}{p}\sum_{i\in[p]} Z_i - \E[Z]}^2] + \norm{\E[Z]}^2 = \frac{1}{p}\E[\norm{Z - \E[Z]}^2] + \norm{\E[Z]}^2. $$

$$ \norm{\Lambda}{HS} = \norm{\Theta^\top\Theta}{HS} \leq \norm{\Theta}{\op} \norm{\Theta}{HS} \leq \norm{\Theta}_{HS}^2, $$

$$ \E\norm{\nabla \ell - \nabla{\cal L}}^2\leq 3\E\norm{a - \E[a]}^2 + 3\E\norm{b-\E[b]}^2 + 3\E\norm{c - \E[c]}^2. $$

$$ \E\norm{\nabla \ell - \nabla{\cal L}}^2\leq 3\paren{2(1-\beta)\paren{\frac{\sigma_{\xi, 1}^2}{m} + \sigma_{X, 1}^2} + 2\beta\paren{\frac{2\sigma_{\xi, 2}^2}{m} + \sigma_{X, 2}^2} + \sup\norm{\Lambda}^2 \paren{\frac{\sigma_{\xi, 3}^2}{m^2} + \sigma_{X, 3}^2}} $$

$$ \E[\scap{\Lambda}{\phi(\xi)\phi(\xi')^\top}^2] = \scap{\Lambda}{\E[\phi(\xi')\phi(\xi')^\top] \otimes \E[\phi(\xi)\phi(\xi)^\top] \Lambda} = \scap{\Lambda}{\Sigma \otimes \Sigma \Lambda}. $$

$$ \scap{f}{g} = \E_{x\sim\tau}[f(x) g(x)] = \frac{1}{2^d}\sum_{x\in\X} f(x) g(x). $$

$$ \chi_{S_{i,k}}^\top A\chi_{S_{i,k'}} = \ind{S_{i,k} = S_{i,k'} + a}, = \ind{S_{i} + k = S_{i} + k' + a}, = \ind{k - k' = a}, $$

$$ c_i = \ind{i = a}, $$

$$ |\hat p(m)|^2 = |\hat p_0(m / \omega)|^2 \approx |\hat p_0(\Delta m)|^2. $$

$$ k_{CNN}(x, y) = \frac{1}{d} \sum_{k\in[d]} h\paren{\scap{x_{(k)}^{(q)}}{y_{(k)}^{(q)}}/ q}. $$

$$ N(d,\ell) := \dim V_{d,\ell} = \frac{2\ell + d - 2}{\ell}{\ell + d -3 \choose d - 2}. $$

$$ \nu_h(d,\ell) = \E_{t \sim \tau}[h(t) Q_{\ell,d}(t)], $$

$$ \hat{T} = I + \sum_{ijk} e_{ij} e_{ik}^\top, $$

Theorem. [Downstream error] Let $(X_i, Y_i)\sim\rho^{\otimes n}$ be $n$ samples drawn from the distribution for the downstream task and $\ell$ be the square loss. Define $k_\lambda < +\infty$ as the number of strictly positive eigenvalues of $T_\lambda$. Under Assumptions~ass:interpolation_simple,ass:robust_simple, andass:source, after a transitory regime, the average excess risk of the optimally-regularized empirical risk minimizer $f_n$ is align \nonumber &\E[{\cal R}(f_n) - {\cal R}(f^)] \leq 2k_e\epsilon^2{n} + \log(n)^{1.1}{n} f^{L^2(\rho)} \&\qquad+ c{f,T_\lambda}^2 {\cal L_k(S\hat\Theta)- {\cal L}k(S\Theta{})} + c_{f,k}. align where $\epsilon^2$ is the noise level of $Y$ (the supremum of conditional variances), $k_e \leq k$ is the effective dimension of the representation $\psi = \Theta\phi$ on the downstream task, $c_{f,k} \leq (k_\lambda - k)_+ f^^2_{L^2(\rho_\X)}$ is a constant relating to the concentration of the energy of $f^$ the target function on the downstream task with respect to the eigenspaces of $T_\lambda$, $c_{f,T_\lambda} \leq T_\lambda^{-1f^}$ is a similar constant taking into account the decay of eigenvalues of $T_\lambda$, and the index $k$ in ${\cal L}_k$ indicates that we search over $\psi:\X\to\R^k$.

Theorem. [Empirical risk minimizer] Let $\Theta_n\in\R^k\otimes {\cal H}$ be the minimizer of the unbiased regularized empirical version of ${\cal L}$ based on a dataset ${\cal D}n$. Assume that ${\cal D}n$ is built from $n$ input samples $(X_i) \sim \mu\X^{\otimes n}$ and $m$ augmentation per samples $(\xi{ij}) \sim\mu\vert_{X_i}^{\otimes m}$, then the average excess risk is bounded by equation \E_{{\cal D}_n}[{\cal L}(S\Theta_n)] - {\cal L}(S\Theta) \leq 12\kappa^2 k{\lambdan} 1 + \frac{\kappa^2 k{\lambda}}, equation where $\kappa$ is a bound on $\phi(X)$.

Theorem. [Sharper bounds] There exists an implementable algorithm that guarantees an average excess risk align \nonumber &\E_{{\cal D}n}[{\cal L}(S\Theta_n)] - {\cal L}(S\Theta) \&\qquad\qquad\leq 3\kappa^2 c\lambda c_\lambda'\frac{\sigma_X^2{n} + \sigma_\xi^2{nm}} + 4\kappa^6 c_\lambda^2{n} align where $c_\lambda = 1 + \kappa^2 k_\lambda/\lambda$, $c_\lambda' = 1 + k_\lambda^2 / \lambda^2$, $k_\lambda$ is the number of positive eigenvalues of $T_\lambda$, $\kappa$ is a bound on $\phi$, $\sigma_X$ relates to the variance of $\E\psi(\xi)\midvert X$, and $\sigma_\xi$ relates to the average variance of $\xi\midvert X$. Moreover, when $K = SS^\top$ or the covariance of the $\phi(\xi)$ has a finite number of positive eigenvalues (e.g. $\X$ finite or $\cal H$ finite dimensional), with $c_K$ a constant that relates to the condition number of $K$, this bound can be tightened to equation \E_{{\cal D}n}[{\cal L}(S\Theta_n)] - {\cal L}(S\Theta) \leq 4 c_K^2 c\lambda^2{n}. equation

Theorem. Under Assumptions ass:source, ass:noise, ass:interpolation and ass:robust, there exists a regularizer $\gamma$ such that the regularized empirical risk minimizer verifies that: for any $\delta>0$, there exists an $N_\delta >0$ such that for any $n > N_\delta$, the excess of risk of the regularized empirical risk eq:erm minimizer reads align \nonumber &{\cal R}(f) - {\cal R}(f^) \leq 2k_e\epsilon^2{n} + 8M^2\log(n)^{1+\delta}{n}f^{L^2(\rho\X)} + 64 k a{n} \&\qquad + \inf_{l\leq k} (\Pi_{{\cal F_{k_\lambda}} - \Pi_{{\cal F}l}^{(\rho\X)})f^}{L^2(\rho\X)}^2 + 4\sigma(l)^2 + 4\zeta^2\norm{\tilde{T_\lambda^{-1}\Pi_{{\cal F}l}^{(\mu\Xi)} f^}{L^2(\mu\Xi)} {\cal L_k(\hat\Theta;\lambda) - {\cal L}k(\Theta;\lambda)}^{1/2}}. align where ${\cal F}l$ the span of $l$-th first eigenfunction of $T\lambda$, $k\lambda$ the number of strictly positive eigenfunctions of $T_\lambda$, $k_e\leq k$ is the effective dimension of $\psi$ in $L^2(\rho_\X)$, $a = I-\Pi_{\hat{\cal F}f^}_{L^\infty} \leq f^{L^\infty} + Mf{L^2}$, $M = \sup\psi \leq k\lambda^{-1}\sup\phi$, and $\tilde T_{\lambda} = \sum_{i\in[k]} (\lambda_i^2 - \lambda_{k+1}^2)^{1/2} f_i f_i^\top$. Moreover, under the sole Assumptions ass:interpolation_simple and ass:robust_simple, we have the simpler bound align* \nonumber {\cal R}(f) - {\cal R}(f^) &\leq 2k_e\epsilon^2{n} + 8M^2\log(n)^{1+\delta}{n}f^{L^2(\rho\X)} + 64 k a{n} + \max(k-k_\lambda, 0)f^{L^2(\rho\X)}^2 \&\qquad\qquad+ 2c_r T_\lambda^{-1 \Pi_{{\cal F}_\lambda} f^}{L^2(\mu\Xi)}^2{\cal L_{k_\lambda}(\hat\Theta;\lambda) - {\cal L}{k\lambda}(\Theta_;\lambda)} + (I - \Pi_{{\cal F_{\lambda}}) f^}{L^2(\mu\Xi)} align* Where $\hat\Theta$ is understood as belonging to $\R^{k_\lambda}\otimes {\cal H}$ in this last expression and ${\cal F}\lambda$ the eigenspace linked with positive eigenvalues of $T\lambda$.

Lemma. [Spectral embedding] There exists a linear positive symmetric operator $L$ in $L^2$ for which the operator $I-T$ is positive and [ \E_X\E_{\xi, \xi'} \norm{\psi(\xi) - \psi(\xi')^2\midvert X} = \sum_{i\in[k]}\psi_i^\top L\psi_i. ] To be consistent with previous literature, we will rather use $T = I - L/2$, which is also a linear positive symmetric operator, and is defined as, for $\psi_1, \psi_2 \in L^2$ [ \psi_1^\top T \psi_2 = \E_X\E_{\xi, \xi'}\psi_1(\xi)^\top \psi_2(\xi)\midvert X ] As a consequence, if $(\lambda_i)$ are the eigenvalues of $T$ and $(f_i)$ are the corresponding eigenvectors, a minimizer of ${\cal L}$ is $\psi_i = \mu_i f_i$ with $\mu_i = 1-\beta+\beta\lambda_i$.

Lemma. [Regularized population loss] For $\Theta\in\R^k\cal H$, and a regularizer $\lambda > 0$, the regularized loss ( {\cal L}(S\Theta) + \lambda\Theta^2_2 ) can be minimized in closed form with the operator equation T_\lambda = (1-\beta) I + \beta T - \lambda K^{-1}. equation where $K = SS^\top$ for $S:{\cal H}\to L^2(\mu_\Xi);\theta\mapsto f_\theta$ the embedding of ${\cal H}$ in $L^2$. Specifically, if $(\lambda_i)$ are the (decreasing) eigenvalues of $T_\lambda$ and $(f_i)$ the corresponding eigenfunctions, a minimizer is given by $\psi_i = \max\lambda_i, 0f_i$.

Lemma. For any $\psi\in L^2(\mu_\Xi)$, $\omega(\psi) \leq 2\psi^2_{L^2(\mu_\Xi)}$. As a consequence, $L \preceq 2I$.

Lemma. Let $A$ be a self-adjoint operator on $L^2(\mu_\Xi)$. Assume that there exists $c$ such that $A \preceq c I$ and that $A$ is pure-point spectrum. Then if $(\lambda_i, f_i)$ denote the eigen-decomposition of $A$ with $\lambda_i$ in decreasing order, the minimization of $\trace(B-A)^2 - B^2$ under the constraint that $B$ is a self-adjoint positive operator of rank at most $k$, is reached for $B = \tilde\psi^\top\tilde\psi$ with $\psi:\X\to\R^k$ such that $\psi_i = \max(0, \lambda_i)^{1/2} f_i$.

Lemma. For $(\theta_i)\in {\cal H}^k$ and $f_\theta:x\to \phi(x){\theta}$, and a regularizer $\lambda \in \R$ [ {\cal L}((f_{\theta_i}){i\in[k]}) + \lambda \sum{i\in[k]}\theta_i^2_2 = \trace\Big(\Sigma^{1/2(\sum_{i\in[k]}\theta_i\theta_i^\top)\Sigma^{1/2} - A)^2 - A^2} + k, ] with $A$ and $\Sigma$ being operator on ${\cal H}$ defined as align* A = \Sigma^{-1/2}((1-\beta)\Sigma + \beta\Sigma_X - \lambda I)\Sigma^{-1/2},\quad \Sigma = \E_{\xi}\phi(\xi)\phi(\xi)^\top,\quad \Sigma_X = \E_X[\E_{\xi, \xi'}\phi(\xi)\phi(\xi')^\top \midvert X]. align* As a consequence, a minimizer~$\Theta_$ of ${\cal L}$ is such that $\Theta_$ matches the eigenvalue decomposition of $A$ on positive eigenvalues up to the $k$-th. Formally, if $A = \sum_{i\in\N} \lambda_i u_i\otimes u_i$ with $u_i\cal H$ and $(\lambda_i)$ in decreasing order, equation* \Theta_* = (\theta_i){i\in[k]},,,with,, \theta_i = \max(\lambda_i, 0) \Sigma^{-1/2}u_i. equation* Moreover, $(f{\theta_i})$ are orthogonal in $L^2(\mu_{\Xi})$, where $\mu_{\Xi}$ denotes the marginal distribution over augmentations.

Lemma. $S$ is isometric to $\Sigma^{1/2}$, and $K = S^\top S$ is an integral operator that maps $f\in L^2(\mu_\Xi)$ to $Kf \in L^2(\mu_\Xi)$ defined for $\xi\in\X$ as equation Kf(\xi) = \E_{\xi'}\phi(\xi)^\top\phi(\xi')f(\xi'). equation

Lemma. For $\Theta\in\R^k\cal H$, and a regularized $\lambda \in \R$ [ {\cal L}(S\Theta) + \lambda\Theta^2_2 = \trace((S\Theta^\top \Theta S^\top-T_\lambda)^2 - T_\lambda^2) + k ] where [ T = S^{-\top}\Sigma_X S^{-1}, \qquad T_\lambda = (1-\beta)I + \beta T - \lambda K^{-1}, \qquad K = SS^\top, ] with $S:{\cal H}\to L^2(\mu_\Xi); \theta \to f_\theta$ the embedding of ${\cal H}$ in $L^2(\mu_\Xi)$, where $\mu_{\Xi}$ denotes the marginal distribution over augmentations. As a consequence, a minimizer~$\Theta_$ of ${\cal L;\lambda}$ is such that $S\Theta_^\top$ matches the eigenvalue decomposition of $T_\lambda$ on positive eigenvalues up to the $k$-th.

Lemma. When $K$ and $T$ commute, $K$ and $T$ can be diagonalized by the same eigenfunctions $(f_i)$.

Lemma. [Decomposition intuition] Let ${\cal F}$ and $\cal F$ be two closed convex sets of $L^2(\rho_\X)$, and $\Pi_{\cal F}$ design the orthogonal projection on the space ${\cal F}$ according to $L^2(\rho_\X)$ geometry. For any function $f:\X\to\Y$ in $\cal F$, the excess of risk eq:down_obj can be decomposed as align {\cal R}(f) - {\cal R}(f^) &\leq f - \Pi_{\hat{\cal F}f^}^2_{L^2(\rho_\X)} \&+ 2(I - \Pi_{\hat{\cal F})\Pi_{\cal F} f^}^2_{L^2(\rho_\X)} \&+ (I - \Pi_{\cal F) f^}^2_{L^2(\rho_\X)}, align

Lemma. [Warm-up] Let $\cal F$ be the span of the $(\psi_i){i\in[k]}$, with $S\psi:\R^k \to L^2$ defined as $S_\psi w = w^\top \psi$, then equation \Pi_{\cal F}f^* = S_\psi \E[\psi(X)\psi(X)^\top]^{-1} \E[Y\psi(X)]. equation Based on data $(X_i, Y_i)$, one can define the empirical risk minimizer $f_n= S_\psi w_n$, where $w_n$ is the minimizer of equation w_n \in \argmin_{w\in\R^k} \sum_{i=1}^n w^\top \psi(X_i) - Y_i^2 = [1{n}\sum_{i=1}^n\phi(X_i)\phi(X_i)^\top]^{-1} 1{n} \sum_{j=1}^n Y_i\phi(X_i). equation

Lemma. [Bias-Variance decomposition] Based on data $(X_i, Y_i)$, one can define the regularized empirical risk minimizer $f_n= S_\psi w_n$ with a regularization parameter $\gamma > 0$ as equation w_n \in \argmin_{w\in\R^k} \sum_{i=1}^n w^\top \psi(X_i) - Y_i^2 + \gamma w^2. equation When doing so, under Assumption ass:noise, the average excess of risk can be decomposed as, with $M = \sup \psi(X)$, align \E_{(X_i, Y_i)}[f_n- \Pi_{\hat{\cal F}f^}^2_{L^2(\rho_\X)}] \nonumber &\leq \epsilon^2{n} 1 + \frac{M^2{\gamma n}} \trace(\Sigma+ \gamma)^{-1\Sigma} \nonumber \&+ 2\gamma1 + \frac{M^2{\gamma n}}^2\Pi_{\hat{\cal F}f^}{\Sigma(\Sigma + \gamma)^{-1}\Pi_{\cal F}f^}{L^2(\rho\X)} \nonumber \&+ 2\E_{(X_i)}S(\hat\Sigma + \gamma)^{-1\hat S^\top (I - \Pi_{\cal F}) f^}^2_{L^2(\rho_\X)}. align

Lemma. For $t = (\Sigma+\gamma)^{-1/2(\Sigma - \hat \Sigma) (\Sigma+\gamma)}{\op}$ and $M$ such that $\psi(X)\leq M$ almost everywhere, equation S(\hat\Sigma + \gamma)^{-1\hat S^\top (I - \Pi{\cal F}) f^}{L^2(\rho\X)} \leq \min\frac{1{1-t}, 1 + t\cdotM^2 + \gamma{\gamma}}\Sigma_\gamma^{-1/2 \hat S^\top (I - \Pi_{\cal F})f^}. equation

Lemma. [Bernstein concentration inequalities] Let denote by ${\cal A}$ a Hilbert space and by $(Z_i){i\in[n]}$ a sequence of independent random vectors on ${\cal A}$ such that $\E[Z_i] = 0$, and such that there exists two positive constants $M$ and $\sigma$ such that for all $m > 2$ [ 1{n}\sum{i\in[n]}\E[Z_i^m] \leq m! \sigma^2 M^{m-2} / 2 ] For any $t>0$, [ \Pbb(\big|1{n}\sum_{i=1}^{n} Z_{i}\big| \geq t) \leq 2\exp\frac{-nt^2{2\sigma^2 + 2tM}}. ] In particular when the $(Z_i)$ are bounded by $3M$, and $\sigma^2 = n^{-1}\sum_{i\in n} \E[Z_i^2]$, the condition holds. When, instead, $Z_i$ are symmetric matrices in $\R^{k\times k}$ and $\cdot$ is the operator norm, the same bound holds with $k\exp(\cdots)$ instead of $2\exp(\cdots)$ on the right-hand side, where $\sigma^2 = n^{-1\sum_{i\in[n]}\E[Z_i^2]}$.

Lemma. For any $t > 0$, the vector part in last term of the bias decomposition eq:to_work_out can be controlled with equation \Pbb\norm{\Sigma_\gamma^{-1/2 \hat S^\top (I - \Pi_{\cal F})f^} \geq t} \leq 2\exp\frac{-nt^2{a(b + 2M\gamma^{-1/2}t/3)}} equation where $b = 2(\Sigma+\gamma)^{-1\Sigma}$, $M = \sup \psi(X)$ and $a=f^{L^\infty} + Mf^*{L^2}$. Moreover, this vector part is bounded by $\gamma^{-1}a^2M^2$. The matrix part in the last term of eq:to_work_out is controlled with equation \Pbb\norm{\Sigma_\gamma^{-1/2( \hat\Sigma - \Sigma )\Sigma_\gamma^{-1/2}}_{\op} \geq t} \leq k\exp\frac{-nt^2{2M^2\gamma^{-1}(1 + t/3)}} equation Moreover, this matrix part is bounded by $\gamma^{-2} M^4$.

Lemma. Retaking the notation of the previous lemma. align* \E_{(X_i)}S(\hat\Sigma + \gamma)^{-1\hat S^\top (I - \Pi_{\cal F}) f^}^2_{L^2(\rho_\X)} &\leq k\exp\frac{-3n\gamma{(3+2)M^2}} (\gamma^{-4}M^6 a^2(M^2 + 2\gamma))^2 \&+ 16ab{n} + 512 a^2M^2{9\gamma n^{2}}. align

Lemma. [Simplifying constants] The constant is the previous bound can be worked out as equation* \trace\Sigma(\Sigma+\gamma)^{-1} \leq k,\qquad M \leq \lambda^{-1}k \sup \phi,\qquad \Pi_{\hat{\cal F}f^}{\Sigma(\Sigma + \gamma)^{-1}\Pi_{\cal F}f^}{L^2(\rho\X)} \leq f^{L^2(\rho\X)}. equation We also have equation* f^{L^2(\rho\X)} \leq f^{L^\infty(\rho\X)} \leq \sigma, \qquad \epsilon^2 \leq \sigma^2,\quad \sigma^2 = \sup_x\EY^2 \midvert X=x equation* As a consequence, the constant $a$ appearing earlier is smaller than $(1+M)\sigma$.

Lemma. Under Assumption ass:noise, when $\gamma = M^2 \log(n)^{1+\delta} n^{-1}$, with $\delta > 0$, there exists a $N>0$ such that for any $n > N$, the excess of risk of the regularized empirical risk eq:erm minimizer reads equation \E_{(X_i, Y_i)}[{\cal R}(f_n) - {\cal R}(f^)] \leq 2k_e\epsilon^2{n} + 8M^2\log(n)^{1+\delta}{n}f^{L^2(\rho\X)} + 64 k a{n} + 2I-\Pi_{\hat{\cal F}\Pi_{\cal F}f^}^2 + I-\Pi_{\cal Ff^}^2 equation where $k_e = \trace\Sigma(\Sigma+\gamma I)^{-1} \leq k$ is the effective dimension, $a = I-\Pi_{\hat{\cal F}f^}_{L^\infty} \leq f^{L^\infty} + Mf{L^2}$, and $M = \sup\psi \leq k\lambda^{-1}\sup\phi$.

Lemma. [Transfer bound] For $\Theta\in\R^k\cal H$, and $\cal F = x\to w^\top\hat\Theta\phi(x)\midvert w\in\R^k$, equation \sum_{i\in[k]} \lambda_i^2 (\Pi^{(\mu_\Xi){\cal F} - \Pi^{(\mu\Xi)}{\cal F})f_i}{L^2(\mu_\Xi)}^2 -\sum_{k< i \leq k_\lambda} \lambda_i^2 \Pi^{(\mu_\Xi){\cal F}f_i}{L^2(\mu_\Xi)}^2 \leq {\cal L}(\hat\Theta;\lambda) - {\cal L}(\Theta_*;\lambda), equation where $\Pi_{\cal F}^{(\tau)}$ is the projection orthogonal on ${\cal F}$ in $L^2(\tau)$.

Lemma. [Decomposition] Under Assumptions ass:interpolation and ass:robust, with ${\cal F}l$ the span of the $(f_i){i\in[l]}$ equation (I - \Pi_{\hat{\cal F}^{(\rho_\X)})\Pi_{{\cal F}l}^{(\rho\X)} f^}{L^2(\rho\X)} \leq \sigma(l) + \zeta\sum_{i\leq l \scap{f^{f_i}{L^2(\mu\Xi)}} (\Pi_{{\cal F_l}^{(\mu_\Xi)} - \Pi_{\cal F}^{(\mu_\Xi)})f_i}{L^2(\mu\Xi)}}. equation

Lemma. When ${\cal F}$ is of dimension $k$ and $\cal F$ is of dimension $k'$ we have equation \sum_{i\leq k} (\Pi_{\cal F - \Pi_{\cal F})f_i}^2 = k-k'+\sum_{i > k}\Pi_{\hat{\cal F}f_i}^2 \leq k. equation

Lemma. Under Assumptions ass:interpolation and ass:robust, with ${\cal F}{l}$ the span of the first $l$ eigenfunctions of $T\lambda$, align \nonumber &(I - \Pi_{\hat{\cal F}^{(\rho_\X)}) f^}{L^2(\rho\X)}^2 \&\qquad\leq \inf_{l\leq k} (I - \Pi_{{\cal F_l}^{(\rho_\X)})f^}{L^2(\rho\X)}^2 + 4\sigma(l)^2 + 4\zeta^2\norm{\tilde{T_\lambda^{-1}\Pi_{{\cal F}l}^{(\mu\Xi)} f^}{L^2(\mu\Xi)} {\cal L(\hat\Theta;\lambda) - {\cal L}(\Theta;\lambda)}^{1/2}}. align where $\tilde T_{\lambda} = \sum_{i\in[k]} (\lambda_i^2 - \lambda_{k+1}^2)^{1/2} f_i f_i^\top$. Moreover, when the search for $\cal F$ is done without rank restriction on $\Theta$, before thresholding to get reduce $\cal F$ to a space of dimension $k$, under the strong Assumptions ass:interpolation_simple and ass:robust_simple, as well as Assumption ass:source equation (I - \Pi_{\hat{\cal F_k})f^}^2 \leq k-k_\lambdaf^{L^2(\rho\X)}^2 + 2c_r T_\lambda^{-1f^}{L^2(\mu\Xi)}^2{\cal L(\hat\Theta;\lambda) - {\cal L}(\Theta_*;\lambda)}. equation

Lemma. [Relating capacity between $K$ and $T_\lambda$] If $(\mu_i)$ are the eigenvalues of $K$, then the number of eigenvalues of $T_\lambda$ that are bigger than $t\in\R$ is smaller than the cardinality of ( i\midvert \mu_i > \lambda / (1-t). ) Moreover, if there exists $q>0$ such that $\traceK^{1/q} < +\infty$, then there exists a $c_q$ such that if $(\mu_i)$ are the eigenvalues of $K$, we have ( \mu_i \leq c_q i^{-q}. ) As a consequence, in this setting, for any $t\in\R$, the number of eigenvalues of $T_\lambda$ that is bigger than $t$ is smaller than $(c_q(1-t)/\lambda)^{1/q}$.

Lemma. Let $\Theta\in\R^k\otimes {\cal H}$, denote $\Lambda = \Theta^\top \Theta \in {\cal H}\cal H$ equation {\cal L}(S\Theta) = 2(\beta - 1)\E_{\xi}[\Lambda{\phi(\xi)\phi(\xi)^\top}] - 2\beta\E_X\E_{\xi, \xi'}\scap{\Lambda{\phi(\xi')\phi(\xi)^\top}\midvert X} + \E_{\xi, \xi'}\scap{\Lambda{\phi(\xi)\phi(\xi')^\top}^2} + k. equation Moreover, the regularization reads $\lambda \Theta^2 = \lambda \Lambda = \lambda \Lambda{I}$.

Lemma. Let ${\cal R}(\zeta) = \E_Z[\ell(\zeta, Z)]$, $\zeta^$ be the minimizer of ${\cal L}$ inside a domain for $\zeta$, and $\zeta_n$ be the minimizer of ${\cal R}{(Z_i)}(\zeta) = 1{n} \sum{i\in[n]} \ell(\zeta, Z_i)$ based on exchangeable data $Z_i$ such that $\E_{(Z_i)}[{\cal R}{(Z_i)}] = {\cal R}$. The average excess of risk of $\zeta_n$ is bounded by Rademacher complexity as equation {\cal R}(\zeta_n) - {\cal R}(\zeta) \leq 4 \E_{(Z_i), (\sigma_i)}\sup_{\zeta1{n}\sum_{i=1}^n \sigma_i\ell(\zeta, Z_i)} equation where $\sigma_i$ are i.i.d variables taking values one and minus one with probability one half.

Lemma. For linear model, the Rademacher complexity can be bounded as equation \E_{(Z_i),(\sigma_i)}\sup_{\norm{\zeta\leq M} 1{n}\sum_{i=1}^n \sigma_i Z_i{\zeta}} \leq M{n} \E[\norm{Z^2]}. equation

Lemma. Moreover, when $h:\R\to\R$ is Lipschitz, the following contraction principle holds [ \E[\sup_f 1{n}\sum_{i=1}^n \sigma_i h(f(Z_i))] \leq \sup \diff h(x) \E[\sup_f 1{n}\sum_{i=1}^n \sigma_i f(Z_i)] ]

Lemma. When minimizing a regularized risk, one can reduce the search of $\Theta$ under the constraint ( \Lambda_{HS} \leq \lambda^{-1} k. )

Lemma. Let ${\cal L}(\Theta) = \E_Z[\ell(\Theta, Z)]$ be a convex function optimized over a convex domain. Given $n$ samples $(Z_i)$, (unbiased) stochastic gradient descent with final averaging can achieve an excess of risk equation \E_{(Z_i)}{\cal L}(\hat\Theta) - {\cal L}(\Theta_) \leq 2 M V n^{-1/2} equation with $M^2 = \Theta_ - \Theta_0^2$ and $V^2 = \E[\nabla_\Theta\ell(\Theta, Z_i)^2]$. Moreover, if ${\cal L}$ is $\alpha$-smooth, then it can achieve equation \E_{(Z_i)}{\cal L}(\hat\Theta) - {\cal L}(\Theta_) \leq 2 M\sigma n^{-1/2} + \alpha M^2 n^{-1} equation where $\sigma^2 = \E[\nabla {\cal L - \nabla \ell}^2]$. Finally, when ${\cal L}$ is $\alpha$-strongly convex, it achieves equation \E_{(Z_i)}{\cal L}(\hat\Theta) - {\cal L}(\Theta_) \leq 2V^2{\alpha (n+1)}. equation As a consequence, given $n$ data samples, there exists an empirical estimate of $\hat\Theta$ that guarantee those generalization bounds.

Lemma. An unbiased formulation of ${\cal L}$ is based on $\ell$ defined as align \nonumber \nabla_{\Lambda}\ell(S\Theta;\lambda) &= 2(\beta - 1){m}\sum_{j\in[m]} \phi(\xi_{1j})\phi(\xi_{1j})^\top - 2\beta{m(m-1)} \sum_{1\leq j \neq k \leq m} \phi(\xi_{1j})\phi(\xi_{1k})^\top \&\qquad\qquad+ 1{m^2} \sum_{i,i'=1}^2 \sum_{j, k=1}^m \Lambda{\phi(\xi_{ij})\phi(\xi_{i'k})^\top} \phi(\xi_{ij})\phi(\xi_{i'k})^\top. align Moreover, when ${\cal L}$ is regularized, one has to add $+\lambda I$ to get a gradient on the regularized risk.

Lemma. For $\ell$ given in eq:sgd_sample, bounds on the gradient norm and its variance are equation \nabla_\Lambda \ell \leq 2\kappa^2 + \kappa^4 \sup\Lambda, \qquadand\qquad \E[\nabla_\Lambda \ell-\nabla {\cal L}^2] \leq (\sigma_X^2 + m^{-1}\sigma_\xi^2)(1+\sup\Lambda^2), equation where $\sigma_X$ relates to the variance of $\E\psi(\xi)\midvert X$ and $\sigma_\xi$ relates to the average variance of $\xi\midvert X$.

Lemma. As a function of $\Lambda$, the objective ${\cal L}$ is $\alpha$-smooth with $\alpha = \kappa^4$, where $\kappa$ is a bound on $\phi$. Moreover, when $\X$ is finite, it is $\alpha'$-strongly, with $\alpha'$ being the square of eigen gap of $K = SS^\top$.

Lemma. The parity functions $\chi_S$ form an orthonormal basis of $L^2(\X)$.

Lemma. The cyclic parities $(\psi_{m, S})$, for $m \in [k_S]$ and $S$ is a set of representers of each orbit of the translations action, form an orthogonal basis of $L^2(\X, \C, \mu)$ where $\mu$ is the uniform measure on $\X$. Moreover, they diagonalize the operators $A:L^2\to L^2$ defined as $Af(x) = f(a\cdot x)$ for any $a\in [d]$.

Lemma. In the uniform Boolean setting, when augmentations are defined as $\xi = a\cdot X$ where $a$ is a permutation sampled from the probability distribution $p\in\Sfrak_d$, [ Tf(x) = \sum_{a,b \in \Sfrak_d} p(a) p(b) f((a^{-1}b) \cdot x). ]

Lemma. For any linear model defined through the features $\phi$ in eq:bool_features, the integral operator $K:L^2(\X)\mapsto L^2(\X)$ is diagonalized in the parity basis, equation K\chi_S = e_S^2 \chi_S. equation

Lemma. [Spectral decomposition of dot-product kernel] Any dot-product kernel is diagonalizable in the parity basis. Specifically, there exists $(\nu_i){i\in[0,d]} \in \R^{d+1}$ such that, when $\mu\X$ is the uniform distribution on the hypercube, equation K\chi_S = \nu_{S} \chi_S. equation

Lemma. The operator $K$ associated with a dot-product kernel in the uniform Boolean setting commutes with all the operators $T$ that can be built from bitwise noise, cropping, translations or index flip.

Proposition. If $T$ and $K$ commute, and if $(\lambda_i)$ are the eigenvalues of $T$ and $(f_i)$ its eigenfunctions, then there exists $(\theta_i)$ such that $f_i = f_{\theta_i}$ eq:rkhs. Moreover, the optimal representation to minimize the regularized loss are the $f_i$ that maximize ( \beta\lambda_i - \lambda\theta_i^2. ) In other terms, the regularization biases towards representations that have a small complexity with respect to the model of computation.

Proposition. [Uniqueness of minimizers] The minimizers of ${\cal L}$ are unique up to orthogonal transformations and eigenfunction picking. More specifically, if $U\in\R^{k\times k}$ is orthogonal, i.e. $U^\top U = I$, then ${\cal L}(\psi) = {\cal L}(U\psi)$; and if $\lambda_k = \lambda_{k+1}$, one can choose different eigenfunctions as $f_k$ in the eigen-decomposition $(\lambda_i,f_i)$ of $T_\beta$.

Proposition. [Random noise] Consider the flip of each bit of $x$ with probability equal to $p$ formally via the operation equation B^p_y(x) = x \odot y, ; ; ; ; y \sim Ber( {-1,+1}, p)^{\otimes d}, equation where the operation $x \odot y$ applies pointwise multiplication and the distribution $Ber( {-1,+1}, p)$ returns the value $-1$ with probability $p$ and $+1$ with probability $1-p$. Under the augmentations $\xi = X\otimes y$, $T$ is diagonalized in the parity basis with equation T\chi_S = 1 - 2p^{S} \chi_S. equation In other terms, $T$ applies a factor $|1-2p|^{|S|}$ to reduce the effect of higher order Fourier functions.

Proposition. [Cropping/Masking] Consider the cropping operation within a window of size $w$, formally defined as equation [M^w_{a}(x)]_i = cases x_i & if i \in [a, a+w) \ Ber( {-1,+1}, 0.5) & otherwise cases, \qquad a \sim [d], equation where $[a, a+w) = {a, a+1, \dots, a+w-1}$, $a$ is drawn from the uniform distribution over $[d]$, and the distribution $Ber( {-1,+1}, 0.5)$ returns a random bit with equal probability for $+1$ and $-1$ thus effectively masking the values outside of the window in $[a, a+w)$. Under the augmentations $\xi = M^w_a(X)$, $T$ is diagonalized in the parity basis with equation T\chi_S = \max\brace{1+w-\diam(S), 0^2}{d^2}\cdot \chi_S \qquadwith\qquad \diam(S) = \min v\midvert v, a \in [d]; S \subseteq [a, a + v). equation In other terms, the action of cropping effectively removes any dependence on the kernel with parity functions of high order whose support falls outside the windows of size $w$.

Proposition. [2D Cropping] Consider that 2D setting $\X = -1,+1^{m \times d}$ where inputs are organized into an $m \times d$ grid. Consider the cropping operation to a window of size $v \times w$, formally equation [M^{v \times w}{a,b}(x)]{i+jm} = cases x_{i+jm} & if i \in [a, a+v), j \in [b, b+w) \ Ber( {-1,+1}, 0.5) & otherwise cases, \qquad (a, b) \sim [m]\times [d]. equation Under the augmentation $\xi = M^{v\times w}{a,b}(X)$, $T$ is diagonalizable in the parity basis and equation T\chi_S = 1{m^2d^2}1 + v - \diam{e_1{S}}+^2\cdot 1 + w - \diam{e_2{S}}+^2 \chi_S, equation where $\diam{e_1}{S}$ is the diameter of $S$ projected onto the first dimension.

Proposition. [Flipping] Consider the operator which, with probability $p$, flip the indices into reverse order, formally equation [R(x)]i = x{-i}. equation Under the augmentation $\xi = R(X)$, equation T = (1-2p + 2p^2) I + 2p(1-p) J, equation where $J$ is the involution that matches any set $S$ to its mirror $S = -i\midvert i\in S$. In this setting, $T$ is diagonalized by the $2(\chi_S + \chi_{S})$ and $2)(\chi_S - \chi_{\tilde S})$ for $S\subseteq [d]$.

Proposition. [Translations] Consider the translation operator defined formally as equation [T_a(x)]i = x{i - a}, \qquad a \sim p \in [d] equation Under the augmentation $\xi = T_a(X)$, $T$ is diagonalized in $\C$ by the cyclic parity functionseq:spherical. equation T\psi_{m, S} = d^2{k_S^2} \hat{p\frac{m d{k_S}}}^2 \psi_{m, S}, equation where $p$ is the Fourier transform of $p$, defined for$\omega \in [d]$ by equation p(\omega) = \sum_{a\in[d]} p(a) \exp\frac{-2i\pi a\omega {d}} equation

Proposition. [Linearization of simple network] Define a simple neural architecture as equation f(x) = \frac{\Delta{N\omega d}} \sum_{i\in[N]} \sum_{k\in[d/\Delta]} a_{ik} \sum_{s\in[\omega]} \sigma\scap{w_i{x_{(k\Delta + s)}^{(q)}}}, equation where $x_{(k)}^{(q)} = (x_k, x_{k+1}, \cdots, x_{k+q-1})$ is a local patch of size~$q$ (with indices being defined modulo $d$), $w_i$ the weights initialized from a rotation-invariant distribution~$\cal W$, $\sigma:\R\mapsto\R$ is an activation function, $\omega\in\N$ is the size of the average pooling window, $\Delta\in\N$ is the pooling window, and $N$ is the channel number. The linearization of this network near initialization yields the kernel equation k(x, x') = \phi(x)^\top \phi(x') = \Delta{d\omega} \sum_{k\in[d/\Delta]} \sum_{s,s'\in [\omega]} h\scap{x_{(k\Delta + s)^{(q)}}{y_{(k\Delta + s')}^{(q)}} / q} equation where equation h(u{v} / q) = \E_{w\sim\cal W}\sigma(\scap{u{w} / q) \sigma(v{w} / q) + \sigma'(u{v} / q)\sigma'(u{v} / q)\cdotu{v} / q}. equation

Proposition. [Linearization of a fully connected network] A one hidden layer fully connected layer [ f_{FC}(x) = 1{N}\sum_{i\in[N]} a_{ik} \sigma(w_i^\top x), ] can be linearized as a dot-product kernel with $k_{FC}(x, y) = h(x^\top y / d)$ for $h$ defined in eq:boolean_kernel. Moreover, the resulting integral operator $K_{FC}$ is diagonalized in the parity basis as [ K_{FC}\chi_S = \nu_{h}(d,S) \chi_S, ] where the coefficients are given by $\nu_h(d,\ell) = h{Q_\ell}{L^2(\tau)}$ as in eq:gg_coeff. Note that eigenvalues~$\nu_h(d,\ell)$ are non-increasing with~$\ell$, and for fixed~$\ell$ and large~$d$ they satisfy~$\nu_h(d, \ell) = \Theta_d(d^{-\ell})$. More generally, it can be shown that $\lim{d \to \infty} d^k \nu_h(d,\ell) = d^k{dt^k}h(t) \bigr\rvert_{t=0}$.

Proposition. [Linearization of a convolutional network] A convolutional layer followed by a fully connected layer [ f_{CNN}(x) = 1{N d} \sum_{i\in[N]} \sum_{k\in[d]} a_{ik} \sigmaw_i^\top x_{(k)^{(q)}}, ] can be linearized with the $h$ of eq:boolean_kernel as [ k_{CNN}(x, y) = 1{d} \sum_{k\in[d]} h\scap{x_{(k)^{(q)}}{y_{(k)}^{(q)}}/ q}. ] In the Boolean setting, the resulting integral operator $K_{CNN}$ is diagonalized in both the parity and the cyclic basis as [ K_{CNN}\psi_{m, S} = cases \nu_h(q,S) (q+1-\diam(S))+{q} \psi{m, S}, &if \diam(S) \leq q, \ 0 &otherwise. cases ] where $\nu_h(q,\ell)$ are defined by Proposition prop:eigbasis_FC.

Definition. [Distribution $\epsilon$-robustness] A close convex set of functions ${\cal F}$ will be said to be $\epsilon$-robust to distribution shift conditionally to the function $f$ if [ \Pi_{\cal F^{(\rho_\X)}f - \Pi_{\cal F}^{(\mu_\Xi)}f}{L^2(\rho\X)} \leq \epsilon f_{L^2(\rho_\X)}, ] where $\Pi_{\cal F}^{(\tau)}$ is the projection orthogonal on ${\cal F}$ in $L^2(\tau)$.

Remark. [Contrastive learning with ${\cal L}$] When $\beta=1$, the population loss ${\cal L}$ is equivalent to the spectral contrastive loss studied in haochen_provable_2021 as a theoretically friendly proxy for SimCLR chen_simple_2020. In other terms, ${\cal L}$ analyzes both contrastive and non-contrastive approaches to representation learning.

Remark. The operator $\Sigma = \E_\xi[\phi(\xi)\phi(\xi)^\top] \in {\cal H}\cal H$ is trace-class.

Remark. The operator $\Sigma_X = \E_X\E_{\xi,\xi'}\phi(\xi)\phi(\xi')^\top\midvert X \in {\cal H}\cal H$ verifies $0 \preceq \Sigma_X \preceq \Sigma$ with $\preceq$ the Loewner order ($A\preceq B$ if $B - A$ is semi-definite positive). As a consequence, $\Sigma_X $ is trace-class and $\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2}$ is continuous.

Remark. Recently, haochen2022theoretical have taken this second perspective on inductive bias perspective by looking at the ``barrier'' case where one can only match eigenfunctions that belongs to the function space $\Psi$. In the kernel regime, this is deceptive since, for example, when considering the Gaussian kernel $\phi(x)^\top \phi(x') = -\exp(x-x'^2)$, $\Psi$ is made of analytic functions Sun2008, hence cannot parameterize any indicator functions without being one everywhere, therefore their approach would fail to explain how the Gaussian kernel could learn fast under the cluster assumption.

Remark. Up to now, we have studied all the operators in the space $L^2(\X,\R,\mu_X)$ while the main text considered those operators in $L^2(\X, \R,\mu_\Xi)$, this is justified by the fact that all transformations studied earlier let invariant the uniform distribution, hence equation L^2(\mu_\X) = L^2(\mu_\Xi). equation

Example. [Cropping] Consider the hypercube setting where $\X = -1, 1^n$ and $X$ is uniformly distributed. A basis of $L^2(\X,\R)$ is given by the parity functions $\chi_S:x\mapsto\prod_{i \in S} x_i$ for all the subsets $S\subseteq [n]$. Pre-training via cropping with window sizes $v \times w$ set $T\chi_S = 0$ for all $S$ whose support forms a window larger than the size $v \times w$. For all the other $S$, $T\chi_S = \lambda_S \chi_S$, where $\lambda_S$ decreases with the diameter of $S$. In other terms, pre-training with 2-D cropping eliminates the influence of functions which act globally outside of the cropping window. This, in effect, imparts a locality to the induced representation $\psi$ which is often desirable for generalization.

Example. [Dot-product kernel] On the Boolean hypercube setting of Example ex:2d_cropping, many linear models eq:rkhs take the form $\phi(x)^\top \phi(y) = h(x{y})$ (e.g., the classical NTK linearization of fully connected layer) leading to an integral operator~$K$ that is diagonalizable by parity functions. More precisely, there exists $(\nu_i)\in\R^d$ such that $K\chi_S =\nu_{S}\chi_S$, where $S$ is the cardinality of $S$ and~$\nu_{S}$ decreases with~$S$. In the setting of crops, $T$ pushes towards representation on parity functions with small diameter ($\psi = (\chi_S)S$ for $S$ with small diameters), while the inductive bias acts on the cardinality of the sets $S$, pushing towards the $\chi_S$ that maximize~$\nu{S}$. Formal derivations are provided in Appendix~app:examples.

Example. If $\rho_\X$ has a density against $\mu_\Xi$ which is bounded from below by $\delta \in (0, 1]$ on its support, i.e. $\mu_\Xi = \delta \rho_\X + (1-\delta)\mu_\perp$ with $\mu_\perp \in\X$, then Assumption~ass:interpolation_simple is met with $c_r = 1 / \delta$.

Example. Let $\Sigma_\tau = \E_{X\sim\tau}[\phi(X)\phi(X)^\top]$ be the covariance matrix of $\phi$ under the distribution~$\tau$. When there exists $c$ such that $\Sigma_{\rho_\X} \preceq c\Sigma_{\mu_\Xi}$ (i.e $c \Sigma_{\mu_\Xi} - \Sigma_{\rho_\X}$ is positive semi-definite), then Assumption ass:interpolation_simple holds with $c_r=c$.

Example. If $\psi_\sharp\mu_\Xi = \psi_\sharp\rho_\X$ holds for the optimal representation $\psi = (f_i)$, with $(f_i)$ the positive eigenfunctions of $T_\lambda$, and there exists a measurable function $g:\R^k\to\Y$ such that $f^* = g \circ \psi$, then Assumption ass:robust_simple is verified.

Example. [Cluster assumption] If the support of the density $\mu_\Xi$ has $k$ connected components, $f^*$ is constant on those clusters, and $\lambda=0$, then Assumption ass:source holds.

Example. When the distribution of augmentation have a density $p$ with respect to a any measure and $(x, \xi)\to p\xi\midvert x/p(\xi)$ is in $L^2(\mu)$, or when $\X$ is finite, $T$ can be shown to be a compact operator, hence to have a pure point spectrum according to the spectral theorem.

Example. When considering the radial basis function kernel $\phi(x)^\top\phi(x') = \exp(-x-x'^2)$, $\Psi$ is the space of analytical functions Sun2008, which is known to be small compared to $L^2$ spaces Kolmogorov1959. As a consequence, one can think as $q=+\infty$ in the previous lemma. More in general, when $\phi$ is bounded, $K$ is trace-class and one can take $q = 1$.

Example. [Smoothing effect of translations] To see the effect of augmentation strength, consider a distribution~$p$ over translations that takes the form~$p(a) = \omega p_0(\omega a)$, where~$p_0$ is a localized window shape (e.g., uniform or Gaussian) that sums to~$1$. Here~$\omega \approx 1/\Delta$ is inversely related to the window size~$\Delta$, which controls the ``strength'' or range of augmentations. Then we have [|\hat p(m)|^2 = |\hat p_0(m / \omega)|^2 \approx |\hat p_0(\Delta m)|^2. ] Here, the squared Fourier coefficients~$|\hat p_0(m)|^2$ typically decay with the frequency~$m$, which shows that~$T$ has a smoothing effect that penalizes eigenfunctions~$\psi_{m,S}$ with larger~$m$, i.e., those which oscillate more quickly. The above formula also highlights that the increasing the augmentation strength~$\Delta$ will lead to faster decay with~$m$, while leaving the translation-invariant eigenfunctions ($m = 0$) unaffected.

Example. [Interplay between FC kernel and translation augmentations] Recall from Exampleex:translation_smoothing that when sampling translations from a localized window, the eigenvalues of$T$ are of the form~$|\hat p(m)|^2$ and typically decay with the frequency index $m$ in $\psi_{m,S} = 1{k_S}\sum_{k\in[k_S]} e^{2i\pi k m / k_S} \chi_{S+k}$ for any set~$S$ with no periodicity. In contrast, the eigenvalues $\nu_h(d,|S|)$ of K for eigenfunctions $\psi_{m,S}$ decay as $\Theta_d(d^{-|S|})$, independently of $m$. Regularization with parameter $\lambda$ thus shrinks the eigenvalues to $|p(m)|^2 - \lambda \nu_h(d,|S|)^{-1}$ after pre-training. This most notably eliminates contributions from eigenfunctions $\psi_{m,S}$ where $m$ is small (i.e., near-invariant) but $|S|$ is large. See Figuresfig:degree_vs_invariance andfig:downstream_error for an illustration.

Example. [Interplay between kernel for CNN and translation augmentations] Consider the setting as before in Example ex:interplay_FC_translation with translations sampled from a localized window. For a single layer CNN with patch width $q$, eigenfunctions correspond to parity functions $\chi_S$, or cyclic parities~$\psi_{m,S}$ where $\diam(S) \leq q$ with corresponding eigenvalue $\nu_h(q,\ell) q+1-\diam(S){d}$. Here, the eigenfunctions $\psi_{m,S}$ of $T$ for $S$ with diameter larger than $q$ are completely eliminated, regardless of the regularization strength~$\lambda$, . For eigenfunctions $\psi_{m,S}$ where $\diam(S) \leq q$, the CNN shrinks the contribution to $|p(m)|^2 - \lambda (\nu_h(q,\ell) q+1-\diam(S){d})^{-1}$, which shrinks more when $\diam(S)$ is larger.

Example. [Translations with convolutional models on the hypercube] % We consider random cyclic translations sampled from a distribution~$h$, data on the binary hypercube, and a convolutional kernel of the form~$k(x, y) = \sum_u \kappa(x_u^\top y_u)$, where~$x_u, y_u$ are patches centered at position~$u$, with cyclic symmetry. % Here, the convolutional architecture limits the interactions of each polynomial to be within a patch rephrase We may choose a basis involving oscillating averages of shifts of the basis functions obtained on a single patch, indexed by sets~$S$ of coordinates and by a spatial frequency~$\omega$. Then, the eigenvalues of~$K$ decay with the degree~$|S|$ and are independent of the spatial frequency~$\omega$, while the eigenvalues of~$T$ only depend on~$\omega$ and are given by~$|\hat h(\omega)|^2$. check? % Similar to above, but spectrum of~$K$ has no mass on high-degree harmonics that go across multiple patches. Eigenvals on each degree correspond to Fourier coeffs of the pmf of translation distribution.bietti2022approximation,misiakiewicz2021learning. Pick either non-overlapping patches on spheres as inbietti2022approximation, or boolean hypercube with overlaps as~misiakiewicz2021learning. %

Proof. This follows from the fact that align* \omega(\psi) &= \E_X[\E_{\xi,\xi'}\norm{\psi(\xi) - \psi(\xi')^2\midvert X}] \&= \E_X[\E_{\xi,\xi'}\norm{\psi(\xi) - \E\bracket{\psi(\xi)\midvert X + \E\psi(\xi)\midvert X - \psi(\xi')}^2\midvert X}] \&= \E_X[\E_{\xi,\xi'}\norm{\psi(\xi) - \E\bracket{\psi(\xi)\midvert X}^2 + \E\bracket{\psi(\xi)\midvert X - \psi(\xi')}^2\midvert X}] \&= 2\E_X[\E_{\xi}\norm{\psi(\xi) - \E\bracket{\psi(\xi)\midvert X}^2\midvert X}] \&= 2\min_{\psi_0:\X\to\R}\E_X[\E_{\xi}\norm{\psi(\xi) - \psi_0(X)^2\midvert X}] \&\leq 2\E_X[\E_{\xi}\norm{\psi(\xi)^2\midvert X}] = 2\E_{\xi}\norm{\psi(\xi)^2} = 2\psi_{L^2(\mu_\Xi)} align* Hence for any $\psi$, with the $L^2(\mu_\Xi)$ geometry we have $\psi^\top L\psi \leq \psi^\top \psi$, which implies, since $L$ is self-adjoint, that $L_{\op} \leq 2$.

Proof. When $\X$ is finite, the $L^2$ spaces are finite dimensional, hence locally compact, which implies that all operators are compact. To prove the case with density, let us develop $T$ as an integral operator. We have, in $L^2(\mu_\Xi)$ geometry, for $f:\X\to\R$ align* 2f^\top (I - T) f &= \E_X\E\norm{f(\xi) - f(\xi')^2\midvert X} = \E_X\E\norm{f(\xi)^2 + f(\xi')^2 - 2f(\xi){f(\xi')}\midvert X} \&= 2f^\top f - 2\E_X\E\scap{f(\xi){f(\xi')}\midvert X}. align* This allow us to identify $T$ with the inner product, we have for $g:\X\to\R$ and $p$ the density of augmentations align* f^\top T g &= \E_X\E\scap{f(\xi){g(\xi')}\midvert X} = \int f(\xi){g(\xi')} p\xi \midvert x p\xi'\midvert x \diff \xi' \diff \xi \mu_\X(\diff x) \&= \int \mu_\Xi(\diff \xi) f(\xi){\int \mu_\Xi(\diff \xi') g(\xi') \int \mu_\X(\diff x) p\paren{\xi \midvert x p\xi'\midvert x}{p(\xi)p(\xi')}}. align* As a consequence, one can consider $T$ as the integral operator in $L^2(\mu_\Xi)$ linked with the kernel [k(\xi, \xi') = \int \mu_\X(\diff x) p\paren{\xi \midvert x p\xi'\midvert x}{p(\xi)p(\xi')}.] When this kernel is bounded, or simply when $\xi \to k(\xi, \xi)$ belongs to $L^2(\mu_\Xi)$, $T$ is trace-class hence compact.

Proof. Let us decompose $A$ into a positive part $A_+ \succeq 0$ and a negative part $A_- \succeq 0$ such that $A = A_+ - A_-$. Using the fact that $B$ is positive self-adjoint, we get align* \trace(B-A)^2 - A^2 &= B^2 - 2B^{1/2AB^{1/2}} = \traceB^2 - 2B^{1/2A_+ B^{1/2}} + 2\traceB^{1/2A_-B^{1/2}} \&\geq \traceB^2 - 2B^{1/2A_+ B^{1/2}}. align* Let us decompose $B$ into $k$ symmetric operators of rank at most one as $B = \sum_{i=1}^k B_i$ such that $B_iB_j = 0$ for any $i \neq j\in[k]$. Using the different properties of the operators introduced, we proceed with align* \trace(B-A)^2 - A^2 &\geq \sum_{i=1}^k \traceB_i^2 - 2\traceB_iA_+ = \sum_{i=1}^k B_i^2_{\op} - 2B_iA_+{\op} \&\geq \sum{i=1}^k B_i^2_{\op} - 2B_i_{\op}\Pi_{B_i A_+}{\op} \geq \sum{i=1}^k B_i^2_{\op} - 2B_i_{\op}\big|\prod_{j< i}(I - \Pi_{B_j}) A_+\big|{\op} \&= \sum{i=1}^k \norm{B_i_{\op} - \big|\prod_{j< i}(I - \Pi_{B_j}) A_+\big|{\op}}^2 - \big|\prod{j< i}(I - \Pi_{B_j}) A_+\big|{\op}^2 \&\geq -\sum{i=1}^k \big|\prod_{j< i}(I - \Pi_{B_j}) A_+\big|{\op}^2 \geq -\sum{i=1}^k \sigma_i(A_+) align* where $\Pi_B$ denote the orthogonal projector on the image of $B$, and $\sigma_i(A)$ the $i$-th singular value of $A$ (monotonically ordered with $\sigma_1(A)$ the biggest). The last inequality is due to the Courant-Fisher min-max principle, This inequality can be achieved with $\Pi_{B_i}$ the projection on the $i$-th eigenspace of $A$ and $B_i_{\op} = \sigma_i(A)$. In other terms, $B$ should match the first $k$ positive eigenvalues of $A$. In the case where $A$ has less than $k$ positive eigenvalues, then $B$ should match all the positive eigenvalues and be null on the range of $A_-$.

Proof. This follows from linearity of traces, expectations, together with the fact that $AB = BA$, [ \trace\Sigma = \E_\xi \phi(\xi)\phi(\xi)^\top = \phi_{L^2(\mu_\X)}^2 < +\infty. ] As a consequence, $\Sigma$ is compact, hence has a pure point spectrum, and since ${\cal H}$ is separable it can be diagonalized with its eigenvectors forming a basis of ${\cal H}$.

Proof. This follows from Jensen inequality applies to $A\to AA^\top$, which can be proven using the positivity of covariance operator. [ 0\preceq \E[(A - \E[A])(A - \E[A])^\top], \qquad\Rightarrow\qquad \E[A]\E[A]^\top\preceq\E[AA^\top]. ] As a consequence, align* \E_{\xi,\xi'}\phi(\xi)\phi(\xi')^\top\midvert X=x \preceq \E_\xi\phi(\xi)\phi(\xi)^\top \midvert X=x, align* which implies that [ \Sigma_X \preceq \Sigma. ] As a consequence, $\Sigma_X \leq \trace\Sigma < +\infty$ and $\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2} \preceq I$, and $\Sigma^{-1/2\Sigma_X \Sigma^{-1/2}}{\op} \leq 1$. The positivity follows from the fact that $\Sigma_X $ is a covariance operator ( \Sigma_X = \E_X\E\xi\bracket{\phi(\xi)\midvert X\E_\xi\phi(\xi)\midvert X^\top}. )

Proof. Let us now rewrite the different quantities appearing in ${\cal L}$ based on the parameterization $\psi = \Theta\phi$. We have [ \trace\E[\psi(\xi)\psi(\xi)^\top] = \trace\E[\Theta\phi(\xi)\phi(\xi)^\top\Theta^\top ] = \trace\Theta\E[\phi(\xi)\phi(\xi)^\top]\Theta^\top = \trace\Theta\Sigma\Theta^\top = \trace\Sigma^{1/2\Theta^\top \Theta\Sigma^{1/2}}. ] The adjoint $\Theta^\top $ is taken with respect to the canonical topology on ${\cal H}$ and $\R^k$. Similarly, [ \trace\E_X\E\bracket{\psi(\xi)\psi(\xi')^\top\midvert X} = \trace\Theta\Sigma_X \Theta^\top = \trace\Sigma^{-1/2\Sigma_X \Sigma^{-1/2}\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2}}. ] For the last term, we get [ \trace\E[\psi(\xi)\psi(\xi)^\top]^2 = \trace(\Theta \Sigma\Theta^\top )^2 = \trace\Theta \Sigma\Theta^\top \Theta \Sigma \Theta^\top = \trace\Sigma^{1/2 \Theta^\top \Theta \Sigma \Theta^\top \Theta \Sigma^{1/2}} ] Collecting the different terms, we get align* &{\cal L}(\Theta\phi) + 2\lambda\trace(\Theta^\top \Theta)- k \&= \trace\big(2(\beta - 1) \Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2} - 2\beta\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2}\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2} \&\qquad\qquad\qquad\qquad+ \Sigma^{1/2}\Theta^\top \Theta\Sigma\Theta^\top \Theta\Sigma^{1/2} + 2\lambda\Sigma^{-1}\Sigma^{1/2}\Theta^\top \Theta\Sigma^{1/2}\big) \&= \trace\paren{\Sigma^{1/2\Theta^\top \Theta\Sigma^{1/2} + (\beta-1)I - \beta\Sigma^{-1/2}\Sigma_X \Sigma^{-1/2} + \lambda\Sigma^{-1}}^2 - (\beta-1)I - \beta\Sigma^{-1/2\Sigma_X \Sigma^{-1/2} + \lambda\Sigma^{-1}}^2} \&= \trace\paren{\Sigma^{1/2\Theta^\top \Theta\Sigma^{1/2} - \Sigma^{-1/2}((1-\beta)\Sigma + \beta \Sigma_X - \lambda)\Sigma^{-1/2}}^2 - \Sigma^{-1/2((1-\beta)\Sigma + \beta \Sigma_X - \lambda)\Sigma^{-1/2}}^2}. align* This proves the first part of the lemma. Remark that the expression of the lemma is slightly different from the generalization to continuous $\X$ suggested by haochen_provable_2021 in their Appendix F, that would reuse the work of Schiebinger2015 considering the covariance operator with feature $\bar\phi(x) = q^{-1/2}(x)\E\phi(\xi)\midvert X=x$ where $q:x\to\E_{X\sim\mu_\Xi}[k(x, X)]$ rather than $\Sigma^{-1/2}\Sigma_X\Sigma^{-1/2}$. Finally, let us prove that the $f_{\theta_i}$ are orthogonal in $L^2$, we have align* f_{\theta_i}{f_{\theta_j}}{L^2(\mu\Xi)} &= \max(\lambda_i, 0)\max(\lambda_j, 0) \E[\Sigma^{-1/2u_i}{\phi(\xi)}\Sigma^{-1/2u_i}{\phi(\xi)}] \&= \max(\lambda_i, 0)\max(\lambda_j, 0) \E[u_i^\top\Sigma^{-1/2}\phi(\xi)\phi(\xi)^\top\Sigma^{-1/2}u_j] \&= \max(\lambda_i, 0)\max(\lambda_j, 0) u_i^\top\Sigma^{-1/2}\E[\phi(\xi)\phi(\xi)^\top]\Sigma^{-1/2}u_j \&= \max(\lambda_i, 0)\max(\lambda_j, 0) u_i^\top\Sigma^{-1/2}\Sigma\Sigma^{-1/2}u_j \&= \max(\lambda_i, 0)\max(\lambda_j, 0) u_i^\top u_j = \max(\lambda_i, 0)\max(\lambda_j, 0) \delta_{ij}. align* This proves the orthogonality of the $f_{\theta_i}$ in $L^2(\mu_\Xi)$.

Proof. This follows from the fact that both $S$ and $\Sigma^{1/2}$ are a square root of $\Sigma$. Indeed, $\Sigma = S^\top S$, since for $\theta\cal H$, align* \theta{S^\top S\theta}{\cal H} &= S\theta{S\theta}{L^2(\mu_\X)} = \E_\xi[S\theta(\xi)^2] \&= \E_\xi[\theta{\phi(\xi)}^2] = \E_\xi[\theta{\phi(\xi)\otimes\phi(\xi)\theta}] \&= \theta{\E[\phi(\xi)\otimes\phi(\xi)]\theta} = \theta{\Sigma\theta}. align* As a consequence, $S$ is isometric to $\Sigma^{1/2}$ (if we write the singular value decomposition of $S$ as $USV^\top$, then $\Sigma^{1/2} = USU^\top$). Regarding the part in $K$, one can check with the same derivation that $S^\top f = \E[f(\xi)\phi(\xi)] \in {\cal H}$ hence the value of $(Kf)(\xi)= (S^\top f)^\top \phi(\xi)= \E_{\xi'}[f(\xi')\phi(\xi')^\top \phi(\xi)]$.

Proof. This lemma follows from the previous discussion. The fact that $S^{-\top}\Sigma S^{-1}$ equates to $T$ on the $L^2(\mu_\Xi)$-closure of $\Psi$ is due to the characterization in Lemma lem:close_rkhs. We can nonetheless prove it in a more direct fashion, by adapting Lemma B.9 of saunshi_understanding_2022 to our case.

Proof. When the operators commute, if $f$ is an eigenfunction of $T$ with $Tf = \lambda f$, then $TKf = KTf = \lambda Kf$. This means that the eigenspace of $T$, i.e. $\ker(T-\lambda I)$ are stable by $K$. As a consequence, $K$ can be decomposed with respect to the summation $L^2 = \oplus_{\lambda\in\spec(T)} \ker(T-\lambda I)$. By diagonalizing the restrictions of $K$ on each of those spaces, there exists a basis that diagonalizes both $K$ and $T$.

Proof. The proof of the lemma follows from classical characterization of the mean square error and a triangular inequality. Introduce the following technical assumption. assumption Assume $(X, Y) \to Y$ to belong to $L^2(\rho)$. assumption When $\ell(y, y') = y - y'^2$, using the fact that $(X, Y) \to Y - \EY\midvert X$ is orthogonal to any measurable function that does not depend on $Y$ in $L^2(\rho)$, [ {\cal R}(f) = \E[f(X) - Y^2] = \E[f(X) - \E\bracket{Y\midvert X}^2] + \E[\E\bracket{Y\midvert X - Y}^2]. ] As a consequence, $f^(x) = \EY\midvert X=x$ and [ {\cal R}(f) - {\cal R}(f^) = f - f^^2_{L^2(\rho_\X)}. ] Let us decompose the excess of risk with the orthogonal projection of $\cal F$, we have align {\cal R}(f) - {\cal R}(f^) &= f - f^^2_{L^2(\rho_\X} = f - \Pi_{\hat{\cal F}f^}^2_{L^2(\rho_\X} +(I - \Pi_{\hat{\cal F})f^}^2_{L^2(\rho_\X} align* The second term is worked out as align* (I - \Pi_{\hat{\cal F})f^}^2_{L^2(\rho_\X)} & =(I - \Pi_{\hat{\cal F})\Pi_{\cal F}f^ + (I - \Pi_{\cal F})(I-\Pi_{\cal F})f^}^2_{L^2(\rho_\X)} \& \leq 2(I - \Pi_{\hat{\cal F})\Pi_{\cal F}f^}^2_{L^2(\rho_\X)} + (I - \Pi_{\hat{\cal F})(I-\Pi_{\cal F})f^}^2_{L^2(\rho_\X)} \& \leq 2(I - \Pi_{\hat{\cal F})\Pi_{\cal F}f^}^2_{L^2(\rho_\X)} + (I-\Pi_{\cal F)f^}^2_{L^2(\rho_\X)} align where the last inequality is due to the fact that projections contract distances.

Proof. The two formula can be proven at once by remarking that if $\Pi_{\cal F}f^*$ is defined as $S_\psi w$ for $w$ minimizing [ \E[w^\top \phi(X) - Y^2] = w^\top \E[\phi(X)\phi(X)^\top]w - 2w^\top \E[Y\phi(X)] + \E[Y^2]. ] Minimizing this quadratic form leads to the first results. The second result is proven in the same way after substituting the distribution over $(X, Y)$ by the empirical one $n^{-1} \sum_{i\in[n]} \delta_{(X_i, Y_i)}$.

Proof. Retaking the warm-up lemma, one can show that [ w_n = (\hat\Sigma + \gamma)^{-1}\hat S (Y_i){i\in[n]}. ] As a consequence, using the usual bias-variance decomposition, and the fact that $f^* = \E\rhoY\midvert X=\cdot$, we develop align* &\E_{Y_i\midvert X=X_i}[f_n - \Pi_{\hat{\cal F}f^}^2] = \E_{Y_i\midvert X=X_i}[S(\hat\Sigma + \gamma)^{-1\hat S^\top (Y_i){i\in[n]} - \Pi{\cal F} f^}^2] \&= \E_{Y_i\midvert X=X_i}\norm{S(\hat\Sigma + \gamma)^{-1\hat S^\top (Y_i - \EY\midvert X=X_i){i\in[n]}}^2} + S(\hat\Sigma + \gamma)^{-1\hat S^\top f^* - \Pi{\cal F}f^}^2. align The first term can be worked out with mourtada2022exact techniques as [ \E_{(X_i, Y_i)}\norm{S(\hat\Sigma + \gamma)^{-1\hat S^\top (Y_i - \EY\midvert X=X_i){i\in[n]}}^2} \leq \epsilon^2{n} 1 + \frac{R^2{\gamma n}} \trace(\Sigma+ \gamma)^{-1\Sigma} ] under the assumption that the variance of $Y\midvert X$ is bounded by $\epsilon^2$. We work out the second term with [ S(\hat\Sigma + \gamma)^{-1\hat S^\top f^* - \Pi{\cal F}f^} \leq S(\hat\Sigma + \gamma)^{-1\hat S^\top (f^ - \Pi_{\cal F} f^)} + S(\hat\Sigma + \gamma)^{-1\hat S^\top \Pi_{\cal F}f^ - \Pi_{\cal F}f^}. ] Once again, the last part can be worked out with techniques of mourtada2022exact to get [ \E_{(X_i, Y_i)}[S(\hat\Sigma + \gamma)^{-1\hat S^\top \Pi_{\cal F}f^ - \Pi_{\cal F}f^}^2] \leq \gamma1 + \frac{R^2{\gamma n}}^2\Pi_{\hat{\cal F}f^}{\Sigma(\Sigma + \gamma)^{-1}\Pi_{\cal F}f^*}{L^2(\rho\X)}. ] This provides the decomposition of the lemma.

Proof. Let us set $f = (I - \Pi_{\cal F})f^$ and $A_\gamma = A + \gamma I$ for simplicity. Remark that $f$ is orthogonal to the image of $S$, hence $S^\top f = 0$. We decompose the last quantity with align \hat\Sigma_\gamma^{-1}\hat S^\top f &=(\hat\Sigma_\gamma^{-1}\hat - \Sigma_\gamma^{-1}) \hat S^\top f + (\Sigma_\gamma)^{-1} \hat S^\top f \&=\hat\Sigma_\gamma^{-1}\hat (\Sigma_\gamma - \hat\Sigma_\gamma) \Sigma_\gamma^{-1} S^\top f + \Sigma_\gamma^{-1} \hat S^\top f \&=\hat\Sigma_\gamma^{-1}\hat (\Sigma- \hat\Sigma) \Sigma_\gamma^{-1} \hat S^\top f + \Sigma_\gamma^{-1} \hat S^\top f \&=\Sigma_\gamma^{-1/2}\Sigma_\gamma^{1/2\hat\Sigma_\gamma^{-1}\Sigma_\gamma^{1/2}\Sigma_\gamma^{-1/2}\hat (\Sigma- \hat\Sigma) \Sigma_\gamma^{-1/2} + I} \Sigma_\gamma^{-1/2} \hat S^\top f align* Using the fact that $S$ is isometric to $\Sigma^{1/2}$ which itself if smaller than $\Sigma_\gamma^{1/2}$ (with the Loewner order), we have [ S(\hat\Sigma + \gamma)^{-1\hat S^\top (I - \Pi_{\cal F}) f^}{L^2(\rho\X)} \leq 1 + \norm{\Sigma_\gamma^{1/2\hat\Sigma_\gamma^{-1}\Sigma_\gamma^{1/2}}{\op}\Sigma\gamma^{-1/2\hat (\Sigma- \hat\Sigma) \Sigma_\gamma^{-1/2}}{\op}}\Sigma\gamma^{-1/2 \hat S^\top f} ] We know that [ \Sigma_\gamma^{1/2\hat\Sigma_\gamma^{-1}\Sigma_\gamma^{1/2}}{\op} \leq \gamma^{-1}(\Sigma{\op} + \gamma) \leq \gamma^{-1} (\sup_{x\in\supp\rho_\X} \psi(x)^2 + \gamma) ] We also have that for $A$ and $\hat A$ self adjoint and any $t > 0$, the sequence of implications align A^{-1/2(A - \hat A) A^{-1/2}}_{\op} \leq t &\quad\Leftrightarrow\quad -t I \preceq A^{-1/2}(\hat A - A) A^{-1/2} \preceq t I \&\quad\Leftrightarrow\quad -t A \preceq \hat A - A \preceq t A \&\quad\Leftrightarrow\quad (1-t) A \preceq \hat A \preceq (1+t) A \&\quad\Leftrightarrow\quad (1+t)^{-1} A^{-1} \preceq \hat A^{-1} \preceq (1-t)^{-1} A^{-1} \&\quad\Leftrightarrow\quad (1+t)^{-1} \preceq A^{1/2}\hat A^{-1}A^{1/2} \preceq (1-t)^{-1}. align* Combining the different results leads to the lemma.

Proof. See Corollary 1 in Pinelis1986 for the first part, and tropp2015 for the matrix version.

Proof. Let us introduce equation Z_i = (I-\Pi_{\cal F})f^(X_i) (\Sigma+\gamma)^{-1/2}\psi(X_i) \in \R^k. equation One can check that [ 1{n} \sum_{i\in[n]} Z_i = (\Sigma+\gamma)^{-1/2}1{n}\sum_{i\in[n]} (I-\Pi_{\cal F})f^(X_i) \psi(X_i) = (\Sigma+\gamma)^{-1/2} \hat S^\top (I-\Pi_{\cal F})f^, ] as well as, since $\ima S = \cal F$ [ \E[Z_i] = S^\top (I - \Pi_{\cal F})f^ = 0. ] Moreover, [ Z_i = (\Sigma+\gamma)^{-1/2\psi(X_i)} (I - \Pi_{\hat{\cal F})f^(X_i)} \leq \gamma^{-1/2} M (f^{L^\infty} + Mf^*{L^2}). ] where $R = \sup_X \psi(X)$ and we have used the fact that align* (I - \Pi_{\hat{\cal F})f^(X_i)} &\leq f^(X_i) + \Pi_{\hat{\cal F}f^(X_i)} = f^(X_i) + \scap{SS^{-1\Pi_{\cal F}f^}{\phi(X)}} \&\leq f^(X_i) + SS^{-1}{\op}\Pi{\hat{\cal F}f^}_{L^2}\phi(X_i) \leq f^{L^\infty} + Mf^*{L^2}. align* Finally, we have align* \E[Z_i^2] &= \E\norm{(\Sigma+\gamma)^{-1/2\psi(X_i)}^2 (I - \Pi_{\hat{\cal F})f^(X_i)}^2} \&\leq \E\norm{(\Sigma+\gamma)^{-1/2\psi(X_i)}^2} (f^{L^\infty} + Mf^*{L^2}) \& = \trace(\Sigma+\gamma)^{-1\Sigma} (f^_{L^\infty} + Mf^{L^2}). align* Using Bernstein inequality leads to the control on the vector term. For the matrix term, let us introduce [ Z_i = U_i U_i^\top - \E[U_i U_i^\top], \qquad U_i = (\Sigma + \gamma)^{-1/2} \phi(X_i). ] We have $(\Sigma + \gamma)^{-1/2}(\hat\Sigma - \Sigma)(\Sigma + \gamma)^{-1/2} = 1{n} \sum{i\in[n]} Z_i$, and [ \supZ \leq \supU^2 \leq \gamma^{-1} M^2. ] Finally, using the fact that $U_i^2 \preceq U_i \supU_i$, with the variational definition of the mean, with the infimum taken with respect to the Loewner order [ \E[Z_i^2] = \inf_{a} \E[(Z_i - a)^2] \preceq \E[(U_i U_i^\top)^2] \preceq \supU^2\E[U_i^\top U_i] = \supU^2(\Sigma+\gamma)^{-1}\Sigma \preceq \supU^2 I ] Applying the matrix version of Bernstein inequality leads to the lemma.

Proof. In essence, we have two random variables, $X=\Sigma_\gamma^{-1/2( \hat\Sigma - \Sigma )\Sigma_\gamma^{-1/2}}^2_{\op}$, and $Y = \Sigma_\gamma^{-1/2 \hat S^\top (I - \Pi_{\cal F})f^}^2$ the vector one. We proceed with computation using the fact that for $X$ positive $\E[X] = \int_{t>0} \Pbb(X>t)\diff t$ and that $ab > t$, implies for any $s$ that $a > 1+s$ or $b > t/(1+s)$, align &\E[\min\frac{1{1-X}, 1 + (M^2 + \gamma)X{\gamma}}^2 Y^2] \&= \int_{t\in(0,\sup (1+\gamma^{-1}(M^2 + \gamma)X)^2Y^2)} \Pbb\min\brace{\frac{1{1-X}, 1 + (M^2 + \gamma)X{\gamma}}^2 Y^2 > t} \diff t \&\leq \int \inf_{s} \Pbb\min\brace{\frac{1{1-X}, 1 + (M^2 + \gamma)X{\gamma}}^2 > 1+s} + \Pbb(Y^2 > t/(1+s)) \diff t. align* Rather than solving this in closed form, we will proceed with a much simpler bound that consists in taking $s = 1$ without any optimization. It gives the much simpler formula [ \E[\min\frac{1{1-X}, 1 + (M^2 + \gamma)X{\gamma}}^2 Y^2] \leq \Pbb(1{(1-X)^2} > 2) \sup(1+\gamma^{-1}(M^2 + \gamma)X)^2 Y^2 + 2\E[Y^2]. ] For $Y$ we can use the same technique as before, using that $\exp(-(a+b)^{-1}) \leq \exp(-\max(2a, 2b)^{-1}) \leq \exp(-(2a)^{-1}) + \exp(-(2b)^{-1})$, we get align* \E[Y^2] &= \int_{t>0} \Pbb(Y^2> t)\diff t \leq \int_{t>0} 2\exp-\frac{-nt{a(b + 2M\gamma^{-1/2}t^{1/2}/3)}}\diff t \&\leq 4\int_{t>0} \exp-\frac{-nt{2ab}} + \exp-\frac{-nt^{1/2}{4 aM\gamma^{-1/2}/3)}}\diff t \&= 8 ab n^{-1} + 256 a^2M^2\gamma^{-1} n^{-2} / 9 align* We conclude the lemma with the previous one.

Proof. The first bound is a direct application of the fact that $\Sigma \preceq \Sigma + \lambda$, hence $\trace((\Sigma+\gamma)^{-1}\gamma) \leq \trace(I) = k$. The second bound is due to the fact that $\psi = \hat\Theta\phi$, hence $\psi \leq \hat\Theta_{\op} \phi \leq \hat\Theta_F \phi$. In the meantime, if $\hat\Theta$ was regularized [ \lambda \hat\Theta_F^2 \leq \cal L(\Theta) + \lambda\hat\Theta^2 \leq \cal L(0) = k. ] For the part in $f^$, we have that [ \Sigma^{1/2(\Sigma + \gamma)^{-1/2}\Pi_{\cal F}f^} \leq \Pi_{\hat{\cal F}f^} \leq f^. ] Finally, the last equality is due to the fact that $f^(x)$ is the mean of $Y$ conditionally to $X=x$, [ f^(X) = \E\bracket{Y\midvert X} \leq \EY^2\midvert X^{1/2} \leq \sigma. ] This ends the lemma

Proof. When $\gamma = c \log(n)^{1+\delta} n^{-1}$ the excess of risk reads align* \E_{(X_i, Y_i)}[{\cal R}(f_n) - {\cal R}(f^)] \nonumber &\leq k_e\epsilon^2{n} 1+\frac{M^2{c\log(n)}} + 2c\log(n)^{1+\delta}{n}1+\frac{M^2{c\log(n)}}^2 f^{L^2(\rho\X)} + 64 k a{n} \nonumber \&+ 114 a^2 M^2{9 c n \log(n)} + O(\exp(-\log(n)^{1+\delta/2})) + 2I-\Pi_{\hat{\cal F}\Pi_{\cal F}f^}^2 + I-\Pi_{\cal Ff^}^2 align* Taking $c = M^2$ leads to the lemma.

Proof. For simplicity, let us remove the dependency to $\mu_\Xi$ in the proof. Let us introduce $C = S\hat\Theta\hat\Theta^\top S^\top$, $C$ is a positive operator of rank $k$ in $L^2$, let us write it as $C = \sum_{i\in[k]} \mu_i g_ig_i^\top$ with $\mu_i \geq 0$. [ {\cal L}(\hat\Theta; \lambda) - k = \trace(C - T_\lambda)^2 - T_\lambda^2 = \traceC^2 - 2C^{1/2T_\lambda C^{1/2}}. ] Let us decompose $T_\lambda = T_+ - T_-$ where $T_+$ and $T_-$ are positive. Since $T_\lambda \preceq T_+$, $-C^{1/2}T_\lambda C^{1/2} \succeq C^{1/2}T_+C^{1/2}$, hence [ {\cal L}(\hat\Theta; \lambda) - k \geq \traceC^2 - 2C^{1/2T_+ C^{1/2}} = \sum_{i\leq k} \mu_i^2 - 2\mu_i T_+^{1/2g_i}^2. ] Minimizing this quantity with respect to $\mu_i$, leads to [ {\cal L}(\hat\Theta; \lambda) - {\cal L}(\Theta;\lambda) \geq \sum_{i\leq k} \lambda_i^2 - \sum_{i\leq k} T_+^{1/2g_i}^4. ] Let us know introduce $(f_i)$ the eigenfunctions of $T_\lambda$. With $U = (g_i{f_j}^2){ij} \in \R^{k\times k\lambda}$ and $\lambda = (\lambda_i)\in\R^{k_\lambda}$, we have [ \sum_{i\leq k} T_+^{1/2g_i}^4 = \sum_{i\leq k} (g_i^\top T_+ g_i)^2 = \sum_{i\leq k} \sum_{j\leq k_\lambda \lambda_j g_i{f_j}^2}^2 = \sum_{j, m\leq k_\lambda} \lambda_j\lambda_m \sum_{i\leq k}g_i{f_j}^2g_i{f_m}^2 = \lambda^\top U^\top U \lambda. ] Note that $U$ is at most doubly stochastic since both $(g_i)$ and $(f_i)$ are orthonormal families, thus $U \leq 1$, and $U^\top U \preceq I$. If one replace the $f_i$ by $f_i / \Pi_{\hat{\cal F}}$ in the definition of $U$ that would become $U = \diag((\Pi_{\hat{\cal F}f_i}^2){i\leq k\lambda})^{-1} U$, $U$ is still right stochastic. Hence [ U^\top U \preceq \diag(\Pi_{\hat{\cal F}f_i}{i\leq k\lambda}^2)^{2} \diag(\Pi_{\hat{\cal F}f_i}{i\leq k\lambda}^2). ] It follows that [ \sum_{i\leq k} T_+^{1/2g_i}^4 \leq \lambda^\top U^\top U \lambda = \sum_{i\leq k_\lambda} \lambda_i^2 \Pi_{\hat{\cal F}f_i}^2. ] This allows to simplify the lower bound as align* {\cal L}(\hat\Theta; \lambda) - {\cal L}(\Theta;\lambda) &\geq \sum_{i\leq k} \lambda_i^2 f_i^\top f_i - f_i^\top \Pi_{\hat{\cal F}f_i} - \sum_{k<i\leq k_\lambda} \lambda_i^2 f_i^\top \Pi_{\cal F}f_i \&= \sum_{i\leq k} \lambda_i^2 f_i{(I - \Pi_{\cal F})f_i} - \sum_{k<i\leq k_\lambda} \lambda_i^2 \Pi_{\hat{\cal F}f_i}^2 \&= \sum_{i\leq k} \lambda_i^2 (\Pi_{\cal F - \Pi_{\cal F})f_i}^2 - \sum_{k<i\leq k_\lambda} \lambda_i^2 \Pi_{\hat{\cal F}f_i}^2. align* This ends the proof of our transfer bound.

Proof. Using the fact that $I - \Pi$ is a projection when $\Pi$ is a projection, and that projections contract distance, we get align* (I - \Pi_{\hat{\cal F}^{(\rho_\X)})\Pi_{{\cal F}l}^{(\rho\X)} f^}{L^2(\rho\X)} &\leq (I - \Pi_{\hat{\cal F}^{(\rho_\X)})(\Pi_{{\cal F}l}^{(\rho\X)} - \Pi_{{\cal F}l}^{(\mu\Xi)}) f^}{L^2(\rho\X)} + (I - \Pi_{\hat{\cal F}^{(\rho_\X)})\Pi_{{\cal F}l}^{(\mu\Xi)}f^}{L^2(\rho\X)} \&\leq (\Pi_{{\cal F_l}^{(\rho_\X)} - \Pi_{{\cal F}l}^{(\mu\Xi)}) f^}{L^2(\rho\X)} + (I - \Pi_{\hat{\cal F}^{(\rho_\X)})\Pi_{{\cal F}l}^{(\mu\Xi)}f^}{L^2(\rho\X)}. align Under Assumption ass:robust, the first term in the right-hand side of the previous equation is bounded by $\sigma(l)$. Regarding the second term, under Assumption ass:interpolation, for $f\in \Psi$ and $f'\in\cal F\subset \Psi$, we have [ (I - \Pi_{{\cal F_l}^{(\rho_\X)})f}{L^2(\rho\X)} \leq f - f'{L^2(\rho\X)} \leq \zeta\norm{(f - f'{L^2(\mu\Xi)}}. ] Taking the minimum on the right-hand side and using the fact that $\zeta$ is increasing leads to [ (I - \Pi_{{\cal F_l}^{(\rho_\X)}) f}{L^2(\rho\X)} \leq \zeta\norm{(I - \Pi_{{\cal F_l}^{(\mu_\Xi)}) f}{L^2(\mu\Xi)}}. ] Applied to $\Pi_{{\cal F}l}^{(\mu\Xi)}f^$, this leads to [ (I - \Pi_{\hat{\cal F}^{(\mu_\Xi)})\Pi_{{\cal F}l}^{(\rho\X)}f^}{L^2(\rho\X)} \leq \zeta\norm{(I - \Pi_{\hat{\cal F}^{(\mu_\Xi)})\Pi_{{\cal F}l}^{(\mu\Xi)}f^}{L^2(\mu\Xi)}}. ] We are done with all the quantities that relate to the distribution shift. Under Assumption ass:source, we have align (I - \Pi_{\hat{\cal F}^{(\mu_\Xi)})\Pi_{{\cal F}l}^{(\mu\Xi)} f^}{L^2(\mu\Xi)} &= \sum_{i\leq l f^{f_i}{\mu\Xi} (I - \Pi_{\cal F}^{(\mu_\Xi)})f_i}{L^2(\mu\Xi)} \&\leq \sum_{i\leq l} \scap{f^{f_i}{\mu\Xi}} (I - \Pi_{\hat{\cal F}^{(\mu_\Xi)})f_i}{L^2(\mu\Xi)}. align Collecting the previous equations leads to the lemma.

Proof. Let us consider two projection $U$ and $V$ onto the span of $(u_i){i\in[k]}$ and $(v_i){i\in[k]}$ with $(u_i){i\in\N}$ and $(v_i){i\in\N}$ two orthonormal basis of the ambient space. We have, with Hilbert-Schmidt norm everywhere, align* U(I-V)^2 &= U^2 - UV^2 = k - UV^2 = k - (UV)^\top^2 = k - k' + k' - VU^2 = k - k' + V(I-U)^2. align* Based on invariant of the Hilbert-Schmidt norm to adjoint, and the fact that projection are self-adjoint, we have [ U(I-V)^2 = (I-V)U^2 = k-k' + V(I-U)^2 = k-k' + (I-U)V^2. ] Finally, we also know that since projection contracts distances $(I-V)U^2 \leq U^2 = k$. The claim of the lemma consists in writing explicitly align* (I - \Pi_{\hat{\cal F})\Pi_{\cal F}}^2 &= (\Pi_{\cal F - \Pi_{\cal F})\Pi_{\cal F}}^2 = \sum_{i\leq k}(\Pi_{\cal F - \Pi_{\cal F})f_i}^2 \&= k-k' + \Pi_{\hat{\cal F}(I-\Pi_{\cal F})}^2 = k-k'+\sum_{i > k}\Pi_{\hat{\cal F}f_i}^2 \leq k. align* This is lead to the statement of the lemma.

Proof. Keeping the algebraic notation above, this comes from a simple application of Cauchy-Schwarz, for $(a_i)\in\R^k$ [ \sum_{i\leq l} c_i x_i = \sum_{i\in[l]} c_i{a_i} a_i x_i \leq \sum_{i\leq [l] c_i^2{a_i^2}}^{1/2} \sum_{i\in[l] a_i^2 x_i^2}^{1/2}. ] When applies to the quantities in eq:algebraic and $a_i = \lambda_i^2 - \lambda_{k+1}^2$ and $l\leq k$, the previous lemma leads to align* (I - \Pi_{\hat{\cal F}^{(\rho_\X)})\Pi_{{\cal F}l}^{(\rho\X)} f^}{L^2(\rho\X)} &\leq \sigma(l) + \zeta\sum_{i\leq l \scap{f^{f_i}{L^2(\mu\Xi)}} (\Pi_{{\cal F_l}^{(\mu_\Xi)} - \Pi_{\cal F}^{(\mu_\Xi)})f_i}{L^2(\mu\Xi)}} \&\leq \sigma(l) + \zeta\sum_{i\leq l c_ix_i} \leq \sigma(l) + \zeta\paren{\sum_{i\leq l c_i^2{a_i^2}}^{1/2} \sum_{i\in[l] a_i^2 x_i^2}^{1/2}} \&\leq \sigma(l) + \zeta\paren{\sum_{i\leq l c_i^2{a_i^2}}^{1/2} {\cal L(\hat\Theta;\lambda) - {\cal L}(\Theta;\lambda)}^{1/2}}. align* We conclude by remarking that $\sum_{i\leq l} c_i^2{a_i^2} = \tilde{T_\lambda^{-1}\Pi_{{\cal F}l^{(\mu\Xi)}}f^}{L^2(\mu\Xi)}$. For the second part, set $\cal F_k$ the $k$ first eigenfunctions to the all the one retrieve with the empirical minimization of ${\cal L}$, and ${\cal F}$ to be the span of all the eigenfunctions linked with positive eigenvalues of $T_\lambda$. Let us rework the decomposition of the excess of risk, we have align (I - \Pi_{\hat{\cal F_k})f^}^2 &= \Pi_{\hat{\cal F_{k_\lambda}}(I - \Pi_{\cal F_k})f^}^2 + (I - \Pi_{\hat{\cal F_{k_\lambda}})(I - \Pi_{\cal F_k})f^}^2 \&= (\Pi_{\hat{\cal F_{k_\lambda}} - \Pi_{\cal F_k})f^}^2 + (I - \Pi_{\hat{\cal F_{k_\lambda}})f^}^2 \&\leq (\Pi_{\hat{\cal F_{k_\lambda}} - \Pi_{\cal F_k})f^}^2 + 2(I - \Pi_{\hat{\cal F_{k_\lambda}})\Pi_{\cal F}f^}^2 + (I - \Pi_{\cal F)f^}^2 \&\leq k-k_\lambdaf^^2 + 2(I - \Pi_{\hat{\cal F_{k_\lambda}})\Pi_{\cal F}f^}^2. align* The last bound begin due to Assumption ass:source, as well as the lax bounding that on the operator norm of two projections. When one could remove the $k-k_\lambda$ we let it as we expect the quantity to behave it this way, with a constant similar to $f^^2 / k_\lambda$ instead of $f^^2$.

Proof. Let us consider the set of eigenvectors $(f_i)$ whose eigenvalues are bigger than $t$. Consider the span of this space, we want to quantify its dimension. We know that all unitary vectors in this span satisfies [ t \leq x^\top T_\lambda x \leq x^\top Tx - \lambda x^\top K^{-1}x \leq 1 - \lambda x^\top K^{-1}x ] Hence [ x^\top K^{-1} x \leq 1 - t{\lambda} ] This means that this span does not intersect the span of $\phi_i$ for $\phi_i$ the eigenvectors of $K^{-1}$ such that the eigenvalues are bigger than $\lambda / (1 - t)$. In other terms, this linear space does not intersect a linear space of co-dimension $d$ where $d$ is the cardinality mentioned in the lemma statement. Let us denote by $U$ the space we are interested in and by $V$ the space it does not intersect beside in the origin, and by $E$ the ambient space Since $U\cap V = 0$, the quotient $(U+V)/V$ is isomorphic to $U$, hence [ \dim(U) = \dim \frac{U+V{V} } \leq \dim\frac{E{V}} = codim(V) = d. ] This concludes the proof of the first part of the lemma. The second claim follows from the fact that $\mu_i^{1/q}$ are summable and decreasing, hence the sequence $S_n = \sum_{i\leq n} \mu_i^{1/q}$ is a Cauchy sequence. As a consequence, there exists $N\in\N$, such that for any $s > N/2$, we have [ s\mu_{2s}^{1/q} \leq S_{2s} - S_s \leq 1/2. ] Hence, for all $s\geq N$, we have $\mu_{s} \leq s^{-1/q}$, hence $\mu_s / s^{-1/q}$ is bounded. Denoting by $c_q$ the maximum, leads to the first result. The final statement is a consequence of the fact that $c_q i^{-q} > \lambda / (1-t)$ implies $i < (c_q(1-t) / \lambda)^{1/q}$.

Proof. The capacity of $K$ is relates to the capacity of $K(f\midvert \norm{f_{L^2(\mu_\Xi)} \leq 1})$, which itself relates to the capacity of $\Psi = \ima K^{1/2}$. This explains why $q$ can be taken, in essence, as arbitrarily big Bach2023. When $\phi$ is bounded, the following align* \traceK &= \traceSS^\top = \traceS^\top S = \trace\E[\phi(X)\phi(X)^\top] = \E[\trace\phi(X)\phi(X)^\top] \&= \E[\phi(X)^\top\phi(X)] = \E[\phi(X)^2] < +\infty, align* proves that $K$ is trace class.

Proof. Consider $\psi = \Theta\phi$, we have align* &{\cal L}(\psi; \beta) = 2(\beta - 1)\E_{\xi}[\psi(\xi)^\top\psi(\xi)] - 2\beta\E_X\E_{\xi, \xi'}\psi(\xi)^\top\psi(\xi')\midvert X + \E_{\xi, \xi'}(\psi(\xi')^\top\psi(\xi))^2 + k. \&\quad= 2(\beta - 1)\E_{\xi}[\phi(\xi)^\top \Lambda \phi(\xi)] - 2\beta\E_X\E_{\xi, \xi'}\phi(\xi)^\top\Lambda\phi(\xi')\midvert X + \E_{\xi, \xi'}(\phi(\xi')^\top\Lambda\phi(\xi))^2 + k. \&\quad= 2(\beta - 1)\E_{\xi}[\trace\Lambda \phi(\xi)\phi(\xi)^\top] - 2\beta\E_X\E_{\xi, \xi'}\trace\paren{\Lambda \phi(\xi')\phi(\xi)^\top\midvert X} + \E_{\xi, \xi'}\trace\paren{\Lambda\phi(\xi)\phi(\xi')^\top^2} + k. align* The lemma follows from the characterization of the Hilbert-Schmidt geometry with the trace, the fact that $\Lambda$ is self-adjoint, and that the regularization reads $\Theta^2 = \Theta^\top\Theta$.

Proof. The proof is a classical result from learning theory bartlett_rademacher_2002, its proof consists in introducing both the empirical risk of $\zeta_n$ and $\zeta$, and bounding the difference between the empirical and population of $\zeta$ by the supremum of this deviation over the entire domain of $\zeta$. This is followed by the replacement of the population risk by the average empirical one, and a symmetrization trick that introduce the variable $(\sigma_i)$ based on exchangeability of the $(Z_i)$.

Proof. This is a classical result on Rademacher complexity of ball constraints predictors bartlett_rademacher_2002.

Proof. This follows from contraction of space capacity by Lipschitz functions Vitushkin1954, see meir_contraction_2003 for a proof in the context of machine learning.

Proof. Following the previous lemmas on Rademacher complexity we have align* &\E_{{\cal D}n}[{\cal L}(S\Theta_n);\lambda] - {\cal L}(S\Theta;\lambda) \leq 8 \E{{\cal D}n, \sigma}\sup{\Lambda 1-\beta{n} \sum_{i\in[n]}\sigma_i\Lambda{1{m}\sum_{j\in[m]}\phi(\xi_{ij})\phi(\xi_{ij})^\top}} \&\qquad\qquad+ 8 \E_{{\cal D}n, \sigma}\sup{\Lambda \beta{n}\sum_{i\in[n]}\sigma_{i}\Lambda{2{m}\sum_{j\in [m/2]; j+k-1=m}\phi(\xi_{ij})\phi(\xi_{ik})^\top}} \&\qquad\qquad+ 4 \E_{{\cal D}n,\sigma}\frac{2{n}\sum{i \in [n/2]; i+j-1=n}\sigma_{i} 1{m^2} \sum_{k,l\in[m]}\Lambda{\phi(\xi_{ik})\phi(\xi_{jk})^\top}^2} \&\leq 8\sup \norm{\Lambda_{HS}}{n} (1-\beta)\E_X\bracket{\E\bracket{\big|\frac{1{m}\sum_{i=1}^m \phi(\xi_i)\phi(\xi_i)^\top\big|{HS}^2 \midvert X}}^{1/2}} \&\qquad\qquad + 8\sup \norm{\Lambda{HS}}{n} \beta\E_X\bracket{\E\bracket{\big|\frac{2{m}\sum_{i,j=1}^{m/2} \phi(\xi_i)\phi(\xi_j)^\top\big|{HS}^2 \midvert X}}^{1/2}} \&\qquad\qquad+ 8\sup \norm{\Lambda{HS}}{n} \sqrt{2\sup \scap{\Lambda{\phi(\xi)\phi(\xi')^\top}} \E\big|\frac{1{m^2}\sum_{i,j=1}^{m} \phi(\xi_i)\phi(\xi_j)^\top\big|{HS}^2}^{1/2} }. align* To work out those terms, remark that if $(Z_i)$ are i.i.d. variables, [ \E[\frac{1{p} \sum{i\in[p]}Z_i}^2] = \E[\frac{1{p}\sum_{i\in[p]} Z_i - \E[Z]}^2] + \E[Z]^2 = 1{p}\E[Z - \E[Z]^2] + \E[Z]^2. ] While one could work out each term, the lemma consists in simply bounding $\phi$ by $\kappa$, hence all the mean and standard deviation one can obtain with expression of $\phi$ by $\kappa$.

Proof. When regularizing we have [ \Lambda_{HS} = \Theta^\top\Theta_{HS} \leq \Theta_{\op} \Theta_{HS} \leq \Theta_{HS}^2, ] and for minimizer of the empirical or population risk [ \lambda\Theta^2 \leq {\cal L}(S\Theta) + \lambda\Theta^2 \leq {\cal L}(0) = k, ] which explains the statement of the lemma.

Proof. This lemma is a direct consequence of Theorems 6.1, 6.2 and 6.3 of Bubeck2015.

Proof. This formula follows from Lemma lem:quadra.

Proof. Let us decompose $\nabla\ell$ into three terms $\nabla\ell = a+b+c$ as appearing in eq:sgd_sample, we have align* &a \leq 2(1-\beta) \phi(\xi)\phi(\xi)^\top \leq 2(1-\beta) \kappa^2 \&b \leq 2 \beta \phi(\xi)\phi(\xi')^\top^2 \leq \beta \kappa^2 \&c \leq \scap{\Lambda{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top} \leq \sup \Lambda\kappa^4. align* To bound the variance, one can proceed with [ \E\nabla \ell - \nabla{\cal L}^2\leq 3\Ea - \E[a]^2 + 3\Eb-\E[b]^2 + 3\Ec - \E[c]^2. ] Let us begin with the part in $a$, align* &\E\norm{\frac{1{m}\sum_{i\in[m]}\phi(\xi_{1i})\phi(\xi_{1i})^\top - \E[\phi(\xi)\phi(\xi)^\top]}^2} = \E\norm{\frac{1{m}\sum_{i\in[m]}\phi(\xi_{1i})\phi(\xi_{1i})^\top - \E[\phi(\xi)\phi(\xi)^\top\midvert X=X_1]}^2} \&\qquad\qquad\qquad\qquad+ \E\norm{\E[\phi(\xi)\phi(\xi)^\top\midvert X=X_1] - \E[\phi(\xi)\phi(\xi)^\top]^2} \&\qquad\qquad= 1{m} \E_X \E_\xi\norm{\phi(\xi)\phi(\xi)^\top - \E[\phi(\xi)\phi(\xi)^\top\midvert X]^2\midvert X} + \E\norm{\E[\phi(\xi)\phi(\xi)^\top\midvert X] - \E[\phi(\xi)\phi(\xi)^\top]^2}. align* Similarly, the part in $b$ can be expressed as align* \E[b - \E[b]^2] &= 2\beta\frac{2{m} \E_X \E_\xi\norm{\phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top\midvert X]^2\midvert X}} \&\qquad\qquad\qquad\qquad+ 2\beta\E\norm{\E[\phi(\xi)\phi(\xi')^\top\midvert X] - \E[\phi(\xi)\phi(\xi')^\top]^2}. align* Finally, align* \E[c - \E[c]^2] &= 1{m^2} \E_X \E_\xi\norm{\scap{\Lambda{\phi(\xi)\phi(\xi')^\top} \phi(\xi)\phi(\xi')^\top - \E[\Lambda{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top\midvert X, X']}^2\midvert X, X'} \&\qquad\qquad+ \E\norm{\E\bracket{\scap{\Lambda{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top\midvert X, X'} - \E[\Lambda{\phi(\xi)\phi(\xi')^\top}\phi(\xi)\phi(\xi')^\top]}^2} \&= 1{m^2} \E_X \E_\xi\norm{\scap{\Lambda{\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top\midvert X, X']}}^2\midvert X, X'} \&\qquad\qquad+ \E\norm{\scap{\Lambda{\E\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top\midvert X, X' - \E[\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top]}}^2}. \&\leq 1{m^2} \Lambda^2\E_X \E_\xi\norm{\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top\midvert X, X']^2\midvert X, X'} \&\qquad\qquad+\Lambda^2 \E\norm{\E\bracket{\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top\midvert X, X' - \E[\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top]}^2}. align* As a consequence, we get [ \E\nabla \ell - \nabla{\cal L}^2\leq 32(1-\beta)\paren{\frac{\sigma_{\xi, 1^2}{m} + \sigma_{X, 1}^2} + 2\beta\frac{2\sigma_{\xi, 2^2}{m} + \sigma_{X, 2}^2} + \sup\Lambda^2 \frac{\sigma_{\xi, 3^2}{m^2} + \sigma_{X, 3}^2}} ] where align* &\sigma_{\xi, 1}^2 = \E_X \E_\xi\norm{\phi(\xi)\phi(\xi)^\top - \E[\phi(\xi)\phi(\xi)^\top\midvert X]^2\midvert X} \& \sigma_{X, 1}^2 = \E\norm{\E[\phi(\xi)\phi(\xi)^\top\midvert X] - \E[\phi(\xi)\phi(\xi)^\top]^2} \&\sigma_{\xi, 2}^2 = \E_X \E_\xi\norm{\phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top\midvert X]^2\midvert X} \&\sigma_{X, 2}^2 = \E\norm{\E[\phi(\xi)\phi(\xi')^\top\midvert X] - \E[\phi(\xi)\phi(\xi')^\top]^2} \&\sigma_{\xi,3}^2 = \E_X \E_\xi\norm{\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top - \E[\phi(\xi)\phi(\xi')^\top \otimes \phi(\xi)\phi(\xi')^\top\midvert X, X']^2\midvert X, X'} \&\sigma_{X,3}^2 = \E\norm{\E\bracket{\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top\midvert X, X' - \E[\phi(\xi)\phi(\xi')^\top\otimes \phi(\xi)\phi(\xi')^\top]}^2}. align* Using the fact that $m^2 \geq m$ and choosing the right $\sigma_X$ and $\sigma_\xi$ leads to the lemma.

Proof. This is a consequence of Lemma lem:quadra, ${\cal L}$ is a quadratic function, with the quadratic part being [ \E[\Lambda{\phi(\xi)\phi(\xi')^\top}^2] = \Lambda{\E[\phi(\xi')\phi(\xi')^\top] \otimes \E[\phi(\xi)\phi(\xi)^\top] \Lambda} = \Lambda{\Sigma \otimes \Sigma \Lambda}. ] In other terms, the hessian of ${\cal L}$ is $\Sigma\otimes \Sigma \in {\cal H}^{\otimes 2}\otimes {\cal H}^{\otimes 2}$. As a consequence, ( \Sigma\otimes \Sigma \preceq \Sigma\otimes\Sigma_{\op} I = \Sigma^2_{\op} I \preceq \kappa^4 I. ) Similarly, ( \Sigma\otimes \Sigma \succeq \Sigma^{-1}^{-2}{\op} I = \gamma\xi^2 I, ) where $\gamma_\xi$ is the eigen gap of $\Sigma$, hence of $K$.

Proof. It is straightforwards to check that $\chi_S{\chi_S} = 1$. If $S \neq S'$, then w.l.o.g.~there is an $i\in S\setminus S'$, and we have [ \chi_S{\chi_{S'}} = \E_x[x_{i} \chi_{S\backslash i}(x) \chi_{S'}(x)] = \E_{x_{i}}[x_{i}] \E_{x_{-i}}[\chi_{S\backslash i}(x) \chi_{S'}(x)] = 0. ] This proves orthogonality of this basis.

Proof. Recall the formula $g^\top T f = \E_X\E_{\xi, \xi'}\scap{f(\xi){g(\xi')}\midvert X}$. As a consequence, with $y,y'$ denoting the noise strings (each bit equal to $-1$ with probability $p$) and $S\vartriangle S' = (S \cup S') / (S \cap S')$, align* \chi_S^\top T \chi_{S'} &= \E_X[ \E_{y,y'} [ \chi_S (X \odot y) \chi_{S'} (X \odot y') ] ] = \E_X \E_{y,y'\prod_{i \in S X_i y_i \prod_{j \in S'} X_j y_j'}} \ &= \E_X \left[ \E_y \left[\prod_{i \in S\vartriangle S'} X_i y_i \right]\right] \E_{y,y'}\left[\prod_{i \in S \cap S'} y_i y_i' \right] \ &=\E[\chi_{S\vartriangle S'}(X)] \cdot |1-2p|^{|S\vartriangle S'|} |1-2p|^{2 | S \cap S'|} = |1-2p|^{|S|} \delta_{S,S'}. align* Therefore, in the case of bit-flip augmentations, $T$ is diagonalized in the parity basis.

Proof. In this setting, align* \chi_S^\top T \chi_{S'} &= \E_X[ \E_{a,b} [ \chi_S (M_a^w(X)) \chi_{S'} (M_b^w(X)) ] ] \&= 1{d^2}\sum_{a,b=1}^d \E_{X, \nu, \nu'} \left[ \prod_{i \in S \cap [a,a+w)} x_i \prod_{i' \in S \backslash [a,a+w)} \nu_{i'} \prod_{j \in S' \cap [b,b+w)} x_j \prod_{j' \in S'\backslash [b,b+w)} \nu_{j'}' \right] \&= 1{d^2}\sum_{a,b=1}^d S \subseteq [a,a+w);S' \subseteq [b,b+w) ;E_X[\chi_S(X)\chi_{S'} (X)] = 1{d^2}\sum_{a,b=1}^d S \subseteq [a,a+w) ;S' \subseteq [b,b+w) ;\delta_{S,S'} \&= \frac{1{d}\sum_{a=1}^d S \subseteq [a,a+w) }^2 ;\delta_{S, S'}. align* The count of the sum relates to the diameter of $S$.

Proof. In this setting, align* \chi_S^\top T\chi_{S'} &= ((1-p)^2 + p^2) \E_X\left[ \chi_S (X) \chi_{S'} (X) \right] + 2p(1-p) \E_X\left[ \chi_{S}(X) \chi_{S'} (X) \right] \& = (1-2p+2p^2) \delta_{S,S'} + 2p(1-p)\delta_{S,S'}, align* which explain the lemma.

Proof. The first part follows from the fact that $L^2(\X)$ can be decomposed into the direct sum linked with the $V_{d, \ell}$ for $\ell \in [0, d]$, that each subspace can be decomposed into the orbits of the action translation $\orb(S) = S + a\midvert a\in[d]$ (note that translation do not change the cardinals of the sets $S$). Those latter spaces can be parameterized through the discrete Fourier transform, yielding the $\psi_{m, S}$. A natural way to ``find'' those basis is when trying to diagonalize an operator $T$ such that $(\chi_S^\top T \chi_{S'}){S,S'\subseteq [d]}$ that is block diagonal, where each block corresponds to a circulant matrix on an orbit, which can be diagonalized with the discrete Fourier transform. This is especially the case for operator of the lemma align* A\chi{S} = \chi_{S+a} align* The above is only nonzero when $[d]\cdot S$ intersects $[d]\cdot S'$, which implies $\orb(S) = \orb(S')$ thereby constructing a block diagonal structure. Indexing the elements of the $i$-th block by $S_{i,k} = S_i + k$ for $k \in [d]$, we have [ \chi_{S_{i,k}}^\top A\chi_{S_{i,k'}} = S_{i,k = S_{i,k'} + a}, = S_{i + k = S_{i} + k' + a}, = k - k' = a, ] which only depends on the value of $(k-k')$. Therefore, each block above is a circulant matrix which is diagonalized by the discrete Fourier transform. The eigenvectors of this matrix are [ v_m = 1{k_S}\sum_{k\in[k_S]} e^{2i\pi k m / k_S} e_k, \qquadwhere\qquad k_S = \orb(S), ] for $m\in[k_S]$ and the corresponding eigenvalues read [ \mu_m = \sum_{k\in[k_S]} c_{k_S-k} \exp\frac{2i\pi k m{k_S}}, ] where [ c_i = i = a, ] Using the fact that we wrote those matrices for $e_i\simeq \chi_{S+i}$ yields the lemma.

Proof. The square root of $T$ is defined as $Af(x) = \sum_{a\in\Sfrak_d} p(a)f(a\cdot x)$. Let us focus on the case where $p(b) = \delta_{a=b}$, using the fact that $\mu_\X$ is the uniform measure, hence is left invariant by translation, we compute the adjoint of $A$ with align* Af{g}{L^2(\mu\X)} &= 1{2^d}\sum_{x\in\X} Af(x) g(x) = 1{2^d}\sum_{x\in\X} p(a) f(a\cdot x) g(x) \&= 1{2^d}\sum_{x\in\X} p(a) f(a\cdot x) g(a^{-1} \cdot a\cdot x) = 1{2^d}\sum_{x\in\X} p(a) f(x) g(a^{-1} \cdot x) \& = f{x\mapsto p(a)g(a^{-1} \cdot x)}{L^2(\mu\X)}. \& = f{A^\top g}{L^2(\mu\Xi)}. align* In the general case, we get by linearity, [ A^\top f(x) = \sum_{a\in \Sfrak_d} p(a) f(a^{-1}\cdot x). ] Computing $T = A^\top A$ leads to the result. Remark that if we further assume that~$p$ is symmetric (i.e., $p(a) = p(a^{-1})$), then we have~$A^\top = A$, so that~$T = A^2$.

Proof. In the case of translation, we have [ Af(x) = \sum_{a\in[d]} p(a)f(a\cdot x) = \sum_{a\in[d]} p(a) A_a f(x), ] were $A_k$ be the operator that associate $f$ to $x\to f(k\cdot x)$, it is a translation operator and retaking the proof of Lemma lem:diag_trans, $A_k \psi_{m, S} = e^{-2i\pi km / k_S}\psi_{m,S}$. This leads to [ A \psi_{m, S} = \sum_{a\in[k_S]} p(a) \exp\frac{-2i\pi a m{k_S}} \psi_{m, S} = \sum_{a\in[d]} d{k_S} \cdot p(a) \exp\frac{-2i\pi a m{k_S}} \psi_{m, S} = d{k_S} \cdot p\frac{m d{k_S}} \psi_{m, S}, ] and align* T = A^A = \sum_{m, S} d^2{k_S^2} \hat{p\frac{m d{k_S}}}^2 \psi_{m, S} \psi_{m, S}^. align* This proves the lemma.

Proof. This follows from the fact that $Kf(x) = d^{-1}\sum_{x'\in[d]} k(x,x') f(x')$ where $k(x,x') = \phi(x)^\top \phi(x')$.

Proof. One can check that $x^\top y = d - 2k$ for $k$ the number of bits that differs in $x$ and $y$. Define $Q_\ell$ the degree-$\ell$ averaged polynomials of degree~$\ell$ as equation \sum_{S \subseteq [d], |S| = \ell} \chi_S(x) \chi_S(y) = {d \choose \ell} Q_{\ell, d}(x{y}), equation for any Boolean strings $x$ and $y$. The $Q_{\ell, d}$ are well defined since the left-hand side is translation invariant. Moreover, leveraging the orthogonality of the $\chi_S$, one can show that the $(Q_{\ell, d}){\ell\in[0, d]}$ form a basis of functions on $d-2k\midvert k\in[0, d]$. More exactly, the $m\to {d \choose \ell}^{-1/2} Q{\ell,d}(m)$ are orthonormal basis of the $L^2$ space endowed with $\tau$ the pushforwards measure of the uniform distribution on $\X$ through the mapping $x\tox{y}$ for any fixed $y$, and the dimensions match. As a consequence, there exists $\nu_\ell$ such that [ h(x{y}) = \sum_{\ell\in[0,d]} \nu_\ell {d \choose \ell} Q_{\ell, d}(x{y}) ] where $\nu_\ell$ can be found by computing the scalar product between $h$ and $Q_\ell$ in $L^2(\tau)$. equation \nu_\ell = h{Q_\ell}{L^2(\tau)}. equation Finally, using the fact that, in the uniform setting, $Kf(x) = d^{-1}\sum{x'\in[d]} k(x,x') f(x')$ where $k(x,x') = \phi(x)^\top \phi(x')$, we have [ K\chi_S(x) = \E_Y[h(x{Y})\chi_S(Y)] = \sum_{\ell} \nu_\ell \sum_{S'\subset[d], S' = \ell} \chi_{S'}(x) \E[\chi_{S'}(Y) \chi_S(Y)] = \nu_\ell \chi_S(x). ] This ends the proof of this lemma.

Proof. Such a linearization can be found, e.g., in Proposition 3 of misiakiewicz2021learning.

Proof. The first part is a direct consequence of the prior proposition with $\omega = 1$ and $q=\Delta=d$. The second part is due to Lemma lem:strong_diag, and eq:gegenbauer_boolean. For the statements on eigenvalues, see~yang2019fine.

Proof. The first part corresponds to the case $\omega=\Delta=1$. The second part is due to the expansion of $h$ over the $\Q_\ell$ basis, which leads to (see Eq. (30) in misiakiewicz2021learning for details) [ k_{CNN}(x,y) = \sum_{S \subseteq [d], \diam(S) \leq q} q^{-1} \nu_h(q,S) (q+1-\diam(S))_+ \chi_S(x) \chi_S(y). ] The fact that $K$ let the $S\midvert \card{S=a, \diam(S)=b}$ invariant, since the eigenvalues only depends on $S$ and $\diam(S)$, allows to change from the parity basis to the cyclic basis.

Proof. In the case of a dot-product kernel in the uniform setting, the spaces $V_{d,\ell}$ are eigenspaces of $K$. Those spaces are left invariant by all the $T$ defined through usual augmentations, since translations and index-flip operations preserve the cardinality of subsets. As a consequence, $K$ and $T$ can be diagonalized in the same basis, hence they commute.

We unveil two central integral operators: an “intrinsic” one that depends on the input distribution and choice of augmentations and another capturing the inductive bias associated with the model of computation.

We provide new bounds on the downstream generalization error that are sharper than previous work, and which can handle distributions shift between data before and after performing augmentations.

We propose new generalization bounds on the pretraining excess risk via tools from convex analysis. This analysis yields novel insights, including an understanding of the benefits of using multiple augmentations per sample (e.g., “multi-crop”).

We discuss several practical insights for SSL practitioners that emerge from our theory, in particular on how design choices in pretraining may affect downstream performance, and on how to avoid collapse of representations.

Though features ψ can be hand-engineered, representation learning aims at improving such design via unsupervised learning procedures. On the one hand, reconstruction based methods mask or add noise to inputs via a mapping M​x and aim to reconstruct the original input x from the features g∘ψ using g, a simple prediction head. Large language models largely rely on this paradigm, usually learning ψ by completing sentences M​x where word tokens are masked (e.g. Devlin et al., 2019). On the other hand, joint embedding methods learn ψ by leveraging invariance to small perturbations of the semantic information contained in inputs. This is the paradigm we shall focus on. Recently, joint embedding methods have relied heavily on the concept of data augmentation, such as small rotation, translation, color jittering of images. In particular, contrastive methods learn ψ by enforcing that if two augmentations ξ and ξ′ come from the same data point, their representation ψ​(ξ) and ψ​(ξ′) are close; while if they come from different data points, their representation are far away from one another (e.g., Chen et al., 2020). Non-contrastive methods only enforce similarities of augmented datapoints and avoid collapse by enforcing richness of the representation (see, e.g., Bardes et al., 2022). In the following, we focus on a theoretically friendly variant of VICReg (Balestriero & LeCun, 2022) with parameter β>0, defined for ψ:𝒳→ℝk by

When β=1, the population loss ℒ is equivalent to the spectral contrastive loss studied in HaoChen et al. (2021) as a theoretically friendly proxy for SimCLR (Chen et al., 2020). In other terms, ℒ analyzes both contrastive and non-contrastive approaches to representation learning.

To be consistent with previous literature, we will rather use T=I−L/2, which is also a linear positive symmetric operator, and is defined as, for ψ1,ψ2∈L2

In the classical viewpoint of statistical learning theory, one would like to retrieve the eigenfunctions of T to minimize ℒ (Lemma 2). However, when solely accessing finitely many samples of data, eigenfunctions of T should be searched within a space of finite capacity (i.e. f∈Ψ|‖f‖Ψ2≤λ−1). Though fewer samples are needed for smaller models (e.g. the fewer neurons and layers in a deep network), such small models are unlikely to be expressive enough to represent the ideal solutions. This echoes the classical trade-off between approximation and estimation error. In the case of Laplacians, one can assume that the eigenfunctions of T are smooth thereby belong to a small space of functions that are well-approximated with a finite model of computation. We refer the curious reader to Cabannes et al. (2021a) for results in this vein when I−T is the real Laplacian in L2.

& - Random noise Attenuate higher order Fourier modes Cropping Keep Fourier modes within cropping windows Translations Bias towards Fourier modes with cyclic invariance Flipping Equate eigenvectors of subsets related by flips Legend: -1 bit, +1 bit, flipped bit, random bit

Reduce the model capacity, through regularization (e.g., early stopping), or simpler architectures (e.g., a shallow CNN instead of an MLP). As a consequence, Ψ will have a lower effective dimension, K will encourage “simpler” representations that can be learned with less data, even without any data augmentation.

Use stronger augmentations. T will become more compact, reducing kλ the dimension of the “positive eigenspace” of Tλ. The ideal ψ will exhibit more structure, thereby its search could be reduced to smaller spaces, making it harder to collapse.

with S:ℋ→L2​(μΞ);θ→fθ the embedding of ℋ in L2​(μΞ), where μΞ denotes the marginal distribution over augmentations. As a consequence, a minimizer Θ∗ of ℒ;λ is such that S​Θ∗⊤ matches the eigenvalue decomposition of Tλ on positive eigenvalues up to the k-th.

Assume (X,Y)→Y to belong to L2​(ρ).

For t=‖(Σ+γ)−1/2​(Σ−Σ^)​(Σ+γ)‖op and M such that ‖ψ​(X)‖≤M almost everywhere,

Using the fact that S is isometric to Σ1/2 which itself if smaller than Σγ1/2 (with the Loewner order), we have

Moreover, this matrix part is bounded by γ−2​M4.

In essence, we have two random variables, X=‖Σγ−1/2​(Σ^−Σ)​Σγ−1/2‖op2, and Y=‖Σγ−1/2​S^⊤​(I−Πℱ^)​f∗‖2 the vector one. We proceed with computation using the fact that for X positive 𝔼⁡[X]=∫t>0ℙ⁡(X>t)​dt and that a​b>t, implies for any s that a>1+s or b>t/(1+s),

As a consequence, the constant a appearing earlier is smaller than (1+M)​σ.

For Θ^∈ℝk⊗ℋ, and ℱ^=x→w⊤​Θ^​φ​(x)|w∈ℝk,

Under Assumptions 8 and 9, with ℱl the span of the (fi)i∈[l]

Taking the minimum on the right-hand side and using the fact that ζ is increasing leads to

When ℱ is of dimension k and ℱ^ is of dimension k′ we have

We conclude by remarking that ∑i≤lci2ai2=‖T~λ−1​Πℱl(μΞ)​f∗‖L2​(μΞ).

This follows from contraction of space capacity by Lipschitz functions (Vitushkin, 1954), see Meir & Zhang (2003) for a proof in the context of machine learning. ∎

In this setting,

for m∈[kS] and the corresponding eigenvalues read

A one hidden layer fully connected layer

Figure 6 considers a classification problem involving four classes with a pretraining task specifically constructed to design a representation ψ:𝒳→ℝk for k=4 that solves this particular classification problem. The dataset we consider is the halfmoon dataset, where X=Z+𝟏⟨Z,e1⟩>0​e2+U, Z∼𝒰​(𝕊2), and U∼𝒩​(0,σ2​I) for σ=0.1. Augmentations apply Gaussian noise, ξ=X+V for V∼𝒩​(0,σ2​I) with σ=0.1. This setting corresponds to that with a Laplacian where ℒ​(ψ)≃‖∇ψ‖L2​(ρ𝒳)2. As a consequence, the ideal ψ will correspond to the top eigenvalues of the Laplacian. I.e., the first two span the constant functions on both moons, the next two are waves with a single oscillation on a given moon, etc. In essence, one can view the harmonics on L2​([0,1]) as x→cos⁡(2​i​π​ω​x+χ) for χ∈0,π/2 and ω∈ℕ, deforming the segment [0,1] to match one moon, and duplicating this basis on the other moons. In this setting, eigenfunctions are not analytic, since analytic functions cannot be dissociated on two different manifold (e.g., a locally constant analytic function is globally constant). As a consequence, searching for the eigenfunction with the radial basis function kernel (Ψ only contains analytical function in this case (Sun & Zhou, 2008)) requires proper tuning of the regularization parameter as a function of the number of samples. This explains our choice of the exponential kernel in this experiment, which corresponds to φ​(x)⊤​φ​(y)=exp⁡(−‖x−y‖/σ) and is associated with the looser Sobolev space that is still a Reproducing kernel Hilbert space (in ℝ2, this is H1). This improves the learning of the top eigenfunctions of T without varying λ, better illustrating the convergence rates of Theorems 1, 2 and 3.

In our experiments, we fixed λ = 10 -3 and the scale of the exponential kernel σ to be about one fifth of the problem diameter. We plot the eigenfunctions of T derived empirically with n pre = 2000 samples in Figure 13. The classification tasks aims to learn the four classes described on the left of Figure 12. Class labels include some noise as indicated by the level lines of the conditional probability of Y as a function of X shown in the middle of Figure 12. A training set example is shown on the right of this figure with n down = 100 . In the experiments we fix k = 5 , which ensures that there is strong correlation in performance between the pretraining and downstream tasks. The downstream task is optimized with a least-squares surrogate: we learn g : X → R 4 that minimizes the least-square error E [ ∥ g ( X ) -e Y ∥ 2 ] before decoding it as f ( X ) = arg max i ∈ [4] g i ( X ) to get an estimate of the ideal mapping f ∗ : X → Y . We report the downstream generalization error on both the least-squares (surrogate) loss and the 0-1 loss on Figure 14. This error is computed as the average over 100 trials on the pretraining task and 200 trials on the downstream task.

Table: S2.T1: Analogy between practice and theory that this paper proposes to help disentangle the various phenomena of SSL training.

PracticeTheoryQuantity
AugmentationSpectral embeddingT
ArchitectureSpace of functionsK
OptimizationRegularizationλ
Subtle Interplay

Table: S4.SS2.26: Effect of common augmentations on the optimal representation ψ through the operator T. Without augmentations, ψ could match any Fourier basis function. Augmentations filter out some of those by attenuating their eigenvalues in T, and the architecture will push ψ to pick some specific frequencies among the remaining ones through the operator K. The table stylizes the effect of usual augmentations on parity functions over bit streams. We refer the reader to Appendix D for further details and derivations.

Augmentation exampleEffect of the operator T
Input (no augmentation)

Illustration of the interplay between T and K as a function of λ where K is the NTK of a 2-layer ReLU network and T performs crops of window size 8 on 12-bit inputs. Here we plot eigenvalues of three different parity functions in the eigenbasis of both operators. Parity functions which large diameters have smaller eigenvalues for T (here, the parity function with largest diameter is χ{1,6}​(X)=X1​X6). Eigenvalues of K, in contrast, bias towards parities supported over fewer bits. Therefore, small regularization biases towards parities with small diameter whereas added regularization penalizes parities with high cardinality.

Behavior of Figure 5 with a neural network. The regularization parameter λ is replaced by early stopping of SGD. We consider a neural network with two hidden layers, both made of 200 neurons. Optimization was performed with gradient descent with a constant step-size. Randomness due to weights initialization is averaged over 100 trials, the standard deviation being shown on the Figure.

Setting of Figure 6. The downstream task consists in learning four classes in 𝒳=ℝ2 with are represented on the left. Those classes are generated with noise. The level lines of the conditional distribution of Y given X are represented on the middle for the left moons; the right moon follows the same structure. A training set example is on the right.

Eigenvalues of Tλ estimated empirically with 2000 pretraining samples on the problem that yield the empirical rates displayed on Figure 6.

Averaged downstream error computed over 100 trials on the pretraining task and 200 trials on the downstream task, for both the least-squares loss (right) and the 0-1 loss (left).

$$ \norm{\Sigma_\gamma^{1/2}\hat\Sigma_\gamma^{-1}\Sigma_\gamma^{1/2}}{\op} \leq \gamma^{-1}(\norm{\Sigma}{\op} + \gamma) \leq \gamma^{-1} (\sup_{x\in\supp\rho_\X} \norm{\psi(x)}^2 + \gamma) $$

$$ \E[\min\brace{\frac{1}{1-X}, 1 + \frac{(M^2 + \gamma)X}{\gamma}}^2 Y^2] \leq \Pbb(\frac{1}{(1-X)^2} > 2) \sup(1+\gamma^{-1}(M^2 + \gamma)X)^2 Y^2 + 2\E[Y^2]. $$

$$ \sum_{i\leq k}\left|T_{+}^{1/2}g_{i}\right|^{4}=\sum_{i\leq k}(g_{i}^{\top}T_{+}g_{i})^{2}=\sum_{i\leq k}\left(\sum_{j\leq k_{\lambda}}\lambda_{j}\left\langle g_{i},f_{j}\right\rangle^{2}\right)^{2}=\sum_{j,m\leq k_{\lambda}}\lambda_{j}\lambda_{m}\sum_{i\leq k}\left\langle g_{i},f_{j}\right\rangle^{2}\left\langle g_{i},f_{m}\right\rangle^{2}=\lambda^{\top}U^{\top}U\lambda. $$

$$ K\chi_{S}(x)=\operatorname{\mathbb{E}}{Y}[h(\left\langle x,Y\right\rangle)\chi{S}(Y)]=\sum_{\ell}\nu_{\ell}\sum_{S^{\prime}\subset[d],\left|S^{\prime}\right|=\ell}\chi_{S^{\prime}}(x)\operatorname{\mathbb{E}}[\chi_{S^{\prime}}(Y)\chi_{S}(Y)]=\nu_{\ell}\chi_{S}(x). $$

$$ h(\left\langle u,v\right\rangle/q)=\operatorname{\mathbb{E}}_{w\sim\cal W}\left[\sigma(\left\langle u,w\right\rangle/\sqrt{q})\sigma(\left\langle v,w\right\rangle/\sqrt{q})+\sigma^{\prime}(\left\langle u,v\right\rangle/\sqrt{q})\sigma^{\prime}(\left\langle u,v\right\rangle/\sqrt{q})\cdot\left\langle u,v\right\rangle/q\right]. $$

$$ \displaystyle{\cal L}(\psi) $$

$$ \displaystyle\omega(\psi) $$

$$ \displaystyle=2\beta\sum_{i\in[k]}\left\langle e_{i}^{\top}\psi,(I-T)\psi^{\top}e_{i}\right\rangle+\left|\operatorname{\mathbb{E}}{\xi}[\sum{i,j\in[k]}e_{i}^{\top}\psi(\xi)\psi(\xi)^{\top}e_{j}e_{i}e_{j}^{\top}]-I\right|^{2} $$

$$ \displaystyle\geq\sum_{i=1}^{k}\left|B_{i}\right|^{2}{\operatorname{op}}-2\left|B{i}\right|{\operatorname{op}}\left|\Pi{B_{i}}A_{+}\right|{\operatorname{op}}\geq\sum{i=1}^{k}\left|B_{i}\right|^{2}{\operatorname{op}}-2\left|B{i}\right|{\operatorname{op}}\big{|}\prod{j<i}(I-\Pi_{B_{j}})A_{+}\big{|}_{\operatorname{op}} $$

$$ \displaystyle\geq-\sum_{i=1}^{k}\big{|}\prod_{j<i}(I-\Pi_{B_{j}})A_{+}\big{|}{\operatorname{op}}^{2}\geq-\sum{i=1}^{k}\sigma_{i}(A_{+}) $$

$$ \displaystyle\qquad\qquad\qquad\qquad+\Sigma^{1/2}\Theta^{\top}\Theta\Sigma\Theta^{\top}\Theta\Sigma^{1/2}+2\lambda\Sigma^{-1}\Sigma^{1/2}\Theta^{\top}\Theta\Sigma^{1/2}\big{)} $$

$$ \displaystyle\left\langle f_{\theta_{i}},f_{\theta_{j}}\right\rangle_{L^{2}(\mu_{\Xi})} $$

$$ \displaystyle=\sqrt{\max(\lambda_{i},0)\max(\lambda_{j},0)}u_{i}^{\top}u_{j}=\sqrt{\max(\lambda_{i},0)\max(\lambda_{j},0)}\delta_{ij}. $$

$$ \displaystyle\leq\left|\Sigma_{\mu_{\Xi}}^{-1/2}\Sigma_{\rho_{\mathcal{X}}}\Sigma_{\mu_{\Xi}}^{-1/2}\right|{\operatorname{op}}\left|\Sigma{\mu_{\Xi}}^{-1/2}\theta\right|{\cal H}^{2}=\left|\Sigma{\mu_{\Xi}}^{-1/2}\Sigma_{\rho_{\mathcal{X}}}\Sigma_{\mu_{\Xi}}^{-1/2}\right|{\operatorname{op}}\left|f{\theta}\right|{L^{2}(\mu{\Xi})}^{2}. $$

$$ \displaystyle\quad\Leftrightarrow\quad-tA\preceq\hat{A}-A\preceq tA $$

$$ \displaystyle=\int_{t\in(0,\sup(1+\gamma^{-1}(M^{2}+\gamma)X)^{2}Y^{2})}\operatorname{\mathbb{P}}\left(\min\left{\frac{1}{1-X},1+\frac{(M^{2}+\gamma)X}{\gamma}\right}^{2}Y^{2}>t\right)\mathop{}!\mathrm{d}t $$

$$ \displaystyle\leq 4\int_{t>0}\exp\left(-\frac{-nt}{2ab}\right)+\exp\left(-\frac{-nt^{1/2}}{4aM\gamma^{-1/2}/3)}\right)\mathop{}!\mathrm{d}t $$

$$ \displaystyle\leq\left|(\Pi_{{\cal F}{l}}^{(\rho{\mathcal{X}})}-\Pi_{{\cal F}{l}}^{(\mu{\Xi})})f^{}\right|{L^{2}(\rho{\mathcal{X}})}+\left|(I-\Pi_{\hat{\cal F}}^{(\rho_{\mathcal{X}})})\Pi_{{\cal F}{l}}^{(\mu{\Xi})}f^{}\right|{L^{2}(\rho{\mathcal{X}})}. $$

$$ \displaystyle\operatorname{\mathbb{E}}{{\cal D}{n}}[{\cal L}(S\Theta_{n});\lambda]-{\cal L}(S\Theta;\lambda)\leq 8\operatorname{\mathbb{E}}{{\cal D}{n},\sigma}\left[\sup_{\Lambda}\frac{1-\beta}{n}\sum_{i\in[n]}\sigma_{i}\left\langle\Lambda,\frac{1}{m}\sum_{j\in[m]}\varphi(\xi_{ij})\varphi(\xi_{ij})^{\top}\right\rangle\right] $$

$$ \displaystyle\operatorname{\mathbb{E}}\left[\left|\frac{1}{m}\sum_{i\in[m]}\varphi(\xi_{1i})\varphi(\xi_{1i})^{\top}-\operatorname{\mathbb{E}}[\varphi(\xi)\varphi(\xi)^{\top}]\right|^{2}\right]=\operatorname{\mathbb{E}}\left[\left|\frac{1}{m}\sum_{i\in[m]}\varphi(\xi_{1i})\varphi(\xi_{1i})^{\top}-\operatorname{\mathbb{E}}[\varphi(\xi)\varphi(\xi)^{\top},\middle|,X=X_{1}]\right|^{2}\right] $$

$$ \displaystyle\qquad\qquad+\operatorname{\mathbb{E}}\left[\left|\operatorname{\mathbb{E}}\left[\left\langle\Lambda,\varphi(\xi)\varphi(\xi^{\prime})^{\top}\right\rangle\varphi(\xi)\varphi(\xi^{\prime})^{\top},\middle|,X,X^{\prime}\right]-\operatorname{\mathbb{E}}[\left\langle\Lambda,\varphi(\xi)\varphi(\xi^{\prime})^{\top}\right\rangle\varphi(\xi)\varphi(\xi^{\prime})^{\top}]\right|^{2}\right] $$

$$ \displaystyle=\frac{1}{m^{2}}\operatorname{\mathbb{E}}{X}\operatorname{\mathbb{E}}{\xi}\left[\left|\left\langle\Lambda,\varphi(\xi)\varphi(\xi^{\prime})^{\top}\otimes\varphi(\xi)\varphi(\xi^{\prime})^{\top}-\operatorname{\mathbb{E}}[\varphi(\xi)\varphi(\xi^{\prime})^{\top}\otimes\varphi(\xi)\varphi(\xi^{\prime})^{\top},\middle|,X,X^{\prime}]\right\rangle\right|^{2},\middle|,X,X^{\prime}\right] $$

$$ \displaystyle\leq\frac{1}{m^{2}}\left|\Lambda\right|^{2}\operatorname{\mathbb{E}}{X}\operatorname{\mathbb{E}}{\xi}\left[\left|\varphi(\xi)\varphi(\xi^{\prime})^{\top}\otimes\varphi(\xi)\varphi(\xi^{\prime})^{\top}-\operatorname{\mathbb{E}}[\varphi(\xi)\varphi(\xi^{\prime})^{\top}\otimes\varphi(\xi)\varphi(\xi^{\prime})^{\top},\middle|,X,X^{\prime}]\right|^{2},\middle|,X,X^{\prime}\right] $$

$$ \displaystyle\sigma_{\xi,3}^{2}=\operatorname{\mathbb{E}}{X}\operatorname{\mathbb{E}}{\xi}\left[\left|\varphi(\xi)\varphi(\xi^{\prime})^{\top}\otimes\varphi(\xi)\varphi(\xi^{\prime})^{\top}-\operatorname{\mathbb{E}}[\varphi(\xi)\varphi(\xi^{\prime})^{\top}\otimes\varphi(\xi)\varphi(\xi^{\prime})^{\top},\middle|,X,X^{\prime}]\right|^{2},\middle|,X,X^{\prime}\right] $$

$$ \displaystyle=\operatorname{\mathbb{E}}{X}\left[\operatorname{\mathbb{E}}{y}\left[\prod_{i\in S\vartriangle S^{\prime}}X_{i}y_{i}\right]\right]\operatorname{\mathbb{E}}{y,y^{\prime}}\left[\prod{i\in S\cap S^{\prime}}y_{i}y_{i}^{\prime}\right] $$

$$ \displaystyle=\frac{1}{d^{2}}\sum_{a,b=1}^{d}\operatorname{\mathbb{E}}{X,\nu,\nu^{\prime}}\left[\prod{i\in S\cap[a,a+w)}x_{i}\prod_{i^{\prime}\in S\backslash[a,a+w)}\nu_{i^{\prime}}\prod_{j\in S^{\prime}\cap[b,b+w)}x_{j}\prod_{j^{\prime}\in S^{\prime}\backslash[b,b+w)}\nu_{j^{\prime}}^{\prime}\right] $$

$$ \displaystyle=\frac{1}{d^{2}}\sum_{a,b=1}^{d}\mathbf{1}{S\subseteq[a,a+w)};\mathbf{1}{S^{\prime}\subseteq[b,b+w)};\mathbb{E}{X}[\chi{S}(X)\chi_{S^{\prime}}(X)]=\frac{1}{d^{2}}\sum_{a,b=1}^{d}\mathbf{1}{S\subseteq[a,a+w)};\mathbf{1}{S^{\prime}\subseteq[b,b+w)};\delta_{S,S^{\prime}} $$

$$ \displaystyle=\frac{1}{2^{d}}\sum_{x\in\mathcal{X}}p(a)f(a\cdot x)g(a^{-1}\cdot a\cdot x)=\frac{1}{2^{d}}\sum_{x\in\mathcal{X}}p(a)f(x)g(a^{-1}\cdot x) $$

Lemma. Lemma 2 (Spectral embedding). There exists a linear positive symmetric operator L in L2 for which the operator I−T is positive and 𝔼X⁡𝔼ξ,ξ′⁡[‖ψ​(ξ)−ψ​(ξ′)‖2|X]=∑i∈[k]ψi⊤​L​ψi. To be consistent with previous literature, we will rather use T=I−L/2, which is also a linear positive symmetric operator, and is defined as, for ψ1,ψ2∈L2 ψ1⊤​T​ψ2=𝔼X⁡𝔼ξ,ξ′⁡[ψ1​(ξ)⊤​ψ2​(ξ)|X] As a consequence, if (λi) are the eigenvalues of T and (fi) are the corresponding eigenvectors, a minimizer of ℒ is ψi=μi​fi with μi=1−β+β​λi.

Proposition. Proposition 4. If T and K commute, and if (λi) are the eigenvalues of T and (fi) its eigenfunctions, then there exists (θi) such that fi=fθi (4). Moreover, the optimal representation to minimize the regularized loss are the fi that maximize β​λi−λ​‖θi‖2. In other terms, the regularization biases towards representations that have a small complexity with respect to the model of computation.

Assumption. Assumption 1 (Low expansion). There exists cr>0 such that for any function f in the original space of functions Ψ defined in (4), ‖f‖L2​(ρ𝒳)2≤cr​‖f‖L2​(μ𝒳)2,

Assumption. Assumption 2. For any i smaller than the number of positive eigenvalues of Tλ, the projection of the target f∗ on fi in L2​(μΞ) coincides with the projection on fi in L2​(ρ𝒳).

Example. Example 4. Let Στ=𝔼X∼τ⁡[φ​(X)​φ​(X)⊤] be the covariance matrix of φ under the distribution τ. When there exists c such that Σρ𝒳⪯c​ΣμΞ (i.e c​ΣμΞ−Σρ𝒳 is positive semi-definite), then Assumption 1 holds with cr=c.

Assumption. Assumption 3 (Source condition). f∗ belongs to the positive eigenspace of Tλ, i.e. f∗∈Span⁡{fi​|λi>​0}.

Theorem. Theorem 2 (Empirical risk minimizer). Let Θn∈ℝk⊗ℋ be the minimizer of the unbiased regularized empirical version of ℒ based on a dataset 𝒟n. Assume that 𝒟n is built from n input samples (Xi)∼μ𝒳⊗n and m augmentation per samples (ξi​j)∼μ|Xi⊗m, then the average excess risk is bounded by 𝔼𝒟n⁡[ℒ​(S​Θn)]−ℒ​(S​Θ)≤12​κ2​kλ​n​(1+κ2​kλ), (7) where κ is a bound on ‖φ​(X)‖.

Theorem. Theorem 3 (Sharper bounds). There exists an implementable algorithm that guarantees an average excess risk 𝔼𝒟n⁡[ℒ​(S​Θn)]−ℒ​(S​Θ) ≤3​κ2​cλ​cλ′​σX2n+σξ2n​m+4​κ6​cλ2n (8) where cλ=1+κ2​kλ/λ, cλ′=1+kλ2/λ2, kλ is the number of positive eigenvalues of Tλ, κ is a bound on ‖φ‖, σX relates to the variance of 𝔼⁡[ψ​(ξ)|X], and σξ relates to the average variance of (ξ|X). Moreover, when K=S​S⊤ or the covariance of the φ​(ξ) has a finite number of positive eigenvalues (e.g. 𝒳 finite or ℋ finite dimensional), with cK a constant that relates to the condition number of K, this bound can be tightened to 𝔼𝒟n⁡[ℒ​(S​Θn)]−ℒ​(S​Θ)≤4​cK2​cλ2n. (9)

Assumption. Assumption 4. Assume that T has a pure point spectrum.

Example. Example 7. When the distribution of augmentation have a density p with respect to a any measure and (x,ξ)→p(ξ|x)/p(ξ) is in L2​(μ), or when 𝒳 is finite, T can be shown to be a compact operator, hence to have a pure point spectrum according to the spectral theorem.

Proposition. Proposition 7 (Uniqueness of minimizers). The minimizers of ℒ are unique up to orthogonal transformations and eigenfunction picking. More specifically, if U∈ℝk×k is orthogonal, i.e. U⊤​U=I, then ℒ​(ψ)=ℒ​(U​ψ); and if λk=λk+1, one can choose different eigenfunctions as fk in the eigen-decomposition (λi,fi) of Tβ.

Lemma. Lemma 10. For (θi)∈ℋk and fθ:x→⟨φ​(x),θ⟩, and a regularizer λ∈ℝ ℒ​((fθi)i∈[k])+λ​∑i∈[k]‖θi‖22=Tr⁡((Σ1/2​(∑i∈[k]θi​θi⊤)​Σ1/2−A)2−A2)+k, with A and Σ being operator on ℋ defined as A=Σ−1/2​((1−β)​Σ+β​ΣX−λ​I)​Σ−1/2,Σ=𝔼ξ⁡[φ​(ξ)​φ​(ξ)⊤],ΣX=𝔼X⁡[𝔼ξ,ξ′⁡[φ​(ξ)​φ​(ξ′)⊤|X]]. As a consequence, a minimizer Θ∗ of ℒ is such that Θ∗ matches the eigenvalue decomposition of A on positive eigenvalues up to the k-th. Formally, if A=∑i∈ℕλi​ui⊗ui with ui∈ℋ and (λi) in decreasing order, Θ∗=(θi)i∈[k],with​θi=max⁡(λi,0)​Σ−1/2​ui. Moreover, (fθi) are orthogonal in L2​(μΞ), where μΞ denotes the marginal distribution over augmentations.

Lemma. Lemma 12. For Θ∈ℝk⊗ℋ, and a regularized λ∈ℝ ℒ​(S​Θ)+λ​‖Θ‖22=Tr⁡((S​Θ⊤​Θ​S⊤−Tλ)2−Tλ2)+k where T=S−⊤​ΣX​S−1,Tλ=(1−β)​I+β​T−λ​K−1,K=S​S⊤, with S:ℋ→L2​(μΞ);θ→fθ the embedding of ℋ in L2​(μΞ), where μΞ denotes the marginal distribution over augmentations. As a consequence, a minimizer Θ∗ of ℒ;λ is such that S​Θ∗⊤ matches the eigenvalue decomposition of Tλ on positive eigenvalues up to the k-th.

Lemma. Lemma 18. For t=‖(Σ+γ)−1/2​(Σ−Σ^)​(Σ+γ)‖op and M such that ‖ψ​(X)‖≤M almost everywhere, ‖S​(Σ^+γ)−1​S^⊤​(I−Πℱ^)​f∗‖L2​(ρ𝒳)≤min⁡{11−t,1+t⋅M2+γγ}​‖Σγ−1/2​S^⊤​(I−Πℱ^)​f∗‖. (21)

Lemma. Lemma 21. Retaking the notation of the previous lemma. 𝔼(Xi)⁡‖S​(Σ^+γ)−1​S^⊤​(I−Πℱ^)​f∗‖L2​(ρ𝒳)2 ≤k​exp⁡(−3​n​γ(3+2)​M2)​(γ−4​M6​a2​(M2+2​γ))2 +16​a​bn+512​a2​M29​γ​n2.

Lemma. Lemma 22 (Simplifying constants). The constant is the previous bound can be worked out as Tr⁡(Σ​(Σ+γ)−1)≤k,M≤λ−1​k​sup‖φ‖,⟨Πℱ^​f∗,Σ​(Σ+γ)−1​Πℱ^​f∗⟩L2​(ρ𝒳)≤‖f∗‖L2​(ρ𝒳). We also have ‖f∗‖L2​(ρ𝒳)≤‖f∗‖L∞​(ρ𝒳)≤σ,ε2≤σ2,σ2=supx𝔼⁡[Y2|X=x] As a consequence, the constant a appearing earlier is smaller than (1+M)​σ.

Lemma. Lemma 23. Under Assumption 7, when γ=M2log(n)1+δn−1, with δ>0, there exists a N>0 such that for any n>N, the excess of risk of the regularized empirical risk (19) minimizer reads 𝔼(Xi,Yi)⁡[ℛ​(fn)−ℛ​(f∗)]≤2​ke​ε2n+8M2log(n)1+δn​‖f∗‖L2​(ρ𝒳)+64​k​an+2​‖I−Πℱ^​Πℱ​f∗‖2+‖I−Πℱ​f∗‖2 (25) where ke=Tr⁡(Σ​(Σ+γ​I)−1)≤k is the effective dimension, a=‖I−Πℱ^​f∗‖L∞≤‖f∗‖L∞+M​‖f‖L2, and M=sup‖ψ‖≤k​λ−1​sup‖φ‖.

Lemma. Lemma 24 (Transfer bound). For Θ^∈ℝk⊗ℋ, and ℱ^={x→w⊤​Θ^​φ​(x)|w∈ℝk}, ∑i∈[k]λi2​‖(Πℱ(μΞ)−Πℱ^(μΞ))​fi‖L2​(μΞ)2−∑k<i≤kλλi2​‖Πℱ^(μΞ)​fi‖L2​(μΞ)2≤ℒ​(Θ^;λ)−ℒ​(Θ∗;λ), (26) where Πℱ(τ) is the projection orthogonal on ℱ in L2​(τ).

Definition. Definition 25 (Distribution ε-robustness). A close convex set of functions ℱ will be said to be ε-robust to distribution shift conditionally to the function f if ‖Πℱ(ρ𝒳)​f−Πℱ(μΞ)​f‖L2​(ρ𝒳)≤ε​‖f‖L2​(ρ𝒳), where Πℱ(τ) is the projection orthogonal on ℱ in L2​(τ).

Lemma. Lemma 26 (Decomposition). Under Assumptions 8 and 9, with ℱl the span of the (fi)i∈[l] ‖(I−Πℱ^(ρ𝒳))​Πℱl(ρ𝒳)​f∗‖L2​(ρ𝒳)≤σ​(l)+ζ​(∑i≤l|⟨f∗,fi⟩L2​(μΞ)|​‖(Πℱl(μΞ)−Πℱ^(μΞ))​fi‖L2​(μΞ)). (27)

Lemma. Lemma 28. Under Assumptions 8 and 9, with ℱl the span of the first l eigenfunctions of Tλ, ‖(I−Πℱ^(ρ𝒳))​f∗‖L2​(ρ𝒳)2 ≤infl≤k‖(I−Πℱl(ρ𝒳))​f∗‖L2​(ρ𝒳)2+4​σ​(l)2+4​ζ2​(‖Tλ~−1​Πℱl(μΞ)​f∗‖L2​(μΞ)​(ℒ​(Θ^;λ)−ℒ​(Θ;λ))1/2). (31) where T~λ=∑i∈k1/2​fi​fi⊤. Moreover, when the search for ℱ^ is done without rank restriction on Θ, before thresholding to get reduce ℱ^ to a space of dimension k, under the strong Assumptions 1 and 2, as well as Assumption 3 ‖(I−Πℱ^k)​f∗‖2≤|k−kλ|​‖f∗‖L2​(ρ𝒳)2+2​cr​‖Tλ−1​f∗‖L2​(μΞ)2​{ℒ​(Θ^;λ)−ℒ​(Θ∗;λ)}. (32)

Theorem. Theorem 4. Under Assumptions 3, 7, 8 and 9, there exists a regularizer γ such that the regularized empirical risk minimizer verifies that: for any δ>0, there exists an Nδ>0 such that for any n>Nδ, the excess of risk of the regularized empirical risk (19) minimizer reads ℛ​(f)−ℛ​(f∗)≤2​ke​ε2n+8M2log(n)1+δn​‖f∗‖L2​(ρ𝒳)+64​k​an +infl≤k‖(Πℱkλ−Πℱl(ρ𝒳))​f∗‖L2​(ρ𝒳)2+4​σ​(l)2+4​ζ2​(‖Tλ~−1​Πℱl(μΞ)​f∗‖L2​(μΞ)​(ℒk​(Θ^;λ)−ℒk​(Θ;λ))1/2). (33) where ℱl the span of l-th first eigenfunction of Tλ, kλ the number of strictly positive eigenfunctions of Tλ, ke≤k is the effective dimension of ψ in L2​(ρ𝒳), a=‖I−Πℱ^​f∗‖L∞≤‖f∗‖L∞+M​‖f‖L2, M=sup‖ψ‖≤k​λ−1​sup‖φ‖, and T~λ=∑i∈k1/2​fi​fi⊤. Moreover, under the sole Assumptions 1 and 2, we have the simpler bound ℛ​(f)−ℛ​(f∗) ≤2​ke​ε2n+8M2log(n)1+δn​‖f∗‖L2​(ρ𝒳)+64​k​an+max⁡(k−kλ,0)​‖f∗‖L2​(ρ𝒳)2 +2​cr​‖Tλ−1​Πℱλ​f∗‖L2​(μΞ)2​{ℒkλ​(Θ^;λ)−ℒkλ​(Θ∗;λ)}+‖(I−Πℱλ)​f∗‖L2​(μΞ) Where Θ^ is understood as belonging to ℝkλ⊗ℋ in this last expression and ℱλ the eigenspace linked with positive eigenvalues of Tλ.

Lemma. Lemma 29 (Relating capacity between K and Tλ). If (μi) are the eigenvalues of K, then the number of eigenvalues of Tλ that are bigger than t∈ℝ is smaller than the cardinality of {i|μi>λ/(1−t)}. Moreover, if there exists q>0 such that Tr⁡(K1/q)<+∞, then there exists a cq such that if (μi) are the eigenvalues of K, we have μi≤cq​i−q. As a consequence, in this setting, for any t∈ℝ, the number of eigenvalues of Tλ that is bigger than t is smaller than (cq​(1−t)/λ)1/q.

Lemma. Lemma 30. Let Θ∈ℝk⊗ℋ, denote Λ=Θ⊤​Θ∈ℋ⊗ℋ ℒ​(S​Θ)=2​(β−1)​𝔼ξ⁡[⟨Λ,φ​(ξ)​φ​(ξ)⊤⟩]−2​β​𝔼X⁡𝔼ξ,ξ′⁡[⟨Λ,φ​(ξ′)​φ​(ξ)⊤⟩|X]+𝔼ξ,ξ′⁡[⟨Λ,φ​(ξ)​φ​(ξ′)⊤⟩2]+k. (34) Moreover, the regularization reads λ​‖Θ‖2=λ​Tr⁡Λ=λ​⟨Λ,I⟩.

Lemma. Lemma 31. Let ℛ​(ζ)=𝔼Z⁡[ℓ​(ζ,Z)], ζ∗ be the minimizer of ℒ inside a domain for ζ, and ζn be the minimizer of ℛ(Zi)​(ζ)=1n​∑i∈[n]ℓ​(ζ,Zi) based on exchangeable data Zi such that 𝔼(Zi)⁡[ℛ(Zi)]=ℛ. The average excess of risk of ζn is bounded by Rademacher complexity as ℛ​(ζn)−ℛ​(ζ∗)≤4​𝔼(Zi),(σi)⁡[supζ1n​∑i=1nσi​ℓ​(ζ,Zi)] (35) where σi are i.i.d variables taking values one and minus one with probability one half.

Lemma. Lemma 37. An unbiased formulation of ℒ is based on ℓ defined as ∇Λℓ​(S​Θ;λ) =2​(β−1)m​∑j∈[m]φ​(ξ1​j)​φ​(ξ1​j)⊤−2​βm​(m−1)​∑1≤j≠k≤mφ​(ξ1​j)​φ​(ξ1​k)⊤ +1m2​∑i,i′=12∑j,k=1m⟨Λ,φ​(ξi​j)​φ​(ξi′​k)⊤⟩​φ​(ξi​j)​φ​(ξi′​k)⊤. (42) Moreover, when ℒ is regularized, one has to add +λ​I to get a gradient on the regularized risk.

Lemma. Lemma 38. For ℓ given in (37), bounds on the gradient norm and its variance are ‖∇Λℓ‖≤2​κ2+κ4​sup‖Λ‖,and𝔼⁡[‖∇Λℓ−∇ℒ‖2]≤(σX2+m−1​σξ2)​(1+sup‖Λ‖2), (43) where σX relates to the variance of 𝔼⁡[ψ​(ξ)|X] and σξ relates to the average variance of (ξ|X).

Lemma. Lemma 39. As a function of Λ, the objective ℒ is α-smooth with α=κ4, where κ is a bound on ‖φ‖. Moreover, when 𝒳 is finite, it is α′-strongly, with α′ being the square of eigen gap of K=S​S⊤.

Proposition. Proposition 41 (Random noise). Consider the flip of each bit of x with probability equal to p formally via the operation Byp(x)=x⊙y,y∼Ber({−1,+1},p)⊗d, (45) where the operation x⊙y applies pointwise multiplication and the distribution Ber⁡({−1,+1},p) returns the value −1 with probability p and +1 with probability 1−p. Under the augmentations ξ=X⊗y, T is diagonalized in the parity basis with T​χS=|1−2​p||S|​χS. (46) In other terms, T applies a factor |1−2​p||S| to reduce the effect of higher order Fourier functions.

Proposition. Proposition 43 (2D Cropping). Consider that 2D setting 𝒳={−1,+1}m×d where inputs are organized into an m×d grid. Consider the cropping operation to a window of size v×w, formally [Ma,bv×w​(x)]i+j​m={xi+j​mif ​i∈[a,a+v),j∈[b,b+w)Ber⁡({−1,+1},0.5)otherwise,(a,b)∼𝒰​([m]×[d]). (49) Under the augmentation ξ=Ma,bv×w​(X), T is diagonalizable in the parity basis and T​χS=1m2​d2​(1+v−diame1⁡S)+2⋅(1+w−diame2⁡S)+2​χS, (50) where diame1⁡S is the diameter of S projected onto the first dimension.

Lemma. Lemma 50 (Spectral decomposition of dot-product kernel). Any dot-product kernel is diagonalizable in the parity basis. Specifically, there exists (νi)i∈[0,d]∈ℝd+1 such that, when μ𝒳 is the uniform distribution on the hypercube, K​χS=ν|S|​χS. (62)

Proposition. Proposition 53 (Linearization of a convolutional network). A convolutional layer followed by a fully connected layer fC​N​N​(x)=1N​d​∑i∈[N]∑k∈[d]ai​k​σ​(wi⊤​x(k)(q)), can be linearized with the h of (67) as kC​N​N​(x,y)=1d​∑k∈[d]h​(⟨x(k)(q),y(k)(q)⟩/q). In the Boolean setting, the resulting integral operator KC​N​N is diagonalized in both the parity and the cyclic basis as KC​N​N​ψm,S={νh​(q,|S|)​(q+1−diam⁡(S))+q​ψm,S,if ​diam⁡(S)≤q,0otherwise. where νh​(q,ℓ) are defined by Proposition 52.

Example. Example 11 (Interplay between kernel for CNN and translation augmentations). Consider the setting as before in Example 10 with translations sampled from a localized window. For a single layer CNN with patch width q, eigenfunctions correspond to parity functions χS, or cyclic parities ψm,S where diam⁡(S)≤q with corresponding eigenvalue νh​(q,ℓ)​q+1−diam⁡(S)d. Here, the eigenfunctions ψm,S of T for S with diameter larger than q are completely eliminated, regardless of the regularization strength λ, . For eigenfunctions ψm,S where diam⁡(S)≤q, the CNN shrinks the contribution to |p^​(m)|2−λ​(νh​(q,ℓ)​q+1−diam⁡(S)d)−1, which shrinks more when diam⁡(S) is larger.

Figure 6 considers a classification problem involving four classes with a pretraining task specifically constructed to design a representation ψ : X → R k for k = 4 that solves this particular classification problem. The dataset we consider is the halfmoon dataset, where X = Z + 1 ⟨ Z,e 1 ⟩ > 0 e 2 + U , Z ∼ U ( S 2 ) , and U ∼ N (0 , σ 2 I ) for σ = 0 . 1 . Augmentations apply Gaussian noise, ξ = X + V for V ∼ N (0 , σ 2 I ) with σ = 0 . 1 . This setting corresponds to that with a Laplacian where

L ( ψ ) ≃ ∥∇ ψ ∥ 2 L 2 ( ρ X ) . As a consequence, the ideal ψ will correspond to the top eigenvalues of the Laplacian. I.e., the first two span the constant functions on both moons, the next two are waves with a single oscillation on a given moon, etc. In essence, one can view the harmonics on L 2 ([0 , 1]) as x → cos(2 iπωx + χ ) for χ ∈ 0 , π/ 2 and ω ∈ N , deforming the segment [0 , 1] to match one moon, and duplicating this basis on the other moons. In this setting, eigenfunctions are not analytic, since analytic functions cannot be dissociated on two different manifold (e.g., a locally constant analytic function is globally constant). As a consequence, searching for the eigenfunction with the radial basis function kernel ( Ψ only contains analytical function in this case (Sun & Zhou, 2008)) requires proper tuning of the regularization parameter as a function of the number of samples. This explains our choice of the exponential kernel in this experiment, which corresponds to φ ( x ) ⊤ φ ( y ) = exp( -∥ x -y ∥ /σ ) and is associated with the looser Sobolev space that is still a Reproducing kernel Hilbert space (in R 2 , this is H 1 ). This improves the learning of the top eigenfunctions of T without varying λ , better illustrating the convergence rates of Theorems 1, 2 and 3.

Augmentation exampleAugmentation exampleEffect of the operator T
Input (no augmentation) Random noise Cropping Translations Flipping Legend: -1 bit,+1 bit, flipped bit,- Attenuate higher order Fourier modes Keep Fourier modes within cropping windows Bias towards Fourier modes with cyclic invariance Equate eigenvectors of subsets related by flips random bit

Figure

Figure

Figure

Figure 12. Setting of Figure 6. The downstream task consists in learning four classes in X = R 2 with are represented on the left. Those classes are generated with noise. The level lines of the conditional distribution of Y given X are represented on the middle for the left moons; the right moon follows the same structure. A training set example is on the right.

References

[1] Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. In ICML, 2019.

[2] Bach, F. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18\penalty0 (1):\penalty0 629--681, 2017.

[3] Bach, F. Learning Theory from First Principles. To appear at MIT press, 2023.

[4] Balestriero, R. and LeCun, Y. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. In NeurIPS, 2022.

[5] Bardes, A., Ponce, J., and LeCun, Y. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.

[6] Bartlett, P. and Mendelson, S. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002.

[7] Bartlett, P., Jordan, M., and Mcauliffe, J. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006.

[8] Bietti, A. Approximation and learning with deep convolutional models: a kernel perspective. In International Conference on Learning Representations, 2022.

[9] Bietti, A. and Mairal, J. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 2019.

[10] Bietti, A., Venturi, L., and Bruna, J. On the sample complexity of learning under invariance and geometric stability. In Advances in Neural Information Processing Systems, 2021.

[11] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in neural information processing systems, 2020.

[12] Bubeck, S. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.

[13] Bun, J., Bouchaud, J.-P., and Potters, M. Cleaning large correlation matrices: Tools from random matrix theory. Physics Reports, 2017.

[14] Cabannes, V., Pillaud-Vivien, L., Bach, F., and Rudi, A. Overcoming the curse of dimensionality with Laplacian regularization in semi-supervised learning. In NeurIPS, 2021a.

[15] Cabannes, V., Rudi, A., and Bach, F. Fast rates in structured prediction. In Conference on Learning Theory, 2021b.

[16] Cabannes, V., Bietti, A., and Balestriero, R. On minimal variations for unsupervised representation learning. ICASSP, 2023.

[17] Caponnetto, A. and De Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 2007.

[18] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., and Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, 2020.

[19] Caron, M., Touvron, H., Misra, I., Jegou, H., Mairal, J., Bojanowski, P., and Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.

[20] Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A simple framework for contrastive learning of visual representations. In ICML, 2020.

[21] Chi, Y., Lu, Y., and Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 2019.

[22] Coifman, R. and Lafon, S. Diffusion maps. Applied and Computational Harmonic Analysis, 2006.

[23] Davis, C. and Kahan, W. The rotation of eigenvectors by a perturbation. SIAM Journal on Numerical Analysis, 1970.

[24] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, 2019.

[25] Efthimiou, C. and Frye, C. Spherical harmonics in p dimensions. World Scientific, 2014.

[26] Favero, A., Cagnetta, F., and Wyart, M. Locality defeats the curse of dimensionality in convolutional teacher-student scenarios. In Advances in Neural Information Processing Systems, 2021.

[27] Garrido, Q., Balestriero, R., Najman, L., and Lecun, Y. Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. arXiv preprint arXiv:2210.02885, 2022.

[28] Grill, J.-B., Strub, F., Altche, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. In Advances in neural information processing systems, 2020.

[29] HaoChen, J., Wei, C., Gaidon, A., and Ma, T. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In NeurIPS, 2021.

[30] HaoChen, J. Z. and Ma, T. A theoretical study of inductive biases in contrastive learning. arXiv preprint arXiv:2211.14699, 2022.

[31] He, B. and Ozay, M. Exploring the gap between collapsed & whitened features in self-supervised learning. In ICML, 2020.

[32] Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, 2018.

[33] Kato, T. Perturbation Theory for Linear Operators. Springer, 1995.

[34] Kiani, B. T., Balestriero, R., Chen, Y., Lloyd, S., and LeCun, Y. Joint embedding self-supervised learning in the kernel regime. arXiv preprint arXiv:2209.14884, 2022.

[35] Kolmogorov, A. and Tikhomirov, V. $\epsilon$-entropy and $\epsilon$-capacity of sets in functional spaces. Uspekhi Matematicheskikh Nauk, 1959.

[36] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In NeurIPS, 2019.

[37] Lee, J., Lei, Q., Saunshi, N., and Zhuo, J. Predicting what you already know helps: Provable self-supervised learning. In Advances in Neural Information Processing Systems, 2021.

[38] Lin, J., Rudi, A., Rosasco, L., and Cevher, V. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 2020.

[39] Maurer, A. A vector-contraction inequality for Rademacher complexities. In Algorithmic Learning Theory, 2016.

[40] Mei, S., Misiakiewicz, T., and Montanari, A. Learning with invariances in random features and kernel models. In Conference on Learning Theory, 2021.

[41] Meir, R. and Zhang, T. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 2003.

[42] Micchelli, C., Xu, Y., and Zhang, H. Universal kernels. Journal of Machine Learning Research, 2006.

[43] Misiakiewicz, T. and Mei, S. Learning with convolution and pooling operations in kernel methods. In Advances in Neural Information Processing Systems, 2022.

[44] Mourtada, J. and Rosasco, L. An elementary analysis of ridge regression with random design. Comptes Rendus. Mathematique, 2022.

[45] Mourtada, J., skevi\v cius, T. V., and Zhivotovskiy, N. Distribution-free robust linear regression. Mathematical Statistics and Learning, 2022.

[46] O'Donnell, R. Analysis of boolean functions. Cambridge University Press, 2014.

[47] Ostrovskii, D. and Bach, F. Finite-sample analysis of m-estimators using self-concordance. Electronic Journal of Statistics, 2018.

[48] Pillaud-Vivien, L. and Bach, F. Kernelized diffusion maps. ArXiv, 2023.

[49] Pinelis, I. and Sakhanenko, A. Remarks on inequalities for large deviation probabilities. Theory of Probability and Its Applications, 1986.

[50] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.

[51] Rigollet, P. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research, 2007.

[52] Saunshi, N., Ash, J., Goel, S., Misra, D., Zhang, C., Arora, S., Kakade, S., and Krishnamurthy, A. Understanding contrastive learning requires incorporating inductive biases. In ICML, 2022.

[53] Schiebinger, G., Wainwright, M., and Yu, B. The geometry of kernelized spectral clustering. The annals of Statistics, 2015.

[54] Scholkopf, B. and Smola, A. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.

[55] Simard, P., Victorri, B., LeCun, Y., and Denker, J. Tangent prop - a formalism for specifying selected invariances in an adaptive network. In NeuRIPS, 1991.

[56] Simon, J., Knutins, M., Ziyin, L., Geisz, D., Fetterman, A., and Albrecht, J. On the stepwise nature of self-supervised learning. ArXiv, 2023.

[57] Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 2007.

[58] Smola, A., Ovari, Z., and Williamson, R. C. Regularization with dot-product kernels. In Advances in neural information processing systems, 2000.

[59] Sun, H.-W. and Zhou, D.-X. Reproducing kernel hilbert spaces associated with analytic translation-invariant mercer kernels. Journal of Fourier Analysis and Applications, 2008.

[60] Tian, Y. Understanding the role of nonlinearity in training dynamics of contrastive learning. arXiv preprint arXiv:2206.01342, 2022.

[61] Tian, Y., Yu, L., Chen, X., and Ganguli, S. Understanding self-supervised learning with dual deep networks, 2021.

[62] Tosh, C., Krishnamurthy, A., and Hsu, D. Contrastive learning, multi-view redundancy, and linear models. In Algorithmic Learning Theory, 2021a.

[63] Tosh, C., Krishnamurthy, A., and Hsu, D. Contrastive estimation reveals topic posterior information to linear models. JMLR, 2021b.

[64] Tropp, J. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 2015.

[65] van Engelen, J. and Hoos, H. A survey of semi-supervised learning. Machine Learning, 2020.

[66] Vitushkin, A. On Hilbert's thirteenth problem. Proceedings of the USSR Academy of Sciences, 1954.

[67] Wen, Z. and Li, Y. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, 2021.

[68] Yang, G. and Salman, H. A fine-grained spectral perspective on neural networks. arXiv preprint arXiv:1907.10599, 2019.