Joint Embedding Self-Supervised Learning in the Kernel Regime
% Bobak T. Kiani, MIT & Meta AI, FAIR, Randall Balestriero, Meta AI, FAIR, Yubei Chen, Meta AI, FAIR, Seth Lloyd, MIT & Turing Inc., Yann LeCun, NYU & Meta AI, FAIR
Abstract
The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Here, we aim to extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. In this kernel regime, we derive methods to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.
Interpreting the induced kernel
Bobak T. Kiani MIT & Meta AI, FAIR bkiani@mit.edu
Seth Lloyd MIT & Turing Inc. slloyd@mit.edu
The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Here, we aim to extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. In this kernel regime, we derive methods to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.
Introduction
Self-supervised learning (SSL) algorithms are broadly tasked with learning from unlabeled data. In the joint embedding framework of SSL, mainstream contrastive methods build representations by reducing the distance between inputs related by an augmentation (positive pairs) and increasing the distance between inputs not known to be related (negative pairs) (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Ye et al., 2019). Non-contrastive methods only enforce similarities between positive pairs but are designed carefully to avoid collapse of representations (Grill et al., 2020; Zbontar et al., 2021). Recent algorithms for SSL have performed remarkably well reaching similar performance to baseline supervised learning algorithms on many downstream tasks (Caron et al., 2020; Bardes et al., 2021; Chen & He, 2021).
In this work, we study SSL from a kernel perspective. In standard SSL tasks, inputs are fed into a neural network and mapped into a feature space which encodes the final representations used in downstream tasks (e.g., in classification tasks). In the kernel setting, inputs are embedded in a feature space corresponding to a kernel, and representations are constructed via an optimal mapping from this feature space to the vector space for the representations of the data. Since the feature space of a kernel can be infinite dimensional, one practically may only have access to the kernel function itself. Here, the task can be framed as one of finding an optimal 'induced' kernel, which is a mapping from the original kernel in the input feature space to an updated kernel function acting on the vector space of the representations. Our results show that such an induced kernel can be constructed using only manipulations of kernel functions and data that encodes the relationships between inputs in an SSL algorithm (e.g., adjacency matrices between the input datapoints).
More broadly, we make the following contributions:
Yubei Chen Meta AI, FAIR yubeic@fb.com
· For a contrastive and non-contrastive loss, we provide closed form solutions when the algorithm is trained over a single batch of data. These solutions form a new 'induced' kernel which can be used to perform downstream supervised learning tasks. · We show that a version of the representer theorem in kernel methods can be used to formulate kernelized SSL tasks as optimization problems. As an example, we show how to optimially find induced kernels when the loss is enforced over separate batches of data. · We empirically study the properties of our SSL kernel algorithms to gain insights about the training of SSL algorithms in practice. We study the generalization properties of SSL algorithms and show that the choice of augmentation and adjacency matrices encoding relationships between the datapoints are crucial to performance.
We proceed as follows. First, we provide a brief background of the goals of our work and the theoretical tools used in our study. Second, we show that kernelized SSL algorithms trained on a single batch admit a closed form solution for commonly used contrastive and non-contrastive loss functions. Third, we generalize our findings to provide a semi-definite programming formulation to solve for the optimal induced kernel in more general settings and provide heuristics to better understand the form and properties of the induced kernels. Finally, we empirically investigate our kernelized SSL algorithms when trained on various datasets.
Notation and setup
We denote vectors and matrices with lowercase ( x ) and uppercase ( X ) letters respectively. The vector 2-norm and matrix operator norm is denoted by ‖ · ‖ . The Frobenius norm of a matrix M is denoted as ‖ M ‖ F . We denote the transpose and conjugate transpose of M by M ᵀ and M † respectively. We denote the identity matrix as I and the vector with each entry equal to one as 1 . For a diagonalizable matrix M , its projection onto the eigenspace of its positive eigenvalues is M + .
For a dataset of size N , let x i ∈ X for i ∈ [ N ] denote the elements of the dataset. Given a kernel function k : X × X → R , let Φ( x ) = k ( x , · ) be the map from inputs to the reproducing kernel Hilbert space (RKHS) denoted by H with corresponding inner product 〈· , ·〉 H and RKHS norm ‖ · ‖ H . Throughout we denote K s,s ∈ R N × N to be the kernel matrix of the SSL dataset where ( K s,s ) ij = k ( x i , x j ) . We consider linear models W : H → R K which map features to representations z i = W Φ( x i ) . Let Z be the representation matrix which contains Φ( x i ) as rows of the matrix. This linear function space induces a corresponding RKHS norm which can be calculated as ‖ W ‖ H = √ ∑ K i =1 〈 W i , W i 〉 2 H where W i ∈ H denotes the i -th component of the output of the linear mapping W . This linear mapping constructs an 'induced' kernel denoted as k ∗ ( · , · ) as discussed later.
The driving motive behind modern self-supervised algorithms is to maximize the information of given inputs in a dataset while enforcing similarity between inputs that are known to be related. The adjacency matrix A ∈ { 0 , 1 } N × N (also can be generalized to A ∈ R N × N ) connects related inputs x i and x j (i.e., A ij = 1 if inputs i and j are related by a transformation) and D A is a diagonal matrix where entry i on the diagonal is equal to the number of nonzero elements of row i of A .
Related works
Any joint-embedding SSL algorithm requires a properly chosen loss function and access to a set of observations and known pairwise positive relation between those observations. Methods are denoted as non-contrastive if the loss function is only a function of pairs that are related (Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021). One common method using the VICReg loss (Bardes et al., 2021), for example, takes the form glyph[negationslash]
$$
$$
We adapt the above for the non-contrastive loss we study in our work. Contrastive methods also penalize similarities of representations that are not related. Popular algorithms include SimCLR
(Chen et al., 2020), SwA V (Caron et al., 2020), NNCLR (Dwibedi et al., 2021), contrastive predictive coding (Oord et al., 2018), and many others. We consider a variant of the spectral contrastive loss in our work (HaoChen et al., 2021).
Theoretical studies of SSL: In tandem with the success of SSL in deep learning, a host of theoretical tools have been developed to help understand how SSL algorithms learn (Arora et al., 2019b; Balestriero & LeCun, 2022; HaoChen et al., 2022; Lee et al., 2021). Findings are often connected to the underlying graph connecting the data distribution (Wei et al., 2020; HaoChen et al., 2021) or the choice of augmentation (Wen & Li, 2021). Building representations from known relationships between datapoints is also studied in spectral graph theory (Chung, 1997). We employ findings from this body of literature to provide intuition behind the properties of the algorithms discussed here.
Neural tangent kernels and gaussian processes: Prior work has connected the outputs of infinite width neural networks to a corresponding gaussian process (Williams & Rasmussen, 2006; Neal, 1996; Lee et al., 2017). When trained using continuous time gradient descent, these infinite width models evolve as linear models under the so called neural tangent kernel (NTK) regime (Jacot et al., 2018; Arora et al., 2019a). The discovery of the NTK opened a flurry of exploration into the connections between so-called lazy training of wide networks and kernel methods (Yang, 2019; Chizat &Bach, 2018; Wang et al., 2022; Bietti & Mairal, 2019). Though the training dynamics of the NTK has previously been studied in the supervised settings, one can analyze an NTK in a self-supervised setting by using that kernel in the SSL algorithms that we study here. We perform some preliminary investigation into this direction in our experiments.
Kernel and metric learning: From an algorithmic perspective, perhaps the closest lines of work are related to kernel and metric learning (Bellet et al., 2013; Yang & Jin, 2006). Since our focus is on directly kernelizing SSL methods to eventually analyze and better understand SSL algorithms, we differ from these methods in that our end goal is not to improve the performance of kernel methods but instead, to bridge algorithms in SSL with kernel methods. In kernel learning, prior works have proposed constructing a kernel via a learning procedure; e.g., via convex combinations of kernels (Cortes et al., 2010), kernel alignment (Cristianini et al., 2001), and unsupervised kernel learning to match local data geometry (Zhuang et al., 2011). Prior work in distance metric learning using kernel methods aim to produce representations of data in unsupervised or semi-supervised settings by taking advantage of links between data points. For example, Baghshah & Shouraki (2010); Hoi et al. (2007) learn to construct a kernel based on optimizing distances between points embedded in Hilbert space according to a similarity and dissimilarity matrix. Yeung & Chang (2007) perform kernel distance metric learning in a semi-supervised setting where pairwise relations between data and labels are provided. Xia et al. (2013) propose an online procedure to learn a kernel which maps similar points closer to each other than dissimilar points. Many of these works also use semi-definite programs to perform optimization to find the optimal kernel.
Contrastive and non-contrastive kernel methods
Stated informally, the goal of SSL in the kernel setting is to start with a given kernel function k : X × X → R (e.g., RBF kernel or a neural tangent kernel) and map this kernel function to a new 'induced' kernel k ∗ : X ×X → R which is a function of the SSL loss function and the SSL dataset. For two new inputs x and x ′ , the induced kernel k ∗ ( x , x ′ ) generally outputs correlated values if x and x ′ are correlated in the original kernel space to some datapoint in the SSL dataset or correlated to separate but related datapoints in the SSL dataset as encoded in the graph adjacency matrix. If no relations are found in the SSL dataset between x and x ′ , then the induced kernel will generally output an uncorrelated value.
To kernelize SSL methods, we consider a setting generalized from the prototypical SSL setting where representations are obtained by maximizing/minimizing distances between augmented/unaugmented samples. Translating this to the kernel regime, as illustrated in Figure 1, our goal is to find a linear mapping W ∗ : H → R K which obtains the optimal representation of the data for a given SSL loss function and minimizes the RKHS norm. This optimal solution produces an 'induced kernel' k ∗ ( · , · ) which is the inner product of the data in the output representation space. Once constructed, the induced kernel can be used in downstream tasks to perform supervised learning.

Figure 1: To translate the SSL setting into the kernel regime, we aim to find the optimal linear function which maps inputs from the RKHS into the K -dimensional feature space of the representations. This new feature space induces a new optimal kernel denoted the 'induced' kernel. Relationships between data-points are encoded in an adjacency matrix (the example matrix shown here contains pairwise relationships between datapoints).
Due to a generalization of the representer theorem (Sch¨ olkopf et al., 2001), we can show that the optimal linear function W ∗ must be in the support of the data. This implies that the induced kernel can be written as a function of the kernel between datapoints in the SSL dataset.
Proposition 3.1 (Form of optimal representation) . Given a dataset x 1 , . . . , x N ∈ X , let k ( · , · ) be a kernel function with corresponding map Φ : X → H into the RKHS H . Let W : H → R K be a function drawn from the space of linear functions W mapping inputs in the RKHS to the vector space of the representation. For a risk function R ( W Φ( x 1 ) , . . . , W Φ( x N )) ∈ R and any strictly increasing function r : [0 , ∞ ) → R , consider the optimization problem
$$
$$
The optimal solutions of the above take the form
$$
$$
where M ∈ R K × N is a matrix that must be solved for and k X , x ∈ R N is a vector with entries [ k X , x ] i = k ( x i , x ) .
Proposition 3.1, proved in Appendix A.1, provides a prescription for finding the optimal representations or induced kernels: i.e, one must search over the set of matrices M ∈ R K × N to find an optimal matrix. This search can be performed using standard optimization techniques as we will discuss later, but in certain cases, the optimal solution can be calculated in closed-form as shown next for both a contrastive and non-contrastive loss function.
Non-contrastive loss Consider a variant of the VICReg (Bardes et al., 2021) loss function below:
$$
$$
where β ∈ R + is a hyperparameter that controls the invariance term in the loss and L = D A -A is the graph Laplacian of the data. When the representation space has dimension K ≥ N and the kernel matrix of the data is full rank, the induced kernel of the above loss function is:
$$
$$
where ( · ) + projects the matrix inside the parentheses onto the eigenspace of its positive eigenvalues, k x ,s ∈ R 1 × N is the kernel row-vector with entry i equal to k ( x , x i ) with k s, x equal to its transpose,
and K s,s ∈ R N × N is the kernel matrix of the training data for the self-supervised dataset where entry i, j is equal to k ( x i , x j ) . When we restrict the output space of the self-supervised learning task to be of dimension K < N , then the induced kernel only incorporates the top K eigenvectors of I -1 N 11 ᵀ -β 2 L :
$$
$$
where CDC ᵀ = I -1 N 11 ᵀ -β 2 L is the eigendecomposition including only positive eigenvalues sorted in descending order, C : , ≤ K denotes the matrix consisting of the first K columns of C and D 1 / 2 ≤ K, ≤ K denotes the K × K matrix consisting of entries in the first K rows and columns. Proofs of the above are in Appendix A.2.
$$
$$
$$
$$
where ( I + A ) + is equal to the projection of I + A onto its eigenspace of positive eigenvalues. In the standard SSL setting where relationships are pair-wise (i.e., A ij = 1 if x i and x j are related by an augmentation), then I + A has only positive or zero eigenvalues so the projection can be ignored. If K < N , then we similarly project the matrix I + A onto its top K eigenvalues and obtain an induced kernel similar to the non-contrastive one:
$$
$$
where as before, CDC ᵀ = I + A is the eigendecomposition including only positive eigenvalues with eigenvalues in descending order, C : , ≤ K consists of the first K columns of C and D 1 / 2 ≤ K, ≤ K is the K × K matrix of the first K rows and columns. Proofs of the above are in Appendix A.3.
(Form of optimal representation).
Theorem A.1 (Representer theorem for self-supervised tasks; adapted from Sch¨ olkopf et al. (2001)) . Let x i ∈ X for i ∈ [ N ] be elements of a dataset of size N , k : X × X → R be a kernel function with corresponding map Φ : X → H into the RKHS H , and r : [0 , ∞ ) → R be any strictly increasing function. Let W : H → R K be a linear function mapping inputs to their corresponding representations. Given a regularized loss function of the form
$$
$$
where R ( W Φ( x 1 ) , . . . , W Φ( x N )) is an error function that depends on the representations of the dataset, the minimizer W ∗ of this loss function will be in the span of the training points { Φ( x i ) , i ∈ { 1 , . . . , N }} , i.e. for any φ ∈ H :
$$
$$
Proof. Decompose W = W ‖ + W ⊥ , where W ⊥ Φ( x i ) = 0 for all i ∈ [ N ] . For a loss function L of the form listed above, we have
$$
$$
Where in the terms in the sum, we used the property that W ⊥ is not in the span of the data. For the regularizer term, we note that
$$
$$
Therefore, strictly enforcing W ⊥ = 0 minimizes the regularizer while leaving the rest of the cost function unchanged.
As a consequence of the above, all optimal solutions must have support over the span of the data. This directly results in the statement shown in Proposition 3.1.
Non-contrastive loss
For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ‖ XW ᵀ WX ᵀ -( I + A ) ‖ 2 F . Since XW ᵀ WX ᵀ is positive semi-definite, then this is optimized when XW ᵀ WX ᵀ matches the positive eigenspace of ( I + A ) defined as ( I + A ) + . Enumerating the eigenvalues of A as v i with corresponding eigenvalues e i , then ( I + A ) + = ∑ i : e i ≥-1 ( e i + 1) v i v † i . More generally, if the dimension of the representation is restricted such that K < N , then we abuse notation and define
$$
$$
where e i are sorted in descending order.
To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation (25):
$$
$$
$$
$$
The optimal primal is B ∗ = X + ( I + A ) + ( X + ) ᵀ . The optimal dual is equal to Y ∗ = ( X + ) ᵀ X + . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
$$
$$
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
$$
$$
In the above, we used the fact that X + X is a projection onto the row space of X . This completes the proof of the optimality.
This has corresponding dual
Contrastive loss
For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ‖ XW ᵀ WX ᵀ -( I + A ) ‖ 2 F . Since XW ᵀ WX ᵀ is positive semi-definite, then this is optimized when XW ᵀ WX ᵀ matches the positive eigenspace of ( I + A ) defined as ( I + A ) + . Enumerating the eigenvalues of A as v i with corresponding eigenvalues e i , then ( I + A ) + = ∑ i : e i ≥-1 ( e i + 1) v i v † i . More generally, if the dimension of the representation is restricted such that K < N , then we abuse notation and define
$$
$$
where e i are sorted in descending order.
To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation (25):
$$
$$
$$
$$
The optimal primal is B ∗ = X + ( I + A ) + ( X + ) ᵀ . The optimal dual is equal to Y ∗ = ( X + ) ᵀ X + . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
$$
$$
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
$$
$$
In the above, we used the fact that X + X is a projection onto the row space of X . This completes the proof of the optimality.
This has corresponding dual
General form as SDP
The closed form solutions for the induced kernel obtained above assumed the loss function was enforced across a single batch. Of course, in practice, data are split into several batches. This batched setting may not admit a closed-form solution, but by using Proposition 3.1, we know that any optimal induced kernel takes the general form:
$$
$$
where B ∈ R N × N is a positive semi-definite matrix. With constraints properly chosen so that the solution for each batch is optimal (Balestriero & LeCun, 2022; HaoChen et al., 2022), one can find the optimal matrix B ∗ by solving a semi-definite program (SDP). We perform this conversion to a SDP for the contrastive loss here and leave proofs and further details including the non-contrastive case to Appendix A.4.
Introducing some notation to deal with batches, assume we have N datapoints split into n batches of size b . We denote the i -th datapoint within batch j as x ( j ) i . As before, x i denotes the i -th datapoint across the whole dataset. Let K s,s ∈ R N × N be the kernel matrix over the complete dataset where [ K s,s ] i,j = k ( x i , x j ) , K s,s j ∈ R N × b be the kernel matrix between the complete dataset and batch j where [ K s,s j ] a,b = k ( x a , x ( j ) b ) , and A ( j ) be the adjacency matrix for inputs in batch j . With this notation, we now aim to minimize the loss function adapted from Equation (7) including a regularizing term for the RKHS norm:
$$
$$
where α ∈ R + is a weighting term for the regularizer. Taking the limit of α → 0 , we can find the optimal induced kernel for a representation of dimension K > b by enforcing that optimal representations are obtained in each batch:
$$
$$
where as before, where ( I + A ( j ) ) + is equal to the projection of I + A ( j ) onto its eigenspace of positive eigenvalues. Relaxing and removing the constraint that rank( B ) = K results in an SDP which can be efficiently solved using existing optimizers. Further details and a generalization of this conversion to other representation dimensions is shown in Appendix A.4.
Interpreting the induced kernel
As a loose rule, the induced kernel will correlate points that are close in the kernel space or related by augmentations in the SSL dataset and uncorrelate points otherwise. As an example, note that in the contrastive setting (Equation (8)), if one calculates the induced kernel k ∗ ( x i , x j ) between two points in the SSL dataset indexed by i and j that are related by an augmentation (i.e., A ij = 1 ), then the kernel between these two points is k ∗ ( x i , x j ) = 1 . More generally, if the two inputs to the induced kernel are close in kernel space to different points in the SSL dataset that are known to be related by A , then the kernel value will be close to 1 . We formalize this intuition below for the standard setting with pairwise augmentations.
Proposition 3.2. Given kernel function k ( · , · ) with corresponding map Φ( · ) into the RKHS H , let { x 1 , x 2 , . . . x N } be an SSL dataset normalized such that k ( x i , x i ) = 1 and formed by pairwise augmentations (i.e., every element has exactly one neighbor in A ) with kernel matrix K s,s . Given two points x and x ′ , if there exists two points in the SSL dataset indexed by i and j which are related by an augmentation ( A ij =1) and ‖ Φ( x ) -Φ( x i ) ‖ H ≤ ∆ 5 ‖| K -1 s,s ‖ √ N and ‖ Φ( x ′ ) -Φ( x j ) ‖ H ≤
$$
$$
Weprove the above statement in Appendix A.5. The bounds in the above statement which depend on the number of datapoints N and the kernel matrix norm ‖ K -1 s,s ‖ are not very tight and solely meant to provide intuition for the properties of the induced kernel. In more realistic settings, stronger correlations will be observed for much weaker values of the assumptions. In light of this, we visualize the induced kernel values and their relations to the original kernel function in Section 4.1 on a simple 2 -dimensional spiral dataset. Here, it is readily observed that the induced kernel better connects points along the data manifold that are related by the adjacency matrix.
.
Downstream tasks
In downstream tasks, one can apply the induced kernels directly on supervised algorithms such as kernel regression or SVM. Alternatively, one can also extract representations directly by obtaining the representation as k ∗ ( x , · ) = Mk s, x as shown in Proposition 3.1 and employ any learning algorithm from these features. As an example, in kernel regression, we are given a dataset of size N t consisting of input-output pairs { x ( t ) i , y i } and aim to train a linear model to minimize the mean squared error loss of the outputs (Williams & Rasmussen, 2006). The optimal solution using an induced kernel k ∗ ( · , · ) takes the form:
$$
$$
where K ∗-1 t,t is the kernel matrix of the supervised training dataset with entry i, j equal to k ∗ ( x ( t ) i , x ( t ) j ) and y is the concatenation of the targets as a vector. Note that since kernel methods generally have complexity that scales quadratically with the number of datapoints, such algorithms may be unfeasible in large-scale learning tasks unless modifications are made.
A natural question is when and why should one prefer the induced kernel of SSL to a kernel used in the standard supervised setting perhaps including data augmentation? Kernel methods generally fit

Figure 2: Comparison of the RBF kernel space (first row) and induced kernel space (second row). The induced kernel is computed based on Equation 5, and the graph Laplacian matrix is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. a) We plot three randomly chosen points' kernel entries with respect to the other points on the manifolds. When the neighborhood augmentation range used to construct the Laplacian matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the Laplacian matrix is too large, it creates the 'short-circuit' effect in the induced kernel space. Each subplot on the second row is normalized by its largest absolute value for better contrast.
a dataset perfectly so an answer to the question more likely arises from studying generalization. In kernel methods, generalization error typically tracks well with the norm of the classifier captured by the complexity quantity s N ( K ) defined as (Mohri et al., 2018; Steinwart & Christmann, 2008)
$$
$$
where y is a vector of targets and K is the kernel matrix of the supervised dataset. For example, the generalization gap of an SVM algorithm can be bounded with high probability by O ( √ s N ( K ) /N ) (see example proof in Appendix B.1) (Meir & Zhang, 2003; Huang et al., 2021). For kernels functions k ( · , · ) bounded in output between 0 and 1 , the quantity s N ( K ) is minimized in binary classification when k ( x , x ′ ) = 1 for x , x ′ drawn from the same class and k ( x , x ′ ) = 0 for x , x ′ drawn from distinct classes. If the induced kernel works ideally - in the sense that it better correlates points within a class and decorrelates points otherwise - then the entries of the kernel matrix approach these optimal values. This intuition is also supported by the hypothesis that self-supervised and semi-supervised algorithms perform well by connecting the representations of points on a common data manifold (HaoChen et al., 2021; Belkin & Niyogi, 2004). To formalize this somewhat, consider such an ideal, yet fabricated, setting where the SSL induced kernel has complexity s N ( K ∗ ) that does not grow with the dataset size.
Proposition 3.3 (Ideal SSL outcome) . Given a supervised dataset of N points for binary classification drawn from a distribution with m -1 and m +1 connected manifolds for classes with labels -1 and +1 respectively, if the induced kernel matrix of the dataset K ∗ successfully separates the manifolds such that k ∗ ( x , x ′ ) = 1 if x , x ′ are in the same manifold and k ∗ ( x , x ′ ) = 0 otherwise, then s N ( K ∗ ) = m -1 + m +1 = O (1) .
The simple proof of the above is in Appendix B. In short, we conjecture that SSL should be preferred in such settings where the relationships between datapoints are 'strong' enough to connect similar points in a class on the same manifold. We analyze the quantity s N ( K ) in Appendix C.3 to add further empirical evidence behind this hypothesis.
(Ideal SSL outcome).
Experiments
In this section, we empirically investigate the performance and properties of the SSL kernel methods on a toy spiral dataset and portions of the MNIST and eMNIST datasets for hand-drawn digits and characters (Cohen et al., 2017). As with other works, we focus on small-data tasks where kernel methods can be performed efficiently without modifications needed for handling large datasets (Arora et al., 2019a; Fern´ andez-Delgado et al., 2014). For simplicity and ease of analysis, we per-
![Figure 3: Depiction of MNIST and EMNIST full test set performances using the contrastive and non-contrastive kernels ( in red ) and benchmarked against the supervised case with labels on all samples (original + augmented) in black and with labels only on the original samples in blue , with the number of original samples given by N ∈ { 16 , 64 , 256 } (each column ) and the number of augmented samples (in log2) in x-axis . The first row corresponds to Gaussian blur data-augmentation (poorly aligned with the data distributions) and the second row corresponds to random rotations (-10,10), translations (-0.1,0.1) and scaling (0.9,1.1). We set the SVM glyph[lscript] 2 regularization to 0 . 001 and use the RBF kernel (NTK kernels in Appendix C.2). We restrict to settings where the total number of samples does not except 50 k , and in all cases, the kernel representation dimensions are unconstrained. Two key observations emerge. First, whenever the augmentation is not aligned with the data distribution, the SSL kernels falls below the supervised case, especially as N increases. Second, when the augmentation is aligned with the data distribution, both SSL kernels are able to get close and even outperform the supervised benchmark with augmented labels.](2209.14884-figure_002.png)
Figure 3: Depiction of MNIST and EMNIST full test set performances using the contrastive and non-contrastive kernels ( in red ) and benchmarked against the supervised case with labels on all samples (original + augmented) in black and with labels only on the original samples in blue , with the number of original samples given by N ∈ { 16 , 64 , 256 } (each column ) and the number of augmented samples (in log2) in x-axis . The first row corresponds to Gaussian blur data-augmentation (poorly aligned with the data distributions) and the second row corresponds to random rotations (-10,10), translations (-0.1,0.1) and scaling (0.9,1.1). We set the SVM glyph[lscript] 2 regularization to 0 . 001 and use the RBF kernel (NTK kernels in Appendix C.2). We restrict to settings where the total number of samples does not except 50 k , and in all cases, the kernel representation dimensions are unconstrained. Two key observations emerge. First, whenever the augmentation is not aligned with the data distribution, the SSL kernels falls below the supervised case, especially as N increases. Second, when the augmentation is aligned with the data distribution, both SSL kernels are able to get close and even outperform the supervised benchmark with augmented labels.
form experiments here with respect to the RBF kernel. Additional experiments reinforcing these findings and also including analysis with neural tangent kernels can be found in Appendix C.
Visualizing the induced kernel on the spiral dataset
For an intuitive understanding, we provide a visualization in Figure 2 to show how the SSL-induced kernel is helping to manipulate the distance and disentangle manifolds in the representation space. In Figure 2, we study two entangled 1-D spiral manifolds in a 2D space with 200 training points uniformly distributed on the spiral manifolds. We use the non-contrastive SSL-induced kernel, following Equation (5), to demonstrate this result, whereas a contrastive SSL-induced kernel is qualitatively similar and left to Appendix C.1. In the RBF kernel space shown in the first row of Figure 2, the value of the kernel is captured purely by distance. Next, to consider the SSL setting, we construct the graph Laplacian matrix L by connecting vertices between any training points with k ( x 1 , x 2 ) > d , i.e., L ij = -1 if k ( x i , x j ) > d and L ij = 0 otherwise. The diagonal entries of L are equal to the degree of the vertices, respectively. This construction can be viewed as using the Euclidean neighborhoods of x as the data augmentation. We choose β = 0 . 4 , where other choices within a reasonable range lead to similar qualitative results. In the second row of Figure 2, we show the induced kernel between selected points (marked with an x) and other training points in the SSL-
![Figure 4: MNIST classification task with 100 original training samples and 100 augmentations per sample on the train and test set (full, 10 , 000 samples) with baselines in the titles given by supervised SVM using labels for all samples or original ( 100 ) samples only. We provide on the left the contrastive kernel performances when ablating over the (inverse) of the SVM's glyph[lscript] 2 regularizer y-axis and representation dimension ( K ) in the x-axis . Similarly, we provide on the right the noncontrastive kernel performances when ablating over β on the y-axis and over the representation dimension ( K ) on the x-axis . We observe, as expected, that reducing K prevents overfitting and should be preferred over the glyph[lscript] 2 regularizer, and that β acts jointly with the representation dimension i.e. one only needs to tune one of the two.](2209.14884-figure_003.png)
Figure 4: MNIST classification task with 100 original training samples and 100 augmentations per sample on the train and test set (full, 10 , 000 samples) with baselines in the titles given by supervised SVM using labels for all samples or original ( 100 ) samples only. We provide on the left the contrastive kernel performances when ablating over the (inverse) of the SVM's glyph[lscript] 2 regularizer y-axis and representation dimension ( K ) in the x-axis . Similarly, we provide on the right the noncontrastive kernel performances when ablating over β on the y-axis and over the representation dimension ( K ) on the x-axis . We observe, as expected, that reducing K prevents overfitting and should be preferred over the glyph[lscript] 2 regularizer, and that β acts jointly with the representation dimension i.e. one only needs to tune one of the two.
induced kernel space. When d is chosen properly, as observed in the second row of Figure 2(a), the SSL-induced kernel faithfully captures the topology of manifolds and leads to a more disentangled representation. However, the augmentation has to be carefully chosen as Figure 2(b) shows that when d is too large, the two manifolds become mixed in the representation space.
Classification Experiments
We explore in Figure 3 the supervised classification setting of MNIST and EMNIST which consist of 28 × 28 grayscale images. (E)MNIST provide a strong baseline to evaluate kernel methods due to the absence of background in the images making kernels such as RBF more aligned to measure input similarities. In this setting, we explore two different data-augmentation (DA) policies, one aligned with the data distribution (rotation+translation+scaling) and one largely misaligned with the data distribution (aggressive Gaussian blur). Because our goal is to understand how much DA impacts the SSL kernel compared to a fully supervised benchmark, we consider two (supervised) benchmarks: one that employs the labels of the sampled training set and all the augmented samples and one that only employs the sampled training set and no augmented samples. We explore a small training set size going from N = 16 to N = 256 and for each case we produce a number of augmented samples for each datapoint so that the total number of samples does not exceed 50 , 000 which is a standard threshold for kernel methods. We note that our implementation is based on the common Python scientific library Numpy/Scipy (Harris et al., 2020) and runs on CPU. We observed in Figure 3 that the SSL kernel is able to match and even outperform the fully supervised case when employing the correct data-augmentation, while with the incorrect data-augmentation, the performance is not even able to match the supervised case that did not see the augmented samples. To better understand the impact of different hyper-parameters onto the two kernels, we also study in Figure 4 the MNIST test set performances when varying the representation dimension K , the SVM's glyph[lscript] 2 regularization parameter, and the non-contrastive kernel's β parameter. We observe that although β is an additional hyper-parameter to tune, its tuning plays a similar role to K , the representation dimension. Hence, in practice, the design of the non-contrastive model can be controlled with a single parameter as with the contrastive setting. We also interestingly observe that K is a more preferable parameter to tune to prevent over-fitting as opposed to SVM's glyph[lscript] 2 regularizer.
Discussion
Our work explores the properties of SSL algorithms when trained via kernel methods. Connections between kernel methods and neural networks have gained significant interest in the supervised learning setting (Neal, 1996; Lee et al., 2017) for their potential insights into the training of deep networks. As we show in this study, such insights into the training properties of SSL algorithms can similarly be garnered from an analysis of SSL algorithms in the kernel regime. Our theoretical and
empirical analysis, for example, highlights the importance of the choice of augmentations and encoded relationships between data points on downstream performance. Looking forward, we believe that interrelations between this kernel regime and the actual deep networks trained in practice can be strengthened particularly by analyzing the neural tangent kernel. In line with similar analysis in the supervised setting (Yang et al., 2022; Seleznova & Kutyniok, 2022; Lee et al., 2020), neural tangent kernels and their corresponding induced kernels in the SSL setting may shine light on some of the theoretical properties of the finite width networks used in practice.
Experiments
Deferred proofs
Representer theorem
Theorem A.1 (Representer theorem for self-supervised tasks; adapted from Sch¨ olkopf et al. (2001)) . Let x i ∈ X for i ∈ [ N ] be elements of a dataset of size N , k : X × X → R be a kernel function with corresponding map Φ : X → H into the RKHS H , and r : [0 , ∞ ) → R be any strictly increasing function. Let W : H → R K be a linear function mapping inputs to their corresponding representations. Given a regularized loss function of the form
$$
$$
where R ( W Φ( x 1 ) , . . . , W Φ( x N )) is an error function that depends on the representations of the dataset, the minimizer W ∗ of this loss function will be in the span of the training points { Φ( x i ) , i ∈ { 1 , . . . , N }} , i.e. for any φ ∈ H :
$$
$$
Proof. Decompose W = W ‖ + W ⊥ , where W ⊥ Φ( x i ) = 0 for all i ∈ [ N ] . For a loss function L of the form listed above, we have
$$
$$
Where in the terms in the sum, we used the property that W ⊥ is not in the span of the data. For the regularizer term, we note that
$$
$$
Therefore, strictly enforcing W ⊥ = 0 minimizes the regularizer while leaving the rest of the cost function unchanged.
As a consequence of the above, all optimal solutions must have support over the span of the data. This directly results in the statement shown in Proposition 3.1.
(Representer theorem for self-supervised tasks; adapted from Schölkopf et al. (2001)).
Deferred proofs
Closed form non-contrastive loss
Throughout this section, for simplicity, we define M := I -1 N 11 ᵀ . With slight abuse of notation, we also denote X ∈ R N × D as a matrix whose i -th row contains the features of Φ( x i ) .
$$
$$
Note that if the RKHS is infinite dimensional, one can apply the general form of the solution as shown in Proposition 3.1 to reframe the problem into the finite dimensional setting below. As a reminder, we aim to minimize the VICreg cost function:
$$
$$
By applying the definition of the Frobenius norm and from some algebra, we obtain:
$$
$$
The optimum of the above formulation has the same optimum as the following optimization problem defined as C ′ ( W ) :
$$
$$
Since M is a projector and M (2 M -β L ) = 2 M -β L (since the all-ones vector is in the kernel of L ), then we can solve the above by employing the Eckart-Young theorem and matching the K -dimensional eigenspace of XW ᵀ WX ᵀ M with that of the top K eigenspace of M -β L / 2 .
One must be careful in choosing this optimum as WX ᵀ MXW ᵀ can only take positive eigenvalues. Therefore, this is achieved by choosing the optimal W ∗ to project the data onto this eigenspace as
$$
$$
where we set the eigendecomposition of 1 N 11 ᵀ + β 2 L as CDC ᵀ = 1 N 11 ᵀ + β 2 L and C : , ≤ K is the matrix consisting of the first K rows of C . Similarly, ( I -D ) 1 / 2 ≤ K, ≤ K denotes the top left K × K matrix of ( I -D ) . Also in the above, X ( s ) + denotes the pseudo-inverse of X . If the diagonal matrix I -D contains negative entries, which can only happen when β is set to a large value, then the values of ( I -D ) 1 / 2 for those entries is undefined. Here, the optimum choice is to set those entries to equal zero. In practice, this can be avoided by setting β to be no larger two times than the degree of the graph.
Note, that since W only appears in the cost function in the form W ᵀ W , the solution above is only unique up to an orthogonal transformation. Furthermore, the rank of XW ᵀ WX ᵀ M is at most N -1 so a better optimum cannot be achieved by increasing the output dimension of the linear transformation beyond N . To see that this produces the optimal induced kernel, we simply plug in the optimal W ∗ :
$$
$$
Now, it needs to be shown that W ∗ is the unique norm optimizer of the optimization problem. To show this, we analyze the following semi-definite program which is equivalent since the cost function is over positive semi-definite matrices of the form W ᵀ W :
$$
$$
To lighten notation, we denote P K = C : , ≤ K ( I -D ) 1 / 2 ≤ K, ≤ K C ᵀ : , ≤ K and simply use X for X . The above can easily be derived by setting B = W ᵀ W in Equation (22). This has corresponding dual
$$
$$
The optimal primal can be obtained from W ∗ by B ∗ = ( W ∗ ) ᵀ W ∗ = X + P K ( X + ) ᵀ . The optimal dual can be similarly calculated and is equal to Y ∗ = ( X + ) ᵀ X + . A straightforward calculation shows that the optimum value of the primal and dual formulation are equal for the given solutions. We now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
$$
$$
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
$$
$$
In the above we used the fact that ML K = L K since M is a projector and L K is unchanged by that projection. This completes the proof of the optimality.
Contrastive loss
For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ‖ XW ᵀ WX ᵀ -( I + A ) ‖ 2 F . Since XW ᵀ WX ᵀ is positive semi-definite, then this is optimized when XW ᵀ WX ᵀ matches the positive eigenspace of ( I + A ) defined as ( I + A ) + . Enumerating the eigenvalues of A as v i with corresponding eigenvalues e i , then ( I + A ) + = ∑ i : e i ≥-1 ( e i + 1) v i v † i . More generally, if the dimension of the representation is restricted such that K < N , then we abuse notation and define
$$
$$
where e i are sorted in descending order.
To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation (25):
$$
$$
$$
$$
The optimal primal is B ∗ = X + ( I + A ) + ( X + ) ᵀ . The optimal dual is equal to Y ∗ = ( X + ) ᵀ X + . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
$$
$$
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
$$
$$
In the above, we used the fact that X + X is a projection onto the row space of X . This completes the proof of the optimality.
This has corresponding dual
Optimization via semi-definite program
In general scenarios, Proposition 3.1 gives a prescription for calculating the optimal induced kernel for more complicated optimization tasks since we note that the optimal induced kernel k ∗ ( x , x ′ ) must be of the form below:
$$
$$
where k x ,s is a row vector whose entry i equals the kernel k ( x , x i ) between x and the i -th datapoint and B ∈ R N × N is a positive semi-definite matrix.
For example, such a scenario arises when one wants to apply the loss function across n batches batches of data. To frame this as an optimization statement, assume we have N datapoints split into batches of size b each. We denote the i -th datapoint within batch k as x ( k ) i . As before, x i denotes the i -th datapoint across the whole dataset. We define the following variables:
In what follows, we denote the representation dimension as K .
Non-contrastive loss function In the non-contrastive setting, we consider a regularized version of the batched loss of Equation (4). Applying the reduction of the loss function in Equation (22), we consider the following loss function where we want to find the minimizer B :
$$
$$
where α ∈ R + is a hyperparameter. The term Tr( BK s,s ) regularizes for the RKHS norm of the resulting solution given by B . For simplicity, we denote as before M = I -1 b 11 ᵀ . Taking the limit α → 0 and enforcing a representation of dimension K , the loss function above is minimized when we obtain the optimal representation, i.e. we must have that
$$
$$
where ( M -β L ( j ) / 2 ) K denotes the projection of M -β L ( j ) / 2 onto the eigenspace of the top K positive singular values (this is the optimal representation as shown earlier).
Therefore, we can find the optimal induced kernel by solving the optimization problem below:
$$
$$
Relaxing the constraint rank( B ) = K forms an SDP which can be solved efficiently using existing SDP solvers (ApS, 2019; Sturm, 1999).
Contrastive loss As shown in Section 3.1, the contrastive loss function takes the form
$$
$$
, where α ∈ R + is a weighting term for the regularizer. In this setting, the optimal representation of dimension K is equal to
$$
$$
$$
$$
As before, relaxing the rank constraint results in an SDP.
Non-contrastive loss function
For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ‖ XW ᵀ WX ᵀ -( I + A ) ‖ 2 F . Since XW ᵀ WX ᵀ is positive semi-definite, then this is optimized when XW ᵀ WX ᵀ matches the positive eigenspace of ( I + A ) defined as ( I + A ) + . Enumerating the eigenvalues of A as v i with corresponding eigenvalues e i , then ( I + A ) + = ∑ i : e i ≥-1 ( e i + 1) v i v † i . More generally, if the dimension of the representation is restricted such that K < N , then we abuse notation and define
$$
$$
where e i are sorted in descending order.
To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation (25):
$$
$$
$$
$$
The optimal primal is B ∗ = X + ( I + A ) + ( X + ) ᵀ . The optimal dual is equal to Y ∗ = ( X + ) ᵀ X + . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
$$
$$
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
$$
$$
In the above, we used the fact that X + X is a projection onto the row space of X . This completes the proof of the optimality.
This has corresponding dual
Contrastive loss
For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ‖ XW ᵀ WX ᵀ -( I + A ) ‖ 2 F . Since XW ᵀ WX ᵀ is positive semi-definite, then this is optimized when XW ᵀ WX ᵀ matches the positive eigenspace of ( I + A ) defined as ( I + A ) + . Enumerating the eigenvalues of A as v i with corresponding eigenvalues e i , then ( I + A ) + = ∑ i : e i ≥-1 ( e i + 1) v i v † i . More generally, if the dimension of the representation is restricted such that K < N , then we abuse notation and define
$$
$$
where e i are sorted in descending order.
To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation (25):
$$
$$
$$
$$
The optimal primal is B ∗ = X + ( I + A ) + ( X + ) ᵀ . The optimal dual is equal to Y ∗ = ( X + ) ᵀ X + . Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
$$
$$
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
$$
$$
In the above, we used the fact that X + X is a projection onto the row space of X . This completes the proof of the optimality.
This has corresponding dual
Proof of kernel closeness
Our goal is to prove Proposition 3.2 copied below.
Proposition 3.2. Given kernel function k ( · , · ) with corresponding map Φ( · ) into the RKHS H , let { x 1 , x 2 , . . . x N } be an SSL dataset normalized such that k ( x i , x i ) = 1 and formed by pairwise augmentations (i.e., every element has exactly one neighbor in A ) with kernel matrix K s,s . Given two points x and x ′ , if there exists two points in the SSL dataset indexed by i and j which are related by an augmentation ( A ij =1) and ‖ Φ( x ) -Φ( x i ) ‖ H ≤ ∆ 5 ‖| K -1 s,s ‖ √ N and ‖ Φ( x ′ ) -Φ( x j ) ‖ H ≤ ∆ -1 , then the induced kernel for the contrastive loss is at least k ∗ ( x , x ′ ) ≥ 1 -∆ .
5
‖|
K
s,s
‖ √
N
Note, the adjacency matrix A for pairwise augmentations takes the form below assuming augmented samples are placed next to each other in order.
$$
$$
Before proceeding, we prove a helper lemma that shows that ‖ k s, x a -k s, x b ‖ is small if ‖ Φ( x a ) -Φ( x b ) ‖ is also relatively small with a factor of dependence on the dataset size.
Lemma A.2. Given kernel function k ( · , · ) with map Φ( · ) into RKHS H and an SSL dataset { x i } N i =1 normalized such that k ( x i , x i ) = 1 , let K s,s be the kernel matrix of the SSL dataset. If ‖ Φ( x a ) -Φ( x b ) ‖ H ≤ glyph[epsilon1] , then ∥ ∥ K -1 s,s k s, x a -K -1 s,s k s, x a ∥ ∥ ≤ ‖ K -1 s,s ‖ √ Nglyph[epsilon1] .
Proof. We have that
$$
$$
Proof. Note that K -1 s,s x i = e i where e i is the vector with a 1 placed on entry i and zeros elsewhere. From equation Equation (8), we have that
$$
$$
Note that e ᵀ i ( I + A ) + e j = 1 since A ij = 1 . Therefore,
$$
$$
Let ‖ Φ( x ) -Φ( x i ) ‖ ≤ glyph[epsilon1] = ∆ 5 ‖| K -1 s,s ‖ √ N and ‖ Φ( x ′ ) -Φ( x j ) ‖ ≤ glyph[epsilon1] = ∆ 5 ‖| K -1 s,s ‖ √ N and by applying Lemma A.2, we have
$$
$$
In the above, we used the fact that A is block diagonal with pairwise constraints (see Equation (41)) so ∥ ∥ ( I + A ) + e i ∥ ∥ = √ 2 and ∥ ∥ ( I + A ) + ∥ ∥ = 2 .
.
Deferred proofs
Deferred proofs
Idealization of downstream tasks
In this section, we prove Proposition 3.3 copied below.
Proposition 3.3 (Ideal SSL outcome) . Given a supervised dataset of N points for binary classification drawn from a distribution with m -1 and m +1 connected manifolds for classes with labels -1 and +1 respectively, if the induced kernel matrix of the dataset K ∗ successfully separates the manifolds such that k ∗ ( x , x ′ ) = 1 if x , x ′ are in the same manifold and k ∗ ( x , x ′ ) = 0 otherwise, then s N ( K ∗ ) = m -1 + m +1 = O (1) .
Proof. There are m +1 + m -1 total manifolds in the dataset. Assume some ordering of these manifolds from { 1 , 2 , . . . , m +1 + m -1 } and let #( i ) be the number of points in the i -th manifold.
The rows and columns of the kernel matrix K ∗ can be permuted such that it becomes block diagonal with m +1 + m -1 blocks with block i equal to 1 #( i ) 1 #( i ) 1 ᵀ #( i ) where 1 k is the all ones vector of length k . I.e., K ∗ permuted accordingly takes the form below:
$$
$$
Each block of the above is clearly a rank one matrix with eigenvalue 1 . Let y m i ∈ R #( i ) be the vector containing all labels for entries in manifold i . Then we have
$$
$$
Deferred proofs
Proof of generalization bound
For sake of completeness, we include an example proof of the generalization bound referred to in the main text. Common to classic generalization bounds for kernel methods from several prior works (Huang et al., 2021; Meir & Zhang, 2003; Mohri et al., 2018; Vapnik, 1999; Bartlett & Mendelson, 2002), the norm of the linear solution in kernel space correlates with the resulting bound on the generalization error. We closely follow the methodology of (Huang et al., 2021), though other works follow a similar line of reasoning.
Given a linear solution in the reproducing kernel Hilbert space denoted by w , we aim to bound the generalization error in the loss function glyph[lscript] ( y, y ′ ) = | min(1 , max( -1 , y )) -y ′ | where y ′ denotes binary classification targets in {-1 , 1 } , y is the output of the kernel function equal to y = 〈 w , x 〉 for a corresponding input x in the Hilbert space or feature space. For our purposes, the solution w is given by e.g., Equation (13) equal to
$$
$$
where φ ( x ( t ) i ) denotes the mapping of x ( t ) i to the Hilbert space of the kernel. Given this solution, we have that
$$
$$
The norm above controls, in a sense, the complexity of the resulting solution as it appears in the resulting generalization bound.
Given input and output spaces X and Y respectively, let D be a distribution of input/output pairs over the support X × Y . Slightly modifying existing generalization bounds via Rademacher complexity arguments (Huang et al., 2021; Bartlett & Mendelson, 2002; Mohri et al., 2018), we prove the following generalization bound.
Theorem B.1 (Adapted from Section 4.C of (Huang et al., 2021)) . Let x 1 , . . . , x N and y 1 , . . . , y N t (with y the corresponding vector storing the scalars y i as entries) be our training set of N independent samples drawn i.i.d. from D . Let w = √ Tr( K ) N ∑ N t i =1 [ K -1 y ] i φ ( x ( t ) i ) denote the solution to the kernel regression problem normalized by the trace of the data kernel. For an L -lipschitz loss function glyph[lscript] : Y × Y → [0 , b ] , with probability 1 -δ for any δ > 0 , we have
$$
$$
To prove the above, we apply a helper theorem and lemma copied below.
Theorem B.2 (Concentration of sum; see Theorem 3.1 in Mohri et al. (2018)) . Let G be a family of functions mapping from Z to [0 , 1] . Given N independent samples z 1 , . . . , z n from Z , then for any δ > 0 , with probability at least 1 -δ , the following holds:
$$
$$
where ̂ R S ( G ) denotes the empirical Rademacher complexity of G equal to
$$
$$
where σ i are independent uniform random variables over {-1 , +1 } .
Lemma B.3 (Talagrand's lemma; see Lemma 4.2 in Mohri et al. (2018)) . Let Φ : R → R be L -Lipschitz. Then for any hypothesis set G of real-valued functions,
$$
$$
Now, we are ready to prove Theorem B.1.
Proof. We consider a function class G γ defined as the set of linear functions on the reproducing kernel Hilbert space such that G γ = {〈 w , ·〉 : ‖ w ‖ 2 ≤ γ } . Given a dataset of inputs x 1 , . . . , x N in the reproducing kernel Hilbert space with corresponding targets y 1 , . . . , y N , let glyph[epsilon1] w ( x i ) = glyph[lscript] ( 〈 w , x 〉 , y i ) ∈ [0 , b ] . The inequality in Theorem B.2 applies for any given G γ but we would like this to hold for all γ ∈ { 1 , 2 , 3 , . . . } since ‖ w ‖ can be unbounded. Since our loss function glyph[lscript] is bounded between [0 , b ] , we multiply it by 1 /b so that it is ranged in [0 , 1] as needed for Theorem B.2. Then, we apply Theorem B.2 to glyph[epsilon1] w ( x i ) over the class G γ for each γ ∈ { 1 , 2 , 3 , . . . } with probability δ γ = δ ( e -1) e -γ . This implies that
$$
$$
This shows that for any γ , the above inequality holds with probability 1 -δ ( e -1) e -γ . However, we need to show this holds for all γ simultaneously. To achieve this, we apply a union bound which holds with probability 1 -∑ ∞ γ =1 δ γ = 1 -∑ ∞ γ =1 δ ( e -1) e -γ = 1 -δ .
To proceed, we consider the inequality where γ = glyph[ceilingleft]‖ w ‖ 2 glyph[ceilingright] copied below.
$$
$$
Applying Talagrand's lemma (Lemma B.3) followed by the Cauchy-Schwarz inequality,
$$
$$
Expanding out the quantity ∥ ∥ ∥ 1 N ∑ N i =1 σ i x i ∥ ∥ ∥ and noting that the random variables are independent, we have
$$
$$
Since we normalize such that Tr( K ) = N , we have
$$
$$
Plugging in ‖ w ‖ 2 = y ᵀ K -1 y and noting that we normalized the kernel to have Tr( K ) = N thus completes the proof.
(Adapted from Section 4.C of (Huang et al., 2021)).
(Concentration of sum; see Theorem 3.1 in Mohri et al. (2018)).
(Talagrand’s lemma; see Lemma 4.2 in Mohri et al. (2018)).
Deferred proofs
Additional numerical experiments
Visualizing the contrastive induced kernel on the spiral dataset
Here, we provide the contrastive SSL-induced kernel visualization in Figure 5 to show how the SSLinduced kernel is helping to manipulate the distance and disentangle manifolds in the representation space. In Figure 5, we study two entangled 1-D spiral manifolds in a 2D space with 200 training points uniformly distributed on the spiral manifolds. We use the contrastive SSL-induced kernel, following Equation 8, to demonstrate this result, whereas the non-contrastive SSL-induced kernel is provided earlier in the main text. We use the radial basis function (RBF) kernel to calculate the inner products between different points and plot a few points' RBF neighborhoods in the first row of Figure 2, i.e., k ( x 1 , x 2 ) = exp ( -‖ x 1 -x 2 ‖ 2 2 σ 2 ) . As we can see, the RBF kernel captures the locality of the 2D space. Next, we use the RBF kernel space neighborhoods to construct the adjacency matrix A and any training points with k ( x 1 , x 2 ) > d are treated as connected vertices, i.e., A ij = 1 and A ij = 0 otherwise. The diagonal entries of A are 0 . This construction can be considered as using the Euclidean neighborhoods of x as the data augmentation. In the second row of Figure 5, we show the selected points' inner products with the other training points in the SSL-induced kernel space. Given σ , when d is chosen small enough, we can see in the second row of Figure 5(a) that the SSL-induced kernel faithfully captures the topology of manifolds and leads to a more disentangled representation. Figure 5(b) shows that when d is too large, i.e., an improper data augmentation, the SSL-induced kernel leads to the 'short-circuit' effect, and the two manifolds are mixed in the representation space.

Figure 5: Inner products comparison in the RBF kernel space (first row) and induced kernel space (second row). The induced kernel is computed based on Equation 8 and the graph adjacency matrix A is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. We plot three randomly chosen points' kernel entries with respect to the other points on the manifolds. a) When the neighborhood augmentation range used to construct the adjacency matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the adjacency matrix is too large, it creates the 'short-circuit' effect in the induced kernel space. Each subplot on the second row is normalized by dividing its largest absolute value for better contrast.
Table 1: NTK for 3-layer fully connected network: Test set accuracy of SVM using the neural tangent kernel of a 3 layer fully connected network in only a supervised (baseline) setting or via the induced kernel in a self-supervised setting. The induced kernel for the SSL algorithm is calculated using the contrastive induced kernel. Numbers shown above are the test set accuracy for classifying MNIST digits for small dataset sizes with the given number of samples. Due to the quadratic scaling of memory and runtime for kernel methods, we restricted analysis to more feasible settings where there were less than 25,000 total samples (number of augmentations times number of samples).
MNIST with neural tangent kernels
In further exploring the performance of SSL kernel methods on small datasets, we perform further numerical experiments on the MNIST dataset using kernel functions derived from the neural tangent kernel (NTK) of commonly used neural networks (Jacot et al., 2018). We use the neural-tangents package to explicitly calculate the NTK for a given architecture (Novak et al., 2020). The basic setup is repeated from Section 4.2 where two different types of augmentations are performed on im-
Table 2: NTK for 7-layer convolutional neural network: Test set accuracy of SVM using the neural tangent kernel of a CNN with 7 layers of 3 × 3 convolution followed by a global pooling layer. The induced kernel for the SSL algorithm is calculated using the contrastive induced kernel and compared to the standard SVM using the kernel itself with or without augmentation. Numbers shown above are the test set accuracy for classifying MNIST digits for small dataset sizes with the given number of samples. Due to the quadratic scaling of memory and runtime for kernel methods, we restricted analysis to more feasible settings where there were less than 25,000 total samples (number of augmentations times number of samples).
ages. As before, we consider augmentations by Gaussian blurring of the pixels or small translations, rotations, and zooms.
Test set accuracy results are shown in Table 1 for the NTK associated to a 3-layer fully connected network with infinite width at each hidden layer. Similarly, in Table 2, we show a similar analysis for the NTK of a CNN with seven convolutional layers followed by a global pooling layer. We use the contrastive induced kernel for the SSL kernel algorithm. The findings are similar to those reported in the main text. The SSL method performs similarly to the baseline method without any self-supervised learning but including labeled augmented datapoints in the training set. In some cases, the SSL method even outperforms the baseline augmented setting.
Analysis of downstream solution
The fundamental goal of self-supervised learning (SSL) is to produce useful representations of data without access to any labels for classifying the data. Modern methods in SSL, which form representations based on known or constructed relationships between samples, have been particularly effective at this task. Here, we aim to extend this framework to incorporate algorithms based on kernel methods where embeddings are constructed by linear maps acting on the feature space of a kernel. In this kernel regime, we derive methods to find the optimal form of the output representations for contrastive and non-contrastive loss functions. This procedure produces a new representation space with an inner product denoted as the induced kernel which generally correlates points which are related by an augmentation in kernel space and de-correlates points otherwise. We analyze our kernel model on small datasets to identify common features of self-supervised learning algorithms and gain theoretical insights into their performance on downstream tasks.
Self-supervised learning (SSL) algorithms are broadly tasked with learning from unlabeled data. In the joint embedding framework of SSL, mainstream contrastive methods build representations by reducing the distance between inputs related by an augmentation (positive pairs) and increasing the distance between inputs not known to be related (negative pairs) (Chen et al., 2020; He et al., 2020; Oord et al., 2018; Ye et al., 2019). Non-contrastive methods only enforce similarities between positive pairs but are designed carefully to avoid collapse of representations (Grill et al., 2020; Zbontar et al., 2021). Recent algorithms for SSL have performed remarkably well reaching similar performance to baseline supervised learning algorithms on many downstream tasks (Caron et al., 2020; Bardes et al., 2021; Chen & He, 2021).
In this work, we study SSL from a kernel perspective. In standard SSL tasks, inputs are fed into a neural network and mapped into a feature space which encodes the final representations used in downstream tasks (e.g., in classification tasks). In the kernel setting, inputs are embedded in a feature space corresponding to a kernel, and representations are constructed via an optimal mapping from this feature space to the vector space for the representations of the data. Since the feature space of a kernel can be infinite dimensional, one practically may only have access to the kernel function itself. Here, the task can be framed as one of finding an optimal “induced” kernel, which is a mapping from the original kernel in the input feature space to an updated kernel function acting on the vector space of the representations. Our results show that such an induced kernel can be constructed using only manipulations of kernel functions and data that encodes the relationships between inputs in an SSL algorithm (e.g., adjacency matrices between the input datapoints).
More broadly, we make the following contributions:
For a contrastive and non-contrastive loss, we provide closed form solutions when the algorithm is trained over a single batch of data. These solutions form a new “induced” kernel which can be used to perform downstream supervised learning tasks.
We show that a version of the representer theorem in kernel methods can be used to formulate kernelized SSL tasks as optimization problems. As an example, we show how to optimially find induced kernels when the loss is enforced over separate batches of data.
We empirically study the properties of our SSL kernel algorithms to gain insights about the training of SSL algorithms in practice. We study the generalization properties of SSL algorithms and show that the choice of augmentation and adjacency matrices encoding relationships between the datapoints are crucial to performance.
We proceed as follows. First, we provide a brief background of the goals of our work and the theoretical tools used in our study. Second, we show that kernelized SSL algorithms trained on a single batch admit a closed form solution for commonly used contrastive and non-contrastive loss functions. Third, we generalize our findings to provide a semi-definite programming formulation to solve for the optimal induced kernel in more general settings and provide heuristics to better understand the form and properties of the induced kernels. Finally, we empirically investigate our kernelized SSL algorithms when trained on various datasets.
We denote vectors and matrices with lowercase (𝒙𝒙{\bm{x}}) and uppercase (𝑿𝑿{\bm{X}}) letters respectively. The vector 2-norm and matrix operator norm is denoted by ∥⋅∥|\cdot|. The Frobenius norm of a matrix 𝑴𝑴{\bm{M}} is denoted as ‖𝑴‖Fsubscriptnorm𝑴𝐹|{\bm{M}}|{F}. We denote the transpose and conjugate transpose of 𝑴𝑴{\bm{M}} by 𝑴⊺superscript𝑴⊺{\bm{M}}^{\intercal} and 𝑴†superscript𝑴†{\bm{M}}^{\dagger} respectively. We denote the identity matrix as 𝑰𝑰{\bm{I}} and the vector with each entry equal to one as 𝟏1\bm{1}. For a diagonalizable matrix 𝑴𝑴{\bm{M}}, its projection onto the eigenspace of its positive eigenvalues is 𝑴+subscript𝑴{\bm{M}}{+}.
For a dataset of size N𝑁N, let 𝒙i∈𝒳subscript𝒙𝑖𝒳{\bm{x}}{i}\in\mathcal{X} for i∈[N]𝑖delimited-[]𝑁i\in[N] denote the elements of the dataset. Given a kernel function k:𝒳×𝒳→ℝ:𝑘→𝒳𝒳ℝk:\mathcal{X}\times\mathcal{X}\to\mathbb{R}, let Φ(𝒙)=k(𝒙,⋅)Φ𝒙𝑘𝒙⋅\Phi({\bm{x}})=k({\bm{x}},\cdot) be the map from inputs to the reproducing kernel Hilbert space (RKHS) denoted by ℋℋ\mathcal{H} with corresponding inner product ⟨⋅,⋅⟩ℋsubscript⋅⋅ℋ\langle\cdot,\cdot\rangle{\mathcal{H}} and RKHS norm ∥⋅∥ℋ|\cdot|{\mathcal{H}}. Throughout we denote 𝑲s,s∈ℝN×Nsubscript𝑲𝑠𝑠superscriptℝ𝑁𝑁{\bm{K}}{s,s}\in\mathbb{R}^{N\times N} to be the kernel matrix of the SSL dataset where (𝑲s,s)ij=k(𝒙i,𝒙j)subscriptsubscript𝑲𝑠𝑠𝑖𝑗𝑘subscript𝒙𝑖subscript𝒙𝑗({\bm{K}}{s,s}){ij}=k({\bm{x}}{i},{\bm{x}}{j}). We consider linear models 𝑾:ℋ→ℝK:𝑾→ℋsuperscriptℝ𝐾{\bm{W}}:\mathcal{H}\to\mathbb{R}^{K} which map features to representations 𝒛i=𝑾Φ(𝒙i)subscript𝒛𝑖𝑾Φsubscript𝒙𝑖{\bm{z}}{i}={\bm{W}}\Phi({\bm{x}}{i}). Let 𝒁𝒁{\bm{Z}} be the representation matrix which contains Φ(𝒙i)Φsubscript𝒙𝑖\Phi({\bm{x}}{i}) as rows of the matrix. This linear function space induces a corresponding RKHS norm which can be calculated as ‖𝑾‖ℋ=∑i=1K⟨𝑾i,𝑾i⟩ℋ2subscriptnorm𝑾ℋsuperscriptsubscript𝑖1𝐾superscriptsubscriptsubscript𝑾𝑖subscript𝑾𝑖ℋ2|{\bm{W}}|{\mathcal{H}}=\sqrt{\sum_{i=1}^{K}\langle{\bm{W}}{i},{\bm{W}}{i}\rangle_{\mathcal{H}}^{2}} where 𝑾i∈ℋsubscript𝑾𝑖ℋ{\bm{W}}_{i}\in\mathcal{H} denotes the i𝑖i-th component of the output of the linear mapping 𝑾𝑾{\bm{W}}. This linear mapping constructs an “induced” kernel denoted as k∗(⋅,⋅)superscript𝑘⋅⋅k^{*}(\cdot,\cdot) as discussed later.
The driving motive behind modern self-supervised algorithms is to maximize the information of given inputs in a dataset while enforcing similarity between inputs that are known to be related. The adjacency matrix 𝑨∈{0,1}N×N𝑨superscript01𝑁𝑁{\bm{A}}\in{0,1}^{N\times N} (also can be generalized to 𝑨∈ℝN×N𝑨superscriptℝ𝑁𝑁{\bm{A}}\in\mathbb{R}^{N\times N}) connects related inputs 𝒙isubscript𝒙𝑖{\bm{x}}{i} and 𝒙jsubscript𝒙𝑗{\bm{x}}{j} (i.e., 𝑨ij=1subscript𝑨𝑖𝑗1{\bm{A}}{ij}=1 if inputs i𝑖i and j𝑗j are related by a transformation) and 𝑫𝑨subscript𝑫𝑨{\bm{D}}{{\bm{A}}} is a diagonal matrix where entry i𝑖i on the diagonal is equal to the number of nonzero elements of row i𝑖i of 𝑨𝑨{\bm{A}}.
Any joint-embedding SSL algorithm requires a properly chosen loss function and access to a set of observations and known pairwise positive relation between those observations. Methods are denoted as non-contrastive if the loss function is only a function of pairs that are related (Grill et al., 2020; Chen & He, 2021; Zbontar et al., 2021). One common method using the VICReg loss (Bardes et al., 2021), for example, takes the form
We adapt the above for the non-contrastive loss we study in our work. Contrastive methods also penalize similarities of representations that are not related. Popular algorithms include SimCLR (Chen et al., 2020), SwAV (Caron et al., 2020), NNCLR (Dwibedi et al., 2021), contrastive predictive coding (Oord et al., 2018), and many others. We consider a variant of the spectral contrastive loss in our work (HaoChen et al., 2021).
Theoretical studies of SSL: In tandem with the success of SSL in deep learning, a host of theoretical tools have been developed to help understand how SSL algorithms learn (Arora et al., 2019b; Balestriero & LeCun, 2022; HaoChen et al., 2022; Lee et al., 2021). Findings are often connected to the underlying graph connecting the data distribution (Wei et al., 2020; HaoChen et al., 2021) or the choice of augmentation (Wen & Li, 2021). Building representations from known relationships between datapoints is also studied in spectral graph theory (Chung, 1997). We employ findings from this body of literature to provide intuition behind the properties of the algorithms discussed here.
Neural tangent kernels and gaussian processes: Prior work has connected the outputs of infinite width neural networks to a corresponding gaussian process (Williams & Rasmussen, 2006; Neal, 1996; Lee et al., 2017). When trained using continuous time gradient descent, these infinite width models evolve as linear models under the so called neural tangent kernel (NTK) regime (Jacot et al., 2018; Arora et al., 2019a). The discovery of the NTK opened a flurry of exploration into the connections between so-called lazy training of wide networks and kernel methods (Yang, 2019; Chizat & Bach, 2018; Wang et al., 2022; Bietti & Mairal, 2019). Though the training dynamics of the NTK has previously been studied in the supervised settings, one can analyze an NTK in a self-supervised setting by using that kernel in the SSL algorithms that we study here. We perform some preliminary investigation into this direction in our experiments.
Kernel and metric learning: From an algorithmic perspective, perhaps the closest lines of work are related to kernel and metric learning (Bellet et al., 2013; Yang & Jin, 2006). Since our focus is on directly kernelizing SSL methods to eventually analyze and better understand SSL algorithms, we differ from these methods in that our end goal is not to improve the performance of kernel methods but instead, to bridge algorithms in SSL with kernel methods. In kernel learning, prior works have proposed constructing a kernel via a learning procedure; e.g., via convex combinations of kernels (Cortes et al., 2010), kernel alignment (Cristianini et al., 2001), and unsupervised kernel learning to match local data geometry (Zhuang et al., 2011). Prior work in distance metric learning using kernel methods aim to produce representations of data in unsupervised or semi-supervised settings by taking advantage of links between data points. For example, Baghshah & Shouraki (2010); Hoi et al. (2007) learn to construct a kernel based on optimizing distances between points embedded in Hilbert space according to a similarity and dissimilarity matrix. Yeung & Chang (2007) perform kernel distance metric learning in a semi-supervised setting where pairwise relations between data and labels are provided. Xia et al. (2013) propose an online procedure to learn a kernel which maps similar points closer to each other than dissimilar points. Many of these works also use semi-definite programs to perform optimization to find the optimal kernel.
Stated informally, the goal of SSL in the kernel setting is to start with a given kernel function k:𝒳×𝒳→ℝ:𝑘→𝒳𝒳ℝk:\mathcal{X}\times\mathcal{X}\to\mathbb{R} (e.g., RBF kernel or a neural tangent kernel) and map this kernel function to a new “induced” kernel k∗:𝒳×𝒳→ℝ:superscript𝑘→𝒳𝒳ℝk^{}:\mathcal{X}\times\mathcal{X}\to\mathbb{R} which is a function of the SSL loss function and the SSL dataset. For two new inputs 𝒙𝒙{\bm{x}} and 𝒙′superscript𝒙′{\bm{x}}^{\prime}, the induced kernel k∗(𝒙,𝒙′)superscript𝑘𝒙superscript𝒙′k^{}({\bm{x}},{\bm{x}}^{\prime}) generally outputs correlated values if 𝒙𝒙{\bm{x}} and 𝒙′superscript𝒙′{\bm{x}}^{\prime} are correlated in the original kernel space to some datapoint in the SSL dataset or correlated to separate but related datapoints in the SSL dataset as encoded in the graph adjacency matrix. If no relations are found in the SSL dataset between 𝒙𝒙{\bm{x}} and 𝒙′superscript𝒙′{\bm{x}}^{\prime}, then the induced kernel will generally output an uncorrelated value.
To kernelize SSL methods, we consider a setting generalized from the prototypical SSL setting where representations are obtained by maximizing/minimizing distances between augmented/un-augmented samples. Translating this to the kernel regime, as illustrated in Figure 1, our goal is to find a linear mapping 𝑾∗:ℋ→ℝK:superscript𝑾→ℋsuperscriptℝ𝐾{\bm{W}}^{}:\mathcal{H}\to\mathbb{R}^{K} which obtains the optimal representation of the data for a given SSL loss function and minimizes the RKHS norm. This optimal solution produces an “induced kernel” k∗(⋅,⋅)superscript𝑘⋅⋅k^{}(\cdot,\cdot) which is the inner product of the data in the output representation space. Once constructed, the induced kernel can be used in downstream tasks to perform supervised learning.
Due to a generalization of the representer theorem (Schölkopf et al., 2001), we can show that the optimal linear function 𝑾∗superscript𝑾{\bm{W}}^{*} must be in the support of the data. This implies that the induced kernel can be written as a function of the kernel between datapoints in the SSL dataset.
Given a dataset 𝐱1,…,𝐱N∈𝒳subscript𝐱1…subscript𝐱𝑁𝒳{\bm{x}}{1},\dots,{\bm{x}}{N}\in\mathcal{X}, let k(⋅,⋅)𝑘⋅⋅k(\cdot,\cdot) be a kernel function with corresponding map Φ:𝒳→ℋ:Φ→𝒳ℋ\Phi:\mathcal{X}\to\mathcal{H} into the RKHS ℋℋ\mathcal{H}. Let 𝐖:ℋ→ℝK:𝐖→ℋsuperscriptℝ𝐾{\bm{W}}:\mathcal{H}\to\mathbb{R}^{K} be a function drawn from the space of linear functions 𝒲𝒲\mathcal{W} mapping inputs in the RKHS to the vector space of the representation. For a risk function ℛ(𝐖Φ(𝐱1),…,𝐖Φ(𝐱N))∈ℝℛ𝐖Φsubscript𝐱1…𝐖Φsubscript𝐱𝑁ℝ\mathcal{R}({\bm{W}}\Phi({\bm{x}}{1}),\dots,{\bm{W}}\Phi({\bm{x}}{N}))\in\mathbb{R} and any strictly increasing function r:[0,∞)→ℝ:𝑟→0ℝr:[0,\infty)\to\mathbb{R}, consider the optimization problem
The optimal solutions of the above take the form
where 𝐌∈ℝK×N𝐌superscriptℝ𝐾𝑁{\bm{M}}\in\mathbb{R}^{K\times N} is a matrix that must be solved for and 𝐤𝐗,𝐱∈ℝNsubscript𝐤𝐗𝐱superscriptℝ𝑁{\bm{k}}{{\bm{X}},{\bm{x}}}\in\mathbb{R}^{N} is a vector with entries [𝐤𝐗,𝐱]i=k(𝐱i,𝐱)subscriptdelimited-[]subscript𝐤𝐗𝐱𝑖𝑘subscript𝐱𝑖𝐱[{\bm{k}}{{\bm{X}},{\bm{x}}}]{i}=k({\bm{x}}{i},{\bm{x}}).
Proposition 3.1, proved in Section A.1, provides a prescription for finding the optimal representations or induced kernels: i.e, one must search over the set of matrices 𝑴∈ℝK×N𝑴superscriptℝ𝐾𝑁{\bm{M}}\in\mathbb{R}^{K\times N} to find an optimal matrix. This search can be performed using standard optimization techniques as we will discuss later, but in certain cases, the optimal solution can be calculated in closed-form as shown next for both a contrastive and non-contrastive loss function.
Consider a variant of the VICReg (Bardes et al., 2021) loss function below:
where β∈ℝ+𝛽superscriptℝ\beta\in\mathbb{R}^{+} is a hyperparameter that controls the invariance term in the loss and 𝑳=𝑫𝑨−𝑨𝑳subscript𝑫𝑨𝑨{\bm{L}}={\bm{D}}_{{\bm{A}}}-{\bm{A}} is the graph Laplacian of the data. When the representation space has dimension K≥N𝐾𝑁K\geq N and the kernel matrix of the data is full rank, the induced kernel of the above loss function is:
where (⋅)+subscript⋅(\cdot){+} projects the matrix inside the parentheses onto the eigenspace of its positive eigenvalues, 𝒌𝒙,s∈ℝ1×Nsubscript𝒌𝒙𝑠superscriptℝ1𝑁{\bm{k}}{{\bm{x}},s}\in\mathbb{R}^{1\times N} is the kernel row-vector with entry i𝑖i equal to k(𝒙,𝒙i)𝑘𝒙subscript𝒙𝑖k({\bm{x}},{\bm{x}}{i}) with 𝒌s,𝒙subscript𝒌𝑠𝒙{\bm{k}}{s,{\bm{x}}} equal to its transpose, and 𝑲s,s∈ℝN×Nsubscript𝑲𝑠𝑠superscriptℝ𝑁𝑁{\bm{K}}{s,s}\in\mathbb{R}^{N\times N} is the kernel matrix of the training data for the self-supervised dataset where entry i,j𝑖𝑗i,j is equal to k(𝒙i,𝒙j)𝑘subscript𝒙𝑖subscript𝒙𝑗k({\bm{x}}{i},{\bm{x}}_{j}). When we restrict the output space of the self-supervised learning task to be of dimension K<N𝐾𝑁K<N, then the induced kernel only incorporates the top K𝐾K eigenvectors of 𝑰−1N𝟏𝟏⊺−β2𝑳𝑰1𝑁superscript11⊺𝛽2𝑳{\bm{I}}-\frac{1}{N}\bm{1}\bm{1}^{\intercal}-\frac{\beta}{2}{\bm{L}}:
where 𝑪𝑫𝑪⊺=𝑰−1N𝟏𝟏⊺−β2𝑳𝑪𝑫superscript𝑪⊺𝑰1𝑁superscript11⊺𝛽2𝑳{\bm{C}}{\bm{D}}{\bm{C}}^{\intercal}={\bm{I}}-\frac{1}{N}\bm{1}\bm{1}^{\intercal}-\frac{\beta}{2}{\bm{L}} is the eigendecomposition including only positive eigenvalues sorted in descending order, 𝑪:,≤Ksubscript𝑪:absent𝐾{\bm{C}}{:,\leq K} denotes the matrix consisting of the first K𝐾K columns of 𝑪𝑪{\bm{C}} and 𝑫≤K,≤K1/2superscriptsubscript𝑫absent𝐾absent𝐾12{\bm{D}}{\leq K,\leq K}^{1/2} denotes the K×K𝐾𝐾K\times K matrix consisting of entries in the first K𝐾K rows and columns. Proofs of the above are in Section A.2.
For contrastive SSL, we can also obtain a closed form solution to the induced kernel for a variant of the spectral contrastive loss (HaoChen et al., 2021):
where (𝑰+𝑨)+subscript𝑰𝑨({\bm{I}}+{\bm{A}}){+} is equal to the projection of 𝑰+𝑨𝑰𝑨{\bm{I}}+{\bm{A}} onto its eigenspace of positive eigenvalues. In the standard SSL setting where relationships are pair-wise (i.e., 𝑨ij=1subscript𝑨𝑖𝑗1{\bm{A}}{ij}=1 if 𝒙isubscript𝒙𝑖{\bm{x}}{i} and 𝒙jsubscript𝒙𝑗{\bm{x}}{j} are related by an augmentation), then 𝑰+𝑨𝑰𝑨{\bm{I}}+{\bm{A}} has only positive or zero eigenvalues so the projection can be ignored. If K<N𝐾𝑁K<N, then we similarly project the matrix 𝑰+𝑨𝑰𝑨{\bm{I}}+{\bm{A}} onto its top K𝐾K eigenvalues and obtain an induced kernel similar to the non-contrastive one:
The closed form solutions for the induced kernel obtained above assumed the loss function was enforced across a single batch. Of course, in practice, data are split into several batches. This batched setting may not admit a closed-form solution, but by using Proposition 3.1, we know that any optimal induced kernel takes the general form:
where 𝑩∈ℝN×N𝑩superscriptℝ𝑁𝑁{\bm{B}}\in\mathbb{R}^{N\times N} is a positive semi-definite matrix. With constraints properly chosen so that the solution for each batch is optimal (Balestriero & LeCun, 2022; HaoChen et al., 2022), one can find the optimal matrix 𝑩∗superscript𝑩{\bm{B}}^{*} by solving a semi-definite program (SDP). We perform this conversion to a SDP for the contrastive loss here and leave proofs and further details including the non-contrastive case to Section A.4.
Introducing some notation to deal with batches, assume we have N𝑁N datapoints split into nbatchessubscript𝑛batchesn_{\text{batches}} of size b𝑏b. We denote the i𝑖i-th datapoint within batch j𝑗j as 𝒙i(j)subscriptsuperscript𝒙𝑗𝑖{\bm{x}}^{(j)}{i}. As before, 𝒙isubscript𝒙𝑖{\bm{x}}{i} denotes the i𝑖i-th datapoint across the whole dataset. Let 𝑲s,s∈ℝN×Nsubscript𝑲𝑠𝑠superscriptℝ𝑁𝑁{\bm{K}}{s,s}\in\mathbb{R}^{N\times N} be the kernel matrix over the complete dataset where [𝑲s,s]i,j=k(𝒙i,𝒙j)subscriptdelimited-[]subscript𝑲𝑠𝑠𝑖𝑗𝑘subscript𝒙𝑖subscript𝒙𝑗[{\bm{K}}{s,s}]{i,j}=k({\bm{x}}{i},{\bm{x}}{j}), 𝑲s,sj∈ℝN×bsubscript𝑲𝑠subscript𝑠𝑗superscriptℝ𝑁𝑏{\bm{K}}{s,s_{j}}\in\mathbb{R}^{N\times b} be the kernel matrix between the complete dataset and batch j𝑗j where [𝑲s,sj]a,b=k(𝒙a,𝒙b(j))subscriptdelimited-[]subscript𝑲𝑠subscript𝑠𝑗𝑎𝑏𝑘subscript𝒙𝑎subscriptsuperscript𝒙𝑗𝑏[{\bm{K}}{s,s{j}}]{a,b}=k({\bm{x}}{a},{\bm{x}}^{(j)}_{b}), and 𝑨(j)superscript𝑨𝑗{\bm{A}}^{(j)} be the adjacency matrix for inputs in batch j𝑗j. With this notation, we now aim to minimize the loss function adapted from Equation 7 including a regularizing term for the RKHS norm:
where α∈ℝ+𝛼superscriptℝ\alpha\in\mathbb{R^{+}} is a weighting term for the regularizer. Taking the limit of α→0→𝛼0\alpha\to 0, we can find the optimal induced kernel for a representation of dimension K>b𝐾𝑏K>b by enforcing that optimal representations are obtained in each batch:
where as before, where (𝑰+𝑨(j))+subscript𝑰superscript𝑨𝑗({\bm{I}}+{\bm{A}}^{(j)})_{+} is equal to the projection of 𝑰+𝑨(j)𝑰superscript𝑨𝑗{\bm{I}}+{\bm{A}}^{(j)} onto its eigenspace of positive eigenvalues. Relaxing and removing the constraint that rank(𝑩)=Krank𝑩𝐾\operatorname{rank}({\bm{B}})=K results in an SDP which can be efficiently solved using existing optimizers. Further details and a generalization of this conversion to other representation dimensions is shown in Section A.4.
As a loose rule, the induced kernel will correlate points that are close in the kernel space or related by augmentations in the SSL dataset and uncorrelate points otherwise. As an example, note that in the contrastive setting (Equation 8), if one calculates the induced kernel k∗(𝒙i,𝒙j)superscript𝑘subscript𝒙𝑖subscript𝒙𝑗k^{}({\bm{x}}{i},{\bm{x}}{j}) between two points in the SSL dataset indexed by i𝑖i and j𝑗j that are related by an augmentation (i.e., 𝑨ij=1subscript𝑨𝑖𝑗1{\bm{A}}_{ij}=1), then the kernel between these two points is k∗(𝒙i,𝒙j)=1superscript𝑘subscript𝒙𝑖subscript𝒙𝑗1k^{}({\bm{x}}{i},{\bm{x}}{j})=1. More generally, if the two inputs to the induced kernel are close in kernel space to different points in the SSL dataset that are known to be related by 𝑨𝑨{\bm{A}}, then the kernel value will be close to 111. We formalize this intuition below for the standard setting with pairwise augmentations.
Given kernel function k(⋅,⋅)𝑘⋅⋅k(\cdot,\cdot) with corresponding map Φ(⋅)Φ⋅\Phi(\cdot) into the RKHS ℋℋ\mathcal{H}, let {𝐱1,𝐱2,…𝐱N}subscript𝐱1subscript𝐱2…subscript𝐱𝑁{{\bm{x}}{1},{\bm{x}}{2},\dots\bm{x}{N}} be an SSL dataset normalized such that k(𝐱i,𝐱i)=1𝑘subscript𝐱𝑖subscript𝐱𝑖1k({\bm{x}}{i},{\bm{x}}{i})=1 and formed by pairwise augmentations (i.e., every element has exactly one neighbor in 𝐀𝐀{\bm{A}}) with kernel matrix 𝐊s,ssubscript𝐊𝑠𝑠{\bm{K}}{s,s}. Given two points 𝐱𝐱{\bm{x}} and 𝐱′superscript𝐱′{\bm{x}}^{\prime}, if there exists two points in the SSL dataset indexed by i𝑖i and j𝑗j which are related by an augmentation (𝐀ijsubscript𝐀𝑖𝑗{\bm{A}}{ij}=1) and ‖Φ(𝐱)−Φ(𝐱i)‖ℋ≤Δ5∥|𝐊s,s−1∥N|\Phi({\bm{x}})-\Phi({\bm{x}}{i})|{\mathcal{H}}\leq\frac{\Delta}{5||{\bm{K}}{s,s}^{-1}|\sqrt{N}} and ‖Φ(𝐱′)−Φ(𝐱j)‖ℋ≤Δ5∥|𝐊s,s−1∥N|\Phi({\bm{x}}^{\prime})-\Phi({\bm{x}}{j})|{\mathcal{H}}\leq\frac{\Delta}{5||{\bm{K}}_{s,s}^{-1}|\sqrt{N}}, then the induced kernel for the contrastive loss is at least k∗(𝐱,𝐱′)≥1−Δsuperscript𝑘𝐱superscript𝐱′1Δk^{*}({\bm{x}},{\bm{x}}^{\prime})\geq 1-\Delta.
We prove the above statement in Section A.5. The bounds in the above statement which depend on the number of datapoints N𝑁N and the kernel matrix norm ‖𝑲s,s−1‖normsuperscriptsubscript𝑲𝑠𝑠1|{\bm{K}}_{s,s}^{-1}| are not very tight and solely meant to provide intuition for the properties of the induced kernel. In more realistic settings, stronger correlations will be observed for much weaker values of the assumptions. In light of this, we visualize the induced kernel values and their relations to the original kernel function in Section 4.1 on a simple 222-dimensional spiral dataset. Here, it is readily observed that the induced kernel better connects points along the data manifold that are related by the adjacency matrix.
In downstream tasks, one can apply the induced kernels directly on supervised algorithms such as kernel regression or SVM. Alternatively, one can also extract representations directly by obtaining the representation as k∗(𝒙,⋅)=𝑴𝒌s,𝒙superscript𝑘𝒙⋅𝑴subscript𝒌𝑠𝒙k^{}({\bm{x}},\cdot)={\bm{M}}{\bm{k}}{s,{\bm{x}}} as shown in Proposition 3.1 and employ any learning algorithm from these features. As an example, in kernel regression, we are given a dataset of size Ntsubscript𝑁𝑡N{t} consisting of input-output pairs {𝒙i(t),yi}subscriptsuperscript𝒙𝑡𝑖subscript𝑦𝑖{{\bm{x}}^{(t)}{i},y{i}} and aim to train a linear model to minimize the mean squared error loss of the outputs (Williams & Rasmussen, 2006). The optimal solution using an induced kernel k∗(⋅,⋅)superscript𝑘⋅⋅k^{}(\cdot,\cdot) takes the form:
where 𝑲t,t∗−1superscriptsubscript𝑲𝑡𝑡absent1{\bm{K}}{t,t}^{-1} is the kernel matrix of the supervised training dataset with entry i,j𝑖𝑗i,j equal to k∗(𝒙i(t),𝒙j(t))superscript𝑘subscriptsuperscript𝒙𝑡𝑖subscriptsuperscript𝒙𝑡𝑗k^{}({\bm{x}}^{(t)}{i},{\bm{x}}^{(t)}_{j}) and 𝒚𝒚{\bm{y}} is the concatenation of the targets as a vector. Note that since kernel methods generally have complexity that scales quadratically with the number of datapoints, such algorithms may be unfeasible in large-scale learning tasks unless modifications are made.
A natural question is when and why should one prefer the induced kernel of SSL to a kernel used in the standard supervised setting perhaps including data augmentation? Kernel methods generally fit a dataset perfectly so an answer to the question more likely arises from studying generalization. In kernel methods, generalization error typically tracks well with the norm of the classifier captured by the complexity quantity sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) defined as (Mohri et al., 2018; Steinwart & Christmann, 2008)
where 𝒚𝒚{\bm{y}} is a vector of targets and 𝑲𝑲{\bm{K}} is the kernel matrix of the supervised dataset. For example, the generalization gap of an SVM algorithm can be bounded with high probability by O(sN(𝑲)/N)𝑂subscript𝑠𝑁𝑲𝑁O(\sqrt{s_{N}({\bm{K}})/N}) (see example proof in Section B.1) (Meir & Zhang, 2003; Huang et al., 2021). For kernels functions k(⋅,⋅)𝑘⋅⋅k(\cdot,\cdot) bounded in output between 00 and 111, the quantity sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) is minimized in binary classification when k(𝒙,𝒙′)=1𝑘𝒙superscript𝒙′1k({\bm{x}},{\bm{x}}^{\prime})=1 for 𝒙,𝒙′𝒙superscript𝒙′{\bm{x}},{\bm{x}}^{\prime} drawn from the same class and k(𝒙,𝒙′)=0𝑘𝒙superscript𝒙′0k({\bm{x}},{\bm{x}}^{\prime})=0 for 𝒙,𝒙′𝒙superscript𝒙′{\bm{x}},{\bm{x}}^{\prime} drawn from distinct classes. If the induced kernel works ideally – in the sense that it better correlates points within a class and decorrelates points otherwise – then the entries of the kernel matrix approach these optimal values. This intuition is also supported by the hypothesis that self-supervised and semi-supervised algorithms perform well by connecting the representations of points on a common data manifold (HaoChen et al., 2021; Belkin & Niyogi, 2004). To formalize this somewhat, consider such an ideal, yet fabricated, setting where the SSL induced kernel has complexity sN(𝑲∗)subscript𝑠𝑁superscript𝑲s_{N}({\bm{K}}^{*}) that does not grow with the dataset size.
Given a supervised dataset of N𝑁N points for binary classification drawn from a distribution with m−1subscript𝑚1m_{-1} and m+1subscript𝑚1m_{+1} connected manifolds for classes with labels −11-1 and +11+1 respectively, if the induced kernel matrix of the dataset 𝐊∗superscript𝐊{\bm{K}}^{} successfully separates the manifolds such that k∗(𝐱,𝐱′)=1superscript𝑘𝐱superscript𝐱′1k^{}({\bm{x}},{\bm{x}}^{\prime})=1 if 𝐱,𝐱′𝐱superscript𝐱′{\bm{x}},{\bm{x}}^{\prime} are in the same manifold and k∗(𝐱,𝐱′)=0superscript𝑘𝐱superscript𝐱′0k^{}({\bm{x}},{\bm{x}}^{\prime})=0 otherwise, then sN(𝐊∗)=m−1+m+1=O(1)subscript𝑠𝑁superscript𝐊subscript𝑚1subscript𝑚1𝑂1s_{N}({\bm{K}}^{})=m_{-1}+m_{+1}=O(1).
The simple proof of the above is in Appendix B. In short, we conjecture that SSL should be preferred in such settings where the relationships between datapoints are “strong” enough to connect similar points in a class on the same manifold. We analyze the quantity sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) in Section C.3 to add further empirical evidence behind this hypothesis.
In this section, we empirically investigate the performance and properties of the SSL kernel methods on a toy spiral dataset and portions of the MNIST and eMNIST datasets for hand-drawn digits and characters (Cohen et al., 2017). As with other works, we focus on small-data tasks where kernel methods can be performed efficiently without modifications needed for handling large datasets (Arora et al., 2019a; Fernández-Delgado et al., 2014). For simplicity and ease of analysis, we perform experiments here with respect to the RBF kernel. Additional experiments reinforcing these findings and also including analysis with neural tangent kernels can be found in Appendix C.
contrastive-MNIST
For an intuitive understanding, we provide a visualization in Figure 2 to show how the SSL-induced kernel is helping to manipulate the distance and disentangle manifolds in the representation space. In Figure 2, we study two entangled 1-D spiral manifolds in a 2D space with 200 training points uniformly distributed on the spiral manifolds. We use the non-contrastive SSL-induced kernel, following Equation 5, to demonstrate this result, whereas a contrastive SSL-induced kernel is qualitatively similar and left to Section C.1. In the RBF kernel space shown in the first row of Figure 2, the value of the kernel is captured purely by distance. Next, to consider the SSL setting, we construct the graph Laplacian matrix 𝑳𝑳{\bm{L}} by connecting vertices between any training points with k(x1,x2)>d𝑘subscript𝑥1subscript𝑥2𝑑k(x_{1},x_{2})>d, i.e., 𝑳ij=−1subscript𝑳𝑖𝑗1{\bm{L}}{ij}=-1 if k(xi,xj)>d𝑘subscript𝑥𝑖subscript𝑥𝑗𝑑k(x{i},x_{j})>d and 𝑳ij=0subscript𝑳𝑖𝑗0{\bm{L}}_{ij}=0 otherwise. The diagonal entries of 𝑳𝑳{\bm{L}} are equal to the degree of the vertices, respectively. This construction can be viewed as using the Euclidean neighborhoods of x𝑥x as the data augmentation. We choose β=0.4𝛽0.4\beta=0.4, where other choices within a reasonable range lead to similar qualitative results. In the second row of Figure 2, we show the induced kernel between selected points (marked with an x) and other training points in the SSL-induced kernel space. When d𝑑d is chosen properly, as observed in the second row of Figure 2(a), the SSL-induced kernel faithfully captures the topology of manifolds and leads to a more disentangled representation. However, the augmentation has to be carefully chosen as Figure 2(b) shows that when d𝑑d is too large, the two manifolds become mixed in the representation space.
We explore in Figure 3 the supervised classification setting of MNIST and EMNIST which consist of 28×28282828\times 28 grayscale images. (E)MNIST provide a strong baseline to evaluate kernel methods due to the absence of background in the images making kernels such as RBF more aligned to measure input similarities. In this setting, we explore two different data-augmentation (DA) policies, one aligned with the data distribution (rotation+translation+scaling) and one largely misaligned with the data distribution (aggressive Gaussian blur). Because our goal is to understand how much DA impacts the SSL kernel compared to a fully supervised benchmark, we consider two (supervised) benchmarks: one that employs the labels of the sampled training set and all the augmented samples and one that only employs the sampled training set and no augmented samples. We explore a small training set size going from N=16𝑁16N=16 to N=256𝑁256N=256 and for each case we produce a number of augmented samples for each datapoint so that the total number of samples does not exceed 50,0005000050,000 which is a standard threshold for kernel methods. We note that our implementation is based on the common Python scientific library Numpy/Scipy (Harris et al., 2020) and runs on CPU. We observed in Figure 3 that the SSL kernel is able to match and even outperform the fully supervised case when employing the correct data-augmentation, while with the incorrect data-augmentation, the performance is not even able to match the supervised case that did not see the augmented samples. To better understand the impact of different hyper-parameters onto the two kernels, we also study in Figure 4 the MNIST test set performances when varying the representation dimension K𝐾K, the SVM’s ℓ2subscriptℓ2\ell_{2} regularization parameter, and the non-contrastive kernel’s β𝛽\beta parameter. We observe that although β𝛽\beta is an additional hyper-parameter to tune, its tuning plays a similar role to K𝐾K, the representation dimension. Hence, in practice, the design of the non-contrastive model can be controlled with a single parameter as with the contrastive setting. We also interestingly observe that K𝐾K is a more preferable parameter to tune to prevent over-fitting as opposed to SVM’s ℓ2subscriptℓ2\ell_{2} regularizer.
Our work explores the properties of SSL algorithms when trained via kernel methods. Connections between kernel methods and neural networks have gained significant interest in the supervised learning setting (Neal, 1996; Lee et al., 2017) for their potential insights into the training of deep networks. As we show in this study, such insights into the training properties of SSL algorithms can similarly be garnered from an analysis of SSL algorithms in the kernel regime. Our theoretical and empirical analysis, for example, highlights the importance of the choice of augmentations and encoded relationships between data points on downstream performance. Looking forward, we believe that interrelations between this kernel regime and the actual deep networks trained in practice can be strengthened particularly by analyzing the neural tangent kernel. In line with similar analysis in the supervised setting (Yang et al., 2022; Seleznova & Kutyniok, 2022; Lee et al., 2020), neural tangent kernels and their corresponding induced kernels in the SSL setting may shine light on some of the theoretical properties of the finite width networks used in practice.
Let 𝐱i∈𝒳subscript𝐱𝑖𝒳{\bm{x}}_{i}\in\mathcal{X} for i∈[N]𝑖delimited-[]𝑁i\in[N] be elements of a dataset of size N𝑁N, k:𝒳×𝒳→ℝ:𝑘→𝒳𝒳ℝk:\mathcal{X}\times\mathcal{X}\to\mathbb{R} be a kernel function with corresponding map Φ:𝒳→ℋ:Φ→𝒳ℋ\Phi:\mathcal{X}\to\mathcal{H} into the RKHS ℋℋ\mathcal{H}, and r:[0,∞)→ℝ:𝑟→0ℝr:[0,\infty)\to\mathbb{R} be any strictly increasing function. Let 𝐖:ℋ→ℝK:𝐖→ℋsuperscriptℝ𝐾{\bm{W}}:\mathcal{H}\to\mathbb{R}^{K} be a linear function mapping inputs to their corresponding representations. Given a regularized loss function of the form
where ℛ(𝐖Φ(𝐱1),…,𝐖Φ(𝐱N))ℛ𝐖Φsubscript𝐱1…𝐖Φsubscript𝐱𝑁\mathcal{R}({\bm{W}}\Phi({\bm{x}}{1}),\dots,{\bm{W}}\Phi({\bm{x}}{N})) is an error function that depends on the representations of the dataset, the minimizer 𝐖∗superscript𝐖{\bm{W}}^{*} of this loss function will be in the span of the training points {Φ(𝐱i),i∈{1,…,N}}Φsubscript𝐱𝑖𝑖1…𝑁{\Phi({\bm{x}}_{i}),;i\in{1,\dots,N}}, i.e. for any ϕ∈ℋbold-ϕℋ\bm{\phi}\in\mathcal{H}:
Decompose 𝑾=𝑾∥+𝑾⊥𝑾subscript𝑾parallel-tosubscript𝑾bottom{\bm{W}}={\bm{W}}{\parallel}+{\bm{W}}{\bot}, where 𝑾⊥Φ(𝒙i)=𝟎subscript𝑾bottomΦsubscript𝒙𝑖0{\bm{W}}{\bot}\Phi({\bm{x}}{i})=\bm{0} for all i∈[N]𝑖delimited-[]𝑁i\in[N]. For a loss function ℒℒ\mathcal{L} of the form listed above, we have
Where in the terms in the sum, we used the property that 𝑾⊥subscript𝑾bottom{\bm{W}}_{\bot} is not in the span of the data. For the regularizer term, we note that
Therefore, strictly enforcing 𝑾⊥=0superscript𝑾bottom0{\bm{W}}^{\bot}=0 minimizes the regularizer while leaving the rest of the cost function unchanged. ∎
As a consequence of the above, all optimal solutions must have support over the span of the data. This directly results in the statement shown in Proposition 3.1.
Throughout this section, for simplicity, we define 𝑴:=𝑰−1N𝟏𝟏⊺assign𝑴𝑰1𝑁superscript11⊺{\bm{M}}:={\bm{I}}-\frac{1}{N}\bm{1}\bm{1}^{\intercal}. With slight abuse of notation, we also denote 𝑿∈ℝN×D𝑿superscriptℝ𝑁𝐷{\bm{X}}\in\mathbb{R}^{N\times D} as a matrix whose i𝑖i-th row contains the features of Φ(𝒙i)Φsubscript𝒙𝑖\Phi({\bm{x}}_{i}).
Note that if the RKHS is infinite dimensional, one can apply the general form of the solution as shown in Proposition 3.1 to reframe the problem into the finite dimensional setting below. As a reminder, we aim to minimize the VICreg cost function:
By applying the definition of the Frobenius norm and from some algebra, we obtain:
The optimum of the above formulation has the same optimum as the following optimization problem defined as C′(𝑾)superscript𝐶′𝑾C^{\prime}({\bm{W}}):
Since 𝑴𝑴{\bm{M}} is a projector and 𝑴(2𝑴−β𝑳)=2𝑴−β𝑳𝑴2𝑴𝛽𝑳2𝑴𝛽𝑳{\bm{M}}(2{\bm{M}}-\beta{\bm{L}})=2{\bm{M}}-\beta{\bm{L}} (since the all-ones vector is in the kernel of 𝑳𝑳{\bm{L}}), then we can solve the above by employing the Eckart-Young theorem and matching the K𝐾K-dimensional eigenspace of 𝑿𝑾⊺𝑾𝑿⊺𝑴𝑿superscript𝑾⊺𝑾superscript𝑿⊺𝑴{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}} with that of the top K𝐾K eigenspace of 𝑴−β𝑳/2𝑴𝛽𝑳2{\bm{M}}-\beta{\bm{L}}/2.
One must be careful in choosing this optimum as 𝑾𝑿⊺𝑴𝑿𝑾⊺𝑾superscript𝑿⊺𝑴𝑿superscript𝑾⊺{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}{\bm{X}}{\bm{W}}^{\intercal} can only take positive eigenvalues. Therefore, this is achieved by choosing the optimal 𝑾∗superscript𝑾{\bm{W}}^{*} to project the data onto this eigenspace as
Note, that since 𝑾𝑾{\bm{W}} only appears in the cost function in the form 𝑾⊺𝑾superscript𝑾⊺𝑾{\bm{W}}^{\intercal}{\bm{W}}, the solution above is only unique up to an orthogonal transformation. Furthermore, the rank of 𝑿𝑾⊺𝑾𝑿⊺𝑴𝑿superscript𝑾⊺𝑾superscript𝑿⊺𝑴{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}} is at most N−1𝑁1N-1 so a better optimum cannot be achieved by increasing the output dimension of the linear transformation beyond N𝑁N. To see that this produces the optimal induced kernel, we simply plug in the optimal 𝑾∗superscript𝑾{\bm{W}}^{*}:
Now, it needs to be shown that 𝑾∗superscript𝑾{\bm{W}}^{*} is the unique norm optimizer of the optimization problem. To show this, we analyze the following semi-definite program which is equivalent since the cost function is over positive semi-definite matrices of the form 𝑾⊺𝑾superscript𝑾⊺𝑾{\bm{W}}^{\intercal}{\bm{W}}:
To lighten notation, we denote 𝑷K=𝑪:,≤K(𝑰−𝑫)≤K,≤K1/2𝑪:,≤K⊺subscript𝑷𝐾subscript𝑪:absent𝐾superscriptsubscript𝑰𝑫absent𝐾absent𝐾12superscriptsubscript𝑪:absent𝐾⊺{\bm{P}}{K}={\bm{C}}{:,\leq K}\left({\bm{I}}-{\bm{D}}\right){\leq K,\leq K}^{1/2}{\bm{C}}{:,\leq K}^{\intercal} and simply use 𝑿𝑿{\bm{X}} for 𝑿𝑿{\bm{X}}. The above can easily be derived by setting 𝑩=𝑾⊺𝑾𝑩superscript𝑾⊺𝑾{\bm{B}}={\bm{W}}^{\intercal}{\bm{W}} in Equation 22. This has corresponding dual
The optimal primal can be obtained from 𝑾∗superscript𝑾{\bm{W}}^{} by 𝑩∗=(𝑾∗)⊺𝑾∗=𝑿+𝑷K(𝑿+)⊺superscript𝑩superscriptsuperscript𝑾⊺superscript𝑾superscript𝑿subscript𝑷𝐾superscriptsuperscript𝑿⊺{\bm{B}}^{}=\left({\bm{W}}^{}\right)^{\intercal}{\bm{W}}^{}={\bm{X}}^{+}{\bm{P}}_{K}\left({\bm{X}}^{+}\right)^{\intercal}. The optimal dual can be similarly calculated and is equal to 𝒀∗=(𝑿+)⊺𝑿+superscript𝒀superscriptsuperscript𝑿⊺superscript𝑿{\bm{Y}}^{*}=\left({\bm{X}}^{+}\right)^{\intercal}{\bm{X}}^{+}. A straightforward calculation shows that the optimum value of the primal and dual formulation are equal for the given solutions. We now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
The primal feasibility and dual feasibility criteria are straightforward to check. For complementary slackness, we note that
In the above we used the fact that 𝑴𝑳K=𝑳K𝑴subscript𝑳𝐾subscript𝑳𝐾{\bm{M}}{\bm{L}}{K}={\bm{L}}{K} since 𝑴𝑴{\bm{M}} is a projector and 𝑳Ksubscript𝑳𝐾{\bm{L}}_{K} is unchanged by that projection. This completes the proof of the optimality.
For the contrastive loss, we follow a similar approach as above to find the minimum norm solution that obtains the optimal representation. Note, that the loss function contains the term ‖𝑿𝑾⊺𝑾𝑿⊺−(𝑰+𝑨)‖F2superscriptsubscriptnorm𝑿superscript𝑾⊺𝑾superscript𝑿⊺𝑰𝑨𝐹2\left|{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal}-\left({\bm{I}}+{\bm{A}}\right)\right|{F}^{2}. Since 𝑿𝑾⊺𝑾𝑿⊺𝑿superscript𝑾⊺𝑾superscript𝑿⊺{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal} is positive semi-definite, then this is optimized when 𝑿𝑾⊺𝑾𝑿⊺𝑿superscript𝑾⊺𝑾superscript𝑿⊺{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal} matches the positive eigenspace of (𝑰+𝑨)𝑰𝑨\left({\bm{I}}+{\bm{A}}\right) defined as (𝑰+𝑨)+subscript𝑰𝑨\left({\bm{I}}+{\bm{A}}\right){+}. Enumerating the eigenvalues of 𝑨𝑨{\bm{A}} as 𝒗isubscript𝒗𝑖{\bm{v}}{i} with corresponding eigenvalues eisubscript𝑒𝑖e{i}, then (𝑰+𝑨)+=∑i:ei≥−1(ei+1)𝒗i𝒗i†subscript𝑰𝑨subscript:𝑖subscript𝑒𝑖1subscript𝑒𝑖1subscript𝒗𝑖superscriptsubscript𝒗𝑖†\left({\bm{I}}+{\bm{A}}\right){+}=\sum{i:e_{i}\geq-1}(e_{i}+1){\bm{v}}{i}{\bm{v}}{i}^{\dagger}. More generally, if the dimension of the representation is restricted such that K<N𝐾𝑁K<N, then we abuse notation and define
where eisubscript𝑒𝑖e_{i} are sorted in descending order.
To find the minimum RKHS norm solution, we have to solve a similar SDP to Equation 25:
This has corresponding dual
The optimal primal is 𝑩∗=𝑿+(𝑰+𝑨)+(𝑿+)⊺superscript𝑩superscript𝑿subscript𝑰𝑨superscriptsuperscript𝑿⊺{\bm{B}}^{}={\bm{X}}^{+}({\bm{I}}+{\bm{A}})_{+}\left({\bm{X}}^{+}\right)^{\intercal}. The optimal dual is equal to 𝒀∗=(𝑿+)⊺𝑿+superscript𝒀superscriptsuperscript𝑿⊺superscript𝑿{\bm{Y}}^{}=\left({\bm{X}}^{+}\right)^{\intercal}{\bm{X}}^{+}. Directly plugging these in shows that the optimum value of the primal and dual formulation are equal for the given solutions. As before, we now check whether the chosen solutions of the primal and dual satisfy the KKT optimality conditions (Boyd et al., 2004):
In general scenarios, Proposition 3.1 gives a prescription for calculating the optimal induced kernel for more complicated optimization tasks since we note that the optimal induced kernel k∗(𝒙,𝒙′)superscript𝑘𝒙superscript𝒙′k^{*}({\bm{x}},{\bm{x}}^{\prime}) must be of the form below:
where 𝒌𝒙,ssubscript𝒌𝒙𝑠{\bm{k}}{{\bm{x}},s} is a row vector whose entry i𝑖i equals the kernel k(𝒙,𝒙i)𝑘𝒙subscript𝒙𝑖k({\bm{x}},{\bm{x}}{i}) between 𝒙𝒙{\bm{x}} and the i𝑖i-th datapoint and 𝑩∈ℝN×N𝑩superscriptℝ𝑁𝑁{\bm{B}}\in\mathbb{R}^{N\times N} is a positive semi-definite matrix.
For example, such a scenario arises when one wants to apply the loss function across nbatchessubscript𝑛batchesn_{\text{batches}} batches of data. To frame this as an optimization statement, assume we have N𝑁N datapoints split into batches of size b𝑏b each. We denote the i𝑖i-th datapoint within batch k𝑘k as 𝒙i(k)subscriptsuperscript𝒙𝑘𝑖{\bm{x}}^{(k)}{i}. As before, 𝒙isubscript𝒙𝑖{\bm{x}}{i} denotes the i𝑖i-th datapoint across the whole dataset. We define the following variables:
𝑲s,s∈ℝN×Nsubscript𝑲𝑠𝑠superscriptℝ𝑁𝑁{\bm{K}}{s,s}\in\mathbb{R}^{N\times N} (kernel matrix over complete dataset including all batches) where [𝑲s,s]i,j=k(𝒙i,𝒙j)subscriptdelimited-[]subscript𝑲𝑠𝑠𝑖𝑗𝑘subscript𝒙𝑖subscript𝒙𝑗[{\bm{K}}{s,s}]{i,j}=k({\bm{x}}{i},{\bm{x}}_{j})
𝑨(k)superscript𝑨𝑘{\bm{A}}^{(k)} is the adjacency matrix for inputs in batch k𝑘k with corresponding graph Laplacian 𝑳(k)superscript𝑳𝑘{\bm{L}}^{(k)}
In what follows, we denote the representation dimension as K𝐾K.
In the non-contrastive setting, we consider a regularized version of the batched loss of Equation 4. Applying the reduction of the loss function in Equation 22, we consider the following loss function where we want to find the minimizer 𝑩𝑩{\bm{B}}:
where α∈ℝ+𝛼superscriptℝ\alpha\in\mathbb{R}^{+} is a hyperparameter. The term Tr(𝑩𝑲s,s)Tr𝑩subscript𝑲𝑠𝑠\operatorname{Tr}({\bm{B}}{\bm{K}}_{s,s}) regularizes for the RKHS norm of the resulting solution given by 𝑩𝑩{\bm{B}}. For simplicity, we denote as before 𝑴=𝑰−1b𝟏𝟏⊺𝑴𝑰1𝑏superscript11⊺{\bm{M}}={\bm{I}}-\frac{1}{b}\bm{1}\bm{1}^{\intercal}. Taking the limit α→0→𝛼0\alpha\to 0 and enforcing a representation of dimension K𝐾K, the loss function above is minimized when we obtain the optimal representation, i.e. we must have that
where (𝑴−β𝑳(j)/2)Ksubscript𝑴𝛽superscript𝑳𝑗2𝐾\left({\bm{M}}-\beta{\bm{L}}^{(j)}/2\right)_{K} denotes the projection of 𝑴−β𝑳(j)/2𝑴𝛽superscript𝑳𝑗2{\bm{M}}-\beta{\bm{L}}^{(j)}/2 onto the eigenspace of the top K𝐾K positive singular values (this is the optimal representation as shown earlier).
Relaxing the constraint rank(𝑩)=Krank𝑩𝐾\operatorname{rank}({\bm{B}})=K forms an SDP which can be solved efficiently using existing SDP solvers (ApS, 2019; Sturm, 1999).
where (𝑰+𝑨(j))Ksubscript𝑰superscript𝑨𝑗𝐾\left({\bm{I}}+{\bm{A}}^{(j)}\right)_{K} denotes the projection of 𝑰+𝑨(j)𝑰superscript𝑨𝑗{\bm{I}}+{\bm{A}}^{(j)} onto the eigenspace of the top K𝐾K positive singular values (this is the optimal representation as shown earlier). Taking the limit of α→0→𝛼0\alpha\to 0, have a similar optimization problem:
Our goal is to prove Proposition 3.2 copied below. See 3.2
Note, the adjacency matrix 𝑨𝑨{\bm{A}} for pairwise augmentations takes the form below assuming augmented samples are placed next to each other in order.
Before proceeding, we prove a helper lemma that shows that ‖𝒌s,𝒙a−𝒌s,𝒙b‖normsubscript𝒌𝑠subscript𝒙𝑎subscript𝒌𝑠subscript𝒙𝑏|{\bm{k}}{s,{\bm{x}}{a}}-{\bm{k}}{s,{\bm{x}}{b}}| is small if ‖Φ(𝒙a)−Φ(𝒙b)‖normΦsubscript𝒙𝑎Φsubscript𝒙𝑏|\Phi({\bm{x}}{a})-\Phi({\bm{x}}{b})| is also relatively small with a factor of dependence on the dataset size.
Given kernel function k(⋅,⋅)𝑘⋅⋅k(\cdot,\cdot) with map Φ(⋅)Φ⋅\Phi(\cdot) into RKHS ℋℋ\mathcal{H} and an SSL dataset {𝐱i}i=1Nsuperscriptsubscriptsubscript𝐱𝑖𝑖1𝑁{{\bm{x}}{i}}{i=1}^{N} normalized such that k(𝐱i,𝐱i)=1𝑘subscript𝐱𝑖subscript𝐱𝑖1k({\bm{x}}{i},{\bm{x}}{i})=1, let 𝐊s,ssubscript𝐊𝑠𝑠{\bm{K}}{s,s} be the kernel matrix of the SSL dataset. If ‖Φ(𝐱a)−Φ(𝐱b)‖ℋ≤ϵsubscriptnormΦsubscript𝐱𝑎Φsubscript𝐱𝑏ℋitalic-ϵ|\Phi({\bm{x}}{a})-\Phi({\bm{x}}{b})|{\mathcal{H}}\leq\epsilon, then ‖𝐊s,s−1𝐤s,𝐱a−𝐊s,s−1𝐤s,𝐱a‖≤‖𝐊s,s−1‖Nϵnormsuperscriptsubscript𝐊𝑠𝑠1subscript𝐤𝑠subscript𝐱𝑎superscriptsubscript𝐊𝑠𝑠1subscript𝐤𝑠subscript𝐱𝑎normsuperscriptsubscript𝐊𝑠𝑠1𝑁italic-ϵ\left|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}{a}}-{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}{a}}\right|\leq|{\bm{K}}_{s,s}^{-1}|\sqrt{N}\epsilon.
We have that
Note that 𝑲s,s−1𝒙i=𝒆isuperscriptsubscript𝑲𝑠𝑠1subscript𝒙𝑖subscript𝒆𝑖{\bm{K}}{s,s}^{-1}{\bm{x}}{i}={\bm{e}}{i} where 𝒆isubscript𝒆𝑖{\bm{e}}{i} is the vector with a 111 placed on entry i𝑖i and zeros elsewhere. From equation Equation 8, we have that
Note that 𝒆i⊺(𝑰+𝑨)+𝒆j=1superscriptsubscript𝒆𝑖⊺subscript𝑰𝑨subscript𝒆𝑗1{\bm{e}}{i}^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{j}=1 since 𝑨ij=1subscript𝑨𝑖𝑗1{\bm{A}}{ij}=1. Therefore,
Let ‖Φ(𝒙)−Φ(𝒙i)‖≤ϵ=Δ5∥|𝑲s,s−1∥N|\Phi({\bm{x}})-\Phi({\bm{x}}{i})|\leq\epsilon=\frac{\Delta}{5||{\bm{K}}{s,s}^{-1}|\sqrt{N}} and ‖Φ(𝒙′)−Φ(𝒙j)‖≤ϵ=Δ5∥|𝑲s,s−1∥N|\Phi({\bm{x}}^{\prime})-\Phi({\bm{x}}{j})|\leq\epsilon=\frac{\Delta}{5||{\bm{K}}{s,s}^{-1}|\sqrt{N}} and by applying Lemma A.2, we have
In the above, we used the fact that 𝑨𝑨{\bm{A}} is block diagonal with pairwise constraints (see Equation 41) so ‖(𝑰+𝑨)+𝒆i‖=2normsubscript𝑰𝑨subscript𝒆𝑖2\left|\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{i}\right|=\sqrt{2} and ‖(𝑰+𝑨)+‖=2normsubscript𝑰𝑨2\left|\left({\bm{I}}+{\bm{A}}\right)_{+}\right|=2.
There are m+1subscript𝑚1m_{+1} + m−1subscript𝑚1m_{-1} total manifolds in the dataset. Assume some ordering of these manifolds from {1,2,…,m+1{1,2,\dots,m_{+1} + m−1}m_{-1}} and let #(i)#𝑖#(i) be the number of points in the i𝑖i-th manifold.
The rows and columns of the kernel matrix 𝑲∗superscript𝑲{\bm{K}}^{} can be permuted such that it becomes block diagonal with m+1+m−1subscript𝑚1subscript𝑚1m_{+1}+m_{-1} blocks with block i𝑖i equal to 1#(i)𝟏#(i)𝟏#(i)⊺1#𝑖subscript1#𝑖superscriptsubscript1#𝑖⊺\frac{1}{#(i)}\bm{1}{#(i)}\bm{1}{#(i)}^{\intercal} where 𝟏ksubscript1𝑘\bm{1}_{k} is the all ones vector of length k𝑘k. I.e., 𝑲∗superscript𝑲{\bm{K}}^{} permuted accordingly takes the form below:
Each block of the above is clearly a rank one matrix with eigenvalue 111. Let 𝒚mi∈ℝ#(i)subscript𝒚subscript𝑚𝑖superscriptℝ#𝑖{\bm{y}}{m{i}}\in\mathbb{R}^{#(i)} be the vector containing all labels for entries in manifold i𝑖i. Then we have
For sake of completeness, we include an example proof of the generalization bound referred to in the main text. Common to classic generalization bounds for kernel methods from several prior works (Huang et al., 2021; Meir & Zhang, 2003; Mohri et al., 2018; Vapnik, 1999; Bartlett & Mendelson, 2002), the norm of the linear solution in kernel space correlates with the resulting bound on the generalization error. We closely follow the methodology of (Huang et al., 2021), though other works follow a similar line of reasoning.
Given a linear solution in the reproducing kernel Hilbert space denoted by 𝒘𝒘{\bm{w}}, we aim to bound the generalization error in the loss function ℓ(y,y′)=|min(1,max(−1,y))−y′|ℓ𝑦superscript𝑦′11𝑦superscript𝑦′\ell(y,y^{\prime})=|\min(1,\max(-1,y))-y^{\prime}| where y′superscript𝑦′y^{\prime} denotes binary classification targets in {−1,1}11{-1,1}, y𝑦y is the output of the kernel function equal to y=⟨𝒘,𝒙⟩𝑦𝒘𝒙y=\langle{\bm{w}},{\bm{x}}\rangle for a corresponding input 𝒙𝒙{\bm{x}} in the Hilbert space or feature space. For our purposes, the solution 𝒘𝒘{\bm{w}} is given by e.g., Equation 13 equal to
The norm above controls, in a sense, the complexity of the resulting solution as it appears in the resulting generalization bound.
Given input and output spaces 𝒳𝒳\mathcal{X} and 𝒴𝒴\mathcal{Y} respectively, let 𝒟𝒟\mathcal{D} be a distribution of input/output pairs over the support 𝒳×𝒴𝒳𝒴\mathcal{X}\times\mathcal{Y}. Slightly modifying existing generalization bounds via Rademacher complexity arguments (Huang et al., 2021; Bartlett & Mendelson, 2002; Mohri et al., 2018), we prove the following generalization bound.
Let 𝐱1,…,𝐱Nsubscript𝐱1…subscript𝐱𝑁{\bm{x}}{1},\dots,{\bm{x}}{N} and y1,…,yNtsubscript𝑦1…subscript𝑦subscript𝑁𝑡y_{1},\dots,y_{N_{t}} (with 𝐲𝐲{\bm{y}} the corresponding vector storing the scalars yisubscript𝑦𝑖y_{i} as entries) be our training set of N𝑁N independent samples drawn i.i.d. from 𝒟𝒟\mathcal{D}. Let 𝐰=Tr(𝐊)N∑i=1Nt[𝐊−1𝐲]iϕ(𝐱i(t))𝐰Tr𝐊𝑁superscriptsubscript𝑖1subscript𝑁𝑡subscriptdelimited-[]superscript𝐊1𝐲𝑖italic-ϕsuperscriptsubscript𝐱𝑖𝑡{\bm{w}}=\sqrt{\frac{\operatorname{Tr}({\bm{K}})}{N}}\sum_{i=1}^{N_{t}}\left[{\bm{K}}^{-1}{\bm{y}}\right]{i}\phi\left({\bm{x}}{i}^{(t)}\right) denote the solution to the kernel regression problem normalized by the trace of the data kernel. For an L𝐿L-lipschitz loss function ℓ:𝒴×𝒴→[0,b]:ℓ→𝒴𝒴0𝑏\ell:\mathcal{Y}\times\mathcal{Y}\to[0,b], with probability 1−δ1𝛿1-\delta for any δ>0𝛿0\delta>0, we have
To prove the above, we apply a helper theorem and lemma copied below.
Let G𝐺G be a family of functions mapping from Z𝑍Z to [0,1]01[0,1]. Given N𝑁N independent samples z1,…,znsubscript𝑧1…subscript𝑧𝑛z_{1},\dots,z_{n} from Z𝑍Z, then for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta, the following holds:
where ℜ^S(G)subscript^ℜ𝑆𝐺\widehat{\mathfrak{R}}_{S}(G) denotes the empirical Rademacher complexity of G𝐺G equal to
Let Φ:ℝ→ℝ:Φ→ℝℝ\Phi:\mathbb{R}\to\mathbb{R} be L𝐿L-Lipschitz. Then for any hypothesis set G𝐺G of real-valued functions,
Now, we are ready to prove Theorem B.1.
We consider a function class Gγsubscript𝐺𝛾G_{\gamma} defined as the set of linear functions on the reproducing kernel Hilbert space such that Gγ={⟨𝒘,⋅⟩:‖𝒘‖2≤γ}subscript𝐺𝛾conditional-set𝒘⋅superscriptnorm𝒘2𝛾G_{\gamma}={\langle{\bm{w}},\cdot\rangle:|{\bm{w}}|^{2}\leq\gamma}. Given a dataset of inputs 𝒙1,…,𝒙Nsubscript𝒙1…subscript𝒙𝑁{\bm{x}}{1},\dots,{\bm{x}}{N} in the reproducing kernel Hilbert space with corresponding targets y1,…,yNsubscript𝑦1…subscript𝑦𝑁y_{1},\dots,y_{N}, let ϵ𝒘(𝒙i)=ℓ(⟨𝒘,𝒙⟩,yi)∈[0,b]subscriptitalic-ϵ𝒘subscript𝒙𝑖ℓ𝒘𝒙subscript𝑦𝑖0𝑏\epsilon_{{\bm{w}}}({\bm{x}}{i})=\ell(\langle{\bm{w}},{\bm{x}}\rangle,y{i})\in[0,b]. The inequality in Theorem B.2 applies for any given Gγsubscript𝐺𝛾G_{\gamma} but we would like this to hold for all γ∈{1,2,3,…}𝛾123…\gamma\in{1,2,3,\dots} since ‖𝒘‖norm𝒘|{\bm{w}}| can be unbounded. Since our loss function ℓℓ\ell is bounded between [0,b]0𝑏[0,b], we multiply it by 1/b1𝑏1/b so that it is ranged in [0,1]01[0,1] as needed for Theorem B.2. Then, we apply Theorem B.2 to ϵ𝒘(𝒙i)subscriptitalic-ϵ𝒘subscript𝒙𝑖\epsilon_{{\bm{w}}}({\bm{x}}{i}) over the class Gγsubscript𝐺𝛾G{\gamma} for each γ∈{1,2,3,…}𝛾123…\gamma\in{1,2,3,\dots} with probability δγ=δ(e−1)e−γsubscript𝛿𝛾𝛿𝑒1superscript𝑒𝛾\delta_{\gamma}=\delta(e-1)e^{-\gamma}. This implies that
This shows that for any γ𝛾\gamma, the above inequality holds with probability 1−δ(e−1)e−γ1𝛿𝑒1superscript𝑒𝛾1-\delta(e-1)e^{-\gamma}. However, we need to show this holds for all γ𝛾\gamma simultaneously. To achieve this, we apply a union bound which holds with probability 1−∑γ=1∞δγ=1−∑γ=1∞δ(e−1)e−γ=1−δ1superscriptsubscript𝛾1subscript𝛿𝛾1superscriptsubscript𝛾1𝛿𝑒1superscript𝑒𝛾1𝛿1-\sum_{\gamma=1}^{\infty}\delta_{\gamma}=1-\sum_{\gamma=1}^{\infty}\delta(e-1)e^{-\gamma}=1-\delta.
Applying Talagrand’s lemma (Lemma B.3) followed by the Cauchy-Schwarz inequality,
Since we normalize such that Tr(𝑲)=NTr𝑲𝑁\operatorname{Tr}({\bm{K}})=N, we have
Plugging in ‖𝒘‖2=𝒚⊺𝑲−1𝒚superscriptnorm𝒘2superscript𝒚⊺superscript𝑲1𝒚|{\bm{w}}|^{2}={\bm{y}}^{\intercal}{\bm{K}}^{-1}{\bm{y}} and noting that we normalized the kernel to have Tr(𝑲)=NTr𝑲𝑁\operatorname{Tr}({\bm{K}})=N thus completes the proof.
In further exploring the performance of SSL kernel methods on small datasets, we perform further numerical experiments on the MNIST dataset using kernel functions derived from the neural tangent kernel (NTK) of commonly used neural networks (Jacot et al., 2018). We use the neural-tangents package to explicitly calculate the NTK for a given architecture (Novak et al., 2020). The basic setup is repeated from Section 4.2 where two different types of augmentations are performed on images. As before, we consider augmentations by Gaussian blurring of the pixels or small translations, rotations, and zooms.
Test set accuracy results are shown in Table 1 for the NTK associated to a 3-layer fully connected network with infinite width at each hidden layer. Similarly, in Table 2, we show a similar analysis for the NTK of a CNN with seven convolutional layers followed by a global pooling layer. We use the contrastive induced kernel for the SSL kernel algorithm. The findings are similar to those reported in the main text. The SSL method performs similarly to the baseline method without any self-supervised learning but including labeled augmented datapoints in the training set. In some cases, the SSL method even outperforms the baseline augmented setting.
To empirically analyze generalization, we track the complexity quantity s N ( K ) here defined in Equation (14) and copied below:
where y is a vector of targets and K is the kernel matrix of the supervised dataset. As a reminder, the generalization gap can be bounded with high probability by O ( √ s N ( K ) /N ) , and in the main text, we provided heuristic indication that this quantity may be reduced when using the induced kernel. Here, we empirically analyze whether this holds true in the MNIST setting considered in the main text. As shown in Figure 6, the SSL induced kernel produces smaller values of s N ( K ) than its corresponding supervised counterpart. We calculate this figure by splitting the classes into binary parts (even and odd integers) in order to construct a vector y that mimics a binary classification task. Comparing this to the test accuracy results reported in Figure 4, it is clear that s N ( K ) also correlates inversely with test accuracy as expected. The results here lend further evidence to the hypothesis that the induced kernel better correlates points along a data manifold as outlined in Proposition 3.3.
Table: A3.T1: NTK for 3-layer fully connected network: Test set accuracy of SVM using the neural tangent kernel of a 3 layer fully connected network in only a supervised (baseline) setting or via the induced kernel in a self-supervised setting. The induced kernel for the SSL algorithm is calculated using the contrastive induced kernel. Numbers shown above are the test set accuracy for classifying MNIST digits for small dataset sizes with the given number of samples. Due to the quadratic scaling of memory and runtime for kernel methods, we restricted analysis to more feasible settings where there were less than 25,000 total samples (number of augmentations times number of samples).
| Test Set Accuracy | |||||||||
| Num. Augmentations | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | ||
| Augmentation | Samples | Algorithm | |||||||
| Gaussian blur | 16 | self-supervised | 32.3 | 32.3 | 32.6 | 29.1 | 32.8 | 32.3 | 32.5 |
| baseline (no augment) | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | ||
| baseline (augmented) | 34.8 | 34.9 | 35.0 | 35.0 | 35.0 | 35.0 | 35.0 | ||
| 64 | self-supervised | 71.8 | 71.8 | 71.6 | 71.4 | 72.1 | |||
| baseline (no augment) | 65.0 | 65.0 | 65.0 | 65.0 | 65.0 | ||||
| baseline (augmented) | 74.6 | 74.6 | 74.7 | 74.7 | 74.7 | ||||
| 256 | self-supervised | 87.5 | 88.0 | 88.0 | |||||
| baseline (no augment) | 86.5 | 86.5 | 86.5 | ||||||
| baseline (augmented) | 88.6 | 88.6 | 88.6 | ||||||
| translate, zoom rotate | 16 | self-supervised | 34.2 | 35.2 | 36.0 | 36.8 | 37.3 | 37.5 | 38.0 |
| baseline (no augment) | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | ||
| baseline (augmented) | 37.5 | 38.5 | 39.3 | 40.2 | 41.1 | 41.5 | 41.7 | ||
| 64 | self-supervised | 73.4 | 74.8 | 75.3 | 75.9 | 76.8 | |||
| baseline (no augment) | 65.0 | 65.0 | 65.0 | 65.0 | 65.0 | ||||
| baseline (augmented) | 77.7 | 78.8 | 79.4 | 79.7 | 80.0 | ||||
| 256 | self-supervised | 90.4 | 91.0 | 91.5 | |||||
| baseline (no augment) | 86.5 | 86.5 | 86.5 | ||||||
| baseline (augmented) | 90.4 | 90.8 | 91.1 |
Table: A3.T2: NTK for 7-layer convolutional neural network: Test set accuracy of SVM using the neural tangent kernel of a CNN with 7 layers of 3×3333\times 3 convolution followed by a global pooling layer. The induced kernel for the SSL algorithm is calculated using the contrastive induced kernel and compared to the standard SVM using the kernel itself with or without augmentation. Numbers shown above are the test set accuracy for classifying MNIST digits for small dataset sizes with the given number of samples. Due to the quadratic scaling of memory and runtime for kernel methods, we restricted analysis to more feasible settings where there were less than 25,000 total samples (number of augmentations times number of samples).
| Test Set Accuracy | |||||||||
| Num. Augmentations | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | ||
| Augmentation | Samples | Algorithm | |||||||
| Gaussian blur | 16 | self-supervised | 28.4 | 28.4 | 28.4 | 28.4 | 28.3 | 28.3 | 28.4 |
| baseline (no augment) | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | ||
| baseline (augmented) | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | ||
| 64 | self-supervised | 60.0 | 60.0 | 60.1 | 60.1 | 60.1 | |||
| baseline (no augment) | 55.6 | 55.6 | 55.6 | 55.6 | 55.6 | ||||
| baseline (augmented) | 56.5 | 56.5 | 56.5 | 56.5 | 56.5 | ||||
| 256 | self-supervised | 87.6 | 87.9 | 87.6 | |||||
| baseline (no augment) | 81.5 | 81.5 | 81.5 | ||||||
| baseline (augmented) | 82.9 | 82.9 | 82.9 | ||||||
| translate, zoom rotate | 16 | self-supervised | 33.4 | 33.8 | 34.4 | 34.6 | 35.1 | 35.4 | 35.3 |
| baseline (no augment) | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | ||
| baseline (augmented) | 30.6 | 30.7 | 30.8 | 31.6 | 31.7 | 31.9 | 32.0 | ||
| 64 | self-supervised | 70.7 | 72.7 | 73.2 | 73.8 | 74.3 | |||
| baseline (no augment) | 55.6 | 55.6 | 55.6 | 55.6 | 55.6 | ||||
| baseline (augmented) | 66.0 | 67.8 | 68.8 | 69.5 | 70.2 | ||||
| 256 | self-supervised | 91.9 | 92.3 | 92.6 | |||||
| baseline (no augment) | 81.5 | 81.5 | 81.5 | ||||||
| baseline (augmented) | 89.5 | 90.0 | 90.3 |
To translate the SSL setting into the kernel regime, we aim to find the optimal linear function which maps inputs from the RKHS into the K𝐾K-dimensional feature space of the representations. This new feature space induces a new optimal kernel denoted the “induced” kernel. Relationships between data-points are encoded in an adjacency matrix (the example matrix shown here contains pairwise relationships between datapoints).
Comparison of the RBF kernel space (first row) and induced kernel space (second row). The induced kernel is computed based on Equation 5, and the graph Laplacian matrix is derived from the inner product neighborhood in the RBF kernel space, i.e., using the neighborhoods as data augmentation. a) We plot three randomly chosen points’ kernel entries with respect to the other points on the manifolds. When the neighborhood augmentation range used to construct the Laplacian matrix is small enough, the SSL-induced kernel faithfully learns the topology of the entangled spiral manifolds. b) When the neighborhood augmentation range used to construct the Laplacian matrix is too large, it creates the “short-circuit” effect in the induced kernel space. Each subplot on the second row is normalized by its largest absolute value for better contrast.
Depiction of MNIST and EMNIST full test set performances using the contrastive and non-contrastive kernels (in red) and benchmarked against the supervised case with labels on all samples (original + augmented) in black and with labels only on the original samples in blue, with the number of original samples given by N∈{16,64,256}𝑁1664256N\in{16,64,256} (each column) and the number of augmented samples (in log2) in x-axis. The first row corresponds to Gaussian blur data-augmentation (poorly aligned with the data distributions) and the second row corresponds to random rotations (-10,10), translations (-0.1,0.1) and scaling (0.9,1.1). We set the SVM ℓ2subscriptℓ2\ell_{2} regularization to 0.0010.0010.001 and use the RBF kernel (NTK kernels in Section C.2). We restrict to settings where the total number of samples does not except 50k50𝑘50k, and in all cases, the kernel representation dimensions are unconstrained. Two key observations emerge. First, whenever the augmentation is not aligned with the data distribution, the SSL kernels falls below the supervised case, especially as N𝑁N increases. Second, when the augmentation is aligned with the data distribution, both SSL kernels are able to get close and even outperform the supervised benchmark with augmented labels.
MNIST classification task with 100100100 original training samples and 100100100 augmentations per sample on the train and test set (full, 10,0001000010,000 samples) with baselines in the titles given by supervised SVM using labels for all samples or original (100100100) samples only. We provide on the left the contrastive kernel performances when ablating over the (inverse) of the SVM’s ℓ2subscriptℓ2\ell_{2} regularizer y-axis and representation dimension (K𝐾K) in the x-axis. Similarly, we provide on the right the non-contrastive kernel performances when ablating over β𝛽\beta on the y-axis and over the representation dimension (K𝐾K) on the x-axis. We observe, as expected, that reducing K𝐾K prevents overfitting and should be preferred over the ℓ2subscriptℓ2\ell_{2} regularizer, and that β𝛽\beta acts jointly with the representation dimension i.e. one only needs to tune one of the two.
Depiction of sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) (Equation 14) for the case of MNIST classification as depicted in the left figure in Figure 4 computed on the training set. This setting considers MNIST classification with the contrastive kernel on a 100001000010000-sample dataset with 100100100 original MNIST samples and 100100100 augmentations. sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) for the supervised baselines for the full dataset (including labeled augmented samples) and only the original dataset (no augmented samples) are calculated as approximately 1534 and 203 respectively. Many values of sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) are smaller than the baseline numbers especially at points where the test set accuracy for the SSL induced kernel was comparitively larger. Here, we recover the trend of the test set performances obtained from Figure 4 showing that sN(𝑲)subscript𝑠𝑁𝑲s_{N}({\bm{K}}) is potentially a good indicator of test accuracy.
$$ \mW^* = \argmin_{\mW\in \mathcal{W}} \mathcal{R}(\mW \Phi(\vx_1), \dots, \mW \Phi(\vx_N)) + r\left(|\mW|_{\mathcal{H}}\right). $$
$$ \begin{split} \text{optimal representation: }& \mW^* \Phi(\vx) = \mM \vk_{\mX,\vx} \ \text{induced kernel: }& k^*(\vx, \vx') = \left( \mM \vk_{\mX,\vx}\right)^\intercal \mM \vk_{\mX,\vx'} = \vk_{\mX,\vx}^\intercal \mM ^\intercal \mM \vk_{\mX,\vx'}, \end{split} $$
$$ \mathcal{L}_{VIC} = \left| \mZ^{\intercal} \left(\mI - \frac{1}{N} \bm 1 \bm 1^\intercal \right) \mZ - \mI \right|_F^2 + \beta \operatorname{Tr}\left[ \mZ^{\intercal} \mL \mZ \right], \label{eq:vicreg_cost} $$ \tag{eq:vicreg_cost}
$$ \begin{split} k^*(\vx, \vx') &= \vk_{\vx,s} \mK_{s,s}^{-1} \left(\mI - \frac{1}{N} \bm 1 \bm 1^\intercal - \frac{\beta}{2} \mL \right)+ \mK{s,s}^{-1} \vk_{s,\vx'}, \end{split} \label{eq:induced_kernel} $$ \tag{eq:induced_kernel}
$$ \mathcal{L} = \sum_{j=1}^{n_{\text{batches}}} \left| \mK_{s_j,s} \mB \mK_{s,s_j} - \left(\mI + \mA^{(j)} \right) \right|F^2 + \alpha \Tr(\mB \mK{s,s}), $$
$$ f^(\vx) = \left[ k^(\vx, \vx^{(t)}1), k^(\vx, \vx^{(t)}_2), \dots, k^(\vx, \vx^{(t)}{N_t}) \right] \cdot \mK_{t,t}^{*-1} \vy, \label{eq:supervised_learning_kernel_output} $$ \tag{eq:supervised_learning_kernel_output}
$$ s_N(\mK)= \frac{\operatorname{Tr}(\mK)}{N} \vy^\intercal \mK^{-1} \vy,\label{eq:n_s} $$ \tag{eq:n_s}
$$ \mW^* \bm \phi=\boldsymbol{0} ;\text{ if } ; \langle \bm \phi,\Phi(\vx_i)\rangle_{\mathcal{H}}=0 ;; \forall i \in [N]. $$
$$ \begin{split}\mathcal{L}({\bm{W}}{\parallel}+{\bm{W}}{\bot})&=\mathcal{R}\left(({\bm{W}}{\parallel}+{\bm{W}}{\bot})\Phi({\bm{x}}{1}),\dots,({\bm{W}}{\parallel}+{\bm{W}}{\bot})\Phi({\bm{x}}{N})\right)+r\left(|{\bm{W}}{\parallel}+{\bm{W}}{\bot}|{\mathcal{H}}\right)\ &=\mathcal{R}\left({\bm{W}}{\parallel}\Phi({\bm{x}}{1}),\dots,{\bm{W}}{\parallel}\Phi({\bm{x}}{N})\right)+r\left(|{\bm{W}}{\parallel}+{\bm{W}}{\bot}|{\mathcal{H}}\right).\end{split} $$ \tag{A1.E17}
$$ \mX = \begin{bmatrix} \text{---} & \Phi(\vx_1)^\intercal & \text{---} \ \text{---} & \Phi(\vx_2)^\intercal & \text{---} \ & \vdots & \ \text{---} & \Phi(\vx_N)^\intercal & \text{---} \end{bmatrix} $$
$$ \begin{split}C({\bm{W}})&=\left|{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}{\bm{X}}{\bm{W}}^{\intercal}-{\bm{I}}\right|{F}^{2}+\beta\operatorname{Tr}\left[{\bm{W}}{\bm{X}}^{\intercal}{\bm{L}}{\bm{X}}{\bm{W}}^{\intercal}\right]\ &=\left|{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}{\bm{X}}{\bm{W}}^{\intercal}-{\bm{I}}\right|{F}^{2}+\operatorname{Tr}\left[{\bm{W}}{\bm{X}}^{\intercal}\left(\beta{\bm{L}}+2{\bm{M}}-2{\bm{M}}\right){\bm{X}}{\bm{W}}^{\intercal}\right]\ &=K+|{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}{\bm{X}}{\bm{W}}^{\intercal}|{F}^{2}-2\operatorname{Tr}\left[{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}{\bm{X}}{\bm{W}}^{\intercal}\right]\ &;;;;+\operatorname{Tr}\left[{\bm{W}}{\bm{X}}^{\intercal}\left(\beta{\bm{L}}+2{\bm{M}}-2{\bm{M}}\right){\bm{X}}{\bm{W}}^{\intercal}\right]\ &=K+|{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}{\bm{X}}{\bm{W}}^{\intercal}|{F}^{2}-\operatorname{Tr}\left[{\bm{W}}{\bm{X}}^{\intercal}\left(2{\bm{M}}-\beta{\bm{L}}\right){\bm{X}}{\bm{W}}^{\intercal}\right]\ &=K+|{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}|_{F}^{2}-\operatorname{Tr}\left[{\bm{X}}{\bm{W}}^{\intercal}{\bm{W}}{\bm{X}}^{\intercal}{\bm{M}}\left(2{\bm{M}}-\beta{\bm{L}}\right)\right].\end{split} $$ \tag{A1.E21}
$$ C'(\mW) = | \mX \mW^\intercal \mW \mX^\intercal \mM - \left( \mM - \beta \mL/2 \right) |_F^2. \label{eq:equivalent_vicreg_cost} $$ \tag{eq:equivalent_vicreg_cost}
$$ \mW^* = \left[ \mX^{(s)^{+}} \mC_{:, \leq K} \left(\mI - \mD \right)_{\leq K, \leq K}^{1/2} \right]^\intercal , $$
$$ \begin{split} k^(\vx, \vx') &:= (\mW^\vx)^\intercal (\mW^* \vx') \ &= \vx^\intercal \mX^{(s)^{+}} \mC_{:, \leq K} \left(\mI - \mD \right){\leq K, \leq K}^{1/2} \left(\mI - \mD \right){\leq K, \leq K}^{1/2} \mC_{:, \leq K}^\intercal \left(\mX^{(s)^{+}}\right)^\intercal \vx' \ &= \vk_{\vx,s} \mK_{s,s}^{-1} \mC_{:, \leq K} \left(\mI - \mD \right){ \leq K, \leq K} \mC{:, \leq K}^\intercal \mK_{s,s}^{-1} \vk_{s,\vx'}. \end{split} $$
$$ \begin{split} \max_{\mY \in \mathbb{R}^{N \times N}} &\Tr(\mY^\intercal \mP_K) \ \text{s.t. } & \mI - \mX^\intercal \mY \mM \mX \succeq 0 \end{split} $$
$$ \begin{split} \text{Primal feasibility: }& \mX \mB^* \mX^\intercal \mM = \mP_K, ; ; \mB^* \succeq 0 \ \text{Dual feasibility: }& \mI - \mX^\intercal \mY^* \mM \mX \succeq 0 \ \text{Complementary slackness: }& \left( \mI - \mX^\intercal \mY^* \mM \mX \right) \mB^* = 0. \end{split} $$
$$ \begin{split}\left({\bm{I}}-{\bm{X}}^{\intercal}{\bm{Y}}^{}{\bm{M}}{\bm{X}}\right){\bm{B}}^{}&={\bm{X}}^{+}{\bm{L}}{K}\left({\bm{X}}^{+}\right)^{\intercal}-{\bm{X}}^{\intercal}\left({\bm{X}}^{+}\right)^{\intercal}{\bm{X}}^{+}{\bm{M}}{\bm{X}}{\bm{X}}^{+}{\bm{L}}{K}\left({\bm{X}}^{+}\right)^{\intercal}\ &={\bm{X}}^{+}{\bm{L}}{K}\left({\bm{X}}^{+}\right)^{\intercal}-{\bm{X}}^{+}{\bm{M}}{\bm{L}}{K}\left({\bm{X}}^{+}\right)^{\intercal}\ &={\bm{X}}^{+}{\bm{L}}{K}\left({\bm{X}}^{+}\right)^{\intercal}-{\bm{X}}^{+}{\bm{L}}{K}\left({\bm{X}}^{+}\right)^{\intercal}\ &=0.\end{split} $$ \tag{A1.E28}
$$ \left(\mI + \mA \right)+ = \sum{i=1}^K \max(e_i + 1,0)\vv_i \vv_i^\dagger, $$
$$ {\bm{A}}=\begin{bmatrix}0&1&&&&&\ 1&0&&&&&\ &&0&1&&&\ &&1&0&&&\ &&&&\ddots&&\ &&&&&0&1\ &&&&&1&0\end{bmatrix}. $$ \tag{A1.E41}
$$ \begin{split}\left|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}{a}}-{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}{a}}\right|&\leq\left|{\bm{K}}{s,s}^{-1}\right|\left|{\bm{k}}{s,{\bm{x}}{a}}-{\bm{k}}{s,{\bm{x}}{a}}\right|\ &=\left|{\bm{K}}{s,s}^{-1}\right|\left[\sum_{i=1}^{N}(\langle\Phi({\bm{x}}{i}),\Phi({\bm{x}}{a})\rangle_{\mathcal{H}}-\langle\Phi({\bm{x}}{i}),\Phi({\bm{x}}{b})\rangle_{\mathcal{H}})^{2}\right]^{1/2}\ &=\left|{\bm{K}}{s,s}^{-1}\right|\left[\sum{i=1}^{N}(\langle\Phi({\bm{x}}{i}),\Phi({\bm{x}}{a})-\Phi({\bm{x}}{b})\rangle{\mathcal{H}})^{2}\right]^{1/2}\ &\leq\left|{\bm{K}}{s,s}^{-1}\right|\left[\sum{i=1}^{N}|\Phi({\bm{x}}{i})|{\mathcal{H}}^{2}|\Phi({\bm{x}}{a})-\Phi({\bm{x}}{b})|{\mathcal{H}}^{2}\right]^{1/2}\ &=\left|{\bm{K}}{s,s}^{-1}\right|\sqrt{N}\epsilon\end{split} $$ \tag{A1.E42}
$$ \begin{split}k^{*}({\bm{x}},{\bm{x}}^{\prime})&={\bm{k}}{{\bm{x}},s}{\bm{K}}{s,s}^{-1}\left({\bm{I}}+{\bm{A}}\right){+}{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}^{\prime}}\ &=({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i}+{\bm{e}}{i})^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j}+{\bm{e}}{j})\ &={\bm{e}}{i}^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{j}+({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i})^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{j}+{\bm{e}}{i}^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j})\ &;;;;;+({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i})^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j}).\end{split} $$ \tag{A1.E43}
$$ \begin{split}k^{*}({\bm{x}},{\bm{x}}^{\prime})&=1+({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i})^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{j}+{\bm{e}}{i}^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j})\ &;;;;;+({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i})^{\intercal}\left({\bm{I}}+{\bm{A}}\right){+}({\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j})\ &\geq 1-|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i}|\left|\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{j}\right|+|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j}|\left|\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{i}\right|\ &;;;;;+\left|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i}\right|\left|\left({\bm{I}}+{\bm{A}}\right){+}\right||{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j}|.\ \end{split} $$ \tag{A1.E44}
$$ \begin{split}k^{*}({\bm{x}},{\bm{x}}^{\prime})&\geq 1-|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i}|\left|\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{j}\right|+|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j}|\left|\left({\bm{I}}+{\bm{A}}\right){+}{\bm{e}}{i}\right|\ &;;;;;+\left|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x}-{\bm{e}}{i}\right|\left|\left({\bm{I}}+{\bm{A}}\right){+}\right||{\bm{K}}{s,s}^{-1}{\bm{k}}{s,x^{\prime}}-{\bm{e}}{j}|\ &\geq 1-2\sqrt{2}|{\bm{K}}{s,s}^{-1}|\sqrt{N}\epsilon-2|{\bm{K}}{s,s}^{-1}|^{2}N\epsilon^{2}\ &\geq 1-(2\sqrt{2}+2)|{\bm{K}}{s,s}^{-1}|\sqrt{N}\epsilon\ &\geq 1-5|{\bm{K}}_{s,s}^{-1}|\sqrt{N}\epsilon\ &=1-\Delta.\end{split} $$ \tag{A1.E45}
$$ \begin{split} \vy^\intercal (\mK^*)^{-1} \vy &= \sum_{i=1}^{m_{+1} + m_{-1}} #(i)^{-1} \left(\langle \bm 1_{#(i)}, \vy_{m_i} \rangle \right)^2 \ &= \sum_{i=1}^{m_{+1} + m_{-1}} #(i)^{-1} (\sqrt{#(i)})^2 \ &= m_{+1} + m_{-1}. \end{split} $$
$$ \mathbb{E}{\vx,y \sim \mathcal{D}} \left[ \ell(\langle \vw, \vx \rangle, y)\right] - \frac{1}{N_t} \sum{i=1}^{N_t} \ell(\langle \vw, \vx_i \rangle, y_i) \leq \frac{2\sqrt{2}L + 3b}{\sqrt{2}} \frac{\sqrt{\operatorname{\Tr}(\mK)\vy^\intercal \mK^{-1} \vy}}{N} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))}{2N}} $$
$$ \mathbb{E}z[g(z)] - \frac{1}{N} \sum{i=1}^N g(z_i) \leq 2 \widehat{\mathfrak{R}}_S (G) + 3 \sqrt{ \frac{\log(2/\delta)}{2N} }, $$
$$ \widehat{\mathfrak{R}}S (G) = \mathbb{E}\sigma \left[ \sup_{g \in G} \frac{1}{N} \sum_{i=1}^N \sigma_i g(z_i) \right], $$
$$ \widehat{\mathfrak{R}}_S( \Phi \circ G) \leq L \widehat{\mathfrak{R}}_S( G). $$
$$ \mathbb{E}{\vx}[\epsilon{\vw}(\vx)] - \frac{1}{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) \leq 2 \mathbb{E}\sigma \left[ \sup{|\vv|^2 \leq \gamma } \frac{1}{N} \sum_{i=1}^N \sigma_i \epsilon_{\vv}(\vx_i) \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\gamma}{2N} }. $$
$$ \begin{split} \mathbb{E}{\vx}[\epsilon{\vw}(\vx)] - \frac{1}{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) &\leq 2\mathbb{E}\sigma \left[ \sup{|\vv|^2 \leq \ceil{|\vw|^2} } \frac{1}{N} \sum_{i=1}^N \sigma_i \epsilon_{\vv}(\vx_i) \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ &\leq 2L\mathbb{E}\sigma \left[ \sup{|\vv|^2 \leq \ceil{|\vw|^2} } \frac{1}{N} \sum_{i=1}^N \sigma_i \langle \vv, \vx_i \rangle \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ &\leq 2L\mathbb{E}\sigma \left[ \sup{|\vv|^2 \leq \ceil{|\vw|^2} } |\vv| \left| \frac{1}{N} \sum_{i=1}^N \sigma_i \vx_i \right| \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ &\leq 2L\mathbb{E}\sigma \left[ \ceil{|\vw|} \left| \frac{1}{N} \sum{i=1}^N \sigma_i \vx_i \right| \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} }. \end{split} $$
$$ \begin{split} \mathbb{E}{\vx}[\epsilon{\vw}(\vx)] - \frac{1}{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) &\leq \frac{2L}{\sqrt{N}} \ceil{|\vw|} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ &\leq \frac{2L}{\sqrt{N}} \ceil{|\vw|} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))}{2N}} + 3b\frac{\ceil{|\vw|}}{\sqrt{2N}} \ &= \frac{2\sqrt{2}L + 3b}{\sqrt{2N}} \ceil{|\vw|} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))}{2N}}. \end{split} $$
$$ \displaystyle\mathcal{L}_{\rm vic}= $$
Proposition. [Form of optimal representation] Given a dataset $\vx_1, \dots, \vx_N \in X$, let $k(\cdot, \cdot)$ be a kernel function with corresponding map $\Phi:X \to H$ into the RKHS $H$. Let $\mW:H \to R^K$ be a function drawn from the space of linear functions $W$ mapping inputs in the RKHS to the vector space of the representation. For a risk function $R(\mW \Phi(\vx_1), \dots, \mW \Phi(\vx_N)) \in R$ and any strictly increasing function $r:[0,\infty)\to R$, consider the optimization problem equation \mW^* = \argmin_{\mW\in W} R(\mW \Phi(\vx_1), \dots, \mW \Phi(\vx_N)) + r\left(|\mW|{H}\right). equation The optimal solutions of the above take the form equation split optimal representation: & \mW^* \Phi(\vx) = \mM \vk{\mX,\vx} \ induced kernel: & k^*(\vx, \vx') = \left( \mM \vk_{\mX,\vx}\right)^\intercal \mM \vk_{\mX,\vx'} = \vk_{\mX,\vx}^\intercal \mM ^\intercal \mM \vk_{\mX,\vx'}, split equation where $\mM \in R^{K \times N}$ is a matrix that must be solved for and $\vk_{\mX,\vx} \in R^{N}$ is a vector with entries $[\vk_{\mX,\vx}]_i = k(\vx_i, \vx)$.
Proposition. Proposition 3.2. Given kernel function k(⋅,⋅)𝑘⋅⋅k(\cdot,\cdot) with corresponding map Φ(⋅)Φ⋅\Phi(\cdot) into the RKHS ℋℋ\mathcal{H}, let {𝐱1,𝐱2,…𝐱N}subscript𝐱1subscript𝐱2…subscript𝐱𝑁{{\bm{x}}{1},{\bm{x}}{2},\dots\bm{x}{N}} be an SSL dataset normalized such that k(𝐱i,𝐱i)=1𝑘subscript𝐱𝑖subscript𝐱𝑖1k({\bm{x}}{i},{\bm{x}}{i})=1 and formed by pairwise augmentations (i.e., every element has exactly one neighbor in 𝐀𝐀{\bm{A}}) with kernel matrix 𝐊s,ssubscript𝐊𝑠𝑠{\bm{K}}{s,s}. Given two points 𝐱𝐱{\bm{x}} and 𝐱′superscript𝐱′{\bm{x}}^{\prime}, if there exists two points in the SSL dataset indexed by i𝑖i and j𝑗j which are related by an augmentation (𝐀ijsubscript𝐀𝑖𝑗{\bm{A}}{ij}=1) and ‖Φ(𝐱)−Φ(𝐱i)‖ℋ≤Δ5∥|𝐊s,s−1∥N|\Phi({\bm{x}})-\Phi({\bm{x}}{i})|{\mathcal{H}}\leq\frac{\Delta}{5||{\bm{K}}{s,s}^{-1}|\sqrt{N}} and ‖Φ(𝐱′)−Φ(𝐱j)‖ℋ≤Δ5∥|𝐊s,s−1∥N|\Phi({\bm{x}}^{\prime})-\Phi({\bm{x}}{j})|{\mathcal{H}}\leq\frac{\Delta}{5||{\bm{K}}_{s,s}^{-1}|\sqrt{N}}, then the induced kernel for the contrastive loss is at least k∗(𝐱,𝐱′)≥1−Δsuperscript𝑘𝐱superscript𝐱′1Δk^{*}({\bm{x}},{\bm{x}}^{\prime})\geq 1-\Delta.
Proposition. [Ideal SSL outcome] Given a supervised dataset of $N$ points for binary classification drawn from a distribution with $m_{-1}$ and $m_{+1}$ connected manifolds for classes with labels $-1$ and $+1$ respectively, if the induced kernel matrix of the dataset $\mK^$ successfully separates the manifolds such that $k^(\vx,\vx') = 1$ if $\vx,\vx'$ are in the same manifold and $k^(\vx,\vx') = 0$ otherwise, then $s_N(\mK^)=m_{-1}+m_{+1}=O(1)$. %
Theorem. [Representer theorem for self-supervised tasks; adapted from scholkopf2001generalized] Let $\vx_i \in X$ for $i\in[N]$ be elements of a dataset of size $N$, $k:X \times X \to R$ be a kernel function with corresponding map $\Phi:X \to H$ into the RKHS $H$, and $r:[0,\infty)\to R$ be any strictly increasing function. Let $\mW:H \to R^K$ be a linear function mapping inputs to their corresponding representations. Given a regularized loss function of the form equation R\left(\mW \Phi(\vx_1), \dots, \mW \Phi(\vx_N)\right) + r\left(|\mW|{H}\right), equation where $R(\mW \Phi(\vx_1), \dots, \mW \Phi(\vx_N))$ is an error function that depends on the representations of the dataset, the minimizer $\mW^$ of this loss function will be in the span of the training points ${\Phi(\vx_i), ; i \in {1,\dots, N}}$, i.e. for any $\bm \phi \in H$: equation \mW^ \bm \phi=0 ; if ; \langle \bm \phi,\Phi(\vx_i)\rangle{H}=0 ;; \forall i \in [N]. equation
Lemma. Lemma A.2. Given kernel function k(⋅,⋅)𝑘⋅⋅k(\cdot,\cdot) with map Φ(⋅)Φ⋅\Phi(\cdot) into RKHS ℋℋ\mathcal{H} and an SSL dataset {𝐱i}i=1Nsuperscriptsubscriptsubscript𝐱𝑖𝑖1𝑁{{\bm{x}}{i}}{i=1}^{N} normalized such that k(𝐱i,𝐱i)=1𝑘subscript𝐱𝑖subscript𝐱𝑖1k({\bm{x}}{i},{\bm{x}}{i})=1, let 𝐊s,ssubscript𝐊𝑠𝑠{\bm{K}}{s,s} be the kernel matrix of the SSL dataset. If ‖Φ(𝐱a)−Φ(𝐱b)‖ℋ≤ϵsubscriptnormΦsubscript𝐱𝑎Φsubscript𝐱𝑏ℋitalic-ϵ|\Phi({\bm{x}}{a})-\Phi({\bm{x}}{b})|{\mathcal{H}}\leq\epsilon, then ‖𝐊s,s−1𝐤s,𝐱a−𝐊s,s−1𝐤s,𝐱a‖≤‖𝐊s,s−1‖Nϵnormsuperscriptsubscript𝐊𝑠𝑠1subscript𝐤𝑠subscript𝐱𝑎superscriptsubscript𝐊𝑠𝑠1subscript𝐤𝑠subscript𝐱𝑎normsuperscriptsubscript𝐊𝑠𝑠1𝑁italic-ϵ\left|{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}{a}}-{\bm{K}}{s,s}^{-1}{\bm{k}}{s,{\bm{x}}{a}}\right|\leq|{\bm{K}}_{s,s}^{-1}|\sqrt{N}\epsilon.
Theorem. Theorem B.1 (Adapted from Section 4.C of (Huang et al., 2021)). Let 𝐱1,…,𝐱Nsubscript𝐱1…subscript𝐱𝑁{\bm{x}}{1},\dots,{\bm{x}}{N} and y1,…,yNtsubscript𝑦1…subscript𝑦subscript𝑁𝑡y_{1},\dots,y_{N_{t}} (with 𝐲𝐲{\bm{y}} the corresponding vector storing the scalars yisubscript𝑦𝑖y_{i} as entries) be our training set of N𝑁N independent samples drawn i.i.d. from 𝒟𝒟\mathcal{D}. Let 𝐰=Tr(𝐊)N∑i=1Nt[𝐊−1𝐲]iϕ(𝐱i(t))𝐰Tr𝐊𝑁superscriptsubscript𝑖1subscript𝑁𝑡subscriptdelimited-[]superscript𝐊1𝐲𝑖italic-ϕsuperscriptsubscript𝐱𝑖𝑡{\bm{w}}=\sqrt{\frac{\operatorname{Tr}({\bm{K}})}{N}}\sum_{i=1}^{N_{t}}\left[{\bm{K}}^{-1}{\bm{y}}\right]{i}\phi\left({\bm{x}}{i}^{(t)}\right) denote the solution to the kernel regression problem normalized by the trace of the data kernel. For an L𝐿L-lipschitz loss function ℓ:𝒴×𝒴→[0,b]:ℓ→𝒴𝒴0𝑏\ell:\mathcal{Y}\times\mathcal{Y}\to[0,b], with probability 1−δ1𝛿1-\delta for any δ>0𝛿0\delta>0, we have 𝔼𝒙,y∼𝒟[ℓ(⟨𝒘,𝒙⟩,y)]−1Nt∑i=1Ntℓ(⟨𝒘,𝒙i⟩,yi)≤22L+3b2Tr(𝑲)𝒚⊺𝑲−1𝒚N+3blog(2/(δ(e−1)))2Nsubscript𝔼similar-to𝒙𝑦𝒟delimited-[]ℓ𝒘𝒙𝑦1subscript𝑁𝑡superscriptsubscript𝑖1subscript𝑁𝑡ℓ𝒘subscript𝒙𝑖subscript𝑦𝑖22𝐿3𝑏2Tr𝑲superscript𝒚⊺superscript𝑲1𝒚𝑁3𝑏2𝛿𝑒12𝑁\mathbb{E}{{\bm{x}},y\sim\mathcal{D}}\left[\ell(\langle{\bm{w}},{\bm{x}}\rangle,y)\right]-\frac{1}{N{t}}\sum_{i=1}^{N_{t}}\ell(\langle{\bm{w}},{\bm{x}}{i}\rangle,y{i})\leq\frac{2\sqrt{2}L+3b}{\sqrt{2}}\frac{\sqrt{\operatorname{\operatorname{Tr}}({\bm{K}}){\bm{y}}^{\intercal}{\bm{K}}^{-1}{\bm{y}}}}{N}+3b\sqrt{\frac{\log(2/(\delta(e-1)))}{2N}} (50)
Theorem. Theorem B.2 (Concentration of sum; see Theorem 3.1 in Mohri et al. (2018)). Let G𝐺G be a family of functions mapping from Z𝑍Z to [0,1]01[0,1]. Given N𝑁N independent samples z1,…,znsubscript𝑧1…subscript𝑧𝑛z_{1},\dots,z_{n} from Z𝑍Z, then for any δ>0𝛿0\delta>0, with probability at least 1−δ1𝛿1-\delta, the following holds: 𝔼z[g(z)]−1N∑i=1Ng(zi)≤2ℜ^S(G)+3log(2/δ)2N,subscript𝔼𝑧delimited-[]𝑔𝑧1𝑁superscriptsubscript𝑖1𝑁𝑔subscript𝑧𝑖2subscript^ℜ𝑆𝐺32𝛿2𝑁\mathbb{E}{z}[g(z)]-\frac{1}{N}\sum{i=1}^{N}g(z_{i})\leq 2\widehat{\mathfrak{R}}{S}(G)+3\sqrt{\frac{\log(2/\delta)}{2N}}, (51) where ℜ^S(G)subscript^ℜ𝑆𝐺\widehat{\mathfrak{R}}{S}(G) denotes the empirical Rademacher complexity of G𝐺G equal to ℜ^S(G)=𝔼σ[supg∈G1N∑i=1Nσig(zi)],subscript^ℜ𝑆𝐺subscript𝔼𝜎delimited-[]subscriptsupremum𝑔𝐺1𝑁superscriptsubscript𝑖1𝑁subscript𝜎𝑖𝑔subscript𝑧𝑖\widehat{\mathfrak{R}}{S}(G)=\mathbb{E}{\sigma}\left[\sup_{g\in G}\frac{1}{N}\sum_{i=1}^{N}\sigma_{i}g(z_{i})\right], (52) where σisubscript𝜎𝑖\sigma_{i} are independent uniform random variables over {−1,+1}11{-1,+1}.
Lemma. [Talagrand's lemma; see Lemma 4.2 in [mohri2018foundations]] Let $\Phi:R \to R$ be $L$-Lipschitz. Then for any hypothesis set $G$ of real-valued functions, equation \mathfrak{R}_S( \Phi \circ G) \leq L \mathfrak{R}_S( G). equation
$$
$$

Figure 6: Depiction of s N ( K ) (Equation (14)) for the case of MNIST classification as depicted in the left figure in Figure 4 computed on the training set. This setting considers MNIST classification with the contrastive kernel on a 10000 -sample dataset with 100 original MNIST samples and 100 augmentations. s N ( K ) for the supervised baselines for the full dataset (including labeled augmented samples) and only the original dataset (no augmented samples) are calculated as approximately 1534 and 203 respectively. Many values of s N ( K ) are smaller than the baseline numbers especially at points where the test set accuracy for the SSL induced kernel was comparitively larger. Here, we recover the trend of the test set performances obtained from Figure 4 showing that s N ( K ) is potentially a good indicator of test accuracy.
| Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | |||
|---|---|---|---|---|---|---|---|---|---|
| Augmentation | Samples | Num. Augmentations Algorithm | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| Gaussian blur | 16 | self-supervised | 32.3 | 32.3 | 32.6 | 29.1 | 32.8 | 32.3 | 32.5 |
| baseline (no augment) | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | ||
| baseline (augmented) | 34.8 | 34.9 | 35 | 35.0 | 35.0 | 35.0 | 35.0 | ||
| 64 | self-supervised | 71.8 | 71.8 | 71.6 | 71.4 | 72.1 | |||
| baseline (no augment) | 65 | 65 | 65 | 65.0 | 65.0 | ||||
| baseline (augmented) | 74.6 | 74.6 | 74.7 | 74.7 | 74.7 | ||||
| 256 | self-supervised | 87.5 | 88 | 88 | |||||
| baseline (no augment) | 86.5 | 86.5 | 86.5 | ||||||
| baseline (augmented) | 88.6 | 88.6 | 88.6 | ||||||
| translate, zoom rotate | 16 | self-supervised | 34.2 | 35.2 | 36 | 36.8 | 37.3 | 37.5 | 38.0 |
| baseline (no augment) | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | ||
| baseline (augmented) | 37.5 | 38.5 | 39.3 | 40.2 | 41.1 | 41.5 | 41.7 | ||
| 64 | self-supervised | 73.4 | 74.8 | 75.3 | 75.9 | 76.8 | |||
| baseline (no augment) | 65 | 65 | 65 | 65.0 | 65.0 | ||||
| baseline (augmented) | 77.7 | 78.8 | 79.4 | 79.7 | 80.0 | ||||
| 256 | self-supervised | 90.4 | 91 | 91.5 | |||||
| baseline (no augment) | 86.5 | 86.5 | 86.5 | ||||||
| baseline (augmented) | 90.4 | 90.8 | 91.1 |
| Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | |||
|---|---|---|---|---|---|---|---|---|---|
| Augmentation | Samples | Num. Augmentations Algorithm | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| Gaussian blur | 16 | self-supervised | 28.4 | 28.4 | 28.4 | 28.4 | 28.3 | 28.3 | 28.4 |
| baseline (no augment) | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | ||
| baseline (augmented) | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | ||
| 64 | self-supervised | 60 | 60 | 60.1 | 60.1 | 60.1 | |||
| baseline (no augment) | 55.6 | 55.6 | 55.6 | 55.6 | 55.6 | ||||
| baseline (augmented) | 56.5 | 56.5 | 56.5 | 56.5 | 56.5 | ||||
| 256 | self-supervised | 87.6 | 87.9 | 87.6 | |||||
| baseline (no augment) | 81.5 | 81.5 | 81.5 | ||||||
| baseline (augmented) | 82.9 | 82.9 | 82.9 | ||||||
| translate, zoom rotate | 16 | self-supervised | 33.4 | 33.8 | 34.4 | 34.6 | 35.1 | 35.4 | 35.3 |
| baseline (no augment) | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | ||
| baseline (augmented) | 30.6 | 30.7 | 30.8 | 31.6 | 31.7 | 31.9 | 32.0 | ||
| 64 | self-supervised | 70.7 | 72.7 | 73.2 | 73.8 | 74.3 | |||
| baseline (no augment) | 55.6 | 55.6 | 55.6 | 55.6 | 55.6 | ||||
| baseline (augmented) | 66 | 67.8 | 68.8 | 69.5 | 70.2 | ||||
| 256 | self-supervised | 91.9 | 92.3 | 92.6 | |||||
| baseline (no augment) | 81.5 | 81.5 | 81.5 | ||||||
| baseline (augmented) | 89.5 | 90 | 90.3 |
$$ k^*(\vx,\vx') = \vk_{\vx,s} \mB \vk_{s,\vx'}, $$
$$ \begin{split} \min_{\mB \in \mathbb{R}^{N \times N} } & \Tr(\mB \mK_{s,s} ) \ \text{s.t. } & \mK_{s_j,s} \mB \mK_{s,s_j} = \left(\mI + \mA^{(j)} \right)+ ; ; ; ; \forall j \in {1,2,\dots,n{\text{batches}}} \ & \mB \succeq 0, ; \operatorname{rank}(\mB)=K, \end{split} $$
$$ \begin{split} \mathcal{L}(\mW_\parallel + \mW_\bot) &= \mathcal{R}\left((\mW_\parallel + \mW_\bot)\Phi(\vx_1), \dots, (\mW_\parallel + \mW_\bot) \Phi(\vx_N)\right) + r\left(|\mW_\parallel + \mW_\bot|{\mathcal{H}}\right) \ &= \mathcal{R}\left(\mW\parallel \Phi(\vx_1), \dots, \mW_\parallel \Phi(\vx_N)\right) + r\left(|\mW_\parallel + \mW_\bot|_{\mathcal{H}}\right). \end{split} $$
$$ \begin{split} C(\mW) &= \left| \mW \mX^\intercal \mM \mX \mW^\intercal - \mI \right|_F^2 + \beta \operatorname{Tr}\left[ \mW\mX^\intercal \mL \mX\mW^\intercal \right] \ &= \left| \mW \mX^\intercal \mM \mX \mW^\intercal - \mI \right|_F^2 + \operatorname{Tr}\left[ \mW\mX^\intercal \left(\beta \mL + 2\mM - 2 \mM \right) \mX\mW^\intercal \right] \ &= K + | \mW \mX^\intercal \mM \mX \mW^\intercal |_F^2 - 2 \operatorname{Tr}\left[ \mW\mX^\intercal \mM \mX\mW^\intercal \right] \ & ;;;; + \operatorname{Tr}\left[ \mW\mX^\intercal \left(\beta \mL + 2\mM - 2 \mM \right) \mX\mW^\intercal \right] \ &= K + | \mW \mX^\intercal \mM \mX \mW^\intercal |_F^2 -\operatorname{Tr}\left[ \mW\mX^\intercal \left( 2 \mM - \beta \mL \right) \mX\mW^\intercal \right] \ &= K + | \mX \mW^\intercal \mW \mX^\intercal \mM |_F^2 -\operatorname{Tr}\left[ \mX\mW^\intercal \mW\mX^\intercal \mM \left( 2 \mM - \beta \mL \right) \right]. \end{split} \label{eq:loss_simplification_steps} $$ \tag{eq:loss_simplification_steps}
$$ \begin{split} \left( \mI - \mX^\intercal \mY^* \mM \mX \right) \mB^* &= \mX^+ \mL_K \left( \mX^+ \right)^\intercal - \mX^\intercal \left(\mX^+\right)^\intercal \mX^+ \mM \mX \mX^+ \mL_K \left( \mX^+ \right)^\intercal \ &=\mX^+ \mL_K \left( \mX^+ \right)^\intercal - \mX^+ \mM \mL_K \left( \mX^+ \right)^\intercal \ &= \mX^+ \mL_K \left( \mX^+ \right)^\intercal - \mX^+ \mL_K \left( \mX^+ \right)^\intercal \ &= 0. \end{split} $$
$$ \label{eq:pairwise_matrix_block_form} \mA = \begin{bmatrix} 0 & 1 & & & & & \ 1 & 0 & & & & & \ & & 0 & 1 & & & \ & & 1 & 0 & & & \ & & & & \ddots & & \ & & & & & 0 & 1 \ & & & & & 1 & 0 \end{bmatrix}. $$ \tag{eq:pairwise_matrix_block_form}
$$ \begin{split} \left| \mK_{s,s}^{-1} \vk_{s,\vx_a} - \mK_{s,s}^{-1}\vk_{s,\vx_a} \right| & \leq \left| \mK_{s,s}^{-1} \right| \left| \vk_{s,\vx_a} - \vk_{s,\vx_a} \right| \ &= \left| \mK_{s,s}^{-1} \right| \left[ \sum_{i=1}^N ( \langle \Phi(\vx_i), \Phi(\vx_a) \rangle_{\mathcal{H}} - \langle \Phi(\vx_i), \Phi(\vx_b) \rangle_{\mathcal{H}} )^2 \right]^{1/2} \ &= \left| \mK_{s,s}^{-1} \right| \left[ \sum_{i=1}^N ( \langle \Phi(\vx_i), \Phi(\vx_a) - \Phi(\vx_b) \rangle_{\mathcal{H}} )^2 \right]^{1/2} \ &\leq \left| \mK_{s,s}^{-1} \right| \left[ \sum_{i=1}^N |\Phi(\vx_i) |{\mathcal{H}}^2 |\Phi(\vx_a) - \Phi(\vx_b) |{\mathcal{H}}^2 \right]^{1/2} \ & = \left| \mK_{s,s}^{-1} \right| \sqrt{N} \epsilon \end{split} $$
$$ \begin{split} k^*(\vx, \vx') &= \vk_{\vx,s} \mK_{s,s}^{-1} \left(\mI + \mA \right)+ \mK{s,s}^{-1} \vk_{s,\vx'} \ &= (\mK_{s,s}^{-1} \vk_{s,x} - \ve_i + \ve_i)^\intercal \left(\mI + \mA \right)+ (\mK{s,s}^{-1} \vk_{s,x'} - \ve_j + \ve_j) \ &= \ve_i^\intercal \left(\mI + \mA \right)+ \ve_j + (\mK{s,s}^{-1} \vk_{s,x} - \ve_i )^\intercal \left(\mI + \mA \right)+ \ve_j + \ve_i^\intercal \left(\mI + \mA \right)+ (\mK_{s,s}^{-1} \vk_{s,x'} - \ve_j) \ & ;;;;; +(\mK_{s,s}^{-1} \vk_{s,x} - \ve_i)^\intercal \left(\mI + \mA \right)+ (\mK{s,s}^{-1} \vk_{s,x'} - \ve_j ). \end{split} $$
$$ \begin{split} k^*(\vx, \vx') &= 1 + (\mK_{s,s}^{-1} \vk_{s,x} - \ve_i )^\intercal \left(\mI + \mA \right)+ \ve_j + \ve_i^\intercal \left(\mI + \mA \right)+ (\mK_{s,s}^{-1} \vk_{s,x'} - \ve_j) \ & ;;;;; +(\mK_{s,s}^{-1} \vk_{s,x} - \ve_i)^\intercal \left(\mI + \mA \right)+ (\mK{s,s}^{-1} \vk_{s,x'} - \ve_j ) \ & \geq 1 - |\mK_{s,s}^{-1} \vk_{s,x} - \ve_i | \left| \left(\mI + \mA \right)+ \ve_j \right|+ |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j | \left| \left(\mI + \mA \right)+ \ve_i \right| \ & ;;;;; +\left|\mK{s,s}^{-1} \vk_{s,x} - \ve_i \right| \left|\left(\mI + \mA \right)+ \right| |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j |.\ \end{split} $$
$$ \begin{split} k^*(\vx, \vx') &\geq 1 - |\mK_{s,s}^{-1} \vk_{s,x} - \ve_i | \left| \left(\mI + \mA \right)+ \ve_j \right|+ |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j | \left| \left(\mI + \mA \right)+ \ve_i \right| \ & ;;;;; +\left|\mK{s,s}^{-1} \vk_{s,x} - \ve_i \right| \left|\left(\mI + \mA \right)+ \right| |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j | \ &\geq 1-2\sqrt{2}|\mK_{s,s}^{-1}|\sqrt{N} \epsilon - 2 |\mK_{s,s}^{-1}|^2 N \epsilon^2 \ &\geq 1 - (2\sqrt{2} + 2) |\mK_{s,s}^{-1}|\sqrt{N} \epsilon \ &\geq 1-5 |\mK_{s,s}^{-1}|\sqrt{N} \epsilon \ &= 1-\Delta. \end{split} $$
$$ \begin{bmatrix} \frac{1}{#(1)} \bm 1_{#(1)} \bm 1_{#(1)}^\intercal & & & \ & \frac{1}{#(2)} \bm 1_{#(2)} \bm 1_{#(2)}^\intercal & & \ & & \ddots & \ & & & \frac{1}{#(m_{+1} + m_{-1})} \bm 1_{#(m_{+1} + m_{-1})} \bm 1_{#(m_{+1} + m_{-1})}^\intercal \ \end{bmatrix}. $$
$$ \vw = \sum_{i = 1}^{N_t} \left[ \mK_{t,t}^{*-1} \vy \right]i \phi\left(\vx{i}^{(t)} \right), $$
$$ \begin{split} \mathbb{E}{\vx}[\epsilon{\vw}(\vx)] - \frac{1}{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) &\leq 2L\mathbb{E}\sigma \left[ \ceil{|\vw|} \left| \frac{1}{N} \sum{i=1}^N \sigma_i \vx_i \right| \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ & = \frac{2L}{N} \ceil{|\vw|} \mathbb{E}\sigma \left[ \sqrt{ \sum{i=1}^N \sum_{j=1}^N \sigma_i \sigma_j k(\vx_i, \vx_j) } \right] + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ & \leq \frac{2L}{N} \ceil{|\vw|} \sqrt{ \mathbb{E}\sigma \left[ \sum{i=1}^N \sum_{j=1}^N \sigma_i \sigma_j k(\vx_i, \vx_j) \right]} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ & = \frac{2L}{N} \sqrt{\sum_{i=1}^N k(\vx_i, \vx_i)} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} } \ &= \frac{2L}{N} \ceil{|\vw|} \sqrt{\operatorname{Tr}(\mK)} + 3b \sqrt{ \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}}{2N} }. \end{split} $$
$$ \mathcal{L}{\rm vic}\hspace{-0.05cm}=&\alpha\hspace{-0.05cm} \sum{k=1}^{K}\max\hspace{-0.05cm}\left(0,1\hspace{-0.05cm}-\hspace{-0.05cm}\sqrt{\Cov(\mZ){k,k}}\right)\hspace{-0.08cm}+\hspace{-0.05cm} \beta\hspace{-0.1cm} \sum{j=1,j\not = k}^{K}\hspace{-0.15cm}\Cov(\mZ)^2_{k,j}\hspace{-0.05cm}+\hspace{-0.05cm}\frac{\gamma}{N} \sum_{i=1}^{N}\sum_{j=1}^{N}(\mA){i,j}|\mZ{i,.}-\mZ_{j,.}|_2^2. \label{eq:VICReg} $$ \tag{eq:VICReg}
Theorem. [Adapted from Section 4.C of Huang2021] Let $\vx_1, \dots, \vx_N$ and $y_1, \dots, y_{N_t}$ (with $\vy$ the corresponding vector storing the scalars $y_i$ as entries) be our training set of $N$ independent samples drawn i.i.d. from $D$. Let $\vw = \frac{\operatorname{Tr(\mK)}{N}} \sum_{i = 1}^{N_t} \left[ \mK^{-1} \vy \right]i \phi\left(\vx{i}^{(t)} \right)$ denote the solution to the kernel regression problem normalized by the trace of the data kernel. For an $L$-lipschitz loss function $\ell: Y \times Y \to [0,b]$, with probability $1-\delta$ for any $\delta > 0 $, we have equation E_{\vx,y \sim D} \left[ \ell(\langle \vw, \vx \rangle, y)\right] - 1{N_t} \sum_{i=1}^{N_t} \ell(\langle \vw, \vx_i \rangle, y_i) \leq 2\sqrt{2L + 3b}{2} \sqrt{\operatorname{\Tr(\mK)\vy^\intercal \mK^{-1} \vy}}{N} + 3b \frac{\log(2/(\delta (e-1))){2N}} equation
Theorem. [Concentration of sum; see Theorem 3.1 in [mohri2018foundations]] Let $G$ be a family of functions mapping from $Z$ to $[0,1]$. Given $N$ independent samples $z_1, \dots, z_n$ from $Z$, then for any $\delta > 0$, with probability at least $1-\delta$, the following holds: equation E_z[g(z)] - 1{N} \sum_{i=1}^N g(z_i) \leq 2 \mathfrak{R}S (G) + 3 \frac{\log(2/\delta){2N} }, equation where $\mathfrak{R}S (G)$ denotes the empirical Rademacher complexity of $G$ equal to equation \mathfrak{R}S (G) = E\sigma \left[ \sup{g \in G} 1{N} \sum{i=1}^N \sigma_i g(z_i) \right], equation where $\sigma_i$ are independent uniform random variables over ${-1,+1}$.
Lemma. Given kernel function $k(\cdot,\cdot)$ with map $\Phi(\cdot)$ into RKHS $H$ and an SSL dataset ${\vx_i}{i=1}^N$ normalized such that $k(\vx_i,\vx_i)=1$, let $\mK{s,s}$ be the kernel matrix of the SSL dataset. If $| \Phi(\vx_a) - \Phi(\vx_b)|{H} \leq \epsilon $, then $\left| \mK{s,s}^{-1} \vk_{s,\vx_a} - \mK_{s,s}^{-1}\vk_{s,\vx_a} \right| \leq |\mK_{s,s}^{-1}|N \epsilon $.
Proposition. [Ideal SSL outcome] % % For binary classification with a supervised dataset of $N$ points drawn from a distribution with $m$ connected manifolds for each class, assume the data is distributed such that the distance between two random points within a class and between two classes is distributed uniformly between $[0,a]$ and $[b,1]$ respectively for $a<b<1$. Let $\mK$ and $\mK^$ be the kernel matrices for the RBF kernel and the induced kernel respectively. If the induced kernel successfully separates the manifolds such that $k^(\vx,\vx') = 1$ if $\vx,\vx'$ are in the same manifold and $k^(\vx,\vx') = 0$ otherwise, then $s_N(\mK^)<s_N(\mK)$. % %
Proof. Decompose $\mW = \mW_{\parallel} + \mW_{\bot}$, where $\mW_{\bot}\Phi(\vx_i)=0$ for all $i\in[N]$. For a loss function $L$ of the form listed above, we have equation split L(\mW_\parallel + \mW_\bot) &= R\left((\mW_\parallel + \mW_\bot)\Phi(\vx_1), \dots, (\mW_\parallel + \mW_\bot) \Phi(\vx_N)\right) + r\left(|\mW_\parallel + \mW_\bot|{H}\right) \ &= R\left(\mW\parallel \Phi(\vx_1), \dots, \mW_\parallel \Phi(\vx_N)\right) + r\left(|\mW_\parallel + \mW_\bot|{H}\right). split equation Where in the terms in the sum, we used the property that $\mW\bot$ is not in the span of the data. For the regularizer term, we note that equation split r(|\mW_\parallel + \mW_\bot|{H}) &= r\left( |\mW\parallel|{H^2 + |\mW\bot|{H}^2 }\right) \ &\geq r(|\mW\parallel|_{H}) . split equation Therefore, strictly enforcing $\mW^\bot=0$ minimizes the regularizer while leaving the rest of the cost function unchanged.
Proof. We have that equation split \left| \mK_{s,s}^{-1} \vk_{s,\vx_a} - \mK_{s,s}^{-1}\vk_{s,\vx_a} \right| & \leq \left| \mK_{s,s}^{-1} \right| \left| \vk_{s,\vx_a} - \vk_{s,\vx_a} \right| \ &= \left| \mK_{s,s}^{-1} \right| \left[ \sum_{i=1}^N ( \langle \Phi(\vx_i), \Phi(\vx_a) \rangle_{H} - \langle \Phi(\vx_i), \Phi(\vx_b) \rangle_{H} )^2 \right]^{1/2} \ &= \left| \mK_{s,s}^{-1} \right| \left[ \sum_{i=1}^N ( \langle \Phi(\vx_i), \Phi(\vx_a) - \Phi(\vx_b) \rangle_{H} )^2 \right]^{1/2} \ &\leq \left| \mK_{s,s}^{-1} \right| \left[ \sum_{i=1}^N |\Phi(\vx_i) |{H}^2 |\Phi(\vx_a) - \Phi(\vx_b) |{H}^2 \right]^{1/2} \ & = \left| \mK_{s,s}^{-1} \right| N \epsilon split equation
Proof. Note that $\mK_{s,s}^{-1}\vx_i = \ve_i$ where $\ve_i$ is the vector with a $1$ placed on entry $i$ and zeros elsewhere. From equation eq:induced_kernel_contrastive, we have that equation split k^(\vx, \vx') &= \vk_{\vx,s} \mK_{s,s}^{-1} \left(\mI + \mA \right)+ \mK{s,s}^{-1} \vk_{s,\vx'} \ &= (\mK_{s,s}^{-1} \vk_{s,x} - \ve_i + \ve_i)^\intercal \left(\mI + \mA \right)+ (\mK{s,s}^{-1} \vk_{s,x'} - \ve_j + \ve_j) \ &= \ve_i^\intercal \left(\mI + \mA \right)+ \ve_j + (\mK{s,s}^{-1} \vk_{s,x} - \ve_i )^\intercal \left(\mI + \mA \right)+ \ve_j + \ve_i^\intercal \left(\mI + \mA \right)+ (\mK_{s,s}^{-1} \vk_{s,x'} - \ve_j) \ & ;;;;; +(\mK_{s,s}^{-1} \vk_{s,x} - \ve_i)^\intercal \left(\mI + \mA \right)+ (\mK{s,s}^{-1} \vk_{s,x'} - \ve_j ). split equation Note that $\ve_i^\intercal \left(\mI + \mA \right)+ \ve_j = 1$ since $\mA{ij}=1$. Therefore, equation split k^(\vx, \vx') &= 1 + (\mK_{s,s}^{-1} \vk_{s,x} - \ve_i )^\intercal \left(\mI + \mA \right)+ \ve_j + \ve_i^\intercal \left(\mI + \mA \right)+ (\mK_{s,s}^{-1} \vk_{s,x'} - \ve_j) \ & ;;;;; +(\mK_{s,s}^{-1} \vk_{s,x} - \ve_i)^\intercal \left(\mI + \mA \right)+ (\mK{s,s}^{-1} \vk_{s,x'} - \ve_j ) \ & \geq 1 - |\mK_{s,s}^{-1} \vk_{s,x} - \ve_i | \left| \left(\mI + \mA \right)+ \ve_j \right|+ |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j | \left| \left(\mI + \mA \right)+ \ve_i \right| \ & ;;;;; +\left|\mK{s,s}^{-1} \vk_{s,x} - \ve_i \right| \left|\left(\mI + \mA \right)+ \right| |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j |.\ split equation Let $ | \Phi(\vx) - \Phi(\vx_i)| \leq \epsilon= \Delta{5 ||\mK_{s,s}^{-1}| N }$ and $| \Phi(\vx') - \Phi(\vx_j)| \leq \epsilon= \Delta{5 ||\mK_{s,s}^{-1}| N }$ and by applying lem:helper_close_kernel_vec, we have equation split k^*(\vx, \vx') &\geq 1 - |\mK_{s,s}^{-1} \vk_{s,x} - \ve_i | \left| \left(\mI + \mA \right)+ \ve_j \right|+ |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j | \left| \left(\mI + \mA \right)+ \ve_i \right| \ & ;;;;; +\left|\mK{s,s}^{-1} \vk_{s,x} - \ve_i \right| \left|\left(\mI + \mA \right)+ \right| |\mK{s,s}^{-1} \vk_{s,x'} - \ve_j | \ &\geq 1-22|\mK_{s,s}^{-1}|N \epsilon - 2 |\mK_{s,s}^{-1}|^2 N \epsilon^2 \ &\geq 1 - (22 + 2) |\mK_{s,s}^{-1}|N \epsilon \ &\geq 1-5 |\mK_{s,s}^{-1}|N \epsilon \ &= 1-\Delta. split equation In the above, we used the fact that $\mA$ is block diagonal with pairwise constraints (see eq:pairwise_matrix_block_form) so $\left| \left(\mI + \mA \right)+ \ve_i \right| = 2$ and $\left|\left(\mI + \mA \right)+ \right|=2$.
Proof. There are $m_{+1}$ + $m_{-1}$ total manifolds in the dataset. Assume some ordering of these manifolds from ${1,2,\dots,m_{+1}$ + $m_{-1}} $ and let $#(i)$ be the number of points in the $i$-th manifold. The rows and columns of the kernel matrix $\mK^$ can be permuted such that it becomes block diagonal with $m_{+1} + m_{-1}$ blocks with block $i$ equal to $1{#(i)} \bm 1_{#(i)} \bm 1_{#(i)}^\intercal$ where $\bm 1_k$ is the all ones vector of length $k$. I.e., $\mK^$ permuted accordingly takes the form below: equation bmatrix 1{#(1)} \bm 1_{#(1)} \bm 1_{#(1)}^\intercal & & & \ & 1{#(2)} \bm 1_{#(2)} \bm 1_{#(2)}^\intercal & & \ & & \ddots & \ & & & 1{#(m_{+1} + m_{-1})} \bm 1_{#(m_{+1} + m_{-1})} \bm 1_{#(m_{+1} + m_{-1})}^\intercal \ bmatrix. equation Each block of the above is clearly a rank one matrix with eigenvalue $1$. Let $\vy_{m_i} \in R^{#(i)}$ be the vector containing all labels for entries in manifold $i$. Then we have equation split \vy^\intercal (\mK^*)^{-1} \vy &= \sum_{i=1}^{m_{+1} + m_{-1}} #(i)^{-1} \left(\langle \bm 1_{#(i)}, \vy_{m_i} \rangle \right)^2 \ &= \sum_{i=1}^{m_{+1} + m_{-1}} #(i)^{-1} (#(i))^2 \ &= m_{+1} + m_{-1}. split equation
Proof. % Denote the elements of the supervised dataset as ${\vx_1, \vx_2, \dots \vx_N}$. As a reminder, we consider the RBF kernel of the form % equation % k(\vx,\vx') = \exp\left( -|\vx - \vx'|^2{2}\right). % equation % First, we prove a lower bound on $s_N(\mK)$. Let $e_i$ and $\vv_i$ be the eigenvalues and eigenvectors of $\mK$ such that $\mK = \sum_i e_i \vv_i \vv_i^\intercal$. We have that % equation % split % \vy^\intercal \mK^{-1} \vy &= \sum_i e_i^{-1} \left(\vv_i^\intercal \vy\right)^2 \ % & \geq \left( \sum_i e_i \left(\vv_i^\intercal \vy\right)^2 \right)^{-1} \ % &= \left( \vy^\intercal \mK \vy \right)^{-1}, % split % equation % where we used Jensen's inequality in the above. Assuming class labels are given by ${-1,1}$, we have % equation % split % \vy^\intercal \mK \vy &= \sum_{i=1}^N \sum_{j=1}^N \vy_i \vy_j k(\vx_i, \vx_j). % split % equation % Therefore, combining the above and taking an expectation, we have that % equation % split % E[\vy^\intercal \mK^{-1} \vy] &\geq E\left[\left( \vy^\intercal \mK \vy \right)^{-1}\right] \ % &\geq E\left[ \vy^\intercal \mK \vy\right]^{-1} \ % &= \left[ \sum_{i=1}^N \sum_{j=1}^N \vy_i \vy_j E[k(\vx_i, \vx_j)] \right]^{-1}, % split % equation % where above we applied Jensen's inequality again and the linearity of expectation. %
Proof. We consider a function class $G_\gamma$ defined as the set of linear functions on the reproducing kernel Hilbert space such that $G_\gamma = {\langle \vw, \cdot \rangle: |\vw|^2 \leq \gamma } $. Given a dataset of inputs $\vx_1,\dots,\vx_N$ in the reproducing kernel Hilbert space with corresponding targets $y_1, \dots, y_N$, let $\epsilon_{\vw}(\vx_i) = \ell(\langle \vw, \vx \rangle, y_i) \in [0,b]$. The inequality in thm:concentration_theorem applies for any given $G_{\gamma}$ but we would like this to hold for all $\gamma \in {1,2,3,\dots}$ since $|\vw|$ can be unbounded. Since our loss function $\ell$ is bounded between $[0,b]$, we multiply it by $1/b$ so that it is ranged in $[0,1]$ as needed for thm:concentration_theorem. Then, we apply thm:concentration_theorem to $\epsilon_{\vw}(\vx_i)$ over the class $G_\gamma$ for each $\gamma \in {1,2,3,\dots}$ with probability $\delta_\gamma = \delta (e-1) e^{-\gamma}$. This implies that equation E_{\vx}[\epsilon_{\vw}(\vx)] - 1{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) \leq 2 E_\sigma \left[ \sup_{|\vv|^2 \leq \gamma } 1{N} \sum_{i=1}^N \sigma_i \epsilon_{\vv}(\vx_i) \right] + 3b \frac{\log(2/(\delta (e-1)))+\gamma{2N} }. equation This shows that for any $\gamma$, the above inequality holds with probability $1-\delta (e-1) e^{-\gamma}$. However, we need to show this holds for all $\gamma$ simultaneously. To achieve this, we apply a union bound which holds with probability $1-\sum_{\gamma=1}^\infty \delta_\gamma = 1-\sum_{\gamma=1}^\infty \delta (e-1) e^{-\gamma} = 1-\delta$. To proceed, we consider the inequality where $\gamma = |\vw|^2$ copied below. equation E_{\vx}[\epsilon_{\vw}(\vx)] - 1{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) \leq 2 E_\sigma \left[ \sup_{|\vv|^2 \leq |\vw|^2 } 1{N} \sum_{i=1}^N \sigma_i \epsilon_{\vv}(\vx_i) \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} }. equation Applying Talagrand's lemma (lem:Talagrand) followed by the Cauchy-Schwarz inequality, equation split E_{\vx}[\epsilon_{\vw}(\vx)] - 1{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) &\leq 2E_\sigma \left[ \sup_{|\vv|^2 \leq |\vw|^2 } 1{N} \sum_{i=1}^N \sigma_i \epsilon_{\vv}(\vx_i) \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ &\leq 2LE_\sigma \left[ \sup_{|\vv|^2 \leq |\vw|^2 } 1{N} \sum_{i=1}^N \sigma_i \langle \vv, \vx_i \rangle \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ &\leq 2LE_\sigma \left[ \sup_{|\vv|^2 \leq |\vw|^2 } |\vv| \left| 1{N} \sum_{i=1}^N \sigma_i \vx_i \right| \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ &\leq 2LE_\sigma \left[ |\vw| \left| 1{N} \sum_{i=1}^N \sigma_i \vx_i \right| \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} }. split equation Expanding out the quantity $\left| 1{N} \sum_{i=1}^N \sigma_i \vx_i \right|$ and noting that the random variables are independent, we have equation split E_{\vx}[\epsilon_{\vw}(\vx)] - 1{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) &\leq 2LE_\sigma \left[ |\vw| \left| 1{N} \sum_{i=1}^N \sigma_i \vx_i \right| \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ & = 2L{N} |\vw| E_\sigma \left[ \sum_{i=1^N \sum_{j=1}^N \sigma_i \sigma_j k(\vx_i, \vx_j) } \right] + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ & \leq 2L{N} |\vw| \mathbb{E_\sigma \left[ \sum_{i=1}^N \sum_{j=1}^N \sigma_i \sigma_j k(\vx_i, \vx_j) \right]} + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ & = 2L{N} \sum_{i=1^N k(\vx_i, \vx_i)} + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ &= 2L{N} |\vw| \operatorname{Tr(\mK)} + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} }. split equation Since we normalize such that $Tr(\mK) = N$, we have equation split E_{\vx}[\epsilon_{\vw}(\vx)] - 1{N} \sum_{i=1}^N \epsilon_{\vw}(\vx_i) &\leq 2L{N} |\vw| + 3b \frac{\log(2/(\delta (e-1)))+\ceil{|\vw|^2}{2N} } \ &\leq 2L{N} |\vw| + 3b \frac{\log(2/(\delta (e-1))){2N}} + 3b\ceil{|\vw|}{2N} \ &= 2\sqrt{2L + 3b}{2N} |\vw| + 3b \frac{\log(2/(\delta (e-1))){2N}}. split equation Plugging in $|\vw|^2 = \vy^\intercal \mK^{-1} \vy$ and noting that we normalized the kernel to have $\Tr(\mK) = N$ thus completes the proof.
| Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | |||
|---|---|---|---|---|---|---|---|---|---|
| Augmentation | Samples | Num. Augmentations Algorithm | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| Gaussian blur | 16 | self-supervised | 32.3 | 32.3 | 32.6 | 29.1 | 32.8 | 32.3 | 32.5 |
| baseline (no augment) | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | ||
| baseline (augmented) | 34.8 | 34.9 | 35 | 35.0 | 35.0 | 35.0 | 35.0 | ||
| 64 | self-supervised | 71.8 | 71.8 | 71.6 | 71.4 | 72.1 | |||
| baseline (no augment) | 65 | 65 | 65 | 65.0 | 65.0 | ||||
| baseline (augmented) | 74.6 | 74.6 | 74.7 | 74.7 | 74.7 | ||||
| 256 | self-supervised | 87.5 | 88 | 88 | |||||
| baseline (no augment) | 86.5 | 86.5 | 86.5 | ||||||
| baseline (augmented) | 88.6 | 88.6 | 88.6 | ||||||
| translate, zoom rotate | 16 | self-supervised | 34.2 | 35.2 | 36 | 36.8 | 37.3 | 37.5 | 38.0 |
| baseline (no augment) | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | 21.8 | ||
| baseline (augmented) | 37.5 | 38.5 | 39.3 | 40.2 | 41.1 | 41.5 | 41.7 | ||
| 64 | self-supervised | 73.4 | 74.8 | 75.3 | 75.9 | 76.8 | |||
| baseline (no augment) | 65 | 65 | 65 | 65.0 | 65.0 | ||||
| baseline (augmented) | 77.7 | 78.8 | 79.4 | 79.7 | 80.0 | ||||
| 256 | self-supervised | 90.4 | 91 | 91.5 | |||||
| baseline (no augment) | 86.5 | 86.5 | 86.5 | ||||||
| baseline (augmented) | 90.4 | 90.8 | 91.1 |
| Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | Test Set Accuracy | |||
|---|---|---|---|---|---|---|---|---|---|
| Augmentation | Samples | Num. Augmentations Algorithm | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
| Gaussian blur | 16 | self-supervised | 28.4 | 28.4 | 28.4 | 28.4 | 28.3 | 28.3 | 28.4 |
| baseline (no augment) | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | ||
| baseline (augmented) | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | 26.6 | ||
| 64 | self-supervised | 60 | 60 | 60.1 | 60.1 | 60.1 | |||
| baseline (no augment) | 55.6 | 55.6 | 55.6 | 55.6 | 55.6 | ||||
| baseline (augmented) | 56.5 | 56.5 | 56.5 | 56.5 | 56.5 | ||||
| 256 | self-supervised | 87.6 | 87.9 | 87.6 | |||||
| baseline (no augment) | 81.5 | 81.5 | 81.5 | ||||||
| baseline (augmented) | 82.9 | 82.9 | 82.9 | ||||||
| translate, zoom rotate | 16 | self-supervised | 33.4 | 33.8 | 34.4 | 34.6 | 35.1 | 35.4 | 35.3 |
| baseline (no augment) | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | 25.6 | ||
| baseline (augmented) | 30.6 | 30.7 | 30.8 | 31.6 | 31.7 | 31.9 | 32.0 | ||
| 64 | self-supervised | 70.7 | 72.7 | 73.2 | 73.8 | 74.3 | |||
| baseline (no augment) | 55.6 | 55.6 | 55.6 | 55.6 | 55.6 | ||||
| baseline (augmented) | 66 | 67.8 | 68.8 | 69.5 | 70.2 | ||||
| 256 | self-supervised | 91.9 | 92.3 | 92.6 | |||||
| baseline (no augment) | 81.5 | 81.5 | 81.5 | ||||||
| baseline (augmented) | 89.5 | 90 | 90.3 |
$$ \begin{split} r(|\mW_\parallel + \mW_\bot|{\mathcal{H}}) &= r\left(\sqrt{ |\mW\parallel|{\mathcal{H}}^2 + |\mW\bot|{\mathcal{H}}^2 }\right) \ &\geq r(|\mW\parallel|_{\mathcal{H}}) . \end{split} $$
$$ \begin{split} \text{Primal feasibility: }& \mX \mB^* \mX^\intercal = \left(\mI + \mA \right)_+, ; ; \mB^* \succeq 0 \ \text{Dual feasibility: }& \mI - \mX^\intercal \mY^* \mX \succeq 0 \ \text{Complementary slackness: }& \left( \mI - \mX^\intercal \mY^* \mX \right) \mB^* = 0. \end{split} $$
$$ \begin{split}\left({\bm{I}}-{\bm{X}}^{\intercal}{\bm{Y}}^{}{\bm{X}}\right){\bm{B}}^{}&={\bm{X}}^{+}({\bm{I}}+{\bm{A}}){+}\left({\bm{X}}^{+}\right)^{\intercal}-{\bm{X}}^{\intercal}\left({\bm{X}}^{+}\right)^{\intercal}{\bm{X}}^{+}{\bm{X}}{\bm{X}}^{+}({\bm{I}}+{\bm{A}}){+}\left({\bm{X}}^{+}\right)^{\intercal}\ &={\bm{X}}^{+}({\bm{I}}+{\bm{A}}){+}\left({\bm{X}}^{+}\right)^{\intercal}-{\bm{X}}^{+}({\bm{I}}+{\bm{A}}){+}\left({\bm{X}}^{+}\right)^{\intercal}\ &=0.\end{split} $$ \tag{A1.E33}
References
[hausladen1994pretty] Hausladen, Paul, Wootters, William K. (1994). A ‘pretty good’measurement for distinguishing quantum states. Journal of Modern Optics.
[bellet2013survey] Bellet, Aur{'e. (2013). A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709.
[harris2020array] Harris, Charles R, Millman, K Jarrod, Van Der Walt, St{'e. (2020). Array programming with NumPy. Nature.
[cortes2010two] Cortes, Corinna, Mohri, Mehryar, Rostamizadeh, Afshin. (2010). Two-stage learning kernel algorithms. Proceedings of the 27th International Conference on International Conference on Machine Learning.
[baghshah2010kernel] Baghshah, Mahdieh Soleymani, Shouraki, Saeed Bagheri. (2010). Kernel-based metric learning for semi-supervised clustering. Neurocomputing.
[hoi2007learning] Hoi, Steven CH, Jin, Rong, Lyu, Michael R. (2007). Learning nonparametric kernel matrices from pairwise constraints. Proceedings of the 24th international conference on Machine learning.
[yeung2007kernel] Yeung, Dit-Yan, Chang, Hong. (2007). A kernel approach for semisupervised metric learning. IEEE Transactions on Neural Networks.
[zhuang2011unsupervised] Zhuang, Jinfeng, Wang, Jialei, Hoi, Steven CH, Lan, Xiangyang. (2011). Unsupervised multiple kernel learning. Asian Conference on Machine Learning.
[cristianini2001kernel] Cristianini, Nello, Shawe-Taylor, John, Elisseeff, Andre, Kandola, Jaz. (2001). On kernel-target alignment. Advances in neural information processing systems.
[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[yang2006distance] Yang, Liu, Jin, Rong. (2006). Distance metric learning: A comprehensive survey. Michigan State Universiy.
[belkin2004semi] Belkin, Mikhail, Niyogi, Partha. (2004). Semi-supervised learning on Riemannian manifolds. Machine learning.
[lloyd2020quantum] Lloyd, Seth, Bosch, Samuel, De Palma, Giacomo, Kiani, Bobak, Liu, Zi-Wen, Marvian, Milad, Rebentrost, Patrick, Arvidsson-Shukur, David M. (2020). Quantum polar decomposition algorithm. arXiv preprint arXiv:2006.00841.
[wen2021toward] Wen, Zixin, Li, Yuanzhi. (2021). Toward understanding the feature learning process of self-supervised contrastive learning. International Conference on Machine Learning.
[steinwart2008support] Steinwart, Ingo, Christmann, Andreas. (2008). Support vector machines.
[lee2017deep] Lee, Jaehoon, Bahri, Yasaman, Novak, Roman, Schoenholz, Samuel S, Pennington, Jeffrey, Sohl-Dickstein, Jascha. (2017). Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165.
[wang2022and] Wang, Sifan, Yu, Xinling, Perdikaris, Paris. (2022). When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics.
[bietti2019inductive] Bietti, Alberto, Mairal, Julien. (2019). On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems.
[yang2020feature] Yang, Greg, Hu, Edward J. (2020). Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522.
[chizat2018note] Chizat, Lenaic, Bach, Francis. (2018). A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956.
[seleznova2022analyzing] Seleznova, Mariia, Kutyniok, Gitta. (2022). Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?. Mathematical and Scientific Machine Learning.
[lee2020finite] Lee, Jaehoon, Schoenholz, Samuel, Pennington, Jeffrey, Adlam, Ben, Xiao, Lechao, Novak, Roman, Sohl-Dickstein, Jascha. (2020). Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems.
[yang2022tensor] Yang, Greg, Hu, Edward J, Babuschkin, Igor, Sidor, Szymon, Liu, Xiaodong, Farhi, David, Ryder, Nick, Pachocki, Jakub, Chen, Weizhu, Gao, Jianfeng. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer. arXiv preprint arXiv:2203.03466.
[Huang2021] Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, Jarrod R. McClean. (2021). Power of data in quantum machine learning. Nature Communications. doi:10.1038/s41467-021-22539-9.
[meir2003generalization] Meir, Ron, Zhang, Tong. (2003). Generalization error bounds for Bayesian mixture algorithms. Journal of Machine Learning Research.
[mohri2018foundations] Mohri, Mehryar, Rostamizadeh, Afshin, Talwalkar, Ameet. (2018). Foundations of machine learning.
[vapnik1999nature] Vapnik, Vladimir. (1999). The nature of statistical learning theory.
[bartlett2002rademacher] Bartlett, Peter L, Mendelson, Shahar. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research.
[haochen2022beyond] HaoChen, Jeff Z, Wei, Colin, Kumar, Ananya, Ma, Tengyu. (2022). Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. arXiv preprint arXiv:2204.02683.
[balestriero2022contrastive] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508.
[boyd2004convex] Boyd, Stephen, Boyd, Stephen P, Vandenberghe, Lieven. (2004). Convex optimization.
[scholkopf2001generalized] Sch{. (2001). A generalized representer theorem. International conference on computational learning theory.
[wei2020theoretical] Wei, Colin, Shen, Kendrick, Chen, Yining, Ma, Tengyu. (2020). Theoretical analysis of self-training with deep networks on unlabeled data. arXiv preprint arXiv:2010.03622.
[chung1997spectral] Chung, Fan RK. (1997). Spectral graph theory.
[haochen2021provable] HaoChen, Jeff Z, Wei, Colin, Gaidon, Adrien, Ma, Tengyu. (2021). Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.
[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.
[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.
[ye2019unsupervised] Ye, Mang, Zhang, Xu, Yuen, Pong C, Chang, Shih-Fu. (2019). Unsupervised embedding learning via invariant and spreading instance feature. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.
[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.
[lee2021predicting] Lee, Jason D, Lei, Qi, Saunshi, Nikunj, Zhuo, Jiacheng. (2021). Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems.
[cohen2017emnist] Cohen, Gregory, Afshar, Saeed, Tapson, Jonathan, Van Schaik, Andre. (2017). EMNIST: Extending MNIST to handwritten letters. 2017 international joint conference on neural networks (IJCNN).
[arora2019harnessing] Arora, Sanjeev, Du, Simon S, Li, Zhiyuan, Salakhutdinov, Ruslan, Wang, Ruosong, Yu, Dingli. (2019). Harnessing the power of infinitely wide deep nets on small-data tasks. arXiv preprint arXiv:1910.01663.
[fernandez2014we] Fern{'a. (2014). Do we need hundreds of classifiers to solve real world classification problems?. The journal of machine learning research.
[williams2006gaussian] Williams, Christopher KI, Rasmussen, Carl Edward. (2006). Gaussian processes for machine learning.
[yang2019scaling] Yang, Greg. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760.
[neuraltangents2020] Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, Samuel S. Schoenholz. (2020). Neural Tangents: Fast and Easy Infinite Neural Networks in Python. International Conference on Learning Representations.
[arora2019exact] Arora, Sanjeev, Du, Simon S, Hu, Wei, Li, Zhiyuan, Salakhutdinov, Russ R, Wang, Ruosong. (2019). On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems.
[jacot2018neural] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems.
[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems.
[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems.
[mosek] Sturm, Jos F. (1999). Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization methods and software.
[xia2013online] Xia, Hao, Hoi, Steven CH, Jin, Rong, Zhao, Peilin. (2013). Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[neal1996priors] Neal, Radford M. (1996). Priors for infinite networks. Bayesian Learning for Neural Networks.
[bib1] MOSEK ApS. The MOSEK optimization toolbox for MATLAB manual. Version 9.0., 2019. URL http://docs.mosek.com/9.0/toolbox/index.html.
[bib2] Arora et al. (2019a) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32, 2019a.
[bib3] Arora et al. (2019b) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019b.
[bib4] Mahdieh Soleymani Baghshah and Saeed Bagheri Shouraki. Kernel-based metric learning for semi-supervised clustering. Neurocomputing, 73(7-9):1352–1361, 2010.
[bib5] Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508, 2022.
[bib6] Bardes et al. (2021) Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
[bib7] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[bib8] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Machine learning, 56(1):209–239, 2004.
[bib9] Bellet et al. (2013) Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013.
[bib10] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019.
[bib11] Boyd et al. (2004) Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[bib12] Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
[bib13] Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
[bib14] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758, 2021.
[bib15] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. arXiv preprint arXiv:1812.07956, 8, 2018.
[bib16] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.
[bib17] Cohen et al. (2017) Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921–2926. IEEE, 2017.
[bib18] Cortes et al. (2010) Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Two-stage learning kernel algorithms. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 239–246, 2010.
[bib19] Cristianini et al. (2001) Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz Kandola. On kernel-target alignment. Advances in neural information processing systems, 14, 2001.
[bib20] Dwibedi et al. (2021) Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9588–9597, 2021.
[bib21] Fernández-Delgado et al. (2014) Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, and Dinani Amorim. Do we need hundreds of classifiers to solve real world classification problems? The journal of machine learning research, 15(1):3133–3181, 2014.
[bib22] Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
[bib23] HaoChen et al. (2021) Jeff Z HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011, 2021.
[bib24] HaoChen et al. (2022) Jeff Z HaoChen, Colin Wei, Ananya Kumar, and Tengyu Ma. Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. arXiv preprint arXiv:2204.02683, 2022.
[bib25] Harris et al. (2020) Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array programming with numpy. Nature, 585(7825):357–362, 2020.
[bib26] He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738, 2020.
[bib27] Hoi et al. (2007) Steven CH Hoi, Rong Jin, and Michael R Lyu. Learning nonparametric kernel matrices from pairwise constraints. In Proceedings of the 24th international conference on Machine learning, pp. 361–368, 2007.
[bib28] Huang et al. (2021) Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jarrod R. McClean. Power of data in quantum machine learning. Nature Communications, 12(1), May 2021. doi: 10.1038/s41467-021-22539-9. URL https://doi.org/10.1038/s41467-021-22539-9.
[bib29] Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
[bib30] Lee et al. (2017) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
[bib31] Lee et al. (2020) Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156–15172, 2020.
[bib32] Lee et al. (2021) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34:309–323, 2021.
[bib33] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4(Oct):839–860, 2003.
[bib34] Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
[bib35] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pp. 29–53. Springer, 1996.
[bib36] Novak et al. (2020) Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. Neural tangents: Fast and easy infinite neural networks in python. In International Conference on Learning Representations, 2020. URL https://github.com/google/neural-tangents.
[bib37] Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[bib38] Schölkopf et al. (2001) Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pp. 416–426. Springer, 2001.
[bib39] Mariia Seleznova and Gitta Kutyniok. Analyzing finite neural networks: Can we trust neural tangent kernel theory? In Mathematical and Scientific Machine Learning, pp. 868–895. PMLR, 2022.
[bib40] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
[bib41] Jos F Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones. Optimization methods and software, 11(1-4):625–653, 1999.
[bib42] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 1999.
[bib43] Wang et al. (2022) Sifan Wang, Xinling Yu, and Paris Perdikaris. When and why pinns fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449:110768, 2022.
[bib44] Wei et al. (2020) Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. arXiv preprint arXiv:2010.03622, 2020.
[bib45] Zixin Wen and Yuanzhi Li. Toward understanding the feature learning process of self-supervised contrastive learning. In International Conference on Machine Learning, pp. 11112–11122. PMLR, 2021.
[bib46] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.
[bib47] Xia et al. (2013) Hao Xia, Steven CH Hoi, Rong Jin, and Peilin Zhao. Online multiple kernel similarity learning for visual search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):536–549, 2013.
[bib48] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
[bib49] Yang et al. (2022) Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
[bib50] Liu Yang and Rong Jin. Distance metric learning: A comprehensive survey. Michigan State Universiy, 2(2):4, 2006.
[bib51] Ye et al. (2019) Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6210–6219, 2019.
[bib52] Dit-Yan Yeung and Hong Chang. A kernel approach for semisupervised metric learning. IEEE Transactions on Neural Networks, 18(1):141–149, 2007.
[bib53] Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pp. 12310–12320. PMLR, 2021.
[bib54] Zhuang et al. (2011) Jinfeng Zhuang, Jialei Wang, Steven CH Hoi, and Xiangyang Lan. Unsupervised multiple kernel learning. In Asian Conference on Machine Learning, pp. 129–144. PMLR, 2011.