Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need
Vivien Cabannes*, Meta AI, Leon Bottou, Meta AI, Yann Lecun, Meta AI, Randall Balestriero*, Meta AI
Abstract
Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we formalize and generalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, based on similarity graphs, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.
Self-Supervised Learning on a Graph
Vivien Cabannes* Meta AI
Active Oracles
Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we formalize and generalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, based on similarity graphs, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.
Introduction
Learning representations of data that can be used to solve multiple tasks, out-of-the-box, and with minimal postprocessing is one of the main goals of current AI research [39, 35, 24]. Such representations are generally found by processing given inputs through Deep neural Networks (DNs). The main question of interest around which contemporary research focuses on deals with the choice of the training setting that is employed to tune the DN's parameters. A few different strategies have emerged such as layerwise [6], reconstruction based [48], and more recently, based on Self-Supervised Learning (SSL) [14, 37]. In fact,
*Equal contribution.
Leon Bottou Meta AI
due to the cost of labeling and the size of datasets constantly growing, recent methods have tried to drift away from traditional supervised learning [42]. From existing training solutions, joint-embedding SSL has emerged as one of the most promising ones [34]. It consists in learning representations that are invariant along some known transformations while preventing dimensional collapse of the representation. Such invariance is enforced by applying some known Data-Augmentation (DA), e.g. translations for images, to the same input and making sure that their corresponding representations are the same.
Despite tremendous progress, several limitations remain in the way of a widespread deployment of SSL. In particular, it is not clear how to incorporate a priori knowledge into SSL frameworks beyond the usual tweaking of the loss and DAs being employed, although some efforts are being made [15, 53, 52]. Indeed, it is not surprising that vision-language pre-training has replaced SSL as the state-of-the-art to learn image representation [26], as those models are better suited to incorporate information stemming from captions that often come alongside images collected on the Internet.
In this study, we propose to redefine existing SSL losses in terms of a similarity graph -where nodes represent data samples and edges reflect known inter-sample relationships. Our first contribution stemming from this formulation provides a generic framework to think about learning in terms of similarity graph : it yields a spectrum on which SSL and supervised learning can be seen as two extremes. Within this realm, those two extremes are connected through the similarity matrix, and in fact can be made equivalent by varying the similarity graph. Our second contribution naturally emerges from using such a similarity graph, unveiling an elegant framework to reduce the cost and expert requirement of active learning summarized by:
Tell me who your friends are, and I will tell you who you are.
Active learning, which aims to reduce supervised learning cost by only asking an oracle for sample labels when needed [41, 20, 27, 31], can now be formulated in term of relative sample comparison, rather than absolute sample labeling.

Figure 1. Active Self-Supervised Learning introduces PAL ( right box ), an alternative to active learning ( left box ) where the oracle is asked if a collection of inputs are semantically related or not. As opposed to active learning, expert knowledge is reduced as one need not to know all the possible classes but only how to distinguish inputs from different classes. PAL querying is flexible, as an illustrative example we exhibit an ` a la captcha version where a given input is presented along with a collection of other inputs, and the oracle can select among those inputs the positive ones.

This much more efficient and low-cost approach is exactly the active learning strategy stemming from our framework : rather than asking for labels, one rather asks if two (or more) inputs belong to the same classes or not, as depicted in Fig. 1. We coin such a strategy as Positive Active Learning (PAL) , and we will present some key analysis on the benefits of PAL over traditional active learning. We summarize our contributions below:
· We provide a unified learning framework based on the concept of similarity graph, which encompasses both self-supervised learning, supervised learning, as well as semi-supervised learning and many variants. · We derive a generic PAL algorithm based on an oracle to query the underlying similarity graph Algorithm 1. The different learning frameworks (SSL, supervised, and so forth) are recovered by different oracles, who can be combined to benefit from each framework distinction. · We show how PAL extends into an active learning framework based on similarity queries that provides low-cost efficient strategies to annotate a dataset (Fig. 1).
All statements of this study are proven in Appendix B, code to reproduce experiments is provided at https:// github.com/facebookresearch/pal .
Background on Self-Supervised Learning
This section provides a brief reminder of the main selfsupervised learning (SSL) methods, their associated losses, and common notations for the remainder of the study.
A common strategy to learn a model in machine learning is to curate labeled examples ( x n , y n ) n , and to learn a model that given x n ∈ X ≜ R D as input, outputs y n ∈ [ C ] , hoping that this model will learn to recognize patterns and relations that generalizes to new, unseen input data. Yet, as the dataset grew larger, and annotating data has become a major bottleneck, machine learning has shifted its attention to learning methods that do not require knowledge of y n . SSL has emerged as a powerful solution to circumvent the need for expensive and time-consuming labeling. It learns a embedding f : X → R K for a small K by enforcing either reconstruction properties, or some invariance and symmetry onto a learned representation. SSL also relies on a set of observations X = { x n } N n =1 ∈ R N × D , yet instead of labels y n , it requires known pairwise positive relation that indicates whether two samples are semantically similar or not. For simplicity, we shall focus on the joint-embedding framework, where those positive pairs are artificially generated on the fly by applying Data Augmentations (DA), e.g. adding white noise, masking, on the same input. Let denote T 1 , T 2 : X → X the generators of two (random) DAs T 1 ( x ) and T 2 ( x ) from an input x , f θ : R D → R K the parametric

Figure 2. Left: Depiction of the 'knowledge graph' arising from binary classification with supervised learning. Notice the two connected components, each corresponding to a single class. Each sample is associated with a node of the graph (blue circle) and the known positive relation between samples is represented by an edge. This knowledge is summarized into the G matrix depicted on the right. Right: Examples of the N × N symmetric graph-adjacency matrices G for the case of binary classification with supervised (same graph as on the left), semi-supervised and active learning. Each nonzero entry ( G ) i,j represents the known positive relation between sample i and j .
model to be learned, and
$$
$$
where ( z (1) n , z (2) n ) , the n th row of Z (1) and Z (2) respectively, form the n th positive pair associated to sample x n . Using (1), different SSL losses will employ different measures of invariance and dimensional collapse. Typically, the losses are minimized with gradient descent and backpropagation to learn θ .
VICReg. With the above notations, the VICReg loss [3] reads, with hyper-parameter α, β > 0 ,
̸
$$
$$
$$
$$
BarlowTwins. BarlowTwins [51] is built on the crosscorrelation matrix C ij = CoSim( Z (1) ⊤ , Z (2) ⊤ ) , with the hyper-parameter λ it reads
$$
$$
Spectral Contrastive Loss. Finally, the spectral contrastive loss [28] is theory-friendly proxy for SSL reading
$$
$$
In particular, as proven in Appendix B.1.1, (5) recovers VICReg (2) when the ReLU-hinge loss is replaced by the mean-square error, hence the denomination VIC 2 .
The Commonality between SSL Losses. All the above Eqs. (2) to (5) losses combine two terms: (i) a matching term between positive pairs, and (ii) a term to avoid collapse towards predicting a constant solution for all inputs. (i) can take different forms such as the squared norm between Z (1) and Z (2) (2), the opposite of their scalar product (5), or of their cosine (3), or the square norm between the centeredcosine and one (4). (ii) can also take different forms such as the infoNCE softmax (5), or an energy that enforces richness of the learn feature, such as the variance-covariance regularization in (2) forcing the different coordinates of f θ to be orthogonal [12].
While at face-value those losses seem distinct, they actually all simply consist and combine some variants of (i) and (ii), and even more importantly, they all rely on the exact same information of positive inter-sample relation for (i). This is exactly what the next Section 3 will dive into, as a means to unify SSL losses, along with supervised learning methods.
The Ubiquity of Similarity Graphs
The goal of this section is to unify SSL and supervised learning through the introduction of a special object: a similarity graph .
The Graphs for (Self-)Supervised Learning
Throughout this study, a similarity graph denotes a graph for which nodes represent data samples, and edges reflect similarity relationships. Formally, such a graph is expressed through its symmetric adjacency matrix G ∈ R N × N , the semantic relation between inputs i and j being encoded in the real entry G i,j . The remainder of this section will demonstrate how (i) SSL losses are implicitly based on a similarity graph (ii) how those losses tackle the supervised learning problem when provided a richer graph G .
̸
Supervised learning. In addition to the N input samples X ∈ R N × D , supervised learning has access to paired labels y ≜ [ y 1 , . . . , y N ] . For clarity, we focus here on categorical labels i.e. y n belongs to { 1 , . . . , C } for C the number of classes. 1 The one-hot encoding of y will be denoted by the matrix Y ∈ R N × C . In terms of the similarity graph G , the label-based relation becomes naturally encoded as G i,j = 1 { y i = y j } , or equivalently
$$
$$
A key observation that we must emphasize is that the graph G almost explicitly encodes for the labels Y , which will be made explicit with Theorem 2.
Multiple Epochs with Data Augmentation. When DA is employed, and training is carried for E epochs, the original N input samples are transformed into N × E 'augmented' samples. For more generality, since DA will also be used in SSL, let's denote by A ∈ N ∗ the number of augmentations -where here A = E . We now have the augmented dataset
$$
$$
where each T has its own randomness. When available, i.e. for supervised learning, the corresponding 'augmented' labels Y ( A ) are given by repeating A times each row of Y , formally written with the Kronecker product Y (sup) ≜ Y ⊗ 1 A , and from that, we can now define the supervised dataset and associated graph extending (6) to the case of multiple epochs and DA training
$$
$$
The resulting graph (7) is depicted on the left part of Fig. 2. Self-Supervised Learning. SSL does not rely on labels, but on positive pairs/tuples/views generated at each epoch. Let us denote by V the number of positive views generated, commonly V = 2 for positive pairs as modeled in (1). With E the total number of epochs, SSL produces V × E samples semantically related to each original sample x n through the course of training i.e. in SSL A = V × E while in supervised learning A = E . The total number of samples is thus
1 While we focus here on classification for simplicity, our approach is easily extendable for generic problems involving a loss ℓ by defining the graph as G ij = -ℓ ( y i , y j ) . In the classification, ℓ could be the zeroone loss ℓ ( y i , y j ) = 1 { y i = y j } , and G ij ≃ 1 -ℓ ( y i , y j ) . See Appendix B.2.1 for details.
N × E × V , defining the dataset and associated graph
$$
$$
where the associated similarity graph G (ssl) -now of size NEV × NEV - captures if two samples were generated as DA of the same original input.
Self-Supervised Learning on a Graph
This section reformulates the different SSL losses through the sole usage of the similarity graph G (ssl) . To lighten notations, and without loss of generality, we redefine X ∈ R N × D to denote the full dataset, i.e. N ← NEV with X = X (sup) for supervised learning with V × E epochs, or with X = X (ssl) in SSL with E epochs with V views for the SSL case. The model embedding is shortened to Z ≜ f θ ( X ) ∈ R N × K as per Eq. (1).
Theorem 1. VICReg (2) , SimCLR (3) , and BarlowTwins (4) losses can be expressed in term of the graph G (8)
$$
$$
where D = diag( G 1 ) is the degree matrix of G ; with ˜ z ≜ z / ∥ z ∥ and ˜ Z the column normalized Z so that each column has unit norm.
In essence, SSL is about making sure that sample's representations match for samples that were deemed similar through the design of data augmentation. As such, it is not surprising that one can rewrite SSL losses through the sole usage of the similarity graph. From Theorem 1, the attentive observer would notice how VICReg is akin to Laplacian Eigenmaps or multidimensional scaling, SimCLR is akin to Cross-entropy and BarlowTwins is akin to Canonical Correlation Analysis; observations already discovered in the literature [2] and reinforced above.
Beyond recovering such representation learning losses, our goal is to go one step further and to tie SSL and supervised learning through the lens of G , which follows in the next section.
Self-Supervised is a G Away from Supervised
What happens if one takes the different SSL losses, but replaces the usual SSL graph G (ssl) with the supervised one G (sup) ?
It turns out that the learned representations emerging from such losses are identical -up to some negligible symmetries that can be corrected for when learning a linear probe- to the one hot-encoding of Y . To make our formal statement (Theorem 2) clearer, we introduce the set of optimal representations that minimize a given loss:
$$
$$
where 'method' refers to the different losses.
Theorem 2 (Interpolation optimum) . When K ≥ C , and Z = f θ ( X ) is unconstrained (e.g. interpolation regime with a rich functions class), the SSL losses as per Theorem 1 with the supervised graph (7) solve the supervised learning problem with:
$$
$$
where R ∈ O means that R is a rotation matrix as defined for the VICReg loss, diag + = diag( R N + ) are the set of diagonal matrix with positive entries, i.e. renormalization matrices, and M is a matrix that maps a deformation of the simplex into the canonical basis. Moreover, provided class templates, i.e. C data points associated with each of the C classes, Y is easily retrieved from any methods and Z ∈ S method .
In essence, Theorem 2 states that SSL losses solve the supervised learning problem when the employed graph G is G (sup) . Moreover, the matrix D appearing in Theorem 2 captures the fact that SimCLR solutions are invariant to rescaling logit and is akin to the cross-entropy loss, while BarlowTwin is invariant to column renormalization of Z and is akin to discriminant analysis. Lastly, VICReg might be thought of as a proxy for the least-squares loss. At a high-level, Theorem 2 suggests fruitful links between spectral embedding techniques captured in Theorem 1 and supervised learning. We let for future work the investigation of this link and translation of spectral embedding results in the realm of supervised learning.
While Theorem 2 describes what we have coined as the 'interpolation optimum', i.e. solution in the interpolation regime with rich models, we ought to highlight that classical statistical learning literature analyzes losses under the light of 'Bayes optimum', i.e. solutions in noisy contextfree setting [4]. Those Bayes optima do not make as much sense for losses that intrinsically relate different inputs, yet for completeness we provide such a proposition on Bayes optimum in Appendix B.3.
PAL: Positive Active Learning
Now that we demonstrated how one should focus on the graph G , rather than the (self-)supervised loss, we turn our focus into getting that graph G . In particular, we propose an active learning framework that discovers G through efficient, low-cost queries.
One Framework to Rule Them All
From our understanding (Theorem 2), the difficulties of both supervised learning and SSL are the same: they need a correct graph G , i.e they need to identify samples that are semantically similar, either through label annotations or through the right design of DA.
One Framework to Rule Them All
Data:
D
; unknown graph
X
∈
R
N
×
Result:
θ
f
Embedding
K
:
→
Initialization: weights
0
, scheduler
]
t
[
Collect
Update from sampler;
←
=
1
)
i
j
(
y
{
γ
∇
L
;
G
.
Our framework suggests a generic way to proceed, having fixed the samples X in advance, and without much a priori knowledge on the similarity graph G . In an active learning spirit, one would like to design a query strategy to discover G , and an update rule for the learned parameter θ . To ground the discussion, let us focus on VICReg. The variance-covariance term can be rewritten with R ( a, b ) = ( a ⊤ b ) 2 -∥ a ∥ 2 -∥ b ∥ 2 , this leads to the formula, proven in Appendix B.1.1,
$$
$$
where I = J = [ N ] 2 . An oracle would typically consider two small sets of indices I, J ⊂ [ N ] 2 , asks labelers to provide G ij for i, j ∈ I , and, given a step size γ t , update the weights with
$$
$$
which could be performed with the sole access to ( G ij ) ( i,j ) ∈ I . The pairs in J are used to avoid dimensional collapse, and in particular for the VICReg loss, to compute the variance-covariance regularization term. The complete picture leads to PAL, Algorithm 1. A particularly useful features of SGD for active learning is its robustness to labeling noise [10]. In other terms, Algorithm 1 is robust to noise in the query answers .
We will now dive more in-depth into two variants of oracles: passive and active ones. As we will see, passive oracles can recover traditional SSL as special cases, but will
(sup)
be much less efficient in learning good representation than active strategies.
Passive Oracles
Passive variations of the PAL algorithm consist in fixing the oracle behavior at iterations t ∈ [ T ] before starting the training. This formulation, under which the oracle does not leverage any information collected along training, recovers both SSL and supervised learning, based on the different querying strategies.
Self-Supervised Oracle. Probably the simplest oracle to describe is the one corresponding to the SSL strategy. The original VICReg algorithm [3] is made of t gradient updates over T = N 0 E iterations with N = N 0 V E samples, where N 0 is the number of original samples, E is the number of epochs, V the number of views. At time t ∈ [ T ] , I t is chosen as { (2 t +1 , 2( t +1)) } , describing a positive pairs generated on the fly from one original sample x i for i = t mod . N 0 ; and J t is chosen as some mini-batch to estimate the covariance matrix of the features at the current epoch. Because it has been built to remove human feedback, SSL actually does not need to ask for labelers to query G s,s +1 (where s = 2 t +1 ), since it is known by construction that those entries are going to be one.
Supervised Oracle. When it comes to a supervised learning oracle, the supervised learning loss provided in Theorem 1 -which is known to recover Y (given class templates) as per Theorem 2- is easily minimized with gradient descent based on (10). Hence a simple oracle to solve the supervised learning problem based on stochastic gradient descent: at time t , consider a random pair of indices ( i t , j t ) and set I t = J t ← { ( i t , j t ) } . The querying of G i t ,j t can either be done on the fly, or if the dataset is already annotated, it can be deduced from the knowledge of G i t ,j t = 1 { y i t = y j t } .
Passive Oracles
SSL oracle:
I
t
=
{
G
2
+1
(
,
J
.
y
Theoretical Remarks. Remarking that (10) is an unbiased formulation of VICReg, in the sense that
$$
$$
As a consequence, when θ ↦→ ∥ ∥ f θ ( X ) f θ ( X ) ⊤ -G ∥ ∥ 2 is strongly convex, Algorithm 3 with either the self-supervised or the supervised oracle will converge to the minimizer of the VICReg loss in O (1 /T ) [8]. Moreover, while this
)
a minibatch, random in
N
[
]
results holds for the empirical loss with resampling, it is equally possible to get a similar result for the minimization of the infinite-data (aka population) version of the VICReg loss and the recovery of the ideal embedding representation, when performing a single pass over the data. In particular, by making sure that J only charges pairs ( i, j ) for i and j in two disjoint subsets of [ N ] , one can prove convergence rates in O (1 /N ) (Theorem 3 in [12]).
Moreover, because the VICReg loss in Theorem 2 is nothing but a matrix factorization problem, one can directly translate results from this literature body into PAL. In particular, recent works have derived theoretical results regarding the matrix factorization problem based on toy models of neural networks, which might be plugged directly in here to claim theoretical results about the soundness of the PAL algorithm with neural networks [50, 22, 29]. Since those results hold for any graph G , such results directly apply to both SSL and supervised learning, highlighting how PAL jointly derives results for SSL and supervised learning.
Active Oracles
Seen through the eyes of PAL, supervised and SSL which employ passive querying- can be improved by refining the oracle to choose the next indices I t and J t to process at time t .
Low-Cost and Efficient Active Learning. A crucial point of this study is that the active learning framework stemming from PAL differs fundamentally from classic active learning. In the latter, at time t , one asks for a fresh label y i t for some chosen index i t . Instead, PAL considers a batch of data I t and asks for pairwise comparisons 1 { y i ∼ y j } for ( i, j ) ∈ I . Rather than asking labelers for fine-grained labels, such as 'caracal' or 'numbfish' on ImageNet, PAL would rather asks labelers if two images are related, or even to spot outliers in a set of images compared to a template, as illustrated on Fig. 1. 2 This is particularly useful when the cost of spotting a few outliers in a batch of M images is much less costly than annotating M data points. On such instances, Criteo engineers found that batches of 15 images was a sweet spot in terms of labeling efficiency [5]; while ImageNet was annotated by querying images on search engines, and spotting outliers among the results [21]. Meanwhile, reCaptcha (illustrated on Fig. 1) is said to have helped annotate millions of images [9]. We refer the curious reader to [43] and references therein regarding the design of efficient user interfaces for those labeling tasks.
Zoo of Active Learning Strategies. By introducing PAL, we open a way to match the practice of active learning and its theory through a grounded framework that encompasses current heuristics to annotate big chunks of data.
2 This 'spot-the-outliers' strategy is formalized with I t = { ( i t , j ) | j ∈ ˜ I t } for i t representing the class template, and ˜ I t capturing the batch of data to spot outliers in.
Level lines of e ⊤ 3 f θ at snapshots (learned with VICReg)
![Figure 3. Comparison the active oracle of Algorithm 3 and the passive supervised one of Algorithm 2. Given q queries made, and the consequent reconstructed graph G q , we learn f θ t : X → R C by minimizing L VIC 2 , and plot the downstream mean-square error of the optimal a linear classifier w ⊤ f θ t for the best w ∈ R C . Here X = R 2 , and y ∈ [4] spans four concentric circles (represented by the blue, red, green and orange classes), N = 100 , query batches are chosen of size 10 and results are average over 100 trials (standard deviations being represented by the colorized regions). Snapshots at different points on the curve show the third coordinates of the reconstructed f θ t , and the ideal linear classifier that can be learned based on this embedding.](2303.15256-figure_003.png)
Figure 3. Comparison the active oracle of Algorithm 3 and the passive supervised one of Algorithm 2. Given q queries made, and the consequent reconstructed graph G q , we learn f θ t : X → R C by minimizing L VIC 2 , and plot the downstream mean-square error of the optimal a linear classifier w ⊤ f θ t for the best w ∈ R C . Here X = R 2 , and y ∈ [4] spans four concentric circles (represented by the blue, red, green and orange classes), N = 100 , query batches are chosen of size 10 and results are average over 100 trials (standard deviations being represented by the colorized regions). Snapshots at different points on the curve show the third coordinates of the reconstructed f θ t , and the ideal linear classifier that can be learned based on this embedding.
While the optimal oracle depends on the problem idiosyncrasies, as well as the labeling cost model, the vast literature body on active learning provides many heuristics to design sophisticated or intricate oracles under different structural assumptions. One could query information based on the least certain predictions [27, 1]; based on the distance to the decision boundaries [44]; by comparing predictions made by different methods in an ensemble [7, 17]; or by finding the queries whose answers are the most likely to modify the current guess for f θ [49, 33, 31]. We refer the curious reader to Appendix A for further discussion on the matter. Throughout reviews, adaptations to PAL, ablation studies and comparisons on different benchmarks of those strategies is left for future work.
PAL ` a la Captcha. Anatural and easy property to leverage in order to build active learning strategies is the fact that the N 2 -entry matrix G is actually derived from the NC -entry matrix Y . In particular, one can recover the full graph G = G (sup) with less than NC pairwise queries, and in the best case only N queries -compare this to the N 2 -entries that are queried by the supervised learning oracle. This idea is captured formally with the oracle described in Algorithm 3, where the matrix Q remembered past queries, and illustrated on Fig. 3. At time t , this oracle chooses to query against the class with the least numbers of known instances, and choose M data points, ask if they match this class, and update known labels as a consequence. An advantage of the query strategy of Algorithm 3 is that one can stop at any time t and have a balanced labeled dataset to learn with.
Algorithm 3: Oracle ` a la Captcha
Data: Class templates ( µ 1 , · · · , µ C ) ∈ X C , Q ∈ R N × C initialized at zero.
Choose the class with least known examples
j = arg min j 1 ⊤ Q t e j ∈ [ C ] ;
Collect pairwise comparison Q ij ← 1 { x i ∼ µ j } for i in a batch B ⊂ [ N ] \ K t where K t remove queries with known results based on Q t ;
Sampler:
I t = J t all the new entries deduced in G .
Labeler: Human feedback
Q ij ; deduction to fill G .
The basic Algorithm 3 can be improved in several ways. First, class templates can be deduced based on initial queries: the first data point µ 1 = x 1 provides a first class template; after querying 1 { x 2 ∼ x 1 } if the answer is negative, µ 2 = x 2 provides a second class template (otherwise it is part of class one); so forth and so on (if 1 { x = µ 1 } = · · · = 1 { x = µ k } = 0 , set µ k +1 = x ). Those templates could be refined during training by defining the templates as the example the most at the center of the classes examples with some well-thought notion of distance (either in the input space or the embedding space). Second, when classes are unbalanced and class probabilities are roughly known, one should rather choose y ( t ) to be the class that minimizes the number of known examples in this class divided by the probability of this class. Third, if C the number of classes is small, random sampling of the batch B will work well enough. Yet, when C is big, random sampling will mainly lead to negative observations and too few positive ones. In this situation, the algorithm is improved by training a classifier based on known labels at time t (eventually incorporating unlabeled data with semi-supervised learning techniques), and querying labels that were classified as the same class. Finally, to avoid only getting negative pairs on datasets with many classes, one could leverage hierarchy in the labels: if dealing with the classes of ImageNet, one can first ask reviewers coarse-grained information, e.g. flag pictures that are not fishes; before going deeper in the taxonomy.
Table 1. Best known performance on ImageNet for state-of-theart SSL methods. Notice how NNCLR [23] derives states of the art performance on ImageNet thanks to an active rule for labelers in Algorithm 1, which consists in defining positive pairs as nearest neighbors in the embedding space as detailed in Algorithm 4. This rule allows to beat the passive strategy that are provided by SimCLR and VICReg.
Experiments
This section provides experimental details to validate the various theoretical results derived in previous sections. In order to remove confounding effects linked with architecture, optimization, data curation and other design choices that might impact the different empirical validation we focus here on closed-form solution based on kernel methods with synthetic dataset. Further real-world empirical validations are provided in Appendix C. In particular, Table 1 reminds the reader how NNCLR [23] succeed to beat stateof-the-art SSL methods on ImageNet thanks to an active labeler oracle, which defines positive pairs as nearest neighbors in the embedding space f θ t ( X ) .
̸
Kernel methods are rich 'linear' parametric models defined as f θ = θ ⊤ ϕ ( x ) , for ϕ ( x ) and θ belonging to a separable Hilbert space H . Because those model can approximate any function [36], it is important to regularize θ in practice, either with early stopping in SGD, or with Tikhonov regularization, which can be written as λ Tr( Z ⊤ K -1 Z ) where λ > 0 is a regularization parameter and K ∈ R N × N is the kernel matrix defined as K ij = k ( x i , x j ) = ϕ ( x i ) ⊤ ϕ ( x j ) . In this setting, rather than matching the top of the spectral decomposition of G , the solution recovered by VICReg amounts to the top spectral decomposition of G -λ K -1 [12]. This allows to compute the ideal representation of f θ in closed-form given any graph G based on the regularized kernel model f θ = θ ⊤ ϕ ( x ) , hence ablating the effects that are unrelated to the theory described in this study. In this controlled setting, the superiority of active algorithms is undeniable, and illustrated on Fig. 3, where we illustrate the optimal downstream error one can achieve with linear probing of the minimizer f θ of the VICReg loss. Experimental details and more extensive validations are provided in Appendix C: in particular, the use of non-contrastive versus contrastive graphs, i.e. that set G ij = -1 when y i = y j , is studied on Fig. 7; the ability to incorporate label knowledge in SSL methods is the object of Fig. 4; robustness to noise is shown on Fig. 8; and relations between test error and the number of connected components of the reconstructed G is analyzed on Fig. 9.

Figure 4. A major motivation of this paper is to be able to add prior information on sample relationships in SSL methods, and more in particular, to have a simple way to leverage known labels. We do by considering Y containing one-hot encoding of known labels, and rows being zero otherwise and the mixed graph G = (1 -α ) · G (ssl) + α · ˆ Y ˆ Y ⊤ . The setting is the same as Fig. 5 with N = 200 and two augmentations per sample. When zero labels are known (left of the plot), we are in the full SSL regime, while when all the 200 labels are known (right of the plot), we recover supervised learning performance. When few labels are given the effect of the supervised graph can be counterproductive if the mixing coefficient α is too big. However, when mixed properly, adding prior label information in SSL methods allows to improve performance.
Conclusions
This work introduces PAL, a learning framework that revolves around the central concept of similarity graph. We first showed how similarity graphs are the implicit backbone of self-supervised learning methods, and how this concept extends to tackle supervised learning problems. This observation does not solely unveil a rich learning framework, but also provides a single algorithm based on a querying oracle that can describe both SSL and supervised learning techniques, opening the way to new oracles that benefit from techniques stemming from both the supervised and self-supervised learning literature. Finally, PAL leads to an efficient formalization of active learning as performed in practice to annotate large datasets, potentially enabling fruitful exchanges between the practice and the theory of active learning. Promising directions for future works include empirical validations on large-scale datasets, as well as theoretical study of the newly introduced active learning framework.
Experiments
Active Learning Algorithms
NNCLR oracle. State-of-the-art the year of its release, [23] proposes a variation of SSL, that does not need human feedback but does update the similarity graph based on past observation. This furnishes strong evidence for the usefulness of active learning algorithms, even without human feedback. The NNCLR oracle consists in setting I t = J t equals to some minibatch at time t , but to labels positive pairs in G ij only for nearest neighbors, i.e.
$$
$$
We describe the corresponding active oracle in Algorithm 4.
Algorithm 4: Active Oracle with No Human Feedback as per [23]
NNCLR oracle: Sampler: I t = J t is some minibatch, i j
Labeler: G i,j = 1 { z ∈N ( z ; I ) } where N ( z ; I ) design the nearest neighbor of z in the batch I .
The New Rise of Active Learning. Recently, active learning has become a focus of the machine learning community when training big models, as those models performance are known to depend on the order they process data [45], as well as in the percentage of data of different type they ingests [46]. In particular, the best paper award at NeurIPS last year [44] has suggested that the distance to the decision boundary could be leveraged smartly to reduce the number of data needed by current AI algorithms to train state of the art models (as compared to the scaling law of [30]). We describe their sampling oracle in Algorithm 5. Note that in the original paper, they query the exact labels of the selected points, while PAL only needs to query pairwise comparison. In the meantime, [17] suggested using ensemble active learning, while [1] suggested a model based uncertain predictions through gradient computation in deep learning models.
PAL: Positive Active Learning
Perform k-means clustering on Z t = f θ t .
For each unlabeled points, compute cosine distance to its cluster center.
if Few examples have been labels then
From Coarse to Fine-grained Query. While active learning usually assumes that the cost of answering any questions is constant, in practice, some queries might be easier to answer than others. For example, if a child has never seen some objects, such as a sophisticated designer chair, they might not easily provide pairwise comparison regarding those objects, e.g. they would be puzzled by the designer chair, and would hesitate to say that this is a chair. Similarly, it might be easier for human labelers to recognize attributes in an image, e.g. sandy fur, desert background, tufted ear, feline; rather than precise species, such as 'caracal'. This has been the basis for weakly supervised learning [18, 40]. It has also motivated some bandit models, such as [13, 25]. More generally, it suggests that one could efficiently learn by first querying weak, coarse-grained information, before refining queries to get precise, fine-grained feedback. We illustrate this high-level idea with Algorithm 6.
PAL: Positive Active Learning
Samplers: Your favorite active learning sampler.
if f θ t has not formed strong opinions on clusters then
Labelers: Human feedback for coarse-grained information (e.g. click on all animals with fur...),
else
Labelers: Human feedback with fine-grained information (e.g. click on fishes that match a precise species).
Proofs
Proof of Theorem 1
VICReg Loss and Spectral Contrastive
This subsection will both identify the VICReg with the spectral contrastive one through their matrix formulation.
Let us begin by reformulating the invariance term in VICReg. For Z defined in (1), it is generalized to multiple pairs through
$$
$$
where D is the degree matrix defined as a diagonal matrix, with A the number of augmented samples per original input
$$
$$
The variance-covariance term can be simplified by replacing the the Hinge loss for the variance by a squared norm [32], by setting β = α = 1 and replacing A by N in D to regularize diagonal terms a bit more. Those simplifications lead to a more principled regularization term that enforces orthogonality over the dataset of the different features learned by the network f θ [11]. The consequent regularization reads ∥ ∥ Z ⊤ Z /N -I K ∥ ∥ 2 . As a consequence, VICReg can be understood as solving for
$$
$$
For the spectral contrastive Loss, it is useful to incorporate negative pairs that are sampled for the same augmentations for two different samples 〈 Z (1) i , Z (1) j 〉 in the repulsive term. Moreover adding ∥ ∥ ∥ Z ( v ) i ∥ ∥ ∥ on both the positive part and the negative part will not change much since -2 x + x 2 is minimized for x = 1 . Those modifications lead to
$$
$$
The last term being a finite constant, it can be removed from the loss.
Indeed, one can relate both the spectral contrastive loss and VICReg, by remarking that
$$
$$
Finally, the variance covariance term can be written as
$$
$$
where R ( a, b ) = ( a ⊤ b ) ⊤ -∥ a ∥ 2 -∥ b ∥ 2 .
The SimCLR Loss
SimCLR can be seen as a generalized linear model, where two variables A , B are observed and the probability observing B knowing A is given by
$$
$$
For simplicity, let us define ˜ z = z / ∥ z ∥ . SimCLR tries to maximize the likelihood of ( A,B ) denoting random pairs coming from the same augmentations based on the observation of the graph
$$
$$
The SimCLR loss is nothing but the inverse of the log likelihood.
$$
$$
Barlow Twins
When λ = 1 , which we will consider for simplicity, the BarlowTwins loss simplifies as
̸
$$
$$
Because cross-correlations are normalized cross-covariances, it is useful to introduce ˜ Z the column normalized version of Z . Formally written in normalized matrix with the Hadamard product notation as
$$
$$
The way the cross-correlation is built can be generalized to multiple positive pairs as
$$
$$
As a consequence, the BarlowTwins loss can be rewritten with the sole use of G as
$$
$$
The SSL Losses for Supervised Learning
This subsection is devoted to the proof of Theorem 2.
Recovery Lemma
The backbone of Theorem 2 is following Lemma.
Lemma 1 (Equivalence between Y and G ) . Given any supervised classification similarity matrix G = Y Y ⊤ (6) , one can recover the corresponding one-hot label encoding Y , up to an orthogonal transformation R , as
$$
$$
where PDP T is the eigenvalue decomposition of the adjacency matrix G ( Y ) . Moreover the rotation R is easily recovered by specifying the labels of C samples associated with each of the C different classes.
Proof. Lemma 1 follows from the fact that G = Y Y ⊤ so that Y is a square root of G , and that any two square roots of a matrix are isometric. In particular, if the SVD of Y is written as
$$
$$
the decomposition G = PDP ⊤ is an eigenvalue decomposition of G . The part P √ D is unique up to the application of a rotation on the right, which could be absorbed in R .
In order to recover Y from G , notice that up to a permutation of lines and columns, G has a block diagonal structure where each block corresponds to one label. If each one label is given to each block, this allows to retrieve exactly Y hence to identify R afterwards by solving for R = ( P √ mD ) -1 Y .
While lemma 1 describes the classification case, in the generic case, if the y are categorical, yet the loss ℓ ( y, z ) is not the zero-one loss, it is natural to define the similarity matrix as
$$
$$
For example, y i could be rankings modeled with y i ∈ S m where m ! = C , i.e. m = Γ -1 ( C ) -1 , and ℓ could be the Kendall loss. In this setting,
$$
$$
and Y is retrieved through Y = P √ DRL 1 / 2 , where PDP ⊤ is the eigenvalue decomposition of G , and R ∈ R C × C is an unknown rotation matrix, that might be identified by specifying at most C labels associated with each of the C different classes, but might be identified with a smaller number of samples if ℓ has a strong structure implying that L is low-rank (see Eq. (11) in [38]). Indeed, the fact that compared to (6), the graph (12) could be much lower rank, could lead to more efficient algorithm to image it in the active learning framework. In essence, it would better leverage the structure encoded by the loss ℓ .
Finally, in the regression setting, one can choose G ij = -y ⊤ i y j .
The VICReg Loss
The VICReg loss is characterized as where we define the rotation O ( C, K ) as
$$
$$
The SimCLR Loss
The probabilistic interpretation of SimCLR states that the SimCLR losses tries to maximize the likelihood of the events
$$
$$
which translate as a loss in the minimization of
$$
$$
This is the cross entropy between G ij and p ij defined in the proof of the characterization of SimCLR. If the minimization with respect to p ij was unconstrained, then one should match p ij ∝ G ij . Yet, the form of p ij ∈ [exp( -1) , exp(1)] constraints it to go for a slightly different solution.
$$
$$
So, it is minimized for Z being a square root of the matrix G . This is possible when the rank of G which is at most C since G = Y Y ⊤ is less than the rank of Z which is K . In this setting, since Y and Z are two square roots of G , we get
$$
$$
Remark that for two ˜ z i , ˜ z j whose index i and j belongs to different clusters defined by the graph G , the loss is a increasing function of the quantity exp(˜ z i ˜ z j ) . By symmetry, we deduce that all the ˜ z i ˜ z j = cos( z i , z j ) should be one for all ( i, j ) such that G ij = 1 . On the other hand, the loss is a decreasing function of the exp(˜ z ⊤ i ˜ z j ) when G ij = 1 . When the number of sample per class is constant, we deduce by symmetry that the different anchors for the different classes should be put at the extremity of the simplex with C vertices centered at the origin and rotate with an arbitrary matrix R ∈ O ( C -1) , which allow to recover the different classes (without their explicit labels if not provided). When the different class have different number of samples N i with ∑ i ∈ [ C ] N i = N , and their anchor in the output space is c ∈ R K , we are trying to minimize
$$
$$
which will deform the simplex to have bigger angles between classes that are highly represented. For example, when N 1 = N 2 ≈ N/ 2 , we will have c 1 ≈ -c 2 while the other anchors are orthogonal to one another and to c 1 . Denoting M ∈ R C the matrix that maps the anchor of the class i for one solution of (14) to the i -th element of the canonical basis e i as v i M = e i , we get that the solution
$$
$$
The fact that Z is invariant by scaling each vector z reminds us of implementation of the cross-entropy, where to avoid divergence to infinity (since the sigmoid is optimized at infinity) one has to normalize the solution. The SimCLR loss is actually built on the same generalized linear model as the cross-entropy, and one can roughly think of SimCLR as the SSL version of the cross-entropy.
BarlowTwins
To minimize the BarlowTwins loss
$$
$$
we want ˜ Z to be a square root of the inverse of G . To be more precise, introduce the eigenvalue decomposition of G as G = PSP ⊤ where P ∈ R N × C and S ∈ R C × C since G is at most of rank C . The minimizer of BarlowTwins is ˜ Z = [ PS -1 / 2 , 0 N × ( K -C ) ] . Since ˜ Z is the column normalized version of Z , Z can be reconstructed for any diagonal matrix D ∈ diag( R K + ) as Z = ˜ ZD . Incorporating [ S -1 , 0] in D , we get that the minimizer of the BarlowTwins loss are exactly the matrices Z = PS 1 / 2 [ D , 0 K -C ] for D ∈ diag R C + . Moreover, since both PS 1 / 2 and Y are square root of G , we know that the exists a rotation matrix R ∈ O ( K ) such that PS 1 / 2 = Y R . All together, we get that the minimizer of the BarlowTwins loss are exactly the
$$
$$
The fact that BarlowTwins do not care about the amplitude of the solution Z reminds us of discriminant analysis that learns classifiers by optimizing ratio and angles.
Bayes optimum
For completeness, we now state a Bayes optimum proposition regarding the VICReg loss of the paper.
Proposition 1 (Bayes optimum) . When K ≥ C and there is no context, i.e. x i = x 0 for all i ∈ [ N ] , and y i are sampled according to a noisy distribution ( y | x = x 0 ) , the naive study of the VICReg Bayes optimum is meaningless, since
$$
$$
Yet, if one free the variable Z ∈ R N × K , we have
$$
$$
where ( e i ) i ∈ [ K ] is the canonical basis of R K .
Proof. For the first part of the proof, remark that the invariance term in VICReg will be zero for any z , so VICReg loss is minimized for any vector that minimized the variance-covariance term ∥ ∥ zz ⊤ -I ∥ ∥ 2 , which is done for any unit vector.
For the second part, remark that G = Y Y ⊤ has C connected components, that are all full cliques, i.e. the adjacency is filled with one. As a consequence, the eigenvectors of G associated with non-zeros elements are exactly the ( 1 { } y i = y ) i ∈ [ N ] for y ∈ [ C ] , and the corresponding eigenvalues are N y where N y = ∑ i ∈ [ N ] 1 { y i = y } = N P ( Y = i ) are the number of element in the class i ∈ [ C ] . As a consequence, a square root of G is N 1 / 2 ( P ( Y = i ) 1 / 2 δ ij ) i ∈ [ C ] ,j ∈ [ K ] , hence the proposition following the fact that all the square root of G are isomorphic.
Additional experimental details
Essential Code
SSL Graph
1 G = torch.zeros(N * V, N * V) # X in Rˆ{Np x D}, V views 2 i = torch.arange(0, N * V).repeat_interleave(V -1) # row indices 3 j = (i + torch.arange(1, V).repeat(N * V) * N).remainder(p * V) # column indices 4 G[i,j] = 1 # unweighted graph Sup Graph 1 Y = torch.nn.functional.one_hot(labels, num_classes=num_classes).float() 2 G = Y @ Y.T VICReg. 1 C = torch.cov(Z.t()) # Z in Rˆ{N x K} 2 reg_loss = torch.nn.functional.mse_loss(C, torch.eye(K)) 3 reg_loss *= out_dim ** 2 # correct for mean vs sum 4 i,j = G.nonzero(as_tuple=True) 5 inv_loss = torch.nn.functional.mse_loss(Z[i], Z[j]) # pairwise L2 weighted by G_{i,j} 6 inv_loss = out_dim 7 loss = beta * inv_loss + reg_loss SimCLR 1 Z_renorm = torch.nn.functional.normalize(Z, dim=1) # Z \in \mathbb{R}ˆ{N \times K} 2 cosim = Z_renorm @ Z_renorm.t() / tau # N x N matrix, tau is the temperature 3 mask = 1 -torch.eye(N, N, device=Z.device, dtype=Z.dtype) 4 loss = (G * (torch.logsumexp(cosimmask, dim=1, keepdim=True) - cosim)).mean() SCL 1 Z = torch.nn.functional.normalize(Z, dim=1) 2 loss = torch.nn.functional.mse_loss(G, Z@Z.T)
Controlled experiments
Setup
The train and test set of Figure 3 is shown on Fig. 5. The similarity graphs corresponding to the different snapshots on Fig. 3 are shown on Fig. 6. In all the experiments, we consider K = C +1 = 5 .

Figure 5. Setup for the controlled experiments of Fig. 3. The dataset is made of four concentric circles that corresponds to four different classes represented by different colors. The training dataset is made of one hundred random points, with some noise.

Figure 6. Graphs G corresponding to the different snapshot taken on Fig. 3. Grey indicates zeros, white indicates negative observations, and black means positive ones. The main strength of the active strategy in Algorithm 3 is that, by leveraging the underlying structure of the graph, is able to deduce much faster the full graph G than the naive passive implementation that only asks for random query pairs. Basically a positive observation is turned into many negative observations.
Contrastive vs. Non-Contrastive
Intuitively, it is useful to distinguish more explicitly between positive, negative and unknown relations, which we test on Fig. 7. To do so, the graph G is modified to encode semantically similar elements as positive edges, dissimilar ones as negative edges, while unknown relationships are going to be represented by zeros.
$$
$$
One might wonder if this really improves performance. The comparison is the object of Fig. 7

Figure 7. Comparison of contrastive ( G ij ∈ {-1 , 0 , 1 } ) and non-contrastive ( G ij ∈ { 0 , 1 } ) variation of VICReg with N = 300 . The setting is the same as Fig. 3 with Algorithm 3. We remark the usefulness to distinguish between negative pairs and unknown pairs, although some instability issues seem to appear when few entries are known for the contrastive method.
The Benefits of Incorporating Known Labels
A major motivation of this paper is to be able to add prior information on sample relationships in SSL methods, and more in particular, to have a simple way to leverage known labels. Let us denote by ˆ Y ∈ R N × D the one-hot matrix ( y i ) i ∈ [ N ] where y i is the one-hot vector of the label y i , such that if y i is unknown y i = 0 . The knowledge of some coefficients of the real Y , leads to the knowledge of a few coefficient of G (sup) = Y Y ⊤ , those could be added to the SSL graph to add useful connection deduced from the labels, leading to
$$
$$
where α ∈ [0 , 1] is a mixing coefficient stating how much the supervised information should weigh in the similarity matrix. Naively, we could set α = 1 / 2 , yet when only few labels are given this would destabilize the spectral decomposition of G too much, and we observe on Figure 4 that a small mixing coefficient is better. An explanation could be that the relations encoded by SSL are quite local and subtle, while the connections suggested by supervised learning are quite global and brutal on it suggested to fold the input space, hence need to be dampened when mixing the SSL and supervised graphs.

Figure 8. Study of the effect of labeling noise. The setup is the same as Fig. 3 yet with N = 300 points. We consider having full access to Y thus to G = Y Y ⊤ yet we assume that a certain number of labels y i are corrupted. We see that the algorithm is somewhat robust to noise.
Robustness to noise
As mentioned in the last part of the paper, depending on the algorithm used, the effect of noise in queries answers might lead to dramatic performance loss. In the main text, we were careful to describe algorithms that are robust to noise. The effect of noise in the labels for Algorithm 3 is studied in Fig. 8. Because of its structure, noise in the query for Algorithm 3 is equivalent to noise in the label y . This explains the setup of the figure.
The Importance to Recover Connected Components
An interesting experiments is provided by Fig. 9, which compares the test error and the number of connected components of the graph G as a function of the number of missing entries of G = G (sup) . In our synthetic experiment, G (sup) has four connected components corresponding to the four classes in the dataset, e.g. Fig. 2. Typically, based on transitivity of the similarity relation ∼ , one can hope to only need O (1 /N ) = O ( NC/N 2 ) queries, i.e. reconstructed entries of G , to have a good sense of the global G , hence to learn f θ . Moreover, on Fig. 9, the test error can be relatively well-predicted by the number of connected components of the graph G . This suggests creative ways to design active learning strategies based on search to optimize the number of connected components of G . However, leveraging transitivity of the similarity relationship to fill G efficiently might be limited when queries answers are noisy, although literature on error correcting codes might be useful [47, 19]. Moreover, the binary (and transitive) nature of similarity can be questioned when SSL sometimes uses DA that provides iconoclast unrealistic images, and one might prefer to assign similarity scores. Problems that do not occur with the transitivity agnostic Algorithm 3.

Figure 9. Comparison between the test error and number of connected components in the graph G as a function of the percentage of missing entries in G . The test error is reported as in Fig. 3, but it is reported as a function of missing entries of the supervised learning graph G (sup) . The standard deviation for the red curve is not represented here as the number of connected components is highly concentrated around its mean.
Mixture of Gaussian
One can question if the findings presented so far are specific to the concentric circles datasets. In order to assert the validity of those findings, we consider a second dataset, made of mixture of Gaussian, formally
$$
$$

Figure 10. Same figures as before with a mixture of Gaussians dataset. The mixture dataset has the particularity that the downstream task can be solved with any orthogonal basis of R C . When no queries has been made, G = I N , and the spectral decomposition of this graph will lead to a representation that can solve the downstream task, explaining why when no queries have been made, or when all the entries of G are removed, the downstream task can be solved.
given a label y ∈ [ C ] , x is generated according to N ( e y , σI C ) . The results are reported on Fig. 10 with σ = . 3 .
Real-world experiment
While it is hard to control all the factors that come into play when training a neural network on real data, our experiments suggested that what we have seen in controlled experiments transfer to real-world problems. In particular, we consider the CIFAR-10 dataset, with a resnet 18 architecture. A first stage was representation learning, where we used the VICReg loss to learn representation with the CIFAR-10 training set. In particular, we removed the classifier head of the resnet and replaced it with two fully connected layers with batch norms. The number of output dimensions was set to K = 16 , and the number of hidden neurons was set to 4 K . After the representation was learned, we replaced the classifier head by a linear layer with K = C output dimension and fit this last layer on the CIFAR-10 training set. The resulting network was then tested on the CIFAR-10 testing set. Regarding hyperparameters (network, DA, optimizer), we fixed them in accordance with tutorial online (in particular the pytorch-lightning tutorial) in order to achieve high performance results on CIFAR with SSL. In our first experiments, we stopped after two epochs of training for pretraining (since the output dimension is quite small, there is no need to go really far away in training), and twenty epochs downstream. The pretraining task consisted in all the training data of CIFAR-10 tackle, We found that the representation learned with SSL was achieving 28 % accuracy on CIFAR-10 with linear probing, while the representation learned directly with the supervised graph was achieving 63 % accuracy. In the meanwhile, training a resnet with classifier head to be made of 60 hidden neurons and 10 output dimensions with the ground truth labels and the mean-square error in the exact same setting leads to a performance of 63% too. In other terms, in these simple experiments, one can use the VICReg technique we derived here, or the MSE loss and get the same performance. Training for tens epochs for the upstream task (the minimization of the VICReg loss), and one hundred for the downstream one (the linear head fitting), we improved performance to 62% for SSL and 66% for the supervised learning graph. Furthermore, we did not perform extensive hyperparameter tuning, which suggests that the supervised learning performance could be even more competitive, since we took parameters that are known to be good for the self-supervised learning techniques. All the code is available to reproduce our experiments.
| Modality | Oracle | 1 accuracy | 5 accuracy |
|---|---|---|---|
| Passive | SimCLR [16] | 71.7 | - |
| Passive | VICReg [3] | 73.2 | 91.1 |
| Active | NNCLR [23] | 75.6 | 92.4 |
Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we formalize and generalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it unveils a theoretically grounded learning framework beyond SSL, based on similarity graphs, that can be extended to tackle supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.
Active Learning Given an input, return its label e.g. for Imagenet: - Tinca tinca - Carassius auratus - Carcharodon carcharias - … query1subscriptquery1\texttt{query}{\texttt{1}}query2subscriptquery2\texttt{query}{\texttt{2}} response1subscriptresponse1\texttt{response}{\texttt{1}} aligator response2subscriptresponse2\texttt{response}{\texttt{2}} aligator Expansive oracle (expert knowledge) Positive Active Learning (PAL) Given inputs, choose if they are semantically related: yes/no query1subscriptquery1\texttt{query}{\texttt{1}}and𝑎𝑛𝑑and response1subscriptresponse1\texttt{response}{\texttt{1}} yes query2subscriptquery2\texttt{query}{\texttt{2}}and𝑎𝑛𝑑and response2subscriptresponse2\texttt{response}{\texttt{2}} no Low-cost relationships information (reduced expertise) or (recaptcha)
or (recaptcha)
Learning representations of data that can be used to solve multiple tasks, out-of-the-box, and with minimal post-processing is one of the main goals of current AI research [39, 35, 24]. Such representations are generally found by processing given inputs through Deep neural Networks (DNs). The main question of interest around which contemporary research focuses on deals with the choice of the training setting that is employed to tune the DN’s parameters. A few different strategies have emerged such as layerwise [6], reconstruction based [48], and more recently, based on Self-Supervised Learning (SSL) [14, 37]. In fact, due to the cost of labeling and the size of datasets constantly growing, recent methods have tried to drift away from traditional supervised learning [42]. From existing training solutions, joint-embedding SSL has emerged as one of the most promising ones [34]. It consists in learning representations that are invariant along some known transformations while preventing dimensional collapse of the representation. Such invariance is enforced by applying some known Data-Augmentation (DA), e.g. translations for images, to the same input and making sure that their corresponding representations are the same.
Despite tremendous progress, several limitations remain in the way of a widespread deployment of SSL. In particular, it is not clear how to incorporate a priori knowledge into SSL frameworks beyond the usual tweaking of the loss and DAs being employed, although some efforts are being made [15, 53, 52]. Indeed, it is not surprising that vision-language pre-training has replaced SSL as the state-of-the-art to learn image representation [26], as those models are better suited to incorporate information stemming from captions that often come alongside images collected on the Internet.
In this study, we propose to redefine existing SSL losses in terms of a similarity graph –where nodes represent data samples and edges reflect known inter-sample relationships. Our first contribution stemming from this formulation provides a generic framework to think about learning in terms of similarity graph: it yields a spectrum on which SSL and supervised learning can be seen as two extremes. Within this realm, those two extremes are connected through the similarity matrix, and in fact can be made equivalent by varying the similarity graph. Our second contribution naturally emerges from using such a similarity graph, unveiling an elegant framework to reduce the cost and expert requirement of active learning summarized by:
Tell me who your friends are, and I will tell you who you are.
Active learning, which aims to reduce supervised learning cost by only asking an oracle for sample labels when needed [41, 20, 27, 31], can now be formulated in term of relative sample comparison, rather than absolute sample labeling. This much more efficient and low-cost approach is exactly the active learning strategy stemming from our framework: rather than asking for labels, one rather asks if two (or more) inputs belong to the same classes or not, as depicted in Fig. 1. We coin such a strategy as Positive Active Learning (PAL), and we will present some key analysis on the benefits of PAL over traditional active learning. We summarize our contributions below:
We provide a unified learning framework based on the concept of similarity graph, which encompasses both self-supervised learning, supervised learning, as well as semi-supervised learning and many variants.
We derive a generic PAL algorithm based on an oracle to query the underlying similarity graph Algorithm 1. The different learning frameworks (SSL, supervised, and so forth) are recovered by different oracles, who can be combined to benefit from each framework distinction.
We show how PAL extends into an active learning framework based on similarity queries that provides low-cost efficient strategies to annotate a dataset (Fig. 1).
All statements of this study are proven in Appendix B, code to reproduce experiments is provided at https://github.com/facebookresearch/pal.
supervised
This section provides a brief reminder of the main self-supervised learning (SSL) methods, their associated losses, and common notations for the remainder of the study.
A common strategy to learn a model in machine learning is to curate labeled examples (𝒙n,yn)nsubscriptsubscript𝒙𝑛subscript𝑦𝑛𝑛({\bm{x}}{n},y{n}){n}, and to learn a model that given 𝒙n∈𝒳≜ℝDsubscript𝒙𝑛𝒳≜superscriptℝ𝐷{\bm{x}}{n}\in\mathcal{X}\triangleq\mathbb{R}^{D} as input, outputs yn∈[C]subscript𝑦𝑛delimited-[]𝐶y_{n}\in[C], hoping that this model will learn to recognize patterns and relations that generalizes to new, unseen input data. Yet, as the dataset grew larger, and annotating data has become a major bottleneck, machine learning has shifted its attention to learning methods that do not require knowledge of ynsubscript𝑦𝑛y_{n}. SSL has emerged as a powerful solution to circumvent the need for expensive and time-consuming labeling. It learns a embedding f:𝒳→ℝK:𝑓→𝒳superscriptℝ𝐾f:\mathcal{X}\to\mathbb{R}^{K} for a small K𝐾K by enforcing either reconstruction properties, or some invariance and symmetry onto a learned representation. SSL also relies on a set of observations 𝑿={𝒙n}n=1N∈ℝN×D𝑿superscriptsubscriptsubscript𝒙𝑛𝑛1𝑁superscriptℝ𝑁𝐷{\bm{X}}={{\bm{x}}{n}}{n=1}^{N}\in\mathbb{R}^{N\times D}, yet instead of labels ynsubscript𝑦𝑛y_{n}, it requires known pairwise positive relation that indicates whether two samples are semantically similar or not. For simplicity, we shall focus on the joint-embedding framework, where those positive pairs are artificially generated on the fly by applying Data Augmentations (DA), e.g. adding white noise, masking, on the same input. Let denote 𝒯1,𝒯2:𝒳→𝒳:subscript𝒯1subscript𝒯2→𝒳𝒳{\mathcal{T}}{1},{\mathcal{T}}{2}:\mathcal{X}\to\mathcal{X} the generators of two (random) DAs 𝒯1(𝒙)subscript𝒯1𝒙{\mathcal{T}}{1}({\bm{x}}) and 𝒯2(𝒙)subscript𝒯2𝒙{\mathcal{T}}{2}({\bm{x}}) from an input 𝒙𝒙{\bm{x}}, fθ:ℝD→ℝK:subscript𝑓𝜃→superscriptℝ𝐷superscriptℝ𝐾f_{\theta}:\mathbb{R}^{D}\to\mathbb{R}^{K} the parametric model to be learned, and
where (𝒛n(1),𝒛n(2))subscriptsuperscript𝒛1𝑛subscriptsuperscript𝒛2𝑛({\bm{z}}^{(1)}{n},{\bm{z}}^{(2)}{n}), the nthsuperscript𝑛thn^{\rm th} row of 𝒁(1)superscript𝒁1{\bm{Z}}^{(1)} and 𝒁(2)superscript𝒁2{\bm{Z}}^{(2)} respectively, form the nthsuperscript𝑛thn^{\rm th} positive pair associated to sample 𝒙nsubscript𝒙𝑛{\bm{x}}_{n}. Using (1), different SSL losses will employ different measures of invariance and dimensional collapse. Typically, the losses are minimized with gradient descent and backpropagation to learn θ𝜃\theta.
VICReg. With the above notations, the VICReg loss [3] reads, with hyper-parameter α,β>0𝛼𝛽0\alpha,\beta>0,
BarlowTwins. BarlowTwins [51] is built on the cross-correlation matrix 𝑪ij=CoSim(𝒁(1),⊤𝒁(2))⊤{\bm{C}}_{ij}=\operatorname{CoSim}({\bm{Z}}^{(1)}{}^{\top},{\bm{Z}}^{(2)}{}^{\top}), with the hyper-parameter λ𝜆\lambda it reads
Spectral Contrastive Loss. Finally, the spectral contrastive loss [28] is theory-friendly proxy for SSL reading
In particular, as proven in Section B.1.1, (5) recovers VICReg (2) when the ReLU-hinge loss is replaced by the mean-square error, hence the denomination VIC2.
The Commonality between SSL Losses. All the above Eqs. 2, 3, 5 and 4 losses combine two terms: (i) a matching term between positive pairs, and (ii) a term to avoid collapse towards predicting a constant solution for all inputs. (i) can take different forms such as the squared norm between 𝒁(1)superscript𝒁1{\bm{Z}}^{(1)} and 𝒁(2)superscript𝒁2{\bm{Z}}^{(2)} (2), the opposite of their scalar product (5), or of their cosine (3), or the square norm between the centered-cosine and one (4). (ii) can also take different forms such as the infoNCE softmax (5), or an energy that enforces richness of the learn feature, such as the variance-covariance regularization in (2) forcing the different coordinates of fθsubscript𝑓𝜃f_{\theta} to be orthogonal [12].
While at face-value those losses seem distinct, they actually all simply consist and combine some variants of (i) and (ii), and even more importantly, they all rely on the exact same information of positive inter-sample relation for (i). This is exactly what the next Section 3 will dive into, as a means to unify SSL losses, along with supervised learning methods.
The goal of this section is to unify SSL and supervised learning through the introduction of a special object: a similarity graph.
Throughout this study, a similarity graph denotes a graph for which nodes represent data samples, and edges reflect similarity relationships. Formally, such a graph is expressed through its symmetric adjacency matrix 𝑮∈ℝN×N𝑮superscriptℝ𝑁𝑁{\bm{G}}\in\mathbb{R}^{N\times N}, the semantic relation between inputs i𝑖i and j𝑗j being encoded in the real entry 𝑮i,jsubscript𝑮𝑖𝑗{\bm{G}}_{i,j}. The remainder of this section will demonstrate how (i) SSL losses are implicitly based on a similarity graph (ii) how those losses tackle the supervised learning problem when provided a richer graph 𝑮𝑮{\bm{G}}.
Supervised learning. In addition to the N𝑁N input samples 𝑿∈ℝN×D𝑿superscriptℝ𝑁𝐷{\bm{X}}\in\mathbb{R}^{N\times D}, supervised learning has access to paired labels 𝒚≜[y1,…,yN]≜𝒚subscript𝑦1…subscript𝑦𝑁{\bm{y}}\triangleq[y_{1},\dots,y_{N}]. For clarity, we focus here on categorical labels i.e. ynsubscript𝑦𝑛y_{n} belongs to {1,…,C}1…𝐶{1,\dots,C} for C𝐶C the number of classes.111While we focus here on classification for simplicity, our approach is easily extendable for generic problems involving a loss ℓℓ\ell by defining the graph as 𝑮ij=−ℓ(yi,yj)subscript𝑮𝑖𝑗ℓsubscript𝑦𝑖subscript𝑦𝑗{\bm{G}}{ij}=-\ell(y{i},y_{j}). In the classification, ℓℓ\ell could be the zero-one loss ℓ(yi,yj)=𝟏{yi≠yj}ℓsubscript𝑦𝑖subscript𝑦𝑗subscript1subscript𝑦𝑖subscript𝑦𝑗\ell(y_{i},y_{j})=\bm{1}{{y{i}\neq y_{j}}}, and 𝑮ij≃1−ℓ(yi,yj)similar-to-or-equalssubscript𝑮𝑖𝑗1ℓsubscript𝑦𝑖subscript𝑦𝑗{\bm{G}}{ij}\simeq 1-\ell(y{i},y_{j}). See Section B.2.1 for details. The one-hot encoding of 𝒚𝒚{\bm{y}} will be denoted by the matrix 𝒀∈ℝN×C𝒀superscriptℝ𝑁𝐶{\bm{Y}}\in\mathbb{R}^{N\times C}. In terms of the similarity graph 𝑮𝑮{\bm{G}}, the label-based relation becomes naturally encoded as 𝑮i,j=𝟏{yi≠yj}subscript𝑮𝑖𝑗subscript1subscript𝑦𝑖subscript𝑦𝑗{\bm{G}}{i,j}=\bm{1}{{y_{i}\neq y_{j}}}, or equivalently
A key observation that we must emphasize is that the graph 𝑮𝑮{\bm{G}} almost explicitly encodes for the labels 𝒀𝒀{\bm{Y}}, which will be made explicit with Theorem 2.
Multiple Epochs with Data Augmentation. When DA is employed, and training is carried for E𝐸E epochs, the original N𝑁N input samples are transformed into N×E𝑁𝐸N\times E “augmented” samples. For more generality, since DA will also be used in SSL, let’s denote by A∈ℕ∗𝐴superscriptℕA\in\mathbb{N}^{*} the number of augmentations –where here A=E𝐴𝐸A=E. We now have the augmented dataset
where each 𝒯𝒯{\mathcal{T}} has its own randomness. When available, i.e. for supervised learning, the corresponding “augmented” labels 𝒀(A)superscript𝒀𝐴{\bm{Y}}^{(A)} are given by repeating A𝐴A times each row of 𝒀𝒀{\bm{Y}}, formally written with the Kronecker product 𝒀(sup)≜𝒀⊗𝟏A≜superscript𝒀supremumtensor-product𝒀subscript1𝐴{\bm{Y}}^{(\sup)}\triangleq{\bm{Y}}\otimes\bm{1}_{A}, and from that, we can now define the supervised dataset and associated graph extending (6) to the case of multiple epochs and DA training
The resulting graph (7) is depicted on the left part of Fig. 2.
Self-Supervised Learning. SSL does not rely on labels, but on positive pairs/tuples/views generated at each epoch. Let us denote by V𝑉V the number of positive views generated, commonly V=2𝑉2V=2 for positive pairs as modeled in (1). With E𝐸E the total number of epochs, SSL produces V×E𝑉𝐸V\times E samples semantically related to each original sample 𝒙nsubscript𝒙𝑛{\bm{x}}_{n} through the course of training i.e. in SSL A=V×E𝐴𝑉𝐸A=V\times E while in supervised learning A=E𝐴𝐸A=E. The total number of samples is thus N×E×V𝑁𝐸𝑉N\times E\times V, defining the dataset and associated graph
where the associated similarity graph 𝑮(ssl)superscript𝑮ssl{\bm{G}}^{(\operatorname{ssl})} –now of size NEV×NEV𝑁𝐸𝑉𝑁𝐸𝑉NEV\times NEV– captures if two samples were generated as DA of the same original input.
This section reformulates the different SSL losses through the sole usage of the similarity graph 𝑮(ssl)superscript𝑮ssl{\bm{G}}^{(\operatorname{ssl})}. To lighten notations, and without loss of generality, we redefine 𝑿∈ℝN×D𝑿superscriptℝ𝑁𝐷{\bm{X}}\in\mathbb{R}^{N\times D} to denote the full dataset, i.e. N←NEV←𝑁𝑁𝐸𝑉N\leftarrow NEV with 𝑿=𝑿(sup)𝑿superscript𝑿supremum{\bm{X}}={\bm{X}}^{(\sup)} for supervised learning with V×E𝑉𝐸V\times E epochs, or with 𝑿=𝑿(ssl)𝑿superscript𝑿ssl{\bm{X}}={\bm{X}}^{(\operatorname{ssl})} in SSL with E𝐸E epochs with V𝑉V views for the SSL case. The model embedding is shortened to 𝒁≜fθ(𝑿)∈ℝN×K≜𝒁subscript𝑓𝜃𝑿superscriptℝ𝑁𝐾{\bm{Z}}\triangleq f_{\theta}({\bm{X}})\in\mathbb{R}^{N\times K} as per Eq. 1.
VICReg (2), SimCLR (3), and BarlowTwins (4) losses can be expressed in term of the graph 𝐆𝐆{\bm{G}} (8)
where 𝐃=diag(𝐆𝟏)𝐃diag𝐆1{\bm{D}}=\operatorname{diag}({\bm{G}}\bm{1}) is the degree matrix of 𝐆𝐆{\bm{G}}; with 𝐳~≜𝐳/‖𝐳‖≜~𝐳𝐳norm𝐳\tilde{\bm{z}}\triangleq{\bm{z}}/\left|{\bm{z}}\right| and 𝐙~~𝐙\tilde{\bm{Z}} the column normalized 𝐙𝐙{\bm{Z}} so that each column has unit norm.
In essence, SSL is about making sure that sample’s representations match for samples that were deemed similar through the design of data augmentation. As such, it is not surprising that one can rewrite SSL losses through the sole usage of the similarity graph. From Theorem 1, the attentive observer would notice how VICReg is akin to Laplacian Eigenmaps or multidimensional scaling, SimCLR is akin to Cross-entropy and BarlowTwins is akin to Canonical Correlation Analysis; observations already discovered in the literature [2] and reinforced above.
Beyond recovering such representation learning losses, our goal is to go one step further and to tie SSL and supervised learning through the lens of 𝑮𝑮{\bm{G}}, which follows in the next section.
What happens if one takes the different SSL losses, but replaces the usual SSL graph 𝑮(ssl)superscript𝑮ssl{\bm{G}}^{({\rm ssl})} with the supervised one 𝑮(sup)superscript𝑮sup{\bm{G}}^{({\rm sup})}?
It turns out that the learned representations emerging from such losses are identical –up to some negligible symmetries that can be corrected for when learning a linear probe– to the one hot-encoding of 𝒀𝒀{\bm{Y}}. To make our formal statement (Theorem 2) clearer, we introduce the set of optimal representations that minimize a given loss:
where “method” refers to the different losses.
When K≥C𝐾𝐶K\geq C, and 𝐙=fθ(𝐗)𝐙subscript𝑓𝜃𝐗{\bm{Z}}=f_{\theta}({\bm{X}}) is unconstrained (e.g. interpolation regime with a rich functions class), the SSL losses as per Theorem 1 with the supervised graph (7) solve the supervised learning problem with:
where 𝐑∈O𝐑𝑂{\bm{R}}\in O means that 𝐑𝐑{\bm{R}} is a rotation matrix as defined for the VICReg loss, diag+=diag(ℝ+N)subscriptdiagdiagsubscriptsuperscriptℝ𝑁\operatorname{diag}{+}=\operatorname{diag}(\mathbb{R}^{N}{+}) are the set of diagonal matrix with positive entries, i.e. renormalization matrices, and 𝐌𝐌{\bm{M}} is a matrix that maps a deformation of the simplex into the canonical basis. Moreover, provided class templates, i.e. C𝐶C data points associated with each of the C𝐶C classes, 𝐘𝐘{\bm{Y}} is easily retrieved from any methods and 𝐙∈𝒮method𝐙subscript𝒮method{\bm{Z}}\in{\mathcal{S}}_{\rm method}.
In essence, Theorem 2 states that SSL losses solve the supervised learning problem when the employed graph 𝑮𝑮{\bm{G}} is 𝑮(sup)superscript𝑮supremum{\bm{G}}^{(\sup)}. Moreover, the matrix 𝑫𝑫{\bm{D}} appearing in Theorem 2 captures the fact that SimCLR solutions are invariant to rescaling logit and is akin to the cross-entropy loss, while BarlowTwin is invariant to column renormalization of 𝒁𝒁{\bm{Z}} and is akin to discriminant analysis. Lastly, VICReg might be thought of as a proxy for the least-squares loss. At a high-level, Theorem 2 suggests fruitful links between spectral embedding techniques captured in Theorem 1 and supervised learning. We let for future work the investigation of this link and translation of spectral embedding results in the realm of supervised learning.
While Theorem 2 describes what we have coined as the “interpolation optimum”, i.e. solution in the interpolation regime with rich models, we ought to highlight that classical statistical learning literature analyzes losses under the light of “Bayes optimum”, i.e. solutions in noisy context-free setting [4]. Those Bayes optima do not make as much sense for losses that intrinsically relate different inputs, yet for completeness we provide such a proposition on Bayes optimum in Section B.3.
Now that we demonstrated how one should focus on the graph 𝑮𝑮{\bm{G}}, rather than the (self-)supervised loss, we turn our focus into getting that graph 𝑮𝑮{\bm{G}}. In particular, we propose an active learning framework that discovers 𝑮𝑮{\bm{G}} through efficient, low-cost queries.
From our understanding (Theorem 2), the difficulties of both supervised learning and SSL are the same: they need a correct graph 𝑮𝑮{\bm{G}}, i.e they need to identify samples that are semantically similar, either through label annotations or through the right design of DA.
Our framework suggests a generic way to proceed, having fixed the samples 𝑿𝑿{\bm{X}} in advance, and without much a priori knowledge on the similarity graph 𝑮𝑮{\bm{G}}. In an active learning spirit, one would like to design a query strategy to discover 𝑮𝑮{\bm{G}}, and an update rule for the learned parameter θ𝜃\theta. To ground the discussion, let us focus on VICReg. The variance-covariance term can be rewritten with R(a,b)=(a⊤b)2−‖a‖2−‖b‖2𝑅𝑎𝑏superscriptsuperscript𝑎top𝑏2superscriptnorm𝑎2superscriptnorm𝑏2R(a,b)=(a^{\top}b)^{2}-\left|a\right|^{2}-\left|b\right|^{2}, this leads to the formula, proven in Section B.1.1,
where I=J=[N]2𝐼𝐽superscriptdelimited-[]𝑁2I=J=[N]^{2}. An oracle would typically consider two small sets of indices I,J⊂[N]2𝐼𝐽superscriptdelimited-[]𝑁2I,J\subset[N]^{2}, asks labelers to provide 𝑮ijsubscript𝑮𝑖𝑗{\bm{G}}{ij} for i,j∈I𝑖𝑗𝐼i,j\in I, and, given a step size γtsubscript𝛾𝑡\gamma{t}, update the weights with
which could be performed with the sole access to (𝑮ij)(i,j)∈Isubscriptsubscript𝑮𝑖𝑗𝑖𝑗𝐼({\bm{G}}{ij}){(i,j)\in I}. The pairs in J𝐽J are used to avoid dimensional collapse, and in particular for the VICReg loss, to compute the variance-covariance regularization term. The complete picture leads to PAL, Algorithm 1. A particularly useful features of SGD for active learning is its robustness to labeling noise [10]. In other terms, Algorithm 1 is robust to noise in the query answers.
We will now dive more in-depth into two variants of oracles: passive and active ones. As we will see, passive oracles can recover traditional SSL as special cases, but will be much less efficient in learning good representation than active strategies.
Passive variations of the PAL algorithm consist in fixing the oracle behavior at iterations t∈[T]𝑡delimited-[]𝑇t\in[T] before starting the training. This formulation, under which the oracle does not leverage any information collected along training, recovers both SSL and supervised learning, based on the different querying strategies.
Self-Supervised Oracle. Probably the simplest oracle to describe is the one corresponding to the SSL strategy. The original VICReg algorithm [3] is made of t𝑡t gradient updates over T=N0E𝑇subscript𝑁0𝐸T=N_{0}E iterations with N=N0VE𝑁subscript𝑁0𝑉𝐸N=N_{0}VE samples, where N0subscript𝑁0N_{0} is the number of original samples, E𝐸E is the number of epochs, V𝑉V the number of views. At time t∈[T]𝑡delimited-[]𝑇t\in[T], Itsubscript𝐼𝑡I_{t} is chosen as {(2t+1,2(t+1))}2𝑡12𝑡1\left{(2t+1,2(t+1))\right}, describing a positive pairs generated on the fly from one original sample 𝒙isubscript𝒙𝑖{\bm{x}}{i} for i=tmod.N0i=t,\operatorname{mod.},N{0}; and Jtsubscript𝐽𝑡J_{t} is chosen as some mini-batch to estimate the covariance matrix of the features at the current epoch. Because it has been built to remove human feedback, SSL actually does not need to ask for labelers to query 𝑮s,s+1subscript𝑮𝑠𝑠1{\bm{G}}_{s,s+1} (where s=2t+1𝑠2𝑡1s=2t+1), since it is known by construction that those entries are going to be one.
Supervised Oracle. When it comes to a supervised learning oracle, the supervised learning loss provided in Theorem 1 –which is known to recover 𝒀𝒀{\bm{Y}} (given class templates) as per Theorem 2– is easily minimized with gradient descent based on (10). Hence a simple oracle to solve the supervised learning problem based on stochastic gradient descent: at time t𝑡t, consider a random pair of indices (it,jt)subscript𝑖𝑡subscript𝑗𝑡(i_{t},j_{t}) and set It=Jt←{(it,jt)}subscript𝐼𝑡subscript𝐽𝑡←subscript𝑖𝑡subscript𝑗𝑡I_{t}=J_{t}\leftarrow\left{(i_{t},j_{t})\right}. The querying of 𝑮it,jtsubscript𝑮subscript𝑖𝑡subscript𝑗𝑡{\bm{G}}{i{t},j_{t}} can either be done on the fly, or if the dataset is already annotated, it can be deduced from the knowledge of 𝑮it,jt=𝟏{yit=yjt}subscript𝑮subscript𝑖𝑡subscript𝑗𝑡subscript1subscript𝑦subscript𝑖𝑡subscript𝑦subscript𝑗𝑡{\bm{G}}{i{t},j_{t}}=\bm{1}{{y{i_{t}}=y_{j_{t}}}}.
Theoretical Remarks. Remarking that (10) is an unbiased formulation of VICReg, in the sense that
As a consequence, when θ↦‖fθ(𝑿)fθ(𝑿)⊤−𝑮‖2maps-to𝜃superscriptnormsubscript𝑓𝜃𝑿subscript𝑓𝜃superscript𝑿top𝑮2\theta\mapsto\left|f_{\theta}({\bm{X}})f_{\theta}({\bm{X}})^{\top}-{\bm{G}}\right|^{2} is strongly convex, Algorithm 3 with either the self-supervised or the supervised oracle will converge to the minimizer of the VICReg loss in O(1/T)𝑂1𝑇O(1/T) [8]. Moreover, while this results holds for the empirical loss with resampling, it is equally possible to get a similar result for the minimization of the infinite-data (aka population) version of the VICReg loss and the recovery of the ideal embedding representation, when performing a single pass over the data. In particular, by making sure that J𝐽J only charges pairs (i,j)𝑖𝑗(i,j) for i𝑖i and j𝑗j in two disjoint subsets of [N]delimited-[]𝑁[N], one can prove convergence rates in O(1/N)𝑂1𝑁O(1/N) (Theorem 3 in [12]).
Moreover, because the VICReg loss in Theorem 2 is nothing but a matrix factorization problem, one can directly translate results from this literature body into PAL. In particular, recent works have derived theoretical results regarding the matrix factorization problem based on toy models of neural networks, which might be plugged directly in here to claim theoretical results about the soundness of the PAL algorithm with neural networks [50, 22, 29]. Since those results hold for any graph 𝑮𝑮{\bm{G}}, such results directly apply to both SSL and supervised learning, highlighting how PAL jointly derives results for SSL and supervised learning.
Seen through the eyes of PAL, supervised and SSL –which employ passive querying– can be improved by refining the oracle to choose the next indices Itsubscript𝐼𝑡I_{t} and Jtsubscript𝐽𝑡J_{t} to process at time t𝑡t.
Low-Cost and Efficient Active Learning. A crucial point of this study is that the active learning framework stemming from PAL differs fundamentally from classic active learning. In the latter, at time t𝑡t, one asks for a fresh label yitsubscript𝑦subscript𝑖𝑡y_{i_{t}} for some chosen index itsubscript𝑖𝑡i_{t}. Instead, PAL considers a batch of data Itsubscript𝐼𝑡I_{t} and asks for pairwise comparisons 𝟏{yi∼yj}subscript1similar-tosubscript𝑦𝑖subscript𝑦𝑗\bm{1}{{y{i}\sim y_{j}}} for (i,j)∈I𝑖𝑗𝐼(i,j)\in I. Rather than asking labelers for fine-grained labels, such as “caracal” or “numbfish” on ImageNet, PAL would rather asks labelers if two images are related, or even to spot outliers in a set of images compared to a template, as illustrated on Fig. 1.222This “spot-the-outliers” strategy is formalized with It={(it,j)|j∈It}subscript𝐼𝑡conditional-setsubscript𝑖𝑡𝑗𝑗subscript𝐼𝑡I_{t}={(i_{t},j),|,j\in\tilde{I}{t}} for itsubscript𝑖𝑡i{t} representing the class template, and Itsubscript𝐼𝑡\tilde{I}_{t} capturing the batch of data to spot outliers in. This is particularly useful when the cost of spotting a few outliers in a batch of M𝑀M images is much less costly than annotating M𝑀M data points. On such instances, Criteo engineers found that batches of 15 images was a sweet spot in terms of labeling efficiency [5]; while ImageNet was annotated by querying images on search engines, and spotting outliers among the results [21]. Meanwhile, reCaptcha (illustrated on Fig. 1) is said to have helped annotate millions of images [9]. We refer the curious reader to [43] and references therein regarding the design of efficient user interfaces for those labeling tasks.
Zoo of Active Learning Strategies. By introducing PAL, we open a way to match the practice of active learning and its theory through a grounded framework that encompasses current heuristics to annotate big chunks of data. While the optimal oracle depends on the problem idiosyncrasies, as well as the labeling cost model, the vast literature body on active learning provides many heuristics to design sophisticated or intricate oracles under different structural assumptions. One could query information based on the least certain predictions [27, 1]; based on the distance to the decision boundaries [44]; by comparing predictions made by different methods in an ensemble [7, 17]; or by finding the queries whose answers are the most likely to modify the current guess for fθsubscript𝑓𝜃f_{\theta} [49, 33, 31]. We refer the curious reader to Appendix A for further discussion on the matter. Throughout reviews, adaptations to PAL, ablation studies and comparisons on different benchmarks of those strategies is left for future work.
PAL à la Captcha. A natural and easy property to leverage in order to build active learning strategies is the fact that the N2superscript𝑁2N^{2}-entry matrix 𝑮𝑮{\bm{G}} is actually derived from the NC𝑁𝐶NC-entry matrix 𝒀𝒀{\bm{Y}}. In particular, one can recover the full graph 𝑮=𝑮(sup)𝑮superscript𝑮supremum{\bm{G}}={\bm{G}}^{(\sup)} with less than NC𝑁𝐶NC pairwise queries, and in the best case only N𝑁N queries –compare this to the N2superscript𝑁2N^{2}-entries that are queried by the supervised learning oracle. This idea is captured formally with the oracle described in Algorithm 3, where the matrix 𝑸𝑸{\bm{Q}} remembered past queries, and illustrated on Fig. 3. At time t𝑡t, this oracle chooses to query against the class with the least numbers of known instances, and choose M𝑀M data points, ask if they match this class, and update known labels as a consequence. An advantage of the query strategy of Algorithm 3 is that one can stop at any time t𝑡t and have a balanced labeled dataset to learn with.
The basic Algorithm 3 can be improved in several ways. First, class templates can be deduced based on initial queries: the first data point μ1=𝒙1subscript𝜇1subscript𝒙1\mu_{1}={\bm{x}}{1} provides a first class template; after querying 𝟏{𝒙2∼𝒙1}subscript1similar-tosubscript𝒙2subscript𝒙1\bm{1}{{{\bm{x}}{2}\sim{\bm{x}}{1}}} if the answer is negative, μ2=𝒙2subscript𝜇2subscript𝒙2\mu_{2}={\bm{x}}{2} provides a second class template (otherwise it is part of class one); so forth and so on (if 𝟏{𝒙=μ1}=⋯=𝟏{𝒙=μk}=0subscript1𝒙subscript𝜇1⋯subscript1𝒙subscript𝜇𝑘0\bm{1}{{{\bm{x}}=\mu_{1}}}=\cdots=\bm{1}{{{\bm{x}}=\mu{k}}}=0, set μk+1=𝒙subscript𝜇𝑘1𝒙\mu_{k+1}={\bm{x}}). Those templates could be refined during training by defining the templates as the example the most at the center of the classes examples with some well-thought notion of distance (either in the input space or the embedding space). Second, when classes are unbalanced and class probabilities are roughly known, one should rather choose y(t)𝑦𝑡y(t) to be the class that minimizes the number of known examples in this class divided by the probability of this class. Third, if C𝐶C the number of classes is small, random sampling of the batch B𝐵B will work well enough. Yet, when C𝐶C is big, random sampling will mainly lead to negative observations and too few positive ones. In this situation, the algorithm is improved by training a classifier based on known labels at time t𝑡t (eventually incorporating unlabeled data with semi-supervised learning techniques), and querying labels that were classified as the same class. Finally, to avoid only getting negative pairs on datasets with many classes, one could leverage hierarchy in the labels: if dealing with the classes of ImageNet, one can first ask reviewers coarse-grained information, e.g. flag pictures that are not fishes; before going deeper in the taxonomy.
This section provides experimental details to validate the various theoretical results derived in previous sections. In order to remove confounding effects linked with architecture, optimization, data curation and other design choices that might impact the different empirical validation we focus here on closed-form solution based on kernel methods with synthetic dataset. Further real-world empirical validations are provided in Appendix C. In particular, Table 1 reminds the reader how NNCLR [23] succeed to beat state-of-the-art SSL methods on ImageNet thanks to an active labeler oracle, which defines positive pairs as nearest neighbors in the embedding space fθt(𝒳)subscript𝑓subscript𝜃𝑡𝒳f_{\theta_{t}}(\mathcal{X}).
Kernel methods are rich “linear” parametric models defined as fθ=θ⊤ϕ(𝒙)subscript𝑓𝜃superscript𝜃topitalic-ϕ𝒙f_{\theta}=\theta^{\top}\phi({\bm{x}}), for ϕ(𝒙)italic-ϕ𝒙\phi({\bm{x}}) and θ𝜃\theta belonging to a separable Hilbert space ℋℋ{\cal H}. Because those model can approximate any function [36], it is important to regularize θ𝜃\theta in practice, either with early stopping in SGD, or with Tikhonov regularization, which can be written as λTr(𝒁⊤𝑲−1𝒁)𝜆Trsuperscript𝒁topsuperscript𝑲1𝒁\lambda\operatorname{Tr}({\bm{Z}}^{\top}{\bm{K}}^{-1}{\bm{Z}}) where λ>0𝜆0\lambda>0 is a regularization parameter and 𝑲∈ℝN×N𝑲superscriptℝ𝑁𝑁{\bm{K}}\in\mathbb{R}^{N\times N} is the kernel matrix defined as 𝑲ij=k(𝒙i,𝒙j)=ϕ(𝒙i)⊤ϕ(𝒙j)subscript𝑲𝑖𝑗𝑘subscript𝒙𝑖subscript𝒙𝑗italic-ϕsuperscriptsubscript𝒙𝑖topitalic-ϕsubscript𝒙𝑗{\bm{K}}{ij}=k({\bm{x}}{i},{\bm{x}}{j})=\phi({\bm{x}}{i})^{\top}\phi({\bm{x}}{j}). In this setting, rather than matching the top of the spectral decomposition of 𝑮𝑮{\bm{G}}, the solution recovered by VICReg amounts to the top spectral decomposition of 𝑮−λ𝑲−1𝑮𝜆superscript𝑲1{\bm{G}}-\lambda{\bm{K}}^{-1} [12]. This allows to compute the ideal representation of fθsubscript𝑓𝜃f{\theta} in closed-form given any graph 𝑮𝑮{\bm{G}} based on the regularized kernel model fθ=θ⊤ϕ(𝒙)subscript𝑓𝜃superscript𝜃topitalic-ϕ𝒙f_{\theta}=\theta^{\top}\phi({\bm{x}}), hence ablating the effects that are unrelated to the theory described in this study. In this controlled setting, the superiority of active algorithms is undeniable, and illustrated on Fig. 3, where we illustrate the optimal downstream error one can achieve with linear probing of the minimizer fθsubscript𝑓𝜃f_{\theta} of the VICReg loss. Experimental details and more extensive validations are provided in Appendix C: in particular, the use of non-contrastive versus contrastive graphs, i.e. that set 𝑮ij=−1subscript𝑮𝑖𝑗1{\bm{G}}{ij}=-1 when yi≠yjsubscript𝑦𝑖subscript𝑦𝑗y{i}\neq y_{j}, is studied on Fig. 7; the ability to incorporate label knowledge in SSL methods is the object of Fig. 4; robustness to noise is shown on Fig. 8; and relations between test error and the number of connected components of the reconstructed 𝑮𝑮{\bm{G}} is analyzed on Fig. 9.
This work introduces PAL, a learning framework that revolves around the central concept of similarity graph. We first showed how similarity graphs are the implicit backbone of self-supervised learning methods, and how this concept extends to tackle supervised learning problems. This observation does not solely unveil a rich learning framework, but also provides a single algorithm based on a querying oracle that can describe both SSL and supervised learning techniques, opening the way to new oracles that benefit from techniques stemming from both the supervised and self-supervised learning literature. Finally, PAL leads to an efficient formalization of active learning as performed in practice to annotate large datasets, potentially enabling fruitful exchanges between the practice and the theory of active learning. Promising directions for future works include empirical validations on large-scale datasets, as well as theoretical study of the newly introduced active learning framework.
NNCLR oracle. State-of-the-art the year of its release, [23] proposes a variation of SSL, that does not need human feedback but does update the similarity graph based on past observation. This furnishes strong evidence for the usefulness of active learning algorithms, even without human feedback. The NNCLR oracle consists in setting It=Jtsubscript𝐼𝑡subscript𝐽𝑡I_{t}=J_{t} equals to some minibatch at time t𝑡t, but to labels positive pairs in 𝑮ijsubscript𝑮𝑖𝑗{\bm{G}}_{ij} only for nearest neighbors, i.e.
We describe the corresponding active oracle in Algorithm 4.
The New Rise of Active Learning. Recently, active learning has become a focus of the machine learning community when training big models, as those models performance are known to depend on the order they process data [45], as well as in the percentage of data of different type they ingests [46]. In particular, the best paper award at NeurIPS last year [44] has suggested that the distance to the decision boundary could be leveraged smartly to reduce the number of data needed by current AI algorithms to train state of the art models (as compared to the scaling law of [30]). We describe their sampling oracle in Algorithm 5. Note that in the original paper, they query the exact labels of the selected points, while PAL only needs to query pairwise comparison. In the meantime, [17] suggested using ensemble active learning, while [1] suggested a model based uncertain predictions through gradient computation in deep learning models.
From Coarse to Fine-grained Query. While active learning usually assumes that the cost of answering any questions is constant, in practice, some queries might be easier to answer than others. For example, if a child has never seen some objects, such as a sophisticated designer chair, they might not easily provide pairwise comparison regarding those objects, e.g. they would be puzzled by the designer chair, and would hesitate to say that this is a chair. Similarly, it might be easier for human labelers to recognize attributes in an image, e.g. sandy fur, desert background, tufted ear, feline; rather than precise species, such as “caracal”. This has been the basis for weakly supervised learning [18, 40]. It has also motivated some bandit models, such as [13, 25]. More generally, it suggests that one could efficiently learn by first querying weak, coarse-grained information, before refining queries to get precise, fine-grained feedback. We illustrate this high-level idea with Algorithm 6.
Let us begin by reformulating the invariance term in VICReg. For 𝒁𝒁{\bm{Z}} defined in (1), it is generalized to multiple pairs through
where 𝑫𝑫{\bm{D}} is the degree matrix defined as a diagonal matrix, with A𝐴A the number of augmented samples per original input
The variance-covariance term can be simplified by replacing the the Hinge loss for the variance by a squared norm [32], by setting β=α=1𝛽𝛼1\beta=\alpha=1 and replacing A𝐴A by N𝑁N in 𝑫𝑫{\bm{D}} to regularize diagonal terms a bit more. Those simplifications lead to a more principled regularization term that enforces orthogonality over the dataset of the different features learned by the network fθsubscript𝑓𝜃f_{\theta} [11]. The consequent regularization reads ‖𝒁⊤𝒁/N−𝑰K‖2superscriptnormsuperscript𝒁top𝒁𝑁subscript𝑰𝐾2\left|{\bm{Z}}^{\top}{\bm{Z}}/N-{\bm{I}}_{K}\right|^{2}. As a consequence, VICReg can be understood as solving for
For the spectral contrastive Loss, it is useful to incorporate negative pairs that are sampled for the same augmentations for two different samples ⟨𝒁i(1),𝒁j(1)⟩superscriptsubscript𝒁𝑖1superscriptsubscript𝒁𝑗1\left\langle{\bm{Z}}{i}^{(1)},{\bm{Z}}{j}^{(1)}\right\rangle in the repulsive term. Moreover adding ‖𝒁i(v)‖normsuperscriptsubscript𝒁𝑖𝑣\left|{\bm{Z}}_{i}^{(v)}\right| on both the positive part and the negative part will not change much since −2x+x22𝑥superscript𝑥2-2x+x^{2} is minimized for x=1𝑥1x=1. Those modifications lead to
The last term being a finite constant, it can be removed from the loss.
Finally, the variance covariance term can be written as
where R(a,b)=(a⊤b)⊤−‖a‖2−‖b‖2𝑅𝑎𝑏superscriptsuperscript𝑎top𝑏topsuperscriptnorm𝑎2superscriptnorm𝑏2R(a,b)=(a^{\top}b)^{\top}-\left|a\right|^{2}-\left|b\right|^{2}.
SimCLR can be seen as a generalized linear model, where two variables A𝐴A, B𝐵B are observed and the probability observing B𝐵B knowing A𝐴A is given by
For simplicity, let us define 𝒛~=𝒛/‖𝒛‖~𝒛𝒛norm𝒛\tilde{\bm{z}}={\bm{z}}/\left|{\bm{z}}\right|. SimCLR tries to maximize the likelihood of (A,B)𝐴𝐵(A,B) denoting random pairs coming from the same augmentations based on the observation of the graph
The SimCLR loss is nothing but the inverse of the log likelihood.
When λ=1𝜆1\lambda=1, which we will consider for simplicity, the BarlowTwins loss simplifies as
Because cross-correlations are normalized cross-covariances, it is useful to introduce 𝒁~~𝒁\tilde{\bm{Z}} the column normalized version of 𝒁𝒁{\bm{Z}}. Formally written in normalized matrix with the Hadamard product notation as
The backbone of Theorem 2 is following Lemma.
Given any supervised classification similarity matrix 𝐆=𝐘𝐘⊤𝐆𝐘superscript𝐘top{\bm{G}}={\bm{Y}}{\bm{Y}}^{\top} (6), one can recover the corresponding one-hot label encoding 𝐘𝐘{\bm{Y}}, up to an orthogonal transformation 𝐑𝐑{\bm{R}}, as
where 𝐏𝐃𝐏T𝐏𝐃superscript𝐏𝑇{\bm{P}}{\bm{D}}{\bm{P}}^{T} is the eigenvalue decomposition of the adjacency matrix 𝐆(𝐘)𝐆𝐘{\bm{G}}({\bm{Y}}). Moreover the rotation 𝐑𝐑{\bm{R}} is easily recovered by specifying the labels of C𝐶C samples associated with each of the C𝐶C different classes.
Lemma 1 follows from the fact that 𝑮=𝒀𝒀⊤𝑮𝒀superscript𝒀top{\bm{G}}={\bm{Y}}{\bm{Y}}^{\top} so that 𝒀𝒀{\bm{Y}} is a square root of 𝑮𝑮{\bm{G}}, and that any two square roots of a matrix are isometric. In particular, if the SVD of 𝒀𝒀{\bm{Y}} is written as
the decomposition 𝑮=PDP⊤𝑮𝑃𝐷superscript𝑃top{\bm{G}}=PDP^{\top} is an eigenvalue decomposition of 𝑮𝑮{\bm{G}}. The part PD𝑃𝐷P\sqrt{D} is unique up to the application of a rotation on the right, which could be absorbed in R𝑅R.
In order to recover 𝒀𝒀{\bm{Y}} from 𝑮𝑮{\bm{G}}, notice that up to a permutation of lines and columns, 𝑮𝑮{\bm{G}} has a block diagonal structure where each block corresponds to one label. If each one label is given to each block, this allows to retrieve exactly 𝒀𝒀{\bm{Y}} hence to identify 𝑹𝑹{\bm{R}} afterwards by solving for 𝑹=(𝑷mD)−1𝒀𝑹superscript𝑷𝑚𝐷1𝒀{\bm{R}}=({\bm{P}}\sqrt{mD})^{-1}{\bm{Y}}. ∎
While lemma 1 describes the classification case, in the generic case, if the y𝑦y are categorical, yet the loss ℓ(y,z)ℓ𝑦𝑧\ell(y,z) is not the zero-one loss, it is natural to define the similarity matrix as
For example, yisubscript𝑦𝑖y_{i} could be rankings modeled with yi∈𝔖msubscript𝑦𝑖subscript𝔖𝑚y_{i}\in\mathfrak{S}_{m} where m!=C𝑚𝐶m!=C, i.e. m=Γ−1(C)−1𝑚superscriptΓ1𝐶1m=\Gamma^{-1}(C)-1, and ℓℓ\ell could be the Kendall loss. In this setting,
and 𝒀𝒀{\bm{Y}} is retrieved through 𝒀=𝑷𝑫𝑹𝑳1/2𝒀𝑷𝑫𝑹superscript𝑳12{\bm{Y}}={\bm{P}}\sqrt{{\bm{D}}}{\bm{R}}{\bm{L}}^{1/2}, where 𝑷𝑫𝑷⊤𝑷𝑫superscript𝑷top{\bm{P}}{\bm{D}}{\bm{P}}^{\top} is the eigenvalue decomposition of 𝑮𝑮{\bm{G}}, and 𝑹∈ℝC×C𝑹superscriptℝ𝐶𝐶{\bm{R}}\in\mathbb{R}^{C\times C} is an unknown rotation matrix, that might be identified by specifying at most C𝐶C labels associated with each of the C𝐶C different classes, but might be identified with a smaller number of samples if ℓℓ\ell has a strong structure implying that 𝑳𝑳{\bm{L}} is low-rank (see Eq. (11) in [38]). Indeed, the fact that compared to (6), the graph (12) could be much lower rank, could lead to more efficient algorithm to image it in the active learning framework. In essence, it would better leverage the structure encoded by the loss ℓℓ\ell.
Finally, in the regression setting, one can choose 𝑮ij=−yi⊤yjsubscript𝑮𝑖𝑗superscriptsubscript𝑦𝑖topsubscript𝑦𝑗{\bm{G}}{ij}=-y{i}^{\top}y_{j}.
The VICReg loss is characterized as
So, it is minimized for 𝒁𝒁{\bm{Z}} being a square root of the matrix 𝑮𝑮{\bm{G}}. This is possible when the rank of 𝑮𝑮{\bm{G}} which is at most C𝐶C since 𝑮=𝒀𝒀⊤𝑮𝒀superscript𝒀top{\bm{G}}={\bm{Y}}{\bm{Y}}^{\top} is less than the rank of 𝒁𝒁{\bm{Z}} which is K𝐾K. In this setting, since 𝒀𝒀{\bm{Y}} and 𝒁𝒁{\bm{Z}} are two square roots of 𝑮𝑮{\bm{G}}, we get
where we define the rotation O(C,K)𝑂𝐶𝐾O(C,K) as
The probabilistic interpretation of SimCLR states that the SimCLR losses tries to maximize the likelihood of the events
which translate as a loss in the minimization of
This is the cross entropy between 𝑮ijsubscript𝑮𝑖𝑗{\bm{G}}{ij} and pijsubscript𝑝𝑖𝑗p{ij} defined in the proof of the characterization of SimCLR. If the minimization with respect to pijsubscript𝑝𝑖𝑗p_{ij} was unconstrained, then one should match pij∝𝑮ijproportional-tosubscript𝑝𝑖𝑗subscript𝑮𝑖𝑗p_{ij}\propto{\bm{G}}{ij}. Yet, the form of pij∈[exp(−1),exp(1)]subscript𝑝𝑖𝑗11p{ij}\in[\exp(-1),\exp(1)] constraints it to go for a slightly different solution.
Remark that for two 𝒛isubscript𝒛𝑖\tilde{\bm{z}}{i}, 𝒛jsubscript𝒛𝑗\tilde{\bm{z}}{j} whose index i𝑖i and j𝑗j belongs to different clusters defined by the graph 𝑮𝑮{\bm{G}}, the loss is a increasing function of the quantity exp(𝒛i𝒛j)subscript𝒛𝑖subscript𝒛𝑗\exp(\tilde{\bm{z}}{i}\tilde{\bm{z}}{j}). By symmetry, we deduce that all the 𝒛i𝒛j=cos(𝒛i,𝒛j)subscript𝒛𝑖subscript𝒛𝑗subscript𝒛𝑖subscript𝒛𝑗\tilde{\bm{z}}{i}\tilde{\bm{z}}{j}=\cos({\bm{z}}{i},{\bm{z}}{j}) should be one for all (i,j)𝑖𝑗(i,j) such that 𝑮ij=1subscript𝑮𝑖𝑗1{\bm{G}}{ij}=1. On the other hand, the loss is a decreasing function of the exp(𝒛i⊤𝒛j)superscriptsubscript𝒛𝑖topsubscript𝒛𝑗\exp(\tilde{\bm{z}}{i}^{\top}\tilde{\bm{z}}{j}) when 𝑮ij=1subscript𝑮𝑖𝑗1{\bm{G}}{ij}=1. When the number of sample per class is constant, we deduce by symmetry that the different anchors for the different classes should be put at the extremity of the simplex with C𝐶C vertices centered at the origin and rotate with an arbitrary matrix 𝑹∈O(C−1)𝑹𝑂𝐶1{\bm{R}}\in O(C-1), which allow to recover the different classes (without their explicit labels if not provided). When the different class have different number of samples Nisubscript𝑁𝑖N_{i} with ∑i∈[C]Ni=Nsubscript𝑖delimited-[]𝐶subscript𝑁𝑖𝑁\sum_{i\in[C]}N_{i}=N, and their anchor in the output space is 𝒄∈ℝK𝒄superscriptℝ𝐾{\bm{c}}\in\mathbb{R}^{K}, we are trying to minimize
which will deform the simplex to have bigger angles between classes that are highly represented. For example, when N1=N2≈N/2subscript𝑁1subscript𝑁2𝑁2N_{1}=N_{2}\approx N/2, we will have 𝒄1≈−𝒄2subscript𝒄1subscript𝒄2{\bm{c}}{1}\approx-{\bm{c}}{2} while the other anchors are orthogonal to one another and to 𝒄1subscript𝒄1{\bm{c}}{1}. Denoting 𝑴∈ℝC𝑴superscriptℝ𝐶{\bm{M}}\in\mathbb{R}^{C} the matrix that maps the anchor of the class i𝑖i for one solution of (14) to the i𝑖i-th element of the canonical basis 𝒆isubscript𝒆𝑖{\bm{e}}{i} as 𝒗i𝑴=𝒆isubscript𝒗𝑖𝑴subscript𝒆𝑖{\bm{v}}{i}{\bm{M}}={\bm{e}}{i}, we get that the solution
The fact that 𝒁𝒁{\bm{Z}} is invariant by scaling each vector 𝒛𝒛{\bm{z}} reminds us of implementation of the cross-entropy, where to avoid divergence to infinity (since the sigmoid is optimized at infinity) one has to normalize the solution. The SimCLR loss is actually built on the same generalized linear model as the cross-entropy, and one can roughly think of SimCLR as the SSL version of the cross-entropy.
To minimize the BarlowTwins loss
we want 𝒁𝒁\tilde{\bm{Z}} to be a square root of the inverse of 𝑮𝑮{\bm{G}}. To be more precise, introduce the eigenvalue decomposition of 𝑮𝑮{\bm{G}} as 𝑮=𝑷𝑺𝑷⊤𝑮𝑷𝑺superscript𝑷top{\bm{G}}={\bm{P}}{\bm{S}}{\bm{P}}^{\top} where 𝑷∈ℝN×C𝑷superscriptℝ𝑁𝐶{\bm{P}}\in\mathbb{R}^{N\times C} and 𝑺∈ℝC×C𝑺superscriptℝ𝐶𝐶{\bm{S}}\in\mathbb{R}^{C\times C} since 𝑮𝑮{\bm{G}} is at most of rank C𝐶C. The minimizer of BarlowTwins is 𝒁~=[𝑷𝑺−1/2,0N×(K−C)]~𝒁𝑷superscript𝑺12subscript0𝑁𝐾𝐶\tilde{\bm{Z}}=[{\bm{P}}{\bm{S}}^{-1/2},0_{N\times(K-C)}]. Since 𝒁𝒁\tilde{\bm{Z}} is the column normalized version of 𝒁𝒁{\bm{Z}}, 𝒁𝒁{\bm{Z}} can be reconstructed for any diagonal matrix 𝑫∈diag(ℝ+K)𝑫diagsuperscriptsubscriptℝ𝐾{\bm{D}}\in\operatorname{diag}(\mathbb{R}{+}^{K}) as 𝒁=𝒁𝑫𝒁𝒁𝑫{\bm{Z}}=\tilde{\bm{Z}}{\bm{D}}. Incorporating [𝑺−1,0]superscript𝑺10[{\bm{S}}^{-1},0] in 𝑫𝑫{\bm{D}}, we get that the minimizer of the BarlowTwins loss are exactly the matrices 𝒁=𝑷𝑺1/2[𝑫,0K−C]𝒁𝑷superscript𝑺12𝑫subscript0𝐾𝐶{\bm{Z}}={\bm{P}}{\bm{S}}^{1/2}[{\bm{D}},0{K-C}] for 𝑫∈diagℝ+C𝑫diagsuperscriptsubscriptℝ𝐶{\bm{D}}\in\operatorname{diag}{\mathbb{R}_{+}^{C}}. Moreover, since both 𝑷𝑺1/2𝑷superscript𝑺12{\bm{P}}{\bm{S}}^{1/2} and 𝒀𝒀{\bm{Y}} are square root of 𝑮𝑮{\bm{G}}, we know that the exists a rotation matrix 𝑹∈O(K)𝑹𝑂𝐾{\bm{R}}\in O(K) such that 𝑷𝑺1/2=𝒀𝑹𝑷superscript𝑺12𝒀𝑹{\bm{P}}{\bm{S}}^{1/2}={\bm{Y}}{\bm{R}}. All together, we get that the minimizer of the BarlowTwins loss are exactly the
The fact that BarlowTwins do not care about the amplitude of the solution 𝒁𝒁{\bm{Z}} reminds us of discriminant analysis that learns classifiers by optimizing ratio and angles.
For completeness, we now state a Bayes optimum proposition regarding the VICReg loss of the paper.
When K≥C𝐾𝐶K\geq C and there is no context, i.e. 𝐱i=𝐱0subscript𝐱𝑖subscript𝐱0{\bm{x}}{i}={\bm{x}}{0} for all i∈[N]𝑖delimited-[]𝑁i\in[N], and yisubscript𝑦𝑖y_{i} are sampled according to a noisy distribution (y|𝐱=𝐱0)\left(y,\middle|,{\bm{x}}={\bm{x}}_{0}\right), the naive study of the VICReg Bayes optimum is meaningless, since
Yet, if one free the variable 𝐙∈ℝN×K𝐙superscriptℝ𝑁𝐾{\bm{Z}}\in\mathbb{R}^{N\times K}, we have
where (ei)i∈[K]subscriptsubscript𝑒𝑖𝑖delimited-[]𝐾(e_{i})_{i\in[K]} is the canonical basis of ℝKsuperscriptℝ𝐾\mathbb{R}^{K}.
For the first part of the proof, remark that the invariance term in VICReg will be zero for any 𝒛𝒛{\bm{z}}, so VICReg loss is minimized for any vector that minimized the variance-covariance term ‖𝒛𝒛⊤−I‖2superscriptnorm𝒛superscript𝒛top𝐼2\left|{\bm{z}}{\bm{z}}^{\top}-I\right|^{2}, which is done for any unit vector.
For the second part, remark that 𝑮=𝒀𝒀⊤𝑮𝒀superscript𝒀top{\bm{G}}={\bm{Y}}{\bm{Y}}^{\top} has C𝐶C connected components, that are all full cliques, i.e. the adjacency is filled with one. As a consequence, the eigenvectors of 𝑮𝑮{\bm{G}} associated with non-zeros elements are exactly the (𝟏{}yi=y)i∈N{i\in[N]} for y∈[C]𝑦delimited-[]𝐶y\in[C], and the corresponding eigenvalues are Nysubscript𝑁𝑦N{y} where Ny=∑i∈[N]𝟏{yi=y}=Nℙ(Y=i)subscript𝑁𝑦subscript𝑖delimited-[]𝑁subscript1subscript𝑦𝑖𝑦𝑁ℙ𝑌𝑖N_{y}=\sum_{i\in[N]}\bm{1}{{y{i}=y}}=N\operatorname{\mathbb{P}}(Y=i) are the number of element in the class i∈[C]𝑖delimited-[]𝐶i\in[C]. As a consequence, a square root of 𝑮𝑮{\bm{G}} is N1/2(ℙ(Y=i)1/2δij)i∈[C],j∈[K]N^{1/2}(\operatorname{\mathbb{P}}(Y=i)^{1/2}\delta_{ij})_{i\in[C],j\in[K]}, hence the proposition following the fact that all the square root of 𝑮𝑮{\bm{G}} are isomorphic. ∎
The train and test set of Figure 3 is shown on Fig. 5. The similarity graphs corresponding to the different snapshots on Fig. 3 are shown on Fig. 6. In all the experiments, we consider K=C+1=5𝐾𝐶15K=C+1=5.
Intuitively, it is useful to distinguish more explicitly between positive, negative and unknown relations, which we test on Fig. 7. To do so, the graph 𝑮𝑮{\bm{G}} is modified to encode semantically similar elements as positive edges, dissimilar ones as negative edges, while unknown relationships are going to be represented by zeros.
One might wonder if this really improves performance. The comparison is the object of Fig. 7
A major motivation of this paper is to be able to add prior information on sample relationships in SSL methods, and more in particular, to have a simple way to leverage known labels. Let us denote by 𝒀^∈ℝN×D^𝒀superscriptℝ𝑁𝐷\hat{\bm{Y}}\in\mathbb{R}^{N\times D} the one-hot matrix (𝒚i)i∈[N]subscriptsubscript𝒚𝑖𝑖delimited-[]𝑁({\bm{y}}{i}){i\in[N]} where 𝒚isubscript𝒚𝑖{\bm{y}}{i} is the one-hot vector of the label yisubscript𝑦𝑖y{i}, such that if yisubscript𝑦𝑖y_{i} is unknown 𝒚i=0subscript𝒚𝑖0{\bm{y}}_{i}=0. The knowledge of some coefficients of the real 𝒀𝒀{\bm{Y}}, leads to the knowledge of a few coefficient of 𝑮(sup)=𝒀𝒀⊤superscript𝑮supremum𝒀superscript𝒀top{\bm{G}}^{(\sup)}={\bm{Y}}{\bm{Y}}^{\top}, those could be added to the SSL graph to add useful connection deduced from the labels, leading to
where α∈[0,1]𝛼01\alpha\in[0,1] is a mixing coefficient stating how much the supervised information should weigh in the similarity matrix. Naively, we could set α=1/2𝛼12\alpha=1/2, yet when only few labels are given this would destabilize the spectral decomposition of 𝑮𝑮{\bm{G}} too much, and we observe on Figure 4 that a small mixing coefficient is better. An explanation could be that the relations encoded by SSL are quite local and subtle, while the connections suggested by supervised learning are quite global and brutal on it suggested to fold the input space, hence need to be dampened when mixing the SSL and supervised graphs.
As mentioned in the last part of the paper, depending on the algorithm used, the effect of noise in queries answers might lead to dramatic performance loss. In the main text, we were careful to describe algorithms that are robust to noise. The effect of noise in the labels for Algorithm 3 is studied in Fig. 8. Because of its structure, noise in the query for Algorithm 3 is equivalent to noise in the label 𝒚𝒚{\bm{y}}. This explains the setup of the figure.
An interesting experiments is provided by Fig. 9, which compares the test error and the number of connected components of the graph 𝑮𝑮{\bm{G}} as a function of the number of missing entries of 𝑮=𝑮(sup)𝑮superscript𝑮supremum{\bm{G}}={\bm{G}}^{(\sup)}. In our synthetic experiment, 𝑮(sup)superscript𝑮supremum{\bm{G}}^{(\sup)} has four connected components corresponding to the four classes in the dataset, e.g. Fig. 2. Typically, based on transitivity of the similarity relation ∼similar-to\sim, one can hope to only need O(1/N)=O(NC/N2)𝑂1𝑁𝑂𝑁𝐶superscript𝑁2O(1/N)=O(NC/N^{2}) queries, i.e. reconstructed entries of 𝑮𝑮{\bm{G}}, to have a good sense of the global 𝑮𝑮{\bm{G}}, hence to learn fθsubscript𝑓𝜃f_{\theta}. Moreover, on Fig. 9, the test error can be relatively well-predicted by the number of connected components of the graph 𝑮𝑮{\bm{G}}. This suggests creative ways to design active learning strategies based on search to optimize the number of connected components of 𝑮𝑮{\bm{G}}. However, leveraging transitivity of the similarity relationship to fill 𝑮𝑮{\bm{G}} efficiently might be limited when queries answers are noisy, although literature on error correcting codes might be useful [47, 19]. Moreover, the binary (and transitive) nature of similarity can be questioned when SSL sometimes uses DA that provides iconoclast unrealistic images, and one might prefer to assign similarity scores. Problems that do not occur with the transitivity agnostic Algorithm 3.
One can question if the findings presented so far are specific to the concentric circles datasets. In order to assert the validity of those findings, we consider a second dataset, made of mixture of Gaussian, formally
given a label y∈[C]𝑦delimited-[]𝐶y\in[C], 𝒙𝒙{\bm{x}} is generated according to 𝒩(𝒆y,σIC)𝒩subscript𝒆𝑦𝜎subscript𝐼𝐶{\cal N}({\bm{e}}{y},\sigma I{C}). The results are reported on Fig. 10 with σ=.3𝜎.3\sigma=.3.
Table: S5.T1: Best known performance on ImageNet for state-of-the-art SSL methods. Notice how NNCLR [23] derives states of the art performance on ImageNet thanks to an active rule for labelers in Algorithm 1, which consists in defining positive pairs as nearest neighbors in the embedding space as detailed in Algorithm 4. This rule allows to beat the passive strategy that are provided by SimCLR and VICReg.
| Modality | Oracle | 1 accuracy | 5 accuracy |
|---|---|---|---|
| Passive | SimCLR [16] | 71.7 | - |
| Passive | VICReg [3] | 73.2 | 91.1 |
| Active | NNCLR [23] | 75.6 | 92.4 |
Active Self-Supervised Learning introduces PAL (right box), an alternative to active learning (left box) where the oracle is asked if a collection of inputs are semantically related or not. As opposed to active learning, expert knowledge is reduced as one need not to know all the possible classes but only how to distinguish inputs from different classes. PAL querying is flexible, as an illustrative example we exhibit an à la captcha version where a given input is presented along with a collection of other inputs, and the oracle can select among those inputs the positive ones.
Left: Depiction of the “knowledge graph” arising from binary classification with supervised learning. Notice the two connected components, each corresponding to a single class. Each sample is associated with a node of the graph (blue circle) and the known positive relation between samples is represented by an edge. This knowledge is summarized into the 𝑮𝑮{\bm{G}} matrix depicted on the right. Right: Examples of the N×N𝑁𝑁N\times N symmetric graph-adjacency matrices 𝑮𝑮{\bm{G}} for the case of binary classification with supervised (same graph as on the left), semi-supervised and active learning. Each nonzero entry (𝑮)i,jsubscript𝑮𝑖𝑗({\bm{G}})_{i,j} represents the known positive relation between sample i𝑖i and j𝑗j.
Comparison the active oracle of Algorithm 3 and the passive supervised one of Algorithm 2. Given q𝑞q queries made, and the consequent reconstructed graph 𝑮qsubscript𝑮𝑞{\bm{G}}{q}, we learn fθt:𝒳→ℝC:subscript𝑓subscript𝜃𝑡→𝒳superscriptℝ𝐶f{\theta_{t}}:\mathcal{X}\to\mathbb{R}^{C} by minimizing ℒVIC2subscriptℒsuperscriptVIC2{\mathcal{L}}{\rm VIC^{2}}, and plot the downstream mean-square error of the optimal a linear classifier w⊤fθtsuperscript𝑤topsubscript𝑓subscript𝜃𝑡w^{\top}f{\theta_{t}} for the best w∈ℝC𝑤superscriptℝ𝐶w\in\mathbb{R}^{C}. Here 𝒳=ℝ2𝒳superscriptℝ2\mathcal{X}=\mathbb{R}^{2}, and y∈[4]𝑦delimited-[]4y\in[4] spans four concentric circles (represented by the blue, red, green and orange classes), N=100𝑁100N=100, query batches are chosen of size 10 and results are average over 100 trials (standard deviations being represented by the colorized regions). Snapshots at different points on the curve show the third coordinates of the reconstructed fθtsubscript𝑓subscript𝜃𝑡f_{\theta_{t}}, and the ideal linear classifier that can be learned based on this embedding.
A major motivation of this paper is to be able to add prior information on sample relationships in SSL methods, and more in particular, to have a simple way to leverage known labels. We do by considering 𝒀𝒀{\bm{Y}} containing one-hot encoding of known labels, and rows being zero otherwise and the mixed graph 𝑮=(1−α)⋅𝑮(ssl)+α⋅𝒀^𝒀^⊤.𝑮⋅1𝛼superscript𝑮ssl⋅𝛼^𝒀superscript^𝒀top{\bm{G}}=(1-\alpha)\cdot{\bm{G}}^{(\operatorname{ssl})}+\alpha\cdot\hat{\bm{Y}}\hat{\bm{Y}}^{\top}. The setting is the same as Fig. 5 with N=200𝑁200N=200 and two augmentations per sample. When zero labels are known (left of the plot), we are in the full SSL regime, while when all the 200 labels are known (right of the plot), we recover supervised learning performance. When few labels are given the effect of the supervised graph can be counterproductive if the mixing coefficient α𝛼\alpha is too big. However, when mixed properly, adding prior label information in SSL methods allows to improve performance.
Setup for the controlled experiments of Fig. 3. The dataset is made of four concentric circles that corresponds to four different classes represented by different colors. The training dataset is made of one hundred random points, with some noise.
Graphs 𝑮𝑮{\bm{G}} corresponding to the different snapshot taken on Fig. 3. Grey indicates zeros, white indicates negative observations, and black means positive ones. The main strength of the active strategy in Algorithm 3 is that, by leveraging the underlying structure of the graph, is able to deduce much faster the full graph 𝑮𝑮{\bm{G}} than the naive passive implementation that only asks for random query pairs. Basically a positive observation is turned into many negative observations.
Comparison of contrastive (𝑮ij∈{−1,0,1}subscript𝑮𝑖𝑗101{\bm{G}}{ij}\in\left{-1,0,1\right}) and non-contrastive (𝑮ij∈{0,1}subscript𝑮𝑖𝑗01{\bm{G}}{ij}\in\left{0,1\right}) variation of VICReg with N=300𝑁300N=300. The setting is the same as Fig. 3 with Algorithm 3. We remark the usefulness to distinguish between negative pairs and unknown pairs, although some instability issues seem to appear when few entries are known for the contrastive method.
Study of the effect of labeling noise. The setup is the same as Fig. 3 yet with N=300𝑁300N=300 points. We consider having full access to 𝒀𝒀{\bm{Y}} thus to 𝑮=𝒀𝒀⊤𝑮𝒀superscript𝒀top{\bm{G}}={\bm{Y}}{\bm{Y}}^{\top} yet we assume that a certain number of labels yisubscript𝑦𝑖y_{i} are corrupted. We see that the algorithm is somewhat robust to noise.
Comparison between the test error and number of connected components in the graph 𝑮𝑮{\bm{G}} as a function of the percentage of missing entries in 𝑮𝑮{\bm{G}}. The test error is reported as in Fig. 3, but it is reported as a function of missing entries of the supervised learning graph 𝑮(sup)superscript𝑮supremum{\bm{G}}^{(\sup)}. The standard deviation for the red curve is not represented here as the number of connected components is highly concentrated around its mean.
Same figures as before with a mixture of Gaussians dataset. The mixture dataset has the particularity that the downstream task can be solved with any orthogonal basis of ℝCsuperscriptℝ𝐶\mathbb{R}^{C}. When no queries has been made, 𝑮=IN𝑮subscript𝐼𝑁{\bm{G}}=I_{N}, and the spectral decomposition of this graph will lead to a representation that can solve the downstream task, explaining why when no queries have been made, or when all the entries of 𝑮𝑮{\bm{G}} are removed, the downstream task can be solved.
$$ \gL_{\rm VIC^2}(\mZ) = \E_{I,J\sim \uniform{[N]^2}}\bracket{\gL_{\rm VIC^2}(\mZ; I, J)}. $$
$$ \mG^{(t)}{ij} = \ind{\vz_i\in{\cal N}(\vz_j, I)} \triangleq \ind{ \norm{\vz_i - \vz_j} = \min{j\in I_t} \norm{\vz_i - \vz_k}}. $$
$$ \mD_{ij} = \ind{i=j}\cdot \sum_{k\in[N]} \mG_{ik} = A\ind{i=j}. $$
$$ N\gL_{\rm VIC} \approx \norm{\mZ^\top \mZ - N \mI_N}^2 + 2\trace(\mZ^\top (N\mI_N -\mG)\mZ). $$
$$ p_{ij} = \Pbb(B=j,\vert, A=i) \propto \exp\paren{\frac{\vz_i^\top \vz_j}{\norm{\vz_i}\norm{\vz_j}}}. $$
$$ \prod_{ij\in[N]} p_{ij}^{\mG_{ij}} = \exp\paren{\sum_{ij\in[N]} \mG_{ij} \log\paren{\frac{\exp\paren{\tilde\vz_i^\top \tilde\vz_j}}{\sum_{k\in[N]} \exp(\tilde\vz_i^\top \tilde\vz_k)}}}. $$
$$ \label{eq:BT} \gL_{\rm BT} = \sum_{k=1}^K (1 - \mC_{ii})^2 + \lambda \sum_{i\neq j} \mC_{ij}^2. $$ \tag{eq:BT}
$$ \tilde\mZ_{ij} = \frac{z_{ij}}{\sqrt{\sum_{k} z_{kj}^2} } = \frac{z_{ij}}{[(\mZ\otimes\mZ)\1]_j^{1/2}} \qquad\text{i.e.}\qquad \tilde\mZ = \mZ \diag(\mZ^{\otimes 2} \1)^{-1/2} $$
$$ \mC_{ij} = \frac{\sum_{kl} \mG_{kl} z_{ki} z_{lj}}{\sqrt{\sum_{k} z_{ki}^2} \sqrt{\sum_{l} z_{lj}^2}} =\sum_{kl} \mG_{kl} \tilde{z}{ki} \tilde{z}{lj} = [\tilde{\mZ}^\top \mG \tilde\mZ]_{ij}. $$
$$ \exists\mathop \mR \in O(C, K),\qquad \mZ = \mY \mR, $$
$$ \mY = \mP \sqrt{\mD} \mR, \qquad \mP\in O(N), \mR \in O(C), \mD = \begin{bmatrix} \mD_1 & 0 \end{bmatrix} \in \R^{N\times C}, \mD_1 = \diag(\sigma_1^2, \cdots \sigma_C^2), $$
$$ \label{eq:gen_G_sup} \mG \triangleq (-\ell(y_i, y_j))_{i,j\in[N]} \in \R^{C\times C}. $$ \tag{eq:gen_G_sup}
$$ O(C, K) = \brace{\mR\in \R^{C\times K}\midvert \mR\mR^\top = \mI_C}. $$
$$ \cup_{ij\in[N]} \brace{\mG_{ij}=1} \cap \brace{Y=i ,,&,, X=j}, $$
$$ \mZ = \mD \mY \mR \mM^{-1}, \qquad\text{with}\qquad \mD \in \diag(\R_+^N); \mR \in O(K, C). $$
$$ \mG_{ij} = \left{\begin{array}{cl} 1 & \text{if } \vx_i\sim \vx_j \text{ has been observed}, \ -1 & \text{if } \vx_i \not\sim \vx_j \text{ has been observed}, \ 0 & \text{otherwise}. \end{array} \right. $$
$$ \mG = (1-\alpha) \cdot \mG^{(\ssl)} + \alpha \cdot \hat\mY\hat\mY^\top, $$
$$ \mX = \mY + \sigma\mE, \qquad\text{where}\qquad \mE_{ij} \sim {\cal N}(0, 1), $$
$$ \mZ^{(1)} \triangleq \begin{bmatrix} f_{\theta}(\gT_1(\vx_1))\ \vdots\ f_{\theta}(\gT_1(\vx_N))\ \end{bmatrix}, \mZ^{(2)} \triangleq \begin{bmatrix} (f_{\theta}(\gT_2(\vx_1))\ \vdots\ (f_{\theta}(\gT_2(\vx_N))\ \end{bmatrix},\label{eq:Z} $$ \tag{eq:Z}
$$ \nonumber &\gL_{\rm VIC} = \alpha \sum_{k=1}^{K}\relu\left(1-\sqrt{\mC_{k,k}}\right)+\beta \sum_{k\neq l}\mC^2_{k,l} \&\quad+ \frac{1}{N}|\mZ^{(1)}-\mZ^{(2)}|_2^2, \qquad\mC \triangleq \Cov(\begin{bmatrix} \mZ^{(1)}\\mZ^{(2)} \end{bmatrix}). \label{eq:VICReg} $$ \tag{eq:VICReg}
$$ \displaystyle\quad+\frac{1}{N}|{\bm{Z}}^{(1)}-{\bm{Z}}^{(2)}|_{2}^{2},\qquad{\bm{C}}\triangleq\mathrm{Cov}(\begin{bmatrix}{\bm{Z}}^{(1)}\ {\bm{Z}}^{(2)}\end{bmatrix}). $$
$$ \nonumber &\gL_{\rm Sim}=-\sum_{i=1}^{N}\frac{\mC_{i,i}}{\tau}+\log\left(\sum_{i\neq j}^{N}\exp\paren{\frac{\mC_{i,j}}{\tau}}\right), \& \mC_{i,j} \triangleq \CosSim(\mZ^{(1)}, \mZ^{(2)})_{ij} \triangleq \frac{\scap{\vz^{(1)}_i}{\vz^{(2)}_j}}{|\vz^{(1)}_i| ,|\vz^{(2)}_j|},\label{eq:simCLR} $$ \tag{eq:simCLR}
$$ \label{eq:G_sup_easy} \mG(\mY) = \mY \mY^\top $$ \tag{eq:G_sup_easy}
$$ \displaystyle{\bm{X}}^{(A)} $$
$$ \label{eq:G_sup} \mX^{(\sup)}\triangleq \mX^{(E)},\qquad \mG^{(\sup)} \triangleq {\mY^{(\sup)}}^\top \mY^{(\sup)}. $$ \tag{eq:G_sup}
$$ \mX^{(\ssl)} \triangleq\mX^{(V\times E)},, \mG^{(\ssl)}{i,j} = \Indic{{\floor{i/VE}=\floor{j/VE}}},\label{eq:G_ssl} $$ \tag{eq:G_ssl}
$$ \displaystyle{\mathcal{L}}_{\rm VIC^{2}}({\bm{Z}};{\bm{G}})= $$
$$ \gS_{\rm method}(\mG) \triangleq \argmin_{\mZ \in \R^{N \times K}} \gL_{\rm method}(\mZ;\mG), $$
$$ \gL_{\rm VIC^2}(\theta; \mG, I, J) = \sum_{(i,j)\in I} \mG_{i,j} \norm{f_\theta(\vx_i) - f_\theta(\vx_j)}^2 \+ \sum_{(i,j)\in J} R(f_\theta(\vx_i), f_\theta(\vx_j)), \label{eq:vic_sgd} $$ \tag{eq:vic_sgd}
$$ \theta_{t+1} = \theta_t - \gamma_t \nabla_{\theta} \mL(\theta_t; \mG, I, J), \label{eq:weight_update} $$ \tag{eq:weight_update}
$$ \displaystyle=\sum_{i,j\in[N]}2{\bm{G}}{ij}[{\bm{Z}}{\bm{Z}}^{\top}]{ii}-2{\bm{G}}{ij}[{\bm{Z}}{\bm{Z}}^{\top}]{ji}=\sum_{i\in[N]}2[{\bm{D}}{\bm{Z}}{\bm{Z}}^{\top}]{ii}-2[{\bm{G}}{\bm{Z}}{\bm{Z}}^{\top}]{ii} $$
$$ \displaystyle=-2\operatorname{Tr}({\bm{G}}{\bm{Z}}{\bm{Z}}^{\top})+\operatorname{Tr}({\bm{Z}}{\bm{Z}}^{\top}{\bm{Z}}{\bm{Z}}^{\top})=\operatorname{Tr}({\bm{Z}}{\bm{Z}}^{\top}{\bm{Z}}{\bm{Z}}^{\top}-2{\bm{G}}{\bm{Z}}{\bm{Z}}^{\top}+{\bm{G}}^{2})-\operatorname{Tr}({\bm{G}}^{2}) $$
$$ \argmin_{\vz \in \R^K} \gL_{\rm VIC-2}(\begin{bmatrix} \vz\ \vdots \ \vz \end{bmatrix};\mG) = \brace{\vz\in\R^K\midvert \norm{\vz} = 1}. $$
Theorem. Theorem 1. VICReg (2), SimCLR (3), and BarlowTwins (4) losses can be expressed in term of the graph 𝐆𝐆{\bm{G}} (8) ℒVIC2(𝒁;𝑮)=subscriptℒsuperscriptVIC2𝒁𝑮absent\displaystyle{\mathcal{L}}{\rm VIC^{2}}({\bm{Z}};{\bm{G}})= ‖𝒁𝒁T−𝑮‖F2,superscriptsubscriptnorm𝒁superscript𝒁𝑇𝑮𝐹2\displaystyle|{\bm{Z}}{\bm{Z}}^{T}-{\bm{G}}|{F}^{2}, ℒSim(𝒁;𝑮)=subscriptℒSim𝒁𝑮absent\displaystyle{\mathcal{L}}{\rm Sim}({\bm{Z}};{\bm{G}})= −∑i,j∈[N]𝑮i,jlog(exp(𝒛i⊤𝒛j)∑k∈[N]exp(𝒛i⊤𝒛k)),subscript𝑖𝑗delimited-[]𝑁subscript𝑮𝑖𝑗superscriptsubscript𝒛𝑖topsubscript𝒛𝑗subscript𝑘delimited-[]𝑁superscriptsubscript𝒛𝑖topsubscript𝒛𝑘\displaystyle-\sum{i,j\in[N]}{\bm{G}}{i,j}\log\left(\frac{\exp(\tilde{\bm{z}}{i}^{\top}\tilde{\bm{z}}{j})}{\sum{k\in[N]}\exp(\tilde{\bm{z}}{i}^{\top}\tilde{\bm{z}}{k})}\right), ℒBT(𝒁;𝑮)=subscriptℒBT𝒁𝑮absent\displaystyle{\mathcal{L}}_{\rm BT}({\bm{Z}};{\bm{G}})= ‖𝒁~⊤𝑮𝒁~−I‖2,superscriptnormsuperscript𝒁top𝑮𝒁𝐼2\displaystyle\left|\tilde{\bm{Z}}^{\top}{\bm{G}}\tilde{\bm{Z}}-I\right|^{2}, where 𝐃=diag(𝐆𝟏)𝐃diag𝐆1{\bm{D}}=\operatorname{diag}({\bm{G}}\bm{1}) is the degree matrix of 𝐆𝐆{\bm{G}}; with 𝐳~≜𝐳/‖𝐳‖≜~𝐳𝐳norm𝐳\tilde{\bm{z}}\triangleq{\bm{z}}/\left|{\bm{z}}\right| and 𝐙~~𝐙\tilde{\bm{Z}} the column normalized 𝐙𝐙{\bm{Z}} so that each column has unit norm.
Theorem. Theorem 2 (Interpolation optimum). When K≥C𝐾𝐶K\geq C, and 𝐙=fθ(𝐗)𝐙subscript𝑓𝜃𝐗{\bm{Z}}=f_{\theta}({\bm{X}}) is unconstrained (e.g. interpolation regime with a rich functions class), the SSL losses as per Theorem 1 with the supervised graph (7) solve the supervised learning problem with: 𝒮VIC2(𝑮(sup))subscript𝒮superscriptVIC2superscript𝑮supremum\displaystyle{\mathcal{S}}{\rm VIC^{2}}({\bm{G}}^{(\sup)}) ={𝒀𝑹|𝑹∈ℝC×K;𝑹𝑹⊤=𝑰C},absentconditional-set𝒀𝑹formulae-sequence𝑹superscriptℝ𝐶𝐾𝑹superscript𝑹topsubscript𝑰𝐶\displaystyle=\left{{\bm{Y}}{\bm{R}},\middle|,{\bm{R}}\in\mathbb{R}^{C\times K};{\bm{R}}{\bm{R}}^{\top}={\bm{I}}{C}\right}, 𝒮Sim(𝑮(sup))subscript𝒮Simsuperscript𝑮supremum\displaystyle{\mathcal{S}}{\rm Sim}({\bm{G}}^{(\sup)}) ={𝑫𝒀𝑹𝑴−1|𝑫∈diag+,𝑹∈O},absentconditional-set𝑫𝒀𝑹superscript𝑴1formulae-sequence𝑫subscriptdiag𝑹𝑂\displaystyle=\left{{\bm{D}}{\bm{Y}}{\bm{R}}{\bm{M}}^{-1},\middle|,{\bm{D}}\in\operatorname{diag}{+},{\bm{R}}\in O\right}, 𝒮BT(𝑮(sup))subscript𝒮BTsuperscript𝑮supremum\displaystyle{\mathcal{S}}{\rm BT}({\bm{G}}^{(\sup)}) ={𝒀𝑹𝑫|𝑫∈diag+,𝑹∈O},absentconditional-set𝒀𝑹𝑫formulae-sequence𝑫subscriptdiag𝑹𝑂\displaystyle=\left{{\bm{Y}}{\bm{R}}{\bm{D}},\middle|,{\bm{D}}\in\operatorname{diag}{+},{\bm{R}}\in O\right}, where 𝐑∈O𝐑𝑂{\bm{R}}\in O means that 𝐑𝐑{\bm{R}} is a rotation matrix as defined for the VICReg loss, diag+=diag(ℝ+N)subscriptdiagdiagsubscriptsuperscriptℝ𝑁\operatorname{diag}{+}=\operatorname{diag}(\mathbb{R}^{N}{+}) are the set of diagonal matrix with positive entries, i.e. renormalization matrices, and 𝐌𝐌{\bm{M}} is a matrix that maps a deformation of the simplex into the canonical basis. Moreover, provided class templates, i.e. C𝐶C data points associated with each of the C𝐶C classes, 𝐘𝐘{\bm{Y}} is easily retrieved from any methods and 𝐙∈𝒮method𝐙subscript𝒮method{\bm{Z}}\in{\mathcal{S}}_{\rm method}.
Lemma. Lemma 1 (Equivalence between 𝒀𝒀{\bm{Y}} and 𝑮𝑮{\bm{G}}). Given any supervised classification similarity matrix 𝐆=𝐘𝐘⊤𝐆𝐘superscript𝐘top{\bm{G}}={\bm{Y}}{\bm{Y}}^{\top} (6), one can recover the corresponding one-hot label encoding 𝐘𝐘{\bm{Y}}, up to an orthogonal transformation 𝐑𝐑{\bm{R}}, as ∃𝑹∈O(C), s.t.𝒀=𝑷𝑫𝑹,formulae-sequence𝑹𝑂𝐶 s.t.𝒀𝑷𝑫𝑹\exists\mathop{{\bm{R}}}\in O(C),\quad\text{ s.t.}\quad{\bm{Y}}={\bm{P}}\sqrt{{\bm{D}}}{\bm{R}}, where 𝐏𝐃𝐏T𝐏𝐃superscript𝐏𝑇{\bm{P}}{\bm{D}}{\bm{P}}^{T} is the eigenvalue decomposition of the adjacency matrix 𝐆(𝐘)𝐆𝐘{\bm{G}}({\bm{Y}}). Moreover the rotation 𝐑𝐑{\bm{R}} is easily recovered by specifying the labels of C𝐶C samples associated with each of the C𝐶C different classes.
Proposition. Proposition 1 (Bayes optimum). When K≥C𝐾𝐶K\geq C and there is no context, i.e. 𝐱i=𝐱0subscript𝐱𝑖subscript𝐱0{\bm{x}}{i}={\bm{x}}{0} for all i∈[N]𝑖delimited-[]𝑁i\in[N], and yisubscript𝑦𝑖y_{i} are sampled according to a noisy distribution (y|𝐱=𝐱0)\left(y,\middle|,{\bm{x}}={\bm{x}}{0}\right), the naive study of the VICReg Bayes optimum is meaningless, since argmin𝒛∈ℝKℒVIC−2([𝒛⋮𝒛];𝑮)={𝒛∈ℝK|‖𝒛‖=1}.subscriptargmin𝒛superscriptℝ𝐾subscriptℒVIC2matrix𝒛⋮𝒛𝑮conditional-set𝒛superscriptℝ𝐾norm𝒛1\displaystyle\operatorname*{arg,min}{{\bm{z}}\in\mathbb{R}^{K}}{\mathcal{L}}{\rm VIC-2}(\begin{bmatrix}{\bm{z}}\ \vdots\ {\bm{z}}\end{bmatrix};{\bm{G}})=\left{{\bm{z}}\in\mathbb{R}^{K},\middle|,\left|{\bm{z}}\right|=1\right}. Yet, if one free the variable 𝐙∈ℝN×K𝐙superscriptℝ𝑁𝐾{\bm{Z}}\in\mathbb{R}^{N\times K}, we have argmin𝒁∈ℝN×KℒVIC−2(𝒁;𝑮)=N1/2⋅{(ℙ(Y=i)1/2ei)i∈[C]⋅𝑹|𝑹∈O(C,K)},\displaystyle\operatorname*{arg,min}{{\bm{Z}}\in\mathbb{R}^{N\times K}}{\mathcal{L}}{\rm VIC-2}({\bm{Z}};{\bm{G}})=N^{1/2}\cdot\left{(\operatorname{\mathbb{P}}(Y=i)^{1/2}e{i}){i\in[C]}\cdot{\bm{R}},\middle|,{\bm{R}}\in O(C,K)\right}, where (ei)i∈[K]subscriptsubscript𝑒𝑖𝑖delimited-[]𝐾(e{i})_{i\in[K]} is the canonical basis of ℝKsuperscriptℝ𝐾\mathbb{R}^{K}.
$$ \label{eq:anchor_simclr} \sum_{j\in[C]} N_j \log\paren{\sum_{i\in[C]} N_i \exp(\vc_i^\top\vc_j)}, $$ \tag{eq:anchor_simclr}
$$ \label{eq:scl} \gL_{\rm VIC^2} = -2\scap{\mZ^{(1)}}{\mZ^{(2)}} + \frac{1}{N}\sum_{i\neq j} \scap{\vz^{(1)}_i}{\vz^{(2)}_j}^2. $$ \tag{eq:scl}
$$ \mX^{(A)} &\triangleq [\underbrace{\gT(\vx_1),\dots,\gT(\vx_1)}_{\text{repeated $A$ times}} ,\dots, \gT(\vx_N),\dots,\gT(\vx_N)]^\top, $$
$$ \ \gL_{\rm VIC^2}(\mZ;\mG)=&| \mZ\mZ^T - \mG |F^2, \ \gL{\rm Sim}(\mZ;\mG)=&-\hspace{-0.2cm}\sum_{i,j\in[N]}\mG_{i,j}\log\paren{\frac{\exp(\tilde\vz_i^\top \tilde\vz_j)}{\sum_{k\in[N]} \exp(\tilde\vz_i^\top \tilde\vz_k)}}, \ \gL_{\rm BT}(\mZ;\mG)=& \norm{\tilde\mZ^\top \mG \tilde\mZ - I}^2, $$
$$ \gS_{\rm VIC^2}(\mG^{(\sup)}) &= \brace{\mY \mR \midvert \mR \in \R^{C\times K}; \mR \mR^\top = \mI_C},\ \gS_{\rm Sim}(\mG^{(\sup)}) &= \brace{\mD\mY\mR\mM^{-1}\midvert \mD \in \diag_+, \mR \in O},\ \gS_{\rm BT}(\mG^{(\sup)}) &= \brace{\mY\mR \mD \midvert \mD \in \diag_+, \mR \in O}, $$
$$
\gL_{\rm VIC-INV}
&= \sum_{i,j\in[N]} \mG_{ij} \norm{\vz_i - \vz_j}^2
= \sum_{i,j\in[N]} 2 \mG_{ij} \norm{\vz_i}^2 - 2\mG_{ij} \scap{\vz_i}{\vz_j}
\& = \sum_{i,j\in[N]} 2 \mG_{ij} [\mZ \mZ^\top]{ii} - 2\mG{ij} [\mZ\mZ^\top]{ji}
= \sum{i\in[N]} 2[\mD \mZ \mZ^\top]{ii} - 2[\mG \mZ\mZ^\top]{ii}
\&= 2\Tr\paren{(\mD-\mG) \mZ\mZ^\top} = 2\Tr\paren{\mZ^\top (\mD-\mG) \mZ}
$$
$$ \gL_{\rm VIC^2} &= - 2\sum_{i,j\in[N]} \mG_{ij} \vz_i^\top \vz_j + \sum_{i,j\in[N]} (\vz_i^\top \vz_j)^2 = - 2\sum_{i,j\in[N]} \mG_{ij} [\mZ\mZ^\top]{ij} + \sum{i,j\in[N]} [\mZ \mZ^\top]_{ij}^2 \&= - 2\Tr(\mG \mZ\mZ^\top) + \Tr(\mZ \mZ^\top \mZ \mZ^\top) = \Tr(\mZ \mZ^\top \mZ \mZ^\top - 2\mG \mZ\mZ^\top + \mG^2) - \Tr(\mG^2) \&= \Tr((\mZ \mZ^\top - \mG)^2) - \Tr(\mG^2) = \norm{\mZ \mZ^\top - \mG}^2_F + \text{cst}. $$
$$ &\norm{\mZ^\top \mZ - N \mI_N}^2 + 2\trace(\mZ^\top (N\mI_N -\mG)\mZ) \&\qquad= \trace(\mZ\mZ^\top \mZ\mZ^\top - 2N\mI_N \mZ^\top\mZ + N^2 \mI_N) + 2\trace(\mZ^\top (N\mI_N -\mG)\mZ) \&\qquad= \trace(\mZ\mZ^\top \mZ\mZ^\top - 2\mG \mZ \mZ^\top) + N^3 = \trace((\mZ\mZ^\top - \mG)^2) - \mG^2) + N^3 \&\qquad= \norm{\mZ\mZ^\top - \mG}_F^2 + \text{cst}. $$
$$ \norm{\mZ\mZ^\top/ N - I}^2 &= \norm{\sum_{i\in[N]}\vz_i \vz_i^\top - I}^2 = \trace\paren{\sum_{i,j\in[N]} \vz_j\vz_j^\top\vz_i \vz_i^\top - 2\sum_{i\in[N]} \vz_i\vz_i^\top + I} \&= \sum_{i,j\in[N]} (\vz_j^\top\vz_i)^2 - \sum_{i,j\in[N]} (\vz_i^\top\vz_i + \vz_j^\top\vz_j) + \text{cst} = \sum_{i,j\in[N]} R(\vz_i, \vz_j) + \text{cst}, $$
$$ \gL_{\rm VIC-2} = \norm{\mZ\mZ^\top - \mG}^2_F + \text{cst}. $$
$$ \argmin_{\mZ \in \R^{N\times K}} \gL_{\rm VIC-2}(\mZ;\mG) = N^{1/2}\cdot\brace{(\Pbb(Y=i)^{1/2} e_i)_{i\in[C]}\cdot \mR \midvert \mR \in O(C, K)}, $$
Theorem. VICReg eq:VICReg, SimCLR eq:simCLR, and BarlowTwins eq:BT losses can be expressed in term of the graph $\mG$ eq:G_ssl align* \ \gL_{\rm VIC^2}(\mZ;\mG)=&| \mZ\mZ^T - \mG |F^2, \ \gL{\rm Sim}(\mZ;\mG)=&--0.2cm\sum_{i,j\in[N]}\mG_{i,j}\log\frac{\exp(\tilde\vz_i^\top \tilde\vz_j){\sum_{k\in[N]} \exp(\tilde\vz_i^\top \tilde\vz_k)}}, \ \gL_{\rm BT}(\mZ;\mG)=& \tilde\mZ^\top \mG \tilde\mZ - I^2, align* where $\mD = \diag(\mG \1)$ is the degree matrix of $\mG$; with $\tilde\vz \triangleq \vz / \vz$ and $\tilde\mZ$ the column normalized $\mZ$ so that each column has unit norm.
Theorem. [Interpolation optimum] When $K \geq C$, and $\mZ = f_\theta(\mX)$ is unconstrained (e.g. interpolation regime with a rich functions class), the SSL losses as per lemma:characterization with the supervised graph eq:G_sup solve the supervised learning problem with: align* \gS_{\rm VIC^2}(\mG^{(\sup)}) &= \mY \mR \midvert \mR \in \R^{C\times K; \mR \mR^\top = \mI_C},\ \gS_{\rm Sim}(\mG^{(\sup)}) &= \mD\mY\mR\mM^{-1\midvert \mD \in \diag_+, \mR \in O},\ \gS_{\rm BT}(\mG^{(\sup)}) &= \mY\mR \mD \midvert \mD \in \diag_+, \mR \in O, align* where $\mR \in O$ means that $\mR$ is a rotation matrix as defined for the VICReg loss, $\diag_+ = \diag(\R^N_+)$ are the set of diagonal matrix with positive entries, i.e. renormalization matrices, and $\mM$ is a matrix that maps a deformation of the simplex into the canonical basis. Moreover, provided class templates, i.e. $C$ data points associated with each of the $C$ classes, $\mY$ is easily retrieved from any methods and $\mZ \in \gS_{\rm method}$.
Lemma. [Equivalence between $\mY$ and $\mG$] Given any supervised classification similarity matrix $\mG = \mY\mY^\top$ eq:G_sup_easy, one can recover the corresponding one-hot label encoding $\mY$, up to an orthogonal transformation $\mR$, as [ \exists\mathop \mR \in O(C), \quad s.t.\quad \mY = \mP\mD\mR, ] where $\mP\mD\mP^T$ is the eigenvalue decomposition of the adjacency matrix $\mG(\mY)$. Moreover the rotation $\mR$ is easily recovered by specifying the labels of $C$ samples associated with each of the $C$ different classes.
Proposition. [Bayes optimum] When $K \geq C$ and there is no context, i.e. $\vx_i = \vx_0$ for all $i\in [N]$, and $y_i$ are sampled according to a noisy distribution $y\midvert \vx=\vx_0$, the naive study of the VICReg Bayes optimum is meaningless, since align* \argmin_{\vz \in \R^K} \gL_{\rm VIC-2}(bmatrix \vz\ \vdots \ \vz bmatrix;\mG) = \vz\in\R^K\midvert \norm{\vz = 1}. align* Yet, if one free the variable $\mZ \in \R^{N\times K}$, we have align* \argmin_{\mZ \in \R^{N\times K}} \gL_{\rm VIC-2}(\mZ;\mG) = N^{1/2}\cdot(\Pbb(Y=i)^{1/2 e_i){i\in[C]}\cdot \mR \midvert \mR \in O(C, K)}, align* where $(e_i){i\in[K]}$ is the canonical basis of $\R^K$.
Proof. lemma:recovery follows from the fact that $\mG = \mY \mY^\top$ so that $\mY$ is a square root of $\mG$, and that any two square roots of a matrix are isometric. In particular, if the SVD of $\mY$ is written as [ \mY = \mP \mD \mR, \qquad \mP\in O(N), \mR \in O(C), \mD = bmatrix \mD_1 & 0 bmatrix \in \R^{N\times C}, \mD_1 = \diag(\sigma_1^2, \cdots \sigma_C^2), ] the decomposition $\mG = PDP^\top$ is an eigenvalue decomposition of $\mG$. The part $PD$ is unique up to the application of a rotation on the right, which could be absorbed in $R$. In order to recover $\mY$ from $\mG$, notice that up to a permutation of lines and columns, $\mG$ has a block diagonal structure where each block corresponds to one label. If each one label is given to each block, this allows to retrieve exactly $\mY$ hence to identify $\mR$ afterwards by solving for $\mR = (\mP mD)^{-1} \mY$.
Proof. For the first part of the proof, remark that the invariance term in VICReg will be zero for any $\vz$, so VICReg loss is minimized for any vector that minimized the variance-covariance term $\vz\vz^\top - I^2$, which is done for any unit vector. For the second part, remark that $\mG = \mY\mY^\top$ has $C$ connected components, that are all full cliques, i.e. the adjacency is filled with one. As a consequence, the eigenvectors of $\mG$ associated with non-zeros elements are exactly the $(\ind_{y_i = y}){i\in[N]}$ for $y\in [C]$, and the corresponding eigenvalues are $N_y$ where $N_y = \sum{i\in[N]} y_i = y = N\Pbb(Y=i)$ are the number of element in the class $i\in[C]$. As a consequence, a square root of $\mG$ is $N^{1/2} (\Pbb(Y=i)^{1/2} \delta_{ij})_{i\in[C], j\in[K]}$, hence the proposition following the fact that all the square root of $\mG$ are isomorphic.
Algorithm: algorithm
[ht]
\KwData{$\mX\in\R^{N\times D}$; unknown graph $\mG = \mG^{(\sup)}$.}
\KwResult{Embedding $f_\theta:\R^D \to \R^K$.}
Initialization: weights $\theta_0$, scheduler $(\gamma_t)$; $T\in\N$;\\
\For{$t \in [T]$}{
Collect $I_t, J_t\leftarrow$ from {\color{blue} sampler};\\
Collect $(\mG_{ij} = \ind{y_i=y_j})_{(i,j)\in I_t}$ from {\color{blue} labelers};\\
Update $\theta_{t+1} \leftarrow \theta_t - \gamma_t \nabla_{\theta} \mL(\theta_t; \mG, I_t, J_t)$.
}
\caption{PAL framework with {\color{blue} oracle}}
\label{alg:pal}
Algorithm: algorithm
[ht]
SSL oracle: \\
\hspace{.3cm} Sampler: $I_t = \brace{(i_{2t+1}, i_{2(t+1)})}$, $J_t$ a minibatch,\\
\hspace{.3cm} Labeler: $\mG_{2t+1,2(t+1)} = 1$.\\
Supervised oracle: \\
\hspace{.3cm} Sampler: $I_t = J_t = \brace{(i_t, j_t)}$ random in $[N]^2$,\\
\hspace{.3cm} Labeler: $\mG_{i,j} = \ind{y_i = y_j}.$
\caption{Passive Oracle Specifications}
\label{alg:passive}
Algorithm: algorithm
[ht]
\KwData{Class templates $(\vmu_1, \cdots, \vmu_C) \in \X^C$, $\mQ \in \R_{N\times C}$ initialized at zero.}
Choose the class with least known examples $j = \argmin_j {\bf 1}^\top \mQ_t \ve_j \in [C]$;\\
Collect pairwise comparison $\mQ_{ij} \leftarrow \ind{\vx_i \sim \mu_j}$ for $i$ in a batch $B \subset [N] \setminus \mK_t$ where $\mK_t$ remove queries with known results based on $\mQ_t$;\\
Sampler: $I_t = J_t$ all the new entries deduced in $\mG$.\\
Labeler: Human feedback $\mQ_{ij}$; deduction to fill $\mG$.
\caption{Oracle {\em \`a la} Captcha}
\label{alg:active}
Algorithm: algorithm
[ht]
NNCLR oracle: \\
\hspace{.3cm} Sampler: $I_t = J_t$ is some minibatch,\\
\hspace{.3cm} Labeler: $\mG_{i,j} = \ind{\vz_i\in{\cal N}(\vz_j; I)}$ where ${\cal N}(\vz; I)$ design the nearest neighbor of $\vz$ in the batch $I$.
\caption{Active Oracle with No Human Feedback as per \cite{dwibedi2021little}}
\label{alg:nnclr}
Algorithm: algorithm
[ht]
Perform k-means clustering on $\mZ_t = f_{\theta_t}$.\\
For each unlabeled points, compute cosine distance to its cluster center.\\
\eIf{Few examples have been labels}{$I_t=J_t\leftarrow$ points near cluster centers,}{$I_t=J_t\leftarrow$ points far from cluster centers.}
\caption{Active Oracle with Data Pruning as per \cite{pruning2022}}
\label{alg:boundary}
Algorithm: algorithm
[ht]
Samplers: Your favorite active learning sampler.\\
\eIf{$f_{\theta_t}$ has not formed strong opinions on clusters}{Labelers: Human feedback for coarse-grained information (e.g. click on all animals with fur...),}{Labelers: Human feedback with fine-grained information (e.g. click on fishes that match a precise species).}
\caption{Active Oracle with Hierarchical Taxonomy}
\label{alg:hierarchical}
| Modality | Oracle | 1 accuracy | 5 accuracy |
|---|---|---|---|
| Passive | SimCLR [16] | 71.7 | - |
| Passive | VICReg [3] | 73.2 | 91.1 |
| Active | NNCLR [23] | 75.6 | 92.4 |

$$ \displaystyle\qquad=\left|{\bm{Z}}{\bm{Z}}^{\top}-{\bm{G}}\right|_{F}^{2}+\text{cst}. $$
References
[Ash2020] Jordan Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. \newblock Deep batch active learning by diverse, uncertain gradient lower bounds. \newblock In {\em ICLR}, 2020.
[balestriero2022contrastive] Randall Balestriero and Yann LeCun. \newblock Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. \newblock In {\em NeurIPS}, 2022.
[bardes2021vicreg] Adrien Bardes, Jean Ponce, and Yann LeCun. \newblock Vicreg: Variance-invariance-covariance regularization for self-supervised learning. \newblock In {\em ICLR}, 2022.
[Bartlett2006] Peter Bartlett, Michael Jordan, and Jon McAuliffe. \newblock Convexity, classification, and risk bounds. \newblock {\em Journal of the American Statistical Association}, 2006.
[criteo] Renaud Bauvin. \newblock Lessons learned from annotating 5 million images, 2019. \newblock Medium.
[bengio2006greedy] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. \newblock Greedy layer-wise training of deep networks. \newblock In {\em NeurIPS}, 2006.
[Bilgic2010active] Mustafa Bilgic, Lilyana Mihalkova, and Lise Getoor. \newblock Active learning for networked data. \newblock In {\em ICML}, 2010.
[Bubeck2015] S'ebastien Bubeck. \newblock Convex optimization: Algorithms and complexity. \newblock {\em Foundations and Trends in Machine Learning}, 2015.
[captcha] ByteBridge. \newblock Data annotation: By typing captcha, you are actually helping ai model training, 2021. \newblock Medium.
[Cabannes2022active] Vivien Cabannes, Francis Bach, Vianney Perchet, and Alessandro Rudi. \newblock Active labeling: streaming stochastic gradients. \newblock In {\em NeurIPS}, 2022.
[cabannes2023minimal] Vivien Cabannes, Alberto Bietti, and Randall Balestriero. \newblock On minimal variations for unsupervised representation learning. \newblock In {\em ICASSP}, 2023.
[cabannes2023ssl] Vivien Cabannes, Bobak Kiani, Randall Balestriero, Yann LeCun, and Alberto Bietti. \newblock The {SSL} interplay: Augmentations, inductive bias, and generalization. \newblock In {\em ICML}, 2023.
[Cesa-Bianchi2006] Nicol{`{o}} Cesa{-}Bianchi, Claudio Gentile, and Luca Zaniboni. \newblock Incremental algorithms for hierarchical classification. \newblock {\em Journal of Machine Learning Research}, 2006.
[chen2020simple] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. \newblock A simple framework for contrastive learning of visual representations. \newblock In {\em ICML}, 2020.
[chen2020big] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. \newblock Big self-supervised models are strong semi-supervised learners. \newblock In {\em NeurIPS}, 2020.
[simclrv2] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. \newblock Big self-supervised models are strong semi-supervised learners. \newblock In {\em NeurIPS}, 2020.
[ensemble2021] Kashyap Chitta, Jose Alvarez, Elmar Haussmann, and Clement Farabet. \newblock Training data subset search with ensemble active learning. \newblock {\em IEEE Transactions on Intelligent Transportation Systems}, 2021.
[Cour2011] Timoth{'{e}}e Cour, Benjamin Sapp, and Ben Taskar. \newblock Learning from partial labels. \newblock {\em Journal of Machine Learning Research}, 2011.
[Cover1991] Thomas Cover and Joy Thomas. \newblock {\em Elements of Information Theory}. \newblock Wiley, 1991.
[Dasgupta2011] Sanjoy Dasgupta. \newblock Two faces of active learning. \newblock {\em Theoretical Computer Science}, 2011.
[ImageNet] Jia Deng, Wei Dong, Richard Socher, Li{-}Jia Li, Kai Li, and Fei{-}Fei Li. \newblock Imagenet: {A} large-scale hierarchical image database. \newblock In {\em Conference on Computer Vision and Pattern Recognition}, 2009.
[Du2018] Simon Du, Wei Hu, and Jason Lee. \newblock Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. \newblock In {\em NeurIPS}, 2018.
[dwibedi2021little] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. \newblock With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. \newblock In {\em ICCV}, 2021.
[foundation] Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun, and Ji-Rong Wen. \newblock Towards artificial general intelligence via a multimodal foundation model. \newblock {\em Nature Communications}, 2022.
[Fiez2019] Tanner Fiez, Lalit Jain, Kevin Jamieson, and Lillian Ratliff. \newblock Sequential {Experimental} {Design} for {Transductive} {Linear} {Bandits}. \newblock In {\em NeurIPS}, 2019.
[gan2022visionlanguage] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. \newblock {\em Vision-Language Pre-training: Basics, Recent Advances, and Future Trends}. \newblock Now Foundations and Trends, 2022.
[Hanneke2014] Steve Hanneke. \newblock Theory of disagreement-based active learning. \newblock {\em Foundations and Trends in Machine Learning}, pages 131--309, 2014.
[haochen2021provable] Jeff HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. \newblock Provable guarantees for self-supervised deep learning with spectral contrastive loss. \newblock In {\em NeurIPS}, 2021.
[Jian2022] Liwei Jiang, Yudong Chen, and Lijun Ding. \newblock Algorithmic regularization in model-free overparametrized asymmetric matrix factorization. \newblock {\em SIAM Journal on Mathematics of Data Science}, 2023.
[scaling2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. \newblock Scaling laws for neural language models. \newblock In {\em ArXiv}, 2020.
[Karzand2020] Mina Karzand and Robert Nowak. \newblock Maximin active learning in overparameterized model classes. \newblock {\em IEEE Journal on Selected Areas in Information Theory}, 2020.
[kiani2022joint] Bobak Kiani, Randall Balestriero, Yubei Chen, Seth Lloyd, and Yann LeCun. \newblock Joint embedding self-supervised learning in the kernel regime. \newblock In {\em arXiv}, 2022.
[Knuth1977] Donald Knuth. \newblock The computer as master mind. \newblock {\em Journal of Recreational Mathematics}, 1977.
[jepa] Yann LeCun. \newblock A path towards autonomous machine intelligence. \newblock In {\em ArXiv}, 2022.
[lecun2015deep] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. \newblock Deep learning. \newblock {\em Nature}, 2015.
[Micchelli2006] Charles Micchelli, Yuesheng Xu, and Haizhang Zhang. \newblock Universal kernels. \newblock {\em Journal of Machine Learning Research}, 2006.
[misra2020self] Ishan Misra and Laurens van~der Maaten. \newblock Self-supervised learning of pretext-invariant representations. \newblock In {\em CVPR}, 2020.
[Nowak2019] Alex Nowak{-}Vila, Francis Bach, and Alessandro Rudi. \newblock Sharp analysis of learning with discrete losses. \newblock In {\em AISTATS}, 2019.
[pan2010survey] Sinno~Jialin Pan and Qiang Yang. \newblock A survey on transfer learning. \newblock {\em IEEE Transactions on knowledge and data engineering}, 2010.
[Ratner2020] Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher R'e. \newblock Snorkel: rapid training data creation with weak supervision. \newblock {\em The VLDB Journal}, 2020.
[Settles2010] Burr Settles. \newblock Active learning literature survey. \newblock Technical report, University of Wisconsin-Madison, 2010.
[settles2011theories] Burr Settles. \newblock From theories to queries: Active learning in practice. \newblock In {\em AISTATS workshop on active learning and experimental design}, 2011.
[Simard2017] Patrice Simard, Saleema Amershi, David Chickering, Alicia~Edelman Pelton, Soroush Ghorashi, Christopher Meek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, and John Wernsing. \newblock Machine teaching: A new paradigm for building machine learning systems. \newblock In {\em ArXiv}, 2017.
[pruning2022] Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. \newblock Beyond neural scaling laws: beating power law scaling via data pruning. \newblock In {\em NeurIPS}, 2022.
[curriculum] Petru Soviany, Radu~Tudor Ionescu, Paolo Rota, and Nicu Sebe. \newblock Curriculum learning: A survey. \newblock {\em International Journal of Computer Vision}, 2022.
[llama] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. \newblock Llama: Open and efficient foundation language models. \newblock In {\em ArXiv}, 2023.
[Varshamov1957] Rom Varshamov. \newblock Estimate of the number of signals in error correcting codes. \newblock {\em Doklady Akademii Nauk SSSR}, 1957.
[vincent2008extracting] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. \newblock Extracting and composing robust features with denoising autoencoders. \newblock In {\em ICML}, 2008.
[Yang2016active] Zhilin Yang, William Cohen, , and Ruslan Salakhutdinov. \newblock Semi-supervised learning with graph embeddings. \newblock In {\em ICML}, 2016.
[Ye2021] Tian Ye and Simon Du. \newblock Global convergence of gradient descent for asymmetric low-rank matrix factorization. \newblock In {\em NeurIPS}, 2021.
[zbontar2021barlow] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St{'e}phane Deny. \newblock Barlow twins: Self-supervised learning via redundancy reduction. \newblock In {\em ICML}, 2021.
[zhai2019s4l] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. \newblock S4l: Self-supervised semi-supervised learning. \newblock In {\em ICCV}, 2019.
[zheltonozhskii2022contrast] Evgenii Zheltonozhskii, Chaim Baskin, Avi Mendelson, Alex~M Bronstein, and Or Litany. \newblock Contrast to divide: Self-supervised pre-training for learning with noisy labels. \newblock In {\em WACV}, 2022.
[bib1] Jordan Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In ICLR, 2020.
[bib2] Randall Balestriero and Yann LeCun. Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. In NeurIPS, 2022.
[bib3] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
[bib4] Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006.
[bib5] Renaud Bauvin. Lessons learned from annotating 5 million images, 2019. Medium.
[bib6] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In NeurIPS, 2006.
[bib7] Mustafa Bilgic, Lilyana Mihalkova, and Lise Getoor. Active learning for networked data. In ICML, 2010.
[bib8] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.
[bib9] ByteBridge. Data annotation: By typing captcha, you are actually helping ai model training, 2021. Medium.
[bib10] Vivien Cabannes, Francis Bach, Vianney Perchet, and Alessandro Rudi. Active labeling: streaming stochastic gradients. In NeurIPS, 2022.
[bib11] Vivien Cabannes, Alberto Bietti, and Randall Balestriero. On minimal variations for unsupervised representation learning. In ICASSP, 2023.
[bib12] Vivien Cabannes, Bobak Kiani, Randall Balestriero, Yann LeCun, and Alberto Bietti. The SSL interplay: Augmentations, inductive bias, and generalization. In ICML, 2023.
[bib13] Nicolò Cesa-Bianchi, Claudio Gentile, and Luca Zaniboni. Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 2006.
[bib14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020.
[bib15] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020.
[bib16] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. In NeurIPS, 2020.
[bib17] Kashyap Chitta, Jose Alvarez, Elmar Haussmann, and Clement Farabet. Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 2021.
[bib18] Timothée Cour, Benjamin Sapp, and Ben Taskar. Learning from partial labels. Journal of Machine Learning Research, 2011.
[bib19] Thomas Cover and Joy Thomas. Elements of Information Theory. Wiley, 1991.
[bib20] Sanjoy Dasgupta. Two faces of active learning. Theoretical Computer Science, 2011.
[bib21] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In Conference on Computer Vision and Pattern Recognition, 2009.
[bib22] Simon Du, Wei Hu, and Jason Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In NeurIPS, 2018.
[bib23] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In ICCV, 2021.
[bib24] Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun, and Ji-Rong Wen. Towards artificial general intelligence via a multimodal foundation model. Nature Communications, 2022.
[bib25] Tanner Fiez, Lalit Jain, Kevin Jamieson, and Lillian Ratliff. Sequential Experimental Design for Transductive Linear Bandits. In NeurIPS, 2019.
[bib26] Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-Language Pre-training: Basics, Recent Advances, and Future Trends. Now Foundations and Trends, 2022.
[bib27] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine Learning, pages 131–309, 2014.
[bib28] Jeff HaoChen, Colin Wei, Adrien Gaidon, and Tengyu Ma. Provable guarantees for self-supervised deep learning with spectral contrastive loss. In NeurIPS, 2021.
[bib29] Liwei Jiang, Yudong Chen, and Lijun Ding. Algorithmic regularization in model-free overparametrized asymmetric matrix factorization. SIAM Journal on Mathematics of Data Science, 2023.
[bib30] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. In ArXiv, 2020.
[bib31] Mina Karzand and Robert Nowak. Maximin active learning in overparameterized model classes. IEEE Journal on Selected Areas in Information Theory, 2020.
[bib32] Bobak Kiani, Randall Balestriero, Yubei Chen, Seth Lloyd, and Yann LeCun. Joint embedding self-supervised learning in the kernel regime. In arXiv, 2022.
[bib33] Donald Knuth. The computer as master mind. Journal of Recreational Mathematics, 1977.
[bib34] Yann LeCun. A path towards autonomous machine intelligence. In ArXiv, 2022.
[bib35] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 2015.
[bib36] Charles Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of Machine Learning Research, 2006.
[bib37] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.
[bib38] Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. Sharp analysis of learning with discrete losses. In AISTATS, 2019.
[bib39] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 2010.
[bib40] Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. Snorkel: rapid training data creation with weak supervision. The VLDB Journal, 2020.
[bib41] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2010.
[bib42] Burr Settles. From theories to queries: Active learning in practice. In AISTATS workshop on active learning and experimental design, 2011.
[bib43] Patrice Simard, Saleema Amershi, David Chickering, Alicia Edelman Pelton, Soroush Ghorashi, Christopher Meek, Gonzalo Ramos, Jina Suh, Johan Verwey, Mo Wang, and John Wernsing. Machine teaching: A new paradigm for building machine learning systems. In ArXiv, 2017.
[bib44] Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws: beating power law scaling via data pruning. In NeurIPS, 2022.
[bib45] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey. International Journal of Computer Vision, 2022.
[bib46] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. In ArXiv, 2023.
[bib47] Rom Varshamov. Estimate of the number of signals in error correcting codes. Doklady Akademii Nauk SSSR, 1957.
[bib48] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
[bib49] Zhilin Yang, William Cohen, , and Ruslan Salakhutdinov. Semi-supervised learning with graph embeddings. In ICML, 2016.
[bib50] Tian Ye and Simon Du. Global convergence of gradient descent for asymmetric low-rank matrix factorization. In NeurIPS, 2021.
[bib51] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.
[bib52] Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In ICCV, 2019.
[bib53] Evgenii Zheltonozhskii, Chaim Baskin, Avi Mendelson, Alex M Bronstein, and Or Litany. Contrast to divide: Self-supervised pre-training for learning with noisy labels. In WACV, 2022.