LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
Randall Balestriero, Yann LeCun
Introduction
Learning manipulable representations of the world and its dynamics is a long-standing question in AI, with roots dating back centuries ago [Von Helmholtz, 1867, Tolman, 1948, Gregory, 1980, Sutton, 1991, Friston, 2010]. Across domains, e.g., image recognition, robotics, physics, space exploration, the unifying question is how to learn an organized and actionable high-dimensional embedding space from observations? Using Deep Networks-parameterized nonlinear operators 𝑓 𝜽 -to map observations to embeddings is a standard first piece of that puzzle [LeCun et al., 2015, Goodfellow et al., 2016]. The second, less standardized, piece of that puzzle is how to train 𝑓 𝜽 . Joint-Embedding Predictive Architectures (JEPAs) suggest training 𝑓 𝜽 by maximizing predictive agreement between the embeddings of semantically related views [Bromley et al., 1993, LeCun, 2022, Balestriero et al., 2023]. Views can come in two forms: transformations or corruptions. They can involve masking, cropping, blurring, temporal or spatial translations, geometric or photometric transformations, viewpoint changes, views from different sensor modalities, etc. The supervised forms involve human-produced components such as image-caption pairs, text-code pairs, etc [Tian et al., 2020]. In any case, views are expected to share some degree of semantic relationship to allow the prediction task to align 𝑓 𝜽 's embeddings towards the underlying knowledge present in the data.
Our study proposes to break that cycle by questioning some of the fundamental design principles underpinning JEPAs. That introspection will start by asking what are the necessary conditions that JEPAs should abide by? Those minimal conditions will then act as axioms for us to design a novel and lean JEPA. We identify two axioms: (i) solving the prediction task while (ii) enforcing an isotropic Gaussian distribution of the embeddings
Alas, JEPA's prediction task admits failure modes, such as representation collapse, where 𝑓 𝜽 maps all inputs to nearly identical embeddings ( complete collapse ) or to a lowdimensional subspace ( dimensional collapse ) [Jing et al., 2021][Jing et al., 2021, Cosentino et al., 2022, Balestriero and LeCun, 2022]. To mitigate such shortcut solutions, state-of-the-art recipes rely on heuristics-stop-gradient [Chen et al., 2020a], asymmetric view generation [Wang et al., 2022], teacher-student networks with carefully tuned EMA schedules [Caron et al., 2021, Tian et al., 2021], explicit normalization and whitening layers [Ermolov et al., 2021, Chen et al., 2021]-and a delicate balance of hyperparameters. As a result, today's JEPA training is brittle and most research has shifted toward scaling data [Vo et al., 2024], models [Fan et al., 2025] and even post-training Rodas et al. [2025] while leaving the theoretical foundations of JEPAs largely unexplored.
(Section 3). While (i) follows standard practice [Balestriero and LeCun, 2022], we introduce in Section 4 a novel distribution matching objective-Sketched Isotropic Gaussian Regularization (SIGReg)-to enforce (ii). The use of SIGReg not only removes the need for the numerous heuristics previously employed to prevent representation collapse, but SIGReg also exhibits favorable scaling properties as its memory and computational complexity is linear in dimension and sample size . Crucially, SIGReg's isotropic Gaussian enforcement solves the collapsed shortcut solution and provably minimizes the model's expected risk over the space of downstream tasks to be encountered post-training. The resulting JEPA solution-coined Latent-Euclidean JEPA (LeJEPA)-is introduced in Section 5. Beyond theoretical optimality, LeJEPA offers numerous benefits such as (i) provable statistical guarantees, (ii) removal of heuristics such as teacher-student networks, (iii) linear memory and computational complexity, and most importantly (iv) a unified design with a single trade-off parameter that works out of the box across datasets, architectures and scales (see Section 6). We summarize our contributions below.
Contribution 3: We design LeJEPA, a statistically optimal JEPA that eliminates collapse by construction. By combining JEPA's predictive objective with SIGReg targeting the isotropic Gaussian, we introduce LeJEPA -LatentEuclidean JEPA (Section 5). LeJEPA requires only a single hyperparameter, eliminates representational collapse without stop-gradients or teacher-student architectures, and transfers across architectures and datasets without hyperparameter tuning. This demonstrates that principled
Contribution 1: We prove the optimal embedding distribution for foundation models. We establish that the isotropic Gaussian uniquely minimizes downstream prediction risk across broad task families. In Section 3, we derive this result rigorously for both linear (Section 3.1) and nonlinear probes (Section 3.2), providing the first principled answer to what distribution 𝑓 𝜽 's embeddings should follow. This theoretical result transforms JEPA design from heuristic exploration to targeted optimization. Contribution 2: We introduce SIGReg, a distribution matching objective that uniquely combines provable correctness with computational efficiency at scale. We present Sketched Isotropic Gaussian Regularization (SIGReg), a novel objective that enforces distributional alignment via random projections and characteristic-function matching (Section 4 and Figure 2). SIGReg provides statistical guarantees (Sections 4.1 and 4.2) while achieving linear complexity and bounded gradients-a combination that existing distribution matching methods do not offer. Critically, its projection-based construction defeats the curse of dimensionality (Section 4.3), making it both theoretically sound and practically efficient for high-dimensional embeddings.

Figure 2. Sketched Isotropic Gaussian Regularization (SIGReg): Given some arbitrary input data with density 𝑝𝑥 with support that may or may not lie on a manifold ( left ), a Deep network (DN) encoder ( 𝑓 𝜽 ) produces embeddings 𝒛 = 𝑓 𝜽 ( 𝒙 ) with some distribution 𝒛 ∼ 𝑝𝑧 ( middle ). Our proposed Backward Cramér-Wold Statistics (Section 4) objective pushes 𝑝𝑧 to match a target distribution 𝑝𝑡 by projecting the embeddings along 1 𝑑 directions ( middle, arrows ) and enforcing that the univariate densities ( right, colored lines ) match the distribution of 𝑝𝑡 , projected along the same directions. Any popular statistical test (provided in Section 4.2) can assess the goodness-of-fit-in practice we argue for characteristic function tests (Section 4.2). By using SIGReg with 𝑝𝑡 isotropic Gaussian ( right, black lines ), we introduce a lean and provably optimal (Section 3) JEPA, coined LeJEPA, free of numerous heuristics and able to produce competitive performances (Sections 5 and 6).
theory directly yields practical simplicity.
Contribution 4: We validate LeJEPA at scale across diverse architectures and establish in-domain pretraining as viable. Our experiments (Section 6) span ViTs, ConvNeXts, ResNets, MaxViTs, and Swin Transformers at scales approaching 1 billion parameters, where LeJEPA matches or exceeds state-of-the-art methods while maintaining training simplicity and robustness. Critically, on domain-specific datasets (Galaxy10, Food101), LeJEPA outperforms DINOv2-based transfer learning when pretrained directly on target data. This challenges the transfer learning paradigm and demonstrates that principled SSL can unlock effective in-domain pretraining-previously considered impractical for small datasets.
Background and Notations
We start by introducing some of the notations we will be using throughout our manuscript (Section 2.1), followed by a review of JEPAs (Section 2.2), and existing literature studying their design (Section 2.3).
Notations and Definitions
Data. Weareinpossessionofadatasetofshape ( 𝑁,𝑉, 𝐷 ) ∈ N ∗ 3 where 𝑁 is the number of samples, 𝑉 is the number of views, and 𝐷 is the dimension. One entry of this dataset is accessed via 𝒙 𝑛,𝑣,𝑑 . Those dimensions are often interpreted as follows: ( N ) is the number of independent samples, e.g., different images or different videos, ( V ) is the number of views , e.g., data-augmentations for images, frames for videos, and ( D ) is the dimension of each 𝒙 𝑛,𝑣 , e.g., number of RGB pixels for images. In many cases the ordering over 𝑉 is given by time -but in some cases, e.g., data-augmentation of an image, ordering becomes irrelevant. Our study does not require any particular choice to organize one's dataset into a ( 𝑁,𝑉, 𝐷 ) tensorand none of our theory and implementation assumes a particular design decision for that tensor . However, we will rely on the following two properties, ( independence ) the samples 𝒙 𝑛 , 𝒙 𝑛 ′ have been obtained independently from each other ∀ 𝑛 ≠ 𝑛 ′ , and ( identically distributed ) the sampling process was identical among 𝒙 𝑛 , ∀ 𝑛 .
JEPAs. A foundation model is any system, e.g., a DN, able to solve numerous downstream tasks without requiring any change in its internal parameters 𝜽 . This is in sharp contrast with a supervised model that only considers its training task. JEPAs have formally been introduced by LeCun [2022] as a vehicle to produce foundation models. The core building blocks of JEPAs rely on numerous wellestablished techniques such as siamese networks [Bromley et al., 1993] and predictive coding [Helmholtz et al., 1867, Bruner and Postman, 1949]. While the exact blueprint of
Deep Networks. Today's AI solutions rely on Deep (Neural) Networks (DNs), which are compositions of a large number of parameterized linear and nonlinear operators. We denote the DN's mapping as 𝑓 𝜽 : R 𝐷 → R 𝐾 with 𝐾 the dimension of the embedding space. The internals of 𝑓 𝜽 are designed by the researcher to incorporate as much prior knowledge about the data as possible. The details of 𝑓 𝜽 are irrelevant to our study-as we will see the proposed LeJEPA works out-of-the-box on any 𝑓 𝜽 . In any case, all the learnable parameters are gathered in the vector 𝜽 ∈ R 𝑃 , with 𝑃 counting the total number of parameters. A central challenge in AI research is to design the right architecture and training objective so that 𝜽 can be learned from gradient descent to ultimately produce a useful system, or foundation model, 𝑓 𝜽 .

Sec 1: Intro | Sec 2: Background | Sec 3: Why Gaussian? | Sec 4: SIGReg | Sec 5: LeJEPA | Sec 6: Experiments
JEPAs varies greatly between use-cases, they all rely on two core principles: (i) being able to predict the embedding of a view 𝒙 𝑛,𝑣 from the embedding of another view 𝒙 𝑛,𝑣 ′ , 𝑣 ′ ≠ 𝑣 , all while (ii) ensuring that the embeddings do not become degenerate. Concretely, once a JEPA is designed and trained, it should be able to solve numerous downstream tasks in zero or few shots. The JEPA objective function, along with some examples for 𝒙 , is provided in Equation (1). The predictability criterion can be done by directly comparing the embeddings of the partial views 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣,. ) and 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣 ′ ,. ) with a metric, e.g., ℓ 𝑝 . In some cases, an additional DN coined Pred , is employed to compare 𝑃𝑟𝑒𝑑 ( 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣,. )) against 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣 ′ ,. ) -which is only justified when there exists an asymmetry between the information content of the different views, e.g., by conditioning the predictions on observed actions from robotics data [Khazatsky et al., 2024].
The Need for Reliable Pretraining
The JEPA's prediction task is designed based on a priori knowledge of the data. Its design is often quite natural since it is relatively intuitive to form 𝒙 so that its views share the relevant information content one hope to capture. On the other hand, the design of the 'anti-collapse' criterion is much closer to a game of Whac-A-Mole. Today's designs rely on many different under-specified safeguards which are carefully combined in the hope that degenerate shortcut solutions are avoided during training. Such mechanisms include (i) feature whitening [Ermolov et al., 2021, Bardes et al., 2021], (ii) negative samples [Chen et al., 2020a, He et al., 2020], and (iii) asymmetric views and teacher-student networks with stop-gradient [Caron et al., 2021, Assran et al., 2023]. Those mechanisms all suffer from at least two of the following limitations: (i)
under-specification, i.e., the criteria can be minimized while embeddings are in a degenerate configuration, (ii) quadratic time and memory complexity with mini-batch size and/or embedding dimension, (iii) sensitivity to data distribution, hyperparameters, architecture, and (iv) lack of theoretical understanding and guarantees.
The Need for Actionable Theory
For decades, the two major solutions for AI were supervised learning [LeCun et al., 2015] and learning by reconstruction [Rumelhart et al., 1986]-sometimes combined together, e.g., for semi-supervised learning [Kingma et al., 2014]. In supervised learning, the labels both ensure that semantically similar samples are close to each other in embedding space while preventing complete representation collapse. In particular, it is possible to measure the amount of collapse in supervised learning as a function of the number of classes [Papyan et al., 2020]. The reconstruction objective is similarly well suited to prevent representation collapse as the original input must be recovered from the embeddings, i.e., the embeddings must be as informative about the input as possible-up to some optional denoising tasks that users can setup as part of the training [Vincent et al., 2010].
Because supervised and reconstruction-based learning have been widely studied for decades, there exists a large body of work to explain and inform practical designs-as well as studying their limitations in producing foundation models [Balestriero and LeCun, 2024, Van Assel et al., 2025]. This is not the case for the more recent JEPAs where empirical advances quickly outpace anyone hoping to delve into their inner workings. This dynamic led the community to focus on post-hoc theoretical justification of already found solutions [Liu et al., 2021, Shwartz Ziv
and LeCun, 2024, Shwartz-Ziv et al., 2022, Zhang et al., 2023]. In most cases, those studies involve the Mutual Information (MI) [Shannon, 1948, Cover, 1999] whose different bounds recover established methods [Gutmann and Hyvärinen, 2010, Ma and Collins, 2018, Oord et al., 2018, Poole et al., 2019, Hjelm et al., 2018, McAllester and Stratos, 2020]. Because existing studies focus on explaining and interpreting already developed JEPAs, too little principled guidance and innovation has been brought forward. Instead, most of the recent empirical advances take the form of collecting larger dataset, scaling up pre-existing training recipes [Goyal et al., 2019, Chen et al., 2020b, Oquab et al., 2023, Fan et al., 2025], and deriving novel data curation processes [Vo et al., 2024, Kerdreux et al., 2025].
In contrast, our goal in the following Sections 3 to 5 will be to derive a novel JEPA solution from first principles, i.e., whose design relies on proved necessary conditions for optimality, and with a pretraining recipe that can finally reconcile exploratory research, scalability, and state-of-theart performances.
Latent Euclidean: Embeddings Should be Isotropic Gaussian
We address a fundamental question: which distribution should Enc ( 𝒙 ) follow to minimize empirical risk on any downstream task? We prove that the isotropic Gaussian is the unique optimal distribution for both linear (Section 3.1) and nonlinear probing (Section 3.2), with geometric intuition provided in Section 3.3. This theoretical result establishes the necessary design principle for our JEPA; Section 4 then provides the practical implementation to achieve it.
Linear Probing
We begin by identifying the optimal distribution for 𝑓 𝜽 's embeddings by analyzing linear probes-one of the most popular methods for frozen encoder evaluation. Specifically, we ask: which distribution for 𝑓 𝜽 ( 𝒙 ) would be most favorable for solving arbitrary downstream tasks, i.e., for any realization of targets 𝒚 ?
$$
$$
Denote as 𝒁 ∈ R 𝑁 × 𝐾 the matrix of 𝑁 embeddings, each 𝐾 -dimensional, from 𝑓 𝜽 ( 𝒙 𝑛 ) . The unknown corresponding labels are denoted as 𝒚 ∈ R 𝑁 . Withoutlossofgenerality, we consider univariate targets; the following analysis extends to multivariate targets. The linear probe minimizes the following least square problem [Bishop and Nasrabadi, 2006]
where ˆ 𝛽 is the optimal probe parameters, and 𝜆 ≥ 0 is an hyperparameter controlling the Tikhonov regularizer strength [Bishop, 1995, Golub et al., 1999]. Despite not knowing 𝒚 , it is possible to describe the bias and variance of the estimator ˆ 𝛽 as a function of the distribution of 𝒁 . Consider two embeddings with identical column spans 𝒁 aniso , 𝒁 iso . 𝒁 aniso 's covariance matrix eigenvalues are given by { 𝜆 𝑘 } 𝐾 𝑘 = 1 with at least two distinct values, while 𝒁 iso 's covariance matrix eigenvalues are all equal to 1 𝐾 ˝ 𝐾 𝑘 = 1 𝜆 𝑘 . Hence, the two candidate embeddings 𝒁 aniso , 𝒁 iso capture the same intrinsic features and have same energy, but different geometries.
Nonlinear Probing
To allow for more flexible evaluation of the pretrained encoder 𝑓 𝜽 , it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN [Taunk et al., 2019, Sun and Huang, 2010, Zhang et al., 2017, Abu Alfeilat et al., 2019] for its simplicity and kernel methods [Nadaraya, 1964, Watson, 1964] for their theoretical tractability.
As in Section 3.1, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the 𝑁 embeddings along with their training labels {( 𝒛 𝑛 , 𝒚 𝑛 )} 𝑁 𝑛 = 1 . The prediction, using radius-based k-NN for a query vector 𝒒 is formed as
$$
$$
where /u1D4A9 𝑟 0 ( 𝒒 ) = { 𝑛 : ‖ 𝒛 𝑛 -𝒒 ‖ ≤ 𝑟 0 } . The specific choice of radius 𝑟 0 controls how many neighbors predictions are averaged to form the query's prediction. The kernel's prediction at a query 𝒒 ∈ R 𝐾 is given by
$$
$$
Wesearch over all distributions of Z subject to a fixed total variance constraint, e.g., Tr ( Cov ( 𝒁 )) = 𝜅 1 or ‖ Cov ( 𝒁 )‖ 𝐹 = 𝜅 2 . The specific value of 𝜅 does not affect the optimal dis-

Figure 3. Illustration of lemma. 2 showcasing how anisotropic ( right ) embeddings lead to higher variance estimator compared to isotropic embeddings ( left ). We sample 100 training points for the 2-class classification task and fit a logistic regression-repeating the process over numerous training set sample. Each sampling results in a decision boundary ( purple ).
tribution shape. Following the same type of derivations as done in the linear regime-with the exception of some additional regularity conditions-we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.
Kernel Probing
As an alternative to (kNN), it is also common to leverage kernel methods, which we consider in this section. Consider a kernel 𝐾 : R 𝐾 → R with the following standard properties
$$
$$
for some 𝜇 2 ( 𝐾 ) ∈ ( 0 , ∞) , some bandwidth ℎ > 0 and denoting 𝐾ℎ ( 𝑡 ) ≜ ℎ -𝑑 𝐾 ( 𝑡 / ℎ ) , we remind the reader that the Nadaraya-Watson estimator, introduced in Nadaraya [1964], Watson [1964], at a query 𝒒 ∈ R 𝑑 is
$$
$$
Similarly to (kNN), we will see that the performance of (NW) depends crucially on the distribution of the training points. We have access to our dataset of inputs from 𝑝𝑧 and for each sample 𝒛 𝑛 the corresponding target is given from 𝜂 ( 𝒛 𝑛 ) = E [ 𝑌𝑛 | 𝒛 𝑛 ] . We also denote the corresponding conditional variance of the target function at that point as 𝑣 ( 𝑥 ) = Var ( 𝑌𝑖 | 𝑋𝑖 = 𝑥 ) . We follow the regularity conditions of the k-NN probing derivations and additionally assume that 𝑝 has sufficiently light tails so that for each coordinate 𝑗 , lim ‖ 𝑥 ‖→∞ 𝑝 ( 𝑥 ) = 0 and lim ‖ 𝑥 ‖→∞ 𝑥 𝑗 𝑝 ( 𝑥 ) = 0. We first derive the pointwise bias and variance for b 𝒚 ( 𝒒 ) .
Geometric and Practical Insights
We now empirically validate that the isotropic Gaussian is optimal when no information about downstream tasks is available. We focus on linear probing (Section 3.1), where all considered distributions have the same total variance.
When employing a linear probe, an anisotropic distribution increases both bias (with Tikhonov regularization) and variance. Examining bias first (lemma. 1), we present in Figure 18 visualizations for both continuous regression and discrete classification tasks. We observe that the cosine similarity between estimated and ground-truth parameters equals 1 only for isotropic distributions, degrading for anisotropic cases regardless of sample size or regularization strength. Regarding variance (lemma. 2), we show in Figure 3 that learned parameters vary significantly more across training sets when the covariance is anisotropic (right) compared to isotropic (left)-even when using logistic regression instead of OLS. Figure 17 further illustrates this effect, showing the distribution of learned 𝛽 parameters across different training samples for both cases. The anisotropic distribution clearly produces higher-variance estimators.
These theoretical and empirical results establish our design principle for LeJEPA: embeddings 𝑓 𝜽 ( 𝒙 ) should follow an isotropic Gaussian distribution to minimize worst-case risk across downstream tasks encountered post-training . Section 4 introduces a novel regularizer to achieve this distribution.
SIGReg: Reliable Isotropic Gaussian Regularization in High-Dimension
Having established the isotropic Gaussian as the optimal embedding distribution (Section 3), we now introduce Sketched Isotropic Gaussian Regularization (SIGReg)-a distribution matching objective that is simultaneously (i) differentiable , (ii) scalable , (iii) provable , and (iv) interpretable . SIGReg builds on three key innovations. First, we formulate distribution matching as a statistical test under the null hypothesis 𝑃 𝜽 = 𝑄 (Section 4.1). Second, we identify a test that guarantees bounded gradients and curvature while maintaining linear complexity and efficient multi-GPU scaling (Section 4.2). Third, SIGReg bypasses the curse of dimensionality, eliminating collapsed shortcut solutions entirely (Section 4.3).
Hypothesis Testing as a Judge
Asking for 𝑓 𝜽 ( 𝒙 ) 's distribution 𝑃 𝜽 to match a target distribution 𝑄 is typically done by creating various measures of distance or divergence, and estimating them in highdimension. Weproposeadifferentstartingpointgrounded in statistics. Consider the hypothesis testing framework [Fisher, 1928, Neyman and Pearson, 1933] given by
$$
$$
with 𝐻 0 being referred to as the null hypothesis . That is, we are asking in Equation (2) if there is enough empirical evidence to reject the null. To answer that question, one (i) employs a test-statistic , i.e., a single scalar value summarizing the evidence from the empirical samples, (ii) determines a critical value 𝜏𝛼 for the test-statistic based on the probability 𝛼 of Type I error, i.e., of mistakenly rejecting a true null hypothesis, (iii) compares the test-statistic to
the critical value 𝜏𝛼 ; if the test-statistic exceeds 𝜏𝛼 , reject the null hypothesis. If the null is not rejected, we can only claim that there is not sufficient empirical evidence against 𝑃 𝜽 = 𝑄 .
As it stands, Equation (2) remains impractical in large dimension as existing tests have at least quadratic complexity with the number of samples considered (more details in Section F). We thus propose to derive a sketching strategy by decomposing Equation (2) into simpler univariate tests. Denoting the push-forward distributions 𝑃 ( 𝒂 ) 𝜽 ≜ ( 𝒂 /latticetop ) # 𝑃 𝜽 and 𝑄 ( 𝒂 ) ≜ ( 𝒂 /latticetop ) # 𝑄 , we can define the following directional univariate test
$$
$$
for a given directional unit-norm vector 𝒂 ∈ /u1D4AE 𝐾 -1 . The corresponding directional test-statistic of Equation (3) is computed as 𝑇 ({ 𝒂 /latticetop 𝑓 𝜽 ( 𝒙 𝑛 )} 𝑁 𝑛 = 1 ) . Examples of tests 𝑇 will be provided in the later Section 4.2. Repeating that process over a set of 𝑀 directions A ≜ { 𝒂 1 , . . . , 𝒂 𝑀 } and aggregating the individual values lead to the following global test-statistic
$$
$$
We now provide a formal statement asserting the consistency of Equation (4) to test the original multivariate null hypothesis from Equation (2). Our result leverages the well-known union-intersection principle [Roy, 1953], and a slightly modified Cramér-Wold theorem. We denote by 𝑑 = equality in distribution.
SIGReg: Sketching the Epps-Pulley Test is Stable and Scalable
Our proposed regularizer-coined Sketched Isotropic Gaussian Regularization (SIGReg)-follows directly from thm. 2 using any statistical test 𝑇 targeted towards the isotropic Gaussian, illustrated in Figures 2 and 5, and formalized below.
Moments are Unstable and Insufficient
The first family of statistics we consider are moment-based. Taking the standard Gaussian as an instanciation for the moments, we can define the Jarque-Bera [Jarque and Bera, 1980] test that compares the third and fourth moments,

Figure 5. Constructed data density with 'X' distribution whose marginals are standard Gaussian and whose covariance is identity ( left densities ). Applying 𝑀 = 10 projections on the half circle directions produces 10 univariate distributions that can be compared against a standard Gaussian ( left ) using any preferred statistic from Section 4.2. The appropriate direction is able to capture the degenerate distribution of the data hereby creating a spike in the statistic value.
i.e., skewness and kurtosis, as
$$
$$
«
‹
where GLYPH<154> skew is the skewness computed from the data as 1 𝑛 ˝ 𝑛 𝑖 = 1 ( 𝑥𝑖 -ˆ 𝜇 ) 3 ˆ 𝜎 3 and d kurt is the kurtosis 1 𝑛 ˝ 𝑛 𝑖 = 1 ( 𝑥𝑖 -ˆ 𝜇 ) 4 ˆ 𝜎 4 . Typically, the (Jarque-Bera) test is used to see if a density follows a Gaussian distribution of any mean and variance-hence it only looks at moments 3 and 4. In our case we aim for a standard Gaussian test and thus add the usual statistics on the first two moments, leading to the extended test
$$
$$
The (Extended Jarque-Bera) acts as a moment matching problem over the first four moments. Such moment matching methods have proven powerful not only for statistical tests but also as mean to learn parametric and nonparametric models of data.
The Stability and Identifiability Conundrum. Wenow explain why moment-based tests-albeit powerful-will not be suited for LeJEPA. The 𝑘 𝑡ℎ of a distribution 𝑃 is denoted as 𝑚𝑘 ( 𝑃 ) . The first observation is that wellbehaved distributions abiding the Carleman's condition ˝ ∞ 𝑘 = 1 𝑚 2 𝑘 ( 𝑄 ) -1 /( 2 𝑘 ) = ∞ [Carleman,1926], such as the Gaussian, or for distributions with finite interval [Hausdorff, 1923] are uniquely determined by their moments. However, using a finite number of moments creates the following non-identifiability issue which well-known in statistics and often used as a motivation to use all moments [Lehmann and Romano, 2005].
〈
〉
Cumulative Density Functions are Impractical
Thesecondfamilyoftests acts upon the CDF. Because those tests require sorting, let's denote the 𝑘 th order-statistics of 𝑁 samples by 𝑥 𝑘 : 𝑁 . Twohighly standard tests are quadratic Empirical Density Function statistics with different weighting known as Cramér-von Mises [Cramér, 1928, Von Mises, 1981] and Anderson Darling [Anderson and Darling, 1952], and given by
$$
$$
$$
$$
where 𝑤 ( 𝑥 ) is a weighting function. Adding the 𝑈 2 statistics on top of Equation (Cramér-von Mises) recovers the

Figure 6. 𝑁 = 100 samples are drawn from a 1024-dimensional standard Gaussian, and the first 2 coordinates are altered to produce the 'X' distribution from Figure 5 ( left-most column ). For each statistic ( all other columns ), we perform gradient descent on the samples to minimize their value, at each iteration step with sample 𝑀 = 10 random directions to evaluate SIGReg (recall def. 2). We obtain that albeit this is a high-dimensional distribution with limited number of samples, SIGReg is able to capture the degenerate subspace and adapt the data accordingly to match an isotropic Gaussian distribution. Additional figures with varying dimensions and number of 1d projections are provided in Figure 16.
Watson test [Watson, 1961]
$$
$$
We do not consider the Kolmogorov-Smirnov test [Kolmogorov, 1933] as it employs the ℓ ∞ -norm instead of the ℓ 2-norm hereby producing sparse gradients. Another common test is the Shapiro-Wilk test [Shapiro and Wilk, 1965] which we found to be unstable in practice-details are provided in Section E.
Lack of Scalability and Differentiability. CDF-based tests require sorting that have been highly optimized, e.g., with the /u1D4AA ( 𝑁 log ( 𝑁 )) Quicksort algorithm [Hoare, 1962] but that nonetheless breaks the embarrassingly parallel nature of SGD-especially on multi-GPU [Tanasic et al., 2013, Maltenberger et al., 2022] due to synchronization requirements. Moreover, these tests involve non-differentiable operations (sorting and order statistics), making them unsuitable for gradient-based optimization without relaxations [Cuturi et al., 2019, Grover et al., 2019, Petersen et al., 2022]. While there exists intricate sketching solutions [Dunning and Ertl, 2019, Masson et al., 2019, Dunning, 2021], each of those solutions introduce numerous additional hyper-parameters-going against our first motivation for LeJEPA.
Characteristic Functions are Stable, Scalable and Identifiable
The third family of tests is concerned with Empirical Characteristic Functions (ECF) which are the Fourier transform of the density function. The Epps-Pulley test [Epps and Pulley, 1983] is one of the most popular test and simply compares in weighted ℓ 2 -norm the ECF of the data against a target CF
$$
$$
The first crucial observation is that the ECF being defined as ˆ 𝜙 𝑋 ( 𝑡 ) = 1 𝑛 ˝ 𝑛 𝑗 = 1 𝑒 𝑖𝑡𝑋 𝑗 is naturally differentiable and easily computed in distributed settings via efficient all_reduce operations, as the ECF is a simple average of complex exponentials. The weight function is typically Gaussian, such as 𝑤 ( 𝑡 ) = 𝑒 -𝑡 2 / 𝜎 2 with 𝜎 commonly set to 1.
Epps-Pulley has bounded loss, gradient and curvature. We now consider the remaining two families of tests: moment-based and CF-based. First, recall that moments are polynomial in the data and with extreme growth rate
Other tests, e.g., based on the Entropy [Székely and Rizzo, 2005] are not considered here as they require numerous additional design choices for the univariate Entropy estimation [Silverman, 2018, Beirlant et al., 1997], e.g., using kernels [Joe, 1989], or M-estimators [Miller, 2003].
Algorithm 1. SIGReg with Epps-Pulley statistic with DDP support and /u1D4AA ( 𝑁 ) time and memory complexity. x is a (N, K) tensor, num_slices is | A | in def. 2, 'global_step' is used for sync. sampling across GPUs and can be omited for single-GPU training. An optimized implementation with caching is also provided in our official codebase, computation times provided in Table 6.
def SIGReg(x , global_step , num_slices=256) : # s l i c e sampling --synced across devices --dev = dict ( device=x . device ) g = t o r c h . Generator (∗∗ dev ) g. manual_seed ( global_step ) proj_shape = ( x . s i z e ( 1 ) , num_slices ) A = t o r c h . randn ( proj_shape , generator=g, ∗∗dev) A / = A. norm(p=2, dim=0) # --Epps-Pulley st a t . see Sec. 4.3 f o r a l t . --# i n t e g r a t i o n points t = t o r c h . l i n s p a c e ( -5 , 5, 17, ∗∗dev) # t h e o r e t i c a l CF f o r N(0 , 1) and Gauss. window exp_f = t o r c h . exp ( -0.5 ∗ t ∗∗2) # empirical CF -gathered across devices --x_t = ( x @ A) . unsqueeze (2) ∗ t # (N, M, T) ecf = (1 j ∗ x_t ) . exp ( ) . mean( 0 ) ecf = all_reduce ( ecf , op="AVG" ) # weighted L2 distance err = ( ecf -exp_f ) . abs ( ) . square ( ) . mul ( exp_f ) N = x . s i z e ( 0 ) ∗ world_size T = t o r c h . t r a p z ( e r r , t , dim=1) ∗ N return T
for higher moment-assuming they even exist. Even for well-behaved distributions, raising values to a power of 𝑘 can quickly lead to exploding gradients. This comes in sharp contrast with the ECF which is always bounded and with bounded gradients for any input distribution for the projected samples 𝑧 𝑖 = 𝒂 /latticetop 𝑓 𝜃 ( 𝒙 𝑛 ) , 𝑛 = 1 , . . . , 𝑁 .
How SIGReg Beats the Curse of Dimensionality
This last section seeks to characterize how many slices in A one must sample for (SIGReg) to be an effective statistical test. That design is crucial if we hope for LeJEPA to successfully converge towards isotropic Gaussian embeddings.
Smoothness Beats the Curse of Dimensionality
Our first argument arguing for a favorable scaling of | A | with the embedding dimension 𝐾 relies on the smoothness of 𝑃 𝜽 as measured by its Sobolev regularity 𝛼 [Adams and Fournier, 2003]. We formalize below a bound on the directional test from Equation (3) over all possible directions 𝒂 when the test statistic is minimized over | A | = 𝑀 directions. While we provide bounds on the expected discrepancy over random directions 𝒂 when the EP test is satisfied (equals zero) on a finite set of directions, the provided proof includes the case of moment-based and CDF-based tests as well.

As | A |→∞ , the bound decays as | A | -2 𝛼 /( 𝐾 -1 ) , showing that | A | = 𝑂 ( 𝐾 ) directions suffice for 𝜖 -approximation when 𝛼 is large. Some examples of embedding densities with varying 𝛼 are provided in Figure 4. The following statement characterizes how the 𝑀 directions actually constrain the entire space as a function of 𝛼 . The constant 𝐶 ( 𝐾, 𝛼 ) = 2 2 𝛼 𝜋 ( 𝐾 -1 )/ 2 Γ ( 𝛼 + 𝐾 -1 2 ) ( 𝐾 -1 ) Γ ( 𝛼 ) Γ ( 𝐾 -1 2 ) is visualized in Figure 15 (left) depicting how 𝛼 and | A | interact. In words, we obtain that thanks to the natural smoothness of DN-either stemming from the architecture or the implicit and explicit regularizers used during training-applying SIGReg on | A | directions can be sufficient to tightly constrain the entire space. We note that considering the worst case over 𝒂 or using low-discrepancy sequences for 𝒂 does not impact the asymptotic bounds, details provided in Section D.
SGD Beats the Curse of Dimensionality
Our second argument leverages the iterative nature of DN training. Although we may use only | A | to be a few hundreds, the cumulative number of sampled directions grows linearly with training time. This resampling effect (illustrated in Figure 7, bottom) enables rapid convergence. Even small | A | achieves tight distributional matching compared to keeping the set A fixed throughout minibatches (recall thm. 5). Our experiments show that even with | A | as low as 16 can easily outperform a fixed set with | A | of order of thousands thanks to the compounding effect of resampling at each minibatch.
Empirical Validation on Synthetic Data
We conclude this section with a controlled experiment applying (SIGReg) with gradient-based training to produce isotropic embeddings. In this setup, we directly consider embeddings 𝒁 which we will differentiate and optimized to minimize (SIGReg). By directly optimizing the embeddings we are able to observe the impact of the loss without any possible constraint and regularization that would come from the architecture. We sample 𝑁 i.i.d. samples 𝒙 𝑛 in a 𝐷 -dimensional space. This sampling is based on an isotropic Gaussian distribution-but the first
Algorithm 2. LeJEPA implementation-works out-of-the-box on any dataset, with DDP, with any backbone, e.g., torchvision or timm. For non-ViT architectures (e.g., ResNet), set global_views = all_views. We use bs for the minibatch size, SIGReg is from algorithm 1.
def LeJEPA( global_views , all_views , lambd) : " " " g l o b a l _ v i e w s and all_views are l i s t s of t e nsors , lambd i s a scalar " " " # embedding of global views g_emb = f o r w a r d ( t o r c h . c a t ( glob_views ) ) # embedding of l o c a l views # i f r e s n e t : skip with a_emb=g_emb a_emb = f o r w a r d ( t o r c h . c a t ( a l l _ v i e w s ) ) # LeJEPA l o s s centers = g_emb. view( -1 , bs , K) . mean(0) a_emb = a_emb. view( -1 , bs , K) sim = ( centers -a_emb) . square ( ) .mean( ) sigreg = mean(SIGReg(emb, global_step ) f or emb i n a_emb) return (1-lambd)∗sim + lambd∗sigreg
two dimensions are again set to the adversarial 'X' shape. That is, among the 𝐷 dimensions, only two must be transformed as all the other ones already obey the isotropic Gaussian target distribution. We then make the samples 𝒙 𝑛 differentiable and optimize then to minimize the value of the different statistical tests compute on 𝑀 random 𝑀 random directions. Those directions are resampled after each gradient step-which follows the procedure we will employ in LeJEPA. We present the results in Figure 6 demonstrating that even in challenging case, i.e., 𝐷 = 512 and 𝑀 = 16, SIGReg is able to detect the two degenerate dimensions and unfold them back to how they should look like under the target distribution.
LeJEPA: Stable and Scalable Implementation
HavingestablishedthatisotropicGaussiansaretheoptimal embedding distribution for foundation models (Section 3) and introduced SIGReg to achieve this distribution (def. 2), we now present the complete LeJEPA framework. We first evaluate candidate statistical tests (Sections 4.2.1 and 4.2.2) and identify characteristic function-based tests as optimal for gradient-based training (Section 4.2.3). The full LeJEPA implementation follows in Section 5.1.
LeJEPA: SIGReg + Prediction Loss
We now discuss the implementation of LeJEPA starting with SIGReg and followed by the prediction and total losses.
The SIGReg Loss. We chose (Epps-Pulley) for its provable boundedness (thm. 4) and its scalability. Its implementation follows exactly the equation except for the integrate which is estimated using a quadrature approximation. We
find that the simple trapezoidal quadrature rule is sufficient even with as few knots as 17, as ablated in Figure 20. In particular, we leverage the symmetry of the integrand to double the number of knots for free, see the official code. On the other hand, the use of minibatches introduces a bias vanishing at rate /u1D4AA ( 1 / 𝑁 ) , as formalized below.
Relation to Prior Work
Prior to presenting our experiments (Section 6), we conclude by discussing how our proposed LeJEPA and SIGReg objective relate to existing frameworks in the literature.
While there is no existing solution employing such slicing and distribution matching for JEPAs, there exists similar pipelines for generative models and optimal transport. Notably, the Sliced Score Matching [Song et al., 2020] proposes to leverage univariate slicing of the space to ease the estimation of a density for generative models. In a similar vein, the sliced Wasserstein distance [Bonneel et al., 2015, Nguyen and Ho, 2023] uses such strategy to speed up and improve optimal transport. Furthermore, when the integral of the (Epps-Pulley) test is computed exactly, as opposed to our quadrature, each slice loss value recovers the kernel MMD [Sriperumbudur et al., 2010, Gretton et al., 2012, Chwialkowski et al., 2016] measuring the distance between two distributions-albeit with a quadratic complexity. Lastly, it is possible to recover some existing SSL frameworks in the limit by employing LeJEPA with a particular test-instead of the preferred (Epps-Pulley). For example, Setting 𝑇 ({ 𝑥𝑛 } 𝐵 𝑛 = 1 ) = mean ({ 𝑥𝑛 } 𝐵 𝑛 = 1 ) 2 +( std ({ 𝑥𝑛 } 𝐵 𝑛 = 1 ) -1 ) 2 and using that 𝑇 with SIGReg in LeJEPA recovers the VICReg SSL method in the limit of large number of slices. In fact, SIGReg will enforce in expectation that E [ Z ] = 0 and Cov ( Z ) = I 𝑑 , where I 𝑑 denotes the 𝑑 × 𝑑 identity matrix-derivations provided in Section B.14. And since our invariance term is simply the ℓ 2 distance between the views' embeddings, LeJEPA recovers VICReg for this degenerate statistical test. Based on thm. 3, we however strongly advocate against such a setting as it would lead to shortcut solutions-a phenomenon already observed in VICReg.
LeJEPA: Empirical Validation
We now use the LeJEPA implementation described in Section 5.1 to demonstrate its effectiveness through comprehensive experiments. We show that LeJEPA: (i) trains reliably across diverse architectures and datasets (Section 6.1), (ii) provides an informative training loss for model selection (Section 6.2), (iii) outperforms frontier vision models on small-scale in-domain pretraining (Section 6.3), (iv) scales successfully to nearly 1 billion parameters on ImageNet-1k (Section 6.4), and (v) learns rich

Figure 8. Inet100 with 400 pretraining epochs and resnet50 backbone. Wedepict linear probe performances as a function of 𝜆 and the number of views 𝑉 (recall (LeJEPA)). We observe that performances are stable over 𝜆 -with peak performance obtain by slightly adjust 𝜆 proportionally to the number of views . The corresponding performance values are provided in Table 7.
semantic segmentation features without explicit supervision.
LeJEPA's Stability Across Hyper-Parameters and Architectures
WenowdemonstrateLeJEPA'sstability across hyperparameters, architectures, and experimental setups. Additional cross-domain stability results are presented in Section 6.3.
Stability across Epps-Pulley hyperparameters. Wenext examine hyperparameters specific to LeJEPA: the number of slices |/u1D49C| in SIGReg, the integration domain for the Epps-Pulley test (Epps-Pulley), and the number of quadrature points for numerical integration. Table 1a shows ablations on ImageNet-1K with ViT-Large/14. Both the integration domain and number of quadrature points have negligible impact on performance. This is expected: since the characteristic function is accurate at zero, the
Stability across standard hyperparameters. We begin by evaluating LeJEPA on ImageNet-100 and ImageNet1K. On ImageNet-100, we train a ResNet-50 and vary the number of views and the loss weighting 𝜆 (Figure 8). Performance remains stable across both dimensions, leading us to recommend 𝜆 = 0 . 05 as a robust default. On ImageNet-1K, we train a ViT-Large/14 and explore batch size, as well as the number of global ( 𝑉 g ) and local ( 𝑉 l ) views (Table 1b). We find that the configuration commonly used in prior work ( 𝑉 g = 2 , 𝑉 l = 8) transfers well to LeJEPA. Notably, LeJEPA achieves competitive performance with batch sizes as small as 128 on ImageNet-1K (Table 1c), suggesting reduced memory requirements compared to existing methods. We thus recommend to use 𝜆 = 0 . 05 , 𝑉 g = 2 , 𝑉 l = 8 , and batch size ≥ 128 as starting points .
Table 1. ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %). LeJEPA's performance is stable across all its hyperparameters and while some may slightly improve performance, e.g., the number of slices | A | and the projector sizes, none of the choices lead to a catastrophic collapse.
LeJEPA's Training Loss is Informative of Downstream Performance
Amajor challenge in SSL pretraining is the lack of reliable signals conveying the quality of the learned representation. As a result, it is common to monitor a supervised
Removalofpopularheuristics. In addition to providing reliable performance across models and datasets, LeJEPA's provable construction enables us to remove many heuristics traditionally used to prevent collapse. First, prior work has shown both empirically and theoretically that predictors in image JEPA (without asymmetric information) and teacher-student architectures serve primarily to prevent collapse [Grill et al., 2020, Jing et al., 2021, Tian et al., 2021, Caron et al., 2021, Chen et al., 2021]. Removing these components produces collapsed encoders, i.e., with performances at chance-level. Thanks to LeJEPA's SIGReg loss, we can remove both the predictor and teacher-student architecture without suffering from collapse, as shown in Table 4. While a teacher-student configuration does provide a small performance boost for ViT models-consistent with observations in supervised learning via Stochastic

Figure 10. (SIGReg, prediction loss) 2 𝑑 -plane with downstream task accuracy shown with colors from blue (low) to red (high). We clearly observe that within this plane, there exists trade-off fronts between the two terms of LeJEPA producing similar downstream performance corresponding to different values of 𝜆 . Yet, those fronts are linear and pointed towards the lower left corner, i.e., LeJEPA's training loss informs of downstream test performance across models and datasets ( columns ). Additional models and datasets provided in Figure 21.

Figure 11. Spearman correlation ( y-axis) between LeJEPA's training loss and downstream accuracy on the dataset's classification task with a frozen backbone and linear evaluation. The x-axis varies 𝛼 in Equation (8) following our scaling law of the loss w.r.t. 𝜆 . Using 𝛼 = 0 recovers the plain training loss. We clearly observe a very high correlation already for 𝛼 = 0, which further increases up to 99% for 𝛼 = 0 . 4. The entire set of points is obtained across numerous hyper-parameters such as learning rate, weight decay, number of epochs, 𝜆 -demonstrating how LeJEPA's training loss is strongly predictive of downstream performance which can be used for label-free cross-validation.
downstream task performance, sometimes supplemented with unsupervised embedding statistics [Agrawal et al., 2022, Garrido et al., 2023, Thilak et al., 2023]. This process is highly limiting since it requires labeled data that is costly and overly specialized. This is further exacerbated in the latest JEPA models where training losses exhibit low correlation with downstream performance-and may not even decrease monotonically during training.
In contrast, we find that LeJEPA's training loss behaves much more favorably-providing us with a meaningful signal on model quality. First, we provide in Figure 10, the 2D plane spanned by the SIGReg and prediction losses where a clear trend with downstream task accuracy can be observed. More strikingly, the combined training loss (LeJEPA) with mixing coefficient 𝜆 exhibits very high Spearman correlation [Spearman, 1961], denoted as 𝜌 𝑠 , of about 85% with downstream accuracy-which is considered a strong signal. This strong relationship holds across datasets and architectures. As a result, a lower LeJEPA training loss reliably indicates a better downstream performance.
We can further improve this correlation through a simple scaling law based upon the trade-off weighting hyperparameter 𝜆
$$
$$
By setting 𝛼 ≈ 0 . 4, LeJEPA's training loss is able to achieve nearly 99% correlation with downstream performance across multiple datasets and models. We depict the changes in 𝐶 ( 𝛼 ) as a function of 𝛼 on multiple datasets and models in Figure 11, as well as the training LeJEPA loss against downstream performance in Figure 19. The strong alignment between LeJEPA's training loss and model quality enables label-free SSL model selection and cross-validation .
In-Domain LeJEPA Outperforms Frontier Model Transfer Learning
Akey promise of self-supervised learning is to learn universal representations that generalize across tasks and domains. However, current frontier foundation models (e.g., DINOv2/v3, IJEPA) are pretrained on natural images forcing practitioners in specialized domains to collect large amount of labels for supervised finetuning. In fact, most frontier models can not be trained directly on those domains as the number of samples may be small and searching again for the hyper-parameters would be cum-

Figure 12. Small architecture in-domain (Galaxy10) LeJEPA pretraining with linear probe evaluation using frozen backbone or full finetuning ( columns ) and with varying number of samples per class ( x-axis) . We compare against state-of-the-art foundation models (DINOv2/v3, IJEPA) over 3 different random seeds. We observe that LeJEPA enables in-domain pretraining out of the box across architectures and able to outperform frontier foundation models . Corresponding numbers are provided in Table 3.
Table 2. Few-shot classification accuracy (percentages) on 8 datasets spanning textures, objects, and fine-grained categories. Our LeJEPA achieves superior performance on fine-grained tasks (DTD, flowers102, food101) while requiring only 100 pretraining epochs compared to I-JEPA's 300 epochs-a 3× reduction in training time and computational resources without sacrificing downstream task performance. This efficiency gain is particularly valuable for practical applications where training budget is limited. Bold indicates best performance within the IN-1K comparison group, all numbers are percentages.

Figure 13. Emergent Object Segmentation via Last Layer Thresholding. LeJEPA naturally learns to segment and track salient objects (shown in attention maps on the right of each video) without explicit supervision. The results display impressive visual quality and strong temporal consistency across video frames ( videos provided on our project page ). This emergent capability demonstrates the rich semantic representations learned through our self-supervised approach.

Figure 14. LeJEPA learns rich semantic representations through self-supervised learning. PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.
bersome yet necessary [Assran et al., 2022].
TodemonstrateLeJEPA'sversatility andability to resolve that current pain-point, we propose to pretrain directly on a new domain without any change in the loss or the pretraining pipeline. We select the Galaxy10 dataset, a galaxy morphology classification task that differs significantly from natural images in both visual structure and statistical properties [Balestriero et al., 2025]. The dataset contains 11,000 training samples across 10 galaxy types. For LeJEPA, we use the default hyper-parameters and pretrain for 400 epochs a variety of backbones. We compare against the latest DINOv2, DINOv3 and IJEPA. We report in Figure 12 the top1 accuracy for linear probing both with frozen backbone and full-finetuning. We observe that in-domain pretraining with LeJEPA substantially outperforms state-of-the-art frontier models (DINOv2, DINOv3) on both linear probing and full finetuning . Additional datasets and backbones are provided in Table 5 depicting LeJEPA's ability to train in-domain, even with a dataset with 1000 samples (flowers102). Coupling this result with the stability of LeJEPA across architectures and hyper-parameters should offer a promising alternatives in domains not yet accounted for by the latest frontier models.
LeJEPA Scales Across Data and Models
We now propose to apply LeJEPA over a larger pretraining dataset, i.e., Imagenet-1k, and over larger backbones such as ViT/Large (0.3B), ConvNextV2-Huge (0.6B). For those two models, we reach an online linear probe accuracy on inet1k of 77.1% and 78.5% respectively. Beyond in-distribution performances, we also explore transfer learning. For those experiments, our baselines are IJEPA with a ViT-Huge (0.6B) which is the closest to our setup, and we also include a recent improved version of IJEPA including additional stochastic prediction tasks [Bar et al., 2023] that is coined IJEPA + STOP. For LeJEPA, we employ the same recipe as described in Section 6.1 and report transfer learning performances with frozen backbone in Table 2. We observe that we consistently outperform IJEPA while employed a smaller model and shorted training schedule. Beyond top1 accuracy, we also echo our findings from Section 6.2 about LeJEPA's training loss quality. In our setup, we observe a very stable and smooth training curve indicating a stable optimization landscape removing the need for careful hyperparameter selection (recall thm. 4). Weprovide an example on a ViT-gigantic (1.8B parameters) in Figure 1.
Emergent Semantic Structure in LeJEPA Representations
A hallmark of successful self-supervised learning is the emergence of semantically meaningful attention patterns
without explicit supervision [Caron et al., 2021]. To assess whether LeJEPA learns such structure, we visualize the attention maps of the learned representations. Following DINO [Caron et al., 2021], we apply PCA to the embeddings and visualize the first principal components, which reveal clear correspondence to object boundaries and salient regions (Figure 14). Furthermore, we explore whether these attention patterns can enable unsupervised video segmentation-a challenging task requiring temporal consistency and object understanding. By thresholding the self-attention maps of the [CLS] token, we obtain binary masks that track objects across frames without any segmentation labels during training. As shown in Figure 13, LeJEPA's attention naturally segments foreground objects from background with remarkable temporal coherence , suggesting that the learned representations capture both spatial semantics and temporal structure. This emergent capability demonstrates that LeJEPA's stabilityfocused objective does not sacrifice the semantic richness of learned features.
Conclusion
We have established a principled theoretical framework for JEPA-based self-supervised learning that fundamentally resolves its core pathologies. Our contributions span theory and practice: we proved that isotropic Gaussian embeddings uniquely minimize worst-case downstream risk, introduced SIGReg as a tractable and provably correct method to enforce this distribution, and demonstrated that this approach eliminates representational collapse by design-and not through ad-hoc combinations of teacherstudent networks, stop-gradients, or asymmetric architectures.
We validate LeJEPA across domains and over 60 architectures including gigantic versions with 1.8B parameters. In spite of its simplicify , LeJEPA matches state-of-the-art performance while requiring fewer than 50 lines of core implementation. Critically, our approach provides what SSL has long needed: a mathematically rigorous foundation that directly informs practical algorithm design.
Acknowledgments
We would like to thank Mike Rabbat and Lucas Maes for providing valuable feedbacks on the manuscript.
Additional Details on Nonlinear Probing
kNN Probing
To allow for more flexible evaluation of the pretrained encoder 𝑓 𝜽 , it is standard to work with a 𝑘 -NN prober [Taunk et al., 2019], both for regression and classification. We rely on the radial 𝑘 -NN variation that leverages a sample-dependent 𝑘 -improving performance for non uniform distributions of samples [Sun and Huang, 2010, Zhang et al., 2017, Abu Alfeilat et al., 2019].
We denote the underlying embedding density as 𝑝𝑧 ∈ 𝐶 3 with derivatives of order up to 3 bounded, and finite Fisher information and covariance. This regularity condition is fulfilled by current encoders. The unknown labels come from the target function 𝜂 : R 𝐾 → R , assumed 𝐶 2 . We handle classification tasks by setting 𝜂 ( 𝒛 ) = P ( 𝑌 = 1 | 𝒛 ) . The training consists of the 𝑁 embeddings along with their training labels {( 𝒛 𝑛 , 𝜂 ( 𝒛 𝑛 ))} 𝑁 𝑛 = 1 , where we will denote 𝒚 𝑛 ≜ 𝜂 ( 𝒛 𝑛 ) . The prediction for a query vector 𝒒 is formed as
$$
$$
with 𝒚 ( 𝒒 ) ≜ # { 𝑛 : GLYPH<13> GLYPH<13> 𝒛 𝑛 -𝒒 GLYPH<13> GLYPH<13> ≤ 𝑟 0 } counting the number of samples within a 𝑟 -radius ball around 𝒒 . The radius 𝑟 controls how many neighbors predictions are averaged to form the query's prediction. As per the linear probing's lemma. 1, we can characterize the bias of the estimator Equation (kNN) at a particular query point, as formalized below.
Kernel Probing
As an alternative to (kNN), it is also common to leverage kernel methods, which we consider in this section. Consider a kernel 𝐾 : R 𝐾 → R with the following standard properties
$$
$$
for some 𝜇 2 ( 𝐾 ) ∈ ( 0 , ∞) , some bandwidth ℎ > 0 and denoting 𝐾ℎ ( 𝑡 ) ≜ ℎ -𝑑 𝐾 ( 𝑡 / ℎ ) , we remind the reader that the Nadaraya-Watson estimator, introduced in Nadaraya [1964], Watson [1964], at a query 𝒒 ∈ R 𝑑 is
$$
$$
Similarly to (kNN), we will see that the performance of (NW) depends crucially on the distribution of the training points. We have access to our dataset of inputs from 𝑝𝑧 and for each sample 𝒛 𝑛 the corresponding target is given from 𝜂 ( 𝒛 𝑛 ) = E [ 𝑌𝑛 | 𝒛 𝑛 ] . We also denote the corresponding conditional variance of the target function at that point as 𝑣 ( 𝑥 ) = Var ( 𝑌𝑖 | 𝑋𝑖 = 𝑥 ) . We follow the regularity conditions of the k-NN probing derivations and additionally assume that 𝑝 has sufficiently light tails so that for each coordinate 𝑗 , lim ‖ 𝑥 ‖→∞ 𝑝 ( 𝑥 ) = 0 and lim ‖ 𝑥 ‖→∞ 𝑥 𝑗 𝑝 ( 𝑥 ) = 0. We first derive the pointwise bias and variance for b 𝒚 ( 𝒒 ) .
Proofs
Proof of cref{thm:linear_probe_bias
To allow for more flexible evaluation of the pretrained encoder 𝑓 𝜽 , it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN [Taunk et al., 2019, Sun and Huang, 2010, Zhang et al., 2017, Abu Alfeilat et al., 2019] for its simplicity and kernel methods [Nadaraya, 1964, Watson, 1964] for their theoretical tractability.
As in Section 3.1, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the 𝑁 embeddings along with their training labels {( 𝒛 𝑛 , 𝒚 𝑛 )} 𝑁 𝑛 = 1 . The prediction, using radius-based k-NN for a query vector 𝒒 is formed as
$$
$$
where /u1D4A9 𝑟 0 ( 𝒒 ) = { 𝑛 : ‖ 𝒛 𝑛 -𝒒 ‖ ≤ 𝑟 0 } . The specific choice of radius 𝑟 0 controls how many neighbors predictions are averaged to form the query's prediction. The kernel's prediction at a query 𝒒 ∈ R 𝐾 is given by
$$
$$
Wesearch over all distributions of Z subject to a fixed total variance constraint, e.g., Tr ( Cov ( 𝒁 )) = 𝜅 1 or ‖ Cov ( 𝒁 )‖ 𝐹 = 𝜅 2 . The specific value of 𝜅 does not affect the optimal dis-

Figure 3. Illustration of lemma. 2 showcasing how anisotropic ( right ) embeddings lead to higher variance estimator compared to isotropic embeddings ( left ). We sample 100 training points for the 2-class classification task and fit a logistic regression-repeating the process over numerous training set sample. Each sampling results in a decision boundary ( purple ).
tribution shape. Following the same type of derivations as done in the linear regime-with the exception of some additional regularity conditions-we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.
Proof of cref{thm:linear_probe_variance
To allow for more flexible evaluation of the pretrained encoder 𝑓 𝜽 , it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN [Taunk et al., 2019, Sun and Huang, 2010, Zhang et al., 2017, Abu Alfeilat et al., 2019] for its simplicity and kernel methods [Nadaraya, 1964, Watson, 1964] for their theoretical tractability.
As in Section 3.1, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the 𝑁 embeddings along with their training labels {( 𝒛 𝑛 , 𝒚 𝑛 )} 𝑁 𝑛 = 1 . The prediction, using radius-based k-NN for a query vector 𝒒 is formed as
$$
$$
where /u1D4A9 𝑟 0 ( 𝒒 ) = { 𝑛 : ‖ 𝒛 𝑛 -𝒒 ‖ ≤ 𝑟 0 } . The specific choice of radius 𝑟 0 controls how many neighbors predictions are averaged to form the query's prediction. The kernel's prediction at a query 𝒒 ∈ R 𝐾 is given by
$$
$$
Wesearch over all distributions of Z subject to a fixed total variance constraint, e.g., Tr ( Cov ( 𝒁 )) = 𝜅 1 or ‖ Cov ( 𝒁 )‖ 𝐹 = 𝜅 2 . The specific value of 𝜅 does not affect the optimal dis-

Figure 3. Illustration of lemma. 2 showcasing how anisotropic ( right ) embeddings lead to higher variance estimator compared to isotropic embeddings ( left ). We sample 100 training points for the 2-class classification task and fit a logistic regression-repeating the process over numerous training set sample. Each sampling results in a decision boundary ( purple ).
tribution shape. Following the same type of derivations as done in the linear regime-with the exception of some additional regularity conditions-we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.
Appendix: PPP justification for the surrogate ball-average law
A. Inhomogeneous Poisson point process (PPP) preliminaries
B. Equalized-radius scheme and choice of $r_0(N)$, $k(x)$
C. Exact formula for $ mathbb{E
Proof of cref{thm:knn_bias
Proof. Recall from Section B.3 that the bias term as sample 𝒙 is given by
$$
$$
□
where we defined 𝐴 ( 𝑥 ) ≜ ∇ 𝜂 ( 𝑥 ) · ∇ log 𝑝 ( 𝑥 ) and 𝐶 ( 𝑥 ) ≜ 1 2 Δ 𝜂 ( 𝑥 ) . We now square and take expectation of 𝑋 ∼ 𝑝 and the isotropic gradient prior
$$
$$
$$
$$
$$
$$
We will derive each term separately, recalling that we assume an isotropic gradient prior for 𝜂 , i.e., E GLYPH<2> ∇ 𝜂 ( 𝑥 ) GLYPH<3> = 0 and E GLYPH<2> ∇ 𝜂 ( 𝑥 )∇ 𝜂 ( 𝑥 ) /latticetop GLYPH<3> = 𝜏 2 𝑔 𝐼 𝑑 , for some 𝜏 2 𝑔 ∈ ( 0 , ∞) .
$$
$$
recovering the Fisher-information functional 𝐽 ( 𝑝 ) , scaled by 𝜏 2 𝑔
E. Concentration of the $k(x)$-NN radius around $r_0$
F. Scaling for consistency and relation to the variable-$k$ scheme
A. Ratio-of-integrals derivation (ball-average estimator)
Assumptions and Taylor expansions.
Data. Weareinpossessionofadatasetofshape ( 𝑁,𝑉, 𝐷 ) ∈ N ∗ 3 where 𝑁 is the number of samples, 𝑉 is the number of views, and 𝐷 is the dimension. One entry of this dataset is accessed via 𝒙 𝑛,𝑣,𝑑 . Those dimensions are often interpreted as follows: ( N ) is the number of independent samples, e.g., different images or different videos, ( V ) is the number of views , e.g., data-augmentations for images, frames for videos, and ( D ) is the dimension of each 𝒙 𝑛,𝑣 , e.g., number of RGB pixels for images. In many cases the ordering over 𝑉 is given by time -but in some cases, e.g., data-augmentation of an image, ordering becomes irrelevant. Our study does not require any particular choice to organize one's dataset into a ( 𝑁,𝑉, 𝐷 ) tensorand none of our theory and implementation assumes a particular design decision for that tensor . However, we will rely on the following two properties, ( independence ) the samples 𝒙 𝑛 , 𝒙 𝑛 ′ have been obtained independently from each other ∀ 𝑛 ≠ 𝑛 ′ , and ( identically distributed ) the sampling process was identical among 𝒙 𝑛 , ∀ 𝑛 .
JEPAs. A foundation model is any system, e.g., a DN, able to solve numerous downstream tasks without requiring any change in its internal parameters 𝜽 . This is in sharp contrast with a supervised model that only considers its training task. JEPAs have formally been introduced by LeCun [2022] as a vehicle to produce foundation models. The core building blocks of JEPAs rely on numerous wellestablished techniques such as siamese networks [Bromley et al., 1993] and predictive coding [Helmholtz et al., 1867, Bruner and Postman, 1949]. While the exact blueprint of
Deep Networks. Today's AI solutions rely on Deep (Neural) Networks (DNs), which are compositions of a large number of parameterized linear and nonlinear operators. We denote the DN's mapping as 𝑓 𝜽 : R 𝐷 → R 𝐾 with 𝐾 the dimension of the embedding space. The internals of 𝑓 𝜽 are designed by the researcher to incorporate as much prior knowledge about the data as possible. The details of 𝑓 𝜽 are irrelevant to our study-as we will see the proposed LeJEPA works out-of-the-box on any 𝑓 𝜽 . In any case, all the learnable parameters are gathered in the vector 𝜽 ∈ R 𝑃 , with 𝑃 counting the total number of parameters. A central challenge in AI research is to design the right architecture and training objective so that 𝜽 can be learned from gradient descent to ultimately produce a useful system, or foundation model, 𝑓 𝜽 .

Sec 1: Intro | Sec 2: Background | Sec 3: Why Gaussian? | Sec 4: SIGReg | Sec 5: LeJEPA | Sec 6: Experiments
JEPAs varies greatly between use-cases, they all rely on two core principles: (i) being able to predict the embedding of a view 𝒙 𝑛,𝑣 from the embedding of another view 𝒙 𝑛,𝑣 ′ , 𝑣 ′ ≠ 𝑣 , all while (ii) ensuring that the embeddings do not become degenerate. Concretely, once a JEPA is designed and trained, it should be able to solve numerous downstream tasks in zero or few shots. The JEPA objective function, along with some examples for 𝒙 , is provided in Equation (1). The predictability criterion can be done by directly comparing the embeddings of the partial views 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣,. ) and 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣 ′ ,. ) with a metric, e.g., ℓ 𝑝 . In some cases, an additional DN coined Pred , is employed to compare 𝑃𝑟𝑒𝑑 ( 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣,. )) against 𝐸𝑛𝑐 ( 𝒙 𝑛,𝑣 ′ ,. ) -which is only justified when there exists an asymmetry between the information content of the different views, e.g., by conditioning the predictions on observed actions from robotics data [Khazatsky et al., 2024].
Ball integral identities.
Denominator expansion.
To allow for more flexible evaluation of the pretrained encoder 𝑓 𝜽 , it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN [Taunk et al., 2019, Sun and Huang, 2010, Zhang et al., 2017, Abu Alfeilat et al., 2019] for its simplicity and kernel methods [Nadaraya, 1964, Watson, 1964] for their theoretical tractability.
As in Section 3.1, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the 𝑁 embeddings along with their training labels {( 𝒛 𝑛 , 𝒚 𝑛 )} 𝑁 𝑛 = 1 . The prediction, using radius-based k-NN for a query vector 𝒒 is formed as
$$
$$
where /u1D4A9 𝑟 0 ( 𝒒 ) = { 𝑛 : ‖ 𝒛 𝑛 -𝒒 ‖ ≤ 𝑟 0 } . The specific choice of radius 𝑟 0 controls how many neighbors predictions are averaged to form the query's prediction. The kernel's prediction at a query 𝒒 ∈ R 𝐾 is given by
$$
$$
Wesearch over all distributions of Z subject to a fixed total variance constraint, e.g., Tr ( Cov ( 𝒁 )) = 𝜅 1 or ‖ Cov ( 𝒁 )‖ 𝐹 = 𝜅 2 . The specific value of 𝜅 does not affect the optimal dis-

Figure 3. Illustration of lemma. 2 showcasing how anisotropic ( right ) embeddings lead to higher variance estimator compared to isotropic embeddings ( left ). We sample 100 training points for the 2-class classification task and fit a logistic regression-repeating the process over numerous training set sample. Each sampling results in a decision boundary ( purple ).
tribution shape. Following the same type of derivations as done in the linear regime-with the exception of some additional regularity conditions-we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.
Numerator expansion.
B. Moment-based derivation via $ mathbb{E
First moment $ mathbb{E
Second moment $ mathbb{E
Bias via Taylor of $ eta$.
Proof. Our proof follows standard derivations when it comes to studying the bias of an estimator. Let's consider the ridge regression problem (Tikhonov regularized least squares estimator) with close form estimator
$$
$$
The labels are formed from the ground truth parameter 𝛽 true with centered error, as per Y = X 𝜷 true + 𝜺 where E [ 𝜺 ] = 0 . We can now look at the bias of our estimator given by
$$
$$
We will now compare that bias when 𝑿 has isotropic and anisotropic covariance with same total variance:
$$
$$
For any anisotropic covariance matrix of 𝑿 , denote by 𝒒 1 the eigenvector with smallest eigenvalue, and let's denote by 𝜅 > 0 a positive constant. We now define leading to
$$
$$
$$
$$
Since 𝜆 𝑝 < ¯ 𝜆 (strict inequality when not isotropic):
$$
$$
$$
$$
we obtain that
As a result, whenever the covariance matrix of 𝑿 is anisotropic, there will be downstream tasks for which the estimator bias is increased compared to having isotropic covariance matrix. Anisotropic covariance structure thus amplifies regularization bias when the true parameter vector aligns unfavorably with the data's covariance structure. □
C. Uniformity, remainder control, and the zero-count event
Remainders.
Haneen Arafat Abu Alfeilat, Ahmad BA Hassanat, Omar Lasassmeh, Ahmad S Tarawneh, Mahmoud Bashir Alhasanat, Hamzeh S Eyal Salman, and VB Surya Prasath. Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data , 7(4):221-248, 2019.
Kumar K Agrawal, Arnab Kumar Mondal, Arna Ghosh, and Blake Richards. a-req: Assessing representation quality in self-supervised learning by measuring eigenspectrumdecay. Advances in Neural Information Processing Systems , 35:17626-17638, 2022.
Theodore W Anderson and Donald A Darling. Asymptotic theory of certain" goodness of fit" criteria based on stochastic processes. The annals of mathematical statistics , pages 193-212, 1952.
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 15619-15629, 2023.
Randall Balestriero and Yann LeCun. Contrastive and noncontrastive self-supervised learning recover global and local spectral embedding methods. Advances in Neural Information Processing Systems , 35:26671-26685, 2022.
Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. Acookbook of self-supervised learning. arXiv preprint arXiv:2304.12210 , 2023.
Amir Bar, Florian Bordes, Assaf Shocher, Mahmoud Assran, Pascal Vincent, Nicolas Ballas, Trevor Darrell, Amir Globerson, and Yann LeCun. Stochastic positional embeddings improve masked image modeling. arXiv preprint arXiv:2308.00566 , 2023.
Zero-count event $ {M_x=0
D. Choice and scaling of $r_0(N)$ and variable $k(x)$
E. Summary bias formula
For any fixed 𝒒 ∈ R 𝑑 with 𝑝 ( 𝒒 ) > 0, as ℎ → 0 and 𝑛ℎ 𝑑 →∞ ,
$$
$$
The 𝑜 (·) terms are uniform over compact sets where 𝑝 is bounded away from zero. (Proof in Section B.5.)
We now show that, under a fixed mean and total-covariance constraint on 𝑝𝑧 , the isotropic Gaussian distribution uniquely minimizes the bias and variance of the kernel regression estimator at any test point. We restrict the smoothness class of the target function using
$$
$$
$$
$$
allowing us to formalize below the worst case integrated bias and the optimal density for 𝑧 .
Proof of cref{thm:knn_optimal
Proof. Our proof follows standard derivations when it comes to studying the bias of an estimator. Let's consider the ridge regression problem (Tikhonov regularized least squares estimator) with close form estimator
$$
$$
The labels are formed from the ground truth parameter 𝛽 true with centered error, as per Y = X 𝜷 true + 𝜺 where E [ 𝜺 ] = 0 . We can now look at the bias of our estimator given by
$$
$$
We will now compare that bias when 𝑿 has isotropic and anisotropic covariance with same total variance:
$$
$$
For any anisotropic covariance matrix of 𝑿 , denote by 𝒒 1 the eigenvector with smallest eigenvalue, and let's denote by 𝜅 > 0 a positive constant. We now define leading to
$$
$$
$$
$$
Since 𝜆 𝑝 < ¯ 𝜆 (strict inequality when not isotropic):
$$
$$
$$
$$
we obtain that
As a result, whenever the covariance matrix of 𝑿 is anisotropic, there will be downstream tasks for which the estimator bias is increased compared to having isotropic covariance matrix. Anisotropic covariance structure thus amplifies regularization bias when the true parameter vector aligns unfavorably with the data's covariance structure. □
1) The score-gradient term $ mathbb{E
The expectation of (Epps-Pulley) satisfies
$$
$$
therefore both the loss and its derivative have a bias of order 𝑂 ( 1 / 𝑛 ) . (Proof in Section B.13.)
Hence, thegradientsweobtainfromusing(Epps-Pulley) are biased by an explicit /u1D4AA ( 1 / 𝑁 ) term. We found this bias to be minimal and not a concern even for minibatches as small as 16. Unbiased alternatives include using Ustatistic debiasing of | 𝜙𝜃 | 2 or sample splitting, which we do not explore in this study. Our final implementation of the SIGReg term with Epps-Pulley statistic is provided in algorithm 1.
ThePredictionLoss. Tostandardize notations, we adopt the DINO [Caron et al., 2021] setup of generating 𝑉𝑔 global views and 𝑉𝑙 local views, leading to a total of 𝑉 = 𝑉𝑔 + 𝑉𝑙 views. We set the first 1 , . . . , 𝑉 𝑔 indices of each 𝒛 𝑛,𝑣 as the global views. For the cases without local views, simply set 𝑉𝑙 = 0. The prediction loss is then given by having all views predict the global views as
$$
$$
where we denote 𝝁 𝑛 ≜ 1 𝑉𝑔 ˝ 𝑉𝑔 𝑣 = 1 𝒛 𝑛,𝑣 , the Equation (5) to Equation (6) derivations are detailed in Section B.6.
LeJEPA Loss. The final total loss simply combines the above prediction loss along with SIGReg on each views as per
$$
$$
We present (LeJEPA)'s implementation in algorithm 2. Altogether, the entire implementation-besides the usual model definitions, optimizers, and data loaders-only takes a few dozens lines in PyTorch (algorithms 1 and 2). The absence of prototypes, stop-gradients, and teacher-student networks makes (LeJEPA) appealing as it only contains one hyperparameter, 𝜆 , balancing the trade-off between the prediction and isotropic Gaussian terms.
2) The cross term $2 mathbb{E
$$
$$
Under the prior, ∇ 𝜂 is mean-zero and isotropic; if, additionally, Δ 𝜂 is uncorrelated with ∇ 𝜂 and has zero mean (or is bounded and mean-zero after centering), then E 𝜂 [ 𝐴 ( 𝑥 ) 𝐶 ( 𝑥 )] = 0. If one does not assume the orthogonality/vanishing covariance above, then E [ 𝐴 ( 𝑋 ) 𝐶 ( 𝑋 )] is a finite constant (depending on the joint law of derivatives of 𝜂 ), and the cross term contributes
$$
$$
not 𝑜 ( 𝑟 4 0 ) . In that general case, the leading 𝑝 -dependent term of E [ Bias ( 𝑋 ) 2 ] is still the score-gradient 𝜏 2 𝑔 𝐽 ( 𝑝 ) .
3) The curvature term $ mathbb{E
$$
$$
which is independent of 𝑝 , hence E GLYPH<2> 𝐶 ( 𝑋 ) 2 GLYPH<3> = 𝑂 ( 1 )
Putting it together.
$$
$$
We show that, among all mean-zero distributions 𝑝 on R 𝑑 with a given scalar constraint on the covariance (trace, determinant, Frobenius norm, or spectral radius), the density that minimizes the Fisher-information functional
$$
$$
is the Gaussian with isotropic covariance satisfying the same scalar constraint. We proceed in two steps: (i) for fixed covariance matrix Σ /follows 0, 𝐽 ( 𝑝 ) is minimized by the Gaussian /u1D4A9 ( 0 , Σ ) and attains the value tr ( Σ -1 ) ; (ii) for each scalar constraint, tr ( Σ -1 ) is minimized by Σ = 𝑠𝐼 𝑑 for the appropriate scalar 𝑠 > 0.
Step 2: Optimizing over covariance shapes under scalar constraints
Write the eigenvalues of Σ as 𝜆 1 , . . . , 𝜆 𝑑 > 0. Then
$$
$$
We now solve min ˝ 𝑖 1 / 𝜆 𝑖 under each scalar constraint; in every case the minimum is attained when all 𝜆 𝑖 are equal, i.e., Σ = 𝑠𝐼 𝑑 .
(a) Trace constraint. Given tr ( Σ ) = ˝ 𝑖 𝜆 𝑖 = 𝑡 > 0, by Cauchy-Schwarz,
$$
$$
$$
$$
$$
$$
$$
$$
(c) Frobenius-norm constraint. Given ‖ Σ ‖ 2 𝐹 = ˝ 𝑖 𝜆 2 𝑖 = 𝑐 2 > 0, minimize 𝑓 ( 𝜆 ) : = ˝ 𝑖 1 / 𝜆 𝑖 over 𝜆 𝑖 > 0 subject to 𝑔 ( 𝜆 ) : = ˝ 𝑖 𝜆 2 𝑖 = 𝑐 2 . The Lagrangian
$$
$$
$$
$$
has first-order conditions -𝜆 -2 𝑖 + 2 𝜈𝜆 𝑖 = 0 for all 𝑖 , i.e., 𝜆 3 𝑖 = 1 2 𝜈 , so all 𝜆 𝑖 are equal. Imposing ˝ 𝜆 2 𝑖 = 𝑐 2 yields 𝜆 𝑖 = 𝑐 / √ 𝑑 , hence
(d) Spectral-radius constraint. Let the spectral radius be constrained by 𝜌 ( Σ ) = max 𝑖 𝜆 𝑖 ≤ 𝑟 for some 𝑟 > 0. Since 𝑥 ↦→ 1 / 𝑥 is strictly decreasing on ( 0 , ∞) ,
$$
$$
with equality if and only if 𝜆 𝑖 = 𝑟 for all 𝑖 . Therefore
$$
$$
(The same conclusion holds if the constraint is 𝜌 ( Σ ) = 𝑟 , since one may take all eigenvalues equal to 𝑟 .)
(a) Trace constraint.
(b) Determinant constraint.
moments of the distribution are well-characterized even with a modest integration range. The number of slices |/u1D49C| has a modest effect-while more slices slightly improve performance, even 512 slices yield competitive results. We thus recommend to use 17 integration points, an integration domain of [-5 , 5 ] , and 1024 slices as starting points .

Inet10 - LeJEPA pretrained, frozen backbone, linear eval 50 architectures ( <20 M params.)
Figure 9. INet10 pretraining and frozen backbone linear evaluation across 50 timm models using LeJEPA out of the box. We cross-validate the learning rate and weight-decay. While there is a small variation between the best and worst performing model, we clearly see that across 50 models spanning 8 families, LeJEPA is able to produce non-trivial representations able to solve the downstream task at SOTA levels .
Stability across architectures. A key advantage of LeJEPA over recent methods (e.g., IJEPA, DINOv2) is its architecture-agnostic design. While most modern selfsupervised methods are tailored to Vision Transformers, LeJEPA works across diverse architecture families without modification. To validate this claim, we pretrain approximately 50 architectures from 8 different families on ImageNet-10, selecting all models in the timm library with fewer than 20M parameters. All models are able to learn high-quality representations reaching between 91.5% to 95% top 1 accuracy with frozen backbone linear probing. It seems that models performing well in supervised learning setups are also the ones to favor for LeJEPA, such as resnets and ViTs. We thus recommend to use standard architectures such as ResNets and ViTs over specialized models like EfficientNet as stating point.
Weight Averaging [Izmailov et al., 2019]-it is not necessary to prevent collapse. In our setup, we apply SWA on the encoder producing 𝜇 in Equation (6). Second, recent work demonstrated that register tokens are needed to prevent training instabilities in vision models [Oquab et al., 2023, Siméoni et al., 2025, Darcet et al., 2023]. We show in Table 1 that such instabilities likely stem from poorly conditioned training objectives. In contrast, LeJEPA does not require register tokens and achieves stable performance with or without them. We thus recommend training without a predictor or register tokens, and optionally applying SWA with ViTs for a possible performance gain.
(c) Frobenius-norm constraint.
(d) Spectral-radius constraint.
Conclusion: Isotropic Gaussian is optimal
Combining Lemma 6 with the solutions (a)-(d), we obtain:
Proof of cref{thm:kernel_bias
Proof. Recall from Section B.3 that the bias term as sample 𝒙 is given by
$$
$$
□
where we defined 𝐴 ( 𝑥 ) ≜ ∇ 𝜂 ( 𝑥 ) · ∇ log 𝑝 ( 𝑥 ) and 𝐶 ( 𝑥 ) ≜ 1 2 Δ 𝜂 ( 𝑥 ) . We now square and take expectation of 𝑋 ∼ 𝑝 and the isotropic gradient prior
$$
$$
$$
$$
$$
$$
We will derive each term separately, recalling that we assume an isotropic gradient prior for 𝜂 , i.e., E GLYPH<2> ∇ 𝜂 ( 𝑥 ) GLYPH<3> = 0 and E GLYPH<2> ∇ 𝜂 ( 𝑥 )∇ 𝜂 ( 𝑥 ) /latticetop GLYPH<3> = 𝜏 2 𝑔 𝐼 𝑑 , for some 𝜏 2 𝑔 ∈ ( 0 , ∞) .
$$
$$
recovering the Fisher-information functional 𝐽 ( 𝑝 ) , scaled by 𝜏 2 𝑔
Proof of cref{eq:pred1
Proof. We prove this result in two parts.
Part I: E [ X ] = 0 Given that E [〈 X , a 〉] = 0 for all unit vectors a , and noting that 〈 X , a 〉 = a 𝑇 X , we have:
$$
$$
$$
$$
By linearity of expectation:
Let 𝝁 = E [ X ] . We claim that 𝝁 = 0 . Suppose, for the sake of contradiction, that 𝝁 ≠ 0 . Then ‖ 𝝁 ‖ 2 > 0. Define the unit vector:
$$
$$
Since a ∗ is a unit vector, equation (33) implies:
However, substituting the definition of a ∗ :
$$
$$
This contradiction establishes that 𝝁 = 0 .
Part II: Cov ( X ) = I 𝑑 Since E [ X ] = 0 , we have:
$$
$$
$$
$$
Expanding the quadratic form:
Since E [ X ] = 0 , the covariance matrix is Cov ( X ) = E [ XX 𝑇 ] . Let 𝚺 = Cov ( X ) . The variance condition gives us:
$$
$$
Wenowshow that 𝚺 = I 𝑑 . Step 1: Diagonal entries. For 𝑖 ∈ { 1 , 2 , . . . , 𝑑 } , let e 𝑖 denote the 𝑖 -th standard basis vector. Setting a = e 𝑖 in equation (40):
$$
$$
𝑖
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
Proof of cref{thm:kernel_optimal
Proof. Our proof follows standard derivations when it comes to studying the bias of an estimator. Let's consider the ridge regression problem (Tikhonov regularized least squares estimator) with close form estimator
$$
$$
The labels are formed from the ground truth parameter 𝛽 true with centered error, as per Y = X 𝜷 true + 𝜺 where E [ 𝜺 ] = 0 . We can now look at the bias of our estimator given by
$$
$$
We will now compare that bias when 𝑿 has isotropic and anisotropic covariance with same total variance:
$$
$$
For any anisotropic covariance matrix of 𝑿 , denote by 𝒒 1 the eigenvector with smallest eigenvalue, and let's denote by 𝜅 > 0 a positive constant. We now define leading to
$$
$$
$$
$$
Since 𝜆 𝑝 < ¯ 𝜆 (strict inequality when not isotropic):
$$
$$
$$
$$
we obtain that
As a result, whenever the covariance matrix of 𝑿 is anisotropic, there will be downstream tasks for which the estimator bias is increased compared to having isotropic covariance matrix. Anisotropic covariance structure thus amplifies regularization bias when the true parameter vector aligns unfavorably with the data's covariance structure. □
Proof of cref{thm:spherical_cramer
Let 𝑋,𝑌 be R 𝑑 -valued random vectors, then
$$
$$
Convergence in distribution also holds. (Proof in Section B.8.)
Proof of cref{thm:bcs
Proof. Recall from Section B.3 that the bias term as sample 𝒙 is given by
$$
$$
□
where we defined 𝐴 ( 𝑥 ) ≜ ∇ 𝜂 ( 𝑥 ) · ∇ log 𝑝 ( 𝑥 ) and 𝐶 ( 𝑥 ) ≜ 1 2 Δ 𝜂 ( 𝑥 ) . We now square and take expectation of 𝑋 ∼ 𝑝 and the isotropic gradient prior
$$
$$
$$
$$
$$
$$
We will derive each term separately, recalling that we assume an isotropic gradient prior for 𝜂 , i.e., E GLYPH<2> ∇ 𝜂 ( 𝑥 ) GLYPH<3> = 0 and E GLYPH<2> ∇ 𝜂 ( 𝑥 )∇ 𝜂 ( 𝑥 ) /latticetop GLYPH<3> = 𝜏 2 𝑔 𝐼 𝑑 , for some 𝜏 2 𝑔 ∈ ( 0 , ∞) .
$$
$$
recovering the Fisher-information functional 𝐽 ( 𝑝 ) , scaled by 𝜏 2 𝑔
Proof of cref{thm:spherical_bounds
Proof. We prove this result in two parts.
Part I: E [ X ] = 0 Given that E [〈 X , a 〉] = 0 for all unit vectors a , and noting that 〈 X , a 〉 = a 𝑇 X , we have:
$$
$$
$$
$$
By linearity of expectation:
Let 𝝁 = E [ X ] . We claim that 𝝁 = 0 . Suppose, for the sake of contradiction, that 𝝁 ≠ 0 . Then ‖ 𝝁 ‖ 2 > 0. Define the unit vector:
$$
$$
Since a ∗ is a unit vector, equation (33) implies:
However, substituting the definition of a ∗ :
$$
$$
This contradiction establishes that 𝝁 = 0 .
Part II: Cov ( X ) = I 𝑑 Since E [ X ] = 0 , we have:
$$
$$
$$
$$
Expanding the quadratic form:
Since E [ X ] = 0 , the covariance matrix is Cov ( X ) = E [ XX 𝑇 ] . Let 𝚺 = Cov ( X ) . The variance condition gives us:
$$
$$
Wenowshow that 𝚺 = I 𝑑 . Step 1: Diagonal entries. For 𝑖 ∈ { 1 , 2 , . . . , 𝑑 } , let e 𝑖 denote the 𝑖 -th standard basis vector. Setting a = e 𝑖 in equation (40):
$$
$$
𝑖
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
Proof of cref{thm:moment_conendrum
Proof. We prove this result in two parts.
Part I: E [ X ] = 0 Given that E [〈 X , a 〉] = 0 for all unit vectors a , and noting that 〈 X , a 〉 = a 𝑇 X , we have:
$$
$$
$$
$$
By linearity of expectation:
Let 𝝁 = E [ X ] . We claim that 𝝁 = 0 . Suppose, for the sake of contradiction, that 𝝁 ≠ 0 . Then ‖ 𝝁 ‖ 2 > 0. Define the unit vector:
$$
$$
Since a ∗ is a unit vector, equation (33) implies:
However, substituting the definition of a ∗ :
$$
$$
This contradiction establishes that 𝝁 = 0 .
Part II: Cov ( X ) = I 𝑑 Since E [ X ] = 0 , we have:
$$
$$
$$
$$
Expanding the quadratic form:
Since E [ X ] = 0 , the covariance matrix is Cov ( X ) = E [ XX 𝑇 ] . Let 𝚺 = Cov ( X ) . The variance condition gives us:
$$
$$
Wenowshow that 𝚺 = I 𝑑 . Step 1: Diagonal entries. For 𝑖 ∈ { 1 , 2 , . . . , 𝑑 } , let e 𝑖 denote the 𝑖 -th standard basis vector. Setting a = e 𝑖 in equation (40):
$$
$$
𝑖
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
Proof of cref{thm:ecf_stability
Proof. We prove this result in two parts.
Part I: E [ X ] = 0 Given that E [〈 X , a 〉] = 0 for all unit vectors a , and noting that 〈 X , a 〉 = a 𝑇 X , we have:
$$
$$
$$
$$
By linearity of expectation:
Let 𝝁 = E [ X ] . We claim that 𝝁 = 0 . Suppose, for the sake of contradiction, that 𝝁 ≠ 0 . Then ‖ 𝝁 ‖ 2 > 0. Define the unit vector:
$$
$$
Since a ∗ is a unit vector, equation (33) implies:
However, substituting the definition of a ∗ :
$$
$$
This contradiction establishes that 𝝁 = 0 .
Part II: Cov ( X ) = I 𝑑 Since E [ X ] = 0 , we have:
$$
$$
$$
$$
Expanding the quadratic form:
Since E [ X ] = 0 , the covariance matrix is Cov ( X ) = E [ XX 𝑇 ] . Let 𝚺 = Cov ( X ) . The variance condition gives us:
$$
$$
Wenowshow that 𝚺 = I 𝑑 . Step 1: Diagonal entries. For 𝑖 ∈ { 1 , 2 , . . . , 𝑑 } , let e 𝑖 denote the 𝑖 -th standard basis vector. Setting a = e 𝑖 in equation (40):
$$
$$
𝑖
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
Proof of cref{thm:moment_bound
Proof. Recall from Section B.3 that the bias term as sample 𝒙 is given by
$$
$$
□
where we defined 𝐴 ( 𝑥 ) ≜ ∇ 𝜂 ( 𝑥 ) · ∇ log 𝑝 ( 𝑥 ) and 𝐶 ( 𝑥 ) ≜ 1 2 Δ 𝜂 ( 𝑥 ) . We now square and take expectation of 𝑋 ∼ 𝑝 and the isotropic gradient prior
$$
$$
$$
$$
$$
$$
We will derive each term separately, recalling that we assume an isotropic gradient prior for 𝜂 , i.e., E GLYPH<2> ∇ 𝜂 ( 𝑥 ) GLYPH<3> = 0 and E GLYPH<2> ∇ 𝜂 ( 𝑥 )∇ 𝜂 ( 𝑥 ) /latticetop GLYPH<3> = 𝜏 2 𝑔 𝐼 𝑑 , for some 𝜏 2 𝑔 ∈ ( 0 , ∞) .
$$
$$
recovering the Fisher-information functional 𝐽 ( 𝑝 ) , scaled by 𝜏 2 𝑔
Proof of cref{thm:moment_bound_ld
Proof. Recall from Section B.3 that the bias term as sample 𝒙 is given by
$$
$$
□
where we defined 𝐴 ( 𝑥 ) ≜ ∇ 𝜂 ( 𝑥 ) · ∇ log 𝑝 ( 𝑥 ) and 𝐶 ( 𝑥 ) ≜ 1 2 Δ 𝜂 ( 𝑥 ) . We now square and take expectation of 𝑋 ∼ 𝑝 and the isotropic gradient prior
$$
$$
$$
$$
$$
$$
We will derive each term separately, recalling that we assume an isotropic gradient prior for 𝜂 , i.e., E GLYPH<2> ∇ 𝜂 ( 𝑥 ) GLYPH<3> = 0 and E GLYPH<2> ∇ 𝜂 ( 𝑥 )∇ 𝜂 ( 𝑥 ) /latticetop GLYPH<3> = 𝜏 2 𝑔 𝐼 𝑑 , for some 𝜏 2 𝑔 ∈ ( 0 , ∞) .
$$
$$
recovering the Fisher-information functional 𝐽 ( 𝑝 ) , scaled by 𝜏 2 𝑔
Proof of cref{thm:gradient_bias
The expectation of (Epps-Pulley) satisfies
$$
$$
therefore both the loss and its derivative have a bias of order 𝑂 ( 1 / 𝑛 ) . (Proof in Section B.13.)
Hence, thegradientsweobtainfromusing(Epps-Pulley) are biased by an explicit /u1D4AA ( 1 / 𝑁 ) term. We found this bias to be minimal and not a concern even for minibatches as small as 16. Unbiased alternatives include using Ustatistic debiasing of | 𝜙𝜃 | 2 or sample splitting, which we do not explore in this study. Our final implementation of the SIGReg term with Epps-Pulley statistic is provided in algorithm 1.
ThePredictionLoss. Tostandardize notations, we adopt the DINO [Caron et al., 2021] setup of generating 𝑉𝑔 global views and 𝑉𝑙 local views, leading to a total of 𝑉 = 𝑉𝑔 + 𝑉𝑙 views. We set the first 1 , . . . , 𝑉 𝑔 indices of each 𝒛 𝑛,𝑣 as the global views. For the cases without local views, simply set 𝑉𝑙 = 0. The prediction loss is then given by having all views predict the global views as
$$
$$
where we denote 𝝁 𝑛 ≜ 1 𝑉𝑔 ˝ 𝑉𝑔 𝑣 = 1 𝒛 𝑛,𝑣 , the Equation (5) to Equation (6) derivations are detailed in Section B.6.
LeJEPA Loss. The final total loss simply combines the above prediction loss along with SIGReg on each views as per
$$
$$
We present (LeJEPA)'s implementation in algorithm 2. Altogether, the entire implementation-besides the usual model definitions, optimizers, and data loaders-only takes a few dozens lines in PyTorch (algorithms 1 and 2). The absence of prototypes, stop-gradients, and teacher-student networks makes (LeJEPA) appealing as it only contains one hyperparameter, 𝜆 , balancing the trade-off between the prediction and isotropic Gaussian terms.
Proof of VICReg's Recovery
Proof. We prove this result in two parts.
Part I: E [ X ] = 0 Given that E [〈 X , a 〉] = 0 for all unit vectors a , and noting that 〈 X , a 〉 = a 𝑇 X , we have:
$$
$$
$$
$$
By linearity of expectation:
Let 𝝁 = E [ X ] . We claim that 𝝁 = 0 . Suppose, for the sake of contradiction, that 𝝁 ≠ 0 . Then ‖ 𝝁 ‖ 2 > 0. Define the unit vector:
$$
$$
Since a ∗ is a unit vector, equation (33) implies:
However, substituting the definition of a ∗ :
$$
$$
This contradiction establishes that 𝝁 = 0 .
Part II: Cov ( X ) = I 𝑑 Since E [ X ] = 0 , we have:
$$
$$
$$
$$
Expanding the quadratic form:
Since E [ X ] = 0 , the covariance matrix is Cov ( X ) = E [ XX 𝑇 ] . Let 𝚺 = Cov ( X ) . The variance condition gives us:
$$
$$
Wenowshow that 𝚺 = I 𝑑 . Step 1: Diagonal entries. For 𝑖 ∈ { 1 , 2 , . . . , 𝑑 } , let e 𝑖 denote the 𝑖 -th standard basis vector. Setting a = e 𝑖 in equation (40):
$$
$$
𝑖
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
Part I: $ mathbb{E
Part II: $ mathrm{Cov
Background
Foundation: The Linear Regression Model We start with the standard linear regression model:
$$
$$
□
where:
The error assumption means:
$$
$$
Step 1: Deriving the OLS Estimator To find the OLS estimator, we minimize the sum of squared residuals:
$$
$$
Expanding this quadratic form:
$$
$$
Taking the derivative with respect to 𝜷 :
Setting equal to zero and solving:
Assuming X 𝑇 X is invertible:
$$
$$
$$
$$
$$
$$
Details on Low-Discrepancy Sequences
Quasi-Monte Carlo (QMC) methods, such as the Sobol sequence, are widely used to generate low-discrepancy samples in the unit hypercube, providing improved uniformity over purely random sampling. To obtain samples uniformly distributed on the hypersphere, each QMC point is mapped to a standard normal vector via the inverse cumulative

Figure 15. Depiction of the expected BCS loss upper bound (thm. 5) for various smoothness values 𝛼 . We clearly see that as the smoothness increases ( blue to red ), as the upper bound decreases more and more rapidly with 𝑀 .
Table 3. Performance metrics across different sample sizes from Figure 12
Table 4. Top 1 accuracy (in %) with LeJEPA pretraining on Imagenet-100 for 400 epochs (All values are percentages)
distribution function (CDF), and then projected onto the sphere by normalization. This approach leverages the rotational invariance of the multivariate normal distribution, ensuring that the resulting directions are uniformly distributed on
Table 5. Small architecture in-domain LeJEPA pretraining from random initialization across datasets and architectures, with frozen backbone linear evaluation. First, LeJEPA is able to produce near state-of-the-art performances on tiny dataset with only a thousand samples , e.g., flowers102. Second, on non-natural image data, LeJEPA clearly outperforms the latest frontier vision models , e.g., Galaxy10. See Figure 12 for additional experiments with varying number of training samples and with full finetuning.
Table 6. Time (in millisecond) to compute the proposed SIGReg loss from algorithm 1 on a Tesla V100-SXM2-16GB for varying mini-batch size ( 𝑁 ), number of slices ( 𝑀 ), integration points. Results are computed over 10 runs.
Table 7. Number of Figure 8.
the sphere's surface. While the low-discrepancy property is not strictly preserved under this nonlinear mapping, the resulting samples are empirically more uniform than random samples and are standard in high-dimensional applications Marsaglia [1972], Dick and Pillichshammer [2010], Caflisch [1998].
Require: Number of points 𝑁 , dimension 𝑑
𝑖
Ensure: Points { y 𝑖 } 𝑁 = 1 quasi-uniformly distributed on S 𝑑 -1
Shapiro-Wilk Test
Let X1 < X2 < . . . < Xn denote an ordered random sample of size n from a standard normal distribution. Also, let m 5 (m1,m2,...,mn) be the vector of expected values of standard normal order statistics, and let V 5 (vij ) be the corresponding
n 3 n covariance matrix, so that
$$
$$
The W test statistic Shapiro and Wilk [1965] for normality is then denoted by
$$
$$
Shapiro and Francia [1972] suggested replacing the covariance matrix V by the identity matrix I, because for large samples, the observations Yi may be treated as if they are independent (see Gupta [1952]). Another asymptotic extension was suggested by Weisburg and Binham [1975]
$$
$$
building atop Elfving [1947]'s approximation but using 3 / 8 instead of 𝜋 / 8.
Rahman and Govindarajulu [1997] proposed another variation using the approximation for the expected values of order statistics given by Blom [1958] and the approximations for the elements of the variance± covariance matrix given by Blom [1958], Mosteller [2006]. These approximations are
$$
$$
$$
$$
$$
$$
We know (see Hammersley and Morton [1954], Plackett [1958])
$$
$$
«
Multivariate Statistics
$$ \hat{\beta} = \underset{\beta \in \mathbb{R}^K}{\arg\min} |\vy - \mZ\beta|_2^2+\lambda |\beta|_2^2,\tag{OLS}\label{eq:OLS} $$ \tag{eq:OLS}
$$ U^{2}=T_{w}-N\left(\bar{F}-\frac{1}{2}\right)^{2}.\tag{Watson} \label{eq:watson} $$ \tag{eq:watson}
$$ EP = N \int_{-\infty}^{\infty} \left| \hat{\phi}_X(t) - \phi(t) \right|^2 w(t) dt.\tag{Epps–Pulley} \label{eq:epps_pulley} $$ \tag{eq:epps_pulley}
$$ \hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X} + \lambda_{\rm wd} \mathbf{I})^{-1} \mathbf{X}^T \mathbf{Y}. $$
$$ \frac{\lambda_1 + \lambda_2 + \cdots + \lambda_p}{p} = \bar{\lambda}. $$
$$ \boldsymbol{\beta}_{\text{true}} = \kappa \cdot \mathbf{q}_p, $$
$$ |\text{Bias}(\hat{\boldsymbol{\beta}})|{\text{non-isotropic}} > |\text{Bias}(\hat{\boldsymbol{\beta}})|{\text{isotropic}} $$
$$ \frac{1}{V_g}\sum_{v=1}^{V_g}\frac{1}{V}\sum_{v'=1}^{V}| \mathbf{z}{n,v} - \mathbf{z}{n,v'} |2^2 = \frac{1}{V}\sum{v'=1}^{V}\left| \bar{\mathbf{z}} - \mathbf{z}_{n,v'} \right|_2^2 $$
$$ \label{eq:mean_condition} \mathbb{E}[\mathbf{a}^T \mathbf{X}] = 0 \quad \text{for all } \mathbf{a} \in \mathbb{R}^d \text{ with } |\mathbf{a}| = 1 $$ \tag{eq:mean_condition}
$$ \mathbf{a}^* = \frac{\boldsymbol{\mu}}{|\boldsymbol{\mu}|_2} $$
$$ (\mathbf{a}^*)^T \boldsymbol{\mu} = 0 $$
$$ (\mathbf{a}^*)^T \boldsymbol{\mu} = \left(\frac{\boldsymbol{\mu}}{|\boldsymbol{\mu}|_2}\right)^T \boldsymbol{\mu} = \frac{\boldsymbol{\mu}^T \boldsymbol{\mu}}{|\boldsymbol{\mu}|_2} = \frac{|\boldsymbol{\mu}|_2^2}{|\boldsymbol{\mu}|_2} = |\boldsymbol{\mu}|_2 > 0 $$
$$ \mathrm{Var}(\langle \mathbf{X}, \mathbf{a} \rangle) = \mathbb{E}[(\langle \mathbf{X}, \mathbf{a} \rangle)^2] = \mathbb{E}[(\mathbf{a}^T \mathbf{X})^2] $$
$$ \mathbb{E}[(\mathbf{a}^T \mathbf{X})^2] = \mathbb{E}[\mathbf{a}^T \mathbf{X} \mathbf{X}^T \mathbf{a}] = \mathbf{a}^T \mathbb{E}[\mathbf{X} \mathbf{X}^T] \mathbf{a} $$
$$ \mathbf{e}_i^T \boldsymbol{\Sigma} \mathbf{e}i = \Sigma{ii} = 1 $$
$$ \mathbf{a} = \frac{\mathbf{e}_i + \mathbf{e}_j}{|\mathbf{e}_i + \mathbf{e}_j|_2} = \frac{\mathbf{e}_i + \mathbf{e}_j}{\sqrt{2}} $$
$$ E\left(X_{i}\right)=m_{i} \quad \text { and } \quad \operatorname{cov}\left(X_{i}, X_{j}\right)=v_{i j}, \quad i, j=1,2, \ldots, n $$
$$ \begin{array}{l} W=\frac{\left(\sum_{i=1}^{n} a_{i} Y_{i}\right)}{\sum_{i=1}^{n}\left(Y_{i} -\bar{Y}\right)^{2}}=\frac{(\mathbf{a} \mathbf{Y})}{S^{2}}\ \mathbf{a}^{\prime}=\left(a_{1}, a_{2}, \ldots, a_{n}\right)=\mathbf{m} \mathbf{V}^{-1}\left(\mathbf{m} \mathbf{V}^{-1} \mathbf{V}^{-1} \mathbf{m}\right)^{-1 / 2}\ \mathrm{S}^{2}=\sum_{i=1}^{n}\left(Y_{i}-\bar{Y}\right)^{2} \end{array} $$
$$ p_{i}=\frac{i}{n+1} $$
$$ \begin{array}{l} \mathbf{V}^{-1}=(n+1)(n+2) \ \times\left(\begin{array}{cccccc} 2 \phi^{2}\left(m_{1}\right) & -\phi\left(m_{1}\right) \phi\left(m_{2}\right) & 0 & 0 & \ldots & 0 \ -\phi\left(m_{1}\right) \phi\left(m_{2}\right) & 2 \phi^{2}\left(m_{2}\right) & -\phi\left(m_{2}\right) \phi\left(m_{3}\right) & 0 & \ldots & 0 \ 0 & -\phi\left(m_{2}\right) \phi\left(m_{3}\right) & 2 \phi^{2}\left(m_{3}\right) & -\phi\left(m_{3}\right) \phi\left(m_{4}\right) & \ldots & 0 \ \vdots & & & & & \ 0 & 0 & 0 & 0 & \ldots & 2 \phi^{2}\left(m_{n}\right) \end{array}\right) \end{array} $$
$$ \beta_{n}=2^{-1 / 2}((2 d+1) n / 4)^{1 /(d+4)} $$
$$ \begin{array}{l} T_{n, \gamma}:=\int_{\mathbb{R}^{d}} U_{n}^{2}(t) w_{\gamma}(t) \mathrm{d} t\ U_{n}(t):=\sqrt{n}\left(R_{n}(t) M_{n}(t)-1\right) \end{array} $$
$$ \begin{aligned} T_{n, \gamma}= & \left(\frac{\pi}{\gamma}\right)^{d / 2}\left{\frac { 1 } { 2 n ^ { 3 } } \sum _ { j , k , l , m = 1 } ^ { n } \left[\exp \left(\frac{\left|Y_{j k}^{+}\right|^{2}-\left|Y_{\ell m}^{-}\right|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{j k}^{+\top} Y_{\ell m}^{-}}{2 \gamma}\right)\right.\right. \
- & \left.\exp \left(\frac{\left|Y_{j k}^{+}\right|^{2}-\left|Y_{\ell m}^{+}\right|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{j k}^{+\top} Y_{\ell m}^{+}}{2 \gamma}\right)\right] \ & \left.-\frac{2}{n} \sum_{j, k=1}^{n} \exp \left(\frac{\left|Y_{n, j}\right|^{2}-\left|Y_{n, k}\right|^{2}}{4 \gamma}\right) \cos \left(\frac{Y_{n, j}^{\top} Y_{n, k}}{2 \gamma}\right)+n\right}, \end{aligned} $$
$$ \mathrm{HV}{n, \gamma}=\frac{1}{n}\left(\frac{\pi}{\gamma}\right)^{d / 2} \sum{j, k=1}^{n} \exp \left(\frac{\left|Y_{n, j, k}^{+}\right|^{2}}{4 \gamma}\right) \left(Y_{n, j}^{\top} Y_{n, k}-\frac{\left|Y_{n, j, k}^{+}\right|^{2}}{2 \gamma}+\frac{d}{2 \gamma}+\frac{\left|Y_{n, j, k}^{+}\right|^{2}}{4 \gamma^{2}}\right) . $$
$$ b_{1, d}=\frac{1}{n^{2}} \sum_{j, k=1}^{n}\left(Y_{n, j}^{\top} Y_{n, k}\right)^{3} $$
$$ {\rm JEPA}(\vx) \iff& {\rm Enc}\left(\vx_{n,t+1,.}\right) \text { is predictable from }{\rm Enc}\left(\vx_{n,t,.}\right), \forall n,t\text{ and } {\rm Enc}\left(\vx_{.,.,.}\right) \text{ is not degenerate}.\label{def:SSL} $$ \tag{def:SSL}
$$ \widehat{\vy}(\vq) := \frac{1}{|\mathcal{N}{r_0}(\vq)|}\sum{n \in \mathcal{N}_{r_0}(\vq)}\vy_n, \tag{kNN}\label{eq:kNN} $$ \tag{eq:kNN}
$$ \widehat \vy(\vq)\triangleq \frac{\sum_{n=1}^N K_h(\vq-\vz_n)\vy_n}{\sum_{n=1}^N K_h(\vq-\vz_n)}.\tag{Kernel}\label{eq:NW} $$ \tag{eq:NW}
$$ \text{ISB}{k\text{-NN}} =\frac{r_0^4}{(K+2)^2}\tau_g^2J(p)+O(r_0^4),&&\text{(k-NN)}\ \text{ISB}{\text{kernel}} \le \Big(\frac{h^2\mu_2(K)}{2}\Big)^2 \Big(2 B^2 + 8 L^2J(p)\Big)+o(h^4),&&\text{(kernel)} $$
$$ H_0: P_{\vtheta}=Q \quad \text{vs.} \quad H_1: P_{\vtheta}\neq Q, \label{eq:null} $$ \tag{eq:null}
$$ T_{\sA}({f_{\vtheta}(\vx_n)}{n=1}^N)\triangleq \max{\va \in \sA} T({\va^\top f_{\vtheta}(\vx_n)}_{n=1}^N).\label{eq:T_max} $$ \tag{eq:T_max}
$$ {\rm JB}(\vu)\triangleq&\frac{N}{6}\left(\widehat{\rm skew}(\vu)^{2}+\left(\frac{\widehat{\rm kurt}(\vu)-3}{2}\right)^{2}\right)\tag{Jarque-Bera}\label{eq:jarque}, $$ \tag{eq:jarque}
$$ T_{w}&=N \int_{-\infty}^{\infty}\left(F_{N}(x)-F(x)\right)^{2} w(x) d F(x)\nonumber\ w(x)&=1,\tag{Cramér-von Mises}\label{eq:cramer}\ w(x)&=[F(x)(1-F(x))]^{-1}, \tag{Anderson-Darling}\label{eq:anderson_darling} $$ \tag{eq:cramer}
$$ \mathcal{L}{\rm pred}({\vz{n,v}}{v=1}^{V})=&\frac{1}{V_g}\sum{v=1}^{V_g}\frac{1}{V}\sum_{v'=1}^{V}\left| \vz_{n,v} - \vz_{n,v'} \right|2^2\label{eq:pred1}\ =&\frac{1}{V}\sum{v'=1}^{V}\left| \frac{1}{V_{\rm g}}\sum_{v=1}^{V_{\rm g}}\vz_{n,v} - \vz_{n,v'} \right|2^2\label{eq:pred2}\ \triangleq& \frac{1}{V}\sum{v'=1}^{V}\left| \bm{\mu}{n} - \vz{n,v'} \right|_2^2,\label{eq:pred_as} $$ \tag{eq:pred1}
$$ C^{(\alpha)} = \rho_s\left(\frac{\text{train_loss}}{\lambda^{\alpha}}, \text{test_accuracy}\right). \label{eq:corr} $$ \tag{eq:corr}
$$ &\int_{\mathbb{R}^d} K(u)du=1,\tag{normalized}\ &\int_{\mathbb{R}^d} uK(u)du=0,\tag{symmetric}\ &\int_{\mathbb{R}^d} u u^\top K(u)du=\mu_2(K)I_d,\tag{isotropic}\ &R(K)\triangleq\int_{\mathbb{R}^d} K(u)^2du<\infty,\tag{finite roughness} $$
$$ \mathrm{Bias}\big[\widehat \vy(\vq)\big] &=\frac{h^2\mu_2(K)}{2}\Big(\Delta \vy(\vq)+2\nabla \vy(\vq)^\top \nabla\log p(\vq)\Big)+o(h^2),\ \mathrm{Var}\big[\widehat \vy(\vq)\big] &=\frac{R(K)}{n h^d}\frac{v(\vq)}{p(\vq)}+o\big((n h^d)^{-1}\big). $$
$$ \text{Bias}(\hat{\boldsymbol{\beta}}) &= \mathbb{E}[\hat{\boldsymbol{\beta}}] - \boldsymbol{\beta}{\text{true}} \ &=(\mathbf{X}^T \mathbf{X} + \lambda{\rm wd} \mathbf{I})^{-1} \mathbf{X}^T \mathbf{X}\boldsymbol{\beta}{\text{true}}-\boldsymbol{\beta}{\text{true}}\ &= -\lambda_{\rm wd}(\mathbf{X}^T \mathbf{X} + \lambda_{\rm wd} \mathbf{I})^{-1} \boldsymbol{\beta}{\text{true}}\ &= -\lambda{\rm wd} \mathbf{Q}(\boldsymbol{\Lambda} + \lambda \mathbf{I})^{-1}\mathbf{Q}^T \boldsymbol{\beta}_{\text{true}} $$
$$ \text{Var}(\hat{\boldsymbol{\beta}}|\mathbf{X}) &= \mathbb{E}[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T|\mathbf{X}]\ &= \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}|\mathbf{X}]\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbb{E}[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^T|\mathbf{X}]\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\sigma^2\mathbf{I}_n)\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\ &= \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} $$
$$ \frac{1}{K}\sum_{k=1}^K \frac{1}{\lambda_k} > \frac{1}{\frac{1}{K}\sum_{j=1}^K \lambda_k}\ \iff \frac{1}{K}\sum_{k=1}^K \frac{1}{\lambda_k} > \frac{1}{K}\sum_{k=1}^{K}\frac{1}{\frac{1}{K}\sum_{j=1}^K \lambda_k}\ \iff \sum_{k=1}^K \frac{1}{\lambda_k} > \sum_{k=1}^{K}\frac{1}{\frac{1}{K}\sum_{j=1}^K \lambda_k}\ \iff \text{tr}(\text{Var}(\hat{\boldsymbol{\beta}})){\text{aniso}} > \text{tr}(\text{Var}(\hat{\boldsymbol{\beta}})){\text{iso}} $$
$$ p(x+z)&=p(x)+\nabla p(x)^\top z+\tfrac12 z^\top H p(x)z + O(|z|^3),\ \eta(x+z)&=\eta(x)+\nabla\eta(x)^\top z+\tfrac12 z^\top H\eta(x)z + O(|z|^3), $$
$$ \mathcal{D}(x)&\triangleq\int_{\Ball(0,r_0)} p(x+z)dz\ &= \int_{\Ball(0,r_0)} \Big[p(x) + \grad p(x)^\top z + \tfrac{1}{2}z^\top \Hess p(x)z + R_p(x;z)\Big]dz\ &= {\rm Vol}_0^dp(x);+;\frac{{\rm Vol}_0^{d+2}}{2(d+2)}\tr\big(\Hess p(x)\big);+;O(r_0^{d+3}), $$
$$ \mathcal{N}(x)&\triangleq \int_{\Ball(0,r_0)} \eta(x+z)p(x+z)dz\ &= \int \Big[\eta(x)+\grad\eta(x)^\top z+\tfrac{1}{2}z^\top \Hess\eta(x)z\Big] \Big[p(x)+\grad p(x)^\top z+\tfrac{1}{2}z^\top \Hess p(x)z\Big]dz+O(r_0^{d+3})\ &= \eta(x)p(x)\vol r_0^d+\eta(x)\frac{\vol r_0^{d+2}}{2(d+2)}\tr\big(\Hess p(x)\big)+\frac{\vol r_0^{d+2}}{d+2}\grad\eta(x)\cdot\grad p(x)+ \frac{\vol r_0^{d+2}}{2(d+2)}p(x)\tr\big(\Hess\eta(x)\big) +O(r_0^{d+3}). $$
$$ \frac{\mathcal{N}(x)}{\mathcal{D}(x)}-\eta(x) &= \frac{ \frac{v_d r_0^{d+2}}{d+2}\left(\nabla\eta\cdot\nabla p + \frac{1}{2}p\Delta\eta\right) + O(r_0^{d+3})}{ v_d r_0^d p \left(1+\alpha r_0^2+O(r_0^3)\right)}\[0.5ex] &= \frac{r_0^2}{d+2}\left(\frac{\nabla\eta\cdot\nabla p}{p} + \frac{1}{2}\Delta\eta\right)\Big(1-\alpha r_0^2+O(r_0^3)\Big)\ +\ O(r_0^3)\[0.5ex] &= \frac{r_0^2}{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + \tfrac{1}{2}\Delta\eta(x)\Big)\ +\ o(r_0^2), $$
$$ \mathrm{Bias}(\vx) =&\frac{r_0^2}{d+2}\Big(\grad\eta(x)\cdot\grad\log p(x)\Big);+;\frac{r_0^2}{2(d+2)}\Delta\eta(x);+;o(r_0^2)\ =&\frac{r_0^2}{d+2}\big(A(x)+C(x)\big)+o(r_0^2), $$
$$ \mathbb{E}\big[\mathrm{Bias}(X)^2\big] &=\mathbb{E}\big[\left(\frac{r_0^2}{d+2}\right)^2\big(A(x)^2 + 2A(x)C(x) + C(x)^2\big)
- o(r_0^4)\big]\ &=\left(\frac{r_0^2}{d+2}\right)^2 \Big{\underbrace{\mathbb{E}\big[A(X)^2\big]}_{\text{score-gradient term}}
- \underbrace{2\mathbb{E}\big[A(X)C(X)\big]}_{\text{cross term}}
- \underbrace{\mathbb{E}\big[C(X)^2\big]}_{\text{curvature term}}\Big}
- o(r_0^4). \label{eq:three-terms} $$ \tag{eq:three-terms}
$$ \mathbb{E}\big[A(X)^2\big] =&\mathbb{E}X\big[\mathbb{E}\eta[A(X)^2]\big]\ =&\mathbb{E}X\big[\mathbb{E}\eta[\big(\nabla\eta(x)^\top v(x)\big)^2]\big]\ =&\mathbb{E}X\big[\mathbb{E}\eta[\nabla\eta(x)^\top\Big(v(x)v(x)^\top\Big)\nabla\eta(x)]\big]\ =&\mathbb{E}X\big[\mathbb{E}\eta[\mathrm{tr}\Big(v(x)v(x)^\top\nabla\eta(x)\nabla\eta(x)^\top\Big)]\big]\ =&\mathbb{E}X\big[\mathrm{tr}\Big(v(x)v(x)^\top\mathbb{E}\eta[\nabla\eta(x)\nabla\eta(x)^\top]\Big)\big]\=&\mathbb{E}_X\big[\tau_g^2|v(x)|^2\big]\ =&\tau_g^2\mathbb{E}X\big[|v(X)|^2\big]\ =&\tau_g^2\int{\mathbb{R}^d} |\nabla\log p(x)|^2p(x)dx $$
$$ \mathbb{E}\big[C(X)^2\big] =&\mathbb{E}X\big[\mathbb{E}\eta[C(X)^2]\big]\ =&\frac{1}{4}\mathbb{E}X\big[\mathbb{E}\eta[(\Delta\eta(X))^2\big] $$
$$ \mathbb{E}[A_n(x)] &=n\mathbb{E}\big[K_h(x-X)\big]\ &=n\int_{\mathbb{R}^d} h^{-d}K\Big(\frac{x-u}{h}\Big)p(u)du\ &=n\int_{\mathbb{R}^d} K(t)p(x-h t)dt\qquad (t:=(x-u)/h)\ &=n\int_{\mathbb{R}^d} K(t)\Big(p(x)-ht^\top \nabla p(x)+\frac{h^2}{2}t^\top \nabla^2 p(x)t+o(h^2)\Big)dt\ &=n\Big(p(x)+\frac{h^2}{2}\underbrace{\int t^\top \nabla^2 p(x)tK(t)dt}_{=\mu_2(K)\Delta p(x)}+o(h^2)\Big), $$
$$ \mathbb{E}[B_n(x)] &=n\mathbb{E}\big[K_h(x-X)Y\big] =n\int K(t)(m p)(x-h t)dt\ &=n\int K(t)\Big((mp)(x)-ht^\top \nabla(mp)(x)+\frac{h^2}{2}t^\top \nabla^2(mp)(x)t+o(h^2)\Big)dt\ &=n\Big(m(x)p(x)+\frac{h^2}{2}\mu_2(K)\mathrm{tr}\big(\nabla^2(mp)(x)\big)+o(h^2)\Big)\ &=n\Big(m(x)p(x)+\frac{h^2\mu_2(K)}{2}\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)(x)+o(h^2)\Big), $$
$$ \frac{\mathbb{E}[B_n(x)]}{\mathbb{E}[A_n(x)]} &=m(x) +\frac{h^2\mu_2(K)}{2}\frac{\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)p - m p\Delta p}{p^2}\Big|_{x}+o(h^2)\ &=m(x) +\frac{h^2\mu_2(K)}{2}\Big(\Delta m(x)+2\nabla m(x)^\top \frac{\nabla p(x)}{p(x)}\Big)+o(h^2), $$
$$ \mathrm{Var}[B_n(x)] &= \sum_{i=1}^n \mathrm{Var}\big(K_h(x-X_i)Y_i\big)\quad\text{(independence)}\ &= n\mathbb{E}\big[K_h(x-X)^2\mathrm{Var}(Y\mid X)\big] = n\mathbb{E}\big[K_h(x-X)^2v(X)\big]\ &= n\int h^{-2d}K\Big(\frac{x-u}{h}\Big)^2 v(u)p(u)du\ &= n h^{-d}\int K(t)^2v(x-h t)p(x-h t)dt = n h^{-d}\Big(R(K)v(x)p(x)+o(1)\Big), $$
$$ \text{LHS} &= \frac{1}{V_g V}\sum_{v=1}^{V_g}\sum_{v'=1}^{V}| \mathbf{z}{n,v} - \mathbf{z}{n,v'} |2^2 \ &= \frac{1}{V_g V}\sum{v=1}^{V_g}\sum_{v'=1}^{V}\left(|\mathbf{z}{n,v}|2^2 - 2\mathbf{z}{n,v}^T\mathbf{z}{n,v'} + |\mathbf{z}{n,v'}|2^2\right) \ &= \frac{1}{V_g}\sum{v=1}^{V_g}|\mathbf{z}{n,v}|2^2 - \frac{2}{V_g V}\sum{v=1}^{V_g}\sum_{v'=1}^{V}\mathbf{z}{n,v}^T\mathbf{z}{n,v'} + \frac{1}{V}\sum_{v'=1}^{V}|\mathbf{z}{n,v'}|2^2 \ &= \frac{1}{V_g}\sum{v=1}^{V_g}|\mathbf{z}{n,v}|2^2 - \frac{2}{V}\bar{\mathbf{z}}^T\sum{v'=1}^{V}\mathbf{z}{n,v'} + \frac{1}{V}\sum{v'=1}^{V}|\mathbf{z}_{n,v'}|_2^2 $$
$$ |\bar{\mathbf{z}}|2^2 &= \left|\frac{1}{V_g}\sum{v=1}^{V_g}\mathbf{z}{n,v}\right|2^2 \ &= \frac{1}{V_g^2}\sum{v=1}^{V_g}\sum{v''=1}^{V_g}\mathbf{z}{n,v}^T\mathbf{z}{n,v''} \ &= \frac{1}{V_g}\sum_{v=1}^{V_g}|\mathbf{z}_{n,v}|_2^2 $$
$$ \mathcal{B}^2(h;p,m) &=\Big(\frac{h^2\mu_2(K)}{2}\Big)^2\int \Big(\Delta m(x)+2\nabla m(x)^\top \nabla\log p(x)\Big)^2p(x)dx+o(h^4)\ &\le \Big(\frac{h^2\mu_2(K)}{2}\Big)^2 \int \Big(2(\Delta m(x))^2+2(2\nabla m(x)^\top \nabla\log p(x))^2\Big)p(x)dx+o(h^4)\ &=\Big(\frac{h^2\mu_2(K)}{2}\Big)^2\Big(2\int (\Delta m(x))^2p(x)dx+8\int (\nabla m(x)^\top \nabla\log p(x))^2p(x)dx\Big)+o(h^4), $$
$$ (\nabla m(x)^\top \nabla\log p(x))^2 \le |\nabla m(x)|^2|\nabla\log p(x)|^2 \le L^2|\nabla\log p(x)|^2\ \implies \int (\nabla m(x)^\top \nabla\log p(x))^2p(x)dx \le L^2\int |\nabla\log p(x)|^2p(x)dx = L^2J(p). $$
$$ \mathcal{V}(h;p) &=\int \Big(\frac{R(K)}{n h^d}\frac{v(x)}{p(x)}+o\big((n h^d)^{-1}\big)\Big)p(x)dx=\frac{R(K)}{n h^d}\int v(x)dx+o\big((n h^d)^{-1}\big), $$
$$ \frac{\partial \widehat{D}V}{\partial X_i} =& \int{\mathbb{R}} w_s(t)2\Re!\Big(\big(\widehat{\varphi}_N(t)-\varphi_G(t)\big)\overline{\frac{\partial \widehat{\varphi}_N(t)}{\partial X_i}}\Big)dt,\ \frac{\partial \widehat{\varphi}_N(t)}{\partial X_i} =& \frac{1}{N}i te^{itX_i}, $$
$$ \mathbb{E}\left[ \big|\phi_n - \psi\big|^2 \right] =& \mathbb{E}\left[ |\phi_n|^2 \right]
\psi\mathbb{E}\left[ \overline{\phi_n} \right]
\overline{\psi}\mathbb{E}\left[ \phi_n \right] + |\psi|^2,\ =& \mathbb{E}\left[ |\phi_n|^2 \right]
- \psi\overline{\mathbb{E}[\phi_n]}-\overline{\psi}\frac{1}{n}\sum_{j=1}^n \mathbb{E}[Z_j] + |\psi|^2,\ =& \mathbb{E}\left[ |\phi_n|^2 \right] -\psi\overline{\phi_\theta} - \overline{\psi}\phi_\theta+ |\psi|^2,\ =&\mathbb{E}\left[ |\phi_n|^2 \right] -2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)+ |\psi|^2,\ =&\mathbb{E}\left[ \left|\frac{1}{n}\sum_{j=1}^n Z_j\right|^2 \right]-2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)+ |\psi|^2,\ =&\frac{1}{n^2}\sum_{j=1}^n \sum_{l=1}^n \mathbb{E}\left[ Z_j\overline{Z_l} \right]-2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big)+ |\psi|^2,\ $$
$$ \mathbb{E}\left[ |\phi_n|^2 \right] =& \frac{1}{n^2}\Big( n + n(n-1) |\phi_\theta|^2 \Big)\ =& \frac{1}{n} + \left(1-\frac{1}{n}\right)|\phi_\theta|^2\ =& |\phi_\theta|^2 + \frac{1-|\phi_\theta|^2}{n} $$
$$ \mathbb{E}\left[ \big|\phi_n - \psi\big|^2 \right] &= \left( |\phi_\theta|^2 + \frac{1-|\phi_\theta|^2}{n} \right)
2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big) + |\psi|^2 \[4pt] &= \big( |\phi_\theta|^2 - 2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big) + |\psi|^2 \big) + \frac{1-|\phi_\theta|^2}{n} \[4pt] &= \big|\phi_\theta - \psi\big|^2 + \frac{1-|\phi_\theta|^2}{n}. $$
$$ \frac{1}{2}(\mathbf{e}i^T \boldsymbol{\Sigma} \mathbf{e}i + 2\mathbf{e}i^T \boldsymbol{\Sigma} \mathbf{e}j + \mathbf{e}j^T \boldsymbol{\Sigma} \mathbf{e}j) &= 1\ \frac{1}{2}(\Sigma{ii} + 2\Sigma{ij} + \Sigma{jj}) &= 1\ \frac{1}{2}(1 + 2\Sigma{ij} + 1) &= 1\ 1 + \Sigma{ij} &= 1\ \Sigma{ij} &= 0 $$
$$ \text{SSR}(\boldsymbol{\beta}) &= \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} $$
$$ N\int |\hat{\psi}{N}(\vt) - \psi{0}(\vt) |^2\omega(\vt) dt=N\int |\hat{\psi}_{N}(\vt) - e^{-|\vt|_2/2} |^2\omega(\vt) dt, $$
$$ \mathrm{BHEP}{n, \beta}= & \frac{1}{n} \sum{j, k=1}^{n} \exp \left(-\frac{\beta^{2}\left|Y_{n, j}-Y_{n, k}\right|^{2}}{2}\right) \ & -\frac{2}{\left(1+\beta^{2}\right)^{d / 2}} \sum_{j=1}^{n} \exp \left(-\frac{\beta^{2}\left|Y_{n, j}\right|^{2}}{2\left(1+\beta^{2}\right)}\right)+\frac{n}{\left(1+2 \beta^{2}\right)^{d / 2}} . $$
$$ T_{n, \beta}=\pi^{d / 2}\left(\frac{1}{n} \sum_{i, j=1}^{n} \frac{1}{\beta^{d / 2}} \exp \left(\frac{\left|Y_{n, i}+Y_{n, j}\right|^{2}}{4 \beta}\right)+\frac{n}{(\beta-1)^{d / 2}}\right. \ \left.-2 \sum_{j=1}^{n} \frac{1}{(\beta-1 / 2)^{d / 2}} \exp \left(\frac{\left|Y_{n, j}\right|^{2}}{4 \beta-2}\right)\right), $$
$$ \mathbb{E}{\va} \left[ \int{\mathbb{R}} \left| \varphi_a(t) - \varphi_{\mathcal{N}}(t) \right|^2 dt \right] \leq C(K, \alpha) |\sA|^{-2\alpha/(K-1)} \\times\int_0^\infty \left| \varphi_{\cdot}(r) - \varphi_{\mathcal{N}}(r) \right|_{H^\alpha(\mathcal{S}^{K-1})}^2 dr, $$
$$ \mathcal{L}{\rm LeJEPA}({\vx{n,v}}{n,v=1}^{B,V})=\frac{\lambda}{V}\sum{v=1}^{V}{\rm SIGReg}({{\vz_{n,v}}{n=1}^{B}})\ +\frac{1-\lambda}{B}\sum{n=1}^{B}\mathcal{L}^{(V_{\rm g})}{\rm pred}({\vz{n,v}}_{v=1}^{V}).\tag{LeJEPA}\label{eq:lejepa} $$ \tag{eq:lejepa}
$$ \mathcal{M}(L,B)\triangleq\Big{m\in C^2(\mathbb{R}^d):|\nabla \vy(\vq)|\le L,\|\Delta \vy(\vq)|\le B, \forall \vq\in\mathbb{R}^d\Big}, $$
$$ 0em] \begin{center}
\vspace{-0.7cm}
\begin{minipage}{0.44\linewidth}
\centering
\includegraphics[width=\linewidth]{toy_figures/exps/loss_corr/loss_corr_ViT-base-8_inet1k.pdf}
\end{minipage}%
\hfill
\begin{minipage}{0.54\linewidth}
\centering
\includegraphics[width=\linewidth]{toy_figures/teaser/training_comparison.pdf}
\end{minipage}
\vspace{0.03cm}
\begin{minipage}{0.44\linewidth}
\centering
\includegraphics[width=0.485\linewidth]{toy_figures/pca/n02099601_5984_original.png}\hfill
\includegraphics[width=0.485\linewidth]{toy_figures/pca/n02099601_5984_pca.png}
\end{minipage}%
\hfill
\begin{minipage}{0.54\linewidth}
\centering
\footnotesize
\setlength{\tabcolsep}{2.5pt}
\renewcommand{\arraystretch}{0.95}
\begin{tabular}{lcccc}
\toprule
& \multicolumn{2}{c}{\textbf{Full FT}} & \multicolumn{2}{c}{\textbf{Frozen}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
\textbf{Method} & \textbf{1-sh} & \textbf{Full} & \textbf{1-sh} & \textbf{Full} \\
\midrule
\multicolumn{5}{l}{\textit{LeJEPA (in-domain)}} \\
\;\;ConvNeXt-V2 Nano & \textbf{29.42} & 82.72 & 28.74 & 76.52 \\
\;\;ResNet-34 & 24.27 & \textbf{83.28} & \textbf{31.08} & \textbf{78.17} \\
\midrule
\multicolumn{5}{l}{\textit{Frontier (transfer)}} \\
\;\;DINOv2 ViT-S/16 & 21.05 & 78.34 & 27.68 & 67.62 \\
\;\;DINOv3 ViT-S/16 & 24.71 & 81.60 & 30.17 & 71.38 \\
\bottomrule
\end{tabular}
\end{minipage}
\vspace{0.05cm}
\captionof{figure}{\textbf{LeJEPA overview.} \textbf{Top-left:} Training loss exhibits strong correlation with downstream linear probe performance on ImageNet-1k (ViT-base), providing the first practical loss for model selection without supervised probing. \textbf{Top-right:} Training stability without heuristics even on 1.8B ViT-g models, stable training loss. \textbf{Bottom-left:} PCA features from ImageNet-1k pretrained LeJEPA ViT-Large demonstrate clear semantic relationships. \textbf{Bottom-right:} Galaxy10 in-domain results showcasing LeJEPA's in-domain pretraining consistently outperforms state-of-the-art frontier foundation models transfer learning (DINOv2/v3 trained on natural images) across data regimes from 1-shot to full supervision. This demonstrates that \textit{domain-specific SSL beats generic transfer learning}, even against massive-scale frontier models, when the framework scales effortlessly to any domain, model, and data scale.}
\label{fig:teaser}
\vspace{-0.5cm}
\end{center} } \icmlrunningtitle{LeJEPA:} \usepackage{natbib} \usepackage{hyperref} \usepackage{microtype} \usepackage{graphicx} \usepackage{subfigure} \usepackage{booktabs} % for professional tables \usepackage[toc,page,header]{appendix} \usepackage{minitoc} \usepackage{makecell} \usepackage{dblfloatfix} \usepackage{amsmath} \usepackage[capitalize,noabbrev]{cleveref} \usepackage{mathtools} \usepackage{cuted} % For the strip environment \usepackage[most]{tcolorbox} \tcbuselibrary{theorems} \usepackage{caption} \usepackage{enumitem} \usepackage{listings}
\usepackage{tgheros} % TeX Gyre Heros (Helvetica-like, sans-serif, proportional) \definecolor{bg}{HTML}{F8FAFD} \definecolor{keyword}{HTML}{0077B6} \definecolor{string}{HTML}{E76F51} \definecolor{comment}{HTML}{A0AEC0} \definecolor{number}{HTML}{457B9D} \definecolor{function}{HTML}{2A9D8F} \definecolor{class}{HTML}{F4A261} \definecolor{text}{HTML}{2D3A4A} \lstdefinestyle{pytorchheros}{ backgroundcolor=\color{bg}, basicstyle=\fontfamily{qhv}\selectfont\footnotesize\color{text}, % qhv = TeX Gyre Heros keywordstyle=\color{keyword}\bfseries, stringstyle=\color{string}, commentstyle=\color{comment}\itshape, numberstyle=\color{number}, identifierstyle=\color{text}, classoffset=1, morekeywords={Net}, keywordstyle=\color{class}\bfseries, classoffset=0, emph={init,forward,print}, emphstyle=\color{function}, frame=single, framerule=0pt, rulecolor=\color{bg}, tabsize=4, showstringspaces=false, breaklines=true, linewidth=\linewidth, xleftmargin=0em, xrightmargin=0em, aboveskip=1em, belowskip=1em, literate={~}{{\textasciitilde}}1 } \lstset{language=Python, style=pytorchheros} \captionsetup[lstlisting]{labelfont=bf,font=small} \renewcommand\lstlistingname{Algorithm} \crefname{lstlisting}{\MakeLowercase\lstlistingname}{\MakeLowercase\lstlistingname s} \Crefname{lstlisting}{\lstlistingname}{\lstlistingname s} \usepackage{float} % For defining custom floats \newfloat{lstfloat}{htbp}{lop} \floatname{lstfloat}{Listing} \def\lstfloatautorefname{Listing}
\usepackage{amsmath,amsthm} \usepackage{bm,bbm}
\usepackage{varwidth}
\usepackage{hyperref} \usepackage{cleveref} \usepackage{mathtools} \usepackage{algpseudocode} \usepackage{mdframed}
\usepackage{multicol} \usepackage{multirow} \usepackage{setspace}
\usepackage{colortbl}
\usepackage[svgnames]{xcolor} \usepackage{framed}
\newcommand{\1}{\mathbf{1}} \newcommand{\norm}[1]{\left\lVert #1\right\rVert} \newcommand{\grad}{\nabla} \newcommand{\Hess}{H} \newcommand{\tr}{\mathrm{tr}} \newcommand{\Ball}{\mathrm{B}} \newcommand{\vol}{v_d} \newcommand{\surf}{s_{d-1}} \newcommand{\RR}{\mathbb{R}} \newcommand{\PP}{\mathbb{P}} \newcommand{\EE}{\mathbb{E}} \newcommand{\ind}{\mathbbm{1}} \newcommand{\Law}{\mathcal{L}} \newcommand{\sphere}{S^{d-1}} \newcommand{\inner}[2]{\left\langle #1, #2 \right\rangle} \newcommand{\BL}{\mathrm{BL}} \newcommand{\dBL}{d_{\mathrm{BL}}} \newcommand{\RightarrowDist}{\Rightarrow} \newcommand{\toProb}{\xrightarrow{,\mathbb{P},}}
\usepackage{tikz,ifthen} \usetikzlibrary{positioning} \usetikzlibrary{shapes} \usetikzlibrary{arrows} \usetikzlibrary{fit} \usetikzlibrary{calc} \usetikzlibrary{shapes.misc}
\crefname{defi}{defn.}{defns.}
\def\ceil#1{\lceil #1 \rceil} \def\floor#1{\lfloor #1 \rfloor} \def\1{\bm{1}} \newcommand{\ReLU}{\text{ReLU}} \newcommand{\flatten}{\text{vec}} \newcommand{\train}{\mathcal{D}} \newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}} \newcommand{\test}{\mathcal{D_{\mathrm{test}}}} \def\eps{{\epsilon}} \def\cst{{\rm cst}} \newcommand{\rmat}[2]{\mathcal{M}{#1,#2}(\mathbb{R})} \newcommand{\romat}[2]{\mathcal{O}{#1}(\mathbb{R})} \DeclareMathOperator{\spn}{span} \DeclareMathOperator{\diag}{diag} \DeclareMathOperator{\sign}{sign} \DeclareMathOperator{\Tr}{Tr} \newcommand{\Trp}[1]{\Tr\left(#1\right)} \DeclareMathOperator{\eigvec}{eigvec} \newcommand{\eigvecp}[1]{\eigvec\left(#1\right)}
\def\reta{{\textnormal{$\eta$}}} \def\ra{{\textnormal{a}}} \def\rb{{\textnormal{b}}} \def\rc{{\textnormal{c}}} \def\rd{{\textnormal{d}}} \def\re{{\textnormal{e}}} \def\rf{{\textnormal{f}}} \def\rg{{\textnormal{g}}} \def\rh{{\textnormal{h}}} \def\ri{{\textnormal{i}}} \def\rj{{\textnormal{j}}} \def\rk{{\textnormal{k}}} \def\rl{{\textnormal{l}}} \def\rn{{\textnormal{n}}} \def\ro{{\textnormal{o}}} \def\rp{{\textnormal{p}}} \def\rq{{\textnormal{q}}} \def\rr{{\textnormal{r}}} \def\rs{{\textnormal{s}}} \def\rt{{\textnormal{t}}} \def\ru{{\textnormal{u}}} \def\rv{{\textnormal{v}}} \def\rw{{\textnormal{w}}} \def\rx{{\textnormal{x}}} \def\ry{{\textnormal{y}}} \def\rz{{\textnormal{z}}}
\def\rvepsilon{{\mathbf{\epsilon}}} \def\rvtheta{{\mathbf{\theta}}} \def\rva{{\mathbf{a}}} \def\rvb{{\mathbf{b}}} \def\rvc{{\mathbf{c}}} \def\rvd{{\mathbf{d}}} \def\rve{{\mathbf{e}}} \def\rvf{{\mathbf{f}}} \def\rvg{{\mathbf{g}}} \def\rvh{{\mathbf{h}}} \def\rvu{{\mathbf{i}}} \def\rvj{{\mathbf{j}}} \def\rvk{{\mathbf{k}}} \def\rvl{{\mathbf{l}}} \def\rvm{{\mathbf{m}}} \def\rvn{{\mathbf{n}}} \def\rvo{{\mathbf{o}}} \def\rvp{{\mathbf{p}}} \def\rvq{{\mathbf{q}}} \def\rvr{{\mathbf{r}}} \def\rvs{{\mathbf{s}}} \def\rvt{{\mathbf{t}}} \def\rvu{{\mathbf{u}}} \def\rvv{{\mathbf{v}}} \def\rvw{{\mathbf{w}}} \def\rvx{{\mathbf{x}}} \def\rvy{{\mathbf{y}}} \def\rvz{{\mathbf{z}}}
\def\erva{{\textnormal{a}}} \def\ervb{{\textnormal{b}}} \def\ervc{{\textnormal{c}}} \def\ervd{{\textnormal{d}}} \def\erve{{\textnormal{e}}} \def\ervf{{\textnormal{f}}} \def\ervg{{\textnormal{g}}} \def\ervh{{\textnormal{h}}} \def\ervi{{\textnormal{i}}} \def\ervj{{\textnormal{j}}} \def\ervk{{\textnormal{k}}} \def\ervl{{\textnormal{l}}} \def\ervm{{\textnormal{m}}} \def\ervn{{\textnormal{n}}} \def\ervo{{\textnormal{o}}} \def\ervp{{\textnormal{p}}} \def\ervq{{\textnormal{q}}} \def\ervr{{\textnormal{r}}} \def\ervs{{\textnormal{s}}} \def\ervt{{\textnormal{t}}} \def\ervu{{\textnormal{u}}} \def\ervv{{\textnormal{v}}} \def\ervw{{\textnormal{w}}} \def\ervx{{\textnormal{x}}} \def\ervy{{\textnormal{y}}} \def\ervz{{\textnormal{z}}}
\def\rmA{{\mathbf{A}}} \def\rmB{{\mathbf{B}}} \def\rmC{{\mathbf{C}}} \def\rmD{{\mathbf{D}}} \def\rmE{{\mathbf{E}}} \def\rmF{{\mathbf{F}}} \def\rmG{{\mathbf{G}}} \def\rmH{{\mathbf{H}}} \def\rmI{{\mathbf{I}}} \def\rmJ{{\mathbf{J}}} \def\rmK{{\mathbf{K}}} \def\rmL{{\mathbf{L}}} \def\rmM{{\mathbf{M}}} \def\rmN{{\mathbf{N}}} \def\rmO{{\mathbf{O}}} \def\rmP{{\mathbf{P}}} \def\rmQ{{\mathbf{Q}}} \def\rmR{{\mathbf{R}}} \def\rmS{{\mathbf{S}}} \def\rmT{{\mathbf{T}}} \def\rmU{{\mathbf{U}}} \def\rmV{{\mathbf{V}}} \def\rmW{{\mathbf{W}}} \def\rmX{{\mathbf{X}}} \def\rmY{{\mathbf{Y}}} \def\rmZ{{\mathbf{Z}}}
\def\ermA{{\textnormal{A}}} \def\ermB{{\textnormal{B}}} \def\ermC{{\textnormal{C}}} \def\ermD{{\textnormal{D}}} \def\ermE{{\textnormal{E}}} \def\ermF{{\textnormal{F}}} \def\ermG{{\textnormal{G}}} \def\ermH{{\textnormal{H}}} \def\ermI{{\textnormal{I}}} \def\ermJ{{\textnormal{J}}} \def\ermK{{\textnormal{K}}} \def\ermL{{\textnormal{L}}} \def\ermM{{\textnormal{M}}} \def\ermN{{\textnormal{N}}} \def\ermO{{\textnormal{O}}} \def\ermP{{\textnormal{P}}} \def\ermQ{{\textnormal{Q}}} \def\ermR{{\textnormal{R}}} \def\ermS{{\textnormal{S}}} \def\ermT{{\textnormal{T}}} \def\ermU{{\textnormal{U}}} \def\ermV{{\textnormal{V}}} \def\ermW{{\textnormal{W}}} \def\ermX{{\textnormal{X}}} \def\ermY{{\textnormal{Y}}} \def\ermZ{{\textnormal{Z}}}
\def\vzero{{\bm{0}}} \def\vone{{\bm{1}}} \def\vmu{{\bm{\mu}}} \def\vsigma{{\bm{\sigma}}} \def\vtheta{{\bm{\theta}}} \def\vepsilon{{\bm{\epsilon}}} \def\va{{\bm{a}}} \def\vb{{\bm{b}}} \def\vc{{\bm{c}}} \def\vd{{\bm{d}}} \def\ve{{\bm{e}}} \def\vf{{\bm{f}}} \def\vg{{\bm{g}}} \def\vh{{\bm{h}}} \def\vi{{\bm{i}}} \def\vj{{\bm{j}}} \def\vk{{\bm{k}}} \def\vl{{\bm{l}}} \def\vm{{\bm{m}}} \def\vn{{\bm{n}}} \def\vo{{\bm{o}}} \def\vp{{\bm{p}}} \def\vq{{\bm{q}}} \def\vr{{\bm{r}}} \def\vs{{\bm{s}}} \def\vt{{\bm{t}}} \def\vu{{\bm{u}}} \def\vv{{\bm{v}}} \def\vw{{\bm{w}}} \def\vx{{\bm{x}}} \def\vy{{\bm{y}}} \def\vz{{\bm{z}}}
\def\evalpha{{\alpha}} \def\evbeta{{\beta}} \def\evepsilon{{\epsilon}} \def\evlambda{{\lambda}} \def\evomega{{\omega}} \def\evmu{{\mu}} \def\evpsi{{\psi}} \def\evsigma{{\sigma}} \def\evtheta{{\theta}} \def\eva{{a}} \def\evb{{b}} \def\evc{{c}} \def\evd{{d}} \def\eve{{e}} \def\evf{{f}} \def\evg{{g}} \def\evh{{h}} \def\evi{{i}} \def\evj{{j}} \def\evk{{k}} \def\evl{{l}} \def\evm{{m}} \def\evn{{n}} \def\evo{{o}} \def\evp{{p}} \def\evq{{q}} \def\evr{{r}} \def\evs{{s}} \def\evt{{t}} \def\evu{{u}} \def\evv{{v}} \def\evw{{w}} \def\evx{{x}} \def\evy{{y}} \def\evz{{z}}
\def\mA{{\bm{A}}} \def\mB{{\bm{B}}} \def\mC{{\bm{C}}} \def\mD{{\bm{D}}} \def\mE{{\bm{E}}} \def\mF{{\bm{F}}} \def\mG{{\bm{G}}} \def\mH{{\bm{H}}} \def\mI{{\bm{I}}} \def\mJ{{\bm{J}}} \def\mK{{\bm{K}}} \def\mL{{\bm{L}}} \def\mM{{\bm{M}}} \def\mN{{\bm{N}}} \def\mO{{\bm{O}}} \def\mP{{\bm{P}}} \def\mQ{{\bm{Q}}} \def\mR{{\bm{R}}} \def\mS{{\bm{S}}} \def\mT{{\bm{T}}} \def\mU{{\bm{U}}} \def\mV{{\bm{V}}} \def\mW{{\bm{W}}} \def\mX{{\bm{X}}} \def\mY{{\bm{Y}}} \def\mZ{{\bm{Z}}} \def\mBeta{{\bm{\beta}}} \def\mPhi{{\bm{\Phi}}} \def\mLambda{{\bm{\Lambda}}} \def\mSigma{{\bm{\Sigma}}}
\DeclareMathAlphabet{\mathsfit}{\encodingdefault}{\sfdefault}{m}{sl} \SetMathAlphabet{\mathsfit}{bold}{\encodingdefault}{\sfdefault}{bx}{n} \newcommand{\tens}[1]{\bm{\mathsfit{#1}}} \def\tA{{\tens{A}}} \def\tB{{\tens{B}}} \def\tC{{\tens{C}}} \def\tD{{\tens{D}}} \def\tE{{\tens{E}}} \def\tF{{\tens{F}}} \def\tG{{\tens{G}}} \def\tH{{\tens{H}}} \def\tI{{\tens{I}}} \def\tJ{{\tens{J}}} \def\tK{{\tens{K}}} \def\tL{{\tens{L}}} \def\tM{{\tens{M}}} \def\tN{{\tens{N}}} \def\tO{{\tens{O}}} \def\tP{{\tens{P}}} \def\tQ{{\tens{Q}}} \def\tR{{\tens{R}}} \def\tS{{\tens{S}}} \def\tT{{\tens{T}}} \def\tU{{\tens{U}}} \def\tV{{\tens{V}}} \def\tW{{\tens{W}}} \def\tX{{\tens{X}}} \def\tY{{\tens{Y}}} \def\tZ{{\tens{Z}}}
\def\gA{{\mathcal{A}}} \def\gB{{\mathcal{B}}} \def\gC{{\mathcal{C}}} \def\gD{{\mathcal{D}}} \def\gE{{\mathcal{E}}} \def\gF{{\mathcal{F}}} \def\gG{{\mathcal{G}}} \def\gH{{\mathcal{H}}} \def\gI{{\mathcal{I}}} \def\gJ{{\mathcal{J}}} \def\gK{{\mathcal{K}}} \def\gL{{\mathcal{L}}} \def\gM{{\mathcal{M}}} \def\gN{{\mathcal{N}}} \def\gO{{\mathcal{O}}} \def\gP{{\mathcal{P}}} \def\gQ{{\mathcal{Q}}} \def\gR{{\mathcal{R}}} \def\gS{{\mathcal{S}}} \def\gT{{\mathcal{T}}} \def\gU{{\mathcal{U}}} \def\gV{{\mathcal{V}}} \def\gW{{\mathcal{W}}} \def\gX{{\mathcal{X}}} \def\gY{{\mathcal{Y}}} \def\gZ{{\mathcal{Z}}}
\def\sA{{\mathbb{A}}} \def\sB{{\mathbb{B}}} \def\sC{{\mathbb{C}}} \def\sD{{\mathbb{D}}} \def\sF{{\mathbb{F}}} \def\sG{{\mathbb{G}}} \def\sH{{\mathbb{H}}} \def\sI{{\mathbb{I}}} \def\sJ{{\mathbb{J}}} \def\sK{{\mathbb{K}}} \def\sL{{\mathbb{L}}} \def\sM{{\mathbb{M}}} \def\sN{{\mathbb{N}}} \def\sO{{\mathbb{O}}} \def\sP{{\mathbb{P}}} \def\sQ{{\mathbb{Q}}} \def\sR{{\mathbb{R}}} \def\sS{{\mathbb{S}}} \def\sT{{\mathbb{T}}} \def\sU{{\mathbb{U}}} \def\sV{{\mathbb{V}}} \def\sW{{\mathbb{W}}} \def\sX{{\mathbb{X}}} \def\sY{{\mathbb{Y}}} \def\sZ{{\mathbb{Z}}}
\def\emLambda{{\Lambda}} \def\emA{{A}} \def\emB{{B}} \def\emC{{C}} \def\emD{{D}} \def\emE{{E}} \def\emF{{F}} \def\emG{{G}} \def\emH{{H}} \def\emI{{I}} \def\emJ{{J}} \def\emK{{K}} \def\emL{{L}} \def\emM{{M}} \def\emN{{N}} \def\emO{{O}} \def\emP{{P}} \def\emQ{{Q}} \def\emR{{R}} \def\emS{{S}} \def\emT{{T}} \def\emU{{U}} \def\emV{{V}} \def\emW{{W}} \def\emX{{X}} \def\emY{{Y}} \def\emZ{{Z}} \def\emSigma{{\Sigma}}
\newcommand{\etens}[1]{\mathsfit{#1}} \def\etLambda{{\etens{\Lambda}}} \def\etA{{\etens{A}}} \def\etB{{\etens{B}}} \def\etC{{\etens{C}}} \def\etD{{\etens{D}}} \def\etE{{\etens{E}}} \def\etF{{\etens{F}}} \def\etG{{\etens{G}}} \def\etH{{\etens{H}}} \def\etI{{\etens{I}}} \def\etJ{{\etens{J}}} \def\etK{{\etens{K}}} \def\etL{{\etens{L}}} \def\etM{{\etens{M}}} \def\etN{{\etens{N}}} \def\etO{{\etens{O}}} \def\etP{{\etens{P}}} \def\etQ{{\etens{Q}}} \def\etR{{\etens{R}}} \def\etS{{\etens{S}}} \def\etT{{\etens{T}}} \def\etU{{\etens{U}}} \def\etV{{\etens{V}}} \def\etW{{\etens{W}}} \def\etX{{\etens{X}}} \def\etY{{\etens{Y}}} \def\etZ{{\etens{Z}}}
\newcommand{\pdata}{p_{\rm{data}}} \newcommand{\ptrain}{\hat{p}{\rm{data}}} \newcommand{\Ptrain}{\hat{P}{\rm{data}}} \newcommand{\pmodel}{p_{\rm{model}}} \newcommand{\Pmodel}{P_{\rm{model}}} \newcommand{\ptildemodel}{\tilde{p}{\rm{model}}} \newcommand{\pencode}{p{\rm{encoder}}} \newcommand{\pdecode}{p_{\rm{decoder}}} \newcommand{\precons}{p_{\rm{reconstruct}}}
\newcommand{\E}{\mathbb{E}} \newcommand{\Ls}{\mathcal{L}} \newcommand{\R}{\mathbb{R}} \newcommand{\emp}{\tilde{p}} \newcommand{\lr}{\alpha} \newcommand{\reg}{\lambda} \newcommand{\rect}{\mathrm{rectifier}} \newcommand{\softmax}{\mathrm{softmax}} \newcommand{\sigmoid}{\sigma} \newcommand{\softplus}{\zeta} \newcommand{\KL}{D_{\mathrm{KL}}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\standarderror}{\mathrm{SE}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\normlzero}{L^0} \newcommand{\normlone}{L^1} \newcommand{\normltwo}{L^2} \newcommand{\normlp}{L^p} \newcommand{\normmax}{L^\infty}
\newcommand{\parents}{Pa} % See usage in notation.tex. Chosen to match Daphne's book.
\DeclareMathOperator*{\argmax}{arg,max} \DeclareMathOperator*{\argmin}{arg,min}
\let\ab\allowbreak
\sectionheaderline{% \seclink{sec:intro}{1}{Intro} | \seclink{sec:background}{2}{Background} | \seclink{sec:gaussian}{3}{Why Gaussian?} | \seclink{sec:bcs}{4}{SIGReg} | \seclink{sec:lejepa}{5}{LeJEPA} | \seclink{sec:experiments}{6}{Experiments} }
\begin{document} \icmlmaketitle
\newtcbtheorem [ crefname={detail}{detail}] {detail}% name {Experiment Details}% title {% fontupper=\small, colback=orange!5, colframe=orange!35!black, fonttitle=\bfseries, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm, }% options {detail}% prefix
\newtcbtheorem [ crefname={def.}{def.}] {definition}% name {Definition}% title {% fontupper=\small, colback=green!5, colframe=green!35!black, fonttitle=\bfseries, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm, }% options {def}% prefix
\newtcbtheorem [ crefname={thm.}{thms.}] {theorem}% name {Theorem}% title {% fontupper=\small, colback=red!5, colframe=red!35!black, fonttitle=\bfseries, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm, }% options {theorem}% prefix \newtcbtheorem [ crefname={lemma.}{lemmas.}]% init options {lemma}% name {Lemma}% title {% fontupper=\small, colback=blue!3, colframe=blue!35!black, fonttitle=\bfseries, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm, }% options {lemma}% prefix \newtcbtheorem [ crefname={prop.}{props.}]% init options {proposition}% name {Proposition}% title {% breakable, enhanced, fontupper=\small, colback=red!5, colframe=red!35!black, fonttitle=\bfseries, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm, }% options {proposition}% prefix \newtcbtheorem [ crefname={cor.}{corrs.}]% init options {corollary}% name {Corollary}% title {% breakable, enhanced, fontupper=\small, colback=red!5, colframe=red!35!black, fonttitle=\bfseries, boxsep=1pt, left=1.5mm, right=1.5mm, top=2mm, bottom=1mm, }% options {corollary}% prefix
\begin{figure*}[t!] \centering \begin{minipage}{0.34\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_manifold_0.png} \end{minipage} \begin{minipage}{0.05\linewidth} \hspace{-0.4cm}{\Large $\underset{\rightarrow}{f_{\vtheta}}$} \end{minipage} \begin{minipage}{0.6\linewidth} \hspace{-0.3cm}\includegraphics[width=\linewidth]{toy_figures/teaser/teaser_manifold_1.png} \end{minipage} \caption{ \textbf{Sketched Isotropic Gaussian Regularization (SIGReg):}~Given some arbitrary input data with density $p_{x}$ with support that may or may not lie on a manifold ({\bf left}), a Deep network (DN) encoder ($f_{\vtheta}$) produces embeddings $\vz=f_{\vtheta}(\vx)$ with some distribution $\vz \sim p_{z}$ ({\bf middle}). Our proposed Backward Cramér-Wold Statistics (\cref{sec:bcs}) objective pushes $p_z$ to match a target distribution $p_t$ by projecting the embeddings along $1d$ directions ({\bf middle, arrows}) and enforcing that the univariate densities ({\bf right, colored lines}) match the distribution of $p_t$, projected along the same directions. Any popular statistical test (provided in \cref{sec:tests}) can assess the goodness-of-fit--in practice we argue for characteristic function tests (\cref{sec:CF_better}). By using SIGReg with $p_t$ isotropic Gaussian ({\bf right, black lines}), we introduce a lean and provably optimal (\cref{sec:gaussian}) JEPA, coined LeJEPA, free of numerous heuristics and able to produce competitive performances (\cref{sec:lejepa,sec:experiments}).} \label{fig:bcs_teaser} \end{figure*}
\section{Introduction} \label{sec:intro}
Learning manipulable representations of the world and its dynamics is a long‑standing question in AI, with roots dating back centuries ago \citep{von1867handbuch,tolman1948cognitive,gregory1980perceptions,sutton1991dyna,friston2010free}. Across domains, e.g., image recognition, robotics, physics, space exploration, the unifying question is {\em how to learn an organized and actionable high‑dimensional embedding space from observations?} Using Deep Networks--parameterized nonlinear operators $f_{\vtheta}$--to map observations to embeddings is a standard first piece of that puzzle \citep{lecun2015deep,goodfellow2016deep}. The second, less standardized, piece of that puzzle is {\em how to train $f_{\vtheta}$}. Joint-Embedding Predictive Architectures (JEPAs) suggest training $f_{\vtheta}$ by maximizing predictive agreement between the embeddings of semantically related {\em views} \citep{bromley1993signature,lecun2022path,balestriero2023cookbook}. Views can come in two forms: transformations or corruptions. They can involve masking, cropping, blurring, temporal or spatial translations, geometric or photometric transformations, viewpoint changes, views from different sensor modalities, etc. The supervised forms involve human-produced components such as image-caption pairs, text-code pairs, etc \citep{tian2020makes}. In any case, views are expected to share some degree of semantic relationship to allow the prediction task to align $f_{\vtheta}$'s embeddings towards the underlying knowledge present in the data.
Alas, JEPA's prediction task admits failure modes, such as representation collapse, where $f_{\vtheta}$ maps all inputs to nearly identical embeddings ({\em complete collapse}) or to a low-dimensional subspace ({\em dimensional collapse}) \citep{jing2021understanding}\citep{jing2021understanding,cosentino2022toward,balestriero2022contrastive}. To mitigate such shortcut solutions, state‑of‑the‑art recipes rely on heuristics--stop‑gradient \citep{chen2020simple}, asymmetric view generation \citep{wang2022importance}, teacher–student networks with carefully tuned EMA schedules \citep{caron2021emerging,tian2021understanding}, explicit normalization and whitening layers \citep{ermolov2021whitening,chen2021empirical}--and a delicate balance of hyperparameters. As a result, today's JEPA training is brittle and most research has shifted toward scaling data \citep{vo2024automatic}, models \citep{fan2025scaling} and even post-training \cite{rodas2025diet} while leaving the theoretical foundations of JEPAs largely unexplored.
Our study proposes to break that cycle by questioning some of the fundamental design principles underpinning JEPAs. That introspection will start by asking {\em what are the necessary conditions that JEPAs should abide by?} Those minimal conditions will then act as {\em axioms} for us to design a novel and lean JEPA. We identify two axioms: (i) solving the prediction task while (ii) enforcing an isotropic Gaussian distribution of the embeddings (\cref{sec:gaussian}). While (i) follows standard practice \citep{balestriero2022contrastive}, we introduce in \cref{sec:bcs} a novel distribution matching objective--Sketched Isotropic Gaussian Regularization (SIGReg)--to enforce (ii). The use of SIGReg not only removes the need for the numerous heuristics previously employed to prevent representation collapse, but SIGReg also exhibits favorable scaling properties as its {\em memory and computational complexity is linear in dimension and sample size}. Crucially, SIGReg's isotropic Gaussian enforcement solves the collapsed shortcut solution and provably minimizes the model's expected risk over the space of downstream tasks to be encountered post-training. The resulting JEPA solution--coined Latent-Euclidean JEPA (LeJEPA)--is introduced in \cref{sec:lejepa}. Beyond theoretical optimality, LeJEPA offers numerous benefits such as (i) provable statistical guarantees, (ii) removal of heuristics such as teacher-student networks, (iii) linear memory and computational complexity, and most importantly (iv) a unified design with a single trade-off parameter that works out of the box across datasets, architectures and scales (see \cref{sec:experiments}). We summarize our contributions below.
{\bf Contribution 1: We prove the optimal embedding distribution for foundation models.}~We establish that the isotropic Gaussian uniquely minimizes downstream prediction risk across broad task families. In \cref{sec:gaussian}, we derive this result rigorously for both linear (\cref{sec:linear_probing}) and nonlinear probes (\cref{sec:nonlinear_probing}), providing the first principled answer to what distribution $f_{\vtheta}$'s embeddings should follow. This theoretical result transforms JEPA design from heuristic exploration to targeted optimization. {\bf Contribution 2: We introduce SIGReg, a distribution matching objective that uniquely combines provable correctness with computational efficiency at scale.}~We present {\em Sketched Isotropic Gaussian Regularization} (SIGReg), a novel objective that enforces distributional alignment via random projections and characteristic-function matching (\cref{sec:bcs,fig:bcs_teaser}). SIGReg provides statistical guarantees (\cref{sec:general_test,sec:tests}) while achieving linear complexity and bounded gradients—a combination that existing distribution matching methods do not offer. Critically, its projection-based construction defeats the curse of dimensionality (\cref{sec:dimension}), making it both theoretically sound and practically efficient for high-dimensional embeddings.
{\bf Contribution 3: We design LeJEPA, a statistically optimal JEPA that eliminates collapse by construction.}~By combining JEPA's predictive objective with SIGReg targeting the isotropic Gaussian, we introduce {\em LeJEPA}—Latent-Euclidean JEPA (\cref{sec:lejepa}). LeJEPA requires only a single hyperparameter, eliminates representational collapse without stop-gradients or teacher-student architectures, and transfers across architectures and datasets without hyperparameter tuning. This demonstrates that principled theory directly yields practical simplicity.
{\bf Contribution 4: We validate LeJEPA at scale across diverse architectures and establish in-domain pretraining as viable.}~Our experiments (\cref{sec:experiments}) span ViTs, ConvNeXts, ResNets, MaxViTs, and Swin Transformers at scales approaching 1 billion parameters, where LeJEPA matches or exceeds state-of-the-art methods while maintaining training simplicity and robustness. Critically, on domain-specific datasets (Galaxy10, Food101), LeJEPA outperforms DINOv2-based transfer learning when pretrained directly on target data. This challenges the transfer learning paradigm and demonstrates that principled SSL can unlock effective in-domain pretraining—previously considered impractical for small datasets. \section{Background and Notations} \label{sec:background}
We start by introducing some of the notations we will be using throughout our manuscript (\cref{sec:notations}), followed by a review of JEPAs (\cref{sec:JEPA}), and existing literature studying their design (\cref{sec:mi}).
\subsection{Notations and Definitions} \label{sec:notations}
\begin{figure*}[t!] \begin{definition}{JEPA}{} \begin{align} {\rm JEPA}(\vx) \iff& {\rm Enc}\left(\vx_{n,t+1,.}\right) \text { is predictable from }{\rm Enc}\left(\vx_{n,t,.}\right), \forall n,t\text{ and } {\rm Enc}\left(\vx_{.,.,.}\right) \text{ is not degenerate}.\label{def:SSL} \end{align} \begin{minipage}{0.32\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_JEPA_0.png} \end{minipage} \begin{minipage}{0.32\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_JEPA_1.png} \end{minipage} \begin{minipage}{0.32\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_JEPA_2.png} \end{minipage} \end{definition} \end{figure*}
{\bf Data.}~We are in possession of a dataset of shape $(N, V, D) \in {\mathbb{N}^*}^3$ where $N$ is the number of samples, $V$ is the number of views, and $D$ is the dimension. One entry of this dataset is accessed via $\vx_{n,v,d}$. Those dimensions are often interpreted as follows: ({\bf N}) is the number of independent samples, e.g., different images or different videos, ({\bf V}) is the number of {\em views}, e.g., data-augmentations for images, frames for videos, and ({\bf D}) is the dimension of each $\vx_{n,v}$, e.g., number of RGB pixels for images. In many cases the ordering over $V$ is given by {\em time}--but in some cases, e.g., data-augmentation of an image, ordering becomes irrelevant. Our study does not require any particular choice to organize one's dataset into a $(N,V,D)$ tensor--{\em and none of our theory and implementation assumes a particular design decision for that tensor}. However, we will rely on the following two properties, ({\em independence}) the samples $\vx_{n},\vx_{n'}$ have been obtained independently from each other $\forall n\not = n'$, and ({\em identically distributed}) the sampling process was identical among $\vx_{n}, \forall n$.
{\bf Deep Networks.}~Today's AI solutions rely on {\em Deep (Neural) Networks} (DNs), which are compositions of a large number of parameterized linear and nonlinear operators. We denote the DN's mapping as $f_\vtheta: \R^D \rightarrow \R^K$ with $K$ the dimension of the embedding space. The internals of $f_\vtheta$ are designed by the researcher to incorporate as much prior knowledge about the data as possible. The details of $f_\vtheta$ are irrelevant to our study--as we will see the proposed LeJEPA works out-of-the-box on any $f_\vtheta$. In any case, all the {\em learnable parameters} are gathered in the vector $\vtheta \in \R^P$, with $P$ counting the total number of parameters. A central challenge in AI research is to design the right architecture and training objective so that $\vtheta$ can be learned from gradient descent to ultimately produce a useful system, or foundation model, $f_{\vtheta}$.
{\bf JEPAs.}~A foundation model is any system, e.g., a DN, able to solve numerous downstream tasks without requiring any change in its internal parameters $\vtheta$. This is in sharp contrast with a supervised model that only considers its training task. JEPAs have formally been introduced by \citet{lecun2022path} as a vehicle to produce foundation models. The core building blocks of JEPAs rely on numerous well-established techniques such as siamese networks \citep{bromley1993signature} and predictive coding \citep{helmholtz1867handbook,bruner1949perception}. While the exact blueprint of JEPAs varies greatly between use-cases, they all rely on two core principles: (i) being able to predict the embedding of a view $\vx_{n,v}$ from the embedding of another view $\vx_{n,v'}, v' \not = v$, all while (ii) ensuring that the embeddings do not become degenerate. Concretely, once a JEPA is designed and trained, it should be able to solve numerous downstream tasks in zero or few shots. The JEPA objective function, along with some examples for $\vx$, is provided in \cref{def:SSL}. The {\em predictability} criterion can be done by directly comparing the embeddings of the partial views $Enc(\vx_{n,v,.})$ and $Enc(\vx_{n,v',.})$ with a metric, e.g., $\ell_p$. In some cases, an additional DN coined {\em Pred}, is employed to compare $Pred(Enc(\vx_{n,v,.}))$ against $Enc(\vx_{n,v',.})$--which is only justified when there exists an asymmetry between the information content of the different views, e.g., by conditioning the predictions on observed actions from robotics data \citep{khazatsky2024droid}.
\subsection{The Need for Reliable Pretraining} \label{sec:JEPA}
The JEPA's prediction task is designed based on a priori knowledge of the data. Its design is often quite natural since it is relatively intuitive to form $\vx$ so that its views share the relevant information content one hope to capture. On the other hand, the design of the ``anti-collapse'' criterion is much closer to a game of Whac-A-Mole. Today's designs rely on many different under-specified safeguards which are carefully combined in the hope that degenerate shortcut solutions are avoided during training. Such mechanisms include (i) feature whitening \citep{ermolov2021whitening,bardes2021vicreg}, (ii) negative samples \citep{chen2020simple,he2020momentum}, and (iii) asymmetric views and teacher-student networks with stop-gradient \citep{caron2021emerging,assran2023self}. Those mechanisms all suffer from at least two of the following limitations: (i) under-specification, i.e., the criteria can be minimized while embeddings are in a degenerate configuration, (ii) quadratic time and memory complexity with mini-batch size and/or embedding dimension, (iii) sensitivity to data distribution, hyperparameters, architecture, and (iv) lack of theoretical understanding and guarantees.
\subsection{The Need for Actionable Theory} \label{sec:mi}
For decades, the two major solutions for AI were supervised learning \citep{lecun2015deep} and learning by reconstruction \citep{rumelhart1986learning}--sometimes combined together, e.g., for semi-supervised learning \citep{kingma2014semi}. In supervised learning, the labels both ensure that semantically similar samples are close to each other in embedding space while preventing complete representation collapse. In particular, it is possible to measure the amount of collapse in supervised learning as a function of the number of classes \citep{papyan2020prevalence}. The reconstruction objective is similarly well suited to prevent representation collapse as the original input must be recovered from the embeddings, i.e., the embeddings must be as informative about the input as possible--up to some optional denoising tasks that users can setup as part of the training \citep{vincent2010stacked}.
Because supervised and reconstruction-based learning have been widely studied for decades, there exists a large body of work to explain and inform practical designs--as well as studying their limitations in producing foundation models \citep{balestriero2024learning,van2025joint}. This is not the case for the more recent JEPAs where empirical advances quickly outpace anyone hoping to delve into their inner workings. This dynamic led the community to focus on post-hoc theoretical justification of already found solutions \citep{liu2021self,shwartz2024compress,shwartz2022we,zhang2023matrix}. In most cases, those studies involve the {\em Mutual Information (MI)} \citep{shannon1948mathematical,cover1999elements} whose different bounds recover established methods \citep{gutmann2010noise,ma2018noise,oord2018representation,poole2019variational,hjelm2018learning,mcallester2020formal}. Because existing studies focus on explaining and interpreting already developed JEPAs, too little principled guidance and innovation has been brought forward. Instead, most of the recent empirical advances take the form of collecting larger dataset, scaling up pre-existing training recipes \citep{goyal2019scaling,chen2020big,oquab2023dinov2,fan2025scaling}, and deriving novel data curation processes \citep{vo2024automatic,kerdreux2025efficient}.
In contrast, our goal in the following \cref{sec:gaussian,sec:bcs,sec:lejepa} will be to derive a novel JEPA solution from first principles, i.e., whose design relies on proved necessary conditions for optimality, and with a pretraining recipe that can finally reconcile exploratory research, scalability, and state-of-the-art performances.
\section{Latent Euclidean: Embeddings Should be Isotropic Gaussian} \label{sec:gaussian}
We address a fundamental question: {\em which distribution should ${\rm Enc}(\vx)$ follow to minimize empirical risk on any downstream task?} We prove that the isotropic Gaussian is the unique optimal distribution for both linear (\cref{sec:linear_probing}) and nonlinear probing (\cref{sec:nonlinear_probing}), with geometric intuition provided in \cref{sec:le}. This theoretical result establishes the necessary design principle for our JEPA; \cref{sec:bcs} then provides the practical implementation to achieve it.
\subsection{Linear Probing} \label{sec:linear_probing}
We begin by identifying the optimal distribution for $f_\vtheta$'s embeddings by analyzing linear probes--one of the most popular methods for frozen encoder evaluation. Specifically, we ask: \emph{which distribution for $f_\vtheta(\vx)$ would be most favorable for solving arbitrary downstream tasks, i.e., for any realization of targets $\vy$?}
Denote as $\mZ \in \mathbb{R}^{N \times K}$ the matrix of $N$ embeddings, each $K$-dimensional, from $f_{\vtheta}(\vx_n)$. The {\em unknown} corresponding labels are denoted as $\vy \in \mathbb{R}^{N}$. Without loss of generality, we consider univariate targets; the following analysis extends to multivariate targets. The linear probe minimizes the following least square problem \citep{bishop2006pattern} \begin{equation} \hat{\beta} = \underset{\beta \in \mathbb{R}^K}{\arg\min} |\vy - \mZ\beta|2^2+\lambda |\beta|2^2,\tag{OLS}\label{eq:OLS} \end{equation} where $\hat{\beta}$ is the optimal probe parameters, and $\lambda \geq 0$ is an hyperparameter controlling the Tikhonov regularizer strength \citep{bishop1995training,golub1999tikhonov}. Despite not knowing $\vy$, it is possible to describe the bias and variance of the estimator $\hat{\beta}$ as a function of the distribution of $\mZ$. Consider two embeddings with identical column spans $\mZ{\rm aniso}, \mZ{\rm iso}$. $\mZ_{\rm aniso}$'s covariance matrix eigenvalues are given by ${\lambda_k}{k=1}^K$ with at least two distinct values, while $\mZ{\rm iso}$'s covariance matrix eigenvalues are all equal to $\frac{1}{K}\sum_{k=1}^{K}\lambda_k$. Hence, the two candidate embeddings $\mZ_{\rm aniso}, \mZ_{\rm iso}$ capture the same intrinsic features and have same energy, but different geometries.
\begin{lemma}[label={thm:linear_probe_bias}]{Anisotropy amplifies bias}{} Whenever $\lambda_K>\lambda_1$, there always exists a downstream task ($\vy$) for which $\mZ_{\rm aniso}$ produces a higher bias estimator than $\mZ_{\rm iso}$ for $\lambda>0$. (Proof in \cref{proof:linear_probe_bias}.) \end{lemma} \begin{lemma}[label={thm:linear_probe_variance}]{Anisotropy amplifies variance}{} With $\lambda=0$, the total variance of $\hat{\beta}$ \eqref{eq:OLS} is minimized for $\mZ_{\rm iso}$ with $\text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}{\text{aniso}})) > \text{tr}(\text{Var}(\hat{\boldsymbol{\beta}}{\text{iso}}))$. (Proof in \cref{proof:linear_probe_variance}.) \end{lemma}
From the above \cref{thm:linear_probe_variance,thm:linear_probe_bias} we obtain that the distribution of features must be isotropic. We now move to nonlinear probing where the standard Gaussian will emerge as the unique optimum.
\subsection{Nonlinear Probing} \label{sec:nonlinear_probing}
To allow for more flexible evaluation of the pretrained encoder $f_{\vtheta}$, it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN \citep{taunk2019brief,sun2010adaptive,zhang2017efficient,abu2019effects} for its simplicity and kernel methods \citep{nadaraya1964estimating,watson1964smooth} for their theoretical tractability.
As in \cref{sec:linear_probing}, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the $N$ embeddings along with their training labels ${(\vz_n,\vy_{n})}{n=1}^{N}$. The prediction, using radius-based k-NN for a query vector $\vq$ is formed as \begin{align} \widehat{\vy}(\vq) := \frac{1}{|\mathcal{N}{r_0}(\vq)|}\sum_{n \in \mathcal{N}{r_0}(\vq)}\vy_n, \tag{kNN}\label{eq:kNN} \end{align} where $\mathcal{N}{r_0}(\vq) = {n : |\vz_n - \vq| \le r_0}$. The specific choice of radius $r_0$ controls how many neighbors predictions are averaged to form the query's prediction. The kernel's prediction at a query $\vq\in\mathbb{R}^K$ is given by \begin{align} \widehat \vy(\vq)\triangleq \frac{\sum_{n=1}^N K_h(\vq-\vz_n)\vy_n}{\sum_{n=1}^N K_h(\vq-\vz_n)}.\tag{Kernel}\label{eq:NW} \end{align}
We search over all distributions of Z subject to a fixed total variance constraint, e.g., $\Tr(\Cov(\mZ)) = \kappa_1$ or $|\Cov(\mZ)|_F=\kappa_2$. The specific value of $\kappa$ does not affect the optimal distribution shape. Following the same type of derivations as done in the linear regime--with the exception of some additional regularity conditions--we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.
\begin{theorem}[label={thm:nonlinear_optimal}]{isotropic Gaussian Optimality}{} The integrated square bias (ISB) over query points is given by \begin{align*} \text{ISB}{k\text{-NN}} =\frac{r_0^4}{(K+2)^2}\tau_g^2J(p)+O(r_0^4),&&\text{(k-NN)}\ \text{ISB}{\text{kernel}} \le \Big(\frac{h^2\mu_2(K)}{2}\Big)^2 \Big(2 B^2 + 8 L^2J(p)\Big)+o(h^4),&&\text{(kernel)} \end{align*} and among distributions with a scalar-based covariance constraint, the isotropic Gaussian is the unique minimizer of the integrated square bias. (Proof in \cref{proof:knn_optimal,proof:kernel_optimal}.) \end{theorem}
Numerous additional details and discussions on the regularity assumptions we employed are provided in \cref{sec:additional_nonlinear}. Together, these results establish the isotropic Gaussian distribution as the optimal design to minimize the worst-case risk of a foundation model across downstream tasks. $$ \tag{fig:teaser}
$$ \sum_{k=1}^K c_k\left(m_k\left(P_{\vtheta}^{(\va)}\right)-m_k\left(Q^{(\va)}\right)\right)^2, $$
$$ \left|\frac{\partial EP(\mathbf{a})}{\partial z_i}\right| \le \frac{4\sigma^2}{N}, \quad \left|\frac{\partial^2 EP(\mathbf{a})}{\partial z_i^2}\right| \le \frac{C\sqrt{\pi}\sigma^3}{2N}, $$
$$ \mathbb{E}\left[\widehat{L}_n(\theta)\right]
L(\theta)+\frac{1}{N}\int_{\mathbb{R}} w_s(t)\big(1-|\varphi_P(t)|^2\big)dt, $$
$$ 0.5em] \begin{tikzpicture} \draw[line width=0.5pt] (0,0) -- (\textwidth,0); \end{tikzpicture} \[2.5em] {\fontsize{22pt}{26pt}\selectfont\bfseries #1}\[1.2em] {\fontsize{16pt}{20pt}\selectfont\bfseries Appendix}\[2.5em] \begin{tikzpicture} \draw[line width=0.5pt] (0,0) -- (\textwidth,0); \end{tikzpicture} \[0.5em] \begin{tikzpicture} \draw[line width=2pt] (0,0) -- (\textwidth,0); \end{tikzpicture} \end{center} \vspace{2cm} } \FancyAppendixTitle{LeJEPA}
\section{Additional Details on Nonlinear Probing} \label{sec:additional_nonlinear}
\subsection{kNN Probing}
To allow for more flexible evaluation of the pretrained encoder $f_{\vtheta}$, it is standard to work with a $k$-NN prober \citep{taunk2019brief}, both for regression and classification. We rely on the radial $k$-NN variation that leverages a sample-dependent $k$--improving performance for non uniform distributions of samples \citep{sun2010adaptive,zhang2017efficient,abu2019effects}.
We denote the underlying embedding density as $p_{z}\in C^3$ with derivatives of order up to $3$ bounded, and finite Fisher information and covariance. This regularity condition is fulfilled by current encoders. The {\em unknown} labels come from the target function $\eta:\R^K\to\R$, assumed $C^2$. We handle classification tasks by setting $\eta(\vz)=\mathbb{P}(Y=1\mid \vz)$. The training consists of the $N$ embeddings along with their training labels ${(\vz_n,\eta(\vz_n))}{n=1}^{N}$, where we will denote $\vy{n}\triangleq \eta(\vz_n)$. The prediction for a query vector $\vq$ is formed as \begin{align} \widehat{\vy}(\vq) := \frac{1}{\vy(\vq)}\sum_{n:\norm{\vz_{n}-\vq}\le r_0}\vy_n,\tag{kNN}\label{eq:kNN} \end{align} with $\vy(\vq)\triangleq#{n:\norm{\vz_{n}-\vq}\le r_0}$ counting the number of samples within a $r$-radius ball around $\vq$. The radius $r$ controls how many neighbors predictions are averaged to form the query's prediction. As per the linear probing's \cref{thm:linear_probe_bias}, we can characterize the bias of the estimator \cref{eq:kNN} at a particular query point, as formalized below.
\begin{lemma}[label={thm:knn_bias}]{k-NN Pointwise Bias}{} The \eqref{eq:kNN} estimator has bias at query $\vq$ given by \begin{multline*} \mathrm{Bias}(\vq)= \frac{r_0^2}{d+2}\Big(\grad\eta(\vq)^\top\grad\log p_{z}(\vq)+\tfrac{1}{2}\Delta\eta(\vz)\Big)\+o(r_0^2), \end{multline*} where the remainder $o(r_0^2)$ is uniform in $\vq$. (Proof in \cref{proof:knn_bias}.) \end{lemma}
To obtain the integrated bias, i.e., over the distribution of query points, we consider the following two properties. First, the distribution of query points follow the training distribution, i.e., $\vq \sim p_{z}$, second, target function $\eta$ has gradient which is mean-zero and isotropic with $\E\big[\grad\eta(\vz)\grad\eta(\vz)^\top\big]=\tau_g^2I_d$ with $\tau_g^2\in(0,\infty)$ uniformly in $\vz$. We also have any finite scalar-constraint on the covariance of the embeddings such as $\Tr(\Sigma)=c$ or $|\Sigma|_F=c$ for a finite constant $c$.
\begin{theorem}[label={thm:knn_optimal}]{k-NN isotropic Gaussian Optimality}{} The integrated squared bias of \eqref{eq:kNN} satisfies [ \E_{\vz}\big[\mathrm{Bias}(\vz)^2\big]
\frac{r_0^4}{(K+2)^2}\tau_g^2J(p)\+O(r_0^4), $$ \tag{sec:additional_nonlinear}
$$ \E\big[\widehat{\eta}(x)\big] ;=; \frac{\int_{\Ball(0,r_0)} \eta(x+z)p(x+z)dz}{\int_{\Ball(0,r_0)} p(x+z)dz} \quad\text{to second order in }r_0, $$
$$ \int_{\Ball(0,r)} zdz=0,\qquad \int_{\Ball(0,r)} zz^\top dz=\frac{{\rm Vol}^{d+2}}{d+2}I_d,\qquad \int_{\Ball(0,r)} \norm{z}^2dz=\frac{d{\rm Vol}^{d+2}}{d+2}. $$
$$ 0.5ex] &= \frac{r_0^2}{d+2}\left(\frac{\nabla\eta\cdot\nabla p}{p} + \frac{1}{2}\Delta\eta\right)\Big(1-\alpha r_0^2+O(r_0^3)\Big)\ +\ O(r_0^3)\[0.5ex] &= \frac{r_0^2}{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + \tfrac{1}{2}\Delta\eta(x)\Big)\ +\ o(r_0^2), \end{align*} uniformly on $\mathcal{K}$. This gives the bias formula [ \mathbb{E}\big[\widehat{\eta}(x)\big]-\eta(x)
\frac{r_0^2}{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + \tfrac{1}{2}\Delta\eta(x)\Big)\ +\ o(r_0^2), $$
$$ 0.5ex] $$
$$ \left(\frac{r_0^2}{d+2}\right)^2\cdot 2\mathbb{E}[A(X)C(X)] =\ O(r_0^4), $$
$$ J(p);:=;\int_{\mathbb{R}^d}|\nabla\log p(x)|^2p(x)dx $$
$$ J(p);\ge;\mathrm{tr}(\Sigma^{-1}), $$
$$ \mathcal{I}(\theta) ;=; \mathbb{E}\big[\nabla_\theta\log p_\theta(X)\nabla_\theta\log p_\theta(X)^\top\big] ;=; \mathbb{E}\big[\nabla\log p(X)\nabla\log p(X)^\top\big], $$
$$ \mathrm{tr}(\Sigma^{-1})=\sum_{i=1}^d \frac{1}{\lambda_i}. $$
$$ \left(\sum_{i=1}^d \frac{1}{\lambda_i}\right)\left(\sum_{i=1}^d \lambda_i\right);\ge;\left(\sum_{i=1}^d 1\right)^2=d^2, $$
$$ \min_{\Sigma\succ 0:\ \mathrm{tr}(\Sigma)=t}\ \mathrm{tr}(\Sigma^{-1}) ;=;\frac{d^2}{t}, \quad\text{attained at}\quad \Sigma=\frac{t}{d}I_d. $$
$$ \begin{array}{ll} \text{trace } t: & J_{\min}=\dfrac{d^2}{t},\quad s=\dfrac{t}{d},\[1.2ex] \text{determinant } \delta: & J_{\min}=d\delta^{-1/d},\quad s=\delta^{1/d},\[1.2ex] \text{Frobenius } c: & J_{\min}=\dfrac{d^{3/2}}{c},\quad s=\dfrac{c}{\sqrt{d}},\[1.2ex] \text{spectral radius } r: & J_{\min}=\dfrac{d}{r},\quad s=r. \end{array} $$
$$ \frac{a_0+h^2 a_2+o(h^2)}{b_0+h^2 b_2+o(h^2)} =\frac{a_0}{b_0} +h^2\frac{a_2 b_0-a_0 b_2}{b_0^2} +o(h^2), $$
$$ \mathrm{Var}[\widehat m(x)]\approx \frac{\mathrm{Var}[B_n(x)]}{(\mathbb{E}[A_n(x)])^2}. $$
$$ \mathbb{E}[A_n(x)]=n\big(p(x)+o(1)\big). $$
$$ \int (\Delta m)^2p \le \int B^2p = B^2. $$
$$ \varphi_X(t) ;=; \mathbb{E}\big[e^{i\langle t,X\rangle}\big] ;=; \mathbb{E}\big[e^{i s \langle u,X\rangle}\big] ;=; \mathbb{E}\big[e^{i s \langle u,Y\rangle}\big] ;=; \mathbb{E}\big[e^{i\langle t,Y\rangle}\big] ;=; \varphi_Y(t). $$
$$ \psi_{n,t} ;=; \mathbb{E}\big[e^{i\langle t,X_n\rangle}\big] ;\longrightarrow; \mathbb{E}\big[e^{i\langle t,X\rangle}\big] ;=; \psi_{t}, \qquad \text{for all } t \in \mathbb{R}^d. $$
$$ \inf_{a\in U}\lim_{n\to\infty}\Pr\big(T_{a,n} \ge u_n(\alpha)\big)=1. $$
$$ \Pr(\Psi_n=1) ;=; \Pr\big(M_n \ge u_n(\alpha)\big) ;\ge; \Pr\big( T_{a_n,n} \ge u_n(\alpha) \big) ;\longrightarrow; 1, $$
$$ |f - P_L f|{L^2(\mathbb{S}^d)} \leq (1 + L^2)^{-\alpha/2} |f|{H^\alpha(\mathbb{S}^d)}, $$
$$ w_s(t)=e^{-s^2 t^2},\qquad s>0, $$
$$ \widehat{\varphi}N(t)=\frac{1}{N}\sum{i=1}^N e^{itX_i}, $$
$$ \Bigg|\frac{\partial \widehat{D}V}{\partial X_i}\Bigg| ;\le;\frac{4}{N}\int{\mathbb{R}} w_s(t)|t|dt ;=; \frac{4}{Ns^2}. $$
$$ \widehat{D}k ;=; (\bar{\phi}-\mu)^\top W(\bar{\phi}-\mu),\qquad \bar{\phi}:=\frac{1}{N}\sum{i=1}^N \phi(X_i),\quad \phi(x)=(x,x^2,\dots,x^k)^\top, $$
$$ \big|\phi_n(t) - \psi(t)\big|^2
\phi_n(t)\overline{\phi_n(t)}
\psi(t)\overline{\phi_n(t)}
\overline{\psi(t)}\phi_n(t) + \big|\psi(t)\big|^2. $$
$$ \mathbb{E}\left[ Z_j\overline{Z_l} \right]
\begin{cases} \mathbb{E}\left[ |Z_1|^2 \right] = 1, & \text{if } j=l,\[4pt] \mathbb{E}[Z_j]\overline{\mathbb{E}[Z_l]} = \phi_\theta\overline{\phi_\theta} = |\phi_\theta|^2, & \text{if } j\neq l, \end{cases} $$
$$ 4pt] &= \big( |\phi_\theta|^2 - 2\mathrm{Re}\big( \overline{\psi}\phi_\theta \big) + |\psi|^2 \big) + \frac{1-|\phi_\theta|^2}{n} \[4pt] &= \big|\phi_\theta - \psi\big|^2 + \frac{1-|\phi_\theta|^2}{n}. \end{align*} Under Dominated convergence, $\E[\nabla_\theta D_n(t)] = \nabla_\theta \E[D_n(t)]$, hence [ \E\left[\nabla_\theta D_n(t)\right] = \nabla_\theta \big|\phi_\theta(t)-\psi(t)\big|^2
- \nabla_\theta \frac{1-|\phi_\theta(t)|^2}{n}, $$
$$ L(\theta) \approx \sum_{k} \omega_k\big|\phi_\theta(t_k)-\psi(t_k)\big|^2, \quad \widehat{L}n(\theta) \approx \sum{k} \omega_k\big|\phi_n(t_k)-\psi(t_k)\big|^2, $$
Theorem. [label={thm:nonlinear_optimal}]{isotropic Gaussian Optimality}{} The integrated square bias (ISB) over query points is given by align* ISB_{k-NN} =r_0^4{(K+2)^2}\tau_g^2J(p)+O(r_0^4),&&(k-NN)\ ISB_{kernel} \le \Big(h^2\mu_2(K){2}\Big)^2 \Big(2 B^2 + 8 L^2J(p)\Big)+o(h^4),&&(kernel) align* and among distributions with a scalar-based covariance constraint, the isotropic Gaussian is the unique minimizer of the integrated square bias. (Proof in proof:knn_optimal,proof:kernel_optimal.)
Theorem. [label={thm:kernel_optimal}]{Kernel isotropic Gaussian Optimality}{thm:kernel_optimal} % The integrated squared bias of eq:NW satisfies % multline* % \sup_{m\inM(L,B)}E_{z}\left[Bias\big[\widehat \vy(\vz)\big]\right] % \le \Big(h^2\mu_2(K){2}\Big)^2\\times \Big(2 B^2 + 8 L^2J(p)\Big)+o(h^4), % multline* % and the integrated variance is independent of $p$. % Among all densities $p$ on $R^d$ with total-variance constrained, e.g., $\Tr(\Sigma)=c$, the isotropic Gaussian is the unique minimizer. (Proof in proof:kernel_optimal.) %
Theorem. [label={thm:bcs}]{Sufficiency of directional tests}{} eq:T_max is a valid statistical test for eq:HUV as align* P=Q&\implies\limsup_{n\to\infty}\Pr\left(T_{\sA}({f_{\vtheta}(\vx_n)}{n=1}^N) \ge \tau\alpha\right)\le \alpha,\bf (level) \ P\neq Q& \implies\limsup_{n\to\infty}\Pr\left(T_{\sA}({f_{\vtheta}(\vx_n)}{n=1}^N) \ge \tau\alpha\right)=1,\bf (power) align* (Proof in proof:bcs.)
Theorem. [label={thm:bcs}]{Sufficient of Univariate Testing}{} % Under Assumptions ass:CW--ass:net, the projection-scan test $\Psi_n$ satisfies: % enumerate[label=(\roman*),left=12pt] % \item (Level) If $P=Q$, then $\limsup_{n\to\infty}\Pr(\Psi_n=1)\le \alpha$. % \item (Power) If $P\neq Q$, then $\Pr(\Psi_n=1)\to 1$ as $n\to\infty$. % enumerate % Hence scanning 1D projections with any uniformly consistent 1D test family yields a universally consistent multivariate GOF test. %
Theorem. [label={thm:moment_conendrum}]{Insufficiency of K Moments}{} Minimizing the following objective with $c_k>0, \forall k$ [ \sum_{k=1}^K c_k\left(m_k\left(P_{\vtheta}^{(\va)}\right)-m_k\left(Q^{(\va)}\right)\right)^2, ] for finite $K$ does not imply $P_{\vtheta}^{(\va)}=Q^{(\va)}$. (Proof in proof:moment_conendrum.)
Theorem. [label={thm:ecf_stability}]{Stability of Epps-Pulley Test}{} eq:epps_pulley satisfies for samples $z_1,\dots,z_N$ [ \left|\partial EP(a){\partial z_i}\right| \le 4\sigma^2{N}, \quad \left|\partial^2 EP(a){\partial z_i^2}\right| \le C\sqrt{\pi\sigma^3}{2N}, ] with constant $C$, and bandwidth $\sigma$. (Proof in proof:ecf_stability.)
Theorem. [label={thm:spherical_bounds}]{Unified Error Bounds}{} Let $p_{\vtheta}\in H^\alpha(R^K)$, $\va \sim U(S^{K-1})$, and eq:epps_pulley$=0$, i.e., $P_\theta^{(a)} = Q^{(a)}, \forall \va \in \sA$, then multline* E_{\va} \left[ \int_{R} \left| \varphi_a(t) - \varphi_{N}(t) \right|^2 dt \right] \leq C(K, \alpha) |\sA|^{-2\alpha/(K-1)} \\times\int_0^\infty \left| \varphi_{\cdot}(r) - \varphi_{N}(r) \right|_{H^\alpha(S^{K-1})}^2 dr, multline* (Proof in proof:spherical_bounds.)
Theorem. [label={thm:gradient_bias}]{Vanishing gradient bias}{} The expectation of eq:epps_pulley satisfies [ E\left[L_n(\theta)\right] = L(\theta)+1{N}\int_{R} w_s(t)\big(1-|\varphi_P(t)|^2\big)dt, ] therefore both the loss and its derivative have a bias of order $O(1/n)$. (Proof in proof:gradient_bias.)
Theorem. [label={thm:isotropic-optimal}]{Special case: Recovery of VCReg}{} Fix one of the following scalar covariance constraints for a mean-zero distribution $p$ on $R^d$: itemize \item trace: $tr(Cov(X))=t$, \item determinant: $\det(Cov(X))=\delta$, \item Frobenius norm: $|Cov(X)|F=c$, \item spectral radius upper bound: $\rho(Cov(X))\le r$. itemize Then the Fisher-information functional $J(p)$ is minimized over all such $p$ by the isotropic Gaussian $p_G=N(0,sI_d)$ with $s$ chosen to satisfy the constraint. The minimal values are: [ array{ll} trace t: & J{\min}=d^2{t},\quad s=t{d},\[1.2ex] determinant \delta: & J_{\min}=d\delta^{-1/d},\quad s=\delta^{1/d},\[1.2ex] Frobenius c: & J_{\min}=d^{3/2}{c},\quad s=c{d},\[1.2ex] spectral radius r: & J_{\min}=d{r},\quad s=r. array ] In each case, $p_G$ is the unique minimizer (up to null sets).
Theorem. [label={def:cramer}]{Cramér-Wold [cramer1936some]}{} Let $X$ and $Y$ be random vectors in $R^D$: align X d{=} Y \iff \langle X, a \rangle d{=} \langle Y, a \rangle, \forall \va \in R^D. align
Lemma. [label={thm:linear_probe_bias}]{Anisotropy amplifies bias}{} Whenever $\lambda_K>\lambda_1$, there always exists a downstream task ($\vy$) for which $\mZ_{\rm aniso}$ produces a higher bias estimator than $\mZ_{\rm iso}$ for $\lambda>0$. (Proof in proof:linear_probe_bias.)
Lemma. [label={thm:linear_probe_variance}]{Anisotropy amplifies variance}{} With $\lambda=0$, the total variance of $\beta$ eq:OLS is minimized for $\mZ_{\rm iso}$ with $tr(Var(\boldsymbol{\beta}{aniso})) > tr(Var(\boldsymbol{\beta}{iso}))$. (Proof in proof:linear_probe_variance.)
Lemma. [label={thm:knn_bias}]{k-NN Pointwise Bias}{} % The eq:kNN estimator has bias at query $\vq$ given by % multline* % Bias(\vq)= % r_0^2{d+2}\Big(\grad\eta(\vq)^\top\grad\log p_{z}(\vq)+1{2}\Delta\eta(\vz)\Big)\+o(r_0^2), % multline* % where the remainder $o(r_0^2)$ is uniform in $\vq$. (Proof in proof:knn_bias.) %
Lemma. [label={thm:kernel_bias}]{Kernel Bias and Variance}{} % For any fixed $\vq\inR^d$ with $p(\vq)>0$, as $h\to 0$ and $n h^d\to\infty$, % align* % Bias\big[\widehat \vy(\vq)\big] % &=h^2\mu_2(K){2}\Big(\Delta \vy(\vq)+2\nabla \vy(\vq)^\top \nabla\log p(\vq)\Big)+o(h^2),\ % Var\big[\widehat \vy(\vq)\big] % &=R(K){n h^d}v(\vq){p(\vq)}+o\big((n h^d)^{-1}\big). % align* % The $o(\cdot)$ terms are uniform over compact sets where $p$ is bounded away from zero. % (Proof in proof:kernel_bias.) %
Lemma. [label={thm:spherical_cramer}]{Hyperspherical Cramér-Wold}{} Let $X,Y$ be $R^d$-valued random vectors, then [ \langle \vu,X\rangle d{=} \langle \vu,Y\rangle, \forall \vu \in S^{d-1}\iff X d{=} Y. ] Convergence in distribution also holds. (Proof in proof:spherical_cramer.)
Lemma. [label={lem:PPP_conditional}]{Special case: Recovery of VCReg}{} % Fix a bounded Borel set $A$ with $\Lambda(A)>0$. Conditional on $M(A)=m\ge 1$, the $m$ points of $\Pi$ that fall in $A$ are i.i.d. with density % [ % q_A(y) % ;:=; % \lambda(y)1{y\in A}{\Lambda(A)}\quad on A. % ] %
Lemma. [label={lem:fixed-Sigma}]{Special case: Recovery of VCReg}{} Let $p$ be a mean-zero probability density on $R^d$ with covariance $\Sigma=E[X X^\top]\succ 0$. Then [ J(p);\ge;tr(\Sigma^{-1}), ] with equality if and only if $p=N(0,\Sigma)$.
Proposition. [label={prop:exact_ratio}]{Special case: Recovery of VCReg}{} % % With $\lambda(y)=N p(y)$ and $M_x:=M_x(r_0)$, we have % [ % E\big[\eta(x)\big| M_x=m\big] % ;=; % \int_{B(0,r_0) \eta(x+z)p(x+z)dz}{\int_{B(0,r_0)} p(x+z)dz} % \qquad for every m\ge 1. % ] % Consequently, % [ % E\big[\eta(x)\big| M_x\ge 1\big] % ;=; % \int_{B(0,r_0) \eta(x+z)p(x+z)dz}{\int_{B(0,r_0)} p(x+z)dz}. % ] %
Corollary. [label={cor:uncond}]{Special case: Recovery of VCReg}{} % Let $\mu_x:=\mu_x(r_0)=\int_{B(x,r_0)} N p(y)dy$. Then % [ % E\big[\eta(x)\big] % = P(M_x\ge 1)\cdot % \int_{B(0,r_0) \eta(x+z)p(x+z)dz}{\int_{B(0,r_0)} p(x+z)dz}. % ] % Moreover, $P(M_x=0)=e^{-\mu_x}$. Hence, if $N r_0(N)^d\inf_{x\in K} p(x)\to \infty$ on a compact $K$, then $\sup_{x\in K}\big|E[\eta(x)] - \int \eta p{\int p}\big| \le e^{-\mu_x}$ vanishes exponentially fast in $N r_0^d$. %
Definition. {JEPA}{} align {\rm JEPA}(\vx) \iff& {\rm Enc}\left(\vx_{n,t+1,.}\right) \text { is predictable from }{\rm Enc}\left(\vx_{n,t,.}\right), \forall n,t and {\rm Enc}\left(\vx_{.,.,.}\right) is not degenerate. align minipage{0.32\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_JEPA_0.png} minipage minipage{0.32\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_JEPA_1.png} minipage minipage{0.32\linewidth} \includegraphics[width=\linewidth]{toy_figures/teaser/teaser_JEPA_2.png} minipage
Definition. [label={def:bcs}]{SIGReg (PyTorch code in lst:epps-pulley-pytorch)}{} SIGReg sketches a statistical test $T$ towards isotropic Gaussian multline {\rm SIGReg}{T}(\sA,{f{\vtheta}(\vx_n)}{n=1}^{N})\triangleq1{|\sA|}\sum{\va\in\sA}T({\va^\top f_{\vtheta}(\vx_n)}_{n=1}^{N}),SIGReg multline where we recommend the Epps-Pulley test (sec:cf_tests) for $T$.
Remark. [Asymptotic ``no-loss'' vs. multivariate tests] % Let ${\Phi_n}$ be any level-$\alpha$ multivariate GOF test that is consistent (i.e., $\Pr_{P\neq Q}(\Phi_n=1)\to 1$ and $\limsup_{n}\Pr_{P=Q}(\Phi_n=1)\le \alpha$). Then Theoremthm:main(ii) implies $\Pr(\Psi_n=1)\to 1$ whenever $\Pr(\Phi_n=1)\to 1$. Conversely, if $\Pr(\Psi_n=1)\not\to 1$, Assumptionass:margin forces $P=Q$, so no consistent multivariate test can reject with probability tending to one. Thus scanning 1D projections is asymptotically as strong as any consistent multivariate GOF method, without committing to any particular 1D test. %
Proof. % (i) Under $H_0:P=Q$, Assumptionass:CW implies $P_a=Q_a$ for all $a\in S^{d-1}$. By Assumptionass:net, the global calibration guarantees $\Pr(\Psi_n=1)\le \alpha$ for all $n$, so $\limsup_{n\to\infty}\Pr(\Psi_n=1)\le \alpha$. % (ii) Suppose $P\neq Q$. By Assumptionass:margin, there exists $a^\star$ and $\eta,\delta>0$ such that $D(P_a,Q_a)\ge \delta$ for all $a$ with $|a-a^\star|\le \eta$. Because $\Delta(A_n)\to 0$, for all sufficiently large $n$ there exists $a_n\in A_n$ with $|a_n-a^\star|\le \eta$, hence $D(P_{a_n},Q_{a_n})\ge \delta$. % Applying the uniform power in Assumptionass:1D with this $\delta$, we obtain % [ % \Pr(\psi_{a_n,n}=1)\longrightarrow1. % ] % Since $\Psi_n$ rejects whenever any constituent $\psi_{a,n}$ rejects (Assumption~ass:net), we have $\Pr(\Psi_n=1)\ge \Pr(\psi_{a_n,n}=1)\to 1$. This proves (ii). %
Proof. Our proof follows standard derivations when it comes to studying the bias of an estimator. Let's consider the ridge regression problem (Tikhonov regularized least squares estimator) with close form estimator equation \boldsymbol{\beta} = (X^T X + \lambda_{\rm wd} I)^{-1} X^T Y. equation The labels are formed from the ground truth parameter $\beta_{\rm true}$ with centered error, as per $Y = X\beta_{true} + \varepsilon$ where $E[\varepsilon] = 0$. We can now look at the bias of our estimator given by align* Bias(\boldsymbol{\beta}) &= E[\boldsymbol{\beta}] - \beta_{true} \ &=(X^T X + \lambda_{\rm wd} I)^{-1} X^T X\beta_{true}-\beta_{true}\ &= -\lambda_{\rm wd}(X^T X + \lambda_{\rm wd} I)^{-1} \beta_{true}\ &= -\lambda_{\rm wd} Q(\Lambda + \lambda I)^{-1}Q^T \beta_{true} align* We will now compare that bias when $\mX$ has isotropic and anisotropic covariance with same total variance: equation \lambda_1 + \lambda_2 + \cdots + \lambda_p{p} = \lambda. equation For any anisotropic covariance matrix of $\mX$, denote by $\vq_1$ the eigenvector with smallest eigenvalue, and let's denote by $\kappa>0$ a positive constant. We now define equation \beta_{true} = \kappa \cdot q_p, equation leading to align* |Bias(\boldsymbol{\beta})|{isotropic} = \lambda{\rm wd}{\lambda + \lambda_{\rm wd}} |\beta_{true}|,\ |Bias(\boldsymbol{\beta})|{non-isotropic} = \lambda{\rm wd}{\lambda_p + \lambda_{\rm wd}} |\beta_{true}| align* Since $\lambda_p < \lambda$ (strict inequality when not isotropic): equation* \lambda_{\rm wd}{\lambda_p + \lambda_{\rm wd}} > \lambda_{\rm wd}{\lambda + \lambda_{\rm wd}} equation* we obtain that equation* |Bias(\boldsymbol{\beta})|{non-isotropic} > |Bias(\boldsymbol{\beta})|{isotropic} equation* As a result, whenever the covariance matrix of $\mX$ is anisotropic, there will be downstream tasks for which the estimator bias is increased compared to having isotropic covariance matrix. Anisotropic covariance structure thus amplifies regularization bias when the true parameter vector aligns unfavorably with the data's covariance structure.
Proof. We use the same formula as in proof:linear_probe_bias with $\lambda_{\rm wd}=0$. We first see that the estimator is unbiased. We will now leverage that result to compute the covariance matrix of the estimator align* Var(\boldsymbol{\beta}|X) &= E[(\boldsymbol{\beta} - \beta)(\boldsymbol{\beta} - \beta)^T|X]\ &= E[(X^TX)^{-1}X^T\varepsilon\varepsilon^TX(X^TX)^{-1}|X]\ &= (X^TX)^{-1}X^TE[\varepsilon\varepsilon^T|X]X(X^TX)^{-1}\ &= (X^TX)^{-1}X^T(\sigma^2I_n)X(X^TX)^{-1}\ &= \sigma^2(X^TX)^{-1} align* leading to the total variance $$tr(Var(\boldsymbol{\beta})) = \sigma^2tr(G^{-1})=\sigma^2 \sum_{j=1}^p 1{\lambda_j}$$ where we used the eigendecomposition: $$G = Q\LambdaQ^T$$ The function $f(x) = 1{x}$ is strictly convex on $(0, \infty)$ allowing us to leverage Jensen's Inequality: align* 1{K}\sum_{k=1}^K 1{\lambda_k} > 1{1{K}\sum_{j=1}^K \lambda_k}\ \iff 1{K}\sum_{k=1}^K 1{\lambda_k} > 1{K}\sum_{k=1}^{K}1{1{K}\sum_{j=1}^K \lambda_k}\ \iff \sum_{k=1}^K 1{\lambda_k} > \sum_{k=1}^{K}1{1{K}\sum_{j=1}^K \lambda_k}\ \iff tr(Var(\boldsymbol{\beta})){aniso} > tr(Var(\boldsymbol{\beta})){iso} align* The inequality is strict whenever the eigenvalues ${\lambda_j}_{j=1}^p$ are not all equal.
Proof. % By Lemma~lem:PPP_conditional with $A=B(x,r_0)$, conditional on $M_x=m$, the $m$ points in $B(x,r_0)$ are i.i.d. with density proportional to $\lambda(y)=N p(y)$ on $B(x,r_0)$; equivalently, with density % [ % q_{x,r_0}(y) ;=; p(y)1{|y-x|\le r_0}{\int_{B(x,r_0)} p(u)du}. % ] % Therefore, % [ % E\left[1{m}\sum_{j=1}^m \eta(Y_j)\Big|M_x=m\right] % = \int_{B(x,r_0)} \eta(y)q_{x,r_0}(y)dy % = \int_{B(x,r_0) \eta(y)p(y)dy}{\int_{B(x,r_0)} p(y)dy}. % ] % Changing variables $y=x+z$ gives the stated ratio. Since the right-hand side is independent of $m$, the same holds conditional on $M_x\ge 1$ by averaging over $m\ge 1$. %
Proof. % Condition on ${M_x=0}$ and ${M_x\ge 1}$ and use Proposition~prop:exact_ratio. The Poisson count identity gives $P(M_x=0)=e^{-\mu_x}$. The uniform bound follows from eq:mu_expansion and the lower bound on $p$ on $K$. %
Proof. Under PPP, conditional expectations of $\eta(x)$ coincide with the normalized ball average [ \E\big[\eta(x)\big] ;=; \int_{\Ball(0,r_0) \eta(x+z)p(x+z)dz}{\int_{\Ball(0,r_0)} p(x+z)dz} \quadto second order in r_0, ] which is the key surrogate used below. \noindentBall integrals. For computations we use (by symmetry) for any $r>0$: [ \int_{\Ball(0,r)} zdz=0,\qquad \int_{\Ball(0,r)} zz^\top dz={\rm Vol^{d+2}}{d+2}I_d,\qquad \int_{\Ball(0,r)} z^2dz=d{\rm Vol^{d+2}}{d+2}. ] Fix $x\in\R^d$ and write $z\in\Ball(0,r_0)$ for local displacements. Assume $p\in C^3$, $\eta\in C^2$ with bounded derivatives on the region of interest, and expand a second-order Taylor expansion: align* p(x+z)&=p(x)+\nabla p(x)^\top z+\tfrac12 z^\top H p(x)z + O(|z|^3),\ \eta(x+z)&=\eta(x)+\nabla\eta(x)^\top z+\tfrac12 z^\top H\eta(x)z + O(|z|^3), align* with remainders satisfying $|R_\eta(x;z)|\le C_\etaz^3$ and $|R_p(x;z)|\le C_pz^3$ uniformly for $z\le r_0$. Using the ball identities $\int_{B(0,r)} zdz=0$ and $\int_{B(0,r)} zz^\top dz=v_d r^{d+2}{d+2}I_d$ and collecting terms up to order $r_0^{d+2}$, we simplify the denominator as align* D(x)&\triangleq\int_{\Ball(0,r_0)} p(x+z)dz\ &= \int_{\Ball(0,r_0)} \Big[p(x) + \grad p(x)^\top z + 1{2}z^\top \Hess p(x)z + R_p(x;z)\Big]dz\ &= {\rm Vol}0^dp(x);+;{\rm Vol_0^{d+2}}{2(d+2)}\tr\big(\Hess p(x)\big);+;O(r_0^{d+3}), align* since $\int zdz=0$ and $\int z^\top \Hess pzdz=\tr(\Hess p)\vol r_0^{d+2}{d+2}$ and the denominator as align* N(x)&\triangleq \int{\Ball(0,r_0)} \eta(x+z)p(x+z)dz\ &= \int \Big[\eta(x)+\grad\eta(x)^\top z+1{2}z^\top \Hess\eta(x)z\Big] \Big[p(x)+\grad p(x)^\top z+1{2}z^\top \Hess p(x)z\Big]dz+O(r_0^{d+3})\ &= \eta(x)p(x)\vol r_0^d+\eta(x)\vol r_0^{d+2}{2(d+2)}\tr\big(\Hess p(x)\big)+\vol r_0^{d+2}{d+2}\grad\eta(x)\cdot\grad p(x)+ \vol r_0^{d+2}{2(d+2)}p(x)\tr\big(\Hess\eta(x)\big) +O(r_0^{d+3}). align* % Define the numerator and denominator % [ % N(x):=\int_{\Ball(0,r_0)} \eta(x+z)p(x+z)dz,\qquad % D(x):=\int_{\Ball(0,r_0)} p(x+z)dz. % ] % We expand both to order $r_0^{d+2}$. % Denominator. Using symmetry: % align* % D(x) % &= \int_{\Ball(0,r_0)} \Big[p(x) + \grad p(x)^\top z + 1{2}z^\top \Hess p(x)z + R_p(x;z)\Big]dz\ % &= {\rm Vol}_0^dp(x);+;{\rm Vol_0^{d+2}}{2(d+2)}\tr\big(\Hess p(x)\big);+;O(r_0^{d+3}), % align* % since $\int zdz=0$ and $\int z^\top \Hess pzdz=\tr(\Hess p)\vol r_0^{d+2}{d+2}$. % Numerator. Expanding the product and integrating term-by-term: % align* % N(x) % &= \int \Big[\eta(x)+\grad\eta(x)^\top z+1{2}z^\top \Hess\eta(x)z\Big] % \Big[p(x)+\grad p(x)^\top z+1{2}z^\top \Hess p(x)z\Big]dz;+;O(r_0^{d+3})\ % &= \eta(x)p(x)\vol r_0^d\ % &\quad+\ \eta(x)\vol r_0^{d+2}{2(d+2)}\tr\big(\Hess p(x)\big)\ % &\quad+\ \vol r_0^{d+2}{d+2}\grad\eta(x)\cdot\grad p(x)\ % &\quad+\ \vol r_0^{d+2}{2(d+2)}p(x)\tr\big(\Hess\eta(x)\big) % ;+;O(r_0^{d+3}). % align* % Indeed, the $\grad\eta\cdot z$ term paired with $p(x)$ integrates to $0$, while $\int (\grad\eta\cdot z)(\grad p^\top z)dz=\grad\eta^\top\Big(\int zz^\top dz\Big)\grad p=\vol r_0^{d+2}{d+2}\grad\eta\cdot\grad p$. Cubic terms vanish by symmetry, and quartic terms are $O(r_0^{d+4})$. Subtract $\eta(x)D(x)$ to obtain the bias numerator: [ N(x)-\eta(x)D(x) = v_dr_0^{d+2}{d+2}\Big(\nabla\eta(x)\cdot\nabla p(x) + 1{2}p(x)\Delta\eta(x)\Big) + O(r_0^{d+3}). ] Write $D(x)=v_d r_0^d p(x)\big(1+\alpha(x)r_0^2+O(r_0^3)\big)$ where $\alpha(x):=1{2(d+2)p(x)}tr(H p(x))$. Then align* N(x){D(x)}-\eta(x) &= \frac{v_d r_0^{d+2}{d+2}\left(\nabla\eta\cdot\nabla p + 1{2}p\Delta\eta\right) + O(r_0^{d+3})}{ v_d r_0^d p \left(1+\alpha r_0^2+O(r_0^3)\right)}\[0.5ex] &= r_0^2{d+2}\left(\nabla\eta\cdot\nabla p{p} + 1{2}\Delta\eta\right)\Big(1-\alpha r_0^2+O(r_0^3)\Big)\ +\ O(r_0^3)\[0.5ex] &= r_0^2{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + 1{2}\Delta\eta(x)\Big)\ +\ o(r_0^2), align* uniformly on $K$. This gives the bias formula [ E\big[\eta(x)\big]-\eta(x) = r_0^2{d+2}\Big(\nabla\eta(x)\cdot\nabla\log p(x) + 1{2}\Delta\eta(x)\Big)\ +\ o(r_0^2), ] % Subtracting $\eta(x)D(x)$ to $N$ cancels the common terms and yields % [ % N(x)-\eta(x)D(x) % = v_d r_0^{d+2}{d+2}\Big(\nabla\eta(x)\cdot\nabla p(x) + \tfrac12 p(x)\Delta\eta(x)\Big) + O(r_0^{d+3}). % ] % Dividing by $D(x)$ and using the expansion % $D(x)^{-1} = (v_d r_0^d p(x))^{-1}\big(1+O(r_0^2)\big)$, we get % align* % N(x){D(x)} - \eta(x) % &=r_0^2{d+2}\Big(\grad\eta(x)\cdot\grad p(x){p(x)}+1{2}\Delta\eta(x)\Big);+;o(r_0^2)\ % &= r_0^2{d+2}\Big(\nabla\eta(x)\cdot\nabla \log p(x) + \tfrac12 \Delta\eta(x)\Big) + o(r_0^2), % align* completing the proof.
Proof. Recall from proof:knn_bias that the bias term as sample $\vx$ is given by align* Bias(\vx) =&r_0^2{d+2}\Big(\grad\eta(x)\cdot\grad\log p(x)\Big);+;r_0^2{2(d+2)}\Delta\eta(x);+;o(r_0^2)\ =&r_0^2{d+2}\big(A(x)+C(x)\big)+o(r_0^2), align* where we defined $A(x)\triangleq \nabla\eta(x)\cdot\nabla\log p(x)$ and $ C(x)\triangleq1{2}\Delta\eta(x)$. We now square and take expectation of $X\sim p$ and the isotropic gradient prior align E\big[Bias(X)^2\big] &=E\big[\left(r_0^2{d+2}\right)^2\big(A(x)^2 + 2A(x)C(x) + C(x)^2\big) + o(r_0^4)\big]\ &=\left(r_0^2{d+2}\right)^2 \Big{\mathbb{E\big[A(X)^2\big]}{score-gradient term} + 2\mathbb{E\big[A(X)C(X)\big]}{cross term} + \mathbb{E\big[C(X)^2\big]}{curvature term}\Big} + o(r_0^4). align We will derive each term separately, recalling that we assume an isotropic gradient prior for $\eta$, i.e., $E\big[\nabla\eta(x)\big]=0$ and $E\big[\nabla\eta(x)\nabla\eta(x)^\top\big]=\tau_g^2I_d$, for some $\tau_g^2\in(0,\infty)$. 1) The score-gradient term $\mathbb{E[A(X)^2]$.} Using $v(x):=\nabla\log p(x)$ for brevity: align* E\big[A(X)^2\big] =&E_X\big[E\eta[A(X)^2]\big]\ =&E_X\big[E_\eta[\big(\nabla\eta(x)^\top v(x)\big)^2]\big]\ =&E_X\big[E_\eta[\nabla\eta(x)^\top\Big(v(x)v(x)^\top\Big)\nabla\eta(x)]\big]\ =&E_X\big[E_\eta[tr\Big(v(x)v(x)^\top\nabla\eta(x)\nabla\eta(x)^\top\Big)]\big]\ =&E_X\big[tr\Big(v(x)v(x)^\topE_\eta[\nabla\eta(x)\nabla\eta(x)^\top]\Big)\big]\=&E_X\big[\tau_g^2|v(x)|^2\big]\ =&\tau_g^2E_X\big[|v(X)|^2\big]\ =&\tau_g^2\int_{R^d} |\nabla\log p(x)|^2p(x)dx align* recovering the Fisher-information functional $J(p)$, scaled by $\tau_g^2$ 2) The cross term $2\mathbb{E[A(X)C(X)]$.} We have [ A(x)C(x)=1{2}\big(\nabla\eta(x)^\top v(x)\big)\Delta\eta(x). ] Under the prior, $\nabla\eta$ is mean-zero and isotropic; if, additionally, $\Delta\eta$ is uncorrelated with $\nabla\eta$ and has zero mean (or is bounded and mean-zero after centering), then $E_\eta[A(x)C(x)]=0$. If one does not assume the orthogonality/vanishing covariance above, then $E[A(X)C(X)]$ is a finite constant (depending on the joint law of derivatives of $\eta$), and the cross term contributes [ \left(r_0^2{d+2}\right)^2\cdot 2E[A(X)C(X)] =\ O(r_0^4), ] not $o(r_0^4)$. In that general case, the leading $p$-dependent term of $E[Bias(X)^2]$ is still the score-gradient $\tau_g^2J(p)$. 3) The curvature term $\mathbb{E[C(X)^2]$.} align* E\big[C(X)^2\big] =&E_X\big[E_\eta[C(X)^2]\big]\ =&1{4}E_X\big[E_\eta[(\Delta\eta(X))^2\big] align* which is independent of $p$, hence $E\big[C(X)^2\big]=O(1)$ Putting it together. Substituting into eq:three-terms: align* E\big[Bias(X)^2\big] &=\left(r_0^2{d+2}\right)^2\Big{\tau_g^2J(p) + O(1)\Big} + o(r_0^4)\ &=r_0^4{(d+2)^2}\tau_g^2J(p);+;O(r_0^4), align* We show that, among all mean-zero distributions $p$ on $R^d$ with a given scalar constraint on the covariance (trace, determinant, Frobenius norm, or spectral radius), the density that minimizes the Fisher-information functional [ J(p);:=;\int_{R^d}|\nabla\log p(x)|^2p(x)dx ] is the Gaussian with isotropic covariance satisfying the same scalar constraint. We proceed in two steps: (i) for fixed covariance matrix $\Sigma\succ 0$, $J(p)$ is minimized by the Gaussian $N(0,\Sigma)$ and attains the value $tr(\Sigma^{-1})$; (ii) for each scalar constraint, $tr(\Sigma^{-1})$ is minimized by $\Sigma=sI_d$ for the appropriate scalar $s>0$. lemma[label={lem:fixed-Sigma}]{Special case: Recovery of VCReg}{} Let $p$ be a mean-zero probability density on $R^d$ with covariance $\Sigma=E[X X^\top]\succ 0$. Then [ J(p);\ge;tr(\Sigma^{-1}), ] with equality if and only if $p=N(0,\Sigma)$. lemma proof Consider the location family $p_\theta(x):=p(x-\theta)$, $\theta\inR^d$. Its Fisher-information matrix at $\theta$ is [ I(\theta) ;=; E\big[\nabla_\theta\log p_\theta(X)\nabla_\theta\log p_\theta(X)^\top\big] ;=; E\big[\nabla\log p(X)\nabla\log p(X)^\top\big], ] so that $J(p)=trI(\theta)$. The estimator $T(X)\equiv X$ is unbiased for $\theta$ under $p_\theta$, with $Cov(T)=\Sigma$. The matrix Cramér--Rao bound gives $Cov(T)\succeq I(\theta)^{-1}$, i.e., $I(\theta)\succeq \Sigma^{-1}$. Taking traces yields $J(p)\ge tr(\Sigma^{-1})$. Equality in the matrix Cramér--Rao bound holds if and only if the score is an affine function of $X-\theta$, i.e., $\nabla\log p_\theta(X)=A(X-\theta)$ a.s.\ for some matrix $A$; integrating this identity shows $p_\theta$ is Gaussian with precision matrix $-A$, hence $p=N(0,\Sigma)$.
Proof. For any admissible $p$ with covariance $\Sigma$, Lemmalem:fixed-Sigma gives $J(p)\ge tr(\Sigma^{-1})$. Minimizing the right-hand side under the stated scalar constraint yields $\Sigma=sI_d$ by the calculations in (a)--(d). Equality in Lemmalem:fixed-Sigma holds if and only if $p$ is Gaussian with that covariance, hence $p_G$ uniquely attains the bound.
Proof. Write the numerator and denominator of $\widehat m(x)$ as [ B_n(x):=\sum_{i=1}^n K_h(x-X_i)Y_i,\qquad A_n(x):=\sum_{i=1}^n K_h(x-X_i), ] so that $\widehat m(x)=B_n(x){A_n(x)}$. Bias. Compute expectations using independence and change of variables. For the denominator, align* E[A_n(x)] &=nE\big[K_h(x-X)\big]\ &=n\int_{R^d} h^{-d}K\Big(x-u{h}\Big)p(u)du\ &=n\int_{R^d} K(t)p(x-h t)dt\qquad (t:=(x-u)/h)\ &=n\int_{R^d} K(t)\Big(p(x)-ht^\top \nabla p(x)+h^2{2}t^\top \nabla^2 p(x)t+o(h^2)\Big)dt\ &=n\Big(p(x)+h^2{2}\int t^\top \nabla^2 p(x)tK(t)dt_{=\mu_2(K)\Delta p(x)}+o(h^2)\Big), align* where we used symmetry $\int t K(t)dt=0$ and isotropy $\int t t^\top K(t)dt=\mu_2(K) I_d$, which implies $\int t^\top \nabla^2 p(x)tK(t)dt=\mu_2(K)tr(\nabla^2 p(x))=\mu_2(K)\Delta p(x)$. Similarly, for the numerator, align* E[B_n(x)] &=nE\big[K_h(x-X)Y\big] =n\int K(t)(m p)(x-h t)dt\ &=n\int K(t)\Big((mp)(x)-ht^\top \nabla(mp)(x)+h^2{2}t^\top \nabla^2(mp)(x)t+o(h^2)\Big)dt\ &=n\Big(m(x)p(x)+h^2{2}\mu_2(K)tr\big(\nabla^2(mp)(x)\big)+o(h^2)\Big)\ &=n\Big(m(x)p(x)+h^2\mu_2(K){2}\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)(x)+o(h^2)\Big), align* where the last step uses the fact that $tr\big(\nabla^2(mp)\big)=p\Delta m + m\Delta p + 2\nabla m^\top \nabla p$ by the product rule and symmetry of mixed derivatives. Now expand the ratio $\mathbb{E[B_n(x)]}{E[A_n(x)]}$ using the identity [ a_0+h^2 a_2+o(h^2){b_0+h^2 b_2+o(h^2)} =a_0{b_0} +h^2a_2 b_0-a_0 b_2{b_0^2} +o(h^2), ] with $a_0=m(x)p(x)$, $a_2=\mu_2(K){2}\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)(x)$, $b_0=p(x)$, and $b_2=\mu_2(K){2}\Delta p(x)$. This yields align* \mathbb{E[B_n(x)]}{E[A_n(x)]} &=m(x) +h^2\mu_2(K){2}\big(p\Delta m + m\Delta p + 2\nabla m^\top \nabla p\big)p - m p\Delta p{p^2}\Big|{x}+o(h^2)\ &=m(x) +h^2\mu_2(K){2}\Big(\Delta m(x)+2\nabla m(x)^\top \nabla p(x){p(x)}\Big)+o(h^2), align* which recovers our statement. Variance. Linearize $\widehat m(x)=B_n(x)/A_n(x)$ around $(E[B_n(x)],E[A_n(x)])$ and use independence. To leading order, [ Var[\widehat m(x)]\approx Var[B_n(x)]{(E[A_n(x)])^2}. ] Compute align* Var[B_n(x)] &= \sum{i=1}^n Var\big(K_h(x-X_i)Y_i\big)\quad(independence)\ &= nE\big[K_h(x-X)^2Var(Y\mid X)\big] = nE\big[K_h(x-X)^2v(X)\big]\ &= n\int h^{-2d}K\Big(x-u{h}\Big)^2 v(u)p(u)du\ &= n h^{-d}\int K(t)^2v(x-h t)p(x-h t)dt = n h^{-d}\Big(R(K)v(x)p(x)+o(1)\Big), align* while [ E[A_n(x)]=n\big(p(x)+o(1)\big). ] Therefore, [ Var[\widehat m(x)] \approx n h^{-dR(K)v(x)p(x)}{n^2p(x)^2} =R(K){n h^d}v(x){p(x)}+o\big((n h^d)^{-1}\big), ] completing the proof.
Proof. Let $z = 1{V_g}\sum_{v=1}^{V_g}z_{n,v}$ denote the mean of the first $V_g$ vectors. We prove that: equation 1{V_g}\sum_{v=1}^{V_g}1{V}\sum_{v'=1}^{V}| z_{n,v} - z_{n,v'} |2^2 = 1{V}\sum{v'=1}^{V}\left| z - z_{n,v'} \right|2^2 equation Expanding the left-hand side: align LHS &= 1{V_g V}\sum{v=1}^{V_g}\sum_{v'=1}^{V}| z_{n,v} - z_{n,v'} |2^2 \ &= 1{V_g V}\sum{v=1}^{V_g}\sum_{v'=1}^{V}\left(|z_{n,v}|2^2 - 2z{n,v}^Tz_{n,v'} + |z_{n,v'}|2^2\right) \ &= 1{V_g}\sum{v=1}^{V_g}|z_{n,v}|2^2 - 2{V_g V}\sum{v=1}^{V_g}\sum_{v'=1}^{V}z_{n,v}^Tz_{n,v'} + 1{V}\sum_{v'=1}^{V}|z_{n,v'}|2^2 \ &= 1{V_g}\sum{v=1}^{V_g}|z_{n,v}|2^2 - 2{V}z^T\sum{v'=1}^{V}z_{n,v'} + 1{V}\sum_{v'=1}^{V}|z_{n,v'}|2^2 align Expanding the right-hand side: align RHS &= 1{V}\sum{v'=1}^{V}\left(|z|2^2 - 2z^Tz{n,v'} + |z_{n,v'}|2^2\right) \ &= |z|2^2 - 2{V}z^T\sum{v'=1}^{V}z{n,v'} + 1{V}\sum_{v'=1}^{V}|z_{n,v'}|2^2 align To complete the proof, we verify that: equation 1{V_g}\sum{v=1}^{V_g}|z_{n,v}|2^2 = |z|2^2 equation Expanding the right-hand side: align |z|2^2 &= \left|1{V_g}\sum{v=1}^{V_g}z{n,v}\right|2^2 \ &= 1{V_g^2}\sum{v=1}^{V_g}\sum{v''=1}^{V_g}z_{n,v}^Tz_{n,v''} \ &= 1{V_g}\sum_{v=1}^{V_g}|z_{n,v}|_2^2 align Therefore, LHS = RHS, completing the proof.
Proof. For each $x$, [ Bias[\widehat m(x)]=h^2\mu_2(K){2}\Big(\Delta m(x)+2\nabla m(x)^\top \nabla\log p(x)\Big)+o(h^2). ] Square and integrate against $p(x)$: align* B^2(h;p,m) &=\Big(h^2\mu_2(K){2}\Big)^2\int \Big(\Delta m(x)+2\nabla m(x)^\top \nabla\log p(x)\Big)^2p(x)dx+o(h^4)\ &\le \Big(h^2\mu_2(K){2}\Big)^2 \int \Big(2(\Delta m(x))^2+2(2\nabla m(x)^\top \nabla\log p(x))^2\Big)p(x)dx+o(h^4)\ &=\Big(h^2\mu_2(K){2}\Big)^2\Big(2\int (\Delta m(x))^2p(x)dx+8\int (\nabla m(x)^\top \nabla\log p(x))^2p(x)dx\Big)+o(h^4), align* where we used $(a+b)^2\le 2 a^2+2 b^2$ pointwise. Since $|\Delta m(x)|\le B$ for all $x$, we have [ \int (\Delta m)^2p \le \int B^2p = B^2. ] For the second term, first use Cauchy--Schwarz and then integrate against $p(x)$ to obtain align* (\nabla m(x)^\top \nabla\log p(x))^2 \le |\nabla m(x)|^2|\nabla\log p(x)|^2 \le L^2|\nabla\log p(x)|^2\ \implies \int (\nabla m(x)^\top \nabla\log p(x))^2p(x)dx \le L^2\int |\nabla\log p(x)|^2p(x)dx = L^2J(p). align* which can be combined with the bounds above to obtain the desired result. We similarly have for the integrated variance align* V(h;p) &=\int \Big(R(K){n h^d}v(x){p(x)}+o\big((n h^d)^{-1}\big)\Big)p(x)dx=R(K){n h^d}\int v(x)dx+o\big((n h^d)^{-1}\big), align* which is independent of $p$.
Proof. We first start by reminding the reader about the original Cramér-Wold theorem that is a function of all possible directions (not unit-norm ones). theorem[label={def:cramer}]{Cramér-Wold [cramer1936some]}{} Let $X$ and $Y$ be random vectors in $R^D$: align X d{=} Y \iff \langle X, a \rangle d{=} \langle Y, a \rangle, \forall \va \in R^D. align theorem Our proof will follow the same proof as for def:cramer. Necessity is immediate: if $X d{=} Y$, then every measurable function of $X$ has the same distribution as the corresponding function of $Y$, from which the linear mapping $x \mapsto \langle u,x\rangle$ for $u \in S^{d-1}$ is a special case. For sufficiency, assume $\langle u,X\rangle d{=} \langle u,Y\rangle$ for all $u \in S^{d-1}$. Let $\varphi_X(t) := E\big[e^{i\langle t,X\rangle}\big]$ and $\varphi_Y(t) := E\big[e^{i\langle t,Y\rangle}\big]$ denote the characteristic functions of $X$ and $Y$. Fix an arbitrary $t \in R^d$; if $t=0$, then $\varphi_X(0)=\varphi_Y(0)=1$. If $t \neq 0$, write $t = s u$ with $s := |t| > 0$ and $u := t/|t| \in S^{d-1}$. By the assumption, $\langle u,X\rangle d{=} \langle u,Y\rangle$, hence for this $u$ and $s$ we have [ \varphi_X(t) ;=; E\big[e^{i\langle t,X\rangle}\big] ;=; E\big[e^{i s \langle u,X\rangle}\big] ;=; E\big[e^{i s \langle u,Y\rangle}\big] ;=; E\big[e^{i\langle t,Y\rangle}\big] ;=; \varphi_Y(t). ] Thus $\varphi_X(t) = \varphi_Y(t)$ for all $t \in R^d$, i.e., $\varphi_X \equiv \varphi_Y$ on $R^d$. By the uniqueness theorem for characteristic functions, this implies $X d{=} Y$. (ii) Define $\psi_{n,t} := E\big[e^{i\langle t,X_n\rangle}\big]$ and $\psi_{t} := E\big[e^{i\langle t,X\rangle}\big]$. Fix $t \in R^d$ and decompose $t = s u$ with $s := |t| \ge 0$ and $u \in S^{d-1}$ (take, e.g., $u = t/|t|$ if $t \neq 0$, and any $u$ if $t=0$). The map $g_s:R\toR$, $g_s(x)=s x$, is continuous. By the continuous mapping theorem applied to the real-valued random variables $\langle u,X_n\rangle d \langle u,X\rangle$, we obtain [ \langle t,X_n\rangle ;=; s \langle u,X_n\rangle d s \langle u,X\rangle ;=; \langle t,X\rangle. ] Hence, for every fixed $t \in R^d$, the one-dimensional projections satisfy $\langle t,X_n\rangle d \langle t,X\rangle$, which in turn yields pointwise convergence of characteristic functions: [ \psi_{n,t} ;=; E\big[e^{i\langle t,X_n\rangle}\big] ;\longrightarrow; E\big[e^{i\langle t,X\rangle}\big] ;=; \psi_{t}, \qquad for all t \in R^d. ] Therefore, by L'evy's continuity theorem, $X_n d X$. This completes the proof.
Proof. We first formulate the following assumptions required for the proof--all of this are satisfied by typical univariate statistical tests. $P=Q$ if and only if $P_a=Q_a$ for all $a\in S^{d-1}$ (population-level equivalence of laws). $A_n$ are finite sets with mesh $\Delta(A_n):=\sup_{u\in S^{d-1}} \min_{a\in A_n}|u-a| \to 0$ as $n\to\infty$. If $P\neq Q$, there exists a separating direction $a^\star\in S^{d-1}$ and a neighborhood $U$ of $a^\star$ such that [ \inf_{a\in U}\lim_{n\to\infty}\Pr\big(T_{a,n} \ge u_n(\alpha)\big)=1. ] (Intuitively: near a truly separating direction, the 1D statistic eventually exceeds the global null threshold with probability $\to 1$.) (i) Under $H_0:P=Q$, our assumption implies no separating direction exists at the population level, and the calibration of $u_n(\alpha)$ ensures $\Pr(M_n \ge u_n(\alpha)) \le \alpha$ for all $n$, hence $\limsup_{n\to\infty}\Pr(\Psi_n=1)\le \alpha$. (ii) Suppose $P\neq Q$. Our assumption guarantees that there exists at least one separating direction $a^\star$ with $P_{a^\star}\neq Q_{a^\star}$. Our assumption guarantees a neighborhood $U$ of $a^\star$ in which the projection statistics exceed the global null threshold with probability tending to 1: [ \inf_{a\in U}\lim_{n\to\infty}\Pr\big(T_{a,n} \ge u_n(\alpha)\big) ;=; 1. ] By assumption, for all large $n$ the set $A_n$ contains at least one direction $a_n\in U$ (dense coverage). Therefore, [ \Pr(\Psi_n=1) ;=; \Pr\big(M_n \ge u_n(\alpha)\big) ;\ge; \Pr\big( T_{a_n,n} \ge u_n(\alpha) \big) ;\longrightarrow; 1, ] which proves consistency.
Proof. For each case, consider the function $g(a)$ on $S^{D-1}$ defined by the quantity of interest (CF, CDF, or moment) at a fixed $t$ or $k$. Since $f \in H^\alpha(R^D)$, the mapping $a \mapsto g(a)$ is in $H^\alpha(S^{D-1})$ for each fixed $t$ or $k$. Given $M$ samples ${a_i}{i=1}^M$ on the sphere, the best possible reconstruction of $g$ from its values at these points is given by spherical interpolation. By classical results on Sobolev spaces and spherical harmonics (see, e.g., [narcowich2006localized]), the $L^2$ interpolation error for functions in $H^\alpha(S^{D-1})$ using $M$ points is bounded by [ E_b \left[ |g(b) - g^*(b)|^2 \right] \leq C(D, \alpha) M^{-2\alpha/(D-1)} | g |{H^\alpha(S^{D-1})}^2, ] where $g^*$ is the interpolant matching $g$ at the $M$ sampled points. The interpolation error bound on the sphere follows from the theory of spherical harmonics and Marcinkiewicz–Zygmund (MZ) inequalities . Any $f \in H^\alpha(S^d)$ admits a spherical harmonics expansion, and the best $L^2$ approximation by harmonics of degree at most $L$ satisfies [ |f - P_L f|{L^2(S^d)} \leq (1 + L^2)^{-\alpha/2} |f|{H^\alpha(S^d)}, ] where $P_L f$ is the projection onto harmonics of degree $\leq L$ \cite[Lemma~2.1]{narcowich2006localized}. If $M$ points are distributed quasi-uniformly on $S^d$, then for $L \sim c M^{1/d}$, the set forms a Marcinkiewicz–Zygmund (MZ) set for degree $L$ \cite[Theorem 1.1]{mhaskar2001spherical}. This allows reconstruction of any function in the space of harmonics of degree at most $L$ from its values at these points, and the $L^2$ interpolation error for $f$ is bounded by [ |f - I_M f|{L^2(S^d)} \leq C (1 + L^2)^{-\alpha/2} |f|{H^\alpha(S^d)}, ] where $I_M f$ is any interpolant matching $f$ at the $M$ points \cite[Theorem 3.1]{narcowich2006localized}. Substituting $L \sim c M^{1/d}$ yields the rate $M^{-\alpha/d}$, and thus [ E_{\omega} |f(\omega) - I_M f(\omega)|^2 \leq C(d, \alpha) M^{-2\alpha/d} |f|_{H^\alpha(S^d)}^2, ] with explicit $C(d, \alpha)$ as in the main theorem. Integrating (or summing) over $t$ (for CF and CDF) or $k$ (for moments, with weights $w_k$) yields the stated bounds. The explicit constant $C(D, \alpha)$ arises from the theory of spherical Sobolev spaces and is given above. For the moment case, the sum over $k$ is weighted to ensure convergence, as higher moments may grow rapidly. The weights $w_k$ can be chosen, for example, as $w_k = 1/k!$. This completes the proof.
Proof. Fix the Gaussian weight [ w_s(t)=e^{-s^2 t^2},\qquad s>0, ] and define the population CF distance [ D(P,G)=\int_{R} w_s(t)\big|\varphi_P(t)-\varphi_G(t)\big|^2dt. ] Let the empirical CF be [ \varphi_N(t)=1{N}\sum_{i=1}^N e^{itX_i}, ] and consider the V-statistic estimator [ D_V=\int_{R} w_s(t)\big|\varphi_N(t)-\varphi_G(t)\big|^2dt. ] We use only that $|e^{itX}|=1$, $|\varphi_P(t)|\le 1$, $|\varphi_G(t)|\le 1$, and integrability of $w_s$. For each $i$ differentiate under the integral (dominated convergence applies because the integrand and its derivative are bounded) align* \partial \widehat{D_V}{\partial X_i} =& \int_{R} w_s(t)2\Re!\Big(\big(\varphi_N(t)-\varphi_G(t)\big)\frac{\partial \widehat{\varphi_N(t)}{\partial X_i}}\Big)dt,\ \partial \widehat{\varphi_N(t)}{\partial X_i} =& 1{N}i te^{itX_i},align* since $|\varphi_N(t)|\le 1$ and $|\varphi_G(t)|\le 1$, align* \left|\partial \widehat{D_V}{\partial X_i}\right| &\le 2{N}\int w_s(t)|t|\big(|\varphi_N(t)|+|\varphi_G(t)|\big)dt\ &\le 4{N}\int w_s(t)|t|dt\ &= 4{Ns^2}, align* using $\int_{R} e^{-s^2 t^2}|t|dt=1/s^2$. [ \Bigg|\partial \widehat{D_V}{\partial X_i}\Bigg| ;\le;4{N}\int_{R} w_s(t)|t|dt ;=; 4{Ns^2}. ] Moreover, differentiating once more in $X_i$ and using $|\varphi_N(t)|\le 1$, $|\varphi_G(t)|\le 1$ gives a global Lipschitz bound [ \Bigg|\partial^2 \widehat{D_V}{\partial X_i^2}\Bigg| ;\le; C{N}\int_{R} w_s(t)t^2dt ;=; C{N}\cdot \sqrt{\pi}{2s^3}, ] for some absolute constant $C$ arising from bounded factors and product rule. Hence ECF gradients are uniformly bounded and Lipschitz, with scale controlled only by $(N,s)$. \item[(B)] (Moment sample-gradients are polynomial in $X_i$ and unbounded for $k\ge 2$.) Let $D_V$ be as above. Define the moment objective [ D_k ;=; (\phi-\mu)^\top W(\phi-\mu),\qquad \phi:=1{N}\sum_{i=1}^N \phi(X_i),\quad \phi(x)=(x,x^2,\dots,x^k)^\top, ] for a symmetric positive semidefinite $W\inR^{k\times k}$ and Gaussian target moments $\mu=E_G[\phi(Y)]$. For each $i$, align* \partial \widehat{D_k}{\partial X_i} =& 2{N}(\phi-\mu)^\top W\partial \phi(X_i){\partial X_i},\ \partial \phi(X){\partial X} =& \big(1,2X,3X^2,\dots,k X^{k-1}\big)^\top. align* The gradient formula follows by the chain rule and linearity of $\phi$. Let $c:=W(\phi-\mu)$ and write $c_r$ for its $r$-th coordinate. Then [ \partial \widehat{D_k}{\partial X_i} = 2{N}\sum_{r=1}^k c_rrX_i^{r-1}, ] which is a polynomial in $X_i$ of degree $\deg=\max{r-1: c_r\neq 0}\le k-1$. In particular, if $c_k\neq 0$ (the generic case when the top-weighted deviation is nonzero), then [ \left|\partial \widehat{D_k}{\partial X_i}\right| ;\xrightarrow[|X_i|\to\infty]{}; \infty \quadas\quad |X_i|^{k-1}. ] The expression is a nonconstant polynomial in $X_i$ of degree $\deg\le k-1$ whenever some $c_r\neq 0$ with $r\ge 2$. Thus the gradient cannot be uniformly bounded on $R$. If $c_k\neq 0$, the leading term dominates and the magnitude grows like $|X_i|^{k-1}$, proving unboundedness for $k\ge 2$.
Proof. % Step 1: Spherical Harmonic Expansion % Any function $g_k : S^{D-1} \to R$ can be expanded in spherical harmonics: % $$g_k(a) = \sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(a)$$ % where $Y_{\ell m}$ are the orthonormal spherical harmonics and % $$c_{\ell m}^{(k)} = \int_{S^{D-1}} g_k(a) Y_{\ell m}(a) d\sigma(a)$$ % For the target Gaussian moments $m_k$, the ideal expansion is $g_k(a) = m_k$ (constant), which corresponds to: % align % c_{00}^{(k)} &= m_k |S^{D-1|} \ % c_{\ell m}^{(k)} &= 0 \quad for all \ell > 0 % align % Step 2: Constraint Analysis % The constraints $E[\langle x, a_i \rangle^k] = m_k$ translate to: % $$\sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(a_i) = m_k \quad for i = 1, \ldots, N$$ % Let $c = (c_{\ell m}^{(k)})$ be the vector of all coefficients and $Y_i = (Y_{\ell m}(a_i))$ be the vector of harmonic evaluations. Our constraints are: % $$Y_i^T c = m_k \quad for i = 1, \ldots, N$$ % Step 3: Truncation at Level $L$ % We truncate the expansion at degree $L$: % $$g_k^{(L)}(a) = \sum_{\ell=0}^{L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(a)$$ % The number of coefficients up to degree $L$ is: % $$M_L = \sum_{\ell=0}^{L} \dim(H_\ell) = \sum_{\ell=0}^{L} \ell + D - 1{\ell} \ell + D - 2{\ell} \asymp L^{D-1}$$ % Step 4: Decomposition of the Error % For any $b \in S^{D-1}$: % align % |g_k(b) - m_k| &= \left|\sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b) - m_k\right| \ % &= \left|\sum_{\ell=1}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right| \quad (since c_{00}^{(k)} = m_k) \ % &\leq \left|\sum_{\ell=1^{L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right|}{Low-frequency error} + \left|\sum{\ell>L \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right|}{High-frequency error} % align % Step 5: Bounding the High-Frequency Error (Bias) % Using the Cauchy-Schwarz inequality and Parseval's identity: % align % \left|\sum{\ell>L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right|^2 &\leq \left(\sum_{\ell>L} \sum_{m=-\ell}^{\ell} |c_{\ell m}^{(k)}|^2\right) \left(\sum_{\ell>L} \sum_{m=-\ell}^{\ell} |Y_{\ell m}(b)|^2\right) \ % &\leq \left(\sum_{\ell>L} \sum_{m=-\ell}^{\ell} |c_{\ell m}^{(k)}|^2\right) \sum_{\ell>L} \dim(H_\ell) % align % From the smoothness assumption with parameter $\alpha$: % $$\sum_{\ell>L} \sum_{m=-\ell}^{\ell} |c_{\ell m}^{(k)}|^2 \leq C_{\alpha} |E[|x|^k]|^2 \sum_{\ell>L} (1 + \ell)^{-2\alpha} \lesssim C_{\alpha} |E[|x|^k]|^2 L^{-2\alpha+1}$$ % Since $\sum_{\ell>L} \dim(H_\ell) \lesssim L^{D-2}$ for large $L$, we obtain: % $$\left|\sum_{\ell>L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right| \lesssim C_{\alpha} |E[|x|^k]| L^{-\alpha} L^{(D-2)/2} \lesssim C_{\alpha} |E[|x|^k]| L^{-\alpha}$$ % Step 6: Bounding the Low-Frequency Error (Variance) % The low-frequency coefficients ${c_{\ell m}^{(k)} : \ell \leq L}$ are constrained by our $N$ linear equations. In the worst case, the unconstrained degrees of freedom allow for maximum deviation. % Let $c_L$ be the vector of coefficients up to degree $L$, and let $A \in R^{N \times M_L}$ be the matrix with entries $A_{i,(\ell,m)} = Y_{\ell m}(a_i)$. The constraints become: % $$A c_L = m_k 1_N$$ % The unconstrained subspace has dimension $\dim(null(A)) = M_L - rank(A)$. % For generic points ${a_i}$ and $M_L \geq N$, we have $rank(A) = N$ with high probability, so the dimension of the null space is $M_L - N \asymp L^{D-1} - N$. % Using concentration of measure for spherical harmonics, the maximum deviation in the null space satisfies: % $$\max_{v \in null(A), |v|=1} |v^T Y(b)| \lesssim \frac{M_L \log(1/\delta){N}} \lesssim \frac{L^{D-1 \log(1/\delta)}{N}}$$ % with probability at least $1 - \delta$. % Step 7: Combining the Bounds % Combining Steps 5 and 6: % $$|E[\langle x, b \rangle^k] - m_k| \leq C |E[|x|^k]| \left( L^{-\alpha} + \frac{L^{D-1 \log(1/\delta)}{N}} \right)$$ % Step 8: Optimal Choice of $L$ (Minimax Analysis) % To minimize the bound, we optimize over $L$: % $$\min_L \left[ L^{-\alpha} + \frac{L^{D-1}{N}} \right]$$ % Taking the derivative and setting it to zero: % $$d{dL}\left[L^{-\alpha} + \frac{L^{D-1}{N}}\right] = -\alpha L^{-\alpha-1} + D-1{2}\frac{L^{D-3}{N}} = 0$$ % Solving for $L$: % align % \alpha L^{-\alpha-1} &= D-1{2}\frac{L^{D-3}{N}} \ % L^{-\alpha-1-(D-3)/2} &= D-1{2\alphaN} \ % L^{-\alpha-(D-1)/2} &= D-1{2\alphaN} % align % Therefore: % $$L^* = \left(2\alpha\sqrt{N}{D-1}\right)^{1/(\alpha + (D-1)/2)} \asymp N^{1/(2\alpha + D - 1)}$$ % Substituting $L^*$ back into the bound: % align % L^{-\alpha} + \frac{L^{D-1}{N}} &\asymp (N^{1/(2\alpha + D - 1)})^{-\alpha} + \frac{(N^{1/(2\alpha + D - 1))^{D-1}}{N}} \ % &= N^{-\alpha/(2\alpha + D - 1)} + N^{-\alpha/(2\alpha + D - 1)} \ % &\asymp N^{-\alpha/(2\alpha + D - 1)} % align % This gives the final optimal rate: % $$|E[\langle x, b \rangle^k] - m_k| \leq C |E[|x|^k]| \cdot N^{-\alpha/(2\alpha + D - 1)} \log^{1/2}(1/\delta)$$ % This completes the proof. %
Proof. % Step 1: Definition and Properties of Spherical Designs % A finite set ${a_1, \ldots, a_N} \subset S^{D-1}$ is called a spherical $L$-design if % $$1{N} \sum_{i=1}^N f(a_i) = 1{|S^{D-1}|} \int_{S^{D-1}} f(a) d\sigma(a)$$ % for all polynomials $f$ of degree at most $L$. % Key Property: This condition is equivalent to exact integration of all spherical harmonics up to degree $L$: % $$1{N} \sum_{i=1}^N Y_{\ell m}(a_i) = cases % 1{|S^{D-1|}} & if \ell = 0, m = 0 \ % 0 & if 1 \leq \ell \leq L % cases$$ % Step 2: Existence and Lower Bounds % By the theory of spherical designs (Delsarte, Goethals, and Seidel), the minimum number of points in a spherical $L$-design satisfies: % $$N \geq L + D - 1{D - 1} + L + D - 2{D - 1} \asymp L^{D-1}$$ % Therefore, given $N$ points, the maximum achievable design parameter satisfies: % $$L \leq C_D N^{1/(D-1)}$$ % for some constant $C_D > 0$ depending only on the dimension. % Step 3: Spherical Harmonic Expansion % As in the random sampling case, we expand: % $$g_k(a) = \sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(a)$$ % For the target Gaussian moments, the ideal function is $g_k(a) = m_k$ (constant), corresponding to: % align % c_{00}^{(k)} &= m_k |S^{D-1|} \ % c_{\ell m}^{(k)} &= 0 \quad for all \ell > 0 % align % Step 4: Constraint Analysis for Spherical Designs % The constraints $E[\langle x, a_i \rangle^k] = m_k$ become: % $$\sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(a_i) = m_k \quad for i = 1, \ldots, N$$ % Summing over all $i$ and using the spherical design property: % align % 1{N} \sum_{i=1}^N \sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(a_i) &= m_k \ % \sum_{\ell=0}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} \left(1{N} \sum_{i=1}^N Y_{\ell m}(a_i)\right) &= m_k % align % By the spherical design property: % $$c_{00}^{(k)} \cdot 1{|S^{D-1|}} + \sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} \cdot 0 + \sum_{\ell>L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} \left(1{N} \sum_{i=1}^N Y_{\ell m}(a_i)\right) = m_k$$ % This simplifies to: % $$c_{00^{(k)}}{|S^{D-1|}} + \sum_{\ell>L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} \left(1{N} \sum_{i=1}^N Y_{\ell m}(a_i)\right) = m_k$$ % Step 5: Key Insight - Elimination of Low-Frequency Terms % Crucial observation: The spherical design property implies that the constraints provide no information about the coefficients $c_{\ell m}^{(k)}$ for $1 \leq \ell \leq L$. % Since we want $c_{\ell m}^{(k)} = 0$ for $\ell \geq 1$ (Gaussian case), but our constraints don't determine these coefficients for $\ell \leq L$, the worst-case scenario allows these coefficients to be arbitrary. % Step 6: Error Decomposition % For any $b \in S^{D-1}$: % align % |g_k(b) - m_k| &= \left|\sum_{\ell=1}^{\infty} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right| \ % &\leq \left|\sum_{\ell=1^{L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right|}{Unconstrained low frequencies} + \left|\sum{\ell>L \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right|}{High frequencies (bias)} % align % Step 7: Bounding the High-Frequency Term % Using the same analysis as in the random sampling case: % $$\left|\sum{\ell>L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right| \leq C_{\alpha} |E[|x|^k]| \cdot L^{-\alpha}$$ % Step 8: Bounding the Low-Frequency Term % The key difference from random sampling: we must bound the worst-case contribution from unconstrained coefficients ${c_{\ell m}^{(k)} : 1 \leq \ell \leq L}$. % However, these coefficients are still subject to the smoothness constraint: % $$\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} (1 + \ell)^{2\alpha} |c_{\ell m}^{(k)}|^2 \leq C_{\alpha} |E[|x|^k]|^2$$ % Using Cauchy-Schwarz: % align % \left|\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right| &\leq \left(\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} |c_{\ell m}^{(k)}|^2\right)^{1/2} \left(\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} |Y_{\ell m}(b)|^2\right)^{1/2} % align % Since $\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} |Y_{\ell m}(b)|^2 \leq \sum_{\ell=1}^{L} \dim(H_\ell) \lesssim L^{D-1}$ and using the smoothness bound: % $$\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} |c_{\ell m}^{(k)}|^2 \leq C_{\alpha} |E[|x|^k]|^2 \sum_{\ell=1}^{L} (1 + \ell)^{-2\alpha} \lesssim C_{\alpha} |E[|x|^k]|^2 L^{-2\alpha+1}$$ % Therefore: % $$\left|\sum_{\ell=1}^{L} \sum_{m=-\ell}^{\ell} c_{\ell m}^{(k)} Y_{\ell m}(b)\right| \lesssim C_{\alpha} |E[|x|^k]| \cdot L^{-\alpha+1/2} \cdot L^{(D-1)/2} = C_{\alpha} |E[|x|^k]| \cdot L^{-\alpha+(D/2)}$$ % For $\alpha > D/2$, this term is dominated by the high-frequency term $L^{-\alpha}$. % Step 9: Final Bound % Combining both terms: % $$|g_k(b) - m_k| \leq C |E[|x|^k]| \cdot L^{-\alpha}$$ % Using $L \geq c_D N^{1/(D-1)}$: % $$|g_k(b) - m_k| \leq C' |E[|x|^k]| \cdot N^{-\alpha/(D-1)}$$ % Step 10: Rate Comparison % Comparing with the random sampling rate $N^{-\alpha/(2\alpha + D - 1)}$: % $$Random sampling exponent{Spherical design exponent} = \alpha/(2\alpha + D - 1){\alpha/(D - 1)} = D - 1{2\alpha + D - 1} = 1{1 + 2\alpha{D-1}} < 1$$ % Therefore, the spherical design rate is faster by a factor of: % $$2\alpha + D - 1{D - 1} = 1 + 2\alpha{D - 1}$$ % This completes the proof. %
Proof. % For fixed $t\inR$, decompose % [ % \big|\varphi_N(t)-\varphi_G(t)\big|^2 % = \big|\varphi_N(t)-\varphi_P(t)\big|^2 % + 2\Re!\Big((\varphi_P(t)-\varphi_G(t))\widehat{\varphi_N(t)-\varphi_P(t)}\Big) % + \big|\varphi_P(t)-\varphi_G(t)\big|^2. % ] % Taking expectations and using $E[\varphi_N(t)]=\varphi_P(t)$ yields % [ % E\big|\varphi_N(t)-\varphi_G(t)\big|^2 % = E\big|\varphi_N(t)-\varphi_P(t)\big|^2 % + \big|\varphi_P(t)-\varphi_G(t)\big|^2. % ] % Now % [ % Var!\big(\varphi_N(t)\big) % = 1{N}Var!\big(e^{itX}\big) % = 1{N}\Big(E|e^{itX}|^2 - |\varphi_P(t)|^2\Big) % = 1{N}\big(1-|\varphi_P(t)|^2\big). % ] % Therefore % [ % E[D_V] % = \int w_s(t)\big|\varphi_P(t)-\varphi_G(t)\big|^2dt % ;+; 1{N}\int w_s(t)\big(1-|\varphi_P(t)|^2\big)dt, % ] % which is the claimed identity, and the $O(N^{-1})$ rate follows by finiteness of $\int w_s(t)dt=\pi/s$. A direct calculation shows Fix $t \in R^d$ and abbreviate $Z_j \coloneqq e^{i t^\top X_j}$, so that $\phi_n(t) = 1{n}\sum_{j=1}^n Z_j$. Note that $|Z_j|=1$ almost surely (since $t^\top X_j\inR$), and $E[Z_j]=\phi_\theta(t)$ for all $j$. We start from the algebraic identity [ \big|\phi_n(t) - \psi(t)\big|^2 = \phi_n(t)\phi_n(t) - \psi(t)\phi_n(t) - \psi(t)\phi_n(t) + \big|\psi(t)\big|^2. ] Taking expectations term by term gives align E\left[ \big|\phi_n - \psi\big|^2 \right] =& E\left[ |\phi_n|^2 \right] - \psiE\left[ \phi_n \right] - \psiE\left[ \phi_n \right] + |\psi|^2,\ =& E\left[ |\phi_n|^2 \right] - \psi\mathbb{E[\phi_n]}-\psi1{n}\sum_{j=1}^n E[Z_j] + |\psi|^2,\ =& E\left[ |\phi_n|^2 \right] -\psi\phi_\theta - \psi\phi_\theta+ |\psi|^2,\ =&E\left[ |\phi_n|^2 \right] -2Re\big( \psi\phi_\theta \big)+ |\psi|^2,\ =&E\left[ \left|1{n}\sum_{j=1}^n Z_j\right|^2 \right]-2Re\big( \psi\phi_\theta \big)+ |\psi|^2,\ =&1{n^2}\sum_{j=1}^n \sum_{l=1}^n E\left[ Z_jZ_l \right]-2Re\big( \psi\phi_\theta \big)+ |\psi|^2,\ align % where all functions are evaluated at $t$ and dependence on $t$ is suppressed for readability. % We now compute each term: % 1) For $T_1 = E[|\phi_n|^2]$, expand the squared modulus of the empirical mean: % [ % E\left[ |\phi_n|^2 \right] % = % E\left[ \left|1{n}\sum_{j=1}^n Z_j\right|^2 \right] % = % 1{n^2}\sum_{j=1}^n \sum_{l=1}^n E\left[ Z_jZ_l \right]. % ] Since the $Z_j$ are i.i.d., [ E\left[ Z_jZ_l \right] = cases E\left[ |Z_1|^2 \right] = 1, & if j=l,\[4pt] E[Z_j]\mathbb{E[Z_l]} = \phi_\theta\phi_\theta = |\phi_\theta|^2, & if j\neq l, cases ] hence align* E\left[ |\phi_n|^2 \right] =& 1{n^2}\Big( n + n(n-1) |\phi_\theta|^2 \Big)\ =& 1{n} + \left(1-1{n}\right)|\phi_\theta|^2\ =& |\phi_\theta|^2 + 1-|\phi_\theta|^2{n} align* % 2) For $T_2$ and $T_3$, use the unbiasedness of $\phi_n$: % [ % E[\phi_n] = 1{n}\sum_{j=1}^n E[Z_j] = \phi_\theta, % \qquad % E[\phi_n] = \mathbb{E[\phi_n]} = \phi_\theta. % ] % Therefore, % [ % T_2 + T_3 % = % \psi\phi_\theta + \psi\phi_\theta % = % 2Re\big( \psi\phi_\theta \big). % ] Plugging these, we obtain align* E\left[ \big|\phi_n - \psi\big|^2 \right] &= \left( |\phi_\theta|^2 + 1-|\phi_\theta|^2{n} \right) - 2Re\big( \psi\phi_\theta \big) + |\psi|^2 \[4pt] &= \big( |\phi_\theta|^2 - 2Re\big( \psi\phi_\theta \big) + |\psi|^2 \big) + 1-|\phi_\theta|^2{n} \[4pt] &= \big|\phi_\theta - \psi\big|^2 + 1-|\phi_\theta|^2{n}. align* Under Dominated convergence, $\E[\nabla_\theta D_n(t)] = \nabla_\theta \E[D_n(t)]$, hence [ \E\left[\nabla_\theta D_n(t)\right] = \nabla_\theta \big|\phi_\theta(t)-\psi(t)\big|^2 + \nabla_\theta 1-|\phi_\theta(t)|^2{n}, ] concluding the proof. In practice one replaces $\int_{R} w(t)(\cdot)dt$ by a deterministic quadrature on a uniform grid $t_k\in[-T,T]$ with weights $\omega_k$ (e.g.\ trapezoidal rule) and a Gaussian window $w(t)=e^{-\alpha t^2}$. All statements above remain valid with the integral replaced by $\sum_k \omega_k (\cdot)$: [ L(\theta) \approx \sum_{k} \omega_k\big|\phi_\theta(t_k)-\psi(t_k)\big|^2, \quad L_n(\theta) \approx \sum_{k} \omega_k\big|\phi_n(t_k)-\psi(t_k)\big|^2, ] and the bias term becomes [ Bias(\theta) = -1{n}\sum_k \omega_k\nabla_\theta \big|\phi_\theta(t_k)\big|^2. ] Since the grid and weights are deterministic, they do not affect unbiasedness with respect to sampling; they only introduce a deterministic approximation error to the target functional $L(\theta)$.
Proof. We prove this result in two parts. Part I: $\mathbb{E[X] = 0$} Given that $E[\langle X, a \rangle] = 0$ for all unit vectors $a$, and noting that $\langle X, a \rangle = a^T X$, we have: equation E[a^T X] = 0 \quad for all a \in R^d with |a| = 1 equation By linearity of expectation: equation a^T E[X] = 0 \quad for all unit vectors a equation Let $\mu = E[X]$. We claim that $\mu = 0$. Suppose, for the sake of contradiction, that $\mu \neq 0$. Then $|\mu|_2 > 0$. Define the unit vector: equation a^* = \boldsymbol{\mu}{|\mu|2} equation Since $a^$ is a unit vector, equation eq:mean_condition implies: equation (a^)^T \mu = 0 equation However, substituting the definition of $a^$: equation (a^)^T \mu = \left(\boldsymbol{\mu}{|\mu|2}\right)^T \mu = \boldsymbol{\mu^T \mu}{|\mu|2} = |\boldsymbol{\mu|2^2}{|\mu|2} = |\mu|2 > 0 equation This contradiction establishes that $\mu = 0$. Part II: $Cov(X) = I_d$ Since $E[X] = 0$, we have: equation Var(\langle X, a \rangle) = E[(\langle X, a \rangle)^2] = E[(a^T X)^2] equation Expanding the quadratic form: equation E[(a^T X)^2] = E[a^T X X^T a] = a^T E[X X^T] a equation Since $E[X] = 0$, the covariance matrix is $Cov(X) = E[X X^T]$. Let $\Sigma = Cov(X)$. The variance condition gives us: equation a^T \Sigma a = 1 \quad for all unit vectors a equation We now show that $\Sigma = I_d$. Step 1: Diagonal entries. For $i \in {1, 2, \ldots, d}$, let $e_i$ denote the $i$-th standard basis vector. Setting $a = e_i$ in equation eq:quadratic_form: equation e_i^T \Sigma e_i = \Sigma{ii} = 1 equation Therefore, all diagonal entries of $\Sigma$ equal 1. Step 2: Off-diagonal entries. For distinct indices $i, j \in {1, 2, \ldots, d}$, consider the unit vector: equation a = e_i + e_j{|e_i + e_j|2} = e_i + e_j{2} equation Applying equation eq:quadratic_form: equation a^T \Sigma a = 1{2}(e_i + e_j)^T \Sigma (e_i + e_j) = 1 equation Expanding the quadratic form and using the symmetry of $\Sigma$: align 1{2}(e_i^T \Sigma e_i + 2e_i^T \Sigma e_j + e_j^T \Sigma e_j) &= 1\ 1{2}(\Sigma{ii} + 2\Sigma{ij} + \Sigma{jj}) &= 1\ 1{2}(1 + 2\Sigma{ij} + 1) &= 1\ 1 + \Sigma{ij} &= 1\ \Sigma{ij} &= 0 align Therefore, all off-diagonal entries of $\Sigma$ equal zero, establishing that $\Sigma = I_d$.
Algorithm: algorithm
[H]
% \caption{Quasi-Monte Carlo Sampling on the $d$-Dimensional Unit Hypersphere}
% \label{alg:qmc_hypersphere}
\begin{algorithmic}[1]
\Require Number of points $N$, dimension $d$
\Ensure Points $\{\mathbf{y}_i\}_{i=1}^N$ quasi-uniformly distributed on $\mathbb{S}^{d-1}$
\For{$i = 1$ to $N$}
\State Generate $\mathbf{x}_i \in [0,1]^d$ as the $i$-th point of a Sobol sequence
\State Transform each component: $z_{i,j} = \Phi^{-1}(x_{i,j})$ for $j = 1, \ldots, d$ \Comment{$\Phi^{-1}$ is the inverse CDF of the standard normal}
\State Normalize: $\mathbf{y}_i = \mathbf{z}_i / \|\mathbf{z}_i\|_2$
\EndFor
\end{algorithmic}
%
We ideally would like to compare the distributions. One slight variation is to compare the Characteristic function of the distributions. Given samples 𝒙 1 , . . . , 𝒙 𝑁 , the Empirical Characteristic Function (ECF) is defined as
$$
$$
We can now compare our ECF to the one of the target distribution and build the statistic
$$
$$
‹
if the weighting function is given by 𝜔 ( 𝒕 ) = ( 2 𝜋𝛽 2 ) -𝑑 / 2 𝑒 -‖ 𝒕 ‖ 2 2 2 then the following simplification can be made
$$
$$
with 𝛽 > 0, Baringhaus-Henze-Epps-Pulley. From 1 leading to the HZ test 2 uses
$$
$$
the same can be done with the moment generating function 3
$$
$$
$$
$$
There is also one combining both 4 !
$$
$$
and its simplified version
$$
$$
$$
$$
$$
$$
1 https://www.routledge.com/Density-Estimation-for-Statistics-and-Data-Analysis/Silverman/p/book/9780412246203?srsltid= AfmBOoodlL-CtlqL0JVC-LcP6mOWw6VTt51_YstdZOW4W3iuicu1VFyg
3 https://arxiv.org/pdf/1711.07199
2 https://www.tandfonline.com/doi/abs/10.1080/03610929008830400
here with 𝛽 > 2
skewness 6 :
$$
$$
$$
$$
$$
$$
which should be 0 for Gaussian and Kurtosis which should be d(d+2)
$$
$$
Reprise of Figure˜6 for additional dimensions and number of 1d projections.
Figure 16. Reprise of Figure 6 for additional dimensions and number of 1d projections.

Figure 17. Depiction of the distribution of optimized 𝛽 values from OLS when comparing 𝒁 iso and 𝒁 aniso from lemmas. 1 and 2. We clearly observe that the anisotropic version ( blue ) provides much lower variance compared to the isotropic case ( red ). We consider a binary classification (linear separable class) ( top row ), a linear regression task ( middle row ), and a nonlinear regression task with smooth targets ( bottom row ). For each case, we resample the training samples numerous times and produce an estimate for 𝛽 each time. Because the data is 2-dimensional, we can visualize the 𝛽 distribution directly.
Depiction of accuracy (top) and cosine similarity between estimated and true estimator (bottom) for the OLS setting with varying strength of Tikhonov regularization (x-axis) comparing isotropic and anisotropic embeddings. As per Section˜5.1, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.
Figure 18. Depiction of accuracy ( top ) and cosine similarity between estimated and true estimator ( bottom ) for the OLS setting with varying strength of Tikhonov regularization ( x-axis) comparing isotropic and anisotropic embeddings. As per thm. 6, the anisotropic distribution creates a bias in the OLS estimation for nonzero regularization.

Figure 20. Proposed trapezoid quadrature for the Epps-Pulley statistic as implemented in algorithm 1. We depict the approximation error of the integral for various distributions, demonstrate rapid convergence (faster than quadratic show in grey line ) across possible embedding distributions.
| Full FT | Full FT | Frozen | Frozen | |
|---|---|---|---|---|
| Method | 1-sh | Full | 1-sh | Full |
| LeJEPA (in-domain) ConvNeXt-V2 Nano | 29.42 | 82.72 | 28.74 | 76.52 |
| ResNet-34 | 24.27 | 83.28 | 31.08 | 78.17 |
| Frontier (transfer) DINOv2 ViT-S/16 | 21.05 | 78.34 | 27.68 | 67.62 |
| DINOv3 ViT-S/16 | 24.71 | 81.60 | 30.17 | 71.38 |
| integration | num_slices | config/bstat_n_points | config/bstat_n_points | config/bstat_n_points |
|---|---|---|---|---|
| 5 | 17 | 41 | ||
| [- 1 , 1 ] | 512 | 71.82 | 72.13 | 72.04 |
| 2048 | 72.88 | 72.3 | 72.69 | |
| [- 3 , 3 ] | 512 | 73.95 | 74.16 | 74.04 |
| 2048 | 75.02 | 74.68 | 74.77 | |
| [- 5 , 5 ] | 512 | 73.71 | 74.21 | 74.15 |
| 2048 | 74.5 | 74.8 | 74.77 |
| # global_views ( 𝑉 g ) # views ( 𝑉 = 𝑉 g + 𝑉 l ) | 1 | 2 | 4 |
|---|---|---|---|
| 4 | 53.06 | 72.26 | - |
| 6 | 58.65 | 73.07 | 73.68 |
| 8 | 64.46 | 74.24 | 73.94 |
| 10 | 68.97 | 74.06 | 75.08 |
| (c) Mini-batch size | (c) Mini-batch size | (c) Mini-batch size | (c) Mini-batch size | (c) Mini-batch size |
|---|---|---|---|---|
| batch_size | 128 | 256 | 512 | 1024 |
| 72.2 | 74.15 | 74.72 | 74.07 |
| num_slices emb. dim. proj. dim. | 1024 | 1024 | 4096 | 4096 |
|---|---|---|---|---|
| 512 | 2048 | 512 | 2048 | |
| 64 | 75.29 | 75.32 | 75.5 | 75.65 |
| 128 | 74.77 | 75.09 | 75.26 | 75.47 |
| 256 | 74.56 | 74.66 | 75.08 | 75.02 |
| 512 | 73.94 | 74.11 | 74.81 | 74.65 |
| 1024 | 73.65 | 73.94 | 74.71 | 74.79 |
| reg_tokens num_slices | 0 | 1 | 2 | 4 | 8 |
|---|---|---|---|---|---|
| 1024 | 75.14 | 75.18 | 75.08 | 75.34 | 75.23 |
| 4096 | 75.61 | 75.58 | 75.67 | 75.63 | 75.84 |
| Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| shots | model | params | pretrain | epochs | DTD | aircr. | cars | cifar10 | cifar100 | flowers102 | food | pets | avg. |
| LeJEPA ViT-L | 304M | IN-1K | 100 | 33.21 | 9.37 | 3.40 | 51.65 | 27.01 | 48.53 | 17.14 | 46.11 | 29.55 | |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 32.15 | 8.07 | 4.28 | 50.95 | 31.48 | 48.74 | 17.95 | 58.98 | 31.58 | |
| 1 | I-JEPA ViT-H | 632M | IN-1K | 300 | 27.71 | 9.86 | 4.33 | 56.52 | 30.58 | 44.69 | 14.53 | 53.38 | 30.20 |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 26.60 | 11.18 | 4.75 | 56.27 | 35.20 | 47.17 | 15.75 | 59.47 | 32.05 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 27.98 | 13.00 | 3.45 | 61.84 | 34.70 | 89.72 | 19.62 | 30.86 | 35.15 | |
| 10 | LeJEPA ViT-L | 304M | IN-1K | 100 | 64.72 | 35.25 | 22.25 | 85.15 | 59.77 | 92.53 | 50.90 | 77.00 | 60.95 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 61.84 | 30.67 | 24.46 | 85.74 | 63.29 | 91.78 | 49.32 | 78.53 | 60.70 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 57.68 | 33.82 | 21.96 | 88.77 | 66.42 | 88.24 | 43.97 | 83.23 | 60.51 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 57.00 | 39.77 | 25.21 | 90.09 | 70.32 | 90.16 | 45.68 | 85.13 | 62.92 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 58.74 | 43.52 | 18.27 | 94.83 | 75.23 | 98.94 | 49.06 | 67.66 | 63.28 | |
| all | LeJEPA ViT-L | 304M | IN-1K | 100 | 78.30 | 57.01 | 57.28 | 96.50 | 83.71 | 91.21 | 82.05 | 89.74 | 79.48 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 76.60 | 52.99 | 54.88 | 96.15 | 81.34 | 91.11 | 77.64 | 89.76 | 77.56 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 73.32 | 56.61 | 54.47 | 97.54 | 86.42 | 86.47 | 81.02 | 92.11 | 78.50 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 73.87 | 61.95 | 61.27 | 98.02 | 87.78 | 88.08 | 81.72 | 92.88 | 80.70 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 75.67 | 65.39 | 49.79 | 98.46 | 89.95 | 98.54 | 81.58 | 87.19 | 80.82 |
| Freeze Backbone | Model Name | Samples per Class | Samples per Class | Samples per Class | Samples per Class | Samples per Class | Samples per Class | Samples per Class |
|---|---|---|---|---|---|---|---|---|
| All | 1 | 2 | 5 | 10 | 100 | 1000 | ||
| No | LeJEPA (Ours) ConvNeXt-V2 Nano | 82.72 79.41 | 29.42 18.45 | 36.65 24.08 | 50.94 33.11 | 59.85 | 75.34 64.59 | 81.97 77.59 |
| LeViT-128 | 41.76 | |||||||
| ResNet-18 | 82.15 | 23.34 | 31.56 | 43.82 | 54.64 | 73.53 | 81.41 | |
| ResNet-34 | 83.28 | 24.27 | 31.51 | 44.23 | 53.95 | 74.93 | 82.32 | |
| Baselines DINOv2 Small | 78.34 | 21.05 | 21.71 | 30.33 | 36.23 | 60.81 | 75.55 | |
| DINOv3 ViT-S/16 | 81.60 | 24.71 | 29.43 | 37.71 | 44.71 | 69.87 | 80.54 | |
| Yes | LeJEPA (Ours) ConvNeXt-V2 Nano | 76.52 | 28.74 | 36.65 | 50.60 | 59.5 | 72.62 | 77.24 |
| Yes | LeViT-128 | 69.00 | 25.85 | 33.30 | 45.52 | 52.43 | 64.37 | 69.39 |
| Yes | ResNet-18 | 75.95 | 30.48 | 38.22 | 50.85 | 58.86 | 72.70 | 76.39 |
| Yes | ResNet-34 | 78.17 | 31.08 | 38.33 | 52.26 | 60.63 | 74.77 | 78.62 |
| Yes | Baselines DINOv2 Small | 67.62 | 27.68 | 32.22 | 40.72 | 47.72 | 62.49 | 67.89 |
| Yes | DINOv3 ViT-S/16 | 71.38 | 30.17 | 36.65 | 45.74 | 51.51 | 65.90 | 71.35 |
| w/ | backbone Projector w/ SWA | resnet50 | resnet50 | resnet50 | vit_small_patch8_224 | vit_small_patch8_224 | vit_small_patch8_224 | vit_tiny_patch8_224 | vit_tiny_patch8_224 | vit_tiny_patch8_224 |
|---|---|---|---|---|---|---|---|---|---|---|
| w/ | backbone Projector w/ SWA | 1-layer | 2-layer | 3-layer | 1-layer | 2-layer | 3-layer | 1-layer | 2-layer | 3-layer |
| False | 79.71 | 82.44 | 83.93 | 76.59 | 80.77 | 81.07 | 71.79 | 76.87 | 80.37 | |
| False | True | 79.79 | 82.69 | 83.50 | 79.96 | 83.63 | 84.12 | 75.86 | 82.36 | 80.50 |
| False | 79.41 | 82.44 | 83.57 | 77.58 | 79.41 | 81.91 | 67.74 | 77.64 | 80.73 | |
| True | 78.87 | 82.04 | 82.82 | 77.11 | 81.77 | 82.58 | 69.53 | 78.27 | 79.77 |
| Pretraining # train. samples | flowers102 1020 | cifar100 50000 | food101 75750 | inet10 13000 | cifar10 50000 | galaxy10 11008 | |
|---|---|---|---|---|---|---|---|
| LeJEPA (convnextv2_nano) 14M | in-domain | 64.34 | 69.26 | 69.59 | 90.81 | 92.22 | 76.05 |
| LeJEPA (resnet18) 11M | in-domain | 74.57 | 69.94 | 73.57 | 92.36 | 92.51 | 75.32 |
| LeJEPA (resnet34) 21M | in-domain | 71.85 | 70.44 | 74.95 | 92.8 | 93.16 | 77.29 |
| LeJEPA (resnext26ts) 8M | in-domain | 82.19 | 69.1 | 76.77 | 92.82 | 91.59 | 73.78 |
| LeJEPA (swin_tiny) 27M | in-domain | 63.94 | 65.08 | 78.4 | 92.87 | 92.67 | 74.89 |
| IJEPA-inet22k (ViT-H/14) 630M | inet1k | 85.76 | 86.93 | 81.06 | 98.65 | 97.77 | 62.93 |
| N | M | # integration points | mean (ms) | std (ms) |
|---|---|---|---|---|
| 512 | 512 | 16 | 0.465236 | 0.011642 |
| 512 | 512 | 64 | 0.461317 | 0.003894 |
| 512 | 512 | 256 | 0.627644 | 0.003337 |
| 2048 | 512 | 16 | 1.40644 | 0.002415 |
| 8192 | 512 | 16 | 6.1883 | 0.007226 |
| 8192 | 8192 | 16 | 8.68501 | 0.038829 |
| 32768 | 512 | 16 | 26.3731 | 0.012732 |
| 512 | 2048 | 16 | 0.465614 | 0.005274 |
| 512 | 8192 | 16 | 0.670379 | 0.006854 |
| resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 𝜆 #views | 0.001 | 0.005 | 0.01 | 0.02 | 0.025 | 0.050 | 0.100 | 0.150 | 0.200 | 0.300 | 0.400 | 0.500 |
| 2 | 81.41 | 82.73 | 83.49 | 82.99 | 82.23 | - | - | - | - | - | - | - |
| 4 | 79.88 | 83.04 | 84.36 | 84.68 | 84.33 | 83.00 | 82.91 | 81.05 | 78.58 | - | - | - |
| 8 | 76.67 | 81.58 | 83.59 | 83.49 | 83.76 | 84.32 | 83.66 | 83.07 | 82.16 | 81.00 | 79.25 | 77.72 |
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
Randall Balestriero1,2,* Yann LeCun3,2,*
1 Brown University 3 New York University (NYU) 2 Meta-FAIR
- Equal contribution
Learning manipulable representations of the world and its dynamics is central to AI. Joint-Embedding Predictive Architectures (JEPAs) offer a promising blueprint, but lack of practical guidance and theory has led to ad‑hoc R&D. We present a comprehensive theory of JEPAs and instantiate it in LeJEPA, a lean, scalable, and theoretically grounded training objective. First, we identify the isotropic Gaussian as the optimal distribution that JEPAs’ embeddings should follow to minimize downstream prediction risk. Second, we introduce a novel objective–Sketched Isotropic Gaussian Regularization (SIGReg)–to constrain embeddings to reach that ideal distribution. Combining the JEPA predictive loss with SIGReg yields LeJEPA with numerous theoretical and practical benefits: (i) single trade‑off hyperparameter, (ii) linear time and memory complexity, (iii) stability across hyper-parameters, architectures (ResNets, ViTs, ConvNets) and domains, (iv) heuristics-free, e.g., no stop‑gradient, no teacher–student, no hyper-parameter schedulers, and (v) distributed training-friendly implementation requiring only ≈\approx50 lines of code. Our empirical validation covers 10+ datasets, 60+ architectures, all with varying scales and domains. As an example, using imagenet-1k for pretraining and linear evaluation with frozen backbone, LeJEPA reaches 79% with a ViT-H/14. We hope that the simplicity and theory-friendly ecosystem offered by LeJEPA will reestablish self-supervised pre-training as a core pillar of AI research (GitHub repo). Full FT Frozen Method 1-sh Full 1-sh Full LeJEPA (in-domain) ConvNeXt-V2 Nano 29.42 82.72 28.74 76.52 ResNet-34 24.27 83.28 31.08 78.17 Frontier (transfer) DINOv2 ViT-S/16 21.05 78.34 27.68 67.62 DINOv3 ViT-S/16 24.71 81.60 30.17 71.38 Figure 1: LeJEPA overview. Top-left: Training loss exhibits strong correlation with downstream linear probe performance on ImageNet-1k (ViT-base), providing the first practical loss for model selection without supervised probing. Top-right: Training stability without heuristics even on 1.8B ViT-g models, stable training loss. Bottom-left: PCA features from ImageNet-1k pretrained LeJEPA ViT-Large demonstrate clear semantic relationships. Bottom-right: Galaxy10 in-domain results showcasing LeJEPA’s in-domain pretraining consistently outperforms state-of-the-art frontier foundation models transfer learning (DINOv2/v3 trained on natural images) across data regimes from 1-shot to full supervision. This demonstrates that domain-specific SSL beats generic transfer learning, even against massive-scale frontier models, when the framework scales effortlessly to any domain, model, and data scale.
f𝜽→\underset{\rightarrow}{f_{{\bm{\theta}}}}
Learning manipulable representations of the world and its dynamics is a long‑standing question in AI, with roots dating back centuries ago (von1867handbuch; tolman1948cognitive; gregory1980perceptions; sutton1991dyna; friston2010free). Across domains, e.g., image recognition, robotics, physics, space exploration, the unifying question is how to learn an organized and actionable high‑dimensional embedding space from observations? Using Deep Networks–parameterized nonlinear operators f𝜽f_{{\bm{\theta}}}–to map observations to embeddings is a standard first piece of that puzzle (lecun2015deep; goodfellow2016deep). The second, less standardized, piece of that puzzle is how to train f𝛉f_{{\bm{\theta}}}. Joint-Embedding Predictive Architectures (JEPAs) suggest training f𝜽f_{{\bm{\theta}}} by maximizing predictive agreement between the embeddings of semantically related views (bromley1993signature; lecun2022path; balestriero2023cookbook). Views can come in two forms: transformations or corruptions. They can involve masking, cropping, blurring, temporal or spatial translations, geometric or photometric transformations, viewpoint changes, views from different sensor modalities, etc. The supervised forms involve human-produced components such as image-caption pairs, text-code pairs, etc (tian2020makes). In any case, views are expected to share some degree of semantic relationship to allow the prediction task to align f𝜽f_{{\bm{\theta}}}’s embeddings towards the underlying knowledge present in the data.
Alas, JEPA’s prediction task admits failure modes, such as representation collapse, where f𝜽f_{{\bm{\theta}}} maps all inputs to nearly identical embeddings (complete collapse) or to a low-dimensional subspace (dimensional collapse) (jing2021understanding)(jing2021understanding; cosentino2022toward; balestriero2022contrastive). To mitigate such shortcut solutions, state‑of‑the‑art recipes rely on heuristics–stop‑gradient (chen2020simple), asymmetric view generation (wang2022importance), teacher–student networks with carefully tuned EMA schedules (caron2021emerging; tian2021understanding), explicit normalization and whitening layers (ermolov2021whitening; chen2021empirical)–and a delicate balance of hyperparameters. As a result, today’s JEPA training is brittle and most research has shifted toward scaling data (vo2024automatic), models (fan2025scaling) and even post-training rodas2025diet while leaving the theoretical foundations of JEPAs largely unexplored.
Our study proposes to break that cycle by questioning some of the fundamental design principles underpinning JEPAs. That introspection will start by asking what are the necessary conditions that JEPAs should abide by? Those minimal conditions will then act as axioms for us to design a novel and lean JEPA. We identify two axioms: (i) solving the prediction task while (ii) enforcing an isotropic Gaussian distribution of the embeddings (Section˜3). While (i) follows standard practice (balestriero2022contrastive), we introduce in Section˜4 a novel distribution matching objective–Sketched Isotropic Gaussian Regularization (SIGReg)–to enforce (ii). The use of SIGReg not only removes the need for the numerous heuristics previously employed to prevent representation collapse, but SIGReg also exhibits favorable scaling properties as its memory and computational complexity is linear in dimension and sample size. Crucially, SIGReg’s isotropic Gaussian enforcement solves the collapsed shortcut solution and provably minimizes the model’s expected risk over the space of downstream tasks to be encountered post-training. The resulting JEPA solution–coined Latent-Euclidean JEPA (LeJEPA)–is introduced in Section˜5. Beyond theoretical optimality, LeJEPA offers numerous benefits such as (i) provable statistical guarantees, (ii) removal of heuristics such as teacher-student networks, (iii) linear memory and computational complexity, and most importantly (iv) a unified design with a single trade-off parameter that works out of the box across datasets, architectures and scales (see Section˜6). We summarize our contributions below.
Contribution 1: We prove the optimal embedding distribution for foundation models. We establish that the isotropic Gaussian uniquely minimizes downstream prediction risk across broad task families. In Section˜3, we derive this result rigorously for both linear (Section˜3.1) and nonlinear probes (Section˜3.2), providing the first principled answer to what distribution f𝜽f_{{\bm{\theta}}}’s embeddings should follow. This theoretical result transforms JEPA design from heuristic exploration to targeted optimization. Contribution 2: We introduce SIGReg, a distribution matching objective that uniquely combines provable correctness with computational efficiency at scale. We present Sketched Isotropic Gaussian Regularization (SIGReg), a novel objective that enforces distributional alignment via random projections and characteristic-function matching (Sections˜4 and 2). SIGReg provides statistical guarantees (Sections˜4.1 and 4.2) while achieving linear complexity and bounded gradients—a combination that existing distribution matching methods do not offer. Critically, its projection-based construction defeats the curse of dimensionality (Section˜4.3), making it both theoretically sound and practically efficient for high-dimensional embeddings.
Contribution 3: We design LeJEPA, a statistically optimal JEPA that eliminates collapse by construction. By combining JEPA’s predictive objective with SIGReg targeting the isotropic Gaussian, we introduce LeJEPA—Latent-Euclidean JEPA (Section˜5). LeJEPA requires only a single hyperparameter, eliminates representational collapse without stop-gradients or teacher-student architectures, and transfers across architectures and datasets without hyperparameter tuning. This demonstrates that principled theory directly yields practical simplicity.
Contribution 4: We validate LeJEPA at scale across diverse architectures and establish in-domain pretraining as viable. Our experiments (Section˜6) span ViTs, ConvNeXts, ResNets, MaxViTs, and Swin Transformers at scales approaching 1 billion parameters, where LeJEPA matches or exceeds state-of-the-art methods while maintaining training simplicity and robustness. Critically, on domain-specific datasets (Galaxy10, Food101), LeJEPA outperforms DINOv2-based transfer learning when pretrained directly on target data. This challenges the transfer learning paradigm and demonstrates that principled SSL can unlock effective in-domain pretraining—previously considered impractical for small datasets.
We start by introducing some of the notations we will be using throughout our manuscript (Section˜2.1), followed by a review of JEPAs (Section˜2.2), and existing literature studying their design (Section˜2.3).
Data. We are in possession of a dataset of shape (N,V,D)∈N∗3(N,V,D)\in{\mathbb{N}^{*}}^{3} where NN is the number of samples, VV is the number of views, and DD is the dimension. One entry of this dataset is accessed via 𝒙n,v,d{\bm{x}}{n,v,d}. Those dimensions are often interpreted as follows: (N) is the number of independent samples, e.g., different images or different videos, (V) is the number of views, e.g., data-augmentations for images, frames for videos, and (D) is the dimension of each 𝒙n,v{\bm{x}}{n,v}, e.g., number of RGB pixels for images. In many cases the ordering over VV is given by time–but in some cases, e.g., data-augmentation of an image, ordering becomes irrelevant. Our study does not require any particular choice to organize one’s dataset into a (N,V,D)(N,V,D) tensor–and none of our theory and implementation assumes a particular design decision for that tensor. However, we will rely on the following two properties, (independence) the samples 𝒙n,𝒙n′{\bm{x}}{n},{\bm{x}}{n^{\prime}} have been obtained independently from each other nn′\forall n\neq n^{\prime}, and (identically distributed) the sampling process was identical among 𝒙n,n{\bm{x}}_{n},\forall n.
Deep Networks. Today’s AI solutions rely on Deep (Neural) Networks (DNs), which are compositions of a large number of parameterized linear and nonlinear operators. We denote the DN’s mapping as f𝜽:RD→RKf_{\bm{\theta}}:\mathbb{R}^{D}\rightarrow\mathbb{R}^{K} with KK the dimension of the embedding space. The internals of f𝜽f_{\bm{\theta}} are designed by the researcher to incorporate as much prior knowledge about the data as possible. The details of f𝜽f_{\bm{\theta}} are irrelevant to our study–as we will see the proposed LeJEPA works out-of-the-box on any f𝜽f_{\bm{\theta}}. In any case, all the learnable parameters are gathered in the vector 𝜽∈RP{\bm{\theta}}\in\mathbb{R}^{P}, with PP counting the total number of parameters. A central challenge in AI research is to design the right architecture and training objective so that 𝜽{\bm{\theta}} can be learned from gradient descent to ultimately produce a useful system, or foundation model, f𝜽f_{{\bm{\theta}}}.
JEPAs. A foundation model is any system, e.g., a DN, able to solve numerous downstream tasks without requiring any change in its internal parameters 𝜽{\bm{\theta}}. This is in sharp contrast with a supervised model that only considers its training task. JEPAs have formally been introduced by lecun2022path as a vehicle to produce foundation models. The core building blocks of JEPAs rely on numerous well-established techniques such as siamese networks (bromley1993signature) and predictive coding (helmholtz1867handbook; bruner1949perception). While the exact blueprint of JEPAs varies greatly between use-cases, they all rely on two core principles: (i) being able to predict the embedding of a view 𝒙n,v{\bm{x}}{n,v} from the embedding of another view 𝒙n,v′,v′v{\bm{x}}{n,v^{\prime}},v^{\prime}\neq v, all while (ii) ensuring that the embeddings do not become degenerate. Concretely, once a JEPA is designed and trained, it should be able to solve numerous downstream tasks in zero or few shots. The JEPA objective function, along with some examples for 𝒙{\bm{x}}, is provided in Equation˜1. The predictability criterion can be done by directly comparing the embeddings of the partial views Enc(𝒙n,v,.)Enc({\bm{x}}{n,v,.}) and Enc(𝒙n,v′,.)Enc({\bm{x}}{n,v^{\prime},.}) with a metric, e.g., ℓp\ell_{p}. In some cases, an additional DN coined Pred, is employed to compare Pred(Enc(𝒙n,v,.))Pred(Enc({\bm{x}}{n,v,.})) against Enc(𝒙n,v′,.)Enc({\bm{x}}{n,v^{\prime},.})–which is only justified when there exists an asymmetry between the information content of the different views, e.g., by conditioning the predictions on observed actions from robotics data (khazatsky2024droid).
The JEPA’s prediction task is designed based on a priori knowledge of the data. Its design is often quite natural since it is relatively intuitive to form 𝒙{\bm{x}} so that its views share the relevant information content one hope to capture. On the other hand, the design of the “anti-collapse” criterion is much closer to a game of Whac-A-Mole. Today’s designs rely on many different under-specified safeguards which are carefully combined in the hope that degenerate shortcut solutions are avoided during training. Such mechanisms include (i) feature whitening (ermolov2021whitening; bardes2021vicreg), (ii) negative samples (chen2020simple; he2020momentum), and (iii) asymmetric views and teacher-student networks with stop-gradient (caron2021emerging; assran2023self). Those mechanisms all suffer from at least two of the following limitations: (i) under-specification, i.e., the criteria can be minimized while embeddings are in a degenerate configuration, (ii) quadratic time and memory complexity with mini-batch size and/or embedding dimension, (iii) sensitivity to data distribution, hyperparameters, architecture, and (iv) lack of theoretical understanding and guarantees.
For decades, the two major solutions for AI were supervised learning (lecun2015deep) and learning by reconstruction (rumelhart1986learning)–sometimes combined together, e.g., for semi-supervised learning (kingma2014semi). In supervised learning, the labels both ensure that semantically similar samples are close to each other in embedding space while preventing complete representation collapse. In particular, it is possible to measure the amount of collapse in supervised learning as a function of the number of classes (papyan2020prevalence). The reconstruction objective is similarly well suited to prevent representation collapse as the original input must be recovered from the embeddings, i.e., the embeddings must be as informative about the input as possible–up to some optional denoising tasks that users can setup as part of the training (vincent2010stacked).
Because supervised and reconstruction-based learning have been widely studied for decades, there exists a large body of work to explain and inform practical designs–as well as studying their limitations in producing foundation models (balestriero2024learning; van2025joint). This is not the case for the more recent JEPAs where empirical advances quickly outpace anyone hoping to delve into their inner workings. This dynamic led the community to focus on post-hoc theoretical justification of already found solutions (liu2021self; shwartz2024compress; shwartz2022we; zhang2023matrix). In most cases, those studies involve the Mutual Information (MI) (shannon1948mathematical; cover1999elements) whose different bounds recover established methods (gutmann2010noise; ma2018noise; oord2018representation; poole2019variational; hjelm2018learning; mcallester2020formal). Because existing studies focus on explaining and interpreting already developed JEPAs, too little principled guidance and innovation has been brought forward. Instead, most of the recent empirical advances take the form of collecting larger dataset, scaling up pre-existing training recipes (goyal2019scaling; chen2020big; oquab2023dinov2; fan2025scaling), and deriving novel data curation processes (vo2024automatic; kerdreux2025efficient).
In contrast, our goal in the following Sections˜3, 4 and 5 will be to derive a novel JEPA solution from first principles, i.e., whose design relies on proved necessary conditions for optimality, and with a pretraining recipe that can finally reconcile exploratory research, scalability, and state-of-the-art performances.
We address a fundamental question: which distribution should Enc(𝐱){\rm Enc}({\bm{x}}) follow to minimize empirical risk on any downstream task? We prove that the isotropic Gaussian is the unique optimal distribution for both linear (Section˜3.1) and nonlinear probing (Section˜3.2), with geometric intuition provided in Section˜3.3. This theoretical result establishes the necessary design principle for our JEPA; Section˜4 then provides the practical implementation to achieve it.
We begin by identifying the optimal distribution for f𝜽f_{\bm{\theta}}’s embeddings by analyzing linear probes–one of the most popular methods for frozen encoder evaluation. Specifically, we ask: which distribution for f𝛉(𝐱)f_{\bm{\theta}}({\bm{x}}) would be most favorable for solving arbitrary downstream tasks, i.e., for any realization of targets 𝐲{\bm{y}}?
Denote as 𝒁∈RN×K{\bm{Z}}\in\mathbb{R}^{N\times K} the matrix of NN embeddings, each KK-dimensional, from f𝜽(𝒙n)f_{{\bm{\theta}}}({\bm{x}}_{n}). The unknown corresponding labels are denoted as 𝒚∈RN{\bm{y}}\in\mathbb{R}^{N}. Without loss of generality, we consider univariate targets; the following analysis extends to multivariate targets. The linear probe minimizes the following least square problem (bishop2006pattern)
where βhat\hat{\beta} is the optimal probe parameters, and λ≥0\lambda\geq 0 is an hyperparameter controlling the Tikhonov regularizer strength (bishop1995training; golub1999tikhonov). Despite not knowing 𝒚{\bm{y}}, it is possible to describe the bias and variance of the estimator βhat\hat{\beta} as a function of the distribution of 𝒁{\bm{Z}}. Consider two embeddings with identical column spans 𝒁aniso,𝒁iso{\bm{Z}}{\rm aniso},{\bm{Z}}{\rm iso}. 𝒁aniso{\bm{Z}}{\rm aniso}’s covariance matrix eigenvalues are given by {λk}k=1K{\lambda{k}}{k=1}^{K} with at least two distinct values, while 𝒁iso{\bm{Z}}{\rm iso}’s covariance matrix eigenvalues are all equal to 1K\slimits@k=1Kλk\frac{1}{K}\sumop\slimits@{k=1}^{K}\lambda{k}. Hence, the two candidate embeddings 𝒁aniso,𝒁iso{\bm{Z}}{\rm aniso},{\bm{Z}}{\rm iso} capture the same intrinsic features and have same energy, but different geometries.
From the above Sections˜3.1 and 3.1 we obtain that the distribution of features must be isotropic. We now move to nonlinear probing where the standard Gaussian will emerge as the unique optimum.
To allow for more flexible evaluation of the pretrained encoder f𝜽f_{{\bm{\theta}}}, it has become increasingly common to work with a nonlinear probe. We analyze two widely-used nonlinear methods: radius-based k-NN (taunk2019brief; sun2010adaptive; zhang2017efficient; abu2019effects) for its simplicity and kernel methods (nadaraya1964estimating; watson1964smooth) for their theoretical tractability.
As in Section˜3.1, we ask ourselves which distribution of embeddings would be preferable for a foundation model. We first define our prediction function. The training data consists of the NN embeddings along with their training labels {(𝒛n,𝒚n)}n=1N{({\bm{z}}{n},{\bm{y}}{n})}_{n=1}^{N}. The prediction, using radius-based k-NN for a query vector 𝒒{\bm{q}} is formed as
where 𝒩r0(𝒒)={n:|𝒛n−𝒒|≤r0}\mathcal{N}{r{0}}({\bm{q}})={n:|{\bm{z}}{n}-{\bm{q}}|\leq r{0}}. The specific choice of radius r0r_{0} controls how many neighbors predictions are averaged to form the query’s prediction. The kernel’s prediction at a query 𝒒∈RK{\bm{q}}\in\mathbb{R}^{K} is given by
We search over all distributions of Z subject to a fixed total variance constraint, e.g., Tr(Cov(𝒁))=κ1\operatorname{Tr}(\mathrm{Cov}({\bm{Z}}))=\kappa_{1} or |Cov(𝒁)|F=κ2|\mathrm{Cov}({\bm{Z}})|{F}=\kappa{2}. The specific value of κ\kappa does not affect the optimal distribution shape. Following the same type of derivations as done in the linear regime–with the exception of some additional regularity conditions–we are able to precisely identify the isotropic Gaussian as the unique optimum to minimize bias as formalized below.
Numerous additional details and discussions on the regularity assumptions we employed are provided in Appendix˜A. Together, these results establish the isotropic Gaussian distribution as the optimal design to minimize the worst-case risk of a foundation model across downstream tasks.
We now empirically validate that the isotropic Gaussian is optimal when no information about downstream tasks is available. We focus on linear probing (Section˜3.1), where all considered distributions have the same total variance.
When employing a linear probe, an anisotropic distribution increases both bias (with Tikhonov regularization) and variance. Examining bias first (Section˜3.1), we present in Figure˜18 visualizations for both continuous regression and discrete classification tasks. We observe that the cosine similarity between estimated and ground-truth parameters equals 1 only for isotropic distributions, degrading for anisotropic cases regardless of sample size or regularization strength. Regarding variance (Section˜3.1), we show in Figure˜3 that learned parameters vary significantly more across training sets when the covariance is anisotropic (right) compared to isotropic (left)—even when using logistic regression instead of OLS. Figure˜17 further illustrates this effect, showing the distribution of learned β\beta parameters across different training samples for both cases. The anisotropic distribution clearly produces higher-variance estimators.
These theoretical and empirical results establish our design principle for LeJEPA: embeddings f𝛉(𝐱)f_{{\bm{\theta}}}({\bm{x}}) should follow an isotropic Gaussian distribution to minimize worst-case risk across downstream tasks encountered post-training. Section˜4 introduces a novel regularizer to achieve this distribution.
Having established the isotropic Gaussian as the optimal embedding distribution (Section˜3), we now introduce Sketched Isotropic Gaussian Regularization (SIGReg)–a distribution matching objective that is simultaneously (i) differentiable, (ii) scalable, (iii) provable, and (iv) interpretable. SIGReg builds on three key innovations. First, we formulate distribution matching as a statistical test under the null hypothesis P𝜽=QP_{{\bm{\theta}}}=Q (Section˜4.1). Second, we identify a test that guarantees bounded gradients and curvature while maintaining linear complexity and efficient multi-GPU scaling (Section˜4.2). Third, SIGReg bypasses the curse of dimensionality, eliminating collapsed shortcut solutions entirely (Section˜4.3).
Asking for f𝜽(𝒙)f_{{\bm{\theta}}}({\bm{x}})’s distribution P𝜽P_{{\bm{\theta}}} to match a target distribution QQ is typically done by creating various measures of distance or divergence, and estimating them in high-dimension. We propose a different starting point grounded in statistics. Consider the hypothesis testing framework (fisher1928statistical; neyman1933ix) given by
with H0H_{0} being referred to as the null hypothesis. That is, we are asking in Equation˜2 if there is enough empirical evidence to reject the null. To answer that question, one (i) employs a test-statistic, i.e., a single scalar value summarizing the evidence from the empirical samples, (ii) determines a critical value τα\tau_{\alpha} for the test-statistic based on the probability α\alpha of Type I error, i.e., of mistakenly rejecting a true null hypothesis, (iii) compares the test-statistic to the critical value τα\tau_{\alpha}; if the test-statistic exceeds τα\tau_{\alpha}, reject the null hypothesis. If the null is not rejected, we can only claim that there is not sufficient empirical evidence against P𝛉=QP_{{\bm{\theta}}}=Q.
As it stands, Equation˜2 remains impractical in large dimension as existing tests have at least quadratic complexity with the number of samples considered (more details in Appendix˜F). We thus propose to derive a sketching strategy by decomposing Equation˜2 into simpler univariate tests. Denoting the push-forward distributions P𝜽(𝒂)(𝒂⊤)#P𝜽P_{\bm{\theta}}^{({\bm{a}})}\triangleq({\bm{a}}^{\top}){#}P{\bm{\theta}} and Q(𝒂)(𝒂⊤)#QQ^{({\bm{a}})}\triangleq({\bm{a}}^{\top})_{#}Q, we can define the following directional univariate test
for a given directional unit-norm vector 𝒂∈𝒮K−1{\bm{a}}\in\mathcal{S}^{K-1}. The corresponding directional test-statistic of Equation˜3 is computed as T({𝒂⊤f𝜽(𝒙n)}n=1N)T({{\bm{a}}^{\top}f_{{\bm{\theta}}}({\bm{x}}{n})}{n=1}^{N}). Examples of tests TT will be provided in the later Section˜4.2. Repeating that process over a set of MM directions A{𝒂1,…,𝒂M}{\mathbb{A}}\triangleq{{\bm{a}}{1},\dots,{\bm{a}}{M}} and aggregating the individual values lead to the following global test-statistic
We now provide a formal statement asserting the consistency of Equation˜4 to test the original multivariate null hypothesis from Equation˜2. Our result leverages the well-known union-intersection principle (roy1953heuristic), and a slightly modified Cramér-Wold theorem. We denote by =d\stackrel{{\scriptstyle d}}{{=}} equality in distribution.
The assumptions required in the proof of Section˜4.1 hold for classical consistent univariate tests TT such as the ones presented in the following Section˜4.2.
Our proposed regularizer–coined Sketched Isotropic Gaussian Regularization (SIGReg)–follows directly from Section˜4.1 using any statistical test TT targeted towards the isotropic Gaussian, illustrated in Figures˜2 and 5, and formalized below.
We replace the maximum over 𝒂∈A{\bm{a}}\in{\mathbb{A}} in Section˜4.1 by an average in (5) to avoid sparse gradient over the directions in A{\mathbb{A}}. We now delve on the choice of TT for which we compare well-known candidate tests in the field of statistics that are categorized into (i) moment based (Section˜4.2.1), (ii) CDF based (Section˜4.2.2), and (iii) CF based (Section˜4.2.3) statistics–ultimately justifying our choice of the Epps-Pulley statistic.
The first family of statistics we consider are moment-based. Taking the standard Gaussian as an instanciation for the moments, we can define the Jarque-Bera (jarque1980efficient) test that compares the third and fourth moments, i.e., skewness and kurtosis, as
where skew^\mathaccent 866{\rm skew} is the skewness computed from the data as 1n\slimits@i=1n(xi−μhat)3σhat3\frac{\frac{1}{n}\sumop\slimits@{i=1}^{n}\left(x{i}-\hat{\mu}\right)^{3}}{\hat{\sigma}^{3}} and kurt^\mathaccent 866{\rm kurt} is the kurtosis 1n\slimits@i=1n(xi−μhat)4σhat4\frac{\frac{1}{n}\sumop\slimits@{i=1}^{n}\left(x{i}-\hat{\mu}\right)^{4}}{\hat{\sigma}^{4}}. Typically, the (Jarque-Bera) test is used to see if a density follows a Gaussian distribution of any mean and variance–hence it only looks at moments 3 and 4. In our case we aim for a standard Gaussian test and thus add the usual statistics on the first two moments, leading to the extended test
The (Extended Jarque-Bera) acts as a moment matching problem over the first four moments. Such moment matching methods have proven powerful not only for statistical tests but also as mean to learn parametric and nonparametric models of data.
The Stability and Identifiability Conundrum. We now explain why moment-based tests–albeit powerful–will not be suited for LeJEPA. The kthk^{th} of a distribution PP is denoted as mk(P)m_{k}(P). The first observation is that well-behaved distributions abiding the Carleman’s condition \slimits@k=1∞m2k(Q)−1/(2k)=∞\sumop\slimits@{k=1}^{\infty}m{2k}(Q)^{-1/(2k)}=\infty (carleman1926fonctions), such as the Gaussian, or for distributions with finite interval (hausdorff1923momentprobleme) are uniquely determined by their moments. However, using a finite number of moments creates the following non-identifiability issue which well-known in statistics and often used as a motivation to use all moments (lehmann2005testing).
Hence Section˜4.2.1 prescribes us with the guideline to employ as large KK as possible to remove collapsed shortcut solution by making sure our distribution matching is accurate. Yet, doing so leads to unstable gradient-based training due to the gradient norm scaling as O(k)O(k), and the variance of Monte Carlo gradient estimates growing as O(k2m2(k−1))O(k^{2}m_{2(k-1)}) for the kk-th moment since |∇θmk(P𝜽(𝒂))|=|E[k(𝒂⊤f𝜽(𝒙))k−1𝒂⊤Jf𝜽(𝒙)]|\big|\nabla_{\theta}m_{k}(P_{{\bm{\theta}}}^{({\bm{a}})})\big|=|\mathbb{E}\big[k({\bm{a}}^{\top}f_{{\bm{\theta}}}({\bm{x}}))^{k-1}{\bm{a}}^{\top}J_{f_{\bm{\theta}}}({\bm{x}})\big]|, with Jf𝜽(𝒙)∈RK×PJ_{f_{\bm{\theta}}}({\bm{x}})\in\mathbb{R}^{K\times P} the Jacobian matrix–hereby creating an impractical situation where training stability and identifiability can not be achieved simultaneously.
The second family of tests acts upon the CDF. Because those tests require sorting, let’s denote the kthk^{\rm th} order-statistics of NN samples by xk:Nx_{k:N}. Two highly standard tests are quadratic Empirical Density Function statistics with different weighting known as Cramér-von Mises (cramer1928composition; von1981probability) and Anderson Darling (anderson1952asymptotic), and given by
where w(x)w(x) is a weighting function. Adding the U2U^{2} statistics on top of Equation˜Cramér-von Mises recovers the Watson test (watson1961goodness)
We do not consider the Kolmogorov-Smirnov test (kolmogorov1933) as it employs the ℓ∞\ell_{\infty}-norm instead of the ℓ2\ell_{2}-norm hereby producing sparse gradients. Another common test is the Shapiro-Wilk test (shapiro1965analysis) which we found to be unstable in practice–details are provided in Appendix˜E.
Lack of Scalability and Differentiability. CDF-based tests require sorting that have been highly optimized, e.g., with the 𝒪(Nlog(N))\mathcal{O}(N\log(N)) Quicksort algorithm (quicksort) but that nonetheless breaks the embarrassingly parallel nature of SGD–especially on multi-GPU (tanasic2013comparison; maltenberger2022evaluating) due to synchronization requirements. Moreover, these tests involve non-differentiable operations (sorting and order statistics), making them unsuitable for gradient-based optimization without relaxations (cuturi2019differentiable; grover2019stochastic; petersen2022monotonic). While there exists intricate sketching solutions (dunning2019computing; masson2019ddsketch; dunning2021t), each of those solutions introduce numerous additional hyper-parameters–going against our first motivation for LeJEPA.
The third family of tests is concerned with Empirical Characteristic Functions (ECF) which are the Fourier transform of the density function. The Epps–Pulley test (epps1983test) is one of the most popular test and simply compares in weighted ℓ2\ell_{2}-norm the ECF of the data against a target CF
The first crucial observation is that the ECF being defined as ϕhatX(t)=1n\slimits@j=1neitXj\hat{\phi}{X}(t)=\frac{1}{n}\sumop\slimits@{j=1}^{n}e^{itX_{j}} is naturally differentiable and easily computed in distributed settings via efficient all_reduce operations, as the ECF is a simple average of complex exponentials. The weight function is typically Gaussian, such as w(t)=e−t2/σ2w(t)=e^{-t^{2}/\sigma^{2}} with σ\sigma commonly set to 11.
Other tests, e.g., based on the Entropy (szekely2005new) are not considered here as they require numerous additional design choices for the univariate Entropy estimation (silverman2018density; beirlant1997nonparametric), e.g., using kernels (joe1989estimation), or M-estimators (miller2003new).
Epps-Pulley has bounded loss, gradient and curvature. We now consider the remaining two families of tests: moment-based and CF-based. First, recall that moments are polynomial in the data and with extreme growth rate for higher moment–assuming they even exist. Even for well-behaved distributions, raising values to a power of kk can quickly lead to exploding gradients. This comes in sharp contrast with the ECF which is always bounded and with bounded gradients for any input distribution for the projected samples zi=𝒂⊤fθ(𝒙n)z_{i}={\bm{a}}^{\top}f_{\theta}({\bm{x}}_{n}), n=1,…,Nn=1,\ldots,N.
By the chain rule, Section˜4.2.3 directly gives |∇θEP(𝐚)|≤4σ2N\slimits@i=1N|𝐚⊤∇θfθ(𝐱i)|\left|\nabla_{\theta}EP(\mathbf{a})\right|\leq\frac{4\sigma^{2}}{N}\sumop\slimits@{i=1}^{N}|\mathbf{a}^{\top}\nabla{\theta}f_{\theta}(\mathbf{x}_{i})|, providing stable gradients. The limitations of moment-based and CDF-based tests coupled with Section˜4.2.3 justifies our choice of the (Epps–Pulley): (i) DDP-friendly and scalable, (ii) uniformly bounded gradients and curvature regardless of input distribution, and (iii) hyper-parameter free implementation. Lastly, we highlight that our implementation has a linear memory and computational complexity of 𝒪(N)\mathcal{O}(N), with NN the minibatch size. The implementation of SIGReg using that statistical test is provided in LABEL:lst:epps-pulley-pytorch, along with computation times of the forward-backward pass in Table˜6.
As a last step before introducing LeJEPA, we ought to study the requirements on the number of directions (|A||{\mathbb{A}}|) for (4.2) to be effective in high-dimension.
This last section seeks to characterize how many slices in A{\mathbb{A}} one must sample for (5) to be an effective statistical test. That design is crucial if we hope for LeJEPA to successfully converge towards isotropic Gaussian embeddings.
Our first argument arguing for a favorable scaling of |A||{\mathbb{A}}| with the embedding dimension KK relies on the smoothness of P𝜽P_{{\bm{\theta}}} as measured by its Sobolev regularity α\alpha (adams2003sobolev). We formalize below a bound on the directional test from Equation˜3 over all possible directions 𝒂{\bm{a}} when the test statistic is minimized over |A|=M|{\mathbb{A}}|=M directions. While we provide bounds on the expected discrepancy over random directions 𝒂{\bm{a}} when the EP test is satisfied (equals zero) on a finite set of directions, the provided proof includes the case of moment-based and CDF-based tests as well.
As |A|→∞|{\mathbb{A}}|\to\infty, the bound decays as |A|−2α/(K−1)|{\mathbb{A}}|^{-2\alpha/(K-1)}, showing that |A|=O(K)|{\mathbb{A}}|=O(K) directions suffice for ϵ\epsilon-approximation when α\alpha is large. Some examples of embedding densities with varying α\alpha are provided in Figure˜4. The following statement characterizes how the MM directions actually constrain the entire space as a function of α\alpha. The constant C(K,α)=22απ(K−1)/2(α+K−12)(K−1)(α)(K−12)C(K,\alpha)=\frac{2^{2\alpha}\pi^{(K-1)/2}\Gamma\left(\alpha+\frac{K-1}{2}\right)}{(K-1)\Gamma(\alpha)\Gamma\left(\frac{K-1}{2}\right)} is visualized in Figure˜15 (left) depicting how α\alpha and |A||{\mathbb{A}}| interact. In words, we obtain that thanks to the natural smoothness of DN–either stemming from the architecture or the implicit and explicit regularizers used during training–applying SIGReg on |A||{\mathbb{A}}| directions can be sufficient to tightly constrain the entire space. We note that considering the worst case over 𝒂{\bm{a}} or using low-discrepancy sequences for 𝒂{\bm{a}} does not impact the asymptotic bounds, details provided in Appendix˜D.
Our second argument leverages the iterative nature of DN training. Although we may use only |A||{\mathbb{A}}| to be a few hundreds, the cumulative number of sampled directions grows linearly with training time. This resampling effect (illustrated in Figure˜7, bottom) enables rapid convergence. Even small |A||{\mathbb{A}}| achieves tight distributional matching compared to keeping the set A{\mathbb{A}} fixed throughout minibatches (recall Section˜4.3). Our experiments show that even with |A||{\mathbb{A}}| as low as 1616 can easily outperform a fixed set with |A||{\mathbb{A}}| of order of thousands thanks to the compounding effect of resampling at each minibatch.
We conclude this section with a controlled experiment applying (5) with gradient-based training to produce isotropic embeddings. In this setup, we directly consider embeddings 𝒁{\bm{Z}} which we will differentiate and optimized to minimize (5). By directly optimizing the embeddings we are able to observe the impact of the loss without any possible constraint and regularization that would come from the architecture. We sample NN i.i.d. samples 𝒙n{\bm{x}}{n} in a DD-dimensional space. This sampling is based on an isotropic Gaussian distribution–but the first two dimensions are again set to the adversarial “X” shape. That is, among the DD dimensions, only two must be transformed as all the other ones already obey the isotropic Gaussian target distribution. We then make the samples 𝒙n{\bm{x}}{n} differentiable and optimize then to minimize the value of the different statistical tests compute on MM random MM random directions. Those directions are resampled after each gradient step–which follows the procedure we will employ in LeJEPA. We present the results in Figure˜6 demonstrating that even in challenging case, i.e., D=512D=512 and M=16M=16, SIGReg is able to detect the two degenerate dimensions and unfold them back to how they should look like under the target distribution.
Having established that isotropic Gaussians are the optimal embedding distribution for foundation models (Section˜3) and introduced SIGReg to achieve this distribution (Section˜4.2), we now present the complete LeJEPA framework. We first evaluate candidate statistical tests (Sections˜4.2.1 and 4.2.2) and identify characteristic function-based tests as optimal for gradient-based training (Section˜4.2.3). The full LeJEPA implementation follows in Section˜5.1.
We now discuss the implementation of LeJEPA starting with SIGReg and followed by the prediction and total losses.
The SIGReg Loss. We chose (Epps–Pulley) for its provable boundedness (Section˜4.2.3) and its scalability. Its implementation follows exactly the equation except for the integrate which is estimated using a quadrature approximation. We find that the simple trapezoidal quadrature rule is sufficient even with as few knots as 1717, as ablated in Figure˜20. In particular, we leverage the symmetry of the integrand to double the number of knots for free, see the official code. On the other hand, the use of minibatches introduces a bias vanishing at rate 𝒪(1/N)\mathcal{O}(1/N), as formalized below.
Hence, the gradients we obtain from using (Epps–Pulley) are biased by an explicit 𝒪(1/N)\mathcal{O}(1/N) term. We found this bias to be minimal and not a concern even for minibatches as small as 16. Unbiased alternatives include using U-statistic debiasing of |ϕθ|2|\phi_{\theta}|^{2} or sample splitting, which we do not explore in this study. Our final implementation of the SIGReg term with Epps-Pulley statistic is provided in LABEL:lst:epps-pulley-pytorch.
The Prediction Loss. To standardize notations, we adopt the DINO (caron2021emerging) setup of generating VgV_{g} global views and VlV_{l} local views, leading to a total of V=Vg+VlV=V_{g}+V_{l} views. We set the first 1,…,Vg1,\dots,V_{g} indices of each 𝒛n,v{\bm{z}}{n,v} as the global views. For the cases without local views, simply set Vl=0V{l}=0. The prediction loss is then given by having all views predict the global views as
where we denote 𝝁n1Vg\slimits@v=1Vg𝒛n,v\bm{\mu}{n}\triangleq\frac{1}{V{g}}\sumop\slimits@{v=1}^{V{g}}{\bm{z}}_{n,v}, the Equation˜6 to Equation˜7 derivations are detailed in Section˜B.6.
LeJEPA Loss. The final total loss simply combines the above prediction loss along with SIGReg on each views as per
We present (9)’s implementation in LABEL:code:lejepa. Altogether, the entire implementation–besides the usual model definitions, optimizers, and data loaders–only takes a few dozens lines in PyTorch (LABEL:lst:epps-pulley-pytorch and LABEL:code:lejepa). The absence of prototypes, stop-gradients, and teacher-student networks makes (9) appealing as it only contains one hyperparameter, λ\lambda, balancing the trade-off between the prediction and isotropic Gaussian terms.
Prior to presenting our experiments (Section˜6), we conclude by discussing how our proposed LeJEPA and SIGReg objective relate to existing frameworks in the literature.
While there is no existing solution employing such slicing and distribution matching for JEPAs, there exists similar pipelines for generative models and optimal transport. Notably, the Sliced Score Matching (song2020sliced) proposes to leverage univariate slicing of the space to ease the estimation of a density for generative models. In a similar vein, the sliced Wasserstein distance (bonneel2015sliced; nguyen2023energy) uses such strategy to speed up and improve optimal transport. Furthermore, when the integral of the (Epps–Pulley) test is computed exactly, as opposed to our quadrature, each slice loss value recovers the kernel MMD (sriperumbudur2010hilbert; gretton2012kernel; chwialkowski2016kernel) measuring the distance between two distributions–albeit with a quadratic complexity. Lastly, it is possible to recover some existing SSL frameworks in the limit by employing LeJEPA with a particular test–instead of the preferred (Epps–Pulley). For example, Setting T({xn}n=1B)=mean({xn}n=1B)2+(std({xn}n=1B)−1)2T({x_{n}}{n=1}^{B})={\rm mean}({x{n}}{n=1}^{B})^{2}+({\rm std}({x{n}}{n=1}^{B})-1)^{2} and using that TT with SIGReg in LeJEPA recovers the VICReg SSL method in the limit of large number of slices. In fact, SIGReg will enforce in expectation that E[𝐙]=𝟎\mathbb{E}[\mathbf{Z}]=\mathbf{0} and Cov(𝐙)=𝐈d\mathrm{Cov}(\mathbf{Z})=\mathbf{I}{d}, where 𝐈d\mathbf{I}{d} denotes the d×dd\times d identity matrix–derivations provided in Section˜B.14. And since our invariance term is simply the ℓ2\ell{2} distance between the views’ embeddings, LeJEPA recovers VICReg for this degenerate statistical test. Based on Section˜4.2.1, we however strongly advocate against such a setting as it would lead to shortcut solutions–a phenomenon already observed in VICReg.
We now use the LeJEPA implementation described in Section˜5.1 to demonstrate its effectiveness through comprehensive experiments. We show that LeJEPA: (i) trains reliably across diverse architectures and datasets (Section˜6.1), (ii) provides an informative training loss for model selection (Section˜6.2), (iii) outperforms frontier vision models on small-scale in-domain pretraining (Section˜6.3), (iv) scales successfully to nearly 1 billion parameters on ImageNet-1k (Section˜6.4), and (v) learns rich semantic segmentation features without explicit supervision.
We now demonstrate LeJEPA’s stability across hyperparameters, architectures, and experimental setups. Additional cross-domain stability results are presented in Section˜6.3.
(a) (Epps–Pulley) parameters
(b) Number of local/global views
(c) Mini-batch size
(d) Embedding/Projector dim.
(e) Register tokens
Stability across standard hyperparameters. We begin by evaluating LeJEPA on ImageNet-100 and ImageNet-1K. On ImageNet-100, we train a ResNet-50 and vary the number of views and the loss weighting λ\lambda (Figure˜8). Performance remains stable across both dimensions, leading us to recommend λ=0.05\lambda=0.05 as a robust default. On ImageNet-1K, we train a ViT-Large/14 and explore batch size, as well as the number of global (VgV_{\rm g}) and local (VlV_{\rm l}) views (Table˜1b). We find that the configuration commonly used in prior work (Vg=2,Vl=8V_{\rm g}=2,V_{\rm l}=8) transfers well to LeJEPA. Notably, LeJEPA achieves competitive performance with batch sizes as small as 128 on ImageNet-1K (Table˜1c), suggesting reduced memory requirements compared to existing methods. We thus recommend to use λ=0.05\lambda=0.05, Vg=2V_{\rm g}=2, Vl=8V_{\rm l}=8, and batch size ≥128\geq 128 as starting points.
Stability across Epps-Pulley hyperparameters. We next examine hyperparameters specific to LeJEPA: the number of slices |𝒜||\mathcal{A}| in SIGReg, the integration domain for the Epps-Pulley test (Epps–Pulley), and the number of quadrature points for numerical integration. Table˜1a shows ablations on ImageNet-1K with ViT-Large/14. Both the integration domain and number of quadrature points have negligible impact on performance. This is expected: since the characteristic function is accurate at zero, the moments of the distribution are well-characterized even with a modest integration range. The number of slices |𝒜||\mathcal{A}| has a modest effect—while more slices slightly improve performance, even 512 slices yield competitive results. We thus recommend to use 17 integration points, an integration domain of [−5,5][-5,5], and 1024 slices as starting points.
Stability across architectures. A key advantage of LeJEPA over recent methods (e.g., IJEPA, DINOv2) is its architecture-agnostic design. While most modern self-supervised methods are tailored to Vision Transformers, LeJEPA works across diverse architecture families without modification. To validate this claim, we pretrain approximately 50 architectures from 8 different families on ImageNet-10, selecting all models in the timm library with fewer than 20M parameters. All models are able to learn high-quality representations reaching between 91.5% to 95% top 1 accuracy with frozen backbone linear probing. It seems that models performing well in supervised learning setups are also the ones to favor for LeJEPA, such as resnets and ViTs. We thus recommend to use standard architectures such as ResNets and ViTs over specialized models like EfficientNet as stating point.
Removal of popular heuristics. In addition to providing reliable performance across models and datasets, LeJEPA’s provable construction enables us to remove many heuristics traditionally used to prevent collapse. First, prior work has shown both empirically and theoretically that predictors in image JEPA (without asymmetric information) and teacher-student architectures serve primarily to prevent collapse (grill2020bootstrap; jing2021understanding; tian2021understanding; caron2021emerging; chen2021empirical). Removing these components produces collapsed encoders, i.e., with performances at chance-level. Thanks to LeJEPA’s SIGReg loss, we can remove both the predictor and teacher-student architecture without suffering from collapse, as shown in Table˜4. While a teacher-student configuration does provide a small performance boost for ViT models—consistent with observations in supervised learning via Stochastic Weight Averaging (izmailov2019averagingweightsleadswider)—it is not necessary to prevent collapse. In our setup, we apply SWA on the encoder producing μ\mu in Equation˜7. Second, recent work demonstrated that register tokens are needed to prevent training instabilities in vision models (oquab2023dinov2; simeoni2025dinov3; darcet2023vision). We show in Table˜1 that such instabilities likely stem from poorly conditioned training objectives. In contrast, LeJEPA does not require register tokens and achieves stable performance with or without them. We thus recommend training without a predictor or register tokens, and optionally applying SWA with ViTs for a possible performance gain.
A major challenge in SSL pretraining is the lack of reliable signals conveying the quality of the learned representation. As a result, it is common to monitor a supervised downstream task performance, sometimes supplemented with unsupervised embedding statistics (agrawal2022alpha; garrido2023rankme; thilak2023lidar). This process is highly limiting since it requires labeled data that is costly and overly specialized. This is further exacerbated in the latest JEPA models where training losses exhibit low correlation with downstream performance–and may not even decrease monotonically during training.
In contrast, we find that LeJEPA’s training loss behaves much more favorably–providing us with a meaningful signal on model quality. First, we provide in Figure˜10, the 2D plane spanned by the SIGReg and prediction losses where a clear trend with downstream task accuracy can be observed. More strikingly, the combined training loss (9) with mixing coefficient λ\lambda exhibits very high Spearman correlation (spearman1961proof), denoted as ρs\rho_{s}, of about 85%85% with downstream accuracy–which is considered a strong signal. This strong relationship holds across datasets and architectures. As a result, a lower LeJEPA training loss reliably indicates a better downstream performance.
We can further improve this correlation through a simple scaling law based upon the trade-off weighting hyperparameter λ\lambda
By setting α≈0.4\alpha\approx 0.4, LeJEPA’s training loss is able to achieve nearly 99% correlation with downstream performance across multiple datasets and models. We depict the changes in C(α)C^{(\alpha)} as a function of α\alpha on multiple datasets and models in Figure˜11, as well as the training LeJEPA loss against downstream performance in Figure˜19. The strong alignment between LeJEPA’s training loss and model quality enables label-free SSL model selection and cross-validation.
A key promise of self-supervised learning is to learn universal representations that generalize across tasks and domains. However, current frontier foundation models (e.g., DINOv2/v3, IJEPA) are pretrained on natural images forcing practitioners in specialized domains to collect large amount of labels for supervised finetuning. In fact, most frontier models can not be trained directly on those domains as the number of samples may be small and searching again for the hyper-parameters would be cumbersome yet necessary (assran2022hidden).
To demonstrate LeJEPA’s versatility and ability to resolve that current pain-point, we propose to pretrain directly on a new domain without any change in the loss or the pretraining pipeline. We select the Galaxy10 dataset, a galaxy morphology classification task that differs significantly from natural images in both visual structure and statistical properties (balestriero2025gaussian). The dataset contains 11,000 training samples across 10 galaxy types. For LeJEPA, we use the default hyper-parameters and pretrain for 400 epochs a variety of backbones. We compare against the latest DINOv2, DINOv3 and IJEPA. We report in Figure˜12 the top1 accuracy for linear probing both with frozen backbone and full-finetuning. We observe that in-domain pretraining with LeJEPA substantially outperforms state-of-the-art frontier models (DINOv2, DINOv3) on both linear probing and full finetuning. Additional datasets and backbones are provided in Table˜5 depicting LeJEPA’s ability to train in-domain, even with a dataset with 10001000 samples (flowers102). Coupling this result with the stability of LeJEPA across architectures and hyper-parameters should offer a promising alternatives in domains not yet accounted for by the latest frontier models.
We now propose to apply LeJEPA over a larger pretraining dataset, i.e., Imagenet-1k, and over larger backbones such as ViT/Large (0.3B), ConvNextV2-Huge (0.6B). For those two models, we reach an online linear probe accuracy on inet1k of 77.1% and 78.5% respectively. Beyond in-distribution performances, we also explore transfer learning. For those experiments, our baselines are IJEPA with a ViT-Huge (0.6B) which is the closest to our setup, and we also include a recent improved version of IJEPA including additional stochastic prediction tasks (bar2023stochastic) that is coined IJEPA + STOP. For LeJEPA, we employ the same recipe as described in Section˜6.1 and report transfer learning performances with frozen backbone in Table˜2. We observe that we consistently outperform IJEPA while employed a smaller model and shorted training schedule. Beyond top1 accuracy, we also echo our findings from Section˜6.2 about LeJEPA’s training loss quality. In our setup, we observe a very stable and smooth training curve indicating a stable optimization landscape removing the need for careful hyperparameter selection (recall Section˜4.2.3). We provide an example on a ViT-gigantic (1.8B parameters) in Figure˜2.
A hallmark of successful self-supervised learning is the emergence of semantically meaningful attention patterns without explicit supervision (caron2021emerging). To assess whether LeJEPA learns such structure, we visualize the attention maps of the learned representations. Following DINO (caron2021emerging), we apply PCA to the embeddings and visualize the first principal components, which reveal clear correspondence to object boundaries and salient regions (Figure˜14). Furthermore, we explore whether these attention patterns can enable unsupervised video segmentation—a challenging task requiring temporal consistency and object understanding. By thresholding the self-attention maps of the [CLS] token, we obtain binary masks that track objects across frames without any segmentation labels during training. As shown in Figure˜13, LeJEPA’s attention naturally segments foreground objects from background with remarkable temporal coherence, suggesting that the learned representations capture both spatial semantics and temporal structure. This emergent capability demonstrates that LeJEPA’s stability-focused objective does not sacrifice the semantic richness of learned features.
We have established a principled theoretical framework for JEPA-based self-supervised learning that fundamentally resolves its core pathologies. Our contributions span theory and practice: we proved that isotropic Gaussian embeddings uniquely minimize worst-case downstream risk, introduced SIGReg as a tractable and provably correct method to enforce this distribution, and demonstrated that this approach eliminates representational collapse by design–and not through ad-hoc combinations of teacher-student networks, stop-gradients, or asymmetric architectures.
We validate LeJEPA across domains and over 6060 architectures including gigantic versions with 1.8B parameters. In spite of its simplicify , LeJEPA matches state-of-the-art performance while requiring fewer than 50 lines of core implementation. Critically, our approach provides what SSL has long needed: a mathematically rigorous foundation that directly informs practical algorithm design.
We would like to thank Mike Rabbat and Lucas Maes for providing valuable feedbacks on the manuscript.
LeJEPA Appendix
To allow for more flexible evaluation of the pretrained encoder f𝜽f_{{\bm{\theta}}}, it is standard to work with a kk-NN prober [taunk2019brief], both for regression and classification. We rely on the radial kk-NN variation that leverages a sample-dependent kk–improving performance for non uniform distributions of samples [sun2010adaptive, zhang2017efficient, abu2019effects].
We denote the underlying embedding density as pz∈C3p_{z}\in C^{3} with derivatives of order up to 33 bounded, and finite Fisher information and covariance. This regularity condition is fulfilled by current encoders. The unknown labels come from the target function η:RK→R\eta:\mathbb{R}^{K}\to\mathbb{R}, assumed C2C^{2}. We handle classification tasks by setting η(𝒛)=P(Y=1𝒛)\eta({\bm{z}})=\mathbb{P}(Y=1\mid{\bm{z}}). The training consists of the NN embeddings along with their training labels {(𝒛n,η(𝒛n))}n=1N{({\bm{z}}{n},\eta({\bm{z}}{n}))}{n=1}^{N}, where we will denote 𝒚nη(𝒛n){\bm{y}}{n}\triangleq\eta({\bm{z}}_{n}). The prediction for a query vector 𝒒{\bm{q}} is formed as
with 𝒚(𝒒)#{n:\lVert𝒛n−𝒒\rVert≤r0}{\bm{y}}({\bm{q}})\triangleq#{n:\left\lVert{\bm{z}}{n}-{\bm{q}}\right\rVert\leq r{0}} counting the number of samples within a rr-radius ball around 𝒒{\bm{q}}. The radius rr controls how many neighbors predictions are averaged to form the query’s prediction. As per the linear probing’s Section˜3.1, we can characterize the bias of the estimator Equation˜kNN at a particular query point, as formalized below.
To obtain the integrated bias, i.e., over the distribution of query points, we consider the following two properties. First, the distribution of query points follow the training distribution, i.e., 𝒒∼pz{\bm{q}}\sim p_{z}, second, target function η\eta has gradient which is mean-zero and isotropic with E[∇η(𝒛)∇η(𝒛)⊤]=τg2Id\mathbb{E}\big[\nabla\eta({\bm{z}})\nabla\eta({\bm{z}})^{\top}\big]=\tau_{g}^{2}I_{d} with τg2∈(0,∞)\tau_{g}^{2}\in(0,\infty) uniformly in 𝒛{\bm{z}}. We also have any finite scalar-constraint on the covariance of the embeddings such as Tr()=c\operatorname{Tr}(\Sigma)=c or ||F=c|\Sigma|_{F}=c for a finite constant cc.
As a result, we now have a unique minimizer for the optimal embedding density for both the linear and k-NN probes.
As an alternative to (kNN), it is also common to leverage kernel methods, which we consider in this section.
Consider a kernel K:RK→RK:\mathbb{R}^{K}\to\mathbb{R} with the following standard properties
for some μ2(K)∈(0,∞)\mu_{2}(K)\in(0,\infty), some bandwidth h>0h>0 and denoting Kh(t)h−dK(t/h)K_{h}(t)\triangleq h^{-d}K(t/h), we remind the reader that the Nadaraya-Watson estimator, introduced in nadaraya1964estimating, watson1964smooth, at a query 𝒒∈Rd{\bm{q}}\in\mathbb{R}^{d} is
Similarly to (kNN), we will see that the performance of (NW) depends crucially on the distribution of the training points. We have access to our dataset of inputs from pzp_{z} and for each sample 𝒛n{\bm{z}}{n} the corresponding target is given from η(𝒛n)=E[Yn𝒛n]\eta({\bm{z}}{n})=\mathbb{E}[Y_{n}\mid{\bm{z}}{n}]. We also denote the corresponding conditional variance of the target function at that point as v(x)=Var(YiXi=x)v(x)=\mathrm{Var}(Y{i}\mid X_{i}=x). We follow the regularity conditions of the k-NN probing derivations and additionally assume that pp has sufficiently light tails so that for each coordinate jj, lim|x|→∞p(x)=0\lim_{|x|\to\infty}p(x)=0 and lim|x|→∞xjp(x)=0\lim_{|x|\to\infty}x_{j}p(x)=0. We first derive the pointwise bias and variance for 𝒚^(𝒒)\mathaccent 866{{\bm{y}}}({\bm{q}}).
We now show that, under a fixed mean and total-covariance constraint on pzp_{z}, the isotropic Gaussian distribution uniquely minimizes the bias and variance of the kernel regression estimator at any test point. We restrict the smoothness class of the target function using
allowing us to formalize below the worst case integrated bias and the optimal density for zz.
Our proof follows standard derivations when it comes to studying the bias of an estimator. Let’s consider the ridge regression problem (Tikhonov regularized least squares estimator) with close form estimator
The labels are formed from the ground truth parameter βtrue\beta_{\rm true} with centered error, as per 𝐘=𝐗𝜷true+𝜺\mathbf{Y}=\mathbf{X}\bm{\beta}_{\text{true}}+\bm{\varepsilon} where E[𝜺]=𝟎\mathbb{E}[\bm{\varepsilon}]=\mathbf{0}. We can now look at the bias of our estimator given by
We will now compare that bias when 𝑿{\bm{X}} has isotropic and anisotropic covariance with same total variance:
For any anisotropic covariance matrix of 𝑿{\bm{X}}, denote by 𝒒1{\bm{q}}_{1} the eigenvector with smallest eigenvalue, and let’s denote by κ>0\kappa>0 a positive constant. We now define
leading to
Since λp<λbar\lambda_{p}<\bar{\lambda} (strict inequality when not isotropic):
As a result, whenever the covariance matrix of 𝑿{\bm{X}} is anisotropic, there will be downstream tasks for which the estimator bias is increased compared to having isotropic covariance matrix. Anisotropic covariance structure thus amplifies regularization bias when the true parameter vector aligns unfavorably with the data’s covariance structure. ∎
We use the same formula as in Section˜B.1 with λwd=0\lambda_{\rm wd}=0. We first see that the estimator is unbiased. We will now leverage that result to compute the covariance matrix of the estimator
where we used the eigendecomposition:
The function f(x)=1xf(x)=\frac{1}{x} is strictly convex on (0,∞)(0,\infty) allowing us to leverage Jensen’s Inequality:
The inequality is strict whenever the eigenvalues {λj}j=1p{\lambda_{j}}_{j=1}^{p} are not all equal. ∎
Under PPP, conditional expectations of η^(x)\mathaccent 866{\eta}(x) coincide with the normalized ball average
which is the key surrogate used below.
Ball integrals. For computations we use (by symmetry) for any r>0r>0:
Fix x∈Rdx\in\mathbb{R}^{d} and write z∈B(0,r0)z\in\mathrm{B}(0,r_{0}) for local displacements. Assume p∈C3p\in C^{3}, η∈C2\eta\in C^{2} with bounded derivatives on the region of interest, and expand a second-order Taylor expansion:
with remainders satisfying |Rη(x;z)|≤Cη\lVertz\rVert3|R_{\eta}(x;z)|\leq C_{\eta}\left\lVert z\right\rVert^{3} and |Rp(x;z)|≤Cp\lVertz\rVert3|R_{p}(x;z)|\leq C_{p}\left\lVert z\right\rVert^{3} uniformly for \lVertz\rVert≤r0\left\lVert z\right\rVert\leq r_{0}. Using the ball identities \ilimits@B(0,r)zdz=0\intslop\ilimits@{B(0,r)}zdz=0 and \ilimits@B(0,r)zz⊤dz=vdrd+2d+2Id\intslop\ilimits@{B(0,r)}zz^{\top}dz=\frac{v_{d}r^{d+2}}{d+2}I_{d} and collecting terms up to order r0d+2r_{0}^{d+2}, we simplify the denominator as
since \ilimits@zdz=0\intslop\ilimits@zdz=0 and \ilimits@z⊤Hpzdz=tr(Hp)vdr0d+2d+2\intslop\ilimits@z^{\top}Hpzdz=\mathrm{tr}(Hp)\frac{v_{d}r_{0}^{d+2}}{d+2} and the denominator as
Cubic terms vanish by symmetry, and quartic terms are O(r0d+4)O(r_{0}^{d+4}). Subtract η(x)𝒟(x)\eta(x)\mathcal{D}(x) to obtain the bias numerator:
Write 𝒟(x)=vdr0dp(x)(1+α(x)r02+O(r03))\mathcal{D}(x)=v_{d}r_{0}^{d}p(x)\big(1+\alpha(x)r_{0}^{2}+O(r_{0}^{3})\big) where α(x):=12(d+2)p(x)tr(Hp(x))\alpha(x):=\frac{1}{2(d+2)p(x)}\mathrm{tr}(Hp(x)). Then
uniformly on 𝒦\mathcal{K}. This gives the bias formula
Recall from Section˜B.3 that the bias term as sample 𝒙{\bm{x}} is given by
where we defined A(x)∇η(x)⋅∇logp(x)A(x)\triangleq\nabla\eta(x)\cdot\nabla\log p(x) and C(x)12η(x)C(x)\triangleq\frac{1}{2}\Delta\eta(x). We now square and take expectation of X∼pX\sim p and the isotropic gradient prior
We will derive each term separately, recalling that we assume an isotropic gradient prior for η\eta, i.e., E[∇η(x)]=0\mathbb{E}\big[\nabla\eta(x)\big]=0 and E[∇η(x)∇η(x)⊤]=τg2Id\mathbb{E}\big[\nabla\eta(x)\nabla\eta(x)^{\top}\big]=\tau_{g}^{2}I_{d}, for some τg2∈(0,∞)\tau_{g}^{2}\in(0,\infty).
Using v(x):=∇logp(x)v(x):=\nabla\log p(x) for brevity:
recovering the Fisher-information functional J(p)J(p), scaled by τg2\tau_{g}^{2}
Under the prior, ∇η\nabla\eta is mean-zero and isotropic; if, additionally, η\Delta\eta is uncorrelated with ∇η\nabla\eta and has zero mean (or is bounded and mean-zero after centering), then Eη[A(x)C(x)]=0\mathbb{E}_{\eta}[A(x)C(x)]=0. If one does not assume the orthogonality/vanishing covariance above, then E[A(X)C(X)]\mathbb{E}[A(X)C(X)] is a finite constant (depending on the joint law of derivatives of η\eta), and the cross term contributes
not o(r04)o(r_{0}^{4}). In that general case, the leading pp-dependent term of E[Bias(X)2]\mathbb{E}[\mathrm{Bias}(X)^{2}] is still the score-gradient τg2J(p)\tau_{g}^{2}J(p).
which is independent of pp, hence E[C(X)2]=O(1)\mathbb{E}\big[C(X)^{2}\big]=O(1)
Substituting into (15):
We show that, among all mean-zero distributions pp on Rd\mathbb{R}^{d} with a given scalar constraint on the covariance (trace, determinant, Frobenius norm, or spectral radius), the density that minimizes the Fisher-information functional
We proceed in two steps: (i) for fixed covariance matrix ≻0\Sigma\succ 0, J(p)J(p) is minimized by the Gaussian 𝒩(0,)\mathcal{N}(0,\Sigma) and attains the value tr()−1\mathrm{tr}({}^{-1}); (ii) for each scalar constraint, tr()−1\mathrm{tr}({}^{-1}) is minimized by =sId\Sigma=sI_{d} for the appropriate scalar s>0s>0.
Consider the location family pθ(x):=p(x−θ)p_{\theta}(x):=p(x-\theta), θ∈Rd\theta\in\mathbb{R}^{d}. Its Fisher-information matrix at θ\theta is
so that J(p)=trℐ(θ)J(p)=\mathrm{tr}\mathcal{I}(\theta). The estimator T(X)≡XT(X)\equiv X is unbiased for θ\theta under pθp_{\theta}, with Cov(T)=\mathrm{Cov}(T)=\Sigma. The matrix Cramér–Rao bound gives Cov(T)⪰ℐ(θ)−1\mathrm{Cov}(T)\succeq\mathcal{I}(\theta)^{-1}, i.e., ℐ(θ)⪰−1\mathcal{I}(\theta)\succeq{}^{-1}. Taking traces yields J(p)≥tr()−1J(p)\geq\mathrm{tr}({}^{-1}). Equality in the matrix Cramér–Rao bound holds if and only if the score is an affine function of X−θX-\theta, i.e., ∇logpθ(X)=A(X−θ)\nabla\log p_{\theta}(X)=A(X-\theta) a.s. for some matrix AA; integrating this identity shows pθp_{\theta} is Gaussian with precision matrix −A-A, hence p=𝒩(0,)p=\mathcal{N}(0,\Sigma). ∎
We now solve min\slimits@i1/λi\min\sumop\slimits@{i}1/\lambda{i} under each scalar constraint; in every case the minimum is attained when all λi\lambda_{i} are equal, i.e., =sId\Sigma=sI_{d}.
Given tr()=\slimits@iλi=t>0\mathrm{tr}(\Sigma)=\sumop\slimits@{i}\lambda{i}=t>0, by Cauchy–Schwarz,
with equality if and only if λ1=⋯=λd\lambda_{1}=\cdots=\lambda_{d}. Hence
Given ||F2=\slimits@iλi2=c2>0|\Sigma|{F}^{2}=\sumop\slimits@{i}\lambda_{i}^{2}=c^{2}>0, minimize f(λ):=\slimits@i1/λif(\lambda):=\sumop\slimits@{i}1/\lambda{i} over λi>0\lambda_{i}>0 subject to g(λ):=\slimits@iλi2=c2g(\lambda):=\sumop\slimits@{i}\lambda{i}^{2}=c^{2}. The Lagrangian
has first-order conditions −λi−2+2νλi=0-\lambda_{i}^{-2}+2\nu\lambda_{i}=0 for all ii, i.e., λi3=12ν\lambda_{i}^{3}=\frac{1}{2\nu}, so all λi\lambda_{i} are equal. Imposing \slimits@λi2=c2\sumop\slimits@\lambda_{i}^{2}=c^{2} yields λi=c/d\lambda_{i}=c/\sqrt{d}, hence
Let the spectral radius be constrained by ρ()=maxiλi≤r\rho(\Sigma)=\max_{i}\lambda_{i}\leq r for some r>0r>0. Since x↦1/xx\mapsto 1/x is strictly decreasing on (0,∞)(0,\infty),
(The same conclusion holds if the constraint is ρ()=r\rho(\Sigma)=r, since one may take all eigenvalues equal to rr.)
Combining Lemma B.4 with the solutions (a)–(d), we obtain:
For any admissible pp with covariance , Lemma B.4 gives J(p)≥tr()−1J(p)\geq\mathrm{tr}({}^{-1}). Minimizing the right-hand side under the stated scalar constraint yields =sId\Sigma=sI_{d} by the calculations in (a)–(d). Equality in Lemma B.4 holds if and only if pp is Gaussian with that covariance, hence pGp_{G} uniquely attains the bound. ∎
Write the numerator and denominator of m^(x)\mathaccent 866{m}(x) as
so that m^(x)=Bn(x)An(x)\mathaccent 866{m}(x)=\frac{B_{n}(x)}{A_{n}(x)}. Bias. Compute expectations using independence and change of variables. For the denominator,
where we used symmetry \ilimits@tK(t)dt=0\intslop\ilimits@tK(t)dt=0 and isotropy \ilimits@tt⊤K(t)dt=μ2(K)Id\intslop\ilimits@tt^{\top}K(t)dt=\mu_{2}(K)I_{d}, which implies \ilimits@t⊤∇2p(x)tK(t)dt=μ2(K)tr(∇2p(x))=μ2(K)p(x)\intslop\ilimits@t^{\top}\nabla^{2}p(x)tK(t)dt=\mu_{2}(K)\mathrm{tr}(\nabla^{2}p(x))=\mu_{2}(K)\Delta p(x). Similarly, for the numerator,
where the last step uses the fact that tr(∇2(mp))=pm+mp+2∇m⊤∇p\mathrm{tr}\big(\nabla^{2}(mp)\big)=p\Delta m+m\Delta p+2\nabla m^{\top}\nabla p by the product rule and symmetry of mixed derivatives.
with a0=m(x)p(x)a_{0}=m(x)p(x), a2=μ2(K)2(pm+mp+2∇m⊤∇p)(x)a_{2}=\frac{\mu_{2}(K)}{2}\big(p\Delta m+m\Delta p+2\nabla m^{\top}\nabla p\big)(x), b0=p(x)b_{0}=p(x), and b2=μ2(K)2p(x)b_{2}=\frac{\mu_{2}(K)}{2}\Delta p(x). This yields
Therefore,
We prove that:
Expanding the left-hand side:
Square and integrate against p(x)p(x):
where we used (a+b)2≤2a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2} pointwise. Since |m(x)|≤B|\Delta m(x)|\leq B for all xx, we have
which can be combined with the bounds above to obtain the desired result. We similarly have for the integrated variance
We first start by reminding the reader about the original Cramér-Wold theorem that is a function of all possible directions (not unit-norm ones).
Our proof will follow the same proof as for Section˜B.8. Necessity is immediate: if X=dYX\stackrel{{\scriptstyle d}}{{=}}Y, then every measurable function of XX has the same distribution as the corresponding function of YY, from which the linear mapping x↦⟨u,x⟩x\mapsto\langle u,x\rangle for u∈Sd−1u\in\mathbb{S}^{d-1} is a special case. For sufficiency, assume ⟨u,X⟩=d⟨u,Y⟩\langle u,X\rangle\stackrel{{\scriptstyle d}}{{=}}\langle u,Y\rangle for all u∈Sd−1u\in\mathbb{S}^{d-1}. Let φX(t):=E[ei⟨t,X⟩]\varphi_{X}(t):=\mathbb{E}\big[e^{i\langle t,X\rangle}\big] and φY(t):=E[ei⟨t,Y⟩]\varphi_{Y}(t):=\mathbb{E}\big[e^{i\langle t,Y\rangle}\big] denote the characteristic functions of XX and YY. Fix an arbitrary t∈Rdt\in\mathbb{R}^{d}; if t=0t=0, then φX(0)=φY(0)=1\varphi_{X}(0)=\varphi_{Y}(0)=1. If t0t\neq 0, write t=sut=su with s:=|t|>0s:=|t|>0 and u:=t/|t|∈Sd−1u:=t/|t|\in\mathbb{S}^{d-1}. By the assumption, ⟨u,X⟩=d⟨u,Y⟩\langle u,X\rangle\stackrel{{\scriptstyle d}}{{=}}\langle u,Y\rangle, hence for this uu and ss we have
Thus φX(t)=φY(t)\varphi_{X}(t)=\varphi_{Y}(t) for all t∈Rdt\in\mathbb{R}^{d}, i.e., φX≡φY\varphi_{X}\equiv\varphi_{Y} on Rd\mathbb{R}^{d}. By the uniqueness theorem for characteristic functions, this implies X=dYX\stackrel{{\scriptstyle d}}{{=}}Y. (ii) Define ψn,t:=E[ei⟨t,Xn⟩]\psi_{n,t}:=\mathbb{E}\big[e^{i\langle t,X_{n}\rangle}\big] and ψt:=E[ei⟨t,X⟩]\psi_{t}:=\mathbb{E}\big[e^{i\langle t,X\rangle}\big]. Fix t∈Rdt\in\mathbb{R}^{d} and decompose t=sut=su with s:=|t|≥0s:=|t|\geq 0 and u∈Sd−1u\in\mathbb{S}^{d-1} (take, e.g., u=t/|t|u=t/|t| if t0t\neq 0, and any uu if t=0t=0). The map gs:R→Rg_{s}:\mathbb{R}\to\mathbb{R}, gs(x)=sxg_{s}(x)=sx, is continuous. By the continuous mapping theorem applied to the real-valued random variables ⟨u,Xn⟩→𝑑⟨u,X⟩\langle u,X_{n}\rangle\xrightarrow{d}\langle u,X\rangle, we obtain
Hence, for every fixed t∈Rdt\in\mathbb{R}^{d}, the one-dimensional projections satisfy ⟨t,Xn⟩→𝑑⟨t,X⟩\langle t,X_{n}\rangle\xrightarrow{d}\langle t,X\rangle, which in turn yields pointwise convergence of characteristic functions:
Therefore, by Lévy’s continuity theorem, Xn→𝑑XX_{n}\xrightarrow{d}X. This completes the proof. ∎
P=QP=Q if and only if Pa=QaP_{a}=Q_{a} for all a∈Sd−1a\in S^{d-1} (population-level equivalence of laws).
AnA_{n} are finite sets with mesh (An):=supu∈Sd−1mina∈An|u−a|→0\Delta(A_{n}):=\sup_{u\in S^{d-1}}\min_{a\in A_{n}}|u-a|\to 0 as n→∞n\to\infty.
If PQP\neq Q, there exists a separating direction a⋆∈Sd−1a^{\star}\in S^{d-1} and a neighborhood UU of a⋆a^{\star} such that
(Intuitively: near a truly separating direction, the 1D statistic eventually exceeds the global null threshold with probability →1\to 1.)
(i) Under H0:P=QH_{0}:P=Q, our assumption implies no separating direction exists at the population level, and the calibration of un(α)u_{n}(\alpha) ensures Pr(Mn≥un(α))≤α\Pr(M_{n}\geq u_{n}(\alpha))\leq\alpha for all nn, hence lim supn→∞Pr(=n1)≤α\limsup_{n\to\infty}\Pr({}{n}=1)\leq\alpha. (ii) Suppose PQP\neq Q. Our assumption guarantees that there exists at least one separating direction a⋆a^{\star} with Pa⋆Qa⋆P{a^{\star}}\neq Q_{a^{\star}}. Our assumption guarantees a neighborhood UU of a⋆a^{\star} in which the projection statistics exceed the global null threshold with probability tending to 1:
By assumption, for all large nn the set AnA_{n} contains at least one direction an∈Ua_{n}\in U (dense coverage). Therefore,
which proves consistency. ∎
For each case, consider the function g(a)g(a) on SD−1\mathbb{S}^{D-1} defined by the quantity of interest (CF, CDF, or moment) at a fixed tt or kk. Since f∈Hα(RD)f\in H^{\alpha}(\mathbb{R}^{D}), the mapping a↦g(a)a\mapsto g(a) is in Hα(SD−1)H^{\alpha}(\mathbb{S}^{D-1}) for each fixed tt or kk.
Given MM samples {ai}i=1M{a_{i}}_{i=1}^{M} on the sphere, the best possible reconstruction of gg from its values at these points is given by spherical interpolation. By classical results on Sobolev spaces and spherical harmonics (see, e.g., narcowich2006localized), the L2L^{2} interpolation error for functions in Hα(SD−1)H^{\alpha}(\mathbb{S}^{D-1}) using MM points is bounded by
where g∗g^{*} is the interpolant matching gg at the MM sampled points. The interpolation error bound on the sphere follows from the theory of spherical harmonics and Marcinkiewicz–Zygmund (MZ) inequalities . Any f∈Hα(Sd)f\in H^{\alpha}(\mathbb{S}^{d}) admits a spherical harmonics expansion, and the best L2L^{2} approximation by harmonics of degree at most LL satisfies
where PLfP_{L}f is the projection onto harmonics of degree ≤L\leq L [narcowich2006localized, Lemma 2.1]. If MM points are distributed quasi-uniformly on Sd\mathbb{S}^{d}, then for L∼cM1/dL\sim cM^{1/d}, the set forms a Marcinkiewicz–Zygmund (MZ) set for degree LL [mhaskar2001spherical, Theorem 1.1]. This allows reconstruction of any function in the space of harmonics of degree at most LL from its values at these points, and the L2L^{2} interpolation error for ff is bounded by
where IMfI_{M}f is any interpolant matching ff at the MM points [narcowich2006localized, Theorem 3.1]. Substituting L∼cM1/dL\sim cM^{1/d} yields the rate M−α/dM^{-\alpha/d}, and thus
with explicit C(d,α)C(d,\alpha) as in the main theorem. Integrating (or summing) over tt (for CF and CDF) or kk (for moments, with weights wkw_{k}) yields the stated bounds. The explicit constant C(D,α)C(D,\alpha) arises from the theory of spherical Sobolev spaces and is given above.
For the moment case, the sum over kk is weighted to ensure convergence, as higher moments may grow rapidly. The weights wkw_{k} can be chosen, for example, as wk=1/k!w_{k}=1/k!.
Pick distinct x0,…,xK+1∈Rx_{0},\dots,x_{K+1}\in\mathbb{R} and consider the linear map A:RK+2→RK+1A:\mathbb{R}^{K+2}\to\mathbb{R}^{K+1}, (Ap)r=\slimits@j=0K+1pjxjr(Ap){r}=\sumop\slimits@{j=0}^{K+1}p_{j}x_{j}^{r} for r=0,…,Kr=0,\dots,K. Then rank(A)≤K+1\mathrm{rank}(A)\leq K+1, so ker(A){0}\ker(A)\neq{0}. Let v∈ker(A)∖{0}v\in\ker(A)\setminus{0}; from (Ap)0=\slimits@jpj(Ap){0}=\sumop\slimits@{j}p_{j}, we get \slimits@jvj=0\sumop\slimits@{j}v{j}=0, hence vv has positive and negative entries. Choose a strictly positive probability vector pp and ε>0\varepsilon>0 small such that p±:=p±εvp^{\pm}:=p\pm\varepsilon v remain probability vectors. Then Ap+=Ap−Ap^{+}=Ap^{-}, so the distributions supported on {xj}{x_{j}} with masses p±p^{\pm} are distinct yet match moments up to order KK.
Fix the Gaussian weight
Let the empirical CF be
and consider the V-statistic estimator
We use only that |eitX|=1|e^{itX}|=1, |φP(t)|≤1|\varphi_{P}(t)|\leq 1, |φG(t)|≤1|\varphi_{G}(t)|\leq 1, and integrability of wsw_{s}. For each ii differentiate under the integral (dominated convergence applies because the integrand and its derivative are bounded)
since |φ^N(t)|≤1|\mathaccent 866{\varphi}{N}(t)|\leq 1 and |φG(t)|≤1|\varphi{G}(t)|\leq 1,
using \ilimits@Re−s2t2|t|dt=1/s2\intslop\ilimits@_{\mathbb{R}}e^{-s^{2}t^{2}}|t|dt=1/s^{2}.
for some absolute constant CC arising from bounded factors and product rule. Hence ECF gradients are uniformly bounded and Lipschitz, with scale controlled only by (N,s)(N,s).
(Moment sample-gradients are polynomial in XiX_{i} and unbounded for k≥2k\geq 2.) Let D^V\mathaccent 866{D}_{V} be as above. Define the moment objective
for a symmetric positive semidefinite W∈Rk×kW\in\mathbb{R}^{k\times k} and Gaussian target moments μ=EG[ϕ(Y)]\mu=\mathbb{E}_{G}[\phi(Y)]. For each ii,
The gradient formula follows by the chain rule and linearity of ϕbar\bar{\phi}. Let c:=W(ϕbar−μ)c:=W(\bar{\phi}-\mu) and write crc_{r} for its rr-th coordinate. Then
which is a polynomial in XiX_{i} of degree deg=max{r−1:cr0}≤k−1\deg=\max{r-1:c_{r}\neq 0}\leq k-1. In particular, if ck0c_{k}\neq 0 (the generic case when the top-weighted deviation is nonzero), then
The expression is a nonconstant polynomial in XiX_{i} of degree deg≤k−1\deg\leq k-1 whenever some cr0c_{r}\neq 0 with r≥2r\geq 2. Thus the gradient cannot be uniformly bounded on R\mathbb{R}. If ck0c_{k}\neq 0, the leading term dominates and the magnitude grows like |Xi|k−1|X_{i}|^{k-1}, proving unboundedness for k≥2k\geq 2. ∎
A direct calculation shows Fix t∈Rdt\in\mathbb{R}^{d} and abbreviate Zj≔eit⊤XjZ_{j}\coloneqq e^{\mathrm{i}t^{\top}X_{j}}, so that ϕn(t)=1n\slimits@j=1nZj\phi_{n}(t)=\frac{1}{n}\sumop\slimits@{j=1}^{n}Z{j}. Note that |Zj|=1|Z_{j}|=1 almost surely (since t⊤Xj∈Rt^{\top}X_{j}\in\mathbb{R}), and E[Zj]=ϕθ(t)\mathbb{E}[Z_{j}]=\phi_{\theta}(t) for all jj. We start from the algebraic identity
Taking expectations term by term gives
Since the ZjZ_{j} are i.i.d.,
Under Dominated convergence, E[∇θDn(t)]=∇θE[Dn(t)]\mathbb{E}[\nabla_{\theta}D_{n}(t)]=\nabla_{\theta}\mathbb{E}[D_{n}(t)], hence
In practice one replaces \ilimits@Rw(t)(⋅)dt\intslop\ilimits@{\mathbb{R}}w(t)(\cdot)dt by a deterministic quadrature on a uniform grid tk∈[−T,T]t{k}\in[-T,T] with weights ωk\omega_{k} (e.g. trapezoidal rule) and a Gaussian window w(t)=e−αt2w(t)=e^{-\alpha t^{2}}. All statements above remain valid with the integral replaced by \slimits@kωk(⋅)\sumop\slimits@{k}\omega{k}(\cdot):
Since the grid and weights are deterministic, they do not affect unbiasedness with respect to sampling; they only introduce a deterministic approximation error to the target functional L(θ)L(\theta).
Given that E[⟨𝐗,𝐚⟩]=0\mathbb{E}[\langle\mathbf{X},\mathbf{a}\rangle]=0 for all unit vectors 𝐚\mathbf{a}, and noting that ⟨𝐗,𝐚⟩=𝐚T𝐗\langle\mathbf{X},\mathbf{a}\rangle=\mathbf{a}^{T}\mathbf{X}, we have:
By linearity of expectation:
Let 𝝁=E[𝐗]\bm{\mu}=\mathbb{E}[\mathbf{X}]. We claim that 𝝁=𝟎\bm{\mu}=\mathbf{0}. Suppose, for the sake of contradiction, that 𝝁𝟎\bm{\mu}\neq\mathbf{0}. Then |𝝁|2>0|\bm{\mu}|_{2}>0. Define the unit vector:
Since 𝐚∗\mathbf{a}^{*} is a unit vector, equation (35) implies:
This contradiction establishes that 𝝁=𝟎\bm{\mu}=\mathbf{0}.
Since E[𝐗]=𝟎\mathbb{E}[\mathbf{X}]=\mathbf{0}, we have:
Since E[𝐗]=𝟎\mathbb{E}[\mathbf{X}]=\mathbf{0}, the covariance matrix is Cov(𝐗)=E[𝐗𝐗T]\mathrm{Cov}(\mathbf{X})=\mathbb{E}[\mathbf{X}\mathbf{X}^{T}]. Let =Cov(𝐗)\bm{\Sigma}=\mathrm{Cov}(\mathbf{X}). The variance condition gives us:
We now show that =𝐈d\bm{\Sigma}=\mathbf{I}{d}. Step 1: Diagonal entries. For i∈{1,2,…,d}i\in{1,2,\ldots,d}, let 𝐞i\mathbf{e}{i} denote the ii-th standard basis vector. Setting 𝐚=𝐞i\mathbf{a}=\mathbf{e}_{i} in equation (42):
Applying equation (42):
Foundation: The Linear Regression Model We start with the standard linear regression model:
𝐲=[y1,y2,…,yn]T∈Rn\mathbf{y}=[y_{1},y_{2},\ldots,y_{n}]^{T}\in\mathbb{R}^{n} is the response vector
𝐗∈Rn×p\mathbf{X}\in\mathbb{R}^{n\times p} is the design matrix with 𝐗ij=xij\mathbf{X}{ij}=x{ij}
The error assumption means:
Step 1: Deriving the OLS Estimator To find the OLS estimator, we minimize the sum of squared residuals:
Taking the derivative with respect to 𝜷\bm{\beta}:
Assuming 𝐗T𝐗\mathbf{X}^{T}\mathbf{X} is invertible:
dimension=128, slices=10 dimension=128, slices=100 dimension=1024, slices=100
Quasi-Monte Carlo (QMC) methods, such as the Sobol sequence, are widely used to generate low-discrepancy samples in the unit hypercube, providing improved uniformity over purely random sampling. To obtain samples uniformly distributed on the hypersphere, each QMC point is mapped to a standard normal vector via the inverse cumulative distribution function (CDF), and then projected onto the sphere by normalization. This approach leverages the rotational invariance of the multivariate normal distribution, ensuring that the resulting directions are uniformly distributed on the sphere’s surface. While the low-discrepancy property is not strictly preserved under this nonlinear mapping, the resulting samples are empirically more uniform than random samples and are standard in high-dimensional applications marsaglia1972choosing, dick2010digital, caflisch1998monte.
Let X1 < X2 < . . . < Xn denote an ordered random sample of size n from a standard normal distribution. Also, let m 5 (m1,m2,…,mn) be the vector of expected values of standard normal order statistics, and let V 5 (vij ) be the corresponding n 3 n covariance matrix, so that
The W test statistic shapiro1965analysis for normality is then denoted by
shapiro1972approximate suggested replacing the covariance matrix V by the identity matrix I, because for large samples, the observations Yi may be treated as if they are independent (see gupta1952estimation). Another asymptotic extension was suggested by weisburg1975approximate
building atop elfving1947asymptotical’s approximation but using 3/83/8 instead of π/8\pi/8.
rahman1997modification proposed another variation using the approximation for the expected values of order statistics given by blom1958statistical and the approximations for the elements of the variance± covariance matrix given by blom1958statistical, mosteller2006some. These approximations are
We know (see hammersley1954estimation, plackett1958linear)
skewness 666https://www.jstor.org/stable/2334770:
Table: S6.T1: ViT/Large-14, on inet1k pretraining for 100 epochs and evaluated with frozen backbone linear probing (top1 accuracy, %).LeJEPA’s performance is stable across all its hyperparameters and while some may slightly improve performance, e.g., the number of slices |A||{\mathbb{A}}| and the projector sizes, none of the choices lead to a catastrophic collapse.
| integration | num_slices | config/bstat_n_points | ||
|---|---|---|---|---|
| 5 | 17 | 41 | ||
| [−1,1][-1,1] | 512 | 71.82 | 72.13 | 72.04 |
| 2048 | 72.88 | 72.30 | 72.69 | |
| [−3,3][-3,3] | 512 | 73.95 | 74.16 | 74.04 |
| 2048 | 75.02 | 74.68 | 74.77 | |
| [−5,5][-5,5] | 512 | 73.71 | 74.21 | 74.15 |
| 2048 | 74.50 | 74.80 | 74.77 |
Table: S6.T2: Few-shot classification accuracy (percentages) on 8 datasets spanning textures, objects, and fine-grained categories. Our LeJEPA achieves superior performance on fine-grained tasks (DTD, flowers102, food101) while requiring only 100 pretraining epochs compared to I-JEPA’s 300 epochs—a 3× reduction in training time and computational resources without sacrificing downstream task performance. This efficiency gain is particularly valuable for practical applications where training budget is limited. Bold indicates best performance within the IN-1K comparison group, all numbers are percentages.
| Dataset | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| shots | model | params | pretrain | epochs | DTD | aircr. | cars | cifar10 | cifar100 | flowers102 | food | pets | avg. |
| 1 | LeJEPA ViT-L | 304M | IN-1K | 100 | 33.21 | 9.37 | 3.40 | 51.65 | 27.01 | 48.53 | 17.14 | 46.11 | 29.55 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 32.15 | 8.07 | 4.28 | 50.95 | 31.48 | 48.74 | 17.95 | 58.98 | 31.58 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 27.71 | 9.86 | 4.33 | 56.52 | 30.58 | 44.69 | 14.53 | 53.38 | 30.20 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 26.60 | 11.18 | 4.75 | 56.27 | 35.20 | 47.17 | 15.75 | 59.47 | 32.05 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 27.98 | 13.00 | 3.45 | 61.84 | 34.70 | 89.72 | 19.62 | 30.86 | 35.15 | |
| 10 | LeJEPA ViT-L | 304M | IN-1K | 100 | 64.72 | 35.25 | 22.25 | 85.15 | 59.77 | 92.53 | 50.90 | 77.00 | 60.95 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 61.84 | 30.67 | 24.46 | 85.74 | 63.29 | 91.78 | 49.32 | 78.53 | 60.70 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 57.68 | 33.82 | 21.96 | 88.77 | 66.42 | 88.24 | 43.97 | 83.23 | 60.51 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 57.00 | 39.77 | 25.21 | 90.09 | 70.32 | 90.16 | 45.68 | 85.13 | 62.92 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 58.74 | 43.52 | 18.27 | 94.83 | 75.23 | 98.94 | 49.06 | 67.66 | 63.28 | |
| all | LeJEPA ViT-L | 304M | IN-1K | 100 | 78.30 | 57.01 | 57.28 | 96.50 | 83.71 | 91.21 | 82.05 | 89.74 | 79.48 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 76.60 | 52.99 | 54.88 | 96.15 | 81.34 | 91.11 | 77.64 | 89.76 | 77.56 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 73.32 | 56.61 | 54.47 | 97.54 | 86.42 | 86.47 | 81.02 | 92.11 | 78.50 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 73.87 | 61.95 | 61.27 | 98.02 | 87.78 | 88.08 | 81.72 | 92.88 | 80.70 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 75.67 | 65.39 | 49.79 | 98.46 | 89.95 | 98.54 | 81.58 | 87.19 | 80.82 |
Table: A3.T3: Performance metrics across different sample sizes from Figure˜12
| Freeze Backbone | Model Name | Samples per Class | ||||||
|---|---|---|---|---|---|---|---|---|
| All | 1 | 2 | 5 | 10 | 100 | 1000 | ||
| No | LeJEPA (Ours) | |||||||
| ConvNeXt-V2 Nano | 82.72 | 29.42 | 36.65 | 50.94 | 59.85 | 75.34 | 81.97 | |
| LeViT-128 | 79.41 | 18.45 | 24.08 | 33.11 | 41.76 | 64.59 | 77.59 | |
| ResNet-18 | 82.15 | 23.34 | 31.56 | 43.82 | 54.64 | 73.53 | 81.41 | |
| ResNet-34 | 83.28 | 24.27 | 31.51 | 44.23 | 53.95 | 74.93 | 82.32 | |
| Baselines | ||||||||
| DINOv2 Small | 78.34 | 21.05 | 21.71 | 30.33 | 36.23 | 60.81 | 75.55 | |
| DINOv3 ViT-S/16 | 81.60 | 24.71 | 29.43 | 37.71 | 44.71 | 69.87 | 80.54 | |
| Yes | LeJEPA (Ours) | |||||||
| ConvNeXt-V2 Nano | 76.52 | 28.74 | 36.65 | 50.60 | 59.50 | 72.62 | 77.24 | |
| LeViT-128 | 69.00 | 25.85 | 33.30 | 45.52 | 52.43 | 64.37 | 69.39 | |
| ResNet-18 | 75.95 | 30.48 | 38.22 | 50.85 | 58.86 | 72.70 | 76.39 | |
| ResNet-34 | 78.17 | 31.08 | 38.33 | 52.26 | 60.63 | 74.77 | 78.62 | |
| Baselines | ||||||||
| DINOv2 Small | 67.62 | 27.68 | 32.22 | 40.72 | 47.72 | 62.49 | 67.89 | |
| DINOv3 ViT-S/16 | 71.38 | 30.17 | 36.65 | 45.74 | 51.51 | 65.90 | 71.35 |
Table: A3.T4: Top 1 accuracy (in %) with LeJEPA pretraining on Imagenet-100 for 400 epochs (All values are percentages)
| backbone | resnet50 | vit_small_patch8_224 | vit_tiny_patch8_224 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Projector | 1-layer | 2-layer | 3-layer | 1-layer | 2-layer | 3-layer | 1-layer | 2-layer | 3-layer | |
| w/ predictor | w/ SWA | |||||||||
| False | False | 79.71 | 82.44 | 83.93 | 76.59 | 80.77 | 81.07 | 71.79 | 76.87 | 80.37 |
| True | 79.79 | 82.69 | 83.50 | 79.96 | 83.63 | 84.12 | 75.86 | 82.36 | 80.50 | |
| True | False | 79.41 | 82.44 | 83.57 | 77.58 | 79.41 | 81.91 | 67.74 | 77.64 | 80.73 |
| True | 78.87 | 82.04 | 82.82 | 77.11 | 81.77 | 82.58 | 69.53 | 78.27 | 79.77 |
Table: A3.T5: Small architecture in-domain LeJEPA pretraining from random initialization across datasets and architectures, with frozen backbone linear evaluation. First, LeJEPA is able to produce near state-of-the-art performances on tiny dataset with only a thousand samples, e.g., flowers102. Second, on non-natural image data, LeJEPA clearly outperforms the latest frontier vision models, e.g., Galaxy10. See Figure˜12 for additional experiments with varying number of training samples and with full finetuning.
| Pretraining | flowers102 | cifar100 | food101 | inet10 | cifar10 | galaxy10 | |
|---|---|---|---|---|---|---|---|
| # train. samples | 1020 | 50000 | 75750 | 13000 | 50000 | 11008 | |
| LeJEPA (convnextv2_nano) 14M | in-domain | 64.34 | 69.26 | 69.59 | 90.81 | 92.22 | 76.05 |
| LeJEPA (resnet18) 11M | in-domain | 74.57 | 69.94 | 73.57 | 92.36 | 92.51 | 75.32 |
| LeJEPA (resnet34) 21M | in-domain | 71.85 | 70.44 | 74.95 | 92.80 | 93.16 | 77.29 |
| LeJEPA (resnext26ts) 8M | in-domain | 82.19 | 69.10 | 76.77 | 92.82 | 91.59 | 73.78 |
| LeJEPA (swin_tiny) 27M | in-domain | 63.94 | 65.08 | 78.40 | 92.87 | 92.67 | 74.89 |
| IJEPA-inet22k (ViT-H/14) 630M | inet1k | 85.76 | 86.93 | 81.06 | 98.65 | 97.77 | 62.93 |
Table: A3.T6: Time (in millisecond) to compute the proposed SIGReg loss from LABEL:lst:epps-pulley-pytorch on a Tesla V100-SXM2-16GB for varying mini-batch size (NN), number of slices (MM), integration points. Results are computed over 1010 runs.
| N | M | # integration points | # integration | points | mean (ms) | std (ms) |
|---|---|---|---|---|---|---|
| # integration | ||||||
| points | ||||||
| 512 | 512 | 16 | 0.465236 | 0.011642 | ||
| 512 | 512 | 64 | 0.461317 | 0.003894 | ||
| 512 | 512 | 256 | 0.627644 | 0.003337 | ||
| 2048 | 512 | 16 | 1.406441 | 0.002415 | ||
| 8192 | 512 | 16 | 6.188304 | 0.007226 | ||
| 8192 | 8192 | 16 | 8.685009 | 0.038829 | ||
| 32768 | 512 | 16 | 26.373118 | 0.012732 | ||
| 512 | 2048 | 16 | 0.465614 | 0.005274 | ||
| 512 | 8192 | 16 | 0.670379 | 0.006854 |
Table: A3.T7: Number of Figure˜8.
| resnet50 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| λ\lambda | 0.001 | 0.005 | 0.010 | 0.020 | 0.025 | 0.050 | 0.100 | 0.150 | 0.200 | 0.300 | 0.400 | 0.500 |
| #views | ||||||||||||
| 2 | 81.41 | 82.73 | 83.49 | 82.99 | 82.23 | - | - | - | - | - | - | - |
| 4 | 79.88 | 83.04 | 84.36 | 84.68 | 84.33 | 83.00 | 82.91 | 81.05 | 78.58 | - | - | - |
| 8 | 76.67 | 81.58 | 83.59 | 83.49 | 83.76 | 84.32 | 83.66 | 83.07 | 82.16 | 81.00 | 79.25 | 77.72 |
Sketched Isotropic Gaussian Regularization (SIGReg): Given some arbitrary input data with density pxp_{x} with support that may or may not lie on a manifold (left), a Deep network (DN) encoder (f𝜽f_{{\bm{\theta}}}) produces embeddings 𝒛=f𝜽(𝒙){\bm{z}}=f_{{\bm{\theta}}}({\bm{x}}) with some distribution 𝒛∼pz{\bm{z}}\sim p_{z} (middle). Our proposed Backward Cramér-Wold Statistics (Section˜4) objective pushes pzp_{z} to match a target distribution ptp_{t} by projecting the embeddings along 1d1d directions (middle, arrows) and enforcing that the univariate densities (right, colored lines) match the distribution of ptp_{t}, projected along the same directions. Any popular statistical test (provided in Section˜4.2) can assess the goodness-of-fit–in practice we argue for characteristic function tests (Section˜4.2). By using SIGReg with ptp_{t} isotropic Gaussian (right, black lines), we introduce a lean and provably optimal (Section˜3) JEPA, coined LeJEPA, free of numerous heuristics and able to produce competitive performances (Sections˜5 and 6).
![[Uncaptioned image]](2511.08544-figure_001.png)
Illustration of Section˜3.1 showcasing how anisotropic (right) embeddings lead to higher variance estimator compared to isotropic embeddings (left). We sample 100100 training points for the 22-class classification task and fit a logistic regression–repeating the process over numerous training set sample. Each sampling results in a decision boundary (purple).
Examples of distributions living on the surface of the sphere with varying Sobolev smoothness coefficients α\alpha. As per Section˜4.3, the greater α\alpha is, the more global will be the impact of SIGReg for a given number of directions MM. Practically, this represents the distribution of the encoder’s output. Because the target density (isotropic Gaussian) is smooth, the α\alpha coeffcients of the embedding will quickly grow hereby making SIGReg (Section˜4.2) immune to the curse of dimensionality.
Constructed data density with “X” distribution whose marginals are standard Gaussian and whose covariance is identity (left densities). Applying M=10M=10 projections on the half circle directions produces 1010 univariate distributions that can be compared against a standard Gaussian (left) using any preferred statistic from Section˜4.2. The appropriate direction is able to capture the degenerate distribution of the data hereby creating a spike in the statistic value.
N=100N=100 samples are drawn from a 10241024-dimensional standard Gaussian, and the first 22 coordinates are altered to produce the “X” distribution from Figure˜5 (left-most column). For each statistic (all other columns), we perform gradient descent on the samples to minimize their value, at each iteration step with sample M=10M=10 random directions to evaluate SIGReg (recall Section˜4.2). We obtain that albeit this is a high-dimensional distribution with limited number of samples, SIGReg is able to capture the degenerate subspace and adapt the data accordingly to match an isotropic Gaussian distribution. Additional figures with varying dimensions and number of 1d projections are provided in Figure˜16.
Expected directional statistic at the end of training (y-axis) for varying MM (number of directions used at each training step, x-axis). The MM directions are either resampled (green) or kept fixed (blue) at each training step. While for fixed directions we benefit from Section˜4.3 bound where increasing MM reduces the overall expected loss, being able to resample at every step provides significant coverage boost for free.
Inet100 with 400400 pretraining epochs and resnet50 backbone. We depict linear probe performances as a function of λ\lambda and the number of views VV (recall (9)). We observe that performances are stable over λ\lambda–with peak performance obtain by slightly adjust λ\lambda proportionally to the number of views. The corresponding performance values are provided in Table˜7.
INet10 pretraining and frozen backbone linear evaluation across 5050 timm models using LeJEPA out of the box. We cross-validate the learning rate and weight-decay. While there is a small variation between the best and worst performing model, we clearly see that across 5050 models spanning 88 families, LeJEPA is able to produce non-trivial representations able to solve the downstream task at SOTA levels.
(SIGReg, prediction loss) 2d2d-plane with downstream task accuracy shown with colors from blue (low) to red (high). We clearly observe that within this plane, there exists trade-off fronts between the two terms of LeJEPA producing similar downstream performance corresponding to different values of λ\lambda. Yet, those fronts are linear and pointed towards the lower left corner, i.e., LeJEPA’s training loss informs of downstream test performance across models and datasets (columns). Additional models and datasets provided in Figure˜21.
Spearman correlation (y-axis) between LeJEPA’s training loss and downstream accuracy on the dataset’s classification task with a frozen backbone and linear evaluation. The x-axis varies α\alpha in Equation˜10 following our scaling law of the loss w.r.t. λ\lambda. Using α=0\alpha=0 recovers the plain training loss. We clearly observe a very high correlation already for α=0\alpha=0, which further increases up to 99%99% for α=0.4\alpha=0.4. The entire set of points is obtained across numerous hyper-parameters such as learning rate, weight decay, number of epochs, λ\lambda–demonstrating how LeJEPA’s training loss is strongly predictive of downstream performance which can be used for label-free cross-validation.
Small architecture in-domain (Galaxy10) LeJEPA pretraining with linear probe evaluation using frozen backbone or full finetuning (columns) and with varying number of samples per class (x-axis). We compare against state-of-the-art foundation models (DINOv2/v3, IJEPA) over 33 different random seeds. We observe that LeJEPA enables in-domain pretraining out of the box across architectures and able to outperform frontier foundation models. Corresponding numbers are provided in Table˜3.
Emergent Object Segmentation via Last Layer Thresholding. LeJEPA naturally learns to segment and track salient objects (shown in attention maps on the right of each video) without explicit supervision. The results display impressive visual quality and strong temporal consistency across video frames (videos provided on our project page). This emergent capability demonstrates the rich semantic representations learned through our self-supervised approach.
LeJEPA learns rich semantic representations through self-supervised learning. PCA visualization of last-layer features from LeJEPA (ViT-Large, 100 epochs on ImageNet-1K). For each image, features are independently projected to RGB using the first 3 principal components. Without any supervision, LeJEPA spontaneously develops semantically meaningful representations: notice how warm colors (red/magenta/pink) consistently capture foreground objects (parrot bodies, dog face), while cool colors (cyan/green/yellow) represent backgrounds and foliage. This emergent object-background separation and perceptual grouping discovered the visual structure of the world purely from unlabeled data.
Depiction of the expected BCS loss upper bound (Section˜4.3) for various smoothness values α\alpha. We clearly see that as the smoothness increases (blue to red), as the upper bound decreases more and more rapidly with MM.
Depiction of the distribution of optimized β\beta values from OLS when comparing 𝒁iso{\bm{Z}}{\rm iso} and 𝒁aniso{\bm{Z}}{\rm aniso} from Sections˜3.1 and 3.1. We clearly observe that the anisotropic version (blue) provides much lower variance compared to the isotropic case (red). We consider a binary classification (linear separable class) (top row), a linear regression task (middle row), and a nonlinear regression task with smooth targets (bottom row). For each case, we resample the training samples numerous times and produce an estimate for β\beta each time. Because the data is 22-dimensional, we can visualize the β\beta distribution directly.
Additional figures provides in Figure˜19
Proposed trapezoid quadrature for the Epps-Pulley statistic as implemented in LABEL:lst:epps-pulley-pytorch. We depict the approximation error of the integral for various distributions, demonstrate rapid convergence (faster than quadratic show in grey line) across possible embedding distributions.
$$ \frac{\lambda_{\rm wd}}{\lambda_{p}+\lambda_{\rm wd}}>\frac{\lambda_{\rm wd}}{\bar{\lambda}+\lambda_{\rm wd}} $$ \tag{A2.Ex36}
$$ \mathbf{G}=\mathbf{Q}\bm{\Lambda}\mathbf{Q}^{T} $$ \tag{A2.Ex44}
$$ \frac{1}{d}\sumop\slimits@{i=1}^{d}\mu{i};\geq;\left(\prodop\slimits@{i=1}^{d}\mu{i}\right)^{1/d}=\delta^{-1/d}, $$ \tag{A2.Ex86}
$$ \mathrm{Var}[\mathaccent 866{m}(x)]\approx\frac{nh^{-d}R(K)v(x)p(x)}{n^{2}p(x)^{2}}=\frac{R(K)}{nh^{d}}\frac{v(x)}{p(x)}+o\big((nh^{d})^{-1}\big), $$ \tag{A2.Ex112}
$$ \mathaccent 866{D}{V}=\intslop\ilimits@{\mathbb{R}}w_{s}(t)\big|\mathaccent 866{\varphi}{N}(t)-\varphi{G}(t)\big|^{2}dt. $$ \tag{A2.Ex134}
$$ \frac{\partial\mathaccent 866{D}{k}}{\partial X{i}}=\frac{2}{N}\sumop\slimits@{r=1}^{k}c{r}rX_{i}^{r-1}, $$ \tag{A2.Ex145}
$$ \mathbf{a}^{T}\mathbb{E}[\mathbf{X}]=0\quad\text{for all unit vectors }\mathbf{a} $$ \tag{A2.E36}
$$ \mathbf{y}=\mathbf{X}\bm{\beta}+\bm{\varepsilon} $$ \tag{A3.Ex158}
$$ \mathbb{E}[\varepsilon_{i}]=0,\quad\text{Var}(\varepsilon_{i})=\sigma^{2},\quad\text{Cov}(\varepsilon_{i},\varepsilon_{j})=0\text{ for }i\neq j $$ \tag{A3.Ex159}
$$ \text{SSR}(\bm{\beta})=\sumop\slimits@{i=1}^{n}(y{i}-\mathbf{x}_{i}^{T}\bm{\beta})^{2}=(\mathbf{y}-\mathbf{X}\bm{\beta})^{T}(\mathbf{y}-\mathbf{X}\bm{\beta}) $$ \tag{A3.Ex160}
$$ \mathbf{X}^{T}\mathbf{X}\bm{\beta}=\mathbf{X}^{T}\mathbf{y} $$ \tag{A3.Ex163}
$$ \begin{array}[]{l}W=\frac{\left(\sumop\slimits@{i=1}^{n}a{i}Y_{i}\right)}{\sumop\slimits@{i=1}^{n}\left(Y{i}-\bar{Y}\right)^{2}}=\frac{(\mathbf{a}\mathbf{Y})}{S^{2}}\ \mathbf{a}^{\prime}=\left(a_{1},a_{2},\ldots,a_{n}\right)=\mathbf{m}\mathbf{V}^{-1}\left(\mathbf{m}\mathbf{V}^{-1}\mathbf{V}^{-1}\mathbf{m}\right)^{-1/2}\ \mathrm{S}^{2}=\sumop\slimits@{i=1}^{n}\left(Y{i}-\bar{Y}\right)^{2}\end{array} $$ \tag{A5.E53}
$$ \begin{array}[]{l}\mathbf{V}^{-1}=(n+1)(n+2)\ \times\left(\begin{array}[]{cccccc}2\phi^{2}\left(m_{1}\right)&-\phi\left(m_{1}\right)\phi\left(m_{2}\right)&0&0&\ldots&0\ -\phi\left(m_{1}\right)\phi\left(m_{2}\right)&2\phi^{2}\left(m_{2}\right)&-\phi\left(m_{2}\right)\phi\left(m_{3}\right)&0&\ldots&0\ 0&-\phi\left(m_{2}\right)\phi\left(m_{3}\right)&2\phi^{2}\left(m_{3}\right)&-\phi\left(m_{3}\right)\phi\left(m_{4}\right)&\ldots&0\ \vbox{\kern 6.0pt\hbox{$.$}\hbox{$.$}\hbox{$.$}}&&&&&\ 0&0&0&0&\ldots&2\phi^{2}\left(m_{n}\right)\end{array}\right)\end{array} $$ \tag{A5.E58}
$$ \displaystyle{\rm JB}({\bm{u}})\triangleq $$
$$ \displaystyle T_{w} $$
$$ \displaystyle\frac{1}{V}\sumop\slimits@{v^{\prime}=1}^{V}\left|\bm{\mu}{n}-{\bm{z}}{n,v^{\prime}}\right|{2}^{2}, $$
$$ \displaystyle R(K)\triangleq\intslop\ilimits@_{\mathbb{R}^{d}}K(u)^{2}du<\infty, $$
$$ \displaystyle=\mathbb{E}[(\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\bm{\varepsilon}\bm{\varepsilon}^{T}\mathbf{X}(\mathbf{X}^{T}\mathbf{X})^{-1}|\mathbf{X}] $$
$$ \displaystyle=\frac{r_{0}^{2}}{d+2}\left(\frac{\nabla\eta\cdot\nabla p}{p}+\frac{1}{2}\Delta\eta\right)\Big(1-\alpha r_{0}^{2}+O(r_{0}^{3})\Big)\ +\ O(r_{0}^{3}) $$
$$ \displaystyle=n\intslop\ilimits@K(t)\Big((mp)(x)-ht^{\top}\nabla(mp)(x)+\frac{h^{2}}{2}t^{\top}\nabla^{2}(mp)(x)t+o(h^{2})\Big)dt $$
$$ \displaystyle=\big|\phi_{\theta}-\psi\big|^{2}+\frac{1-|\phi_{\theta}|^{2}}{n}. $$
| Full FT | Full FT | Frozen | Frozen | |
|---|---|---|---|---|
| Method | 1-sh | Full | 1-sh | Full |
| LeJEPA (in-domain) ConvNeXt-V2 Nano | 29.42 | 82.72 | 28.74 | 76.52 |
| ResNet-34 | 24.27 | 83.28 | 31.08 | 78.17 |
| Frontier (transfer) DINOv2 ViT-S/16 | 21.05 | 78.34 | 27.68 | 67.62 |
| DINOv3 ViT-S/16 | 24.71 | 81.60 | 30.17 | 71.38 |
| integration | num_slices | config/bstat_n_points | config/bstat_n_points | config/bstat_n_points |
|---|---|---|---|---|
| 5 | 17 | 41 | ||
| [- 1 , 1 ] | 512 | 71.82 | 72.13 | 72.04 |
| 2048 | 72.88 | 72.3 | 72.69 | |
| [- 3 , 3 ] | 512 | 73.95 | 74.16 | 74.04 |
| 2048 | 75.02 | 74.68 | 74.77 | |
| [- 5 , 5 ] | 512 | 73.71 | 74.21 | 74.15 |
| 2048 | 74.5 | 74.8 | 74.77 |
| # global_views ( 𝑉 g ) # views ( 𝑉 = 𝑉 g + 𝑉 l ) | 1 | 2 | 4 |
|---|---|---|---|
| 4 | 53.06 | 72.26 | - |
| 6 | 58.65 | 73.07 | 73.68 |
| 8 | 64.46 | 74.24 | 73.94 |
| 10 | 68.97 | 74.06 | 75.08 |
| (c) Mini-batch size | (c) Mini-batch size | (c) Mini-batch size | (c) Mini-batch size | (c) Mini-batch size |
|---|---|---|---|---|
| batch_size | 128 | 256 | 512 | 1024 |
| 72.2 | 74.15 | 74.72 | 74.07 |
| num_slices emb. dim. proj. dim. | 1024 | 1024 | 4096 | 4096 |
|---|---|---|---|---|
| 512 | 2048 | 512 | 2048 | |
| 64 | 75.29 | 75.32 | 75.5 | 75.65 |
| 128 | 74.77 | 75.09 | 75.26 | 75.47 |
| 256 | 74.56 | 74.66 | 75.08 | 75.02 |
| 512 | 73.94 | 74.11 | 74.81 | 74.65 |
| 1024 | 73.65 | 73.94 | 74.71 | 74.79 |
| reg_tokens num_slices | 0 | 1 | 2 | 4 | 8 |
|---|---|---|---|---|---|
| 1024 | 75.14 | 75.18 | 75.08 | 75.34 | 75.23 |
| 4096 | 75.61 | 75.58 | 75.67 | 75.63 | 75.84 |
| Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | Dataset | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| shots | model | params | pretrain | epochs | DTD | aircr. | cars | cifar10 | cifar100 | flowers102 | food | pets | avg. |
| LeJEPA ViT-L | 304M | IN-1K | 100 | 33.21 | 9.37 | 3.40 | 51.65 | 27.01 | 48.53 | 17.14 | 46.11 | 29.55 | |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 32.15 | 8.07 | 4.28 | 50.95 | 31.48 | 48.74 | 17.95 | 58.98 | 31.58 | |
| 1 | I-JEPA ViT-H | 632M | IN-1K | 300 | 27.71 | 9.86 | 4.33 | 56.52 | 30.58 | 44.69 | 14.53 | 53.38 | 30.20 |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 26.60 | 11.18 | 4.75 | 56.27 | 35.20 | 47.17 | 15.75 | 59.47 | 32.05 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 27.98 | 13.00 | 3.45 | 61.84 | 34.70 | 89.72 | 19.62 | 30.86 | 35.15 | |
| 10 | LeJEPA ViT-L | 304M | IN-1K | 100 | 64.72 | 35.25 | 22.25 | 85.15 | 59.77 | 92.53 | 50.90 | 77.00 | 60.95 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 61.84 | 30.67 | 24.46 | 85.74 | 63.29 | 91.78 | 49.32 | 78.53 | 60.70 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 57.68 | 33.82 | 21.96 | 88.77 | 66.42 | 88.24 | 43.97 | 83.23 | 60.51 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 57.00 | 39.77 | 25.21 | 90.09 | 70.32 | 90.16 | 45.68 | 85.13 | 62.92 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 58.74 | 43.52 | 18.27 | 94.83 | 75.23 | 98.94 | 49.06 | 67.66 | 63.28 | |
| all | LeJEPA ViT-L | 304M | IN-1K | 100 | 78.30 | 57.01 | 57.28 | 96.50 | 83.71 | 91.21 | 82.05 | 89.74 | 79.48 |
| LeJEPA ConvNeXtV2-H | 660M | IN-1K | 100 | 76.60 | 52.99 | 54.88 | 96.15 | 81.34 | 91.11 | 77.64 | 89.76 | 77.56 | |
| I-JEPA ViT-H | 632M | IN-1K | 300 | 73.32 | 56.61 | 54.47 | 97.54 | 86.42 | 86.47 | 81.02 | 92.11 | 78.50 | |
| I-JEPA ViT-H + STOP | 632M | IN-1K | 300 | 73.87 | 61.95 | 61.27 | 98.02 | 87.78 | 88.08 | 81.72 | 92.88 | 80.70 | |
| I-JEPA ViT-H (22K) | 632M | IN-22K | 900 | 75.67 | 65.39 | 49.79 | 98.46 | 89.95 | 98.54 | 81.58 | 87.19 | 80.82 |
| Freeze Backbone | Model Name | Samples per Class | Samples per Class | Samples per Class | Samples per Class | Samples per Class | Samples per Class | Samples per Class |
|---|---|---|---|---|---|---|---|---|
| All | 1 | 2 | 5 | 10 | 100 | 1000 | ||
| No | LeJEPA (Ours) ConvNeXt-V2 Nano | 82.72 79.41 | 29.42 18.45 | 36.65 24.08 | 50.94 33.11 | 59.85 | 75.34 64.59 | 81.97 77.59 |
| LeViT-128 | 41.76 | |||||||
| ResNet-18 | 82.15 | 23.34 | 31.56 | 43.82 | 54.64 | 73.53 | 81.41 | |
| ResNet-34 | 83.28 | 24.27 | 31.51 | 44.23 | 53.95 | 74.93 | 82.32 | |
| Baselines DINOv2 Small | 78.34 | 21.05 | 21.71 | 30.33 | 36.23 | 60.81 | 75.55 | |
| DINOv3 ViT-S/16 | 81.60 | 24.71 | 29.43 | 37.71 | 44.71 | 69.87 | 80.54 | |
| Yes | LeJEPA (Ours) ConvNeXt-V2 Nano | 76.52 | 28.74 | 36.65 | 50.60 | 59.5 | 72.62 | 77.24 |
| Yes | LeViT-128 | 69.00 | 25.85 | 33.30 | 45.52 | 52.43 | 64.37 | 69.39 |
| Yes | ResNet-18 | 75.95 | 30.48 | 38.22 | 50.85 | 58.86 | 72.70 | 76.39 |
| Yes | ResNet-34 | 78.17 | 31.08 | 38.33 | 52.26 | 60.63 | 74.77 | 78.62 |
| Yes | Baselines DINOv2 Small | 67.62 | 27.68 | 32.22 | 40.72 | 47.72 | 62.49 | 67.89 |
| Yes | DINOv3 ViT-S/16 | 71.38 | 30.17 | 36.65 | 45.74 | 51.51 | 65.90 | 71.35 |
| w/ | backbone Projector w/ SWA | resnet50 | resnet50 | resnet50 | vit_small_patch8_224 | vit_small_patch8_224 | vit_small_patch8_224 | vit_tiny_patch8_224 | vit_tiny_patch8_224 | vit_tiny_patch8_224 |
|---|---|---|---|---|---|---|---|---|---|---|
| w/ | backbone Projector w/ SWA | 1-layer | 2-layer | 3-layer | 1-layer | 2-layer | 3-layer | 1-layer | 2-layer | 3-layer |
| False | 79.71 | 82.44 | 83.93 | 76.59 | 80.77 | 81.07 | 71.79 | 76.87 | 80.37 | |
| False | True | 79.79 | 82.69 | 83.50 | 79.96 | 83.63 | 84.12 | 75.86 | 82.36 | 80.50 |
| False | 79.41 | 82.44 | 83.57 | 77.58 | 79.41 | 81.91 | 67.74 | 77.64 | 80.73 | |
| True | 78.87 | 82.04 | 82.82 | 77.11 | 81.77 | 82.58 | 69.53 | 78.27 | 79.77 |
| Pretraining # train. samples | flowers102 1020 | cifar100 50000 | food101 75750 | inet10 13000 | cifar10 50000 | galaxy10 11008 | |
|---|---|---|---|---|---|---|---|
| LeJEPA (convnextv2_nano) 14M | in-domain | 64.34 | 69.26 | 69.59 | 90.81 | 92.22 | 76.05 |
| LeJEPA (resnet18) 11M | in-domain | 74.57 | 69.94 | 73.57 | 92.36 | 92.51 | 75.32 |
| LeJEPA (resnet34) 21M | in-domain | 71.85 | 70.44 | 74.95 | 92.8 | 93.16 | 77.29 |
| LeJEPA (resnext26ts) 8M | in-domain | 82.19 | 69.1 | 76.77 | 92.82 | 91.59 | 73.78 |
| LeJEPA (swin_tiny) 27M | in-domain | 63.94 | 65.08 | 78.4 | 92.87 | 92.67 | 74.89 |
| IJEPA-inet22k (ViT-H/14) 630M | inet1k | 85.76 | 86.93 | 81.06 | 98.65 | 97.77 | 62.93 |
| N | M | # integration points | mean (ms) | std (ms) |
|---|---|---|---|---|
| 512 | 512 | 16 | 0.465236 | 0.011642 |
| 512 | 512 | 64 | 0.461317 | 0.003894 |
| 512 | 512 | 256 | 0.627644 | 0.003337 |
| 2048 | 512 | 16 | 1.40644 | 0.002415 |
| 8192 | 512 | 16 | 6.1883 | 0.007226 |
| 8192 | 8192 | 16 | 8.68501 | 0.038829 |
| 32768 | 512 | 16 | 26.3731 | 0.012732 |
| 512 | 2048 | 16 | 0.465614 | 0.005274 |
| 512 | 8192 | 16 | 0.670379 | 0.006854 |
| resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | resnet50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 𝜆 #views | 0.001 | 0.005 | 0.01 | 0.02 | 0.025 | 0.050 | 0.100 | 0.150 | 0.200 | 0.300 | 0.400 | 0.500 |
| 2 | 81.41 | 82.73 | 83.49 | 82.99 | 82.23 | - | - | - | - | - | - | - |
| 4 | 79.88 | 83.04 | 84.36 | 84.68 | 84.33 | 83.00 | 82.91 | 81.05 | 78.58 | - | - | - |
| 8 | 76.67 | 81.58 | 83.59 | 83.49 | 83.76 | 84.32 | 83.66 | 83.07 | 82.16 | 81.00 | 79.25 | 77.72 |





![[Uncaptioned image]](2511.08544-figure_001.png)
$$ |\text{Bias}(\hat{\boldsymbol{\beta}})|{\text{isotropic}} = \frac{\lambda{\rm wd}}{\bar{\lambda} + \lambda_{\rm wd}} |\boldsymbol{\beta}{\text{true}}|,\ |\text{Bias}(\hat{\boldsymbol{\beta}})|{\text{non-isotropic}} = \frac{\lambda_{\rm wd}}{\lambda_p + \lambda_{\rm wd}} |\boldsymbol{\beta}_{\text{true}}| $$
$$ \mathbb{E}\big[\mathrm{Bias}(X)^2\big] &=\left(\frac{r_0^2}{d+2}\right)^2\Big{\tau_g^2J(p) + O(1)\Big} + o(r_0^4)\ &=\frac{r_0^4}{(d+2)^2}\tau_g^2J(p);+;O(r_0^4), $$
$$ \text{RHS} &= \frac{1}{V}\sum_{v'=1}^{V}\left(|\bar{\mathbf{z}}|2^2 - 2\bar{\mathbf{z}}^T\mathbf{z}{n,v'} + |\mathbf{z}{n,v'}|2^2\right) \ &= |\bar{\mathbf{z}}|2^2 - \frac{2}{V}\bar{\mathbf{z}}^T\sum{v'=1}^{V}\mathbf{z}{n,v'} + \frac{1}{V}\sum{v'=1}^{V}|\mathbf{z}_{n,v'}|_2^2 $$
$$ \displaystyle=-\lambda_{\rm wd}\mathbf{Q}(\bm{\Lambda}+\lambda\mathbf{I})^{-1}\mathbf{Q}^{T}\bm{\beta}_{\text{true}} $$
$$ \displaystyle-\frac{2}{\left(1+\beta^{2}\right)^{d/2}}\sumop\slimits@{j=1}^{n}\exp\left(-\frac{\beta^{2}\left|Y{n,j}\right|^{2}}{2\left(1+\beta^{2}\right)}\right)+\frac{n}{\left(1+2\beta^{2}\right)^{d/2}}. $$
References
[shapiro1965analysis] Shapiro, Samuel Sanford, Wilk, Martin B. (1965). An analysis of variance test for normality (complete samples). Biometrika.
[watson1961goodness] Watson, George S. (1961). Goodness-of-fit tests on a circle. Biometrika.
[dick2010digital] Dick, Josef, Pillichshammer, Friedrich. (2010). Digital nets and sequences: discrepancy theory and quasi--Monte Carlo integration.
[lukacs1970characteristic] Lukacs, Eugene. (1970). Characteristic functions. NA.
[yu2004empirical] Yu, Jun. (2004). Empirical characteristic function estimation and its applications. Econometric reviews.
[carleman1926fonctions] Carleman, Torsten. (1926). Les Fonctions quasi analytiques: le{\c{c.
[feuerverger1977empirical] Feuerverger, Andrey, Mureika, Roman A. (1977). The empirical characteristic function and its applications. The annals of Statistics.
[rudin1987real] Rudin, Walter. (1987). Real and complex analysis.
[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.
[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.
[ermolov2021whitening] Ermolov, Aleksandr, Siarohin, Aliaksandr, Sangineto, Enver, Sebe, Nicu. (2021). Whitening for self-supervised representation learning. International conference on machine learning.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.
[von1981probability] Von Mises, Richard. (1981). Probability, statistics, and truth.
[zhang2017efficient] Zhang, Shichao, Li, Xuelong, Zong, Ming, Zhu, Xiaofeng, Wang, Ruili. (2017). Efficient kNN classification with different numbers of nearest neighbors. IEEE transactions on neural networks and learning systems.
[nadaraya1964estimating] Nadaraya, Elizbar A. (1964). On estimating regression. Theory of Probability & Its Applications.
[bonneel2015sliced] Bonneel, Nicolas, Rabin, Julien, Peyr{'e. (2015). Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision.
[izmailov2019averagingweightsleadswider] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson. (2019). Averaging Weights Leads to Wider Optima and Better Generalization.
[song2020sliced] Song, Yang, Garg, Sahaj, Shi, Jiaxin, Ermon, Stefano. (2020). Sliced score matching: A scalable approach to density and score estimation. Uncertainty in artificial intelligence.
[dunning2021t] Dunning, Ted. (2021). The t-digest: Efficient estimates of distributions. Software Impacts.
[dunning2019computing] Dunning, Ted, Ertl, Otmar. (2019). Computing extremely accurate quantiles using t-digests. arXiv preprint arXiv:1902.04023.
[masson2019ddsketch] Masson, Charles, Rim, Jee E, Lee, Homin K. (2019). Ddsketch: A fast and fully-mergeable quantile sketch with relative-error guarantees. arXiv preprint arXiv:1908.10693.
[tanasic2013comparison] Tanasic, Ivan, Vilanova, Llu{'\i. (2013). Comparison based sorting for systems with multiple GPUs. Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units.
[maltenberger2022evaluating] Maltenberger, Tobias, Ilic, Ivan, Tolovski, Ilin, Rabl, Tilmann. (2022). Evaluating multi-GPU sorting with modern interconnects. Proceedings of the 2022 International Conference on Management of Data.
[nguyen2023energy] Nguyen, Khai, Ho, Nhat. (2023). Energy-based sliced wasserstein distance. Advances in Neural Information Processing Systems.
[watson1964smooth] Watson, Geoffrey S. (1964). Smooth regression analysis. Sankhy{=a.
[taunk2019brief] Taunk, Kashvi, De, Sanjukta, Verma, Srishti, Swetapadma, Aleena. (2019). A brief review of nearest neighbor algorithm for learning and classification. 2019 international conference on intelligent computing and control systems (ICCS).
[sun2010adaptive] Sun, Shiliang, Huang, Rongqing. (2010). An adaptive k-nearest neighbor algorithm. 2010 seventh international conference on fuzzy systems and knowledge discovery.
[bishop2006pattern] Bishop, Christopher M, Nasrabadi, Nasser M. (2006). Pattern recognition and machine learning.
[golub1999tikhonov] Golub, Gene H, Hansen, Per Christian, O'Leary, Dianne P. (1999). Tikhonov regularization and total least squares. SIAM journal on matrix analysis and applications.
[bishop1995training] Bishop, Chris M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural computation.
[abu2019effects] Abu Alfeilat, Haneen Arafat, Hassanat, Ahmad BA, Lasassmeh, Omar, Tarawneh, Ahmad S, Alhasanat, Mahmoud Bashir, Eyal Salman, Hamzeh S, Prasath, VB Surya. (2019). Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big data.
[asadi2017alternative] Asadi, Kavosh, Littman, Michael L. (2017). An alternative softmax operator for reinforcement learning. International Conference on Machine Learning.
[beirlant1997nonparametric] Beirlant, Jan, Dudewicz, Edward J, Gy{. (1997). Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences.
[miller2003new] Miller, Erik G. (2003). A new class of entropy estimators for multi-dimensional densities. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03)..
[joe1989estimation] Joe, Harry. (1989). Estimation of entropy and other functionals of a multivariate density. Annals of the Institute of Statistical Mathematics.
[silverman2018density] Silverman, Bernard W. (2018). Density estimation for statistics and data analysis.
[darcet2023vision] Darcet, Timoth{'e. (2023). Vision transformers need registers. arXiv preprint arXiv:2309.16588.
[simeoni2025dinov3] Sim{'e. (2025). Dinov3. arXiv preprint arXiv:2508.10104.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems.
[bar2023stochastic] Bar, Amir, Bordes, Florian, Shocher, Assaf, Assran, Mahmoud, Vincent, Pascal, Ballas, Nicolas, Darrell, Trevor, Globerson, Amir, LeCun, Yann. (2023). Stochastic positional embeddings improve masked image modeling. arXiv preprint arXiv:2308.00566.
[sriperumbudur2010hilbert] Sriperumbudur, Bharath K, Gretton, Arthur, Fukumizu, Kenji, Sch{. (2010). Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research.
[chwialkowski2016kernel] Chwialkowski, Kacper, Strathmann, Heiko, Gretton, Arthur. (2016). A kernel test of goodness of fit. International conference on machine learning.
[gretton2012kernel] Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J, Sch{. (2012). A kernel two-sample test. The journal of machine learning research.
[kolmogorov1933] A. N. Kolmogorov. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Istituto Italiano degli Attuari.
[shapiro1990test] Shapiro, Samuel S. (1990). How to test normality and other distributional assumptions.
[anderson1952asymptotic] Anderson, Theodore W, Darling, Donald A. (1952). Asymptotic theory of certain. The annals of mathematical statistics.
[cramer1928composition] Cram{'e. (1928). On the composition of elementary errors: First paper: Mathematical deductions. Scandinavian Actuarial Journal.
[cramer1936some] Cram{'e. (1936). Some theorems on distribution functions. Journal of the London Mathematical Society.
[srinath2023implicit] Srinath Halvagal, Manu, Laborieux, Axel, Zenke, Friedemann. (2023). Implicit variance regularization in non-contrastive SSL. Advances in Neural Information Processing Systems.
[oquab2023dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
[fan2025scaling] Fan, David, Tong, Shengbang, Zhu, Jiachen, Sinha, Koustuv, Liu, Zhuang, Chen, Xinlei, Rabbat, Michael, Ballas, Nicolas, LeCun, Yann, Bar, Amir, others. (2025). Scaling language-free visual representation learning. arXiv preprint arXiv:2504.01017.
[chen2020big] Chen, Ting, Kornblith, Simon, Swersky, Kevin, Norouzi, Mohammad, Hinton, Geoffrey E. (2020). Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems.
[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. Proceedings of the ieee/cvf International Conference on computer vision.
[kerdreux2025efficient] Kerdreux, Thomas, Tuel, Alexandre, Febvre, Quentin, Mouche, Alexis, Chapron, Bertrand. (2025). Efficient Self-Supervised Learning for Earth Observation via Dynamic Dataset Curation. Proceedings of the Computer Vision and Pattern Recognition Conference.
[vo2024automatic] Vo, Huy V, Khalidov, Vasil, Darcet, Timoth{'e. (2024). Automatic data curation for self-supervised learning: A clustering-based approach. arXiv preprint arXiv:2405.15613.
[zhang2023matrix] Zhang, Yifan, Tan, Zhiquan, Yang, Jingqin, Huang, Weiran, Yuan, Yang. (2023). Matrix information theory for self-supervised learning. arXiv preprint arXiv:2305.17326.
[liu2021self] Liu, Xiao, Zhang, Fanjin, Hou, Zhenyu, Mian, Li, Wang, Zhaoyu, Zhang, Jing, Tang, Jie. (2021). Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering.
[shwartz2022we] Shwartz-Ziv, Ravid, Balestriero, Randall, LeCun, Yann. (2022). What do we maximize in self-supervised learning?. arXiv preprint arXiv:2207.10081.
[shwartz2024compress] Shwartz Ziv, Ravid, LeCun, Yann. (2024). To compress or not to compress—self-supervised learning and information theory: A review. Entropy.
[vincent2010stacked] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, Pierre-Antoine, Bottou, L{'e. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research.
[van2018inaturalist] Van Horn, Grant, Mac Aodha, Oisin, Song, Yang, Cui, Yin, Sun, Chen, Shepard, Alex, Adam, Hartwig, Perona, Pietro, Belongie, Serge. (2018). The inaturalist species classification and detection dataset. Proceedings of the IEEE conference on computer vision and pattern recognition.
[papyan2020prevalence] Papyan, Vardan, Han, XY, Donoho, David L. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences.
[kingma2014semi] Kingma, Diederik P, Rezende, Danilo J, Mohamed, Shakir, Welling, Max. (2014). Semi-supervised learning with deep generative models. Advances in neural information processing systems.
[rumelhart1986learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J. (1986). Learning representations by back-propagating errors. nature.
[lecun2015deep] LeCun, Yann, Bengio, Yoshua, Hinton, Geoffrey. (2015). Deep learning. nature.
[khazatsky2024droid] Khazatsky, Alexander, Pertsch, Karl, Nair, Suraj, Balakrishna, Ashwin, Dasari, Sudeep, Karamcheti, Siddharth, Nasiriany, Soroush, Srirama, Mohan Kumar, Chen, Lawrence Yunliang, Ellis, Kirsty, others. (2024). Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945.
[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[littwin2024jepa] Littwin, Etai, Saremi, Omid, Advani, Madhu, Thilak, Vimal, Nakkiran, Preetum, Huang, Chen, Susskind, Joshua. (2024). How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems.
[bruner1949perception] Bruner, Jerome S, Postman, Leo. (1949). On the perception of incongruity: A paradigm. Journal of personality.
[helmholtz1867handbook] Helmholtz, H von, others. (1867). Handbook of physiological optics. Voss, Leipzig.
[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.
[wang2022importance] Wang, Xiao, Fan, Haoqi, Tian, Yuandong, Kihara, Daisuke, Chen, Xinlei. (2022). On the importance of asymmetry for siamese representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[chen2021empirical] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision.
[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.
[cover2006geometrical] Cover, Thomas M. (2006). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers.
[jing2021understanding] Jing, Li, Vincent, Pascal, LeCun, Yann, Tian, Yuandong. (2021). Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348.
[von1867handbuch] Von Helmholtz, Hermann. (1867). Handbuch der physiologischen Optik.
[gregory1980perceptions] Gregory, Richard Langton. (1980). Perceptions as hypotheses. Philosophical Transactions of the Royal Society of London. B, Biological Sciences.
[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.
[tian2020makes] Tian, Yonglong, Sun, Chen, Poole, Ben, Krishnan, Dilip, Schmid, Cordelia, Isola, Phillip. (2020). What makes for good views for contrastive learning?. Advances in neural information processing systems.
[balestriero2022contrastive] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. Advances in Neural Information Processing Systems.
[cosentino2022toward] Cosentino, Romain, Sengupta, Anirvan, Avestimehr, Salman, Soltanolkotabi, Mahdi, Ortega, Antonio, Willke, Ted, Tepper, Mariano. (2022). Toward a geometrical understanding of self-supervised contrastive learning. arXiv preprint arXiv:2205.06926.
[balestriero2024learning] Balestriero, Randall, LeCun, Yann. (2024). Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337.
[petersen2022monotonic] Petersen, Felix, Borgelt, Christian, Kuehne, Hilde, Deussen, Oliver. (2022). Monotonic differentiable sorting networks. arXiv preprint arXiv:2203.09630.
[spearman1961proof] Spearman, Charles. (1961). The proof and measurement of association between two things..
[agrawal2022alpha] Agrawal, Kumar K, Mondal, Arnab Kumar, Ghosh, Arna, Richards, Blake. (2022). a-ReQ: Assessing Representation Quality in Self-Supervised Learning by measuring eigenspectrum decay. Advances in Neural Information Processing Systems.
[thilak2023lidar] Thilak, Vimal, Huang, Chen, Saremi, Omid, Dinh, Laurent, Goh, Hanlin, Nakkiran, Preetum, Susskind, Joshua M, Littwin, Etai. (2023). Lidar: Sensing linear probing performance in joint embedding ssl architectures. arXiv preprint arXiv:2312.04000.
[garrido2023rankme] Garrido, Quentin, Balestriero, Randall, Najman, Laurent, Lecun, Yann. (2023). Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. International conference on machine learning.
[balestriero2025gaussian] Balestriero, Randall, Ballas, Nicolas, Rabbat, Mike, LeCun, Yann. (2025). Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density. arXiv preprint arXiv:2510.05949.
[assran2022hidden] Assran, Mahmoud, Balestriero, Randall, Duval, Quentin, Bordes, Florian, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, Ballas, Nicolas. (2022). The hidden uniform cluster prior in self-supervised learning. arXiv preprint arXiv:2210.07277.
[grover2019stochastic] Grover, Aditya, Wang, Eric, Zweig, Aaron, Ermon, Stefano. (2019). Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850.
[cuturi2019differentiable] Cuturi, Marco, Teboul, Olivier, Vert, Jean-Philippe. (2019). Differentiable ranking and sorting using optimal transport. Advances in neural information processing systems.
[lehmann2005testing] Lehmann, Erich Leo, Romano, Joseph P. (2005). Testing statistical hypotheses.
[adams2003sobolev] Adams, Robert A, Fournier, John JF. (2003). Sobolev spaces.
[rodas2025diet] Rodas, Bryan, Montesino, Natalie, Ambsdorf, Jakob, Klindt, David, Balestriero, Randall. (2025). DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining. arXiv preprint arXiv:2509.06990.
[neyman1933ix] Neyman, Jerzy, Pearson, Egon Sharpe. (1933). IX. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character.
[fisher1928statistical] Fisher, Ronald Aylmer. (1928). Statistical methods for research workers.
[van2025joint] Van Assel, Hugues, Ibrahim, Mark, Biancalani, Tommaso, Regev, Aviv, Balestriero, Randall. (2025). Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning. arXiv preprint arXiv:2505.12477.
[balestriero2023cookbook] Balestriero, Randall, Ibrahim, Mark, Sobal, Vlad, Morcos, Ari, Shekhar, Shashank, Goldstein, Tom, Bordes, Florian, Bardes, Adrien, Mialon, Gregoire, Tian, Yuandong, others. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.
[quicksort] Hoare, C. A. R.. (1962). Quicksort. The Computer Journal. doi:10.1093/comjnl/5.1.10.
[jarque1980efficient] Jarque, Carlos M, Bera, Anil K. (1980). Efficient tests for normality, homoscedasticity and serial independence of regression residuals. Economics letters.
[friston2010free] Friston, Karl. (2010). The free-energy principle: a unified brain theory?. Nature reviews neuroscience.
[sutton1991dyna] Sutton, Richard S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin.
[tolman1948cognitive] Tolman, Edward C. (1948). Cognitive maps in rats and men.. Psychological review.
[biau2015lectures] Biau, G{'e. (2015). Lectures on the nearest neighbor method.
[caflisch1998monte] Caflisch, Russel E. (1998). Monte carlo and quasi-monte carlo methods. Acta numerica.
[marsaglia1972choosing] Marsaglia, George. (1972). Choosing a point from the surface of a sphere. The Annals of Mathematical Statistics.
[mhaskar2001spherical] Mhaskar, H, Narcowich, F, Ward, J. (2001). Spherical Marcinkiewicz-Zygmund inequalities and positive quadrature. Mathematics of computation.
[narcowich2006localized] Narcowich, Francis J, Petrushev, Pencho, Ward, Joseph D. (2006). Localized tight frames on spheres. SIAM Journal on Mathematical Analysis.
[epps1983test] Epps, Thomas W, Pulley, Lawrence B. (1983). A test for normality based on the empirical characteristic function. Biometrika.
[roy1953heuristic] Roy, Samarendra Nath. (1953). On a heuristic method of test construction and its use in multivariate analysis. The Annals of Mathematical Statistics.
[tishby2000information] Tishby, Naftali, Pereira, Fernando C, Bialek, William. (2000). The information bottleneck method. arXiv preprint physics/0004057.
[hyvarinen2000independent] Hyv{. (2000). Independent component analysis: algorithms and applications. Neural networks.
[shannon1948mathematical] Shannon, Claude E. (1948). A mathematical theory of communication. The Bell system technical journal.
[cover1999elements] Cover, Thomas M. (1999). Elements of information theory.
[fang2019generic] Fang, Song, Skoglund, Mikael, Johansson, Karl Henrik, Ishii, Hideaki, Zhu, Quanyan. (2019). Generic variance bounds on estimation and prediction errors in time series analysis: An entropy perspective. 2019 IEEE Information Theory Workshop (ITW).
[cvitkovic2019minimal] Cvitkovic, Milan, Koliander, G{. (2019). Minimal achievable sufficient statistic learning. International Conference on Machine Learning.
[szekely2005new] Sz{'e. (2005). A new test for multivariate normality. Journal of Multivariate Analysis.
[henter2016minimum] Henter, Gustav Eje, Kleijn, W Bastiaan. (2016). Minimum entropy rate simplification of stochastic processes. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
[gutmann2010noise] Gutmann, Michael, Hyv{. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the thirteenth international conference on artificial intelligence and statistics.
[poole2019variational] Poole, Ben, Ozair, Sherjil, Van Den Oord, Aaron, Alemi, Alex, Tucker, George. (2019). On variational bounds of mutual information. International conference on machine learning.
[mcallester2020formal] McAllester, David, Stratos, Karl. (2020). Formal limitations on the measurement of mutual information. International Conference on Artificial Intelligence and Statistics.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[ma2018noise] Ma, Zhuang, Collins, Michael. (2018). Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812.
[belghazi2018mutual] Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeshwar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, Devon. (2018). Mutual information neural estimation. International conference on machine learning.
[barber2004algorithm] Barber, David, Agakov, Felix. (2004). The im algorithm: a variational approach to information maximization. Advances in neural information processing systems.
[suzuki2008approximating] Suzuki, Taiji, Sugiyama, Masashi, Sese, Jun, Kanamori, Takafumi. (2008). Approximating mutual information by maximum likelihood density ratio estimation. New challenges for feature selection in data mining and knowledge discovery.
[kraskov2004estimating] Kraskov, Alexander, St{. (2004). Estimating mutual information. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics.
[ebner2020tests] Ebner, Bruno, Henze, Norbert. (2020). Tests for multivariate normality—A critical review with emphasis on weighted L 2-statistics. Test.
[bell1995information] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.
[linsker1988self] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.
[elfving1947asymptotical] Elfving, Gustav. (1947). The asymptotical distribution of range in samples from a normal population. Biometrika.
[gupta1952estimation] Gupta, AK. (1952). Estimation of the mean and standard deviation of a normal population from a censored sample. Biometrika.
[shapiro1972approximate] Shapiro, Samuel S, Francia, RS. (1972). An approximate analysis of variance test for normality. Journal of the American statistical Association.
[mosteller2006some] Mosteller, Frederick. (2006). On some useful “inefficient” statistics.
[blom1958statistical] Blom, Gunnar. (1958). Statistical estimates and transformed beta-variables.
[rahman1997modification] Rahman, M Mahibbur, Govindarajulu, Z. (1997). A modification of the test of Shapiro and Wilk for normality. Journal of Applied Statistics.
[weisburg1975approximate] Weisburg, S, Binham, C. (1975). An approximate analysis of variance test for non-normality suitable for machine computation. Technometrics.
[plackett1958linear] Plackett, RoL. (1958). Linear estimation from censored data. The Annals of Mathematical Statistics.
[hammersley1954estimation] Hammersley, JM, Morton, KW. (1954). The estimation of location and scale parameters from grouped data. Biometrika.
[hausdorff1923momentprobleme] Hausdorff, Felix. (1923). *Momentprobleme f{*. Mathematische Zeitschrift.