Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
% Randall Balestriero, Meta-FAIR & Brown University, % examples of more authors, Nicolas Ballas, Meta-FAIR, % email, Mike Rabbat, Meta-FAIR, % Address, % email, Yann LeCun, Meta-FAIR & NYU % Address, % email, % Coauthor, % Affiliation, % Address, % email
Abstract
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $\vx$ efficiently and in closed-form using the model's Jacobian matrix at $\vx$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.
Introduction
The training procedure of foundation models-Deep Networks (DNs) f θ able to solve many tasks in zero or few-shot-can take many forms and is at the center of Self Supervised Learning research [2]. Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f θ ( X ) to be maximum Entropy given i.i.d pretraining samples X with density p X [17, 10]. Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f θ ( X ) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct X from f ( X ) [16]. Because this approach comes with known limitations [3], more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) [12] that directly encourage f θ ( X ) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f θ ( X ) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers [15, 13]. Today's JEPAs mostly take three forms: (i) moment-matching objectives (VICReg [4], W-MSE [8]), (ii) non-parametric estimators (SimCLR [6], MoCo [9], CLIP [14]), and (iii) implicit teacher-student methods (DINO [5], I-JEPA [1]). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate p X . In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder. . .
Can the density of f ( X ) be specified without f learning about p X ?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f θ learns the underlying data density p X . But JEPAs estimate p X in a highly non standard way, free of input space reconstruction, and free of a parametric model for p X . One question remains. . .
Is there any further benefit of not only specifying a density for f θ ( X ) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for p X implicitly learned during JEPA training can easily be extracted from the final trained model f θ -an estimator we call the JEPA-SCORE (eq. (5)). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs . Our theory and its corresponding controlled experiments are provided in sections 2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section 2.3. JEPA-SCORE 's implementation only takes a few lines of code and is provided in listing 1.
JEPA-SCORE: the Data Density Implicitly Learned by JEPAs
We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section 2.1, we formalize our general finding in section 2.2, culminating in the JEPA result of section 2.3 and theorem 1. An efficient implementation is also provided in section 2.3.
Preliminaries: Gaussian Embeddings Are Uniform on the Hypersphere
Our derivations will rely on a simple observations widely known in high-dimensional statistics: K -dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let's denote the K -dimensional standard Gaussian random variable by Z , and the normalized version by X ≜ Z √ K with density f N (0 , I /K ) . Let's also denote the Uniform distribution on the K -dimensional hypersphere surface by f U ( S (0 ,R,K )) with radius R > 0 .
Lemma 1. As K grows, X quickly concentrates around the hypersphere of radius 1 , converging to a Uniform density over the hypersphere surface. (Proof in section A.1.)
Lemma 1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section 2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.

Figure 2: Top left: Visual illustration of JEPA-SCORE -the DN f θ must learn p X for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas 1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp ( x ) on a GMMdata model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f θ internalize the data density which can be easily extracted using our proposed JEPA-SCORE . Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
Producing Gaussian Embeddings Equates Learning an Energy Function
This section builds upon lemma 1 to demonstrate how learning to produce Gaussian embeddings implies learning about the data density.
Consider two densities, one on the input domain ( p X ) and one on the output domain ( p f ( X ) ). For p f ( X ) to have a particular form, e.g., N (0 , I /K ) , f must learn something about p X . To see that, we will have leverage the eponymous change of variable formula expressing the embedding density p f θ as a function of the data density and the DN's Jacobian matrix:
$$
$$
where H r denotes r -dimensional Hausdorff measure, with r ≜ dim( { u ∈ R D | f ( u ) = f ( x ) } ) being the dimension of the level set of f at x . We note that eq. (1) does not require f to be bijective, which will be crucial for our JEPA result in section 2.3; for details see [11, 7]. Combining eq. (1) and lemma 1 leads us to the following result.
Lemma 2. In order for f ( X ) to be distributed as N (0 , I /K ) for large K , f must learn the data density p X up to mean-preserving rescaling within each level-set { u ∈ R D | f ( u ) = f ( x ) } . (Proof in section A.2.)
Empirical validation. Before broadening lemma 2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig. 2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.
JEPA-SCORE: The Data Density Learned by JEPAs
Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function
$$
$$

Figure 3: Depiction of JEPA-SCORE for 5 , 000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset . While our score does not rely on singular vectors, we provide some examples in fig. 7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
where x (1) n , x (2) n are two generated 'views' from the original sample through the stochastic operator G , and dist is a distance function (e.g., L2). For images, G typically involves two different dataaugmentations. At this point, lemma 2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma 2 is that the integration occurs over the level set of the function f θ which coincides with the JEPA's invariance term when Pred is near identity.
Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density p T . We also denote the density of generators as p µ , from which the data density p X is defined as
$$
$$
In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators ( p µ ) are the original training samples prior to applying any augmentation-hence estimating p µ will amount to estimating the data density.
JEPA-SCORE. Combining eqs. (2) and (3) and lemma 2 leads to the following result proved in section A.3.
Theorem 1. At optimality, JEPA embeddings estimate the data density as per
$$
$$
We define our JEPA-SCORE for input x as the Monte Carlo estimator of eq. (4), for a single-sample estimate we have (in log-scale)
$$
$$
which exactly recovers p µ as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig. 2. We empirically validate eq. (5) by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs. 1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5 , 000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig. 3. We clearly see that datasets that weren't seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples.
Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.
Proofs
Proof of cref{thm:uniform
Proof. Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as K increases. Let's first express the Uniform distribution in spherical coordinates:
$$
$$
and let's now express the rescaled standard Gaussian density Z √ K in spherical coordinates:
$$
$$
As K increases, as the scaled Chi-distribution converges to a Dirac function at 1 , leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that Z √ K being isotropic Gaussian, the distribution of norms, ∥ Z ∥ 2 2 /K , is a Chi-squared distribution with mean 1 and variance 2 /K . That is, as K increases as the norms distribution converges to a Dirac at 1 . Lastly, because Z √ K is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as K increases, as the samples are already normalized, hence leading to our result.
Proof of cref{thm:general_density
Proof. Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as K increases. Let's first express the Uniform distribution in spherical coordinates:
$$
$$
and let's now express the rescaled standard Gaussian density Z √ K in spherical coordinates:
$$
$$
As K increases, as the scaled Chi-distribution converges to a Dirac function at 1 , leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that Z √ K being isotropic Gaussian, the distribution of norms, ∥ Z ∥ 2 2 /K , is a Chi-squared distribution with mean 1 and variance 2 /K . That is, as K increases as the norms distribution converges to a Dirac at 1 . Lastly, because Z √ K is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as K increases, as the samples are already normalized, hence leading to our result.
Proof of cref{thm:JEPA
Proof. Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as K increases. Let's first express the Uniform distribution in spherical coordinates:
$$
$$
and let's now express the rescaled standard Gaussian density Z √ K in spherical coordinates:
$$
$$
As K increases, as the scaled Chi-distribution converges to a Dirac function at 1 , leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that Z √ K being isotropic Gaussian, the distribution of norms, ∥ Z ∥ 2 2 /K , is a Chi-squared distribution with mean 1 and variance 2 /K . That is, as K increases as the norms distribution converges to a Dirac at 1 . Lastly, because Z √ K is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as K increases, as the samples are already normalized, hence leading to our result.
Implementation Details
1 from torch.autograd.functional import jacobian 2 3 # model returns a tensor of shape (num_samples , features_dim) 4 J = jacobian(lambda x: model(x).sum(0), inputs=images) 5 with torch.inference_mode(): 6 J = J.flatten(2).permute(1, 0, 2) 7 svdvals = torch.linalg.svdvals(J) 8 jepa_score = svdvals.clip_(eps).log_().sum(1) 9
Listing 1: JEPA-SCORE implementation in PyTorch. Our empirical ablations demonstrate that JEPA-SCORE is not sensitive to the choice of eps (we pick 1 e -6 )
Additional Figures

Random samples
Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom:Random samples from Imagenet-1k training dataset for class 141.
Figure 5: Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom: Random samples from Imagenet-1k training dataset for class 141.


0
Figure 7: Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample’s representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs’ anti-collapse term does much more–it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used–in any case one can compute the learned probabilities of sample 𝒙{\bm{x}} efficiently and in closed-form using the model’s Jacobian matrix at 𝒙{\bm{x}}. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as JEPA-SCORE.
low probability
MetaCLIP IJEPA-22k IJEPA-1k DINOv2
The training procedure of foundation models—Deep Networks (DNs) f𝜽f_{{\bm{\theta}}} able to solve many tasks in zero or few-shot—can take many forms and is at the center of Self Supervised Learning research balestriero2023cookbook . Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f𝜽(X)f_{{\bm{\theta}}}(X) to be maximum Entropy given i.i.d pretraining samples XX with density pXp_{X} wang2020understanding ; hjelm2018learning . Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f𝜽(X)f_{{\bm{\theta}}}(X) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct XX from f(X)f(X) vincent2008extracting . Because this approach comes with known limitations balestriero2024learning , more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) lecun2022path that directly encourage f𝜽(X)f_{{\bm{\theta}}}(X) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f𝜽(X)f_{{\bm{\theta}}}(X) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers srinath2023implicit ; littwin2024jepa . Today’s JEPAs mostly take three forms: (i) moment-matching objectives (VICReg bardes2021vicreg , W-MSE ermolov2021whitening ), (ii) non-parametric estimators (SimCLR chen2020simple , MoCo he2020momentum , CLIP radford2021learning ), and (iii) implicit teacher-student methods (DINO caron2021emerging , I-JEPA assran2023self ). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate pXp_{X}. In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder…
Can the density of f(X)f(X) be specified without ff learning about pXp_{X}?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f𝜽f_{{\bm{\theta}}} learns the underlying data density pXp_{X}. But JEPAs estimate pXp_{X} in a highly non standard way, free of input space reconstruction, and free of a parametric model for pXp_{X}. One question remains…
Is there any further benefit of not only specifying a density for f𝛉(X)f_{{\bm{\theta}}}(X) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for pXp_{X} implicitly learned during JEPA training can easily be extracted from the final trained model f𝜽f_{{\bm{\theta}}}–an estimator we call the JEPA-SCORE (eq.˜5). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs. Our theory and its corresponding controlled experiments are provided in sections˜2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section˜2.3. JEPA-SCORE’s implementation only takes a few lines of code and is provided in LABEL:code.
We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section˜2.1, we formalize our general finding in section˜2.2, culminating in the JEPA result of sections˜2.3 and 1. An efficient implementation is also provided in section˜2.3.
Our derivations will rely on a simple observations widely known in high-dimensional statistics: KK-dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let’s denote the KK-dimensional standard Gaussian random variable by ZZ, and the normalized version by X≜ZKX\triangleq\frac{Z}{\sqrt{K}} with density f𝒩(0,𝑰/K)f_{\mathcal{N}(0,{\bm{I}}/K)}. Let’s also denote the Uniform distribution on the KK-dimensional hypersphere surface by f𝒰(𝕊(0,R,K))f_{\mathcal{U}(\mathbb{S}(0,R,K))} with radius R>0R>0.
As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma˜1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section˜2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.
Consider two densities, one on the input domain (pXp_{X}) and one on the output domain (pf(X)p_{f(X)}). For pf(X)p_{f(X)} to have a particular form, e.g., 𝒩(0,𝑰/K)\mathcal{N}(0,{\bm{I}}/K), ff must learn something about pXp_{X}. To see that, we will have leverage the eponymous change of variable formula expressing the embedding density pf𝜽p_{f_{{\bm{\theta}}}} as a function of the data density and the DN’s Jacobian matrix:
where ℋr\mathcal{H}^{r} denotes rr-dimensional Hausdorff measure, with r≜dim({𝒖∈ℝD|f(𝒖)=f(𝒙)})r\triangleq\dim({{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}) being the dimension of the level set of ff at 𝒙{\bm{x}}. We note that eq.˜1 does not require ff to be bijective, which will be crucial for our JEPA result in section˜2.3; for details see krantz2008geometric ; cvitkovic2019minimal . Combining eq.˜1 and lemma˜1 leads us to the following result.
In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Empirical validation. Before broadening lemma˜2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig.˜2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.
Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function
where 𝒙n(1),𝒙n(2){\bm{x}}{n}^{(1)},{\bm{x}}{n}^{(2)} are two generated “views” from the original sample through the stochastic operator 𝒢\mathcal{G}, and dist\rm dist is a distance function (e.g., L2). For images, 𝒢\mathcal{G} typically involves two different data-augmentations. At this point, lemma˜2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma˜2 is that the integration occurs over the level set of the function f𝜽f_{{\bm{\theta}}} which coincides with the JEPA’s invariance term when Pred is near identity.
Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density pTp_{T}. We also denote the density of generators as pμp_{\mu}, from which the data density pXp_{X} is defined as
In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators (pμp_{\mu}) are the original training samples prior to applying any augmentation–hence estimating pμp_{\mu} will amount to estimating the data density.
JEPA-SCORE. Combining eqs.˜3, 2 and 2 leads to the following result proved in section˜A.3.
At optimality, JEPA embeddings estimate the data density as per
We define our JEPA-SCORE for input 𝒙{\bm{x}} as the Monte Carlo estimator of eq.˜4, for a single-sample estimate we have (in log-scale)
which exactly recovers pμp_{\mu} as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig.˜2. We empirically validate eq.˜5 by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs.˜1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5,0005,000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig.˜3. We clearly see that datasets that weren’t seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples. Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.
Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as KK increases. Let’s first express the Uniform distribution in spherical coordinates:
and let’s now express the rescaled standard Gaussian density ZK\frac{Z}{\sqrt{K}} in spherical coordinates:
As KK increases, as the scaled Chi-distribution converges to a Dirac function at 11, leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that ZK\frac{Z}{\sqrt{K}} being isotropic Gaussian, the distribution of norms, ‖Z‖22/K|Z|_{2}^{2}/K, is a Chi-squared distribution with mean 11 and variance 2/K2/K. That is, as KK increases as the norms distribution converges to a Dirac at 11. Lastly, because ZK\frac{Z}{\sqrt{K}} is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as KK increases, as the samples are already normalized, hence leading to our result. ∎
First and foremost, recall that the density of the random variable f(X)f(X) is given by eq.˜1. Relying on lemma˜1 which stated that for large KK, our assumption on the output density reads f(𝒙)∼𝒰(0,1)f({\bm{x}})\sim\mathcal{U}(0,1), we obtain that ∫{𝒖∈ℝD|f(𝒖)=f(𝒙)}pX(𝒖)∏k=1rank(Jf(𝒖))σk(Jf(𝒖))dℋr(𝒖)=cst\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}({\bm{u}})}{\prod_{k=1}^{rank(J_{f}({\bm{u}}))}\sigma_{k}(J_{f}({\bm{u}}))}\mathrm{d}\mathcal{H}^{r}({\bm{u}})={\rm cst}. Now if ff is bijective between supp(pX){\rm supp}(p_{X}) and ℝK\mathbb{R}^{K}, then it is direct to see that pX(x)∝∏k=1rank(Jf(𝒙))σk(Jf(𝒙))p_{X}(x)\propto\prod_{k=1}^{rank(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}})). Now if ff is surjective there is no longer a one-to-one mapping between ff and pXp_{X}. Instead, there is ambiguity over each level set of ff. To see that, recall that we only need to maintain a constant value over the integration on the level set. Hence, ff is free to scale up one subset of that level set, and scale down another subset, proportionally to pXp_{X} to preserve the integration to a constant. ∎
The role of the predictor in JEPA training is to allow for additional computation to predict one view’s embedding from the other view’s embedding. While this provides numerous empirical benefits, e.g., in terms of optimization landscape, it actually does not impact the level-set of the encoder–which is what is needed in eq.˜1.
To understand the above argument, consider the case where the views are obtained from applying a transformation such as masking. We denote by ℳ\mathcal{M} the masking random and by mask(𝒙){\rm mask}({\bm{x}}) the application of one realization of ℳ\mathcal{M} onto the input 𝒙{\bm{x}}. We thus have for the invariance term of sample 𝒙n{\bm{x}}_{n}
Because the predictor is only applied on one of the two embeddings, it is clear that for the JEPA loss to be minimized, it must also be true that
for any realization of mask(1){\rm mask}^{(1)}. In other word, the encoder’s invariance is over the support of ℳ\mathcal{M} no matter if the predictor is identity or nonlinear. Therefore our result directly follows from the above combined with eqs.˜1 and 3. ∎
Depiction of the 5 least (left) and 5 most (right) likely samples of class 21 from Imagenet as per JEPA-SCORE–JEPAs’ implicit density estimator learned during pretraining. Two striking observations: (i) across all JEPAs (rows) the type of samples with low and high probabilities are alike, and (ii) the same samples (amongst 1,000) are found at those extrema. Random samples from that class are provided in fig.˜4
Top left: Visual illustration of JEPA-SCORE–the DN f𝜽f_{{\bm{\theta}}} must learn pXp_{X} for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas˜1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp(x)logp(x) on a GMM data model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f𝜽f_{{\bm{\theta}}} internalize the data density which can be easily extracted using our proposed JEPA-SCORE. Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
Depiction of JEPA-SCORE for 5,0005,000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset. While our score does not rely on singular vectors, we provide some examples in fig.˜7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
Random samples from Imagenet-1k training dataset for class 21.
Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
$$ p_{f(X)}(f({\bm{x}}))=\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}(x)}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}}))}\mathrm{d}\mathcal{H}^{r}({\bm{x}}), $$ \tag{S2.E1}
$$ \displaystyle\mathcal{L}\triangleq $$
$$ \displaystyle+{\rm diversity}\left(\left({\rm Enc}\left({\bm{x}}{n}\right)\right){n\in[N]}\right), $$
$$ \displaystyle=\frac{K^{\frac{K}{2}}}{2^{K/2-1}\Gamma(K/2)}e^{-\frac{Kr^{2}}{2}}\frac{\Gamma(K/2)}{2\pi^{K/2}}r^{K-1}\hskip-4.26773pt\prod_{i=1}^{K-1}\sin({\bm{\theta}}_{i})^{K-i-1} $$
Lemma. Lemma 1. As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma. Lemma 2. In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Theorem. Theorem 1. At optimality, JEPA embeddings estimate the data density as per pμ(μ)∝𝔼pT[1∏k=1rank(Jf(𝒙))σk(Jf(μ,T))]−1.p_{\mu}(\mu)\propto\mathbb{E}{p{T}}\left[\frac{1}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}(\mu,T))}\right]^{-1}. (4)
The VCReg Conundrum
Gaussian Embedding Enforcement
Our derivations will rely on a simple observations widely known in high-dimensional statistics: K -dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let's denote the K -dimensional standard Gaussian random variable by Z , and the normalized version by X ≜ Z √ K with density f N (0 , I /K ) . Let's also denote the Uniform distribution on the K -dimensional hypersphere surface by f U ( S (0 ,R,K )) with radius R > 0 .
Lemma 1. As K grows, X quickly concentrates around the hypersphere of radius 1 , converging to a Uniform density over the hypersphere surface. (Proof in section A.1.)
Lemma 1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section 2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.

Figure 2: Top left: Visual illustration of JEPA-SCORE -the DN f θ must learn p X for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas 1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp ( x ) on a GMMdata model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f θ internalize the data density which can be easily extracted using our proposed JEPA-SCORE . Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
The Beauty of Isotropic Gaussian Embeddings
A-JEPA
Learning by Reconstruction: Good in Theory, Bad in Practice
Joint Univariate Scheme
Implicit Density Learning
Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function
$$
$$

Figure 3: Depiction of JEPA-SCORE for 5 , 000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset . While our score does not rely on singular vectors, we provide some examples in fig. 7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
where x (1) n , x (2) n are two generated 'views' from the original sample through the stochastic operator G , and dist is a distance function (e.g., L2). For images, G typically involves two different dataaugmentations. At this point, lemma 2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma 2 is that the integration occurs over the level set of the function f θ which coincides with the JEPA's invariance term when Pred is near identity.
Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density p T . We also denote the density of generators as p µ , from which the data density p X is defined as
$$
$$
In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators ( p µ ) are the original training samples prior to applying any augmentation-hence estimating p µ will amount to estimating the data density.
JEPA-SCORE. Combining eqs. (2) and (3) and lemma 2 leads to the following result proved in section A.3.
Theorem 1. At optimality, JEPA embeddings estimate the data density as per
$$
$$
We define our JEPA-SCORE for input x as the Monte Carlo estimator of eq. (4), for a single-sample estimate we have (in log-scale)
$$
$$
which exactly recovers p µ as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig. 2. We empirically validate eq. (5) by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs. 1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5 , 000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig. 3. We clearly see that datasets that weren't seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples.
Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.
Statistical Testing
For Uniformity
Joint Embeddings Secretly Learn the Data Density Function
Style
Retrieval of style files
General formatting instructions
The training procedure of foundation models-Deep Networks (DNs) f θ able to solve many tasks in zero or few-shot-can take many forms and is at the center of Self Supervised Learning research [2]. Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f θ ( X ) to be maximum Entropy given i.i.d pretraining samples X with density p X [17, 10]. Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f θ ( X ) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct X from f ( X ) [16]. Because this approach comes with known limitations [3], more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) [12] that directly encourage f θ ( X ) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f θ ( X ) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers [15, 13]. Today's JEPAs mostly take three forms: (i) moment-matching objectives (VICReg [4], W-MSE [8]), (ii) non-parametric estimators (SimCLR [6], MoCo [9], CLIP [14]), and (iii) implicit teacher-student methods (DINO [5], I-JEPA [1]). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate p X . In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder. . .
Can the density of f ( X ) be specified without f learning about p X ?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f θ learns the underlying data density p X . But JEPAs estimate p X in a highly non standard way, free of input space reconstruction, and free of a parametric model for p X . One question remains. . .
Is there any further benefit of not only specifying a density for f θ ( X ) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for p X implicitly learned during JEPA training can easily be extracted from the final trained model f θ -an estimator we call the JEPA-SCORE (eq. (5)). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs . Our theory and its corresponding controlled experiments are provided in sections 2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section 2.3. JEPA-SCORE 's implementation only takes a few lines of code and is provided in listing 1.
Headings: first level
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample’s representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs’ anti-collapse term does much more–it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used–in any case one can compute the learned probabilities of sample 𝒙{\bm{x}} efficiently and in closed-form using the model’s Jacobian matrix at 𝒙{\bm{x}}. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as JEPA-SCORE.
low probability
MetaCLIP IJEPA-22k IJEPA-1k DINOv2
The training procedure of foundation models—Deep Networks (DNs) f𝜽f_{{\bm{\theta}}} able to solve many tasks in zero or few-shot—can take many forms and is at the center of Self Supervised Learning research balestriero2023cookbook . Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f𝜽(X)f_{{\bm{\theta}}}(X) to be maximum Entropy given i.i.d pretraining samples XX with density pXp_{X} wang2020understanding ; hjelm2018learning . Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f𝜽(X)f_{{\bm{\theta}}}(X) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct XX from f(X)f(X) vincent2008extracting . Because this approach comes with known limitations balestriero2024learning , more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) lecun2022path that directly encourage f𝜽(X)f_{{\bm{\theta}}}(X) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f𝜽(X)f_{{\bm{\theta}}}(X) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers srinath2023implicit ; littwin2024jepa . Today’s JEPAs mostly take three forms: (i) moment-matching objectives (VICReg bardes2021vicreg , W-MSE ermolov2021whitening ), (ii) non-parametric estimators (SimCLR chen2020simple , MoCo he2020momentum , CLIP radford2021learning ), and (iii) implicit teacher-student methods (DINO caron2021emerging , I-JEPA assran2023self ). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate pXp_{X}. In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder…
Can the density of f(X)f(X) be specified without ff learning about pXp_{X}?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f𝜽f_{{\bm{\theta}}} learns the underlying data density pXp_{X}. But JEPAs estimate pXp_{X} in a highly non standard way, free of input space reconstruction, and free of a parametric model for pXp_{X}. One question remains…
Is there any further benefit of not only specifying a density for f𝛉(X)f_{{\bm{\theta}}}(X) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for pXp_{X} implicitly learned during JEPA training can easily be extracted from the final trained model f𝜽f_{{\bm{\theta}}}–an estimator we call the JEPA-SCORE (eq.˜5). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs. Our theory and its corresponding controlled experiments are provided in sections˜2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section˜2.3. JEPA-SCORE’s implementation only takes a few lines of code and is provided in LABEL:code.
We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section˜2.1, we formalize our general finding in section˜2.2, culminating in the JEPA result of sections˜2.3 and 1. An efficient implementation is also provided in section˜2.3.
Our derivations will rely on a simple observations widely known in high-dimensional statistics: KK-dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let’s denote the KK-dimensional standard Gaussian random variable by ZZ, and the normalized version by X≜ZKX\triangleq\frac{Z}{\sqrt{K}} with density f𝒩(0,𝑰/K)f_{\mathcal{N}(0,{\bm{I}}/K)}. Let’s also denote the Uniform distribution on the KK-dimensional hypersphere surface by f𝒰(𝕊(0,R,K))f_{\mathcal{U}(\mathbb{S}(0,R,K))} with radius R>0R>0.
As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma˜1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section˜2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.
Consider two densities, one on the input domain (pXp_{X}) and one on the output domain (pf(X)p_{f(X)}). For pf(X)p_{f(X)} to have a particular form, e.g., 𝒩(0,𝑰/K)\mathcal{N}(0,{\bm{I}}/K), ff must learn something about pXp_{X}. To see that, we will have leverage the eponymous change of variable formula expressing the embedding density pf𝜽p_{f_{{\bm{\theta}}}} as a function of the data density and the DN’s Jacobian matrix:
where ℋr\mathcal{H}^{r} denotes rr-dimensional Hausdorff measure, with r≜dim({𝒖∈ℝD|f(𝒖)=f(𝒙)})r\triangleq\dim({{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}) being the dimension of the level set of ff at 𝒙{\bm{x}}. We note that eq.˜1 does not require ff to be bijective, which will be crucial for our JEPA result in section˜2.3; for details see krantz2008geometric ; cvitkovic2019minimal . Combining eq.˜1 and lemma˜1 leads us to the following result.
In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Empirical validation. Before broadening lemma˜2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig.˜2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.
Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function
where 𝒙n(1),𝒙n(2){\bm{x}}{n}^{(1)},{\bm{x}}{n}^{(2)} are two generated “views” from the original sample through the stochastic operator 𝒢\mathcal{G}, and dist\rm dist is a distance function (e.g., L2). For images, 𝒢\mathcal{G} typically involves two different data-augmentations. At this point, lemma˜2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma˜2 is that the integration occurs over the level set of the function f𝜽f_{{\bm{\theta}}} which coincides with the JEPA’s invariance term when Pred is near identity.
Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density pTp_{T}. We also denote the density of generators as pμp_{\mu}, from which the data density pXp_{X} is defined as
In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators (pμp_{\mu}) are the original training samples prior to applying any augmentation–hence estimating pμp_{\mu} will amount to estimating the data density.
JEPA-SCORE. Combining eqs.˜3, 2 and 2 leads to the following result proved in section˜A.3.
At optimality, JEPA embeddings estimate the data density as per
We define our JEPA-SCORE for input 𝒙{\bm{x}} as the Monte Carlo estimator of eq.˜4, for a single-sample estimate we have (in log-scale)
which exactly recovers pμp_{\mu} as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig.˜2. We empirically validate eq.˜5 by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs.˜1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5,0005,000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig.˜3. We clearly see that datasets that weren’t seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples. Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.
Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as KK increases. Let’s first express the Uniform distribution in spherical coordinates:
and let’s now express the rescaled standard Gaussian density ZK\frac{Z}{\sqrt{K}} in spherical coordinates:
As KK increases, as the scaled Chi-distribution converges to a Dirac function at 11, leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that ZK\frac{Z}{\sqrt{K}} being isotropic Gaussian, the distribution of norms, ‖Z‖22/K|Z|_{2}^{2}/K, is a Chi-squared distribution with mean 11 and variance 2/K2/K. That is, as KK increases as the norms distribution converges to a Dirac at 11. Lastly, because ZK\frac{Z}{\sqrt{K}} is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as KK increases, as the samples are already normalized, hence leading to our result. ∎
First and foremost, recall that the density of the random variable f(X)f(X) is given by eq.˜1. Relying on lemma˜1 which stated that for large KK, our assumption on the output density reads f(𝒙)∼𝒰(0,1)f({\bm{x}})\sim\mathcal{U}(0,1), we obtain that ∫{𝒖∈ℝD|f(𝒖)=f(𝒙)}pX(𝒖)∏k=1rank(Jf(𝒖))σk(Jf(𝒖))dℋr(𝒖)=cst\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}({\bm{u}})}{\prod_{k=1}^{rank(J_{f}({\bm{u}}))}\sigma_{k}(J_{f}({\bm{u}}))}\mathrm{d}\mathcal{H}^{r}({\bm{u}})={\rm cst}. Now if ff is bijective between supp(pX){\rm supp}(p_{X}) and ℝK\mathbb{R}^{K}, then it is direct to see that pX(x)∝∏k=1rank(Jf(𝒙))σk(Jf(𝒙))p_{X}(x)\propto\prod_{k=1}^{rank(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}})). Now if ff is surjective there is no longer a one-to-one mapping between ff and pXp_{X}. Instead, there is ambiguity over each level set of ff. To see that, recall that we only need to maintain a constant value over the integration on the level set. Hence, ff is free to scale up one subset of that level set, and scale down another subset, proportionally to pXp_{X} to preserve the integration to a constant. ∎
The role of the predictor in JEPA training is to allow for additional computation to predict one view’s embedding from the other view’s embedding. While this provides numerous empirical benefits, e.g., in terms of optimization landscape, it actually does not impact the level-set of the encoder–which is what is needed in eq.˜1.
To understand the above argument, consider the case where the views are obtained from applying a transformation such as masking. We denote by ℳ\mathcal{M} the masking random and by mask(𝒙){\rm mask}({\bm{x}}) the application of one realization of ℳ\mathcal{M} onto the input 𝒙{\bm{x}}. We thus have for the invariance term of sample 𝒙n{\bm{x}}_{n}
Because the predictor is only applied on one of the two embeddings, it is clear that for the JEPA loss to be minimized, it must also be true that
for any realization of mask(1){\rm mask}^{(1)}. In other word, the encoder’s invariance is over the support of ℳ\mathcal{M} no matter if the predictor is identity or nonlinear. Therefore our result directly follows from the above combined with eqs.˜1 and 3. ∎
Random samples
Depiction of the 5 least (left) and 5 most (right) likely samples of class 21 from Imagenet as per JEPA-SCORE–JEPAs’ implicit density estimator learned during pretraining. Two striking observations: (i) across all JEPAs (rows) the type of samples with low and high probabilities are alike, and (ii) the same samples (amongst 1,000) are found at those extrema. Random samples from that class are provided in fig.˜4
Top left: Visual illustration of JEPA-SCORE–the DN f𝜽f_{{\bm{\theta}}} must learn pXp_{X} for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas˜1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp(x)logp(x) on a GMM data model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f𝜽f_{{\bm{\theta}}} internalize the data density which can be easily extracted using our proposed JEPA-SCORE. Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
Depiction of JEPA-SCORE for 5,0005,000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset. While our score does not rely on singular vectors, we provide some examples in fig.˜7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
Random samples from Imagenet-1k training dataset for class 21.
Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom:Random samples from Imagenet-1k training dataset for class 141.
Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
$$ p_{f(X)}(f({\bm{x}}))=\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}(x)}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}}))}\mathrm{d}\mathcal{H}^{r}({\bm{x}}), $$ \tag{S2.E1}
$$ \displaystyle\mathcal{L}\triangleq $$
$$ \displaystyle+{\rm diversity}\left(\left({\rm Enc}\left({\bm{x}}{n}\right)\right){n\in[N]}\right), $$
$$ \displaystyle=\frac{K^{\frac{K}{2}}}{2^{K/2-1}\Gamma(K/2)}e^{-\frac{Kr^{2}}{2}}\frac{\Gamma(K/2)}{2\pi^{K/2}}r^{K-1}\hskip-4.26773pt\prod_{i=1}^{K-1}\sin({\bm{\theta}}_{i})^{K-i-1} $$
Lemma. Lemma 1. As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma. Lemma 2. In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Theorem. Theorem 1. At optimality, JEPA embeddings estimate the data density as per pμ(μ)∝𝔼pT[1∏k=1rank(Jf(𝒙))σk(Jf(μ,T))]−1.p_{\mu}(\mu)\propto\mathbb{E}{p{T}}\left[\frac{1}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}(\mu,T))}\right]^{-1}. (4)
Headings: second level
Headings: third level
Citations, figures, tables, references

Random samples
Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom:Random samples from Imagenet-1k training dataset for class 141.
Figure 5: Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom: Random samples from Imagenet-1k training dataset for class 141.


0
Figure 7: Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample’s representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs’ anti-collapse term does much more–it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used–in any case one can compute the learned probabilities of sample 𝒙{\bm{x}} efficiently and in closed-form using the model’s Jacobian matrix at 𝒙{\bm{x}}. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as JEPA-SCORE.
low probability
MetaCLIP IJEPA-22k IJEPA-1k DINOv2
The training procedure of foundation models—Deep Networks (DNs) f𝜽f_{{\bm{\theta}}} able to solve many tasks in zero or few-shot—can take many forms and is at the center of Self Supervised Learning research balestriero2023cookbook . Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f𝜽(X)f_{{\bm{\theta}}}(X) to be maximum Entropy given i.i.d pretraining samples XX with density pXp_{X} wang2020understanding ; hjelm2018learning . Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f𝜽(X)f_{{\bm{\theta}}}(X) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct XX from f(X)f(X) vincent2008extracting . Because this approach comes with known limitations balestriero2024learning , more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) lecun2022path that directly encourage f𝜽(X)f_{{\bm{\theta}}}(X) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f𝜽(X)f_{{\bm{\theta}}}(X) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers srinath2023implicit ; littwin2024jepa . Today’s JEPAs mostly take three forms: (i) moment-matching objectives (VICReg bardes2021vicreg , W-MSE ermolov2021whitening ), (ii) non-parametric estimators (SimCLR chen2020simple , MoCo he2020momentum , CLIP radford2021learning ), and (iii) implicit teacher-student methods (DINO caron2021emerging , I-JEPA assran2023self ). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate pXp_{X}. In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder…
Can the density of f(X)f(X) be specified without ff learning about pXp_{X}?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f𝜽f_{{\bm{\theta}}} learns the underlying data density pXp_{X}. But JEPAs estimate pXp_{X} in a highly non standard way, free of input space reconstruction, and free of a parametric model for pXp_{X}. One question remains…
Is there any further benefit of not only specifying a density for f𝛉(X)f_{{\bm{\theta}}}(X) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for pXp_{X} implicitly learned during JEPA training can easily be extracted from the final trained model f𝜽f_{{\bm{\theta}}}–an estimator we call the JEPA-SCORE (eq.˜5). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs. Our theory and its corresponding controlled experiments are provided in sections˜2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section˜2.3. JEPA-SCORE’s implementation only takes a few lines of code and is provided in LABEL:code.
We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section˜2.1, we formalize our general finding in section˜2.2, culminating in the JEPA result of sections˜2.3 and 1. An efficient implementation is also provided in section˜2.3.
Our derivations will rely on a simple observations widely known in high-dimensional statistics: KK-dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let’s denote the KK-dimensional standard Gaussian random variable by ZZ, and the normalized version by X≜ZKX\triangleq\frac{Z}{\sqrt{K}} with density f𝒩(0,𝑰/K)f_{\mathcal{N}(0,{\bm{I}}/K)}. Let’s also denote the Uniform distribution on the KK-dimensional hypersphere surface by f𝒰(𝕊(0,R,K))f_{\mathcal{U}(\mathbb{S}(0,R,K))} with radius R>0R>0.
As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma˜1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section˜2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.
Consider two densities, one on the input domain (pXp_{X}) and one on the output domain (pf(X)p_{f(X)}). For pf(X)p_{f(X)} to have a particular form, e.g., 𝒩(0,𝑰/K)\mathcal{N}(0,{\bm{I}}/K), ff must learn something about pXp_{X}. To see that, we will have leverage the eponymous change of variable formula expressing the embedding density pf𝜽p_{f_{{\bm{\theta}}}} as a function of the data density and the DN’s Jacobian matrix:
where ℋr\mathcal{H}^{r} denotes rr-dimensional Hausdorff measure, with r≜dim({𝒖∈ℝD|f(𝒖)=f(𝒙)})r\triangleq\dim({{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}) being the dimension of the level set of ff at 𝒙{\bm{x}}. We note that eq.˜1 does not require ff to be bijective, which will be crucial for our JEPA result in section˜2.3; for details see krantz2008geometric ; cvitkovic2019minimal . Combining eq.˜1 and lemma˜1 leads us to the following result.
In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Empirical validation. Before broadening lemma˜2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig.˜2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.
Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function
where 𝒙n(1),𝒙n(2){\bm{x}}{n}^{(1)},{\bm{x}}{n}^{(2)} are two generated “views” from the original sample through the stochastic operator 𝒢\mathcal{G}, and dist\rm dist is a distance function (e.g., L2). For images, 𝒢\mathcal{G} typically involves two different data-augmentations. At this point, lemma˜2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma˜2 is that the integration occurs over the level set of the function f𝜽f_{{\bm{\theta}}} which coincides with the JEPA’s invariance term when Pred is near identity.
Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density pTp_{T}. We also denote the density of generators as pμp_{\mu}, from which the data density pXp_{X} is defined as
In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators (pμp_{\mu}) are the original training samples prior to applying any augmentation–hence estimating pμp_{\mu} will amount to estimating the data density.
JEPA-SCORE. Combining eqs.˜3, 2 and 2 leads to the following result proved in section˜A.3.
At optimality, JEPA embeddings estimate the data density as per
We define our JEPA-SCORE for input 𝒙{\bm{x}} as the Monte Carlo estimator of eq.˜4, for a single-sample estimate we have (in log-scale)
which exactly recovers pμp_{\mu} as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig.˜2. We empirically validate eq.˜5 by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs.˜1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5,0005,000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig.˜3. We clearly see that datasets that weren’t seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples. Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.
Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as KK increases. Let’s first express the Uniform distribution in spherical coordinates:
and let’s now express the rescaled standard Gaussian density ZK\frac{Z}{\sqrt{K}} in spherical coordinates:
As KK increases, as the scaled Chi-distribution converges to a Dirac function at 11, leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that ZK\frac{Z}{\sqrt{K}} being isotropic Gaussian, the distribution of norms, ‖Z‖22/K|Z|_{2}^{2}/K, is a Chi-squared distribution with mean 11 and variance 2/K2/K. That is, as KK increases as the norms distribution converges to a Dirac at 11. Lastly, because ZK\frac{Z}{\sqrt{K}} is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as KK increases, as the samples are already normalized, hence leading to our result. ∎
First and foremost, recall that the density of the random variable f(X)f(X) is given by eq.˜1. Relying on lemma˜1 which stated that for large KK, our assumption on the output density reads f(𝒙)∼𝒰(0,1)f({\bm{x}})\sim\mathcal{U}(0,1), we obtain that ∫{𝒖∈ℝD|f(𝒖)=f(𝒙)}pX(𝒖)∏k=1rank(Jf(𝒖))σk(Jf(𝒖))dℋr(𝒖)=cst\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}({\bm{u}})}{\prod_{k=1}^{rank(J_{f}({\bm{u}}))}\sigma_{k}(J_{f}({\bm{u}}))}\mathrm{d}\mathcal{H}^{r}({\bm{u}})={\rm cst}. Now if ff is bijective between supp(pX){\rm supp}(p_{X}) and ℝK\mathbb{R}^{K}, then it is direct to see that pX(x)∝∏k=1rank(Jf(𝒙))σk(Jf(𝒙))p_{X}(x)\propto\prod_{k=1}^{rank(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}})). Now if ff is surjective there is no longer a one-to-one mapping between ff and pXp_{X}. Instead, there is ambiguity over each level set of ff. To see that, recall that we only need to maintain a constant value over the integration on the level set. Hence, ff is free to scale up one subset of that level set, and scale down another subset, proportionally to pXp_{X} to preserve the integration to a constant. ∎
The role of the predictor in JEPA training is to allow for additional computation to predict one view’s embedding from the other view’s embedding. While this provides numerous empirical benefits, e.g., in terms of optimization landscape, it actually does not impact the level-set of the encoder–which is what is needed in eq.˜1.
To understand the above argument, consider the case where the views are obtained from applying a transformation such as masking. We denote by ℳ\mathcal{M} the masking random and by mask(𝒙){\rm mask}({\bm{x}}) the application of one realization of ℳ\mathcal{M} onto the input 𝒙{\bm{x}}. We thus have for the invariance term of sample 𝒙n{\bm{x}}_{n}
Because the predictor is only applied on one of the two embeddings, it is clear that for the JEPA loss to be minimized, it must also be true that
for any realization of mask(1){\rm mask}^{(1)}. In other word, the encoder’s invariance is over the support of ℳ\mathcal{M} no matter if the predictor is identity or nonlinear. Therefore our result directly follows from the above combined with eqs.˜1 and 3. ∎
Depiction of the 5 least (left) and 5 most (right) likely samples of class 21 from Imagenet as per JEPA-SCORE–JEPAs’ implicit density estimator learned during pretraining. Two striking observations: (i) across all JEPAs (rows) the type of samples with low and high probabilities are alike, and (ii) the same samples (amongst 1,000) are found at those extrema. Random samples from that class are provided in fig.˜4
Top left: Visual illustration of JEPA-SCORE–the DN f𝜽f_{{\bm{\theta}}} must learn pXp_{X} for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas˜1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp(x)logp(x) on a GMM data model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f𝜽f_{{\bm{\theta}}} internalize the data density which can be easily extracted using our proposed JEPA-SCORE. Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
Depiction of JEPA-SCORE for 5,0005,000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset. While our score does not rely on singular vectors, we provide some examples in fig.˜7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
Random samples from Imagenet-1k training dataset for class 21.
Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
$$ p_{f(X)}(f({\bm{x}}))=\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}(x)}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}}))}\mathrm{d}\mathcal{H}^{r}({\bm{x}}), $$ \tag{S2.E1}
$$ \displaystyle\mathcal{L}\triangleq $$
$$ \displaystyle+{\rm diversity}\left(\left({\rm Enc}\left({\bm{x}}{n}\right)\right){n\in[N]}\right), $$
$$ \displaystyle=\frac{K^{\frac{K}{2}}}{2^{K/2-1}\Gamma(K/2)}e^{-\frac{Kr^{2}}{2}}\frac{\Gamma(K/2)}{2\pi^{K/2}}r^{K-1}\hskip-4.26773pt\prod_{i=1}^{K-1}\sin({\bm{\theta}}_{i})^{K-i-1} $$
Lemma. Lemma 1. As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma. Lemma 2. In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Theorem. Theorem 1. At optimality, JEPA embeddings estimate the data density as per pμ(μ)∝𝔼pT[1∏k=1rank(Jf(𝒙))σk(Jf(μ,T))]−1.p_{\mu}(\mu)\propto\mathbb{E}{p{T}}\left[\frac{1}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}(\mu,T))}\right]^{-1}. (4)
Citations within the text
Footnotes
Figures

Random samples
Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom:Random samples from Imagenet-1k training dataset for class 141.
Figure 5: Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom: Random samples from Imagenet-1k training dataset for class 141.


0
Figure 7: Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample’s representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs’ anti-collapse term does much more–it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used–in any case one can compute the learned probabilities of sample 𝒙{\bm{x}} efficiently and in closed-form using the model’s Jacobian matrix at 𝒙{\bm{x}}. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as JEPA-SCORE.
low probability
MetaCLIP IJEPA-22k IJEPA-1k DINOv2
The training procedure of foundation models—Deep Networks (DNs) f𝜽f_{{\bm{\theta}}} able to solve many tasks in zero or few-shot—can take many forms and is at the center of Self Supervised Learning research balestriero2023cookbook . Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f𝜽(X)f_{{\bm{\theta}}}(X) to be maximum Entropy given i.i.d pretraining samples XX with density pXp_{X} wang2020understanding ; hjelm2018learning . Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f𝜽(X)f_{{\bm{\theta}}}(X) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct XX from f(X)f(X) vincent2008extracting . Because this approach comes with known limitations balestriero2024learning , more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) lecun2022path that directly encourage f𝜽(X)f_{{\bm{\theta}}}(X) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f𝜽(X)f_{{\bm{\theta}}}(X) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers srinath2023implicit ; littwin2024jepa . Today’s JEPAs mostly take three forms: (i) moment-matching objectives (VICReg bardes2021vicreg , W-MSE ermolov2021whitening ), (ii) non-parametric estimators (SimCLR chen2020simple , MoCo he2020momentum , CLIP radford2021learning ), and (iii) implicit teacher-student methods (DINO caron2021emerging , I-JEPA assran2023self ). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate pXp_{X}. In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder…
Can the density of f(X)f(X) be specified without ff learning about pXp_{X}?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f𝜽f_{{\bm{\theta}}} learns the underlying data density pXp_{X}. But JEPAs estimate pXp_{X} in a highly non standard way, free of input space reconstruction, and free of a parametric model for pXp_{X}. One question remains…
Is there any further benefit of not only specifying a density for f𝛉(X)f_{{\bm{\theta}}}(X) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for pXp_{X} implicitly learned during JEPA training can easily be extracted from the final trained model f𝜽f_{{\bm{\theta}}}–an estimator we call the JEPA-SCORE (eq.˜5). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs. Our theory and its corresponding controlled experiments are provided in sections˜2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section˜2.3. JEPA-SCORE’s implementation only takes a few lines of code and is provided in LABEL:code.
We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section˜2.1, we formalize our general finding in section˜2.2, culminating in the JEPA result of sections˜2.3 and 1. An efficient implementation is also provided in section˜2.3.
Our derivations will rely on a simple observations widely known in high-dimensional statistics: KK-dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let’s denote the KK-dimensional standard Gaussian random variable by ZZ, and the normalized version by X≜ZKX\triangleq\frac{Z}{\sqrt{K}} with density f𝒩(0,𝑰/K)f_{\mathcal{N}(0,{\bm{I}}/K)}. Let’s also denote the Uniform distribution on the KK-dimensional hypersphere surface by f𝒰(𝕊(0,R,K))f_{\mathcal{U}(\mathbb{S}(0,R,K))} with radius R>0R>0.
As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma˜1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section˜2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.
Consider two densities, one on the input domain (pXp_{X}) and one on the output domain (pf(X)p_{f(X)}). For pf(X)p_{f(X)} to have a particular form, e.g., 𝒩(0,𝑰/K)\mathcal{N}(0,{\bm{I}}/K), ff must learn something about pXp_{X}. To see that, we will have leverage the eponymous change of variable formula expressing the embedding density pf𝜽p_{f_{{\bm{\theta}}}} as a function of the data density and the DN’s Jacobian matrix:
where ℋr\mathcal{H}^{r} denotes rr-dimensional Hausdorff measure, with r≜dim({𝒖∈ℝD|f(𝒖)=f(𝒙)})r\triangleq\dim({{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}) being the dimension of the level set of ff at 𝒙{\bm{x}}. We note that eq.˜1 does not require ff to be bijective, which will be crucial for our JEPA result in section˜2.3; for details see krantz2008geometric ; cvitkovic2019minimal . Combining eq.˜1 and lemma˜1 leads us to the following result.
In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Empirical validation. Before broadening lemma˜2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig.˜2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.
Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function
where 𝒙n(1),𝒙n(2){\bm{x}}{n}^{(1)},{\bm{x}}{n}^{(2)} are two generated “views” from the original sample through the stochastic operator 𝒢\mathcal{G}, and dist\rm dist is a distance function (e.g., L2). For images, 𝒢\mathcal{G} typically involves two different data-augmentations. At this point, lemma˜2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma˜2 is that the integration occurs over the level set of the function f𝜽f_{{\bm{\theta}}} which coincides with the JEPA’s invariance term when Pred is near identity.
Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density pTp_{T}. We also denote the density of generators as pμp_{\mu}, from which the data density pXp_{X} is defined as
In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators (pμp_{\mu}) are the original training samples prior to applying any augmentation–hence estimating pμp_{\mu} will amount to estimating the data density.
JEPA-SCORE. Combining eqs.˜3, 2 and 2 leads to the following result proved in section˜A.3.
At optimality, JEPA embeddings estimate the data density as per
We define our JEPA-SCORE for input 𝒙{\bm{x}} as the Monte Carlo estimator of eq.˜4, for a single-sample estimate we have (in log-scale)
which exactly recovers pμp_{\mu} as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig.˜2. We empirically validate eq.˜5 by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs.˜1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5,0005,000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig.˜3. We clearly see that datasets that weren’t seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples. Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.
Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as KK increases. Let’s first express the Uniform distribution in spherical coordinates:
and let’s now express the rescaled standard Gaussian density ZK\frac{Z}{\sqrt{K}} in spherical coordinates:
As KK increases, as the scaled Chi-distribution converges to a Dirac function at 11, leading to our desired result.
Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that ZK\frac{Z}{\sqrt{K}} being isotropic Gaussian, the distribution of norms, ‖Z‖22/K|Z|_{2}^{2}/K, is a Chi-squared distribution with mean 11 and variance 2/K2/K. That is, as KK increases as the norms distribution converges to a Dirac at 11. Lastly, because ZK\frac{Z}{\sqrt{K}} is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as KK increases, as the samples are already normalized, hence leading to our result. ∎
First and foremost, recall that the density of the random variable f(X)f(X) is given by eq.˜1. Relying on lemma˜1 which stated that for large KK, our assumption on the output density reads f(𝒙)∼𝒰(0,1)f({\bm{x}})\sim\mathcal{U}(0,1), we obtain that ∫{𝒖∈ℝD|f(𝒖)=f(𝒙)}pX(𝒖)∏k=1rank(Jf(𝒖))σk(Jf(𝒖))dℋr(𝒖)=cst\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}({\bm{u}})}{\prod_{k=1}^{rank(J_{f}({\bm{u}}))}\sigma_{k}(J_{f}({\bm{u}}))}\mathrm{d}\mathcal{H}^{r}({\bm{u}})={\rm cst}. Now if ff is bijective between supp(pX){\rm supp}(p_{X}) and ℝK\mathbb{R}^{K}, then it is direct to see that pX(x)∝∏k=1rank(Jf(𝒙))σk(Jf(𝒙))p_{X}(x)\propto\prod_{k=1}^{rank(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}})). Now if ff is surjective there is no longer a one-to-one mapping between ff and pXp_{X}. Instead, there is ambiguity over each level set of ff. To see that, recall that we only need to maintain a constant value over the integration on the level set. Hence, ff is free to scale up one subset of that level set, and scale down another subset, proportionally to pXp_{X} to preserve the integration to a constant. ∎
The role of the predictor in JEPA training is to allow for additional computation to predict one view’s embedding from the other view’s embedding. While this provides numerous empirical benefits, e.g., in terms of optimization landscape, it actually does not impact the level-set of the encoder–which is what is needed in eq.˜1.
To understand the above argument, consider the case where the views are obtained from applying a transformation such as masking. We denote by ℳ\mathcal{M} the masking random and by mask(𝒙){\rm mask}({\bm{x}}) the application of one realization of ℳ\mathcal{M} onto the input 𝒙{\bm{x}}. We thus have for the invariance term of sample 𝒙n{\bm{x}}_{n}
Because the predictor is only applied on one of the two embeddings, it is clear that for the JEPA loss to be minimized, it must also be true that
for any realization of mask(1){\rm mask}^{(1)}. In other word, the encoder’s invariance is over the support of ℳ\mathcal{M} no matter if the predictor is identity or nonlinear. Therefore our result directly follows from the above combined with eqs.˜1 and 3. ∎
Depiction of the 5 least (left) and 5 most (right) likely samples of class 21 from Imagenet as per JEPA-SCORE–JEPAs’ implicit density estimator learned during pretraining. Two striking observations: (i) across all JEPAs (rows) the type of samples with low and high probabilities are alike, and (ii) the same samples (amongst 1,000) are found at those extrema. Random samples from that class are provided in fig.˜4
Top left: Visual illustration of JEPA-SCORE–the DN f𝜽f_{{\bm{\theta}}} must learn pXp_{X} for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas˜1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp(x)logp(x) on a GMM data model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f𝜽f_{{\bm{\theta}}} internalize the data density which can be easily extracted using our proposed JEPA-SCORE. Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.
Depiction of JEPA-SCORE for 5,0005,000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset. While our score does not rely on singular vectors, we provide some examples in fig.˜7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.
Random samples from Imagenet-1k training dataset for class 21.
Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.
$$ p_{f(X)}(f({\bm{x}}))=\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}(x)}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}}))}\mathrm{d}\mathcal{H}^{r}({\bm{x}}), $$ \tag{S2.E1}
$$ \displaystyle\mathcal{L}\triangleq $$
$$ \displaystyle+{\rm diversity}\left(\left({\rm Enc}\left({\bm{x}}{n}\right)\right){n\in[N]}\right), $$
$$ \displaystyle=\frac{K^{\frac{K}{2}}}{2^{K/2-1}\Gamma(K/2)}e^{-\frac{Kr^{2}}{2}}\frac{\Gamma(K/2)}{2\pi^{K/2}}r^{K-1}\hskip-4.26773pt\prod_{i=1}^{K-1}\sin({\bm{\theta}}_{i})^{K-i-1} $$
Lemma. Lemma 1. As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)
Lemma. Lemma 2. In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)
Theorem. Theorem 1. At optimality, JEPA embeddings estimate the data density as per pμ(μ)∝𝔼pT[1∏k=1rank(Jf(𝒙))σk(Jf(μ,T))]−1.p_{\mu}(\mu)\propto\mathbb{E}{p{T}}\left[\frac{1}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}(\mu,T))}\right]^{-1}. (4)
Tables
Default Notation
Final instructions
The training procedure of foundation models-Deep Networks (DNs) f θ able to solve many tasks in zero or few-shot-can take many forms and is at the center of Self Supervised Learning research [2]. Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f θ ( X ) to be maximum Entropy given i.i.d pretraining samples X with density p X [17, 10]. Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f θ ( X ) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct X from f ( X ) [16]. Because this approach comes with known limitations [3], more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) [12] that directly encourage f θ ( X ) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f θ ( X ) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers [15, 13]. Today's JEPAs mostly take three forms: (i) moment-matching objectives (VICReg [4], W-MSE [8]), (ii) non-parametric estimators (SimCLR [6], MoCo [9], CLIP [14]), and (iii) implicit teacher-student methods (DINO [5], I-JEPA [1]). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate p X . In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder. . .
Can the density of f ( X ) be specified without f learning about p X ?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f θ learns the underlying data density p X . But JEPAs estimate p X in a highly non standard way, free of input space reconstruction, and free of a parametric model for p X . One question remains. . .
Is there any further benefit of not only specifying a density for f θ ( X ) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for p X implicitly learned during JEPA training can easily be extracted from the final trained model f θ -an estimator we call the JEPA-SCORE (eq. (5)). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs . Our theory and its corresponding controlled experiments are provided in sections 2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section 2.3. JEPA-SCORE 's implementation only takes a few lines of code and is provided in listing 1.
Preparing PostScript or PDF files
Margins in LaTeX
Author Contributions
The training procedure of foundation models-Deep Networks (DNs) f θ able to solve many tasks in zero or few-shot-can take many forms and is at the center of Self Supervised Learning research [2]. Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f θ ( X ) to be maximum Entropy given i.i.d pretraining samples X with density p X [17, 10]. Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f θ ( X ) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct X from f ( X ) [16]. Because this approach comes with known limitations [3], more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) [12] that directly encourage f θ ( X ) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f θ ( X ) producing Gaussian Embeddings (GE).
JEPAs can take many forms by employing numerous implicit and explicit regularizers [15, 13]. Today's JEPAs mostly take three forms: (i) moment-matching objectives (VICReg [4], W-MSE [8]), (ii) non-parametric estimators (SimCLR [6], MoCo [9], CLIP [14]), and (iii) implicit teacher-student methods (DINO [5], I-JEPA [1]). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate p X . In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder. . .
Can the density of f ( X ) be specified without f learning about p X ?
That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f θ learns the underlying data density p X . But JEPAs estimate p X in a highly non standard way, free of input space reconstruction, and free of a parametric model for p X . One question remains. . .
Is there any further benefit of not only specifying a density for f θ ( X ) but using the eponymous Gaussian density?
At it turns out, this choice guarantees that the estimator for p X implicitly learned during JEPA training can easily be extracted from the final trained model f θ -an estimator we call the JEPA-SCORE (eq. (5)). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs . Our theory and its corresponding controlled experiments are provided in sections 2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section 2.3. JEPA-SCORE 's implementation only takes a few lines of code and is provided in listing 1.
Acknowledgments
NeurIPS Paper Checklist
$$ p_{f(X)}(f(\vx))=\int_{{\vu \in \mathbb{R}^{D} | f(\vu)=f(\vx)}} \frac{p_{X}(x)}{\prod_{k=1}^{\rank(J_{f}(\vx))}\sigma_k(J_{f}(\vx))} \mathrm{d} \mathcal{H}^{r}(\vx),\label{eq:general} $$ \tag{eq:general}
$$ \mathrm{HV}{n, \gamma}=\frac{1}{n}\left(\frac{\pi}{\gamma}\right)^{d / 2} \sum{j, k=1}^{n} \exp \left(\frac{\left|Y_{n, j, k}^{+}\right|^{2}}{4 \gamma}\right) $$
$$ \mathcal{L}\triangleq& \sum_{n=1}^{N} \mathbb{E}{(\vx_n^{(1)},\vx_n^{(2)})\sim \mathcal{G}(\vx_n)}\left[{\rm dist}\left({\rm Pred}\left({\rm Enc}\left(\vx_n^{(1)}\right)\right),{\rm Enc}\left(\vx_n^{(2)}\right)\right)\right]&&(\text{predictive invariance})\nonumber\ &+{\rm diversity}\left(\left({\rm Enc}\left(\vx{n}\right)\right)_{n \in [N]}\right),&&(\text{anti-collapse}),\label{eq:JEPA} $$ \tag{eq:JEPA}
$$ p_{X} \triangleq p_{\mu}\otimes p_{T}.\label{eq:data} $$ \tag{eq:data}
$$ f_{\mathcal{U}(\mathbb{S}(0,R,K))}(\vx) &= \delta(|\vx|2-R)\frac{\Gamma(K/2)}{2\pi^{K/2}R^{K-1}},&&\text{(Cartesian coordinates)}\ f{\mathcal{U}(\mathbb{S}(0,R,K))}(r,\vtheta) &= \delta(r-R)\frac{\Gamma(K/2)}{2\pi^{K/2}}\prod_{i=1}^{K-1}\sin(\vtheta_i)^{K-i-1},&&\text{(spherical coordinates)}, $$
$$ f_{\mathcal{N}(0,\mI/K)}(\vx) &=\left(\frac{K}{2\pi}\right)^{K/2}e^{-\frac{K}{2}|\vx|2^2},&&\text{(Cartesian coordinates)}\ f{\mathcal{N}(0,\mI/K)}(r,\vtheta) &= \left(\frac{K}{2\pi}\right)^{K/2}e^{-\frac{K}{2}r^2}r^{K-1}\prod_{i=1}^{K-1}\sin(\vtheta_i)^{K-i-1}\ &= \frac{K^{\frac{K}{2}}}{2^{K / 2-1} \Gamma(K / 2)} e^{-\frac{Kr^{2}}{2}}\frac{\Gamma(K/2)}{2\pi^{K/2}}r^{K-1}\hspace{-0.15cm}\prod_{i=1}^{K-1}\sin(\vtheta_i)^{K-i-1}\ &= \underbrace{\frac{K^{\frac{K}{2}}}{2^{K / 2-1} \Gamma(K / 2)} r^{K-1} e^{-\frac{Kr^{2}}{2}}}{\text{scaled Chi-distribution}\overset{K\rightarrow \infty}{\rightarrow} \delta(r-1)}f{\mathcal{U}(\mathbb{S}(0,R,K))}(r ,\vtheta),&&\text{(spherical coordinates)}. $$
Theorem. At optimality, JEPA embeddings estimate the data density as per equation p_{\mu}(\mu) \propto E_{p_T}\left[1{\prod_{k=1}^{\rank(J_{f}(\vx))}\sigma_k(J_{f}(\mu,T))}\right]^{-1}. equation % hereby allowing for $p_{\mu}$ to be recovered from an optimally trained JEPA model $f_{\vtheta}$.
Theorem. % The latent space distribution $f_{\vtheta}\left(T^{(v)}\left(\mX^{(t)}\right)\right)$ on the hypersphere $S^{K}$ maximizing the pairwise distances between embeddings is the uniform distribution % align* % f_{\vtheta}\left(T^{(v)}\left(\mX^{(t)}\right)\right)\sim U(S^{K}) \iff E_{\vx \sim p_{\vx},\vx' \sim p_{\vx}}\left[| f_{\vtheta}\left(T^{(v)}\left(\vx\right)\right)-f_{\vtheta}\left(T^{(v)}\left(\vx'\right)\right)|\right] is maximized % align* % proof %\url{https://mathoverflow.net/questions/151176/what-kind-of-probability-distribution-maximizes-the-average-distance-between-two} %
Theorem. [Cramer Wold Theorem] % A random vector of $K$ random variables $\vx \triangleq (x_1,\dots,x_K)$ is a multivariate Gaussian only if the $K$ random variables are jointly Gaussian, i.e., any linear combination of them is a univariate Gaussian: % align % \vx \sim N(\vmu,\mSigma) \iff \va \in R^{K}, \exists (\mu,\sigma) \in R \times R^* : \langle \vx,\va\rangle \sim N(\mu,\sigma). % align %
Lemma. As $K$ grows, $X$ quickly concentrates around the hypersphere of radius $1$, converging to a Uniform density over the hypersphere surface. (Proof in proof:uniform.)
Lemma. In order for $f(X)$ to be distributed as $N(0,\mI/K)$ for large $K$, $f$ must learn the data density $p_X$ up to mean-preserving rescaling within each level-set ${\vu \in R^{D} | f(\vu)=f(\vx)}$. (Proof in proof:general_density.)
Definition. [Optimality Conditions] % An optimal JEPA should be (i) maximally informative about the input, i.e., $h(\vx^{(t)}$ is maximal, and be (ii) maximally predictive, i.e., no other model within the given class can achieve a better rate of reduction in the conditional entropy. %
Proof. {\bf Proof 1:}~The proof consists in expressing both densities in spherical coordinates, and studying their convergence as $K$ increases. Let's first express the Uniform distribution in spherical coordinates: align* f_{U(S(0,R,K))}(\vx) &= \delta(|\vx|2-R)\Gamma(K/2){2\pi^{K/2}R^{K-1}},&&(Cartesian coordinates)\ f{U(S(0,R,K))}(r,\vtheta) &= \delta(r-R)\Gamma(K/2){2\pi^{K/2}}\prod_{i=1}^{K-1}\sin(\vtheta_i)^{K-i-1},&&(spherical coordinates), align* and let's now express the rescaled standard Gaussian density $Z{K}$ in spherical coordinates: align* f_{N(0,\mI/K)}(\vx) &=\left(K{2\pi}\right)^{K/2}e^{-K{2}|\vx|2^2},&&(Cartesian coordinates)\ f{N(0,\mI/K)}(r,\vtheta) &= \left(K{2\pi}\right)^{K/2}e^{-K{2}r^2}r^{K-1}\prod_{i=1}^{K-1}\sin(\vtheta_i)^{K-i-1}\ &= K^{\frac{K{2}}}{2^{K / 2-1} \Gamma(K / 2)} e^{-Kr^{2}{2}}\Gamma(K/2){2\pi^{K/2}}r^{K-1}-0.15cm\prod_{i=1}^{K-1}\sin(\vtheta_i)^{K-i-1}\ &= \frac{K^{\frac{K{2}}}{2^{K / 2-1} \Gamma(K / 2)} r^{K-1} e^{-Kr^{2}{2}}}{scaled Chi-distributionK\rightarrow \infty{\rightarrow} \delta(r-1)}f{U(S(0,R,K))}(r ,\vtheta),&&(spherical coordinates). align* As $K$ increases, as the scaled Chi-distribution converges to a Dirac function at $1$, leading to our desired result. {\bf Proof 2:}~The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that $Z{K}$ being isotropic Gaussian, the distribution of norms, $|Z|_2^2/K$, is a Chi-squared distribution with mean $1$ and variance $2/K$. That is, as $K$ increases as the norms distribution converges to a Dirac at $1$. Lastly, because $Z{K}$ is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as $K$ increases, as the samples are already normalized, hence leading to our result.
Proof. First and foremost, recall that the density of the random variable $f(X)$ is given by eq:general. Relying on thm:uniform which stated that for large $K$, our assumption on the output density reads $f(\vx) \sim U(0,1)$, we obtain that $ \int_{{\vu \in R^{D} | f(\vu)=f(\vx)}} p_{X(\vu)}{\prod_{k=1}^{rank(J_{f}(\vu))}\sigma_k(J_{f}(\vu))} d H^{r}(\vu)={\rm cst} $. Now if $f$ is bijective between ${\rm supp}(p_X)$ and $R^K$, then it is direct to see that $ p_{X}(x) \propto \prod_{k=1}^{rank(J_{f}(\vx))}\sigma_k(J_{f}(\vx)) $. Now if $f$ is surjective there is no longer a one-to-one mapping between $f$ and $p_X$. Instead, there is ambiguity over each level set of $f$. To see that, recall that we only need to maintain a constant value over the integration on the level set. Hence, $f$ is free to scale up one subset of that level set, and scale down another subset, proportionally to $p_X$ to preserve the integration to a constant.
Proof. The role of the predictor in JEPA training is to allow for additional computation to predict one view's embedding from the other view's embedding. While this provides numerous empirical benefits, e.g., in terms of optimization landscape, it actually does not impact the level-set of the encoder--which is what is needed in eq:general. To understand the above argument, consider the case where the views are obtained from applying a transformation such as masking. We denote by $M$ the masking random and by ${\rm mask}(\vx)$ the application of one realization of $M$ onto the input $\vx$. We thus have for the invariance term of sample $\vx_n$ align* E_{({\rm mask}^{(1)},{\rm mask}^{(2)})\sim (M,M)}\left[{\rm dist}\left({\rm Pred}\left({\rm Enc}\left({\rm mask}^{(1)}(\vx_n)\right)\right),{\rm Enc}\left({\rm mask}^{(2)}(\vx)\right)\right)\right]. align* Because the predictor is only applied on one of the two embeddings, it is clear that for the JEPA loss to be minimized, it must also be true that align* E_{{\rm mask}^{(2)}\sim M}\left[{\rm dist}\left({\rm Pred}\left({\rm Enc}\left({\rm mask}^{(1)}(\vx_n)\right)\right),{\rm Enc}\left({\rm mask}^{(2)}(\vx)\right)\right)\right]=0, align* for any realization of ${\rm mask}^{(1)}$. In other word, the encoder's invariance is over the support of $M$ no matter if the predictor is identity or nonlinear. Therefore our result directly follows from the above combined with eq:general,eq:data.



References
[shapiro1965analysis] Shapiro, Samuel Sanford, Wilk, Martin B. (1965). An analysis of variance test for normality (complete samples). Biometrika.
[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF international conference on computer vision.
[krantz2008geometric] Krantz, Steven, Parks, Harold. (2008). Geometric integration theory.
[cvitkovic2019minimal] Cvitkovic, Milan, Koliander, G{. (2019). Minimal achievable sufficient statistic learning. International Conference on Machine Learning.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning.
[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.
[ermolov2021whitening] Ermolov, Aleksandr, Siarohin, Aliaksandr, Sangineto, Enver, Sebe, Nicu. (2021). Whitening for self-supervised representation learning. International conference on machine learning.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.
[srinath2023implicit] Srinath Halvagal, Manu, Laborieux, Axel, Zenke, Friedemann. (2023). Implicit variance regularization in non-contrastive SSL. Advances in Neural Information Processing Systems.
[littwin2024jepa] Littwin, Etai, Saremi, Omid, Advani, Madhu, Thilak, Vimal, Nakkiran, Preetum, Huang, Chen, Susskind, Joshua. (2024). How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems.
[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.
[balestriero2024learning] Balestriero, Randall, LeCun, Yann. (2024). Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337.
[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.
[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International conference on machine learning.
[balestriero2023cookbook] Balestriero, Randall, Ibrahim, Mark, Sobal, Vlad, Morcos, Ari, Shekhar, Shashank, Goldstein, Tom, Bordes, Florian, Bardes, Adrien, Mialon, Gregoire, Tian, Yuandong, others. (2023). A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.
[tishby2000information] Tishby, Naftali, Pereira, Fernando C, Bialek, William. (2000). The information bottleneck method. arXiv preprint physics/0004057.
[hyvarinen2000independent] Hyv{. (2000). Independent component analysis: algorithms and applications. Neural networks.
[shannon1948mathematical] Shannon, Claude E. (1948). A mathematical theory of communication. The Bell system technical journal.
[cover1999elements] Cover, Thomas M. (1999). Elements of information theory.
[fang2019generic] Fang, Song, Skoglund, Mikael, Johansson, Karl Henrik, Ishii, Hideaki, Zhu, Quanyan. (2019). Generic variance bounds on estimation and prediction errors in time series analysis: An entropy perspective. 2019 IEEE Information Theory Workshop (ITW).
[szekely2005new] Sz{'e. (2005). A new test for multivariate normality. Journal of Multivariate Analysis.
[henter2016minimum] Henter, Gustav Eje, Kleijn, W Bastiaan. (2016). Minimum entropy rate simplification of stochastic processes. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
[gutmann2010noise] Gutmann, Michael, Hyv{. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the thirteenth international conference on artificial intelligence and statistics.
[poole2019variational] Poole, Ben, Ozair, Sherjil, Van Den Oord, Aaron, Alemi, Alex, Tucker, George. (2019). On variational bounds of mutual information. International conference on machine learning.
[mcallester2020formal] McAllester, David, Stratos, Karl. (2020). Formal limitations on the measurement of mutual information. International Conference on Artificial Intelligence and Statistics.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[ma2018noise] Ma, Zhuang, Collins, Michael. (2018). Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. arXiv preprint arXiv:1809.01812.
[belghazi2018mutual] Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeshwar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, Devon. (2018). Mutual information neural estimation. International conference on machine learning.
[barber2004algorithm] Barber, David, Agakov, Felix. (2004). The im algorithm: a variational approach to information maximization. Advances in neural information processing systems.
[suzuki2008approximating] Suzuki, Taiji, Sugiyama, Masashi, Sese, Jun, Kanamori, Takafumi. (2008). Approximating mutual information by maximum likelihood density ratio estimation. New challenges for feature selection in data mining and knowledge discovery.
[kraskov2004estimating] Kraskov, Alexander, St{. (2004). Estimating mutual information. Physical Review E—Statistical, Nonlinear, and Soft Matter Physics.
[ebner2020tests] Ebner, Bruno, Henze, Norbert. (2020). Tests for multivariate normality—A critical review with emphasis on weighted L 2-statistics. Test.
[bell1995information] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.
[linsker1988self] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.
[elfving1947asymptotical] Elfving, Gustav. (1947). The asymptotical distribution of range in samples from a normal population. Biometrika.
[gupta1952estimation] Gupta, AK. (1952). Estimation of the mean and standard deviation of a normal population from a censored sample. Biometrika.
[shapiro1972approximate] Shapiro, Samuel S, Francia, RS. (1972). An approximate analysis of variance test for normality. Journal of the American statistical Association.
[mosteller2006some] Mosteller, Frederick. (2006). On some useful “inefficient” statistics.
[blom1958statistical] Blom, Gunnar. (1958). Statistical estimates and transformed beta-variables.
[rahman1997modification] Rahman, M Mahibbur, Govindarajulu, Z. (1997). A modification of the test of Shapiro and Wilk for normality. Journal of Applied Statistics.
[weisburg1975approximate] Weisburg, S, Binham, C. (1975). An approximate analysis of variance test for non-normality suitable for machine computation. Technometrics.
[plackett1958linear] Plackett, RoL. (1958). Linear estimation from censored data. The Annals of Mathematical Statistics.
[hammersley1954estimation] Hammersley, JM, Morton, KW. (1954). The estimation of location and scale parameters from grouped data. Biometrika.
[bib1] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
[bib2] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al. A cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210, 2023.
[bib3] Randall Balestriero and Yann LeCun. Learning by reconstruction produces uninformative features for perception. arXiv preprint arXiv:2402.11337, 2024.
[bib4] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
[bib5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
[bib6] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PmLR, 2020.
[bib7] Milan Cvitkovic and Günther Koliander. Minimal achievable sufficient statistic learning. In International Conference on Machine Learning, pages 1465–1474. PMLR, 2019.
[bib8] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, and Nicu Sebe. Whitening for self-supervised representation learning. In International conference on machine learning, pages 3015–3024. PMLR, 2021.
[bib9] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[bib10] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
[bib11] Steven Krantz and Harold Parks. Geometric integration theory. Springer, 2008.
[bib12] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022.
[bib13] Etai Littwin, Omid Saremi, Madhu Advani, Vimal Thilak, Preetum Nakkiran, Chen Huang, and Joshua Susskind. How jepa avoids noisy features: The implicit bias of deep linear self distillation networks. Advances in Neural Information Processing Systems, 37:91300–91336, 2024.
[bib14] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021.
[bib15] Manu Srinath Halvagal, Axel Laborieux, and Friedemann Zenke. Implicit variance regularization in non-contrastive ssl. Advances in Neural Information Processing Systems, 36:63409–63436, 2023.
[bib16] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
[bib17] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International conference on machine learning, pages 9929–9939. PMLR, 2020.