Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

% Randall Balestriero, Meta-FAIR & Brown University, % examples of more authors, Nicolas Ballas, Meta-FAIR, % email, Mike Rabbat, Meta-FAIR, % Address, % email, Yann LeCun, Meta-FAIR & NYU % Address, % email, % Coauthor, % Affiliation, % Address, % email

Abstract

Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample's representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs' anti-collapse term does much more--it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used--in any case one can compute the learned probabilities of sample $\vx$ efficiently and in closed-form using the model's Jacobian matrix at $\vx$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.

Introduction

The training procedure of foundation models-Deep Networks (DNs) f θ able to solve many tasks in zero or few-shot-can take many forms and is at the center of Self Supervised Learning research [2]. Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f θ ( X ) to be maximum Entropy given i.i.d pretraining samples X with density p X [17, 10]. Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f θ ( X ) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct X from f ( X ) [16]. Because this approach comes with known limitations [3], more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) [12] that directly encourage f θ ( X ) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f θ ( X ) producing Gaussian Embeddings (GE).

JEPAs can take many forms by employing numerous implicit and explicit regularizers [15, 13]. Today's JEPAs mostly take three forms: (i) moment-matching objectives (VICReg [4], W-MSE [8]), (ii) non-parametric estimators (SimCLR [6], MoCo [9], CLIP [14]), and (iii) implicit teacher-student methods (DINO [5], I-JEPA [1]). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate p X . In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder. . .

Can the density of f ( X ) be specified without f learning about p X ?

That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f θ learns the underlying data density p X . But JEPAs estimate p X in a highly non standard way, free of input space reconstruction, and free of a parametric model for p X . One question remains. . .

Is there any further benefit of not only specifying a density for f θ ( X ) but using the eponymous Gaussian density?

At it turns out, this choice guarantees that the estimator for p X implicitly learned during JEPA training can easily be extracted from the final trained model f θ -an estimator we call the JEPA-SCORE (eq. (5)). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs . Our theory and its corresponding controlled experiments are provided in sections 2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section 2.3. JEPA-SCORE 's implementation only takes a few lines of code and is provided in listing 1.

JEPA-SCORE: the Data Density Implicitly Learned by JEPAs

We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section 2.1, we formalize our general finding in section 2.2, culminating in the JEPA result of section 2.3 and theorem 1. An efficient implementation is also provided in section 2.3.

Preliminaries: Gaussian Embeddings Are Uniform on the Hypersphere

Our derivations will rely on a simple observations widely known in high-dimensional statistics: K -dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let's denote the K -dimensional standard Gaussian random variable by Z , and the normalized version by X ≜ Z √ K with density f N (0 , I /K ) . Let's also denote the Uniform distribution on the K -dimensional hypersphere surface by f U ( S (0 ,R,K )) with radius R > 0 .

Lemma 1. As K grows, X quickly concentrates around the hypersphere of radius 1 , converging to a Uniform density over the hypersphere surface. (Proof in section A.1.)

Lemma 1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section 2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.

Figure 2: Top left: Visual illustration of JEPA-SCORE -the DN f θ must learn p X for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas 1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp ( x ) on a GMMdata model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f θ internalize the data density which can be easily extracted using our proposed JEPA-SCORE . Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.

Producing Gaussian Embeddings Equates Learning an Energy Function

This section builds upon lemma 1 to demonstrate how learning to produce Gaussian embeddings implies learning about the data density.

Consider two densities, one on the input domain ( p X ) and one on the output domain ( p f ( X ) ). For p f ( X ) to have a particular form, e.g., N (0 , I /K ) , f must learn something about p X . To see that, we will have leverage the eponymous change of variable formula expressing the embedding density p f θ as a function of the data density and the DN's Jacobian matrix:

where H r denotes r -dimensional Hausdorff measure, with r ≜ dim( { u ∈ R D | f ( u ) = f ( x ) } ) being the dimension of the level set of f at x . We note that eq. (1) does not require f to be bijective, which will be crucial for our JEPA result in section 2.3; for details see [11, 7]. Combining eq. (1) and lemma 1 leads us to the following result.

Lemma 2. In order for f ( X ) to be distributed as N (0 , I /K ) for large K , f must learn the data density p X up to mean-preserving rescaling within each level-set { u ∈ R D | f ( u ) = f ( x ) } . (Proof in section A.2.)

Empirical validation. Before broadening lemma 2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig. 2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.

JEPA-SCORE: The Data Density Learned by JEPAs

Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function

Figure 3: Depiction of JEPA-SCORE for 5 , 000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset . While our score does not rely on singular vectors, we provide some examples in fig. 7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.

where x (1) n , x (2) n are two generated 'views' from the original sample through the stochastic operator G , and dist is a distance function (e.g., L2). For images, G typically involves two different dataaugmentations. At this point, lemma 2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma 2 is that the integration occurs over the level set of the function f θ which coincides with the JEPA's invariance term when Pred is near identity.

Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density p T . We also denote the density of generators as p µ , from which the data density p X is defined as

In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators ( p µ ) are the original training samples prior to applying any augmentation-hence estimating p µ will amount to estimating the data density.

JEPA-SCORE. Combining eqs. (2) and (3) and lemma 2 leads to the following result proved in section A.3.

Theorem 1. At optimality, JEPA embeddings estimate the data density as per

We define our JEPA-SCORE for input x as the Monte Carlo estimator of eq. (4), for a single-sample estimate we have (in log-scale)

which exactly recovers p µ as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig. 2. We empirically validate eq. (5) by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs. 1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5 , 000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig. 3. We clearly see that datasets that weren't seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples.

Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.

Proofs

Proof of cref{thm:uniform

Proof. Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as K increases. Let's first express the Uniform distribution in spherical coordinates:

and let's now express the rescaled standard Gaussian density Z √ K in spherical coordinates:

As K increases, as the scaled Chi-distribution converges to a Dirac function at 1 , leading to our desired result.

Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that Z √ K being isotropic Gaussian, the distribution of norms, ∥ Z ∥ 2 2 /K , is a Chi-squared distribution with mean 1 and variance 2 /K . That is, as K increases as the norms distribution converges to a Dirac at 1 . Lastly, because Z √ K is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as K increases, as the samples are already normalized, hence leading to our result.

Proof of cref{thm:general_density

and let's now express the rescaled standard Gaussian density Z √ K in spherical coordinates:

As K increases, as the scaled Chi-distribution converges to a Dirac function at 1 , leading to our desired result.

Proof of cref{thm:JEPA

and let's now express the rescaled standard Gaussian density Z √ K in spherical coordinates:

As K increases, as the scaled Chi-distribution converges to a Dirac function at 1 , leading to our desired result.

Implementation Details

1 from torch.autograd.functional import jacobian 2 3 # model returns a tensor of shape (num_samples , features_dim) 4 J = jacobian(lambda x: model(x).sum(0), inputs=images) 5 with torch.inference_mode(): 6 J = J.flatten(2).permute(1, 0, 2) 7 svdvals = torch.linalg.svdvals(J) 8 jepa_score = svdvals.clip_(eps).log_().sum(1) 9

Listing 1: JEPA-SCORE implementation in PyTorch. Our empirical ablations demonstrate that JEPA-SCORE is not sensitive to the choice of eps (we pick 1 e -6 )

Additional Figures

Random samples

Refer to caption Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom:Random samples from Imagenet-1k training dataset for class 141.

Figure 5: Top: least and most likely samples based on the proposed JEPA-SCORE for a fe different pretrained backbones on class 141. Bottom: Random samples from Imagenet-1k training dataset for class 141.

Figure 7: Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.

Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample’s representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs’ anti-collapse term does much more–it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used–in any case one can compute the learned probabilities of sample 𝒙{\bm{x}} efficiently and in closed-form using the model’s Jacobian matrix at 𝒙{\bm{x}}. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as JEPA-SCORE.

low probability

MetaCLIP IJEPA-22k IJEPA-1k DINOv2

The training procedure of foundation models—Deep Networks (DNs) f𝜽f_{{\bm{\theta}}} able to solve many tasks in zero or few-shot—can take many forms and is at the center of Self Supervised Learning research balestriero2023cookbook . Over the years, one principle has emerged as central to all current state-of-the-art methods: encouraging f𝜽(X)f_{{\bm{\theta}}}(X) to be maximum Entropy given i.i.d pretraining samples XX with density pXp_{X} wang2020understanding ; hjelm2018learning . Because the differential Entropy is difficult to estimate in high-dimensional spaces, and f𝜽(X)f_{{\bm{\theta}}}(X) often contains thousands of dimensions, one solution is to maximize a lower bound by using a decoder and learning to reconstruct XX from f(X)f(X) vincent2008extracting . Because this approach comes with known limitations balestriero2024learning , more and more foundation models are trained with Joint-Embedding Predictive Architectures (JEPAs) lecun2022path that directly encourage f𝜽(X)f_{{\bm{\theta}}}(X) to be Gaussian. In fact, the Gaussian distribution is the one of maximum differential Entropy under covariance constraint, leading to f𝜽(X)f_{{\bm{\theta}}}(X) producing Gaussian Embeddings (GE).

JEPAs can take many forms by employing numerous implicit and explicit regularizers srinath2023implicit ; littwin2024jepa . Today’s JEPAs mostly take three forms: (i) moment-matching objectives (VICReg bardes2021vicreg , W-MSE ermolov2021whitening ), (ii) non-parametric estimators (SimCLR chen2020simple , MoCo he2020momentum , CLIP radford2021learning ), and (iii) implicit teacher-student methods (DINO caron2021emerging , I-JEPA assran2023self ). While JEPAs produce state-of-the-art representations, they are currently seen as disconnected from generative models whose goal is to estimate pXp_{X}. In fact, the absence of generative modeling is a praised benefit of JEPAs. But one may wonder…

Can the density of f(X)f(X) be specified without ff learning about pXp_{X}?

That question will be at the core of our study, and the answer is no: producing Gaussian Embeddings can only happen if f𝜽f_{{\bm{\theta}}} learns the underlying data density pXp_{X}. But JEPAs estimate pXp_{X} in a highly non standard way, free of input space reconstruction, and free of a parametric model for pXp_{X}. One question remains…

Is there any further benefit of not only specifying a density for f𝛉(X)f_{{\bm{\theta}}}(X) but using the eponymous Gaussian density?

At it turns out, this choice guarantees that the estimator for pXp_{X} implicitly learned during JEPA training can easily be extracted from the final trained model f𝜽f_{{\bm{\theta}}}–an estimator we call the JEPA-SCORE (eq.˜5). Our findings not only open new avenues in using JEPA-SCORE for outlier detection or data curation, but also shaken the Self Supervised Learning paradigm by showing how non parametric density estimation in high dimension is now amenable through JEPAs. Our theory and its corresponding controlled experiments are provided in sections˜2.1 and 2.2, and experiments with state-of-the-art models such as DINOv2, MetaCLIP and I-JEPA are provided in section˜2.3. JEPA-SCORE’s implementation only takes a few lines of code and is provided in LABEL:code.

We now derive our main result stating that in order to minimize the JEPA objective, a DN must learn the data density. We start with some preliminary results in section˜2.1, we formalize our general finding in section˜2.2, culminating in the JEPA result of sections˜2.3 and 1. An efficient implementation is also provided in section˜2.3.

Our derivations will rely on a simple observations widely known in high-dimensional statistics: KK-dimensional standard Gaussians, appropriately normalized, converge to being Uniform on the hypersphere. Let’s denote the KK-dimensional standard Gaussian random variable by ZZ, and the normalized version by X≜ZKX\triangleq\frac{Z}{\sqrt{K}} with density f𝒩(0,𝑰/K)f_{\mathcal{N}(0,{\bm{I}}/K)}. Let’s also denote the Uniform distribution on the KK-dimensional hypersphere surface by f𝒰(𝕊(0,R,K))f_{\mathcal{U}(\mathbb{S}(0,R,K))} with radius R>0R>0.

As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)

Lemma˜1 provides an interesting geometric fact which we turn into a practical result for SSL in the following section˜2.2, where we demonstrate how learning to produce Gaussian embeddings equates with learning the Energy function of the training data.

Consider two densities, one on the input domain (pXp_{X}) and one on the output domain (pf(X)p_{f(X)}). For pf(X)p_{f(X)} to have a particular form, e.g., 𝒩(0,𝑰/K)\mathcal{N}(0,{\bm{I}}/K), ff must learn something about pXp_{X}. To see that, we will have leverage the eponymous change of variable formula expressing the embedding density pf𝜽p_{f_{{\bm{\theta}}}} as a function of the data density and the DN’s Jacobian matrix:

where ℋr\mathcal{H}^{r} denotes rr-dimensional Hausdorff measure, with r≜dim({𝒖∈ℝD|f(𝒖)=f(𝒙)})r\triangleq\dim({{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}) being the dimension of the level set of ff at 𝒙{\bm{x}}. We note that eq.˜1 does not require ff to be bijective, which will be crucial for our JEPA result in section˜2.3; for details see krantz2008geometric ; cvitkovic2019minimal . Combining eq.˜1 and lemma˜1 leads us to the following result.

In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)

Empirical validation. Before broadening lemma˜2 to JEPAs, we first provide empirical validations that learning to produce Gaussian embeddings implies learning the data density in fig.˜2. We show that, in fact, the data density can be recovered with high accuracy, and it is even possible to draw samples from the estimated density through Langevin dynamics.

Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function

where 𝒙n(1),𝒙n(2){\bm{x}}{n}^{(1)},{\bm{x}}{n}^{(2)} are two generated “views” from the original sample through the stochastic operator 𝒢\mathcal{G}, and dist\rm dist is a distance function (e.g., L2). For images, 𝒢\mathcal{G} typically involves two different data-augmentations. At this point, lemma˜2 only takes into account the non-collapse term of JEPAs. But an interesting observation from lemma˜2 is that the integration occurs over the level set of the function f𝜽f_{{\bm{\theta}}} which coincides with the JEPA’s invariance term when Pred is near identity.

Data assumption. To make our result as precise as possible, we focus on the case where the views are generated from stochastic transformations, with density pTp_{T}. We also denote the density of generators as pμp_{\mu}, from which the data density pXp_{X} is defined as

In other words, each training sample is seen as some transformation of some generator. We do not impose further restrictions, e.g., that there is only one generator per class. In practical setting, the generators (pμp_{\mu}) are the original training samples prior to applying any augmentation–hence estimating pμp_{\mu} will amount to estimating the data density.

JEPA-SCORE. Combining eqs.˜3, 2 and 2 leads to the following result proved in section˜A.3.

At optimality, JEPA embeddings estimate the data density as per

We define our JEPA-SCORE for input 𝒙{\bm{x}} as the Monte Carlo estimator of eq.˜4, for a single-sample estimate we have (in log-scale)

which exactly recovers pμp_{\mu} as long as the JEPA loss is minimized. We use a logarithmic scale to align with the definition of a score function. A visual illustration is provided at the top of fig.˜2. We empirically validate eq.˜5 by using pretrained JEPA models and visualizing a few Imagenet-1k classes after ordering the images based on their JEPA-SCORE in figs.˜1, 5 and 6. We obtain that for bird classes, high probability samples depict flying birds while low probability ones are seated. We also conduct an additional experiment where we compute JEPA-SCORE for 5,0005,000 samples of different dataset (Imagenet, MNIST and Galaxy) and depict the distribution of their JEPA-SCORE in fig.˜3. We clearly see that datasets that weren’t seen during pretraining, e.g., Galaxy, have much lower JEPA-SCORE than Imagenet samples. Conclusion. We provided a novel connection between JEPAs and score-based methods, two families of methods thought to be unrelated until now. Controlled experiments on synthetic data confirm the validity of our theory and qualitative experiments with state-of-the-art large scale JEPAs also qualitatively validate our findings. Although this is only a first step, we hope that JEPA-SCORE will open new avenues for outlier detection and model assessment for downstream tasks.

Proof 1: The proof consists in expressing both densities in spherical coordinates, and studying their convergence as KK increases. Let’s first express the Uniform distribution in spherical coordinates:

and let’s now express the rescaled standard Gaussian density ZK\frac{Z}{\sqrt{K}} in spherical coordinates:

As KK increases, as the scaled Chi-distribution converges to a Dirac function at 11, leading to our desired result.

Proof 2: The above proof provides granular details into the convergence to the Uniform distribution on the hypersphere by studying the scaled Chi-distribution. For completeness, we also provide a more straightforward argument, sufficient to study the limiting case. First, it is known that ZK\frac{Z}{\sqrt{K}} being isotropic Gaussian, the distribution of norms, ‖Z‖22/K|Z|_{2}^{2}/K, is a Chi-squared distribution with mean 11 and variance 2/K2/K. That is, as KK increases as the norms distribution converges to a Dirac at 11. Lastly, because ZK\frac{Z}{\sqrt{K}} is isotropic, it will be uniformly distribution on the hypersphere after normalization. But as KK increases, as the samples are already normalized, hence leading to our result. ∎

First and foremost, recall that the density of the random variable f(X)f(X) is given by eq.˜1. Relying on lemma˜1 which stated that for large KK, our assumption on the output density reads f(𝒙)∼𝒰(0,1)f({\bm{x}})\sim\mathcal{U}(0,1), we obtain that ∫{𝒖∈ℝD|f(𝒖)=f(𝒙)}pX(𝒖)∏k=1rank(Jf(𝒖))σk(Jf(𝒖))dℋr(𝒖)=cst\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}({\bm{u}})}{\prod_{k=1}^{rank(J_{f}({\bm{u}}))}\sigma_{k}(J_{f}({\bm{u}}))}\mathrm{d}\mathcal{H}^{r}({\bm{u}})={\rm cst}. Now if ff is bijective between supp(pX){\rm supp}(p_{X}) and ℝK\mathbb{R}^{K}, then it is direct to see that pX(x)∝∏k=1rank(Jf(𝒙))σk(Jf(𝒙))p_{X}(x)\propto\prod_{k=1}^{rank(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}})). Now if ff is surjective there is no longer a one-to-one mapping between ff and pXp_{X}. Instead, there is ambiguity over each level set of ff. To see that, recall that we only need to maintain a constant value over the integration on the level set. Hence, ff is free to scale up one subset of that level set, and scale down another subset, proportionally to pXp_{X} to preserve the integration to a constant. ∎

The role of the predictor in JEPA training is to allow for additional computation to predict one view’s embedding from the other view’s embedding. While this provides numerous empirical benefits, e.g., in terms of optimization landscape, it actually does not impact the level-set of the encoder–which is what is needed in eq.˜1.

To understand the above argument, consider the case where the views are obtained from applying a transformation such as masking. We denote by ℳ\mathcal{M} the masking random and by mask(𝒙){\rm mask}({\bm{x}}) the application of one realization of ℳ\mathcal{M} onto the input 𝒙{\bm{x}}. We thus have for the invariance term of sample 𝒙n{\bm{x}}_{n}

Because the predictor is only applied on one of the two embeddings, it is clear that for the JEPA loss to be minimized, it must also be true that

for any realization of mask(1){\rm mask}^{(1)}. In other word, the encoder’s invariance is over the support of ℳ\mathcal{M} no matter if the predictor is identity or nonlinear. Therefore our result directly follows from the above combined with eqs.˜1 and 3. ∎

Refer to caption Depiction of the 5 least (left) and 5 most (right) likely samples of class 21 from Imagenet as per JEPA-SCORE–JEPAs’ implicit density estimator learned during pretraining. Two striking observations: (i) across all JEPAs (rows) the type of samples with low and high probabilities are alike, and (ii) the same samples (amongst 1,000) are found at those extrema. Random samples from that class are provided in fig.˜4

Refer to caption Top left: Visual illustration of JEPA-SCORE–the DN f𝜽f_{{\bm{\theta}}} must learn pXp_{X} for its Jacobian matrix to expand or contract the density in order to produce a Uniform density on the hypersphere surface in its embedding space (lemmas˜1 and 2). Top right: Pearson correlation between JEPA-SCORE and the true logp(x)logp(x) on a GMM data model for various input dimensions (rows) and number of samples (columns). In all cases, producing Gaussian embeddings make the backbone f𝜽f_{{\bm{\theta}}} internalize the data density which can be easily extracted using our proposed JEPA-SCORE. Bottom: as JEPA-SCORE is an approximation of the true score function, it is possible to perform Langevin sample to recover the true data distribution as shown here in two dimensions.

Refer to caption Depiction of JEPA-SCORE for 5,0005,000 samples from different datasets (imagenet-1k/a/r, MNIST and Galaxy). We clearly observe that as the pretraining dataset size increases (all models against IJEPA-1k) as MNIST and Galaxy images are seen as lower probability samples, i.e., those images are less and less represented within the overall pretraining dataset. While our score does not rely on singular vectors, we provide some examples in fig.˜7. This can be used to assess if a model is ready or not to handle particular data domains at test time for zero-shot tasks.

Refer to caption Random samples from Imagenet-1k training dataset for class 21.

Refer to caption Depiction of the top singular vectors of the Jacobian matrix (columns) for a given input image (left column). The corresponding singular value is provided in the title of each subplot.

$$ p_{f(X)}(f({\bm{x}}))=\int_{{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}}\frac{p_{X}(x)}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}({\bm{x}}))}\mathrm{d}\mathcal{H}^{r}({\bm{x}}), $$ \tag{S2.E1}

$$ \displaystyle\mathcal{L}\triangleq $$

$$ \displaystyle+{\rm diversity}\left(\left({\rm Enc}\left({\bm{x}}{n}\right)\right){n\in[N]}\right), $$

$$ \displaystyle=\frac{K^{\frac{K}{2}}}{2^{K/2-1}\Gamma(K/2)}e^{-\frac{Kr^{2}}{2}}\frac{\Gamma(K/2)}{2\pi^{K/2}}r^{K-1}\hskip-4.26773pt\prod_{i=1}^{K-1}\sin({\bm{\theta}}_{i})^{K-i-1} $$

Lemma. Lemma 1. As KK grows, XX quickly concentrates around the hypersphere of radius 11, converging to a Uniform density over the hypersphere surface. (Proof in section˜A.1.)

Lemma. Lemma 2. In order for f(X)f(X) to be distributed as 𝒩(0,𝐈/K)\mathcal{N}(0,{\bm{I}}/K) for large KK, ff must learn the data density pXp_{X} up to mean-preserving rescaling within each level-set {𝐮∈ℝD|f(𝐮)=f(𝐱)}{{\bm{u}}\in\mathbb{R}^{D}|f({\bm{u}})=f({\bm{x}})}. (Proof in section˜A.2.)

Theorem. Theorem 1. At optimality, JEPA embeddings estimate the data density as per pμ(μ)∝𝔼pT[1∏k=1rank⁡(Jf(𝒙))σk(Jf(μ,T))]−1.p_{\mu}(\mu)\propto\mathbb{E}{p{T}}\left[\frac{1}{\prod_{k=1}^{\operatorname{rank}(J_{f}({\bm{x}}))}\sigma_{k}(J_{f}(\mu,T))}\right]^{-1}. (4)

The VCReg Conundrum

Gaussian Embedding Enforcement

Lemma 1. As K grows, X quickly concentrates around the hypersphere of radius 1 , converging to a Uniform density over the hypersphere surface. (Proof in section A.1.)

The Beauty of Isotropic Gaussian Embeddings

A-JEPA

Learning by Reconstruction: Good in Theory, Bad in Practice

Joint Univariate Scheme

Implicit Density Learning

Most Joint-Embedding Predictive Architecture (JEPA) methods aim to achieve two goals: (i) predictive invariance, and (ii) representation diversity, as seen in the following loss function

JEPA-SCORE. Combining eqs. (2) and (3) and lemma 2 leads to the following result proved in section A.3.

Theorem 1. At optimality, JEPA embeddings estimate the data density as per

We define our JEPA-SCORE for input x as the Monte Carlo estimator of eq. (4), for a single-sample estimate we have (in log-scale)

Statistical Testing

For Uniformity

Joint Embeddings Secretly Learn the Data Density Function

Style

Retrieval of style files

General formatting instructions

Can the density of f ( X ) be specified without f learning about p X ?

Is there any further benefit of not only specifying a density for f θ ( X ) but using the eponymous Gaussian density?

Headings: first level