A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures

Basile Terver, Randall Balestriero, Megi Dervishi, David Fan, Quentin Garrido, Tushar Nagarajan, Koustuv Sinha, Wancong Zhang, Mike Rabbat, Yann LeCun, Amir Bar

Abstract

We present EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures (JEPAs). JEPAs learn to predict in representation space rather than pixel space, avoiding the pitfalls of generative modeling while capturing semantically meaningful features suitable for downstream tasks. Our library provides modular, self-contained implementations that illustrate how representation learning techniques developed for image-level self-supervised learning can transfer to video, where temporal dynamics add complexity, and ultimately to action-conditioned world models, where the model must additionally learn to predict the effects of control inputs. Each example is designed for single-GPU training within a few hours, making energy-based self-supervised learning accessible for research and education. We provide ablations of JEA components on CIFAR-10. Probing these representations yields 91% accuracy, indicating that the model learns useful features. Extending to video, we include a multi-step prediction example on Moving MNIST that demonstrates how the same principles scale to temporal modeling. Finally, we show how these representations can drive action-conditioned world models, achieving a 97% planning success rate on the Two Rooms navigation task. Comprehensive ablations reveal the critical importance of each regularization component for preventing representation collapse. Code is available at \url{https://github.com/facebookresearch/eb_jepa.}

Introduction

The idea that intelligent systems should learn internal models of their environment has deep roots in cognitive science, from early theories of mental models (Craik, 1967) to predictive coding accounts of perception (Rao & Ballard, 1999) and learned world models for planning (Sutton, 1991; Schmidhuber, 1990). Recent advances in video generation (Brooks et al., 2024; Blattmann et al., 2023) and interactive world simulators (Bruce et al., 2024; Parker-Holder et al., 2024) have shown impressive results, but those face fundamental challenges: they must model all pixels including task-irrelevant details hereby requiring substantial computational resources (Balestriero & LeCun, 2024). JointEmbedding Predictive Architectures (JEPAs) (Assran et al., 2023; Bardes et al., 2024) offer an alternative paradigm. Rather than reconstructing observations in pixel space, JEPAs learn to predict in a learned representation space, focusing computational effort on semantically meaningful features.

JEPA builds on a rich history of self-supervised representation learning (Chen et al., 2020; He et al., 2020; Grill et al., 2020; Zbontar et al., 2021; Chen & He, 2021) while avoiding the need for negative samples. JEPAs have demonstrated strong performance for visual representation learning (Assran et al., 2023) and have been extended to video understanding (Bardes et al., 2024) and world modeling for planning (Assran et al., 2025; Sobal et al., 2025; Zhou et al., 2024a; Terver et al., 2026). Despite this growing body of work, accessible implementations that bridge theoretical principles and practical application remain scarce. Production-scale implementations are designed for large-scale training and are challenging to navigate. World model implementations like DINO-WM (Zhou et al., 2024a) and JEPA-WMs (Terver et al., 2026) enable planning on simple environments but rely on particular

1 Code is available at https://github.com/facebookresearch/eb_jepa .

(d) Planning

Figure 1: EB-JEPA is a modular code base and tutorial, providing self-contained implementations of Joint-Embedding Predictive Architecture for (a) self-supervised image representation learning (b) video prediction in latent space, and (c) action-conditioned world models that enable goal-directed planning (d).

setups, e.g., frozen pre-trained encoders. As a result, while JEPAs have shown promises, they remain with a high barrier to entry-which we hope to address in this study. This paper introduces EB-JEPA , an open-source library that addresses this gap through modular, well-documented implementations of JEPA-based models trainable at small scale with simple, concise code designed for educational purposes and rapid experimentation. Our contributions are:

Accessible implementations : Three progressively complex examples (image representation learning, video prediction, and action-conditioned planning), each trainable on a single GPU in a few hours.
Modular architecture : Reusable components (encoders, predictors, regularizers, planners) that can be easily recombined for new applications.
Comprehensive evaluation : Systematic experiments and ablations demonstrating the importance of each component, with practical guidance on hyperparameter selection.
Educational resource : Clear documentation and code structure designed to help researchers understand JEPA principles.

Joint-Embedding methods. EB-JEPA builds on the JEPA framework and non-contrastive selfsupervised learning (Adrien Bardes, 2022; Zbontar et al., 2021; Grill et al., 2020; Chen & He, 2021; Oquab et al., 2024). Recent theoretical work has provided deeper understanding of these methods: Shwartz-Ziv et al. (2023) analyze VICReg from an information-theoretic perspective, while Balestriero & LeCun (2022) show connections between contrastive and non-contrastive methods and spectral embedding. While I-JEPA and V-JEPA focus on masked prediction within single images or videos, our action-conditioned example extends this to interactive settings where actions determine future states. Recent work has shown that JEPA-style pretraining leads to emergent understanding of intuitive physics (Garrido et al., 2025), motivating the use of such architectures for world modeling. Importantly, JEPAs differ fundamentally from reconstruction-based methods such as MAE (He et al., 2021) and VideoMAE (Tong et al., 2022; Wang et al., 2023), which predict in pixel space rather than representation space. Balestriero & Lecun (2024) provide theoretical analysis showing that reconstruction-based learning can produce uninformative features for perception, further motivating the joint-embedding paradigm that our library focuses on.

World models for planning. Latent world models have been extensively studied for model-based RL (Hafner et al., 2019; 2024; Hansen et al., 2024). Our work is most closely related to PLDM (Sobal et al., 2025), IWM (Garrido et al., 2024), DINO-WM (Zhou et al., 2024a), Navigation World

Models (Bar et al., 2025), and JEPA-WMs (Terver et al., 2026), which use joint-embedding objectives for planning. Unlike these works, we focus on providing accessible, educational implementations rather than state-of-the-art performance on complex benchmarks.

Joint-Embedding methods.

World models for planning.

Preliminaries: A Unified JEPA Framework

Our goal is to train models that map inputs to latent semantic representations useful for perception, planning, and action. We view this through the lens of Energy-Based Models (EBMs) ("LeCun et al., 2006; Hopfield, 1982). An EBM defines a scalar energy function E ( x, y ) measuring compatibility between inputs x and outputs y , where low energy indicates high compatibility. Learning consists of shaping the energy landscape so that correct input-output pairs have lower energy than incorrect ones.

The key challenge in training EBMs is preventing collapse : a degenerate solution where the energy is uniformly low for all inputs. Classical EBMs address this through contrastive methods that explicitly push up the energy of 'negative' samples (Hinton, 2002; Chen et al., 2020; He et al., 2020). JEPAs instead rely on regularization-based approaches (Adrien Bardes, 2022; Zbontar et al., 2021; Balestriero & LeCun, 2025) that maintain representation diversity without requiring negative samples. This insight has proven powerful across self-supervised learning (Grathwohl et al., 2020; Grill et al., 2020). In the JEPA framework, we instantiate the energy principle by defining energy as prediction error in representation space. With the regularizer R and a given prediction loss L pred, the JEPA general training objective takes the form

where z = f θ ( x ) is the representation of input x , u = q ω ( a ) is optional conditioning information (e.g., robotic controls), z ′ is the target representation, and λ balances prediction and regularization. This unified framework encompasses three instantiations of increasing complexity, detailed below.

(i) Image-JEPA: view invariance. Given an image x , we create two views x and x ′ (random crops, color jittering, etc.). The encoder produces representations z = f θ ( x ) and z ′ = f θ ( x ′ ) . The objective learns representations invariant to different views, with the energy function

Here, the energy directly measures how similar the representations of two views of the same image are. Low energy means the model has learned view-invariant features.

(ii) Video-JEPA: temporal prediction. We denote a video sequence as x 1: T := ( x 1 , . . . , x T ) . The encoder produces per-frame representations z t = f θ ( x t -w : t ) , where w is the encoder temporal receptive field. A predictor takes a context of v +1 frame representations, where v is the predictor temporal receptive field (see hyperparameter values in Tab. 6), and predicts the next representation, yielding the energy

The model learns to capture temporal dynamics without access to future frames during prediction.

(iii) Action-conditioned video-JEPA (AC-video-JEPA): world modeling. Given observation-action sequences ( x t , a t ) T t =1 , an action encoder q ω maps actions to control representations u t = q ω ( a t -w : t ) , and the predictor is conditioned on these representations, yielding the energy

This learns a latent dynamics model suitable for planning: given a current state and control representation, predict the next state representation.

A unified energy formulation. The three settings above share a common structure. Given an encoder f θ , a predictor g ϕ , and optional conditioning a with conditioning encoder q ω , we can write a general energy function

Epoch

Figure 2: Hyperparameter sensitivity comparison between SIGReg and VICReg on CIFAR-10. SIGReg demonstrates greater stability across different hyperparameter configurations, while VICReg achieves similar peak performance but requires more careful tuning.

Image-JEPA corresponds to g ϕ = Id (identity) and no conditioning; video-JEPA uses a temporal predictor without conditioning; AC-video-JEPA includes the full formulation with action conditioning. This unified view highlights how the same energy-based principle - minimizing prediction error in representation space - underlies all three settings, with complexity increasing as we move from static images to video to action-conditioned dynamics.

Regularization: Preventing Collapse. The key challenge in training JEPAs is preventing representation collapse , where the encoder learns trivial constant representations. EB-JEPA implements two regularization families. VICReg (Adrien Bardes, 2022) prevents collapse through two complementary terms. The variance term ensures each feature dimension has sufficient spread across the batch and reads

where Z ∈ R N × D is the batch of embeddings, D is the feature dimension, and γ is the target standard deviation (typically 1). The covariance term decorrelates feature dimensions to encourage the model to use all available capacity and reads

The full VICReg regularizer is R VICReg = α L var + β L cov .

For image-JEPA and video-JEPA, the regularization losses are computed in a projected space rather than directly on the encoder outputs. A learned projector h ψ maps representations to embeddings r = h ψ ( z ) on which the regularizer is computed. LeJEPA (Balestriero & LeCun, 2025) introduces SIGReg, a theoretically grounded alternative regularizer. It identifies the isotropic Gaussian N (0 , I ) as the optimal embedding distribution for minimizing downstream prediction risk. The SIGReg objective enforces this by testing Gaussianity along random 1D projections ξ k ∼ N (0 , I ) and reads

where G is the Epps-Pulley Gaussianity test statistic. This approach offers a single hyperparameter λ , linear time/memory complexity, and stability across architectures.

A unified energy formulation.

Regularization: Preventing Collapse.

Training and Planning with World Models

Multistep Rollout Training. In practice, for both video JEPA and Action-Conditioned JEPA, we augment the single-step prediction loss with multistep rollout losses, following Terver et al. (2026);

Figure 3: Video-JEPA training dynamics and multistep rollout ablation. (a) Training dynamics over 50 epochs: variance-covariance regularization loss R (left), prediction loss L pred (center), and mean Average Precision (right). (b) Training with k -step recursive predictions achieves significantly better Average Precision compared to single-step predictions, demonstrating improved temporal understanding, with a Pareto optimum around k = 4 rollout steps.

Figure 4: Video JEPA visualization on Moving MNIST. From left to right: input frames, 1-step prediction visualization, and full autoregressive rollout. The model maintains coherent predictions of digit motion over extended horizons, correctly capturing trajectory and dynamics.

Assran et al. (2025). At each training iteration, in addition to the single-step loss of Eqs. (3)-(4), we compute additional k -step rollout losses L k for k ≥ 1 . Let us define the order of a prediction as the number of calls to the predictor function required to obtain it from a groundtruth representation. For a predicted representation z ( k ) t , we denote the timestep it corresponds to as t and its prediction order as k , with z (0) = z = f θ ( x ) . For k ≥ 1 , L k is defined as

where z ( k ) t is obtained by recursively unrolling the predictor for all t ≤ T , as

Note that L 1 recovers the single-step loss. Thus the total energy function losses of Eqs. (3)-(4) read

Note that we could perform truncated backpropagation through time (TBPTT) (Jaeger, 2002), detaching the gradient after each call to the predictor. Training with k -step rollouts aligns the training procedure with autoregressive inference, reducing exposure bias and improving long-horizon prediction quality (see Figure 3).

Additional Regularizers for World Models. Training action-conditioned JEPAs in randomized environments requires additional regularization beyond VICReg or SIGReg terms. The temporal similarity loss L sim encourages smooth representation trajectories along action sequences, and the inverse dynamics model (IDM) loss (Pathak et al., 2017) L IDM predicts actions from consecutive representations. These losses read

Table 1: Image-JEPA Linear probing accuracy on CIFAR-10 with ResNet-18 backbone trained for 300 epochs, comparing regularizers (SIGReg and VICReg) and the impact of using a projector.

This term is critical for preventing collapse from spurious correlations in randomized environments (Sobal et al., 2022). The full training objective for action-conditioned video-JEPA combines prediction with all regularization terms and reads

Goal-Conditioned Planning. We perform goal-conditioned planning by optimizing action sequences to reach a goal observation x g . This extends the energy function from Eq. (5) to trajectories: rather than measuring prediction error for a single step, we accumulate the energy over an imagined rollout towards the goal as

Low energy corresponds to action sequences that successfully reach the goal; planning thus reduces to finding the minimum-energy trajectory. We use MPPI (Williams et al., 2015), a population-based optimizer that samples action trajectories, weights them by exponentiated negative energy (i.e., a Boltzmann distribution over trajectories), and iteratively refines the proposal distribution toward lower-energy solutions. Summing over intermediate states (rather than only the final state) encourages efficient paths and provides robustness to prediction compounding errors.

Multistep Rollout Training.

Additional Regularizers for World Models.

Goal-Conditioned Planning.

Experiments

Experimental Setup. We evaluate the JEPA framework on three tasks of increasing complexity: image representation learning on CIFAR-10, video prediction on Moving MNIST (Srivastava et al., 2015), and goal-conditioned planning on the Two Rooms environment (Sobal et al., 2025). Our implementation uses modular building blocks: Encoders (ResNet-18 (He et al., 2016), Vision Transformers (Dosovitskiy et al., 2021), IMPALA (Espeholt et al., 2018)), Predictors (UNet-based spatial predictors, GRU-based temporal predictors), Regularizers (VICReg, SIGReg, temporal similarity, inverse dynamics losses), and Planners (MPPI (Williams et al., 2015) and CEM optimizers). We provide comprehensive hyperparameter tables in Appendix A: Tables 5 and 6 summarize the best training hyperparameters for each example, and Table 7 details the planning configuration.

Image Representation Learning. Tables 1, 2, and 3 compare VICReg and SIGReg on CIFAR-10, using a naive hyperparameter search. Both methods achieve approximately 90-91% linear probing accuracy, competitive with prior self-supervised methods on this benchmark. We find that using a learned projector provides around a 3 point improvement over directly regularizing encoder outputs. Projector architecture matters: a bottleneck design (large hidden → small output) works best for SIGReg, while VICReg prefers larger output dimensions. Having only one hyperparameter, SIGReg can be easier to tune in this naive setting.

Video Prediction. Multi-step autoregressive rollouts on Moving MNIST maintain prediction quality over extended horizons. Training with k -step prediction (rather than single-step) significantly improves Average Precision on downstream detection tasks by reducing exposure bias, i.e., the discrepancy between teacher-forced training and autoregressive inference. Figure 3 shows that models trained with longer prediction horizons achieve better downstream performance, as recursive prediction during training aligns with the autoregressive inference procedure.

Action-Conditioned Video-JEPA. We display three successful planning evaluation episodes in Figure 5, showing the ability of the model to plan given randomized initial and goal state. This

Table 2: Ablation of Image-JEPA on loss hyperparameters when training on CIFAR-10 with ResNet18 backbone trained for 300 epochs.

navigation task is non-monotonous, meaning that the optimal trajectory requires first getting further from the goal, in order to reach it ultimately. Table 4 shows planning results on the challenging random-wall setup. Our best model achieves 97% success rate using MPPI with cumulative cost over the planning horizon.

We perform an ablation of the regularization components of the action-conditioned video-JEPA models. Table 4 reveals the importance of each regularization component: IDM is critical (without it, the model collapses to 1% success due to spurious correlations (Sobal et al., 2022)); variance and covariance terms each contribute ∼ 50% absolute improvement; temporal similarity adds ∼ 35%.

We ablate the importance of planning cost design . Using cumulative cost over all timesteps ( ∑ t ∥ z g -ˆ z t ∥ ) outperforms final-state-only cost by 8% (Table 4). This formulation encourages efficient paths and provides gradient signal throughout the trajectory.

Experimental Setup.

Action-Conditioned Video-JEPA. We display three successful planning evaluation episodes in Figure 5, showing the ability of the model to plan given randomized initial and goal state. This

Table 2: Ablation of Image-JEPA on loss hyperparameters when training on CIFAR-10 with ResNet18 backbone trained for 300 epochs.

Image Representation Learning.

Video Prediction.

EB-JEPA is designed for fast iteration on algorithmic innovations at small scale: single-GPU training, simple datasets, and controlled simulated environments. This enables rapid prototyping and fundamental research on JEPA architectures before scaling to more complex settings. We identify three promising algorithmic directions that EB-JEPA's modular design enables researchers to explore.

Advancing Regularization Theory. Our experiments highlight the critical role of regularization in preventing representation collapse, yet the theoretical understanding of why certain regularization combinations work remains incomplete. EB-JEPA provides a testbed for systematically studying regularization dynamics: investigating the interplay between variance, covariance, temporal similarity, and inverse dynamics terms (Adrien Bardes, 2022; Balestriero & LeCun, 2025; Sobal et al., 2022); understanding when each becomes necessary; and developing principled methods for automatic hyperparameter selection. The controlled, single-GPU setting enables rapid iteration on these fundamental questions without the confounding factors introduced by large-scale distributed training.

Hierarchical World Models. Current JEPA models predict at a single temporal resolution, but intelligent planning often requires reasoning at multiple timescales (Schmidhuber, 2015; Hafner et al., 2022). Hierarchical world models could learn to predict both fine-grained dynamics for

Table 4: AC-video-JEPA planning ablations on Two Rooms with randomized wall positions. Each result averages over 3 seeds × 3 checkpoints × 20 episodes. Left: Planner configuration comparison. Right: Regularization term ablation; removing IDM causes collapse.

Configuration

MPPI (full cost)

Success

Time

37s

Figure 5: Visualization of three successful planning evaluation episodes of our AC-video-JEPA on the Two Rooms environment with random wall. From left to right: initial frame (red), full episode outputted by the planning optimization procedure, goal frame used to define planning cost (red). Each episodes allows a maximum of 200 steps in the environment.

local control and coarse-grained abstractions for long-horizon planning. Prior work in hierarchical reinforcement learning (Nachum et al., 2018; Levy et al., 2019) has demonstrated the benefits of learning at multiple levels of abstraction. EB-JEPA's separation of encoder, predictor, and regularizer components provides a natural starting point for implementing such multi-scale architectures, and future releases may include basic hierarchical prediction examples.

Learned Cost and Value Functions. Our current planning approach uses simple distance-based costs in representation space, but this may be suboptimal for complex tasks. Learning task-specific cost functions or value functions from demonstrations or sparse rewards could enable more sophisticated goal-directed behavior. Combining JEPA world models with learned value functions (Hansen et al., 2022; 2024) offers a promising avenue for making better use of the predictive models trained with this codebase, potentially bridging the gap between pure world modeling and reward-driven reinforcement learning. EB-JEPA's simple planning interface makes it straightforward to experiment with alternative cost formulations.

Complementary to Large-Scale Codebases. EB-JEPA is intended for algorithmic exploration and fundamental research. Once promising approaches are validated at small scale, researchers can transition to codebases supporting distributed training, pre-trained visual backbones, and more complex environments, such as JEPA-WMs (Terver et al., 2026) for planning with frozen encoders on diverse benchmarks. This two-stage workflow enables efficient research: rapid prototyping with EB-JEPA followed by rigorous evaluation at scale.

Action-Conditioned Video-JEPA.

Future Directions

Configuration

MPPI (full cost)

Success

Time

37s

Advancing Regularization Theory.

Hierarchical World Models.

Multistep Rollout Training. In practice, for both video JEPA and Action-Conditioned JEPA, we augment the single-step prediction loss with multistep rollout losses, following Terver et al. (2026);

where z ( k ) t is obtained by recursively unrolling the predictor for all t ≤ T , as

Note that L 1 recovers the single-step loss. Thus the total energy function losses of Eqs. (3)-(4) read

Table 1: Image-JEPA Linear probing accuracy on CIFAR-10 with ResNet-18 backbone trained for 300 epochs, comparing regularizers (SIGReg and VICReg) and the impact of using a projector.

Learned Cost and Value Functions.

Complementary to Large-Scale Codebases.

Conclusion

We have presented EB-JEPA, an open-source library for learning representations and world models using Joint-Embedding Predictive Architectures. Our implementations span image representation learning, video prediction, and action-conditioned planning, each designed to be trainable on a single GPU within a few hours. Comprehensive experiments demonstrate that our implementations achieve strong results on established benchmarks while providing insights into the importance of each

component. The ablation studies reveal that all regularization terms (variance, covariance, temporal similarity, and inverse dynamics) play important roles in preventing collapse and enabling effective planning. We hope EB-JEPA serves as both a practical toolkit for researchers exploring JEPA-based methods and an educational resource for understanding energy-based self-supervised learning.

Ethics statement

EB-JEPA is an educational library for self-supervised learning research. All experiments use standard public benchmarks (CIFAR-10, Moving MNIST) or procedurally generated environments (Two Rooms). None of these datasets contain personally identifiable information. We see no direct ethical concerns with this work.

Reproducibility statement

Reproducibility is the central goal of this work. Our full codebase is included in the supplementary material, with all training scripts, model implementations, and evaluation code. Each example is self-contained and trains on a single GPU in a few hours, removing the need for large compute clusters. Hyperparameters for all experiments are listed in Appendix A. The Two Rooms environment is procedurally generated with documented seeds.

Acknowledgments

We thank Adrien Bardes and Gaoyue Zhou for participating in the discussions and conceptualization of the project.

Hyperparameters

This section provides the hyperparameters used for training and evaluation across our examples. Tables 5 and 6 summarize the key training hyperparameters, including the number of rollout steps K used for multistep prediction training (Eq. 9) and the trajectory slice length T for temporal examples. Table 7 details the MPPI planning configuration used for goal-conditioned navigation in the action-conditioned video-JEPA example.

Table 5: Training hyperparameters for image-JEPA examples on CIFAR-10.

Table 7: Planning hyperparameters for the action-conditioned video-JEPA example using MPPI, corresponding to the notations of Algorithm 1. The total number of replanning steps for an evaluation episode is M m .

Planning Algorithm

Weuse Model Predictive Path Integral (MPPI) control (Williams et al., 2015) for planning, a samplingbased optimization algorithm that uses importance sampling to iteratively refine action sequences. Unlike the Cross-Entropy Method (CEM) which fits a Gaussian to elite samples, MPPI weights all samples by their exponentiated costs, providing smoother gradient information and better exploration.

Given a trained encoder f θ , predictor g ϕ , and action encoder q ω , we minimize the planning energy E plan from Eq. (5) over action sequences as described in Algorithm 1.

Diffusion-Based Planning. An alternative paradigm for planning uses diffusion models to generate trajectories. Diffuser (Janner et al., 2022) pioneered planning with diffusion by treating trajectory optimization as iterative denoising. Diffusion MPC (Zhou et al., 2024b) extends this to model predictive control settings, while Diffusion Policy (Chi et al., 2023) applies diffusion to visuomotor policy learning. These approaches complement JEPA-based methods: while diffusion models excel

at generating diverse, multimodal trajectories, JEPAs provide efficient latent dynamics suitable for fast online planning.

	Best acc.	Average acc.	w/o Projector	Hyperparams	Best projector
SIGReg	91.02%	89.22%	-3.3 points	1	2048 × 128
VICReg	90.12%	84.90%	-2.9 points	2	2048 × 1024

	SIGReg	SIGReg	VICReg	VICReg
Rank	Hyperparameters	Accuracy	Dimensions	Accuracy
1	λ = 10	90.88%	std = 1, cov = 100	90.12%
2	λ = 1	86.94%	std = 1, cov = 10	89.93%
3	λ = 100	80.86%	std = 10, cov = 10	89.20%
-1	λ = 0 . 1	27.20%	std = 100, cov = 100	10.00%

	SIGReg	SIGReg	VICReg	VICReg
Rank	Dimensions	Accuracy	Dimensions	Accuracy
1	2048 × 128	91.02%	2048 × 1024	90.12%
2	4096 × 1024	91.00%	4096 × 512	90.10%
3	2048 × 64	90.99%	1024 × 1024	90.05%
4	512 × 256	90.99%	2048 × 512	90.03%
5	4096 × 64	90.96%	4096 × 1024	90.02%
N/A	None	87.75%	None	87.27%

Ablated Term	Success
None (full model)	97 ± 2 %
Variance ( α = 0 )	47 ± 3 %
Covariance ( β = 0 )	46 ± 3 %
Temporal Sim. ( δ = 0 )	61 ± 2 %
IDM ( ω = 0 )	1 ± 1 %

Group	Hyperparameter	VICReg	ViT Image-JEPA	SIGReg
Optimization	Learning rate Epochs Batch size Weight decay	0.3 300 256 10 - 4	0.3 100 512 10 - 4	0.3 300 256 10 - 4
Data	Dataset Image size	CIFAR-10 32 × 32	CIFAR-10 32 × 32	CIFAR-10 32 × 32
Architecture	Encoder Predictor Encoder output dim Projector hidden dim Projector output dim Projector layers	ResNet-18 Identity 512 2048 2048 3	ViT-S Identity 384 2048 2048 3	ResNet-18 Identity 512 2048 128 3
Loss	Loss type Variance coeff. α Covariance coeff. β BCS coeff. λ	VICReg 1.0 80.0 -	VICReg 25.0 1.0 -	BCS - - 10.0

Group	Hyperparameter	Video-JEPA	AC-Video-JEPA
Optimization	Learning rate Epochs Batch size Weight decay	0.001 50 64 -	0.001 12 384 10 - 5
Data	Dataset Trajectory length T Image size	Moving MNIST 10 64 × 64	Two Rooms 17 65 × 65
Architecture	Encoder Predictor Latent dimension d Hidden dimension Encoder receptive field w Predictor receptive field v	ResNet-5 ResUNet 16 32 1 2	IMPALA GRU 32 32 1 1
Loss	Rollout steps K Variance coeff. α Covariance coeff. β Time similarity coeff. δ IDM coeff. ω	4 10 100 - -	8 16 8 12 1

Hyperparameter	Symbol	Value
Planning horizon	H	90
Number of parallel samples	N	200
Number of iterations	J	20
Number of elites	K	20
Noise scale	σ	2
Temperature	τ	0.005
Actions stepped per plan	m	1
Max steps per episode	M	200

Diffusion-Based Planning.

$$ \mathcal{L} = \mathcal{L}{\text{pred}}(g\phi(z, u), z') + \lambda \mathcal{R}(z), \label{eq:general_jepa} $$ \tag{eq:general_jepa}

$$ \mathcal{L}{\text{var}}(Z) = \frac{1}{D} \sum{j=1}^{D} \max\left(0, \gamma - \sqrt{\text{Var}(Z_{:,j}) + \epsilon}\right), $$

$$ \mathcal{L}{\text{cov}}(Z) = \frac{1}{D(D-1)} \sum{i \neq j} [C(Z)]^2_{i,j}, \quad C(Z) = \frac{1}{N-1}(Z - \bar{Z})^\top(Z - \bar{Z}). $$

$$ \label{eq:SIGReg} \mathcal{R}{\text{SIGReg}}(Z) = \frac{1}{K}\sum{k=1}^{K} \mathcal{G}(Z \xi_k), $$ \tag{eq:SIGReg}

$$ \label{eq:multistep_rollout_loss} \mathcal{L}k = \sum{t=1}^{T-k} | g_\phi(z^{(k-1)}{t-v:t}, u{t-v:t}) - z_{t+1} |_2^2, $$ \tag{eq:multistep_rollout_loss}

$$ \mathcal{L}{\text{video}} = \mathcal{L}{{\text{pred}}} + \lambda \mathcal{R}(z_{1:T}), \quad \mathcal{L}{\text{world}} = \mathcal{L}{{\text{pred}}}+ \lambda \mathcal{R}(z_{1:T}, u_{1:T}), \quad \mathcal{L}{{\text{pred}}}=\sum{k=1}^{K} \mathcal{L}_k. \label{eq:full_multistep_losses_video} $$ \tag{eq:full_multistep_losses_video}

$$ \mathcal{L}{\text{sim}} = \sum_t |z_t - z{t+1}|2^2, \quad \mathcal{L}{\text{IDM}} = \sum_t |a_t - \text{MLP}(z_t, z_{t+1})|_2^2. $$

$$ \mathcal{L} = \mathcal{L}{\text{pred}} + \alpha \mathcal{L}{\text{var}} + \beta \mathcal{L}{\text{cov}} + \delta \mathcal{L}{\text{sim}} + \omega \mathcal{L}_{\text{IDM}}. \label{eq:ac_video_full_loss} $$ \tag{eq:ac_video_full_loss}

$$ E_{\text{plan}}(a_{0:H}; x_0, x_g) = \sum_{t=0}^{H} | f_\theta(x_g) - \hat{z}t |2, \quad \text{where } \hat{z}{t+1} = g\phi(\hat{z}{t-v:t}, u{t-v:t}), \quad \hat{z}0 = f\theta(x_0). $$

Algorithm: algorithm
\caption{Model Predictive Path Integral (MPPI)}
\label{algo:MPPI}
\begin{algorithmic}[1]
    \STATE \textbf{Input:} Initial observation $x_0$, goal observation $x_g$, initial mean $\mu \in \mathbb{R}^{H \times A}$, noise scale $\sigma$, temperature $\tau$, number of samples $N$, number of iterations $J$, number of elites $K$, max steps per episode $M$
    \STATE Encode initial and goal: $\hat{z}_0 = f_\theta(x_0)$, $z_g = f_\theta(x_g)$
    \FOR{$j = 1$ to $J$}
        \STATE Sample $N$ noise perturbations: $\epsilon_i \sim \mathcal{N}(0, \sigma^2 \textrm{I})$ for $i = 1, \ldots, N$
        \STATE Compute candidate action sequences: $a^{(i)}_{0:H-1} = \mu + \epsilon_i$
        \STATE Unroll predictor for each trajectory: $\hat{z}^{(i)}_{t+1} = g_\phi(\hat{z}^{(i)}_{t-v:t}, u^{(i)}_{t-v:t})$ for $t = 0, \ldots, H-1$
        \STATE Compute trajectory costs: $S_i = \sum_{t=1}^{H} \| z_g - \hat{z}^{(i)}_t \|_2$
        \STATE Select top $K$ elite samples with lowest costs
        \STATE Compute weights over elites: $w_i = \frac{\exp(-S_i / \tau)}{\sum_{k=1}^{K} \exp(-S_k / \tau)}$
        \STATE Update mean: $\mu \leftarrow \sum_{i=1}^{K} w_i \cdot a^{(i)}_{0:H-1}$
    \ENDFOR
    \STATE \textbf{Return:} Execute first $m$ actions of $\mu$, then replan from new observation until $M$ steps reached
\end{algorithmic}

	Best acc.	Average acc.	w/o Projector	Hyperparams	Best projector
SIGReg	91.02%	89.22%	-3.3 points	1	2048 × 128
VICReg	90.12%	84.90%	-2.9 points	2	2048 × 1024

	SIGReg	SIGReg	VICReg	VICReg
Rank	Hyperparameters	Accuracy	Dimensions	Accuracy
1	λ = 10	90.88%	std = 1, cov = 100	90.12%
2	λ = 1	86.94%	std = 1, cov = 10	89.93%
3	λ = 100	80.86%	std = 10, cov = 10	89.20%
-1	λ = 0 . 1	27.20%	std = 100, cov = 100	10.00%

	SIGReg	SIGReg	VICReg	VICReg
Rank	Dimensions	Accuracy	Dimensions	Accuracy
1	2048 × 128	91.02%	2048 × 1024	90.12%
2	4096 × 1024	91.00%	4096 × 512	90.10%
3	2048 × 64	90.99%	1024 × 1024	90.05%
4	512 × 256	90.99%	2048 × 512	90.03%
5	4096 × 64	90.96%	4096 × 1024	90.02%
N/A	None	87.75%	None	87.27%

Ablated Term	Success
None (full model)	97 ± 2 %
Variance ( α = 0 )	47 ± 3 %
Covariance ( β = 0 )	46 ± 3 %
Temporal Sim. ( δ = 0 )	61 ± 2 %
IDM ( ω = 0 )	1 ± 1 %

Group	Hyperparameter	VICReg	ViT Image-JEPA	SIGReg
Optimization	Learning rate Epochs Batch size Weight decay	0.3 300 256 10 - 4	0.3 100 512 10 - 4	0.3 300 256 10 - 4
Data	Dataset Image size	CIFAR-10 32 × 32	CIFAR-10 32 × 32	CIFAR-10 32 × 32
Architecture	Encoder Predictor Encoder output dim Projector hidden dim Projector output dim Projector layers	ResNet-18 Identity 512 2048 2048 3	ViT-S Identity 384 2048 2048 3	ResNet-18 Identity 512 2048 128 3
Loss	Loss type Variance coeff. α Covariance coeff. β BCS coeff. λ	VICReg 1.0 80.0 -	VICReg 25.0 1.0 -	BCS - - 10.0

Group	Hyperparameter	Video-JEPA	AC-Video-JEPA
Optimization	Learning rate Epochs Batch size Weight decay	0.001 50 64 -	0.001 12 384 10 - 5
Data	Dataset Trajectory length T Image size	Moving MNIST 10 64 × 64	Two Rooms 17 65 × 65
Architecture	Encoder Predictor Latent dimension d Hidden dimension Encoder receptive field w Predictor receptive field v	ResNet-5 ResUNet 16 32 1 2	IMPALA GRU 32 32 1 1
Loss	Rollout steps K Variance coeff. α Covariance coeff. β Time similarity coeff. δ IDM coeff. ω	4 10 100 - -	8 16 8 12 1

Hyperparameter	Symbol	Value
Planning horizon	H	90
Number of parallel samples	N	200
Number of iterations	J	20
Number of elites	K	20
Noise scale	σ	2
Temperature	τ	0.005
Actions stepped per plan	m	1
Max steps per episode	M	200

References

[IJEPA] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. \newblock Self-supervised learning from images with a joint-embedding predictive architecture. \newblock In CVPR, 2023.

[VJEPA2] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois~Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, and Nicolas Ballas. \newblock V-jepa 2: Self-supervised video models enable understanding, prediction and planning, 2025.

[SLLspectralembeddings2022] Randall Balestriero and Yann LeCun. \newblock Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. \newblock In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS '22, Red Hook, NY, USA, 2022. Curran Associates Inc. \newblock ISBN 9781713871088.

[ReconstructionUninformativebalestriero24b] Randall Balestriero and Yann Lecun. \newblock How learning by reconstruction produces uninformative features for perception. \newblock In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 2566--2585. PMLR, 21--27 Jul 2024. \newblock URL https://proceedings.mlr.press/v235/balestriero24b.html.

[balestriero2024learning] Randall Balestriero and Yann LeCun. \newblock Learning by reconstruction produces uninformative features for perception. \newblock arXiv preprint arXiv:2402.11337, 2024.

[balestriero2025lejepaprovablescalableselfsupervised] Randall Balestriero and Yann LeCun. \newblock Lejepa: Provable and scalable self-supervised learning without the heuristics, 2025. \newblock URL https://arxiv.org/abs/2511.08544.

[NWM_Bar_2025_CVPR] Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. \newblock Navigation world models. \newblock In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 15791--15801, June 2025.

[VICReg] Adrien Bardes, Jean Ponce, and Yann LeCun. \newblock Vicreg: Variance-invariance-covariance regularization for self-supervised learning. \newblock In ICLR, 2022.

[VJEPA] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. \newblock Revisiting feature prediction for learning visual representations from video, 2024. \newblock ISSN 2835-8856.

[blattmann2023stablevideodiffusionscaling] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. \newblock Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023. \newblock URL https://arxiv.org/abs/2311.15127.

[brooks2024video] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li~~Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et~~al. \newblock Video generation models as world simulators, 2024. \newblock URL https://openai. com/research/video-generation-modelsas-world-simulators.

[bruce2024genie] Jake Bruce, Michael~~D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et~~al. \newblock Genie: Generative interactive environments. \newblock In Forty-first International Conference on Machine Learning, 2024.

[SimCLR] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. \newblock A simple framework for contrastive learning of visual representations. \newblock In ICML, 2020.

[SimSiam] Xinlei Chen and Kaiming He. \newblock Exploring simple siamese representation learning. \newblock In CVPR, 2021.

[chi2023diffusion] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. \newblock Diffusion policy: Visuomotor policy learning via action diffusion. \newblock The International Journal of Robotics Research, pp.\ 02783649241273668, 2023.

[craik1967nature] Kenneth James~Williams Craik. \newblock The nature of explanation, volume 445. \newblock CUP Archive, 1967.

[VITs] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. \newblock An image is worth 16x16 words: Transformers for image recognition at scale. \newblock In International Conference on Learning Representations, 2021.

[Impala] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. \newblock {IMPALA}: Scalable distributed deep-{RL} with importance weighted actor-learner architectures. \newblock In Jennifer Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume~80 of Proceedings of Machine Learning Research, pp.\ 1407--1416. PMLR, 10--15 Jul 2018. \newblock URL https://proceedings.mlr.press/v80/espeholt18a.html.

[IWM] Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. \newblock Learning and leveraging world models in visual representation learning, 2024.

[garrido2025intuitivephysicsunderstandingemerges] Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, and Yann LeCun. \newblock Intuitive physics understanding emerges from self-supervised pretraining on natural videos, 2025. \newblock URL https://arxiv.org/abs/2502.11831.

[BYOL] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre~~H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo~~Avila Pires, Zhaohan~~Daniel Guo, Mohammad~~Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. \newblock Bootstrap your own latent: A new approach to self-supervised learning. \newblock In NeurIPS, 2020.

[PlaNet] Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. \newblock Learning latent dynamics for planning from pixels. \newblock In Proceedings of the 36th International Conference on Machine Learning, volume~97, pp.\ 2555--2565. PMLR, 2019.

[Director] Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. \newblock Deep hierarchical planning from pixels. \newblock In Alice~H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022.

[DreamerV3] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. \newblock Mastering diverse domains through world models, 2024.

[TDMPC2] Nicklas Hansen, Hao Su, and Xiaolong Wang. \newblock Td-mpc2: Scalable, robust world models for continuous control. \newblock In The Twelfth International Conference on Learning Representations, 2024.

[TD-MPC] Nicklas~A Hansen, Hao Su, and Xiaolong Wang. \newblock Temporal difference learning for model predictive control. \newblock In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.\ 8387--8406. PMLR, 17--23 Jul 2022.

[ResNet] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. \newblock Deep residual learning for image recognition. \newblock In CVPR, 2016.

[MoCo] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. \newblock Momentum contrast for unsupervised visual representation learning. \newblock In CVPR, 2020.

[MAE] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. \newblock Masked autoencoders are scalable vision learners. \newblock In CVPR, 2021.

[ContrastiveDivergence2002] Geoffrey~~E. Hinton. \newblock Training products of experts by minimizing contrastive divergence. \newblock volume~~14, pp.\ 1771–1800, Cambridge, MA, USA, August 2002. MIT Press. \newblock 10.1162/089976602760128018. \newblock URL https://doi.org/10.1162/089976602760128018.

[Hopfield1982] J~J Hopfield. \newblock Neural networks and physical systems with emergent collective computational abilities. \newblock Proceedings of the National Academy of Sciences, 79\penalty0 (8):\penalty0 2554--2558, 1982. \newblock 10.1073/pnas.79.8.2554. \newblock URL https://www.pnas.org/doi/abs/10.1073/pnas.79.8.2554.

[TBPTT2002tuto] Herbert Jaeger. \newblock Tutorial on training recurrent neural networks, covering bppt, rtrl, ekf and the echo state network approach. \newblock GMD-Forschungszentrum Informationstechnik, 2002., 5, 01 2002.

[Diffuser] Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. \newblock Planning with diffusion for flexible behavior synthesis. \newblock In ICML, 2022.

[LeCunEBMTutorial2006] Yann" "LeCun, Sumit" "Chopra, Raia" "Hadsell, M"~~"Ranzato, and F"~~"Huang. \newblock A tutorial on energy-based learning. \newblock 2006.

[HAC] Andrew Levy, Robert Platt, and Kate Saenko. \newblock Hierarchical reinforcement learning with hindsight. \newblock In International Conference on Learning Representations, 2019.

[HIRO] Ofir Nachum, Shixiang~(Shane) Gu, Honglak Lee, and Sergey Levine. \newblock Data-efficient hierarchical reinforcement learning. \newblock In S.~Bengio, H.~Wallach, H.~Larochelle, K.~Grauman, N.~Cesa-Bianchi, and R.~~Garnett (eds.), Advances in Neural Information Processing Systems, volume~~31. Curran Associates, Inc., 2018.

[DINOV2] Maxime Oquab, Timoth{'e}e Darcet, Th{'e}o Moutakanni, HuyV. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, HuXu, Herve Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. \newblock {DINO}v2: Learning robust visual features without supervision. \newblock Transactions on Machine Learning Research, 2024. \newblock ISSN 2835-8856.

[genie2] Jack Parker-Holder, Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, Stephen Spencer, Jessica Yung, Michael Dennis, Sultan Kenjeyev, Shangbang Long, Vlad Mnih, Harris Chan, Maxime Gazeau, Bonnie Li, Fabio Pardo, Luyu Wang, Lei Zhang, Frederic Besse, Tim Harley, Anna Mitenkova, Jane Wang, Jeff Clune, Demis Hassabis, Raia Hadsell, Adrian Bolton, Satinder Singh, and Tim Rockt{"a}schel. \newblock Genie 2: A large-scale foundation world model. \newblock 2024. \newblock URL https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/.

[curiositydrivenexplorationselfsupervisedprediction] Deepak Pathak, Pulkit Agrawal, Alexei~A. Efros, and Trevor Darrell. \newblock Curiosity-driven exploration by self-supervised prediction. \newblock In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML'17, pp.\ 2778–2787. JMLR.org, 2017.

[Rao99] R.~P. Rao and D.~H. Ballard. \newblock Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. \newblock Nature neuroscience, 2\penalty0 (1):\penalty0 79--87, January 1999. \newblock ISSN 1097-6256. \newblock 10.1038/4580. \newblock URL http://dx.doi.org/10.1038/4580.

[Schmidhuber2015] Juergen Schmidhuber. \newblock On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models, 2015. \newblock URL https://arxiv.org/abs/1511.09249.

[Schmidhuber1990] Jurgen Schmidhuber. \newblock Making the world differentiable: on using self supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. \newblock Forschungsberichte, TU Munich, FKI 126 90:\penalty0 1--26, 1990. \newblock URL https://api.semanticscholar.org/CorpusID:28490120.

[InformationTheoPerspectiveVICREG2023] Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Tim G.~J. Rudner, and Yann LeCun. \newblock An information theory perspective on variance-invariance-covariance regularization. \newblock In A.~Oh, T.~Naumann, A.~Globerson, K.~Saenko, M.~Hardt, and S.~~Levine (eds.), Advances in Neural Information Processing Systems, volume~~36, pp.\ 33965--33998. Curran Associates, Inc., 2023. \newblock URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6b1d4c03391b0aa6ddde0b807a78c950-Paper-Conference.pdf.

[sobal2022jepaslowfeatures] Vlad Sobal, Jyothir~S V, Siddhartha Jalagam, Nicolas Carion, Kyunghyun Cho, and Yann LeCun. \newblock Joint embedding predictive architectures focus on slow features, 2022. \newblock URL https://arxiv.org/abs/2211.10831.

[PLDM] Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim Rudner, and Yann Lecun. \newblock Learning from reward-free offline data: A case for planning with latent dynamics models, 02 2025.

[moving_mnist] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. \newblock Unsupervised learning of video representations using lstms. \newblock In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML'15, pp.\ 843–852. JMLR.org, 2015.

[Dyna1991] Richard~S. Sutton. \newblock Dyna, an integrated architecture for learning, planning, and reacting. \newblock SIGART Bull., 2\penalty0 (4):\penalty0 160–163, July 1991. \newblock ISSN 0163-5719. \newblock 10.1145/122344.122377. \newblock URL https://doi.org/10.1145/122344.122377.

[terver2026JEPAWMs] Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, and Yann LeCun. \newblock What drives success in physical planning with joint-embedding predictive world models?, 2026. \newblock URL https://arxiv.org/abs/2512.24497.

[VideoMAE] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. \newblock Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. \newblock In S.~Koyejo, S.~Mohamed, A.~Agarwal, D.~Belgrave, K.~Cho, and A.~~Oh (eds.), Advances in Neural Information Processing Systems, volume~~35, pp.\ 10078--10093. Curran Associates, Inc., 2022.

[VideoMAEv2] Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi~~Wang, Yali Wang, and Yu~~Qiao. \newblock Videomae v2: Scaling video masked autoencoders with dual masking. \newblock In CVPR, 2023.

[MPPI] Grady Williams, Andrew Aldrich, and Evangelos Theodorou. \newblock Model predictive path integral control using covariance variable importance sampling, 2015.

[BarlowTwins] Jure Zbontar, Li~Jing, Ishan Misra, Yann LeCun, and St{'e}phane Deny. \newblock Barlow twins: Self-supervised learning via redundancy reduction. \newblock In ICML, 2021.

[DINO-WM] Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. \newblock Dino-wm: World models on pre-trained visual features enable zero-shot planning, 2024{a}. \newblock URL https://arxiv.org/abs/2411.04983.

[DMPC] Guangyao Zhou, Sivaramakrishnan Swaminathan, Rajkumar~Vasudeva Raju, J.~Swaroop Guntupalli, Wolfgang Lehrach, Joseph Ortiz, Antoine Dedieu, Miguel Lázaro-Gredilla, and Kevin Murphy. \newblock Diffusion model predictive control. \newblock arXiv preprint arXiv:2410.05364, 2024{b}.

Introduction​

Related Work​

Joint-Embedding methods.​

World models for planning.​

Preliminaries: A Unified JEPA Framework​

A unified energy formulation.​

Regularization: Preventing Collapse.​

Training and Planning with World Models​

Multistep Rollout Training.​

Additional Regularizers for World Models.​

Goal-Conditioned Planning.​

Experiments​

Experimental Setup.​

Image Representation Learning.​

Video Prediction.​

Action-Conditioned Video-JEPA.​

Future Directions​

Advancing Regularization Theory.​

Hierarchical World Models.​

Learned Cost and Value Functions.​

Complementary to Large-Scale Codebases.​

Conclusion​

Ethics statement​

Reproducibility statement​

Acknowledgments​

Hyperparameters​

Planning Algorithm​

Extended Related Work​

Diffusion-Based Planning.​

References​