Value-Guided Action Planning with JEPA World Models
Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, Yann LeCun
Abstract
Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.
Value-guided JEPA for action planning
Matthieu Destrade 1 , 2 ∗ , Oumayma Bounou 3 , Quentin Le Lidec 3 , Jean Ponce 2 , 3 , Yann LeCun 3 1 ´ Ecole Polytechnique 2 ENS Paris 3 New York University
Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.
Introduction
World models are a class of deep learning architectures designed to capture the dynamics of systems (Ha & Schmidhuber (2018); Ding et al. (2025)). They are trained to predict future states of an environment given a sequence of actions. By explicitly modeling the system's dynamics, they capture a causal understanding of how actions influence future outcomes, enabling reasoning and planning over possible trajectories.
Among the various architectures proposed to implement such models, Joint-Embedded Predictive Architectures (JEPA) (LeCun (2022)) provide an effective framework for learning predictive representations. By optimizing a self-supervised prediction loss, JEPA models jointly learn representations of states and predictors that map past states and actions to future representations. This formulation has proven effective for both representation learning (Assran et al. (2023); Bardes et al. (2024)) and action planning (Sobal et al. (2025); Zhou et al. (2025)), the latter referring to the optimization of action sequences that drive a system from an initial state to a goal state.
In this work, we aim to enhance the planning capabilities of JEPA models. Inspired by advances in reinforcement learning, we learn representations such that the Euclidean distance (or a quasidistance) between embedded states approximates the negative goal-conditioned value function associated with a reaching cost (Park et al. (2024b;a); Wang et al. (2023)). This structure provides a meaningful latent representation space for planning, potentially mitigating local minima during planning optimization. We evaluate our method on control tasks and observe that incorporating such representations consistently improves planning performance compared to standard JEPA models.
Related work
JEPA world models
Joint-Embedded Predictive Architectures (JEPA) (LeCun (2022)) provide an effective way to implement world models for representation learning and action planning. They rely on the hypothesis that predicting future states is easier in a learned representation space than in the original observation space, and that enforcing predictability encourages meaningful representations. A JEPA model typically consists of a state encoder, an action encoder, and a predictor. It is trained on sequences of states and actions by minimizing a prediction loss, L pred, between a predicted representation and
∗ Correspondence: matthieu.destrade@polytechnique.edu
that of the actual state resulting from applying a given action. To prevent collapse during training, standard approaches use a VCReg loss, L VCReg, as in Sobal et al. (2025), or an exponential moving average (EMA) scheme, as in Assran et al. (2023); Bardes et al. (2024).
Recent works (Sobal et al. (2025); Zhou et al. (2025)) have applied JEPA models to action-planning tasks, showing promising yet still limited performance. To do so, they employ a model predictive control (MPC) procedure (Garc´ ıa et al. (1989)), which iteratively minimizes a planning loss measuring the distance between predicted and goal representations over a finite horizon.
Learning a value function
To improve the effectiveness of MPC, several works have proposed learning a value function to guide the MPC procedure Farshidian et al. (2019); Jordana et al. (2025). This approach allows MPC to account for longer time horizons, and can stabilize the procedure by providing an additional cost term whose minimization facilitates goal-reaching tasks.
Implicit Q-Learning (IQL) Ghosh et al. (2023); Kostrikov et al. (2021); Xu et al. (2023) learns a goal-conditioned value function from unlabeled trajectories by leveraging expectile regression. The authors of Park et al. (2024b) leverage IQL to learn a structured representation space for states of a system, where the negative Euclidean distance approximates a goal-conditioned value function corresponding to the terminal cost in a reaching objective. They show that these representations enable solving various reinforcement learning tasks efficiently. Since a goal-conditioned value function is not symmetric in general, additional work has proposed learning it using a quasi-distance Wang et al. (2023).
Value-guided JEPA for action planning
To improve the planning capabilities of JEPA models, we focus on enhancing the representations used to compute the MPC planning cost. In the standard JEPA framework, planning is performed by minimizing the distance between a predicted state and the goal in the representation space. However, this cost can have numerous local minima, making optimization challenging. To address this, we propose learning representations such that the Euclidean distance in the representation space corresponds to the negative of the goal-conditioned value function associated with a reaching cost in a given environment, as in Park et al. (2024b). Unlike previous works, we focus on using these representations for planning with JEPA models and MPC procedures, rather than solely for policy execution. Under this formulation, setting the planning cost to the learned value function and minimizing it naturally drives the model toward the goal.
Baseline loss functions
To enforce the value function criterion in the representation space, we consider several simple loss functions for the state encoder of a JEPA model, which serve as baselines. Specifically, we apply a contrastive loss L contrastive using successive states from training trajectories as positive examples and random pairs of states as negative examples, as well as a regression loss L regressive explicitly enforcing the distance between successive states to be 1.
IQL for JEPA models
̸
Let S 0 be the state space, θ the parameters and E θ the state encoder of a JEPA model. For all ( s, g ) ∈ S 2 0 , we define V θ ( s, g ) = -∥E θ ( s ) - E θ ( g ) ∥ 2 . Our goal is to learn θ such that V θ approximates the goal-conditioned value function V ⋆ associated with the reaching cost C : ( s, a, g ) ↦→ 1 s = g , which penalizes all time steps where the state s is not equal to the goal g .
Let ( T, N ) ∈ N 2 represent the length of the training trajectories and the number of training goals. Let D be a dataset of trajectories ( s t ) t ∈ J 0 ,T K belonging to S T +1 0 and goals ( g n ) n ∈ J 0 ,N K belonging to S N +1 0 . We minimize the mean IQL loss with respect to θ via gradient descent:
$$
$$
where ¯ · denotes a stop-gradient ; τ, γ ∈ ]0 , 1[ are close to 1 ; and for all x ∈ R , the term L 2 τ ( x ) = | τ -1 x< 0 | x 2 performs expectile regression. The parameter γ is the discount factor of the value function we aim to learn. In practice, we use two different types of goals: the last state of the training trajectories, and random goals sampled from the training batches.
To obtain a better approximation, we further explore replacing the Euclidean distance in the definition of V θ with a quasimetric distance, following Wang et al. (2023). The quasi-distance used to learn V ⋆ is the generic form introduced in Wang & Isola (2022).
We consider two approaches to training JEPA models. The first approach, which we call 'Sep', consists of training the state encoder alone using the L VF objective, followed by training the action encoder and predictor with the L pred loss. The second approach consists of training all networks together using as objective the sum of L VF and L pred .
Experiments
Experiment settings
We conduct our experiments in two environments under an offline reinforcement learning setting. Models are trained with random trajectories sampled in the environments. The states used as inputs to our models are observation images, potentially including additional sensory information. A detailed description of the datasets used is provided in the Appendix 7.1.
The wall environment consists of a square space separated by a wall with a door. The positions of the wall and door are randomly initialized when the environment is instantiated. The agent has to move from a random starting position to a random goal located on the opposite side of the wall. It can execute actions that are vectors corresponding to displacements. We generate datasets with two settings: WS, with actions of small norms, and WB, with actions of larger norms.
The maze environment consists of an agent that must move from a random starting point to a random goal within a random maze. Its actions are velocity commands. Planning in this environment requires that both the agent's position and velocity be encoded in the representations, as it simulates inertia. Following a similar approach to Sobal et al. (2025), we include the agent's velocity as an input to the encoders for a given state.
Planning with the representations
Weconduct experiments to evaluate the planning performance of different learning methods. Specifically, we train JEPA models with:
Name
Contrastive
Regressive pred VCReg
pred EMA
VF
State encoder loss
L
VCReg
Sep
✓
×
Table 1: Training approaches
The precise settings of the experiments are described in Appendix 7.2.
We assess the quality of the learned representations by evaluating the planning accuracy of the model, defined as the proportion of successful plans for random pairs of initial states and goals. We compute this success rate on 200 instances of the wall environment and 80 of the maze one, so that the variance of the results is small. We use an MPC procedure with an MPPI optimizer. The results are displayed in Table 2.
They show that IQL-inspired approaches provide valuable guidance during planning and achieve better results than intuitive or prediction-based approaches, as used in Sobal et al. (2025). Interestingly, the VF quasi approach consistently outperforms the VF approach, even when the theoretical
Table 2: Planning results in the different environments
value function is symmetric. This suggests that using a quasi-distance always facilitates the training process by enhancing the expressiveness of the networks.
Learning representations using both a prediction loss and an IQL loss is less effective than using the latter loss alone. Using VCReg to promote diversity when learning with an IQL loss also results in poor planning performance. The results obtained with the WB dataset are better than those obtained with the WS dataset. This may be due to the fact that a single trajectory explores more of the environment in the WB dataset, and that the agent is more likely to collide with the wall.
Discussion
Locality of the training. The imperfect results indicate that the value functions learned with our approach are inaccurate. While local relationships between states can be expected to be correctly captured, this is less probable for distant relationships, for two main reasons. First, during training, the space of distant triplets of states (starting state, following state and goal) is sparsely sampled. Second, the gradient of the discounted value function with respect to the state becomes small when the state is far from the given goal. For such states, the signal-to-noise ratio of the value function tends to be low. This suggests that using a hierarchy of representation spaces, where higher levels model longer-range transitions or more coarsely sampled trajectories, may better capture distant relationships and yield improved results.
Influence of the dataset. Theoretical results on the IQL loss show that only the support of the policy used to create the training dataset actually matters when τ tends to 1 . In practice, however, other factors are likely relevant. In highly suboptimal trajectories, states that are close to each other may appear far apart, potentially making training more difficult. Therefore, it might be preferable to use 'expert' trajectories. However, they are often hard to obtain and come at the cost of diversity and exploration. Moreover, it is important that the states used in the IQL loss during training span the entire state space. In practice, this can be achieved either by increasing the size of the training dataset or by employing more effective data collection strategies that better explore underrepresented states.
Conclusion
In this study, we aimed to improve the planning capabilities of JEPA world models. To this end, we proposed enhancing the representations used for planning by learning them such that the Euclidean distance, or a quasi-distance, in the representation space approximates the negative goal-conditioned value function associated with a goal-reaching cost for the system under consideration. This was achieved by training the state encoder of a JEPA model using an implicit Q-learning (IQL) loss.
We compared these methods to more intuitive approaches, as well as to standard prediction-based JEPA training approaches, on benchmark action-planning tasks. Our results show that the value function-based methods, particularly those using a quasi-distance, achieve superior performance, suggesting that such approaches are a promising direction for world model action planning.
Further experiments would be valuable, especially in random environments. Prediction-based methods are indeed expected to be more robust to stochasticity in non-deterministic environments and may enable the learning of more general representations than the other approaches tested, whereas our IQL approach is known to be biased in random environments.
Experiments
Appendix
Datasets
Wall
The observations of states in the wall environment are of size 64 × 64 and consist of 2 channels: one representing the agent and the other representing the walls. Visualizations of typical states of this environment (with flattened channels) are shown in Fig. 1.
To generate the dataset of training trajectories, we follow the approach of Sobal et al. (2025), and do not sample actions using Gaussian noise, as this would result in trajectories concentrated in a small region of the environment. Instead, we generate actions by sampling a random direction, perturbing it with noise drawn from the von Mises distribution with concentration parameter 5. We generate datasets containing 1000 trajectories of length 64, ensuring that half of the trajectories correspond to the agent passing through the door.


Crossing trajectory, WS Non-crossing trajectory, WB
Figure 1: Examples of trajectories from the wall datasets
The WS dataset is generated with action norms sampled randomly from a Gaussian distribution with mean 1 pixel and standard deviation 0.4, and clipped to the range [0 . 2 , 1 . 8] . The WB dataset is generated with action norms sampled randomly from a Gaussian distribution with mean 2 pixels and standard deviation 0.8, and clipped to the range [0 . 4 , 3 . 6] .
Maze
The maze environment follows the setting used in Sobal et al. (2025), which is based on the Mujoco PointMaze environment Fu et al. (2021). It uses a grid of 4 × 4 squares, of which between 50% and 60% contiguous squares are selected to form the maze. Observations of states in this environment are of size 64 × 64 , are colored, and have 3 channels.
The actions controlling the agent correspond to target speeds to reach. The environment computes the force required to achieve the desired speed after a certain number of time steps. The trajectories are generated by sampling random speed vectors with norms smaller than 5, starting from random positions. To evaluate the planning capabilities of our approaches in this environment, random starting points and goals are sampled, such that they are at least 3 cells apart. The dataset contains 1000 trajectories of length 101.
To assess the generalization capabilities of the different approaches we experiment with, we follow the methodology of Sobal et al. (2025). The training trajectories all belong to five maze layouts, that are different from those used for evaluation.

Figure 2: Examples of states of the maze environment (the agent is the green point)
Experiment settings
The code used for the experiments is based on an implementation of JEPA models for action planning by Sobal et al. (2025).
In the models, we use flat representations of size 512, a predictor with a MLP architecture and an action encoder set to the identity. The state encoder is based on a simple architecture combining convolutions and residual connections. Before being passed to the predictor, the representations of states and actions are concatenated. The encoder has 2.2M parameters and the predictor has 1.3M parameters. The input trajectories are subsampled into segments of length 16 during training.
All networks were trained with a base learning rate of 0.0028, using the Adam optimizer and a cosine learning rate schedule. For the wall environments, the VCReg loss is computed along the batch dimension of the representations. At planning time, the MPPI optimization in the MPC is configured with 2000 initial perturbations sampled from a Gaussian distribution with mean 0 and standard deviation 12, and a temperature parameter of λ = 0 . 005 . We use a planning horizon of 96 for a total of 200 planning steps in the WS environment, and a planning horizon of 64 for a total of 64 planning steps in the WB environment. For the maze environment, the VCReg loss is computed along both the batch and temporal dimensions. The MPPI optimization in the MPC is configured with 500 initial perturbations sampled from a Gaussian distribution with mean 0 and standard deviation 5, and a temperature parameter of λ = 0 . 0025 . We use a planning horizon of 100 for a total of 200 planning steps.
For all experiments, we use γ = 0 . 98 and τ = 0 . 80 for the VF-based approaches, and γ = 0 . 93 and τ = 0 . 60 for the VF quasi-based ones. These values were optimized following the procedure described in Appendix 7.3
Additional experiments
Hyperparameter optimization. Before running our experiments, we optimized the two main hyperparameters controlling the behavior of the value function learning methods, namely τ and γ . This was done using a WS dataset different from the one used for the rest of the tests. The results are shown below:

Results for τ (VF: γ = 0 . 98 , VF quasi: γ = 0 . 93 )

Figure 3: Evolution of planning accuracy with respect to hyperparameters
Increasing γ improves performance, as it better captures the relationships between distant states. The same applies to τ , which should theoretically be set as close to 1 as possible. However, setting either parameter too close to 1 introduces instabilities that degrade performance. We chose the values of γ and τ that maximized the planning accuracy for the rest of the experiments.
Separate predictive and planning representations. One might hypothesize that representations learned using a prediction loss yield better prediction accuracy, while those learned with an IQL approach result in a more effective planning cost. It is possible to combine the advantages of both by adopting an intermediate approach. In this approach, two separate representation spaces are learned: the first with a standard prediction loss, and the second with an IQL loss using a second state encoder. During planning, the first level is used to compute predictions, and the second level to compute the cost. We tested this method with the WS dataset using the pred VCReg approach for the first level and the VF approach for the second level. It did not improve planning results, with a planning accuracy of 0 . 60 .
| Name | State encoder loss | Sep |
|---|---|---|
| VF pred | L VF | × |
| VF quasi | L VF &quasi-distance | ✓ |
| VF quasi pred | L VF &quasi-distance | × |
| VF VCReg | L VF & L VCReg | ✓ |
| VF VCReg pred | L VF & L VCReg | × |
| Type | WS | WB | Maze |
|---|---|---|---|
| Contrastive | 0.49 | 0.59 | 0.5 |
| Regressive | 0.54 | 0.57 | 0.46 |
| pred VCReg | 0.55 | 0.89 | 0.54 |
| pred EMA | 0.46 | 0.43 | 0.04 |
| VF | 0.63 | 0.94 | 0.49 |
| Type | WS | WB | Maze |
|---|---|---|---|
| VF pred | 0.55 | 0.75 | 0.49 |
| VF quasi | 0.71 | 0.96 | 0.63 |
| VF quasi pred | 0.61 | 0.85 | 0.43 |
| VF VCReg | 0.49 | 0.75 | 0.39 |
| VF VCReg pred | 0.47 | 0.67 | 0.39 |
Building deep learning models that can reason about their environment requires capturing its underlying dynamics. Joint-Embedded Predictive Architectures (JEPA) provide a promising framework to model such dynamics by learning representations and predictors through a self-supervised prediction objective. However, their ability to support effective action planning remains limited. We propose an approach to enhance planning with JEPA world models by shaping their representation space so that the negative goal-conditioned value function for a reaching cost in a given environment is approximated by a distance (or quasi-distance) between state embeddings. We introduce a practical method to enforce this constraint during training and show that it leads to significantly improved planning performance compared to standard JEPA models on simple control tasks.
World models are a class of deep learning architectures designed to capture the dynamics of systems (Ha and Schmidhuber (2018); Ding et al. (2025)). They are trained to predict future states of an environment given a sequence of actions. By explicitly modeling the system’s dynamics, they capture a causal understanding of how actions influence future outcomes, enabling reasoning and planning over possible trajectories.
Among the various architectures proposed to implement such models, Joint-Embedded Predictive Architectures (JEPA) (LeCun (2022)) provide an effective framework for learning predictive representations. By optimizing a self-supervised prediction loss, JEPA models jointly learn representations of states and predictors that map past states and actions to future representations. This formulation has proven effective for both representation learning (Assran et al. (2023); Bardes et al. (2024)) and action planning (Sobal et al. (2025); Zhou et al. (2025)), the latter referring to the optimization of action sequences that drive a system from an initial state to a goal state.
In this work, we aim to enhance the planning capabilities of JEPA models. Inspired by advances in reinforcement learning, we learn representations such that the Euclidean distance (or a quasi-distance) between embedded states approximates the negative goal-conditioned value function associated with a reaching cost (Park et al. (2024b; a); Wang et al. (2023)). This structure provides a meaningful latent representation space for planning, potentially mitigating local minima during planning optimization. We evaluate our method on control tasks and observe that incorporating such representations consistently improves planning performance compared to standard JEPA models.
Joint-Embedded Predictive Architectures (JEPA) (LeCun (2022)) provide an effective way to implement world models for representation learning and action planning. They rely on the hypothesis that predicting future states is easier in a learned representation space than in the original observation space, and that enforcing predictability encourages meaningful representations. A JEPA model typically consists of a state encoder, an action encoder, and a predictor. It is trained on sequences of states and actions by minimizing a prediction loss, ℒpred\mathcal{L}{\text{pred}}, between a predicted representation and that of the actual state resulting from applying a given action. To prevent collapse during training, standard approaches use a VCReg loss, ℒVCReg\mathcal{L}{\text{VCReg}}, as in Sobal et al. (2025), or an exponential moving average (EMA) scheme, as in Assran et al. (2023); Bardes et al. (2024).
Recent works (Sobal et al. (2025); Zhou et al. (2025)) have applied JEPA models to action-planning tasks, showing promising yet still limited performance. To do so, they employ a model predictive control (MPC) procedure (García et al. (1989)), which iteratively minimizes a planning loss measuring the distance between predicted and goal representations over a finite horizon.
To improve the effectiveness of MPC, several works have proposed learning a value function to guide the MPC procedure Farshidian et al. (2019); Jordana et al. (2025). This approach allows MPC to account for longer time horizons, and can stabilize the procedure by providing an additional cost term whose minimization facilitates goal-reaching tasks.
Implicit Q-Learning (IQL) Ghosh et al. (2023); Kostrikov et al. (2021); Xu et al. (2023) learns a goal-conditioned value function from unlabeled trajectories by leveraging expectile regression. The authors of Park et al. (2024b) leverage IQL to learn a structured representation space for states of a system, where the negative Euclidean distance approximates a goal-conditioned value function corresponding to the terminal cost in a reaching objective. They show that these representations enable solving various reinforcement learning tasks efficiently. Since a goal-conditioned value function is not symmetric in general, additional work has proposed learning it using a quasi-distance Wang et al. (2023).
To improve the planning capabilities of JEPA models, we focus on enhancing the representations used to compute the MPC planning cost. In the standard JEPA framework, planning is performed by minimizing the distance between a predicted state and the goal in the representation space. However, this cost can have numerous local minima, making optimization challenging. To address this, we propose learning representations such that the Euclidean distance in the representation space corresponds to the negative of the goal-conditioned value function associated with a reaching cost in a given environment, as in Park et al. (2024b). Unlike previous works, we focus on using these representations for planning with JEPA models and MPC procedures, rather than solely for policy execution. Under this formulation, setting the planning cost to the learned value function and minimizing it naturally drives the model toward the goal.
To enforce the value function criterion in the representation space, we consider several simple loss functions for the state encoder of a JEPA model, which serve as baselines. Specifically, we apply a contrastive loss ℒcontrastive\mathcal{L}{\text{contrastive}} using successive states from training trajectories as positive examples and random pairs of states as negative examples, as well as a regression loss ℒregressive\mathcal{L}{\text{regressive}} explicitly enforcing the distance between successive states to be 1.
Let 𝒮0\mathcal{S}{0} be the state space, θ\theta the parameters and ℰθ\mathcal{E}{\theta} the state encoder of a JEPA model. For all (s,g)∈𝒮02(s,g)\in\mathcal{S}{0}^{2}, we define Vθ(s,g)=−‖ℰθ(s)−ℰθ(g)‖2V{\theta}(s,g)=-|\mathcal{E}{\theta}(s)-\mathcal{E}{\theta}(g)|{2}. Our goal is to learn θ\theta such that VθV{\theta} approximates the goal-conditioned value function V⋆V^{\star} associated with the reaching cost C:(s,a,g)↦𝟏s≠gC:(s,a,g)\mapsto\mathbf{1}_{s\neq g}, which penalizes all time steps where the state ss is not equal to the goal gg.
Let (T,N)∈ℕ2(T,N)\in\mathbb{N}^{2} represent the length of the training trajectories and the number of training goals. Let 𝒟\mathcal{D} be a dataset of trajectories (st)t∈⟦0,T⟧(s_{t}){t\in\llbracket 0,T\rrbracket} belonging to 𝒮0T+1\mathcal{S}{0}^{T+1} and goals (gn)n∈⟦0,N⟧(g_{n}){n\in\llbracket 0,N\rrbracket} belonging to 𝒮0N+1\mathcal{S}{0}^{N+1}. We minimize the mean IQL loss with respect to θ\theta via gradient descent:
where ⋅¯\bar{\cdot} denotes a stop-gradient ; τ,γ∈]0,1[\tau,\gamma\in]0,1[ are close to 11 ; and for all x∈ℝ,x\in\mathbb{R}, the term Lτ2(x)=|τ−𝟏x<0|x2;L_{\tau}^{2}(x)=|\tau-\mathbf{1}_{x<0}|,x^{2} performs expectile regression. The parameter γ\gamma is the discount factor of the value function we aim to learn. In practice, we use two different types of goals: the last state of the training trajectories, and random goals sampled from the training batches.
To obtain a better approximation, we further explore replacing the Euclidean distance in the definition of VθV_{\theta} with a quasimetric distance, following Wang et al. (2023). The quasi-distance used to learn V⋆V^{\star} is the generic form introduced in Wang and Isola (2022).
We consider two approaches to training JEPA models. The first approach, which we call “Sep”, consists of training the state encoder alone using the ℒVF\mathcal{L}{\text{VF}} objective, followed by training the action encoder and predictor with the ℒpred\mathcal{L}{\text{pred}} loss. The second approach consists of training all networks together using as objective the sum of ℒVF\mathcal{L}{\text{VF}} and ℒpred\mathcal{L}{\text{pred}}.
We conduct our experiments in two environments under an offline reinforcement learning setting. Models are trained with random trajectories sampled in the environments. The states used as inputs to our models are observation images, potentially including additional sensory information. A detailed description of the datasets used is provided in the Appendix 7.1.
The wall environment consists of a square space separated by a wall with a door. The positions of the wall and door are randomly initialized when the environment is instantiated. The agent has to move from a random starting position to a random goal located on the opposite side of the wall. It can execute actions that are vectors corresponding to displacements. We generate datasets with two settings: WS, with actions of small norms, and WB, with actions of larger norms.
The maze environment consists of an agent that must move from a random starting point to a random goal within a random maze. Its actions are velocity commands. Planning in this environment requires that both the agent’s position and velocity be encoded in the representations, as it simulates inertia. Following a similar approach to Sobal et al. (2025), we include the agent’s velocity as an input to the encoders for a given state.
We conduct experiments to evaluate the planning performance of different learning methods. Specifically, we train JEPA models with:
The precise settings of the experiments are described in Appendix 7.2.
We assess the quality of the learned representations by evaluating the planning accuracy of the model, defined as the proportion of successful plans for random pairs of initial states and goals. We compute this success rate on 200 instances of the wall environment and 80 of the maze one, so that the variance of the results is small. We use an MPC procedure with an MPPI optimizer. The results are displayed in Table 2.
They show that IQL-inspired approaches provide valuable guidance during planning and achieve better results than intuitive or prediction-based approaches, as used in Sobal et al. (2025). Interestingly, the VF_quasi approach consistently outperforms the VF approach, even when the theoretical value function is symmetric. This suggests that using a quasi-distance always facilitates the training process by enhancing the expressiveness of the networks.
Learning representations using both a prediction loss and an IQL loss is less effective than using the latter loss alone. Using VCReg to promote diversity when learning with an IQL loss also results in poor planning performance. The results obtained with the WB dataset are better than those obtained with the WS dataset. This may be due to the fact that a single trajectory explores more of the environment in the WB dataset, and that the agent is more likely to collide with the wall.
Locality of the training. The imperfect results indicate that the value functions learned with our approach are inaccurate. While local relationships between states can be expected to be correctly captured, this is less probable for distant relationships, for two main reasons. First, during training, the space of distant triplets of states (starting state, following state and goal) is sparsely sampled. Second, the gradient of the discounted value function with respect to the state becomes small when the state is far from the given goal. For such states, the signal-to-noise ratio of the value function tends to be low. This suggests that using a hierarchy of representation spaces, where higher levels model longer-range transitions or more coarsely sampled trajectories, may better capture distant relationships and yield improved results.
Influence of the dataset. Theoretical results on the IQL loss show that only the support of the policy used to create the training dataset actually matters when τ\tau tends to 11. In practice, however, other factors are likely relevant. In highly suboptimal trajectories, states that are close to each other may appear far apart, potentially making training more difficult. Therefore, it might be preferable to use “expert” trajectories. However, they are often hard to obtain and come at the cost of diversity and exploration. Moreover, it is important that the states used in the IQL loss during training span the entire state space. In practice, this can be achieved either by increasing the size of the training dataset or by employing more effective data collection strategies that better explore underrepresented states.
In this study, we aimed to improve the planning capabilities of JEPA world models. To this end, we proposed enhancing the representations used for planning by learning them such that the Euclidean distance, or a quasi-distance, in the representation space approximates the negative goal-conditioned value function associated with a goal-reaching cost for the system under consideration. This was achieved by training the state encoder of a JEPA model using an implicit Q-learning (IQL) loss.
We compared these methods to more intuitive approaches, as well as to standard prediction-based JEPA training approaches, on benchmark action-planning tasks. Our results show that the value function–based methods, particularly those using a quasi-distance, achieve superior performance, suggesting that such approaches are a promising direction for world model action planning.
Further experiments would be valuable, especially in random environments. Prediction-based methods are indeed expected to be more robust to stochasticity in non-deterministic environments and may enable the learning of more general representations than the other approaches tested, whereas our IQL approach is known to be biased in random environments.
The observations of states in the wall environment are of size 64×6464\times 64 and consist of 2 channels: one representing the agent and the other representing the walls. Visualizations of typical states of this environment (with flattened channels) are shown in Fig. 1.
To generate the dataset of training trajectories, we follow the approach of Sobal et al. (2025), and do not sample actions using Gaussian noise, as this would result in trajectories concentrated in a small region of the environment. Instead, we generate actions by sampling a random direction, perturbing it with noise drawn from the von Mises distribution with concentration parameter 5. We generate datasets containing 1000 trajectories of length 64, ensuring that half of the trajectories correspond to the agent passing through the door.
The WS dataset is generated with action norms sampled randomly from a Gaussian distribution with mean 1 pixel and standard deviation 0.4, and clipped to the range [0.2,1.8][0.2,1.8]. The WB dataset is generated with action norms sampled randomly from a Gaussian distribution with mean 2 pixels and standard deviation 0.8, and clipped to the range [0.4,3.6][0.4,3.6].
The maze environment follows the setting used in Sobal et al. (2025), which is based on the Mujoco PointMaze environment Fu et al. (2021). It uses a grid of 4×44\times 4 squares, of which between 50%50% and 60%60% contiguous squares are selected to form the maze. Observations of states in this environment are of size 64×6464\times 64, are colored, and have 3 channels.
The actions controlling the agent correspond to target speeds to reach. The environment computes the force required to achieve the desired speed after a certain number of time steps. The trajectories are generated by sampling random speed vectors with norms smaller than 5, starting from random positions. To evaluate the planning capabilities of our approaches in this environment, random starting points and goals are sampled, such that they are at least 3 cells apart. The dataset contains 1000 trajectories of length 101.
To assess the generalization capabilities of the different approaches we experiment with, we follow the methodology of Sobal et al. (2025). The training trajectories all belong to five maze layouts, that are different from those used for evaluation.
The code used for the experiments is based on an implementation of JEPA models for action planning by Sobal et al. (2025).
In the models, we use flat representations of size 512, a predictor with a MLP architecture and an action encoder set to the identity. The state encoder is based on a simple architecture combining convolutions and residual connections. Before being passed to the predictor, the representations of states and actions are concatenated. The encoder has 2.2M parameters and the predictor has 1.3M parameters. The input trajectories are subsampled into segments of length 16 during training.
All networks were trained with a base learning rate of 0.0028, using the Adam optimizer and a cosine learning rate schedule. For the wall environments, the VCReg loss is computed along the batch dimension of the representations. At planning time, the MPPI optimization in the MPC is configured with 2000 initial perturbations sampled from a Gaussian distribution with mean 0 and standard deviation 12, and a temperature parameter of λ=0.005\lambda=0.005. We use a planning horizon of 96 for a total of 200 planning steps in the WS environment, and a planning horizon of 64 for a total of 64 planning steps in the WB environment. For the maze environment, the VCReg loss is computed along both the batch and temporal dimensions. The MPPI optimization in the MPC is configured with 500 initial perturbations sampled from a Gaussian distribution with mean 0 and standard deviation 5, and a temperature parameter of λ=0.0025\lambda=0.0025. We use a planning horizon of 100 for a total of 200 planning steps.
For all experiments, we use γ=0.98\gamma=0.98 and τ=0.80\tau=0.80 for the VF-based approaches, and γ=0.93\gamma=0.93 and τ=0.60\tau=0.60 for the VF_quasi-based ones. These values were optimized following the procedure described in Appendix 7.3
Table: S4.T1: Training approaches
| Name | State encoder loss | Sep |
|---|---|---|
| Contrastive | ℒcontrastive\mathcal{L}_{\text{contrastive}} | ✓ |
| Regressive | ℒregressive\mathcal{L}{\text{regressive}} & ℒVCReg\mathcal{L}{\text{VCReg}} | ✓ |
| pred_VCReg | ℒVCReg\mathcal{L}_{\text{VCReg}} | ×\times |
| pred_EMA | EMA procedure | ×\times |
| VF | ℒVF\mathcal{L}_{\text{VF}} | ✓ |
Table: S4.T2: Planning results in the different environments
| Type | WS | WB | Maze |
|---|---|---|---|
| Contrastive | 0.49 | 0.59 | 0.50 |
| Regressive | 0.54 | 0.57 | 0.46 |
| pred_VCReg | 0.55 | 0.89 | 0.54 |
| pred_EMA | 0.46 | 0.43 | 0.04 |
| VF | 0.63 | 0.94 | 0.49 |
Crossing trajectory, WS
Examples of states of the maze environment (the agent is the green point)





Results for γ\gamma (τ=0.7\tau=0.7)
$$ \forall ((s_t),(g_n)) \in \mathcal{D}, \quad \mathcal{L}\text{VF}^\theta((s_t),(g_n)) = \sum{n=0}^N \sum_{t=0}^{T-1} L_\tau^2 \Big( -\mathbf{1}{s_t \neq g_n} + \gamma V{\bar{\theta}}(s_{t+1}, g_n) - V_\theta(s_t, g_n) \Big), $$
| Name | State encoder loss | Sep |
|---|---|---|
| VF pred | L VF | × |
| VF quasi | L VF &quasi-distance | ✓ |
| VF quasi pred | L VF &quasi-distance | × |
| VF VCReg | L VF & L VCReg | ✓ |
| VF VCReg pred | L VF & L VCReg | × |
| Type | WS | WB | Maze |
|---|---|---|---|
| Contrastive | 0.49 | 0.59 | 0.5 |
| Regressive | 0.54 | 0.57 | 0.46 |
| pred VCReg | 0.55 | 0.89 | 0.54 |
| pred EMA | 0.46 | 0.43 | 0.04 |
| VF | 0.63 | 0.94 | 0.49 |
| Type | WS | WB | Maze |
|---|---|---|---|
| VF pred | 0.55 | 0.75 | 0.49 |
| VF quasi | 0.71 | 0.96 | 0.63 |
| VF quasi pred | 0.61 | 0.85 | 0.43 |
| VF VCReg | 0.49 | 0.75 | 0.39 |
| VF VCReg pred | 0.47 | 0.67 | 0.39 |








References
[hilp] Seohong Park, Tobias Kreiman, Sergey Levine. (2024). Foundation Policies with Hilbert Representations.
[xu2023policyguidedimitationapproachoffline] Haoran Xu, Li Jiang, Jianxiong Li, Xianyuan Zhan. (2023). A Policy-Guided Imitation Approach for Offline Reinforcement Learning.
[kostrikov2021offlinereinforcementlearningimplicit] Ilya Kostrikov, Ashvin Nair, Sergey Levine. (2021). Offline Reinforcement Learning with Implicit Q-Learning.
[ma2021offlinereinforcementlearningvaluebased] Xiaoteng Ma, Yiqin Yang, Hao Hu, Qihan Liu, Jun Yang, Chongjie Zhang, Qianchuan Zhao, Bin Liang. (2021). Offline Reinforcement Learning with Value-based Episodic Memory.
[williams2015modelpredictivepathintegral] Grady Williams, Andrew Aldrich, Evangelos Theodorou. (2015). Model Predictive Path Integral Control using Covariance Variable Importance Sampling.
[sobal2025learningrewardfreeofflinedata] Vlad Sobal, Wancong Zhang, Kynghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun. (2025). Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models.
[zhou2025dinowmworldmodelspretrained] Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto. (2025). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning.
[bardes2022vicregvarianceinvariancecovarianceregularizationselfsupervised] Adrien Bardes, Jean Ponce, Yann LeCun. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.
[ghosh2023reinforcementlearningpassivedata] Dibya Ghosh, Chethan Bhateja, Sergey Levine. (2023). Reinforcement Learning from Passive Data via Latent Intentions.
[ding2025understandingworldpredictingfuture] Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Nicholas Sukiennik, Fengli Xu, Yong Li. (2025). Understanding World or Predicting Future? A Comprehensive Survey of World Models.
[alma990026608780107879] Johnson-Laird, P. N.. (1983). Mental models : towards a cognitive science of language, inference, and consciousness. Mental models : towards a cognitive science of language, inference, and consciousness.
[LeCun2022APT] Yann LeCun. (2022). A Path Towards Autonomous Machine Intelligence Version 0.9.2, 2022-06-27.
[WM] Ha, David, Schmidhuber, Jürgen. (2018). World Models. doi:10.5281/ZENODO.1207631.
[balestriero2024learningreconstructionproducesuninformative] Randall Balestriero, Yann LeCun. (2024). Learning by Reconstruction Produces Uninformative Features For Perception.
[assran2023selfsupervisedlearningimagesjointembedding] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.
[garrido2024learningleveragingworldmodels] Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, Yann LeCun. (2024). Learning and Leveraging World Models in Visual Representation Learning.
[bardes2024revisitingfeaturepredictionlearning] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video.
[assran2025vjepa2selfsupervisedvideo] Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas. (2025). V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning.
[garrido2025intuitivephysicsunderstandingemerges] Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun. (2025). Intuitive physics understanding emerges from self-supervised pretraining on natural videos.
[oquab2024dinov2learningrobustvisual] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. (2024). DINOv2: Learning Robust Visual Features without Supervision.
[GARCIA1989335] Carlos E. García, David M. Prett, Manfred Morari. (1989). Model predictive control: Theory and practice—A survey. Automatica. doi:https://doi.org/10.1016/0005-1098(89)90002-2.
[florence2021implicitbehavioralcloning] Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson. (2021). Implicit Behavioral Cloning.
[jordana2025infinitehorizonvaluefunctionapproximation] Armand Jordana, Sébastien Kleff, Arthur Haffemayer, Joaquim Ortiz-Haro, Justin Carpentier, Nicolas Mansard, Ludovic Righetti. (2025). Infinite-Horizon Value Function Approximation for Model Predictive Control.
[lawrence2025viewlearningrobustgoalconditioned] Nathan P. Lawrence, Philip D. Loewen, Michael G. Forbes, R. Bhushan Gopaluni, Ali Mesbah. (2025). A view on learning robust goal-conditioned value functions: Interplay between RL and MPC.
[levine2020offlinereinforcementlearningtutorial] Sergey Levine, Aviral Kumar, George Tucker, Justin Fu. (2020). Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems.
[pmlr-v37-schaul15] Schaul, Tom, Horgan, Daniel, Gregor, Karol, Silver, David. (2015). Universal Value Function Approximators. Proceedings of the 32nd International Conference on Machine Learning.
[park2024hiqlofflinegoalconditionedrl] Seohong Park, Dibya Ghosh, Benjamin Eysenbach, Sergey Levine. (2024). HIQL: Offline Goal-Conditioned RL with Latent States as Actions.
[hafner2024masteringdiversedomainsworld] Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap. (2024). Mastering Diverse Domains through World Models.
[gupta2019relaypolicylearningsolving] Abhishek Gupta, Vikash Kumar, Corey Lynch, Sergey Levine, Karol Hausman. (2019). Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning.
[shwartzziv2024informationtheoreticperspectivevarianceinvariancecovarianceregularization] Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Tim G. J. Rudner, Yann LeCun. (2024). An Information-Theoretic Perspective on Variance-Invariance-Covariance Regularization.
[10.5555/3312046] Sutton, Richard S., Barto, Andrew G.. (2018). Reinforcement Learning: An Introduction.
[Schroff_2015] Tongzhou Wang, Antonio Torralba, Phillip Isola, Amy Zhang. (2023). Optimal Goal-Reaching Reinforcement Learning via Quasimetric Learning. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). doi:10.1109/cvpr.2015.7298682.
[wang2022improved] Tongzhou Wang, Phillip Isola. (2022). Improved Representation of Asymmetrical Distances with Interval Quasimetric Embeddings. NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations.
[goodfellow2015explainingharnessingadversarialexamples] Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. (2015). Explaining and Harnessing Adversarial Examples.
[kingma2022autoencodingvariationalbayes] Diederik P Kingma, Max Welling. (2022). Auto-Encoding Variational Bayes.
[fu2021d4rldatasetsdeepdatadriven] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine. (2021). D4RL: Datasets for Deep Data-Driven Reinforcement Learning.
[ostrovskii2017newapproachlowdistortionembeddings] Mikhail I. Ostrovskii, Beata Randrianantoanina. (2017). A new approach to low-distortion embeddings of finite metric spaces into non-superreflexive Banach spaces.
[pitis2020inductivebiasdistancesneural] Silviu Pitis, Harris Chan, Kiarash Jamali, Jimmy Ba. (2020). An Inductive Bias for Distances: Neural Nets that Respect the Triangle Inequality.
[farshidian2019deepvaluemodelpredictive] Farbod Farshidian, David Hoeller, Marco Hutter. (2019). Deep Value Model Predictive Control.
[bib15] Self-supervised learning from images with a joint-embedding predictive architecture. External Links: 2301.08243, Link Cited by: §1, §2.1.
[bib17] Revisiting feature prediction for learning visual representations from video. External Links: 2404.08471, Link Cited by: §1, §2.1.
[bib10] Understanding world or predicting future? a comprehensive survey of world models. External Links: 2411.14499, Link Cited by: §1.
[bib40] Deep value model predictive control. External Links: 1910.03358, Link Cited by: §2.2.
[bib37] D4RL: datasets for deep data-driven reinforcement learning. External Links: 2004.07219, Link Cited by: §7.1.2.
[bib21] Model predictive control: theory and practice—a survey. Automatica 25 (3), pp. 335–348. External Links: ISSN 0005-1098, Document, Link Cited by: §2.1.
[bib9] Reinforcement learning from passive data via latent intentions. External Links: 2304.04782, Link Cited by: §2.2.
[bib13] World models. External Links: Document, Link Cited by: §1.
[bib23] Infinite-horizon value function approximation for model predictive control. External Links: 2502.06760, Link Cited by: §2.2.
[bib3] Offline reinforcement learning with implicit q-learning. External Links: 2110.06169, Link Cited by: §2.2.
[bib12] A path towards autonomous machine intelligence version 0.9.2, 2022-06-27. External Links: Link Cited by: §1, §2.1.
[bib27] HIQL: offline goal-conditioned rl with latent states as actions. External Links: 2307.11949, Link Cited by: §1.
[bib1] Foundation policies with hilbert representations. External Links: 2402.15567, Link Cited by: §1, §2.2, §3.
[bib6] Learning from reward-free offline data: a case for planning with latent dynamics models. External Links: 2502.14819, Link Cited by: §1, §2.1, §2.1, §4.1, §4.2, §7.1.1, §7.1.2, §7.1.2, §7.2.
[bib34] Improved representation of asymmetrical distances with interval quasimetric embeddings. In NeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations, External Links: Link Cited by: §3.2.
[bib33] Optimal goal-reaching reinforcement learning via quasimetric learning. External Links: 2304.01203, Link Cited by: §1, §2.2, §3.2.
[bib2] A policy-guided imitation approach for offline reinforcement learning. External Links: 2210.08323, Link Cited by: §2.2.
[bib7] DINO-wm: world models on pre-trained visual features enable zero-shot planning. External Links: 2411.04983, Link Cited by: §1, §2.1.