Closing the Train-Test Gap in Gradient-Based Planning
Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal, Yann LeCun, Oumayma Bounou, Pavel Izmailov, Micah Goldblum
Abstract
World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget. %To that end, we present a meta-learning approach that directly optimizes the world model for gradient-based planning itself. To aid in gradient flow over long horizons, we present a simple technique for adding a skip connection to the gradient-based planning computation graph. We further test out alternative techniques for decreasing the train-test gap including dataset synthesis and backpropagation through time. We evaluate our techniques on two suites of tasks: the environments present in DINO-WM and the DM Control Suite.% We demonstrate an improvement of +x% on xyz tasks over an off-the-shelf world model when performing gradient-based planning.
Introduction
In robotic tasks, anticipating how the actions of an agent affect the state of its environment is fundamental for both prediction (Finn et al., 2016) and planning (Mohanan & Salgoankar, 2018; Kavraki et al., 2002). Classical approaches derive models of the environment evolution analytically from first principles, relying on prior knowledge of the environment, the agent, and any uncertainty (Goldstein et al., 1950; Siciliano et al., 2009; Spong et al., 2020). In contrast, learning-based methods infer such models directly from data, enabling them to capture complex dynamics and thus improve generalization and robustness to uncertainty (Sutton et al., 1998; Schrittwieser et al., 2020; LeCun, 2022).
World models (Ha & Schmidhuber, 2018), in particular, have emerged as a powerful paradigm. Given the current state and an action, the world model predicts the resulting next state. These models can be learned either from exact state information (Sutton, 1991) or from high-dimensional sensory inputs such as images (Hafner et al., 2023). The latter setup is especially compelling as it enables perception, prediction, and control directly from raw images by leveraging pre-trained visual representations, and removes the need for measuring the precise environment states which is difficult in practice (Assran et al., 2023; Bardes et al., 2024). Recently, world models and their predictive capabilities have been leveraged for planning, enabling agents to solve a variety of tasks (Hafner et al., 2019a;b; Schrittwieser et al., 2020; Hafner et al., 2023; Zhou et al., 2025). A model of the dynamics is learned offline, while the planning task is defined at inference as a constrained optimization problem: given the current state, find a sequence of actions that results in a state as close as possible to the target state. This inference-time optimization provides an effective alternative to reinforcement learning approaches (Sutton et al., 1998) that often suffer from poor sample-efficiency.
Correspondence to: Nimit Kalra (nimit@utexas.edu) and Rohun Agrawal (rohun.agrawal@columbia.edu).

Online Woria Moaeling

Figure 1: An overview of our two proposed methods. When planning with a world model, actions may result in trajectories that lie outside the distribution of expert trajectories on which the world model was trained, leading to inaccurate world modeling. Online World Modeling finetunes a pretrained world model by using the simulator to correct trajectories produced via gradient-based planning, leading to accurate world modeling beyond the expert trajectory distribution. Adversarial World Modeling finetunes a world model on perturbations of actions and expert trajectories, promoting robustness and smoothing the world model's input gradients.
World models are compatible with many model-based planning algorithms. Traditional methods such as DDP (Mayne, 1966) and iLQR (Li & Todorov, 2004) rely on iteratively solving exact optimization problems derived from linear and quadratic approximations of the dynamics around a nominal trajectory. While highly effective in low-dimensional settings, these methods become impractical for large-scale world models, where solving the resulting optimization problem is computationally intractable. As an alternative, search-based methods such as the Cross Entropy Method (CEM) (Rubinstein & Kroese, 2004) and Model Predictive Path Integral control (MPPI) (Williams et al., 2017a) have been widely adopted as gradient-free alternatives and have proven effective in practice. However, they are computationally intensive as they require iteratively sampling candidate solutions and performing world model rollouts to evaluate each one, a procedure that scales poorly in high-dimensional spaces. Gradient-based methods (SV et al., 2023), in contrast, avoid the limitations of sampling by directly exploiting the differentiability of world models to optimize actions end-to-end. These methods eliminate the costly rollouts required by search-based approaches, thus scaling more efficiently in high-dimensional spaces. Despite this promise, gradient-based approaches have thus far seen limited empirical success.
This procedure suffers from a fundamental train-test gap. World models are typically trained using a next-state prediction objective on datasets of expert trajectories. At test time, however, they are used to optimize a planning objective over sequences of actions. We argue that this mismatch underlies the poor empirical performance of gradient-based planning (GBP), and we offer two hypotheses to explain why. (1) During planning, the intermediate sequence of actions explored by gradient descent drive the world model into states that were not encountered during training. In these outof-distribution states, model errors compound, making the world model unreliable as a surrogate for optimization. (2) The action-level optimization landscape induced by the world model may be difficult to traverse, containing many poor local minima or flat regions, which hinders effective gradient-based optimization.
In this work, we address both of these challenges by proposing two algorithms: Online World Modeling and Adversarial World Modeling . Both expand the region of familiar latent states by continuously adding new trajectories to the dataset and finetuning the world model on them. To manage the distribution shift between offline expert trajectories and predicted trajectories from planning, Online World Modeling uses the environment simulator to correct states along a trajectory produced by performing GBP. Finetuning on these corrected trajectories ensures that the world model performs sufficiently well when GBP enters regimes of latent state space outside of the expert trajectory distribution. To overcome the difficulties of optimizing over a non-smooth loss surface during GBP, Adversarial World Modeling perturbs expert trajectories in the direction that maximizes the world model's loss. Adversarial finetuning smooths the induced action loss landscape, making it easier to optimize via gradient-based planning. We provide a visual depiction of both methods in Figure 1.
We show that finetuning world models with these algorithms leads to substantial improvements in the performance of gradient-based planning (GBP). Applying Adversarial World Modeling to a pretrained world model enables gradient-based planning to match or exceed the performance of search-based CEM on a variety of robotic object manipulation and navigation tasks. Importantly, this performance is achieved with a 10 × reduction in computation time compared to CEM, underscoring the practicality of our approach for real-world planning. Additionally, we empirically demonstrate that Adversarial World Modeling smooths the planning loss landscape, and that both methods can reverse the train-test gap in world model error.
Online and Adversarial World Modeling
Problem formulation
World models learn environment dynamics by predicting the state resulting from taking an action in the current state. Then, at test time, the learned world model enables planning by simulating future trajectories and guiding action optimization. Formally, a world model approximates the (potentially unknown) dynamics function h : S × A → S , where S denotes the state space and A the action space. The environment evolves according to
$$
$$
where s t ∈ S , a t ∈ A denote the state and action at time t , respectively.
Latent world models. In practice, we typically do not have access to the exact state of the environment; instead, we only receive partial observations of it, such as images. In order for a world model to efficiently learn in the high-dimensional observation space O , an embedding function Φ µ : O → Z is employed to map observations to a lower-dimensional latent space Z . Then, given an embedding function Φ µ , our goal is to learn a latent world model f θ : Z × A → Z , such that
$$
$$
The choice of Φ µ directly affects the expressivity of the latent world model. In this work, we use a fixed encoder pretrained with self-supervised learning that yields rich feature representations out of the box.
Training. To train a latent world model, we sample triplets of the form ( o t , a t , o t +1 ) from an offline dataset of trajectories T and minimize the ℓ 2 distance between the true next latent state z t +1 = Φ µ ( o t +1 ) and the predicted next latent state ˆ z t +1 . This procedure is represented by the following teacher-forcing objective:
$$
$$
Notably, we only minimize this objective with respect to the world model's parameters θ , not those of the potentially large embedding function.
Planning. During test-time, we use a learned world model to optimize candidate action sequences for reaching a goal state. By recursively applying the world model over an action sequence starting from an initial latent state, we obtain a predicted latent goal state and therefore the distance to the true goal state in latent space. This allows us to find the optimal action sequence
$$
$$
where ˆ z H +1 is produced by the recursive procedure
$$
$$
Gradient-based planning (GBP) solves the planning objective (4) via gradient descent. Crucially, since the world model is differentiable, ∇ { ˆ a t } ˆ z H +1 = ∇ { ˆ a t } rollout f ( z 1 , { ˆ a t } ) H +1 is well-defined. In contrast, the search-based CEM is gradient-free, but requires evaluating substantially more action sequences. We detail GBP in Algorithm 1 and CEM in Section A.2.
textbf{Latent world models.
During gradient-based planning, the action sequences being optimized are not constrained to lie within the distribution of behavior seen during training. World models are typically trained on fixed datasets of expert trajectories, whereas GBP selects actions solely to improve the planning objective, without regard to whether those actions resemble expert behavior. As a result, the optimization process often proposes action sequences that are out of distribution. Optimizing through learned models under such conditions is known to induce adversarial inputs (Szegedy et al., 2013; Goodfellow et al., 2014). In our setting, these adversarial action sequences drive the world model into regions of the latent state space that were rarely or never observed during training, causing large prediction errors. Even when errors are initially small, they accumulate as the planner rolls the model forward, ultimately degrading long-horizon planning performance.
To address this issue, we propose Online World Modeling , which iteratively corrects the trajectories produced by GBP and finetunes the world model on the resulting rollouts. Rather than training solely on expert demonstrations, we repeatedly incorporate trajectories induced by the planner itself, thereby expanding the region of latent states that the world model can reliably predict.
textbf{Training.
textbf{Planning.
Onall three tasks, our methods outperform DINO-WM with Gradient Descent GBP and either match or outperform it with the far more expensive CEM. In the open-loop setting, we achieve a +18% on Push-T, +20% on PointMaze, and +30% on Wall increase in success rate. In the MPC setting, Adam GBP with Adversarial World Modeling outperforms CEM with DINO-WM on PointMaze and Wall and matches CEM on PushT.
While both Online World Modeling and Adversarial World Modeling bootstrap new data to improve the robustness of our world model during GBP, the distributions they induce are quite different. Whereas Online World Modeling anticipates and covers the distribution seen at planning time, Adversarial World Modeling exploits the current loss landscape of the world model to encourage local smoothness near expert trajectories. For all environments, we find Adversarial World Modeling outperforms Online World Modeling when using Adam to perform GBP.
To demonstrate the advantages of Adversarial World Modeling in more complex environments where the simulator may be very costly and the number of action dimensions is larger, we also evaluate planning performance on two robotic manipulation tasks in Section B.2.
Online World Modeling
During gradient-based planning, the action sequences being optimized are not constrained to lie within the distribution of behavior seen during training. World models are typically trained on fixed datasets of expert trajectories, whereas GBP selects actions solely to improve the planning objective, without regard to whether those actions resemble expert behavior. As a result, the optimization process often proposes action sequences that are out of distribution. Optimizing through learned models under such conditions is known to induce adversarial inputs (Szegedy et al., 2013; Goodfellow et al., 2014). In our setting, these adversarial action sequences drive the world model into regions of the latent state space that were rarely or never observed during training, causing large prediction errors. Even when errors are initially small, they accumulate as the planner rolls the model forward, ultimately degrading long-horizon planning performance.
To address this issue, we propose Online World Modeling , which iteratively corrects the trajectories produced by GBP and finetunes the world model on the resulting rollouts. Rather than training solely on expert demonstrations, we repeatedly incorporate trajectories induced by the planner itself, thereby expanding the region of latent states that the world model can reliably predict.
Adversarial World Modeling
Since world models are only trained on the next-state prediction objective, there is no particular reason for their input gradients to be well-behaved. Adversarial training has been shown to result in better behaved input gradients (Mejia et al., 2019), consequently smoothing the input loss surface. Motivated by this observation, we propose an adversarial training objective that explicitly targets regions of the state-action space where the world model is expected to perform poorly. These adversarial samples may lie outside the expert trajectory distribution, which can expose the model to
precisely the regions that matter for action optimization. We find that this procedure, which we call Adversarial World Modeling , does in fact smooth the loss surface of the planning objective (see Figure 2), improving the stability of action-sequence optimization.
Adversarial training improves model robustness by optimizing performance under worst-case perturbations (Madry et al., 2018). An adversarial example is generated by applying a perturbation δ to an input that maximally increases the model's loss. To train a world model on adversarial examples, we use the objective
$$
$$
where B a = { δ a : ∥ δ a ∥ ∞ ≤ ϵ a } and B z = { δ z : ∥ δ z ∥ ∞ ≤ ϵ z } constrain the magnitude of perturbations for given ϵ a , ϵ z . Training on these adversarially perturbed trajectories provides an alternative method to Online World Modeling for surfacing states that may be encountered during planning, without relying on GBP rollouts. This is a significant advantage in settings where simulation is expensive or infeasible.
We generate adversarial latent states using the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014), which efficiently approximates the worst-case perturbations that maximize prediction error (Wong et al., 2020). Although stronger iterative attacks such as Projected Gradient Descent (PGD) can be used, we find that FGSM delivers comparable improvements in GBP performance while being significantly more computationally efficient (see Section D.1). This enables us to generate adversarial samples over entire large-scale offline imitation learning datasets.
For each state-action pair in a given minibatch, we look for small changes to the latent state or action that most increase the world model's prediction error. Let ϵ a , ϵ z denote the radius of the perturbation to the actions { a t } and latent states { z t } respectively. We compute gradients ∇ δ a ,δ z ∥ f θ ( z t + δ z , a t + δ a ) -z t +1 ∥ 2 2 with respect to the perturbations and take a signed gradient ascent step (i.e., in a direction that degrades the prediction) with step sizes α a = 1 . 25 ϵ a , α z = 1 . 25 ϵ z . We clip the result so that each entry of the perturbation stays within the radius. This procedure corresponds to a single step of a PGD-style attack, producing perturbations that lie on the edge of the allowed region where they are maximally challenging for the model. See Algorithm 3 for a detailed treatment.
Experiments
Weevaluate our methods by finetuning world models pretrained with the next-state prediction objective on 3 tasks: PushT, PointMaze, and Wall. For each task we measure the success rate of reaching a target configuration o goal from an initial configuration o 1 . We report planning results with both open-loop and MPC in Table 1. In the open-loop setting, we run Algorithm 1 from o 1 once and evaluate the predicted action sequence. In the MPC setting, we run Algorithm 1 once for each MPC step (using Φ µ ( o ′ 1 ) as the initial latent state for the first MPC step), rollout the predicted actions { ˆ a t } in the environment simulator to reach latent state ˆ z H +1 , and set ˆ z 1 = ˆ z H +1 for the next MPC iteration. We report all finetuning, planning, and optimization hyperparameters in Table 3.
We use DINO-WM (Zhou et al., 2025) as our initial world model for its strong performance with CEM across our chosen tasks. The embedding function Φ µ is taken to be the pre-trained DINOv2 encoder (Oquab et al., 2024), and remains frozen while finetuning the transition model f θ . f θ is implemented using the ViT architecture (Dosovitskiy et al., 2021). We additionally train a VQV AE decoder (van den Oord et al., 2018) to visualize latent states, though it plays no role in planning. To validate the broad applicability of our approach, we also study the use of the IRIS (Micheli et al., 2023) world model architecture in Section B.3.
To initialize the action sequence for planning optimization, we evaluate both random sampling from a standard normal distribution and the use of an initialization network. Our initialization network g θ : Z × Z → A T is trained such that g θ ( z 1 , z g ) = { ˆ a t } T t =1 . We find that random initialization tends to outperform the initialization network and we analyze its impact in depth in Section B.1.
During GBP, we set L goal in Algorithm 1 to a weighted goal loss to obtain a gradient from each predicted state instead of simply the last one. We find empirically that this task assumption generalizes to both navigation (e.g., PointMaze and Wall) and non-navigation tasks (e.g., PushT); i.e., on tasks with or without subgoal decomposability, this objective improves or matches performance of the final-state loss. We provide the exact formulation and more details in Section A.4. We additionally evaluate using the Adam optimizer (Kingma & Ba, 2014) during GBP. Although using Adam improves performance significantly over GD for all world models in our experiments, we find that Adam alone does not scale performance to match or surpass CEM.
∗ We could not reproduce the Wall environment open-loop CEM success rate reported in DINO-WM (74% over our 32%), so we report their better value.

Figure 3: Planning efficiency of DINO-WM, Online World Modeling, and Adversarial World Modeling on the PushT task. Gradient-based planning is orders of magnitude faster than CEM.
Planning Results
Onall three tasks, our methods outperform DINO-WM with Gradient Descent GBP and either match or outperform it with the far more expensive CEM. In the open-loop setting, we achieve a +18% on Push-T, +20% on PointMaze, and +30% on Wall increase in success rate. In the MPC setting, Adam GBP with Adversarial World Modeling outperforms CEM with DINO-WM on PointMaze and Wall and matches CEM on PushT.
While both Online World Modeling and Adversarial World Modeling bootstrap new data to improve the robustness of our world model during GBP, the distributions they induce are quite different. Whereas Online World Modeling anticipates and covers the distribution seen at planning time, Adversarial World Modeling exploits the current loss landscape of the world model to encourage local smoothness near expert trajectories. For all environments, we find Adversarial World Modeling outperforms Online World Modeling when using Adam to perform GBP.
To demonstrate the advantages of Adversarial World Modeling in more complex environments where the simulator may be very costly and the number of action dimensions is larger, we also evaluate planning performance on two robotic manipulation tasks in Section B.2.
Train-Test Gap
Comparing the world model error between training trajectories and planning trajectories allows us to evaluate if the world model will perform well during planning even if it is trained to convergence on expert trajectories. We evaluate world model error as the deviation between the world model's predicted next latent state and the next latent state given by the environment simulator. Given an initial state s 1 (associated with o 1 ) and a sequence of actions { a t } (either from the training dataset or a planning procedure), the world model error at timestep t is given by

Figure 4: Difference in World Model Error between expert and planning trajectories on PushT.
$$
$$
relatively worse on sequences of actions produced during planning. Figure 4 demonstrates that this is the case with DINO-WM, but not with Online World Modeling or Adversarial World Modeling, indicating a narrowing of the train-test gap. See Section B.6 for results for PointMaze and Wall.
Loss Surface Visualization
We include visualizations of planning trajectories for DINO-WM, Online World Modeling, and Adversarial World Modeling to further study their success and failure modes. Visualizations for PushT and Wall can be found in Figures 10 and 11 respectively.

(a) We see that DINO-WM is more likely to enter states outside of the training distribution, and so the decoder is not able reconstruct the state accurately. This is not the case with Online World Modeling but it still fails to successfully reach the goal state. Adversarial World Modeling successfully completes the task.

(b) Again we notice the failure for DINO-WM's decoder to reconstruct states it encounters during planning, while this is not the case with Online World Modeling and Adversarial World Modeling, which both complete the task successfully.
Figure 10: Trajectory Visualizations of the PushT task. We plot the expert trajectory to reach the goal side, alongside both the simulator states and decoded latent states for DINO-WM, Online World Modeling, Adversarial World Modeling.
(a) In this challenging example, all three world models enter states through planning that their respective decoders cannot reconstruct, but only Online World Modeling is able to complete the task successfully.
(a) In this challenging example, all three world models enter states through planning that their respective decoders cannot reconstruct, but only Online World Modeling is able to complete the task successfully.
(b) In this example, we see that DINO-WM predicts that it successfully completed the task according to its reconstructed last latent state, but the simulator indicates the true position to be off of the goal state. Online and Adversarial World Modeling correct for this and successfully complete the task.
(b) In this example, we see that DINO-WM predicts that it successfully completed the task according to its reconstructed last latent state, but the simulator indicates the true position to be off of the goal state. Online and Adversarial World Modeling correct for this and successfully complete the task.
| PushT | PushT | PushT | PointMaze | PointMaze | PointMaze | Wall | Wall | Wall | |
|---|---|---|---|---|---|---|---|---|---|
| GD | Adam | CEM | GD | Adam | CEM | GD | Adam | CEM | |
| DINO-WM | 38 | 54 | 78 | 12 | 24 | 90 | 2 | 10 | 74 ∗ |
| + MPC | 56 | 76 | 92 | 42 | 68 | 90 | 12 | 80 | 82 |
| OnlineWM | 34 | 52 | 90 | 20 | 14 | 62 | 16 | 18 | 54 ∗ |
| + MPC | 50 | 76 | 92 | 54 | 88 | 96 | 38 | 80 | 90 |
| AdversarialWM | 56 | 82 | 94 | 32 | 70 | 88 | 32 | 34 | 30 ∗ |
| + MPC | 66 | 92 | 92 | 50 | 94 | 98 | 14 | 94 | 94 |
| Environment | H | Frameskip | Dataset Size | Trajectory Length |
|---|---|---|---|---|
| Push-T | 3 | 5 | 18500 | 100-300 |
| PointMaze | 3 | 5 | 2000 | 100 |
| Wall | 1 | 5 | 1920 | 50 |
| Rope | 1 | 1 | 1000 | 5 |
| Granular | 1 | 1 | 1000 | 5 |
| Name | Value | Name | GD | Adam | ||
|---|---|---|---|---|---|---|
| Image size Optimizer Predictor LR | 224 AdamW 1e-5 LR | Name Opt. steps | GD Adam 300 300 1.0 0.3 | MPC steps Opt. steps LR | 10 100 1 | 10 100 0.2 |
| (a) Finetuning Parameters | (b) Open-Loop | Planning | (c) MPC | Parameters |
| Environment | # Rollouts | Batch Size | GPU | Epochs | ϵ visual | ϵ proprio | ϵ action |
|---|---|---|---|---|---|---|---|
| PushT | 20000 (all) | 16 | 8x B200 | 1 | 0.05 | 0.02 | 0.02 |
| PointMaze | 2000 (all) | 16 | 1x B200 | 1 | 0.2 | 0.08 | 0.08 |
| Wall | 1920 (all) | 48 | 1x B200 | 2 | 0.2 | 0.08 | 0.08 |
| Environment | # Rollouts | Batch Size | GPU | Epochs |
|---|---|---|---|---|
| PushT | 6000 | 32 | 4x B200 | 1 |
| PointMaze | 500 | 32 | 4x B200 | 1 |
| Wall | 1920 (all) | 80 | 4x B200 | 1 |
| PushT | PushT | PointMaze | PointMaze | Wall | Wall | |
|---|---|---|---|---|---|---|
| GD+IN | Ad+IN | GD+IN | Ad+IN | GD+IN | Ad+IN | |
| DINO-WM | 44 60 | 62 84 | 16 40 | 14 54 | 4 6 | 12 32 |
| + MPC | 56 | 8 | 28 46 | |||
| OnlineWM + MPC | 66 | 10 | 18 | |||
| 52 | 82 | 40 | 2 | 22 | ||
| AdversarialWM | 74 | 90 | 22 | 36 | 18 | 24 |
| + MPC | 74 | 90 | 44 | 56 | 24 | 48 |
| Rope | Rope | Granular | Granular | |
|---|---|---|---|---|
| GD | CEM | GD | CEM | |
| DINO-WM | 1.73 | 0.93 | 0.30 | 0.22 |
| AdversarialWM | 0.93 | 0.82 | 0.24 | 0.28 |
| GD | CEM | |
|---|---|---|
| IRIS | 0 | 4 |
| IRIS + OnlineWM | 0 | 0 |
| IRIS + AdversarialWM | 8 | 6 |
| PushT | PointMaze | |
|---|---|---|
| DINO-WM | 16 | 70 |
| OnlineWM | 16 | 96 |
| AdversarialWM | 26 | 88 |
| PushT | PointMaze | Wall | |
|---|---|---|---|
| Simulator | 0.959 | 0.717 | 4.465 |
| DINO-WM | 0.029 | 0.029 | 0.029 |
| Backward Passes | PointMaze | PointMaze | PointMaze | Wall | Wall | Wall | |
|---|---|---|---|---|---|---|---|
| FGSM | Min/Epoch | Open-Loop | MPC | Min/Epoch | Open-Loop | MPC | |
| 2-Step PGD | 2 3 | 120 165 | 70 80 | 94 96 | 14 20 | 34 8 | 94 90 |
| 3-Step PGD | 4 | 201 | 78 | 94 | 24 | 14 | 94 |
World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.
github.com/nimitkalra/robust-world-model-planning
In robotic tasks, anticipating how the actions of an agent affect the state of its environment is fundamental for both prediction (Finn2016UnsupervisedLFA) and planning (mohanan2018survey; kavraki2002probabilistic). Classical approaches derive models of the environment evolution analytically from first principles, relying on prior knowledge of the environment, the agent, and any uncertainty (goldstein1950classical; siciliano2009robotics; spong2020robot). In contrast, learning-based methods infer such models directly from data, enabling them to capture complex dynamics and thus improve generalization and robustness to uncertainty (sutton1998reinforcement; schrittwieser2020mastering; lecun2022path).
World models (ha2018world), in particular, have emerged as a powerful paradigm. Given the current state and an action, the world model predicts the resulting next state. These models can be learned either from exact state information (sutton1991dyna) or from high-dimensional sensory inputs such as images (hafner2023mastering). The latter setup is especially compelling as it enables perception, prediction, and control directly from raw images by leveraging pre-trained visual representations, and removes the need for measuring the precise environment states which is difficult in practice (assran2023self; Bardes2024RevisitingFPA). Recently, world models and their predictive capabilities have been leveraged for planning, enabling agents to solve a variety of tasks (hafner2019dream; hafner2019learning; schrittwieser2020mastering; hafner2023mastering; zhou2025dinowmworldmodelspretrained). A model of the dynamics is learned offline, while the planning task is defined at inference as a constrained optimization problem: given the current state, find a sequence of actions that results in a state as close as possible to the target state. This inference-time optimization provides an effective alternative to reinforcement learning approaches (sutton1998reinforcement) that often suffer from poor sample-efficiency.
World models are compatible with many model-based planning algorithms. Traditional methods such as DDP (mayne1966second) and iLQR (li2004iterative) rely on iteratively solving exact optimization problems derived from linear and quadratic approximations of the dynamics around a nominal trajectory. While highly effective in low-dimensional settings, these methods become impractical for large-scale world models, where solving the resulting optimization problem is computationally intractable. As an alternative, search-based methods such as the Cross Entropy Method (CEM) (rubinstein2004cross) and Model Predictive Path Integral control (MPPI) (williams2017model) have been widely adopted as gradient-free alternatives and have proven effective in practice. However, they are computationally intensive as they require iteratively sampling candidate solutions and performing world model rollouts to evaluate each one, a procedure that scales poorly in high-dimensional spaces. Gradient-based methods (sv2023gradient), in contrast, avoid the limitations of sampling by directly exploiting the differentiability of world models to optimize actions end-to-end. These methods eliminate the costly rollouts required by search-based approaches, thus scaling more efficiently in high-dimensional spaces. Despite this promise, gradient-based approaches have thus far seen limited empirical success.
This procedure suffers from a fundamental train-test gap. World models are typically trained using a next-state prediction objective on datasets of expert trajectories. At test time, however, they are used to optimize a planning objective over sequences of actions. We argue that this mismatch underlies the poor empirical performance of gradient-based planning (GBP), and we offer two hypotheses to explain why. (1) During planning, the intermediate sequence of actions explored by gradient descent drive the world model into states that were not encountered during training. In these out-of-distribution states, model errors compound, making the world model unreliable as a surrogate for optimization. (2) The action-level optimization landscape induced by the world model may be difficult to traverse, containing many poor local minima or flat regions, which hinders effective gradient-based optimization.
In this work, we address both of these challenges by proposing two algorithms: Online World Modeling and Adversarial World Modeling. Both expand the region of familiar latent states by continuously adding new trajectories to the dataset and finetuning the world model on them. To manage the distribution shift between offline expert trajectories and predicted trajectories from planning, Online World Modeling uses the environment simulator to correct states along a trajectory produced by performing GBP. Finetuning on these corrected trajectories ensures that the world model performs sufficiently well when GBP enters regimes of latent state space outside of the expert trajectory distribution. To overcome the difficulties of optimizing over a non-smooth loss surface during GBP, Adversarial World Modeling perturbs expert trajectories in the direction that maximizes the world model’s loss. Adversarial finetuning smooths the induced action loss landscape, making it easier to optimize via gradient-based planning. We provide a visual depiction of both methods in Figure 1.
We show that finetuning world models with these algorithms leads to substantial improvements in the performance of gradient-based planning (GBP). Applying Adversarial World Modeling to a pretrained world model enables gradient-based planning to match or exceed the performance of search-based CEM on a variety of robotic object manipulation and navigation tasks. Importantly, this performance is achieved with a 10×\times reduction in computation time compared to CEM, underscoring the practicality of our approach for real-world planning. Additionally, we empirically demonstrate that Adversarial World Modeling smooths the planning loss landscape, and that both methods can reverse the train-test gap in world model error.
World models learn environment dynamics by predicting the state resulting from taking an action in the current state. Then, at test time, the learned world model enables planning by simulating future trajectories and guiding action optimization. Formally, a world model approximates the (potentially unknown) dynamics function h:𝒮×𝒜→𝒮h{,:,}\mathcal{S}\times\mathcal{A}\to\mathcal{S}, where 𝒮\mathcal{S} denotes the state space and 𝒜\mathcal{A} the action space. The environment evolves according to
where st∈𝒮,at∈𝒜s_{t}\in\mathcal{S},a_{t}\in\mathcal{A} denote the state and action at time tt, respectively.
In practice, we typically do not have access to the exact state of the environment; instead, we only receive partial observations of it, such as images. In order for a world model to efficiently learn in the high-dimensional observation space 𝒪\mathcal{O}, an embedding function Φμ:𝒪→𝒵\Phi_{\mu}{,:,}\mathcal{O}\to\mathcal{Z} is employed to map observations to a lower-dimensional latent space 𝒵\mathcal{Z}. Then, given an embedding function Φμ\Phi_{\mu}, our goal is to learn a latent world model fθ:𝒵×𝒜→𝒵f_{\theta}{,:,}\mathcal{Z}\times\mathcal{A}\to\mathcal{Z}, such that
The choice of Φμ\Phi_{\mu} directly affects the expressivity of the latent world model. In this work, we use a fixed encoder pretrained with self-supervised learning that yields rich feature representations out of the box.
To train a latent world model, we sample triplets of the form (ot,at,ot+1)(o_{t},a_{t},o_{t+1}) from an offline dataset of trajectories 𝒯\mathcal{T} and minimize the ℓ2\ell_{2} distance between the true next latent state zt+1=Φμ(ot+1)z_{t+1}=\Phi_{\mu}(o_{t+1}) and the predicted next latent state z^t+1\hat{z}_{t+1}. This procedure is represented by the following teacher-forcing objective:
Notably, we only minimize this objective with respect to the world model’s parameters θ\theta, not those of the potentially large embedding function.
During test-time, we use a learned world model to optimize candidate action sequences for reaching a goal state. By recursively applying the world model over an action sequence starting from an initial latent state, we obtain a predicted latent goal state and therefore the distance to the true goal state in latent space. This allows us to find the optimal action sequence
where z^H+1\hat{z}_{H+1} is produced by the recursive procedure
We use the function rolloutf:𝒵×𝒜H→𝒵H\text{rollout}_{f}{,:,}\mathcal{Z}\times\mathcal{A}^{H}\to\mathcal{Z}^{H} to denote this recursive procedure.
Gradient-based planning (GBP) solves the planning objective (4) via gradient descent. Crucially, since the world model is differentiable, ∇{a^t}z^H+1=∇{a^t}rolloutf(z1,{at^})H+1\nabla_{{\hat{a}{t}}}\hat{z}{H+1}=\nabla_{{\hat{a}{t}}}\text{rollout}{f}(z_{1},{\hat{a_{t}}})_{H+1} is well-defined. In contrast, the search-based CEM is gradient-free, but requires evaluating substantially more action sequences. We detail GBP in Algorithm 1 and CEM in Section A.2.
As errors can propagate over long horizons, Model Predictive Control (MPC) is commonly used to repeatedly re-plan by optimizing an HH-step action sequence but executing only the first K≤HK\leq H actions before replanning from the updated state.
As the planning objective is induced entirely by the world model, the success of GBP hinges on (1) the model accurately predicting future states under any candidate action sequence, and (2) the stability of this differentiable optimization. We now present two finetuning methods designed to improve on these fronts.
During gradient-based planning, the action sequences being optimized are not constrained to lie within the distribution of behavior seen during training. World models are typically trained on fixed datasets of expert trajectories, whereas GBP selects actions solely to improve the planning objective, without regard to whether those actions resemble expert behavior. As a result, the optimization process often proposes action sequences that are out of distribution. Optimizing through learned models under such conditions is known to induce adversarial inputs (szegedy2013intriguing; goodfellow2014explaining). In our setting, these adversarial action sequences drive the world model into regions of the latent state space that were rarely or never observed during training, causing large prediction errors. Even when errors are initially small, they accumulate as the planner rolls the model forward, ultimately degrading long-horizon planning performance.
To address this issue, we propose Online World Modeling, which iteratively corrects the trajectories produced by GBP and finetunes the world model on the resulting rollouts. Rather than training solely on expert demonstrations, we repeatedly incorporate trajectories induced by the planner itself, thereby expanding the region of latent states that the world model can reliably predict.
First, we conduct GBP using the initial and goal latent states of an expert trajectory τ\tau, yielding a sequence of predicted actions {a^t}t=1H{\hat{a}{t}}{t=1}^{H}. These actions might send the world model into regions of the latent space that lie outside of the training distribution. To adjust for this, we obtain a corrected trajectory: the actual sequence of states that would result by executing the action sequence {a^t}t=1H{\hat{a}{t}}{t=1}^{H} in the environment using the true dynamics simulator hh. We add the corrected trajectory,
to the dataset that the world model trains with every time the dataset is updated. Re-training on these corrected trajectories expands the training distribution to cover the regions of latent space induced by gradient-based planning, mitigating compounding prediction errors during planning. We provide more detail in Algorithm 2 and illustrate the method in Figure 1.
This procedure is reminiscent of DAgger (Dataset Aggregation) (ross2011reduction), an online imitation learning method wherein a base policy network is iteratively trained on its own rollouts with the action predictions replaced by those from an expert policy. In a similar spirit, we invoke the ground-truth simulator as our expert world model that we imitate.
Since world models are only trained on the next-state prediction objective, there is no particular reason for their input gradients to be well-behaved. Adversarial training has been shown to result in better behaved input gradients (mejia2019robust), consequently smoothing the input loss surface. Motivated by this observation, we propose an adversarial training objective that explicitly targets regions of the state-action space where the world model is expected to perform poorly. These adversarial samples may lie outside the expert trajectory distribution, which can expose the model to precisely the regions that matter for action optimization. We find that this procedure, which we call Adversarial World Modeling, does in fact smooth the loss surface of the planning objective (see Figure 2), improving the stability of action-sequence optimization.
Adversarial training improves model robustness by optimizing performance under worst-case perturbations (madry2019deeplearningmodelsresistant). An adversarial example is generated by applying a perturbation δ\delta to an input that maximally increases the model’s loss. To train a world model on adversarial examples, we use the objective
where ℬa={δa:∥δa∥∞≤ϵa}\mathcal{B}{a}={\delta{a}:\lVert\delta_{a}\rVert_{\infty}\leq\epsilon_{a}} and ℬz={δz:∥δz∥∞≤ϵz}\mathcal{B}{z}={\delta{z}:\lVert\delta_{z}\rVert_{\infty}\leq\epsilon_{z}} constrain the magnitude of perturbations for given ϵa,ϵz\epsilon_{a},\epsilon_{z}. Training on these adversarially perturbed trajectories provides an alternative method to Online World Modeling for surfacing states that may be encountered during planning, without relying on GBP rollouts. This is a significant advantage in settings where simulation is expensive or infeasible.
We generate adversarial latent states using the Fast Gradient Sign Method (FGSM) (goodfellow2014explaining), which efficiently approximates the worst-case perturbations that maximize prediction error (fastbetterthanfree). Although stronger iterative attacks such as Projected Gradient Descent (PGD) can be used, we find that FGSM delivers comparable improvements in GBP performance while being significantly more computationally efficient (see Section D.1). This enables us to generate adversarial samples over entire large-scale offline imitation learning datasets.
For each state-action pair in a given minibatch, we look for small changes to the latent state or action that most increase the world model’s prediction error. Let ϵa,ϵz{\epsilon}{a},{\epsilon}{z} denote the radius of the perturbation to the actions {at}{a_{t}} and latent states {zt}{z_{t}} respectively. We compute gradients ∇δa,δz∥fθ(zt+δz,at+δa)−zt+1∥22\nabla_{\delta_{a},\delta_{z}}\lVert f_{\theta}(z_{t}+\delta_{z},a_{t}+\delta_{a})-z_{t+1}\rVert_{2}^{2} with respect to the perturbations and take a signed gradient ascent step (i.e., in a direction that degrades the prediction) with step sizes αa=1.25ϵa,αz=1.25ϵz\alpha_{a}=1.25{\epsilon}{a},\alpha{z}=1.25{\epsilon}_{z}. We clip the result so that each entry of the perturbation stays within the radius. This procedure corresponds to a single step of a PGD-style attack, producing perturbations that lie on the edge of the allowed region where they are maximally challenging for the model. See Algorithm 3 for a detailed treatment.
To initialize the perturbation radii ϵa,ϵz\epsilon_{a},\epsilon_{z}, we use scaling factors λa,λz\lambda_{a},\lambda_{z} and find that Adversarial World Modeling is robust for 0≤λa≤10\leq\lambda_{a}\leq 1 and 0≤λz≤0.50\leq\lambda_{z}\leq 0.5. Furthermore, we find that fixing ϵa,ϵz{\epsilon}{a},{\epsilon}{z} to the standard deviation of the initial minibatch is stable across all experiments. Updating this estimate for each batch as in Algorithm 3 yields no consistent improvement in final planning performance. We further analyze design ablations in Appendix D.
We evaluate our methods by finetuning world models pretrained with the next-state prediction objective on 3 tasks: PushT, PointMaze, and Wall. For each task we measure the success rate of reaching a target configuration ogoalo_{\text{goal}} from an initial configuration o1o_{1}. We report planning results with both open-loop and MPC in Table 1. In the open-loop setting, we run Algorithm 1 from o1o_{1} once and evaluate the predicted action sequence. In the MPC setting, we run Algorithm 1 once for each MPC step (using Φμ(o1′)\Phi_{\mu}(o_{1}^{\prime}) as the initial latent state for the first MPC step), rollout the predicted actions {a^t}{\hat{a}{t}} in the environment simulator to reach latent state z^H+1\hat{z}{H+1}, and set z^1=z^H+1\hat{z}{1}=\hat{z}{H+1} for the next MPC iteration. We report all finetuning, planning, and optimization hyperparameters in Table 3.
We use DINO-WM (zhou2025dinowmworldmodelspretrained) as our initial world model for its strong performance with CEM across our chosen tasks. The embedding function Φμ\Phi_{\mu} is taken to be the pre-trained DINOv2 encoder (oquab2024dinov2learningrobustvisual), and remains frozen while finetuning the transition model fθf_{\theta}. fθf_{\theta} is implemented using the ViT architecture (dosovitskiy2021imageworth16x16words). We additionally train a VQVAE decoder (oord2018neuraldiscreterepresentationlearning) to visualize latent states, though it plays no role in planning. To validate the broad applicability of our approach, we also study the use of the IRIS (micheli2023transformerssampleefficientworldmodels) world model architecture in Section B.3.
To initialize the action sequence for planning optimization, we evaluate both random sampling from a standard normal distribution and the use of an initialization network. Our initialization network gθ:𝒵×𝒵→𝒜Tg_{\theta}:\mathcal{Z}\times\mathcal{Z}\to\mathcal{A}^{T} is trained such that gθ(z1,zg)={a^t}t=1Tg_{\theta}(z_{1},z_{g})={\hat{a}{t}}{t=1}^{T}. We find that random initialization tends to outperform the initialization network and we analyze its impact in depth in Section B.1.
During GBP, we set ℒgoal\mathcal{L}_{\text{goal}} in Algorithm 1 to a weighted goal loss to obtain a gradient from each predicted state instead of simply the last one. We find empirically that this task assumption generalizes to both navigation (e.g., PointMaze and Wall) and non-navigation tasks (e.g., PushT); i.e., on tasks with or without subgoal decomposability, this objective improves or matches performance of the final-state loss. We provide the exact formulation and more details in Section A.4. We additionally evaluate using the Adam optimizer (Kingma2014AdamAM) during GBP. Although using Adam improves performance significantly over GD for all world models in our experiments, we find that Adam alone does not scale performance to match or surpass CEM.
On all three tasks, our methods outperform DINO-WM with Gradient Descent GBP and either match or outperform it with the far more expensive CEM. In the open-loop setting, we achieve a +18% on Push-T, +20% on PointMaze, and +30% on Wall increase in success rate. In the MPC setting, Adam GBP with Adversarial World Modeling outperforms CEM with DINO-WM on PointMaze and Wall and matches CEM on PushT.
While both Online World Modeling and Adversarial World Modeling bootstrap new data to improve the robustness of our world model during GBP, the distributions they induce are quite different. Whereas Online World Modeling anticipates and covers the distribution seen at planning time, Adversarial World Modeling exploits the current loss landscape of the world model to encourage local smoothness near expert trajectories. For all environments, we find Adversarial World Modeling outperforms Online World Modeling when using Adam to perform GBP.
To demonstrate the advantages of Adversarial World Modeling in more complex environments where the simulator may be very costly and the number of action dimensions is larger, we also evaluate planning performance on two robotic manipulation tasks in Section B.2.
Comparing the world model error between training trajectories and planning trajectories allows us to evaluate if the world model will perform well during planning even if it is trained to convergence on expert trajectories. We evaluate world model error as the deviation between the world model’s predicted next latent state and the next latent state given by the environment simulator. Given an initial state s1s_{1} (associated with o1o_{1}) and a sequence of actions {at}{a_{t}} (either from the training dataset or a planning procedure), the world model error at timestep tt is given by
This error is averaged over all timesteps of a trajectory. If the difference in world model error between expert trajectories and planning trajectories is negative, then the world model will perform relatively worse on sequences of actions produced during planning. Figure 4 demonstrates that this is the case with DINO-WM, but not with Online World Modeling or Adversarial World Modeling, indicating a narrowing of the train-test gap. See Section B.6 for results for PointMaze and Wall.
When using a world model to conduct planning in real-world settings, fast inference is crucial for actively interacting with the environment. On all three tasks, we find that GBP with Adversarial World Modeling is able to match or come near the best performing world model when planning with CEM, in over an order of magnitude less wall clock time. We compare wall clock times across world models and planning procedures for PushT in Figure 3. The planning efficiency results for PointMaze and Wall can be found in Section B.7.
Learning world models from sensory data. Learning-based dynamics models have become central to control and decision making, offering a data-driven alternative to classical approaches that rely on first principles modeling (goldstein1950classical; Schmidt2009DistillingFNA; macchelli2009port). Early work focused on modeling dynamics in low-dimensional state-space (deisenroth2011pilco; lenz2015deepmpc; henaff2017model; Sharma2019DynamicsAwareUD), while more recent methods learn directly from high-dimensional sensory inputs such as images. Pixel-space prediction methods (Finn2016UnsupervisedLFA; Kaiser2019ModelBasedRL) have shown success in applications such as human motion prediction (Finn2016UnsupervisedLFA), robotic manipulation (Finn2016DeepVF; agrawal2016learning; zhang2019solar), and solving Atari games (Kaiser2019ModelBasedRL), but they remain computationally expensive due to the cost of image reconstructions. To address this, alternative approaches learn a compact latent representation where dynamics are modeled (Karl2016DeepVB; hafner2019learning; Shi2022RoboCraftLT; karypidis2024dino). These models are typically supervised either by decoding latent predictions to match ground truth observations (Edwards2018ImitatingLPA; Zhang2021DeformableLOA; bounou2021online; Hu2022ModelBasedILA; Akan2022StretchBEVSFA; hafner2019learning), or by using prediction objectives that operate directly in latent space, such as those in joint-embedding prediction architectures (JEPAs) (lecun2022path; Bardes2024RevisitingFPA; Drozdov2024VideoRLA; Guan2024WorldMFA; zhou2025dinowmworldmodelspretrained). Our work builds upon this latter category of world models and specifically leverages the DINOv2-based latent world models introduced in zhou2025dinowmworldmodelspretrained. However, unlike prior work that primarily targets improving general representation quality or prediction accuracy, we focus on enhancing the trainability of world models to improve the convergence and reliability of gradient-based planning.
Planning with world models. Planning with world models is challenging due to the non-linearity and non-convexity of the objective. Search-based methods such as CEM (rubinstein2004cross) and MPPI (williams2017model) are widely used in this context (Williams2017InformationTMA; Nagabandi2019DeepDMA; hafner2019learning; Zhan2021ModelBasedOPA; zhou2025dinowmworldmodelspretrained). These methods explore the action space effectively, helping to escape from local minima, but typically scale poorly in high-dimensional settings due to their sampling-based nature. In contrast, gradient-based methods offer a more scalable alternative by exploiting the differentiability of the world model to optimize actions directly via backpropagation. Despite their efficiency, these methods suffer from local minima in highly non-smooth loss landscapes (Bharadhwaj2020ModelPredictiveCVA; Xu2022AcceleratedPLA; Chen2022BenchmarkingDOA; Wang2023SoftZooASA), and gradient optimization can induce adversarial action sequences that exploit model inaccuracies (Schiewer2024ExploringTLA; Jackson2024PolicyGuidedDA). zhou2025dinowmworldmodelspretrained have observed that GBP is particularly brittle when used with world models built on pre-trained visual embeddings, such as DINOv2 (oquab2024dinov2learningrobustvisual), often underperforming compared to CEM. To address these challenges, several stabilizing techniques have been proposed. For instance, random-sampling shooting helps mitigate adversarial trajectories by injecting noise in the action sequence and exploring a broader set of actions during trajectory optimization (nagabandi2018neural), and Zhang2025StateAwarePOA introduce adversarial attacks on learned policies to make them robust to environmental perturbations by selectively perturbing state inputs at inference time. In contrast, we apply perturbation directly to latent states and latent actions during world model training. florence2022implicit add gradient penalties when training an implicit policy function to improve its smoothness and stabilize optimization, but their method does not involve training or using a world model. Other approaches aim to use a hybrid method that combines search and gradient steps to balance global exploration and local refinement (Bharadhwaj2020ModelPredictiveCVA). In our work, we modify the world-model training procedure itself to improve GBP stability. In particular, through our Adversarial World Modeling approach, we enhance the robustness of the world model to perturbed states and actions, producing more stable and informative gradients that prevent adversarial action sequences at test time.
Train-test gap in world models. A key challenge when planning with learned world models is the mismatch between the training objective and the planning objective (lambert2020objective). In fact, during training, world models are typically optimized to minimize one-step prediction or reconstruction error on trajectories collected from expert demonstrations or behavioral policies. At test time, however, the same models are used inside a planner to optimize multi-step action sequences. As a result, the objectives at training and test times are inherently different, inducing a distribution shift between trajectories seen during training and those encountered during planning. This mismatch can cause planners to drive the model into out-of-distribution regions of the state space, where prediction errors compound over time and the model becomes unreliable for long-horizon optimization (Ajay2018AugmentingPSA; Ke2019LearningDMA; Zhu2023DiffusionMFA). A common strategy to address this train-test gap is dataset-aggregation (ross2011reduction), which expands the training distribution by rolling out action trajectories generated by the planning algorithm and adding them to the training set (Talvitie2014ModelRFA; nagabandi2018neural). However, unlike these approaches which typically apply this technique directly in the environment’s low-dimensional state space, our approach uses dataset-aggregation in the context of high-dimensional latent world models, where training occurs in latent space rather than directly on states. Through our Online World Modeling approach, we explicitly close the train-test gap for gradient-based planning by using the planner itself to generate off-distribution trajectories and correcting them with simulator feedback.
In this work, we introduced Online World Modeling and Adversarial World Modeling as techniques for addressing the train-test gap that arises when world models trained on next-state prediction are used for iterative gradient-based planning. Across our experiments, these methods substantially improve the reliability of GBP and, in some settings, allow it to match or outperform sampling-based planners such as CEM. By narrowing this gap, our results suggest that gradient-based planning can be a practical alternative for planning with world models, particularly in settings where computational efficiency is critical. An important direction for future work is to evaluate these methods on real-world systems. Adversarial training may additionally improve a world model’s robustness to environmental adversaries or stochasticity. More broadly, world models offer a natural advantage over policy-based reinforcement learning in long-horizon decision making. We believe our methods are especially well-suited to multi-timescale or hierarchical world models, where long-horizon planning is enabled by improving planning stability at different levels of abstraction.
Compute resources used in this work were provided by the Modal and NVIDIA Academic Grants. Micah Goldblum was supported by the Google Cyber NYC Award.
This task introduced by pusht uses an agent interacting with a T-shaped block to guide both the agent and block from a randomly initialized state to a feasible goal state within 25 steps. We use the dataset of 18500 trajectories given in zhou2025dinowmworldmodelspretrained, in which the green anchor serves purely as a visual reference. We draw a goal state from one of the noisy expert trajectories at 25 steps from the starting state.
In this task introduced by pointmaze, a force-actuated ball which can move in the x,yx,y Cartesian directions has to reach a target goal within a maze. We use the dataset of 2000 random trajectories present in zhou2025dinowmworldmodelspretrained, with a goal state chosen 25 steps from the starting state.
This task introduced by DINO-WM (zhou2025dinowmworldmodelspretrained) features a 2D navigation environment with two rooms separated by a wall with a door. The agent’s task is to navigate from a randomized starting location in one room to a random goal state in the other room, passing through the door. We use the dataset of 1920 trajectories as provided in DINO-WM, with a goal state chosen 25 steps from the starting state.
In this task introduced by zhang2024adaptigraph a simulated Xarm must push around one hundred small particles into the goal configuration. We use the dataset of 1000 trajectories of 20 steps each provided in DINO-WM.
We reproduce the dataset statistics used to train the base world model for each environment from zhou2025dinowmworldmodelspretrained. We use the same datasets for our alternative world model architecture ablation in Section B.3.
We detail the cross-entropy method used in our planning experiments in Algorithm 4.
In Table 3, we list all shared hyperparameters used in training and planning.
We provide data quantity and synthetic data parameters for our Online and Adversarial World Modeling training setups in Table 5 and Table 4 respectively. In addition to the maintaining perturbation radii for the visual latent and action embeddings, we use a distinct radius for the proprioceptive embeddings. We empirically find that the scales of the visual and proprioceptive embeddings are incompatible and semantically distinct, thereby necessitating independent perturbation. Throughout all of our experiments, we set the perturbation radii of the action embedding and proprioceptive embedding identically for simplicity.
To facilitate progress towards the goal in Gradient-based Planning, we introduce an alternate loss function: Weighted Goal Loss (WGL). Instead of the standard goal loss function that only minimizes the ℓ2\ell_{2}-distance between the final latent state produced by planning actions and the goal latent state, WGL encourages intermediate latent states to also be close to the goal latent state. Formally,
where the sequence of normalized weights {wi}2H+1{w_{i}}{2}^{H+1} is a hyperparameter choice. Empirically, we find that using this objective for Gradient-Based Planning either maintains or improves planning performance. For PointMaze and Wall, we found that exponentially upweighting later states in the planning horizon improved planning performance, so we set wi=2iw{i}=2^{i}. For PushT, we found that exponentially upweighting earlier states improved planning performance, so we set wi=(1/2)iw_{i}=\left(1/2\right)^{i}. We leave the optimal selection of this sequence of weights as future work.
Motivated by the hypothesis that the optimization landscape is rugged (see Figure 2 for some evidence of this), we train an initialization network gθ:𝒵×𝒵→𝒜T,gθ(z1,zg)={a^t}g_{\theta}:\mathcal{Z}\times\mathcal{Z}\to\mathcal{A}^{T},g_{\theta}(z_{1},z_{g})={\hat{a}_{t}} to initialize a sequence of actions for gradient-based planning.
We provide details on training the initialization network gθg_{\theta} in Algorithm 5. We train gθg_{\theta} on a single epoch over the trajectories in the task’s training dataset.
We show results of including the initialization network in GBP for each task in Table 6. Comparing to Table 1, we see that for both GD and Adam, the initialization network only performs comparably in the PushT environment compared to a random initialization.
We evaluate Adversarial World Modeling on two robotic manipulation tasks: Rope and Granular. Planning results for both tasks can be found in Table 7. To measure the accuracy of planned actions, we evaluate the Chamfer distance between the goal set of keypoints and the predicted set of keypoints.
We ablate the use of the DINO-WM architecture by evaluating planning performance with the IRIS (micheli2023transformerssampleefficientworldmodels) architecture. Specifically, IRIS uses a VQ-VAE (oord2018neuraldiscreterepresentationlearning) for both the encoder and decoder, and a standard decoder-only Transformer (NIPS2017_3f5ee243). We find that even with a learned encoder, Adversarial World Modeling improves GBP performance and even CEM performance. Planning success rates of the IRIS architecture for the Wall task are reported in Table 8.
We evaluate GBP over a longer horizon in Table 9(a). We use Adam in the MPC setting for each of these runs, setting a goal state 50 timesteps into the future drawn from an expert trajectory, a planning horizon of 50 steps, and 20 MPC iterations where we take a single action at each iteration. The dataset of held-out validation trajectories for the Wall environment does not contain expert trajectories of 50 timesteps, so we omit it from our evaluations. In comparison, our results in Table 1 use a goal state drawn 25 timesteps in the future and a planning horizon of 25 steps. We find that on the longer horizon, Adversarial World Modeling outperforms DINO-WM on PushT and both Adversarial and Online World Modeling outperform DINO-WM on PointMaze.
Additionally, we evaluate both the MPPI (torch_mppi) and GradCEM (gradcem) algorithms under MPC on the PushT task in Table 9(b). MPPI is an online, receding-horizon controller that samples and evaluates perturbed action sequences, executes the first action of the lowest-cost trajectory, and then replans from the updated state at each timestep.
GradCEM refines the candidate sequences used to update the estimated action distribution with gradient descent to provide a more accurate estimate of the true distribution’s parameters. We see that Adversarial World Modeling outperforms DINO-WM with GradCEM. Additionally, GradCEM exhibits slightly lower performance than vanilla CEM. We hypothesize this is due to the memory requirements of gradient descent necessitating reducing the number of candidate sequences by a factor of 6 compared to vanilla CEM, leading to reduced accuracy in estimating the true action distribution.
For MPPI, we use 5 samples each MPC iteration, with 100 MPC steps. For GradCEM, we use 50 samples, 30 CEM steps, and 2 Adam steps per CEM step with an LR of 0.3. For GradCEM we take 10 MPC steps.
We present additional results for the difference in World Model Error between training and planning for the PointMaze and Wall tasks in Figure 6. For both tasks, our methods have lower error during planning compared to training except for Online World Modeling on PointMaze, which is inconclusive due to the low magnitude of world model error. Planning actions are obtained after 300 steps of GBP with GD on 50 rollouts using the initial and goal state from a training trajectory.
For PointMaze and Wall, we compare the planning efficiency of DINO-WM and our two approaches across planning methodologies in Figures 7 and 8 respectively. All planning is performed with MPC.
To understand the additional cost of using the environment simulator in Online World Modeling, we record the wall clock time of rolling out 25 steps with the DINO-WM architecture and each environment simulator in Table 10. We see that in all environments, the simulator takes longer to rollout than the world model. We also note that the simulator for all 3 tasks is deterministic in terms of reproducing the training trajectories from their actions.
We visualize the loss landscape of both the DINO World Model before and after applying our Adversarial World Modeling objective. We perform a grid search over the subspace spanned by
a^GBP-Pretrained\hat{a}{\text{GBP-Pretrained}}: Gradient-Based Planning on original Dino World Model with 300 optimization steps of Adam with LR = 1e-3. We set a fixed initialization ainita{\text{init}}.
aGTa_{\text{GT}}: the ground-truth actions from the expert demonstrator.
We define the axes as α=a^GBP-Pretrained−aGT\alpha=\hat{a}{\text{GBP-Pretrained}}-a{\text{GT}} and β=a^GBP-Adversarial−aGT\beta=\hat{a}{\text{GBP-Adversarial}}-a{\text{GT}}, and compute the loss surface over a 50×5050\times 50 grid spanning α,β∈[−1.25,1.25]\alpha,\beta\in[-1.25,1.25].
Projected Gradient Descent (PGD) has been used as an iterative method for generating adversarial perturbations (madry2019deeplearningmodelsresistant). At each step, PGD takes a gradient ascent step and projects the result onto the space of allowed perturbations (some ball with radius ϵ{\epsilon} around the input). Projection (Π\Pi) is typically via clipping or scaling. Formally,
However, this is computationally expensive to use for adversarial training as it requires an additional backward pass for each iteration. If one uses a single-step, replaces the gradient by its sign, and uses step size α=ϵ\alpha={\epsilon}, this recovers the well-known Fast Gradient Sign Method (FGSM) update (goodfellow2014explaining).
In fastbetterthanfree, the authors demonstrate that initializing δ\delta in the ℓ∞\ell_{\infty}-ball with radius ϵ{\epsilon} and performing FGSM adversarial training on these perturbations substantially improves robustness to PGD attacks and matches performance of PGD-based training. We leverage this observation to perform cheap adversarial training that only requires 2×2\times the backward passes of traditional supervised learning. In comparison, KK-step PGD requires KK more backward passes (3×3\times more for K=2K=2 and 4×4\times for K=3K=3). In Table 11, we show that 2/3-Step PGD does not consistently outperform FGSM, despite requiring a much larger training budget.
To assess the robustness of Adversarial World Modeling to the scaling factor and perturbation radius hyperparameters, we conduct an ablation study varying these two factors, shown in Figure 9. We evaluate λa,λz∈[0.0,0.02,0.05,0.20,0.50,1.0]2\lambda_{a},\lambda_{z}\in[0.0,0.02,0.05,0.20,0.50,1.0]^{2} and either fix ϵa,ϵp,ϵz{\epsilon}{a},{\epsilon}{p},{\epsilon}{z} to the standard deviation of the first minibatch (“Fixed”) or recompute it for every minibatch (“Adaptive”). We observe no consistent improvement or degradation across any value of λa\lambda{a}, for 0≤λz≤0.50\leq\lambda_{z}\leq 0.5, or between the “Fixed” or “Adaptive” perturbation radii. We note that setting the visual scaling factor λz\lambda_{z} too high (e.g., 0.5,1.00.5,1.0) can significantly degrade performance. We hypothesize that excessively large perturbations distort the semantic content of the visual latent state, pushing it outside the range of semantically equivalent representations.
Table: S3.T1: Planning Results. We evaluate the planning performance of our finetuned world models against DINO-WM (zhou2025dinowmworldmodelspretrained) on 33 tasks in terms of success rate (%) using both open-loop and model predictive control (MPC) procedures. For each task, we perform gradient-based planning using both stochastic gradient descent (GD) and Adam (Kingma2014AdamAM), and search-based planning using the cross-entropy method (CEM).
| GD | Adam | CEM | GD | Adam | CEM | GD | Adam | CEM | |
|---|---|---|---|---|---|---|---|---|---|
| DINO-WM | 38 | 54 | 78 | 12 | 24 | 90 | 2 | 10 | 74∗ |
| + MPC | 56 | 76 | 92 | 42 | 68 | 90 | 12 | 80 | 82 |
| Online WM | 34 | 52 | 90 | 20 | 14 | 62 | 16 | 18 | 54∗ |
| + MPC | 50 | 76 | 92 | 54 | 88 | 96 | 38 | 80 | 90 |
| Adversarial WM | 56 | 82 | 94 | 32 | 70 | 88 | 32 | 34 | 30∗ |
| + MPC | 66 | 92 | 92 | 50 | 94 | 98 | 14 | 94 | 94 |
Table: A1.T2: Trajectory datasets used to pretrain the base DINO-WM and IRIS world models.
| Environment | H | Frameskip | Dataset Size | Trajectory Length |
|---|---|---|---|---|
| Push-T | 3 | 5 | 18500 | 100-300 |
| PointMaze | 3 | 5 | 2000 | 100 |
| Wall | 1 | 5 | 1920 | 50 |
| Rope | 1 | 1 | 1000 | 5 |
| Granular | 1 | 1 | 1000 | 5 |
Table: A1.T3: (a) Finetuning Parameters
| Name | Value |
|---|---|
| Image size | 224 |
| Optimizer | AdamW |
| Predictor LR | 1e-5 |
Table: A1.T3.st2: (b) Open-Loop Planning
| Name | GD | Adam |
|---|---|---|
| Opt. steps | 300 | 300 |
| LR | 1.0 | 0.3 |
Table: A1.T4: Training parameters for Adversarial World Modeling as reported in Table 1.
| Environment | # Rollouts | Batch Size | GPU | Epochs | ϵvisual{\epsilon}_{\text{visual}} | ϵproprio{\epsilon}_{\text{proprio}} | ϵaction{\epsilon}_{\text{action}} |
|---|---|---|---|---|---|---|---|
| PushT | 20000 (all) | 16 | 8x B200 | 1 | 0.05 | 0.02 | 0.02 |
| PointMaze | 2000 (all) | 16 | 1x B200 | 1 | 0.20 | 0.08 | 0.08 |
| Wall | 1920 (all) | 48 | 1x B200 | 2 | 0.20 | 0.08 | 0.08 |
Table: A2.T6: For both gradient descent (GD) and Adam (Ad), we evaluate initializing the actions for gradient-based planning (GBP) from the initialization network (IN) instead of a normal distribution.
| GD+IN | Ad+IN | GD+IN | Ad+IN | GD+IN | Ad+IN | |
|---|---|---|---|---|---|---|
| DINO-WM | 44 | 62 | 16 | 14 | 4 | 12 |
| + MPC | 60 | 84 | 40 | 54 | 6 | 32 |
| Online WM | 56 | 66 | 8 | 28 | 10 | 18 |
| + MPC | 52 | 82 | 40 | 46 | 2 | 22 |
| Adversarial WM | 74 | 90 | 22 | 36 | 18 | 24 |
| + MPC | 74 | 90 | 44 | 56 | 24 | 48 |
Table: A2.T7: Planning performance measured with Chamfer Distance (less is better) on two robotic manipulation tasks: Rope and Granular.
| Rope | Granular | |||
|---|---|---|---|---|
| GD | CEM | GD | CEM | |
| DINO-WM | 1.73 | 0.93 | 0.30 | 0.22 |
| Adversarial WM | 0.93 | 0.82 | 0.24 | 0.28 |
Table: A2.T8: Planning results in terms of success rate using the IRIS (micheli2023transformerssampleefficientworldmodels) architecture on the Wall Task.
| GD | CEM | |
| IRIS | 0 | 4 |
| IRIS + Online WM | 0 | 0 |
| IRIS + Adversarial WM | 8 | 6 |
Table: A2.T9: (a) Long-Horizon GBP
| PushT | PointMaze | |
|---|---|---|
| DINO-WM | 16 | 70 |
| Online WM | 16 | 96 |
| Adversarial WM | 26 | 88 |
Table: A2.T9.st2: (b) MPPI and GradCEM on PushT
| MPPI | GradCEM | |
|---|---|---|
| DINO-WM | 2 | 78 |
| Online WM | 2 | 74 |
| Adversarial WM | 2 | 84 |
Table: A2.T10: Wall clock time (in seconds) of rolling out 25 steps with each environment simulator compared to the DINO-WM architecture.
| PushT | PointMaze | Wall | |
|---|---|---|---|
| Simulator | 0.959 | 0.717 | 4.465 |
| DINO-WM | 0.029 | 0.029 | 0.029 |
Table: A4.T11: Both Open-Loop and MPC (Closed-Loop) use the Adam optimizer with the same parameters as the main experiments.
| Backward Passes | Min/Epoch | Open-Loop | MPC | Min/Epoch | Open-Loop | MPC | |
|---|---|---|---|---|---|---|---|
| FGSM | 2 | 120 | 70 | 94 | 14 | 34 | 94 |
| 2-Step PGD | 3 | 165 | 80 | 96 | 20 | 8 | 90 |
| 3-Step PGD | 4 | 201 | 78 | 94 | 24 | 14 | 94 |
An overview of our two proposed methods. When planning with a world model, actions may result in trajectories that lie outside the distribution of expert trajectories on which the world model was trained, leading to inaccurate world modeling. Online World Modeling finetunes a pretrained world model by using the simulator to correct trajectories produced via gradient-based planning, leading to accurate world modeling beyond the expert trajectory distribution. Adversarial World Modeling finetunes a world model on perturbations of actions and expert trajectories, promoting robustness and smoothing the world model’s input gradients.
Optimization landscape of DINO-WM (zhou2025dinowmworldmodelspretrained) before and after finetuning with our Adversarial World Modeling objective on the Push-T task. Adversarial World Modeling yields a smoother landscape with a broader basin around the optimum. Visualization details in Appendix C.
Difference in World Model Error between expert and planning trajectories on PushT.
PushT
PointMaze
Wall
Rope
Granular
Planning efficiency of DINO-WM, Online WM, and Adversarial WM using both GBP methods and CEM on the PointMaze task.
Success rate of closed-loop MPC planning using Adam on an Adversarial World Model trained with scaling factors λa,λz\lambda_{a},\lambda_{z} and perturbation radii ϵa,ϵz{\epsilon}{a},{\epsilon}{z} on the Wall environment. We find that 0≤λz,λa≤0.20\leq\lambda_{z},\lambda_{a}\leq 0.2 are stable for either “Fixed” or “Adaptive” perturbation radii.
(a) We see that DINO-WM is more likely to enter states outside of the training distribution, and so the decoder is not able reconstruct the state accurately. This is not the case with Online World Modeling but it still fails to successfully reach the goal state. Adversarial World Modeling successfully completes the task.
(b) Again we notice the failure for DINO-WM’s decoder to reconstruct states it encounters during planning, while this is not the case with Online World Modeling and Adversarial World Modeling, which both complete the task successfully.
$$ s_{t+1}=h(s_{t},a_{t}),\quad\text{ for all $t$}, $$ \tag{S2.E1}
$$ \min_{\theta}\mathbb{E}{(o{t},a_{t},o_{t+1})\sim\mathcal{T}}\lVert f_{\theta}(\Phi_{\mu}(o_{t}),a_{t})-\Phi_{\mu}(o_{t+1})\rVert_{2}^{2}. $$ \tag{S2.E3}
$$ {\hat{a}{t}^{*}}^{H}{t=1}=\operatorname*{arg,min}{{\hat{a}{t}}}\lVert\hat{z}{H+1}-z{\text{goal}}\rVert^{2}_{2} $$ \tag{S2.E4}
$$ \hat{z}{2}=f{\theta}(z_{1},\hat{a}{1}),\quad\hat{z}{t+1}=f_{\theta}(\hat{z}{t},\hat{a}{t})\quad\text{for}\quad t>1. $$ \tag{S2.E5}
$$ \tau^{\prime}=(z_{1},\hat{a}{1},z{2}^{\prime},\hat{a}{2},\dots,z^{\prime}{H+1}), $$ \tag{S2.E6}
$$ \delta^{(k+1)}=\Pi_{\lVert\delta\rVert_{\infty}\leq\epsilon}\left(\delta^{(k)}+\alpha\cdot\nabla_{x}\mathcal{L}(f_{\theta}(x+\delta^{(k)}),y)\right) $$ \tag{A4.E10}
Planning Computational Efficiency
When using a world model to conduct planning in real-world settings, fast inference is crucial for actively interacting with the environment. On all three tasks, we find that GBP with Adversarial World Modeling is able to match or come near the best performing world model when planning with CEM, in over an order of magnitude less wall clock time. We compare wall clock times across world models and planning procedures for PushT in Figure 3. The planning efficiency results for PointMaze and Wall can be found in Section B.7.
Related Work
Learning world models from sensory data. Learning-based dynamics models have become central to control and decision making, offering a data-driven alternative to classical approaches that rely on first principles modeling (Goldstein et al., 1950; Schmidt & Lipson, 2009; Macchelli et al., 2009). Early work focused on modeling dynamics in low-dimensional state-space (Deisenroth & Rasmussen, 2011; Lenz et al., 2015; Henaff et al., 2017; Sharma et al., 2019), while more recent methods learn directly from high-dimensional sensory inputs such as images. Pixel-space prediction methods (Finn et al., 2016; Kaiser et al., 2019) have shown success in applications such as human motion prediction (Finn et al., 2016), robotic manipulation (Finn & Levine, 2016; Agrawal et al., 2016; Zhang et al., 2019), and solving Atari games (Kaiser et al., 2019), but they remain computationally expensive due to the cost of image reconstructions. To address this, alternative approaches learn a compact latent representation where dynamics are modeled (Karl et al., 2016; Hafner et al., 2019b; Shi et al., 2022; Karypidis et al., 2024). These models are typically supervised either by decoding latent predictions to match ground truth observations (Edwards et al., 2018; Zhang et al., 2021; Bounou et al., 2021; Hu et al., 2022; Akan & G¨ uney, 2022; Hafner et al., 2019b), or by using prediction objectives that operate directly in latent space, such as those in joint-embedding prediction architectures (JEPAs) (LeCun, 2022; Bardes et al., 2024; Drozdov et al., 2024; Guan et al., 2024; Zhou et al., 2025). Our work builds upon this latter category of world models and specifically leverages the DINOv2-based latent world models introduced in Zhou et al. (2025). However, unlike prior work that primarily targets improving general representation quality or prediction accuracy, we focus on enhancing the trainability of world models to improve the convergence and reliability of gradient-based planning.
Planning with world models. Planning with world models is challenging due to the non-linearity and non-convexity of the objective. Search-based methods such as CEM (Rubinstein & Kroese, 2004) and MPPI (Williams et al., 2017a) are widely used in this context (Williams et al., 2017b; Nagabandi et al., 2019; Hafner et al., 2019b; Zhan et al., 2021; Zhou et al., 2025). These methods explore the action space effectively, helping to escape from local minima, but typically scale poorly in high-dimensional settings due to their sampling-based nature. In contrast, gradient-based methods offer a more scalable alternative by exploiting the differentiability of the world model to optimize actions directly via backpropagation. Despite their efficiency, these methods suffer from local minima in highly non-smooth loss landscapes (Bharadhwaj et al., 2020a; Xu et al., 2022; Chen et al., 2022; Wang et al., 2023), and gradient optimization can induce adversarial action sequences that exploit model inaccuracies (Schiewer et al., 2024; Jackson et al., 2024). Zhou et al. (2025) have observed that GBP is particularly brittle when used with world models built on pre-trained visual embeddings, such as DINOv2 (Oquab et al., 2024), often underperforming compared to CEM. To address these challenges, several stabilizing techniques have been proposed. For instance, random-sampling shooting helps mitigate adversarial trajectories by injecting noise in the action sequence and exploring a broader set of actions during trajectory optimization (Nagabandi et al., 2018), and Zhang et al. (2025) introduce adversarial attacks on learned policies to make them robust to environmental perturbations by selectively perturbing state inputs at inference time. In contrast, we apply perturbation directly to latent states and latent actions during world model training. Florence et al. (2022) add gradient penalties when training an implicit policy function to improve its smoothness and stabilize optimization, but their method does not involve training or using a world model. Other approaches
aim to use a hybrid method that combines search and gradient steps to balance global exploration and local refinement (Bharadhwaj et al., 2020a). In our work, we modify the world-model training procedure itself to improve GBP stability. In particular, through our Adversarial World Modeling approach, we enhance the robustness of the world model to perturbed states and actions, producing more stable and informative gradients that prevent adversarial action sequences at test time.
Train-test gap in world models. A key challenge when planning with learned world models is the mismatch between the training objective and the planning objective (Lambert et al., 2020). In fact, during training, world models are typically optimized to minimize one-step prediction or reconstruction error on trajectories collected from expert demonstrations or behavioral policies. At test time, however, the same models are used inside a planner to optimize multi-step action sequences. As a result, the objectives at training and test times are inherently different, inducing a distribution shift between trajectories seen during training and those encountered during planning. This mismatch can cause planners to drive the model into out-of-distribution regions of the state space, where prediction errors compound over time and the model becomes unreliable for long-horizon optimization (Ajay et al., 2018; Ke et al., 2019; Zhu et al., 2023). A common strategy to address this train-test gap is dataset-aggregation (Ross et al., 2011), which expands the training distribution by rolling out action trajectories generated by the planning algorithm and adding them to the training set (Talvitie, 2014; Nagabandi et al., 2018). However, unlike these approaches which typically apply this technique directly in the environment's low-dimensional state space, our approach uses dataset-aggregation in the context of high-dimensional latent world models, where training occurs in latent space rather than directly on states. Through our Online World Modeling approach, we explicitly close the train-test gap for gradient-based planning by using the planner itself to generate off-distribution trajectories and correcting them with simulator feedback.
Conclusion
In this work, we introduced Online World Modeling and Adversarial World Modeling as techniques for addressing the train-test gap that arises when world models trained on next-state prediction are used for iterative gradient-based planning. Across our experiments, these methods substantially improve the reliability of GBP and, in some settings, allow it to match or outperform sampling-based planners such as CEM. By narrowing this gap, our results suggest that gradient-based planning can be a practical alternative for planning with world models, particularly in settings where computational efficiency is critical. An important direction for future work is to evaluate these methods on real-world systems. Adversarial training may additionally improve a world model's robustness to environmental adversaries or stochasticity. More broadly, world models offer a natural advantage over policy-based reinforcement learning in long-horizon decision making. We believe our methods are especially well-suited to multi-timescale or hierarchical world models, where long-horizon planning is enabled by improving planning stability at different levels of abstraction.
Acknowledgments
Compute resources used in this work were provided by the Modal and NVIDIA Academic Grants. Micah Goldblum was supported by the Google Cyber NYC Award.
Experimental Details
Task Details
PushT: This task introduced by Chi et al. (2024) uses an agent interacting with a T-shaped block to guide both the agent and block from a randomly initialized state to a feasible goal state within 25 steps. We use the dataset of 18500 trajectories given in Zhou et al. (2025), in which the green anchor serves purely as a visual reference. We draw a goal state from one of the noisy expert trajectories at 25 steps from the starting state.
PointMaze: In this task introduced by Fu et al. (2021), a force-actuated ball which can move in the x, y Cartesian directions has to reach a target goal within a maze. We use the dataset of 2000 random trajectories present in Zhou et al. (2025), with a goal state chosen 25 steps from the starting state.
Wall: This task introduced by DINO-WM (Zhou et al., 2025) features a 2D navigation environment with two rooms separated by a wall with a door. The agent's task is to navigate from a randomized starting location in one room to a random goal state in the other room, passing through the door. We use the dataset of 1920 trajectories as provided in DINO-WM, with a goal state chosen 25 steps from the starting state.
Granular: In this task introduced by Zhang et al. (2024) a simulated Xarm must push around one hundred small particles into the goal configuration. We use the dataset of 1000 trajectories of 20 steps each provided in DINO-WM.
We reproduce the dataset statistics used to train the base world model for each environment from Zhou et al. (2025). We use the same datasets for our alternative world model architecture ablation in Section B.3.
Table 2: Trajectory datasets used to pretrain the base DINO-WM and IRIS world models.
textbf{PushT:
textbf{PointMaze:
textbf{Wall:
textbf{Rope:
textbf{Granular:
CEM Algorithm
We detail the cross-entropy method used in our planning experiments in Algorithm 4.
Finetuning and Planning Hyperparameters
In Table 3, we list all shared hyperparameters used in training and planning.
We provide data quantity and synthetic data parameters for our Online and Adversarial World Modeling training setups in Table 5 and Table 4 respectively. In addition to the maintaining perturbation radii for the visual latent and action embeddings, we use a distinct radius for the proprioceptive embeddings. We empirically find that the scales of the visual and proprioceptive embeddings are incompatible and semantically distinct, thereby necessitating independent perturbation. Throughout all of our experiments, we set the perturbation radii of the action embedding and proprioceptive embedding identically for simplicity.
Weighted Goal Loss
To facilitate progress towards the goal in Gradient-based Planning, we introduce an alternate loss function: Weighted Goal Loss (WGL). Instead of the standard goal loss function that only minimizes the ℓ 2 -distance between the final latent state produced by planning actions and the goal latent state,
$$
$$
t t =1
Table 4: Training parameters for Adversarial World Modeling as reported in Table 1.
WGLencourages intermediate latent states to also be close to the goal latent state. Formally,
$$
$$
where the sequence of normalized weights { w i } H +1 2 is a hyperparameter choice. Empirically, we find that using this objective for Gradient-Based Planning either maintains or improves planning performance. For PointMaze and Wall, we found that exponentially upweighting later states in the planning horizon improved planning performance, so we set w i = 2 i . For PushT, we found that exponentially upweighting earlier states improved planning performance, so we set w i = (1 / 2) i . We leave the optimal selection of this sequence of weights as future work.
Additional Experiment Results
Initialization Network
Motivated by the hypothesis that the optimization landscape is rugged (see Figure 2 for some evidence of this), we train an initialization network g θ : Z × Z → A T , g θ ( z 1 , z g ) = { ˆ a t } to initialize a sequence of actions for gradient-based planning. We provide details on training the initialization
Robotic Manipulation Tasks
We evaluate Adversarial World Modeling on two robotic manipulation tasks: Rope and Granular. Planning results for both tasks can be found in Table 7. To measure the accuracy of planned actions, we evaluate the Chamfer distance between the goal set of keypoints and the predicted set of keypoints.
Different World Model Architecture
We ablate the use of the DINO-WM architecture by evaluating planning performance with the IRIS (Micheli et al., 2023) architecture. Specifically, IRIS uses a VQ-VAE (van den Oord et al., 2018) for both the encoder and decoder, and a standard decoder-only Transformer (Vaswani et al., 2017). We find that even with a learned encoder, Adversarial World Modeling improves GBP performance and even CEM performance. Planning success rates of the IRIS architecture for the Wall task are reported in Table 8.
Table 8: Planning results in terms of success rate using the IRIS (Micheli et al., 2023) architecture on the Wall Task.
Long Horizon Planning
We evaluate GBP over a longer horizon in Table 9a. We use Adam in the MPC setting for each of these runs, setting a goal state 50 timesteps into the future drawn from an expert trajectory, a planning horizon of 50 steps, and 20 MPC iterations where we take a single action at each iteration. The dataset of held-out validation trajectories for the Wall environment does not contain expert trajectories of 50 timesteps, so we omit it from our evaluations. In comparison, our results in Table 1 use a goal state drawn 25 timesteps in the future and a planning horizon of 25 steps. We find that on the longer horizon, Adversarial World Modeling outperforms DINO-WM on PushT and both Adversarial and Online World Modeling outperform DINO-WM on PointMaze.
(a) Long-Horizon GBP
DINO-WM
MPPI
2
GradCEM
78
Adversarial WM
(b) MPPI and GradCEM on PushT
Table 9: Performance for (a) long-horizon GBP and (b) the MPPI and GradCEM algorithms
Additional Planning Algorithms
Additionally, we evaluate both the MPPI (Williams et al., 2017c) and GradCEM (Bharadhwaj et al., 2020b) algorithms under MPC on the PushT task in Table 9b. MPPI is an online, receding-horizon controller that samples and evaluates perturbed action sequences, executes the first action of the lowest-cost trajectory, and then replans from the updated state at each timestep.
GradCEM refines the candidate sequences used to update the estimated action distribution with gradient descent to provide a more accurate estimate of the true distribution's parameters. We see that Adversarial World Modeling outperforms DINO-WM with GradCEM. Additionally, GradCEM exhibits slightly lower performance than vanilla CEM. We hypothesize this is due to the memory requirements of gradient descent necessitating reducing the number of candidate sequences by a factor of 6 compared to vanilla CEM, leading to reduced accuracy in estimating the true action distribution.
For MPPI, we use 5 samples each MPC iteration, with 100 MPC steps. For GradCEM, we use 50 samples, 30 CEM steps, and 2 Adam steps per CEM step with an LR of 0.3. For GradCEM we take 10 MPC steps.
Additional Train-Test Gap Results
We present additional results for the difference in World Model Error between training and planning for the PointMaze and Wall tasks in Figure 6. For both tasks, our methods have lower error during planning compared to training except for Online World Modeling on PointMaze, which is inconclusive due to the low magnitude of world model error. Planning actions are obtained after 300 steps of GBP with GD on 50 rollouts using the initial and goal state from a training trajectory.

(a) PointMaze

Figure 6: Difference in World Model Error between expert trajectories and planning trajectories. Larger positive numbers indicate better performance on the actions seen during planning.
Planning Computational Efficiency
When using a world model to conduct planning in real-world settings, fast inference is crucial for actively interacting with the environment. On all three tasks, we find that GBP with Adversarial World Modeling is able to match or come near the best performing world model when planning with CEM, in over an order of magnitude less wall clock time. We compare wall clock times across world models and planning procedures for PushT in Figure 3. The planning efficiency results for PointMaze and Wall can be found in Section B.7.
Rollout Inference Time
To understand the additional cost of using the environment simulator in Online World Modeling, we record the wall clock time of rolling out 25 steps with the DINO-WM architecture and each environment simulator in Table 10. We see that in all environments, the simulator takes longer to rollout than the world model. We also note that the simulator for all 3 tasks is deterministic in terms of reproducing the training trajectories from their actions.
Table 10: Wall clock time (in seconds) of rolling out 25 steps with each environment simulator compared to the DINO-WM architecture.
Visualizing the Optimization Landscape
We visualize the loss landscape of both the DINO World Model before and after applying our Adversarial World Modeling objective. We perform a grid search over the subspace spanned by
We define the axes as α = ˆ a GBP-Pretrained -a GT and β = ˆ a GBP-Adversarial -a GT, and compute the loss surface over a 50 × 50 grid spanning α, β ∈ [ -1 . 25 , 1 . 25] .
Adversarial World Modeling: Design Decisions
Fast Gradient Sign Method (FGSM) vs. Projected Gradient Descent (PGD)
Projected Gradient Descent (PGD) has been used as an iterative method for generating adversarial perturbations (Madry et al., 2018). At each step, PGD takes a gradient ascent step and projects the result onto the space of allowed perturbations (some ball with radius ϵ around the input). Projection ( Π ) is typically via clipping or scaling. Formally,
$$
$$
However, this is computationally expensive to use for adversarial training as it requires an additional backward pass for each iteration. If one uses a single-step, replaces the gradient by its sign, and uses step size α = ϵ , this recovers the well-known Fast Gradient Sign Method (FGSM) update (Goodfellow et al., 2014).
$$
$$
In Wong et al. (2020), the authors demonstrate that initializing δ in the ℓ ∞ -ball with radius ϵ and performing FGSM adversarial training on these perturbations substantially improves robustness to PGD attacks and matches performance of PGD-based training. We leverage this observation to perform cheap adversarial training that only requires 2 × the backward passes of traditional supervised learning. In comparison, K -step PGD requires K more backward passes ( 3 × more for K = 2 and 4 × for K = 3 ). In Table 11, we show that 2/3-Step PGD does not consistently outperform FGSM, despite requiring a much larger training budget.
Scaling Factor ($ lambda$) & Perturbation Radii ($ eps$) Ablations
To assess the robustness of Adversarial World Modeling to the scaling factor and perturbation radius hyperparameters, we conduct an ablation study varying these two factors, shown in Figure 9. We evaluate λ a , λ z ∈ [0 . 0 , 0 . 02 , 0 . 05 , 0 . 20 , 0 . 50 , 1 . 0] 2 and either fix ϵ a , ϵ p , ϵ z to the standard deviation of the first minibatch ('Fixed') or recompute it for every minibatch ('Adaptive'). We observe no consistent improvement or degradation across any value of λ a , for 0 ≤ λ z ≤ 0 . 5 , or between the 'Fixed' or 'Adaptive' perturbation radii. We note that setting the visual scaling factor λ z too high (e.g., 0 . 5 , 1 . 0 ) can significantly degrade performance. We hypothesize that excessively large perturbations distort the semantic content of the visual latent state, pushing it outside the range of semantically equivalent representations.
Table 11: Both Open-Loop and MPC (Closed-Loop) use the Adam optimizer with the same parameters as the main experiments.

Figure 9: Success rate of closed-loop MPC planning using Adam on an Adversarial World Model trained with scaling factors λ a , λ z and perturbation radii ϵ a , ϵ z on the Wall environment. We find that 0 ≤ λ z , λ a ≤ 0 . 2 are stable for either 'Fixed' or 'Adaptive' perturbation radii.
Trajectory Visualization
We include visualizations of planning trajectories for DINO-WM, Online World Modeling, and Adversarial World Modeling to further study their success and failure modes. Visualizations for PushT and Wall can be found in Figures 10 and 11 respectively.

(a) We see that DINO-WM is more likely to enter states outside of the training distribution, and so the decoder is not able reconstruct the state accurately. This is not the case with Online World Modeling but it still fails to successfully reach the goal state. Adversarial World Modeling successfully completes the task.

(b) Again we notice the failure for DINO-WM's decoder to reconstruct states it encounters during planning, while this is not the case with Online World Modeling and Adversarial World Modeling, which both complete the task successfully.
Figure 10: Trajectory Visualizations of the PushT task. We plot the expert trajectory to reach the goal side, alongside both the simulator states and decoded latent states for DINO-WM, Online World Modeling, Adversarial World Modeling.
(a) In this challenging example, all three world models enter states through planning that their respective decoders cannot reconstruct, but only Online World Modeling is able to complete the task successfully.
(a) In this challenging example, all three world models enter states through planning that their respective decoders cannot reconstruct, but only Online World Modeling is able to complete the task successfully.
(b) In this example, we see that DINO-WM predicts that it successfully completed the task according to its reconstructed last latent state, but the simulator indicates the true position to be off of the goal state. Online and Adversarial World Modeling correct for this and successfully complete the task.
(b) In this example, we see that DINO-WM predicts that it successfully completed the task according to its reconstructed last latent state, but the simulator indicates the true position to be off of the goal state. Online and Adversarial World Modeling correct for this and successfully complete the task.
| PushT | PushT | PushT | PointMaze | PointMaze | PointMaze | Wall | Wall | Wall | |
|---|---|---|---|---|---|---|---|---|---|
| GD | Adam | CEM | GD | Adam | CEM | GD | Adam | CEM | |
| DINO-WM | 38 | 54 | 78 | 12 | 24 | 90 | 2 | 10 | 74 ∗ |
| + MPC | 56 | 76 | 92 | 42 | 68 | 90 | 12 | 80 | 82 |
| OnlineWM | 34 | 52 | 90 | 20 | 14 | 62 | 16 | 18 | 54 ∗ |
| + MPC | 50 | 76 | 92 | 54 | 88 | 96 | 38 | 80 | 90 |
| AdversarialWM | 56 | 82 | 94 | 32 | 70 | 88 | 32 | 34 | 30 ∗ |
| + MPC | 66 | 92 | 92 | 50 | 94 | 98 | 14 | 94 | 94 |
| Environment | H | Frameskip | Dataset Size | Trajectory Length |
|---|---|---|---|---|
| Push-T | 3 | 5 | 18500 | 100-300 |
| PointMaze | 3 | 5 | 2000 | 100 |
| Wall | 1 | 5 | 1920 | 50 |
| Rope | 1 | 1 | 1000 | 5 |
| Granular | 1 | 1 | 1000 | 5 |
| Name | Value | Name | GD | Adam | ||
|---|---|---|---|---|---|---|
| Image size Optimizer Predictor LR | 224 AdamW 1e-5 LR | Name Opt. steps | GD Adam 300 300 1.0 0.3 | MPC steps Opt. steps LR | 10 100 1 | 10 100 0.2 |
| (a) Finetuning Parameters | (b) Open-Loop | Planning | (c) MPC | Parameters |
| Environment | # Rollouts | Batch Size | GPU | Epochs | ϵ visual | ϵ proprio | ϵ action |
|---|---|---|---|---|---|---|---|
| PushT | 20000 (all) | 16 | 8x B200 | 1 | 0.05 | 0.02 | 0.02 |
| PointMaze | 2000 (all) | 16 | 1x B200 | 1 | 0.2 | 0.08 | 0.08 |
| Wall | 1920 (all) | 48 | 1x B200 | 2 | 0.2 | 0.08 | 0.08 |
| Environment | # Rollouts | Batch Size | GPU | Epochs |
|---|---|---|---|---|
| PushT | 6000 | 32 | 4x B200 | 1 |
| PointMaze | 500 | 32 | 4x B200 | 1 |
| Wall | 1920 (all) | 80 | 4x B200 | 1 |
| PushT | PushT | PointMaze | PointMaze | Wall | Wall | |
|---|---|---|---|---|---|---|
| GD+IN | Ad+IN | GD+IN | Ad+IN | GD+IN | Ad+IN | |
| DINO-WM | 44 60 | 62 84 | 16 40 | 14 54 | 4 6 | 12 32 |
| + MPC | 56 | 8 | 28 46 | |||
| OnlineWM + MPC | 66 | 10 | 18 | |||
| 52 | 82 | 40 | 2 | 22 | ||
| AdversarialWM | 74 | 90 | 22 | 36 | 18 | 24 |
| + MPC | 74 | 90 | 44 | 56 | 24 | 48 |
| Rope | Rope | Granular | Granular | |
|---|---|---|---|---|
| GD | CEM | GD | CEM | |
| DINO-WM | 1.73 | 0.93 | 0.30 | 0.22 |
| AdversarialWM | 0.93 | 0.82 | 0.24 | 0.28 |
| GD | CEM | |
|---|---|---|
| IRIS | 0 | 4 |
| IRIS + OnlineWM | 0 | 0 |
| IRIS + AdversarialWM | 8 | 6 |
| PushT | PointMaze | |
|---|---|---|
| DINO-WM | 16 | 70 |
| OnlineWM | 16 | 96 |
| AdversarialWM | 26 | 88 |
| PushT | PointMaze | Wall | |
|---|---|---|---|
| Simulator | 0.959 | 0.717 | 4.465 |
| DINO-WM | 0.029 | 0.029 | 0.029 |
| Backward Passes | PointMaze | PointMaze | PointMaze | Wall | Wall | Wall | |
|---|---|---|---|---|---|---|---|
| FGSM | Min/Epoch | Open-Loop | MPC | Min/Epoch | Open-Loop | MPC | |
| 2-Step PGD | 2 3 | 120 165 | 70 80 | 94 96 | 14 20 | 34 8 | 94 90 |
| 3-Step PGD | 4 | 201 | 78 | 94 | 24 | 14 | 94 |
World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.
github.com/nimitkalra/robust-world-model-planning
In robotic tasks, anticipating how the actions of an agent affect the state of its environment is fundamental for both prediction (Finn2016UnsupervisedLFA) and planning (mohanan2018survey; kavraki2002probabilistic). Classical approaches derive models of the environment evolution analytically from first principles, relying on prior knowledge of the environment, the agent, and any uncertainty (goldstein1950classical; siciliano2009robotics; spong2020robot). In contrast, learning-based methods infer such models directly from data, enabling them to capture complex dynamics and thus improve generalization and robustness to uncertainty (sutton1998reinforcement; schrittwieser2020mastering; lecun2022path).
World models (ha2018world), in particular, have emerged as a powerful paradigm. Given the current state and an action, the world model predicts the resulting next state. These models can be learned either from exact state information (sutton1991dyna) or from high-dimensional sensory inputs such as images (hafner2023mastering). The latter setup is especially compelling as it enables perception, prediction, and control directly from raw images by leveraging pre-trained visual representations, and removes the need for measuring the precise environment states which is difficult in practice (assran2023self; Bardes2024RevisitingFPA). Recently, world models and their predictive capabilities have been leveraged for planning, enabling agents to solve a variety of tasks (hafner2019dream; hafner2019learning; schrittwieser2020mastering; hafner2023mastering; zhou2025dinowmworldmodelspretrained). A model of the dynamics is learned offline, while the planning task is defined at inference as a constrained optimization problem: given the current state, find a sequence of actions that results in a state as close as possible to the target state. This inference-time optimization provides an effective alternative to reinforcement learning approaches (sutton1998reinforcement) that often suffer from poor sample-efficiency.
World models are compatible with many model-based planning algorithms. Traditional methods such as DDP (mayne1966second) and iLQR (li2004iterative) rely on iteratively solving exact optimization problems derived from linear and quadratic approximations of the dynamics around a nominal trajectory. While highly effective in low-dimensional settings, these methods become impractical for large-scale world models, where solving the resulting optimization problem is computationally intractable. As an alternative, search-based methods such as the Cross Entropy Method (CEM) (rubinstein2004cross) and Model Predictive Path Integral control (MPPI) (williams2017model) have been widely adopted as gradient-free alternatives and have proven effective in practice. However, they are computationally intensive as they require iteratively sampling candidate solutions and performing world model rollouts to evaluate each one, a procedure that scales poorly in high-dimensional spaces. Gradient-based methods (sv2023gradient), in contrast, avoid the limitations of sampling by directly exploiting the differentiability of world models to optimize actions end-to-end. These methods eliminate the costly rollouts required by search-based approaches, thus scaling more efficiently in high-dimensional spaces. Despite this promise, gradient-based approaches have thus far seen limited empirical success.
This procedure suffers from a fundamental train-test gap. World models are typically trained using a next-state prediction objective on datasets of expert trajectories. At test time, however, they are used to optimize a planning objective over sequences of actions. We argue that this mismatch underlies the poor empirical performance of gradient-based planning (GBP), and we offer two hypotheses to explain why. (1) During planning, the intermediate sequence of actions explored by gradient descent drive the world model into states that were not encountered during training. In these out-of-distribution states, model errors compound, making the world model unreliable as a surrogate for optimization. (2) The action-level optimization landscape induced by the world model may be difficult to traverse, containing many poor local minima or flat regions, which hinders effective gradient-based optimization.
In this work, we address both of these challenges by proposing two algorithms: Online World Modeling and Adversarial World Modeling. Both expand the region of familiar latent states by continuously adding new trajectories to the dataset and finetuning the world model on them. To manage the distribution shift between offline expert trajectories and predicted trajectories from planning, Online World Modeling uses the environment simulator to correct states along a trajectory produced by performing GBP. Finetuning on these corrected trajectories ensures that the world model performs sufficiently well when GBP enters regimes of latent state space outside of the expert trajectory distribution. To overcome the difficulties of optimizing over a non-smooth loss surface during GBP, Adversarial World Modeling perturbs expert trajectories in the direction that maximizes the world model’s loss. Adversarial finetuning smooths the induced action loss landscape, making it easier to optimize via gradient-based planning. We provide a visual depiction of both methods in Figure 1.
We show that finetuning world models with these algorithms leads to substantial improvements in the performance of gradient-based planning (GBP). Applying Adversarial World Modeling to a pretrained world model enables gradient-based planning to match or exceed the performance of search-based CEM on a variety of robotic object manipulation and navigation tasks. Importantly, this performance is achieved with a 10×\times reduction in computation time compared to CEM, underscoring the practicality of our approach for real-world planning. Additionally, we empirically demonstrate that Adversarial World Modeling smooths the planning loss landscape, and that both methods can reverse the train-test gap in world model error.
World models learn environment dynamics by predicting the state resulting from taking an action in the current state. Then, at test time, the learned world model enables planning by simulating future trajectories and guiding action optimization. Formally, a world model approximates the (potentially unknown) dynamics function h:𝒮×𝒜→𝒮h{,:,}\mathcal{S}\times\mathcal{A}\to\mathcal{S}, where 𝒮\mathcal{S} denotes the state space and 𝒜\mathcal{A} the action space. The environment evolves according to
where st∈𝒮,at∈𝒜s_{t}\in\mathcal{S},a_{t}\in\mathcal{A} denote the state and action at time tt, respectively.
In practice, we typically do not have access to the exact state of the environment; instead, we only receive partial observations of it, such as images. In order for a world model to efficiently learn in the high-dimensional observation space 𝒪\mathcal{O}, an embedding function Φμ:𝒪→𝒵\Phi_{\mu}{,:,}\mathcal{O}\to\mathcal{Z} is employed to map observations to a lower-dimensional latent space 𝒵\mathcal{Z}. Then, given an embedding function Φμ\Phi_{\mu}, our goal is to learn a latent world model fθ:𝒵×𝒜→𝒵f_{\theta}{,:,}\mathcal{Z}\times\mathcal{A}\to\mathcal{Z}, such that
The choice of Φμ\Phi_{\mu} directly affects the expressivity of the latent world model. In this work, we use a fixed encoder pretrained with self-supervised learning that yields rich feature representations out of the box.
To train a latent world model, we sample triplets of the form (ot,at,ot+1)(o_{t},a_{t},o_{t+1}) from an offline dataset of trajectories 𝒯\mathcal{T} and minimize the ℓ2\ell_{2} distance between the true next latent state zt+1=Φμ(ot+1)z_{t+1}=\Phi_{\mu}(o_{t+1}) and the predicted next latent state z^t+1\hat{z}_{t+1}. This procedure is represented by the following teacher-forcing objective:
Notably, we only minimize this objective with respect to the world model’s parameters θ\theta, not those of the potentially large embedding function.
During test-time, we use a learned world model to optimize candidate action sequences for reaching a goal state. By recursively applying the world model over an action sequence starting from an initial latent state, we obtain a predicted latent goal state and therefore the distance to the true goal state in latent space. This allows us to find the optimal action sequence
where z^H+1\hat{z}_{H+1} is produced by the recursive procedure
We use the function rolloutf:𝒵×𝒜H→𝒵H\text{rollout}_{f}{,:,}\mathcal{Z}\times\mathcal{A}^{H}\to\mathcal{Z}^{H} to denote this recursive procedure.
Gradient-based planning (GBP) solves the planning objective (4) via gradient descent. Crucially, since the world model is differentiable, ∇{a^t}z^H+1=∇{a^t}rolloutf(z1,{at^})H+1\nabla_{{\hat{a}{t}}}\hat{z}{H+1}=\nabla_{{\hat{a}{t}}}\text{rollout}{f}(z_{1},{\hat{a_{t}}})_{H+1} is well-defined. In contrast, the search-based CEM is gradient-free, but requires evaluating substantially more action sequences. We detail GBP in Algorithm 1 and CEM in Section A.2.
As errors can propagate over long horizons, Model Predictive Control (MPC) is commonly used to repeatedly re-plan by optimizing an HH-step action sequence but executing only the first K≤HK\leq H actions before replanning from the updated state.
As the planning objective is induced entirely by the world model, the success of GBP hinges on (1) the model accurately predicting future states under any candidate action sequence, and (2) the stability of this differentiable optimization. We now present two finetuning methods designed to improve on these fronts.
During gradient-based planning, the action sequences being optimized are not constrained to lie within the distribution of behavior seen during training. World models are typically trained on fixed datasets of expert trajectories, whereas GBP selects actions solely to improve the planning objective, without regard to whether those actions resemble expert behavior. As a result, the optimization process often proposes action sequences that are out of distribution. Optimizing through learned models under such conditions is known to induce adversarial inputs (szegedy2013intriguing; goodfellow2014explaining). In our setting, these adversarial action sequences drive the world model into regions of the latent state space that were rarely or never observed during training, causing large prediction errors. Even when errors are initially small, they accumulate as the planner rolls the model forward, ultimately degrading long-horizon planning performance.
To address this issue, we propose Online World Modeling, which iteratively corrects the trajectories produced by GBP and finetunes the world model on the resulting rollouts. Rather than training solely on expert demonstrations, we repeatedly incorporate trajectories induced by the planner itself, thereby expanding the region of latent states that the world model can reliably predict.
First, we conduct GBP using the initial and goal latent states of an expert trajectory τ\tau, yielding a sequence of predicted actions {a^t}t=1H{\hat{a}{t}}{t=1}^{H}. These actions might send the world model into regions of the latent space that lie outside of the training distribution. To adjust for this, we obtain a corrected trajectory: the actual sequence of states that would result by executing the action sequence {a^t}t=1H{\hat{a}{t}}{t=1}^{H} in the environment using the true dynamics simulator hh. We add the corrected trajectory,
to the dataset that the world model trains with every time the dataset is updated. Re-training on these corrected trajectories expands the training distribution to cover the regions of latent space induced by gradient-based planning, mitigating compounding prediction errors during planning. We provide more detail in Algorithm 2 and illustrate the method in Figure 1.
This procedure is reminiscent of DAgger (Dataset Aggregation) (ross2011reduction), an online imitation learning method wherein a base policy network is iteratively trained on its own rollouts with the action predictions replaced by those from an expert policy. In a similar spirit, we invoke the ground-truth simulator as our expert world model that we imitate.
Since world models are only trained on the next-state prediction objective, there is no particular reason for their input gradients to be well-behaved. Adversarial training has been shown to result in better behaved input gradients (mejia2019robust), consequently smoothing the input loss surface. Motivated by this observation, we propose an adversarial training objective that explicitly targets regions of the state-action space where the world model is expected to perform poorly. These adversarial samples may lie outside the expert trajectory distribution, which can expose the model to precisely the regions that matter for action optimization. We find that this procedure, which we call Adversarial World Modeling, does in fact smooth the loss surface of the planning objective (see Figure 2), improving the stability of action-sequence optimization.
Adversarial training improves model robustness by optimizing performance under worst-case perturbations (madry2019deeplearningmodelsresistant). An adversarial example is generated by applying a perturbation δ\delta to an input that maximally increases the model’s loss. To train a world model on adversarial examples, we use the objective
where ℬa={δa:∥δa∥∞≤ϵa}\mathcal{B}{a}={\delta{a}:\lVert\delta_{a}\rVert_{\infty}\leq\epsilon_{a}} and ℬz={δz:∥δz∥∞≤ϵz}\mathcal{B}{z}={\delta{z}:\lVert\delta_{z}\rVert_{\infty}\leq\epsilon_{z}} constrain the magnitude of perturbations for given ϵa,ϵz\epsilon_{a},\epsilon_{z}. Training on these adversarially perturbed trajectories provides an alternative method to Online World Modeling for surfacing states that may be encountered during planning, without relying on GBP rollouts. This is a significant advantage in settings where simulation is expensive or infeasible.
We generate adversarial latent states using the Fast Gradient Sign Method (FGSM) (goodfellow2014explaining), which efficiently approximates the worst-case perturbations that maximize prediction error (fastbetterthanfree). Although stronger iterative attacks such as Projected Gradient Descent (PGD) can be used, we find that FGSM delivers comparable improvements in GBP performance while being significantly more computationally efficient (see Section D.1). This enables us to generate adversarial samples over entire large-scale offline imitation learning datasets.
For each state-action pair in a given minibatch, we look for small changes to the latent state or action that most increase the world model’s prediction error. Let ϵa,ϵz{\epsilon}{a},{\epsilon}{z} denote the radius of the perturbation to the actions {at}{a_{t}} and latent states {zt}{z_{t}} respectively. We compute gradients ∇δa,δz∥fθ(zt+δz,at+δa)−zt+1∥22\nabla_{\delta_{a},\delta_{z}}\lVert f_{\theta}(z_{t}+\delta_{z},a_{t}+\delta_{a})-z_{t+1}\rVert_{2}^{2} with respect to the perturbations and take a signed gradient ascent step (i.e., in a direction that degrades the prediction) with step sizes αa=1.25ϵa,αz=1.25ϵz\alpha_{a}=1.25{\epsilon}{a},\alpha{z}=1.25{\epsilon}_{z}. We clip the result so that each entry of the perturbation stays within the radius. This procedure corresponds to a single step of a PGD-style attack, producing perturbations that lie on the edge of the allowed region where they are maximally challenging for the model. See Algorithm 3 for a detailed treatment.
To initialize the perturbation radii ϵa,ϵz\epsilon_{a},\epsilon_{z}, we use scaling factors λa,λz\lambda_{a},\lambda_{z} and find that Adversarial World Modeling is robust for 0≤λa≤10\leq\lambda_{a}\leq 1 and 0≤λz≤0.50\leq\lambda_{z}\leq 0.5. Furthermore, we find that fixing ϵa,ϵz{\epsilon}{a},{\epsilon}{z} to the standard deviation of the initial minibatch is stable across all experiments. Updating this estimate for each batch as in Algorithm 3 yields no consistent improvement in final planning performance. We further analyze design ablations in Appendix D.
We evaluate our methods by finetuning world models pretrained with the next-state prediction objective on 3 tasks: PushT, PointMaze, and Wall. For each task we measure the success rate of reaching a target configuration ogoalo_{\text{goal}} from an initial configuration o1o_{1}. We report planning results with both open-loop and MPC in Table 1. In the open-loop setting, we run Algorithm 1 from o1o_{1} once and evaluate the predicted action sequence. In the MPC setting, we run Algorithm 1 once for each MPC step (using Φμ(o1′)\Phi_{\mu}(o_{1}^{\prime}) as the initial latent state for the first MPC step), rollout the predicted actions {a^t}{\hat{a}{t}} in the environment simulator to reach latent state z^H+1\hat{z}{H+1}, and set z^1=z^H+1\hat{z}{1}=\hat{z}{H+1} for the next MPC iteration. We report all finetuning, planning, and optimization hyperparameters in Table 3.
We use DINO-WM (zhou2025dinowmworldmodelspretrained) as our initial world model for its strong performance with CEM across our chosen tasks. The embedding function Φμ\Phi_{\mu} is taken to be the pre-trained DINOv2 encoder (oquab2024dinov2learningrobustvisual), and remains frozen while finetuning the transition model fθf_{\theta}. fθf_{\theta} is implemented using the ViT architecture (dosovitskiy2021imageworth16x16words). We additionally train a VQVAE decoder (oord2018neuraldiscreterepresentationlearning) to visualize latent states, though it plays no role in planning. To validate the broad applicability of our approach, we also study the use of the IRIS (micheli2023transformerssampleefficientworldmodels) world model architecture in Section B.3.
To initialize the action sequence for planning optimization, we evaluate both random sampling from a standard normal distribution and the use of an initialization network. Our initialization network gθ:𝒵×𝒵→𝒜Tg_{\theta}:\mathcal{Z}\times\mathcal{Z}\to\mathcal{A}^{T} is trained such that gθ(z1,zg)={a^t}t=1Tg_{\theta}(z_{1},z_{g})={\hat{a}{t}}{t=1}^{T}. We find that random initialization tends to outperform the initialization network and we analyze its impact in depth in Section B.1.
During GBP, we set ℒgoal\mathcal{L}_{\text{goal}} in Algorithm 1 to a weighted goal loss to obtain a gradient from each predicted state instead of simply the last one. We find empirically that this task assumption generalizes to both navigation (e.g., PointMaze and Wall) and non-navigation tasks (e.g., PushT); i.e., on tasks with or without subgoal decomposability, this objective improves or matches performance of the final-state loss. We provide the exact formulation and more details in Section A.4. We additionally evaluate using the Adam optimizer (Kingma2014AdamAM) during GBP. Although using Adam improves performance significantly over GD for all world models in our experiments, we find that Adam alone does not scale performance to match or surpass CEM.
On all three tasks, our methods outperform DINO-WM with Gradient Descent GBP and either match or outperform it with the far more expensive CEM. In the open-loop setting, we achieve a +18% on Push-T, +20% on PointMaze, and +30% on Wall increase in success rate. In the MPC setting, Adam GBP with Adversarial World Modeling outperforms CEM with DINO-WM on PointMaze and Wall and matches CEM on PushT.
While both Online World Modeling and Adversarial World Modeling bootstrap new data to improve the robustness of our world model during GBP, the distributions they induce are quite different. Whereas Online World Modeling anticipates and covers the distribution seen at planning time, Adversarial World Modeling exploits the current loss landscape of the world model to encourage local smoothness near expert trajectories. For all environments, we find Adversarial World Modeling outperforms Online World Modeling when using Adam to perform GBP.
To demonstrate the advantages of Adversarial World Modeling in more complex environments where the simulator may be very costly and the number of action dimensions is larger, we also evaluate planning performance on two robotic manipulation tasks in Section B.2.
Comparing the world model error between training trajectories and planning trajectories allows us to evaluate if the world model will perform well during planning even if it is trained to convergence on expert trajectories. We evaluate world model error as the deviation between the world model’s predicted next latent state and the next latent state given by the environment simulator. Given an initial state s1s_{1} (associated with o1o_{1}) and a sequence of actions {at}{a_{t}} (either from the training dataset or a planning procedure), the world model error at timestep tt is given by
This error is averaged over all timesteps of a trajectory. If the difference in world model error between expert trajectories and planning trajectories is negative, then the world model will perform relatively worse on sequences of actions produced during planning. Figure 4 demonstrates that this is the case with DINO-WM, but not with Online World Modeling or Adversarial World Modeling, indicating a narrowing of the train-test gap. See Section B.6 for results for PointMaze and Wall.
When using a world model to conduct planning in real-world settings, fast inference is crucial for actively interacting with the environment. On all three tasks, we find that GBP with Adversarial World Modeling is able to match or come near the best performing world model when planning with CEM, in over an order of magnitude less wall clock time. We compare wall clock times across world models and planning procedures for PushT in Figure 3. The planning efficiency results for PointMaze and Wall can be found in Section B.7.
Learning world models from sensory data. Learning-based dynamics models have become central to control and decision making, offering a data-driven alternative to classical approaches that rely on first principles modeling (goldstein1950classical; Schmidt2009DistillingFNA; macchelli2009port). Early work focused on modeling dynamics in low-dimensional state-space (deisenroth2011pilco; lenz2015deepmpc; henaff2017model; Sharma2019DynamicsAwareUD), while more recent methods learn directly from high-dimensional sensory inputs such as images. Pixel-space prediction methods (Finn2016UnsupervisedLFA; Kaiser2019ModelBasedRL) have shown success in applications such as human motion prediction (Finn2016UnsupervisedLFA), robotic manipulation (Finn2016DeepVF; agrawal2016learning; zhang2019solar), and solving Atari games (Kaiser2019ModelBasedRL), but they remain computationally expensive due to the cost of image reconstructions. To address this, alternative approaches learn a compact latent representation where dynamics are modeled (Karl2016DeepVB; hafner2019learning; Shi2022RoboCraftLT; karypidis2024dino). These models are typically supervised either by decoding latent predictions to match ground truth observations (Edwards2018ImitatingLPA; Zhang2021DeformableLOA; bounou2021online; Hu2022ModelBasedILA; Akan2022StretchBEVSFA; hafner2019learning), or by using prediction objectives that operate directly in latent space, such as those in joint-embedding prediction architectures (JEPAs) (lecun2022path; Bardes2024RevisitingFPA; Drozdov2024VideoRLA; Guan2024WorldMFA; zhou2025dinowmworldmodelspretrained). Our work builds upon this latter category of world models and specifically leverages the DINOv2-based latent world models introduced in zhou2025dinowmworldmodelspretrained. However, unlike prior work that primarily targets improving general representation quality or prediction accuracy, we focus on enhancing the trainability of world models to improve the convergence and reliability of gradient-based planning.
Planning with world models. Planning with world models is challenging due to the non-linearity and non-convexity of the objective. Search-based methods such as CEM (rubinstein2004cross) and MPPI (williams2017model) are widely used in this context (Williams2017InformationTMA; Nagabandi2019DeepDMA; hafner2019learning; Zhan2021ModelBasedOPA; zhou2025dinowmworldmodelspretrained). These methods explore the action space effectively, helping to escape from local minima, but typically scale poorly in high-dimensional settings due to their sampling-based nature. In contrast, gradient-based methods offer a more scalable alternative by exploiting the differentiability of the world model to optimize actions directly via backpropagation. Despite their efficiency, these methods suffer from local minima in highly non-smooth loss landscapes (Bharadhwaj2020ModelPredictiveCVA; Xu2022AcceleratedPLA; Chen2022BenchmarkingDOA; Wang2023SoftZooASA), and gradient optimization can induce adversarial action sequences that exploit model inaccuracies (Schiewer2024ExploringTLA; Jackson2024PolicyGuidedDA). zhou2025dinowmworldmodelspretrained have observed that GBP is particularly brittle when used with world models built on pre-trained visual embeddings, such as DINOv2 (oquab2024dinov2learningrobustvisual), often underperforming compared to CEM. To address these challenges, several stabilizing techniques have been proposed. For instance, random-sampling shooting helps mitigate adversarial trajectories by injecting noise in the action sequence and exploring a broader set of actions during trajectory optimization (nagabandi2018neural), and Zhang2025StateAwarePOA introduce adversarial attacks on learned policies to make them robust to environmental perturbations by selectively perturbing state inputs at inference time. In contrast, we apply perturbation directly to latent states and latent actions during world model training. florence2022implicit add gradient penalties when training an implicit policy function to improve its smoothness and stabilize optimization, but their method does not involve training or using a world model. Other approaches aim to use a hybrid method that combines search and gradient steps to balance global exploration and local refinement (Bharadhwaj2020ModelPredictiveCVA). In our work, we modify the world-model training procedure itself to improve GBP stability. In particular, through our Adversarial World Modeling approach, we enhance the robustness of the world model to perturbed states and actions, producing more stable and informative gradients that prevent adversarial action sequences at test time.
Train-test gap in world models. A key challenge when planning with learned world models is the mismatch between the training objective and the planning objective (lambert2020objective). In fact, during training, world models are typically optimized to minimize one-step prediction or reconstruction error on trajectories collected from expert demonstrations or behavioral policies. At test time, however, the same models are used inside a planner to optimize multi-step action sequences. As a result, the objectives at training and test times are inherently different, inducing a distribution shift between trajectories seen during training and those encountered during planning. This mismatch can cause planners to drive the model into out-of-distribution regions of the state space, where prediction errors compound over time and the model becomes unreliable for long-horizon optimization (Ajay2018AugmentingPSA; Ke2019LearningDMA; Zhu2023DiffusionMFA). A common strategy to address this train-test gap is dataset-aggregation (ross2011reduction), which expands the training distribution by rolling out action trajectories generated by the planning algorithm and adding them to the training set (Talvitie2014ModelRFA; nagabandi2018neural). However, unlike these approaches which typically apply this technique directly in the environment’s low-dimensional state space, our approach uses dataset-aggregation in the context of high-dimensional latent world models, where training occurs in latent space rather than directly on states. Through our Online World Modeling approach, we explicitly close the train-test gap for gradient-based planning by using the planner itself to generate off-distribution trajectories and correcting them with simulator feedback.
In this work, we introduced Online World Modeling and Adversarial World Modeling as techniques for addressing the train-test gap that arises when world models trained on next-state prediction are used for iterative gradient-based planning. Across our experiments, these methods substantially improve the reliability of GBP and, in some settings, allow it to match or outperform sampling-based planners such as CEM. By narrowing this gap, our results suggest that gradient-based planning can be a practical alternative for planning with world models, particularly in settings where computational efficiency is critical. An important direction for future work is to evaluate these methods on real-world systems. Adversarial training may additionally improve a world model’s robustness to environmental adversaries or stochasticity. More broadly, world models offer a natural advantage over policy-based reinforcement learning in long-horizon decision making. We believe our methods are especially well-suited to multi-timescale or hierarchical world models, where long-horizon planning is enabled by improving planning stability at different levels of abstraction.
Compute resources used in this work were provided by the Modal and NVIDIA Academic Grants. Micah Goldblum was supported by the Google Cyber NYC Award.
This task introduced by pusht uses an agent interacting with a T-shaped block to guide both the agent and block from a randomly initialized state to a feasible goal state within 25 steps. We use the dataset of 18500 trajectories given in zhou2025dinowmworldmodelspretrained, in which the green anchor serves purely as a visual reference. We draw a goal state from one of the noisy expert trajectories at 25 steps from the starting state.
In this task introduced by pointmaze, a force-actuated ball which can move in the x,yx,y Cartesian directions has to reach a target goal within a maze. We use the dataset of 2000 random trajectories present in zhou2025dinowmworldmodelspretrained, with a goal state chosen 25 steps from the starting state.
This task introduced by DINO-WM (zhou2025dinowmworldmodelspretrained) features a 2D navigation environment with two rooms separated by a wall with a door. The agent’s task is to navigate from a randomized starting location in one room to a random goal state in the other room, passing through the door. We use the dataset of 1920 trajectories as provided in DINO-WM, with a goal state chosen 25 steps from the starting state.
In this task introduced by zhang2024adaptigraph a simulated Xarm must push around one hundred small particles into the goal configuration. We use the dataset of 1000 trajectories of 20 steps each provided in DINO-WM.
We reproduce the dataset statistics used to train the base world model for each environment from zhou2025dinowmworldmodelspretrained. We use the same datasets for our alternative world model architecture ablation in Section B.3.
We detail the cross-entropy method used in our planning experiments in Algorithm 4.
In Table 3, we list all shared hyperparameters used in training and planning.
We provide data quantity and synthetic data parameters for our Online and Adversarial World Modeling training setups in Table 5 and Table 4 respectively. In addition to the maintaining perturbation radii for the visual latent and action embeddings, we use a distinct radius for the proprioceptive embeddings. We empirically find that the scales of the visual and proprioceptive embeddings are incompatible and semantically distinct, thereby necessitating independent perturbation. Throughout all of our experiments, we set the perturbation radii of the action embedding and proprioceptive embedding identically for simplicity.
To facilitate progress towards the goal in Gradient-based Planning, we introduce an alternate loss function: Weighted Goal Loss (WGL). Instead of the standard goal loss function that only minimizes the ℓ2\ell_{2}-distance between the final latent state produced by planning actions and the goal latent state, WGL encourages intermediate latent states to also be close to the goal latent state. Formally,
where the sequence of normalized weights {wi}2H+1{w_{i}}{2}^{H+1} is a hyperparameter choice. Empirically, we find that using this objective for Gradient-Based Planning either maintains or improves planning performance. For PointMaze and Wall, we found that exponentially upweighting later states in the planning horizon improved planning performance, so we set wi=2iw{i}=2^{i}. For PushT, we found that exponentially upweighting earlier states improved planning performance, so we set wi=(1/2)iw_{i}=\left(1/2\right)^{i}. We leave the optimal selection of this sequence of weights as future work.
Motivated by the hypothesis that the optimization landscape is rugged (see Figure 2 for some evidence of this), we train an initialization network gθ:𝒵×𝒵→𝒜T,gθ(z1,zg)={a^t}g_{\theta}:\mathcal{Z}\times\mathcal{Z}\to\mathcal{A}^{T},g_{\theta}(z_{1},z_{g})={\hat{a}_{t}} to initialize a sequence of actions for gradient-based planning.
We provide details on training the initialization network gθg_{\theta} in Algorithm 5. We train gθg_{\theta} on a single epoch over the trajectories in the task’s training dataset.
We show results of including the initialization network in GBP for each task in Table 6. Comparing to Table 1, we see that for both GD and Adam, the initialization network only performs comparably in the PushT environment compared to a random initialization.
We evaluate Adversarial World Modeling on two robotic manipulation tasks: Rope and Granular. Planning results for both tasks can be found in Table 7. To measure the accuracy of planned actions, we evaluate the Chamfer distance between the goal set of keypoints and the predicted set of keypoints.
We ablate the use of the DINO-WM architecture by evaluating planning performance with the IRIS (micheli2023transformerssampleefficientworldmodels) architecture. Specifically, IRIS uses a VQ-VAE (oord2018neuraldiscreterepresentationlearning) for both the encoder and decoder, and a standard decoder-only Transformer (NIPS2017_3f5ee243). We find that even with a learned encoder, Adversarial World Modeling improves GBP performance and even CEM performance. Planning success rates of the IRIS architecture for the Wall task are reported in Table 8.
We evaluate GBP over a longer horizon in Table 9(a). We use Adam in the MPC setting for each of these runs, setting a goal state 50 timesteps into the future drawn from an expert trajectory, a planning horizon of 50 steps, and 20 MPC iterations where we take a single action at each iteration. The dataset of held-out validation trajectories for the Wall environment does not contain expert trajectories of 50 timesteps, so we omit it from our evaluations. In comparison, our results in Table 1 use a goal state drawn 25 timesteps in the future and a planning horizon of 25 steps. We find that on the longer horizon, Adversarial World Modeling outperforms DINO-WM on PushT and both Adversarial and Online World Modeling outperform DINO-WM on PointMaze.
Additionally, we evaluate both the MPPI (torch_mppi) and GradCEM (gradcem) algorithms under MPC on the PushT task in Table 9(b). MPPI is an online, receding-horizon controller that samples and evaluates perturbed action sequences, executes the first action of the lowest-cost trajectory, and then replans from the updated state at each timestep.
GradCEM refines the candidate sequences used to update the estimated action distribution with gradient descent to provide a more accurate estimate of the true distribution’s parameters. We see that Adversarial World Modeling outperforms DINO-WM with GradCEM. Additionally, GradCEM exhibits slightly lower performance than vanilla CEM. We hypothesize this is due to the memory requirements of gradient descent necessitating reducing the number of candidate sequences by a factor of 6 compared to vanilla CEM, leading to reduced accuracy in estimating the true action distribution.
For MPPI, we use 5 samples each MPC iteration, with 100 MPC steps. For GradCEM, we use 50 samples, 30 CEM steps, and 2 Adam steps per CEM step with an LR of 0.3. For GradCEM we take 10 MPC steps.
We present additional results for the difference in World Model Error between training and planning for the PointMaze and Wall tasks in Figure 6. For both tasks, our methods have lower error during planning compared to training except for Online World Modeling on PointMaze, which is inconclusive due to the low magnitude of world model error. Planning actions are obtained after 300 steps of GBP with GD on 50 rollouts using the initial and goal state from a training trajectory.
For PointMaze and Wall, we compare the planning efficiency of DINO-WM and our two approaches across planning methodologies in Figures 7 and 8 respectively. All planning is performed with MPC.
To understand the additional cost of using the environment simulator in Online World Modeling, we record the wall clock time of rolling out 25 steps with the DINO-WM architecture and each environment simulator in Table 10. We see that in all environments, the simulator takes longer to rollout than the world model. We also note that the simulator for all 3 tasks is deterministic in terms of reproducing the training trajectories from their actions.
We visualize the loss landscape of both the DINO World Model before and after applying our Adversarial World Modeling objective. We perform a grid search over the subspace spanned by
a^GBP-Pretrained\hat{a}{\text{GBP-Pretrained}}: Gradient-Based Planning on original Dino World Model with 300 optimization steps of Adam with LR = 1e-3. We set a fixed initialization ainita{\text{init}}.
aGTa_{\text{GT}}: the ground-truth actions from the expert demonstrator.
We define the axes as α=a^GBP-Pretrained−aGT\alpha=\hat{a}{\text{GBP-Pretrained}}-a{\text{GT}} and β=a^GBP-Adversarial−aGT\beta=\hat{a}{\text{GBP-Adversarial}}-a{\text{GT}}, and compute the loss surface over a 50×5050\times 50 grid spanning α,β∈[−1.25,1.25]\alpha,\beta\in[-1.25,1.25].
Projected Gradient Descent (PGD) has been used as an iterative method for generating adversarial perturbations (madry2019deeplearningmodelsresistant). At each step, PGD takes a gradient ascent step and projects the result onto the space of allowed perturbations (some ball with radius ϵ{\epsilon} around the input). Projection (Π\Pi) is typically via clipping or scaling. Formally,
However, this is computationally expensive to use for adversarial training as it requires an additional backward pass for each iteration. If one uses a single-step, replaces the gradient by its sign, and uses step size α=ϵ\alpha={\epsilon}, this recovers the well-known Fast Gradient Sign Method (FGSM) update (goodfellow2014explaining).
In fastbetterthanfree, the authors demonstrate that initializing δ\delta in the ℓ∞\ell_{\infty}-ball with radius ϵ{\epsilon} and performing FGSM adversarial training on these perturbations substantially improves robustness to PGD attacks and matches performance of PGD-based training. We leverage this observation to perform cheap adversarial training that only requires 2×2\times the backward passes of traditional supervised learning. In comparison, KK-step PGD requires KK more backward passes (3×3\times more for K=2K=2 and 4×4\times for K=3K=3). In Table 11, we show that 2/3-Step PGD does not consistently outperform FGSM, despite requiring a much larger training budget.
To assess the robustness of Adversarial World Modeling to the scaling factor and perturbation radius hyperparameters, we conduct an ablation study varying these two factors, shown in Figure 9. We evaluate λa,λz∈[0.0,0.02,0.05,0.20,0.50,1.0]2\lambda_{a},\lambda_{z}\in[0.0,0.02,0.05,0.20,0.50,1.0]^{2} and either fix ϵa,ϵp,ϵz{\epsilon}{a},{\epsilon}{p},{\epsilon}{z} to the standard deviation of the first minibatch (“Fixed”) or recompute it for every minibatch (“Adaptive”). We observe no consistent improvement or degradation across any value of λa\lambda{a}, for 0≤λz≤0.50\leq\lambda_{z}\leq 0.5, or between the “Fixed” or “Adaptive” perturbation radii. We note that setting the visual scaling factor λz\lambda_{z} too high (e.g., 0.5,1.00.5,1.0) can significantly degrade performance. We hypothesize that excessively large perturbations distort the semantic content of the visual latent state, pushing it outside the range of semantically equivalent representations.
Table: S3.T1: Planning Results. We evaluate the planning performance of our finetuned world models against DINO-WM (zhou2025dinowmworldmodelspretrained) on 33 tasks in terms of success rate (%) using both open-loop and model predictive control (MPC) procedures. For each task, we perform gradient-based planning using both stochastic gradient descent (GD) and Adam (Kingma2014AdamAM), and search-based planning using the cross-entropy method (CEM).
| GD | Adam | CEM | GD | Adam | CEM | GD | Adam | CEM | |
|---|---|---|---|---|---|---|---|---|---|
| DINO-WM | 38 | 54 | 78 | 12 | 24 | 90 | 2 | 10 | 74∗ |
| + MPC | 56 | 76 | 92 | 42 | 68 | 90 | 12 | 80 | 82 |
| Online WM | 34 | 52 | 90 | 20 | 14 | 62 | 16 | 18 | 54∗ |
| + MPC | 50 | 76 | 92 | 54 | 88 | 96 | 38 | 80 | 90 |
| Adversarial WM | 56 | 82 | 94 | 32 | 70 | 88 | 32 | 34 | 30∗ |
| + MPC | 66 | 92 | 92 | 50 | 94 | 98 | 14 | 94 | 94 |
Table: A1.T2: Trajectory datasets used to pretrain the base DINO-WM and IRIS world models.
| Environment | H | Frameskip | Dataset Size | Trajectory Length |
|---|---|---|---|---|
| Push-T | 3 | 5 | 18500 | 100-300 |
| PointMaze | 3 | 5 | 2000 | 100 |
| Wall | 1 | 5 | 1920 | 50 |
| Rope | 1 | 1 | 1000 | 5 |
| Granular | 1 | 1 | 1000 | 5 |
Table: A1.T3: (a) Finetuning Parameters
| Name | Value |
|---|---|
| Image size | 224 |
| Optimizer | AdamW |
| Predictor LR | 1e-5 |
Table: A1.T3.st2: (b) Open-Loop Planning
| Name | GD | Adam |
|---|---|---|
| Opt. steps | 300 | 300 |
| LR | 1.0 | 0.3 |
Table: A1.T4: Training parameters for Adversarial World Modeling as reported in Table 1.
| Environment | # Rollouts | Batch Size | GPU | Epochs | ϵvisual{\epsilon}_{\text{visual}} | ϵproprio{\epsilon}_{\text{proprio}} | ϵaction{\epsilon}_{\text{action}} |
|---|---|---|---|---|---|---|---|
| PushT | 20000 (all) | 16 | 8x B200 | 1 | 0.05 | 0.02 | 0.02 |
| PointMaze | 2000 (all) | 16 | 1x B200 | 1 | 0.20 | 0.08 | 0.08 |
| Wall | 1920 (all) | 48 | 1x B200 | 2 | 0.20 | 0.08 | 0.08 |
Table: A2.T6: For both gradient descent (GD) and Adam (Ad), we evaluate initializing the actions for gradient-based planning (GBP) from the initialization network (IN) instead of a normal distribution.
| GD+IN | Ad+IN | GD+IN | Ad+IN | GD+IN | Ad+IN | |
|---|---|---|---|---|---|---|
| DINO-WM | 44 | 62 | 16 | 14 | 4 | 12 |
| + MPC | 60 | 84 | 40 | 54 | 6 | 32 |
| Online WM | 56 | 66 | 8 | 28 | 10 | 18 |
| + MPC | 52 | 82 | 40 | 46 | 2 | 22 |
| Adversarial WM | 74 | 90 | 22 | 36 | 18 | 24 |
| + MPC | 74 | 90 | 44 | 56 | 24 | 48 |
Table: A2.T7: Planning performance measured with Chamfer Distance (less is better) on two robotic manipulation tasks: Rope and Granular.
| Rope | Granular | |||
|---|---|---|---|---|
| GD | CEM | GD | CEM | |
| DINO-WM | 1.73 | 0.93 | 0.30 | 0.22 |
| Adversarial WM | 0.93 | 0.82 | 0.24 | 0.28 |
Table: A2.T8: Planning results in terms of success rate using the IRIS (micheli2023transformerssampleefficientworldmodels) architecture on the Wall Task.
| GD | CEM | |
| IRIS | 0 | 4 |
| IRIS + Online WM | 0 | 0 |
| IRIS + Adversarial WM | 8 | 6 |
Table: A2.T9: (a) Long-Horizon GBP
| PushT | PointMaze | |
|---|---|---|
| DINO-WM | 16 | 70 |
| Online WM | 16 | 96 |
| Adversarial WM | 26 | 88 |
Table: A2.T9.st2: (b) MPPI and GradCEM on PushT
| MPPI | GradCEM | |
|---|---|---|
| DINO-WM | 2 | 78 |
| Online WM | 2 | 74 |
| Adversarial WM | 2 | 84 |
Table: A2.T10: Wall clock time (in seconds) of rolling out 25 steps with each environment simulator compared to the DINO-WM architecture.
| PushT | PointMaze | Wall | |
|---|---|---|---|
| Simulator | 0.959 | 0.717 | 4.465 |
| DINO-WM | 0.029 | 0.029 | 0.029 |
Table: A4.T11: Both Open-Loop and MPC (Closed-Loop) use the Adam optimizer with the same parameters as the main experiments.
| Backward Passes | Min/Epoch | Open-Loop | MPC | Min/Epoch | Open-Loop | MPC | |
|---|---|---|---|---|---|---|---|
| FGSM | 2 | 120 | 70 | 94 | 14 | 34 | 94 |
| 2-Step PGD | 3 | 165 | 80 | 96 | 20 | 8 | 90 |
| 3-Step PGD | 4 | 201 | 78 | 94 | 24 | 14 | 94 |
An overview of our two proposed methods. When planning with a world model, actions may result in trajectories that lie outside the distribution of expert trajectories on which the world model was trained, leading to inaccurate world modeling. Online World Modeling finetunes a pretrained world model by using the simulator to correct trajectories produced via gradient-based planning, leading to accurate world modeling beyond the expert trajectory distribution. Adversarial World Modeling finetunes a world model on perturbations of actions and expert trajectories, promoting robustness and smoothing the world model’s input gradients.
Optimization landscape of DINO-WM (zhou2025dinowmworldmodelspretrained) before and after finetuning with our Adversarial World Modeling objective on the Push-T task. Adversarial World Modeling yields a smoother landscape with a broader basin around the optimum. Visualization details in Appendix C.
Difference in World Model Error between expert and planning trajectories on PushT.
PushT
PointMaze
Wall
Rope
Granular
Planning efficiency of DINO-WM, Online WM, and Adversarial WM using both GBP methods and CEM on the PointMaze task.
Success rate of closed-loop MPC planning using Adam on an Adversarial World Model trained with scaling factors λa,λz\lambda_{a},\lambda_{z} and perturbation radii ϵa,ϵz{\epsilon}{a},{\epsilon}{z} on the Wall environment. We find that 0≤λz,λa≤0.20\leq\lambda_{z},\lambda_{a}\leq 0.2 are stable for either “Fixed” or “Adaptive” perturbation radii.
(a) We see that DINO-WM is more likely to enter states outside of the training distribution, and so the decoder is not able reconstruct the state accurately. This is not the case with Online World Modeling but it still fails to successfully reach the goal state. Adversarial World Modeling successfully completes the task.
(b) Again we notice the failure for DINO-WM’s decoder to reconstruct states it encounters during planning, while this is not the case with Online World Modeling and Adversarial World Modeling, which both complete the task successfully.
$$ s_{t+1}=h(s_{t},a_{t}),\quad\text{ for all $t$}, $$ \tag{S2.E1}
$$ \min_{\theta}\mathbb{E}{(o{t},a_{t},o_{t+1})\sim\mathcal{T}}\lVert f_{\theta}(\Phi_{\mu}(o_{t}),a_{t})-\Phi_{\mu}(o_{t+1})\rVert_{2}^{2}. $$ \tag{S2.E3}
$$ {\hat{a}{t}^{*}}^{H}{t=1}=\operatorname*{arg,min}{{\hat{a}{t}}}\lVert\hat{z}{H+1}-z{\text{goal}}\rVert^{2}_{2} $$ \tag{S2.E4}
$$ \hat{z}{2}=f{\theta}(z_{1},\hat{a}{1}),\quad\hat{z}{t+1}=f_{\theta}(\hat{z}{t},\hat{a}{t})\quad\text{for}\quad t>1. $$ \tag{S2.E5}
$$ \tau^{\prime}=(z_{1},\hat{a}{1},z{2}^{\prime},\hat{a}{2},\dots,z^{\prime}{H+1}), $$ \tag{S2.E6}
$$ \delta^{(k+1)}=\Pi_{\lVert\delta\rVert_{\infty}\leq\epsilon}\left(\delta^{(k)}+\alpha\cdot\nabla_{x}\mathcal{L}(f_{\theta}(x+\delta^{(k)}),y)\right) $$ \tag{A4.E10}
Appendix
Select K sequences with lowest cost: E = { a ( j ) } topK
$$
$$
Training Details
PushT: This task introduced by Chi et al. (2024) uses an agent interacting with a T-shaped block to guide both the agent and block from a randomly initialized state to a feasible goal state within 25 steps. We use the dataset of 18500 trajectories given in Zhou et al. (2025), in which the green anchor serves purely as a visual reference. We draw a goal state from one of the noisy expert trajectories at 25 steps from the starting state.
PointMaze: In this task introduced by Fu et al. (2021), a force-actuated ball which can move in the x, y Cartesian directions has to reach a target goal within a maze. We use the dataset of 2000 random trajectories present in Zhou et al. (2025), with a goal state chosen 25 steps from the starting state.
Wall: This task introduced by DINO-WM (Zhou et al., 2025) features a 2D navigation environment with two rooms separated by a wall with a door. The agent's task is to navigate from a randomized starting location in one room to a random goal state in the other room, passing through the door. We use the dataset of 1920 trajectories as provided in DINO-WM, with a goal state chosen 25 steps from the starting state.
Granular: In this task introduced by Zhang et al. (2024) a simulated Xarm must push around one hundred small particles into the goal configuration. We use the dataset of 1000 trajectories of 20 steps each provided in DINO-WM.
We reproduce the dataset statistics used to train the base world model for each environment from Zhou et al. (2025). We use the same datasets for our alternative world model architecture ablation in Section B.3.
Table 2: Trajectory datasets used to pretrain the base DINO-WM and IRIS world models.
textbf{PushT:
textbf{PointMaze:
textbf{Wall:
$$ s_{t+1} = h(s_t, a_t), \quad \text{ for all $t$}, $$
$$ \min_{\theta}\mathbb{E}{(o_t, a_t, o{t + 1}) \sim \mathcal{T}} \lVert f_\theta(\Phi_{\mu}(o_t), a_t) - \Phi_{\mu}(o_{t + 1})\rVert_2^2. $$
$$ {\hat{a}t^*}^H{t = 1} = \argmin_{{\hat{a}t}} \lVert\hat{z}{H+1} - z_{\text{goal}}\rVert^2_2 \label{eq:plan-objective} $$ \tag{eq:plan-objective}
$$ \tau'= (z_1, \hat{a}_1, z_2', \hat{a}2, \dots, z'{H + 1}), $$
$$ \delta^{(k+1)} = \Pi_{\lVert \delta \rVert_\infty \le \epsilon} \left( \delta^{(k)} + \alpha \cdot \nabla_x \mathcal{L}(f_\theta(x + \delta^{(k)}), y) \right) $$
Algorithm: algorithm
[!h]
\caption{Gradient-Based Planning (GBP) via Gradient Descent}
\label{algo:gbp}
\KwIn{Start state $z_1$, goal state $z_\text{goal}$, world model $f_{\theta}$, horizon $H$, optimization iterations $N$}
\KwOut{Optimal action sequence $\{\hat{a}_t\}_{t=1}^H$}
\BlankLine
Initialize action prediction $\{\hat{a}_t\}_{t=1}^H \sim \mathcal{N}(0, I_H)$ \;
\For{$i = 1, \dots, N$}{
$\hat{z}_{H+1} \leftarrow \text{rollout}_{f}(z_1, \{\hat{a}_t\})_{H + 1}$ \;
$\mathcal{L}_{\text{goal}} \leftarrow \lVert\hat{z}_{H+1} - z_{\text{goal}}\rVert^2_2$ \;
$\{\hat{a}_t\} \leftarrow \{\hat{a}_t\} - \eta \cdot \nabla_{\{\hat{a}_t\}} \mathcal{L}_{\text{goal}}$ \;
}
\Return $\{\hat{a}_t\}_{t=1}^H$\;
Algorithm: algorithm
[]
\caption{Online World Modeling}
\label{algo:online_wm}
\KwIn{Pretrained world model $f_\theta$, simulator dynamics function $h$, encoder $\Phi_\mu$, dataset of trajectories $\mathcal{T}$, online iterations $N$, horizon $H$, planning optimization iterations $M$}
\KwOut{Updated world model $f_\theta$}
\BlankLine
Initialize new trajectory dataset $\mathcal{T}'$ \;
\For{$i = 1, \dots, N$}{
Sample trajectory $\tau_i = (z_1, a_1, z_2, a_2, \dots, a_H, z_{H + 1}) \sim \mathcal{T}$ \;
$\{\hat{a}_t \}_{t = 1}^H \gets \text{GBP}(z_{1}, z_{H + 1}, f_\theta, H, M)$\;
$\{s_t'\}_{t = 2}^{H + 1} \gets \text{rollout}_{h}(s_1, \{\hat{a}_t \}) $ \;
$\{z_t'\}_{t = 2}^{H + 1} \gets \{\Phi_\mu(s_t')\}_{t = 2}^{H + 1}$\;
$\tau_i' \gets (z_1, \hat{a}_1, z'_2, \hat{a}_2, \dots, \hat{a}_H, z'_{H+1})$ \;
$\mathcal{T}' \gets \mathcal{T}' \cup \tau'_i$ \;
Train $f_\theta$ on next-state prediction using $\mathcal{T}'$
}
\Return $f_\theta$\;
Algorithm: algorithm
[b!]
\caption{Adversarial World Modeling}
\label{algo:adv_wm}
\KwIn{Pretrained world model $f_\theta$, dataset of trajectories $\mathcal{T}$, action perturbation scaling $\lambda_a$, state perturbation scaling $\lambda_z$, horizon $H$, training iterations $N$, minibatch size $B$}
\KwOut{Updated world model $f_\theta$}
\BlankLine
Initialize new trajectory dataset $\mathcal{T}'$\;
\For{$i = 1, \dots, N$}{
Sample minibatch $\tau \gets
\{(z_1^j, a_1^j, z_2^j), (z_2^j, a_2^j, z_3^j), \dots, (z_H^j, a_H^j, z_{H+1}^j)\}_{j=1}^{B}
\sim \mathcal{T}$ \;
$(\epsilon_a, \epsilon_z) \gets
\Big(
\lambda_a\, \mathrm{mean}_j\big[\mathrm{std}(\{a_1^j, \ldots, a_H^j\})\big],\,
\lambda_z\, \mathrm{mean}_j\big[\mathrm{std}(\{z_1^j, \ldots, z_{H+1}^j\})\big]
\Big)$ \;
$(\alpha_a, \alpha_z) \gets (1.25\,\epsilon_a, 1.25\,\epsilon_z)$ \;
$\delta_a \sim \mathrm{Uniform}(-\epsilon_a, \epsilon_a)$ \;
$\delta_z \sim \mathrm{Uniform}(-\epsilon_z, \epsilon_z)$ \;
\For{$t = 1, \dots, H$}{
$\nabla_{\delta_a}, \nabla_{\delta_z} \gets
\nabla_{\delta_a, \delta_z}
\big\lVert f_\theta(z_t + \delta_z, a_t + \delta_a) - z_{t+1} \big\rVert_2^2$ \;
$\delta_a \gets
\mathrm{clip}(\delta_a + \alpha_a \,\mathrm{sign}(\nabla_{\delta_a}), -\epsilon_a, \epsilon_a)$ \;
$\delta_z \gets
\mathrm{clip}(\delta_z + \alpha_z \,\mathrm{sign}(\nabla_{\delta_z}), -\epsilon_z, \epsilon_z)$ \;
$a_t' \gets a_t + \delta_a$ \;
$z_t' \gets z_t + \delta_z$ \;
}
$\tau' \gets
\{(z_1^{\prime j}, a_1^{\prime j}, z_2^j), (z_2^{\prime j}, a_2^{\prime j}, z_3^j), \ldots, (z_H^{\prime j}, a_H^{\prime j}, z_{H+1}^j)\}_{j=1}^{B}$ \;
Train $f_\theta$ on next-state prediction using $\tau'$ \;
}
\Return $f_\theta$\;
Algorithm: algorithm
[!h]
\caption{Cross-Entropy Method (CEM) Planning}
\label{algo:cem}
\KwIn{Current observation $o_0$, goal observation $o_g$, encoder $\Phi_\mu$, world model $f_\theta$, \\
\hspace{1.2cm} horizon $H$, population size $N$, top-K selection $K$, iterations $I$}
\KwOut{Action sequence $\{\hat{a}_t\}_{t=1}^{H}$}
\BlankLine
% --- Encode observations ---
$\hat{z}_1 \leftarrow \Phi_\mu(o_1)$ \;
$z_g \leftarrow \Phi_\mu(o_g)$ \;
% --- Initialize distribution ---
Initialize Gaussian distribution parameters: mean $\mu_0$, covariance $\Sigma_0$ \;
\For{$i = 1, \dots, I$}{
% --- Sample population ---
Sample $N$ action sequences $\{a^{(j)}_{1:H}\}_{j=1}^N \sim \mathcal{N}(\mu_{i-1}, \Sigma_{i-1})$ \;
\For{$j = 1, \dots, N$}{
$\hat{z}^{(j)}_1 \leftarrow \hat{z}_1$ \;
\For{$t = 2, \dots, H+1$}{
$\hat{z}^{(j)}_t \leftarrow f_\theta(\hat{z}^{(j)}_{t-1}, a^{(j)}_{t-1})$ \;
}
$C^{(j)} \leftarrow \lVert \hat{z}^{(j)}_{H+1} - z_g \rVert^2$ \;
}
% --- Select elites ---
Select $K$ sequences with lowest cost: $\mathcal{E} = \{a^{(j)}\}_{\text{top-}K}$ \;
% --- Update distribution ---
$\mu_i \leftarrow \frac{1}{K} \sum_{a \in \mathcal{E}} a$ \;
$\Sigma_i \leftarrow \frac{1}{K} \sum_{a \in \mathcal{E}} (a - \mu_i)(a - \mu_i)^\top$ \;
}
\Return $\mu_I$ as the final action sequence estimate $\{\hat{a}_t\}_{t=1}^{H}$ \;
Algorithm: algorithm
\caption{Initialization Network Training}
\label{algo:initnet}
\KwIn{Initialization network $g_\theta$, LR $\eta$, dataset of trajectories $\mathcal{T}$, iterations $N$, horizon $H$}
\KwOut{Trained initialization network $g_\theta$}
\BlankLine
\For{$i = 1, \dots, N$}{
Sample trajectory $\tau_i = (z_1, a_1, z_2, a_2, \dots, a_H, z_{H + 1}) \sim \mathcal{T}$ \;
$\{\hat{a}_t \}_{t = 1}^H \gets g_{\theta}(z_{1}, z_{H+1})$\;
$\mathcal{L}_{\text{actions}} \gets \sum_{t=1}^{H}{\lVert \hat{a}_t - a_t \rVert_2^2}$ \;
$\theta \gets \theta - \eta\nabla_{\theta}\mathcal{L}_{\text{actions}}$ \;
}
\Return $g_\theta$\;
| PushT | PushT | PushT | PointMaze | PointMaze | PointMaze | Wall | Wall | Wall | |
|---|---|---|---|---|---|---|---|---|---|
| GD | Adam | CEM | GD | Adam | CEM | GD | Adam | CEM | |
| DINO-WM | 38 | 54 | 78 | 12 | 24 | 90 | 2 | 10 | 74 ∗ |
| + MPC | 56 | 76 | 92 | 42 | 68 | 90 | 12 | 80 | 82 |
| OnlineWM | 34 | 52 | 90 | 20 | 14 | 62 | 16 | 18 | 54 ∗ |
| + MPC | 50 | 76 | 92 | 54 | 88 | 96 | 38 | 80 | 90 |
| AdversarialWM | 56 | 82 | 94 | 32 | 70 | 88 | 32 | 34 | 30 ∗ |
| + MPC | 66 | 92 | 92 | 50 | 94 | 98 | 14 | 94 | 94 |
| Environment | H | Frameskip | Dataset Size | Trajectory Length |
|---|---|---|---|---|
| Push-T | 3 | 5 | 18500 | 100-300 |
| PointMaze | 3 | 5 | 2000 | 100 |
| Wall | 1 | 5 | 1920 | 50 |
| Rope | 1 | 1 | 1000 | 5 |
| Granular | 1 | 1 | 1000 | 5 |
| Name | Value | Name | GD | Adam | ||
|---|---|---|---|---|---|---|
| Image size Optimizer Predictor LR | 224 AdamW 1e-5 LR | Name Opt. steps | GD Adam 300 300 1.0 0.3 | MPC steps Opt. steps LR | 10 100 1 | 10 100 0.2 |
| (a) Finetuning Parameters | (b) Open-Loop | Planning | (c) MPC | Parameters |
| Environment | # Rollouts | Batch Size | GPU | Epochs | ϵ visual | ϵ proprio | ϵ action |
|---|---|---|---|---|---|---|---|
| PushT | 20000 (all) | 16 | 8x B200 | 1 | 0.05 | 0.02 | 0.02 |
| PointMaze | 2000 (all) | 16 | 1x B200 | 1 | 0.2 | 0.08 | 0.08 |
| Wall | 1920 (all) | 48 | 1x B200 | 2 | 0.2 | 0.08 | 0.08 |
| Environment | # Rollouts | Batch Size | GPU | Epochs |
|---|---|---|---|---|
| PushT | 6000 | 32 | 4x B200 | 1 |
| PointMaze | 500 | 32 | 4x B200 | 1 |
| Wall | 1920 (all) | 80 | 4x B200 | 1 |
| PushT | PushT | PointMaze | PointMaze | Wall | Wall | |
|---|---|---|---|---|---|---|
| GD+IN | Ad+IN | GD+IN | Ad+IN | GD+IN | Ad+IN | |
| DINO-WM | 44 60 | 62 84 | 16 40 | 14 54 | 4 6 | 12 32 |
| + MPC | 56 | 8 | 28 46 | |||
| OnlineWM + MPC | 66 | 10 | 18 | |||
| 52 | 82 | 40 | 2 | 22 | ||
| AdversarialWM | 74 | 90 | 22 | 36 | 18 | 24 |
| + MPC | 74 | 90 | 44 | 56 | 24 | 48 |
| Rope | Rope | Granular | Granular | |
|---|---|---|---|---|
| GD | CEM | GD | CEM | |
| DINO-WM | 1.73 | 0.93 | 0.30 | 0.22 |
| AdversarialWM | 0.93 | 0.82 | 0.24 | 0.28 |
| GD | CEM | |
|---|---|---|
| IRIS | 0 | 4 |
| IRIS + OnlineWM | 0 | 0 |
| IRIS + AdversarialWM | 8 | 6 |
| PushT | PointMaze | |
|---|---|---|
| DINO-WM | 16 | 70 |
| OnlineWM | 16 | 96 |
| AdversarialWM | 26 | 88 |
| PushT | PointMaze | Wall | |
|---|---|---|---|
| Simulator | 0.959 | 0.717 | 4.465 |
| DINO-WM | 0.029 | 0.029 | 0.029 |
| Backward Passes | PointMaze | PointMaze | PointMaze | Wall | Wall | Wall | |
|---|---|---|---|---|---|---|---|
| FGSM | Min/Epoch | Open-Loop | MPC | Min/Epoch | Open-Loop | MPC | |
| 2-Step PGD | 2 3 | 120 165 | 70 80 | 94 96 | 14 20 | 34 8 | 94 90 |
| 3-Step PGD | 4 | 201 | 78 | 94 | 24 | 14 | 94 |






References
[fastbetterthanfree] Wong, Eric, Rice, Leslie, Kolter, J Zico. (2020). Fast is better than free: Revisiting adversarial training. International Conference on Learning Representations.
[goodfellow2014explaining] Goodfellow, Ian J, Shlens, Jonathon, Szegedy, Christian. (2014). Explaining and harnessing adversarial examples. International Conference on Learning Representations.
[Finn2016UnsupervisedLFA] Chelsea Finn, I. Goodfellow, S. Levine. (2016). Unsupervised Learning for Physical Interaction through Video Prediction. ArXiv.
[Zhang2025StateAwarePOA] Zongyuan Zhang, Tian-dong Duan, Zheng Lin, Dong Huang, Zihan Fang, Zekai Sun, Ling Xiong, Hongbin Liang, Heming Cui, Yong Cui. (2025). State-Aware Perturbation Optimization for Robust Deep Reinforcement Learning. ArXiv.
[nagabandi2018neural] Nagabandi, Anusha, Kahn, Gregory, Fearing, Ronald S, Levine, Sergey. (2018). Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. 2018 IEEE international conference on robotics and automation (ICRA).
[shen2025babnd] Ross, St{'e. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. Proceedings of the fourteenth international conference on artificial intelligence and statistics.
[levine2020offline] Levine, Sergey, Kumar, Aviral, Tucker, George, Fu, Justin. (2020). Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
[kostrikov2021offline] Kostrikov, Ilya, Nair, Ashvin, Levine, Sergey. (2021). Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169.
[park2023hiql] Park, Seohong, Ghosh, Dibya, Eysenbach, Benjamin, Levine, Sergey. (2023). Hiql: Offline goal-conditioned rl with latent states as actions. Advances in Neural Information Processing Systems.
[rubinstein2004cross] Rubinstein, Reuven Y, Kroese, Dirk P. (2004). The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning.
[williams2017model] Williams, Grady, Aldrich, Andrew, Theodorou, Evangelos A. (2017). Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics.
[deisenroth2011pilco] Deisenroth, Marc, Rasmussen, Carl E. (2011). PILCO: A model-based and data-efficient approach to policy search. Proceedings of the 28th International Conference on machine learning (ICML-11).
[lenz2015deepmpc] Lenz, Ian, Knepper, Ross A, Saxena, Ashutosh. (2015). DeepMPC: Learning deep latent features for model predictive control.. Robotics: Science and Systems.
[Karl2016DeepVB] Maximilian Karl, Maximilian S{. (2016). Deep Variational Bayes Filters: Unsupervised Learning of State Space Models from Raw Data. ArXiv.
[Sharma2019DynamicsAwareUD] Archit Sharma, Shixiang Shane Gu, Sergey Levine, Vikash Kumar, Karol Hausman. (2019). Dynamics-Aware Unsupervised Discovery of Skills. ArXiv.
[Kaiser2019ModelBasedRL] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, K. Czechowski, D. Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, G. Tucker, Henryk Michalewski. (2019). Model-Based Reinforcement Learning for Atari. ArXiv.
[Shi2022RoboCraftLT] Haochen Shi, Huazhe Xu, Zhiao Huang, Yunzhu Li, Jiajun Wu. (2022). RoboCraft: Learning to See, Simulate, and Shape Elasto-Plastic Objects with Graph Networks. ArXiv.
[Finn2016DeepVF] Chelsea Finn, Sergey Levine. (2016). Deep visual foresight for planning robot motion. 2017 IEEE International Conference on Robotics and Automation (ICRA).
[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.
[hafner2019learning] Hafner, Danijar, Lillicrap, Timothy, Fischer, Ian, Villegas, Ruben, Ha, David, Lee, Honglak, Davidson, James. (2019). Learning latent dynamics for planning from pixels. International conference on machine learning.
[henaff2017model] Henaff, Mikael, Whitney, William F, LeCun, Yann. (2017). Model-based planning with discrete and continuous actions. arXiv preprint arXiv:1705.07177.
[agrawal2016learning] Agrawal, Pulkit, Nair, Ashvin V, Abbeel, Pieter, Malik, Jitendra, Levine, Sergey. (2016). Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems.
[zhang2019solar] Zhang, Marvin, Vikram, Sharad, Smith, Laura, Abbeel, Pieter, Johnson, Matthew, Levine, Sergey. (2019). Solar: Deep structured representations for model-based reinforcement learning. International conference on machine learning.
[florence2022implicit] Florence, Pete, Lynch, Corey, Zeng, Andy, Ramirez, Oscar A, Wahid, Ayzaan, Downs, Laura, Wong, Adrian, Lee, Johnny, Mordatch, Igor, Tompson, Jonathan. (2022). Implicit behavioral cloning. Conference on robot learning.
[shen2024bab] Shen, Keyi, Yu, Jiangwei, Barreiros, Jose, Zhang, Huan, Li, Yunzhu. (2024). Bab-nd: Long-horizon motion planning with branch-and-bound and neural dynamics. arXiv preprint arXiv:2412.09584.
[Schmidt2009DistillingFNA] Michael D. Schmidt, Hod Lipson. (2009). Distilling Free-Form Natural Laws from Experimental Data. Science.
[goldstein1950classical] Goldstein, Herbert, Poole, Charles P, Safko, John. (1950). Classical mechanics.
[macchelli2009port] Macchelli, Alessandro, Melchiorri, Claudio, Stramigioli, Stefano. (2009). Port-based modeling and simulation of mechanical systems with rigid and flexible links. IEEE transactions on robotics.
[Nagabandi2019DeepDMA] Anusha Nagabandi, K. Konolige, S. Levine, Vikash Kumar. (2019). Deep Dynamics Models for Learning Dexterous Manipulation. Conference on Robot Learning.
[Bharadhwaj2020ModelPredictiveCVA] Homanga Bharadhwaj, Kevin Xie, F. Shkurti. (2020). Model-Predictive Control via Cross-Entropy and Gradient-Based Optimization. ArXiv.
[Williams2017InformationTMA] Grady Williams, Nolan Wagener, Brian Goldfain, P. Drews, James M. Rehg, Byron Boots, Evangelos A. Theodorou. (2017). Information theoretic MPC for model-based reinforcement learning. 2017 IEEE International Conference on Robotics and Automation (ICRA).
[Zhan2021ModelBasedOPA] Xianyuan Zhan, Xiangyu Zhu, Haoran Xu. (2021). Model-Based Offline Planning with Trajectory Pruning. International Joint Conference on Artificial Intelligence.
[karypidis2024dino] Karypidis, Efstathios, Kakogeorgiou, Ioannis, Gidaris, Spyros, Komodakis, Nikos. (2024). DINO-Foresight: Looking into the Future with DINO. arXiv preprint arXiv:2412.11673.
[Wang2023SoftZooASA] Tsun-Hsuan Wang, Pingchuan Ma, A. Spielberg, Zhou Xian, Hao Zhang, J. Tenenbaum, D. Rus, Chuang Gan. (2023). SoftZoo: A Soft Robot Co-design Benchmark For Locomotion In Diverse Environments. ArXiv.
[Xu2022AcceleratedPLA] Jie Xu, Viktor Makoviychuk, Yashraj S. Narang, Fabio Ramos, W. Matusik, Animesh Garg, M. Macklin. (2022). Accelerated Policy Learning with Parallel Differentiable Simulation. ArXiv.
[Chen2022BenchmarkingDOA] Siwei Chen, Yiqing Xu, Cunjun Yu, Linfeng Li, Xiao Ma, Zhongwen Xu, David Hsu. (2022). Benchmarking Deformable Object Manipulation with Differentiable Physics. ArXiv.
[Hu2022ModelBasedILA] Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zak Murez, Corina Gurau, Hudson Yeo, Alex Kendall, R. Cipolla, J. Shotton. (2022). Model-Based Imitation Learning for Urban Driving. ArXiv.
[Akan2022StretchBEVSFA] Adil Kaan Akan, F. G{. (2022). StretchBEV: Stretching Future Instance Prediction Spatially and Temporally. European Conference on Computer Vision.
[Edwards2018ImitatingLPA] Ashley D. Edwards, Himanshu Sahni, Yannick Schroecker, C. Isbell. (2018). Imitating Latent Policies from Observation. ArXiv.
[Zhang2021DeformableLOA] Wenbo Zhang, Karl Schmeckpeper, P. Chaudhari, Kostas Daniilidis. (2021). Deformable Linear Object Prediction Using Locally Linear Latent Dynamics. 2021 IEEE International Conference on Robotics and Automation (ICRA).
[bounou2021online] Bounou, Oumayma, Ponce, Jean, Carpentier, Justin. (2021). Online learning and control of complex dynamical systems from sensory input. Advances in Neural Information Processing Systems.
[madry2019deeplearningmodelsresistant] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, Adrian Vladu. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks.
[zhou2025dinowmworldmodelspretrained] Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto. (2025). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning.
[oquab2024dinov2learningrobustvisual] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. (2024). DINOv2: Learning Robust Visual Features without Supervision.
[dosovitskiy2021imageworth16x16words] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
[oord2018neuraldiscreterepresentationlearning] Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu. (2018). Neural Discrete Representation Learning.
[mohanan2018survey] Mohanan, MG, Salgoankar, Ambuja. (2018). A survey of robotic motion planning in dynamic environments. Robotics and Autonomous Systems.
[kavraki2002probabilistic] Kavraki, Lydia E, Svestka, Petr, Latombe, J-C, Overmars, Mark H. (2002). Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE transactions on Robotics and Automation.
[spong2020robot] Spong, Mark W, Hutchinson, Seth, Vidyasagar, M. (2020). Robot modeling and control. John Wiley &.
[siciliano2009robotics] Siciliano, Bruno, Sciavicco, Lorenzo, Villani, Luigi, Oriolo, Giuseppe. (2009). Robotics: modelling, planning and control.
[ha2018world] Ha, David, Schmidhuber, J{. (2018). World models. arXiv preprint arXiv:1803.10122.
[sutton1998reinforcement] Sutton, Richard S, Barto, Andrew G, others. (1998). Reinforcement learning: An introduction.
[schrittwieser2020mastering] Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis, Graepel, Thore, others. (2020). Mastering atari, go, chess and shogi by planning with a learned model. Nature.
[sutton1991dyna] Sutton, Richard S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin.
[hafner2023mastering] Hafner, Danijar, Pasukonis, Jurgis, Ba, Jimmy, Lillicrap, Timothy. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[Ajay2018AugmentingPSA] Anurag Ajay, Jiajun Wu, Nima Fazeli, Maria Bauz{'a. (2018). Augmenting Physical Simulators with Stochastic Neural Networks: Case Study of Planar Pushing and Bouncing. IEEE/RJS International Conference on Intelligent RObots and Systems.
[Jackson2024PolicyGuidedDA] Matthew Jackson, Michael Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, J. Foerster. (2024). Policy-Guided Diffusion. ArXiv.
[Schiewer2024ExploringTLA] Robin Schiewer, Anand Subramoney, Laurenz Wiskott. (2024). Exploring the limits of hierarchical world models in reinforcement learning. Scientific Reports.
[Talvitie2014ModelRFA] Erik Talvitie. (2014). Model Regularization for Stable Sample Rollouts. Conference on Uncertainty in Artificial Intelligence.
[Zhu2023DiffusionMFA] Zhengbang Zhu, Hanye Zhao, Haoran He, Yichao Zhong, Shenyu Zhang, Yong Yu, Weinan Zhang. (2023). Diffusion Models for Reinforcement Learning: A Survey. ArXiv.
[Ke2019LearningDMA] Nan Rosemary Ke, Amanpreet Singh, Ahmed Touati, Anirudh Goyal, Yoshua Bengio, Devi Parikh, Dhruv Batra. (2019). Learning Dynamics Model in Reinforcement Learning by Incorporating the Long Term Future. ArXiv.
[Bardes2024RevisitingFPA] Adrien Bardes, Q. Garrido, Jean Ponce, Xinlei Chen, Michael G. Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. ArXiv.
[Drozdov2024VideoRLA] Katrina Drozdov, Ravid Shwartz-Ziv, Yann LeCun. (2024). Video Representation Learning with Joint-Embedding Predictive Architectures. ArXiv.
[Guan2024WorldMFA] Yanchen Guan, Haicheng Liao, Zhenning Li, Guohui Zhang, Chengzhong Xu. (2024). World Models for Autonomous Driving: An Initial Survey. ArXiv.
[hafner2019dream] Hafner, Danijar, Lillicrap, Timothy, Ba, Jimmy, Norouzi, Mohammad. (2019). Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603.
[hansen2023td] Hansen, Nicklas, Su, Hao, Wang, Xiaolong. (2023). Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828.
[mayne1966second] Mayne, David. (1966). A second-order gradient method for determining optimal trajectories of non-linear discrete-time systems. International Journal of Control.
[li2004iterative] Li, Weiwei, Todorov, Emanuel. (2004). Iterative linear quadratic regulator design for nonlinear biological movement systems. First International Conference on Informatics in Control, Automation and Robotics.
[sv2023gradient] SV, Jyothir, Jalagam, Siddhartha, LeCun, Yann, Sobal, Vlad. (2023). Gradient-based Planning with World Models. arXiv preprint arXiv:2312.17227.
[szegedy2013intriguing] Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian, Fergus, Rob. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
[mejia2019robust] Mejia, Felipe A, Gamble, Paul, Hampel-Arias, Zigfried, Lomnitz, Michael, Lopatina, Nina, Tindall, Lucas, Barrios, Maria Alejandra. (2019). Robust or private? adversarial training makes models more vulnerable to privacy attacks. arXiv preprint arXiv:1906.06449.
[Kingma2014AdamAM] Diederik P. Kingma, Jimmy Ba. (2014). Adam: A Method for Stochastic Optimization. CoRR.
[pointmaze] Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine. (2021). D4RL: Datasets for Deep Data-Driven Reinforcement Learning.
[pusht] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, Shuran Song. (2024). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.
[micheli2023transformerssampleefficientworldmodels] Vincent Micheli, Eloi Alonso, François Fleuret. (2023). Transformers are Sample-Efficient World Models.
[NIPS2017_3f5ee243] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, \L ukasz, Polosukhin, Illia. (2017). Attention is All you Need. Advances in Neural Information Processing Systems.
[torch_mppi] Williams, Grady, Wagener, Nolan, Goldfain, Brian, Drews, Paul, Rehg, James M, Boots, Byron, Theodorou, Evangelos A. (2017). Information theoretic MPC for model-based reinforcement learning. 2017 IEEE international conference on robotics and automation (ICRA).
[gradcem] Bharadhwaj, Homanga, Xie, Kevin, Shkurti, Florian. (2020). Model-predictive control via cross-entropy and gradient-based optimization. Learning for Dynamics and Control.
[zhang2024adaptigraph] Lambert, Nathan, Amos, Brandon, Yadan, Omry, Calandra, Roberto. (2020). Objective mismatch in model-based reinforcement learning. arXiv preprint arXiv:2002.04523.