Closing the Train-Test Gap in Gradient-Based Planning

Introduction

Online and Adversarial World Modeling

Experiments

Conclusion

Acknowledgments

Experimental Details

Additional Experiment Results

Visualizing the Optimization Landscape

Adversarial World Modeling: Design Decisions

Trajectory Visualization

Problem formulation

Online World Modeling

Adversarial World Modeling

Planning Results

Train-Test Gap

Loss Surface Visualization

Planning Computational Efficiency

Task Details

CEM Algorithm

Finetuning and Planning Hyperparameters

Weighted Goal Loss

Initialization Network

Robotic Manipulation Tasks

Different World Model Architecture

Long Horizon Planning

Additional Planning Algorithms

Additional Train-Test Gap Results

Planning Computational Efficiency

Rollout Inference Time

Fast Gradient Sign Method (FGSM) vs. Projected Gradient Descent (PGD)

Scaling Factor ($ lambda$) & Perturbation Radii ($ eps$) Ablations

textbf{Latent world models.

textbf{Training.

textbf{Planning.

textbf{PushT:

textbf{PointMaze:

textbf{Wall:

textbf{Rope:

textbf{Granular:

Arjun Parthasarathy, Nimit Kalra, Rohun Agrawal, Yann LeCun, Oumayma Bounou, Pavel Izmailov, Micah Goldblum

Abstract

World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget. %To that end, we present a meta-learning approach that directly optimizes the world model for gradient-based planning itself. To aid in gradient flow over long horizons, we present a simple technique for adding a skip connection to the gradient-based planning computation graph. We further test out alternative techniques for decreasing the train-test gap including dataset synthesis and backpropagation through time. We evaluate our techniques on two suites of tasks: the environments present in DINO-WM and the DM Control Suite.% We demonstrate an improvement of +x% on xyz tasks over an off-the-shelf world model when performing gradient-based planning.

In robotic tasks, anticipating how the actions of an agent affect the state of its environment is fundamental for both prediction (Finn et al., 2016) and planning (Mohanan & Salgoankar, 2018; Kavraki et al., 2002). Classical approaches derive models of the environment evolution analytically from first principles, relying on prior knowledge of the environment, the agent, and any uncertainty (Goldstein et al., 1950; Siciliano et al., 2009; Spong et al., 2020). In contrast, learning-based methods infer such models directly from data, enabling them to capture complex dynamics and thus improve generalization and robustness to uncertainty (Sutton et al., 1998; Schrittwieser et al., 2020; LeCun, 2022).

World models (Ha & Schmidhuber, 2018), in particular, have emerged as a powerful paradigm. Given the current state and an action, the world model predicts the resulting next state. These models can be learned either from exact state information (Sutton, 1991) or from high-dimensional sensory inputs such as images (Hafner et al., 2023). The latter setup is especially compelling as it enables perception, prediction, and control directly from raw images by leveraging pre-trained visual representations, and removes the need for measuring the precise environment states which is difficult in practice (Assran et al., 2023; Bardes et al., 2024). Recently, world models and their predictive capabilities have been leveraged for planning, enabling agents to solve a variety of tasks (Hafner et al., 2019a;b; Schrittwieser et al., 2020; Hafner et al., 2023; Zhou et al., 2025). A model of the dynamics is learned offline, while the planning task is defined at inference as a constrained optimization problem: given the current state, find a sequence of actions that results in a state as close as possible to the target state. This inference-time optimization provides an effective alternative to reinforcement learning approaches (Sutton et al., 1998) that often suffer from poor sample-efficiency.

Correspondence to: Nimit Kalra (nimit@utexas.edu) and Rohun Agrawal (rohun.agrawal@columbia.edu).

Online Woria Moaeling

Figure 1: An overview of our two proposed methods. When planning with a world model, actions may result in trajectories that lie outside the distribution of expert trajectories on which the world model was trained, leading to inaccurate world modeling. Online World Modeling finetunes a pretrained world model by using the simulator to correct trajectories produced via gradient-based planning, leading to accurate world modeling beyond the expert trajectory distribution. Adversarial World Modeling finetunes a world model on perturbations of actions and expert trajectories, promoting robustness and smoothing the world model's input gradients.

World models are compatible with many model-based planning algorithms. Traditional methods such as DDP (Mayne, 1966) and iLQR (Li & Todorov, 2004) rely on iteratively solving exact optimization problems derived from linear and quadratic approximations of the dynamics around a nominal trajectory. While highly effective in low-dimensional settings, these methods become impractical for large-scale world models, where solving the resulting optimization problem is computationally intractable. As an alternative, search-based methods such as the Cross Entropy Method (CEM) (Rubinstein & Kroese, 2004) and Model Predictive Path Integral control (MPPI) (Williams et al., 2017a) have been widely adopted as gradient-free alternatives and have proven effective in practice. However, they are computationally intensive as they require iteratively sampling candidate solutions and performing world model rollouts to evaluate each one, a procedure that scales poorly in high-dimensional spaces. Gradient-based methods (SV et al., 2023), in contrast, avoid the limitations of sampling by directly exploiting the differentiability of world models to optimize actions end-to-end. These methods eliminate the costly rollouts required by search-based approaches, thus scaling more efficiently in high-dimensional spaces. Despite this promise, gradient-based approaches have thus far seen limited empirical success.

This procedure suffers from a fundamental train-test gap. World models are typically trained using a next-state prediction objective on datasets of expert trajectories. At test time, however, they are used to optimize a planning objective over sequences of actions. We argue that this mismatch underlies the poor empirical performance of gradient-based planning (GBP), and we offer two hypotheses to explain why. (1) During planning, the intermediate sequence of actions explored by gradient descent drive the world model into states that were not encountered during training. In these outof-distribution states, model errors compound, making the world model unreliable as a surrogate for optimization. (2) The action-level optimization landscape induced by the world model may be difficult to traverse, containing many poor local minima or flat regions, which hinders effective gradient-based optimization.

In this work, we address both of these challenges by proposing two algorithms: Online World Modeling and Adversarial World Modeling . Both expand the region of familiar latent states by continuously adding new trajectories to the dataset and finetuning the world model on them. To manage the distribution shift between offline expert trajectories and predicted trajectories from planning, Online World Modeling uses the environment simulator to correct states along a trajectory produced by performing GBP. Finetuning on these corrected trajectories ensures that the world model performs sufficiently well when GBP enters regimes of latent state space outside of the expert trajectory distribution. To overcome the difficulties of optimizing over a non-smooth loss surface during GBP, Adversarial World Modeling perturbs expert trajectories in the direction that maximizes the world model's loss. Adversarial finetuning smooths the induced action loss landscape, making it easier to optimize via gradient-based planning. We provide a visual depiction of both methods in Figure 1.

We show that finetuning world models with these algorithms leads to substantial improvements in the performance of gradient-based planning (GBP). Applying Adversarial World Modeling to a pretrained world model enables gradient-based planning to match or exceed the performance of search-based CEM on a variety of robotic object manipulation and navigation tasks. Importantly, this performance is achieved with a 10 × reduction in computation time compared to CEM, underscoring the practicality of our approach for real-world planning. Additionally, we empirically demonstrate that Adversarial World Modeling smooths the planning loss landscape, and that both methods can reverse the train-test gap in world model error.

World models learn environment dynamics by predicting the state resulting from taking an action in the current state. Then, at test time, the learned world model enables planning by simulating future trajectories and guiding action optimization. Formally, a world model approximates the (potentially unknown) dynamics function h : S × A → S , where S denotes the state space and A the action space. The environment evolves according to

where s t ∈ S , a t ∈ A denote the state and action at time t , respectively.

Latent world models. In practice, we typically do not have access to the exact state of the environment; instead, we only receive partial observations of it, such as images. In order for a world model to efficiently learn in the high-dimensional observation space O , an embedding function Φ µ : O → Z is employed to map observations to a lower-dimensional latent space Z . Then, given an embedding function Φ µ , our goal is to learn a latent world model f θ : Z × A → Z , such that

The choice of Φ µ directly affects the expressivity of the latent world model. In this work, we use a fixed encoder pretrained with self-supervised learning that yields rich feature representations out of the box.

Training. To train a latent world model, we sample triplets of the form ( o t , a t , o t +1 ) from an offline dataset of trajectories T and minimize the ℓ 2 distance between the true next latent state z t +1 = Φ µ ( o t +1 ) and the predicted next latent state ˆ z t +1 . This procedure is represented by the following teacher-forcing objective:

Notably, we only minimize this objective with respect to the world model's parameters θ , not those of the potentially large embedding function.

Planning. During test-time, we use a learned world model to optimize candidate action sequences for reaching a goal state. By recursively applying the world model over an action sequence starting from an initial latent state, we obtain a predicted latent goal state and therefore the distance to the true goal state in latent space. This allows us to find the optimal action sequence

where ˆ z H +1 is produced by the recursive procedure

Gradient-based planning (GBP) solves the planning objective (4) via gradient descent. Crucially, since the world model is differentiable, ∇ { ˆ a t } ˆ z H +1 = ∇ { ˆ a t } rollout f ( z 1 , { ˆ a t } ) H +1 is well-defined. In contrast, the search-based CEM is gradient-free, but requires evaluating substantially more action sequences. We detail GBP in Algorithm 1 and CEM in Section A.2.

During gradient-based planning, the action sequences being optimized are not constrained to lie within the distribution of behavior seen during training. World models are typically trained on fixed datasets of expert trajectories, whereas GBP selects actions solely to improve the planning objective, without regard to whether those actions resemble expert behavior. As a result, the optimization process often proposes action sequences that are out of distribution. Optimizing through learned models under such conditions is known to induce adversarial inputs (Szegedy et al., 2013; Goodfellow et al., 2014). In our setting, these adversarial action sequences drive the world model into regions of the latent state space that were rarely or never observed during training, causing large prediction errors. Even when errors are initially small, they accumulate as the planner rolls the model forward, ultimately degrading long-horizon planning performance.

To address this issue, we propose Online World Modeling , which iteratively corrects the trajectories produced by GBP and finetunes the world model on the resulting rollouts. Rather than training solely on expert demonstrations, we repeatedly incorporate trajectories induced by the planner itself, thereby expanding the region of latent states that the world model can reliably predict.

Onall three tasks, our methods outperform DINO-WM with Gradient Descent GBP and either match or outperform it with the far more expensive CEM. In the open-loop setting, we achieve a +18% on Push-T, +20% on PointMaze, and +30% on Wall increase in success rate. In the MPC setting, Adam GBP with Adversarial World Modeling outperforms CEM with DINO-WM on PointMaze and Wall and matches CEM on PushT.

While both Online World Modeling and Adversarial World Modeling bootstrap new data to improve the robustness of our world model during GBP, the distributions they induce are quite different. Whereas Online World Modeling anticipates and covers the distribution seen at planning time, Adversarial World Modeling exploits the current loss landscape of the world model to encourage local smoothness near expert trajectories. For all environments, we find Adversarial World Modeling outperforms Online World Modeling when using Adam to perform GBP.

To demonstrate the advantages of Adversarial World Modeling in more complex environments where the simulator may be very costly and the number of action dimensions is larger, we also evaluate planning performance on two robotic manipulation tasks in Section B.2.

precisely the regions that matter for action optimization. We find that this procedure, which we call Adversarial World Modeling , does in fact smooth the loss surface of the planning objective (see Figure 2), improving the stability of action-sequence optimization.

Adversarial training improves model robustness by optimizing performance under worst-case perturbations (Madry et al., 2018). An adversarial example is generated by applying a perturbation δ to an input that maximally increases the model's loss. To train a world model on adversarial examples, we use the objective

where B a = { δ a : ∥ δ a ∥ ∞ ≤ ϵ a } and B z = { δ z : ∥ δ z ∥ ∞ ≤ ϵ z } constrain the magnitude of perturbations for given ϵ a , ϵ z . Training on these adversarially perturbed trajectories provides an alternative method to Online World Modeling for surfacing states that may be encountered during planning, without relying on GBP rollouts. This is a significant advantage in settings where simulation is expensive or infeasible.

We generate adversarial latent states using the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014), which efficiently approximates the worst-case perturbations that maximize prediction error (Wong et al., 2020). Although stronger iterative attacks such as Projected Gradient Descent (PGD) can be used, we find that FGSM delivers comparable improvements in GBP performance while being significantly more computationally efficient (see Section D.1). This enables us to generate adversarial samples over entire large-scale offline imitation learning datasets.

For each state-action pair in a given minibatch, we look for small changes to the latent state or action that most increase the world model's prediction error. Let ϵ a , ϵ z denote the radius of the perturbation to the actions { a t } and latent states { z t } respectively. We compute gradients ∇ δ a ,δ z ∥ f θ ( z t + δ z , a t + δ a ) -z t +1 ∥ 2 2 with respect to the perturbations and take a signed gradient ascent step (i.e., in a direction that degrades the prediction) with step sizes α a = 1 . 25 ϵ a , α z = 1 . 25 ϵ z . We clip the result so that each entry of the perturbation stays within the radius. This procedure corresponds to a single step of a PGD-style attack, producing perturbations that lie on the edge of the allowed region where they are maximally challenging for the model. See Algorithm 3 for a detailed treatment.

Weevaluate our methods by finetuning world models pretrained with the next-state prediction objective on 3 tasks: PushT, PointMaze, and Wall. For each task we measure the success rate of reaching a target configuration o goal from an initial configuration o 1 . We report planning results with both open-loop and MPC in Table 1. In the open-loop setting, we run Algorithm 1 from o 1 once and evaluate the predicted action sequence. In the MPC setting, we run Algorithm 1 once for each MPC step (using Φ µ ( o ′ 1 ) as the initial latent state for the first MPC step), rollout the predicted actions { ˆ a t } in the environment simulator to reach latent state ˆ z H +1 , and set ˆ z 1 = ˆ z H +1 for the next MPC iteration. We report all finetuning, planning, and optimization hyperparameters in Table 3.

We use DINO-WM (Zhou et al., 2025) as our initial world model for its strong performance with CEM across our chosen tasks. The embedding function Φ µ is taken to be the pre-trained DINOv2 encoder (Oquab et al., 2024), and remains frozen while finetuning the transition model f θ . f θ is implemented using the ViT architecture (Dosovitskiy et al., 2021). We additionally train a VQV AE decoder (van den Oord et al., 2018) to visualize latent states, though it plays no role in planning. To validate the broad applicability of our approach, we also study the use of the IRIS (Micheli et al., 2023) world model architecture in Section B.3.

To initialize the action sequence for planning optimization, we evaluate both random sampling from a standard normal distribution and the use of an initialization network. Our initialization network g θ : Z × Z → A T is trained such that g θ ( z 1 , z g ) = { ˆ a t } T t =1 . We find that random initialization tends to outperform the initialization network and we analyze its impact in depth in Section B.1.

During GBP, we set L goal in Algorithm 1 to a weighted goal loss to obtain a gradient from each predicted state instead of simply the last one. We find empirically that this task assumption generalizes to both navigation (e.g., PointMaze and Wall) and non-navigation tasks (e.g., PushT); i.e., on tasks with or without subgoal decomposability, this objective improves or matches performance of the final-state loss. We provide the exact formulation and more details in Section A.4. We additionally evaluate using the Adam optimizer (Kingma & Ba, 2014) during GBP. Although using Adam improves performance significantly over GD for all world models in our experiments, we find that Adam alone does not scale performance to match or surpass CEM.

∗ We could not reproduce the Wall environment open-loop CEM success rate reported in DINO-WM (74% over our 32%), so we report their better value.

Figure 3: Planning efficiency of DINO-WM, Online World Modeling, and Adversarial World Modeling on the PushT task. Gradient-based planning is orders of magnitude faster than CEM.

Comparing the world model error between training trajectories and planning trajectories allows us to evaluate if the world model will perform well during planning even if it is trained to convergence on expert trajectories. We evaluate world model error as the deviation between the world model's predicted next latent state and the next latent state given by the environment simulator. Given an initial state s 1 (associated with o 1 ) and a sequence of actions { a t } (either from the training dataset or a planning procedure), the world model error at timestep t is given by

Figure 4: Difference in World Model Error between expert and planning trajectories on PushT.

relatively worse on sequences of actions produced during planning. Figure 4 demonstrates that this is the case with DINO-WM, but not with Online World Modeling or Adversarial World Modeling, indicating a narrowing of the train-test gap. See Section B.6 for results for PointMaze and Wall.

We include visualizations of planning trajectories for DINO-WM, Online World Modeling, and Adversarial World Modeling to further study their success and failure modes. Visualizations for PushT and Wall can be found in Figures 10 and 11 respectively.

(a) We see that DINO-WM is more likely to enter states outside of the training distribution, and so the decoder is not able reconstruct the state accurately. This is not the case with Online World Modeling but it still fails to successfully reach the goal state. Adversarial World Modeling successfully completes the task.

(b) Again we notice the failure for DINO-WM's decoder to reconstruct states it encounters during planning, while this is not the case with Online World Modeling and Adversarial World Modeling, which both complete the task successfully.

Figure 10: Trajectory Visualizations of the PushT task. We plot the expert trajectory to reach the goal side, alongside both the simulator states and decoded latent states for DINO-WM, Online World Modeling, Adversarial World Modeling.

Refer to caption (a) In this challenging example, all three world models enter states through planning that their respective decoders cannot reconstruct, but only Online World Modeling is able to complete the task successfully.

(a) In this challenging example, all three world models enter states through planning that their respective decoders cannot reconstruct, but only Online World Modeling is able to complete the task successfully.

Refer to caption (b) In this example, we see that DINO-WM predicts that it successfully completed the task according to its reconstructed last latent state, but the simulator indicates the true position to be off of the goal state. Online and Adversarial World Modeling correct for this and successfully complete the task.

(b) In this example, we see that DINO-WM predicts that it successfully completed the task according to its reconstructed last latent state, but the simulator indicates the true position to be off of the goal state. Online and Adversarial World Modeling correct for this and successfully complete the task.

	PushT	PushT	PushT	PointMaze	PointMaze	PointMaze	Wall	Wall	Wall
	GD	Adam	CEM	GD	Adam	CEM	GD	Adam	CEM
DINO-WM	38	54	78	12	24	90	2	10	74 ∗
+ MPC	56	76	92	42	68	90	12	80	82
OnlineWM	34	52	90	20	14	62	16	18	54 ∗
+ MPC	50	76	92	54	88	96	38	80	90
AdversarialWM	56	82	94	32	70	88	32	34	30 ∗
+ MPC	66	92	92	50	94	98	14	94	94

Environment	H	Frameskip	Dataset Size	Trajectory Length
Push-T	3	5	18500	100-300
PointMaze	3	5	2000	100
Wall	1	5	1920	50
Rope	1	1	1000	5
Granular	1	1	1000	5

Name	Value			Name	GD	Adam
Image size Optimizer Predictor LR	224 AdamW 1e-5 LR	Name Opt. steps	GD Adam 300 300 1.0 0.3	MPC steps Opt. steps LR	10 100 1	10 100 0.2
(a) Finetuning Parameters	(b) Open-Loop		Planning	(c) MPC	Parameters

Environment	# Rollouts	Batch Size	GPU	Epochs	ϵ visual	ϵ proprio	ϵ action
PushT	20000 (all)	16	8x B200	1	0.05	0.02	0.02
PointMaze	2000 (all)	16	1x B200	1	0.2	0.08	0.08
Wall	1920 (all)	48	1x B200	2	0.2	0.08	0.08

Environment	# Rollouts	Batch Size	GPU	Epochs
PushT	6000	32	4x B200	1
PointMaze	500	32	4x B200	1
Wall	1920 (all)	80	4x B200	1

	PushT	PushT	PointMaze	PointMaze	Wall	Wall
	GD+IN	Ad+IN	GD+IN	Ad+IN	GD+IN	Ad+IN
DINO-WM	44 60	62 84	16 40	14 54	4 6	12 32
+ MPC	56		8	28 46
OnlineWM + MPC		66			10	18
	52	82	40		2	22
AdversarialWM	74	90	22	36	18	24
+ MPC	74	90	44	56	24	48

	Rope	Rope	Granular	Granular
	GD	CEM	GD	CEM
DINO-WM	1.73	0.93	0.30	0.22
AdversarialWM	0.93	0.82	0.24	0.28

	GD	CEM
IRIS	0	4
IRIS + OnlineWM	0	0
IRIS + AdversarialWM	8	6

	PushT	PointMaze
DINO-WM	16	70
OnlineWM	16	96
AdversarialWM	26	88

	PushT	PointMaze	Wall
Simulator	0.959	0.717	4.465
DINO-WM	0.029	0.029	0.029

	Backward Passes	PointMaze	PointMaze	PointMaze	Wall	Wall	Wall
FGSM		Min/Epoch	Open-Loop	MPC	Min/Epoch	Open-Loop	MPC
2-Step PGD	2 3	120 165	70 80	94 96	14 20	34 8	94 90
3-Step PGD	4	201	78	94	24	14	94

World models paired with model predictive control (MPC) can be trained offline on large-scale datasets of expert trajectories and enable generalization to a wide range of planning tasks at inference time. Compared to traditional MPC procedures, which rely on slow search algorithms or on iteratively solving optimization problems exactly, gradient-based planning offers a computationally efficient alternative. However, the performance of gradient-based planning has thus far lagged behind that of other approaches. In this paper, we propose improved methods for training world models that enable efficient gradient-based planning. We begin with the observation that although a world model is trained on a next-state prediction objective, it is used at test-time to instead estimate a sequence of actions. The goal of our work is to close this train-test gap. To that end, we propose train-time data synthesis techniques that enable significantly improved gradient-based planning with existing world models. At test time, our approach outperforms or matches the classical gradient-free cross-entropy method (CEM) across a variety of object manipulation and navigation tasks in 10% of the time budget.

github.com/nimitkalra/robust-world-model-planning

In robotic tasks, anticipating how the actions of an agent affect the state of its environment is fundamental for both prediction (Finn2016UnsupervisedLFA) and planning (mohanan2018survey; kavraki2002probabilistic). Classical approaches derive models of the environment evolution analytically from first principles, relying on prior knowledge of the environment, the agent, and any uncertainty (goldstein1950classical; siciliano2009robotics; spong2020robot). In contrast, learning-based methods infer such models directly from data, enabling them to capture complex dynamics and thus improve generalization and robustness to uncertainty (sutton1998reinforcement; schrittwieser2020mastering; lecun2022path).

World models (ha2018world), in particular, have emerged as a powerful paradigm. Given the current state and an action, the world model predicts the resulting next state. These models can be learned either from exact state information (sutton1991dyna) or from high-dimensional sensory inputs such as images (hafner2023mastering). The latter setup is especially compelling as it enables perception, prediction, and control directly from raw images by leveraging pre-trained visual representations, and removes the need for measuring the precise environment states which is difficult in practice (assran2023self; Bardes2024RevisitingFPA). Recently, world models and their predictive capabilities have been leveraged for planning, enabling agents to solve a variety of tasks (hafner2019dream; hafner2019learning; schrittwieser2020mastering; hafner2023mastering; zhou2025dinowmworldmodelspretrained). A model of the dynamics is learned offline, while the planning task is defined at inference as a constrained optimization problem: given the current state, find a sequence of actions that results in a state as close as possible to the target state. This inference-time optimization provides an effective alternative to reinforcement learning approaches (sutton1998reinforcement) that often suffer from poor sample-efficiency.

World models are compatible with many model-based planning algorithms. Traditional methods such as DDP (mayne1966second) and iLQR (li2004iterative) rely on iteratively solving exact optimization problems derived from linear and quadratic approximations of the dynamics around a nominal trajectory. While highly effective in low-dimensional settings, these methods become impractical for large-scale world models, where solving the resulting optimization problem is computationally intractable. As an alternative, search-based methods such as the Cross Entropy Method (CEM) (rubinstein2004cross) and Model Predictive Path Integral control (MPPI) (williams2017model) have been widely adopted as gradient-free alternatives and have proven effective in practice. However, they are computationally intensive as they require iteratively sampling candidate solutions and performing world model rollouts to evaluate each one, a procedure that scales poorly in high-dimensional spaces. Gradient-based methods (sv2023gradient), in contrast, avoid the limitations of sampling by directly exploiting the differentiability of world models to optimize actions end-to-end. These methods eliminate the costly rollouts required by search-based approaches, thus scaling more efficiently in high-dimensional spaces. Despite this promise, gradient-based approaches have thus far seen limited empirical success.

This procedure suffers from a fundamental train-test gap. World models are typically trained using a next-state prediction objective on datasets of expert trajectories. At test time, however, they are used to optimize a planning objective over sequences of actions. We argue that this mismatch underlies the poor empirical performance of gradient-based planning (GBP), and we offer two hypotheses to explain why. (1) During planning, the intermediate sequence of actions explored by gradient descent drive the world model into states that were not encountered during training. In these out-of-distribution states, model errors compound, making the world model unreliable as a surrogate for optimization. (2) The action-level optimization landscape induced by the world model may be difficult to traverse, containing many poor local minima or flat regions, which hinders effective gradient-based optimization.

In this work, we address both of these challenges by proposing two algorithms: Online World Modeling and Adversarial World Modeling. Both expand the region of familiar latent states by continuously adding new trajectories to the dataset and finetuning the world model on them. To manage the distribution shift between offline expert trajectories and predicted trajectories from planning, Online World Modeling uses the environment simulator to correct states along a trajectory produced by performing GBP. Finetuning on these corrected trajectories ensures that the world model performs sufficiently well when GBP enters regimes of latent state space outside of the expert trajectory distribution. To overcome the difficulties of optimizing over a non-smooth loss surface during GBP, Adversarial World Modeling perturbs expert trajectories in the direction that maximizes the world model’s loss. Adversarial finetuning smooths the induced action loss landscape, making it easier to optimize via gradient-based planning. We provide a visual depiction of both methods in Figure 1.

We show that finetuning world models with these algorithms leads to substantial improvements in the performance of gradient-based planning (GBP). Applying Adversarial World Modeling to a pretrained world model enables gradient-based planning to match or exceed the performance of search-based CEM on a variety of robotic object manipulation and navigation tasks. Importantly, this performance is achieved with a 10×\times reduction in computation time compared to CEM, underscoring the practicality of our approach for real-world planning. Additionally, we empirically demonstrate that Adversarial World Modeling smooths the planning loss landscape, and that both methods can reverse the train-test gap in world model error.

World models learn environment dynamics by predicting the state resulting from taking an action in the current state. Then, at test time, the learned world model enables planning by simulating future trajectories and guiding action optimization. Formally, a world model approximates the (potentially unknown) dynamics function h:𝒮×𝒜→𝒮h{,:,}\mathcal{S}\times\mathcal{A}\to\mathcal{S}, where 𝒮\mathcal{S} denotes the state space and 𝒜\mathcal{A} the action space. The environment evolves according to

where st∈𝒮,at∈𝒜s_{t}\in\mathcal{S},a_{t}\in\mathcal{A} denote the state and action at time tt, respectively.

In practice, we typically do not have access to the exact state of the environment; instead, we only receive partial observations of it, such as images. In order for a world model to efficiently learn in the high-dimensional observation space 𝒪\mathcal{O}, an embedding function Φμ:𝒪→𝒵\Phi_{\mu}{,:,}\mathcal{O}\to\mathcal{Z} is employed to map observations to a lower-dimensional latent space 𝒵\mathcal{Z}. Then, given an embedding function Φμ\Phi_{\mu}, our goal is to learn a latent world model fθ:𝒵×𝒜→𝒵f_{\theta}{,:,}\mathcal{Z}\times\mathcal{A}\to\mathcal{Z}, such that

The choice of Φμ\Phi_{\mu} directly affects the expressivity of the latent world model. In this work, we use a fixed encoder pretrained with self-supervised learning that yields rich feature representations out of the box.

To train a latent world model, we sample triplets of the form (ot,at,ot+1)(o_{t},a_{t},o_{t+1}) from an offline dataset of trajectories 𝒯\mathcal{T} and minimize the ℓ2\ell_{2} distance between the true next latent state zt+1=Φμ(ot+1)z_{t+1}=\Phi_{\mu}(o_{t+1}) and the predicted next latent state z^t+1\hat{z}_{t+1}. This procedure is represented by the following teacher-forcing objective:

Notably, we only minimize this objective with respect to the world model’s parameters θ\theta, not those of the potentially large embedding function.

During test-time, we use a learned world model to optimize candidate action sequences for reaching a goal state. By recursively applying the world model over an action sequence starting from an initial latent state, we obtain a predicted latent goal state and therefore the distance to the true goal state in latent space. This allows us to find the optimal action sequence

where z^H+1\hat{z}_{H+1} is produced by the recursive procedure

We use the function rolloutf:𝒵×𝒜H→𝒵H\text{rollout}_{f}{,:,}\mathcal{Z}\times\mathcal{A}^{H}\to\mathcal{Z}^{H} to denote this recursive procedure.

Gradient-based planning (GBP) solves the planning objective (4) via gradient descent. Crucially, since the world model is differentiable, ∇{a^t}z^H+1=∇{a^t}rolloutf(z1,{at^})H+1\nabla_{{\hat{a}{t}}}\hat{z}{H+1}=\nabla_{{\hat{a}{t}}}\text{rollout}{f}(z_{1},{\hat{a_{t}}})_{H+1} is well-defined. In contrast, the search-based CEM is gradient-free, but requires evaluating substantially more action sequences. We detail GBP in Algorithm 1 and CEM in Section A.2.

As errors can propagate over long horizons, Model Predictive Control (MPC) is commonly used to repeatedly re-plan by optimizing an HH-step action sequence but executing only the first K≤HK\leq H actions before replanning from the updated state.

As the planning objective is induced entirely by the world model, the success of GBP hinges on (1) the model accurately predicting future states under any candidate action sequence, and (2) the stability of this differentiable optimization. We now present two finetuning methods designed to improve on these fronts.

During gradient-based planning, the action sequences being optimized are not constrained to lie within the distribution of behavior seen during training. World models are typically trained on fixed datasets of expert trajectories, whereas GBP selects actions solely to improve the planning objective, without regard to whether those actions resemble expert behavior. As a result, the optimization process often proposes action sequences that are out of distribution. Optimizing through learned models under such conditions is known to induce adversarial inputs (szegedy2013intriguing; goodfellow2014explaining). In our setting, these adversarial action sequences drive the world model into regions of the latent state space that were rarely or never observed during training, causing large prediction errors. Even when errors are initially small, they accumulate as the planner rolls the model forward, ultimately degrading long-horizon planning performance.

To address this issue, we propose Online World Modeling, which iteratively corrects the trajectories produced by GBP and finetunes the world model on the resulting rollouts. Rather than training solely on expert demonstrations, we repeatedly incorporate trajectories induced by the planner itself, thereby expanding the region of latent states that the world model can reliably predict.

First, we conduct GBP using the initial and goal latent states of an expert trajectory τ\tau, yielding a sequence of predicted actions {a^t}t=1H{\hat{a}{t}}{t=1}^{H}. These actions might send the world model into regions of the latent space that lie outside of the training distribution. To adjust for this, we obtain a corrected trajectory: the actual sequence of states that would result by executing the action sequence {a^t}t=1H{\hat{a}{t}}{t=1}^{H} in the environment using the true dynamics simulator hh. We add the corrected trajectory,

to the dataset that the world model trains with every time the dataset is updated. Re-training on these corrected trajectories expands the training distribution to cover the regions of latent space induced by gradient-based planning, mitigating compounding prediction errors during planning. We provide more detail in Algorithm 2 and illustrate the method in Figure 1.

This procedure is reminiscent of DAgger (Dataset Aggregation) (ross2011reduction), an online imitation learning method wherein a base policy network is iteratively trained on its own rollouts with the action predictions replaced by those from an expert policy. In a similar spirit, we invoke the ground-truth simulator as our expert world model that we imitate.

Since world models are only trained on the next-state prediction objective, there is no particular reason for their input gradients to be well-behaved. Adversarial training has been shown to result in better behaved input gradients (mejia2019robust), consequently smoothing the input loss surface. Motivated by this observation, we propose an adversarial training objective that explicitly targets regions of the state-action space where the world model is expected to perform poorly. These adversarial samples may lie outside the expert trajectory distribution, which can expose the model to precisely the regions that matter for action optimization. We find that this procedure, which we call Adversarial World Modeling, does in fact smooth the loss surface of the planning objective (see Figure 2), improving the stability of action-sequence optimization.

Adversarial training improves model robustness by optimizing performance under worst-case perturbations (madry2019deeplearningmodelsresistant). An adversarial example is generated by applying a perturbation δ\delta to an input that maximally increases the model’s loss. To train a world model on adversarial examples, we use the objective

where ℬa={δa:∥δa∥∞≤ϵa}\mathcal{B}{a}={\delta{a}:\lVert\delta_{a}\rVert_{\infty}\leq\epsilon_{a}} and ℬz={δz:∥δz∥∞≤ϵz}\mathcal{B}{z}={\delta{z}:\lVert\delta_{z}\rVert_{\infty}\leq\epsilon_{z}} constrain the magnitude of perturbations for given ϵa,ϵz\epsilon_{a},\epsilon_{z}. Training on these adversarially perturbed trajectories provides an alternative method to Online World Modeling for surfacing states that may be encountered during planning, without relying on GBP rollouts. This is a significant advantage in settings where simulation is expensive or infeasible.

We generate adversarial latent states using the Fast Gradient Sign Method (FGSM) (goodfellow2014explaining), which efficiently approximates the worst-case perturbations that maximize prediction error (fastbetterthanfree). Although stronger iterative attacks such as Projected Gradient Descent (PGD) can be used, we find that FGSM delivers comparable improvements in GBP performance while being significantly more computationally efficient (see Section D.1). This enables us to generate adversarial samples over entire large-scale offline imitation learning datasets.

For each state-action pair in a given minibatch, we look for small changes to the latent state or action that most increase the world model’s prediction error. Let ϵa,ϵz{\epsilon}{a},{\epsilon}{z} denote the radius of the perturbation to the actions {at}{a_{t}} and latent states {zt}{z_{t}} respectively. We compute gradients ∇δa,δz∥fθ(zt+δz,at+δa)−zt+1∥22\nabla_{\delta_{a},\delta_{z}}\lVert f_{\theta}(z_{t}+\delta_{z},a_{t}+\delta_{a})-z_{t+1}\rVert_{2}^{2} with respect to the perturbations and take a signed gradient ascent step (i.e., in a direction that degrades the prediction) with step sizes αa=1.25ϵa,αz=1.25ϵz\alpha_{a}=1.25{\epsilon}{a},\alpha{z}=1.25{\epsilon}_{z}. We clip the result so that each entry of the perturbation stays within the radius. This procedure corresponds to a single step of a PGD-style attack, producing perturbations that lie on the edge of the allowed region where they are maximally challenging for the model. See Algorithm 3 for a detailed treatment.

To initialize the perturbation radii ϵa,ϵz\epsilon_{a},\epsilon_{z}, we use scaling factors λa,λz\lambda_{a},\lambda_{z} and find that Adversarial World Modeling is robust for 0≤λa≤10\leq\lambda_{a}\leq 1 and 0≤λz≤0.50\leq\lambda_{z}\leq 0.5. Furthermore, we find that fixing ϵa,ϵz{\epsilon}{a},{\epsilon}{z} to the standard deviation of the initial minibatch is stable across all experiments. Updating this estimate for each batch as in Algorithm 3 yields no consistent improvement in final planning performance. We further analyze design ablations in Appendix D.

We evaluate our methods by finetuning world models pretrained with the next-state prediction objective on 3 tasks: PushT, PointMaze, and Wall. For each task we measure the success rate of reaching a target configuration ogoalo_{\text{goal}} from an initial configuration o1o_{1}. We report planning results with both open-loop and MPC in Table 1. In the open-loop setting, we run Algorithm 1 from o1o_{1} once and evaluate the predicted action sequence. In the MPC setting, we run Algorithm 1 once for each MPC step (using Φμ(o1′)\Phi_{\mu}(o_{1}^{\prime}) as the initial latent state for the first MPC step), rollout the predicted actions {a^t}{\hat{a}{t}} in the environment simulator to reach latent state z^H+1\hat{z}{H+1}, and set z^1=z^H+1\hat{z}{1}=\hat{z}{H+1} for the next MPC iteration. We report all finetuning, planning, and optimization hyperparameters in Table 3.

We use DINO-WM (zhou2025dinowmworldmodelspretrained) as our initial world model for its strong performance with CEM across our chosen tasks. The embedding function Φμ\Phi_{\mu} is taken to be the pre-trained DINOv2 encoder (oquab2024dinov2learningrobustvisual), and remains frozen while finetuning the transition model fθf_{\theta}. fθf_{\theta} is implemented using the ViT architecture (dosovitskiy2021imageworth16x16words). We additionally train a VQVAE decoder (oord2018neuraldiscreterepresentationlearning) to visualize latent states, though it plays no role in planning. To validate the broad applicability of our approach, we also study the use of the IRIS (micheli2023transformerssampleefficientworldmodels) world model architecture in Section B.3.

To initialize the action sequence for planning optimization, we evaluate both random sampling from a standard normal distribution and the use of an initialization network. Our initialization network gθ:𝒵×𝒵→𝒜Tg_{\theta}:\mathcal{Z}\times\mathcal{Z}\to\mathcal{A}^{T} is trained such that gθ(z1,zg)={a^t}t=1Tg_{\theta}(z_{1},z_{g})={\hat{a}{t}}{t=1}^{T}. We find that random initialization tends to outperform the initialization network and we analyze its impact in depth in Section B.1.

During GBP, we set ℒgoal\mathcal{L}_{\text{goal}} in Algorithm 1 to a weighted goal loss to obtain a gradient from each predicted state instead of simply the last one. We find empirically that this task assumption generalizes to both navigation (e.g., PointMaze and Wall) and non-navigation tasks (e.g., PushT); i.e., on tasks with or without subgoal decomposability, this objective improves or matches performance of the final-state loss. We provide the exact formulation and more details in Section A.4. We additionally evaluate using the Adam optimizer (Kingma2014AdamAM) during GBP. Although using Adam improves performance significantly over GD for all world models in our experiments, we find that Adam alone does not scale performance to match or surpass CEM.

On all three tasks, our methods outperform DINO-WM with Gradient Descent GBP and either match or outperform it with the far more expensive CEM. In the open-loop setting, we achieve a +18% on Push-T, +20% on PointMaze, and +30% on Wall increase in success rate. In the MPC setting, Adam GBP with Adversarial World Modeling outperforms CEM with DINO-WM on PointMaze and Wall and matches CEM on PushT.

Comparing the world model error between training trajectories and planning trajectories allows us to evaluate if the world model will perform well during planning even if it is trained to convergence on expert trajectories. We evaluate world model error as the deviation between the world model’s predicted next latent state and the next latent state given by the environment simulator. Given an initial state s1s_{1} (associated with o1o_{1}) and a sequence of actions {at}{a_{t}} (either from the training dataset or a planning procedure), the world model error at timestep tt is given by

This error is averaged over all timesteps of a trajectory. If the difference in world model error between expert trajectories and planning trajectories is negative, then the world model will perform relatively worse on sequences of actions produced during planning. Figure 4 demonstrates that this is the case with DINO-WM, but not with Online World Modeling or Adversarial World Modeling, indicating a narrowing of the train-test gap. See Section B.6 for results for PointMaze and Wall.

When using a world model to conduct planning in real-world settings, fast inference is crucial for actively interacting with the environment. On all three tasks, we find that GBP with Adversarial World Modeling is able to match or come near the best performing world model when planning with CEM, in over an order of magnitude less wall clock time. We compare wall clock times across world models and planning procedures for PushT in Figure 3. The planning efficiency results for PointMaze and Wall can be found in Section B.7.

Learning world models from sensory data. Learning-based dynamics models have become central to control and decision making, offering a data-driven alternative to classical approaches that rely on first principles modeling (goldstein1950classical; Schmidt2009DistillingFNA; macchelli2009port). Early work focused on modeling dynamics in low-dimensional state-space (deisenroth2011pilco; lenz2015deepmpc; henaff2017model; Sharma2019DynamicsAwareUD), while more recent methods learn directly from high-dimensional sensory inputs such as images. Pixel-space prediction methods (Finn2016UnsupervisedLFA; Kaiser2019ModelBasedRL) have shown success in applications such as human motion prediction (Finn2016UnsupervisedLFA), robotic manipulation (Finn2016DeepVF; agrawal2016learning; zhang2019solar), and solving Atari games (Kaiser2019ModelBasedRL), but they remain computationally expensive due to the cost of image reconstructions. To address this, alternative approaches learn a compact latent representation where dynamics are modeled (Karl2016DeepVB; hafner2019learning; Shi2022RoboCraftLT; karypidis2024dino). These models are typically supervised either by decoding latent predictions to match ground truth observations (Edwards2018ImitatingLPA; Zhang2021DeformableLOA; bounou2021online; Hu2022ModelBasedILA; Akan2022StretchBEVSFA; hafner2019learning), or by using prediction objectives that operate directly in latent space, such as those in joint-embedding prediction architectures (JEPAs) (lecun2022path; Bardes2024RevisitingFPA; Drozdov2024VideoRLA; Guan2024WorldMFA; zhou2025dinowmworldmodelspretrained). Our work builds upon this latter category of world models and specifically leverages the DINOv2-based latent world models introduced in zhou2025dinowmworldmodelspretrained. However, unlike prior work that primarily targets improving general representation quality or prediction accuracy, we focus on enhancing the trainability of world models to improve the convergence and reliability of gradient-based planning.

Planning with world models. Planning with world models is challenging due to the non-linearity and non-convexity of the objective. Search-based methods such as CEM (rubinstein2004cross) and MPPI (williams2017model) are widely used in this context (Williams2017InformationTMA; Nagabandi2019DeepDMA; hafner2019learning; Zhan2021ModelBasedOPA; zhou2025dinowmworldmodelspretrained). These methods explore the action space effectively, helping to escape from local minima, but typically scale poorly in high-dimensional settings due to their sampling-based nature. In contrast, gradient-based methods offer a more scalable alternative by exploiting the differentiability of the world model to optimize actions directly via backpropagation. Despite their efficiency, these methods suffer from local minima in highly non-smooth loss landscapes (Bharadhwaj2020ModelPredictiveCVA; Xu2022AcceleratedPLA; Chen2022BenchmarkingDOA; Wang2023SoftZooASA), and gradient optimization can induce adversarial action sequences that exploit model inaccuracies (Schiewer2024ExploringTLA; Jackson2024PolicyGuidedDA). zhou2025dinowmworldmodelspretrained have observed that GBP is particularly brittle when used with world models built on pre-trained visual embeddings, such as DINOv2 (oquab2024dinov2learningrobustvisual), often underperforming compared to CEM. To address these challenges, several stabilizing techniques have been proposed. For instance, random-sampling shooting helps mitigate adversarial trajectories by injecting noise in the action sequence and exploring a broader set of actions during trajectory optimization (nagabandi2018neural), and Zhang2025StateAwarePOA introduce adversarial attacks on learned policies to make them robust to environmental perturbations by selectively perturbing state inputs at inference time. In contrast, we apply perturbation directly to latent states and latent actions during world model training. florence2022implicit add gradient penalties when training an implicit policy function to improve its smoothness and stabilize optimization, but their method does not involve training or using a world model. Other approaches aim to use a hybrid method that combines search and gradient steps to balance global exploration and local refinement (Bharadhwaj2020ModelPredictiveCVA). In our work, we modify the world-model training procedure itself to improve GBP stability. In particular, through our Adversarial World Modeling approach, we enhance the robustness of the world model to perturbed states and actions, producing more stable and informative gradients that prevent adversarial action sequences at test time.

Train-test gap in world models. A key challenge when planning with learned world models is the mismatch between the training objective and the planning objective (lambert2020objective). In fact, during training, world models are typically optimized to minimize one-step prediction or reconstruction error on trajectories collected from expert demonstrations or behavioral policies. At test time, however, the same models are used inside a planner to optimize multi-step action sequences. As a result, the objectives at training and test times are inherently different, inducing a distribution shift between trajectories seen during training and those encountered during planning. This mismatch can cause planners to drive the model into out-of-distribution regions of the state space, where prediction errors compound over time and the model becomes unreliable for long-horizon optimization (Ajay2018AugmentingPSA; Ke2019LearningDMA; Zhu2023DiffusionMFA). A common strategy to address this train-test gap is dataset-aggregation (ross2011reduction), which expands the training distribution by rolling out action trajectories generated by the planning algorithm and adding them to the training set (Talvitie2014ModelRFA; nagabandi2018neural). However, unlike these approaches which typically apply this technique directly in the environment’s low-dimensional state space, our approach uses dataset-aggregation in the context of high-dimensional latent world models, where training occurs in latent space rather than directly on states. Through our Online World Modeling approach, we explicitly close the train-test gap for gradient-based planning by using the planner itself to generate off-distribution trajectories and correcting them with simulator feedback.

In this work, we introduced Online World Modeling and Adversarial World Modeling as techniques for addressing the train-test gap that arises when world models trained on next-state prediction are used for iterative gradient-based planning. Across our experiments, these methods substantially improve the reliability of GBP and, in some settings, allow it to match or outperform sampling-based planners such as CEM. By narrowing this gap, our results suggest that gradient-based planning can be a practical alternative for planning with world models, particularly in settings where computational efficiency is critical. An important direction for future work is to evaluate these methods on real-world systems. Adversarial training may additionally improve a world model’s robustness to environmental adversaries or stochasticity. More broadly, world models offer a natural advantage over policy-based reinforcement learning in long-horizon decision making. We believe our methods are especially well-suited to multi-timescale or hierarchical world models, where long-horizon planning is enabled by improving planning stability at different levels of abstraction.

Compute resources used in this work were provided by the Modal and NVIDIA Academic Grants. Micah Goldblum was supported by the Google Cyber NYC Award.

This task introduced by pusht uses an agent interacting with a T-shaped block to guide both the agent and block from a randomly initialized state to a feasible goal state within 25 steps. We use the dataset of 18500 trajectories given in zhou2025dinowmworldmodelspretrained, in which the green anchor serves purely as a visual reference. We draw a goal state from one of the noisy expert trajectories at 25 steps from the starting state.

In this task introduced by pointmaze, a force-actuated ball which can move in the x,yx,y Cartesian directions has to reach a target goal within a maze. We use the dataset of 2000 random trajectories present in zhou2025dinowmworldmodelspretrained, with a goal state chosen 25 steps from the starting state.

This task introduced by DINO-WM (zhou2025dinowmworldmodelspretrained) features a 2D navigation environment with two rooms separated by a wall with a door. The agent’s task is to navigate from a randomized starting location in one room to a random goal state in the other room, passing through the door. We use the dataset of 1920 trajectories as provided in DINO-WM, with a goal state chosen 25 steps from the starting state.

In this task introduced by zhang2024adaptigraph a simulated Xarm must push around one hundred small particles into the goal configuration. We use the dataset of 1000 trajectories of 20 steps each provided in DINO-WM.

We reproduce the dataset statistics used to train the base world model for each environment from zhou2025dinowmworldmodelspretrained. We use the same datasets for our alternative world model architecture ablation in Section B.3.

We detail the cross-entropy method used in our planning experiments in Algorithm 4.

In Table 3, we list all shared hyperparameters used in training and planning.

We provide data quantity and synthetic data parameters for our Online and Adversarial World Modeling training setups in Table 5 and Table 4 respectively. In addition to the maintaining perturbation radii for the visual latent and action embeddings, we use a distinct radius for the proprioceptive embeddings. We empirically find that the scales of the visual and proprioceptive embeddings are incompatible and semantically distinct, thereby necessitating independent perturbation. Throughout all of our experiments, we set the perturbation radii of the action embedding and proprioceptive embedding identically for simplicity.

To facilitate progress towards the goal in Gradient-based Planning, we introduce an alternate loss function: Weighted Goal Loss (WGL). Instead of the standard goal loss function that only minimizes the ℓ2\ell_{2}-distance between the final latent state produced by planning actions and the goal latent state, WGL encourages intermediate latent states to also be close to the goal latent state. Formally,

where the sequence of normalized weights {wi}2H+1{w_{i}}{2}^{H+1} is a hyperparameter choice. Empirically, we find that using this objective for Gradient-Based Planning either maintains or improves planning performance. For PointMaze and Wall, we found that exponentially upweighting later states in the planning horizon improved planning performance, so we set wi=2iw{i}=2^{i}. For PushT, we found that exponentially upweighting earlier states improved planning performance, so we set wi=(1/2)iw_{i}=\left(1/2\right)^{i}. We leave the optimal selection of this sequence of weights as future work.

Motivated by the hypothesis that the optimization landscape is rugged (see Figure 2 for some evidence of this), we train an initialization network gθ:𝒵×𝒵→𝒜T,gθ(z1,zg)={a^t}g_{\theta}:\mathcal{Z}\times\mathcal{Z}\to\mathcal{A}^{T},g_{\theta}(z_{1},z_{g})={\hat{a}_{t}} to initialize a sequence of actions for gradient-based planning.

We provide details on training the initialization network gθg_{\theta} in Algorithm 5. We train gθg_{\theta} on a single epoch over the trajectories in the task’s training dataset.

We show results of including the initialization network in GBP for each task in Table 6. Comparing to Table 1, we see that for both GD and Adam, the initialization network only performs comparably in the PushT environment compared to a random initialization.

We evaluate Adversarial World Modeling on two robotic manipulation tasks: Rope and Granular. Planning results for both tasks can be found in Table 7. To measure the accuracy of planned actions, we evaluate the Chamfer distance between the goal set of keypoints and the predicted set of keypoints.

We ablate the use of the DINO-WM architecture by evaluating planning performance with the IRIS (micheli2023transformerssampleefficientworldmodels) architecture. Specifically, IRIS uses a VQ-VAE (oord2018neuraldiscreterepresentationlearning) for both the encoder and decoder, and a standard decoder-only Transformer (NIPS2017_3f5ee243). We find that even with a learned encoder, Adversarial World Modeling improves GBP performance and even CEM performance. Planning success rates of the IRIS architecture for the Wall task are reported in Table 8.

We evaluate GBP over a longer horizon in Table 9(a). We use Adam in the MPC setting for each of these runs, setting a goal state 50 timesteps into the future drawn from an expert trajectory, a planning horizon of 50 steps, and 20 MPC iterations where we take a single action at each iteration. The dataset of held-out validation trajectories for the Wall environment does not contain expert trajectories of 50 timesteps, so we omit it from our evaluations. In comparison, our results in Table 1 use a goal state drawn 25 timesteps in the future and a planning horizon of 25 steps. We find that on the longer horizon, Adversarial World Modeling outperforms DINO-WM on PushT and both Adversarial and Online World Modeling outperform DINO-WM on PointMaze.

Additionally, we evaluate both the MPPI (torch_mppi) and GradCEM (gradcem) algorithms under MPC on the PushT task in Table 9(b). MPPI is an online, receding-horizon controller that samples and evaluates perturbed action sequences, executes the first action of the lowest-cost trajectory, and then replans from the updated state at each timestep.

GradCEM refines the candidate sequences used to update the estimated action distribution with gradient descent to provide a more accurate estimate of the true distribution’s parameters. We see that Adversarial World Modeling outperforms DINO-WM with GradCEM. Additionally, GradCEM exhibits slightly lower performance than vanilla CEM. We hypothesize this is due to the memory requirements of gradient descent necessitating reducing the number of candidate sequences by a factor of 6 compared to vanilla CEM, leading to reduced accuracy in estimating the true action distribution.

For MPPI, we use 5 samples each MPC iteration, with 100 MPC steps. For GradCEM, we use 50 samples, 30 CEM steps, and 2 Adam steps per CEM step with an LR of 0.3. For GradCEM we take 10 MPC steps.

We present additional results for the difference in World Model Error between training and planning for the PointMaze and Wall tasks in Figure 6. For both tasks, our methods have lower error during planning compared to training except for Online World Modeling on PointMaze, which is inconclusive due to the low magnitude of world model error. Planning actions are obtained after 300 steps of GBP with GD on 50 rollouts using the initial and goal state from a training trajectory.

For PointMaze and Wall, we compare the planning efficiency of DINO-WM and our two approaches across planning methodologies in Figures 7 and 8 respectively. All planning is performed with MPC.

To understand the additional cost of using the environment simulator in Online World Modeling, we record the wall clock time of rolling out 25 steps with the DINO-WM architecture and each environment simulator in Table 10. We see that in all environments, the simulator takes longer to rollout than the world model. We also note that the simulator for all 3 tasks is deterministic in terms of reproducing the training trajectories from their actions.

We visualize the loss landscape of both the DINO World Model before and after applying our Adversarial World Modeling objective. We perform a grid search over the subspace spanned by

a^GBP-Pretrained\hat{a}{\text{GBP-Pretrained}}: Gradient-Based Planning on original Dino World Model with 300 optimization steps of Adam with LR = 1e-3. We set a fixed initialization ainita{\text{init}}.

aGTa_{\text{GT}}: the ground-truth actions from the expert demonstrator.

We define the axes as α=a^GBP-Pretrained−aGT\alpha=\hat{a}{\text{GBP-Pretrained}}-a{\text{GT}} and β=a^GBP-Adversarial−aGT\beta=\hat{a}{\text{GBP-Adversarial}}-a{\text{GT}}, and compute the loss surface over a 50×5050\times 50 grid spanning α,β∈[−1.25,1.25]\alpha,\beta\in[-1.25,1.25].

Projected Gradient Descent (PGD) has been used as an iterative method for generating adversarial perturbations (madry2019deeplearningmodelsresistant). At each step, PGD takes a gradient ascent step and projects the result onto the space of allowed perturbations (some ball with radius ϵ{\epsilon} around the input). Projection (Π\Pi) is typically via clipping or scaling. Formally,

However, this is computationally expensive to use for adversarial training as it requires an additional backward pass for each iteration. If one uses a single-step, replaces the gradient by its sign, and uses step size α=ϵ\alpha={\epsilon}, this recovers the well-known Fast Gradient Sign Method (FGSM) update (goodfellow2014explaining).

In fastbetterthanfree, the authors demonstrate that initializing δ\delta in the ℓ∞\ell_{\infty}-ball with radius ϵ{\epsilon} and performing FGSM adversarial training on these perturbations substantially improves robustness to PGD attacks and matches performance of PGD-based training. We leverage this observation to perform cheap adversarial training that only requires 2×2\times the backward passes of traditional supervised learning. In comparison, KK-step PGD requires KK more backward passes (3×3\times more for K=2K=2 and 4×4\times for K=3K=3). In Table 11, we show that 2/3-Step PGD does not consistently outperform FGSM, despite requiring a much larger training budget.

To assess the robustness of Adversarial World Modeling to the scaling factor and perturbation radius hyperparameters, we conduct an ablation study varying these two factors, shown in Figure 9. We evaluate λa,λz∈[0.0,0.02,0.05,0.20,0.50,1.0]2\lambda_{a},\lambda_{z}\in[0.0,0.02,0.05,0.20,0.50,1.0]^{2} and either fix ϵa,ϵp,ϵz{\epsilon}{a},{\epsilon}{p},{\epsilon}{z} to the standard deviation of the first minibatch (“Fixed”) or recompute it for every minibatch (“Adaptive”). We observe no consistent improvement or degradation across any value of λa\lambda{a}, for 0≤λz≤0.50\leq\lambda_{z}\leq 0.5, or between the “Fixed” or “Adaptive” perturbation radii. We note that setting the visual scaling factor λz\lambda_{z} too high (e.g., 0.5,1.00.5,1.0) can significantly degrade performance. We hypothesize that excessively large perturbations distort the semantic content of the visual latent state, pushing it outside the range of semantically equivalent representations.

Table: S3.T1: Planning Results. We evaluate the planning performance of our finetuned world models against DINO-WM (zhou2025dinowmworldmodelspretrained) on 33 tasks in terms of success rate (%) using both open-loop and model predictive control (MPC) procedures. For each task, we perform gradient-based planning using both stochastic gradient descent (GD) and Adam (Kingma2014AdamAM), and search-based planning using the cross-entropy method (CEM).

	GD	Adam	CEM	GD	Adam	CEM	GD	Adam	CEM
DINO-WM	38	54	78	12	24	90	2	10	74∗
+ MPC	56	76	92	42	68	90	12	80	82
Online WM	34	52	90	20	14	62	16	18	54∗
+ MPC	50	76	92	54	88	96	38	80	90
Adversarial WM	56	82	94	32	70	88	32	34	30∗
+ MPC	66	92	92	50	94	98	14	94	94

Table: A1.T2: Trajectory datasets used to pretrain the base DINO-WM and IRIS world models.

Environment	H	Frameskip	Dataset Size	Trajectory Length
Push-T	3	5	18500	100-300
PointMaze	3	5	2000	100
Wall	1	5	1920	50
Rope	1	1	1000	5
Granular	1	1	1000	5

Table: A1.T3: (a) Finetuning Parameters

Name	Value
Image size	224
Optimizer	AdamW
Predictor LR	1e-5

Table: A1.T3.st2: (b) Open-Loop Planning

Name	GD	Adam
Opt. steps	300	300
LR	1.0	0.3

Table: A1.T4: Training parameters for Adversarial World Modeling as reported in Table 1.

Environment	# Rollouts	Batch Size	GPU	Epochs	ϵvisual{\epsilon}_{\text{visual}}	ϵproprio{\epsilon}_{\text{proprio}}	ϵaction{\epsilon}_{\text{action}}
PushT	20000 (all)	16	8x B200	1	0.05	0.02	0.02
PointMaze	2000 (all)	16	1x B200	1	0.20	0.08	0.08
Wall	1920 (all)	48	1x B200	2	0.20	0.08	0.08

Table: A2.T6: For both gradient descent (GD) and Adam (Ad), we evaluate initializing the actions for gradient-based planning (GBP) from the initialization network (IN) instead of a normal distribution.

Table: A2.T7: Planning performance measured with Chamfer Distance (less is better) on two robotic manipulation tasks: Rope and Granular.

	Rope	Granular
	GD	CEM	GD	CEM
DINO-WM	1.73	0.93	0.30	0.22
Adversarial WM	0.93	0.82	0.24	0.28

Table: A2.T8: Planning results in terms of success rate using the IRIS (micheli2023transformerssampleefficientworldmodels) architecture on the Wall Task.


	GD	CEM
IRIS	0	4
IRIS + Online WM	0	0
IRIS + Adversarial WM	8	6

Table: A2.T9: (a) Long-Horizon GBP

	PushT	PointMaze
DINO-WM	16	70
Online WM	16	96
Adversarial WM	26	88

Table: A2.T9.st2: (b) MPPI and GradCEM on PushT

	MPPI	GradCEM
DINO-WM	2	78
Online WM	2	74
Adversarial WM	2	84

Table: A2.T10: Wall clock time (in seconds) of rolling out 25 steps with each environment simulator compared to the DINO-WM architecture.

	PushT	PointMaze	Wall
Simulator	0.959	0.717	4.465
DINO-WM	0.029	0.029	0.029

Table: A4.T11: Both Open-Loop and MPC (Closed-Loop) use the Adam optimizer with the same parameters as the main experiments.

	Backward Passes	Min/Epoch	Open-Loop	MPC	Min/Epoch	Open-Loop	MPC
FGSM	2	120	70	94	14	34	94
2-Step PGD	3	165	80	96	20	8	90
3-Step PGD	4	201	78	94	24	14	94

Refer to caption An overview of our two proposed methods. When planning with a world model, actions may result in trajectories that lie outside the distribution of expert trajectories on which the world model was trained, leading to inaccurate world modeling. Online World Modeling finetunes a pretrained world model by using the simulator to correct trajectories produced via gradient-based planning, leading to accurate world modeling beyond the expert trajectory distribution. Adversarial World Modeling finetunes a world model on perturbations of actions and expert trajectories, promoting robustness and smoothing the world model’s input gradients.

Refer to caption Optimization landscape of DINO-WM (zhou2025dinowmworldmodelspretrained) before and after finetuning with our Adversarial World Modeling objective on the Push-T task. Adversarial World Modeling yields a smoother landscape with a broader basin around the optimum. Visualization details in Appendix C.

Refer to caption Difference in World Model Error between expert and planning trajectories on PushT.

Refer to caption PushT

Refer to caption PointMaze

Refer to caption Wall

Refer to caption Rope

Refer to caption Granular

Refer to caption Planning efficiency of DINO-WM, Online WM, and Adversarial WM using both GBP methods and CEM on the PointMaze task.

Refer to caption Success rate of closed-loop MPC planning using Adam on an Adversarial World Model trained with scaling factors λa,λz\lambda_{a},\lambda_{z} and perturbation radii ϵa,ϵz{\epsilon}{a},{\epsilon}{z} on the Wall environment. We find that 0≤λz,λa≤0.20\leq\lambda_{z},\lambda_{a}\leq 0.2 are stable for either “Fixed” or “Adaptive” perturbation radii.

Refer to caption (a) We see that DINO-WM is more likely to enter states outside of the training distribution, and so the decoder is not able reconstruct the state accurately. This is not the case with Online World Modeling but it still fails to successfully reach the goal state. Adversarial World Modeling successfully completes the task.

Refer to caption (b) Again we notice the failure for DINO-WM’s decoder to reconstruct states it encounters during planning, while this is not the case with Online World Modeling and Adversarial World Modeling, which both complete the task successfully.

$$ s_{t+1}=h(s_{t},a_{t}),\quad\text{ for all $t$}, $$ \tag{S2.E1}

$$ \min_{\theta}\mathbb{E}{(o{t},a_{t},o_{t+1})\sim\mathcal{T}}\lVert f_{\theta}(\Phi_{\mu}(o_{t}),a_{t})-\Phi_{\mu}(o_{t+1})\rVert_{2}^{2}. $$ \tag{S2.E3}

$$ {\hat{a}{t}^{*}}^{H}{t=1}=\operatorname*{arg,min}{{\hat{a}{t}}}\lVert\hat{z}{H+1}-z{\text{goal}}\rVert^{2}_{2} $$ \tag{S2.E4}

$$ \hat{z}{2}=f{\theta}(z_{1},\hat{a}{1}),\quad\hat{z}{t+1}=f_{\theta}(\hat{z}{t},\hat{a}{t})\quad\text{for}\quad t>1. $$ \tag{S2.E5}

$$ \tau^{\prime}=(z_{1},\hat{a}{1},z{2}^{\prime},\hat{a}{2},\dots,z^{\prime}{H+1}), $$ \tag{S2.E6}

$$ \delta^{(k+1)}=\Pi_{\lVert\delta\rVert_{\infty}\leq\epsilon}\left(\delta^{(k)}+\alpha\cdot\nabla_{x}\mathcal{L}(f_{\theta}(x+\delta^{(k)}),y)\right) $$ \tag{A4.E10}

Learning world models from sensory data. Learning-based dynamics models have become central to control and decision making, offering a data-driven alternative to classical approaches that rely on first principles modeling (Goldstein et al., 1950; Schmidt & Lipson, 2009; Macchelli et al., 2009). Early work focused on modeling dynamics in low-dimensional state-space (Deisenroth & Rasmussen, 2011; Lenz et al., 2015; Henaff et al., 2017; Sharma et al., 2019), while more recent methods learn directly from high-dimensional sensory inputs such as images. Pixel-space prediction methods (Finn et al., 2016; Kaiser et al., 2019) have shown success in applications such as human motion prediction (Finn et al., 2016), robotic manipulation (Finn & Levine, 2016; Agrawal et al., 2016; Zhang et al., 2019), and solving Atari games (Kaiser et al., 2019), but they remain computationally expensive due to the cost of image reconstructions. To address this, alternative approaches learn a compact latent representation where dynamics are modeled (Karl et al., 2016; Hafner et al., 2019b; Shi et al., 2022; Karypidis et al., 2024). These models are typically supervised either by decoding latent predictions to match ground truth observations (Edwards et al., 2018; Zhang et al., 2021; Bounou et al., 2021; Hu et al., 2022; Akan & G¨ uney, 2022; Hafner et al., 2019b), or by using prediction objectives that operate directly in latent space, such as those in joint-embedding prediction architectures (JEPAs) (LeCun, 2022; Bardes et al., 2024; Drozdov et al., 2024; Guan et al., 2024; Zhou et al., 2025). Our work builds upon this latter category of world models and specifically leverages the DINOv2-based latent world models introduced in Zhou et al. (2025). However, unlike prior work that primarily targets improving general representation quality or prediction accuracy, we focus on enhancing the trainability of world models to improve the convergence and reliability of gradient-based planning.

Planning with world models. Planning with world models is challenging due to the non-linearity and non-convexity of the objective. Search-based methods such as CEM (Rubinstein & Kroese, 2004) and MPPI (Williams et al., 2017a) are widely used in this context (Williams et al., 2017b; Nagabandi et al., 2019; Hafner et al., 2019b; Zhan et al., 2021; Zhou et al., 2025). These methods explore the action space effectively, helping to escape from local minima, but typically scale poorly in high-dimensional settings due to their sampling-based nature. In contrast, gradient-based methods offer a more scalable alternative by exploiting the differentiability of the world model to optimize actions directly via backpropagation. Despite their efficiency, these methods suffer from local minima in highly non-smooth loss landscapes (Bharadhwaj et al., 2020a; Xu et al., 2022; Chen et al., 2022; Wang et al., 2023), and gradient optimization can induce adversarial action sequences that exploit model inaccuracies (Schiewer et al., 2024; Jackson et al., 2024). Zhou et al. (2025) have observed that GBP is particularly brittle when used with world models built on pre-trained visual embeddings, such as DINOv2 (Oquab et al., 2024), often underperforming compared to CEM. To address these challenges, several stabilizing techniques have been proposed. For instance, random-sampling shooting helps mitigate adversarial trajectories by injecting noise in the action sequence and exploring a broader set of actions during trajectory optimization (Nagabandi et al., 2018), and Zhang et al. (2025) introduce adversarial attacks on learned policies to make them robust to environmental perturbations by selectively perturbing state inputs at inference time. In contrast, we apply perturbation directly to latent states and latent actions during world model training. Florence et al. (2022) add gradient penalties when training an implicit policy function to improve its smoothness and stabilize optimization, but their method does not involve training or using a world model. Other approaches

aim to use a hybrid method that combines search and gradient steps to balance global exploration and local refinement (Bharadhwaj et al., 2020a). In our work, we modify the world-model training procedure itself to improve GBP stability. In particular, through our Adversarial World Modeling approach, we enhance the robustness of the world model to perturbed states and actions, producing more stable and informative gradients that prevent adversarial action sequences at test time.

Train-test gap in world models. A key challenge when planning with learned world models is the mismatch between the training objective and the planning objective (Lambert et al., 2020). In fact, during training, world models are typically optimized to minimize one-step prediction or reconstruction error on trajectories collected from expert demonstrations or behavioral policies. At test time, however, the same models are used inside a planner to optimize multi-step action sequences. As a result, the objectives at training and test times are inherently different, inducing a distribution shift between trajectories seen during training and those encountered during planning. This mismatch can cause planners to drive the model into out-of-distribution regions of the state space, where prediction errors compound over time and the model becomes unreliable for long-horizon optimization (Ajay et al., 2018; Ke et al., 2019; Zhu et al., 2023). A common strategy to address this train-test gap is dataset-aggregation (Ross et al., 2011), which expands the training distribution by rolling out action trajectories generated by the planning algorithm and adding them to the training set (Talvitie, 2014; Nagabandi et al., 2018). However, unlike these approaches which typically apply this technique directly in the environment's low-dimensional state space, our approach uses dataset-aggregation in the context of high-dimensional latent world models, where training occurs in latent space rather than directly on states. Through our Online World Modeling approach, we explicitly close the train-test gap for gradient-based planning by using the planner itself to generate off-distribution trajectories and correcting them with simulator feedback.

In this work, we introduced Online World Modeling and Adversarial World Modeling as techniques for addressing the train-test gap that arises when world models trained on next-state prediction are used for iterative gradient-based planning. Across our experiments, these methods substantially improve the reliability of GBP and, in some settings, allow it to match or outperform sampling-based planners such as CEM. By narrowing this gap, our results suggest that gradient-based planning can be a practical alternative for planning with world models, particularly in settings where computational efficiency is critical. An important direction for future work is to evaluate these methods on real-world systems. Adversarial training may additionally improve a world model's robustness to environmental adversaries or stochasticity. More broadly, world models offer a natural advantage over policy-based reinforcement learning in long-horizon decision making. We believe our methods are especially well-suited to multi-timescale or hierarchical world models, where long-horizon planning is enabled by improving planning stability at different levels of abstraction.

Compute resources used in this work were provided by the Modal and NVIDIA Academic Grants. Micah Goldblum was supported by the Google Cyber NYC Award.

PushT: This task introduced by Chi et al. (2024) uses an agent interacting with a T-shaped block to guide both the agent and block from a randomly initialized state to a feasible goal state within 25 steps. We use the dataset of 18500 trajectories given in Zhou et al. (2025), in which the green anchor serves purely as a visual reference. We draw a goal state from one of the noisy expert trajectories at 25 steps from the starting state.

PointMaze: In this task introduced by Fu et al. (2021), a force-actuated ball which can move in the x, y Cartesian directions has to reach a target goal within a maze. We use the dataset of 2000 random trajectories present in Zhou et al. (2025), with a goal state chosen 25 steps from the starting state.

Wall: This task introduced by DINO-WM (Zhou et al., 2025) features a 2D navigation environment with two rooms separated by a wall with a door. The agent's task is to navigate from a randomized starting location in one room to a random goal state in the other room, passing through the door. We use the dataset of 1920 trajectories as provided in DINO-WM, with a goal state chosen 25 steps from the starting state.

Granular: In this task introduced by Zhang et al. (2024) a simulated Xarm must push around one hundred small particles into the goal configuration. We use the dataset of 1000 trajectories of 20 steps each provided in DINO-WM.

We reproduce the dataset statistics used to train the base world model for each environment from Zhou et al. (2025). We use the same datasets for our alternative world model architecture ablation in Section B.3.

Table 2: Trajectory datasets used to pretrain the base DINO-WM and IRIS world models.

We detail the cross-entropy method used in our planning experiments in Algorithm 4.

In Table 3, we list all shared hyperparameters used in training and planning.

To facilitate progress towards the goal in Gradient-based Planning, we introduce an alternate loss function: Weighted Goal Loss (WGL). Instead of the standard goal loss function that only minimizes the ℓ 2 -distance between the final latent state produced by planning actions and the goal latent state,

t t =1

Table 4: Training parameters for Adversarial World Modeling as reported in Table 1.

WGLencourages intermediate latent states to also be close to the goal latent state. Formally,

where the sequence of normalized weights { w i } H +1 2 is a hyperparameter choice. Empirically, we find that using this objective for Gradient-Based Planning either maintains or improves planning performance. For PointMaze and Wall, we found that exponentially upweighting later states in the planning horizon improved planning performance, so we set w i = 2 i . For PushT, we found that exponentially upweighting earlier states improved planning performance, so we set w i = (1 / 2) i . We leave the optimal selection of this sequence of weights as future work.

Motivated by the hypothesis that the optimization landscape is rugged (see Figure 2 for some evidence of this), we train an initialization network g θ : Z × Z → A T , g θ ( z 1 , z g ) = { ˆ a t } to initialize a sequence of actions for gradient-based planning. We provide details on training the initialization

We ablate the use of the DINO-WM architecture by evaluating planning performance with the IRIS (Micheli et al., 2023) architecture. Specifically, IRIS uses a VQ-VAE (van den Oord et al., 2018) for both the encoder and decoder, and a standard decoder-only Transformer (Vaswani et al., 2017). We find that even with a learned encoder, Adversarial World Modeling improves GBP performance and even CEM performance. Planning success rates of the IRIS architecture for the Wall task are reported in Table 8.

Table 8: Planning results in terms of success rate using the IRIS (Micheli et al., 2023) architecture on the Wall Task.

We evaluate GBP over a longer horizon in Table 9a. We use Adam in the MPC setting for each of these runs, setting a goal state 50 timesteps into the future drawn from an expert trajectory, a planning horizon of 50 steps, and 20 MPC iterations where we take a single action at each iteration. The dataset of held-out validation trajectories for the Wall environment does not contain expert trajectories of 50 timesteps, so we omit it from our evaluations. In comparison, our results in Table 1 use a goal state drawn 25 timesteps in the future and a planning horizon of 25 steps. We find that on the longer horizon, Adversarial World Modeling outperforms DINO-WM on PushT and both Adversarial and Online World Modeling outperform DINO-WM on PointMaze.

(a) Long-Horizon GBP

DINO-WM

MPPI

GradCEM

Adversarial WM

(b) MPPI and GradCEM on PushT

Table 9: Performance for (a) long-horizon GBP and (b) the MPPI and GradCEM algorithms

Additionally, we evaluate both the MPPI (Williams et al., 2017c) and GradCEM (Bharadhwaj et al., 2020b) algorithms under MPC on the PushT task in Table 9b. MPPI is an online, receding-horizon controller that samples and evaluates perturbed action sequences, executes the first action of the lowest-cost trajectory, and then replans from the updated state at each timestep.

GradCEM refines the candidate sequences used to update the estimated action distribution with gradient descent to provide a more accurate estimate of the true distribution's parameters. We see that Adversarial World Modeling outperforms DINO-WM with GradCEM. Additionally, GradCEM exhibits slightly lower performance than vanilla CEM. We hypothesize this is due to the memory requirements of gradient descent necessitating reducing the number of candidate sequences by a factor of 6 compared to vanilla CEM, leading to reduced accuracy in estimating the true action distribution.

For MPPI, we use 5 samples each MPC iteration, with 100 MPC steps. For GradCEM, we use 50 samples, 30 CEM steps, and 2 Adam steps per CEM step with an LR of 0.3. For GradCEM we take 10 MPC steps.

(a) PointMaze

Figure 6: Difference in World Model Error between expert trajectories and planning trajectories. Larger positive numbers indicate better performance on the actions seen during planning.

Table 10: Wall clock time (in seconds) of rolling out 25 steps with each environment simulator compared to the DINO-WM architecture.

We visualize the loss landscape of both the DINO World Model before and after applying our Adversarial World Modeling objective. We perform a grid search over the subspace spanned by

We define the axes as α = ˆ a GBP-Pretrained -a GT and β = ˆ a GBP-Adversarial -a GT, and compute the loss surface over a 50 × 50 grid spanning α, β ∈ [ -1 . 25 , 1 . 25] .

Projected Gradient Descent (PGD) has been used as an iterative method for generating adversarial perturbations (Madry et al., 2018). At each step, PGD takes a gradient ascent step and projects the result onto the space of allowed perturbations (some ball with radius ϵ around the input). Projection ( Π ) is typically via clipping or scaling. Formally,

However, this is computationally expensive to use for adversarial training as it requires an additional backward pass for each iteration. If one uses a single-step, replaces the gradient by its sign, and uses step size α = ϵ , this recovers the well-known Fast Gradient Sign Method (FGSM) update (Goodfellow et al., 2014).

In Wong et al. (2020), the authors demonstrate that initializing δ in the ℓ ∞ -ball with radius ϵ and performing FGSM adversarial training on these perturbations substantially improves robustness to PGD attacks and matches performance of PGD-based training. We leverage this observation to perform cheap adversarial training that only requires 2 × the backward passes of traditional supervised learning. In comparison, K -step PGD requires K more backward passes ( 3 × more for K = 2 and 4 × for K = 3 ). In Table 11, we show that 2/3-Step PGD does not consistently outperform FGSM, despite requiring a much larger training budget.

To assess the robustness of Adversarial World Modeling to the scaling factor and perturbation radius hyperparameters, we conduct an ablation study varying these two factors, shown in Figure 9. We evaluate λ a , λ z ∈ [0 . 0 , 0 . 02 , 0 . 05 , 0 . 20 , 0 . 50 , 1 . 0] 2 and either fix ϵ a , ϵ p , ϵ z to the standard deviation of the first minibatch ('Fixed') or recompute it for every minibatch ('Adaptive'). We observe no consistent improvement or degradation across any value of λ a , for 0 ≤ λ z ≤ 0 . 5 , or between the 'Fixed' or 'Adaptive' perturbation radii. We note that setting the visual scaling factor λ z too high (e.g., 0 . 5 , 1 . 0 ) can significantly degrade performance. We hypothesize that excessively large perturbations distort the semantic content of the visual latent state, pushing it outside the range of semantically equivalent representations.

Table 11: Both Open-Loop and MPC (Closed-Loop) use the Adam optimizer with the same parameters as the main experiments.

Figure 9: Success rate of closed-loop MPC planning using Adam on an Adversarial World Model trained with scaling factors λ a , λ z and perturbation radii ϵ a , ϵ z on the Wall environment. We find that 0 ≤ λ z , λ a ≤ 0 . 2 are stable for either 'Fixed' or 'Adaptive' perturbation radii.

Adversarial WM

	PushT	PushT	PushT	PointMaze	PointMaze	PointMaze	Wall	Wall	Wall
	GD	Adam	CEM	GD	Adam	CEM	GD	Adam	CEM
DINO-WM	38	54	78	12	24	90	2	10	74 ∗
+ MPC	56	76	92	42	68	90	12	80	82
OnlineWM	34	52	90	20	14	62	16	18	54 ∗
+ MPC	50	76	92	54	88	96	38	80	90
AdversarialWM	56	82	94	32	70	88	32	34	30 ∗
+ MPC	66	92	92	50	94	98	14	94	94

Environment	H	Frameskip	Dataset Size	Trajectory Length
Push-T	3	5	18500	100-300
PointMaze	3	5	2000	100
Wall	1	5	1920	50
Rope	1	1	1000	5
Granular	1	1	1000	5

Name	Value			Name	GD	Adam
Image size Optimizer Predictor LR	224 AdamW 1e-5 LR	Name Opt. steps	GD Adam 300 300 1.0 0.3	MPC steps Opt. steps LR	10 100 1	10 100 0.2
(a) Finetuning Parameters	(b) Open-Loop		Planning	(c) MPC	Parameters

Environment	# Rollouts	Batch Size	GPU	Epochs	ϵ visual	ϵ proprio	ϵ action
PushT	20000 (all)	16	8x B200	1	0.05	0.02	0.02
PointMaze	2000 (all)	16	1x B200	1	0.2	0.08	0.08
Wall	1920 (all)	48	1x B200	2	0.2	0.08	0.08

Environment	# Rollouts	Batch Size	GPU	Epochs
PushT	6000	32	4x B200	1
PointMaze	500	32	4x B200	1
Wall	1920 (all)	80	4x B200	1

	PushT	PushT	PointMaze	PointMaze	Wall	Wall
	GD+IN	Ad+IN	GD+IN	Ad+IN	GD+IN	Ad+IN
DINO-WM	44 60	62 84	16 40	14 54	4 6	12 32
+ MPC	56		8	28 46
OnlineWM + MPC		66			10	18
	52	82	40		2	22
AdversarialWM	74	90	22	36	18	24
+ MPC	74	90	44	56	24	48

	Rope	Rope	Granular	Granular
	GD	CEM	GD	CEM
DINO-WM	1.73	0.93	0.30	0.22
AdversarialWM	0.93	0.82	0.24	0.28

	GD	CEM
IRIS	0	4
IRIS + OnlineWM	0	0
IRIS + AdversarialWM	8	6

	PushT	PointMaze
DINO-WM	16	70
OnlineWM	16	96
AdversarialWM	26	88

	PushT	PointMaze	Wall
Simulator	0.959	0.717	4.465
DINO-WM	0.029	0.029	0.029

	Backward Passes	PointMaze	PointMaze	PointMaze	Wall	Wall	Wall
FGSM		Min/Epoch	Open-Loop	MPC	Min/Epoch	Open-Loop	MPC
2-Step PGD	2 3	120 165	70 80	94 96	14 20	34 8	94 90
3-Step PGD	4	201	78	94	24	14	94