Parallel Stochastic Gradient-Based Planning for World Models

Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar

Introduction

Intelligent agents carry a small-scale model of external reality which allows them to simulate actions, reason about their consequences, and choose the ones that lead to the best outcome. Attempts to build such models date back to control theory, and in recent years researchers have made progress in building world models using deep neural networks trained directly from the raw sensory input (e.g., vision). For example, recent world models have shown success modeling computer games (Valevski et al., 2024), navigating real-world environments (Bar et al., 2025; Ball et al., 2025), and robot arm motion commands (Goswami et al., 2025). World models have numerous impactful applications from simulating complex medical procedures (Koju et al., 2025) to testing robots in visually realistic environments (Guo et al., 2025).

Current planning algorithms used with world models often rely on 0 th -order optimization methods such as the Cross-Entropy Method (CEM Rubinstein and Kroese (2004)) or in general shooting methods which plan by iteratively rolling out trajectories and choosing actions from the optimal rollout (Bock and Plitt, 1984; Piovesan and Tanner, 2009). These approaches are simple and robust, but their performance degrades with longer planning horizons and higher action dimensionality (Bharadhwaj et al., 2020), motivating the use of gradient information when differentiable world models are available.

Gradient-based planners exploit the differentiability of learned world models to directly optimize action sequences, enabling more sample-efficient planning and finer-grained improvement than zero-order methods. However, there are two main challenges to this optimization approach: local minima (Jyothir et al., 2023) and instability due to differentiating through the full rollout, akin to backpropagation through time (Werbos, 2002). While prior methods have formulated the optimization for better conditioning (e.g., multiple shooting and direct collocation (Von Stryk, 1993)), these techniques are typically developed for known dynamical systems and do not scale as well to deep neural dynamics (often in latent spaces) with brittle or poorly calibrated Jacobians.

Figure 1 Difficulty of the planning problem . Subfigure (a) shows the distance to the goal in L 2 norm throughout a successful trajectory. This illustrates the difficulty of planning optimization away from a minimizer: successful trajectories often have to first move away from the goal to successfully plan towards it later, resulting in greedy strategies failing. Subfigures (b)-(c) depict the loss landscape at convergence of standard rollout-based planners vs. our planner. The example given is in the Push-T environment at horizon length 50. The axes plotted over are with respect to two random, orthogonal, unit-norm directions in the full action space R 50 × 2 . Our planner loss is taken as in Eq. (10), and for GD the loss is taken as in Eq. (3).

In this work, we introduce a novel gradient-based planning method for learned world models that decouples temporal dynamics into parallel optimized states (rather than serial rollouts) while remaining robust at long horizons and in high-dimensional state spaces. Rather than planning exclusively through a deep, sequential rollout of the dynamics model, our approach optimizes over lifted intermediate states that are treated as independent optimization variables. Our approach makes two fundamental additions that help solve issues for gradient-based planning in higher dimensions: gradient sensitivity and local minima.

Firstly, a fundamental difficulty arises in the setting of vision-based world models. In high-dimensional learned state spaces, gradients with respect to state inputs can be brittle or adversarial, allowing the optimizer to exploit sensitive Jacobian structure rather than discovering physically meaningful transitions. To mitigate this issue, our planner deliberately stops gradients through the state inputs of the world model, while retaining gradients with respect to actions, which we find behave more reasonably. This alone would promote trajectories near the starting state, but with a dense one-step goal loss over the full trajectory, converged trajectories for these noisy iterations tend towards the goal.

Secondly, to address the remaining non-convexity of the lifted state approach, our planner incorporates Langevin-style stochastic updates on the lifted state variables, explicitly injecting noise during optimization to promote exploration of the state space and facilitate escape from unfavorable basins. This stochastic relaxation allows the planner to search over diverse intermediate trajectories while still favoring solutions that approximately satisfy the learned dynamics. Finally, we intermittently apply a small GD step to fine-tune stochastically optimized trajectories towards fully-optimized paths.

Together, these components yield a practical gradientbased planner for learned visual dynamics that remains stable at long horizons while avoiding the failure modes commonly encountered when backpropagating through deep world-model rollouts. We call our planner GRASP (Gradient RelAxed Stochastic Planner) to emphasize its primary components: using gradient information, relaxing the dynamics constraints, and stochastic optimization for exploration. In various settings, we achieve up to +10% success rate at less than half the compute time cost. We also provide a theoretical model for our planner to further illustrate its role. We demonstrate our planner on visual world models trained on problems in D4RL (Fu et al., 2020) and the DeepMind control suite (Tassa et al., 2018).

Problem formulation

Our main object of interest is a learned world model F θ : S × A → S that predicts the next state given the current state and action. For visual domains, states are typically represented in a learned latent space to handle high-dimensional observations. Here we assume A = R k is a continuous Euclidean action space.

We consider the problem of fixed-goal path planning: finding an action sequence a = ( a 0 , a 1 , . . . , a T -1 )

Figure 2 Graphical depiction of (a) a standard serial-based setup for optimization-based planning, where states are rolled out using the actions and the loss is evaluated on the goal state, (b) our setup, which parallelizes the world model evaluations by optimizing 'virtual states' directly and only supervising pairwise dynamics satisfaction. The crossed lines and skipped connections for our method's depiction (b) are detailed in Section 3.3, which keeps the full planning graph connected while not requiring state gradients of the dynamics F θ . For our planner, we find it helpful to alternate between (a) and (b) throughout the planning optimization.

that, with respect to the dynamics of the world model F θ and a given initial state s 0 ∈ S , reaches a set goal state g ∈ S :

where the terminal state s T is generated recursively through the update rule s t +1 = F θ ( s t , a t ) :

We can sufficiently compute a ∗ by solving the following optimization problem:

Optimizing Eq. (3) directly is challenging due to two main problems. First, it requires T applications of F θ (see Eq. 2), which is computationally expensive and difficult to optimize due to the poor conditioning arising from repeated applications of F θ (see Appendix A.1 for details). Second, it is susceptible to local minima and a jagged loss landscape-see Figure 1. Due to these reasons, existing planners are based on zero-order optimization algorithms like CEM and MPPI (Williams et al., 2016) which are highly stochastic and do not require gradient computation.

In what follows, we propose a gradient-based planner which alleviates these difficulties, while also using the differentiability of the model F θ . In Section 3, we lift the optimization problem by also optimizing over states, which leads to faster convergence and better conditioning. In Section 3.2, we introduce stochasticity, which helps escape local minima.

Decoupling dynamics for gradient-based planning

We consider planning with a world model F θ and horizon T . Given an initial state s 0 ∈ S and goal state g ∈ S , we optimize an action sequence a = ( a 0 , . . . , a T -1 ) such that the rolled-out state s T ( a ) is as close to g as possible. A standard approach defines a trajectory by rolling out the model, but backpropagating through a deep composition of F θ can be unstable and ill-conditioned. Following prior work on lifted planning (Tamimi and Li, 2009; Rybkin et al., 2021), we introduce auxiliary states z = ( z 1 , . . . , z T ) and enforce dynamics consistency through a penalty function.

Parallelized planning

We first want to decouple the states from the explicit rollout outputs. Writing Eq. (3) in terms of each intermediate dynamics condition, we get the

following:

The minimization in Eq. (4) is equivalent to Eq. (3) in that they share global minimizers.

Immediately, this gives a great benefit in that all world model evaluations are parallel ; there is no need to do serial rollouts like what is required in Eq. (3). There are however two main issues with optimizing this loss directly:

Local minima . When optimizing with respect to states, the states might be stuck in an unphysical region; for example, Figure 7 shows a case where states go straight through a barrier. To address this, we propose to use Langevin state updates which promote exploration (see Section 3.2).
World model sensitivity for high-dimensional states. When optimizing s directly over a higher dimensional space (e.g. vision-based), we observe that the Jacobian J s F θ ( s, a ) does not necessarily have any nice low-dimensional or convex structure; in practice, the world model can be easily steered toward outputting any desired output state, as depicted in Figure 3. We address this in Section 3.3 with a reshaping of the descent directions.

We now describe our approach to address these two fundamental problems with lifted-states approaches to planning.

Exploration via Langevin state updates

The lifted optimization in Eq. (4) is still non-convex and can get trapped in poor local minima. In practice, we frequently observe that deterministic joint updates in ( a , s ) converge to 'bad' stationary points where the intermediate variables settle into an unfavorable basin; for example, a linear route that ignores barriers or walls like in Figure 7. To circumvent this, we inject stochasticity directly into the state iterates, yielding a Langevin-style update that encourages exploration in the lifted state space.

Langevin dynamics on state iterates. Consider the optimization induced by Eq. (4). A standard way to escape spurious basins is to replace deterministic gradient descent on s with overdamped Langevin dynamics (Gelfand and Mitter, 1991), whose Euler

Figure 3 Sensitivity of state gradient structure . Examples of three states far away from the goal on the right (either in-distribution or out-of-distribution), such that taking a small step along the gradient s ′ = s -ϵ ∇ s L ( s ) , L ( s ) = ∥ F θ ( s, a = 0 ) -g ∥ 2 2 , leads to a nearby state s ′ that solves the planning problem in a single step: F θ ( s ′ , 0 ) = g . Thus, optimizing states directly through the world model F θ can be quite challenging.

discretization takes the following form:

where ξ k t ∼ N (0 , I ) . That is, each optimization step performs a gradient descent update on the intermediate states, followed by an isotropic Gaussian perturbation. Intuitively, the noise allows the iterates to 'hop' between nearby basins of the lifted loss landscape.

Noise on states vs. actions. By only noising the states, we can still condition on more dynamically feasible trajectories, while still allowing exploration over a wider distribution. Intuitively, planning problems often have a single (or small number of) intermediate states to find for the solution, and being able to noise directly over states rather than actions allows us to find these intermediate states faster. See Appendix A.3 for a characterization of the sampled distribution.

Langevin dynamics on state iterates.

discretization takes the following form:

Noise on states vs. actions.

Sensitivity to state gradients

A note on adversarial robustness of state gradients. In practice, F θ is learned and can have brittle local geometry. When optimizing Eq. (4) by gradient descent in both a and s , we observed empirically that gradients with respect to the state inputs, ∇ s F θ ( s, a ) , can be exploited: for any local goal-reaching objective of the following form:

instead of the optimizer learning to find an s onmanifold such that applying the action a leads to end state y , the optimizer can find a nearby ambient

s + δ , ∥ δ ∥ 2 ≪ 1 such that the loss is practically minimized: y ≈ F θ ( s + δ, a ) , regardless of the starting state s . This is analogous to the adversarial robustness issue of image classifiers (Szegedy et al., 2013; Shamir et al., 2021): high-dimensional input spaces for trained neural networks can have high Lipschitz constants, hindering optimization performance.

Unfortunately, any loss function over s and a whose minimizers are feasible dynamics must depend on the state gradient ∇ s F in a meaningful way. We provide the informal theorem here, with formalization and proof in Appendix A.4.

Theorem 1 (informal) . A differentiable loss function over state/action trajectories L : S T ×A T → R given a world model F θ : S × A → S cannot satisfy both of the following at the same time:

Minimizers of L correspond to dynamically feasible trajectories: F θ ( s t , a t ) = s t +1 ,
L is insensitive to the world model state gradient ∇ s F θ .

To address this adversarial sensitivity, we detach gradients through the state inputs of the world model, while still differentiating with respect to the actions. We denote by ¯ s t a stop-gradient copy of s t (i.e., ¯ s t = s t in value, but treated as constant during differentiation).

Require: Initial observation o 0 , goal o g , world model F θ , horizon T , steps K , learning rates η a , η s .

Ensure: a ∗ 1: s 0 ← encode ( o 0 ) , s T ← encode ( o g ) 2: a ← a 0 ; s ← init_states ( s 0 , s T ) 3: for k = 0 to K -1 do 4: Compute L as in Eq. (10) 5: Joint step: ( a , s ) ← ( a , s ) -( η a ∇ a L , η s ∇ s L ) 6: Stochastic state: s t ← s t + σ state ξ t , ξ t ∼ N (0 , I ) for t = 1 , . . . , T -1 7: Sync (periodic): if k mod K sync = 0 , rollout from s 0 ; s t +1 ← F θ ( s t , a t ) for t = 0 , . . . , T -1 8: then take a GD step a ← a -η sync ∇ a ∥ s T -g ∥ 2 2 9: end for 10: return a ∗ ← a

Algorithm 1 GRASP planner: parallel stochastic gradient-based planning

Grad-cut dynamics loss. We begin by applying a gradient stop to the state inputs in the dynamics loss:

This objective is differentiable with respect to a and the next states s t +1 , but does not backpropagate through s t via F θ .

Dense goal shaping on one-step predictions. While Eq. (8) improves robustness, it introduces a new degeneracy: paths gravitate towards the current rollout, regardless of proximity to the goal. To provide a task-aligned signal at every time step without stateinput gradients, we add a goal loss on the one-step predictions:

This encourages each predicted next state to move toward the goal, supplying gradient information to every action a t while maintaining the grad-cut on s t . This is depicted visually in Figure 2, and theoretically in Appendix A.4. Crucially, due to the stop-gradient ¯ s t , gradients through F θ (¯ s t , a t ) flow only with respect to a t (and not s t ), which prevents the optimizer from exploiting adversarial state-input directions. The final energy that is sampled from is then the following:

where γ > 0 is fixed.

Resulting noisy dynamics. The final resulting dynamics, after explicitly writing out ∇ a L , are as follows:

where ξ k t ∼ N (0 , I ) . Importantly, while the action dynamics still follow a gradient flow, the states do not follow a true gradient vector field, and thus the resulting dynamics are not Langevin. What results are still noisy dynamics that bias towards valid goaloriented trajectories, but whose efficiency will require an extra synchronization step as described in the following section.

A note on adversarial robustness of state gradients.

discretization takes the following form:

Grad-cut dynamics loss.

We now analyze a variant of the optimization method where we 'cut' the gradient flow through the state evolution in the dynamics. This is akin to removing the adjoint (backward) pass through time, transforming the global trajectory optimization into a sequence of locally regularized problems.

First, we provide proof that there are no loss functions where (i) the minimizers correspond to dynamically feasible trajectories, that also (ii) has no dependency on the state gradients through the world model F θ ( s, a ) . We formalize this in the following theorem:

Theorem 5 (Nonexistence of exact dynamics-enforcing losses with Jacobian-free state gradients) . Let S ⊆ R n and A ⊆ R m each be open and connected sets, and fix s 0 ∈ S and g ∈ S . Consider horizon T = 2 with decision variables ( s 1 , a 0 , a 1 ) ∈ S × A × A and boundary s 2 = g fixed. Let L F : S × A × A → R be decomposable into the following form:

where Φ : S × A × A × S × S → R is C 1 . Let

denote the set of trajectories that satisfy the dynamics exactly at both steps (with the fixed boundary conditions). There does not exist such a Φ for which the following two properties hold simultaneously for every C 1 model F :

Proof. We argue by contradiction.

Fix any point ( s 1 , a 0 , a 1 ) and write y 0 = F ( s 0 , a 0 ) and y 1 = F ( s 1 , a 1 ) . By the chain rule,

We claim that ∂ Φ /∂y 1 must vanish identically. To see this, fix arbitrary arguments ( s 1 , a 0 , a 1 , y 0 , y 1 ) , ( s 0 , a 0 ) = ( s 1 , a 1 ) in the domain and construct two C 1 models F and G such that F ( s 0 , a 0 ) = G ( s 0 , a 0 ) = y 0 and F ( s 1 , a 1 ) = G ( s 1 , a 1 ) = y 1 , but whose state Jacobians at ( s 1 , a 1 ) are prescribed arbitrarily and differ:

1 Equivalently, one can view σ state in Eq. (5) as setting an effective temperature: larger σ state yields broader exploration, while smaller σ state concentrates around local minima.

To construct, choose a small ball B around ( s 1 , a 1 ) contained in S × A , construct a smooth bump function ψ that equals 1 on a smaller concentric ball and 0 outside B (possible since S , A open), and define

for an arbitrary C 1 base map H . Then F ( s 1 , a 1 ) = y 1 and ∇ s F ( s 1 , a 1 ) = J F . Defining G analogously with J G gives the desired pair; values at ( s 0 , a 0 ) can be kept fixed by choosing disjoint supports or applying the same local surgery at ( s 0 , a 0 ) .

By Jacobian-invariance, ∇ s 1 L F = ∇ s 1 L G at ( s 1 , a 0 , a 1 ) . Subtracting the two chain-rule expressions cancels ∂ Φ /∂s 1 and yields

Since J F -J G can be any matrix in R n × n , it follows that

Because the arguments were arbitrary, and since S is connected, we conclude ∂ Φ /∂y 1 ≡ 0 , meaning the loss is independent of y 1 . Therefore there exists a C 1 function ˜ Φ such that for every F ,

In particular, if two models F and G satisfy F ( s 0 , a ) = G ( s 0 , a ) for all a ∈ A , then L F ≡ L G as functions of ( s 1 , a 0 , a 1 ) , and hence

We now construct such a pair F, G but with different feasible sets, contradicting the exact-minimizers assumption. Pick two distinct actions u, v ∈ A , two distinct states s A , s B ∈ S , and an action a ⋆ ∈ A . Using bump-function surgery as above, construct C 1 models F and G such that

but with swapped second-step goal reachability:

Then ( s A , u, a ⋆ ) ∈ M ( F ) but ( s A , u, a ⋆ ) / ∈ M ( G ) , and ( s B , v, a ⋆ ) ∈ M ( G ) but ( s B , v, a ⋆ ) / ∈ M ( F ) , so M ( F ) = M ( G ) . On the other hand, since F ( s 0 , · ) = G ( s 0 , · ) we have arg min L F = arg min L G . By assumption (i),

We introduce the stop-gradient operator sg ( · ) , where sg ( x ) = x during the forward evaluation, but ∇ sg ( x ) = 0 during the backward pass. The modified objective incorporating the stop-gradient mechanism is:

where β t ≥ 0 are the goal loss coefficients. Note that the target g is effectively applied as a penalty on the state at each step to guide the local optimization.

and

Theorem 6 (Linear convergence to a unique fixed point) . Consider the gradient descent iteration

where s 0 is fixed and gradients are computed with the stop-gradient convention (i.e., treating sg ( s t ) as constant during differentiation). Assume the linear dynamics setting and that

and that B has full column rank (equivalently, σ min ( B ) > 0 ). Then there exists a stepsize η ∈ (0 , ¯ η ) , where ¯ η depends only on β min , β max , ∥ B ∥ 2 , σ min ( B ) (and not on T ), such that the induced update operator T on z = ( s , a ) has a unique fixed point z ⋆ and the iterates converge linearly:

Proof. Write the objective (re-indexing the goal term to match the action index) as

Define the residuals

Under the stop-gradient convention, sg( s t ) is treated as constant during differentiation, so s t does not receive gradient contributions through the A sg( s t ) terms. It follows that the only state-gradient at time t comes from the appearance of s t as the next state in the previous residual, namely

with the understanding that r -1 = 0 if s 0 is fixed. Likewise, the action-gradient at time t is

Therefore, gradient descent with stepsize η > 0 yields the explicit update rules

Stack the variables in the time-ordered vector z := ( s 1 , a 0 , s 2 , a 1 , . . . , s T , a T -1 ) . The updates above define an affine map z k +1 = T ( z k ) = J z k + c whose Jacobian J is block lower-triangular with respect to this ordering: indeed, ( s k +1 t +1 , a k +1 t ) depends only on ( s k t , s k t +1 , a k t ) (and on the fixed constants g and s 0 ) and is independent of any future variables ( s k t +2 , a k t +1 , . . . ) . Consequently, the eigenvalues of J are exactly the union of the eigenvalues of its diagonal blocks.

To characterize a diagonal block, fix t ∈ { 0 , . . . , T -1 } and consider the pair y t := ( s t +1 , a t ) . Conditioned on s t (which appears only as a constant inside sg( s t ) for differentiation), the update ( s k +1 t +1 , a k +1 t ) is precisely one gradient step on the quadratic function

so the corresponding diagonal block equals I -ηH t where H t = ∇ 2 y t ϕ t is the constant Hessian with respect to ( s t +1 , a t ) :

Assume β t ∈ [ β min , β max ] with β min > 0 and that B has full column rank, so B ⊤ B ≻ 0 . The Schur complement of the I block is

hence H t ≻ 0 for every t , with eigenvalues uniformly bounded away from 0 and ∞ as t varies:

for constants µ, L depending only on β min , β max , ∥ B ∥ 2 , and σ min ( B ) . Choosing any stepsize η such that 0 < η < 2 /L , we obtain for every t that all eigenvalues of I -ηH t lie strictly inside the unit disk, and in particular:

where q is independent of t and T . Since J is block lower-triangular and its diagonal blocks are exactly ( I -ηH t ) (up to a fixed permutation corresponding to the stacking order), we conclude

Fix any q such that q 0 < q < 1 . By applying Gelfand's formula, for any square matrix M and any q > ρ ( M ) , there exists an induced norm ∥ · ∥ † such that ∥ M ∥ † ≤ q . Applying this with M = J yields a norm ∥ · ∥ † satisfying

Hence, for all k ≥ 0 , and therefore

By norm equivalence in finite dimensions, there exists C > 0 such that

Notes on the stopgrad optimization. The optimization indeed converges to a fixed point, but one can show that these stable points in the linear convex case are merely the greedy rollouts towards the goal. Two things make the optimization in our setting nontrivial: the nonconvexity of the world model F θ , and the stochastic noise on the states s t . We now present some characterization on the distribution of trajectories that our planner tends towards.

Let F θ : S × A → S be a differentiable world model and define the stop-gradient one-step prediction

Consider the stopgrad lifted objective (cf. Eq. Eq. (10))

and the (no-sync) optimization updates

Throughout, assume 0 < η s < 1 (for stability of the state contraction).

Theorem 7 (Gaussian tube around one-step predictions) . Fix { ¯ s k t } T -1 t =0 and { a k t } T -1 t =0 at iteration k , and let µ k t = F θ (¯ s k t , a k t ) . Then the state update Eq. (93) satisfies the conditional mean recursion

Moreover, if µ k t ≡ µ t is held fixed, then s k t +1 converges in distribution to a Gaussian 'tube' around µ t :

Analogously, in continuous optimization-time τ , the limiting SDE

has stationary law N ( µ t , σ 2 2 λ I ) .

Proof. Rewrite Eq. (93) as an affine Gaussian recursion:

Taking conditional expectation yields Eq. (95). If µ k t ≡ µ t is fixed, the centered process u k := s k t +1 -µ t satisfies u k +1 = (1 -2 η s ) u k + σξ k , i.e. an AR(1) process with contraction factor | 1 -2 η s | < 1 . Its unique stationary covariance Σ tube solves the discrete Lyapunov equation Σ tube = (1 -2 η s ) 2 Σ tube + σ 2 I , giving Eq. (96). The continuous-time statement is standard, given that µ t is fixed.

Theorem 8 (Goal shaping induces goal-directed drift of tube center) . Define µ k t = F θ (¯ s k t , a k t ) and let

Assume a first-order linearization in the action step:

Then the action update Eq. (94) induced by Eq. (92) yields the tube-center evolution

where ε k t +1 denotes the tube residual s k t +1 = µ k t + ε k t +1 . In particular, if E [ ε k t +1 | µ k t ] = 0 , then

so in controllable directions the mean prediction µ t moves toward g as an averaging step. If 0 < αγλ max ( P k t ) < 1 , this is a contractive averaging step toward g on Range( P k t ) .

Proof. From Eq. (92), only terms at time t depend on a t via µ t . Using ∇ a t µ t = J t , the gradient is

which simplifies to Eq. (98) after substituting α = 2 η a . Taking conditional expectation and using E [ ε k t +1 | µ k t ] = 0 yields Eq. (99).

Unlike the stopgrad lifted-state dynamics (Theorem 7), the rollout distribution is not contracted toward the current rollout, but instead noise accumulates throughout nonlinear iterations of the world model. The stochastic rollout distribution need not concentrate in a local tube around the current deterministic rollout trajectory; it can drift and spread away as the horizon T grows.

Theorem 9 (Mean evolution and non-tube behavior of noisy rollouts) . Consider a rollout-based stochastic trajectory generated in model-time:

and a dense goal objective along the rollout (e.g. ∑ T -1 t =0 ∥ s t +1 -g ∥ 2 ). Let m t := E [ s t ] and Σ t := Cov( s t ) . The rollout mean obeys the exact identity

In particular, if F θ is affine in s (i.e. F θ ( s, a ) = As + Ba + c ), then

For general nonlinear F θ , a second-order moment expansion yields

so the mean generally does not follow the deterministic rollout F θ ( m t , a t ) .

Proof. Taking conditional expectation of Eq. (101) gives E [ s t +1 | s t ] = F θ ( s t , a t ) , and then total expectation implies Eq. (102). For affine F θ , expectation commutes with F θ , yielding Eq. (103). For nonlinear F θ , expand F θ ( s t , a t ) around µ t by Taylor's theorem; the first-order term vanishes in expectation and the second-order term produces Eq. (104). The covariance recursion Eq. (105) follows by linearizing F θ ( s t , a t ) ≈ F θ ( m t , a t ) + G t ( s t -µ t ) and computing Cov( · ) , adding the independent noise variance σ 2 env I . The final 'non-tube' claim follows because there is no optimization-time contraction that repeatedly pulls s t +1 back toward a moving center (as in Eq. (95)); instead the forward propagation Eq. (105) typically increases spread and the mean can deviate from the deterministic path by Eq. (104).

Theorem 7 shows that noisy lifted state updates form an noisy 'tube' around the one-step predictions m t = F θ (¯ s t , a t ) , keeping exploration local and dynamically consistent in optimization-time. Theorem 8 then shows that dense one-step goal shaping moves the tube center toward the goal by a preconditioned averaging step, while the dynamics residual contributes approximately zero-mean stochastic forcing that enables exploration without horizon-coupled backpropagation. In contrast, Theorem 9 shows that noisy rollouts evolve by forward propagation of randomness: the mean follows m t +1 = E [ F θ ( s t , a t )] (not generally the deterministic rollout), and the distribution can drift/spread rather than concentrate in a tube around the current plan.

Dense goal shaping on one-step predictions.

Full-rollout synchronization

The no-state-gradient updates are designed to be robust to brittle state-input Jacobians of the learned world model. However, the stochastic optimization in Eq. (11) still needs a method of strict descent towards

Figure 4 Virtual states learned through planning. All examples are instantiations of our planner at horizon 50 in the Point-Maze, Wall-Single, and Push-T environments. Regardless of the dynamics constraint relaxation and state noising, directly optimized states find realistic, non-greedy paths towards the goal.

true minima. In practice, we found it beneficial to periodically 'sync' the plan by briefly running standard full-gradient planning on the original rollout objective.

Full-gradient rollout step. Every K sync iterations, we perform J sync steps of gradient descent on the original planning loss

where s T ( a , s 0 ) is computed by sequentially rolling out the world model

During this synchronization phase we update only the actions,

using full backpropagation through the T -step rollout. By keeping these GD steps small relative to the stochastic dynamics of Eq. (11), we benefit from the smoothed loss landscape in Figure 1c for wider exploration, and the sharp but brittle landscape in Figure 1b for refinement.

Full-gradient rollout step.

true minima. In practice, we found it beneficial to periodically 'sync' the plan by briefly running standard full-gradient planning on the original rollout objective.

Full-gradient rollout step. Every K sync iterations, we perform J sync steps of gradient descent on the original planning loss

where s T ( a , s 0 ) is computed by sequentially rolling out the world model

During this synchronization phase we update only the actions,

Results

We evaluate our proposed planner GRASP across two complementary classes of environments designed to test (i) nonconvex long-horizon planning with obstacles and (ii) data-driven visual control under learned dynamics. Concretely, these experiments aim to answer three questions:

Can the proposed planner overcome the greedy local minima that often trap shooting methods?
Does the method remain robust as the planning horizon increases?
Does the proposed planner converge to plans faster than rollout-based planners?

We provide self-ablations in Table 2, demonstrating the value of various components of our planner: using the state gradient detaching, the GD sync steps, and the level of the noise.

Figure 4 visualizes planning iterations in several navigation environments, illustrating how trajectories initialized far from dynamically consistent rollouts converge to feasible plans that satisfy the learned dynamics.

Baselines

We compare against three commonly used planners. CEM optimizes action sequences by iteratively sampling candidate trajectories, selecting elites, and refitting a sampling distribution. GD directly optimizes the action sequence by backpropagating through the dynamics model. LatCo (Rybkin et al., 2021) optimizes in a lifted latent/state-space by jointly adjusting intermediate latent variables and actions. This setting is different than the original LatCo method, which was originally applied in a model-based RL environment, but it still provides an important baseline for performance if we were to purely focus on direct optimization of Eq. (4).

For all methods, we sweep over hyperparameters and report results using the best-performing setting for each environment and horizon. For our planner, we initialize the states { s t } T t =0 as noised around the linear interpolation between s 0 and g : s t = t T g + (1 -t T ) s 0 + z , z ∼ N (0 , ϵI ) , and actions initialized

Table 1 Open-loop planning results on long range Push-T. Reported are success rate (%) and median success time (seconds; successful trials only) across planning horizons. 500 trials per setting. Each cell reports Success / Time .

Figure 5 Success rate over time at a fixed horizon. Success rate over fixed set of open-loop planning tasks for CEM, GD, LatCo (Rybkin et al., 2021), and our planner for a fixed horizon of 50. Curves summarize how quickly each planner makes progress under the learned world model setting when evaluated at a fixed planning horizon. Shaded regions are Wald 95% confidence intervals.

at zeros: a t = 0 .

Environments and evaluation protocol

We evaluate planning on three visual control environments with learned dynamics: PointMaze , WallSingle , and Push-T . World models are trained using the DINO-wm framework (Zhou et al., 2024), following the original paper's setup, where the world model F θ ( s, a ) takes 5 actions and predicts 5 steps ahead; that is, if dim ( A ) = 2 , then F θ takes actions as vectors of stacked actions a ∈ R 10 .

All reported metrics measure task success under the learned world model. Success is defined as reaching the goal region within the planning horizon.

Long-term planning and horizon scaling

We evaluate planners in the long-horizon regime, where our parallelized stochastic planner is more intended; greedy local minima and optimization instability become the dominant challenges.

GRASP remains reliable as the planning horizon increases: it solves more tasks and finishes a majority of its successful trials faster, showing stronger robust-

Table 2 Ablation Studies over the GD sync steps, level of noise for Langevin dynamics, and whether we use our detached gradient approach with the goal-reaching objective or not. GD sync happens every 100 stochastic steps. Ablations done on the Push-T environment at horizon H = 40 . Time reported is median over successful trials, which only beats our method when the accuracy is much smaller.

ness to long horizons than the baselines as shown in Table 1. Beyond the median completion times, we provide further illustration of the solving speed of our planner in Figure 5, showing further that most of its plans converge at an earlier time. At the longer horizons for Push-T, where it is more important for the planner to explore non-greedy optima, GRASP finds more success than the baselines and is able to find the needed non-greedy trajectories, as visualized in Figure 6.

Short-term planning

We also evaluate short-horizon planning, to demonstrate that our planner can match performance on shorter, easier tasks. Table 3 reports success rates across environments for horizons ranging from H = 10 to H = 30 , while Table 4 reports median wall-clock planning times.

Table 3 Short Term Planning. Success rate (%) for Push-T, PointMaze, and WallSingle. 500 trials per setting. Our method has comparable success rate while having a consistently low completion time (Table 4).

Across all environments and short horizons, the proposed planner achieves success rates comparable to the baselines. Alongside similar success rates, our proposed planner exhibits consistently low planning times. As shown in Table 4, it is among the fastest methods across all environments and horizons, often significantly faster than sampling-based approaches and competitive with gradient-based optimization, sacrificing some speed for a higher success rate. These results indicate that even in relatively short and easy planning regimes, our planner remains competitive with the baselines.

Overall, these results demonstrate that GRASP consistently matches strong baselines in short-term planning, while outperforming them in long-horizon settings by avoiding greedy failures and converging more quickly in wall-clock time.

World modeling has shown significant improvement in sample efficiency for model-based reinforcement learning (Hafner et al., 2025). By learning to predict future states given current states and actions, world models enable planning without access to an interactive environment (Ding et al., 2024). Recent work has focused on learning latent-space representations to handle high-dimensional observations Assran et al. (2025), with models now demonstrating the ability to scale and generalize across diverse environments (Bar et al., 2025). In this paper, we develop an efficient planning algorithm for action-conditioned video models.

Sampling-based planning in world models traditionally relies on methods like the Cross-Entropy Method (CEM Rubinstein and Kroese (2004)) and random shooting. While these methods are robust and simple to implement, they suffer from serial evaluation bottlenecks and poor scaling with planning horizon length (Bharadhwaj et al., 2020). Recent work proposes performance improvements-such as faster CEM variants with action correlation and memory, parallelized sampling via diffeomorphic transforms, and massively parallel strategies-but fundamental limitations remain for very long horizons (Pinneri et al., 2021; Lai et al., 2022).

Gradient-based planning leverages the differentiability of neural world models to optimize action sequences directly (Jyothir et al., 2023). Early approaches applied backpropagation through time to optimize actions (Thrun et al., 1990), but face challenges with vanishing/exploding gradients and poor conditioning over long horizons. Hybrid strategies combining gradient descent with sampling-based methods-such as interleaving CEM with gradient updates-have shown promise. CEM-GD variants interleave backwards passes through the learned model with populationbased search for improved convergence and scalability (Bharadhwaj et al., 2020; Huang et al., 2021). Recent work improves gradient-based planners by training world models to be adversarially robust to improve gradient-based planning (Parthasarathy et al., 2025); our method instead tries to improve gradientbased planning for more general world models, with-

out any pretraining modifications.

State optimization and multiple shooting in optimal control. The idea of treating states as optimization variables separate from dynamics constraints has a rich history in classical optimal control. Non-condensed QP formulations in MPC (Jerez et al., 2011) decouple state and input optimization for improved numerical properties. Multiple shooting methods (Tamimi and Li, 2009; Diedam and Sager, 2018) break long-horizon problems into shorter segments with continuity constraints. Direct collocation approaches (Bordalba et al., 2022; Nie and Kerrigan, 2025) optimize state and control trajectories simultaneously while enforcing dynamics through collocation constraints. These methods have primarily been applied to systems with known analytical dynamics. Trajectory optimization methods in robotics have developed parallel shooting techniques and GPU-accelerated planning algorithms (Guhathakurta et al., 2022), but most approaches still face fundamental limitations when applied to learned world models, particularly visual world models where dynamics are approximate and high-dimensional.

Noise and regularization in optimization. Stochastic optimization techniques and noise injection have long been recognized for their ability to improve optimization outcomes (Robbins and Monro, 1951), and can help regularize and explore complex loss landscapes (Welling and Teh, 2011; Xu et al., 2018; Bras, 2023; Foret et al., 2020). In the context of planning, noise is commonly used in sampling-based methods, but its systematic incorporation into gradient-based planning for learned models remains underexplored.

Limitations and future work

Although the proposed planner shows clear advantages in long-horizon settings, its benefits are more limited at short horizons. As demonstrated in our experiments, for small planning horizons the planner typically achieves success rates and completion times that are comparable to strong baselines such as CEM and gradient-based optimization, rather than strictly outperforming them. This suggests that the primary gains of the method arise in regimes where long-horizon reasoning and non-greedy planning are essential, rather than in short-horizon settings where simpler methods already perform well.

Hybrid planners (that combine iterations of a rolloutbased planner like CEM and a gradient-based planner like GD) have been implemented to get the 'best of both worlds' from the two approaches (Huang et al., 2021), and there are many ways to 'hybridize' our planner as well. We leave exploration of such methods for future work.

Several components of the planner are designed to mitigate the unreliability of state gradients in learned world models. While effective, these modifications introduce additional structure and hyperparameters that would ideally be unnecessary. If state representations induced by the world model were smoother or more geometrically well-behaved in the state space, many of these stabilization mechanisms could be removed, potentially leading to further speed improvements. Promising directions toward this goal include improved representation learning through adversarial training, diffusion-based world models, or other techniques that explicitly regularize the geometry of the learned state space.

Conclusion

World models provide a powerful framework for planning in complex environments, but existing approaches struggle with long horizons, highdimensional actions, and serial computation. We propose GRASP, a new gradient-based planning algorithm with two key contributions: (a) a lifted planner that optimizes actions together with 'virtual states' in a time-parallel manner, yielding more stable and scalable optimization while allowing direct control over exploration via stochastic state updates, and (b) an action-gradient-only planning variant for learned visual world models that avoids brittle state-input gradients while still exploiting differentiability with respect to actions. Experiments on visual worldmodel benchmarks show that our approach remains robust as horizons grow and finds non-greedy solutions at a faster rate than commonly used planners such as CEM or vanilla GD.

Theory

In this section, we provide a theoretical analysis of the convergence properties of our planning approach compared to traditional shooting methods. We consider a simplified linear dynamics setting to derive formal convergence guarantees. We reproduce proof here for self-containedness; see e.g. Ascher et al. (1995) for other theoretical treatments of this problem.

Convergence of various methods in the convex setting

Consider a linear dynamical system with dynamics:

where s t ∈ R n is the state at time t , a t ∈ R m is the control input, and A ∈ R n × n , B ∈ R n × m .

Let F ( s 0 , a ) : R n × R mT → R n denote the rollout of the dynamics for T timesteps. Under linear dynamics, F takes the compact form:

We analyze the optimization landscape of two fundamental formulations for reaching a target state g from initial state s init .

The Lifted States Method (or Multiple Shooting) treats both states and controls as optimization variables and minimizes the violation of the dynamics constraints (the physics defects):

where s = ( s 0 , . . . , s T ) .

Matrix Representation To analyze the convergence properties, we express both objectives in quadratic matrix form.

Shooting Method. Let a = ( a 0 ; . . . ; a T -1 ) ∈ R mT . The final state is linear in a :

where C T = [ A T -1 B,A T -2 B,.. . , B ] ∈ R n × mT is the controllability matrix. The objective is:

The Hessian of this objective is H S = 2 C ⊤ T C T .

Lifted Method. We eliminate the fixed boundary variables s 0 and s T and optimize over the free variables z = ( s 1 ; . . . ; s T -1 ; a 0 ; . . . ; a T -1 ) . The dynamics residuals can be written as a linear system M z -b . The dynamics equations for t = 0 , . . . , T -1 correspond to rows in a large matrix M .

The objective is J L ( z ) = ∥ M z -b ∥ 2 2 , and the Hessian is H L = 2 M ⊤ M . The matrix M has a sparse, block-banded structure.

SmoothnessAnalysis We compare the smoothness of the two optimization problems by comparing the Lipschitz constants of their gradients, L = λ max ( H ) .

Theorem 2 (Shooting: Exploding Smoothness) . Let A ⊤ have a real eigenvalue λ with | λ | > 1 and unit eigenvector v (a left eigenvector of A ). Assume B aligns with this mode such that for some input direction w ( ∥ w ∥ 2 = 1 ), the projection |⟨ v, Bw ⟩| = µ > 0 .

Then, the Lipschitz constant of the Shooting method gradient grows exponentially with T :

Proof. The Lipschitz constant is the maximum eigenvalue of the Hessian H S = 2 C ⊤ T C T , which equals 2 ∥C T ∥ 2 2 . By definition, the spectral norm is the maximum gain over all possible inputs:

From the existence of a controllable non-contractive mode, we can construct a unit-norm a test = ( w ; 0 ; . . . ; 0 ) that evaluates to the form:

and such that the following holds when projecting to the corresponding unstable eigenvector v :

Squaring this result gives the desired bound.

Proof. The Hessian is H L = 2 M ⊤ M , so λ max ( H L ) = 2 ∥ M ∥ 2 2 . It therefore suffices to upper bound ∥ M ∥ 2 by a constant that does not depend on T .

The matrix M is block-sparse: each block row corresponding to timestep t contains at most three non-zero blocks, namely an identity block (selecting s t +1 ), a dynamics block (multiplying s t ), and an input block (multiplying a t ). Equivalently, the corresponding residual has the form

with s 0 and s T treated as fixed boundary values (so r t is affine in the free variables).

Fix any optimization vector z = ( s 1 ; . . . ; s T -1 ; a 0 ; . . . ; a T -1 ) (stacking the free states and controls), and let M z denote the stacked residuals ( r 0 , . . . , r T -1 ) with constants removed. Using the inequality ∥ x + y + z ∥ 2 2 ≤ 3( ∥ x ∥ 2 2 + ∥ y ∥ 2 2 + ∥ z ∥ 2 2 ) and the operator norm bounds ∥ As ∥ 2 ≤ ∥ A ∥ 2 ∥ s ∥ 2 , ∥ Ba ∥ 2 ≤ ∥ B ∥ 2 ∥ a ∥ 2 , we obtain for each t :

Crucially, due to the banded structure, each free intermediate state s 1 , . . . , s T -1 appears in at most two terms in the sum: once as s t +1 and once as s t . Hence the state contributions can be bounded without any accumulation in T , yielding the following:

Therefore, ∥ M ∥ 2 2 = sup z =0 ∥ M z ∥ 2 2 / ∥ z ∥ 2 2 ≤ 3(1 + ∥ A ∥ 2 2 + ∥ B ∥ 2 2 ) , and thus

which is independent of T .

Interpretation. While we needed a slightly more restrictive assumption for the lower bound on the Shooting method's Hessian, such a condition is not unreasonable to find, as many realistic systems will not be universally stable within the controllability subspace. In these settings, the shooting method forces the optimization to traverse a loss landscape with curvature that varies exponentially (requiring exponentially small steps to remain stable), while the lifted states method 'preconditions' the problem by creating variables for intermediate states, decoupling the long-term dependencies into local constraints. This results in a loss landscape with uniform smoothness O (1) with respect to the planning horizon length T .

Shooting Method.

Lifted Method.

out any pretraining modifications.

Interpretation.

Gaussian noise regularization

The regularity from noisy gradient descent (or Langevin-based optimization) primarily stems from the smoothing of the Gaussian convolution:

Theorem 4 (Gaussian smoothing contracts gradients and yields scale control) . Let d ≥ 1 , σ > 0 , and let

be the density of N (0 , σ 2 I d ) . For L ∈ C 1 ( R d ) define

The following statements hold of the resulting convolution:

In particular, if L is Lipschitz, then Lip( L σ ) = ∥∇ s L σ ∥ L ∞ ≤ ∥∇ s L ∥ L ∞ = Lip( L ) .

(Explicit regularity control by variance.) For any 1 ≤ p ≤ ∞ , if L ∈ L p ( R d ) , then

Proof. An important property of convolution is that gradients commute to either argument:

where ∇ ϕ σ ( ϵ ) = -( ϵ /σ 2 ) ϕ σ ( ϵ ) .

For part 1, apply Young's convolution inequality with the fact that ∥ ϕ σ ∥ L 1 = 1 . For any g ∈ L p ,

Taking g = ∇ a L and using the commutation identity above gives

When p = ∞ , ∥∇ s L ∥ L ∞ is the Lipschitz constant of L , so Lip( L σ ) ≤ Lip( L ) .

It remains to compute ∥∇ ϕ σ ∥ L 1 . Since ∥∇ ϕ σ ( ϵ ∥ 2 = ∥ ϵ ∥ 2 ϕ σ ( ϵ ) /σ 2 and X ∼ N (0 , σ 2 I d ) has density ϕ σ , we get

where X = σZ with Z ∼ N (0 , I d ) . Substituting this into the previous display yields the claimed bound. The p = ∞ statement is the corresponding Lipschitz estimate.

We then get regularity in expectation by noting that, after adding noise to a gradient step, this noise is then fed as input into the next step.

Assume L ∈ C 1 ( R d ) and that E ξ ∥∇ L ( s + ξ ) ∥ < ∞ (so differentiation may be interchanged with expectation). Then

Moreover, if L ∈ L ∞ ( R d ) , then by Theorem 4 (Part 2),

This then provides motivation for the decaying noise schedule: as an annealing process from a smoother, less accurate gradient structure to a rigid but more accurate gradient structure. For more detailed theory on the regularity of noisy gradient descent, see e.g. Chaudhari et al. (2019) and the references therein.

Connection to Boltzmann sampling, state-only noising

The continuous-time Langevin dynamics corresponding to Eq. (5), if we were to also noise the actions a t similarly, has a stationary distribution that concentrates on low-energy regions of L dyn ( s , a ) ; in particular, under mild regularity conditions it admits the Gibbs (Boltzmann) density

for an inverse temperature β > 0 determined by the relative scaling between the drift and diffusion terms 1 . However, we are only noising the states, so the converged distribution is not Boltzmann. The new converged distribution, written loosely, collapses on solutions in action to the local dynamics problems:

where a ∗ ( s ) = min a L dyn ( s , a ) .

State-gradient-free Dynamics