Parallel Stochastic Gradient-Based Planning for World Models
Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, Amir Bar
Introduction
Intelligent agents carry a small-scale model of external reality which allows them to simulate actions, reason about their consequences, and choose the ones that lead to the best outcome. Attempts to build such models date back to control theory, and in recent years researchers have made progress in building world models using deep neural networks trained directly from the raw sensory input (e.g., vision). For example, recent world models have shown success modeling computer games (Valevski et al., 2024), navigating real-world environments (Bar et al., 2025; Ball et al., 2025), and robot arm motion commands (Goswami et al., 2025). World models have numerous impactful applications from simulating complex medical procedures (Koju et al., 2025) to testing robots in visually realistic environments (Guo et al., 2025).
Current planning algorithms used with world models often rely on 0 th -order optimization methods such as the Cross-Entropy Method (CEM Rubinstein and Kroese (2004)) or in general shooting methods which plan by iteratively rolling out trajectories and choosing actions from the optimal rollout (Bock and Plitt, 1984; Piovesan and Tanner, 2009). These approaches are simple and robust, but their performance degrades with longer planning horizons and higher action dimensionality (Bharadhwaj et al., 2020), motivating the use of gradient information when differentiable world models are available.
Gradient-based planners exploit the differentiability of learned world models to directly optimize action sequences, enabling more sample-efficient planning and finer-grained improvement than zero-order methods. However, there are two main challenges to this optimization approach: local minima (Jyothir et al., 2023) and instability due to differentiating through the full rollout, akin to backpropagation through time (Werbos, 2002). While prior methods have formulated the optimization for better conditioning (e.g., multiple shooting and direct collocation (Von Stryk, 1993)), these techniques are typically developed for known dynamical systems and do not scale as well to deep neural dynamics (often in latent spaces) with brittle or poorly calibrated Jacobians.

Figure 1 Difficulty of the planning problem . Subfigure (a) shows the distance to the goal in L 2 norm throughout a successful trajectory. This illustrates the difficulty of planning optimization away from a minimizer: successful trajectories often have to first move away from the goal to successfully plan towards it later, resulting in greedy strategies failing. Subfigures (b)-(c) depict the loss landscape at convergence of standard rollout-based planners vs. our planner. The example given is in the Push-T environment at horizon length 50. The axes plotted over are with respect to two random, orthogonal, unit-norm directions in the full action space R 50 × 2 . Our planner loss is taken as in Eq. (10), and for GD the loss is taken as in Eq. (3).
In this work, we introduce a novel gradient-based planning method for learned world models that decouples temporal dynamics into parallel optimized states (rather than serial rollouts) while remaining robust at long horizons and in high-dimensional state spaces. Rather than planning exclusively through a deep, sequential rollout of the dynamics model, our approach optimizes over lifted intermediate states that are treated as independent optimization variables. Our approach makes two fundamental additions that help solve issues for gradient-based planning in higher dimensions: gradient sensitivity and local minima.
Firstly, a fundamental difficulty arises in the setting of vision-based world models. In high-dimensional learned state spaces, gradients with respect to state inputs can be brittle or adversarial, allowing the optimizer to exploit sensitive Jacobian structure rather than discovering physically meaningful transitions. To mitigate this issue, our planner deliberately stops gradients through the state inputs of the world model, while retaining gradients with respect to actions, which we find behave more reasonably. This alone would promote trajectories near the starting state, but with a dense one-step goal loss over the full trajectory, converged trajectories for these noisy iterations tend towards the goal.
Secondly, to address the remaining non-convexity of the lifted state approach, our planner incorporates Langevin-style stochastic updates on the lifted state variables, explicitly injecting noise during optimization to promote exploration of the state space and facilitate escape from unfavorable basins. This stochastic relaxation allows the planner to search over diverse intermediate trajectories while still favoring solutions that approximately satisfy the learned dynamics. Finally, we intermittently apply a small GD step to fine-tune stochastically optimized trajectories towards fully-optimized paths.
Together, these components yield a practical gradientbased planner for learned visual dynamics that remains stable at long horizons while avoiding the failure modes commonly encountered when backpropagating through deep world-model rollouts. We call our planner GRASP (Gradient RelAxed Stochastic Planner) to emphasize its primary components: using gradient information, relaxing the dynamics constraints, and stochastic optimization for exploration. In various settings, we achieve up to +10% success rate at less than half the compute time cost. We also provide a theoretical model for our planner to further illustrate its role. We demonstrate our planner on visual world models trained on problems in D4RL (Fu et al., 2020) and the DeepMind control suite (Tassa et al., 2018).
Problem formulation
Our main object of interest is a learned world model F θ : S × A → S that predicts the next state given the current state and action. For visual domains, states are typically represented in a learned latent space to handle high-dimensional observations. Here we assume A = R k is a continuous Euclidean action space.
We consider the problem of fixed-goal path planning: finding an action sequence a = ( a 0 , a 1 , . . . , a T -1 )

Figure 2 Graphical depiction of (a) a standard serial-based setup for optimization-based planning, where states are rolled out using the actions and the loss is evaluated on the goal state, (b) our setup, which parallelizes the world model evaluations by optimizing 'virtual states' directly and only supervising pairwise dynamics satisfaction. The crossed lines and skipped connections for our method's depiction (b) are detailed in Section 3.3, which keeps the full planning graph connected while not requiring state gradients of the dynamics F θ . For our planner, we find it helpful to alternate between (a) and (b) throughout the planning optimization.
that, with respect to the dynamics of the world model F θ and a given initial state s 0 ∈ S , reaches a set goal state g ∈ S :
$$
$$
where the terminal state s T is generated recursively through the update rule s t +1 = F θ ( s t , a t ) :
$$
$$
We can sufficiently compute a ∗ by solving the following optimization problem:
$$
$$
Optimizing Eq. (3) directly is challenging due to two main problems. First, it requires T applications of F θ (see Eq. 2), which is computationally expensive and difficult to optimize due to the poor conditioning arising from repeated applications of F θ (see Appendix A.1 for details). Second, it is susceptible to local minima and a jagged loss landscape-see Figure 1. Due to these reasons, existing planners are based on zero-order optimization algorithms like CEM and MPPI (Williams et al., 2016) which are highly stochastic and do not require gradient computation.
In what follows, we propose a gradient-based planner which alleviates these difficulties, while also using the differentiability of the model F θ . In Section 3, we lift the optimization problem by also optimizing over states, which leads to faster convergence and better conditioning. In Section 3.2, we introduce stochasticity, which helps escape local minima.
Decoupling dynamics for gradient-based planning
We consider planning with a world model F θ and horizon T . Given an initial state s 0 ∈ S and goal state g ∈ S , we optimize an action sequence a = ( a 0 , . . . , a T -1 ) such that the rolled-out state s T ( a ) is as close to g as possible. A standard approach defines a trajectory by rolling out the model, but backpropagating through a deep composition of F θ can be unstable and ill-conditioned. Following prior work on lifted planning (Tamimi and Li, 2009; Rybkin et al., 2021), we introduce auxiliary states z = ( z 1 , . . . , z T ) and enforce dynamics consistency through a penalty function.
Parallelized planning
We first want to decouple the states from the explicit rollout outputs. Writing Eq. (3) in terms of each intermediate dynamics condition, we get the
following:
$$
$$
The minimization in Eq. (4) is equivalent to Eq. (3) in that they share global minimizers.
Immediately, this gives a great benefit in that all world model evaluations are parallel ; there is no need to do serial rollouts like what is required in Eq. (3). There are however two main issues with optimizing this loss directly:
- Local minima . When optimizing with respect to states, the states might be stuck in an unphysical region; for example, Figure 7 shows a case where states go straight through a barrier. To address this, we propose to use Langevin state updates which promote exploration (see Section 3.2).
- World model sensitivity for high-dimensional states. When optimizing s directly over a higher dimensional space (e.g. vision-based), we observe that the Jacobian J s F θ ( s, a ) does not necessarily have any nice low-dimensional or convex structure; in practice, the world model can be easily steered toward outputting any desired output state, as depicted in Figure 3. We address this in Section 3.3 with a reshaping of the descent directions.
We now describe our approach to address these two fundamental problems with lifted-states approaches to planning.
Exploration via Langevin state updates
The lifted optimization in Eq. (4) is still non-convex and can get trapped in poor local minima. In practice, we frequently observe that deterministic joint updates in ( a , s ) converge to 'bad' stationary points where the intermediate variables settle into an unfavorable basin; for example, a linear route that ignores barriers or walls like in Figure 7. To circumvent this, we inject stochasticity directly into the state iterates, yielding a Langevin-style update that encourages exploration in the lifted state space.
Langevin dynamics on state iterates. Consider the optimization induced by Eq. (4). A standard way to escape spurious basins is to replace deterministic gradient descent on s with overdamped Langevin dynamics (Gelfand and Mitter, 1991), whose Euler

Figure 3 Sensitivity of state gradient structure . Examples of three states far away from the goal on the right (either in-distribution or out-of-distribution), such that taking a small step along the gradient s ′ = s -ϵ ∇ s L ( s ) , L ( s ) = ∥ F θ ( s, a = 0 ) -g ∥ 2 2 , leads to a nearby state s ′ that solves the planning problem in a single step: F θ ( s ′ , 0 ) = g . Thus, optimizing states directly through the world model F θ can be quite challenging.
discretization takes the following form:
$$
$$
$$
$$
where ξ k t ∼ N (0 , I ) . That is, each optimization step performs a gradient descent update on the intermediate states, followed by an isotropic Gaussian perturbation. Intuitively, the noise allows the iterates to 'hop' between nearby basins of the lifted loss landscape.
Noise on states vs. actions. By only noising the states, we can still condition on more dynamically feasible trajectories, while still allowing exploration over a wider distribution. Intuitively, planning problems often have a single (or small number of) intermediate states to find for the solution, and being able to noise directly over states rather than actions allows us to find these intermediate states faster. See Appendix A.3 for a characterization of the sampled distribution.
Langevin dynamics on state iterates.
The lifted optimization in Eq. (4) is still non-convex and can get trapped in poor local minima. In practice, we frequently observe that deterministic joint updates in ( a , s ) converge to 'bad' stationary points where the intermediate variables settle into an unfavorable basin; for example, a linear route that ignores barriers or walls like in Figure 7. To circumvent this, we inject stochasticity directly into the state iterates, yielding a Langevin-style update that encourages exploration in the lifted state space.
Langevin dynamics on state iterates. Consider the optimization induced by Eq. (4). A standard way to escape spurious basins is to replace deterministic gradient descent on s with overdamped Langevin dynamics (Gelfand and Mitter, 1991), whose Euler

Figure 3 Sensitivity of state gradient structure . Examples of three states far away from the goal on the right (either in-distribution or out-of-distribution), such that taking a small step along the gradient s ′ = s -ϵ ∇ s L ( s ) , L ( s ) = ∥ F θ ( s, a = 0 ) -g ∥ 2 2 , leads to a nearby state s ′ that solves the planning problem in a single step: F θ ( s ′ , 0 ) = g . Thus, optimizing states directly through the world model F θ can be quite challenging.
discretization takes the following form:
$$
$$
$$
$$
where ξ k t ∼ N (0 , I ) . That is, each optimization step performs a gradient descent update on the intermediate states, followed by an isotropic Gaussian perturbation. Intuitively, the noise allows the iterates to 'hop' between nearby basins of the lifted loss landscape.
Noise on states vs. actions. By only noising the states, we can still condition on more dynamically feasible trajectories, while still allowing exploration over a wider distribution. Intuitively, planning problems often have a single (or small number of) intermediate states to find for the solution, and being able to noise directly over states rather than actions allows us to find these intermediate states faster. See Appendix A.3 for a characterization of the sampled distribution.
Noise on states vs. actions.
Sensitivity to state gradients
A note on adversarial robustness of state gradients. In practice, F θ is learned and can have brittle local geometry. When optimizing Eq. (4) by gradient descent in both a and s , we observed empirically that gradients with respect to the state inputs, ∇ s F θ ( s, a ) , can be exploited: for any local goal-reaching objective of the following form:
$$
$$
instead of the optimizer learning to find an s onmanifold such that applying the action a leads to end state y , the optimizer can find a nearby ambient
s + δ , ∥ δ ∥ 2 ≪ 1 such that the loss is practically minimized: y ≈ F θ ( s + δ, a ) , regardless of the starting state s . This is analogous to the adversarial robustness issue of image classifiers (Szegedy et al., 2013; Shamir et al., 2021): high-dimensional input spaces for trained neural networks can have high Lipschitz constants, hindering optimization performance.
Unfortunately, any loss function over s and a whose minimizers are feasible dynamics must depend on the state gradient ∇ s F in a meaningful way. We provide the informal theorem here, with formalization and proof in Appendix A.4.
Theorem 1 (informal) . A differentiable loss function over state/action trajectories L : S T ×A T → R given a world model F θ : S × A → S cannot satisfy both of the following at the same time:
- Minimizers of L correspond to dynamically feasible trajectories: F θ ( s t , a t ) = s t +1 ,
- L is insensitive to the world model state gradient ∇ s F θ .
To address this adversarial sensitivity, we detach gradients through the state inputs of the world model, while still differentiating with respect to the actions. We denote by ¯ s t a stop-gradient copy of s t (i.e., ¯ s t = s t in value, but treated as constant during differentiation).
Require: Initial observation o 0 , goal o g , world model F θ , horizon T , steps K , learning rates η a , η s .
Ensure: a ∗ 1: s 0 ← encode ( o 0 ) , s T ← encode ( o g ) 2: a ← a 0 ; s ← init_states ( s 0 , s T ) 3: for k = 0 to K -1 do 4: Compute L as in Eq. (10) 5: Joint step: ( a , s ) ← ( a , s ) -( η a ∇ a L , η s ∇ s L ) 6: Stochastic state: s t ← s t + σ state ξ t , ξ t ∼ N (0 , I ) for t = 1 , . . . , T -1 7: Sync (periodic): if k mod K sync = 0 , rollout from s 0 ; s t +1 ← F θ ( s t , a t ) for t = 0 , . . . , T -1 8: then take a GD step a ← a -η sync ∇ a ∥ s T -g ∥ 2 2 9: end for 10: return a ∗ ← a
Algorithm 1 GRASP planner: parallel stochastic gradient-based planning
Grad-cut dynamics loss. We begin by applying a gradient stop to the state inputs in the dynamics loss:
$$
$$
This objective is differentiable with respect to a and the next states s t +1 , but does not backpropagate through s t via F θ .
Dense goal shaping on one-step predictions. While Eq. (8) improves robustness, it introduces a new degeneracy: paths gravitate towards the current rollout, regardless of proximity to the goal. To provide a task-aligned signal at every time step without stateinput gradients, we add a goal loss on the one-step predictions:
$$
$$
This encourages each predicted next state to move toward the goal, supplying gradient information to every action a t while maintaining the grad-cut on s t . This is depicted visually in Figure 2, and theoretically in Appendix A.4. Crucially, due to the stop-gradient ¯ s t , gradients through F θ (¯ s t , a t ) flow only with respect to a t (and not s t ), which prevents the optimizer from exploiting adversarial state-input directions. The final energy that is sampled from is then the following:
$$
$$
where γ > 0 is fixed.
Resulting noisy dynamics. The final resulting dynamics, after explicitly writing out ∇ a L , are as follows:
$$
$$
$$
$$
where ξ k t ∼ N (0 , I ) . Importantly, while the action dynamics still follow a gradient flow, the states do not follow a true gradient vector field, and thus the resulting dynamics are not Langevin. What results are still noisy dynamics that bias towards valid goaloriented trajectories, but whose efficiency will require an extra synchronization step as described in the following section.
A note on adversarial robustness of state gradients.
The lifted optimization in Eq. (4) is still non-convex and can get trapped in poor local minima. In practice, we frequently observe that deterministic joint updates in ( a , s ) converge to 'bad' stationary points where the intermediate variables settle into an unfavorable basin; for example, a linear route that ignores barriers or walls like in Figure 7. To circumvent this, we inject stochasticity directly into the state iterates, yielding a Langevin-style update that encourages exploration in the lifted state space.
Langevin dynamics on state iterates. Consider the optimization induced by Eq. (4). A standard way to escape spurious basins is to replace deterministic gradient descent on s with overdamped Langevin dynamics (Gelfand and Mitter, 1991), whose Euler

Figure 3 Sensitivity of state gradient structure . Examples of three states far away from the goal on the right (either in-distribution or out-of-distribution), such that taking a small step along the gradient s ′ = s -ϵ ∇ s L ( s ) , L ( s ) = ∥ F θ ( s, a = 0 ) -g ∥ 2 2 , leads to a nearby state s ′ that solves the planning problem in a single step: F θ ( s ′ , 0 ) = g . Thus, optimizing states directly through the world model F θ can be quite challenging.
discretization takes the following form:
$$
$$
$$
$$
where ξ k t ∼ N (0 , I ) . That is, each optimization step performs a gradient descent update on the intermediate states, followed by an isotropic Gaussian perturbation. Intuitively, the noise allows the iterates to 'hop' between nearby basins of the lifted loss landscape.
Noise on states vs. actions. By only noising the states, we can still condition on more dynamically feasible trajectories, while still allowing exploration over a wider distribution. Intuitively, planning problems often have a single (or small number of) intermediate states to find for the solution, and being able to noise directly over states rather than actions allows us to find these intermediate states faster. See Appendix A.3 for a characterization of the sampled distribution.
Grad-cut dynamics loss.
We now analyze a variant of the optimization method where we 'cut' the gradient flow through the state evolution in the dynamics. This is akin to removing the adjoint (backward) pass through time, transforming the global trajectory optimization into a sequence of locally regularized problems.
First, we provide proof that there are no loss functions where (i) the minimizers correspond to dynamically feasible trajectories, that also (ii) has no dependency on the state gradients through the world model F θ ( s, a ) . We formalize this in the following theorem:
Theorem 5 (Nonexistence of exact dynamics-enforcing losses with Jacobian-free state gradients) . Let S ⊆ R n and A ⊆ R m each be open and connected sets, and fix s 0 ∈ S and g ∈ S . Consider horizon T = 2 with decision variables ( s 1 , a 0 , a 1 ) ∈ S × A × A and boundary s 2 = g fixed. Let L F : S × A × A → R be decomposable into the following form:
$$
$$
$$
$$
$$
$$
where Φ : S × A × A × S × S → R is C 1 . Let
$$
$$
denote the set of trajectories that satisfy the dynamics exactly at both steps (with the fixed boundary conditions). There does not exist such a Φ for which the following two properties hold simultaneously for every C 1 model F :
Proof. We argue by contradiction.
Fix any point ( s 1 , a 0 , a 1 ) and write y 0 = F ( s 0 , a 0 ) and y 1 = F ( s 1 , a 1 ) . By the chain rule,
$$
$$
̸
We claim that ∂ Φ /∂y 1 must vanish identically. To see this, fix arbitrary arguments ( s 1 , a 0 , a 1 , y 0 , y 1 ) , ( s 0 , a 0 ) = ( s 1 , a 1 ) in the domain and construct two C 1 models F and G such that F ( s 0 , a 0 ) = G ( s 0 , a 0 ) = y 0 and F ( s 1 , a 1 ) = G ( s 1 , a 1 ) = y 1 , but whose state Jacobians at ( s 1 , a 1 ) are prescribed arbitrarily and differ:
$$
$$
1 Equivalently, one can view σ state in Eq. (5) as setting an effective temperature: larger σ state yields broader exploration, while smaller σ state concentrates around local minima.
To construct, choose a small ball B around ( s 1 , a 1 ) contained in S × A , construct a smooth bump function ψ that equals 1 on a smaller concentric ball and 0 outside B (possible since S , A open), and define
$$
$$
for an arbitrary C 1 base map H . Then F ( s 1 , a 1 ) = y 1 and ∇ s F ( s 1 , a 1 ) = J F . Defining G analogously with J G gives the desired pair; values at ( s 0 , a 0 ) can be kept fixed by choosing disjoint supports or applying the same local surgery at ( s 0 , a 0 ) .
By Jacobian-invariance, ∇ s 1 L F = ∇ s 1 L G at ( s 1 , a 0 , a 1 ) . Subtracting the two chain-rule expressions cancels ∂ Φ /∂s 1 and yields
$$
$$
Since J F -J G can be any matrix in R n × n , it follows that
$$
$$
Because the arguments were arbitrary, and since S is connected, we conclude ∂ Φ /∂y 1 ≡ 0 , meaning the loss is independent of y 1 . Therefore there exists a C 1 function ˜ Φ such that for every F ,
$$
$$
In particular, if two models F and G satisfy F ( s 0 , a ) = G ( s 0 , a ) for all a ∈ A , then L F ≡ L G as functions of ( s 1 , a 0 , a 1 ) , and hence
$$
$$
We now construct such a pair F, G but with different feasible sets, contradicting the exact-minimizers assumption. Pick two distinct actions u, v ∈ A , two distinct states s A , s B ∈ S , and an action a ⋆ ∈ A . Using bump-function surgery as above, construct C 1 models F and G such that
$$
$$
$$
$$
$$
$$
but with swapped second-step goal reachability:
$$
$$
$$
$$
Then ( s A , u, a ⋆ ) ∈ M ( F ) but ( s A , u, a ⋆ ) / ∈ M ( G ) , and ( s B , v, a ⋆ ) ∈ M ( G ) but ( s B , v, a ⋆ ) / ∈ M ( F ) , so M ( F ) = M ( G ) . On the other hand, since F ( s 0 , · ) = G ( s 0 , · ) we have arg min L F = arg min L G . By assumption (i),
$$
$$
$$
$$
We introduce the stop-gradient operator sg ( · ) , where sg ( x ) = x during the forward evaluation, but ∇ sg ( x ) = 0 during the backward pass. The modified objective incorporating the stop-gradient mechanism is:
$$
$$
where β t ≥ 0 are the goal loss coefficients. Note that the target g is effectively applied as a penalty on the state at each step to guide the local optimization.
and
Theorem 6 (Linear convergence to a unique fixed point) . Consider the gradient descent iteration
$$
$$
where s 0 is fixed and gradients are computed with the stop-gradient convention (i.e., treating sg ( s t ) as constant during differentiation). Assume the linear dynamics setting and that
$$
$$
and that B has full column rank (equivalently, σ min ( B ) > 0 ). Then there exists a stepsize η ∈ (0 , ¯ η ) , where ¯ η depends only on β min , β max , ∥ B ∥ 2 , σ min ( B ) (and not on T ), such that the induced update operator T on z = ( s , a ) has a unique fixed point z ⋆ and the iterates converge linearly:
$$
$$
Proof. Write the objective (re-indexing the goal term to match the action index) as
$$
$$
Define the residuals
$$
$$
Under the stop-gradient convention, sg( s t ) is treated as constant during differentiation, so s t does not receive gradient contributions through the A sg( s t ) terms. It follows that the only state-gradient at time t comes from the appearance of s t as the next state in the previous residual, namely
$$
$$
with the understanding that r -1 = 0 if s 0 is fixed. Likewise, the action-gradient at time t is
$$
$$
Therefore, gradient descent with stepsize η > 0 yields the explicit update rules
$$
$$
$$
$$
Stack the variables in the time-ordered vector z := ( s 1 , a 0 , s 2 , a 1 , . . . , s T , a T -1 ) . The updates above define an affine map z k +1 = T ( z k ) = J z k + c whose Jacobian J is block lower-triangular with respect to this ordering: indeed, ( s k +1 t +1 , a k +1 t ) depends only on ( s k t , s k t +1 , a k t ) (and on the fixed constants g and s 0 ) and is independent of any future variables ( s k t +2 , a k t +1 , . . . ) . Consequently, the eigenvalues of J are exactly the union of the eigenvalues of its diagonal blocks.
To characterize a diagonal block, fix t ∈ { 0 , . . . , T -1 } and consider the pair y t := ( s t +1 , a t ) . Conditioned on s t (which appears only as a constant inside sg( s t ) for differentiation), the update ( s k +1 t +1 , a k +1 t ) is precisely one gradient step on the quadratic function
$$
$$
so the corresponding diagonal block equals I -ηH t where H t = ∇ 2 y t ϕ t is the constant Hessian with respect to ( s t +1 , a t ) :
$$
$$
Assume β t ∈ [ β min , β max ] with β min > 0 and that B has full column rank, so B ⊤ B ≻ 0 . The Schur complement of the I block is
$$
$$
hence H t ≻ 0 for every t , with eigenvalues uniformly bounded away from 0 and ∞ as t varies:
$$
$$
for constants µ, L depending only on β min , β max , ∥ B ∥ 2 , and σ min ( B ) . Choosing any stepsize η such that 0 < η < 2 /L , we obtain for every t that all eigenvalues of I -ηH t lie strictly inside the unit disk, and in particular:
$$
$$
where q is independent of t and T . Since J is block lower-triangular and its diagonal blocks are exactly ( I -ηH t ) (up to a fixed permutation corresponding to the stacking order), we conclude
$$
$$
Fix any q such that q 0 < q < 1 . By applying Gelfand's formula, for any square matrix M and any q > ρ ( M ) , there exists an induced norm ∥ · ∥ † such that ∥ M ∥ † ≤ q . Applying this with M = J yields a norm ∥ · ∥ † satisfying
Hence, for all k ≥ 0 , and therefore
$$
$$
$$
$$
$$
$$
By norm equivalence in finite dimensions, there exists C > 0 such that
$$
$$
Notes on the stopgrad optimization. The optimization indeed converges to a fixed point, but one can show that these stable points in the linear convex case are merely the greedy rollouts towards the goal. Two things make the optimization in our setting nontrivial: the nonconvexity of the world model F θ , and the stochastic noise on the states s t . We now present some characterization on the distribution of trajectories that our planner tends towards.
Let F θ : S × A → S be a differentiable world model and define the stop-gradient one-step prediction
$$
$$
Consider the stopgrad lifted objective (cf. Eq. Eq. (10))
$$
$$
and the (no-sync) optimization updates
$$
$$
$$
$$
Throughout, assume 0 < η s < 1 (for stability of the state contraction).
Theorem 7 (Gaussian tube around one-step predictions) . Fix { ¯ s k t } T -1 t =0 and { a k t } T -1 t =0 at iteration k , and let µ k t = F θ (¯ s k t , a k t ) . Then the state update Eq. (93) satisfies the conditional mean recursion
$$
$$
Moreover, if µ k t ≡ µ t is held fixed, then s k t +1 converges in distribution to a Gaussian 'tube' around µ t :
$$
$$
Analogously, in continuous optimization-time τ , the limiting SDE
$$
$$
has stationary law N ( µ t , σ 2 2 λ I ) .
Proof. Rewrite Eq. (93) as an affine Gaussian recursion:
$$
$$
Taking conditional expectation yields Eq. (95). If µ k t ≡ µ t is fixed, the centered process u k := s k t +1 -µ t satisfies u k +1 = (1 -2 η s ) u k + σξ k , i.e. an AR(1) process with contraction factor | 1 -2 η s | < 1 . Its unique stationary covariance Σ tube solves the discrete Lyapunov equation Σ tube = (1 -2 η s ) 2 Σ tube + σ 2 I , giving Eq. (96). The continuous-time statement is standard, given that µ t is fixed.
Theorem 8 (Goal shaping induces goal-directed drift of tube center) . Define µ k t = F θ (¯ s k t , a k t ) and let
$$
$$
Assume a first-order linearization in the action step:
$$
$$
Then the action update Eq. (94) induced by Eq. (92) yields the tube-center evolution
$$
$$
where ε k t +1 denotes the tube residual s k t +1 = µ k t + ε k t +1 . In particular, if E [ ε k t +1 | µ k t ] = 0 , then
$$
$$
so in controllable directions the mean prediction µ t moves toward g as an averaging step. If 0 < αγλ max ( P k t ) < 1 , this is a contractive averaging step toward g on Range( P k t ) .
Proof. From Eq. (92), only terms at time t depend on a t via µ t . Using ∇ a t µ t = J t , the gradient is
$$
$$
$$
$$
$$
$$
which simplifies to Eq. (98) after substituting α = 2 η a . Taking conditional expectation and using E [ ε k t +1 | µ k t ] = 0 yields Eq. (99).
Unlike the stopgrad lifted-state dynamics (Theorem 7), the rollout distribution is not contracted toward the current rollout, but instead noise accumulates throughout nonlinear iterations of the world model. The stochastic rollout distribution need not concentrate in a local tube around the current deterministic rollout trajectory; it can drift and spread away as the horizon T grows.
Theorem 9 (Mean evolution and non-tube behavior of noisy rollouts) . Consider a rollout-based stochastic trajectory generated in model-time:
$$
$$
and a dense goal objective along the rollout (e.g. ∑ T -1 t =0 ∥ s t +1 -g ∥ 2 ). Let m t := E [ s t ] and Σ t := Cov( s t ) . The rollout mean obeys the exact identity
$$
$$
In particular, if F θ is affine in s (i.e. F θ ( s, a ) = As + Ba + c ), then
$$
$$
For general nonlinear F θ , a second-order moment expansion yields
$$
$$
so the mean generally does not follow the deterministic rollout F θ ( m t , a t ) .
$$
$$
Proof. Taking conditional expectation of Eq. (101) gives E [ s t +1 | s t ] = F θ ( s t , a t ) , and then total expectation implies Eq. (102). For affine F θ , expectation commutes with F θ , yielding Eq. (103). For nonlinear F θ , expand F θ ( s t , a t ) around µ t by Taylor's theorem; the first-order term vanishes in expectation and the second-order term produces Eq. (104). The covariance recursion Eq. (105) follows by linearizing F θ ( s t , a t ) ≈ F θ ( m t , a t ) + G t ( s t -µ t ) and computing Cov( · ) , adding the independent noise variance σ 2 env I . The final 'non-tube' claim follows because there is no optimization-time contraction that repeatedly pulls s t +1 back toward a moving center (as in Eq. (95)); instead the forward propagation Eq. (105) typically increases spread and the mean can deviate from the deterministic path by Eq. (104).
Theorem 7 shows that noisy lifted state updates form an noisy 'tube' around the one-step predictions m t = F θ (¯ s t , a t ) , keeping exploration local and dynamically consistent in optimization-time. Theorem 8 then shows that dense one-step goal shaping moves the tube center toward the goal by a preconditioned averaging step, while the dynamics residual contributes approximately zero-mean stochastic forcing that enables exploration without horizon-coupled backpropagation. In contrast, Theorem 9 shows that noisy rollouts evolve by forward propagation of randomness: the mean follows m t +1 = E [ F θ ( s t , a t )] (not generally the deterministic rollout), and the distribution can drift/spread rather than concentrate in a tube around the current plan.
Dense goal shaping on one-step predictions.
Full-rollout synchronization
The no-state-gradient updates are designed to be robust to brittle state-input Jacobians of the learned world model. However, the stochastic optimization in Eq. (11) still needs a method of strict descent towards

Figure 4 Virtual states learned through planning. All examples are instantiations of our planner at horizon 50 in the Point-Maze, Wall-Single, and Push-T environments. Regardless of the dynamics constraint relaxation and state noising, directly optimized states find realistic, non-greedy paths towards the goal.
true minima. In practice, we found it beneficial to periodically 'sync' the plan by briefly running standard full-gradient planning on the original rollout objective.
Full-gradient rollout step. Every K sync iterations, we perform J sync steps of gradient descent on the original planning loss
$$
$$
where s T ( a , s 0 ) is computed by sequentially rolling out the world model
$$
$$
During this synchronization phase we update only the actions,
$$
$$
using full backpropagation through the T -step rollout. By keeping these GD steps small relative to the stochastic dynamics of Eq. (11), we benefit from the smoothed loss landscape in Figure 1c for wider exploration, and the sharp but brittle landscape in Figure 1b for refinement.
Full-gradient rollout step.
The no-state-gradient updates are designed to be robust to brittle state-input Jacobians of the learned world model. However, the stochastic optimization in Eq. (11) still needs a method of strict descent towards

Figure 4 Virtual states learned through planning. All examples are instantiations of our planner at horizon 50 in the Point-Maze, Wall-Single, and Push-T environments. Regardless of the dynamics constraint relaxation and state noising, directly optimized states find realistic, non-greedy paths towards the goal.
true minima. In practice, we found it beneficial to periodically 'sync' the plan by briefly running standard full-gradient planning on the original rollout objective.
Full-gradient rollout step. Every K sync iterations, we perform J sync steps of gradient descent on the original planning loss
$$
$$
where s T ( a , s 0 ) is computed by sequentially rolling out the world model
$$
$$
During this synchronization phase we update only the actions,
$$
$$
using full backpropagation through the T -step rollout. By keeping these GD steps small relative to the stochastic dynamics of Eq. (11), we benefit from the smoothed loss landscape in Figure 1c for wider exploration, and the sharp but brittle landscape in Figure 1b for refinement.
Results
We evaluate our proposed planner GRASP across two complementary classes of environments designed to test (i) nonconvex long-horizon planning with obstacles and (ii) data-driven visual control under learned dynamics. Concretely, these experiments aim to answer three questions:
- Can the proposed planner overcome the greedy local minima that often trap shooting methods?
- Does the method remain robust as the planning horizon increases?
- Does the proposed planner converge to plans faster than rollout-based planners?
We provide self-ablations in Table 2, demonstrating the value of various components of our planner: using the state gradient detaching, the GD sync steps, and the level of the noise.
Figure 4 visualizes planning iterations in several navigation environments, illustrating how trajectories initialized far from dynamically consistent rollouts converge to feasible plans that satisfy the learned dynamics.
Baselines
We compare against three commonly used planners. CEM optimizes action sequences by iteratively sampling candidate trajectories, selecting elites, and refitting a sampling distribution. GD directly optimizes the action sequence by backpropagating through the dynamics model. LatCo (Rybkin et al., 2021) optimizes in a lifted latent/state-space by jointly adjusting intermediate latent variables and actions. This setting is different than the original LatCo method, which was originally applied in a model-based RL environment, but it still provides an important baseline for performance if we were to purely focus on direct optimization of Eq. (4).
For all methods, we sweep over hyperparameters and report results using the best-performing setting for each environment and horizon. For our planner, we initialize the states { s t } T t =0 as noised around the linear interpolation between s 0 and g : s t = t T g + (1 -t T ) s 0 + z , z ∼ N (0 , ϵI ) , and actions initialized
Table 1 Open-loop planning results on long range Push-T. Reported are success rate (%) and median success time (seconds; successful trials only) across planning horizons. 500 trials per setting. Each cell reports Success / Time .

Figure 5 Success rate over time at a fixed horizon. Success rate over fixed set of open-loop planning tasks for CEM, GD, LatCo (Rybkin et al., 2021), and our planner for a fixed horizon of 50. Curves summarize how quickly each planner makes progress under the learned world model setting when evaluated at a fixed planning horizon. Shaded regions are Wald 95% confidence intervals.
at zeros: a t = 0 .
Environments and evaluation protocol
We evaluate planning on three visual control environments with learned dynamics: PointMaze , WallSingle , and Push-T . World models are trained using the DINO-wm framework (Zhou et al., 2024), following the original paper's setup, where the world model F θ ( s, a ) takes 5 actions and predicts 5 steps ahead; that is, if dim ( A ) = 2 , then F θ takes actions as vectors of stacked actions a ∈ R 10 .
All reported metrics measure task success under the learned world model. Success is defined as reaching the goal region within the planning horizon.
Long-term planning and horizon scaling
We evaluate planners in the long-horizon regime, where our parallelized stochastic planner is more intended; greedy local minima and optimization instability become the dominant challenges.
GRASP remains reliable as the planning horizon increases: it solves more tasks and finishes a majority of its successful trials faster, showing stronger robust-
Table 2 Ablation Studies over the GD sync steps, level of noise for Langevin dynamics, and whether we use our detached gradient approach with the goal-reaching objective or not. GD sync happens every 100 stochastic steps. Ablations done on the Push-T environment at horizon H = 40 . Time reported is median over successful trials, which only beats our method when the accuracy is much smaller.
ness to long horizons than the baselines as shown in Table 1. Beyond the median completion times, we provide further illustration of the solving speed of our planner in Figure 5, showing further that most of its plans converge at an earlier time. At the longer horizons for Push-T, where it is more important for the planner to explore non-greedy optima, GRASP finds more success than the baselines and is able to find the needed non-greedy trajectories, as visualized in Figure 6.
Short-term planning
We also evaluate short-horizon planning, to demonstrate that our planner can match performance on shorter, easier tasks. Table 3 reports success rates across environments for horizons ranging from H = 10 to H = 30 , while Table 4 reports median wall-clock planning times.
Table 3 Short Term Planning. Success rate (%) for Push-T, PointMaze, and WallSingle. 500 trials per setting. Our method has comparable success rate while having a consistently low completion time (Table 4).
Across all environments and short horizons, the proposed planner achieves success rates comparable to the baselines. Alongside similar success rates, our proposed planner exhibits consistently low planning times. As shown in Table 4, it is among the fastest methods across all environments and horizons, often significantly faster than sampling-based approaches and competitive with gradient-based optimization, sacrificing some speed for a higher success rate. These results indicate that even in relatively short and easy planning regimes, our planner remains competitive with the baselines.
Overall, these results demonstrate that GRASP consistently matches strong baselines in short-term planning, while outperforming them in long-horizon settings by avoiding greedy failures and converging more quickly in wall-clock time.
Related work
World modeling has shown significant improvement in sample efficiency for model-based reinforcement learning (Hafner et al., 2025). By learning to predict future states given current states and actions, world models enable planning without access to an interactive environment (Ding et al., 2024). Recent work has focused on learning latent-space representations to handle high-dimensional observations Assran et al. (2025), with models now demonstrating the ability to scale and generalize across diverse environments (Bar et al., 2025). In this paper, we develop an efficient planning algorithm for action-conditioned video models.
Sampling-based planning in world models traditionally relies on methods like the Cross-Entropy Method (CEM Rubinstein and Kroese (2004)) and random shooting. While these methods are robust and simple to implement, they suffer from serial evaluation bottlenecks and poor scaling with planning horizon length (Bharadhwaj et al., 2020). Recent work proposes performance improvements-such as faster CEM variants with action correlation and memory, parallelized sampling via diffeomorphic transforms, and massively parallel strategies-but fundamental limitations remain for very long horizons (Pinneri et al., 2021; Lai et al., 2022).
Gradient-based planning leverages the differentiability of neural world models to optimize action sequences directly (Jyothir et al., 2023). Early approaches applied backpropagation through time to optimize actions (Thrun et al., 1990), but face challenges with vanishing/exploding gradients and poor conditioning over long horizons. Hybrid strategies combining gradient descent with sampling-based methods-such as interleaving CEM with gradient updates-have shown promise. CEM-GD variants interleave backwards passes through the learned model with populationbased search for improved convergence and scalability (Bharadhwaj et al., 2020; Huang et al., 2021). Recent work improves gradient-based planners by training world models to be adversarially robust to improve gradient-based planning (Parthasarathy et al., 2025); our method instead tries to improve gradientbased planning for more general world models, with-
out any pretraining modifications.
State optimization and multiple shooting in optimal control. The idea of treating states as optimization variables separate from dynamics constraints has a rich history in classical optimal control. Non-condensed QP formulations in MPC (Jerez et al., 2011) decouple state and input optimization for improved numerical properties. Multiple shooting methods (Tamimi and Li, 2009; Diedam and Sager, 2018) break long-horizon problems into shorter segments with continuity constraints. Direct collocation approaches (Bordalba et al., 2022; Nie and Kerrigan, 2025) optimize state and control trajectories simultaneously while enforcing dynamics through collocation constraints. These methods have primarily been applied to systems with known analytical dynamics. Trajectory optimization methods in robotics have developed parallel shooting techniques and GPU-accelerated planning algorithms (Guhathakurta et al., 2022), but most approaches still face fundamental limitations when applied to learned world models, particularly visual world models where dynamics are approximate and high-dimensional.
Noise and regularization in optimization. Stochastic optimization techniques and noise injection have long been recognized for their ability to improve optimization outcomes (Robbins and Monro, 1951), and can help regularize and explore complex loss landscapes (Welling and Teh, 2011; Xu et al., 2018; Bras, 2023; Foret et al., 2020). In the context of planning, noise is commonly used in sampling-based methods, but its systematic incorporation into gradient-based planning for learned models remains underexplored.
Limitations and future work
Although the proposed planner shows clear advantages in long-horizon settings, its benefits are more limited at short horizons. As demonstrated in our experiments, for small planning horizons the planner typically achieves success rates and completion times that are comparable to strong baselines such as CEM and gradient-based optimization, rather than strictly outperforming them. This suggests that the primary gains of the method arise in regimes where long-horizon reasoning and non-greedy planning are essential, rather than in short-horizon settings where simpler methods already perform well.
Hybrid planners (that combine iterations of a rolloutbased planner like CEM and a gradient-based planner like GD) have been implemented to get the 'best of both worlds' from the two approaches (Huang et al., 2021), and there are many ways to 'hybridize' our planner as well. We leave exploration of such methods for future work.
Several components of the planner are designed to mitigate the unreliability of state gradients in learned world models. While effective, these modifications introduce additional structure and hyperparameters that would ideally be unnecessary. If state representations induced by the world model were smoother or more geometrically well-behaved in the state space, many of these stabilization mechanisms could be removed, potentially leading to further speed improvements. Promising directions toward this goal include improved representation learning through adversarial training, diffusion-based world models, or other techniques that explicitly regularize the geometry of the learned state space.
Conclusion
World models provide a powerful framework for planning in complex environments, but existing approaches struggle with long horizons, highdimensional actions, and serial computation. We propose GRASP, a new gradient-based planning algorithm with two key contributions: (a) a lifted planner that optimizes actions together with 'virtual states' in a time-parallel manner, yielding more stable and scalable optimization while allowing direct control over exploration via stochastic state updates, and (b) an action-gradient-only planning variant for learned visual world models that avoids brittle state-input gradients while still exploiting differentiability with respect to actions. Experiments on visual worldmodel benchmarks show that our approach remains robust as horizons grow and finds non-greedy solutions at a faster rate than commonly used planners such as CEM or vanilla GD.
Theory
In this section, we provide a theoretical analysis of the convergence properties of our planning approach compared to traditional shooting methods. We consider a simplified linear dynamics setting to derive formal convergence guarantees. We reproduce proof here for self-containedness; see e.g. Ascher et al. (1995) for other theoretical treatments of this problem.
Convergence of various methods in the convex setting
Consider a linear dynamical system with dynamics:
$$
$$
where s t ∈ R n is the state at time t , a t ∈ R m is the control input, and A ∈ R n × n , B ∈ R n × m .
Let F ( s 0 , a ) : R n × R mT → R n denote the rollout of the dynamics for T timesteps. Under linear dynamics, F takes the compact form:
$$
$$
We analyze the optimization landscape of two fundamental formulations for reaching a target state g from initial state s init .
$$
$$
The Lifted States Method (or Multiple Shooting) treats both states and controls as optimization variables and minimizes the violation of the dynamics constraints (the physics defects):
$$
$$
$$
$$
where s = ( s 0 , . . . , s T ) .
Matrix Representation To analyze the convergence properties, we express both objectives in quadratic matrix form.
Shooting Method. Let a = ( a 0 ; . . . ; a T -1 ) ∈ R mT . The final state is linear in a :
$$
$$
where C T = [ A T -1 B,A T -2 B,.. . , B ] ∈ R n × mT is the controllability matrix. The objective is:
$$
$$
The Hessian of this objective is H S = 2 C ⊤ T C T .
Lifted Method. We eliminate the fixed boundary variables s 0 and s T and optimize over the free variables z = ( s 1 ; . . . ; s T -1 ; a 0 ; . . . ; a T -1 ) . The dynamics residuals can be written as a linear system M z -b . The dynamics equations for t = 0 , . . . , T -1 correspond to rows in a large matrix M .
The objective is J L ( z ) = ∥ M z -b ∥ 2 2 , and the Hessian is H L = 2 M ⊤ M . The matrix M has a sparse, block-banded structure.
SmoothnessAnalysis We compare the smoothness of the two optimization problems by comparing the Lipschitz constants of their gradients, L = λ max ( H ) .
Theorem 2 (Shooting: Exploding Smoothness) . Let A ⊤ have a real eigenvalue λ with | λ | > 1 and unit eigenvector v (a left eigenvector of A ). Assume B aligns with this mode such that for some input direction w ( ∥ w ∥ 2 = 1 ), the projection |⟨ v, Bw ⟩| = µ > 0 .
Then, the Lipschitz constant of the Shooting method gradient grows exponentially with T :
$$
$$
Proof. The Lipschitz constant is the maximum eigenvalue of the Hessian H S = 2 C ⊤ T C T , which equals 2 ∥C T ∥ 2 2 . By definition, the spectral norm is the maximum gain over all possible inputs:
̸
$$
$$
From the existence of a controllable non-contractive mode, we can construct a unit-norm a test = ( w ; 0 ; . . . ; 0 ) that evaluates to the form:
$$
$$
and such that the following holds when projecting to the corresponding unstable eigenvector v :
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
Squaring this result gives the desired bound.
$$
$$
Proof. The Hessian is H L = 2 M ⊤ M , so λ max ( H L ) = 2 ∥ M ∥ 2 2 . It therefore suffices to upper bound ∥ M ∥ 2 by a constant that does not depend on T .
The matrix M is block-sparse: each block row corresponding to timestep t contains at most three non-zero blocks, namely an identity block (selecting s t +1 ), a dynamics block (multiplying s t ), and an input block (multiplying a t ). Equivalently, the corresponding residual has the form
$$
$$
with s 0 and s T treated as fixed boundary values (so r t is affine in the free variables).
Fix any optimization vector z = ( s 1 ; . . . ; s T -1 ; a 0 ; . . . ; a T -1 ) (stacking the free states and controls), and let M z denote the stacked residuals ( r 0 , . . . , r T -1 ) with constants removed. Using the inequality ∥ x + y + z ∥ 2 2 ≤ 3( ∥ x ∥ 2 2 + ∥ y ∥ 2 2 + ∥ z ∥ 2 2 ) and the operator norm bounds ∥ As ∥ 2 ≤ ∥ A ∥ 2 ∥ s ∥ 2 , ∥ Ba ∥ 2 ≤ ∥ B ∥ 2 ∥ a ∥ 2 , we obtain for each t :
$$
$$
$$
$$
$$
$$
Crucially, due to the banded structure, each free intermediate state s 1 , . . . , s T -1 appears in at most two terms in the sum: once as s t +1 and once as s t . Hence the state contributions can be bounded without any accumulation in T , yielding the following:
$$
$$
$$
$$
Therefore, ∥ M ∥ 2 2 = sup z =0 ∥ M z ∥ 2 2 / ∥ z ∥ 2 2 ≤ 3(1 + ∥ A ∥ 2 2 + ∥ B ∥ 2 2 ) , and thus
$$
$$
which is independent of T .
Interpretation. While we needed a slightly more restrictive assumption for the lower bound on the Shooting method's Hessian, such a condition is not unreasonable to find, as many realistic systems will not be universally stable within the controllability subspace. In these settings, the shooting method forces the optimization to traverse a loss landscape with curvature that varies exponentially (requiring exponentially small steps to remain stable), while the lifted states method 'preconditions' the problem by creating variables for intermediate states, decoupling the long-term dependencies into local constraints. This results in a loss landscape with uniform smoothness O (1) with respect to the planning horizon length T .
Shooting Method.
Lifted Method.
World modeling has shown significant improvement in sample efficiency for model-based reinforcement learning (Hafner et al., 2025). By learning to predict future states given current states and actions, world models enable planning without access to an interactive environment (Ding et al., 2024). Recent work has focused on learning latent-space representations to handle high-dimensional observations Assran et al. (2025), with models now demonstrating the ability to scale and generalize across diverse environments (Bar et al., 2025). In this paper, we develop an efficient planning algorithm for action-conditioned video models.
Sampling-based planning in world models traditionally relies on methods like the Cross-Entropy Method (CEM Rubinstein and Kroese (2004)) and random shooting. While these methods are robust and simple to implement, they suffer from serial evaluation bottlenecks and poor scaling with planning horizon length (Bharadhwaj et al., 2020). Recent work proposes performance improvements-such as faster CEM variants with action correlation and memory, parallelized sampling via diffeomorphic transforms, and massively parallel strategies-but fundamental limitations remain for very long horizons (Pinneri et al., 2021; Lai et al., 2022).
Gradient-based planning leverages the differentiability of neural world models to optimize action sequences directly (Jyothir et al., 2023). Early approaches applied backpropagation through time to optimize actions (Thrun et al., 1990), but face challenges with vanishing/exploding gradients and poor conditioning over long horizons. Hybrid strategies combining gradient descent with sampling-based methods-such as interleaving CEM with gradient updates-have shown promise. CEM-GD variants interleave backwards passes through the learned model with populationbased search for improved convergence and scalability (Bharadhwaj et al., 2020; Huang et al., 2021). Recent work improves gradient-based planners by training world models to be adversarially robust to improve gradient-based planning (Parthasarathy et al., 2025); our method instead tries to improve gradientbased planning for more general world models, with-
out any pretraining modifications.
State optimization and multiple shooting in optimal control. The idea of treating states as optimization variables separate from dynamics constraints has a rich history in classical optimal control. Non-condensed QP formulations in MPC (Jerez et al., 2011) decouple state and input optimization for improved numerical properties. Multiple shooting methods (Tamimi and Li, 2009; Diedam and Sager, 2018) break long-horizon problems into shorter segments with continuity constraints. Direct collocation approaches (Bordalba et al., 2022; Nie and Kerrigan, 2025) optimize state and control trajectories simultaneously while enforcing dynamics through collocation constraints. These methods have primarily been applied to systems with known analytical dynamics. Trajectory optimization methods in robotics have developed parallel shooting techniques and GPU-accelerated planning algorithms (Guhathakurta et al., 2022), but most approaches still face fundamental limitations when applied to learned world models, particularly visual world models where dynamics are approximate and high-dimensional.
Noise and regularization in optimization. Stochastic optimization techniques and noise injection have long been recognized for their ability to improve optimization outcomes (Robbins and Monro, 1951), and can help regularize and explore complex loss landscapes (Welling and Teh, 2011; Xu et al., 2018; Bras, 2023; Foret et al., 2020). In the context of planning, noise is commonly used in sampling-based methods, but its systematic incorporation into gradient-based planning for learned models remains underexplored.
Interpretation.
Intelligent agents carry a small-scale model of external reality which allows them to simulate actions, reason about their consequences, and choose the ones that lead to the best outcome. Attempts to build such models date back to control theory, and in recent years researchers have made progress in building world models using deep neural networks trained directly from the raw sensory input (e.g., vision). For example, recent world models have shown success modeling computer games (Valevski et al., 2024), navigating real-world environments (Bar et al., 2025; Ball et al., 2025), and robot arm motion commands (Goswami et al., 2025). World models have numerous impactful applications from simulating complex medical procedures (Koju et al., 2025) to testing robots in visually realistic environments (Guo et al., 2025).
Current planning algorithms used with world models often rely on 0 th -order optimization methods such as the Cross-Entropy Method (CEM Rubinstein and Kroese (2004)) or in general shooting methods which plan by iteratively rolling out trajectories and choosing actions from the optimal rollout (Bock and Plitt, 1984; Piovesan and Tanner, 2009). These approaches are simple and robust, but their performance degrades with longer planning horizons and higher action dimensionality (Bharadhwaj et al., 2020), motivating the use of gradient information when differentiable world models are available.
Gradient-based planners exploit the differentiability of learned world models to directly optimize action sequences, enabling more sample-efficient planning and finer-grained improvement than zero-order methods. However, there are two main challenges to this optimization approach: local minima (Jyothir et al., 2023) and instability due to differentiating through the full rollout, akin to backpropagation through time (Werbos, 2002). While prior methods have formulated the optimization for better conditioning (e.g., multiple shooting and direct collocation (Von Stryk, 1993)), these techniques are typically developed for known dynamical systems and do not scale as well to deep neural dynamics (often in latent spaces) with brittle or poorly calibrated Jacobians.

Figure 1 Difficulty of the planning problem . Subfigure (a) shows the distance to the goal in L 2 norm throughout a successful trajectory. This illustrates the difficulty of planning optimization away from a minimizer: successful trajectories often have to first move away from the goal to successfully plan towards it later, resulting in greedy strategies failing. Subfigures (b)-(c) depict the loss landscape at convergence of standard rollout-based planners vs. our planner. The example given is in the Push-T environment at horizon length 50. The axes plotted over are with respect to two random, orthogonal, unit-norm directions in the full action space R 50 × 2 . Our planner loss is taken as in Eq. (10), and for GD the loss is taken as in Eq. (3).
In this work, we introduce a novel gradient-based planning method for learned world models that decouples temporal dynamics into parallel optimized states (rather than serial rollouts) while remaining robust at long horizons and in high-dimensional state spaces. Rather than planning exclusively through a deep, sequential rollout of the dynamics model, our approach optimizes over lifted intermediate states that are treated as independent optimization variables. Our approach makes two fundamental additions that help solve issues for gradient-based planning in higher dimensions: gradient sensitivity and local minima.
Firstly, a fundamental difficulty arises in the setting of vision-based world models. In high-dimensional learned state spaces, gradients with respect to state inputs can be brittle or adversarial, allowing the optimizer to exploit sensitive Jacobian structure rather than discovering physically meaningful transitions. To mitigate this issue, our planner deliberately stops gradients through the state inputs of the world model, while retaining gradients with respect to actions, which we find behave more reasonably. This alone would promote trajectories near the starting state, but with a dense one-step goal loss over the full trajectory, converged trajectories for these noisy iterations tend towards the goal.
Secondly, to address the remaining non-convexity of the lifted state approach, our planner incorporates Langevin-style stochastic updates on the lifted state variables, explicitly injecting noise during optimization to promote exploration of the state space and facilitate escape from unfavorable basins. This stochastic relaxation allows the planner to search over diverse intermediate trajectories while still favoring solutions that approximately satisfy the learned dynamics. Finally, we intermittently apply a small GD step to fine-tune stochastically optimized trajectories towards fully-optimized paths.
Together, these components yield a practical gradientbased planner for learned visual dynamics that remains stable at long horizons while avoiding the failure modes commonly encountered when backpropagating through deep world-model rollouts. We call our planner GRASP (Gradient RelAxed Stochastic Planner) to emphasize its primary components: using gradient information, relaxing the dynamics constraints, and stochastic optimization for exploration. In various settings, we achieve up to +10% success rate at less than half the compute time cost. We also provide a theoretical model for our planner to further illustrate its role. We demonstrate our planner on visual world models trained on problems in D4RL (Fu et al., 2020) and the DeepMind control suite (Tassa et al., 2018).
Gaussian noise regularization
The regularity from noisy gradient descent (or Langevin-based optimization) primarily stems from the smoothing of the Gaussian convolution:
Theorem 4 (Gaussian smoothing contracts gradients and yields scale control) . Let d ≥ 1 , σ > 0 , and let
$$
$$
be the density of N (0 , σ 2 I d ) . For L ∈ C 1 ( R d ) define
$$
$$
The following statements hold of the resulting convolution:
$$
$$
In particular, if L is Lipschitz, then Lip( L σ ) = ∥∇ s L σ ∥ L ∞ ≤ ∥∇ s L ∥ L ∞ = Lip( L ) .
- (Explicit regularity control by variance.) For any 1 ≤ p ≤ ∞ , if L ∈ L p ( R d ) , then
$$
$$
Proof. An important property of convolution is that gradients commute to either argument:
$$
$$
$$
$$
$$
$$
where ∇ ϕ σ ( ϵ ) = -( ϵ /σ 2 ) ϕ σ ( ϵ ) .
For part 1, apply Young's convolution inequality with the fact that ∥ ϕ σ ∥ L 1 = 1 . For any g ∈ L p ,
$$
$$
Taking g = ∇ a L and using the commutation identity above gives
$$
$$
When p = ∞ , ∥∇ s L ∥ L ∞ is the Lipschitz constant of L , so Lip( L σ ) ≤ Lip( L ) .
$$
$$
It remains to compute ∥∇ ϕ σ ∥ L 1 . Since ∥∇ ϕ σ ( ϵ ∥ 2 = ∥ ϵ ∥ 2 ϕ σ ( ϵ ) /σ 2 and X ∼ N (0 , σ 2 I d ) has density ϕ σ , we get
$$
$$
$$
$$
$$
$$
$$
$$
where X = σZ with Z ∼ N (0 , I d ) . Substituting this into the previous display yields the claimed bound. The p = ∞ statement is the corresponding Lipschitz estimate.
We then get regularity in expectation by noting that, after adding noise to a gradient step, this noise is then fed as input into the next step.
$$
$$
Assume L ∈ C 1 ( R d ) and that E ξ ∥∇ L ( s + ξ ) ∥ < ∞ (so differentiation may be interchanged with expectation). Then
$$
$$
Moreover, if L ∈ L ∞ ( R d ) , then by Theorem 4 (Part 2),
$$
$$
This then provides motivation for the decaying noise schedule: as an annealing process from a smoother, less accurate gradient structure to a rigid but more accurate gradient structure. For more detailed theory on the regularity of noisy gradient descent, see e.g. Chaudhari et al. (2019) and the references therein.
Connection to Boltzmann sampling, state-only noising
The continuous-time Langevin dynamics corresponding to Eq. (5), if we were to also noise the actions a t similarly, has a stationary distribution that concentrates on low-energy regions of L dyn ( s , a ) ; in particular, under mild regularity conditions it admits the Gibbs (Boltzmann) density
$$
$$
for an inverse temperature β > 0 determined by the relative scaling between the drift and diffusion terms 1 . However, we are only noising the states, so the converged distribution is not Boltzmann. The new converged distribution, written loosely, collapses on solutions in action to the local dynamics problems:
$$
$$
where a ∗ ( s ) = min a L dyn ( s , a ) .
State-gradient-free Dynamics
We now analyze a variant of the optimization method where we 'cut' the gradient flow through the state evolution in the dynamics. This is akin to removing the adjoint (backward) pass through time, transforming the global trajectory optimization into a sequence of locally regularized problems.
First, we provide proof that there are no loss functions where (i) the minimizers correspond to dynamically feasible trajectories, that also (ii) has no dependency on the state gradients through the world model F θ ( s, a ) . We formalize this in the following theorem:
Theorem 5 (Nonexistence of exact dynamics-enforcing losses with Jacobian-free state gradients) . Let S ⊆ R n and A ⊆ R m each be open and connected sets, and fix s 0 ∈ S and g ∈ S . Consider horizon T = 2 with decision variables ( s 1 , a 0 , a 1 ) ∈ S × A × A and boundary s 2 = g fixed. Let L F : S × A × A → R be decomposable into the following form:
$$
$$
$$
$$
$$
$$
where Φ : S × A × A × S × S → R is C 1 . Let
$$
$$
denote the set of trajectories that satisfy the dynamics exactly at both steps (with the fixed boundary conditions). There does not exist such a Φ for which the following two properties hold simultaneously for every C 1 model F :
Proof. We argue by contradiction.
Fix any point ( s 1 , a 0 , a 1 ) and write y 0 = F ( s 0 , a 0 ) and y 1 = F ( s 1 , a 1 ) . By the chain rule,
$$
$$
̸
We claim that ∂ Φ /∂y 1 must vanish identically. To see this, fix arbitrary arguments ( s 1 , a 0 , a 1 , y 0 , y 1 ) , ( s 0 , a 0 ) = ( s 1 , a 1 ) in the domain and construct two C 1 models F and G such that F ( s 0 , a 0 ) = G ( s 0 , a 0 ) = y 0 and F ( s 1 , a 1 ) = G ( s 1 , a 1 ) = y 1 , but whose state Jacobians at ( s 1 , a 1 ) are prescribed arbitrarily and differ:
$$
$$
1 Equivalently, one can view σ state in Eq. (5) as setting an effective temperature: larger σ state yields broader exploration, while smaller σ state concentrates around local minima.
To construct, choose a small ball B around ( s 1 , a 1 ) contained in S × A , construct a smooth bump function ψ that equals 1 on a smaller concentric ball and 0 outside B (possible since S , A open), and define
$$
$$
for an arbitrary C 1 base map H . Then F ( s 1 , a 1 ) = y 1 and ∇ s F ( s 1 , a 1 ) = J F . Defining G analogously with J G gives the desired pair; values at ( s 0 , a 0 ) can be kept fixed by choosing disjoint supports or applying the same local surgery at ( s 0 , a 0 ) .
By Jacobian-invariance, ∇ s 1 L F = ∇ s 1 L G at ( s 1 , a 0 , a 1 ) . Subtracting the two chain-rule expressions cancels ∂ Φ /∂s 1 and yields
$$
$$
Since J F -J G can be any matrix in R n × n , it follows that
$$
$$
Because the arguments were arbitrary, and since S is connected, we conclude ∂ Φ /∂y 1 ≡ 0 , meaning the loss is independent of y 1 . Therefore there exists a C 1 function ˜ Φ such that for every F ,
$$
$$
In particular, if two models F and G satisfy F ( s 0 , a ) = G ( s 0 , a ) for all a ∈ A , then L F ≡ L G as functions of ( s 1 , a 0 , a 1 ) , and hence
$$
$$
We now construct such a pair F, G but with different feasible sets, contradicting the exact-minimizers assumption. Pick two distinct actions u, v ∈ A , two distinct states s A , s B ∈ S , and an action a ⋆ ∈ A . Using bump-function surgery as above, construct C 1 models F and G such that
$$
$$
$$
$$
$$
$$
but with swapped second-step goal reachability:
$$
$$
$$
$$
Then ( s A , u, a ⋆ ) ∈ M ( F ) but ( s A , u, a ⋆ ) / ∈ M ( G ) , and ( s B , v, a ⋆ ) ∈ M ( G ) but ( s B , v, a ⋆ ) / ∈ M ( F ) , so M ( F ) = M ( G ) . On the other hand, since F ( s 0 , · ) = G ( s 0 , · ) we have arg min L F = arg min L G . By assumption (i),
$$
$$
$$
$$
We introduce the stop-gradient operator sg ( · ) , where sg ( x ) = x during the forward evaluation, but ∇ sg ( x ) = 0 during the backward pass. The modified objective incorporating the stop-gradient mechanism is:
$$
$$
where β t ≥ 0 are the goal loss coefficients. Note that the target g is effectively applied as a penalty on the state at each step to guide the local optimization.
and
Theorem 6 (Linear convergence to a unique fixed point) . Consider the gradient descent iteration
$$
$$
where s 0 is fixed and gradients are computed with the stop-gradient convention (i.e., treating sg ( s t ) as constant during differentiation). Assume the linear dynamics setting and that
$$
$$
and that B has full column rank (equivalently, σ min ( B ) > 0 ). Then there exists a stepsize η ∈ (0 , ¯ η ) , where ¯ η depends only on β min , β max , ∥ B ∥ 2 , σ min ( B ) (and not on T ), such that the induced update operator T on z = ( s , a ) has a unique fixed point z ⋆ and the iterates converge linearly:
$$
$$
Proof. Write the objective (re-indexing the goal term to match the action index) as
$$
$$
Define the residuals
$$
$$
Under the stop-gradient convention, sg( s t ) is treated as constant during differentiation, so s t does not receive gradient contributions through the A sg( s t ) terms. It follows that the only state-gradient at time t comes from the appearance of s t as the next state in the previous residual, namely
$$
$$
with the understanding that r -1 = 0 if s 0 is fixed. Likewise, the action-gradient at time t is
$$
$$
Therefore, gradient descent with stepsize η > 0 yields the explicit update rules
$$
$$
$$
$$
Stack the variables in the time-ordered vector z := ( s 1 , a 0 , s 2 , a 1 , . . . , s T , a T -1 ) . The updates above define an affine map z k +1 = T ( z k ) = J z k + c whose Jacobian J is block lower-triangular with respect to this ordering: indeed, ( s k +1 t +1 , a k +1 t ) depends only on ( s k t , s k t +1 , a k t ) (and on the fixed constants g and s 0 ) and is independent of any future variables ( s k t +2 , a k t +1 , . . . ) . Consequently, the eigenvalues of J are exactly the union of the eigenvalues of its diagonal blocks.
To characterize a diagonal block, fix t ∈ { 0 , . . . , T -1 } and consider the pair y t := ( s t +1 , a t ) . Conditioned on s t (which appears only as a constant inside sg( s t ) for differentiation), the update ( s k +1 t +1 , a k +1 t ) is precisely one gradient step on the quadratic function
$$
$$
so the corresponding diagonal block equals I -ηH t where H t = ∇ 2 y t ϕ t is the constant Hessian with respect to ( s t +1 , a t ) :
$$
$$
Assume β t ∈ [ β min , β max ] with β min > 0 and that B has full column rank, so B ⊤ B ≻ 0 . The Schur complement of the I block is
$$
$$
hence H t ≻ 0 for every t , with eigenvalues uniformly bounded away from 0 and ∞ as t varies:
$$
$$
for constants µ, L depending only on β min , β max , ∥ B ∥ 2 , and σ min ( B ) . Choosing any stepsize η such that 0 < η < 2 /L , we obtain for every t that all eigenvalues of I -ηH t lie strictly inside the unit disk, and in particular:
$$
$$
where q is independent of t and T . Since J is block lower-triangular and its diagonal blocks are exactly ( I -ηH t ) (up to a fixed permutation corresponding to the stacking order), we conclude
$$
$$
Fix any q such that q 0 < q < 1 . By applying Gelfand's formula, for any square matrix M and any q > ρ ( M ) , there exists an induced norm ∥ · ∥ † such that ∥ M ∥ † ≤ q . Applying this with M = J yields a norm ∥ · ∥ † satisfying
Hence, for all k ≥ 0 , and therefore
$$
$$
$$
$$
$$
$$
By norm equivalence in finite dimensions, there exists C > 0 such that
$$
$$
Notes on the stopgrad optimization. The optimization indeed converges to a fixed point, but one can show that these stable points in the linear convex case are merely the greedy rollouts towards the goal. Two things make the optimization in our setting nontrivial: the nonconvexity of the world model F θ , and the stochastic noise on the states s t . We now present some characterization on the distribution of trajectories that our planner tends towards.
Let F θ : S × A → S be a differentiable world model and define the stop-gradient one-step prediction
$$
$$
Consider the stopgrad lifted objective (cf. Eq. Eq. (10))
$$
$$
and the (no-sync) optimization updates
$$
$$
$$
$$
Throughout, assume 0 < η s < 1 (for stability of the state contraction).
Theorem 7 (Gaussian tube around one-step predictions) . Fix { ¯ s k t } T -1 t =0 and { a k t } T -1 t =0 at iteration k , and let µ k t = F θ (¯ s k t , a k t ) . Then the state update Eq. (93) satisfies the conditional mean recursion
$$
$$
Moreover, if µ k t ≡ µ t is held fixed, then s k t +1 converges in distribution to a Gaussian 'tube' around µ t :
$$
$$
Analogously, in continuous optimization-time τ , the limiting SDE
$$
$$
has stationary law N ( µ t , σ 2 2 λ I ) .
Proof. Rewrite Eq. (93) as an affine Gaussian recursion:
$$
$$
Taking conditional expectation yields Eq. (95). If µ k t ≡ µ t is fixed, the centered process u k := s k t +1 -µ t satisfies u k +1 = (1 -2 η s ) u k + σξ k , i.e. an AR(1) process with contraction factor | 1 -2 η s | < 1 . Its unique stationary covariance Σ tube solves the discrete Lyapunov equation Σ tube = (1 -2 η s ) 2 Σ tube + σ 2 I , giving Eq. (96). The continuous-time statement is standard, given that µ t is fixed.
Theorem 8 (Goal shaping induces goal-directed drift of tube center) . Define µ k t = F θ (¯ s k t , a k t ) and let
$$
$$
Assume a first-order linearization in the action step:
$$
$$
Then the action update Eq. (94) induced by Eq. (92) yields the tube-center evolution
$$
$$
where ε k t +1 denotes the tube residual s k t +1 = µ k t + ε k t +1 . In particular, if E [ ε k t +1 | µ k t ] = 0 , then
$$
$$
so in controllable directions the mean prediction µ t moves toward g as an averaging step. If 0 < αγλ max ( P k t ) < 1 , this is a contractive averaging step toward g on Range( P k t ) .
Proof. From Eq. (92), only terms at time t depend on a t via µ t . Using ∇ a t µ t = J t , the gradient is
$$
$$
$$
$$
$$
$$
which simplifies to Eq. (98) after substituting α = 2 η a . Taking conditional expectation and using E [ ε k t +1 | µ k t ] = 0 yields Eq. (99).
Unlike the stopgrad lifted-state dynamics (Theorem 7), the rollout distribution is not contracted toward the current rollout, but instead noise accumulates throughout nonlinear iterations of the world model. The stochastic rollout distribution need not concentrate in a local tube around the current deterministic rollout trajectory; it can drift and spread away as the horizon T grows.
Theorem 9 (Mean evolution and non-tube behavior of noisy rollouts) . Consider a rollout-based stochastic trajectory generated in model-time:
$$
$$
and a dense goal objective along the rollout (e.g. ∑ T -1 t =0 ∥ s t +1 -g ∥ 2 ). Let m t := E [ s t ] and Σ t := Cov( s t ) . The rollout mean obeys the exact identity
$$
$$
In particular, if F θ is affine in s (i.e. F θ ( s, a ) = As + Ba + c ), then
$$
$$
For general nonlinear F θ , a second-order moment expansion yields
$$
$$
so the mean generally does not follow the deterministic rollout F θ ( m t , a t ) .
$$
$$
Proof. Taking conditional expectation of Eq. (101) gives E [ s t +1 | s t ] = F θ ( s t , a t ) , and then total expectation implies Eq. (102). For affine F θ , expectation commutes with F θ , yielding Eq. (103). For nonlinear F θ , expand F θ ( s t , a t ) around µ t by Taylor's theorem; the first-order term vanishes in expectation and the second-order term produces Eq. (104). The covariance recursion Eq. (105) follows by linearizing F θ ( s t , a t ) ≈ F θ ( m t , a t ) + G t ( s t -µ t ) and computing Cov( · ) , adding the independent noise variance σ 2 env I . The final 'non-tube' claim follows because there is no optimization-time contraction that repeatedly pulls s t +1 back toward a moving center (as in Eq. (95)); instead the forward propagation Eq. (105) typically increases spread and the mean can deviate from the deterministic path by Eq. (104).
Theorem 7 shows that noisy lifted state updates form an noisy 'tube' around the one-step predictions m t = F θ (¯ s t , a t ) , keeping exploration local and dynamically consistent in optimization-time. Theorem 8 then shows that dense one-step goal shaping moves the tube center toward the goal by a preconditioned averaging step, while the dynamics residual contributes approximately zero-mean stochastic forcing that enables exploration without horizon-coupled backpropagation. In contrast, Theorem 9 shows that noisy rollouts evolve by forward propagation of randomness: the mean follows m t +1 = E [ F θ ( s t , a t )] (not generally the deterministic rollout), and the distribution can drift/spread rather than concentrate in a tube around the current plan.
Notes on the stopgrad optimization.
Ablations
To rigorously evaluate the contribution of each individual component within our proposed framework, we conducted an ablation study on the Push-T environment with a horizon of 40 steps. In this experiment, we systematically removed specific modules or mechanisms from the full method while keeping all other hyperparameters constant. This analysis allows us to isolate the impact of important components such as: the noised-states optimization, the GD sync steps, and whether our state-gradient-approach is needed
The results, summarized in Table 2, highlight the critical role of each design choice in achieving robust performance. We observe that the Full Method achieves the highest success rate, validating the synergy between the proposed components.
The stopgrad is an important component, without it the state optimization feels 'sticky' and is difficult to tune. Noise also remains an important component to allow the planner to explore around local minima.
We also ablate different, simpler options of a noise and gradient-based planner, building off a normal rollout-based GD planner. We ablate with two kinds of noise:
Table 5 Stochastic GD Baselines Ablation Study, showing alternatives from our method for a stochastic gradient-based planner. For rollouts of the world model F θ ( s, a ) , values of σ a correspond to variance of noise added to actions throughout optimization, and values of σ s correspond to noise added to states directly through the rollout; that is, for rollouts of the world model, s t +1 = F θ ( s t , a i ) + ξ, ξ ∼ N (0 , I ) . None beat our method's performance, shown in Table 2. Reported time is average time over all experiments, with reported intervals as 95% CI.
$$
$$
$$
$$
These ablations are provided in Table 5.
Additional experiments
$$ \mc F_\theta^T(\mb a, s_0) = g, $$
$$ \label{eq:forward} \mc{F}\theta^T(\mathbf{a},s_0) \coloneqq \underbrace{\forward(\dots \forward(\forward(}{\text{$T$ times}}s_0, a_0), a_1), \dots a_{T-1}). $$ \tag{eq:forward}
$$ \begin{split} \min_{\mb{s}, \mathbf{a}}\ \mathcal{L}{\mathrm{dyn}}(\mathbf{s}, \mathbf{a}) \coloneqq \sum{t=0}^{T-1}\big| F_\theta(s_t,a_t)-s_{t+1}\big|_2^2, \ \text{with } s_0 \text{ fixed},\ s_T=g. \end{split} \label{eq:loss_dynamics} $$ \tag{eq:loss_dynamics}
$$ \argmin_s |y - F_\theta(s, a)|_2^2, $$
$$ \label{eq:sync_rollout_loss} \min_{\mathbf{a}} |s_T(\mathbf{a}, s_0) - g|_2^2, $$ \tag{eq:sync_rollout_loss}
$$ \label{eq:app_proof_lindynamics} s_{t+1} = As_t + Ba_t, $$ \tag{eq:app_proof_lindynamics}
$$ \mathcal{F}(s_0, \mathbf{a}) = A^T s_0 + \sum_{t=0}^{T-1} A^{T-1-t} B a_t. $$
$$ \min_{\mathbf{a} \in \mathbb{R}^{mT}} J_S(\mathbf{a}) = |\mathcal{F}(s_{init}, \mathbf{a}) - g|_2^2 \label{eq:shooting} $$ \tag{eq:shooting}
$$ L_S = \lambda_{\max}(H_S) \geq 2\mu^2 |\lambda|^{2(T-1)}. $$
$$ |\mathcal{C}_T|2 = \sup{\mathbf{a} \neq 0} \frac{|\mathcal{C}_T \mathbf{a}|_2}{|\mathbf{a}|_2}. $$
$$ \mathcal{C}T \mathbf{a}{test} = A^{T-1}Bw, $$
$$ |M\mathbf{z}|2^2 ;=; \sum{t=0}^{T-1}|r_t|2^2 ;\le; 3\sum{t=0}^{T-1}\Bigl(|s_{t+1}|_2^2 + |A|_2^2|s_t|_2^2 + |B|_2^2|a_t|_2^2\Bigr). $$
$$ \phi_\sigma(\boldsymbol{\epsilon})=(2\pi\sigma^2)^{-d/2}\exp!\big(-|\boldsymbol{\epsilon}|^2/(2\sigma^2)\big) $$
$$ L_\sigma(\mathbf s)=(\phi_\sigma*L)(\mathbf s)=\int_{\mathbb R^d}\phi_\sigma(\boldsymbol{\epsilon}),L(\mathbf s-\boldsymbol{\epsilon}),d\boldsymbol{\epsilon}. $$
$$ |\nabla_{s} L_\sigma|{L^p}\ \le\ |\nabla{s} L|_{L^p}. $$
$$ |\nabla_{a} L_\sigma|{L^p}\ \le\ \frac{c_d}{\sigma},|L|{L^p}, \qquad c_d:=\mathbb E|Z|=\sqrt{2},\frac{\Gamma\big(\frac{d+1}{2}\big)}{\Gamma\big(\frac{d}{2}\big)},\ \ Z\sim\mathcal N(0,\mathbf I_d). $$
$$ \mathbb E_{\boldsymbol{\xi}}\big[\nabla L(\mathbf s+\boldsymbol{\xi})\big] = \nabla_{!\mathbf s} L_\sigma(\mathbf s). $$
$$ \label{eq:boltzmann_states} p(\mathbf{s}, \mathbf{a}) \ \propto\ \exp \big(-\beta,\mathcal{L}_{\mathrm{dyn}}(\mathbf{s}, \mathbf{a})\big), $$ \tag{eq:boltzmann_states}
$$ \mathcal M(F)={(s_1,a_0,a_1): s_1=F(s_0,a_0)\ \text{and}\ g=F(s_1,a_1)} $$
$$ \nabla_{s_1}L_F(s_1,a_0,a_1)
\frac{\partial \Phi}{\partial s_1}(s_1,a_0,a_1,y_0,y_1) + \big(\nabla_s F(s_1,a_1)\big)^\top \frac{\partial \Phi}{\partial y_1}(s_1,a_0,a_1,y_0,y_1). $$
$$ \nabla_sF(s_1,a_1)=J_F,\qquad \nabla_sG(s_1,a_1)=J_G. $$
$$ \frac{\partial \Phi}{\partial y_1}(s_1,a_0,a_1,y_0,y_1)=0. $$
$$ \arg\min L_F=\arg\min L_G. $$
$$ F(s_0,a)=G(s_0,a)\quad \text{ for all } a\in\mathcal A, $$
$$ \mathcal{L}_{sg}(\mathbf{s},\mathbf{a})
\sum_{t=0}^{T-1}|s_{t+1}-A,\mathrm{sg}(s_t)-Ba_t|2^2 ;+; \sum{t=0}^{T-1}\beta_t|g-A,\mathrm{sg}(s_t)-Ba_t|_2^2 . $$
$$ \nabla_{s_t}\mathcal{L}{sg} = 2r{t-1}, \qquad t=1,\dots,T, $$
$$ a_t^{k+1}
a_t^k + 2\eta B^\top\Big(\big(s_{t+1}^k-As_t^k-Ba_t^k\big)+\beta_t\big(g-As_t^k-Ba_t^k\big)\Big), \qquad t=0,\dots,T-1. $$
$$ H_t
2\begin{bmatrix} I & -B\[0.2em] -B^\top & (1+\beta_t)B^\top B \end{bmatrix}. $$
$$ (1+\beta_t)B^\top B - B^\top I^{-1}B = \beta_t B^\top B \succ 0, $$
$$ 0<\mu I \preceq H_t \preceq L I < \infty, $$
$$ \rho(I-\eta H_t)\le \max{|1-\eta\mu|,\ |1-\eta L|}=:q<1, $$
$$ |J|_\dagger \le q < 1. $$
$$ |\mathbf z^{k+1}-\mathbf z^\star|\dagger = |J(\mathbf z^{k}-\mathbf z^\star)|\dagger \le |J|\dagger,|\mathbf z^{k}-\mathbf z^\star|\dagger \le q,|\mathbf z^{k}-\mathbf z^\star|_\dagger, $$
$$ \label{eq:ou_mean_recursion} \mathbb{E}!\left[s_{t+1}^{k+1}\mid \mu_t^k\right]
(1-2\eta_s),\mathbb{E}!\left[s_{t+1}^{k}\mid \mu_t^k\right]
- 2\eta_s,\mu_t^k. $$ \tag{eq:ou_mean_recursion}
$$ \label{eq:ou_stationary_tube} s_{t+1}^\infty \ \sim\ \mathcal N!\bigl(\mu_t,\ \Sigma_{\mathrm{tube}}\bigr), \qquad \Sigma_{\mathrm{tube}}
\frac{\sigma^2}{1-(1-2\eta_s)^2},I
\frac{\sigma^2}{4\eta_s(1-\eta_s)},I. $$ \tag{eq:ou_stationary_tube}
$$ \label{eq:mu_linearization} \mu_t^{k+1} \approx \mu_t^k + J_t^k,(a_t^{k+1}-a_t^k). $$ \tag{eq:mu_linearization}
$$ \label{eq:tube_center_update} \mu_t^{k+1} ;\approx; \underbrace{\bigl(I-\alpha\gamma P_t^k\bigr)\mu_t^k + \alpha\gamma P_t^k g}{\textbf{goal-directed averaging (drift)}} ;+; \underbrace{\alpha,P_t^k,\varepsilon{t+1}^k}_{\textbf{exploration forcing}}, \qquad \alpha \coloneqq 2\eta_a, $$ \tag{eq:tube_center_update}
$$ \label{eq:action_grad_decomp} \nabla_{a_t}\mathcal L
2(J_t)^\top(\mu_t-s_{t+1}) + 2\gamma(J_t)^\top(\mu_t-g). $$ \tag{eq:action_grad_decomp}
$$ \label{eq:rollout_noise} s_{t+1} = F_\theta(s_t,a_t) + \sigma_{\mathrm{env}}\zeta_t, \qquad \zeta_t\sim\mathcal N(0,I), \qquad s_0 \ \text{fixed}, $$ \tag{eq:rollout_noise}
$$ \label{eq:rollout_mean_second_order} m_{t+1} \approx F_\theta(m_t,a_t) + \frac12,\big(\mathrm{Hess}s F\theta(m_t,a_t)\big):\Sigma_t, $$ \tag{eq:rollout_mean_second_order}
$$ \label{eq:langevin_euler} s_t^{k+1} &\leftarrow s_t^{k} - \eta_s \nabla_{s_t}\mathcal{L}{\mathrm{dyn}}(\mathbf{s}^{k}, \mathbf{a}^{k}) + \sigma{\text{state}}\xi_t^{k},\ a_t^{k+1} &\leftarrow a_t^{k} - \eta_a \nabla_{a_t}\mathcal{L}_{\mathrm{dyn}}(\mathbf{s}^{k}, \mathbf{a}^{k}), $$ \tag{eq:langevin_euler}
$$ |\mathcal{C}T \mathbf{a}{test}|_2 &\geq |\langle v, A^{T-1}Bw \rangle|, \ &= |\langle (A^\top)^{T-1}v, Bw \rangle|, \ &= |\langle \lambda^{T-1}v, Bw \rangle|, \ &= |\lambda|^{T-1} |\langle v, Bw \rangle|, \ &= \mu |\lambda|^{T-1}. $$
$$ |M\mathbf{z}|_2^2 &\le 3\Bigl((1+|A|2^2)\sum{t=1}^{T-1}|s_t|_2^2 ;+; |B|2^2\sum{t=0}^{T-1}|a_t|_2^2\Bigr),\ &\le 3\bigl(1+|A|_2^2+|B|_2^2\bigr),|\mathbf{z}|_2^2. $$
$$ \nabla_{s} L_\sigma &=\nabla_{s}(\phi_\sigma*L), \ &=(\nabla \phi_\sigma)L, \ &=\phi_\sigma(\nabla_{s}L), $$
$$ |\nabla\phi_\sigma|{L^1} &=\int{\mathbb R^d}\frac{|\boldsymbol{\epsilon}|2}{\sigma^2},\phi\sigma(\boldsymbol{\epsilon}),d\boldsymbol{\epsilon} \ &=\frac{1}{\sigma^2},\mathbb E|X|_2 \ &=\frac{1}{\sigma},\mathbb E|Z|_2 \ &=\frac{c_d}{\sigma}, $$
$$ F(s_0,u)&=G(s_0,u)=s_A,\ F(s_0,v)&=G(s_0,v)=s_B, $$
$$ F(s_A,a^\star) &= g, & F(s_B,a^\star) &\neq g,\ G(s_A,a^\star) &\neq g, & G(s_B,a^\star) &= g. $$
$$ \label{eq:sg_state_update_theory} s_{t+1}^{k+1} &= s_{t+1}^{k} - 2\eta_s\bigl(s_{t+1}^k-\mu_t^k\bigr) + \sigma,\xi_{t+1}^k, \qquad \xi_{t+1}^k\sim\mathcal N(0,I),\ \label{eq:sg_action_update_theory} a_t^{k+1} &= a_t^k - \eta_a \nabla_{a_t}\mathcal{L}(\mathbf{s}^k,\mathbf{a}^k). $$ \tag{eq:sg_state_update_theory}
$$ \mu_t^{k+1} &\approx \mu_t^k -2\eta_a J_t^k(J_t^k)^\top(\mu_t^k-s_{t+1}^k) -2\eta_a\gamma J_t^k(J_t^k)^\top(\mu_t^k-g)\ &= \mu_t^k -2\eta_a P_t^k(\mu_t^k-(\mu_t^k+\varepsilon_{t+1}^k)) -2\eta_a\gamma P_t^k(\mu_t^k-g), $$
$$ s_{t+1} &= F(s_t, a_t) + z,\ z &\sim \mc{N}(0, \sigma_s I). $$
$$ \beta_t \in [\beta_{\min},\beta_{\max}] \quad \text{for all } t,\qquad \beta_{\min}>0, $$
$$ |\mathbf z^{k}-\mathbf z^\star| \le Cq^k|\mathbf z^{0}-\mathbf z^\star|,\qquad \text{for some } q\in(0,1) \text{ and } C > 0. $$
$$ 0.2em] -B^\top & (1+\beta_t)B^\top B \end{bmatrix}. \end{equation} Assume $\beta_t\in[\beta_{\min},\beta_{\max}]$ with $\beta_{\min}>0$ and that $B$ has full column rank, so $B^\top B\succ 0$. The Schur complement of the $I$ block is \begin{equation} (1+\beta_t)B^\top B - B^\top I^{-1}B = \beta_t B^\top B \succ 0, \end{equation} hence $H_t\succ 0$ for every $t$, with eigenvalues uniformly bounded away from $0$ and $\infty$ as $t$ varies: \begin{equation} 0<\mu I \preceq H_t \preceq L I < \infty, \end{equation} for constants $\mu,L$ depending only on $\beta_{\min},\beta_{\max},|B|2,$ and $\sigma{\min}(B)$. Choosing any stepsize $\eta$ such that $0<\eta<2/L$, we obtain for every $t$ that all eigenvalues of $I-\eta H_t$ lie strictly inside the unit disk, and in particular: \begin{equation} \rho(I-\eta H_t)\le \max{|1-\eta\mu|,\ |1-\eta L|}=:q<1, \end{equation} where $q$ is independent of $t$ and $T$. Since $J$ is block lower-triangular and its diagonal blocks are exactly $(I-\eta H_t)$ (up to a fixed permutation corresponding to the stacking order), we conclude \begin{equation} \rho(J)=\max_{t}\rho(I-\eta H_t)\le q_0<1. \end{equation} Fix any $q$ such that $q_0<q<1$. By applying Gelfand's formula, for any square matrix $M$ and any $q>\rho(M)$, there exists an induced norm $|\cdot|\dagger$ such that $|M|\dagger\le q$. Applying this with $M=J$ yields a norm $|\cdot|\dagger$ satisfying \begin{equation} |J|\dagger \le q < 1. \end{equation} Hence, for all $k\ge 0$, \begin{equation} |\mathbf z^{k+1}-\mathbf z^\star|\dagger = |J(\mathbf z^{k}-\mathbf z^\star)|\dagger \le |J|\dagger,|\mathbf z^{k}-\mathbf z^\star|\dagger \le q,|\mathbf z^{k}-\mathbf z^\star|\dagger, \end{equation} and therefore \begin{equation} |\mathbf z^{k}-\mathbf z^\star|\dagger \le q^k |\mathbf z^{0}-\mathbf z^\star|_\dagger . \end{equation} By norm equivalence in finite dimensions, there exists $C>0$ such that \begin{equation} |\mathbf z^{k}-\mathbf z^\star|_2 \le C q^k |\mathbf z^{0}-\mathbf z^\star|_2 . \end{equation} \end{proof}
\paragraph{Notes on the stopgrad optimization.} The optimization indeed converges to a fixed point, but one can show that these stable points in the linear convex case are merely the greedy rollouts towards the goal. Two things make the optimization in our setting nontrivial: the nonconvexity of the world model $F_\theta$, and the stochastic noise on the states $s_t$. We now present some characterization on the distribution of trajectories that our planner tends towards.
Let $F_\theta:\mathcal S\times\mathcal A\to\mathcal S$ be a differentiable world model and define the stop-gradient one-step prediction [ \mu_t ;\coloneqq; F_\theta(\bar s_t, a_t), \qquad \bar s_t = \mathrm{stopgrad}(s_t). $$
$$ ds_{t+1}(\tau) = -\lambda\bigl(s_{t+1}(\tau)-\mu_t\bigr)d\tau + \sigma,dW_{t+1}(\tau), \quad \lambda>0, $$
$$ J_t^k ;\coloneqq; \nabla_{a_t}F_\theta(\bar s_t^k,a_t^k)\in\mathbb R^{d_s\times d_a}, \qquad P_t^k ;\coloneqq; J_t^k(J_t^k)^\top \succeq 0. $$
Theorem. [informal] A differentiable loss function over state/action trajectories $\mc L: \mc S^T \times \mc A^T \to \R$ given a world model $F_\theta: \mc S \times \mc A \to \mc S$ cannot satisfy both of the following at the same time: enumerate \item Minimizers of $\mc L$ correspond to dynamically feasible trajectories: $F_\theta(s_t, a_t) = s_{t+1}$, \item $\mc L$ is insensitive to the world model state gradient $\nabla_s F_\theta$. enumerate
Theorem. [Shooting: Exploding Smoothness] Let $A^\top$ have a real eigenvalue $\lambda$ with $|\lambda| > 1$ and unit eigenvector $v$ (a left eigenvector of $A$). Assume $B$ aligns with this mode such that for some input direction $w$ ($|w|2 =1$), the projection $|\langle v, Bw \rangle| = \mu > 0$. Then, the Lipschitz constant of the Shooting method gradient grows exponentially with $T$: equation L_S = \lambda{\max}(H_S) \geq 2\mu^2 |\lambda|^{2(T-1)}. equation
Theorem. [Lifted: Stable Smoothness] The Lipschitz constant of the Lifted gradient is bounded by a constant independent of the horizon $T$. Specifically, one valid bound is: equation L_L = \lambda_{\max}(H_L) \le 6\bigl(1 + |A|_2^2 + |B|_2^2\bigr). equation
Theorem. [Gaussian smoothing contracts gradients and yields scale control] Let $d\ge 1$, $\sigma>0$, and let equation \phi_\sigma(\epsilon)=(2\pi\sigma^2)^{-d/2}\exp!\big(-|\epsilon|^2/(2\sigma^2)\big) equation be the density of $\mathcal N(0,\sigma^2 \mathbf I_d)$. For $L \in \mc C^1(\R^d)$ define equation L_\sigma(\mathbf s)=(\phi_\sigma*L)(\mathbf s)=\int_{\mathbb R^d}\phi_\sigma(\epsilon),L(\mathbf s-\epsilon),d\epsilon. equation The following statements hold of the resulting convolution: enumerate \item (Regularity never decreases.) For any $1\le p\le\infty$, if the distributional gradient $\nabla_{s}L\in L^p(\mathbb R^d)$, then equation |\nabla_{s} L_\sigma|{L^p}\ \le\ |\nabla{s} L|{L^p}. equation In particular, if $L$ is Lipschitz, then $Lip(L\sigma) = |\nabla_{s} L_\sigma|{L^\infty} \le |\nabla{s} L|{L^\infty} = Lip(L)$. \item (Explicit regularity control by variance.) For any $1\le p\le\infty$, if $L\in L^p(\mathbb R^d)$, then equation |\nabla{a} L_\sigma|{L^p}\ \le\ c_d{\sigma},|L|{L^p}, \qquad c_d:=\mathbb E|Z|=2,\Gamma\big(\frac{d+1{2}\big)}{\Gamma\big(d{2}\big)},\ \ Z\sim\mathcal N(0,\mathbf I_d). equation Applying $p=\infty$, we get $Lip(L_\sigma)\le c_d{\sigma},|L|_{L^\infty}$. enumerate
Theorem. [Nonexistence of exact dynamics-enforcing losses with Jacobian-free state gradients] Let $\mathcal S\subseteq\mathbb R^n$ and $\mathcal A\subseteq\mathbb R^m$ each be open and connected sets, and fix $s_0\in\mathcal S$ and $g\in\mathcal S$. Consider horizon $T=2$ with decision variables $(s_1,a_0,a_1)\in\mathcal S\times\mathcal A\times\mathcal A$ and boundary $s_2=g$ fixed. Let $ L_F : \mc S \times \mc A \times \mc A \to \R$ be decomposable into the following form: align L_F(s_1,a_0,a_1) &= \Phi(s_1,a_0,a_1,y_0,y_1),\ y_0 &= F(s_0,a_0),\ y_1 &= F(s_1,a_1), align where $\Phi:\mathcal S\times\mathcal A\times\mathcal A\times\mathcal S\times\mathcal S\to\mathbb R$ is $C^1$. Let equation \mathcal M(F)={(s_1,a_0,a_1): s_1=F(s_0,a_0)\ and\ g=F(s_1,a_1)} equation denote the set of trajectories that satisfy the dynamics exactly at both steps (with the fixed boundary conditions). There does not exist such a $\Phi$ for which the following two properties hold simultaneously for every $C^1$ model $F$: enumerate \item Minimizers correspond to feasible dynamics: $\argmin_{(s_1,a_0,a_1)} L_F(s_1,a_0,a_1) = \mathcal M(F),$ \item Independence of loss to the dynamics' state gradient: for all $G: \mc S \times \mc A \to \mc S, G\in C^1$, $ \Big(F(s_0,a_0)=G(s_0,a_0), F(s_1,a_1)=G(s_1,a_1)\Big) \Rightarrow \nabla_{s_1}L_F(s_1,a_0,a_1)=\nabla_{s_1}L_G(s_1,a_0,a_1)$. enumerate
Theorem. [Linear convergence to a unique fixed point] Consider the gradient descent iteration [ (\mathbf s^{k+1},\mathbf a^{k+1}) = (\mathbf s^{k},\mathbf a^{k})-\eta \nabla \mathcal L_{sg}(\mathbf s^{k},\mathbf a^{k}) ] where $s_0$ is fixed and gradients are computed with the stop-gradient convention (i.e., treating $sg(s_t)$ as constant during differentiation). Assume the linear dynamics setting and that [ \beta_t \in [\beta_{\min},\beta_{\max}] \quad for all t,\qquad \beta_{\min}>0, ] and that $B$ has full column rank (equivalently, $\sigma_{\min}(B)>0$). Then there exists a stepsize $\eta\in (0,\bar \eta)$, where $\bar\eta$ depends only on $\beta_{\min},\beta_{\max},|B|2,\sigma{\min}(B)$ (and not on $T$), such that the induced update operator $\mathcal T$ on $\mathbf z=(\mathbf s,\mathbf a)$ has a unique fixed point $\mathbf z^\star$ and the iterates converge linearly: [ |\mathbf z^{k}-\mathbf z^\star| \le Cq^k|\mathbf z^{0}-\mathbf z^\star|,\qquad for some q\in(0,1) and C > 0. ]
Theorem. [Gaussian tube around one-step predictions] Fix ${\bar s_t^k}{t=0}^{T-1}$ and ${a_t^k}{t=0}^{T-1}$ at iteration $k$, and let $\mu_t^k=F_\theta(\bar s_t^k,a_t^k)$. Then the state update eq:sg_state_update_theory satisfies the conditional mean recursion equation E!\left[s_{t+1}^{k+1}\mid \mu_t^k\right] = (1-2\eta_s),E!\left[s_{t+1}^{k}\mid \mu_t^k\right] + 2\eta_s,\mu_t^k. equation Moreover, if $\mu_t^k\equiv \mu_t$ is held fixed, then $s_{t+1}^k$ converges in distribution to a Gaussian ``tube'' around $\mu_t$: equation s_{t+1}^\infty \ \sim\ \mathcal N!\bigl(\mu_t,\ \Sigma_{tube}\bigr), \qquad \Sigma_{tube} = \sigma^2{1-(1-2\eta_s)^2},I = \sigma^2{4\eta_s(1-\eta_s)},I. equation Analogously, in continuous optimization-time $\tau$, the limiting SDE [ ds_{t+1}(\tau) = -\lambda\bigl(s_{t+1}(\tau)-\mu_t\bigr)d\tau + \sigma,dW_{t+1}(\tau), \quad \lambda>0, ] has stationary law $\mathcal N(\mu_t,\sigma^2{2\lambda}I)$.
Theorem. [Goal shaping induces goal-directed drift of tube center] Define $\mu_t^k = F_\theta(\bar s_t^k,a_t^k)$ and let [ J_t^k ;\coloneqq; \nabla_{a_t}F_\theta(\bar s_t^k,a_t^k)\in\mathbb R^{d_s\times d_a}, \qquad P_t^k ;\coloneqq; J_t^k(J_t^k)^\top \succeq 0. ] Assume a first-order linearization in the action step: equation \mu_t^{k+1} \approx \mu_t^k + J_t^k,(a_t^{k+1}-a_t^k). equation Then the action update eq:sg_action_update_theory induced by eq:sg_full_objective_theory yields the tube-center evolution equation \mu_t^{k+1} ;\approx; \bigl(I-\alpha\gamma P_t^k\bigr)\mu_t^k + \alpha\gamma P_t^k g_{goal-directed averaging (drift)} ;+; \alpha,P_t^k,\varepsilon_{t+1^k}{exploration forcing}, \qquad \alpha \coloneqq 2\eta_a, equation where $\varepsilon{t+1}^k$ denotes the tube residual $s_{t+1}^k=\mu_t^k+\varepsilon_{t+1}^k$. In particular, if $\mathbb E[\varepsilon_{t+1}^k\mid \mu_t^k]=0$, then equation \mathbb E[\mu_t^{k+1}\mid \mu_t^{k}, s_t^k, a_t^k] ;\approx; \bigl(I-\alpha\gamma P_t^k\bigr)\mu_t^k + \alpha\gamma P_t^k g, equation so in controllable directions the mean prediction $\mu_t$ moves toward $g$ as an averaging step. If $0 < \alpha\gamma\lambda_{\max}(P_t^k) < 1$, this is a contractive averaging step toward $g$ on $Range(P_t^k)$.
Theorem. [Mean evolution and non-tube behavior of noisy rollouts] Consider a rollout-based stochastic trajectory generated in model-time: equation s_{t+1} = F_\theta(s_t,a_t) + \sigma_{env}\zeta_t, \qquad \zeta_t\sim\mathcal N(0,I), \qquad s_0 \ fixed, equation and a dense goal objective along the rollout (e.g. $\sum_{t=0}^{T-1}|s_{t+1}-g|^2$). Let $m_t\coloneqq \mathbb E[s_t]$ and $\Sigma_t\coloneqq Cov(s_t)$. The rollout mean obeys the exact identity equation m_{t+1} = \mathbb E!\left[F_\theta(s_t,a_t)\right]. equation In particular, if $F_\theta$ is affine in $s$ (i.e. $F_\theta(s,a)=As+Ba+c$), then equation m_{t+1} = F_\theta(m_t,a_t). equation For general nonlinear $F_\theta$, a second-order moment expansion yields equation m_{t+1} \approx F_\theta(m_t,a_t) + \frac12,\big(Hess_s F_\theta(m_t,a_t)\big):\Sigma_t, equation so the mean generally does not follow the deterministic rollout $F_\theta(m_t,a_t)$. A first-order linearization gives the approximate covariance propagation equation \Sigma_{t+1} \approx G_t,\Sigma_t,G_t^\top + \sigma_{env}^2 I, \qquad G_t\coloneqq \nabla_s F_\theta(\mu_t,a_t). equation
Corollary. [Noise induces smoothed gradients in expectation] Let $\xi \sim \mathcal N(0,\sigma^2 \mathbf I_d)$ and define the Gaussian-smoothed loss equation L_\sigma(\mathbf s) := \mathbb E_{\xi}\big[L(\mathbf s+\xi)\big]. equation Assume $L \in \mc C^1(\R^d)$ and that $\mathbb E_{\xi}|\nabla L(\mathbf s+\xi)|<\infty$ (so differentiation may be interchanged with expectation). Then equation \mathbb E_{\xi}\big[\nabla L(\mathbf s+\xi)\big] = \nabla_{!\mathbf s} L_\sigma(\mathbf s). equation Moreover, if $L\in L^\infty(\mathbb R^d)$, then by Theorem~thm:gaussian_regularity (Part 2), equation |\nabla_{!\mathbf s} L_\sigma|{L^\infty} \le c_d{\sigma},|L|{L^\infty}, \qquadand hence\qquad \Big|\mathbb E_{\xi}\big[\nabla L(\mathbf s+\xi)\big]\Big| \le c_d{\sigma},|L|_{L^\infty}. equation
Proof. The Lipschitz constant is the maximum eigenvalue of the Hessian $H_S = 2C_T^\top C_T$, which equals $2|C_T|2^2$. By definition, the spectral norm is the maximum gain over all possible inputs: equation |C_T|2 = \sup{a \neq 0} |C_T a|2{|a|2}. equation From the existence of a controllable non-contractive mode, we can construct a unit-norm $a{test} = (w; 0; \dots; 0)$ that evaluates to the form: equation C_T a{test} = A^{T-1}Bw, equation and such that the following holds when projecting to the corresponding unstable eigenvector $v$: align |C_T a{test}|_2 &\geq |\langle v, A^{T-1}Bw \rangle|, \ &= |\langle (A^\top)^{T-1}v, Bw \rangle|, \ &= |\langle \lambda^{T-1}v, Bw \rangle|, \ &= |\lambda|^{T-1} |\langle v, Bw \rangle|, \ &= \mu |\lambda|^{T-1}. align Squaring this result gives the desired bound.
Proof. The Hessian is $H_L = 2M^\top M$, so $\lambda_{\max}(H_L) = 2|M|2^2$. It therefore suffices to upper bound $|M|2$ by a constant that does not depend on $T$. The matrix $M$ is block-sparse: each block row corresponding to timestep $t$ contains at most three non-zero blocks, namely an identity block (selecting $s{t+1}$), a dynamics block (multiplying $s_t$), and an input block (multiplying $a_t$). Equivalently, the corresponding residual has the form equation r_t ;=; As_t + Ba_t - s{t+1}, equation with $s_0$ and $s_T$ treated as fixed boundary values (so $r_t$ is affine in the free variables). Fix any optimization vector $z = (s_1; \dots; s_{T-1}; a_0; \dots; a_{T-1})$ (stacking the free states and controls), and let $Mz$ denote the stacked residuals $(r_0,\dots,r_{T-1})$ with constants removed. Using the inequality $|x+y+z|_2^2 \le 3(|x|_2^2+|y|_2^2+|z|_2^2)$ and the operator norm bounds $|As|_2 \le |A|_2|s|_2$, $|Ba|_2 \le |B|_2|a|_2$, we obtain for each $t$: align |r_t|2^2 &= |-s{t+1} + As_t + Ba_t,|2^2 \ &\le 3\Bigl(|s{t+1}|_2^2 + |A|_2^2|s_t|_2^2 + |B|_2^2|a_t|2^2\Bigr). align Summing over $t=0,\dots,T-1$ gives equation |Mz|2^2 ;=; \sum{t=0}^{T-1}|r_t|2^2 ;\le; 3\sum{t=0}^{T-1}\Bigl(|s{t+1}|_2^2 + |A|_2^2|s_t|_2^2 + |B|2^2|a_t|2^2\Bigr). equation Crucially, due to the banded structure, each free intermediate state $s_1,\dots,s{T-1}$ appears in at most two terms in the sum: once as $s{t+1}$ and once as $s_t$. Hence the state contributions can be bounded without any accumulation in $T$, yielding the following: align |Mz|_2^2 &\le 3\Bigl((1+|A|2^2)\sum{t=1}^{T-1}|s_t|_2^2 ;+; |B|2^2\sum{t=0}^{T-1}|a_t|_2^2\Bigr),\ &\le 3\bigl(1+|A|_2^2+|B|_2^2\bigr),|z|_2^2. align Therefore, $|M|2^2 = \sup{z\neq 0}|Mz|_2^2/|z|_2^2 \le 3(1+|A|_2^2+|B|_2^2)$, and thus equation L_L ;=; 2|M|_2^2 ;\le; 6\bigl(1+|A|_2^2+|B|_2^2\bigr), equation which is independent of $T$.
Proof. An important property of convolution is that gradients commute to either argument: align \nabla_{s} L_\sigma &=\nabla_{s}(\phi_\sigmaL), \ &=(\nabla \phi_\sigma)L, \ &=\phi_\sigma(\nabla_{s}L), align where $\nabla \phi_\sigma(\epsilon)=-(\epsilon/\sigma^2),\phi_\sigma(\epsilon)$. For part 1, apply Young's convolution inequality with the fact that $|\phi_\sigma|_{L^1}=1$. For any $g\in L^p$, equation |g\phi_\sigma|{L^p}\le |g|{L^p}. equation Taking $g=\nabla_{!a}L$ and using the commutation identity above gives equation |\nabla_{s} L_\sigma|{L^p}=|\phi\sigma*(\nabla_{s}L)|{L^p}\le |\nabla{s}L|{L^p}. equation When $p=\infty$, $|\nabla{s}L|{L^\infty}$ is the Lipschitz constant of $L$, so $Lip(L\sigma)\le Lip(L)$. For part 2, again use the commutation identity together with Young's inequality in the form $|h*k|{L^p}\le |h|{L^1}|k|{L^p}$. We obtain equation |\nabla{!a} L_\sigma|{L^p} =|(\nabla\phi\sigma)*L|{L^p} \le |\nabla\phi\sigma|{L^1},|L|{L^p}. equation It remains to compute $|\nabla\phi_\sigma|{L^1}$. Since $|\nabla\phi\sigma(\epsilon|2=|\epsilon|2 \phi\sigma(\epsilon)/\sigma^2$ and $X\sim\mathcal N(0,\sigma^2\mathbf I_d)$ has density $\phi\sigma$, we get align |\nabla\phi_\sigma|{L^1} &=\int{\mathbb R^d}|\boldsymbol{\epsilon|2}{\sigma^2},\phi\sigma(\epsilon),d\epsilon \ &=1{\sigma^2},\mathbb E|X|_2 \ &=1{\sigma},\mathbb E|Z|_2 \ &=c_d{\sigma}, align where $X=\sigma Z$ with $Z\sim\mathcal N(0,\mathbf I_d)$. Substituting this into the previous display yields the claimed bound. The $p=\infty$ statement is the corresponding Lipschitz estimate.
Proof. We argue by contradiction. Fix any point $(s_1,a_0,a_1)$ and write $y_0=F(s_0,a_0)$ and $y_1=F(s_1,a_1)$. By the chain rule, equation \nabla_{s_1}L_F(s_1,a_0,a_1) = \partial \Phi{\partial s_1}(s_1,a_0,a_1,y_0,y_1) + \big(\nabla_s F(s_1,a_1)\big)^\top \partial \Phi{\partial y_1}(s_1,a_0,a_1,y_0,y_1). equation We claim that $\partial\Phi/\partial y_1$ must vanish identically. To see this, fix arbitrary arguments $(s_1,a_0,a_1,y_0,y_1), (s_0, a_0) \ne (s_1, a_1)$ in the domain and construct two $C^1$ models $F$ and $G$ such that $F(s_0,a_0) = G(s_0,a_0)=y_0$ and $F(s_1,a_1) = G(s_1,a_1)=y_1$, but whose state Jacobians at $(s_1,a_1)$ are prescribed arbitrarily and differ: equation \nabla_sF(s_1,a_1)=J_F,\qquad \nabla_sG(s_1,a_1)=J_G. equation To construct, choose a small ball $B$ around $(s_1,a_1)$ contained in $\mathcal S\times\mathcal A$, construct a smooth bump function $\psi$ that equals $1$ on a smaller concentric ball and $0$ outside $B$ (possible since $\mc S, \mc A$ open), and define equation F(s,a) = H(s,a)+\psi(s,a)\Big((y_1-H(s_1,a_1)) + (J_F-\nabla_s H(s_1,a_1))(s-s_1)\Big), equation for an arbitrary $C^1$ base map $H$. Then $F(s_1,a_1)=y_1$ and $\nabla_sF(s_1,a_1)=J_F$. Defining $G$ analogously with $J_G$ gives the desired pair; values at $(s_0,a_0)$ can be kept fixed by choosing disjoint supports or applying the same local surgery at $(s_0,a_0)$. By Jacobian-invariance, $\nabla_{s_1}L_F=\nabla_{s_1}L_G$ at $(s_1,a_0,a_1)$. Subtracting the two chain-rule expressions cancels $\partial\Phi/\partial s_1$ and yields equation (J_F-J_G)^\top \partial \Phi{\partial y_1}(s_1,a_0,a_1,y_0,y_1)=0. equation Since $J_F-J_G$ can be any matrix in $\mathbb R^{n\times n}$, it follows that equation \partial \Phi{\partial y_1}(s_1,a_0,a_1,y_0,y_1)=0. equation Because the arguments were arbitrary, and since $\mc S$ is connected, we conclude $\partial\Phi/\partial y_1\equiv 0$, meaning the loss is independent of $y_1$. Therefore there exists a $C^1$ function $\widetilde\Phi$ such that for every $F$, equation L_F(s_1,a_0,a_1)=\widetilde\Phi(s_1,a_0,a_1,F(s_0,a_0)). equation In particular, if two models $F$ and $G$ satisfy $F(s_0,a)=G(s_0,a)$ for all $a\in\mathcal A$, then $L_F\equiv L_G$ as functions of $(s_1,a_0,a_1)$, and hence equation \arg\min L_F=\arg\min L_G. equation We now construct such a pair $F,G$ but with different feasible sets, contradicting the exact-minimizers assumption. Pick two distinct actions $u,v\in\mathcal A$, two distinct states $s_A,s_B\in\mathcal S$, and an action $a^\star\in\mathcal A$. Using bump-function surgery as above, construct $C^1$ models $F$ and $G$ such that equation F(s_0,a)=G(s_0,a)\quad for all a\in\mathcal A, equation and align F(s_0,u)&=G(s_0,u)=s_A,\ F(s_0,v)&=G(s_0,v)=s_B, align but with swapped second-step goal reachability: align F(s_A,a^\star) &= g, & F(s_B,a^\star) &\neq g,\ G(s_A,a^\star) &\neq g, & G(s_B,a^\star) &= g. align Then $(s_A,u,a^\star)\in\mathcal M(F)$ but $(s_A,u,a^\star)\notin\mathcal M(G)$, and $(s_B,v,a^\star)\in\mathcal M(G)$ but $(s_B,v,a^\star)\notin\mathcal M(F)$, so $\mathcal M(F)\neq \mathcal M(G)$. On the other hand, since $F(s_0,\cdot)=G(s_0,\cdot)$ we have $\arg\min L_F=\arg\min L_G$. By assumption (i), equation \arg\min L_F=\mathcal M(F)\quad and\quad \arg\min L_G=\mathcal M(G), equation implying $\mathcal M(F)=\mathcal M(G)$, a contradiction. Therefore, no such $\Phi$ can exist.
Proof. Write the objective (re-indexing the goal term to match the action index) as equation L_{sg}(s,a) = \sum_{t=0}^{T-1}|s_{t+1}-A,sg(s_t)-Ba_t|2^2 ;+; \sum{t=0}^{T-1}\beta_t|g-A,sg(s_t)-Ba_t|2^2 . equation Define the residuals equation r_t := s{t+1}-A s_t-Ba_t, \qquad e_t := g-A s_t-Ba_t . equation Under the stop-gradient convention, $sg(s_t)$ is treated as constant during differentiation, so $s_t$ does not receive gradient contributions through the $A,sg(s_t)$ terms. It follows that the only state-gradient at time $t$ comes from the appearance of $s_t$ as the next state in the previous residual, namely equation \nabla_{s_t}L_{sg} = 2r_{t-1}, \qquad t=1,\dots,T, equation with the understanding that $r_{-1}=0$ if $s_0$ is fixed. Likewise, the action-gradient at time $t$ is equation \nabla_{a_t}L_{sg} = -2B^\top r_t -2\beta_t B^\top e_t, \qquad t=0,\dots,T-1. equation Therefore, gradient descent with stepsize $\eta>0$ yields the explicit update rules equation s_t^{k+1} = s_t^k - 2\eta\big(s_t^k-As_{t-1}^k-Ba_{t-1}^k\big), \qquad t=1,\dots,T, equation and equation a_t^{k+1} = a_t^k + 2\eta B^\top\Big(\big(s_{t+1}^k-As_t^k-Ba_t^k\big)+\beta_t\big(g-As_t^k-Ba_t^k\big)\Big), \qquad t=0,\dots,T-1. equation Stack the variables in the time-ordered vector $z:=(s_1,a_0,s_2,a_1,\dots,s_T,a_{T-1})$. The updates above define an affine map $z^{k+1}=T(z^k)=Jz^k+c$ whose Jacobian $J$ is block lower-triangular with respect to this ordering: indeed, $(s_{t+1}^{k+1},a_t^{k+1})$ depends only on $(s_t^k,s_{t+1}^k,a_t^k)$ (and on the fixed constants $g$ and $s_0$) and is independent of any future variables $(s_{t+2}^k,a_{t+1}^k,\dots)$. Consequently, the eigenvalues of $J$ are exactly the union of the eigenvalues of its diagonal blocks. To characterize a diagonal block, fix $t\in{0,\dots,T-1}$ and consider the pair $y_t:=(s_{t+1},a_t)$. Conditioned on $s_t$ (which appears only as a constant inside $sg(s_t)$ for differentiation), the update $(s_{t+1}^{k+1},a_t^{k+1})$ is precisely one gradient step on the quadratic function equation \phi_t(s_{t+1},a_t; s_t) = |s_{t+1}-As_t-Ba_t|2^2 + \beta_t|g-As_t-Ba_t|2^2, equation so the corresponding diagonal block equals $I-\eta H_t$ where $H_t=\nabla{y_t}^2\phi_t$ is the constant Hessian with respect to $(s{t+1},a_t)$: equation H_t = 2bmatrix I & -B\[0.2em] -B^\top & (1+\beta_t)B^\top B bmatrix. equation Assume $\beta_t\in[\beta_{\min},\beta_{\max}]$ with $\beta_{\min}>0$ and that $B$ has full column rank, so $B^\top B\succ 0$. The Schur complement of the $I$ block is equation (1+\beta_t)B^\top B - B^\top I^{-1}B = \beta_t B^\top B \succ 0, equation hence $H_t\succ 0$ for every $t$, with eigenvalues uniformly bounded away from $0$ and $\infty$ as $t$ varies: equation 0<\mu I \preceq H_t \preceq L I < \infty, equation for constants $\mu,L$ depending only on $\beta_{\min},\beta_{\max},|B|2,$ and $\sigma{\min}(B)$. Choosing any stepsize $\eta$ such that $0<\eta<2/L$, we obtain for every $t$ that all eigenvalues of $I-\eta H_t$ lie strictly inside the unit disk, and in particular: equation \rho(I-\eta H_t)\le \max{|1-\eta\mu|,\ |1-\eta L|}=:q<1, equation where $q$ is independent of $t$ and $T$. Since $J$ is block lower-triangular and its diagonal blocks are exactly $(I-\eta H_t)$ (up to a fixed permutation corresponding to the stacking order), we conclude equation \rho(J)=\max_{t}\rho(I-\eta H_t)\le q_0<1. equation Fix any $q$ such that $q_0<q<1$. By applying Gelfand's formula, for any square matrix $M$ and any $q>\rho(M)$, there exists an induced norm $|\cdot|\dagger$ such that $|M|\dagger\le q$. Applying this with $M=J$ yields a norm $|\cdot|\dagger$ satisfying equation |J|\dagger \le q < 1. equation Hence, for all $k\ge 0$, equation |\mathbf z^{k+1}-\mathbf z^\star|\dagger = |J(\mathbf z^{k}-\mathbf z^\star)|\dagger \le |J|\dagger,|\mathbf z^{k}-\mathbf z^\star|\dagger \le q,|\mathbf z^{k}-\mathbf z^\star|\dagger, equation and therefore equation |\mathbf z^{k}-\mathbf z^\star|\dagger \le q^k |\mathbf z^{0}-\mathbf z^\star|_\dagger . equation By norm equivalence in finite dimensions, there exists $C>0$ such that equation |\mathbf z^{k}-\mathbf z^\star|_2 \le C q^k |\mathbf z^{0}-\mathbf z^\star|_2 . equation
Proof. Rewrite eq:sg_state_update_theory as an affine Gaussian recursion: [ s_{t+1}^{k+1} = (1-2\eta_s)s_{t+1}^k + 2\eta_s\mu_t^k + \sigma\xi_{t+1}^k. ] Taking conditional expectation yields eq:ou_mean_recursion. If $\mu_t^k\equiv \mu_t$ is fixed, the centered process $u^k\coloneqq s_{t+1}^k-\mu_t$ satisfies $u^{k+1}=(1-2\eta_s)u^k+\sigma\xi^k$, i.e. an AR(1) process with contraction factor $|1-2\eta_s|<1$. Its unique stationary covariance $\Sigma_{tube}$ solves the discrete Lyapunov equation $\Sigma_{tube}=(1-2\eta_s)^2\Sigma_{tube}+\sigma^2 I$, giving eq:ou_stationary_tube. The continuous-time statement is standard, given that $\mu_t$ is fixed.
Proof. From eq:sg_full_objective_theory, only terms at time $t$ depend on $a_t$ via $\mu_t$. Using $\nabla_{a_t}\mu_t = J_t$, the gradient is equation \nabla_{a_t}\mathcal L = 2(J_t)^\top(\mu_t-s_{t+1}) + 2\gamma(J_t)^\top(\mu_t-g). equation Thus the action step is [ a_t^{k+1}-a_t^k = -\eta_a\nabla_{a_t}\mathcal L = -2\eta_a(J_t^k)^\top(\mu_t^k-s_{t+1}^k) -2\eta_a\gamma(J_t^k)^\top(\mu_t^k-g). ] Apply the linearization eq:mu_linearization: align* \mu_t^{k+1} &\approx \mu_t^k -2\eta_a J_t^k(J_t^k)^\top(\mu_t^k-s_{t+1}^k) -2\eta_a\gamma J_t^k(J_t^k)^\top(\mu_t^k-g)\ &= \mu_t^k -2\eta_a P_t^k(\mu_t^k-(\mu_t^k+\varepsilon_{t+1}^k)) -2\eta_a\gamma P_t^k(\mu_t^k-g), align* which simplifies to eq:tube_center_update after substituting $\alpha=2\eta_a$. Taking conditional expectation and using $\mathbb E[\varepsilon_{t+1}^k\mid \mu_t^k]=0$ yields eq:tube_center_expected_drift.
Proof. Taking conditional expectation of eq:rollout_noise gives $\mathbb E[s_{t+1}\mid s_t]=F_\theta(s_t,a_t)$, and then total expectation implies eq:rollout_mean_exact. For affine $F_\theta$, expectation commutes with $F_\theta$, yielding eq:rollout_mean_linear. For nonlinear $F_\theta$, expand $F_\theta(s_t,a_t)$ around $\mu_t$ by Taylor's theorem; the first-order term vanishes in expectation and the second-order term produces eq:rollout_mean_second_order. The covariance recursion eq:rollout_covariance follows by linearizing $F_\theta(s_t,a_t)\approx F_\theta(m_t,a_t)+G_t(s_t-\mu_t)$ and computing $Cov(\cdot)$, adding the independent noise variance $\sigma_{env}^2 I$. The final ``non-tube'' claim follows because there is no optimization-time contraction that repeatedly pulls $s_{t+1}$ back toward a moving center (as in eq:ou_mean_recursion); instead the forward propagation eq:rollout_covariance typically increases spread and the mean can deviate from the deterministic path by eq:rollout_mean_second_order.
Algorithm: algorithm
[1]
\Require Initial observation $o_0$, goal $o_g$, world model $F_\theta$, horizon $T$, steps $K$, learning rates $\eta_a, \eta_s$.
\Ensure $\mathbf{a}^*$
\State $s_0 \gets \text{encode}(o_0)$, $s_T \gets \text{encode}(o_g)$
\State $\mathbf{a} \gets \mathbf{a}^0$;\quad $\mathbf{s} \gets \text{init\_states}(s_0,s_T)$
\For{$k=0$ \textbf{to} $K-1$}
\State Compute $\mathcal{L}$ as in \eqref{eq:loss_full}
\State \textbf{Joint step:} $(\mathbf{a},\mathbf{s}) \leftarrow (\mathbf{a},\mathbf{s}) - (\eta_a\nabla_{\mathbf{a}}\mathcal{L}, \eta_s\nabla_{\mathbf{s}}\mathcal{L})$
\State \textbf{Stochastic state:} $s_t \leftarrow s_t + \sigma_{\text{state}}\xi_t, \xi_t\sim\mathcal{N}(0,I)$ for $t=1,\dots,T-1$
\State \textbf{Sync (periodic):} if $k \bmod K_{\mathrm{sync}}=0$, rollout from $s_0$;\;\;$s_{t+1}\gets F_\theta(s_t,a_t)$ for $t=0,\dots,T-1$
\State \hspace{3.2em} then take a GD step $\mathbf{a}\leftarrow \mathbf{a}-\eta_{\mathrm{sync}}\nabla_{\mathbf{a}}\|s_T-g\|_2^2$
\EndFor
\State \Return $\mathbf{a}^* \gets \mathbf{a}$
% \vspace{-5mm}
$$
$$

Figure 6 Example of converged plans for our planner on the Push-T environment at horizon 80, with goals depicted to the right of the dashed line. The parallel optimization is able to converge more consistently at longer horizons, and find the required non-greedy paths to the goal.

Figure 7 Example of a bad local minima when relaxing the dynamics constraint into a penalty function. The dynamics are nonzero for the middle transition, but both the states and actions are at a local minima; there's no direction they can move locally to reduce the dynamics loss further.
| H =40 / | H =40 / | H =50 / | H =50 / | H =60 / | H =60 / | H =70 / | H =70 / | H =80 / | H =80 / | |
|---|---|---|---|---|---|---|---|---|---|---|
| Succ. / | Time | Succ. / | Time | Succ. | Time | Succ. | Time | Succ. | Time | |
| CEM | 61.4 | 35.3s | 30.2 | 96.2s | 7.2 / | 83.1s | 7.8 / | 156.1s | 2.8 / | 132.2s |
| LatCo | 15.0 / | 598.0s | 4.2 / | 1114.7s | 2.0 / | 231.5s | 0.0 / | - | 0.0 / | - |
| GD | 51.0 / | 18.0s | 37.6 / | 76.3s | 16.4 / | 146.5s | 12.0 / | 103.1s | 6.4 | / 161.3s / |
| GRASP (Ours) | 59.0 / | 8.5s | 43.4 / | 15.2s | 26.2 / | 49.1s | 16.0 / | 79.9s | 10.4 | 58.9s |
| Setting | Accuracy (%) | Time (s) |
|---|---|---|
| Sync Steps | ||
| No Sync | 1.6 | 4.5 |
| 10 | 48.0 | 11.2 |
| 25 (ours) | 59.0 | 8.5 |
| 50 | 58.0 | 9.8 |
| Noise Level | ||
| σ = 0 . 0 | 54.8 | 10.4 |
| σ = 0 . 5 (ours) | 59.0 | 8.5 |
| σ = 1 . 0 | 50.6 | 8.4 |
| Detach ∇ s F ( s,a ) | ||
| Stop (ours) | 59.0 | 8.5 |
| Flow | 46 . 6 | 10.3 |
| Push-T | Push-T | Push-T | PointMaze | PointMaze | PointMaze | WallSingle | WallSingle | WallSingle | |
|---|---|---|---|---|---|---|---|---|---|
| Planner | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 |
| CEM | 100.0 | 96.0 | 76.0 | 100.0 | 100.0 | 100.0 | 98.0 | 100.0 | 100.0 |
| LatCo | 67.4 | 55.2 | 35.4 | 100.0 | 99.8 | 99.0 | 91.4 | 81.6 | 62.8 |
| GD | 99.2 | 88.4 | 69.0 | 94.6 | 91.8 | 94.6 | 91.8 | 92.8 | 91.6 |
| GRASP (Ours) | 100.0 | 89.2 | 75.2 | 100.0 | 100.0 | 99.8 | 95.2 | 91.8 | 95.0 |
| Push-T | Push-T | Push-T | PointMaze | PointMaze | PointMaze | WallSingle | WallSingle | WallSingle | |
|---|---|---|---|---|---|---|---|---|---|
| Planner | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 |
| CEM | 1.3 | 7.5 | 23.6 | 0.7 | 2.0 | 6.3 | 0.9 | 1.5 | 2.1 |
| LatCo | 39.9 | 437.2 | 1800.0 | 1.6 | 13.7 | 36.6 | 12.1 | 23.8 | 111.1 |
| GD | 0.5 | 2.4 | 25.3 | 0.3 | 0.6 | 1.2 | 0.4 | 0.5 | 0.7 |
| GRASP (Ours) | 1.5 | 2.1 | 9.1 | 0.7 | 1.8 | 2.1 | 1.6 | 1.8 | 2.1 |
| Setting | Accuracy (%) | Time (s) |
|---|---|---|
| Default Mode (objective.mode=default) | Default Mode (objective.mode=default) | Default Mode (objective.mode=default) |
| σ s = 0 . 0 | 53.0 ± 4.4 | 864 ± 77 |
| σ s = 0 . 5 | 50.6 ± 4.4 | 913.97 ± 78 |
| σ s = 1 . 0 | 37.2 ± 4.3 | 1149 ± 75 |
| σ a = 0 . 0 | 47.6 ± 4.4 | 1056 ± 73 |
| σ a = 0 . 5 | 53.0 ± 4.4 | 864 ± 77 |
| σ a = 1 . 0 | 33.4 ± 4.2 | 1208 ± 73 |
| σ s = 0 . 5 ,σ a = 0 . 5 | 53.8 ± 4.4 | 854 ± 78 |
| Mode All (objective.mode=all) | Mode All (objective.mode=all) | Mode All (objective.mode=all) |
| σ s = 0 . 0 | 54.0 ± 4.4 | 865 ± 77 |
| σ s = 0 . 5 | 55.8 ± 4.4 | 838 ± 77 |
| σ s = 1 . 0 | 41.0 ± 4.4 | 1111 ± 75 |
| σ a = 0 . 0 | 46.6 ± 4.4 | 1067 ± 73 |
| σ a = 0 . 5 | 52.8 ± 4.4 | 873 ± 77 |
| σ a = 1 . 0 | 52.4 ± 4.4 | 878 ± 77 |
| σ s = 0 . 5 ,σ a = 0 . 5 | 55.8 ± 4.4 | 838 ± 77 |
| H =40 / | H =40 / | H =50 / | H =50 / | H =60 / | H =60 / | H =70 / | H =70 / | H =80 / | H =80 / | |
|---|---|---|---|---|---|---|---|---|---|---|
| Succ. / | Time | Succ. / | Time | Succ. | Time | Succ. | Time | Succ. | Time | |
| CEM | 61.4 | 35.3s | 30.2 | 96.2s | 7.2 / | 83.1s | 7.8 / | 156.1s | 2.8 / | 132.2s |
| LatCo | 15.0 / | 598.0s | 4.2 / | 1114.7s | 2.0 / | 231.5s | 0.0 / | - | 0.0 / | - |
| GD | 51.0 / | 18.0s | 37.6 / | 76.3s | 16.4 / | 146.5s | 12.0 / | 103.1s | 6.4 | / 161.3s / |
| GRASP (Ours) | 59.0 / | 8.5s | 43.4 / | 15.2s | 26.2 / | 49.1s | 16.0 / | 79.9s | 10.4 | 58.9s |
| Setting | Accuracy (%) | Time (s) |
|---|---|---|
| Sync Steps | ||
| No Sync | 1.6 | 4.5 |
| 10 | 48.0 | 11.2 |
| 25 (ours) | 59.0 | 8.5 |
| 50 | 58.0 | 9.8 |
| Noise Level | ||
| σ = 0 . 0 | 54.8 | 10.4 |
| σ = 0 . 5 (ours) | 59.0 | 8.5 |
| σ = 1 . 0 | 50.6 | 8.4 |
| Detach ∇ s F ( s,a ) | ||
| Stop (ours) | 59.0 | 8.5 |
| Flow | 46 . 6 | 10.3 |
| Push-T | Push-T | Push-T | PointMaze | PointMaze | PointMaze | WallSingle | WallSingle | WallSingle | |
|---|---|---|---|---|---|---|---|---|---|
| Planner | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 |
| CEM | 100.0 | 96.0 | 76.0 | 100.0 | 100.0 | 100.0 | 98.0 | 100.0 | 100.0 |
| LatCo | 67.4 | 55.2 | 35.4 | 100.0 | 99.8 | 99.0 | 91.4 | 81.6 | 62.8 |
| GD | 99.2 | 88.4 | 69.0 | 94.6 | 91.8 | 94.6 | 91.8 | 92.8 | 91.6 |
| GRASP (Ours) | 100.0 | 89.2 | 75.2 | 100.0 | 100.0 | 99.8 | 95.2 | 91.8 | 95.0 |
| Push-T | Push-T | Push-T | PointMaze | PointMaze | PointMaze | WallSingle | WallSingle | WallSingle | |
|---|---|---|---|---|---|---|---|---|---|
| Planner | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 | H =10 | H =20 | H =30 |
| CEM | 1.3 | 7.5 | 23.6 | 0.7 | 2.0 | 6.3 | 0.9 | 1.5 | 2.1 |
| LatCo | 39.9 | 437.2 | 1800.0 | 1.6 | 13.7 | 36.6 | 12.1 | 23.8 | 111.1 |
| GD | 0.5 | 2.4 | 25.3 | 0.3 | 0.6 | 1.2 | 0.4 | 0.5 | 0.7 |
| GRASP (Ours) | 1.5 | 2.1 | 9.1 | 0.7 | 1.8 | 2.1 | 1.6 | 1.8 | 2.1 |
| Setting | Accuracy (%) | Time (s) |
|---|---|---|
| Default Mode (objective.mode=default) | Default Mode (objective.mode=default) | Default Mode (objective.mode=default) |
| σ s = 0 . 0 | 53.0 ± 4.4 | 864 ± 77 |
| σ s = 0 . 5 | 50.6 ± 4.4 | 913.97 ± 78 |
| σ s = 1 . 0 | 37.2 ± 4.3 | 1149 ± 75 |
| σ a = 0 . 0 | 47.6 ± 4.4 | 1056 ± 73 |
| σ a = 0 . 5 | 53.0 ± 4.4 | 864 ± 77 |
| σ a = 1 . 0 | 33.4 ± 4.2 | 1208 ± 73 |
| σ s = 0 . 5 ,σ a = 0 . 5 | 53.8 ± 4.4 | 854 ± 78 |
| Mode All (objective.mode=all) | Mode All (objective.mode=all) | Mode All (objective.mode=all) |
| σ s = 0 . 0 | 54.0 ± 4.4 | 865 ± 77 |
| σ s = 0 . 5 | 55.8 ± 4.4 | 838 ± 77 |
| σ s = 1 . 0 | 41.0 ± 4.4 | 1111 ± 75 |
| σ a = 0 . 0 | 46.6 ± 4.4 | 1067 ± 73 |
| σ a = 0 . 5 | 52.8 ± 4.4 | 873 ± 77 |
| σ a = 1 . 0 | 52.4 ± 4.4 | 878 ± 77 |
| σ s = 0 . 5 ,σ a = 0 . 5 | 55.8 ± 4.4 | 838 ± 77 |
$$ \label{eq:tube_center_expected_drift} \mathbb E[\mu_t^{k+1}\mid \mu_t^{k}, s_t^k, a_t^k] ;\approx; \bigl(I-\alpha\gamma P_t^k\bigr)\mu_t^k + \alpha\gamma P_t^k g, $$ \tag{eq:tube_center_expected_drift}
References
[jerez2011condensed] Jerez, Juan L, Kerrigan, Eric C, Constantinides, George A. (2011). A condensed and sparse QP formulation for predictive control. 2011 50th IEEE Conference on Decision and Control and European Control Conference.
[ding2024understanding] Ding, Jingtao, Zhang, Yunke, Shang, Yu, Zhang, Yuheng, Zong, Zefang, Feng, Jie, Yuan, Yuan, Su, Hongyuan, Li, Nian, Sukiennik, Nicholas, others. (2024). Understanding world or predicting future? a comprehensive survey of world models. ACM Computing Surveys.
[bar2025navigation] Bar, Amir, Zhou, Gaoyue, Tran, Danny, Darrell, Trevor, LeCun, Yann. (2025). Navigation world models. Proceedings of the Computer Vision and Pattern Recognition Conference.
[hafner2025mastering] Hafner, Danijar, Pasukonis, Jurgis, Ba, Jimmy, Lillicrap, Timothy. (2025). Mastering diverse control tasks through world models. Nature.
[pinneri2021sample] Pinneri, Cristina, Sawant, Shambhuraj, Blaes, Sebastian, Achterhold, Jan, Stueckler, Joerg, Rolinek, Michal, Martius, Georg. (2021). Sample-efficient cross-entropy method for real-time planning. Conference on Robot Learning.
[lai2022parallelised] Lai, Tin, Zhi, Weiming, Hermans, Tucker, Ramos, Fabio. (2022). Parallelised diffeomorphic sampling-based motion planning. Conference on robot learning.
[rubinstein2004cross] Rubinstein, Reuven Y, Kroese, Dirk P. (2004). The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning.
[bharadhwaj2020model] Bharadhwaj, Homanga, Xie, Kevin, Shkurti, Florian. (2020). Model-predictive control via cross-entropy and gradient-based optimization. Learning for Dynamics and Control.
[huang2021cem] Huang, Kevin, Lale, Sahin, Rosolia, Ugo, Shi, Yuanyuan, Anandkumar, Anima. (2021). Cem-gd: Cross-entropy method with gradient descent planner for model-based reinforcement learning. arXiv preprint arXiv:2112.07746.
[jyothir2023gradient] Jyothir, SV, Jalagam, Siddhartha, LeCun, Yann, Sobal, Vlad. (2023). Gradient-based planning with world models. arXiv preprint arXiv:2312.17227.
[thrun1990planning] Thrun, Sebastian, M{. (1990). Planning with an adaptive world model. Advances in neural information processing systems.
[guhathakurta2022fast] Guhathakurta, Dipanwita, Rastgar, Fatemeh, Sharma, M Aditya, Krishna, K Madhava, Singh, Arun Kumar. (2022). Fast joint multi-robot trajectory optimization by GPU accelerated batch solution of distributed sub-problems. Frontiers in Robotics and AI.
[tamimi2009nonlinear] Tamimi, Jasem, Li, Pu. (2009). Nonlinear model predictive control using multiple shooting combined with collocation on finite elements. IFAC Proceedings Volumes.
[diedam2018global] Diedam, H, Sager, Sebastian. (2018). Global optimal control with the direct multiple shooting method. Optimal Control Applications and Methods.
[bordalba2022direct] Bordalba, Ricard, Schoels, Tobias, Ros, Llu{'\i. (2022). Direct collocation methods for trajectory optimization in constrained robotic systems. IEEE Transactions on Robotics.
[nie2025reliable] Nie, Yuanbo, Kerrigan, Eric C. (2025). Reliable Solution to Dynamic Optimization Problems using Integrated Residual Regularized Direct Collocation. arXiv preprint arXiv:2503.09123.
[boyd2011distributed] Boyd, Stephen, Parikh, Neal, Chu, Eric, Peleato, Borja, Eckstein, Jonathan, others. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends{\textregistered.
[stellato2020osqp] Stellato, Bartolomeo, Banjac, Goran, Goulart, Paul, Bemporad, Alberto, Boyd, Stephen. (2020). OSQP: An operator splitting solver for quadratic programs. Mathematical Programming Computation.
[chan2016plug] Chan, Stanley H, Wang, Xiran, Elgendy, Omar A. (2016). Plug-and-play ADMM for image restoration: Fixed-point convergence and applications. IEEE Transactions on Computational Imaging.
[rostami2017admm] Rostami, Ramin, Costantini, Giuliano, G{. (2017). ADMM-based distributed model predictive control: Primal and dual approaches. 2017 IEEE 56th Annual Conference on Decision and Control (CDC).
[tang2019distributed] Tang, Wentao, Daoutidis, Prodromos. (2019). Distributed nonlinear model predictive control through accelerated parallel ADMM. 2019 American Control Conference (ACC).
[foret2020sharpness] Foret, Pierre, Kleiner, Ariel, Mobahi, Hossein, Neyshabur, Behnam. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412.
[robbins1951stochastic] Robbins, Herbert, Monro, Sutton. (1951). A stochastic approximation method. The annals of mathematical statistics.
[welling2011bayesian] Welling, Max, Teh, Yee W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th international conference on machine learning (ICML-11).
[xu2018global] Xu, Pan, Chen, Jinghui, Zou, Difan, Gu, Quanquan. (2018). Global convergence of Langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems.
[bras2023langevin] Bras, Pierre. (2023). Langevin algorithms for very deep neural networks with application to image classification. Procedia Computer Science.
[baldassarre2025back] Baldassarre, Federico, Szafraniec, Marc, Terver, Basile, Khalidov, Vasil, Massa, Francisco, LeCun, Yann, Labatut, Patrick, Seitzer, Maximilian, Bojanowski, Piotr. (2025). Back to the Features: DINO as a Foundation for Video World Models. arXiv preprint arXiv:2507.19468.
[assran2025v] Assran, Mido, Bardes, Adrien, Fan, David, Garrido, Quentin, Howes, Russell, Muckley, Matthew, Rizvi, Ammar, Roberts, Claire, Sinha, Koustuv, Zholus, Artem, others. (2025). V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985.
[bock1984multiple] Bock, Hans Georg, Plitt, Karl-Josef. (1984). A multiple shooting algorithm for direct solution of optimal control problems. IFAC Proceedings Volumes.
[piovesan2009randomized] Piovesan, Jorge L, Tanner, Herbert G. (2009). Randomized model predictive control for robot navigation. 2009 IEEE International Conference on Robotics and Automation.
[bengio1994learning] Bengio, Yoshua, Simard, Patrice, Frasconi, Paolo. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks.
[werbos2002backpropagation] Werbos, Paul J. (2002). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE.
[cui2015convergence] Cui, Ying, Li, Xudong, Sun, Defeng, Toh, Kim-Chuan. (2015). On the convergence properties of a majorized ADMM for linearly constrained convex optimization problems with coupled objective functions. arXiv preprint arXiv:1502.00098.
[anderson2007optimal] Anderson, Brian DO, Moore, John B. (2007). Optimal control: linear quadratic methods.
[glowinski1975approximation] Glowinski, Roland, Marroco, Americo. (1975). Sur l'approximation, par {'e. Revue fran{\c{c.
[gabay1976dual] Gabay, Daniel, Mercier, Bertrand. (1976). A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & mathematics with applications.
[williams2016aggressive] Williams, Grady, Drews, Paul, Goldfain, Brian, Rehg, James M, Theodorou, Evangelos A. (2016). Aggressive driving with model predictive path integral control. 2016 IEEE international conference on robotics and automation (ICRA).
[rybkin2021model] Rybkin, Oleh, Zhu, Chuning, Nagabandi, Anusha, Daniilidis, Kostas, Mordatch, Igor, Levine, Sergey. (2021). Model-based reinforcement learning via latent-space collocation. International Conference on Machine Learning.
[he2000alternating] He, Bing-Sheng, Yang, Hai, Wang, SL. (2000). Alternating direction method with self-adaptive penalty parameters for monotone variational inequalities. Journal of Optimization Theory and applications.
[parthasarathy2025closing] Parthasarathy, Arjun, Kalra, Nimit, Agrawal, Rohun, LeCun, Yann, Bounou, Oumayma, Izmailov, Pavel, Goldblum, Micah. (2025). Closing the Train-Test Gap in World Models for Gradient-Based Planning. arXiv preprint arXiv:2512.09929.
[zhou2024dino] Zhou, Gaoyue, Pan, Hengkai, LeCun, Yann, Pinto, Lerrel. (2024). Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983.
[szegedy2013intriguing] Szegedy, Christian, Zaremba, Wojciech, Sutskever, Ilya, Bruna, Joan, Erhan, Dumitru, Goodfellow, Ian, Fergus, Rob. (2013). Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
[shamir2021dimpled] Shamir, Adi, Melamed, Odelia, BenShmuel, Oriel. (2021). The dimpled manifold model of adversarial examples in machine learning. arXiv preprint arXiv:2106.10151.
[chaudhari2019entropy] Chaudhari, Pratik, Choromanska, Anna, Soatto, Stefano, LeCun, Yann, Baldassi, Carlo, Borgs, Christian, Chayes, Jennifer, Sagun, Levent, Zecchina, Riccardo. (2019). Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment.
[ascher1995numerical] Ascher, Uri M, Mattheij, Robert MM, Russell, Robert D. (1995). Numerical solution of boundary value problems for ordinary differential equations.
[gelfand1991recursive] Gelfand, Saul B, Mitter, Sanjoy K. (1991). Recursive stochastic algorithms for global optimization in R^{. SIAM Journal on Control and Optimization.
[fu2020d4rl] Fu, Justin, Kumar, Aviral, Nachum, Ofir, Tucker, George, Levine, Sergey. (2020). D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219.
[tassa2018deepmind] Tassa, Yuval, Doron, Yotam, Muldal, Alistair, Erez, Tom, Li, Yazhe, Casas, Diego de Las, Budden, David, Abdolmaleki, Abbas, Merel, Josh, Lefrancq, Andrew, others. (2018). Deepmind control suite. arXiv preprint arXiv:1801.00690.
[valevski2024diffusion] Valevski, Dani, Leviathan, Yaniv, Arar, Moab, Fruchter, Shlomi. (2024). Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837.
[genie3] Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Cip Baetu, Jordi Berbel, David Bridson, Jake Bruce, Gavin Buttimore, Sarah Chakera, Bilva Chandra, Paul Collins, Alex Cullum, Bogdan Damoc, Vibha Dasagi, Maxime Gazeau, Charles Gbadamosi, Woohyun Han, Ed Hirst, Ashyana Kachra, Lucie Kerley, Kristian Kjems, Eva Knoepfel, Vika Koriakin, Jessica Lo, Cong Lu, Zeb Mehring, Alex Moufarek, Henna Nandwani, Valeria Oliveira, Fabio Pardo, Jane Park, Andrew Pierson, Ben Poole, Helen Ran, Tim Salimans, Manuel Sanchez, Igor Saprykin, Amy Shen, Sailesh Sidhwani, Duncan Smith, Joe Stanton, Hamish Tomlinson, Dimple Vijaykumar, Luyu Wang, Piers Wingfield, Nat Wong, Keyang Xu, Christopher Yew, Nick Young, Vadim Zubov, Douglas Eck, Dumitru Erhan, Koray Kavukcuoglu, Demis Hassabis, Zoubin Gharamani, Raia Hadsell, A{. (2025). Genie 3: A New Frontier for World Models.
[goswami2025world] Goswami, Raktim Gautam, Bar, Amir, Fan, David, Yang, Tsung-Yen, Zhou, Gaoyue, Krishnamurthy, Prashanth, Rabbat, Michael, Khorrami, Farshad, LeCun, Yann. (2025). World Models Can Leverage Human Videos for Dexterous Manipulation. arXiv preprint arXiv:2512.13644.
[von1993numerical] Von Stryk, Oskar. (1993). Numerical solution of optimal control problems by direct collocation. Optimal Control: Calculus of Variations, Optimal Control Theory and Numerical Methods.
[koju2025surgical] Koju, Saurabh, Bastola, Saurav, Shrestha, Prashant, Amgain, Sanskar, Shrestha, Yash Raj, Poudel, Rudra PK, Bhattarai, Binod. (2025). Surgical vision world model. MICCAI Workshop on Data Engineering in Medical Imaging.
[guo2025ctrl] Guo, Yanjiang, Shi, Lucy Xiaoyang, Chen, Jianyu, Finn, Chelsea. (2025). Ctrl-world: A controllable generative world model for robot manipulation. arXiv preprint arXiv:2510.10125.