Skip to main content

-10mm Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, Yann LeCun

Abstract

% -5mm Navigation is a fundamental skill of agents with visual-motor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systemsProject page: \url{https://amirbar.net/nwm}. -13mm

1

Amir Bar

Gaoyue Zhou

2 Danny Tran 3 2 New York University

Trevor Darrell 3 Yann LeCun 1,2 3 Berkeley AI Research

Figure

Figure

(b) evaluate trajectories for navigation planning by synthesizing videos (known environments)

Figure 1. We train a Navigation World Model (NWM) from video footage of robots and their associated navigation actions (a). After training, NWM can evaluate trajectories by synthesizing their videos and scoring the final frame's similarity with the goal (b). We use NWM to plan from scratch or rank experts navigation trajectories, improving downstream visual navigation performance. In unknown environments , NWMcan simulate imagined trajectories from a single image (c). In all examples above, the input to the model is the first image and actions, then the model auto-regressively synthesizes future observations. Click on the image to view examples in a browser .

Navigation is a fundamental skill of agents with visualmotor capabilities. We introduce a Navigation World Model (NWM), a controllable video generation model that predicts future visual observations based on past observations and navigation actions. To capture complex environment dynamics, NWM employs a Conditional Diffusion Transformer (CDiT), trained on a diverse collection of egocentric videos of both human and robotic agents, and scaled up to 1 billion parameters. In familiar environments, NWM

can plan navigation trajectories by simulating them and evaluating whether they achieve the desired goal. Unlike supervised navigation policies with fixed behavior, NWM can dynamically incorporate constraints during planning. Experiments demonstrate its effectiveness in planning trajectories from scratch or by ranking trajectories sampled from an external policy. Furthermore, NWM leverages its learned visual priors to imagine trajectories in unfamiliar environments from a single input image, making it a flexible and powerful tool for next-generation navigation systems 1 .

1 Project page: https://amirbar.net/nwm

Introduction

Navigation is a fundamental skill for any organism with vision, playing a crucial role in survival by allowing agents to locate food, shelter, and avoid predators. In order to successfully navigate environments, smart agents primarily rely on vision, allowing them to construct representations of their surroundings to assess distances and capture landmarks in the environment, all useful for planning a navigation route.

When human agents plan, they often imagine their future trajectories considering constraints and counterfactuals. On the other hand, current state-of-the-art robotics navigation policies [53, 55] are 'hard-coded', and after training, new constraints cannot be easily introduced (e.g. 'no left turns'). Another limitation of current supervised visual navigation models is that they cannot dynamically allocate more computational resources to address hard problems. We aim to design a new model that can mitigate these issues.

In this work, we propose a Navigation World Model (NWM), trained to predict the future representation of a video frame based on past frame representation(s) and action(s) (see Figure 1(a)). NWM is trained on video footage and navigation actions collected from various robotic agents. After training, NWM is used to plan novel navigation trajectories by simulating potential navigation plans and verifying if they reach a target goal (see Figure 1(b)). To evaluate its navigation skills, we test NWM in known environments , assessing its ability to plan novel trajectories either independently or by ranking an external navigation policy. In the planning setup, we use NWM in a Model Predictive Control (MPC) framework, optimizing the action sequence that enables NWM to reach a target goal. In the ranking setup, we assume access to an existing navigation policy, such as NoMaD [55], which allows us to sample trajectories, simulate them using NWM, and select the best ones. Our NWM achieves state-of-the-art standalone performance and competitive results when combined with existing methods.

NWM is conceptually similar to recent diffusion-based world models for offline model-based reinforcement learning, such as DIAMOND [1] and GameNGen [66]. However, unlike these models, NWM is trained across a wide range of environments and embodiments, leveraging the diversity of navigation data from robotic and human agents. This allows us to train a large diffusion transformer model capable of scaling effectively with model size and data to adapt to multiple environments. Our approach also shares similarities with Novel View Synthesis (NVS) methods like NeRF [40], Zero-1-2-3 [38], and GDC [67], from which we draw inspiration. However, unlike NVS approaches, our goal is to train a single model for navigation across diverse environments and model temporal dynamics from natural videos, without relying on 3D priors.

To learn a NWM, we propose a novel Conditional Diffusion Transformer (CDiT), trained to predict the next image state given past image states and actions as context. Unlike a DiT [44], CDiT's computational complexity is linear with respect to the number of context frames, and it scales favorably for models trained up to 1 B parameters across diverse environments and embodiments, requiring 4 × fewer FLOPs compared to a standard DiT while achieving better future prediction results.

In unknown environments, our results show that NWM benefits from training on unlabeled, action- and reward-free video data from Ego4D. Qualitatively, we observe improved video prediction and generation performance on single images (see Figure 1(c)). Quantitatively, with additional unlabeled data, NWM produces more accurate predictions when evaluated on the held-out Stanford Go [24] dataset.

Our contributions are as follows. We introduce a Navigation World Model (NWM) and propose a novel Conditional Diffusion Transformer (CDiT), which scales efficiently up to 1 B parameters with significantly reduced computational requirements compared to standard DiT. We train CDiT on video footage and navigation actions from diverse robotic agents, enabling planning by simulating navigation plans independently or alongside external navigation policies, achieving state-of-the-art visual navigation performance. Finally, by training NWM on action- and rewardfree video data, such as Ego4D, we demonstrate improved video prediction and generation performance in unseen environments.

Goal conditioned visual navigation is an important task in robotics requiring both perception and planning skills [8, 13, 15, 41, 43, 51, 55]. Given context image(s) and an image specifying the navigation goals, goal-conditioned visual navigation models [51, 55] aim to generate a viable path towards the goal if the environment is known, or to explore it otherwise. Recent visual navigation methods like NoMaD [55] train a diffusion policy via behavior cloning and temporal distance objective to follow goals in the conditional setting or to explore new environments in the unconditional setting. Previous approaches like Active Neural SLAM[8] used neural SLAM together with analytical planners to plan trajectories in the 3 D environment, while other approaches like [9] learn policies via reinforcement learning. Here we show that world models can use exploratory data to plan or improve existing navigation policies.

Differently than in learning a policy, the goal of a world model [19] is to simulate the environment, e.g. given the current state and action to predict the next state and an associated reward. Previous works have shown that jointly learning a policy and a world model can improve sample efficiency on Atari [1, 20, 21], simulated robotics environments [50], and even when applied to real world robots [71]. More recently, [22] proposed to use a single world model that is shared across tasks by introducing action and task embeddings while [37, 73] proposed to describe actions in language, and [6] proposed to learn latent actions. World models were also explored in the context of game simulation. DIAMOND [1] and GameNGen [66] propose to use diffusion models to learn game engines of computer games like Atari and Doom. Our work is inspired by these works, and we aim to learn a single general diffusion video transformer that can be shared across many environments and different embodiments for navigation.

In computer vision, generating videos has been a long standing challenge [3, 4, 17, 29, 32, 62, 74]. Most recently, there has been tremendous progress with text-to-video synthesis with methods like Sora [5] and MovieGen [45]. Past works proposed to control video synthesis given structured action-object class categories [61] or Action Graphs [2]. Video generation models were previously used in reinforcement learning as rewards [10], pretraining methods [59], for simulating and planning manipulation actions [11, 35] and for generating paths in indoor environments [26, 31]. Interestingly, diffusion models [28, 54] are useful both for video tasks like generation [69] and prediction [36], but also for view synthesis [7, 46, 63]. Differently, we use a conditional diffusion transformer to simulate trajectories for planning without explicit 3 D representations or priors.

Formulation

Next, we turn to describe our NWM formulation. Intuitively, a NWM is a model that receives the current state of the world (e.g. an image observation) and a navigation action describing where to move and how to rotate. The model then produces the next state of the world with respect to the agent's point of view.

We are given an egocentric video dataset together with agent navigation actions D = { ( x 0 , a 0 , ..., x T , a T ) } n i =1 , such that x i ∈ R H × W × 3 is an image and a i = ( u, ϕ ) is a navigation command given by translation parameter u ∈ R 2 that controls the change in forward/backward and right/left motion, as well as ϕ ∈ R that controls the change in yaw rotation angle. 2

The navigation actions a i can be fully observed (as in Habitat [49]), e.g. moving forward towards a wall will trigger a response from the environment based on physics, which will lead to the agent staying in place, whereas in other environments the navigation actions can be approxi-

2 This can be naturally extended to three dimensions by having u ∈ R 3 and θ ∈ R 3 defining yaw, pitch and roll. For simplicity, we assume navigation on a flat surface with fixed pitch and roll.

mated based on the change in the agent's location.

Our goal is to learn a world model F , a stochastic mapping from previous latent observation(s) s τ and action a τ to future latent state representation s t +1 :

$$

$$

Where s τ = ( s τ , ..., s τ -m ) are the past m visual observations encoded via a pretrained VAE [4]. Using a VAE has the benefit of working with compressed latents, allowing to decode predictions back to pixel space for visualization.

Due to the simplicity of this formulation, it can be naturally shared across environments and easily extended to more complex action spaces, like controlling a robotic arm. Different than [20], we aim to train a single world model across environments and embodiments, without using task or action embeddings like in [22].

The formulation in Equation 1 models action but does not allow control over the temporal dynamics. We extend this formulation with a time shift input k ∈ [ T min , T max ] , setting a τ = ( u, ϕ, k ) , thus now a τ specifies the time change k , used to determine how many steps should the model move into the future (or past). Hence, given a current state s τ , we can randomly choose a timeshift k and use the corresponding time shifted video frame as our next state s τ +1 . The navigation actions can then be approximated to be a summation from time τ to m = τ + k -1 :

$$

$$

This formulation allows learning both navigation actions, but also the environment temporal dynamics. In practice, we allow time shifts of up to ± 16 seconds.

One challenge that may arise is the entanglement of actions and time. For example, if reaching a specific location always occurs at a particular time, the model may learn to rely solely on time and ignore the subsequent actions, or vice versa. In practice, the data may contain natural counterfactuals-such as reaching the same area at different times. To encourage these natural counterfactuals, we sample multiple goals for each state during training. We further explore this approach in Section 4.

Diffusion Transformer as World Model

As mentioned in the previous section, we design F θ as a stochastic mapping so it can simulate stochastic environments. This is achieved using a Conditional Diffusion Transformer (CDiT) model, described next.

Conditional Diffusion Transformer Architecture . The architecture we use is a temporally autoregressive transformer model utilizing the efficient CDiT block (see Figure 2), which is applied × N times over the input sequence of latents with input action conditioning.

CDiT enables time-efficient autoregressive modeling by constraining the attention in the first attention block only to tokens from the target frame which is being denoised. To condition on tokens from past frames, we incorporate a cross-attention layer, such that every query token from the current target attends to tokens from past frames, which are used as keys and values. The cross-attention then contextualizes the representations using a skip connection layer.

To condition on the navigation action a ∈ R 3 , we first map each scalar to R d 3 by extracting sine-cosine features, then applying a 2 -layer MLP, and concatenating them into a single vector ψ a ∈ R d . We follow a similar process to map the timeshift k ∈ R to ψ k ∈ R d and the diffusion timestep t ∈ R to ψ k ∈ R d . Finally we sum all embeddings into a single vector used for conditioning:

$$

$$

ξ is then fed to an AdaLN [72] block to generate scale and shift coefficients that modulate the Layer Normalization [34] outputs, as well as the outputs of the attention layers. To train on unlabeled data, we simply omit explicit navigation actions when computing ξ (see Eq. 3).

An alternative approach is to simply use DiT [44], however, applying a DiT on the full input is computationally expensive. Denote n the number of input tokens per frame, and m the number of frames, and d the token dimension. Scaled Multi-head Attention Layer [68] complexity is dominated by the attention term O ( m 2 n 2 d ) , which is quadratic with context length. In contrast, our CDiT block is dominated by the cross-attention layer complexity O ( mn 2 d ) , which is linear with respect to the context, allowing us to use longer context size. We analyze these two design choices in Section 4. CDiT resembles the original Transformer Block [68], without applying expensive selfattention over the context tokens.

Diffusion Training . In the forward process, noise is added to the target state s τ +1 according to a randomly chosen timestep t ∈ { 1 , . . . , T } . The noisy state s ( t ) τ +1 can be defined as: s ( t ) τ +1 = √ α t s τ +1 + √ 1 -α t ϵ , where ϵ ∼ N (0 , I ) is Gaussian noise, and { α t } is a noise schedule controlling the variance. As t increases, s ( t ) τ +1 converges to pure noise. The reverse process attempts to recover the original state representation s τ +1 from the noisy version s ( t ) τ +1 , conditioned on the context s τ , the current action a τ , and the diffusion timestep t . Wedefine F θ ( s τ +1 | s τ , a τ , t ) as the denoising neural network model parameterized by θ . We follow the same noise schedule and hyperparams of DiT [44].

Training Objective . The model is trained to minimize the mean-squared between the clean and predicted target, aiming to learn the denoising process:

$$

$$

Figure 2. Conditional Diffusion Transformer (CDiT) Block . The block's complexity is linear with the number of frames.

Figure 2. Conditional Diffusion Transformer (CDiT) Block . The block's complexity is linear with the number of frames.

In this objective, the timestep t is sampled randomly to ensure that the model learns to denoise frames across varying levels of corruption. By minimizing this loss, the model learns to reconstruct s τ +1 from its noisy version s ( t ) τ +1 , conditioned on the context s τ and action a τ , thereby enabling the generation of realistic future frames. Following [44], we also predict the covariance matrix of the noise and supervise it with the variational lower bound loss L vlb [42].

Here we move to describe how to use a trained NWM to plan navigation trajectories. Intuitively, if our world model is familiar with an environment, we can use it to simulate navigation trajectories, and choose the ones which reach the goal. In an unknown, out of distribution environments, long term planning might rely on imagination.

Formally, given the latent encoding s 0 and navigation target s ∗ , we look for a sequence of actions ( a 0 , ..., a T -1 ) that maximizes the likelihood of reaching s ∗ . Let S ( s T , s ∗ ) represent the unnormalized score for reaching state s ∗ with s T given the initial condition s 0 , actions a = ( a 0 , . . . , a T -1 ) , and states s = ( s 1 , . . . s T ) obtained by autoregressively rolling out the NWM: s ∼ F θ ( ·| s 0 , a ) .

We define the energy function E ( s 0 , a 0 , . . . , a T -1 , s T ) , such that minimizing the energy corresponds to maximizing the unnormalized perceptual similarity score and following potential constraints on the states and actions:

$$

$$

The similarity is computed by decoding s ∗ and s T to pixels using a pretrained V AE decoder [4] and then measuring the perceptual similarity [14, 75]. Constraints like 'never go left then right' can be encoded by constraining a τ to be in a valid action set A valid, and 'never explore the edge of the cliff' by ensuring such states s τ are in S safe . I ( · ) denotes the indicator function that applies a large penalty if any action or state constraint is violated.

The problem then reduces to finding the actions that minimize this energy function:

$$

$$

This objective can be reformulated as a Model Predictive Control (MPC) problem, and we optimize it using the Cross-Entropy Method [48], a simple derivative-free and population-based optimization method which was recently used with with world models for planning [77]. We include an overview of the Cross-Entropy Method and the full optimization technical details in Appendix 7.

Ranking Navigation Trajectories . Assuming we have an existing navigation policy Π( a | s 0 , s ∗ ) , we can use NWMs to rank sampled trajectories. Here we use NoMaD [55], a state-of-the-art navigation policy for robotic navigation. To rank trajectories, we draw multiple samples from Π and choose the one with the lowest energy, like in Eq. 5.

Experiments and Results

We describe the experimental setting, our design choices, and compare NWM to previous approaches. Additional results are included in the Supplementary Material.

Experimental Setting

Datasets. For all robotics datasets (SCAND [30], TartanDrive [60], RECON [52], and HuRoN [27]), we have access to the location and rotation of robots, allowing us to infer relative actions compare to current location (see Eq. 2). To standardize the step size across agents, we divide the distance agents travel between frames by their average step size in meters, ensuring the action space is similar for different agents. We further filter out backward movements, following NoMaD [55]. Additionally, we use unlabeled Ego4D [18] videos, where the only action we consider is time shift. SCAND provides video footage of socially compliant navigation in diverse environments, TartanDrive focuses on off-road driving, RECON covers open-world navigation, HuRoN captures social interactions. We train on unlabeled Ego4D videos and GO Stanford [24] serves as an unknown evaluation environment. For the full details, see Appendix 8.1.

Evaluation Metrics. We evaluate predicted navigation trajectories using Absolute Trajectory Error (ATE) for accuracy and Relative Pose Error (RPE) for pose consistency [57]. To check how semantically similar are world model predictions to ground truth images, we apply LPIPS [76] and DreamSim [14], measuring perceptual similarity by comparing deep features, and PSNR for pixellevel quality. For image and video synthesis quality, we use FID [23] and FVD [64] which evaluate the generated data distribution. See Appendix 8.1 for more details.

Baselines. We consider all the following baselines.

· DIAMOND [1] is a diffusion world model based on the UNet [47] architecture. We use DIAMOND in the offline-reinforcement learning setting following their public code. The diffusion model is trained to autoregressively predict at 56 x 56 resolution alongside an upsampler to obtrain 224 x 224 resolution predictions. To condition on continuous actions, we use a linear embedding layer. · GNM [53] is a general goal-conditioned navigation policy trained on a dataset soup of robotic navigation datasets with a fully connected trajectory prediction network. GNM is trained on multiple datasets including SCAND, TartanDrive, GO Stanford, and RECON. · NoMaD [55] extends GNM using a diffusion policy for predicting trajectories for robot exploration and visual navigation. NoMaD is trained on the same datasets used by GNM and on HuRoN.

Implementation Details. In the default experimental setting we use a CDiT-XL of 1 B parameters with context of 4 frames, a total batch size of 1024 , and 4 different navigation goals, leading to a final total batch size of 4096 . We use the Stable Diffusion [4] VAE tokenizer, similar as in DiT [44]. We use the AdamW [39] optimizer with a learning rate of 8 e -5 . After training, we sample 5 times from each model to report mean and std results. XL sized model are trained on 8 H100 machines, each with 8 GPUs. Unless otherwise mentioned, we use the same setting as in DiT-*/2 models.

Ablations

Models are evaluated on single-step 4 seconds future prediction on validation set trajectories on the known environment RECON. We evaluate the performance against the ground truth frame by measuring LPIPS, DreamSim, and PSNR. We provide qualitative examples in Figure 3.

Model Size and CDiT . We compare CDiT (see Section 3.2) with a standard DiT in which all context tokens are fed as inputs. We hypothesize that for navigating known environments, the capacity of the model is the most important, and the results in Figure 5, indicate that CDiT indeed performs

Figure

Figure 3. Following trajectories in known environments. We include qualitative video generation comparisons of different models following ground truth trajectories. Click on the image to play the video clip in a browser .

Table 1. Ablations of predicted goals per sample number, context size, and the use of action and time conditioning. We report prediction results 4 seconds into the future on RECON.

Figure 4. Comparing generation accuracy and quality of NWM and DIAMOND at 1 and 4 FPS as function of time, up to 16 seconds of generated video on the RECON dataset.

Figure 4. Comparing generation accuracy and quality of NWM and DIAMOND at 1 and 4 FPS as function of time, up to 16 seconds of generated video on the RECON dataset.

better with models of up to 1 Bparameters, while consuming less than 2 × FLOPs. Surprisingly, even with equal amount of parameters (e.g, CDiT-L compared to DiT-XL), CDiT is 4 × faster and performs better.

Number of Goals . We train models with variable number of goal states given a fixed context, changing the number of goals from 1 to 4 . Each goal is randomly chosen between ± 16 seconds window around the current state. The results reported in Table 1 indicate that using 4 goals leads to significantly improved prediction performance in all metrics.

Context Size . We train models while varying the number of conditioning frames from 1 to 4 (see Table 1). Unsurprisingly, more context helps, and with short context the model often 'lose track', leading to poor predictions.

Time and Action Conditioning . We train our model with both time and action conditioning and test how much each

Figure 5. CDiT vs. DiT . Measuring how well models predict 4 seconds into the future on RECON. We report LPIPS as a function of Tera FLOPs, lower is better.

Figure 5. CDiT vs. DiT . Measuring how well models predict 4 seconds into the future on RECON. We report LPIPS as a function of Tera FLOPs, lower is better.

input contributes to the prediction performance (we include the results in Table 1. We find that running the model with time only leads to poor performance, while not conditioning on time leads to small drop in performance as well. This confirms that both inputs are beneficial to the model.

Video Prediction and Synthesis

We evaluate how well our model follows ground truth actions and predicts future states. The model is conditioned on the first image and context frames, then autoregressively predicts the next state using ground truth actions, feeding back each prediction. We compare predictions to ground truth images at 1 , 2 , 4 , 8 , and 16 seconds, reporting FID and LPIPS on the RECON dataset. Figure 4 shows performance over time compared to DIAMOND at 4 FPS and 1 FPS, showing that NWM predictions are significantly more accurate than DIAMOND. Initially, the NWM 1 FPS vari-

Figure 7. Ranking an external policy's trajectories using NWM. To navigate from the observation image to the goal, we sample trajectories from NoMaD [55], simulate each of these trajectories using NWM, score them (see Equation 4), and rank them. With NWM we can accurately choose trajectories that are closer to the groundtruth trajectory. Click the image to play examples in a browser .

Figure 7. Ranking an external policy's trajectories using NWM. To navigate from the observation image to the goal, we sample trajectories from NoMaD [55], simulate each of these trajectories using NWM, score them (see Equation 4), and rank them. With NWM we can accurately choose trajectories that are closer to the groundtruth trajectory. Click the image to play examples in a browser .

Table 2. Goal Conditioned Visual Navigation . ATE and RPE results on RECON, predicting 2 second trajectories. NWM achieves improved results on all metrics compared to previous approaches NoMaD [55] and GNM [53].

ant performs better, but after 8 seconds, predictions degrade due to accumulated errors and loss of context and the 4 FPS becomes superior. See qualitative examples in Figure 3.

Generation Quality. To evaluate video quality, we autoregressively predict videos at 4 FPS for 16 seconds to create videos, while conditioning on ground truth actions. We then evaluate the quality of videos generated using FVD, compared to DIAMOND [1]. The results in Figure 6 indicate that NWM outputs higher quality videos.

Planning Using a Navigation World Model

Next, we turn to describe experiments that measure how well can we navigate using a NWM. We include the full technical details of the experiments in Appendix 8.2.

Standalone Planning. We demonstrate that NWM can be effectively used independently for goal-conditioned navigation. We condition it on past observations and a goal image, and use the Cross-Entropy Method to find a trajectory that minimizes the LPIPS similarity of the last predicted image to the goal image (see Equation 5). To rank an action sequence, we execute the NWM and measure LPIPS between the last state and the goal 3 times to get an average score. We generate trajectories of length 8 , with temporal shift of k = 0 . 25 . We evaluate the model performance in Table 2. Wefind that using a NWM for planning leads to competitive results with state-of-the-art policies.

Planning with Constraints. World models allow planning under constraints-for example, requiring straight motion

Table 3. Planning with Navigation Constraints. We present results for planning with NWM under three action constraints, reporting the differences in final position ( δu ) and yaw ( δϕ ) relative to the no-constraints baseline. All constraints are met, demonstrating that NWM can effectively adhere to them.

or a single turn. We show that NWM supports constraintaware planning. In forward-first , the agent moves forward for 5 steps, then turns for 3. In left-right first , it turns for 3 steps before moving forward. In straight then forward , it moves straight for 3 steps, then forward. Constraints are enforced by zeroing out specific actions; e.g., in left-right first , forward motion is zeroed for the first 3 steps, and Standalone Planning optimizes the rest. We report the norm of the difference in final position and yaw relative to unconstrained planning. Results (Table 3) show NWM plans effectively under constraints, with only minor performance drops (see examples in Figure 9).

Using a Navigation World Model for Ranking . NWM can enhance existing navigation policies in a goalconditioned navigation. Conditioning NoMaD on past observations and a goal image, we sample n ∈ { 16 , 32 } trajectories, each of length 8 , and evaluate them by autoregressively following the actions using NWM. Finally, we rank each trajectory's final prediction by measuring LPIPS similarity with the goal image (see Figure 7). We report ATE and RPE on all in-domain datasets (Table 2) and find that NWM-based trajectory ranking improves navigation performance, with more samples yielding better results.

Generalization to Unknown Environments

Here we experiment with adding unlabeled data, and ask whether NWM can make predictions in new environments using imagination. In this experiment, we train a model on all in-domain datasets, as well as a susbet of unlabeled

Figure 8. Navigating Unknown Environments . NWM is conditioned on a single image, and autoregressively predicts the next states given the associated actions (marked in yellow). Click on the image to play the video clip in a browser .

Figure 8. Navigating Unknown Environments . NWM is conditioned on a single image, and autoregressively predicts the next states given the associated actions (marked in yellow). Click on the image to play the video clip in a browser .

Table 4. Training on additional unlabeled data improves performance on unseen environments. Reporting results on unknown environment (Go Stanford) and known one (RECON). Results reported by evaluating 4 seconds into the future.

Condition Image Goal Image

Figure 9. Planning with Constraints Using NWM. We visualize trajectories planned with NWM under the constraint of moving left or right first, followed by forward motion. The planning objective is to reach the same final position and orientation as the ground truth (GT) trajectory. Shown are the costs for proposed trajectories 0 , 1 , and 2 , with trajectory 0 (in green) achieving the lowest cost.

Figure 9. Planning with Constraints Using NWM. We visualize trajectories planned with NWM under the constraint of moving left or right first, followed by forward motion. The planning objective is to reach the same final position and orientation as the ground truth (GT) trajectory. Shown are the costs for proposed trajectories 0 , 1 , and 2 , with trajectory 0 (in green) achieving the lowest cost.

videos from Ego4D, where we only have access to the timeshift action. We train a CDiT-XL model and test it on the Go Stanford dataset as well as other random images. We report the results in Table 4, finding that training on unlabeled data leads to significantly better video predictions according to all metrics, including improved generation quality. We include qualitative examples in Figure 8. Compared to indomain (Figure 3), the model breaks faster and expectedly hallucinates paths as it generates traversals of imagined environments.

Limitations

We identify multiple limitations. First, when applied to out of distribution data, the model tends to slowly lose context and generates next states that resemble the training data, a phenomena that was observed in image generation and is known as mode collapse [56, 58]. We include such an example in Figure 10. Second, while the model can plan, it struggles with simulating temporal dynamics like pedestrian motion (although in some cases it does). Both limitations are likely to be solved with longer context and more

Figure 10. Limitations and Failure Cases. In unknown environments, a common failure case is mode collapse, where the model outputs slowly become more similar to data seen in training. Click on the image to play the video clip in a browser .

Figure 10. Limitations and Failure Cases. In unknown environments, a common failure case is mode collapse, where the model outputs slowly become more similar to data seen in training. Click on the image to play the video clip in a browser .

training data. Additionally, the model currently utilizes 3 DoF navigation actions, but extending to 6 DoF navigation and potentially more (like controlling the joints of a robotic arm) are possible as well, which we leave for future work.

Discussion

Our proposed Navigation World Model (NWM) offers a scalable, data-driven approach to learning world models for visual navigation; However, we are not exactly sure yet what representations enable this, as our NWM does not explicitly utilize a structured map of the environment. One idea, is that next frame prediction from an egocentric point of view can drive the emergence of allocentric representations [65]. Ultimately, our approach bridges learning from video, visual navigation, and model-based planning and could potentially open the door to self-supervised systems that not only perceive but can also plan to inform action.

Acknowledgments. We thank Noriaki Hirose for his help with the HuRoN dataset and for sharing his insights, and to Manan Tomar, David Fan, Sonia Joseph, Angjoo Kanazawa, Ethan Weber, Nicolas Ballas, and the anonymous reviewers for their helpful discussions and feedback.

References

Experimental Setting

The structure of the Appendix is as follows: we start by describing how we plan navigation trajectories via Standalone Planning in Section 7, and then include more experiments and results in Section 8.

Standalone Planning Optimization

As described in Section 3.3, we use a pretrained NWM to standalone-plan goal-conditioned navigation trajectories by optimizing Eq.5. Here, we provide additional details about the optimization using the Cross-Entropy Method [48] and the hyperparameters used. Full standalone navigation planning results are presented in Section 8.2.

We optimize trajectories using the Cross-Entropy Method, a gradient-free stochastic optimization technique for continuous optimization problems. This method iteratively updates a probability distribution to improve the likelihood of generating better solutions. In the unconstrained standalone planning scenario, we assume the trajectory is a straight line and optimize only its endpoint, represented by three variables: a single translation u and yaw rotation ϕ . We then map this tuple into eight evenly spaced delta steps, applying the yaw rotation at the final step. The time interval between steps is fixed at k = 0 . 25 seconds. The main steps of our optimization process are as follows:

For simplicity, we run the optimization process for a single iteration, which we found effective for short-horizon planning of two seconds, though further improvements are possible with more iterations. When navigation constraints are applied, parts of the trajectory are zeroed out to respect these constraints. For instance, in the 'forward-first' scenario, the translation action is u = (∆ x, 0) for the first five steps and u = (0 , ∆ y ) for the last three steps.

Experiments and Results

Experimental Study

We elaborate on the metrics and datasets used.

Evaluation Metrics. We describe the evaluation metrics used to assess predicted navigation trajectories and the quality of images generated by our NWM.

For visual navigation performance, Absolute Trajectory Error (ATE) measures the overall accuracy of trajectory estimation by computing the Euclidean distance between corresponding points in the estimated and groundtruth trajectories. Relative Pose Error (RPE) evaluates the consistency of consecutive poses by calculating the error in relative transformations between them [57].

To more rigorously assess the semantics in the world model outputs, we use Learned Perceptual Image Patch Similarity (LPIPS) and DreamSim [14], which evaluate perceptual similarity by comparing deep features from a neural network [75]. LPIPS, in particular, uses AlexNet [33] to focus on human perception of structural differences. Additionally, we use Peak Signal-to-Noise Ratio (PSNR) to quantify the pixel-level quality of generated images by measuring the ratio of maximum pixel value to error, with higher values indicating better quality.

To study image and video synthesis quality, we use Fr´ echet Inception Distance (FID) and Fr´ echet Video Distance (FVD), which compare the feature distributions of real and generated images or videos. Lower FID and FVD scores indicate higher visual quality [23, 64].

Datasets . For all robotics datasets, we have access to the location and rotation of the robots, and we use this to infer the actions as the delta in location and rotation. We remove all backward movement which can be jittery following NoMaD[55], thereby splitting the data to forward walking segments for SCAND [30], TartanDrive [60], RECON [52], and HuRoN [27]. We also utilize unlabeled Ego4D videos, where we only use time shift as action. Next, we describe each individual dataset.

Table 5. Training on additional unlabeled data improves performance on unseen environments. Reporting results on unknown environment (Go Stanford) and known one (RECON). Results reported by evaluating LPIPS 4 seconds into the future.

training and 121 video segments for testing. Used for training and evaluation.

Visual Navigation Evaluation Set. Our main finding when constructing visual navigation evaluation sets is that forward motion is highly prevalent, and if not carefully accounted for, it can dominate the evaluation data. To create diverse evaluation sets, we rank potential evaluation trajectories based on how well they can be predicted by simply moving forward. For each dataset, we select the 100 examples that are least predictable by this heuristic and use them for evaluation.

Time Prediction Evaluation Set. Predicting the future frame after k seconds is more challenging than estimating a trajectory, as it requires both predicting the agent's trajectory and its orientation in pixel space. Therefore, we do not impose additional diversity constraints. For each dataset, we randomly select 500 test prediction examples.

Experiments and Results

Training on Additional Unlabeled Data. We include results for additional known environments in Table 5 and Figure 11. We find that in known environments, models trained exclusively with in-domain data tend to perform better, likely because they are better tailored to the in-domain distribution. The only exception is the SCAND dataset, where dynamic objects (e.g. humans walking) are present. In this case, adding unlabeled data may help improve performance by providing additional diverse examples.

Known Environments. We include additional visualization results of following trajectories using NWM in the known environments RECON (Figure 12), SCAND (Figure 13), HuRoN (Figure 14), and Tartan Drive (Figure 15). Additionally, we include full FVD comparison of DIAMONDand NWM in Table 6.

Table 6. Comparison of Video Synthesis Quality. 16 second videos generated at 4 FPS, reporting FVD (lower is better).

Planning (Ranking). Full goal-conditioned navigation results for all in-domain datasets are presented in Table 7. Compared to NoMaD, we observe consistent improvements when using NWM to select from a pool of 16 trajectories, with further gains when selecting from a larger pool of 32 .

Table 7. Goal Conditioned Visual Navigation . ATE and RPE results on on all in domain datasets, predicting trajectories of up to 2 seconds. NWM achieves improved results on all metrics compared to previous approaches NoMaD [55] and GNM [53].

For Tartan Drive, we note that the dataset is heavily dominated by forward motion, as reflected in the results compared to the 'Forward' baseline, a prediction model that always selects forward-only motion.

Standalone Planning. For standalone planning, we run the optimization procedure outlined in Section 7 for 1 step, and evaluate each trajectories for 3 times. For all datasets, we initialize µ ∆ y and µ ϕ to be 0, and σ 2 ∆ y and σ 2 ϕ to be 0.1. We use different ( µ ∆ x , σ 2 ∆ x ) across each dataset: ( -0 . 1 , 0 . 02) for RECON, (0 . 5 , 0 . 07) for TartanDrive, ( -0 . 25 , 0 . 04) for SCAND, and ( -0 . 33 , 0 . 03) for HuRoN. We include the full standalone navigation planning results in Table 7. We find that using planning in the stand-alone setting performs better compared to other approaches, and specifically previous hard-coded policies.

Real-World Applicability . A key bottleneck in deploying NWMin real-world robotics is inference speed. We evaluate methods to improve NWM efficiency and measure their impact on runtime. We focus on using NWM with a generative policy (Section 3.3) to rank 32 four-second trajectories. Since trajectory evaluation is parallelizable, we analyze the runtime of simulating a single trajectory. We find that existing solutions can already enable real-time applications of NWMat 2-10HZ (Table 8).

Table 8. Runtime (seconds) on an NVIDIA RTX 6000 Ada card.

Inference time can be accelerated by composing every adjacent pair of actions (via Eq. 2) then simulating only 8 future states instead of 16 ('Time Skip'), which does not degrade navigation performance. Reducing the diffusion denoising steps from 250 to 6 by model distillation [70] further speeds up inference with minor visual quality loss. 3 Taken together, these two ideas can enable NWM to run in real time. Quantization to 4-bit, which we haven't explored, can lead to a × 4 speedup without performance hit [12].

3 Using the distillation implementation for DiTs from https://

github.com/hao-ai-lab/FastVideo

Table 9. Results in unknown environment ('Go Stanford'). Reporting lpips on 4 seconds future prediction. Lower is better.

Test-time adaptation . Test-time adaptation has shown to improve visual navigation [13, 16]. What is the relation between planning using a world model and test-time adaptation? We hypothesize that the two ideas are orthogonal, and include test-time adaptation results. We consider a simplified adaptation approach by fine-tuning NWM for 2 k steps on trajectories from an unknown environment . Weshow that this adaptation improves trajectory simulation in this environment (see 'ours+TTA' in Table 9), where we also include additional baselines and ablations.

Figure 11. Navigating Unknown Environments . NWM is conditioned on a single image, and autoregressively predicts the next states given the associated actions (marked in yellow) up to 4 seconds and 4 FPS. We plot the generated results after 1, 2, 3, and 4 seconds.

Figure 11. Navigating Unknown Environments . NWM is conditioned on a single image, and autoregressively predicts the next states given the associated actions (marked in yellow) up to 4 seconds and 4 FPS. We plot the generated results after 1, 2, 3, and 4 seconds.

Figure 12. Video generation examples on RECON . NWM is conditioned on a single first image, and a ground truth trajectory and autoregressively predicts the next up to 16 seconds at 4 FPS. We plot the generated results from 2 to 16 seconds, every 1 second.

Figure 12. Video generation examples on RECON . NWM is conditioned on a single first image, and a ground truth trajectory and autoregressively predicts the next up to 16 seconds at 4 FPS. We plot the generated results from 2 to 16 seconds, every 1 second.

Figure 13. Video generation examples on SCAND . NWM is conditioned on a single first image, and a ground truth trajectory and autoregressively predicts the next up to 16 seconds at 4 FPS. We plot the generated results from 2 to 16 seconds, every 1 second.

Figure 13. Video generation examples on SCAND . NWM is conditioned on a single first image, and a ground truth trajectory and autoregressively predicts the next up to 16 seconds at 4 FPS. We plot the generated results from 2 to 16 seconds, every 1 second.

ablationablationlpips ↓dreamsim ↓psnr ↑
10 . 312 ± 0 . 0010 . 098 ± 0 . 00115 . 044 ± 0 . 031
2#goals0 . 305 ± 0 . 0000 . 096 ± 0 . 00115 . 154 ± 0 . 017
40.296 ± 0 . 0020.091 ± 0 . 00115.331 ± 0 . 027
10 . 304 ± 0 . 0010 . 097 ± 0 . 00115 . 223 ± 0 . 033
2#context0 . 302 ± 0 . 0010 . 095 ± 0 . 00015 . 274 ± 0 . 027
40.296 ± 0 . 0020.091 ± 0 . 00115.331 ± 0 . 027
time onlytime only0 . 760 ± 0 . 0010 . 783 ± 0 . 0007 . 839 ± 0 . 017
action onlyaction only0 . 318 ± 0 . 0020 . 100 ± 0 . 00014 . 858 ± 0 . 055
action + timeaction + time0.295 ± 0 . 0020.091 ± 0 . 00115.343 ± 0 . 060
modeldiamond NWM(ours)
FVD ↓762 . 734 ± 3 . 361 200.969 ± 5 . 629
modelATE ↓RPE ↓
GNM1 . 87 ± 0 . 000 . 73 ± 0 . 00
NoMaD1 . 93 ± 0 . 040 . 52 ± 0 . 00
NWM+NoMaD( × 16 )1 . 83 ± 0 . 030 . 50 ± 0 . 01
NWM+NoMaD( × 32 )1 . 78 ± 0 . 030 . 48 ± 0 . 01
NWM(planning)1.13 ± 0 . 020.35 ± 0 . 01
modelRel. δu ↓Rel. δϕ ↓
forward first+0 . 36 ± 0 . 01+0 . 61 ± 0 . 02
left-right first- 0 . 03 ± 0 . 01+0 . 20 ± 0 . 01
straight then forward+0 . 08 ± 0 . 01+0 . 22 ± 0 . 01
dataunknown environment (Go Stanford)unknown environment (Go Stanford)unknown environment (Go Stanford)known environment (RECON)known environment (RECON)known environment (RECON)
lpips ↓dreamsim ↓psnr ↑lpips ↓dreamsim ↓psnr ↑
in-domain data0 . 658 ± 0 . 0020 . 478 ± 0 . 00111 . 031 ± 0 . 0360.295 ± 0 . 0020.091 ± 0 . 00115.343 ± 0 . 060
+ Ego4D (unlabeled)0.652 ± 0 . 0030.464 ± 0 . 00311.083 ± 0 . 0640 . 368 ± 0 . 0030 . 138 ± 0 . 00214 . 072 ± 0 . 075
unknown environmentknown environmentsknown environmentsknown environmentsknown environments
dataGo StanfordRECONHuRoNSCANDTartanDrive
in-domain data0 . 658 ± 0 . 0020.295 ± 0 . 0020.250 ± 0 . 0030 . 403 ± 0 . 0020.414 ± 0 . 001
+ Ego4D (unlabeled)0.652 ± 0 . 0030 . 368 ± 0 . 0030 . 377 ± 0 . 0020.398 ± 0 . 0010 . 430 ± 0 . 000
datasetDIAMONDNWM(ours)
RECON762 . 734 ± 3 . 361200.969 ± 5 . 629
HuRoN881 . 981 ± 11 . 601276.932 ± 4 . 346
TartanDrive2289 . 687 ± 6 . 991494.247 ± 14 . 433
SCAND1945 . 085 ± 8 . 449401.699 ± 11 . 216
modelRECONRECONHuRoNHuRoNTartanTartanSCANDSCAND
ATERTEATERTEATERTEATERTE
Forward1 . 92 ± 0 . 000 . 54 ± 0 . 004 . 14 ± 0 . 001 . 05 ± 0 . 005 . 75 ± 0 . 001 . 19 ± 0 . 002 . 97 ± 0 . 000 . 62 ± 0 . 00
GNM1 . 87 ± 0 . 000 . 73 ± 0 . 003 . 71 ± 0 . 001 . 00 ± 0 . 006 . 65 ± 0 . 001 . 62 ± 0 . 002 . 12 ± 0 . 000 . 61 ± 0 . 00
NoMaD1 . 95 ± 0 . 050 . 53 ± 0 . 013 . 73 ± 0 . 040 . 96 ± 0 . 016 . 32 ± 0 . 031 . 31 ± 0 . 012 . 24 ± 0 . 030 . 49 ± 0 . 01
NWM+NoMaD( × 16 )1 . 88 ± 0 . 030 . 51 ± 0 . 013 . 73 ± 0 . 050 . 95 ± 0 . 016 . 26 ± 0 . 061 . 30 ± 0 . 012 . 18 ± 0 . 050 . 48 ± 0 . 01
NWM+NoMaD( × 32 )1.79 ± 0 . 020 . 49 ± 0 . 003.68 ± 0 . 030.95 ± 0 . 016 . 25 ± 0 . 051 . 29 ± 0 . 012 . 19 ± 0 . 030 . 47 ± 0 . 01
NWM(only)1.13 ± 0 . 020.35 ± 0 . 014 . 12 ± 0 . 030 . 96 ± 0 . 015.63 ± 0 . 061.18 ± 0 . 011.28 ± 0 . 020.33 ± 0 . 01
NWM+Time Skip+Distillation.+Quant. 4-bit
30 . 3 ± 0 . 214 . 7 ± 0 . 10 . 4 ± 0 . 10 . 1 (est. [12])
CDiT-Lcontext 2action onlygoals 2oursours + TTA
0 . 6560 . 6550 . 6610 . 6540 . 6520.65
ablationablationlpips ↓dreamsim ↓psnr ↑
10 . 312 ± 0 . 0010 . 098 ± 0 . 00115 . 044 ± 0 . 031
2#goals0 . 305 ± 0 . 0000 . 096 ± 0 . 00115 . 154 ± 0 . 017
40.296 ± 0 . 0020.091 ± 0 . 00115.331 ± 0 . 027
10 . 304 ± 0 . 0010 . 097 ± 0 . 00115 . 223 ± 0 . 033
2#context0 . 302 ± 0 . 0010 . 095 ± 0 . 00015 . 274 ± 0 . 027
40.296 ± 0 . 0020.091 ± 0 . 00115.331 ± 0 . 027
time onlytime only0 . 760 ± 0 . 0010 . 783 ± 0 . 0007 . 839 ± 0 . 017
action onlyaction only0 . 318 ± 0 . 0020 . 100 ± 0 . 00014 . 858 ± 0 . 055
action + timeaction + time0.295 ± 0 . 0020.091 ± 0 . 00115.343 ± 0 . 060
modeldiamond NWM(ours)
FVD ↓762 . 734 ± 3 . 361 200.969 ± 5 . 629
modelATE ↓RPE ↓
GNM1 . 87 ± 0 . 000 . 73 ± 0 . 00
NoMaD1 . 93 ± 0 . 040 . 52 ± 0 . 00
NWM+NoMaD( × 16 )1 . 83 ± 0 . 030 . 50 ± 0 . 01
NWM+NoMaD( × 32 )1 . 78 ± 0 . 030 . 48 ± 0 . 01
NWM(planning)1.13 ± 0 . 020.35 ± 0 . 01
modelRel. δu ↓Rel. δϕ ↓
forward first+0 . 36 ± 0 . 01+0 . 61 ± 0 . 02
left-right first- 0 . 03 ± 0 . 01+0 . 20 ± 0 . 01
straight then forward+0 . 08 ± 0 . 01+0 . 22 ± 0 . 01
dataunknown environment (Go Stanford)unknown environment (Go Stanford)unknown environment (Go Stanford)known environment (RECON)known environment (RECON)known environment (RECON)
lpips ↓dreamsim ↓psnr ↑lpips ↓dreamsim ↓psnr ↑
in-domain data0 . 658 ± 0 . 0020 . 478 ± 0 . 00111 . 031 ± 0 . 0360.295 ± 0 . 0020.091 ± 0 . 00115.343 ± 0 . 060
+ Ego4D (unlabeled)0.652 ± 0 . 0030.464 ± 0 . 00311.083 ± 0 . 0640 . 368 ± 0 . 0030 . 138 ± 0 . 00214 . 072 ± 0 . 075
unknown environmentknown environmentsknown environmentsknown environmentsknown environments
dataGo StanfordRECONHuRoNSCANDTartanDrive
in-domain data0 . 658 ± 0 . 0020.295 ± 0 . 0020.250 ± 0 . 0030 . 403 ± 0 . 0020.414 ± 0 . 001
+ Ego4D (unlabeled)0.652 ± 0 . 0030 . 368 ± 0 . 0030 . 377 ± 0 . 0020.398 ± 0 . 0010 . 430 ± 0 . 000
datasetDIAMONDNWM(ours)
RECON762 . 734 ± 3 . 361200.969 ± 5 . 629
HuRoN881 . 981 ± 11 . 601276.932 ± 4 . 346
TartanDrive2289 . 687 ± 6 . 991494.247 ± 14 . 433
SCAND1945 . 085 ± 8 . 449401.699 ± 11 . 216
modelRECONRECONHuRoNHuRoNTartanTartanSCANDSCAND
ATERTEATERTEATERTEATERTE
Forward1 . 92 ± 0 . 000 . 54 ± 0 . 004 . 14 ± 0 . 001 . 05 ± 0 . 005 . 75 ± 0 . 001 . 19 ± 0 . 002 . 97 ± 0 . 000 . 62 ± 0 . 00
GNM1 . 87 ± 0 . 000 . 73 ± 0 . 003 . 71 ± 0 . 001 . 00 ± 0 . 006 . 65 ± 0 . 001 . 62 ± 0 . 002 . 12 ± 0 . 000 . 61 ± 0 . 00
NoMaD1 . 95 ± 0 . 050 . 53 ± 0 . 013 . 73 ± 0 . 040 . 96 ± 0 . 016 . 32 ± 0 . 031 . 31 ± 0 . 012 . 24 ± 0 . 030 . 49 ± 0 . 01
NWM+NoMaD( × 16 )1 . 88 ± 0 . 030 . 51 ± 0 . 013 . 73 ± 0 . 050 . 95 ± 0 . 016 . 26 ± 0 . 061 . 30 ± 0 . 012 . 18 ± 0 . 050 . 48 ± 0 . 01
NWM+NoMaD( × 32 )1.79 ± 0 . 020 . 49 ± 0 . 003.68 ± 0 . 030.95 ± 0 . 016 . 25 ± 0 . 051 . 29 ± 0 . 012 . 19 ± 0 . 030 . 47 ± 0 . 01
NWM(only)1.13 ± 0 . 020.35 ± 0 . 014 . 12 ± 0 . 030 . 96 ± 0 . 015.63 ± 0 . 061.18 ± 0 . 011.28 ± 0 . 020.33 ± 0 . 01
NWM+Time Skip+Distillation.+Quant. 4-bit
30 . 3 ± 0 . 214 . 7 ± 0 . 10 . 4 ± 0 . 10 . 1 (est. [12])
CDiT-Lcontext 2action onlygoals 2oursours + TTA
0 . 6560 . 6550 . 6610 . 6540 . 6520.65

Table: S4.T3: Goal Conditioned Visual Navigation. ATE and RPE results on RECON, predicting 222 second trajectories. NWM achieves improved results on all metrics compared to previous approaches NoMaD (Sridhar et al., 2024) and GNM (Shah et al., 2023).

modelATE ↓↓\downarrowRPE ↓↓\downarrow
GNM1.871.871.87 ±plus-or-minus\pm 0.000.000.000.730.730.73 ±plus-or-minus\pm 0.000.000.00
NoMaD1.931.931.93 ±plus-or-minus\pm 0.040.040.040.520.520.52 ±plus-or-minus\pm 0.000.000.00
NWM + NoMaD (×16absent16\times 16)1.831.831.83 ±plus-or-minus\pm 0.030.030.030.500.500.50 ±plus-or-minus\pm 0.010.010.01
NWM + NoMaD (×32absent32\times 32)1.781.781.78 ±plus-or-minus\pm 0.030.030.030.480.480.48 ±plus-or-minus\pm 0.010.010.01
NWM (planning)1.13 ±plus-or-minus\pm 0.020.020.020.35 ±plus-or-minus\pm 0.010.010.01

Table: S4.T4: Training on additional unlabeled data improves performance on unseen environments. Reporting results on unknown environment (Go Stanford) and known one (RECON). Results reported by evaluating 444 seconds into the future.

lpips ↓↓\downarrowdreamsim ↓↓\downarrowpsnr ↑↑\uparrowlpips ↓↓\downarrowdreamsim ↓↓\downarrowpsnr ↑↑\uparrow
in-domain data0.658±0.002plus-or-minus0.6580.0020.658\pm 0.0020.478±0.001plus-or-minus0.4780.0010.478\pm 0.00111.031±0.036plus-or-minus11.0310.03611.031\pm 0.0360.295 ±0.002plus-or-minus0.002\pm 0.0020.091 ±0.001plus-or-minus0.001\pm 0.00115.343 ±0.060plus-or-minus0.060\pm 0.060
+ Ego4D (unlabeled)0.652 ±0.003plus-or-minus0.003\pm 0.0030.464 ±0.003plus-or-minus0.003\pm 0.00311.083 ±0.064plus-or-minus0.064\pm 0.0640.368±0.003plus-or-minus0.3680.0030.368\pm 0.0030.138±0.002plus-or-minus0.1380.0020.138\pm 0.00214.072±0.075plus-or-minus14.0720.07514.072\pm 0.075

Figure

Figure

Figure

Refer to caption We train a Navigation World Model (NWM) from video footage of robots and their associated navigation actions (a). After training, NWM can evaluate trajectories by synthesizing their videos and scoring the final frame’s similarity with the goal (b). We use NWM to plan from scratch or rank experts navigation trajectories, improving downstream visual navigation performance. In unknown environments, NWM can simulate imagined trajectories from a single image (c). In all examples above, the input to the model is the first image and actions, then the model auto-regressively synthesizes future observations. Click on the image to view examples in a browser.

Refer to caption Following trajectories in known environments. We include qualitative video generation comparisons of different models following ground truth trajectories. Click on the image to play the video clip in a browser.

Refer to caption Table 1: Ablations of predicted goals per sample number, context size, and the use of action and time conditioning. We report prediction results 444 seconds into the future on RECON.

Refer to caption Ranking an external policy’s trajectories using NWM. We use NoMaD (Sridhar et al., 2024) to sample multiple trajectory predictions to navigate from the observation image to the goal image. We then simulate each of these trajectories using NWM and rank them with LPIPS (Zhang et al., 2018b). With NWM we can accurately choose trajectories that are closer to the groundtruth trajectory. Click the image to play examples in a browser.

$$ \begin{aligned} \label{eq:planning-loss} \arg\min_{a_0, \dots, a_{T-1}} \mathbb{E}{\mathbf{s}} \left[ \mathcal{E}(s_0, a_0, \dots, a{T-1}, s_T) \right]
\end{aligned} $$ \tag{eq:planning-loss}

$$ \label{eq:basic} s_i = \text{enc}{\theta}(x{i}) && s_{\tau+1} \sim F_{\theta}(s_{\tau+1}\mid\mathbf{s_\tau}, a_\tau) $$ \tag{eq:basic}

$$ \label{eq:compose-actions} u_{\tau \rightarrow m} = \sum_{t=\tau}^{m} u_t && \phi_{\tau \rightarrow m} = \sum_{t=\tau}^{m}\phi_{t}\mod2\pi $$ \tag{eq:compose-actions}

$$ \label{eq:embedding} \xi = \psi_a + \psi_k + \psi_t $$ \tag{eq:embedding}

$$ \label{eq:score} \mathcal{E}(s_0, a_0, \dots, a_{T-1}, s_T) = -\mathcal{S}(s_T, s^*) + && \ \nonumber + \sum_{\tau=0}^{T-1} \mathbb{I}(a_\tau \notin \mathcal{A}{\text{valid}}) + \sum{\tau=0}^{T-1} \mathbb{I}(s_\tau \notin \mathcal{S}_{\text{safe}}), $$ \tag{eq:score}

$$ \displaystyle s_{i}=enc_{\theta}(x_{i}) $$

References

[ha2018world] Ha, David, Schmidhuber, J{. (2018). World models. arXiv preprint arXiv:1803.10122.

[hansentd] Hansen, Nicklas, Su, Hao, Wang, Xiaolong. TD-MPC2: Scalable, Robust World Models for Continuous Control. The Twelfth International Conference on Learning Representations.

[alonso2024diffusionworldmodelingvisual] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret. (2024). Diffusion for World Modeling: Visual Details Matter in Atari. Thirty-eighth Conference on Neural Information Processing Systems.

[seo2023masked] Seo, Younggyo, Hafner, Danijar, Liu, Hao, Liu, Fangchen, James, Stephen, Lee, Kimin, Abbeel, Pieter. (2023). Masked world models for visual control. Conference on Robot Learning.

[wu2023daydreamer] Wu, Philipp, Escontrela, Alejandro, Hafner, Danijar, Abbeel, Pieter, Goldberg, Ken. (2023). Daydreamer: World models for physical robot learning. Conference on robot learning.

[bear2023unifying] Bear, Daniel M, Feigelis, Kevin, Chen, Honglin, Lee, Wanhee, Venkatesh, Rahul, Kotar, Klemen, Durango, Alex, Yamins, Daniel LK. (2023). Unifying (machine) vision via counterfactual world modeling. arXiv preprint arXiv:2306.01828.

[valevski2024diffusion] Valevski, Dani, Leviathan, Yaniv, Arar, Moab, Fruchter, Shlomi. (2024). Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837.

[hafnermastering] Hafner, Danijar, Lillicrap, Timothy P, Norouzi, Mohammad, Ba, Jimmy. Mastering Atari with Discrete World Models. International Conference on Learning Representations.

[hafner2019learning] Hafner, Danijar, Lillicrap, Timothy, Fischer, Ian, Villegas, Ruben, Ha, David, Lee, Honglak, Davidson, James. (2019). Learning latent dynamics for planning from pixels. International conference on machine learning.

[lin2024learningmodelworldlanguage] Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, Anca Dragan. (2024). Learning to Model the World with Language.

[liu2024worldmodelmillionlengthvideo] Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel. (2024). World Model on Million-Length Video And Language With Blockwise RingAttention.

[escontrela2024video] Escontrela, Alejandro, Adeniji, Ademi, Yan, Wilson, Jain, Ajay, Peng, Xue Bin, Goldberg, Ken, Lee, Youngwoon, Hafner, Danijar, Abbeel, Pieter. (2024). Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems.

[hafner2023mastering] Hafner, Danijar, Pasukonis, Jurgis, Ba, Jimmy, Lillicrap, Timothy. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104.

[yan2023temporally] Yan, Wilson, Hafner, Danijar, James, Stephen, Abbeel, Pieter. (2023). Temporally consistent transformers for video generation. International Conference on Machine Learning.

[yan2022patch] Yan, Wilson, Okumura, Ryo, James, Stephen, Abbeel, Pieter. (2022). Patch-based Object-centric Transformers for Efficient Video Generation. arXiv preprint arXiv:2206.04003.

[yan2021videogpt] Yan, Wilson, Zhang, Yunzhi, Abbeel, Pieter, Srinivas, Aravind. (2021). Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157.

[mendonca2021discovering] Mendonca, Russell, Rybkin, Oleh, Daniilidis, Kostas, Hafner, Danijar, Pathak, Deepak. (2021). Discovering and achieving goals via world models. Advances in Neural Information Processing Systems.

[pmlr-v119-sekar20a] Sekar, Ramanan, Rybkin, Oleh, Daniilidis, Kostas, Abbeel, Pieter, Hafner, Danijar, Pathak, Deepak. (2020). Planning to Explore via Self-Supervised World Models. Proceedings of the 37th International Conference on Machine Learning.

[hafnerdream] Hafner, Danijar, Lillicrap, Timothy, Ba, Jimmy, Norouzi, Mohammad. Dream to Control: Learning Behaviors by Latent Imagination. International Conference on Learning Representations.

[kim2020learning] Kim, Seung Wook, Zhou, Yuhao, Philion, Jonah, Torralba, Antonio, Fidler, Sanja. (2020). Learning to simulate dynamic environments with gamegan. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[koh2021pathdreamer] Koh, Jing Yu, Lee, Honglak, Yang, Yinfei, Baldridge, Jason, Anderson, Peter. (2021). Pathdreamer: A world model for indoor navigation. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[Chan_2023_ICCV] Chan, Eric R., Nagano, Koki, Chan, Matthew A., Bergman, Alexander W., Park, Jeong Joon, Levy, Axel, Aittala, Miika, De Mello, Shalini, Karras, Tero, Wetzstein, Gordon. (2023). Generative Novel View Synthesis with 3D-Aware Diffusion Models. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[ren2022look] Ren, Xuanchi, Wang, Xiaolong. (2022). Look outside the room: Synthesizing a consistent long-term 3d scene video from a single image. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[tseng2023consistent] Tseng, Hung-Yu, Li, Qinbo, Kim, Changil, Alsisan, Suhib, Huang, Jia-Bin, Kopf, Johannes. (2023). Consistent view synthesis with pose-guided diffusion models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[karnan2022socially] Karnan, Haresh, Nair, Anirudh, Xiao, Xuesu, Warnell, Garrett, Pirk, S{. (2022). Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters.

[triest2022tartandrive] Triest, Samuel, Sivaprakasam, Matthew, Wang, Sean J, Wang, Wenshan, Johnson, Aaron M, Scherer, Sebastian. (2022). Tartandrive: A large-scale dataset for learning off-road dynamics models. 2022 International Conference on Robotics and Automation (ICRA).

[shah2021rapid] Shah, Dhruv, Eysenbach, Benjamin, Kahn, Gregory, Rhinehart, Nicholas, Levine, Sergey. (2021). Rapid exploration for open-world navigation with latent goal models. arXiv preprint arXiv:2104.05859.

[hirose2023sacson] Hirose, Noriaki, Shah, Dhruv, Sridhar, Ajay, Levine, Sergey. (2023). Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters.

[grauman2022ego4d] Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, others. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[sridhar2024nomad] Sridhar, Ajay, Shah, Dhruv, Glossop, Catherine, Levine, Sergey. (2024). Nomad: Goal masked diffusion policies for navigation and exploration. 2024 IEEE International Conference on Robotics and Automation (ICRA).

[shahvint] Shah, Dhruv, Sridhar, Ajay, Dashora, Nitish, Stachowicz, Kyle, Black, Kevin, Hirose, Noriaki, Levine, Sergey. ViNT: A Foundation Model for Visual Navigation. 7th Annual Conference on Robot Learning.

[shah2023gnm] Shah, Dhruv, Sridhar, Ajay, Bhorkar, Arjun, Hirose, Noriaki, Levine, Sergey. (2023). Gnm: A general navigation model to drive any robot. 2023 IEEE International Conference on Robotics and Automation (ICRA).

[pathak2018zero] Pathak, Deepak, Mahmoudieh, Parsa, Luo, Guanghao, Agrawal, Pulkit, Chen, Dian, Shentu, Yide, Shelhamer, Evan, Malik, Jitendra, Efros, Alexei A, Darrell, Trevor. (2018). Zero-shot visual imitation. Proceedings of the IEEE conference on computer vision and pattern recognition workshops.

[mirowski2022learning] Mirowski, Piotr, Pascanu, Razvan, Viola, Fabio, Soyer, Hubert, Ballard, Andy, Banino, Andrea, Denil, Misha, Goroshin, Ross, Sifre, Laurent, Kavukcuoglu, Koray, others. (2022). Learning to Navigate in Complex Environments. International Conference on Learning Representations.

[chaplotlearning] Chaplot, Devendra Singh, Gandhi, Dhiraj, Gupta, Saurabh, Gupta, Abhinav, Salakhutdinov, Ruslan. Learning To Explore Using Active Neural SLAM. International Conference on Learning Representations.

[chenlearning] Chen, Tao, Gupta, Saurabh, Gupta, Abhinav. Learning Exploration Policies for Navigation. International Conference on Learning Representations.

[tulyakov2018mocogan] Tulyakov, Sergey, Liu, Ming-Yu, Yang, Xiaodong, Kautz, Jan. (2018). Mocogan: Decomposing motion and content for video generation. Proceedings of the IEEE conference on computer vision and pattern recognition.

[voleti2022mcvd] Voleti, Vikram, Jolicoeur-Martineau, Alexia, Pal, Chris. (2022). Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems.

[polyak2024movie] Polyak, Adam, Zohar, Amit, Brown, Andrew, Tjandra, Andros, Sinha, Animesh, Lee, Ann, Vyas, Apoorv, Shi, Bowen, Ma, Chih-Yao, Chuang, Ching-Yao, others. (2024). Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720.

[brooks2024video] Brooks, Tim, Peebles, Bill, Holmes, Connor, DePue, Will, Guo, Yufei, Jing, Li, Schnurr, David, Taylor, Joe, Luhman, Troy, Luhman, Eric, others. (2024). Video generation models as world simulators. 2024-03-03]. https://openai. com/research/video-generation-modelsas-world-simulators.

[fu2022coupling] Fu, Zipeng, Kumar, Ashish, Agarwal, Ananye, Qi, Haozhi, Malik, Jitendra, Pathak, Deepak. (2022). Coupling vision and proprioception for navigation of legged robots. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[bar2021compositional] Bar, Amir, Herzig, Roei, Wang, Xiaolong, Rohrbach, Anna, Chechik, Gal, Darrell, Trevor, Globerson, Amir. (2021). Compositional Video Synthesis with Action Graphs. International Conference on Machine Learning.

[kondratyukvideopoet] Kondratyuk, Dan, Yu, Lijun, Gu, Xiuye, Lezama, Jose, Huang, Jonathan, Schindler, Grant, Hornung, Rachel, Birodkar, Vighnesh, Yan, Jimmy, Chiu, Ming-Chang, others. VideoPoet: A Large Language Model for Zero-Shot Video Generation. Forty-first International Conference on Machine Learning.

[pooledreamfusion] Poole, Ben, Jain, Ajay, Barron, Jonathan T, Mildenhall, Ben. DreamFusion: Text-to-3D using 2D Diffusion. The Eleventh International Conference on Learning Representations.

[blattmann2023stable] Blattmann, Andreas, Dockhorn, Tim, Kulal, Sumith, Mendelevitch, Daniel, Kilian, Maciej, Lorenz, Dominik, Levi, Yam, English, Zion, Voleti, Vikram, Letts, Adam, others. (2023). Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127.

[girdhar2023emu] Girdhar, Rohit, Singh, Mannat, Brown, Andrew, Duval, Quentin, Azadi, Samaneh, Rambhatla, Sai Saketh, Shah, Akbar, Yin, Xi, Parikh, Devi, Misra, Ishan. (2023). Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709.

[ho2022imagen] Ho, Jonathan, Chan, William, Saharia, Chitwan, Whang, Jay, Gao, Ruiqi, Gritsenko, Alexey, Kingma, Diederik P, Poole, Ben, Norouzi, Mohammad, Fleet, David J, others. (2022). Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.

[bar2024lumiere] Bar-Tal, Omer, Chefer, Hila, Tov, Omer, Herrmann, Charles, Paiss, Roni, Zada, Shiran, Ephrat, Ariel, Hur, Junhwa, Liu, Guanghui, Raj, Amit, others. (2024). Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945.

[yu2023magvit] Yu, Lijun, Cheng, Yong, Sohn, Kihyuk, Lezama, Jos{'e. (2023). Magvit: Masked generative video transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[zhou2024dinowmworldmodelspretrained] Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto. (2024). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning.

[zhang2018unreasonable] Zhang, Richard, Isola, Phillip, Efros, Alexei A, Shechtman, Eli, Wang, Oliver. (2018). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition.

[heusel2017gans] Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, Hochreiter, Sepp. (2017). Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems.

[unterthiner2019fvd] Unterthiner, Thomas, van Steenkiste, Sjoerd, Kurach, Karol, Marinier, Rapha{. (2019). FVD: A new metric for video generation.

[sturm2012evaluating] Sturm, J{. (2012). Evaluating egomotion and structure-from-motion approaches using the TUM RGB-D benchmark. Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS).

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.

[simonyan2014very] Simonyan, Karen. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

[savva2019habitat] Savva, Manolis, Kadian, Abhishek, Maksymets, Oleksandr, Zhao, Yili, Wijmans, Erik, Jain, Bhavana, Straub, Julian, Liu, Jia, Koltun, Vladlen, Malik, Jitendra, others. (2019). Habitat: A platform for embodied ai research. Proceedings of the IEEE/CVF international conference on computer vision.

[oquab2024dinov2learningrobustvisual] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. (2024). DINOv2: Learning Robust Visual Features without Supervision.

[Peebles_2023_ICCV] Peebles, William, Xie, Saining. (2023). Scalable Diffusion Models with Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[vaswani2017attention] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing Systems.

[xu2019understandingimprovinglayernormalization] Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin. (2019). Understanding and Improving Layer Normalization.

[lei2016layer] Lei Ba, Jimmy, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer normalization. ArXiv e-prints.

[rubinstein1997optimization] Rubinstein, Reuven Y. (1997). Optimization of computer simulation models with rare events. European Journal of Operational Research.

[zhang2018perceptual] Zhang, Richard, Isola, Phillip, Efros, Alexei A, Shechtman, Eli, Wang, Oliver. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CVPR.

[hirose2019vunet] Hirose, Noriaki, Sadeghian, Amir, Xia, Fei, Mart{'\i. (2019). VUNet: Dynamic Scene View Synthesis for Traversability Estimation using an RGB Camera. IEEE Robotics and Automation Letters.

[hirose2018gonet] Hirose, Noriaki, Sadeghian, Amir, V{'a. (2018). Gonet: A semi-supervised deep learning approach for traversability estimation. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[fu2024dreamsim] Fu, Stephanie, Tamir, Netanel, Sundaram, Shobhita, Chai, Lucy, Zhang, Richard, Dekel, Tali, Isola, Phillip. (2024). DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data. Advances in Neural Information Processing Systems.

[ronneberger2015u] Ronneberger, Olaf, Fischer, Philipp, Brox, Thomas. (2015). U-net: Convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention--MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.

[liu2023zero] Liu, Ruoshi, Wu, Rundi, Van Hoorick, Basile, Tokmakov, Pavel, Zakharov, Sergey, Vondrick, Carl. (2023). Zero-1-to-3: Zero-shot one image to 3d object. Proceedings of the IEEE/CVF international conference on computer vision.

[vanhoorick2024gcd] Mildenhall, Ben, Srinivasan, Pratul P, Tancik, Matthew, Barron, Jonathan T, Ramamoorthi, Ravi, Ng, Ren. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM.

[loshchilov2017decoupled] Loshchilov, I. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

[thanh2020catastrophic] Thanh-Tung, Hoang, Tran, Truyen. (2020). Catastrophic forgetting and mode collapse in GANs. 2020 international joint conference on neural networks (ijcnn).

[srivastava2017veegan] Srivastava, Akash, Valkov, Lazar, Russell, Chris, Gutmann, Michael U, Sutton, Charles. (2017). Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems.

[liang2024dreamitate] Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, Carl Vondrick. (2024). Dreamitate: Real-World Visuomotor Policy Learning via Video Generation.

[ho2020denoising] Ho, Jonathan, Jain, Ajay, Abbeel, Pieter. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems.

[sohl2015deep] Sohl-Dickstein, Jascha, Weiss, Eric, Maheswaranathan, Niru, Ganguli, Surya. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International conference on machine learning.

[cho] Tung, Joseph, Chou, Gene, Cai, Ruojin, Yang, Guandao, Zhang, Kai, Wetzstein, Gordon, Hariharan, Bharath, Snavely, Noah. (2025). MegaScenes: Scene-Level View Synthesis at Scale. Computer Vision -- ECCV 2024.

[Tulyakov:2018:MoCoGAN] Tulyakov, Sergey, Liu, Ming-Yu, Yang, Xiaodong, Kautz, Jan. (2018). {MoCoGAN. IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[lin2024veditlatentpredictionarchitecture] Han Lin, Tushar Nagarajan, Nicolas Ballas, Mido Assran, Mojtaba Komeili, Mohit Bansal, Koustuv Sinha. (2024). VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning.

[tomar2024videooccupancymodels] Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John Langford, Matthew E. Taylor, Sergey Levine. (2024). Video Occupancy Models.

[yanglearning] Yang, Sherry, Du, Yilun, Ghasemipour, Seyed Kamyar Seyed, Tompson, Jonathan, Kaelbling, Leslie Pack, Schuurmans, Dale, Abbeel, Pieter. Learning Interactive Real-World Simulators. The Twelfth International Conference on Learning Representations.

[bruce2024genie] Bruce, Jake, Dennis, Michael D, Edwards, Ashley, Parker-Holder, Jack, Shi, Yuge, Hughes, Edward, Lai, Matthew, Mavalankar, Aditi, Steigerwald, Richie, Apps, Chris, others. (2024). Genie: Generative interactive environments. Forty-first International Conference on Machine Learning.

[finn2017deep] Finn, Chelsea, Levine, Sergey. (2017). Deep visual foresight for planning robot motion. 2017 IEEE International Conference on Robotics and Automation (ICRA).

[hirose2019deep] Hirose, Noriaki, Xia, Fei, Mart{'\i. (2019). Deep visual mpc-policy learning for navigation. IEEE Robotics and Automation Letters.

[pmlr-v139-nichol21a] Nichol, Alexander Quinn, Dhariwal, Prafulla. (2021). Improved Denoising Diffusion Probabilistic Models. Proceedings of the 38th International Conference on Machine Learning.

[wang2024phased] Wang, Fu-Yun, Huang, Zhaoyang, Bergman, Alexander, Shen, Dazhong, Gao, Peng, Lingelbach, Michael, Sun, Keqiang, Bian, Weikang, Song, Guanglu, Liu, Yu, others. (2024). Phased Consistency Models. Advances in Neural Information Processing Systems.

[frantar2022gptq] Frantar, Elias, Ashkboos, Saleh, Hoefler, Torsten, Alistarh, Dan. (2022). Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.

[frey2023fast] Frey, J, Mattamala, M, Chebrolu, N, Cadena, C, Fallon, M, Hutter, M. (2023). Fast traversability estimation for wild visual navigation. Robotics: Science and Systems Proceedings.

[pmlr-v235-gao24p] Gao, Junyu, Yao, Xuan, Xu, Changsheng. (2024). Fast-Slow Test-Time Adaptation for Online Vision-and-Language Navigation. Proceedings of the 41st International Conference on Machine Learning.

[bib1] Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. In Thirty-eighth Conference on Neural Information Processing Systems.

[bib2] Bar et al. (2021) Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, and Amir Globerson. Compositional video synthesis with action graphs. In International Conference on Machine Learning, pages 662–673. PMLR, 2021.

[bib3] Bar-Tal et al. (2024) Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Guanghui Liu, Amit Raj, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024.

[bib4] Blattmann et al. (2023) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.

[bib5] Brooks et al. (2024) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators, 2024.

[bib6] Bruce et al. (2024) Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024.

[bib7] Chan et al. (2023) Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, and Gordon Wetzstein. Generative novel view synthesis with 3d-aware diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4217–4229, October 2023.

[bib8] Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. Learning to explore using active neural slam. In International Conference on Learning Representations.

[bib9] Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for navigation. In International Conference on Learning Representations.

[bib10] Escontrela et al. (2024) Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.

[bib11] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2786–2793. IEEE, 2017.

[bib12] Fu et al. (2024) Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. Advances in Neural Information Processing Systems, 36, 2024.

[bib13] Fu et al. (2022) Zipeng Fu, Ashish Kumar, Ananye Agarwal, Haozhi Qi, Jitendra Malik, and Deepak Pathak. Coupling vision and proprioception for navigation of legged robots. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17273–17283, 2022.

[bib14] Girdhar et al. (2023) Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.

[bib15] Grauman et al. (2022) Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.

[bib16] David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.

[bib17] Hafner et al. (a) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, a.

[bib18] Hafner et al. (b) Danijar Hafner, Timothy P Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In International Conference on Learning Representations, b.

[bib19] Nicklas Hansen, Hao Su, and Xiaolong Wang. Td-mpc2: Scalable, robust world models for continuous control. In The Twelfth International Conference on Learning Representations.

[bib20] Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.

[bib21] Hirose et al. (2018) Noriaki Hirose, Amir Sadeghian, Marynel Vázquez, Patrick Goebel, and Silvio Savarese. Gonet: A semi-supervised deep learning approach for traversability estimation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3044–3051. IEEE, 2018.

[bib22] Hirose et al. (2019a) Noriaki Hirose, Amir Sadeghian, Fei Xia, Roberto Martín-Martín, and Silvio Savarese. Vunet: Dynamic scene view synthesis for traversability estimation using an rgb camera. IEEE Robotics and Automation Letters, 2019a.

[bib23] Hirose et al. (2019b) Noriaki Hirose, Fei Xia, Roberto Martín-Martín, Amir Sadeghian, and Silvio Savarese. Deep visual mpc-policy learning for navigation. IEEE Robotics and Automation Letters, 4(4):3184–3191, 2019b.

[bib24] Hirose et al. (2023) Noriaki Hirose, Dhruv Shah, Ajay Sridhar, and Sergey Levine. Sacson: Scalable autonomous control for social navigation. IEEE Robotics and Automation Letters, 2023.

[bib25] Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.

[bib26] Ho et al. (2022) Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.

[bib27] Karnan et al. (2022) Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett Warnell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Biswas, and Peter Stone. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022.

[bib28] Koh et al. (2021) Jing Yu Koh, Honglak Lee, Yinfei Yang, Jason Baldridge, and Peter Anderson. Pathdreamer: A world model for indoor navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14738–14748, 2021.

[bib29] Dan Kondratyuk, Lijun Yu, Xiuye Gu, Jose Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation. In Forty-first International Conference on Machine Learning.

[bib30] Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.

[bib31] Lei Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. ArXiv e-prints, pages arXiv–1607, 2016.

[bib32] Liang et al. (2024) Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation, 2024.

[bib33] Lin et al. (2024a) Han Lin, Tushar Nagarajan, Nicolas Ballas, Mido Assran, Mojtaba Komeili, Mohit Bansal, and Koustuv Sinha. Vedit: Latent prediction architecture for procedural video representation learning, 2024a. https://arxiv.org/abs/2410.03478.

[bib34] Lin et al. (2024b) Jessy Lin, Yuqing Du, Olivia Watkins, Danijar Hafner, Pieter Abbeel, Dan Klein, and Anca Dragan. Learning to model the world with language, 2024b. https://arxiv.org/abs/2308.01399.

[bib35] Liu et al. (2023) Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl Vondrick. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023.

[bib36] I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

[bib37] Mildenhall et al. (2021) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.

[bib38] Mirowski et al. (2022) Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. In International Conference on Learning Representations, 2022.

[bib39] Pathak et al. (2018) Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zero-shot visual imitation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 2050–2053, 2018.

[bib40] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023.

[bib41] Polyak et al. (2024) Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024.

[bib42] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations.

[bib43] Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.

[bib44] Reuven Y Rubinstein. Optimization of computer simulation models with rare events. European Journal of Operational Research, 99(1):89–112, 1997.

[bib45] Savva et al. (2019) Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019.

[bib46] Seo et al. (2023) Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. Masked world models for visual control. In Conference on Robot Learning, pages 1332–1344. PMLR, 2023.

[bib47] Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. Vint: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning.

[bib48] Shah et al. (2021) Dhruv Shah, Benjamin Eysenbach, Gregory Kahn, Nicholas Rhinehart, and Sergey Levine. Rapid exploration for open-world navigation with latent goal models. arXiv preprint arXiv:2104.05859, 2021.

[bib49] Shah et al. (2023) Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023.

[bib50] Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.

[bib51] Sridhar et al. (2024) Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. Nomad: Goal masked diffusion policies for navigation and exploration. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 63–70. IEEE, 2024.

[bib52] Srivastava et al. (2017) Akash Srivastava, Lazar Valkov, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. Advances in neural information processing systems, 30, 2017.

[bib53] Sturm et al. (2012) Jürgen Sturm, Wolfram Burgard, and Daniel Cremers. Evaluating egomotion and structure-from-motion approaches using the tum rgb-d benchmark. In Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS), volume 13, page 6, 2012.

[bib54] Hoang Thanh-Tung and Truyen Tran. Catastrophic forgetting and mode collapse in gans. In 2020 international joint conference on neural networks (ijcnn), pages 1–10. IEEE, 2020.

[bib55] Tomar et al. (2024) Manan Tomar, Philippe Hansen-Estruch, Philip Bachman, Alex Lamb, John Langford, Matthew E. Taylor, and Sergey Levine. Video occupancy models, 2024. https://arxiv.org/abs/2407.09533.

[bib56] Triest et al. (2022) Samuel Triest, Matthew Sivaprakasam, Sean J Wang, Wenshan Wang, Aaron M Johnson, and Sebastian Scherer. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022.

[bib57] Tulyakov et al. (2018a) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1526–1535, 2018a.

[bib58] Tulyakov et al. (2018b) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018b.

[bib59] Tung et al. (2025) Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision – ECCV 2024, pages 197–214, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-73397-0.

[bib60] Unterthiner et al. (2019) Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.

[bib61] Valevski et al. (2024) Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. arXiv preprint arXiv:2408.14837, 2024.

[bib62] Van Hoorick et al. (2024) Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sargent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl Vondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. 2024.

[bib63] A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

[bib64] Voleti et al. (2022) Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems, 35:23371–23385, 2022.

[bib65] Wu et al. (2023) Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer: World models for physical robot learning. In Conference on robot learning, pages 2226–2240. PMLR, 2023.

[bib66] Xu et al. (2019) Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Understanding and improving layer normalization, 2019. https://arxiv.org/abs/1911.07013.

[bib67] Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. In The Twelfth International Conference on Learning Representations.

[bib68] Yu et al. (2023) Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.

[bib69] Zhang et al. (2018a) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018a.

[bib70] Zhang et al. (2018b) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018b.

[bib71] Zhou et al. (2024) Gaoyue Zhou, Hengkai Pan, Yann LeCun, and Lerrel Pinto. Dino-wm: World models on pre-trained visual features enable zero-shot planning, 2024. https://arxiv.org/abs/2411.04983.