Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models

Vlad Sobal, Wancong Zhang, Kyunghyun Cho, Randall Balestriero, Tim G. J. Rudner, Yann LeCun

Abstract

A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties—such as data diversity, trajectory quality, and environment variability—affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.

Introduction

How can we build a system that performs well on unseen combinations of tasks and environments? One promising approach is to avoid relying on online interactions or expert demonstrations, and instead leverage large collections of existing suboptimal trajectories without reward annotations [16, 32, 53]. Broadly, two dominant fields offer promising solutions for learning from such data: reinforcement learning and optimal control.

While online reinforcement learning has enabled agents to master complex tasks-from Atari games [44], Go [60], to controlling real robots [49]-it demands massive quantities of environment interactions. For instance, OpenAI et al. [49] used the equivalent of 100 years of real-time hand manipulation experience to train a robot to reliably handle a Rubik's cube. To address this inefficiency, offline RL methods [18, 33, 41] have been developed to learn behaviors from state-action trajectories with corresponding reward annotations. However, these methods typically train agents for a single task, limiting their reuse in other downstream tasks. To overcome this, recent work has explored learning behaviors from offline reward-free trajectories [32, 52, 53, 67]. This reward-free paradigm is particularly appealing as it allows agents to learn from suboptimal data and use the learned policy to solve a variety of downstream tasks. For example, a system trained on low-quality robotic interactions with cloth can later generalize to tasks like folding laundry [9].

Equal contribution. Author ordering determined by coin flip.

39th Conference on Neural Information Processing Systems (NeurIPS 2025).

Figure 1: Overview of our analysis. We test six methods for learning from offline reward-free trajectories on 23 different datasets across several navigation environments. We evaluate for six generalization properties required to scale to large offline datasets of suboptimal trajectories. We find that planning with a latent dynamics model (PLDM) demonstrates the highest level of generalization. For a full comparison, see Table 1. Right: diagram of PLDM. Circles represent variables, rectangles - loss components, half-ovals - trained models.

Optimal control tackles challenge differently: instead of learning a policy function via trial and error, it plans actions using a known dynamics model [7, 63, 64] to plan out actions. Since real-world dynamics are often hard to specify exactly, many approaches instead learn the model from data [20, 71, 77]. This model-based approach has shown generalization in manipulation tasks involving unseen objects [17]. Importantly, dynamics models can be trained directly from reward-free offline trajectories, making this a compelling route [16, 56].

Despite significant advances in RL and optimal control, the role of pre-training data quality on rewardfree offline learning remains largely unexplored. Prior work has primarily focused on RL methods trained on data from expert or exploratory policies [22, 76], without isolating the specific aspects of data quality that influence performance. In this work, we address this gap by systematically evaluating the strengths and limitations of various approaches for learning from reward-free trajectories. We assess how different learning paradigms perform under offline datasets that vary in both quality and quantity. To ground our study, we focus on navigation tasks - an essential aspect of many real-world robotic systems - where spatial reasoning, generalization, and trajectory stitching play a critical role. While this choice excludes domains such as manipulation, it offers a controlled yet challenging testbed for our comparative analysis.

We propose two new navigation environments with granular control over the data generation process, and generate a total of 23 datasets of varying quality;
We evaluate methods for learning from offline, reward-free trajectories, drawing from both reinforcement learning and optimal control paradigms. Our analysis systematically assesses their ability to learn from random policy trajectories, stitch together short sequences, train effectively on limited data, and generalize to unseen environment layouts and tasks beyond goal-reaching;
We demonstrate that learning a latent dynamics model and using it for planning is robust to suboptimal data quality and achieves the highest level of generalization to environment variations;
We present a list of guidelines to help practitioners choose between methods depending on available data and generalization requirements.

To facilitate further research into methods for learning from offline trajectories without rewards, we release code, data, environment visualizations, and more at latent-planning.github.io .

Reward-free offline RL refers to learning from offline data that does not contain rewards in a task-agnostic way. The goal is to extract general behaviors from offline data to solve a variety of downstream tasks. One approach uses goal-conditioned RL, with goals sampled in similar manner as in Hindsight Experience Replay [1]. Park et al. [52] show that this can be applied to learn a goal-conditioned policy using IQL, as well as to learn a hierarchical value function. Hatch et al. [29] proposes using a small set of observations corresponding to the solved task to define the task and learn from reward-free data. Hu et al. [30], Yu et al. [79] propose to use labeled data to train a reward function, than label the reward-free trajectories. Zero-shot methods go beyond goal-reaching from

Figure 2: Left: We train offline goal-conditioned agents on trajectories collected in a subset of maze layouts (left), and evaluate on held out layouts, observing trajectories shown on the right. Only PLDM solves the task (see Figure 8 for more). Right: Success rates of tested methods on held-out layouts, as a function of the number of training layouts. Rightmost plot shows success rates of models trained on data from five layouts, evaluated on held-out layouts ranging from those similar to training layouts to out-of-distribution ones. We use map layout edit distance from the training layouts as a measure of distribution shift. PLDM demonstrates the best generalization performance. Results are averaged over 3 seeds, shaded area denotes standard error. See Figure 1 for more details on PLDM.

offline data and aim to solve arbitrary tasks specified at test time. HILP [53] proposes learning a distance-preserving representation space such that the distance in that space is proportional to the number of steps between two states, similar to Laplacian representations [69, 70, 73]. ForwardBackward representations [67, 68] tackle this with an approach akin to successor-features [5].

Optimal Control , similar to RL, tackles the problem of selecting actions in an environment to optimize a given objective (reward for RL, cost for control). Classical optimal control methods typically assumes that the transition dynamics of the environment are known [7]. This paradigm has been used to control aircraft, rockets, missiles [11] and humanoid robots [35, 57]. When the transition dynamics cannot be defined precisely, they can often be learned [65, 71]. Many RL methods approximate dynamic programming in the context of unknown dynamics [6, 62]. In this work, we use the term RL to refer to methods that either implicitly or explicitly use rewards information to train a policy function, and optimal control for methods that use a dynamics model and explicitly search for actions that optimize the objective.

The importance of offline data has been highlighted in works such as ExORL, [76] which demonstrates that exploratory RL data enables off-policy algorithms to perform well in offline RL; however, it only compares exploratory vs. task-specific data, without analyzing which data aspects affect performance. Buckman et al. [12] investigates the data importance for offline RL with rewards. Recently proposed OGBench [51] introduces multiple offline datasets for a variety of goal-conditioned tasks; in contrast, we conduct a more fine-grained analysis of how methods perform in top-down navigation under suboptimal data conditions and generalize to new tasks and layouts. Yang et al. [74] also study generalization of offline GCRL, but focus on reaching out-of-distribution goals. Ghugare et al. [24] study stitching generalization.

The Landscape of Available Methods

In this section, we formally introduce the setting of learning from state-action sequences without reward annotations and overview available approaches. We also introduce a method we call Planning with a Latent Dynamics Model (PLDM).

Problem Setting

We consider a Markov decision process (MDP) M = ( S , A , µ, p, r ) , where S is the state space, A is the action space, µ ∈ P ( S ) denotes the initial state distribution, p ∈ S × A → S denotes the transition dynamics (we only consider the deterministic case), and r ∈ S → R denotes the reward function. We work in the offline setting, where we have access to a dataset of state-action sequences D which consists of transitions ( s 0 , a 0 , s 1 , . . . , a T -1 , s T ) . We emphasize again that the offline dataset in our setting does not contain any reward information. The goal is, given D , to find a policy π ∈ S × Z → A , to maximize cumulative reward r z , where Z is the space of possible task definitions. Our goal is to make the best use of the offline dataset D to enable the agent to

Table 1: Road-map of our generalization stress-testing experiments. We test 4 offline goal-conditioned methods - HIQL, GCIQL, CRL, GCBC; a zero-shot RL method HILP, and a learned latent dynamics planning method PLDM. ★★★ denotes good performance in the specified experiment, ★★✩ denotes average performance, and ★✩✩ denotes poor performance. We see that HILP and PLDM are the best-performing methods, with PLDM standing out as the only method that reaches competitive performance in all settings.

solve a variety of tasks in a given environment with potentially different layouts. During evaluation, unless otherwise specified, the agent is tasked to reach a goal state s g , so the reward is defined as r g ( s ) = I [ s = s g ] , and Z is equivalent to S .

Reward-free Offline Reinforcement Learning

In this work, we study methods that solve tasks purely from offline trajectories without reward annotations. Reward-free offline RL methods fall into two categories: goal-conditioned RL and zero-shot methods that treat the task as a latent variable. We evaluate state-of-the-art methods from both categories on goal-reaching, and test zero-shot methods on their ability to transfer to new tasks. The methods we investigate are:

· GCIQL [52] - goal-conditioned version of Implicit Q-Learning [33], a strong and widely-used method for offline RL; · HIQL [52] - a hierarchical GCRL method which trains two policies: one to generate subgoals, and another one to reach the subgoals. Notably, both policies use the same value function; · HILP [53] - a method that learns state representations from the offline data such that the distance in the learned representation space is proportional to the number of steps between two states. A direction-conditioned policy is then learned to be able to move along any specified direction in the latent space; · CRL [19] - uses contrastive learning to learn compatibility between states and possible reachable goals. The learned representation, which has been shown to be directly linked to goal-conditioned Q-function, is then used to train a goal-conditioned policy; · GCBC [23, 43] - Goal-Conditioned Behavior Cloning - the simplest baseline for goal-reaching.

Planning with a Latent Dynamics Model

The methods in Section 3.2 are model-free, none explicitly model the environment dynamics. Since we do not assume known dynamics as in classical control, we can instead learn a dynamics model from offline data, similar to [46, 55], which propose a model-based method for goal-reaching using an image reconstruction objective.

We propose a model-based method named Planning with a Latent Dynamics Model (PLDM), which learns latent dynamics using a reconstruction-free SSL objective and the JEPA architecture [39]. At test time, we plan in the learned latent space to reach goals. We opt for an SSL approach that predicts the latents as opposed to reconstructing the input observations [3, 21, 26, 81] motivated by findings that reconstruction yields suboptimal features [2, 42], while reconstruction-free representation learning works well for control [27, 59]. Appendix G provides empirical support: features trained with reconstruction-based methods such as DreamerV3 [26] underperform in test-time planning.

Given agent trajectory sequence ( s 0 , a 0 , s 1 , ..., a T -1 , s T ) , we specify the PLDM world model as:

where ˆ z k t is the latent state predicted by predictor k and z t is the encoder output at step t . When K > 1 , we train an ensemble of predictors for uncertainty regularization at test-time. The training objective involves minimizing the distance between predicted and encoded latents summed over

Figure 3: Left : The Two-Rooms environment. The agent starts at a random location and is tasked with reaching the goal at another randomly sampled location in the other room using 200 steps or less. Observations are 64 × 64 pixels images. Right: Examples of trajectories in the offline data. Red: each step's direction is sampled from Von Mises distribution. Blue: each step's direction is sampled uniformly.

Table 2: Performance of tested methods on goodquality data and on data with no trajectories passing through the door. Values are average success rates ( ± standard error ) across 3 seeds.

all timesteps. Given target and predicted latents Z, ˆ Z k ∈ R H × N × D , where H ≤ T is the model prediction horizon, N is the batch dimension, and D the feature dimension, the similarity objective between predictions and encodings is:

To prevent representation collapse, we use a VICReg-inspired [4] objective, and inverse dynamics modeling [40]. We show a diagram of PLDM in Figure 1. See Appendix D.1.1 for details.

Goal-conditioned planning with PLDM. In this work, we mainly focus on the task of reaching specified goal states. While methods outlined in Section 3.2 rely on trained policies to reach the goal, PLDM relies on planning. At test time, given the current observation s 0 , goal observation s g , pretrained encoder h θ predictor f θ , and planning horizon H , our planning objective is:

C goal is the goal-reaching objective and C uncertainty penalizes the model from choosing state-action transitions that deviate from the training distribution, with γ ∈ [0 , 1] as the temporal discount. This regularization resembles how GCIQL, HIQL, and HILP use expectile regression to learn policies that remain in-distribution with respect to the dataset [34]. See Appendix E for ablations on C uncertainty .

Following the Model Predictive Control framework [45], PLDM re-plans every i interactions with the environment. By default, we use i = 1 for all experiments, making PLDM ∼ 4 x slower than the model-free baselines. The replanning interval i can be increased to accelerate MPC with only a minor loss in performance (see Appendix F). We use MPPI [72] in all our experiments with planning. We note that PLDM is not using rewards, neither explicitly nor implicitly, and should be considered as an optimal control method. We also note that to apply PLDM to a new task, we do not need to retrain the encoder h θ and dynamics f θ , we only need to change the cost in Equation (5). We test this flexibility in Section 4.6, where we invert the sign of the cost to make the agent avoid a given state.

Goal-conditioned planning with JEPA{

In order to estimate how computationally expensive it is to run planning with a latent dynamics model, we evaluate PLDM, GCIQL, and HIQL on 25 episodes in the Two-Rooms environment. Each episode consists of 200 steps. We record the average time per episode and the standard deviation. We omit HILP, GCBC, or CRL because the resulting policy architecture is the same, making the evaluation time identical to that of GCIQL. HIQL takes more time due to the hierarchy of policies. When replanning every step, PLDM is slower than the policies. However, PLDM can match the latencies of policies by replanning less frequently with negligible performance drop.

Figure 10: Comparing PLDM's performance under fixed inference time compute budget on two-rooms. We see that across two-rooms experiments, PLDM performs only slightly worse when replanning every 4 steps compared to replanning every step.

Table 5: Time of evaluation on one episode in Two-Rooms environment. Time is averaged across 25 episodes. PLDM success rates are normalized against the setting that replans every step. PLDM can match the latencies of GCIQL and HIQL by replanning less frequently with neligible cost to performance.

Every Method Can Excel but Few Generalize

In this section, we conduct thorough experiments testing methods spanning RL and optimal control outlined in Section 3.2 and Section 3.3. We evaluate on navigation tasks where the agent is either a point mass (Section 4.1, Section 4.8) or quadruped (Section 4.7). We generate datasets of varying size and quality and test how a specific data type affects a given method. See Table 1 for overview.

Figure 4: Testing the selected methods' performance under different dataset constraints. Values and shaded regions are means and standard error over 3 seeds, respectively. Left : To test the importance of the dataset quality, we mix the random policy trajectories with good quality trajectories (see Figure 3). As the amount of good quality data goes to 0, methods begin to fail, with PLDM, GCIQL, and HILP being the most robust ones. Center : We measure methods' performance when trained with different sequence lengths. We find that many goal-conditioned methods fail when train trajectories are short, which causes far-away goals to become out-of-distribution for the resulting policy. Right : We measure methods' performance with datasets of varying sizes. We see that PLDM and GCIQL are the most sample efficient, and manage to get almost 80% success rate even with a few thousand transitions. See Appendix H for the analysis of statistical significance.

Two-Rooms Environment

We begin with a navigation task called Two-Rooms, featuring a point-mass agent. Each observation x t ∈ R 2 × 64 × 64 is a top-down view: the first channel encodes the agent, the second the walls (Figure 3). Actions a ∈ R 2 denote the displacement vector of the agent position from one time step to the next, with a norm limit of 2 . 45 . The goal is to reach a randomly sampled state within 200 steps. See Appendix C.2 for more details. This environment allows for controlled data generation - ideal for efficient and thorough experimentation, while still not being too trivial.

Offline data. To generate offline data, we place the agent in a random location within the environment, and execute a sequence of actions for T steps, where T denotes the episode length. The actions are generated by first picking a random direction, then using Von Mises distribution with concentration 5 to sample action directions. The step size is uniform from 0 to 2 . 45 . Unless otherwise specified, the episodes' length is T = 91 , and the total number of transitions in the data is 3 million.

What Methods Excel In-Distribution with a Large High-Quality Dataset?

To get the topline performance of the methods under optimal dataset conditions, we test them in a setting with abundant data, good state coverage, and good quality trajectories long enough to traverse the two rooms. With 3 million transitions, corresponding to around 30,000 trajectories, all methods reach good performance in the goal-reaching task in Two-Rooms (Table 2), with HIQL, GCIQL, HILP, and PLDM nearing 100% success rate.

Takeaway : All methods can perform well when data is plentiful and high-quality.

What Method is the Most Sample-Efficient?

We investigate how different methods perform when the dataset size varies. While our ultimate goal is to have a method that can make use of a large amount of suboptimal offline data, this experiment serves to distinguish which methods can glean the most information from available data. We tried ranges of dataset sizes all the way down to a few thousand transitions. In Figure 4, we see that the model-based PLDM and the model-free GCIQL outperform other model-free methods when data is scarce. In particular, HILP is more data-hungry than other model-free methods but achieves perfect performance with enough data.

Takeaway : PLDM and GCIQL are more sample-efficient than other methods.

What Methods Can Stitch Suboptimal Trajectories?

Can we learn from short trajectories? We vary the episode length T during data generation to test whether methods can stitch together short training trajectories to reach long-horizon goals. In real-world scenarios, collecting long episodes is often difficult-especially in open-ended environments-so the ability to learn generalizable policies from short trajectories is crucial. The hardest scenario of Two-Rooms requires the agent to navigate from the bottom left corner to the bottom right corner, and involves ∼ 90 steps, meaning that with short episode lengths such as 16, the goal is never observed. To succeed, methods must stitch together multiple offline trajectories.

We create datasets with episode length of 91, 64, 32, 16, keeping total transitions number at 3 million. Results in Figure 4 (center) show that with the exception of GCIQL, goal-conditioned model-free methods struggle when trained with shorter episodes. We hypothesize that these methods are limited by dynamic programming on the short transitions, which can be sample-inefficient for stitching many short trajectories to reach far-away goals. In contrast, HILP performs well by learning to follow directions in latent space - even from short episodes. Similarly, PLDM can learn an accurate model from short trajectories and stitch together a plan during test time.

Can we learn from data with imperfect coverage? We artificially constrain trajectories to always stay within one room within the episode, and never pass through the door. Without the constraint, around 35% trajectories pass through the door. During evaluation, the agent still needs to go through the door to reach the goal state. This also reflects possible constraints in real-life scenarios, as the ability to stitch offline trajectories together is essential to efficiently learn from offline data. The results are shown in Table 2. We see that HILP and GCIQL achieve perfect performance, while PLDM performance drops but is still higher than the rest of offline GCRL methods. We hypothesize that HILP's latent space structure enables effective stitching, while PLDM retains some performance due to the learned dynamics. With the exception of GCIQL, other model-free GCRL methods fail to learn to compose trajectories across rooms.

Takeaway : When solving the task requires 'stitching', HILP and GCIQL work great. The performance of PLDM drops, but is better than that of most offline model-free GCRL methods.

What Methods Can Learn From Trajectories of a Random Policy?

In this experiment, we evaluate how trajectory quality affects agent performance. In practice, random policy data is easy to collect, while expert demonstrations are often unavailable. Thus, algorithms that can generalize from noisy data are crucial. We create a dataset where actions are sampled uniformly at random, causing agents to oscillate near the starting point. In this setting, the average maximum distance between any two points in a trajectory is ∼ 10 (when the whole environment is 64 by 64), while when using Von Mises to sample actions as in the case of good quality data, it is ∼ 28 . Example trajectories from both types of action sampling are shown in Figure 3.

We see that HILP, GCIQL and the model-based PLDM outperform the rest of goal-conditioned RL methods in this setting (Figure 4). We hypothesize that because random trajectories on average do not go far, the sampled state and goal pairs during training are close to each other. As a result, faraway goals become out of distribution, and GCRL methods struggle with long-horizon tasks when TD learning fails to bridge distant trajectories. In contrast, PLDM uses the data only to learn the dynamics model, and random policy trajectories are still suitable for the purpose. HILP learns the latent space and how to traverse it in various dimensions - even from random data.

Takeaway : When the dataset quality is low, HILP, GCIQL and PLDM perform better than other offline GCRL methods.

What Methods Can Generalize to a New Task?

To build effective general purpose systems that learn from offline data, we need an algorithm that can generalize to different tasks. So far, we evaluated on goal-reaching tasks. In this experiment, we test generalization to a different task in an environment with the same dynamics: avoiding a chasing agent in the same environment. We compare PLDM and HILP, using models trained on optimal data from Section 4.2 without any further training. In this task, the chaser follows an expert policy along the

Figure 5: Zero-shot generalization to the chasing task. (a) In the chase environment, the blue agent is tasked with avoiding the red chaser. The chaser follows the shortest path to the agent. The observations of the agent remain unchanged: we pass the chaser state as the goal state. The agent has to avoid the specified state instead of reaching it. (b) Left : Performance of the tested methods on the chasing task across different chaser speeds, with faster chaser making the task harder. Baselines include agents that take no action ('Zero') and random actions ('Random'). (b) Right : Average distance between the agent and chaser agent throughout the episode when chaser speed is 1 . 0 . (c) Visualization of ant-umaze environment. The 4-legged ant is tasked with reaching a randomly sampled goal within a u-shaped room.

shortest path to the agent, and we vary its speed to adjust difficulty. The goal of the controlled agent is to avoid being caught. Goal-conditioned methods are excluded from evaluation, as they cannot avoid specific states by design. At each step, the agent observes the chaser's state and selects actions to maintain distance. In PLDM, we invert the sign of the planning objective to maximize distance in the latent space to the goal state. In HILP, we invert the skill direction. We evaluate the success rate defined as maintaining a distance ≥ 1.4 pixels over 100 steps. The results are shown in Figure 5b. We also plot average distance between the agents over time in Figure 5b. We see that PLDM performs better than HILP, and is able to evade the chaser more effectively by maintaining greater separation by episode end.

Takeaway : Assuming the environment dynamics remain fixed, PLDM can generalize to tasks other than goal-reaching simply by changing the planning objective.

Extending to a Higher-Dimensional Control Environment.

So far, we have focused on environments with simple control dynamics over a point-mass agent, where actions are 2D displacement or acceleration vectors. We now investigate whether the same trend holds in a setting with more complex control. We choose Ant-U-Maze, a standard state-based environment with a 29-dimensional state space and 8-dimensional action space (Figure 5c). Solving this task with PLDM requires learning non-trivial dynamics that better resemble real-world control.

We collect a dataset using a pretrained directional expert policy from Park et al. [51], generating 5M transitions by resampling a new direction every 10 steps and adding Gaussian noise with standard deviation 1.0 to each action. For evaluation, the quadruped is initialized at the bottom left or right corner, with the goal at the opposite diagonal. Each method is evaluated on 10 trials.

As in 4.4, we test the methods' abilities to stitch short training trajectories by using datasets with trajectory lengths 25, 50, 100, 250, and 500 (during evaluation, the start and goal are approximately 200 steps apart).

Figure 6: Success rates in Ant-U-Maze for agents trained on trajectories of varying lengths.

PLDM, HIQL, and HILP outperform other GCRL baselines in trajectory stitching, achieving 100% success rates while other methods fail with shorter trajectories. In contrast to the Two-Rooms experiments, HIQL outperforms GCIQL in this setting. We hypothesize that HIQL's hierarchical structure may particularly benefit high-dimensional ant control: the low-level policy can manage fine-grained joint movements, while the high-level policy can govern overall navigation.

Takeaway : Planning with a latent dynamics model maintains good performance in a standard quadruped maze task, indicating promising generalization to higher control complexity.

What Methods Can Generalize to Unseen Environment Layouts?

In this experiment, we test whether methods can generalize to new obstacle configurations/layouts a key requirement for general purpose agents, since collecting data for every scenario is infeasible. We introduce a new navigation environment with slightly more complex dynamics and configurable layouts (Figure 2). Building on top of Mujoco PointMaze, [66] layouts are generated by randomly permuting wall locations. Data is collected by randomly sampling actions at each step. Observation includes an RGB top-down view of the maze and agent velocity; actions are 2D accelerations. The goal is to reach a randomly sampled location. See Appendix C.2 for details.

To test generalization to new obstacle configurations, we vary the number of training maze layouts (5, 10, 20, 40) and evaluate on held-out unseen layouts. For models trained on 5 maps, we further analyze how test-time performance degrades as layouts diverge from the training distribution. Figure 8 shows that PLDM generalizes best - even when trained on just 5 maps - while other methods fail. As test layouts become more out-of-distribution, all methods except PLDM degrade in performance. We also evaluate all methods on a single fixed layout and observe 100% success rate across the board (Table 4). Figure 7 and 8 show PLDM's inferred plans and trajectories from different agents. We also investigate HILP's failure to generalize in Appendix I, and show that HILP's learned representation space successfully captures distance between states in mazes seen during training, but fails on unseen mazes.

Figure 7: Left : Plans generated by PLDM at test time. Right : Actual agent trajectories for the tested methods. PLDM is the only method that reliably succeeds on held-out mazes.

Takeaway : The model-based approach enables better generalization to unseen obstacle layouts than model-free methods.

Conclusion

In this work, we conducted a comprehensive study of existing methods for learning from offline data without rewards, spanning both reinforcement learning and optimal control, aiming to identify the most promising approaches for leveraging suboptimal trajectories. We focus on a set of navigation tasks that present unique challenges due to the need for spatial reasoning, generalization to new layouts, and trajectory stitching. Our findings highlight HILP and PLDM as the strongest candidates, with PLDM demonstrating the best generalization to new obstacle layouts and to a state-avoidance task. We aggregate our experimental results in Table 1. Overall, we draw three main conclusions:

PLDM exhibits robustness to data quality, a high level of data efficiency, best-of-class generalization to new layouts, and excels at adapting to tasks beyond goal-reaching;
Learning a well-structured latent-space (e.g. using HILP) enables trajectory stitching and robustness to data quality, although it is more data-hungry than other methods;
Model-free GCRL methods are a great choice when the data is plentiful and good quality.

Future work. Our findings indicate that learning and planning with latent dynamics models is a promising direction for building general autonomous agents. There are many promising areas for exploration: 1) extending PLDM to more complex domains such as robotic manipulation and partially observable environments; 2) investigating improved dynamics learning methods to mitigate issues like accumulating prediction errors for tasks involving long-horizon reasoning [37]; and 3) improving test-time efficiency, either by backpropagating gradients through the forward model [8] or via amortized planning.

Future work.

Refer to caption Left: We train offline goal-conditioned agents on trajectories collected in a subset of maze layouts (left), and evaluate on held out layouts, observing trajectories shown on the right. Only PLDM solves the task (see Figure˜10 for more). Right: Success rates of tested methods on held-out layouts, as a function of the number of training layouts. Rightmost plot shows success rates of models trained on data from five layouts, evaluated on held-out layouts ranging from those similar to training layouts to out-of-distribution ones. We use map layout edit distance from the training layouts as a measure of distribution shift. PLDM demonstrates the best generalization performance. Results are averaged over 3 seeds, shaded area denotes standard deviation. See Figure˜1 for more details on PLDM.

A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties—such as data diversity, trajectory quality, and environment variability—affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.

How can we bulid a system that can perform well on unseen task and environment combinations? One promising way to achieve this is to not rely on online interactions or expert demonstrations, and instead leverage large amounts of existing suboptimal trajectories without reward annotations [KPL24, PKL24, Das+20]. Broadly, two dominant fields offer promising solutions to leveraging such data: reinforcement learning and optimal control.

Although online reinforcement learning enables learning to solve increasingly complex tasks—ranging from Atari games [Mni13] to Go [Sil+16] to controlling real robots [Ope+18]—it requires numerous environment interactions to do so. For example, [Ope+18] used the equivalent of 100 years of real-time hand manipulation experience to train a robot to reliably handle a Rubik’s cube. To address this sample complexity problem, offline RL methods [KNL21, Lev+20, EGW05] have been developed to learn behaviors from state–action trajectories with corresponding reward annotations. Unfortunately, conventional offline RL limits agents to one task, making it impossible to use a trained agent to solve another downstream task. To address this shortcoming, recently proposed methods learn desired behaviors from offline reward-free trajectories [Par+24a, TO21, KPL24, PKL24]. This reward-free paradigm is particularly appealing as it allows agents to learn from suboptimal data and use the learned policy to solve a variety of downstream tasks. For example, a system trained on a large dataset of low-quality robotic interactions with cloths can generalize to new tasks, such as folding laundry [Bla+24].

On the other hand, the field of optimal control approaches the challenge from a different angle, and instead of optimizing a policy function through trial and error, aims to use a known dynamics model [Ber19, TL05, TES07] to plan out actions. Of course, the dynamics model is often hard to define exactly, prompting a range of methods that learn the dynamics model instead [Wat+15, FL17, YCBI19]. This approach, when applied to manipulation, has also been shown to achieve generalization to unseen objects [Ebe+18]. Notably, dynamics model learning approaches can easily use offline trajectories without reward signal for training, making this a promising approach for training a general agent on offline data [Das+20, Ryb+18].

Despite significant advances in RL and optimal control, the impact of pre-training data quality on reward-free offline learning remains largely unexplored. Prior work primarily focuses on RL methods and trains on data collected either from expert policies or unsupervised RL [Fu+20, Yar+22], and does not test proposed methods on more than three different dataset types per task. In this work, we bridge this gap by systematically analyzing the strengths and weaknesses of various approaches for learning from reward-free trajectories. Through carefully designed experiments, we evaluate how different learning paradigms handle offline data across varying levels of quality and quantity.

Our contributions can be summarized as follows:

We propose two navigation environments with granular control over the data generation process, and generate a total of 23 datasets of varying quality;

We evaluate methods for learning from offline, reward-free trajectories, drawing from both reinforcement learning and optimal control paradigms. Our analysis systematically examines their performance across the proposed environments, assessing their ability to learn from random policy trajectories, stitch together short sequences, train effectively on limited data, and generalize to unseen environments and tasks;

We demonstrate that learning a latent dynamics model and using it for planning is robust to suboptimal data quality and achieves the highest level of generalization to environment variations;

We present a list of guidelines to help practitioners choose between methods depending on available data and generalization requirements.

To facilitate further research into methods for learning from offline trajectories without rewards, we release code, data, environment visualizations, and more at latent-planning.github.io.

Investigating importance of offline data. ExORL [Yar+22] show the importance of data for offline RL, and demonstrate that data collected using unsupervised RL enables off-policy RL algorithms to perform well in the offline setting; however, that study only compares using data collected by unsupervised RL and by task-specific agents, without giving a more fine-grained analysis on how different aspects of the data affect performance. [BGB20] investigates the data importance for offline RL with rewards. Recently proposed OGBench [Par+24] introduces multiple versions of offline data for a variety of goal-conditioned tasks; in contrast to that work, we focus on top-down navigation environments and build 23 different datasets to perform a detailed study of methods’ generalization, including to new tasks and environment layouts, as opposed to at most only three dataset versions for one task in OGBench and its focus on the single layout, goal-conditioned setting. [Cob+18] investigate generalization in RL using variations in the environment akin to what we did in Section˜4.7, although that study is using the online setting with rewards. [Yan+23] also study generalization of offline GCRL, but focus on reaching out-of-distribution goals. [Ghu+24] study stitching generalization.

We consider a Markov decision process (MDP) ℳ=(𝒮,𝒜,μ,p,r){\mathcal{M}}=({\mathcal{S}},{\mathcal{A}},\mu,p,r), where 𝒮{\mathcal{S}} is the state space, 𝒜{\mathcal{A}} is the action space, μ∈𝒫(𝒮)\mu\in{\mathcal{P}}({\mathcal{S}}) denotes the initial state distribution, p∈𝒮×𝒜→𝒮p\in{\mathcal{S}}\times{\mathcal{A}}\rightarrow{\mathcal{S}} denotes the transition dynamics, and r∈𝒮→ℝr\in{\mathcal{S}}\rightarrow{\mathbb{R}} denotes the reward function. We work in the offline setting, where we have access to a dataset of state-action sequences 𝒟{\mathcal{D}} which consists of transitions (s0,a0,s1,…,aT−1,sT)(s_{0},a_{0},s_{1},\ldots,a_{T-1},s_{T}). We emphasize again that the offline dataset in our setting does not contain any reward information. In our experiments, we also consider only deterministic transition dynamics. The goal is, given 𝒟{\mathcal{D}}, to find a policy π∈𝒮×𝒵→𝒜\pi\in{\mathcal{S}}\times{\mathcal{Z}}\rightarrow{\mathcal{A}}, to maximize cumulative reward rzr_{z}, where 𝒵{\mathcal{Z}} is the space of possible task definitions. Our goal is to make the best use of the offline dataset 𝒟{\mathcal{D}} to enable the agent to solve a variety of tasks in a given environment with potentially different layouts. During evaluation, unless otherwise specified, the agent is tasked to reach a goal state sgs_{g}, so the reward is defined as rg(s)=𝕀[s=sg]r_{g}(s)=\mathbb{I}[s=s_{g}], and 𝒵{\mathcal{Z}} is equivalent to 𝒮{\mathcal{S}}.

In this work, we focus on methods that can learn to solve tasks purely from offline trajectories without rewards annotations. We do not consider methods that augment reward-labeled dataset with reward free data, as we believe that the fully reward-free approach is more general. In offline RL, methods for learning without rewards fall into two broad categories: offline goal-conditioned RL and zero-shot RL methods that model the underlying task as a latent variable. In this work, we consider both categories, and select methods that we believe reflect the state of the art. We test all methods on goal-reaching, and test the zero-shot methods transfer to new tasks. The methods we investigate are:

GCIQL [Par+24a] – goal-conditioned version of Implicit Q-Learning [KNL21], a strong and widely-used method for offline RL;

HIQL [Par+24a] – a hierarchical GCRL method which trains two policies: one to generate subgoals, and another one to reach the subgoals. Notably, both policies use the same value function;

HILP [PKL24] – a method that learns state representations from the offline data such that the distance in the learned representation space is proportional to the number of steps between two states. A direction-conditioned policy is then learned to be able to move along any specified direction in the latent space;

CRL [Eys+22] – uses contrastive learning to learn compatibility between states and possible reachable goals. The learned representation, which has been shown to be directly linked to goal-conditioned Q-function, is then used to train a goal-conditioned policy;

Although the methods outlined in Section˜3.2 cover a wide range of paradigms, they all fall into the model-free RL category, and none of them use the model-based approach, which achieves impressive performance in other settings [DR11, Sil+17, Sil+16, Raf+21]. An easy way to use state-action sequences is to learn a dynamics model, making it a natural choice for our setting. For example, [NSF20, Per+20] propose model-based methods for goal-reaching, and use an image reconstruction objective. In this work, we choose to focus on just the dynamics learning objective, with added representation learning objectives to prevent collapse, therefore bypassing the need for image reconstruction. We introduce a model-based method named PLDM – Planning with a Latent Dynamics Model. We learn latent dynamics using a reconstruction-free self-supervised learning (SSL) objective, and utilize the joint-embedding predictive architecture (JEPA) [LeC22]. During evaluation, we use planning to optimize the goal-reaching objective. We opt for an SSL approach that involves predicting the latents as opposed to reconstructing the input observations [Haf+18, Haf+19, Haf+20, Haf+23, Wat+15, Fin+16, Zha+19, Ban+18] as recent work showed that reconstruction leads to suboptimal features [BL24, Lit+24], and that reconstruction-free representation learning can work well for control and RL [Shu+20, HWS22].

Given agent trajectory sequence (s0,a0,s1,…,aT−1,sT)(s_{0},a_{0},s_{1},...,a_{T-1},s_{T}), we specify the PLDM world model as:

where z^t\hat{z}{t} is the predicted latent state and ztz{t} is the encoder output at step tt. The training objective involves minimizing the distance between predicted and encoded latents summed over all timesteps. Given latents Z∈ℝH×N×DZ\in\mathbb{R}^{H\times N\times D}, where H≤TH\leq T is the model prediction horizon, NN is the batch dimension, and DD the feature dimension, the similarity objective between predictions and encodings is:

To prevent representation collapse, we use a VICReg-inspired [BPL21] objective, and inverse dynamics modeling [Les+18]. We show a diagram of PLDM in Figure˜1. See Section˜C.1.1 for details.

In this work, we mainly focus on the task of reaching specified goal states. While methods outlined in Section˜3.2 rely on trained policies to reach the goal, PLDM relies on planning. At test time, given the current observation s0s_{0}, goal observation sgs_{g}, pretrained encoder hθh_{\theta} predictor fθf_{\theta}, and planning horizon HH, our planning objective is:

Following the Model Predictive Control framework [ML99], our model re-plans at every kthk_{\mathrm{th}} interaction with the environment. Unless stated otherwise, we use k=1k=1. In all our experiments with PLDM, we use MPPI [WAT15] for planning. We note that PLDM is not using rewards, neither explicitly nor implicitly, and should be considered as falling under the optimal control category. We also note that in order to apply PLDM to another task, we do not need to retrain the encoder hθh_{\theta} and the forward model fθf_{\theta}, we only need to change the definition of the cost in equation Equation˜3.5. We demonstrate this flexibility in Section˜4.6, where we invert the sign of the cost to make the agent avoid the specified state.

In this section, we conduct thorough experiments testing methods spanning RL and optimal control outlined in Section˜3.2 and Section˜3.3. We test all methods on navigation tasks where the agent is a point mass. We present the task in Section˜4.1. We generate datasets of varying size and quality and test how a specific data type affects a given method. We design our experiments to test the following properties of methods (see Table˜1 for experiment overview):

Best-case performance with good data (Section˜4.2);

Ability to stitch together suboptimal trajectories (Section˜4.4);

Zero-shot generalization to a different task (Section˜4.6);

We then draw conclusions from our experiments and outline possible next steps in Section˜5.

All our experiments are done with top-down navigation tasks with a point-mass agent. First, we introduce a two-rooms navigation task. Each observation xtx_{t} is a top-down view of the two-rooms environment, xt∈ℝ2×64×64x_{t}\in{\mathbb{R}}^{2\times 64\times 64}, shown in Figure˜5. The first channel in the image is the agent, the second channel is the walls. Actions a∈ℝ2a\in{\mathbb{R}}^{2} denote the displacement vector of the agent position from one time step to the next one. The norm of the actions is restricted to be less than 2.452.45. The goal is to reach another randomly sampled state within 200 environment steps. See Section˜A.2 for more details. This environment makes control of the data generation process very easy, enabling us to conduct our experiments efficiently and thoroughly, while not being so trivial that any method can solve it with even a little bit of data. Movement and navigation is a big part of virtually every real-world robotic environment, making this a useful testbed for development.

Offline data. To generate offline data, we place the agent in a random location within the environment, and execute a sequence of actions for TT steps, where TT denotes the episode length. The actions are generated by first picking a random direction, then using Von Mises distribution with concentration 5 to sample action directions. The step size is uniform from 0 to 2.452.45. When sampling low-quality data, we do not bias the action directions using Von Mises, and instead sample the direction completely uniformly. Unless otherwise specified, the episodes’ length is T=91T=91, and the total number of transitions in the data is 3 million.

To get the topline performance of the methods under optimal dataset conditions, we test them in a setting with a large amount of data, good state coverage, and good quality trajectories long enough to traverse the two rooms. With 3 million transitions, corresponding to around 30 thousand trajectories, all methods reach their best-case performance in this environment. We report the results in Figure˜5. On the goal-reaching task in the two-rooms environment, all methods achieve impressive performance, with HIQL and HILP nearing perfect 100% success rate. PLDM fails to achieve perfect performance here. We hypothesize that because PLDM’s training objective is not to learn a policy but to learn dynamics, PLDM does not fully benefit from the high-quality trajectories like other model-free methods.

We investigate how different methods perform when the dataset size varies. While our ultimate goal is to have a method that can make use of a large amount of suboptimal offline data, this experiment serves to distinguish which methods can glean the most information from available data. We tried ranges of dataset sizes all the way down to a few thousand transitions. In Figure˜6 we see that the model-based method PLDM outperforms model-free methods when the data is scarce. In particular, HILP is more data-hungry than other model-free methods but achieves perfect performance with enough data.

Can we learn from short trajectories? In this experiment, we vary the episode length TT when generating the data. This experiment aims to test the methods’ ability to stitch together shorter trajectories in order to get to the goal. In real-life scenarios, collecting long episodes may be much more challenging than having a large set of shorter trajectories, especially when we scale to more open-ended environments. Therefore, the ability to learn and generalize effectively from shorter trajectories, even when the evaluation trajectory may be much longer, is essential. In our environment, successfully navigating from the bottom left corner to the bottom right corner requires around 90 steps. This means that successful trajectories for the hardest start and goal pairings are never observed in a dataset with episodes of length 16. In order to successfully solve this task, the learning method has to be able to stitch together multiple offline trajectories. We generate several datasets, with episode length of 91, 64, 32, 16. We adjust the number of episodes to keep the total number of transitions close to 3 million. The results are shown in Figure˜6 (center). We see that when the episode length is short, goal-conditioned methods fail. We hypothesize that because goal-conditioned methods sample state and goal pairs from a trajectory to train their policies, far away goals become out of distribution for the resulting policy. Although randomly sampling goals from other trajectories during training does not improve generalization, we hypothesize that data augmentation akin to the one outlined in [Ghu+24] can help. On the other hand, HILP performs well because instead of reaching goals, it learns to follow directions in the latent space, which can be learned even from short trajectories. Similarly, a model based method such as PLDM can learn an accurate model from short trajectories and stitch together a plan during test time.

Can we learn from data with imperfect coverage? We artificially constrain trajectories to always stay within one room within the episode, and never pass through the door. Without the constraint, around 35% trajectories pass through the door. During evaluation, the agent still needs to go through the door to reach the goal state. This also reflects possible constraints in real-life scenarios, as the ability to stitch offline trajectories together is essential to efficiently learn from offline data. The results are shown in Figure˜5. We see that HILP achieves perfect performance, while PLDM performance drops, but is better than that of other methods. Similarly to the experiment with short trajectories, the GCRL methods fail. We hypothesize that the structure of the latent space allows HILP to stitch trajectories easily, while PLDM retains some performance due to the learned dynamics. Model-free GCRL methods fail because the goal in a different room from the current state is always out-of-distribution for the policy trained on trajectories staying in one room.

In this experiment, we evaluate how trajectory quality affects agent performance. In practice, collecting trajectories with a random policy is easy, while access to skilled demonstrations cannot always be assumed. Therefore, developing an algorithm that can learn from noisy or random policy trajectories is critical for leveraging all available data. We generate a dataset of noisy trajectories, where at each step the direction is sampled completely at random. This effectively makes the agent move randomly, and throughout the episode it mostly stays close to where it started. In this dataset, the average maximum distance between any two points in a trajectory is ∼10\sim 10 (when the whole environment is 64 by 64), while when using Von Mises to sample actions, it is ∼28\sim 28. Example trajectories from both types of action sampling are shown in Figure˜7.

We see that HILP and the model-based PLDM perform better with very noisy data, while the goal-conditioned RL methods struggle (Figure˜6). Similarly to the experiment with shorter trajectories, we hypothesize that because trajectories on average do not go far, the sampled state and goal pairs during training are close to each other, making faraway goals out of distribution. On the other hand, PLDM uses the data only to learn the dynamics model, and random poilcy trajectories are still suitable for the purpose. HILP uses the data to learn the latent space and how to traverse it in various dimensions, and can also use the random policy trajectories effectively.

In order to build a system that can learn from offline data effectively, we need a learning algorithm that can generalize to different tasks. So far, we compared all the methods on goal-reaching tasks. In this experiment, we test whether the selected methods are able to generalize to a different task in the same environment. We compare the performances of PLDM and HILP on the task of avoiding another agent that is ‘chasing’ the controlled agent. We evaluate models trained on optimal data from the experiment in Section˜4.2 without any additional training. In this task, the chasing agent follows an expert policy that moves toward the agent along the shortest path. To vary the difficulty of the task, we vary the speed of the chasing agent. The goal of the controlled agent is to avoid the chasing agent. We note that the goal-conditioned methods can only reach specified goals, and by definition are unable to avoid a given state. Therefore, we only test PLDM and HILP. At each step, the agent is given the state of the chaser agent, and has to choose actions to avoid the chaser. To achieve that, in PLDM we simply invert the sign of the planning objective, making planning maximize the distance in representation space to the goal state. In HILP, we invert the skill direction. To compare the two methods, we evaluate the success rate of the controlled agent avoiding the chaser agent. The episode is considered successful if the agent manages to stay away from the chaser by at least 1.4 pixels for the whole episode lasting 100 steps. The results are shown in Figure˜8(b). To further analyze the results, we also investigate average distance between the agents throughout the episode, and plot the average, see Figure˜8(c). We see that PLDM performs better than HILP, and is able to evade the chaser more efficiently, keeping a larger distance between the agents at the end of the episode.

In this experiment we test the ability of the tested methods to generalize to new environments. Generalization to new environment variations is a requirement for any truly general RL agent, as it is impossible to collect data for every scenario. To test this, we introduce another navigation environment featuring more complex dynamics and configurable layouts, see Figure˜2 for an example. We utilize the Mujoco PointMaze environment [TET12] and generate various maze layouts by randomly permuting wall locations. The data is collected by initializing the agent at a random location and sampling actions randomly at every step. The observation space contains the top down view of the maze in RGB image format and the velocity of the agent, while the action is the 2D acceleration vector. The goal is to reach a randomly sampled goal state in the environment. For more details about the environment, see Section˜A.2.

To study the generalization ability of our agents, we vary the number (5, 10, 20, 40) of pre-training maze layouts in the offline dataset and evaluate the trained agents on a held-out set of unseen layouts. Furthermore, for agents trained on 5 layouts, we analyze how their performance is affected by the degree to which the test layouts differ from the training layouts in distribution. We show the results in Figure˜2, and more details in Figure˜10. PLDM demonstrates the best performance, generalizing to unseen environments, even when trained on as few as five maps, while other methods fail. In particular, as the test layouts move out of distribution from train layouts, all methods except PLDM suffer in performance as a consequence. To make sure all methods are able to solve the task, we also evaluate the methods on a fixed layout, and see that all of them are able to reach 100% success rate, see Table˜4. Figure˜9 and 10 show the plans inferred by PLDM at test time, as well as the different agents’ trajectories.

In this work, we conducted a comprehensive study of existing methods for learning from offline data without rewards spanning both RL and optimal control, aiming to identify the most promising approaches for leveraging large datasets of suboptimal trajectories. Our findings highlight HILP and PLDM as the strongest candidates, with PLDM demonstrating the best generalization to new environment layouts and tasks. We aggregate our experimental results in Table˜1. Overall, we draw three main conclusions:

PLDM works well across different dataset settings, and is able to learn from poor data and generalize to novel environments and tasks. Therefore, we believe that learning latent dynamics models is a promising candidate for pre-training on large datasets of suboptimal trajectories. Dynamics learning can also be extended to data without actions by modeling them as a latent variable [Seo+22, Ye+22]. We believe that other non-generative objectives for latent representation learning [Oqu+23, Bar+24] can be used to further improve performance. Another promising direction of research is planning. Dynamics learning and planning bring their own set of issues, including accumulating prediction errors [LPC22] and increased computational complexity during inference. In our case, we used MPPI [WAT15] for planning with a learned dynamics model, which takes a considerable amount of time, making evaluation with PLDM about 100 times slower compared to model-free methods (see Appendix˜D for details). Further research into making planning more efficient by e.g. backpropagating through the forward model is needed [BXS20]. In domains where inference speed is important, planning can be also used as the target to train a policy [Liu+22].

Limitations. All our experiments were conducted in simple navigation environments, and it is unclear if these findings will translate to more complex environments, e.g. physical robots. However, we argue that the conceptual understanding of the effects of data quality on the investigated methods will hold, as even in the relatively simple setting, we see many recent methods break down in surprising ways.

We build our own top-down navigation environment. It is implemented in PyTorch [Pas+19], and supports GPU acceleration. The environment does not model momentum, i.e. the agent does not have velocity, and is moved by the specified action vector at each step. When the action takes the agent through a wall, the agent is moved to the intersection point between the action vector and the wall. We generate the specified datasets and save them to disk for our experiments. The datasets generation takes under 30 minutes.

Here, we build upon the Mujoco PointMaze environment [TET12], which contains a point mass agent with a 4D state vector (globalx,globaly,vx,vy)(\mathrm{global}\ x,\mathrm{global}\ y,v_{x},v_{y}), where vv is the agent velocity. To allow our models to perceive the different maze layouts, we use as model input a top down view of the maze rendered as (64,64,3)(64,64,3) RGB image tensor instead of relying on (globalx,globaly)(\mathrm{global}\ x,\mathrm{global}\ y) directly.

Mujoco PointMaze allows for customization of the maze layout via a grid structure, where a grid cell can either be a wall or space. We opt for a 4×44\times 4 grid (excluding outer wall). Maze layouts are generated randomly. Only the following constraints are enforced: 1) all the space cells are interconnected, 2) percentage of space cells range from 50%50% to 75%75%.

We set action repeat to 44 for our version of the environment.

We produce four training datasets with the following parameters:

Each episode is collected by setting the (globalx,globaly)(\mathrm{global}\ x,\mathrm{global}\ y) at a random location in the maze, and agent velocity (vx,vy)(v_{x},v_{y}) by randomly sampling a 2D vector with ‖v‖≤5|v|\leq 5, given that vxv_{x} and vyv_{y} are clipped within range of [−5-5, 55] in the environment.

All the test layouts during evaluation are disjoint from the training layouts. For each layout, trials are created by randomly sampling a start and goal position guaranteed to be at least 33 cells away on the maze. The same set of layouts and trials are used to evaluate all agents for a given experimental setting.

We evaluate agents in two scenarios: 1) How agents perform on test layouts when trained on various number of train layouts; 2) Given a constant number of training layouts, how agents perform on test maps with varying degrees of distribution shift from the training layouts.

For scenario 1), we evaluate the agents on 4040 randomly generated test layouts, 11 trial per layout.

For scenario 2), we randomly generate test layouts and partition them into groups of 55, where all the layouts in each group have the same degree of distribution shift from train layout as defined by metric DminD_{min} defined as the following:

Given train layouts {Ltrain1,Ltrain2,…LtrainN}{L^{1}{\text{train}},L^{2}{\text{train}},...L^{N}{\text{train}}}, test layout LtestL{\text{test}}, and let d(L1,L2)d(L_{1},L_{2}) represents the edit distance between two layouts L1L_{1} and L2L_{2}’s binary grid representation. We quantify the distribution shift of LtestL_{\text{test}} as Dmin=mini∈{1,2,…,N}⁡d(Ltest,Ltrain(i))D_{\min}=\min_{i\in{1,2,\ldots,N}}d(L_{\text{test}},L_{\text{train}}^{(i)}).

In this second scenario we evaluate 55 trials per layout, thus a total of 5×5=255\times 5=25 per group.

For CRL, GCBC, GCIQL, and HIQL we use the implementations from the repository111https://github.com/seohongpark/ogbench of OGBench [Par+24]. Likewise, for HILP we use the official implementation222https://github.com/seohongpark/HILP from its authors.

For the Diverse PointMaze environment, to keep things consistent with our implementation of PLDM (C.1.3), instead of using frame stacking, we append the agent velocity directly to the encoder output.

To prevent collapse, we introduce a VICReg-based [BPL21] objective. We modify it to apply variance objective across the time dimension to encourage features to capture information that changes, as opposed to information that stays fixed [Sob+22]. The objective to prevent collapse is defined as follows:

We also apply a tunable objective to enforce the temporal smoothness of learned representations:

The combined objective is a weighted sum of above:

We use the same Impala Small Encoder used by the other methods from OGBench [Par+24]. For predictor, we use the a 2-layer Gated recurrent unit [Cho14] with 512 hidden dimensions; the predictor input at timestep tt is a 2D displacement vector representing agent action at timestep tt; while the initial hidden state is hθ(s0)h_{\theta}(s_{0}), or the encoded state at timestep 0. A single layer normalization layer is applied to the encoder and predictor outputs across all timesteps. Parameter counts are the following:

For the Diverse PointMaze environment, we use convolutional networks for both the encoder and predictor. To fully capture the agent’s state at timestep tt, we first encode the top down view of the maze to get a spatial representation of the environment hθ:ℝ3×64×64→ℝ16×26×26,zenv=hθ(senv)h_{\theta}:\mathbb{R}^{3\times 64\times 64}\to\mathbb{R}^{16\times 26\times 26},z^{env}=h_{\theta}(s^{env}). We incorporate the agent velocity by first transforming it into planes Expander2D:ℝ2→ℝ2×26×26,svp=Expander2D(sv)\mathrm{Expander2D}:\mathbb{R}^{2}\to\mathbb{R}^{2\times 26\times 26},s^{vp}=\mathrm{Expander2D}(s^{v}), where each slice svp[i]s^{vp}[i] is filled with sv[i]s^{v}[i]. Then, we concatenate the expanded velocity tensor with spatial representation along the channel dimension to get our overall representation: z=concat(svp,zenv,dim=0)∈ℝ18×26×26\ z=\mathrm{concat}(s^{vp},z^{env},\mathrm{dim}=0)\in\mathbb{R}^{18\times 26\times 26}.

For the predictor input, we concatenate the state st∈ℝ18×26×26s_{t}\in\mathbb{R}^{18\times 26\times 26} with the expanded action Expander2D(at)∈ℝ2×26×26\mathrm{Expander2D}(a_{t})\in\mathbb{R}^{2\times 26\times 26} along the channel dimension. The predictor output has the same dimension as the representation: z^∈ℝ18×26×26\hat{z}\in\mathbb{R}^{18\times 26\times 26}. Both the encodings and predictions are flattened for computing the VicReg and IDM objectives.

The full model architecture is summarized using PyTorch-like notations.

In order to estimate how computationally expensive it is to run planning with a latent dynamics model, we evaluate PLDM and GCIQL on 25 episodes in the two-rooms environment. Each episode is 200 steps. We record the average time per episode and the standard deviation. We do not run HILP, GCBC, or CRL because the resulting policy architecture is the same, making the evaluation time identical to that of GCIQL. HIQL takes more time due to the hierarchy of policies. The results are shown below:

Unless listed below, all hyperparameters remain consistent with default values from OGBench and HILP repositories. 1,2

For all methods, we used the learning rate of 3e-4. The rest of the hyperparameters were kept default.

The best case setting is sequence length = 91, dataset size = 3M, non-random %% = 100, wall crossing %≈35%\approx 35. For our experiments we vary each of the above parameters individually.

Offline RL. aims to learn behaviors purely from offline data without online interactions. There, a big challenge is preventing the policy from selecting trajectories that were not seen in the dataset. CQL [Kum+20] relies on model conservatism to prevent the learned policy from being overly optimistic about trajectories not observed in the data. IQL [KNL21] introduces an objective that avoids evaluating the Q-function on state-action pairs not seen in the data to prevent value overestimation. MOPO [Yu+20] is a model-based approach to learning from offline data, and uses model disagreement to constrain the policy. See [Lev+20] for a more in-depth survey.

Foundation models in RL. Recently, following the success of NLP, the RL community put a lot of effort into training large sequence models, which sparked dataset collection efforts like Open-X-Embodiment [Col+23] and DROID [Kha+24]. Large datasets have enabled training models such as RT-2 [Bro+23] and Octo [Oct+24]. See [Yan+23a] for a more extensive survey on the topic.

Training representations for RL. Another way to use large amounts of data to improve RL agents is using self-supervised learning (SSL). CURL [LSA20] introduce an SSL objective in addition to the standard RL objectives. Later works also explore using a separate pre-training stage [Sch+21, Zha+22, Nai+22]. [Zho+24] show that pre-trained visual representations from DINO [Car+21, Oqu+23] can be used to learn a word model for planning.

Table: S3.T1: Road-map of our generalization stress-testing experiments. We test 4 offline goal-conditioned methods - HIQL, GCIQL, CRL, GCBC; a zero-shot RL method HILP, and a learned latent dynamics planning method PLDM. ★★★ denotes good performance in the specified experiment, ★★✩ denotes average performance, and ★✩✩ denotes poor performance. We see that HILP and PLDM are the best-performing methods, with PLDM standing out as the only method that reaches competitive performance in all settings.

Property (Experiment section)	HILP	HIQL	GCIQL	CRL	GCBC	PLDM
Transfer to new environments (4.7)	★✩✩	★✩✩	★✩✩	★✩✩	★✩✩	★★★
Transfer to a new task (4.6)	★★✩	★✩✩	★✩✩	★✩✩	★✩✩	★★★
Data efficiency (4.3)	★✩✩	★★✩	★★✩	★★✩	★★✩	★★★
Best-case performance (4.2)	★★★	★★★	★★★	★★★	★★✩	★★✩
Can learn from random policy trajectories (4.5)	★★★	★✩✩	★✩✩	★✩✩	★✩✩	★★✩
Can stitch suboptimal trajectories (4.4)	★★★	★✩✩	★✩✩	★✩✩	★✩✩	★★✩
Competitve performance in all settings	✗	✗	✗	✗	✗	✓

Table: A1.T2: Details for Diverse PointMaze datasets

# Transitions	# layouts	# episodes per layout	episode length
1000000	5	2000	100
1000000	10	1000	100
1000000	20	500	100
1000000	40	250	100

Table: A4.T3: Time of evaluation on one episode in two-rooms environment. PLDM is about 100 times slower than model-free methods. Time is calculated by running on 25 episodes.

Method	Time per episode (seconds)
PLDM	13.44 ± 0.11
GCIQL	0.12 ± 0.03
HIQL	0.16 ± 0.03

Table: A5.T4: Results averaged over 3 seeds ± std

Method	Sucess rate)
PLDM	0.990 ± 0.001
CRL	0.980 ± 0.001
GCBC	0.970 ± 0.024
GCIQL	1.000 ± 0.000
HIQL	1.000 ± 0.000
HILP	1.000 ± 0.000

Table: A6.T5: HILP hyperparameters

Hyperparam	Value
Expectile	0.7
Skill expectile	0.7

Table: A6.T8: Dataset specific hyperparameters of CRL, GCBC, GCIQL, HIQL, HILP for the Diverse PointMaze environment. For HILP, we set the same value for expectile and skill expectile.

Dataset	CRL	GCBC	GCIQL	HIQL	HILP
LR	LR	LR	Expectile	LR	Expectile	LR	Expectile
# map layouts = 5	0.0003	0.0003	0.0002	0.8	0.0001	0.7	0.0001	0.9
# map layouts = 10	0.0003	0.0001	0.0001	0.9	0.0001	0.7	0.0001	0.9
# map layouts = 20	0.0003	0.0001	0.0001	0.6	0.0003	0.7	0.0001	0.9
# map layouts = 40	0.0003	0.0001	0.0003	0.9	0.0001	0.9	0.0001	0.9

Refer to caption Overview of our analysis. We test six methods for learning from offline reward-free trajectories on 23 different datasets across two top-down navigation environments. We evaluate for six generalization properties required to scale to large offline datasets of suboptimal trajectories. We find that planning with a latent dynamics model (PLDM) demonstrates the highest level of generalization. For a full comparison, see Table˜1. Right: diagram of PLDM. Circles represent variables, rectangles – loss components, half-ovals – trained models.

Refer to caption

Refer to caption Testing the selected methods’ performance under different dataset constraints. Values and shaded regions are means and standard deviations over 3 seeds, respectively. Left: To test the importance of the dataset quality, we mix the random policy trajectories with good quality trajectories (see Figure˜7). As the amount of good quality data goes to 0, methods begin to fail, with PLDM and HILP being the most robust ones. Center: We measure methods’ performance when trained with different sequence lengths. We find that many goal-conditioned methods fail when train trajectories are short, which causes far-away goals to become out-of-distribution for the resulting policy. Right: We measure methods’ performance with datasets of varying sizes. We see that PLDM is the most sample efficient, and manages to get almost 50% success rate even with a few thousand transitions.

Refer to caption (a)

Refer to caption Left: Plans generated by PLDM at test time. Right: Actual agent trajectories for the tested methods. PLDM is the only method that reliably succeeds on held-out mazes.

Refer to caption Top: The training layouts used in the 5 layout setting. Middle: Trajectories of different agents navigating an unseen maze layout towards goal at test time. As the layouts become increasingly out-of-distribution, only PLDN consistently succeeds. Layouts can be represented as a 4x4 array, with each value being either a wall or empty space. The distribution shift is quantified by the minimum edit distance between a given test layout and the closest training layout. The top row corresponds to an in-distribution layout with a minimum edit distance of 0, and with each subsequent row, the minimum edit distance increases by 1 incrementally.

$$ \displaystyle\mathrm{Encoder:} $$

$$ \displaystyle\mathcal{L}{\mathrm{sim}}=\sum{t=0}^{H}\frac{1}{N}\sum_{b=0}^{N}|\hat{Z}{t,b}-Z{t,b}|^{2}_{2} $$

$$ \displaystyle\hat{z}{0}=z{0}=h_{\theta}(s_{0}),\ \hat{z}{t}=f{\theta}(\hat{z}{t-1},a{t-1}) $$

$$ \displaystyle\mathcal{L_{\mathrm{JEPA}}}=\mathcal{L}{\mathrm{sim}}+\alpha\mathcal{L}{\mathrm{var}}+\beta\mathcal{L}{\mathrm{cov}}+\lambda\mathcal{L}{\mathrm{time-var}}+\delta\mathcal{L}{\mathrm{time-sim}}+\omega\mathcal{L}{\mathrm{IDM}} $$

Acknowledgments

This work was supported through the NYU IT High Performance Computing resources, services, and staff expertise, by the Institute of Information & Communications Technology Planning & Evaluation (IITP) with a grant funded by the Ministry of Science and ICT (MSIT) of the Republic of Korea in connection with the Global AI Frontier Lab International Collaborative Research, by the Samsung Advanced Institute of Technology (under the project Next Generation Deep Learning: From Pattern Recognition to AI) and the National Science Foundation (under NSF Award 1922658), and in part by AFOSR grant FA9550-23-1-0139 "World Models and Autonomous Machine Intelligence" and by ONR MURI N00014-22-1-2773 "Self-Learning Perception through Interaction with the Real World".

LARGE Appendix

Reproducibility

Code is available at https://github.com/vladisai/PLDM

An implementation of our method is provided in the GitHub repository above, which includes environments, data generation, training, and evaluation scripts. Detailed hyperparameters for our method and the baselines are provided in Appendix J.

Visualization of Plans and Trajectories for Diverse PointMaze

Figure 8: Top : The training layouts used in the 5 layout setting. Middle : Trajectories of different agents navigating an unseen maze layout towards goal at test time. As the layouts become increasingly out-ofdistribution, only PLDN consistently succeeds. Layouts can be represented as a 4x4 array, with each value being either a wall or empty space. The distribution shift is quantified by the minimum edit distance between a given test layout and the closest training layout. The top row corresponds to an in-distribution layout with a minimum edit distance of 0, and with each subsequent row, the minimum edit distance increases by 1 incrementally.

Environments and Datasets

Two-Rooms Environment

Diverse PointMaze

Here, we build upon the Mujoco PointMaze environment [66], which contains a point mass agent with a 4D state vector (global x, global y, v x , v y ) , where v is the agent velocity. To allow our models to perceive the different maze layouts, we use as model input a top down view of the maze rendered as (64 , 64 , 3) RGB image tensor instead of relying on (global x, global y ) directly.

Mujoco PointMaze allows for customization of the maze layout via a grid structure, where a grid cell can either be a wall or space. We opt for a 4 × 4 grid (excluding outer wall). Maze layouts are generated randomly. Only the following constraints are enforced: 1) all the space cells are interconnected, 2) percentage of space cells range from 50% to 75% .

We set action repeat to 4 for our version of the environment.

Dataset Generation

We produce four training datasets with the following parameters:

Table 3: Details for Diverse PointMaze datasets

Each episode is collected by setting the (global x, global y ) at a random location in the maze, and agent velocity ( v x , v y ) by randomly sampling a 2D vector with ∥ v ∥ ≤ 5 , given that v x and v y are clipped within range of [ -5 , 5 ] in the environment.

Evaluation

All the test layouts during evaluation are disjoint from the training layouts. For each layout, trials are created by randomly sampling a start and goal position guaranteed to be at least 3 cells away on the maze. The same set of layouts and trials are used to evaluate all agents for a given experimental setting.

For scenario 1), we evaluate the agents on 40 randomly generated test layouts, 1 trial per layout.

For scenario 2), we randomly generate test layouts and partition them into groups of 5 , where all the layouts in each group have the same degree of distribution shift from train layout as defined by metric D min defined as the following:

Given train layouts { L 1 train , L 2 train , ...L N train } , test layout L test, and let d ( L 1 , L 2 ) represents the edit distance between two layouts L 1 and L 2 's binary grid representation. We quantify the distribution shift of L test as D min = min i ∈{ 1 , 2 ,...,N } d ( L test , L ( i ) train ) .

In this second scenario we evaluate 5 trials per layout, thus a total of 5 × 5 = 25 per group.

Results for Single Maze Setting

Table 4: Results averaged over 3 seeds ± std

Ant U-Maze

To investigate whether our findings generalize to environments with more complicated control dynamics, we test the methods on the Ant U-Maze environment [66] with 8-dimensional action space and 29-dimensional state space. Similar to our previous analysis on Two-Rooms, we showcase the trajectory stitching capabilities of different methods.

Dataset Generation

We produce four training datasets with the following parameters:

Table 3: Details for Diverse PointMaze datasets

Evaluation

All the test layouts during evaluation are disjoint from the training layouts. For each layout, trials are created by randomly sampling a start and goal position guaranteed to be at least 3 cells away on the maze. The same set of layouts and trials are used to evaluate all agents for a given experimental setting.

For scenario 1), we evaluate the agents on 40 randomly generated test layouts, 1 trial per layout.

In this second scenario we evaluate 5 trials per layout, thus a total of 5 × 5 = 25 per group.

Models

For CRL, GCBC, GCIQL, and HIQL we use the implementations from the repository 1 of OGBench [51]. Likewise, for HILP we use the official implementation 2 from its authors.

For the Diverse PointMaze environment, to keep things consistent with our implementation of PLDM (D.1.4), instead of using frame stacking, we append the agent velocity directly to the encoder output.

PLDM

Objective for collapse prevention

To prevent collapse, we introduce a VICReg-based [4] objective. We modify it to apply variance objective across the time dimension to encourage features to capture information that changes, as

1 https://github.com/seohongpark/ogbench

opposed to information that stays fixed [61]. The objective to prevent collapse is defined as follows:

We also apply a tunable objective to enforce the temporal smoothness of learned representations:

The combined objective is a weighted sum of above:

Ablations of Objective Components

We conduct a careful ablation study over each loss component by setting its coefficient to zero. Two-Room ablations are performed in the optimal setting with sequence length 90, dataset size 3M, and all expert data. Diverse Maze ablations are performed in the 5 training maps setting.

Model Details for Two-Rooms

We use the same Impala Small Encoder used by the other methods from OGBench [51]. For predictor, we use the a 2-layer Gated recurrent unit [14] with 512 hidden dimensions; the predictor input at timestep t is a 2D displacement vector representing agent action at timestep t ; while the initial hidden state is h θ ( s 0 ) , or the encoded state at timestep 0 . A single layer normalization layer is applied to the encoder and predictor outputs across all timesteps. Parameter counts are the following:

total params: 2218672

1426096

793600

Model Details for Diverse PointMaze Environment

For the Diverse PointMaze environment, we use convolutional networks for both the encoder and predictor. To fully capture the agent's state at timestep t , we first encode the top down view of the maze to get a spatial representation of the environment h θ : R 3 × 64 × 64 → R 16 × 26 × 26 , z env = h θ ( s env ) . We incorporate the agent velocity by first transforming it into planes Expander2D : R 2 → R 2 × 26 × 26 , s vp = Expander2D( s v ) , where each slice s vp [ i ] is filled with s v [ i ] . Then, we concatenate the expanded velocity tensor with spatial representation along the channel dimension to get our overall representation: z = concat( s vp , z env , dim = 0) ∈ R 18 × 26 × 26 .

For the predictor input, we concatenate the state s t ∈ R 18 × 26 × 26 with the expanded action Expander2D( a t ) ∈ R 2 × 26 × 26 along the channel dimension. The predictor output has the same dimension as the representation: ˆ z ∈ R 18 × 26 × 26 . Both the encodings and predictions are flattened for computing the VicReg and IDM objectives.

We set the planning-frequency (Section 3.3) in MPPI to k = 4 for this environment.

The full model architecture is summarized using PyTorch-like notations.

total params: 53666 encoder params: 33296 predictor params: 20370 PLDM( (backbone): MeNet6( (layers): Sequential( (0): Conv2d(3, 16, kernel_size=(5, 5), stride=(1, 1)) (1): GroupNorm(4, 16, eps=1e-05, affine=True) (2): ReLU() (3): Conv2d(16, 32, kernel_size=(5, 5), stride=(2, 2)) (4): GroupNorm(8, 32, eps=1e-05, affine=True) (5): ReLU() (6): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1)) (7): GroupNorm(8, 32, eps=1e-05, affine=True) (8): ReLU() (9): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (10): GroupNorm(8, 32, eps=1e-05, affine=True) (11): ReLU() (12): Conv2d(32, 16, kernel_size=(1, 1), stride=(1, 1)) ) (propio_encoder): Expander2D() ) (predictor): ConvPredictor( (layers): Sequential( (0): Conv2d(20, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): GroupNorm(4, 32, eps=1e-05, affine=True) (2): ReLU() (3): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): GroupNorm(4, 32, eps=1e-05, affine=True) (5): ReLU() (6): Conv2d(32, 18, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (action_encoder): Expander2D() ) )

Model Details for Ant-U-Maze

We encode the global ( x, y ) position using a 2-layer MLP into a 256 dimensional embedding, and concatenate it with the rest of the raw proprioceptive state to make our overall state representation. Our predictor is a 3-layer MLP with ensemble size of 5 . During training, variance and covariance regularization is only applied on the part of the representation for ( x, y ) (first 256 dimensions), since the rest of the proprioceptive state are not encoded and therefore do not collapse.

total params: 1080615 encoder params: 9120 predictor params: 1072007 PLDM( (backbone): MLPEncoder( (globa_xy_encoder): Sequential( (0): Linear(in_features=2, out_features=32, bias=True) (1): LayerNorm((32,), eps=1e-05, elementwise_affine=True) (2): Mish(inplace=True) (3): Linear(in_features=32, out_features=256, bias=True)

(4): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (proprio_encoder): Identity() ) (predictor): MLPPredictor( (layers): Sequential( (0): Linear(in_features=291, out_features=256, bias=True) (1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (2): Mish(inplace=True) (3): Linear(in_features=256, out_features=256, bias=True) (4): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (5): Mish(inplace=True) (6): Linear(in_features=256, out_features=283, bias=True) (7): LayerNorm((283,), eps=1e-05, elementwise_affine=True) ) ) )

Effects of Uncertainty Regularization via Ensembles

Figure 9: Top Row : Two-Rooms environment. Bottom Row : AntMaze environment.

Analyzing planning time of PLDM

Evaluating PLDM Performance With Adjusted Inference Compute Budget

As we see in Table 5, replanning every 4 steps brings PLDM close to other methods' inference speed. In order to further compare how PLDM compares to other methods when inference time compute is fixed between methods, we run additional evaluations, and report them in Figure 10 and Figure 11.

Extended Baselines: Input Reconstruction Learning and TD-MPC2

In this section, we evaluate reconstruction-based objectives for training the encoder and dynamics model, as well as TD-MPC2 [28]. For reconstruction, we compare two approaches: one based on Dreamer [25], and one that replaces our VICReg objective with pixel reconstruction.

Dreamer We adapt DreamerV3 [26] to our setting by:

Table 6: Comparing reconstruction-based latent-dynamics learning and TDMPC-2 to other baselines. We test the methods on good-quality data in the two-rooms environment. We see that reconstruction-based methods perform significantly worse than PLDM and other baselines. TD-MPC2 fails to learn altogether due to collapse, and performs poorly even with the added IDM objective.

for, making this comparison flawed. However, we believe it serves to highlight that compared to DreamerV3, PLDM is designed for a very different setting in terms of available data.

Reconstruction As opposed to Dreamer, this baseline uses the same architecture of the encoder and dynamics as PLDM instead of RSSM, and only replaces the VICReg objective with a reconstruction term. The architecture of the decoder mirrors that of the encoder.

TD-MPC2 Like Dreamer models, TD-MPC2 [28] also relies on rewards to learn its representions. In order to adapt TD-MPC2 to our setting, we remove the reward prediction component, and only keep the objective enforcing consistency between the predictor and encoder outputs.

Results are shown in Table 6. We test all methods on the two-rooms environment described in Section 4.1. We use good-quality data, with long trajectories and good transition coverage. We find that pixel observation reconstruction is not a good objective to learn representations, and using it results in poor planning performance. TD-MPC2 collapses altogether, achieving 0% success rate. To prevent collpase, we add inverse dynamics modelling (IDM), which somewhat improves performnace, although it is still far behind other methods. We only tested one seed for TD-MPC2 + IDM.

Dreamer

Reconstruction

Equal contribution. Author ordering determined by coin flip.

39th Conference on Neural Information Processing Systems (NeurIPS 2025).

We propose two new navigation environments with granular control over the data generation process, and generate a total of 23 datasets of varying quality;
We evaluate methods for learning from offline, reward-free trajectories, drawing from both reinforcement learning and optimal control paradigms. Our analysis systematically assesses their ability to learn from random policy trajectories, stitch together short sequences, train effectively on limited data, and generalize to unseen environment layouts and tasks beyond goal-reaching;
We demonstrate that learning a latent dynamics model and using it for planning is robust to suboptimal data quality and achieves the highest level of generalization to environment variations;
We present a list of guidelines to help practitioners choose between methods depending on available data and generalization requirements.

To facilitate further research into methods for learning from offline trajectories without rewards, we release code, data, environment visualizations, and more at latent-planning.github.io .

TD-MPC2

Analyzing Statistical Significance of Results

In order to analyze whether the results in this paper are statistically significant, we perform Welch's t-test to compare the performance of PLDM to other methods. Because 5 seeds is not enough for statistical tests, we pool results across settings and show results in Table 7. We also run additional seeds for certain selected settings to get a total of 10 seeds per method, and show results of statistical analysis in Table 8. Overall, we see that the results are significant, except for certain settings comparing to HILP and GCIQL, which aligns with our findings.

Table 7: Statistical significance of results (pooled). ✓ means that Welch's t-test showed that PLDM is better than the corresponding method when pooling results across seeds and dataset parameters.

Analyzing HILP's Out-of-Distribution Generalization

To understand HILP's poor generalization to out-of-distribution (OOD) maze layouts, we visualize the distance in HILP's learned latent representation space. HILP learns a latent representation ϕ ( s ) , such that ∥ ϕ ( s ) -ϕ ( s g ) ∥ 2 is equal to the lowest number of transitions needed to traverse from s to s g . We hypothesize that ϕ fails to generalize to out-of-distribution maze layouts, resulting in incorrect predicted distances and in the failure of the goal-conditioned policy. We visualize the distances on in-distribution and out-of-distribution layouts for an encoder ϕ trained on 5 different layouts in Appendix I. We see that HILP distances are meaningful only on in-distribution layouts, and are very noisy on out-of-distribution layouts. This failure of the latent-space distance to generalize to out-of-distribution layouts confirms our hypothesis, and highlights the strength of PLDM.

(a) HILP distances on in-distribution maze layouts

Figure 12: HILP learns representations of states such that distance in representation space between a pair of states is equal to the number of steps needed to go between steps. These plots visualize the distance from the state denoted with a red dot to other states across the maze. (a) We see that on an in-distribution maze, the distances increase smoothly and mostly reflect the number of steps needed to go between states. (b) We see that on an out-of distribution maze, the representation space distances no longer make sense.

Hyperparameters

CRL, GCBC, GCIQL, HIQL, HILP

Unless listed below, all hyperparameters remain consistent with default values from OGBench 1 and HILP 2 repositories

Two-Rooms

To tune hyperparameters, we ran 1 grid search for good quality data, then another grid search for sequence length 17, then tested both on all settings. In grid searches, we searched over learning rate, expectiles (for GCIQL, HILP, and HIQL), and the probability of sampling random goal. For CRL and GCBC, we ended up using default OGBench parameters. The other parameters are described below:

HILP. We used learning rate of 3e-4, expectile of 0.7, and skill expectile of 0.7 in all settings.

GCIQL. For settings with dataset size of 634 as well as for good quality data, we used expectile 0.7, learning rate 3e-4, BC coefficient of 0.3, and 0 probability of sampling random goal. For all other settings, we used learning rate 3e-4, expectile of 0.7, BC coefficient of 0.003, and probability of sampling a random goal 0.3.

HIQL. For sequence length 17 and 33, we used learning rate of 3e-5, expectile of 0.7, AWR temperature of 3.0 (both high and low levels), and probability of sampling a random goal of 0.6. For all other settings, we used learning rate of 3e-4, expectile of 0.7, AWR temperature of 3.0 (both high and low levels), and probability of sampling a random goal of 0.

HILP.

GCIQL.

HIQL.

Diverse PointMaze

We set action repeat to 4 for our version of the environment.

Ant U-Maze

PLDM

Two-Rooms

HILP. We used learning rate of 3e-4, expectile of 0.7, and skill expectile of 0.7 in all settings.

Diverse PointMaze

We set action repeat to 4 for our version of the environment.

Ant-U-Maze

Table 17: Dataset-agnostic hyperparameters for Ant-U-Maze

Offline RL. This field aims to learn behaviors purely from offline data without online interactions. As opposed to imitation learning [80], offline RL is capable of learning policies that are better than the policy collecting the data. However, a big challenge is preventing the policy from selecting actions that were not seen in the dataset. CQL [36] relies on model conservatism to prevent the learned policy from being overly optimistic about trajectories not observed in the data. IQL [33] introduces an objective that avoids evaluating the Q-function on state-action pairs not seen in the data to prevent value overestimation. MOPO [78] is a model-based approach to learning from offline data, and uses model disagreement to constrain the policy. See [41] for a more in-depth survey.

Table 18: Dataset-specific hyperparameters for Ant-U-Maze

Foundation models in RL. Recently, following the success of NLP, the RL community put a lot of effort into training large sequence models, which sparked dataset collection efforts like Open-XEmbodiment [15] and DROID [31]. Large datasets have enabled training models such as RT-2 [10] and Octo [48]. See [75] for a more extensive survey on the topic.

Training representations for RL. Another way to use large amounts of data to improve RL agents is using self-supervised learning (SSL). CURL [38] introduce an SSL objective in addition to the standard RL objectives. Later works also explore using a separate pre-training stage [47, 58, 82]. Zhou et al. [83] show that pre-trained visual representations from DINO [13, 50] can be used to learn a word model for planning.

Our contributions can be summarized as follows:

We propose two navigation environments with granular control over the data generation process, and generate a total of 23 datasets of varying quality;

We demonstrate that learning a latent dynamics model and using it for planning is robust to suboptimal data quality and achieves the highest level of generalization to environment variations;

We present a list of guidelines to help practitioners choose between methods depending on available data and generalization requirements.

To facilitate further research into methods for learning from offline trajectories without rewards, we release code, data, environment visualizations, and more at latent-planning.github.io.

Reward-free offline RL proposes to learn from the offline data that does not contain rewards in a task-agnostic way. The goal is to extract general behaviors from the offline data to solve a variety of downstream tasks. One approach to this is to use goal-conditioned RL, and sample goals using a technique proposed in Hindsight Experience Replay [And+17]. [Par+24a] show that this can be applied to learn a goal-conditioned policy using IQL, as well as to learn a hierarchical value function. [Hat+22] proposes using a small set of observations corresponding to the solved task to define the task and learn from reward-free data. [KPL24] study how to transition from offline to online RL, and uses HILP [PKL24] for unsupervised pre-training, then fine-tunes it on online data. [Yu+22, Hu+23] propose to use labeled data to train a reward function, than label the reward-free trajectories.

Zero-shot methods go beyond just goal-reaching from offline data, and aim to solve any possible task specified during test time. HILP [PKL24] propose learning a distance-preserving representation space such that the distance in that space is proportional to the number of steps between two states, similar to Laplacian representations [WTN18, Wan+21, Wan+22]. Forward-Backward representations [TO21, TRO22] tackle this with an approach akin to successor-features [Bar+17]. [Fra+24] propose to learn a transformer model to encode target task’s state action sequences. [Che+23] propose to learn basis Q-functions that implicitly model dynamics and enable generalization to tasks that can be represented as a linear combination of the learned basis functions.

Optimal Control, similar to RL, tackles the problem of selecting actions of an agent in an environment in order to optimize a given objective (reward for RL, cost for control). Unlike RL, optimal control commonly assumes that the transition dynamics of the environment are known [Ber19]. This paradigm has been used to great success long before the advent of deep learning, and has enabled applications ranging from controlling aircraft, rockets and missiles [Bry96] to controlling humanoid robots [Kui+16, SM09]. When the transition dynamics cannot be defined precisely, they can often be learned. For example, [Wat+15] learns the environment dynamics, and uses iLQG [TL05a] on the linear approximation of the dynamics. When dynamics are unknown and need to be approximated, the line between RL and optimal control becomes blurry, as a lot of RL methods can be interpreted as approximating dynamic programming approaches in control in the context of unknown dynamics [Ber12, Sut18]. In this work, we use the term RL to refer to methods that either implicitly or explicitly use rewards information to train a policy function, and the term optimal control to refer to methods that use a dynamics model and, during inference, explicitly search for the best actions that optimize a given objective function.