Skip to main content

Joint Embedding Predictive Architectures Focus on Slow Features

% Vlad Sobal\affmark[1] \quad Jyothir S V\affmark[1] \quad Siddhartha Jalagam\affmark[1] \quad Nicolas Carion\affmark[2] \quad, Kyunghyun Cho\affmark[1, 3, 4] \quad Yann LeCun \affmark[1, 2], \affmark[1]New York University \quad \affmark[2]Meta AI \quad \affmark[3]Prescient Design, Genentech \quad \affmark[4]CIFAR Fellow

Abstract

Many common methods for learning a world model for pixel-based environments use generative architectures trained with pixel-level reconstruction objectives. Recently proposed Joint Embedding Predictive Architectures (JEPA) lecun2022path offer a reconstruction-free alternative. In this work, we analyze performance of JEPA trained with VICReg and SimCLR objectives in the fully offline setting without access to rewards, and compare the results to the performance of the generative architecture. We test the methods in a simple environment with a moving dot with various background distractors, and probe learned representations for the dot's location. We find that JEPA methods perform on par or better than reconstruction when distractor noise changes every time step, but fail when the noise is fixed. Furthermore, we provide a theoretical explanation for the poor performance of JEPA-based methods with fixed noise, highlighting an important limitation.

Joint Embedding Predictive Architectures Focus on Slow Features

Vlad Sobal 1 Jyothir S V 1 Siddhartha Jalagam 1 Nicolas Carion 2 Kyunghyun Cho 1, 3, 4 Yann LeCun 1, 2

1 New York University 2 Meta AI 3 Prescient Design, Genentech 4 CIFAR Fellow {us441, jyothir, scj9994, carion.nicolas, kyunghyun.cho}@nyu.edu yann@cs.nyu.edu

Reconstruction

Many common methods for learning a world model for pixel-based environments use generative architectures trained with pixel-level reconstruction objectives. Recently proposed Joint Embedding Predictive Architectures (JEPA) [20] offer a reconstruction-free alternative. In this work, we analyze performance of JEPA trained with VICReg and SimCLR objectives in the fully offline setting without access to rewards, and compare the results to the performance of the generative architecture. We test the methods in a simple environment with a moving dot with various background distractors, and probe learned representations for the dot's location. We find that JEPA methods perform on par or better than reconstruction when distractor noise changes every time step, but fail when the noise is fixed. Furthermore, we provide a theoretical explanation for the poor performance of JEPA-based methods with fixed noise, highlighting an important limitation.

Introduction

Currently, the most common approach to learning world models is to use reconstruction objectives [11, 13, 12, 23]. However, reconstruction objectives suffer from object-vanishing, as the objectives by design do not distinguish important objects in the scene [24]. The framework of Joint Embedding Predictive Architectures presented in [20] may offer a possible alternative to reconstruction-based objectives, as has been shown in [30, 29, 24]. In this paper, we implement 1 JEPA for learning from image and action sequences with VICReg [3] and SimCLR [5] objectives (section 2 and code). We then test JEPA, reconstruction-based, and inverse dynamics modeling methods by training on image sequences of one moving dot and probing the learned representations for dot location in the presence of various noise distractors (section 3). We observe that JEPA methods can learn to ignore distractor noise that changes every time step, but fail when distractor noise is static.

Method

We focus on learning a world model from a fixed set of state-action sequences. We consider a Markov Decision Process (MDP) M = ( O,A,P,R ) . O is a set of possible observations (in our case these are images), A represents the set of possible actions, P = Pr ( o t | o t -1 , a t -1 ) represents transition probabilities, and R : O × A → R is the reward function. In the offline setting we focus on, we do not have access to the MDP directly, we only have a pre-recorded set of sequences of observations o t ∈ O and actions a t ∈ A . Our goal is to use these offline sequences to learn an encoder that converts observations o t to D -dimensional representations s t , g φ ( o t ) = s t , g φ : O → R D , and the

1 Code is available at https://github.com/vladisai/JEPA_SSL_NeurIPS_2022

Γ/Γ˘ΓΛΓOΓRΓVΓV

Figure

(a)

VICReg-based architecture

Γ3ΓUΓHΓGΓLΓFΓWΓRΓU

ΓΛ

Figure 1: JEPA-based methods' training diagrams. (a) JEPA with VICReg loss. VC here denotes variance and covariance losses from [3]. (b) JEPA with InfoNCE loss.

forward model f θ ( s t , a t ) = ˜ s t +1 , f θ : R D × A → R D . During pre-training we do not have access to the evaluation task, therefore we aim to learn representations that capture in D -dimensional vectors as much information about the environment and its dynamics as possible.

During both training and evaluation, given the initial observation o 1 and a sequence of actions of length T , a 1 . . . a T , we encode the first observation ˜ s 1 = g φ ( o 1 ) , then auto-regressively apply the forward model ˜ s t = f θ (˜ s t -1 , a t -1 ) , obtaining representations for a sequence of length T +1 : ˜ s 1 . . . ˜ s T +1 . During testing, we probe the representations using a single linear layer that is trained, with frozen encoder and predictor, to recover some known properties of states q (˜ s t ) : R D → R Q , where Q is the dimension of the target value. This protocol is related to the one used in [1], but we probe not only the encoder output, but also the predictor output. Fore more details on probing, see appendix A.4. We compare multiple ways of training the encoder and predictor. In all the approaches, the gradients are propagated through time, and no image augmentations are used. Encoder and predictor architectures are fixed (see appendix A.7). We test the following methods:

VICReg (figure 1a) We take inspiration from [20], and adopt VICReg [3] objective for training a Joint Embedding Predictive Architecture (JEPA). We apply variance and covariance losses described in [3] to representations separately at each step, and apply the prediction loss to make the forward model output ˜ s t close to encoder output s t . For a more detailed description, see appendix A.3.1.

SimCLR (figure 1b) In this case, we train JEPA, but instead of VICReg we utilize SimCLR objective [5]. We apply InfoNCE loss [25], with the forward model output and the encoder output for the same time step as positive pairs. For a more detailed description, see appendix A.3.2.

Reconstruction Reconstruction approach introduces a decoder d ξ (˜ s t ) = ˜ o t , and utilizes a reconstruction objective L = 1 T ∑ T t =1 ‖ o t -˜ o t ‖ 2 2 to train the encoder and predictor. See appendix A.3.4 for more details.

Inverse Dynamics Modeling (IDM) We add a linear layer that, given the encoder's outputs at two consecutive steps g φ ( o t ) , g φ ( o t +1 ) , predicts a t . The forward model is trained by predicting the encoder's output at the next time step. For more details, see appendix A.3.3.

Supervised All components are trained end-to-end by propagating the error from the probing function to both the encoder and decoder. This should give us the lower bound on probing error.

Random In this case, the probing is run with fixed random weights of the encoder and predictor.

Spurious correlation

VICReg and SimCLR JEPA methods may fall prey to a spurious correlation issue: the loss can be easily minimized by paying attention to noise that does not change with time, making the system ignore all other information. Intuitively, the objectives make the model focus on 'slow features' [37]. When the slowest features in the input belong to fixed background noise, the representation will only contain the noise. To demonstrate that, we show a trivial but plausible solution. In the presence of normally distributed distractor noise that does not change with time, the model can extract features by directly copying the values of the noise from the input : g φ ( o t ) = s ∼ N (0 , σ 2 I ) , s ∈ R D . Since

Γ3ΓUΓHΓGΓLΓFΓWΓRΓU

Γ,ΓQΓIΓRΓ1Γ&Γ(

ΓΛ

̂[∐{˜]{˜(√√√√̂√√√˜̂

Figure

)

Figure 2: Dataset examples. (a) We introduce distractors to our moving dot dataset by adding either uniform or structured noise with different brightness coefficient α . (b) Temporally, the noise either changes every frame (top row), or remains fixed throughout the video (bottom row). In the fixed case, the noise is still re-sampled for each new sequence.

the noise is persistent through time, g φ ( o t ) = g φ ( o t +1 ) = s . We assume that the forward model has converged to identity: f θ ( s, a ) = s . We denote a batch of encoded representations at step t with a matrix S t ∈ R N × D where N is batch size; and batches of actions and observations with A t and O t respectively. Then the VICReg losses are:

$$

$$

$$

$$

$$

$$

$$

$$

Equation 3 holds for large enough σ , 4 holds because the noise variables are independent across episodes. The total sum of the loss components is then 0 for the described trivial solution.

In SimCLR case, as shown by Wang and Isola [36] in their theorem 1, the InfoNCE loss is minimized in the infinite limit of negative samples if the positive pairs are perfectly aligned and the encoder output is perfectly uniform on the unit sphere. Both conditions are satisfied for the trivial solution described above, therefore SimCLR objective is also susceptible to fixed distractor noise problem.

Experiments

In order to verify whether the proposed JEPA-based methods indeed focus on fixed background noise, we introduce a simple moving dot dataset. The sequences of images contain a single dot on a square with sides of length 1, and the action denotes the delta in the dot's coordinates from the previous time step. We fix the length of the episode to 17 time steps (16 actions). After pre-training, we probe the representation by training a linear layer to recover the dot's position for all time steps. For pre-training, we use 1 million sequences; for training the prober we use 300,000 sequences; for evaluation, we use 10,000 sequences. We introduce two types of distractor noise to the background: structured and uniform. We generate structured noise by overlaying CIFAR-10 [19] images. Temporally, the noise can be changing, i.e., each time step the background is resampled; or fixed, i.e., the background does not change with time, but still changes between sequences. For examples, see figure 2. The coefficient α controls noise brightness relative to the dot. We tune hyperparameters of all methods separately for each noise level and type. For more details about the dataset, see appendix A.2.

Figure 3: Performance of the compared methods with different types and levels of noise. We tune hyperparameters individually for each model, noise level, and type. We show results without tuning in figure 5 in the appendix. The dots represent the mean RMSE across 17 time steps. The shaded area represents the standard deviation calculated by running 3 random seeds for each experiment.

Figure 3: Performance of the compared methods with different types and levels of noise. We tune hyperparameters individually for each model, noise level, and type. We show results without tuning in figure 5 in the appendix. The dots represent the mean RMSE across 17 time steps. The shaded area represents the standard deviation calculated by running 3 random seeds for each experiment.

Results

We compare the approaches described in section 2 with different types of noise described above. We also add a baseline called 'Center', which corresponds to always predicting the dot's location to be in the center. 'Center' and 'Supervised' baselines should be upper and lower bounds on the error. The results are shown in figure 3. All methods perform well when there are no distractors. Reconstruction performs well in all settings with α ≤ 1 . 5 , while JEPA-based methods fail in the presence of fixed noise, both structured and uniform. We hypothesize that, as described in section 2, these methods focus on background noise and ignore the dot. We observe a similar drop in performance when an extra static dot is introduced instead of distractor noise (see appendix A.6.2 for more details). All methods work well when the noise is changing every frame. Additionally, we find that JEPA-based methods do not require hyperparameter tuning to adapt to higher levels of changing noise, while reconstruction performs much worse with untuned hyperparameters (see appendix A.5). Inverse dynamics modeling has great performance in all cases, but this method may be unsuitable for pre-training as it only learns representations that capture the agent, and fail if there is additional useful information to be captured, as we demonstrate with additional experiments with 3 dots in appendix A.6.

Conclusion

We demonstrate that JEPA-based methods offer a possible way forward for reconstruction-free forward model learning and are capable of ignoring unpredictable noise well even without additional hyperparameter tuning. However, these methods fail when slow features are present, even with a large pre-training dataset and hyperparameter tuning. We only demonstrate this with a toy dataset, but we hypothesize that this may happen in more complex problems. For example, when pre-training a forward-model for self-driving with JEPA on dash-cam videos, the model may focus on cloud patterns that are easily predictable, rather than trying to learn the traffic participants' behavior. This drawback of JEPA may be addressed by using image differences or optical flow as input to the model, although these input modalities will ignore potentially useful background and may still contain fixed noise. We believe that the way to learn representations that capture both fast and slow features is by

adding hierarchy to the architecture (see HJEPA in [20]) or by changing the objective to impose an additional constraint that prevents the representations from being constant across time.

Experiments

Checklist

Appendix

Joint Embedding methods In recent years, multiple new joint-embedding methods for pre-training image classification systems were introduced [10, 38, 3, 5, 4, 15, 17, 7]. These methods heavily rely on well-designed image augmentations to prevent collapse [35, 18]. Very closely related works of [6, 28] also investigate the tendency of contrastive losses to focus on easy features, although they do not study the application to videos.

Representation learning from video Many of the ideas from self-supervised pre-training for classification have also been applied to learning from video: [27] applies InfoNCE loss [25] to learn encodings of video segments, while [21] modify CPC architecture [25] and include samples from the same video clip as negative examples. [14] also propose a modification of CPC [25] for training from video and use predicted representations and encoder outputs as positive pair, and construct negative examples using both different videos and different fragments of the same video. [9] also use videos as an alternative to image augmentations. [31] use a triplet loss with single negative example instead of InfoNCE. [16] adopt Deep InfoMax objective [17] for videos, while [26] use MoCo [15].

Forward model learning Many methods using reconstruction objectives as main training signal have been proposed [23, 11, 13, 12]. [24] proposes a reconstruction-free version of Dreamer [11, 13] using contrastive learning. [39] proposes a method to learn representations using bisimulation objective, pushing states that lead to same rewards with same trajectories to have the same representations. In contrast, we focus on the setting where the reward is unknown during pretraining. Notably, [39] also explores the performance of reconstruction and CPC [25] objectives in the presence of distractor noise, but the noise is not fixed in time. [22] explore various aspects of model-based and model-free offline RL, and test reconstruction-based RSSM [12] with various perturbations, concluding that it performs quite well even with noisy data, which we confirm in our experiments as well.

Self-supervised learning for reinforcement learning Objectives used in self-supervised learning for images have been actively used in reinforcement learning from pixels to improve sample complexity of training [29, 33], or for pre-training the encoder [30, 34, 1, 40]. These methods focus on pre-training the encoder, and do not evaluate the trained forward model, even if it is present in the system [30].

Joint Embedding methods

We denote the batch of training video sequences as a tuple of observation and action sequences ( O , A ) , O = ( O 1 . . . O T +1 ) , O t ∈ R N × H × W ; A = ( A 1 . . . A T ) , A i ∈ R N × M . Here, T is the episode length, H × W is the resolution of observation image, N is batch size, M is the dimension of the action. For all algorithms we test, the observations are processed using an encoder to obtain representations for each time step S i = g φ ( O i ) , S i ∈ R N × d , and the forward model is unfolded from the first observation with the given actions: ˜ S t = f θ ( ˜ S t -1 , A t -1 ); ˜ S 1 = S 1 . We use S , ˜ S to denote encodings and predictions for all time steps: S = ( S 1 , . . . , S T +1 ) , ˜ S = ( ˜ S 1 , . . . , ˜ S T +1 ) .

Representation learning from video
Forward model learning

The encoder consists of 3 convolutional layers with ReLU and BatchNorm after each layer, and average pooling with kernel size of 2 by 2 and stride of 2 at the end. The first convolution layer has kernel size 5, stride 2, padding 2, and output dimension of 32. The second layer is the same as the first, except the output dimension is 64. The final layer has kernel size 3, stride 1, padding 1, and output dimension of 64. After average pooling, a linear layer is applied, with output dimension of 512.

The predictor is represented by a single-layer GRU [8] with hidden representation size of 512, and input size of 2. Hidden representation is initialized at the first step of predictions with encoder output g φ ( o 1 ) . The inputs at each time step are actions.

Reconstruction introduces a decoder represented by a model symmetric to the encoder, with convolutions in reverse order and upsampling. We do not use a latent variable in our implementation.

Self-supervised learning for reinforcement learning

Dataset

We consider the problem of capturing the location of an object in a video sequence. To this end, we introduce a simple environment with just one dot inside a square. The dot cannot leave the square, and is always visible on the screen. We denote dot's coordinates at time t as c t = ( c x t , c y t ) . We assume the square size is 1 by 1, therefore c x t , c y t ∈ [0 , 1] . At each step, the dot takes an action a t = ( a x t , a y t ) , a t ∈ R 2 . The norm of a t is restricted to be less than or equal to a maximum step size D : ‖ a t ‖ 2 ≤ D . In our experiments D = 0 . 14 .

The dot moves continuously around the square by moving by a vector specified by the action a t with clipping to prevent the dot from going outside the square, i.e. c t +1 = max(0 , min( c t + a t , 1)) .

In order to generate one dataset example, we randomly generate a starting location c 1 , and a sequence of actions a 1 ...T , where T is the sequence length. In our experiments T = 16 . The actions in a sequence a t are generated by first sampling vector directions using a Von Mises distribution with a randomly chosen direction ω ∼ Uniform(0 , 2 π ) , ( u x t , u y t ) = u t ∼ VonMises( ω, 1 . 3) . This prevents the dot from staying in one place in expectation, as would be the case with uniform random sampling. Then we multiply direction vectors with uniformly sampled step size d t ∼ Uniform(0 , D ) : a t = d t u t . We then generate locations by adding the action vectors to the initial location. The length of the action sequence a 0 ...T -1 is then T , while the generated locations sequence is of length T +1 (in our case it is 17). A diagram of the process is shown in figure 4.

Once the actions and locations are generated, we obtain images o t by rendering the dots at the generated locations c t applying a Gaussian blur with σ = 0 . 05 to the image with the value set to 1 at the location c t . In our experiments, the resolution of o t is 28 × 28 .

Figure 4: Diagram of data generation process. Initial position c 1 and action sequence a 1 . . . a T are first generated. Then, the actions are sequentially added to the initial location. The resulting coordinates c 1 . . . c T +1 are then rendered.

Figure 4: Diagram of data generation process. Initial position c 1 and action sequence a 1 . . . a T are first generated. Then, the actions are sequentially added to the initial location. The resulting coordinates c 1 . . . c T +1 are then rendered.

We introduce distractors to the dataset by overlaying noise onto the dot images. We consider two noise types: random and structured; and two temporal settings: fixed and changing. Random noise images Z are images of the same dimension as o t where each pixel sampled from a uniform distribution, while structured noise images are loaded from CIFAR-10 dataset.

In all cases, we add noise Z with a coefficient β : ˆ o t = o t + αZ t . Both noise image and dot image have values between 0 and 1, so the coefficient represents how many times stronger the brightest pixel in the noise is compared to the brightest pixel in the dot image.

Training methods details

We denote the batch of training video sequences as a tuple of observation and action sequences ( O , A ) , O = ( O 1 . . . O T +1 ) , O t ∈ R N × H × W ; A = ( A 1 . . . A T ) , A i ∈ R N × M . Here, T is the episode length, H × W is the resolution of observation image, N is batch size, M is the dimension of the action. For all algorithms we test, the observations are processed using an encoder to obtain representations for each time step S i = g φ ( O i ) , S i ∈ R N × d , and the forward model is unfolded from the first observation with the given actions: ˜ S t = f θ ( ˜ S t -1 , A t -1 ); ˜ S 1 = S 1 . We use S , ˜ S to denote encodings and predictions for all time steps: S = ( S 1 , . . . , S T +1 ) , ˜ S = ( ˜ S 1 , . . . , ˜ S T +1 ) .

VICReg

VICReg objective was originally used for image classification [3]. We follow ideas described in [20] and adopt the objective to learning from video. We consider representations at each time step separately for calculating variance and covariance losses, while the representation loss becomes the

prediction error between forward model and encoder outputs. Total loss and its components are:

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

SimCLR

In case of SimCLR adaptation to JEPA, again, we use the same strategy as with VICReg, and treat each time step prediction separately. We apply SimCLR's InfoNCE loss [5] as follows:

$$

$$

glyph[negationslash]

$$

$$

$$

$$

Inverse Dynamics Modeling (IDM)

We add a linear layer that, given the encoder's outputs at two consecutive steps g φ ( o t ) , g φ ( o t +1 ) , predicts a t . This should make the encoder pay attention to the parts of the observation that are affected by the action. The predictor is trained by predicting encoder output. We denote inverse dynamics model with h ξ ( s t , s t +1 ) = ˜ a t . The loss components are then:

$$

$$

$$

$$

$$

$$

Reconstruction

Reconstruction approach introduces a decoder d ξ (˜ s t ) = ˜ o t , and utilizes a reconstruction objective L = 1 T ∑ T t =1 ‖ o t -˜ o t ‖ 2 2 to train the encoder and predictor. The decoder mimics the architecture of the encoder, and uses up-sampling followed by convolutional layers to match strided convolutional layers in the encoder. The total loss is:

$$

$$

(17)

Figure

⋃√√˜√√]√˜̂

∧∮∫˜˜

⋃]⌉⊕∫

∮〈⊕

∫∐{̂}⌉

Figure 5: Performance of compared methods with different types and levels of noise. Hyperparameters for each model are chosen for no-noise setting and, in contrast to figure 3 are not re-tuned for each type or level of noise. The dots represent the mean RMSE across 17 time steps. The shaded area shows standard deviation calculated by running 3 random seeds for each experiment.

Probing details

In order to test whether representations contain the desired information, we train a probing function q ( s ) : R D → R Q which we represent with one linear layer. D is the representation size, Q is the target value dimension. In our case, the target value is the dot's location, so Q = 2 . When training the prober, we follow a similar protocol to the pre-training. Given the initial observation o 1 and a sequence of actions of length T , a 1 . . . a T , we encode the first observation ˜ s 1 = g φ ( o 1 ) , then auto-regressively apply the forward model ˜ s t = f θ (˜ s t -1 , a t -1 ) , obtaining representations for a sequence of length T +1 : ˜ s 1 . . . ˜ s T +1 . We denote the location of the dot in the i -th batch element at time step t as C t,i . Then, the loss function we apply to train q is:

$$

$$

Additional results

In figure 5 we show results for the same experiments as in figure 3, with one difference that we do not tune hyperparameters for each model and noise setting. We pick the best hyperparameters for noise-free setting, and fix them for all levels of noise. We see that SimCLR and VICReg JEPA methods, compared to tuned performance, fail with lower levels of noise, while reconstruction fails with changing uniform noise without tuning.

3 dots dataset

Environment

In this dataset, there are three dots on the three channels of the image: action-controlled dot, uncontrollable dot, and stationary dot. The first dot is action controlled similar to the single dot environment without any noise. The second dot is similar to the first dot, but the actions are unknown, making it impossible to predict its movements. The third dot is stationary across all the frames of the episode, but is at different positions in different episodes. To generate a sample of the three dot dataset, we concatenate these dots along different channels of image. These fixed channels ensure that our model can distinguish between the dots. We show an example sequence in figure 6. The goal is to learn a representation that captures the locations of the action-controlled and stationary dots, while ignoring the random dot.

Figure 6: Example sequence of 3 dots dataset. Red, the first channel, corresponds to the action controlled dot. Green corresponds to the randomly moving dot with unknown actions. Blue corresponds to the stationary dot.

Figure 6: Example sequence of 3 dots dataset. Red, the first channel, corresponds to the action controlled dot. Green corresponds to the randomly moving dot with unknown actions. Blue corresponds to the stationary dot.

Table 1: Results for 3-dots dataset. All numbers denote RMSE across 17 steps. We run 3 seeds for each experiment to obtain standard deviations. Cells are colored according to the values, with higher values shown in red, and lower values shown in blue.

Results

We compare the approaches described in section 2 with different types of noise described above. We also add a baseline called 'Center', which corresponds to always predicting the dot's location to be in the center. 'Center' and 'Supervised' baselines should be upper and lower bounds on the error. The results are shown in figure 3. All methods perform well when there are no distractors. Reconstruction performs well in all settings with α ≤ 1 . 5 , while JEPA-based methods fail in the presence of fixed noise, both structured and uniform. We hypothesize that, as described in section 2, these methods focus on background noise and ignore the dot. We observe a similar drop in performance when an extra static dot is introduced instead of distractor noise (see appendix A.6.2 for more details). All methods work well when the noise is changing every frame. Additionally, we find that JEPA-based methods do not require hyperparameter tuning to adapt to higher levels of changing noise, while reconstruction performs much worse with untuned hyperparameters (see appendix A.5). Inverse dynamics modeling has great performance in all cases, but this method may be unsuitable for pre-training as it only learns representations that capture the agent, and fail if there is additional useful information to be captured, as we demonstrate with additional experiments with 3 dots in appendix A.6.

Model architectures

The encoder consists of 3 convolutional layers with ReLU and BatchNorm after each layer, and average pooling with kernel size of 2 by 2 and stride of 2 at the end. The first convolution layer has kernel size 5, stride 2, padding 2, and output dimension of 32. The second layer is the same as the first, except the output dimension is 64. The final layer has kernel size 3, stride 1, padding 1, and output dimension of 64. After average pooling, a linear layer is applied, with output dimension of 512.

The predictor is represented by a single-layer GRU [8] with hidden representation size of 512, and input size of 2. Hidden representation is initialized at the first step of predictions with encoder output g φ ( o 1 ) . The inputs at each time step are actions.

Reconstruction introduces a decoder represented by a model symmetric to the encoder, with convolutions in reverse order and upsampling. We do not use a latent variable in our implementation.

Limitations

The limitation of the current experiments is the use of the simple toy dataset. It is unclear that the same will hold for more complicated video datasets and bigger models. Another limitation is that we only check VICReg and SimCLR losses for JEPA methods, while there are many more objectives, e.g., [10, 38].

Computational resources

All experiments were run using AMD MI50 or Nvidia RTX 8000 GPUs. For each noise type and level, 100 random hyperparameters were run, and the best one was run for 3 seeds. Each individual experiment takes less than one hour of GPU time.

Code license

To implement reconstruction-based approach, we used parts of the implementation of [2] distributed under MIT license. Implementing InfoNCE loss, we used the code from [32], which is also distributed under MIT license.

Acknowledgements

Many common methods for learning a world model for pixel-based environments use generative architectures trained with pixel-level reconstruction objectives. Recently proposed Joint Embedding Predictive Architectures (JEPA) [20] offer a reconstruction-free alternative. In this work, we analyze performance of JEPA trained with VICReg and SimCLR objectives in the fully offline setting without access to rewards, and compare the results to the performance of the generative architecture. We test the methods in a simple environment with a moving dot with various background distractors, and probe learned representations for the dot’s location. We find that JEPA methods perform on par or better than reconstruction when distractor noise changes every time step, but fail when the noise is fixed. Furthermore, we provide a theoretical explanation for the poor performance of JEPA-based methods with fixed noise, highlighting an important limitation.

Currently, the most common approach to learning world models is to use reconstruction objectives [11, 13, 12, 23]. However, reconstruction objectives suffer from object-vanishing, as the objectives by design do not distinguish important objects in the scene [24]. The framework of Joint Embedding Predictive Architectures presented in [20] may offer a possible alternative to reconstruction-based objectives, as has been shown in [30, 29, 24]. In this paper, we implement111Code is available at https://github.com/vladisai/JEPA_SSL_NeurIPS_2022 JEPA for learning from image and action sequences with VICReg [3] and SimCLR [5] objectives (section 2 and code). We then test JEPA, reconstruction-based, and inverse dynamics modeling methods by training on image sequences of one moving dot and probing the learned representations for dot location in the presence of various noise distractors (section 3). We observe that JEPA methods can learn to ignore distractor noise that changes every time step, but fail when distractor noise is static.

We focus on learning a world model from a fixed set of state-action sequences. We consider a Markov Decision Process (MDP) M=(O,A,P,R)𝑀𝑂𝐴𝑃𝑅M=(O,A,P,R). O𝑂O is a set of possible observations (in our case these are images), A𝐴A represents the set of possible actions, P=P​r​(ot|ot−1,at−1)𝑃𝑃𝑟conditionalsubscript𝑜𝑡subscript𝑜𝑡1subscript𝑎𝑡1P=Pr(o_{t}|o_{t-1},a_{t-1}) represents transition probabilities, and R:O×A→ℝ:𝑅→𝑂𝐴ℝR:O\times A\to\mathbb{R} is the reward function. In the offline setting we focus on, we do not have access to the MDP directly, we only have a pre-recorded set of sequences of observations ot∈Osubscript𝑜𝑡𝑂o_{t}\in O and actions at∈Asubscript𝑎𝑡𝐴a_{t}\in A. Our goal is to use these offline sequences to learn an encoder that converts observations otsubscript𝑜𝑡o_{t} to D𝐷D-dimensional representations stsubscript𝑠𝑡s_{t}, gϕ​(ot)=st,gϕ:O→ℝD:subscript𝑔italic-ϕsubscript𝑜𝑡subscript𝑠𝑡subscript𝑔italic-ϕ→𝑂superscriptℝ𝐷g_{\phi}(o_{t})=s_{t},g_{\phi}:O\to\mathbb{R}^{D}, and the forward model fθ​(st,at)=st+1,fθ:ℝD×A→ℝD:subscript𝑓𝜃subscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1subscript𝑓𝜃→superscriptℝ𝐷𝐴superscriptℝ𝐷f_{\theta}(s_{t},a_{t})=\tilde{s}{t+1},f{\theta}:\mathbb{R}^{D}\times A\to\mathbb{R}^{D}. During pre-training we do not have access to the evaluation task, therefore we aim to learn representations that capture in D𝐷D-dimensional vectors as much information about the environment and its dynamics as possible.

During both training and evaluation, given the initial observation o1subscript𝑜1o_{1} and a sequence of actions of length T𝑇T, a1​…​aTsubscript𝑎1…subscript𝑎𝑇a_{1}\dots a_{T}, we encode the first observation s1=gϕ​(o1)subscript𝑠1subscript𝑔italic-ϕsubscript𝑜1\tilde{s}{1}=g{\phi}(o_{1}), then auto-regressively apply the forward model st=fθ​(st−1,at−1)subscript𝑠𝑡subscript𝑓𝜃subscript𝑠𝑡1subscript𝑎𝑡1\tilde{s}{t}=f{\theta}(\tilde{s}{t-1},a{t-1}), obtaining representations for a sequence of length T+1𝑇1T+1: s1​…​sT+1subscript𝑠1…subscript𝑠𝑇1\tilde{s}{1}\dots\tilde{s}{T+1}. During testing, we probe the representations using a single linear layer that is trained, with frozen encoder and predictor, to recover some known properties of states q​(st):ℝD→ℝQ:𝑞subscript𝑠𝑡→superscriptℝ𝐷superscriptℝ𝑄q(\tilde{s}_{t}):\mathbb{R}^{D}\to\mathbb{R}^{Q}, where Q𝑄Q is the dimension of the target value. This protocol is related to the one used in [1], but we probe not only the encoder output, but also the predictor output. Fore more details on probing, see appendix A.4. We compare multiple ways of training the encoder and predictor. In all the approaches, the gradients are propagated through time, and no image augmentations are used. Encoder and predictor architectures are fixed (see appendix A.7). We test the following methods:

VICReg (figure 1(a)) We take inspiration from [20], and adopt VICReg [3] objective for training a Joint Embedding Predictive Architecture (JEPA). We apply variance and covariance losses described in [3] to representations separately at each step, and apply the prediction loss to make the forward model output stsubscript𝑠𝑡\tilde{s}{t} close to encoder output stsubscript𝑠𝑡s{t}. For a more detailed description, see appendix A.3.1.

SimCLR (figure 1(b)) In this case, we train JEPA, but instead of VICReg we utilize SimCLR objective [5]. We apply InfoNCE loss [25], with the forward model output and the encoder output for the same time step as positive pairs. For a more detailed description, see appendix A.3.2.

Reconstruction Reconstruction approach introduces a decoder dξ​(st)=otsubscript𝑑𝜉subscript𝑠𝑡subscript𝑜𝑡d_{\xi}(\tilde{s}{t})=\tilde{o}{t}, and utilizes a reconstruction objective ℒ=1T​∑t=1T‖ot−ot‖22ℒ1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptnormsubscript𝑜𝑡subscript𝑜𝑡22\mathcal{L}=\frac{1}{T}\sum_{t=1}^{T}|o_{t}-\tilde{o}{t}|{2}^{2} to train the encoder and predictor. See appendix A.3.4 for more details.

Inverse Dynamics Modeling (IDM) We add a linear layer that, given the encoder’s outputs at two consecutive steps gϕ​(ot),gϕ​(ot+1)subscript𝑔italic-ϕsubscript𝑜𝑡subscript𝑔italic-ϕsubscript𝑜𝑡1g_{\phi}(o_{t}),g_{\phi}(o_{t+1}), predicts atsubscript𝑎𝑡a_{t}. The forward model is trained by predicting the encoder’s output at the next time step. For more details, see appendix A.3.3.

Supervised All components are trained end-to-end by propagating the error from the probing function to both the encoder and decoder. This should give us the lower bound on probing error.

Random In this case, the probing is run with fixed random weights of the encoder and predictor.

VICReg and SimCLR JEPA methods may fall prey to a spurious correlation issue: the loss can be easily minimized by paying attention to noise that does not change with time, making the system ignore all other information. Intuitively, the objectives make the model focus on ‘slow features’ [37]. When the slowest features in the input belong to fixed background noise, the representation will only contain the noise. To demonstrate that, we show a trivial but plausible solution. In the presence of normally distributed distractor noise that does not change with time, the model can extract features by directly copying the values of the noise from the input : gϕ​(ot)=s∼𝒩​(0,σ2​I),s∈ℝDformulae-sequencesubscript𝑔italic-ϕsubscript𝑜𝑡𝑠similar-to𝒩0superscript𝜎2𝐼𝑠superscriptℝ𝐷g_{\phi}(o_{t})=s\sim\mathcal{N}(0,\sigma^{2}I),,s\in\mathbb{R}^{D}. Since the noise is persistent through time, gϕ​(ot)=gϕ​(ot+1)=ssubscript𝑔italic-ϕsubscript𝑜𝑡subscript𝑔italic-ϕsubscript𝑜𝑡1𝑠g_{\phi}(o_{t})=g_{\phi}(o_{t+1})=s. We assume that the forward model has converged to identity: fθ​(s,a)=ssubscript𝑓𝜃𝑠𝑎𝑠f_{\theta}(s,a)=s. We denote a batch of encoded representations at step t𝑡t with a matrix St∈ℝN×Dsubscript𝑆𝑡superscriptℝ𝑁𝐷S_{t}\in\mathbb{R}^{N\times D} where N𝑁N is batch size; and batches of actions and observations with Atsubscript𝐴𝑡A_{t} and Otsubscript𝑂𝑡O_{t} respectively. Then the VICReg losses are:

Equation 3 holds for large enough σ𝜎\sigma, 4 holds because the noise variables are independent across episodes. The total sum of the loss components is then 0 for the described trivial solution.

In SimCLR case, as shown by Wang and Isola [36] in their theorem 1, the InfoNCE loss is minimized in the infinite limit of negative samples if the positive pairs are perfectly aligned and the encoder output is perfectly uniform on the unit sphere. Both conditions are satisfied for the trivial solution described above, therefore SimCLR objective is also susceptible to fixed distractor noise problem.

In order to verify whether the proposed JEPA-based methods indeed focus on fixed background noise, we introduce a simple moving dot dataset. The sequences of images contain a single dot on a square with sides of length 1, and the action denotes the delta in the dot’s coordinates from the previous time step. We fix the length of the episode to 17 time steps (16 actions). After pre-training, we probe the representation by training a linear layer to recover the dot’s position for all time steps. For pre-training, we use 1 million sequences; for training the prober we use 300,000 sequences; for evaluation, we use 10,000 sequences. We introduce two types of distractor noise to the background: structured and uniform. We generate structured noise by overlaying CIFAR-10 [19] images. Temporally, the noise can be changing, i.e., each time step the background is resampled; or fixed, i.e., the background does not change with time, but still changes between sequences. For examples, see figure 2. The coefficient α𝛼\alpha controls noise brightness relative to the dot. We tune hyperparameters of all methods separately for each noise level and type. For more details about the dataset, see appendix 4.

We compare the approaches described in section 2 with different types of noise described above. We also add a baseline called ‘Center’, which corresponds to always predicting the dot’s location to be in the center. ‘Center’ and ‘Supervised’ baselines should be upper and lower bounds on the error. The results are shown in figure 3. All methods perform well when there are no distractors. Reconstruction performs well in all settings with α≤1.5𝛼1.5\alpha\leq 1.5, while JEPA-based methods fail in the presence of fixed noise, both structured and uniform. We hypothesize that, as described in section 2, these methods focus on background noise and ignore the dot. We observe a similar drop in performance when an extra static dot is introduced instead of distractor noise (see appendix A.6.2 for more details). All methods work well when the noise is changing every frame. Additionally, we find that JEPA-based methods do not require hyperparameter tuning to adapt to higher levels of changing noise, while reconstruction performs much worse with untuned hyperparameters (see appendix A.5). Inverse dynamics modeling has great performance in all cases, but this method may be unsuitable for pre-training as it only learns representations that capture the agent, and fail if there is additional useful information to be captured, as we demonstrate with additional experiments with 3 dots in appendix A.6.

We demonstrate that JEPA-based methods offer a possible way forward for reconstruction-free forward model learning and are capable of ignoring unpredictable noise well even without additional hyperparameter tuning. However, these methods fail when slow features are present, even with a large pre-training dataset and hyperparameter tuning. We only demonstrate this with a toy dataset, but we hypothesize that this may happen in more complex problems. For example, when pre-training a forward-model for self-driving with JEPA on dash-cam videos, the model may focus on cloud patterns that are easily predictable, rather than trying to learn the traffic participants’ behavior. This drawback of JEPA may be addressed by using image differences or optical flow as input to the model, although these input modalities will ignore potentially useful background and may still contain fixed noise. We believe that the way to learn representations that capture both fast and slow features is by adding hierarchy to the architecture (see HJEPA in [20]) or by changing the objective to impose an additional constraint that prevents the representations from being constant across time.

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] See appendix A.8.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes]

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

If you used crowdsourcing or conducted research with human subjects…

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

In recent years, multiple new joint-embedding methods for pre-training image classification systems were introduced [10, 38, 3, 5, 4, 15, 17, 7]. These methods heavily rely on well-designed image augmentations to prevent collapse [35, 18]. Very closely related works of [6, 28] also investigate the tendency of contrastive losses to focus on easy features, although they do not study the application to videos.

Many of the ideas from self-supervised pre-training for classification have also been applied to learning from video: [27] applies InfoNCE loss [25] to learn encodings of video segments, while [21] modify CPC architecture [25] and include samples from the same video clip as negative examples. [14] also propose a modification of CPC [25] for training from video and use predicted representations and encoder outputs as positive pair, and construct negative examples using both different videos and different fragments of the same video. [9] also use videos as an alternative to image augmentations. [31] use a triplet loss with single negative example instead of InfoNCE. [16] adopt Deep InfoMax objective [17] for videos, while [26] use MoCo [15].

Many methods using reconstruction objectives as main training signal have been proposed [23, 11, 13, 12]. [24] proposes a reconstruction-free version of Dreamer [11, 13] using contrastive learning. [39] proposes a method to learn representations using bisimulation objective, pushing states that lead to same rewards with same trajectories to have the same representations. In contrast, we focus on the setting where the reward is unknown during pretraining. Notably, [39] also explores the performance of reconstruction and CPC [25] objectives in the presence of distractor noise, but the noise is not fixed in time. [22] explore various aspects of model-based and model-free offline RL, and test reconstruction-based RSSM [12] with various perturbations, concluding that it performs quite well even with noisy data, which we confirm in our experiments as well.

Objectives used in self-supervised learning for images have been actively used in reinforcement learning from pixels to improve sample complexity of training [29, 33], or for pre-training the encoder [30, 34, 1, 40]. These methods focus on pre-training the encoder, and do not evaluate the trained forward model, even if it is present in the system [30].

We consider the problem of capturing the location of an object in a video sequence. To this end, we introduce a simple environment with just one dot inside a square. The dot cannot leave the square, and is always visible on the screen. We denote dot’s coordinates at time t𝑡t as ct=(ctx,cty)subscript𝑐𝑡superscriptsubscript𝑐𝑡𝑥superscriptsubscript𝑐𝑡𝑦c_{t}=(c_{t}^{x},c_{t}^{y}). We assume the square size is 1 by 1, therefore ctx,cty∈[0,1]superscriptsubscript𝑐𝑡𝑥superscriptsubscript𝑐𝑡𝑦01c_{t}^{x},c_{t}^{y}\in[0,1]. At each step, the dot takes an action at=(atx,aty),at∈ℝ2formulae-sequencesubscript𝑎𝑡superscriptsubscript𝑎𝑡𝑥superscriptsubscript𝑎𝑡𝑦subscript𝑎𝑡superscriptℝ2a_{t}=(a_{t}^{x},a_{t}^{y}),a_{t}\in\mathbb{R}^{2}. The norm of atsubscript𝑎𝑡a_{t} is restricted to be less than or equal to a maximum step size D𝐷D: ‖at‖2≤Dsubscriptnormsubscript𝑎𝑡2𝐷|a_{t}|_{2}\leq D. In our experiments D=0.14𝐷0.14D=0.14.

The dot moves continuously around the square by moving by a vector specified by the action atsubscript𝑎𝑡a_{t} with clipping to prevent the dot from going outside the square, i.e. ct+1=max⁡(0,min⁡(ct+at,1))subscript𝑐𝑡10subscript𝑐𝑡subscript𝑎𝑡1c_{t+1}=\max(0,\min(c_{t}+a_{t},1)).

In order to generate one dataset example, we randomly generate a starting location c1subscript𝑐1c_{1}, and a sequence of actions a1​…​Tsubscript𝑎1…𝑇a_{1\dots T}, where T𝑇T is the sequence length. In our experiments T=16𝑇16T=16. The actions in a sequence atsubscript𝑎𝑡a_{t} are generated by first sampling vector directions using a Von Mises distribution with a randomly chosen direction ω∼Uniform​(0,2​π)similar-to𝜔Uniform02𝜋\omega\sim\mathrm{Uniform}(0,2\pi), (utx,uty)=ut∼VonMises​(ω,1.3)subscriptsuperscript𝑢𝑥𝑡subscriptsuperscript𝑢𝑦𝑡subscript𝑢𝑡similar-toVonMises𝜔1.3(u^{x}{t},u^{y}{t})=u_{t}\sim\mathrm{VonMises}(\omega,1.3). This prevents the dot from staying in one place in expectation, as would be the case with uniform random sampling. Then we multiply direction vectors with uniformly sampled step size dt∼Uniform​(0,D)similar-tosubscript𝑑𝑡Uniform0𝐷d_{t}\sim\mathrm{Uniform}(0,D): at=dt​utsubscript𝑎𝑡subscript𝑑𝑡subscript𝑢𝑡a_{t}=d_{t}u_{t}. We then generate locations by adding the action vectors to the initial location. The length of the action sequence a0​…​T−1subscript𝑎0…𝑇1a_{0\dots T-1} is then T𝑇T, while the generated locations sequence is of length T+1𝑇1T+1 (in our case it is 17). A diagram of the process is shown in figure 4.

Once the actions and locations are generated, we obtain images otsubscript𝑜𝑡o_{t} by rendering the dots at the generated locations ctsubscript𝑐𝑡c_{t} applying a Gaussian blur with σ=0.05𝜎0.05\sigma=0.05 to the image with the value set to 1 at the location ctsubscript𝑐𝑡c_{t}. In our experiments, the resolution of otsubscript𝑜𝑡o_{t} is 28×28282828\times 28.

We introduce distractors to the dataset by overlaying noise onto the dot images. We consider two noise types: random and structured; and two temporal settings: fixed and changing. Random noise images Z𝑍Z are images of the same dimension as otsubscript𝑜𝑡o_{t} where each pixel sampled from a uniform distribution, while structured noise images are loaded from CIFAR-10 dataset.

In all cases, we add noise Z𝑍Z with a coefficient β𝛽\beta: o^t=ot+α​Ztsubscript^𝑜𝑡subscript𝑜𝑡𝛼subscript𝑍𝑡\hat{o}{t}=o{t}+\alpha Z_{t}. Both noise image and dot image have values between 0 and 1, so the coefficient represents how many times stronger the brightest pixel in the noise is compared to the brightest pixel in the dot image.

We denote the batch of training video sequences as a tuple of observation and action sequences (𝒪,𝒜),𝒪=(O1​…​OT+1),Ot∈ℝN×H×W;𝒜=(A1​…​AT),Ai∈ℝN×Mformulae-sequence𝒪𝒜𝒪subscript𝑂1…subscript𝑂𝑇1formulae-sequencesubscript𝑂𝑡superscriptℝ𝑁𝐻𝑊formulae-sequence𝒜subscript𝐴1…subscript𝐴𝑇subscript𝐴𝑖superscriptℝ𝑁𝑀(\mathcal{O},\mathcal{A}),\mathcal{O}=(O_{1}\dots O_{T+1}),O_{t}\in\mathbb{R}^{N\times H\times W};\mathcal{A}=(A_{1}\dots A_{T}),A_{i}\in\mathbb{R}^{N\times M}. Here, T𝑇T is the episode length, H×W𝐻𝑊H\times W is the resolution of observation image, N𝑁N is batch size, M𝑀M is the dimension of the action. For all algorithms we test, the observations are processed using an encoder to obtain representations for each time step Si=gϕ​(Oi),Si∈ℝN×dformulae-sequencesubscript𝑆𝑖subscript𝑔italic-ϕsubscript𝑂𝑖subscript𝑆𝑖superscriptℝ𝑁𝑑S_{i}=g_{\phi}(O_{i}),S_{i}\in\mathbb{R}^{N\times d}, and the forward model is unfolded from the first observation with the given actions: St=fθ​(St−1,At−1);S1=S1formulae-sequencesubscript𝑆𝑡subscript𝑓𝜃subscript𝑆𝑡1subscript𝐴𝑡1subscript𝑆1subscript𝑆1\tilde{S}{t}=f{\theta}(\tilde{S}{t-1},A{t-1});\tilde{S}{1}=S{1}. We use 𝒮,𝒮𝒮𝒮\mathcal{S},\mathcal{\tilde{S}} to denote encodings and predictions for all time steps: 𝒮=(S1,…,ST+1)𝒮subscript𝑆1…subscript𝑆𝑇1\mathcal{S}=(S_{1},\dots,S_{T+1}), 𝒮~=(S1,…,ST+1)𝒮subscript𝑆1…subscript~𝑆𝑇1\mathcal{\tilde{S}}=(\tilde{S}{1},\dots,\tilde{S}{T+1}).

VICReg objective was originally used for image classification [3]. We follow ideas described in [20] and adopt the objective to learning from video. We consider representations at each time step separately for calculating variance and covariance losses, while the representation loss becomes the prediction error between forward model and encoder outputs. Total loss and its components are:

In case of SimCLR adaptation to JEPA, again, we use the same strategy as with VICReg, and treat each time step prediction separately. We apply SimCLR’s InfoNCE loss [5] as follows:

In order to test whether representations contain the desired information, we train a probing function q​(s):ℝD→ℝQ:𝑞𝑠→superscriptℝ𝐷superscriptℝ𝑄q(s):\mathbb{R}^{D}\to\mathbb{R}^{Q} which we represent with one linear layer. D𝐷D is the representation size, Q𝑄Q is the target value dimension. In our case, the target value is the dot’s location, so Q=2𝑄2Q=2. When training the prober, we follow a similar protocol to the pre-training. Given the initial observation o1subscript𝑜1o_{1} and a sequence of actions of length T𝑇T, a1​…​aTsubscript𝑎1…subscript𝑎𝑇a_{1}\dots a_{T}, we encode the first observation s1=gϕ​(o1)subscript𝑠1subscript𝑔italic-ϕsubscript𝑜1\tilde{s}{1}=g{\phi}(o_{1}), then auto-regressively apply the forward model st=fθ​(st−1,at−1)subscript𝑠𝑡subscript𝑓𝜃subscript𝑠𝑡1subscript𝑎𝑡1\tilde{s}{t}=f{\theta}(\tilde{s}{t-1},a{t-1}), obtaining representations for a sequence of length T+1𝑇1T+1: s1​…​sT+1subscript𝑠1…subscript𝑠𝑇1\tilde{s}{1}\dots\tilde{s}{T+1}. We denote the location of the dot in the i𝑖i-th batch element at time step t𝑡t as Ct,isubscript𝐶𝑡𝑖C_{t,i}. Then, the loss function we apply to train q𝑞q is:

In figure 5 we show results for the same experiments as in figure 3, with one difference that we do not tune hyperparameters for each model and noise setting. We pick the best hyperparameters for noise-free setting, and fix them for all levels of noise. We see that SimCLR and VICReg JEPA methods, compared to tuned performance, fail with lower levels of noise, while reconstruction fails with changing uniform noise without tuning.

In this dataset, there are three dots on the three channels of the image: action-controlled dot, uncontrollable dot, and stationary dot. The first dot is action controlled similar to the single dot environment without any noise. The second dot is similar to the first dot, but the actions are unknown, making it impossible to predict its movements. The third dot is stationary across all the frames of the episode, but is at different positions in different episodes. To generate a sample of the three dot dataset, we concatenate these dots along different channels of image. These fixed channels ensure that our model can distinguish between the dots. We show an example sequence in figure 6. The goal is to learn a representation that captures the locations of the action-controlled and stationary dots, while ignoring the random dot.

We show performance of the compared methods on 3-dots dataset in table A.6.2. Hypeparameters were tuned for each model to obtain the best average RMSE. VICReg and SimCLR based methods focus only on the stationary dot, and fail to capture the other two dots. Again, we hypothesize that JEPA methods capture ‘slow features’ [37], and with the stationary dot containing the slowest features, the other dots are ignored. IDM manages to capture the action-controlled dot, but ignores the stationary dot, as it is irrelevant to inverse dynamics. Reconstruction-based approach captures both the stationary and action-controlled dots.

The encoder consists of 3 convolutional layers with ReLU and BatchNorm after each layer, and average pooling with kernel size of 2 by 2 and stride of 2 at the end. The first convolution layer has kernel size 5, stride 2, padding 2, and output dimension of 32. The second layer is the same as the first, except the output dimension is 64. The final layer has kernel size 3, stride 1, padding 1, and output dimension of 64. After average pooling, a linear layer is applied, with output dimension of 512.

The predictor is represented by a single-layer GRU [8] with hidden representation size of 512, and input size of 2. Hidden representation is initialized at the first step of predictions with encoder output gϕ​(o1)subscript𝑔italic-ϕsubscript𝑜1g_{\phi}(o_{1}). The inputs at each time step are actions.

Reconstruction introduces a decoder represented by a model symmetric to the encoder, with convolutions in reverse order and upsampling. We do not use a latent variable in our implementation.

The limitation of the current experiments is the use of the simple toy dataset. It is unclear that the same will hold for more complicated video datasets and bigger models. Another limitation is that we only check VICReg and SimCLR losses for JEPA methods, while there are many more objectives, e.g., [10, 38].

All experiments were run using AMD MI50 or Nvidia RTX 8000 GPUs. For each noise type and level, 100 random hyperparameters were run, and the best one was run for 3 seeds. Each individual experiment takes less than one hour of GPU time.

To implement reconstruction-based approach, we used parts of the implementation of [2] distributed under MIT license. Implementing InfoNCE loss, we used the code from [32], which is also distributed under MIT license.

This material is based upon work supported by the National Science Foundation under NSF Award 1922658.

Table: A1.SS6.SSS2.1: Results for 3-dots dataset. All numbers denote RMSE across 17 steps. We run 3 seeds for each experiment to obtain standard deviations. Cells are colored according to the values, with higher values shown in red, and lower values shown in blue.

MethodAverageActionRandomStationary
VICReg0.229±0.0310.277±0.0410.273±0.0440.066±0.026
SimCLR0.158±0.0010.193±0.0010.193±0.0020.025±0.001
IDM0.234±0.0010.035±0.0000.298±0.0020.272±0.000
Supervised0.104±0.0000.010±0.0010.180±0.0010.005±0.000
Reconstruction0.107±0.0000.021±0.0010.182±0.0000.026±0.002
Random0.260±0.0010.235±0.0020.278±0.0020.265±0.004
Center0.299±0.0000.304±0.0000.304±0.000

Refer to caption (a) VICReg-based architecture

Refer to caption (a) Different noise levels

Refer to caption (b) Changing and fixed structured noise (α=1𝛼1\alpha=1)

Refer to caption Performance of the compared methods with different types and levels of noise. We tune hyperparameters individually for each model, noise level, and type. We show results without tuning in figure 5 in the appendix. The dots represent the mean RMSE across 17 time steps. The shaded area represents the standard deviation calculated by running 3 random seeds for each experiment.

Refer to caption Diagram of data generation process. Initial position c1subscript𝑐1c_{1} and action sequence a1​…​aTsubscript𝑎1…subscript𝑎𝑇a_{1}\dots a_{T} are first generated. Then, the actions are sequentially added to the initial location. The resulting coordinates c1​…​cT+1subscript𝑐1…subscript𝑐𝑇1c_{1}\dots c_{T+1} are then rendered.

Refer to caption Example sequence of 3 dots dataset. Red, the first channel, corresponds to the action controlled dot. Green corresponds to the randomly moving dot with unknown actions. Blue corresponds to the stationary dot.

$$ \displaystyle\mathcal{L}_{\mathrm{prediction}} $$

$$ &\mathcal{L} = \mathcal{L}\mathrm{prediction} + \mathcal{L}\mathrm{IDM} \ &\mathcal{L}\mathrm{prediction} = \frac{1}{TN}\sum{t=1}^T\sum_{i=1}^N\Vert f_\theta(S_{t, i}, A_{t, i}) - g_\phi(O_{t+1, i})\Vert_2^2 \ &\mathcal{L}\mathrm{IDM} = \frac{1}{TN} \sum{t=1}^T \sum_{i=1}^N \Vert h_\xi(S_{t, i}, S_{t+1, i}) - A_{t, i}\Vert $$

$$ \displaystyle\mathcal{L}{\mathrm{variance}}=\frac{1}{(T+1)D}\sum{t=1}^{T+1}\sum_{j=1}^{D}\max\left(0,\gamma-\sqrt{\mathrm{Var}(S_{t,:,j})+\epsilon}\right)=0 $$

$$ \displaystyle\mathrm{expsim}(u,v)=\exp\left(\frac{u^{\top}v}{\tau|u||v|}\right) $$

$$ \displaystyle\mathrm{InfoNCE}(S_{t},\tilde{S}{t})=-\frac{1}{N}\sum{i=1}^{N}\log\frac{\mathrm{expsim}(S_{t,i},\tilde{S}{t,i})}{\sum{k=1}^{N}\mathrm{expsim}(S_{t,i},\tilde{S}{t,i}))+\mathds{1}{k\neq i}\mathrm{expsim}(S_{t,i},S_{t,k})} $$

MethodAverageActionRandomStationary
VICReg0.229±0.0310.277±0.0410.273±0.0440.066±0.026
SimCLR0.158±0.0010.193±0.0010.193±0.0020.025±0.001
IDM0.234±0.0010.035±0.0000.298±0.0020.272±0.000
Supervised0.104±0.0000.010±0.0010.180±0.0010.005±0.000
Reconstruction0.107±0.0000.021±0.0010.182±0.0000.026±0.002
Random0.260±0.0010.235±0.0020.278±0.0020.265±0.004
Center0.299±0.0000.304±0.0000.304±0.0000.289±0.000

$$ \mathcal{L}\mathrm{prediction} &= \frac{1}{TN}\sum{t=1}^{T}\sum_{i=1}^N\Vert f_\theta(S_{t, i}, A_{t,i}) - g_\phi(O_{t+1, i})\Vert_2^2 \nonumber \ &= \frac{1}{TN}\sum_{t=1}^{T}\sum_{i=1}^N\Vert S_{t, i} - S_{t+1, i}\Vert_2^2 = \frac{1}{TN}\sum_{t=2}^{T+1}\sum_{i=1}^N\Vert S_{t, i} - S_{t, i}\Vert_2^2 = 0 $$

$$ &\mathrm{Var}(s_t) = \frac{1}{N-1}\sum_{i=1}^N (s_i - \bar s)^2 = \sigma^2 & \text{As } s \sim \mathcal{N}(0, \sigma^2 I) $$

$$ &\mathcal{L}\mathrm{variance} = \frac{1}{(T+1)D}\sum{t=1}^{T+1}\sum_{j=1}^D \max \left(0, \gamma - \sqrt{\mathrm{Var}(S_{t, :, j}) + \epsilon}\right) = 0 \label{eq:var} \ &\mathcal{L}\mathrm{covariance} = \frac{1}{(T+1)(N-1)}\sum{t=1}^{T+1}\sum_{i=1}^D \sum_{j=i+1}^D (S_t S_t^\top)_{i,j} = 0 & \label{eq:cov} $$ \tag{eq:var}

$$ &\mathcal{L}\mathrm{VICReg} = \alpha \mathcal{L}\mathrm{prediction} + \beta \mathcal{L}\mathrm{variance} + \mathcal{L}\mathrm{covariance} \ &\mathcal{L}\mathrm{prediction} = \frac{1}{NT}\sum{t=1}^T\sum_{i=1}^N\Vert f_\theta(S_{t, i}, A_{t, i}) - g_\phi(O_{t+1, i})\Vert_2^2 = \frac{1}{NT}\sum_{t=2}^{T+1}\sum_{i=1}^N \Vert \tilde S_{t, i} - S_{t, i}\Vert_2^2 \ &\mathrm{Var}(v) = \frac{1}{N-1}\sum_{i=1}^N (v_i - \bar v)^2 \ &\mathcal{L}\mathrm{variance} = \frac{1}{T+1}\sum{t=1}^{T+1}\frac{1}{D}\sum_{j=1}^D \max \left(0, \gamma - \sqrt{\mathrm{Var}(S_{t, :, j}) + \epsilon}\right) \ &\mathcal{L}\mathrm{covariance} = \frac{1}{T+1} \sum{t=1}^{T+1}\frac{1}{N-1}\sum_{i=1}^D \sum_{j=i+1}^D (S_t S_t^\top)_{i,j} $$

$$ &\mathrm{expsim}(u, v) = \exp \left( \frac{u^\top v}{\tau \Vert u \Vert \Vert v \Vert} \right)\ &\mathrm{InfoNCE}(S_t, \tilde S_t) = -\frac{1}{N}\sum_{i=1}^{N}\log \frac{\mathrm{expsim}(S_{t, i}, \tilde S_{t, i})}{ \sum_{k=1}^N \mathrm{expsim}(S_{t, i}, \tilde S_{t, i})) + \mathds{1}{k \neq i} \mathrm{expsim}(S{t, i}, S_{t, k}) } \ &\mathcal{L}\mathrm{SimCLR} = \frac{1}{2T} \sum{t=2}^{T+1} \mathrm{InfoNCE}(S_t, \tilde S_t) + \mathrm{InfoNCE}(\tilde S_t, S_t) $$

$$ &\mathcal{L}\mathrm{reconstruction} = \frac{1}{T}\sum{t=2}^{T+1} \Vert d_\xi(\tilde s_t) - o_t \Vert^2_2\ $$

MethodAverageActionRandomStationary
VICReg0.229±0.0310.277±0.0410.273±0.0440.066±0.026
SimCLR0.158±0.0010.193±0.0010.193±0.0020.025±0.001
IDM0.234±0.0010.035±0.0000.298±0.0020.272±0.000
Supervised0.104±0.0000.010±0.0010.180±0.0010.005±0.000
Reconstruction0.107±0.0000.021±0.0010.182±0.0000.026±0.002
Random0.260±0.0010.235±0.0020.278±0.0020.265±0.004
Center0.299±0.0000.304±0.0000.304±0.0000.289±0.000

Figure

Figure

Figure

$$ \displaystyle\mathrm{Var}(v)=\frac{1}{N-1}\sum_{i=1}^{N}(v_{i}-\bar{v})^{2} $$

References

[DBLP:journals/corr/abs-1803-10122] David Ha, J{. (2018). World Models. CoRR.

[lecun2022path] LeCun, Yann. (2022). A Path Towards Autonomous Machine Intelligence Version 0.9. 2, 2022-06-27.

[simclr] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.

[https://doi.org/10.48550/arxiv.1807.03748] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation Learning with Contrastive Predictive Coding. doi:10.48550/ARXIV.1807.03748.

[dreamer] Siciliano, Bruno, Khatib, Oussama, Kr{. (2008). Springer handbook of robotics. arXiv:1906.08226 [cs, stat].

[dreaming] Alex Krizhevsky, Vinod Nair, Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research). doi:10.1162/089976602317318938.

[RSSM_code] Kai Arulkumaran. (2021). PlaNet. GitHub repository.

[SimCLR_code] Thalles Silva. (2021). PyTorch SimCLR: A Simple Framework for Contrastive Learning of Visual Representations. GitHub repository.

[bisimulation] Amy McGovern, Richard S. Sutton, Andrew H. Fagg. (1997). Roles of Macro-Actions in Accelerating Reinforcement Learning. arXiv:2110.09348 [cs]. doi:10.48550/arXiv.2011.02803.

[bib1] Anand et al. [2020] A. Anand, E. Racah, S. Ozair, Y. Bengio, M.-A. Côté, and R. D. Hjelm. Unsupervised state representation learning in atari. arXiv:1906.08226 [cs, stat], Nov 2020. URL http://arxiv.org/abs/1906.08226. arXiv: 1906.08226.

[bib2] K. Arulkumaran. Planet. https://github.com/Kaixhin/PlaNet/, 2021.

[bib3] Bardes et al. [2021] A. Bardes, J. Ponce, and Y. LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv:2105.04906 [cs], May 2021. URL http://arxiv.org/abs/2105.04906. arXiv: 2105.04906.

[bib4] Caron et al. [2021] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv:2006.09882 [cs], Jan 2021. URL http://arxiv.org/abs/2006.09882. arXiv: 2006.09882.

[bib5] Chen et al. [2020] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.

[bib6] Chen et al. [2021] T. Chen, C. Luo, and L. Li. Intriguing properties of contrastive losses. (arXiv:2011.02803), Oct 2021. doi: 10.48550/arXiv.2011.02803. URL http://arxiv.org/abs/2011.02803. arXiv:2011.02803 [cs, stat].

[bib7] X. Chen and K. He. Exploring simple siamese representation learning. (arXiv:2011.10566), Nov 2020. URL http://arxiv.org/abs/2011.10566. arXiv:2011.10566 [cs].

[bib8] Cho et al. [2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. (arXiv:1409.1259), Oct 2014. URL http://arxiv.org/abs/1409.1259. arXiv:1409.1259 [cs, stat].

[bib9] Gordon et al. [2020] D. Gordon, K. Ehsani, D. Fox, and A. Farhadi. Watching the world go by: Representation learning from unlabeled videos. (arXiv:2003.07990), May 2020. URL http://arxiv.org/abs/2003.07990. arXiv:2003.07990 [cs].

[bib10] Grill et al. [2020] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko. Bootstrap your own latent: A new approach to self-supervised learning. arXiv:2006.07733 [cs, stat], Sep 2020. URL http://arxiv.org/abs/2006.07733. arXiv: 2006.07733.

[bib11] Hafner et al. [2019a] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv, Dec 2019a. URL http://arxiv.org/abs/1912.01603.

[bib12] Hafner et al. [2019b] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson. Learning latent dynamics for planning from pixels. (arXiv:1811.04551), Jun 2019b. URL http://arxiv.org/abs/1811.04551. arXiv:1811.04551 [cs, stat].

[bib13] Hafner et al. [2020] D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba. Mastering atari with discrete world models. 2020.

[bib14] Han et al. [2019] T. Han, W. Xie, and A. Zisserman. Video representation learning by dense predictive coding. (arXiv:1909.04656), Sep 2019. URL http://arxiv.org/abs/1909.04656. arXiv:1909.04656 [cs].

[bib15] He et al. [2020] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. arXiv:1911.05722 [cs], Mar 2020. URL http://arxiv.org/abs/1911.05722. arXiv: 1911.05722.

[bib16] R. D. Hjelm and P. Bachman. Representation learning with video deep infomax. (arXiv:2007.13278), Jul 2020. URL http://arxiv.org/abs/2007.13278. arXiv:2007.13278 [cs].

[bib17] Hjelm et al. [2019] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio. Learning deep representations by mutual information estimation and maximization. (arXiv:1808.06670), Feb 2019. URL http://arxiv.org/abs/1808.06670. arXiv:1808.06670 [cs, stat].

[bib18] Jing et al. [2021] L. Jing, P. Vincent, Y. LeCun, and Y. Tian. Understanding dimensional collapse in contrastive self-supervised learning. arXiv:2110.09348 [cs], Oct 2021. URL http://arxiv.org/abs/2110.09348. arXiv: 2110.09348.

[bib19] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.

[bib20] Y. LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.

[bib21] LORRE et al. [2020] G. LORRE, J. RABARISOA, A. ORCESI, S. AINOUZ, and S. CANU. Temporal contrastive pretraining for video action recognition. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), page 651–659, Mar 2020. doi: 10.1109/WACV45572.2020.9093278.

[bib22] Lu et al. [2022] C. Lu, P. J. Ball, T. G. J. Rudner, J. Parker-Holder, M. A. Osborne, and Y. W. Teh. Challenges and opportunities in offline reinforcement learning from visual observations. (arXiv:2206.04779), Jun 2022. URL http://arxiv.org/abs/2206.04779. arXiv:2206.04779 [cs, stat].

[bib23] Oh et al. [2015] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. arXiv:1507.08750 [cs], Dec 2015. URL http://arxiv.org/abs/1507.08750. arXiv: 1507.08750.

[bib24] M. Okada and T. Taniguchi. Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction.

[bib25] Oord et al. [2019] A. v. d. Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding. (arXiv:1807.03748), Jan 2019. URL http://arxiv.org/abs/1807.03748. arXiv:1807.03748 [cs, stat].

[bib26] Pan et al. [2021] T. Pan, Y. Song, T. Yang, W. Jiang, and W. Liu. Videomoco: Contrastive video representation learning with temporally adversarial examples. (arXiv:2103.05905), Mar 2021. URL http://arxiv.org/abs/2103.05905. arXiv:2103.05905 [cs].

[bib27] Qian et al. [2021] R. Qian, T. Meng, B. Gong, M.-H. Yang, H. Wang, S. Belongie, and Y. Cui. Spatiotemporal contrastive video representation learning. (arXiv:2008.03800), Apr 2021. URL http://arxiv.org/abs/2008.03800. arXiv:2008.03800 [cs].

[bib28] Robinson et al. [2021] J. Robinson, L. Sun, K. Yu, K. Batmanghelich, S. Jegelka, and S. Sra. Can contrastive learning avoid shortcut solutions? (arXiv:2106.11230), Dec 2021. URL http://arxiv.org/abs/2106.11230. arXiv:2106.11230 [cs].

[bib29] Schwarzer et al. [2021a] M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman. Data-efficient reinforcement learning with self-predictive representations. arXiv:2007.05929 [cs, stat], May 2021a. URL http://arxiv.org/abs/2007.05929. arXiv: 2007.05929.

[bib30] Schwarzer et al. [2021b] M. Schwarzer, N. Rajkumar, M. Noukhovitch, A. Anand, L. Charlin, D. Hjelm, P. Bachman, and A. Courville. Pretraining representations for data-efficient reinforcement learning. arXiv:2106.04799 [cs], Jun 2021b. URL http://arxiv.org/abs/2106.04799. arXiv: 2106.04799.

[bib31] Sermanet et al. [2018] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine. Time-contrastive networks: Self-supervised learning from video. (arXiv:1704.06888), Mar 2018. URL http://arxiv.org/abs/1704.06888. arXiv:1704.06888 [cs].

[bib32] T. Silva. Pytorch simclr: A simple framework for contrastive learning of visual representations. https://github.com/sthalles/SimCLR, 2021.

[bib33] Srinivas et al. [2020] A. Srinivas, M. Laskin, and P. Abbeel. Curl: Contrastive unsupervised representations for reinforcement learning. arXiv:2004.04136 [cs, stat], Sep 2020. URL http://arxiv.org/abs/2004.04136. arXiv: 2004.04136.

[bib34] Stooke et al. [2021] A. Stooke, K. Lee, P. Abbeel, and M. Laskin. Decoupling representation learning from reinforcement learning. arXiv:2009.08319 [cs, stat], May 2021. URL http://arxiv.org/abs/2009.08319. arXiv: 2009.08319.

[bib35] Tian et al. [2020] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. What makes for good views for contrastive learning? (arXiv:2005.10243), Dec 2020. URL http://arxiv.org/abs/2005.10243. arXiv:2005.10243 [cs].

[bib36] T. Wang and P. Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. (arXiv:2005.10242), Aug 2022. URL http://arxiv.org/abs/2005.10242. arXiv:2005.10242 [cs, stat].

[bib37] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715–770, Apr 2002. ISSN 0899-7667, 1530-888X. doi: 10.1162/089976602317318938.

[bib38] Zbontar et al. [2021] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny. Barlow twins: Self-supervised learning via redundancy reduction. (arXiv:2103.03230), Jun 2021. URL http://arxiv.org/abs/2103.03230. arXiv:2103.03230 [cs, q-bio].

[bib39] Zhang et al. [2021] A. Zhang, R. McAllister, R. Calandra, Y. Gal, and S. Levine. Learning invariant representations for reinforcement learning without reconstruction. (arXiv:2006.10742), Apr 2021. URL http://arxiv.org/abs/2006.10742. arXiv:2006.10742 [cs, stat].

[bib40] Zhang et al. [2022] W. Zhang, A. GX-Chen, V. Sobal, Y. LeCun, and N. Carion. Light-weight probing of unsupervised representations for reinforcement learning. (arXiv:2208.12345), Aug 2022. URL http://arxiv.org/abs/2208.12345. arXiv:2208.12345 [cs].