Light-weight probing of unsupervised representations for Reinforcement Learning

% Wancong Zhang$^{1,2}$ \quad Anthony GX-Chen$^{2}$ \quad Vlad Sobal$^2$ \quad Yann LeCun$^{2,3}$ \quad Nicolas Carion$^{2,3}$, $^1$AssemblyAI \quad $^2$New York University \quad $^3$Facebook AI Research

Abstract

Unsupervised visual representation learning offers the opportunity to leverage large corpora of unlabeled trajectories to form useful visual representations, which can benefit the training of reinforcement learning (RL) algorithms. However, evaluating the fitness of such representations requires training RL algorithms which is computationally intensive and has high variance outcomes. Inspired by the vision community, we study whether linear probing can be a proxy evaluation task for the quality of unsupervised RL representation. Specifically, we probe for the observed reward in a given state and the action of an expert in a given state, both of which are generally applicable to many RL domains. Through rigorous experimentation, we show that the probing tasks are strongly rank correlated with the downstream RL performance on the Atari100k Benchmark, while having lower variance and up to 600x lower computational cost. This provides a more efficient method for exploring the space of pretraining algorithms and identifying promising pretraining recipes without the need to run RL evaluations for every setting. Leveraging this framework, we further improve existing self-supervised learning (SSL) recipes for RL, highlighting the importance of the forward model, the size of the visual backbone, and the precise formulation of the unsupervised objective.

Introduction

Learning visual representations is a critical step towards solving many kinds of tasks, from supervised tasks such as image classification or object detection, to reinforcement learning. Ever since the early successes of deep reinforcement learning (Mnih et al., 2015), neural networks have been widely adopted to solve pixel-based reinforcement learning tasks such as arcade games (Bellemare et al., 2013), physical continuous control (Todorov et al., 2012; Tassa et al., 2018), and complex video games (Synnaeve et al., 2018; Oh et al., 2016). However, learning deep representations directly from interactions is a challenging endeavor. This is primarily due to the nature of rewards, which, despite being a critical source of supervision, tend to be noisy, sparse, and delayed.

With ongoing progress in unsupervised visual representation learning for vision tasks (Zbontar et al., 2021; Chen et al., 2020a;b; Grill et al., 2020; Caron et al., 2020; 2021; Assran et al., 2023; Oquab et al., 2023), there have been recent efforts to apply self-supervised techniques and ideas to improve representation learning for reinforcement learning applications. Recently, some promising approaches have been proposed in this direction, which suggest either supplementing the RL loss with self-supervised objectives (Laskin et al., 2020; Schwarzer et al., 2021a; D'Oro et al., 2022; Schwarzer et al., 2023), or first pre-training the representations on a corpus of trajectories (Schwarzer et al., 2021b; Stooke et al., 2021). However, the diversity in the settings considered, as well as the self-supervised methods used, make it difficult to identify the core principles of what makes a self-supervised method successful for RL. Moreover, estimating the performance of RL algorithms is

2 , 3

notoriously challenging (Henderson et al., 2018; Agarwal et al., 2021): it often requires repeating the same experience with a different random seed, and the high CPU-to-GPU ratio of the computational requirements of most online RL methods makes them inefficient to run on typical HPC clusters. This tends to prevent systematic exploration of the many design choices that characterize SSL methods.

Inspired by the vision community, we investigate whether linear probing-training a linear prediction head on top of frozen features-can serve as a proxy evaluation task for the quality of unsupervised visual representation in RL. In particular, we focus on two probing tasks that we deem widely applicable to RL: the first one consists of predicting the reward in a given state; the second one consists of predicting the action that would be taken by a fixed policy in a given state, for example that of an expert. We probe for reward as it is closely related to the value function (expected cumulative reward) which assesses the quality of a policy; while expert actions are the desired output of a good policy. We hypothesize that a representation which can be easily (i.e. linearly ) transformed into the reward and expert action is a good representation for RL training. Nonetheless, we stress that these probing tasks are only used as a means of evaluation where very little supervised data is required, making it suitable for situations where obtaining expert trajectories or reward labels is expensive. Through thorough experimentation, we show that the performance of the SSL algorithms in terms of their downstream RL outcomes rank correlates with the performance of both of these probing tasks. This is particularly true for reward probing, for which we obtain a statistically significant Spearman's rank correlation coefficient of r > 0 . 9 (p<0.001), suggesting its utility as an effective proxy for RL performance. Given the vastly reduced computational burden of such linear evaluations, we argue that they enable much easier and straightforward experimentation of SSL design choices, paving the way for a more systematic exploration of the design space.

Finally, we leverage this framework to make systematic assessments of some of the key attributes of SSL methods. We focus on a class of SSL algorithms with latent dynamics modelling as it has been the common choice behind a series of highly performant models in Atari100k (Schwarzer et al., 2021a;b; Tomar et al., 2021; Schwarzer et al., 2023; Ni et al., 2024). First off, we explore the utility and role of learning a forward model as part of the self-supervised objective. We investigate whether its expressiveness matters and in particular show that equipping it with the ability to model uncertainty through a random latent variable significantly improves the quality of the representations. Next, we identify several knobs in the self-supervised objective, allowing us to carefully tune the parameters in a principled way. Finally, we confirm the previous finding (Schwarzer et al., 2021b) that bigger architectures tend to perform better, when adequately pre-trained.

Our contributions can be summarized as follows:

· Design an efficient protocol that estimates the quality of unsupervised visual representations for RL by linearly probing for rewards and actions. • Demonstrate significant rank correlation between probing tasks and downstream RL performance. • Systematic exploration of design choices in existing SSL methods.

There has recently been a surge in interest and advances in the domain of self-supervised learning in computer vision. Some state-of-art techniques include contrastive learning methods SimCLR, MoCov2 (Chen et al., 2020a;b); clustering methods SwAV (Caron et al., 2020); distillation methods BYOL, SimSiam, OBoW, DINOv2, I-JEPA (Grill et al., 2020; Chen and He, 2021; Gidaris et al., 2020; Oquab et al., 2023; Assran et al., 2023); and information maximization methods Barlow Twins and VICReg (Zbontar et al., 2021; Bardes et al., 2022).

Simultaneously, significant progress has been made in representation learning for reinforcement learning. One line of work applies unsupervised losses as an auxiliary objective during RL training

to improve data efficiency (Laskin et al., 2020; Zhu et al., 2020; Schwarzer et al., 2021a; Yu et al., 2022; Banino et al., 2022). Another line of work pretrains on offline data prior to online RL or imitation learning (Aytar et al., 2018; Pari et al., 2021; Stooke et al., 2021; Nair et al., 2022; Seo et al., 2022; Ma et al., 2022; Ghosh et al., 2023). In particular, SGI (Schwarzer et al., 2021b) is most similar to our setup in that it pretrains both an encoder and forward model on demonstrations while the encoder is recycled during RL for improved data efficiency; this model category involving latent dynamics has been applied in various state-of-the-art models within the Atari100k benchmark (Tomar et al., 2021; Schwarzer et al., 2023; Ni et al., 2024).

While different in spirit, many model based methods also train an encoder and a dynamic model from a corpus of trajectories, either by explicit pixel reconstruction (Kaiser et al., 2020; Hafner et al., 2021; Micheli et al., 2022; Robine et al., 2023) or in embedding space (Ye et al., 2021; Schrittwieser et al., 2020; Hansen et al., 2022; Ye et al., 2023). Self-supervised representations have also been used for exploration (Burda et al., 2019a; Sekar et al., 2020; Yarats et al., 2021a; Du et al., 2021).

Some prior works (Racah and Pal, 2019; Guo et al., 2018; Anand et al., 2019) evaluate the quality of their pretrained representations by probing for ground truth state variables such as agent/object locations and game scores. Das et al. (2020) propose to probe representations with natural language question-answering. While these methods are efficient, they tend to be domain-specific and require meticulous crafting for each environment. Morever, these approaches have not consistently demosntrated a correlation between the outcomes of probing and downstream RL performance, which complicates the use of these results to reliably inform model design.

On the other hand, the authors of ATC (Stooke et al., 2021) propose to evaluate representations by finetuning for RL tasks using the pretrained encoder with weights frozen. Similarly, Laskin et al. (2021) propose a unified benchmark for SSL methods in continuous control but still require full RL training. A part of our work aims to bridge these two approaches by making explicit the correlation between linear probing and RL performances, as well as designing probing tasks that are invariant across environments.

In recent developments, Garrido et al. (2022) introduced the use of effective feature ranking to predict downstream performance in image encoders trained via self-supervised learning (SSL). This methodology was further applied to reinforcement learning (RL) by Lee et al. (2023). Notably, Lee et al. (2023) compared this feature rank approach with the methods we outlined in an earlier version of our work (Zhang et al., 2022). They concluded that although neither method perfectly correlates with downstream RL performance, our approach - focusing on reward and action probing - provides a more accurate prediction of RL outcomes.

A framework for developing unsupervised representations for RL

In this section, we detail our proposed framework for training and evaluating unsupervised representations for reinforcement learning.

Unsupervised pre-training

The network is first pre-trained on a large corpus of trajectories. Formally, we define a trajectory T i of length T i as a sequence of tuples T i = ( o t , a t ) | t ∈ [1 , T i ] , where o t is the observation of the state at time t in the environment and a t was the action taken in this state. This setting is closely related to Batch RL (Lange et al., 2012), with the crucial difference that the reward is not being observed. In particular, it should be possible to use the learned representations to maximize any reward (Touati and Ollivier, 2021). The training corpus corresponds to a set of such trajectories: D unsup T 1 , . . . , T n . We note that the policy used to generate this data is left unspecified in this formulation, and is bound to be environment-specific. Since unsupervised methods usually necessitate a lot of data, this pre-training corpus is required to be substantial. In some domains, it might be straightforward to collect a large number of random trajectories to constitute D unsup . In some other cases, like self-driving, where generating random trajectories is undesirable, expert trajectories from humans can be used instead.

The goal of the pre-training step is to learn the parameters θ of an encoder Enc θ which maps any observation o of the state s (for example raw pixels) to a representation e = Enc θ ( o ). This representation must be amenable for the downstream control task, for example learning a policy.

In general, the evaluation of RL algorithms is tricky due to the high variance in performance (Henderson et al., 2018). This requires evaluating many random seeds, which creates a computational burden. We side-step this issue by formulating an evaluation protocol which is light-weight and purely supervised. Specifically, we identify two proxy supervised tasks that are broadly applicable and relevant for control. We further show in the experiment section that they are sound , in the sense that models' performance on the proxy tasks strongly correlates with their performance in the downstream control task of interest. Similar to the evaluation protocol typically used for computer vision models, we rely on linear probing , meaning that we train only a linear layer on top of the representations, which are kept frozen.

Reward Probing Our first task consists in predicting the reward observed in a given state. For this task, we require a corpus of trajectories D rew = T ′ 1 , . . . , T ′ m for which the observed rewards are known, i.e. T ′ i = ( o t , a t , r t ) | t ∈ [1 , T i ]

In the most general setting, it can be formulated as a regression problem, where the goal is to minimize the following loss:

Here, the only learnt parameters ψ are those of the linear prediction layer l ψ .

In practice, in many environments where rewards are sparse, the presence or absence of a reward is more important than its magnitude. To simplify the problem in those cases, we can cast it as a binary prediction problem instead (this could be extended to ternary classification if the sign of the reward is of interest):

Reward prediction is closely related to value prediction, a central objective in RL that is essential for value-based control and the critic in actor-critic methods. The ability to predict instantaneous reward, akin to predicting value with a very small discount factor, can be viewed as a lower bound on the learned representation's ability to encode the value function, and has been demonstrably helpful for control, particularly in sparse reward tasks (Jaderberg et al., 2017). Thus, we hypothesize reward prediction accuracy to be a good probing proxy task for our setting as well.

Action prediction Our second task consists in predicting the action taken by an expert in a given state. For this task, we require a corpus of trajectories D exp = T 1 , . . . , T n generated by an expert policy. We stress that this dataset may be much smaller than the pretraining corpus since we only require to fit and evaluate a linear model. The corresponding objective is as follows:

This task is closely related to imitation learning, however, we are not concerned with the performance of the policy that we learn as a by-product.

In our work, we focus on evaluating and improving a particular class of unsupervised pretraining algorithms that involves using a transition model to predict its own representations in the future (Schwarzer et al., 2021b; Guo et al., 2018; Gelada et al., 2019). This pretraining modality is especially well suited for RL, since the transition model can be conditioned on agent actions, and can be repurposed for model-based RL after pretraining. Our framework is depicted in Fig.2. In this section, we present the main design choices, and we investigate their performance in Section 5.

Transition models

Our baseline transition model is a 2D convolutional network applied directly to the spatial output of the convolutional encoder (Schwarzer et al., 2021b; Schrittwieser et al., 2020). The network consists of two 64-channel convolutional layers with 3x3 filters. The action is represented as a 2D one-hot vector and appended to the input to the first convolutional layer.

We believe a well-established sequence modeling architecture such as GRU can serve as a superior transition model. Its gating mechanisms should be better at retaining information from both the immediate and distant past, especially helpful for learning dynamics in a partially observable environment.

Encoder :

ˆ e 0 = e 0 = Enc θ ( o 0 )

$$ RecurrentModel : ˆ e t = f φ (ˆ e t - 1 , a t - 1 ) $$

In addition to the deterministic GRU model above, we also experiment with a GRU variant where we introduce stochastic states to allow our model to generalize better to stochastic environments, such as Atari with sticky actions (Machado et al., 2018). Our model is based on the RSSM from DreamerV2 (Hafner et al., 2021), with the main difference being that while pixel reconstruction

Figure 2: Model diagram. The observations consist of a stack of 4 frames, to which we apply data augmentation before passing them to a convolutional encoder. The predictor is a recurrent model outputting future state embeddings given the action. We supervise with an inverse modeling loss (cross entropy loss on the predicted transition action) and an SSL loss (distance between embeddings)

is used as the SSL objective in the original work, we minimize the distance between predictions and targets purely in the latent space. Following DreamerV2, we optimize the latent variables using straight-through gradients (Bengio et al., 2013), and minimize the distance between posterior ( z ) and prior (ˆ z ) distributions using KL loss.

h t = f φ ( h t - 1 , z t - 1 , a t - 1 )

The objective of self predictive representation learning is to minimize the distance between the predicted and the target representations, while ensuring that they do not collapse to a trivial solution. Our baseline prediction objective is BYOL (Grill et al., 2020), which is also used in SGI (Schwarzer et al., 2021b). The predicted representation ˆ e t + k , and the encoded target representation ˜ e t + k are first projected to lower dimensions to produce ˆ y t + k and ˜ y t + k . BYOL then maximizes the cosine similarity between the predicted and target projections, using a linear function q to translate from ˆ y to ˜ y :

In the case of BYOL, the target encoder and projection module are the exponentially moving average of the online weights, and the gradients are blocked on the target branch.

As an alternative prediction objective, we experiment with Barlow Twins (Zbontar et al., 2021). Similar to BYOL, Barlow Twins minimizes the distance of the latent representations between the online and target branches; however, instead of using a predictor module and stop gradient on the target branch, Barlow Twins avoids collapse by pushing the cross-correlation matrix between the projection outputs on the two branches to be as close to the identity matrix as possible. To adapt Barlow Twins, we calculate the cross correlation across batch and time dimensions:

/negationslash

where λ is a positive constant trading off the importance of the invariance and covariance terms of the loss, C is the cross-correlation matrix computed between the projection outputs of two branches along the batch and time dimensions, b indexes batch samples, t indexes time, and i, j index the vector dimension of the projection output.

By enabling gradients on both the prediction and target branches, the Barlow objective pushes the predictions towards the representations, while regularizing the representations toward the predictions. In practice, learning the transition model takes time and we want to avoid regularizing the representations towards poorly trained predictions. To address this, we apply a higher learning rate to the prediction branch. We call this technique Barlow Balancing (Algorithm 1).

Algorithm 1: PyTorch-style pseudocode for Barlow Balancing

Other SSL objectives

SGI's authors (Schwarzer et al., 2021b) showed that in the absence of other SSL objectives, pretraining with BYOL prediction objective alone results in representation collapse; the addition of inverse dynamics modeling loss is necessary to prevent collapse, while the addition of goal-oriented RL loss results in minor downstream RL performance improvement. In inverse dynamics modeling, the model is trained using cross-entropy to model p ( a t | ˆ y t + k , ˜ y t + k +1 ), effectively predicting the transition action between two adjacent states. For details regarding goal-oriented RL loss, please refer to Appendix.

Results

Experimental details

We conduct experiments on the Arcade Learning Environment benchmark (Bellemare et al., 2013). Given the multitude of pretraining setups we investigate, we limit our experiment to 9 Atari games 1 .

Pretraining We use the publicly-available DQN replay dataset (Agarwal et al., 2020), which contains data from training a DQN agent for 50M steps with sticky action (Machado et al., 2018). We select 1.5 million frames from the 3.5 to 5 millionth steps of the replay dataset, which constitutes trajectories of a weak, partially trained agent. We largely follow the recipe of SGI (Schwarzer et al., 2021b), where we jointly optimize the self prediction, goal-conditioned RL, and inverse dynamics modeling losses for 20 epochs; in some of our experiments we remove one or both of the last two objectives. We use the same data-augmentations as SGI, namely the ones introduced by Yarats et al. (2021b). All experiments are performed on single instances of MI50 AMD GPU, and the pretraining process took 2 to 8 days depending on the model.

Reward probing We focus on the simplified binary classification task of whether a reward occurs in a given state. We use 100k frames from the 1-1.1 millionth step of the replay dataset, with a 4:1 train/eval split. We train a logistic regression model on frozen features using the Cyanure (Mairal, 2019) library, with the MISO algorithm (Mairal, 2015) coupled with QNING acceleration (Lin et al., 2019) for a maximum of 300 steps. We do not use any data augmentation. We report the mean F1 averaged across all 9 games. On a MI50 AMD GPU, each probing run takes 10 minutes. From preliminary studies we found the variance across linear probing runs to be sufficiently low ( ≤ 1e-2 F1). Thus we omit standard error bars for all probing F1 scores.

Action probing We use the last 100k (4:1 train/eval split) frames of the DQN replay dataset, which correspond to a fully trained DQN agent. We train a linear layer on top of frozen, un-augmented features for 12 epochs with softmax focal loss (Lin et al., 2017) using SGD optimizer with learning rate 0.2, batch size 256, 1e-6 weight decay, stepwise scheduler with step size 10 and gamma 0.1. We report the Multiclass F1 (weighted average of F1 scores of each class) averaged across all games.

RL evaluation We focus on the Atari 100k benchmark (Kaiser et al., 2020), where only 100k interactive steps are allowed by the agent. This is roughly equivalent to two hours of human play, providing an approximation for human level sample-efficiency. We follow Schwarzer et al. (2021b) training protocol using the Rainbow algorithm (Hessel et al., 2018) with the following differences: we freeze the pretrained encoder (thus only training the Q head), do not apply auxiliary SSL losses while fine-tuning, and finally disable noisy layers and rely instead on /epsilon1 -greedy exploration. This changes are made to make the RL results reflect as closely as possible the performance induced by the quality of the representations. On a MI50 AMD GPU, each run takes between 8 and 12 hours. We evaluate each setup using 10 seeds. 2

We evaluate the agent's performance using human-normalized score (HNS), defined as ( agentscore -randomscore ) / ( humanscore -randomscore ). We calculate this per game, per seed by averaging scores over 100 evaluation trajectories at the end of training. For aggregate metrics across games and seeds, we report the median and interquartile mean (IQM). For median, we first average the HNS across seeds for each game, and report the median of the averaged HNS values. For IQM, we first take the interquartile mean across seeds for each game, and report the average of these quantities. While median is commonly reported for Atari100k, recent work has recommended IQM as a superior aggregate metric for the RL setting due to its smaller uncertainty (Agarwal et al., 2021); we also follow the cited work to report the 95

1 Amidar, Assault, Asterix, Boxing, Demon Attack, Frostbite, Gopher, Krull, Seaquest. See Appendix G for selection strategy and per-game statistics.

2 An RL run takes on average 10 GPU hours, a probing run takes 10 CPU minutes. We run 10 RL seeds due to high variance, while probing runs have low variance and only require a single run. Thus probing is ∼ 600 × faster.

Unless specified otherwise, the experiments use the medium ResNet-M from Schwarzer et al. (2021b), and the inverse dynamics loss as an auxiliary loss. In BYOL experiments, the target network is an exponential moving average of the online network, while in Barlow Twins both networks are identical, following the original papers. For additional details regarding model architectures and hyperparameters used during pretraining and RL evaluation, please refer to Appendix.

Impact of transition models and prediction objectives

Table 1: F1 scores on probing tasks for different transition models and prediction objectives.

In table 1, we report the mean probing F1 scores for the convolutional, deterministic GRU, and latent GRU transition models trained using either the BYOL or Barlow prediction objective. When using the BYOL objective, the relative probing strengths for the different transition models are somewhat ambiguous: while the convolutional model results in better reward probing F1, the GRU models are superior in terms of expert action probing.

Interestingly, we observe that after replacing BYOL with Barlow, the probing scores for the latent model improve, while those of the deterministic models deteriorate. Overall, the particular combination of pre-training using the GRU-latent transition model with the Barlow prediction objective results in representations with the best overall probing qualities. Since the deterministic model's predictions are likely to regress to the mean, allowing gradients to flow through the target branch in the case of Barlow objective can regularize the representations towards poor predictions, and can explain their inferior probing performance. Introducing latent variables can alleviate this issue through better predictions.

We stress that the transition models are not used during probing, only the encoder is. These experiments show that having a more expressive forward model during the pre-training has a direct impact on the quality of the representations learned by the encoder. In Fig.3, we qualitatively investigate the impact of the latent variable on the information contained in the representations, by training a decoder on frozen features.

In table 2, we show the results from experimenting with different variants of the Barlow objective. We find that using a higher learning rate for the prediction branch (Barlow 0.7 , with 7:3 prediction to target lr ratio) results in better probing outcome than using equal learning rates (Barlow 0.5 ) or not letting gradients flow in the target branch altogether (Barlow 1 , here the target encoder is a copy of

Table 3: Representation probing and RL results for representative setups. Mean binary F1 for reward, mean multiclass F1 for next action. RL metrics are aggregated on 10 seeds of 9 games. The 95

& goal & inv

Human Normalized Score

the online encoder). This suggests that while it is helpful to regularize the representations towards the predictions, there is a potential for them being regularized towards poorly trained ones. This can be addressed by applying a higher learning rate on the prediction branch.

We also demonstrate that using a frozen, random target network (Barlow rand ) results in good features, and in our experiments it gets the best reward probing performance. This contradicts findings from the vision domain (Grill et al., 2020), but corroborates SSL results from other domains such as speech (Chiu et al., 2022). Random networks have also been shown to exhibit useful inductive biases for exploration (Burda et al., 2019b;a). An explanation is that random targets act as a regularization that prevent partial collapse by enforcing a wide range of features to be encoded by the model.

SSL objective Although pretraining with multiple objectives can sometimes result in better downstream performance, they also make it harder to tune for hyperparameters and debug, therefore it is desirable to use the least number of objectives that can result in comparable performance.

In table 4, we show the effects of inverse dynamics modeling (inv) and goal-conditioned RL (goal) objectives on probing performance. The BYOL model experiences partial collapse without the inverse dynamics modeling loss, while the addition of goal loss improves the probing performance slightly. This is in congruence with the relative RL performances reported by SGI (Schwarzer et al., 2021b) for the same ablations.

The Barlow-only model performs significantly better than the BYOL-only model in terms of probing scores, indicating that the Barlow objective is less prone to collapse in the predictive SSL setting. Similar to the BYOL model, the Barlow model can also be improved with inverse dynamics modeling, while the addition of goal loss has a slight negative impact.

Encoders SGI (Schwarzer et al., 2021b) showed that using bigger encoders during pretraining results in improved downstream RL performance. We revisit this topic from the point of finding out whether the pretrained representations from bigger networks also have better probing qualities. We experiment with the medium (ResNet-M) and large (ResNet-L) residual networks from SGI. In table 5 we show that Barlow models pretrained using the larger ResNet have improved probing scores, which is consistent with SGI's findings.

To investigate the extent to which linear probing performance correlates with the actual downstream RL performance, we perform RL evaluations for 7 representative setups, and report their probing and aggregate RL metrics (with confidence intervals) in table 3. We find that the rank correlations between reward probing F1 and the RL aggregate metrics are significant ( r =0.933, p<0.001; Figure 1), while those for the expert action probing F1 are weaker, though still positive ( r =0.66, p=0.019; Figure 5 for RL IQM ). 3 In sum, our results suggest that while probing cannot completely replace RL

3 We have not investigated how the quality of the policy that generated the expert actions affects the strength of correlation for action probing. It is possible that probing for actions of a weaker policy will be less informative, as this certainly holds true in the extreme case of a random policy.

evaluation due to lack of perfect correspondence between rankings, their strong positive correlation still makes them a useful proxy for designing pretraining setups that deliver significant downstream RL performance improvements.

Conclusion

In an effort to to alleviate the need to rely on costly RL evaluations to assess the qualities of unsupervised representations, we investigated whether linear probing tasks can serve as a useful proxy in the context of RL. We found a significant rank correlation between the performances of the probing tasks and downstream RL performances. Using this proxy to guide us, we have demonstrated the impact of a number of key design choices in the pre-training methodology. While linear probing cannot fully replace RL evaluation, we hope these promising results encourage the research community to make greater use of them to systematically explore the design space and further improve the quality of SSL representaitons for RL due to their simplicity and efficiency.

Encoders

Models and Hyper-parameters

We use ResNet-M and ResNet-L from SGI (Schwarzer et al., 2021b). The ResNet-M encoder consists of inverted residual blocked with an expansion ratio of 2, with batch normalization applied after each convolutional layer; it uses 3 groups with 32, 64, and 64 channels, and has 3 residual blocks per group; it down-scales the input by a factor of 3 in the first group and 2 in the latter 2 groups. This yields a representation of shape 64x7x7 when applied to 84x84-dimensional Atari frames. ResNet-L uses 3 groups with 48, 96, and 96 channels, and has 5 residual blocks per group; it uses a larger expansion ratio of 4, producing a representation shape of 96x7x7 from an 84x84 frame. This enlargement increases the number of parameters by approximately a factor of 5.

We experimented with three transition models: convolutional model, deterministic GRU, and latent GRU. Our convolutional model is based on SGI (Schwarzer et al., 2021b). The input into the convolutional transition model is the concatenation of the action represented as a 2D one-hot vector and the representation e t along the channel dimension. The network itself consists of two 64-channel convolutional layers with 3x3 filters, separated by ReLU activation and batch normalization layers.

The deterministic GRU model has a hidden dimension of 600 and input dimension of 250. The input a t is prepared by passing the one-hot action vector through a 250 dimensional embedding layer. The initial hidden state ˆ e 0 is generated by projecting the representation at timestep 0 through a 600 dimensional linear layer with ELU activation and dropout. Layer normalization is applied to the hidden input at all timesteps.

The latent GRU model is based on Dreamerv2's RSSM (Hafner et al., 2021), and consists of a recurrent model, posterior model, prior predictor, and latent merger. The recurrent model has a hidden dimension of 600 and input dimension of 600. The initial hidden state h 0 and input z 0 are zero vectors. The flattened stochastic variables z t and one-hot action vector a t are first concatenated and then projected to 600 dimension through a linear layer with ELU activation, before being passed into the recurrent model as input. Layer normalization is applied to the hidden input at all non-zero timesteps.

The posterior model is a two-layer MLP with 600 dimensional bottleneck separated by ELU activation. It takes the concatenation of representation e t and recurrent hidden output h t as input, and outputs a 1024 dimensional vector representing the 32 dimensional logits for 32 latent categorical variables. z t is sampled from the posterior logits. The prior model is a two-layer MLP with 600 dimensional bottleneck separated by ELU activation. Its output format is same as that of the posterior model. ˆ z t is sampled from the prior logits. The latent merger is a linear layer that projects the concatenation of h t and flattened z t to the same dimension of representation e t .

In the case of the deterministic GRU, ˆ e is first projected to the same dimension of representation through a linear layer. Henceforth we shall assume that ˆ e underwent this step for GRU det .

The predicted representation ˆ e and target representation ˜ e are projected to 1024 dimensional vectors ˆ y and ˜ y through a linear layer. The BYOL objective involves processing ˆ y with an additional linear layer q with output dimension 1024. The Barlow objective involves applying batch normalization to ˆ y and ˜ y prior to taking the covariance and variance losses.

The inverse dynamics model is a two-layer MLP with 256 dimensional bottleneck separated by ReLU activation. It takes the concatenation of ˆ y t and ˜ y t +1 as input, and outputs logits with dimension equivalent to number of actions.

We used a decoder architecture that mirrors the structure of the ResNet-M encoder. In decoding, instead of transposed convolutions we used upsampling with the nearest value followed by a regular convolution (Odena et al., 2016). We used mean squared error between the reconstructed pixels and the target image as the training criterion. All the models were trained and evaluated on the same data as reward and action probing. Models were trained for 30 epochs using Adam optimizer with learning rate 0 . 001.

Table 9: Optimal RL learning rate for different setups. Identified by sweeping through 2 . 5e -5, 5e -5, 1e -4, 2e -4 and evaluated on games frostbite, assault, gopher, and demon attack.

We use the same image augmentations as used in SGI (Schwarzer et al., 2021b), which itself used the augmentations in DrQ (Yarats et al., 2021b), in both pretraining and fine-tuning. We specifically apply random crops (4 pixel padding and 84x84 crops) and image intensity jittering.

Goal-oriented RL loss

Goal-oriented RL loss is taken directly from SGI (Schwarzer et al., 2021b). This objective trains a goal-conditional DQN, with rewards specified by proximity to sampled goals. First, a goal g is sampled to be the state encoding either of the near future in the current trajectory (up to 50 steps in the future), or, with probability of 20

Forward Model Probing

While our principal goal is to demonstrate the correlation between representation probing and offline RL performances, we also apply the reward probing technique to predictions in order to evaluate the qualities of transition models under different pretraining setups.

In table 10, we show the effects of using different transition models during pretraining on prediction probing performance. All models are trained with ResNet-M encoder and inverse loss. Goal loss is also applied to the BYOL models.

In the deterministic setting, the predictions of the GRU model are worse than those of the convolutional model. The introduction of stochasticity appears to fix the underlying issue for predictions, resulting in the latent GRU model having the best overall prediction probing performance.

One possible explanation for Conv-det having better predictions than GRU-det is that the spatial inductive bias in the convolutional kernels acts as a constraint and helps regularize the predictions from regressing to the mean. However, this is more effectively solved by the introduction of latent variables into GRU during training and inference.

Comparing to the BYOL model, Barlow models generally have higher probing scores for predictions. We also note that for Barlow models, regularizing the representations towards the predictions (by setting Barlow Balance < 1) improves the qualities of predictions. This is likely because it makes the prediction task easier, making it more likely to learn a capable transition model.

This reasoning can also explain why the Barlow model with frozen, random target network achieves superior probing result for representation (table 2) but worse result for predictions compared to the other Barlow versions. Predicting a random target representation is likely more difficult than predicting a learned representation, and this may in turn encourage the model to rely more on learning a powerful encoder and posterior model, and less on learning an accurate transition model.

Table 12: Full RL Results for representative pretraining setups. Setup names are represented as encoder-transition model-ssl losses . M and L refer to ResNet M and ResNet L, CD is convolutional model, GD is deterministic GRU, GL is latent GRU, By and Bt refer to Byol and Barlow, G and I refer to goal and inverse losses.

Statistical Hypothesis Testing of Rank Correlation

In Fig. 5, we show the correlations results for both the action and reward predictions. We estimate Spearman's rank correlation coefficient (Spearman's r) between the linear probing performance and the (interquartile) mean RL human-normalized score (HNS) over 9 Atari games. The reason for using Spearman's r instead of the Pearson correlation coefficient is because we are interested in whether the relative ranking of the models on the linear probing tasks is indicative of the relative ranking of the same models when RL is trained on top of it. As an example, this allows us to say if model A out-ranks model B in the reward prediction task, an RL model trained on top of model A's representations will likely out-perform an RL model trained on top of model B's representation. However, it does not let us predict by how much model A will out-perform model B.

Let d denote the difference in ranking between the linear probing performance and the RL performance, Spearman's r (denoted as ρ below) is computed as,

where d i is the difference in ranking for the i-th model, and n is the total number of models we have.

We perform statistical hypothesis testing on ρ with null hypothesis ρ = 0 (no correlation between linear probing performance and RL performance) and alternative hypothesis ρ > 0 (positive correlation). The null distribution is constructed nonparametrically using permutation testing: we sample random orderings of the observed linear probing performance and RL performance independently and compute ρ . This is repeated 50,000 times to generate the null distribution (which is centered at ρ = 0 as we do not expect randomly ordered values to be correlated). We then compare our observed ρ to this distribution and perform one-tailed test for the proportion of samples larger than our observed ρ to report our p-value.

Rank Correlation on a different dataset

Figure 6: Reproduction of Fig.5, left, on a different probing dataset (expert trajectories instead of random ones). The exact values of the F1 scores are different, but the Spearman's r is the same, showing that the correlation is insensitive to the probing dataset

In Fig. 1, we explored the correlation between the RL performance and the reward probing task, where the dataset used for the reward probing was a set of quasi-random trajectories from the DQN dataset, coming from very beginning of the training run of the DQN agent used to collect the data. It is natural to ask whether the correlation results we obtain are sensitive to the specific dataset used. To put this question to the test, we re-run the same reward probing task, this time on the "expert" dataset, i.e. the last trajectories of the DQN dataset, corresponding to a fully trained agent. The results are shown in Fig.6. The Spearman's correlation coefficient that we obtain is the exact same as the one for the random trajectory dataset (even though the reward statistic are different, see Table 14), showing that the correlation result is not sensitive to the probing dataset used.

Confidence Interval of RL performance as a Function of Independent Runs

We further show the confidence interval of the estimated mean RL performance as the number of independent runs increase. From our total of 10 independent runs each game, we sample with replacement k ≤ 10 runs ( k being number of independent runs we 'pretend' to have instead of the full 10), independently for each game. We can compute the IQM over this sample to get an estimate for the IQM as if we only have k independent runs. We repeat this process 10,000 times to construct the 95 confidence interval of the empirical IQM for different k 's. Illustrative examples of how much this confidence interval shrinks for different pairs of models is shown in Fig. 7.

We observe in Fig. 7 the mean RL performance estimates have CIs that eventually separate with many independent runs. This is an unbiased but high variance and computationally intensive estimator

of the true expected RL performance. On the other hand, the reward prediction F1 score is a computationally cheap, low variance and accurate estimator of the relative model ranks in mean RL performance. This further corroborates our previous results of positive correlation between reward prediction F1 score and mean RL performance (Fig. 1).

Comparison with Domain Specific Probing Benchmarks

One of the key advantages of our probing method is that it is domain agnostic, unlike the previously proposed AtariARI benchmark (Anand et al., 2019) which acquires probing labels through the RAM state of the emulator, making their method impractical for image-based trajectories.

To better understand how our probing metrics compare with the domain specific ones in terms of correlations with RL performances, we perform the AtariARI probing benchmarks using our pretrained encoders on the 4 overlapping games (Boxing, Seaquest, Frostbite, DemonAttack) used in both works. For AtariARI, we first calculate the average probe F1 scores across categories, then average this quantity across the games. For reward probing, we apply our own protocol detailed in section 5.1. For RL performance we use the IQM. We report the correlation between the probing metrics and RL performances across different models.

Our results are summarized in Table 13. We find that the correlation between the average probing F1s and RL performances is stronger for our reward probing method. In particular, our probing method has a significant correlation with RL performances ( p < 0 . 05), while the AtariARI probing method does not.

Figure 8: Average reward probing F1s for two SSL setups during different training epochs. Epoch 0 constitutes an untrained model.

Game Statistics

We selected a set of 9 representative Atari games due to limited compute. The 9 games were chosen randomly from a subset of Atari games that had at least 1

Figure 9: Left: Spearman's correlation coefficient between the RL performance on each individual game and the reward probing F1, plotted as a function of the percentage of rewards observed in this game. Right: p-values associated with each of the Spearman's coefficients

further sanity check that the reward percentage does not play a role on the reward probe's correlation coefficient or p-value in Section G.2.

Table 14: Percentages of positive rewards in checkpoints 1 and 50 of the DQN replay dataset for 9 games. Checkpoint 1 is used for reward probing and checkpoint 50 is used for expert action probing.

In table 14, we report the percentage of states that have a non-zero reward in each of the 9 games, for two different subsets of data:

All the games have a fairly small percentage of positive reward states, and we generally observe a higher percentage of reward in checkpoint 50, which is expected since the agent is more capable by then.

In Fig.9, we plot the Spearman's correlation coefficient between the RL performance on each individual game and the reward probing F1, as a function of the percentage of reward observed in each game (see Table 14). We do not observe any particular pattern with respect to the sparsity, suggesting that the probing task is not very sensitive to the sparsity level of each individual game. Note however

that, as usual in the Atari benchmark, it is difficult to draw conclusion from any given individual game, and the statistical significance of our results only emerge when considering the set of games as a whole. Indeed, only 3 games achieve individual statistical significance at p < 0 . 01 (Boxing, Seaquest and Assault), while the other do not obtain statistically significant correlations.

One limitation of the current work is that for the presented probing methods to work one needs a subset of the data either with known rewards, where ideally rewards are not too sparse, or with expert actions. If none of the two is available, our method cannot be used. For the reward probing task, the usefulness of the method also depends on the hardness of the reward prediction itself. If the prediction task is too easy, for example because there are rewards at every step, or because the states with rewards are completely different than the ones without (such that even a randomly initialized model would yield features allowing linear separation between the two types of states), then the performance of all the models on this task are going to be extremely similar, with the only differences coming from random noise. In such a case, the performance of the prediction task cannot be used to accurately rank the quality of the features of each of the models. For future work we also would like to extend the findings of this paper to more settings, for example different environments.

Pred Obj	Transition	Reward	Action
BYOL	Conv-det	64.9	22.7
	GRU-det	62.2	26.8
	GRU-latent	63.4	23.2
Barlow 0.7	Conv-det	52.7	24.9
	GRU-latent	67.5	26.2

ResNet	Transition	Objectives	Reward	Action
L	GRU-lat	Barlow rand , inv Res-L, GRU-latent,	70.3 Barlow	26.7 rand & inv
L	GRU-lat	Barlow 0.7 , inv Res-L, GRU-latent,	69.0 Barlow	27.7 0.7 & inv
M	GRU-lat	Barlow rand , inv Res-M, GRU-latent,	67.7 Barlow	25.8 rand & inv
M	GRU-lat	Barlow 0.7 , inv Res-M, GRU-latent,	67.4 Barlow	26.2 0.7 & inv
M	GRU-lat	BYOL, goal, inv Res-M, GRU-latent,	63.4 BYOL	23.2 & goal &
M	GRU-det	BYOL, goal, inv Res-M, GRU-det,	62.2 BYOL &	26.9 goal & inv
M	Conv-det	BYOL, goal, inv Res-M, Conv-det,	64.9 BYOL &	22.7 goal & inv
M	GRU-lat	Barlow 0.7 Res-M, GRU-lat,	56.2 Barlow 0.7	24.4
M	Conv-det	Barlow 0.7 , goal, inv Res-M, Conv-det,	52.7 Barlow 0.7	24.8 & goal &

SSL Objs	Reward	Action
BYOL, inv, goal	63.4	23.2
BYOL, inv	57.3	22.6
BYOL	25.9	5.9
Barlow 0.7 , inv, goal	66.5	26.2
Barlow 0.7 , inv	67.5	26.2
Barlow 0.7	56.2	24.4

	Parameter	Setting
Pretrain & RL	Gray-scaling	True
	Observation down-sampling	84x84
	Frames stacked	4
	Action repetitions	4
	Sticky Action	True
	Reward clipping	[-1, 1]
	Terminal on loss of life	True
	Optimizer	Adam
	Optimizer: learning rate	0.0001
	Optimizer:	0.9
	β 1 Optimizer: β	0.999
	2 Optimizer: /epsilon1	0.00015
	Minibatch Size	64
	Max gradient norm	10
Pretrain	Prediction Depth	10
	Epochs	20
	Goal loss weight	0 or 1
	Inverse loss weight	0 or 1
RL	Max frams per episode	108K
	Update	Distributional Q
	Dueling	True
	Support of Q-distribution	51
	Discount factor	0.99
	Priority exponent	0.5
	Priority correction	0.4 → 1
	Exploration	/epsilon1 -greedy
	Training steps	100K
	Evaluation trajectories	100
	Min replay size for sampling	2000
	Replay period every	1 step
	Updates per step	2
	Multi-step return length	10
	Q network: channels	32,64,64
	Q network: filter size	8x8, 4x4, 3x3
	Q network: stride	4,2,1
	Q network: hidden units	256
	Non-linearity	ReLU
	Target network: update period	1

	Parameter	Setting
BYOL	loss weight τ	1 0.99
Barlow	loss weight λ	0.002 0.0051

Encoder	Transition	Objectives	Learning Rate
ResNet-M	Conv-det	BYOL, goal, inv	2e - 4
ResNet-M	GRU-det	BYOL, goal, inv	2e - 4
ResNet-M	GRU-latent	BYOL, goal, inv	2e - 4
ResNet-M	GRU-latent	Barlow 0.7 , inv	1e - 4
ResNet-M	GRU-latent	Barlow rand , inv	5e - 5
ResNet-L	GRU-latent	Barlow 0.7 , inv	5e - 5
ResNet-L	GRU-latent	Barlow rand , inv	1e - 4

Pred Obj	Pred 5	Pred 10
BYOL	33.4	28.9
Barlow 0.5	40.2	30.2
Barlow 0.7	39.5	30.2
Barlow 1	37.4	29.7
Barlow rand	36.8	27.5

	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull	Amidar Assault Asterix Boxing DemonAttFrostbite Gopher Seaquest Krull
Random	5.8	222.4	210	0.1	152.1	65.2	257.6	68.4	1598
Human	1719.5	742	8503.3	12.1	1971	4334.7	2412.5	42054.7	2665.5
M-CD-ByGI	169.6	693.1	393.1	54.5	458.1	1058.9	1323.4	461.7	5541.4
M-CD-Bt 0.7 GI	206.1	545.6	500	21.5	357.7	518.5	880.5	482.4	4216
M-GD-ByGI	204.5	552.6	625.2	51	723.5	979.7	1299.2	597.1	5006.3
M-GL-ByGI	170.9	392	527.2	49.1	1842.9	541.9	1489.7	609.9	4753.9
M-GL-Bt 0.7	97.9	846	442.5	53.9	311.5	461.5	731	622.1	4176.4
M-GL-Bt 0.7 I	189.9	861.8	426.4	63.2	1048.8	2020.1	857.6	579	5111.4
M-GL-Bt randI	161.8	954.6	569.1	59.6	4373	1067.4	1068.8	734.5	5422.6
L-GL-Bt 0.7 I	173.5	1072.1	540	72.6	1143.9	1633.4	1274.1	578.7	5383.4
L-GL-Bt rand I	136.3	1273.7	506.5	64	4112.8	1163.7	1594.3	653.1	5453.6

	Spearman's r	p
AtariARI	0.527	0.058
Reward (ours)	0.782	0.003

Game	Ckpt 1 %	Ckpt 50
Amidar	2.7	5.2
Assault	3.6	6.8
Asterix	5	6
Boxing	3.5	9.3
DemonAttack	2.1	4.7
Frostbite	4.2	2.9
Gopher	2.8	8.5
Krull	13.2	41.7
Seaquest	1.5	7.5

$$ \begin{aligned} \mathrm{Encoder:} &\quad \hat{e_0} = e_0 = \textsc{Enc}{\theta}(o_0) \ \mathrm{Recurrent Model:} & \quad \hat{e}t = f\phi(\hat{e}{t-1}, a_{t-1}) \ \end{aligned} $$

$$ L^{BYOL}{\theta}(\hat{y}{t:t+k}, \tilde{y}{t:t+k}) = - \sum{k=1}^{K} \frac{q(\hat{y}{t+k})\cdot \tilde{y}{t+k}}{\norm{q(\hat{y}_{t+k})}2 \cdot \norm{\tilde{y}{t+k}}_2} $$

$$ \rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 -1)} ,, $$

$$ \mathcal{L(\psi)}\text{reward-reg} = \dfrac{1}{|\mathcal{D}\mathrm{rew}|} \sum\limits_{\mathcal{T'}i\in \mathcal{D}\mathrm{rew}} \dfrac{1}{|\mathcal{T'}i|} \sum\limits{(o_t, a_t, r_t \in \mathcal{T'}i)} | l\psi(\textsc{Enc}_\theta(o_t)) - r_t |_2 $$

Algorithm: PyTorch-style pseudocode for Barlow Balancing

[hbt!]
\hspace1cm

$
\mathrmBarlow Loss = \mu * L^BT(\haty, \tildey.\mathrmdetach()) + (1 - \mu) * L^BT(\haty.\mathrmdetach(), \tildey)
$

References

[1] Rishabh Agarwal, Dale Schuurmans, Mohammad Norouzi. (2020). An Optimistic Perspective on Offline Reinforcement Learning. Proceedings of the 37th International Conference on Machine Learning, {ICML.

[2] Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C. Courville, Marc G. Bellemare. (2021). Deep Reinforcement Learning at the Edge of the Statistical Precipice. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual.

[3] Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc{-. (2019). Unsupervised State Representation Learning in Atari. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada.

[4] Yusuf Aytar, Tobias Pfaff, David Budden, Tom Le Paine, Ziyu Wang, Nando de Freitas. (2018). Playing hard exploration games by watching YouTube. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr{'{e.

[5] Adrien Bardes, Jean Ponce, Yann LeCun. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. The Tenth International Conference on Learning Representations, {ICLR.

[6] Bellemare, Marc G, Naddaf, Yavar, Veness, Joel, Bowling, Michael. (2013). The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research.

[7] Bengio, Yoshua, L{'e. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. ArXiv preprint.

[8] Yuri Burda, Harrison Edwards, Amos J. Storkey, Oleg Klimov. (2019). Exploration by random network distillation. 7th International Conference on Learning Representations, {ICLR.

[9] Yuri Burda, Harrison Edwards, Deepak Pathak, Amos J. Storkey, Trevor Darrell, Alexei A. Efros. (2019). Large-Scale Study of Curiosity-Driven Learning. 7th International Conference on Learning Representations, {ICLR.

[10] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, Armand Joulin. (2020). Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[11] Mathilde Caron, Hugo Touvron, Ishan Misra, Herv{'{e. (2021). Emerging Properties in Self-Supervised Vision Transformers. 2021 {IEEE/CVF. doi:10.1109/ICCV48922.2021.00951.

[12] Chen, Xinlei, Fan, Haoqi, Girshick, Ross, He, Kaiming. (2020). Improved baselines with momentum contrastive learning. ArXiv preprint.

[13] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey E. Hinton. (2020). A Simple Framework for Contrastive Learning of Visual Representations. Proceedings of the 37th International Conference on Machine Learning, {ICML.

[14] Xinlei Chen, Kaiming He. (2021). Exploring Simple Siamese Representation Learning. {IEEE. doi:10.1109/CVPR46437.2021.01549.

[15] Chung{-. (2022). Self-supervised learning with random-projection quantizer for speech recognition. International Conference on Machine Learning, {ICML.

[16] Abhishek Das, Federico Carnevale, Hamza Merzic, Laura Rimell, Rosalia Schneider, Josh Abramson, Alden Hung, Arun Ahuja, Stephen Clark, Greg Wayne, Felix Hill. (2020). Probing Emergent Semantics in Predictive Agents via Question Answering. Proceedings of the 37th International Conference on Machine Learning, {ICML.

[17] Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, Marc G. Bellemare. (2019). DeepMDP: Learning Continuous Latent Space Models for Representation Learning. Proceedings of the 36th International Conference on Machine Learning, {ICML.

[18] Gidaris, Spyros, Bursuc, Andrei, Puy, Gilles, Komodakis, Nikos, Cord, Matthieu, P{'e. (2020). Online bag-of-visual-words generation for unsupervised representation learning. ArXiv preprint.

[19] Jean{-. (2020). Bootstrap Your Own Latent - {A. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[20] Guo, Zhaohan Daniel, Azar, Mohammad Gheshlaghi, Piot, Bilal, Pires, Bernardo A, Munos, R{'e. (2018). Neural predictive belief representations. ArXiv preprint.

[21] Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, Jimmy Ba. (2021). Mastering Atari with Discrete World Models. 9th International Conference on Learning Representations, {ICLR.

[22] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger. (2018). Deep Reinforcement Learning That Matters. Proceedings of the Thirty-Second {AAAI.

[23] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, David Silver. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the Thirty-Second {AAAI.

[24] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, Koray Kavukcuoglu. (2017). Reinforcement Learning with Unsupervised Auxiliary Tasks. 5th International Conference on Learning Representations, {ICLR.

[25] Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, Henryk Michalewski. (2020). Model Based Reinforcement Learning for Atari. 8th International Conference on Learning Representations, {ICLR.

[26] Lange, Sascha, Gabel, Thomas, Riedmiller, Martin. (2012). Batch reinforcement learning. Reinforcement learning.

[27] Laskin, Michael, Yarats, Denis, Liu, Hao, Lee, Kimin, Zhan, Albert, Lu, Kevin, Cang, Catherine, Pinto, Lerrel, Abbeel, Pieter. (2021). URLB: Unsupervised reinforcement learning benchmark. ArXiv preprint.

[28] Tsung{-. (2017). Focal Loss for Dense Object Detection. {IEEE. doi:10.1109/ICCV.2017.324.

[29] Lin, Hongzhou, Mairal, Julien, Harchaoui, Zaid. (2019). An inexact variable metric proximal point algorithm for generic quasi-Newton acceleration. SIAM Journal on Optimization.

[30] Machado, Marlos C, Bellemare, Marc G, Talvitie, Erik, Veness, Joel, Hausknecht, Matthew, Bowling, Michael. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research.

[31] Mairal, Julien. (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization.

[32] Mairal, Julien. (2019). Cyanure: {{An Open-Source Toolbox. ArXiv preprint.

[33] Menore Tekeba Mengistu, Getachew Alemu, Pierre Chevaillier, Pierre De Loor. (2022). Unsupervised Learning of State Representation using Balanced View Spatial Deep InfoMax: Evaluation on Atari Games. ICAART.

[34] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G, Graves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, others. (2015). Human-level control through deep reinforcement learning. nature.

[35] Junhyuk Oh, Valliappa Chockalingam, Satinder P. Singh, Honglak Lee. (2016). Control of Memory, Active Perception, and Action in Minecraft. Proceedings of the 33nd International Conference on Machine Learning, {ICML.

[36] Pari, Jyothish, Muhammad, Nur, Arunachalam, Sridhar Pandian, Pinto, Lerrel, others. (2021). The Surprising Effectiveness of Representation Learning for Visual Imitation. ArXiv preprint.

[37] Racah, Evan, Pal, Christopher. (2019). Supervise Thyself: Examining Self-Supervised Representations in Interactive Environments. ArXiv preprint.

[38] Schrittwieser, Julian, Antonoglou, Ioannis, Hubert, Thomas, Simonyan, Karen, Sifre, Laurent, Schmitt, Simon, Guez, Arthur, Lockhart, Edward, Hassabis, Demis, Graepel, Thore, others. (2020). Mastering atari, go, chess and shogi by planning with a learned model. Nature.

[39] Max Schwarzer, Nitarshan Rajkumar, Michael Noukhovitch, Ankesh Anand, Laurent Charlin, R. Devon Hjelm, Philip Bachman, Aaron C. Courville. (2021). Pretraining Representations for Data-Efficient Reinforcement Learning. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual.

[40] Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron C. Courville, Philip Bachman. (2021). Data-Efficient Reinforcement Learning with Self-Predictive Representations. 9th International Conference on Learning Representations, {ICLR.

[41] D'Oro, Pierluca, Schwarzer, Max, Nikishin, Evgenii, Bacon, Pierre-Luc, Bellemare, Marc G, Courville, Aaron. (2022). Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier. The Eleventh International Conference on Learning Representations.

[42] Schwarzer, Max, Ceron, Johan Samir Obando, Courville, Aaron, Bellemare, Marc G, Agarwal, Rishabh, Castro, Pablo Samuel. (2023). Bigger, better, faster: Human-level atari with human-level efficiency. International Conference on Machine Learning.

[43] Sermanet, Pierre, Lynch, Corey, Chebotar, Yevgen, Hsu, Jasmine, Jang, Eric, Schaal, Stefan, Levine, Sergey, Brain, Google. (2018). Time-contrastive networks: Self-supervised learning from video. 2018 IEEE international conference on robotics and automation (ICRA).

[44] Michael Laskin, Aravind Srinivas, Pieter Abbeel. (2020). {CURL:. Proceedings of the 37th International Conference on Machine Learning, {ICML.

[45] Adam Stooke, Kimin Lee, Pieter Abbeel, Michael Laskin. (2021). Decoupling Representation Learning from Reinforcement Learning. Proceedings of the 38th International Conference on Machine Learning, {ICML.

[46] Gabriel Synnaeve, Zeming Lin, Jonas Gehring, Daniel Gant, Vegard Mella, Vasil Khalidov, Nicolas Carion, Nicolas Usunier. (2018). Forward Modeling for Partial Observation Strategy Games - {A. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montr{'{e.

[47] Tassa, Yuval, Doron, Yotam, Muldal, Alistair, Erez, Tom, Li, Yazhe, Casas, Diego de Las, Budden, David, Abdolmaleki, Abbas, Merel, Josh, Lefrancq, Andrew, others. (2018). Deepmind control suite. ArXiv preprint.

[48] Todorov, Emanuel, Erez, Tom, Tassa, Yuval. (2012). Mujoco: A physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49] Ahmed Touati, Yann Ollivier. (2021). Learning One Representation to Optimize All Rewards. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual.

[50] Denis Yarats, Ilya Kostrikov, Rob Fergus. (2021). Image Augmentation Is All You Need: Regularizing Deep Reinforcement Learning from Pixels. 9th International Conference on Learning Representations, {ICLR.

[51] Denis Yarats, Rob Fergus, Alessandro Lazaric, Lerrel Pinto. (2021). Reinforcement Learning with Prototypical Representations. Proceedings of the 38th International Conference on Machine Learning, {ICML.

[52] Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao. (2021). Mastering Atari Games with Limited Data. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual.

[53] Tao Yu, Zhizheng Zhang, Cuiling Lan, Zhibo Chen, Yan Lu. (2022). Mask-based Latent Reconstruction for Reinforcement Learning. ArXiv preprint.

[54] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, St{'{e. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. Proceedings of the 38th International Conference on Machine Learning, {ICML.

[55] Jinhua Zhu, Yingce Xia, Lijun Wu, Jiajun Deng, Wen-gang Zhou, Tao Qin, Houqiang Li. (2020). Masked Contrastive Representation Learning for Reinforcement Learning. ArXiv preprint.

[56] Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron C. Courville. (2018). FiLM: Visual Reasoning with a General Conditioning Layer. Proceedings of the Thirty-Second {AAAI.

[57] Odena, Augustus, Dumoulin, Vincent, Olah, Chris. (2016). Deconvolution and Checkerboard Artifacts. Distill. doi:10.23915/distill.00003.

[58] Yilun Du, Chuang Gan, Phillip Isola. (2021). Curious Representation Learning for Embodied Intelligence. 2021 {IEEE/CVF. doi:10.1109/ICCV48922.2021.01024.

[59] Lee, Hojoon, Lee, Koanho, Hwang, Dongyoon, Lee, Hyunho, Lee, Byungkun, Choo, Jaegul. (2023). On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learning. ArXiv preprint.

[60] Garrido, Quentin, Balestriero, Randall, Najman, Laurent, Lecun, Yann. (2022). RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. ArXiv preprint.

[61] Oquab, Maxime, Darcet, Timothée, Moutakanni, Théo, Vo, Huy, Szafraniec, Marc, Khalidov, Vasil, Fernandez, Pierre, Haziza, Daniel, Massa, Francisco, El-Nouby, Alaaeldin, Assran, Mahmoud, Ballas, Nicolas, Galuba, Wojciech, Howes, Russell, Huang, Po-Yao, Li, Shang-Wen, Misra, Ishan, Rabbat, Michael, Sharma, Vasu, Synnaeve, Gabriel, Xu, Hu, Jegou, Hervé, Mairal, Julien, Labatut, Patrick, Joulin, Armand, Bojanowski, Piotr. (2023). DINOv2: Learning Robust Visual Features without Supervision. ArXiv preprint.

[62] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. ArXiv preprint.

[63] Ni, Tianwei, Eysenbach, Benjamin, Seyedsalehi, Erfan, Ma, Michel, Gehring, Clement, Mahajan, Aditya, Bacon, Pierre-Luc. (2024). Bridging State and History Representations: Understanding Self-Predictive RL. ArXiv preprint.

[64] Tomar, Manan, Mishra, Utkarsh A., Zhang, Amy, Taylor, Matthew E.. (2021). Learning Representations for Pixel-based Control: What Matters and Why?. ArXiv preprint.

[65] Nicklas Hansen, Hao Su, Xiaolong Wang. (2022). Temporal Difference Learning for Model Predictive Control. International Conference on Machine Learning, {ICML.

[66] Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak. (2020). Planning to Explore via Self-Supervised World Models. Proceedings of the 37th International Conference on Machine Learning, {ICML.

[67] Nair, Suraj, Rajeswaran, Aravind, Kumar, Vikash, Finn, Chelsea, Gupta, Abhinav. (2022). R3m: A universal visual representation for robot manipulation. ArXiv preprint.

[68] Andrea Banino, Adri{`{a. (2022). CoBERL: Contrastive {BERT. The Tenth International Conference on Learning Representations, {ICLR.

[69] Radosavovic, Ilija, Xiao, Tete, James, Stephen, Abbeel, Pieter, Malik, Jitendra, Darrell, Trevor. (2023). Real-world robot learning with masked visual pre-training. Conference on Robot Learning.

[70] Seo, Younggyo, Hafner, Danijar, Liu, Hao, Liu, Fangchen, James, Stephen, Lee, Kimin, Abbeel, Pieter. (2023). Masked world models for visual control. Conference on Robot Learning.

[71] Robine, Jan, H{. (2023). Transformer-based World Models Are Happy With 100k Interactions. ArXiv preprint.

[72] Micheli, Vincent, Alonso, Eloi, Fleuret, Fran{\c{c. (2022). Transformers are sample efficient world models. ArXiv preprint.

[73] Ye, Weirui, Zhang, Yunsheng, Abbeel, Pieter, Gao, Yang. (2023). Become a Proficient Player with Limited Data through Watching Pure Videos.

[74] Younggyo Seo, Kimin Lee, Stephen L. James, Pieter Abbeel. (2022). Reinforcement Learning with Action-Free Pre-Training from Videos. International Conference on Machine Learning, {ICML.

[75] Ma, Yecheng Jason, Sodhani, Shagun, Jayaraman, Dinesh, Bastani, Osbert, Kumar, Vikash, Zhang, Amy. (2022). Vip: Towards universal visual reward and representation via value-implicit pre-training. ArXiv preprint.

[76] Ghosh, Dibya, Bhateja, Chethan Anand, Levine, Sergey. (2023). Reinforcement learning from passive data via latent intentions. International Conference on Machine Learning.

[77] Zhang, Wancong, GX-Chen, Anthony, Sobal, Vlad, LeCun, Yann, Carion, Nicolas. (2022). Light-weight probing of unsupervised representations for reinforcement learning. ArXiv preprint.

Introduction​

Related work​

A framework for developing unsupervised representations for RL​

Unsupervised pre-training​

Transition models​

Other SSL objectives​

Results​

Experimental details​

Impact of transition models and prediction objectives​

Conclusion​

Encoders​

Models and Hyper-parameters​

Goal-oriented RL loss​

Forward Model Probing​

Statistical Hypothesis Testing of Rank Correlation​

Rank Correlation on a different dataset​

Confidence Interval of RL performance as a Function of Independent Runs​

Comparison with Domain Specific Probing Benchmarks​

Game Statistics​

References​