Learning Latent Action World Models In The Wild

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat

Leveraging latent action world models for planning

Quentin Garrido 1 , Tushar Nagarajan 1 , Basile Terver 1 , 2 , Nicolas Ballas 1 , Yann LeCun 1 , 3 , Michael Rabbat 1 1 FAIR at Meta, 2 Inria, 3 NYU

Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

Correspondence: Quentin Garrido at garridoq@meta.com

Introduction

To build intelligent systems that can reason and plan in the real world, we must build systems that can predict the future, and in particular consequences of their actions(Friston, 2010; Clark, 2013; Bubic et al., 2010; LeCun, 2022; Sutton, 1991; Ha and Schmidhuber, 2018; Hafner et al., 2019; Nguyen and Widrow, 1990). As soon as agents are present in the scene, predicting the future becomes a stochastic endeavor that can be parametrized by possible actions. Modeling these possible futures is thus necessary to learn good models of the world, ones that can for example be used to solve planning problems. A significant body of literature on world models is available to us assuming that we possess action labels (Ha and Schmidhuber, 2018; Hafner et al., 2019, 2023; Hu et al., 2023; Bar et al., 2024; Agarwal et al., 2025; Assran et al., 2025). This access to actions is a critical bottleneck: the vast majority of video data available online is unlabeled (Zellers et al., 2022; Miech et al., 2019) and includes diverse embodiments.

This gap motivates the idea of learning a latent action model (LAM) (Edwards et al., 2019; Rybkin et al.,

Figure 1 Action diversity. Classically used navigation or manipulation data contains the most general actions, such as camera or hand movements. In-the-wild videos extend this to a much broader distribution of actions, with objects entering the scene or people dancing.

2019; Menapace et al., 2022; Schmidt and Jiang, 2024; Ye et al., 2025; Yang et al., 2025; Chen et al., 2024; Cui et al., 2024) that can discover the action space from videos alone, without action annotations or a known embodiment. The standard approach is

to learn two components jointly. First, an inverse dynamics model (IDM) that, given observations of the past and future, predicts a latent action that explains the difference between the two. Second, a forward model which predicts the future using the past and obtained latent action. After such model is trained, the IDM can be used as part of a VLA pipeline (Bu et al., 2025; Ye et al., 2025) or to train a world model, using the frozen IDM (Gao et al., 2025).

The type of unlabeled videos that are used is critical to the learned action space, and often an understudied component. Most LAM studies rely on narrow, taskaligned domains-video games (Bruce et al., 2024), tabletop manipulation (Nikulin et al., 2025), or curated real manipulation (Bu et al., 2025; Gao et al., 2025)-which can yield action spaces specialized to a single embodiment with limited transfer or generalization. While some works have use more 'natural' videos such as Ego4D (Grauman et al., 2022), it usually amounts to a minority of the training data, e.g. 5% for Bu et al. (2025) and Gao et al. (2025), far from leveraging the richness of in-the-wild videos.

To learn a truly general and transferable latent action world model, we argue that we must go beyond these targeted data sources. Sources of natural in-the-wild videos such as HowTo100M (Miech et al., 2019) or YoutubeTemporal-1B (Zellers et al., 2022) provide a much richer and general learning environment than usually studied, as illustrated in Figure 1. However this introduces a new set of research challenges that we address in this work to demonstrate the viability of LAM on large scale in-the-wild natural videos 1 .

First and foremost, the meaning of an "action" on in-the-wild video is not as clearly defined as it is in environments with known action spaces. Metaphorically speaking, the first dimension-or principal componentof actions could be movements, something shared across video sources. From then we can have a split between ego- and exo-centric actions, which separates actions of the camera wearer and other agents in the environment. In in-the-wild videos, we have a stronger presence of external agents performing diverse actions, on top of what the camera wearer does. Going deeper in the action distribution, in-the-wild videos will contain unique actions such as cars entering the frame, people dancing, fingers forming chords on a fretboard, etc. This leads to an inherent richness of actions that we aim at modeling. In-the-wild videos provide a superset of actions compared video games or manipulation videos, which means that one

1 While our work does not focus on video generation, LAM trained on in-the-wild videos could be used to remove the necessity of text-video pairs (Sun et al., 2024).

should still be able to solve more classical navigation or manipulation tasks. While data sources used in previous works would mainly contain the metaphorical first principal components of actions, trying to model more diverse actions has a risk of capturing more environmental noise (Nikulin et al., 2025) such as leaves oscillating on trees. Finally, agents in in-thewild videos do not have a consistent embodiment that the model can latch onto, which poses challenges for transfer and downstream applicability of the learned latent actions.

The focus of our work thus lies in the study of latent action world models trained on large scale in-thewild video datasets, studying the inherent challenges, potential pitfalls of latent actions in such setting, as well as demonstrating their viability. Our contributions are as follows:

· We conduct a study on how to regulate the information content of latent actions, focusing on in-the-wild natural videos. We find that while sparse or noisy latent actions can effectively model complex actions, discrete ones struggle to adapt. · We show that the absence of a common embodiment across in-the-wild videos is not an issue when learning latent actions. Latent actions will encode more spatially-localized transformations. · We demonstrate the generality of the learned action space by transferring complex actions between videos. We find that we can effectively transfer motion between objects, or actions such as someone entering the frame. · We demonstrate how our learned latent action space can be used as a universal action space. By training a small controller to map known actions to latent ones, our world model trained only on natural videos can be controlled to solve robotic manipulation and navigation tasks, achieving planning performance close to models trained on domain-specific, action-labeled data.

World Models. World Models (Nguyen and Widrow, 1990; Sutton, 1991; Ha and Schmidhuber, 2018) have become a very active area of research. While a significant body of work had been applied to game data (Alonso et al., 2024; Hafner et al., 2019, 2023),

Figure 2 Latent action world model. A classical world model is endowed with actions represented as latent variables. These latent actions are obtained thanks to an inverse dynamics model trained jointly with the world model. To limit their information content (and propensity to cheat), they are regularized using techniques such as noise addition, sparsification, or quantization.

applications to more complex environments, such as simulated robotics environment (Seo et al., 2023; Zhou et al., 2024) or the real world (Hu et al., 2023; Agarwal et al., 2025; Assran et al., 2025) have flourished recently. With a plethora of possible embodiments and action space, works such as NWM (Bar et al., 2024) focus on locomotion, PEVA (Bai et al., 2025) on whole body control, or UniSim (Yang et al., 2023) which can handle a variety of embodiments though textual control, have appeared. The promise of such models is not solely to generate visually appealing videos (Brooks et al., 2024; Teng et al., 2025; Agarwal et al., 2025) but mainly lies in their use to solve visual planning tasks. Being able to predict the consequences of actions can enable us to solve problems for navigation (Shah et al., 2021), robotic manipulation in simulation (Nasiriany et al., 2024; Liu et al., 2023; Yu et al., 2020) or in the real world (Khazatsky et al., 2024), or even whole body control (Ma et al., 2024). Such models can even be used to solve more classical vision tasks such as segmentation and depth forecasting (Baldassarre et al., 2025; Karypidis et al., 2024; Luc et al., 2017). A common issue to obtain models that generalize across embodiments is how to define a common action space ? A solution can for example be to use the maximal dimensionality across considered embodiments, with an embodiment token (Hansen et al., 2023), but this is not easily scalable. This is where latent action models (Edwards et al., 2019; Rybkin et al., 2019; Schmidt and Jiang, 2024; Bruce et al., 2024) come into play, as one of their promises is to learn an abstract, general latent action space.

Latent Action Models. Latent action models aim at learning actions from unlabeled videos. Latent actions can be inferred using a latent policy (Edwards et al., 2019), or by using an explicit inverse dynamics model (IDM) that predicts the latent action from the past and future frames (Rybkin et al., 2019; Menapace et al., 2021, 2022; Schmidt and Jiang, 2024). This is then combined with a forward model that predicts the future frame from the past and the latent action. The used of an IDM introduces a causal leakage in information and a key challenge is to ensure that the latent actions do not capture too much information, e.g. the entire next frame. A commonly used approach is to discretize the latent actions. This is the approach of choice in methods such as ILPO Edwards et al. (2019), LAPO (Schmidt and Jiang, 2024), Genie (Bruce et al., 2024), LAPA (Ye et al., 2025), or UniVLA (Bu et al., 2025). This can for example be motivated by prior knowledge of the desired action space (Bruce et al., 2024). Other methods such as CLASP (Rybkin et al., 2019), CoMo (Yang et al., 2025), or AdaWorld (Gao et al., 2025) instead opt for a continuous space, which is inherently more flexible. In this case, a regularization term can be added to reduce the information content of the latent actions. Other works instead rely on carefully designed forward model architectures Menapace et al. (2022); Sun et al. (2024) to structure the latent action space. Furthermore, while numerous methods use off-the-shelf vision encoders to encode frames, latent actions are still often learned by predicting the future frame in pixel space (Chen et al., 2025; Yang et al., 2025; Ye et al., 2025). This makes latent actions more suscep-

tible to distractors (Nikulin et al., 2025), where the latent actions learn to encode background noise rather than the actions we desire. While a solution is to use supervision (Nikulin et al., 2025; Liang et al., 2025), working in an abstract latent spaces and carefully designing latent actions can help avoid some of these issues, as we study throughout our work. In general, while learning latent actions has clear applicability to world models, methods tend to be developed with VLAs in mind (Bu et al., 2025; Ye et al., 2025). Even if the approaches are architecturally similar to world models, where the forward model/action decoder can be seen as a world model, it is often discarded. Even when a world model is trained, a two-stage approach is commonly used, where the world model is trained after the inverse dynamics model (Yang et al., 2025). Concurrently to our work Wang et al. (2025) proposes to treat the forward model as a world model, by using a pretrained video generation model.

Problem setting

Considering a video V where the state of the world at each timestep t is s t , we are interested in modeling the evolution of the world, i.e. find a function f such that s t +1 = f ( s 0: t ) . However, the presence of agents as well as general stochasticity make the prediction non deterministic and thus this formulation is insufficient. We can model the uncertainty of the prediction with a latent variable z t containing the relevant information, such that s t +1 = f ( s 0: t , z t ) . Another way to model uncertainty is to not consider s t +1 directly, but instead output a distribution over possible futures p ( s t +1 | s 0: t ) , as is commonly done in text (Radford et al., 2018) or with quantized representations (Hu et al., 2023; Agarwal et al., 2025).

Nonetheless, formalizing future prediction as s t +1 = f ( s 0: t , z t ) is appealing as we can interpret part of z t as actions happening in the scene. This is for example the case when learning a world model for robotics, where in simple environments no stochasticity exists beyond the actions a t of the agent. We thus have s t +1 = f ( s 0: t , a t ) . If an environment is stochastic, we have both noise from the environment and actions which prompts a more complex formalism than previously where we want s t +1 = f ( s 0: t , a t , z t ) . This is reminiscent of diffusion based world models (Alonso et al., 2024; Bar et al., 2024) for example.

Latent action models (Edwards et al., 2019; Rybkin et al., 2019; Schmidt and Jiang, 2024) aim at modeling the actions happening in a scene, without capturing exogenous noise that may come from the environment. To do so, most methods introduce a leak of causality by looking at the future to infer z t . This is commonly done with an inverse dynamics model (IDM) 2 that takes as input the past and future frames and outputs the latent action z t = g ϕ ( s t , s t +1 ) . From this, we can then train a world model (also called forward model) p ψ to estimate s t +1 using the following loss function:

This works well in clean environments (Hoque et al., 2025; Yu et al., 2020) since the stochasticity comes mainly from actions performed by the welldefined agent. However, on videos that are in-thewild (Zellers et al., 2022; Miech et al., 2019) there is a significant risk of capturing exogenous noise, such as leaves oscillating on trees. Limiting the information content of latent actions thus becomes paramount, balancing between capturing complex actions and capturing noise, or even worse, encoding the whole next state in the latent action.

In general, this information regularization aims at finding the minimal latent actions that can explain the prediction of the future. Throughout this work we focus on three distinct mechanisms, each with pros and cons.

Sparsity. The first one, and perhaps most complex to implement, is sparsity based constraints (Drozdov et al., 2024). Here, we would like for the latent actions to have as low of an L1 norm as possible. Due to trivial solutions that would reduce the L2 norm of the vectors, concentrate the norm along a few dimensions, or focus too much around the mode of the latent distribution, a few additional regularizations are added. The regularization is then

with and

2 We can see z t as the result of an optimization process minimizing the prediction error over it. Implementing it this way is impractical, but we can see the IDM as performing amortized inference (Amos et al., 2023). This lends itself well to gradient based optimization at inference time.

Figure3 SamplepredictionsusingtheIDM. We illustrate the highest quality unrollings obtained with different regularization, using the inverse dynamics model. While sparse or noisy latent actions are able to capture a man entering the scene, discrete ones are not able to properly capture such action, even if some motions remains captured.

This Variance-Covariance-Mean (VCM) regularization, inspired by VICReg (Bardes et al., 2021), ensures an adequate spread of information and forces the sparsity constraints to be properly used by the model. In practice we set the coefficients to λ l 2 = 1 , λ V = 0 . 1 , λ C = 0 . 001 , λ M = 0 . 1 , and vary λ 1 to regulate information content.

Noiseaddition. Another approach to limit information content in the learned latent actions is to add noise to them, while making sure their norm does not increase and makes the noise negligible. This can be implemented in a similar way as a VAE (Kingma and Welling, 2014; Gao et al., 2025). The prior matching term here acts as our regularizer, where the target standard deviation adds noise while the target mean reduces the norm of the latent actions.

Discretization. A final approach is to discretize the latent actions. For this, the most common approach is vector quantization (Van Den Oord et al., 2017) or a variant of it. This serves as a baseline comparison to illustrate a commonly used regularization in previous works(Ye et al., 2025; Bu et al., 2025). In practice, we use the same quantization scheme as UniVLA (Bu et al., 2025), using classical vector quantization (Van Den Oord et al., 2017) as well as codebook reset for unused codes.

All of this can be performed in the latent space of trained encoder where s t and s t +1 now are the representations obtained from video frames, which leads us to the complete architecture illustrated in Figure 2.

Experimental details

We now turn ourselves to a more practical implementation. A video V of length T is encoded through a frame causal encoder f θ -V-JEPA 2-L (Assran et al., 2025) in our experiments- producing representations s 0: T -1 . This encoder is kept frozen during training. We then train the world model p ψ ( s 0: t , z t ) and inverse dynamics model g ϕ jointly to predict s t +1 using the aforementioned prediction loss and latent action regularization.

To increase efficiency, we train the model using teacher forcing (Williams and Zipser, 1989; Vaswani et al., 2017). By default, p ψ is implemented as a ViTL (Dosovitskiy et al., 2021) using RoPE (Su et al., 2021; Assran et al., 2025) for positional embeddings. To condition p ψ on z we use AdaLN-zero (Peebles and Xie, 2023) that we adapt to condition the sequence frame-wise. Our latent actions z t are 128 dimensional continuous vectors by default. Unless specified otherwise, all models are trained on YoutubeTemporal1B (Zellers et al., 2022) with 16 frames clips at 4 fps, for 30000 iterations at a batch size of 1024 . We

Figure 4 IDMperformance. We report the one step prediction error on in-the-wild videos. Adjusting the capacity of sparsity and noise based latent actions allows for varying performance, while quantized ones struggle to adapt to the complexity.

use the Muon optimizer (Jordan et al., 2024) with a learning rate of 0.02 and AdamW (Loshchilov and Hutter, 2019) learning rate of 6 . 25 × 10 -4 following a linear warmup over 10% of the training followed by cosine annealing. We use 0.04 as weight decay.

For visualization purposes, we also train a frame causal video decoder using a ViT-L trained with a combination of L 1 and perceptual loss (Johnson et al., 2016; Zhang et al., 2018). While generation is not core to our work, this is a useful tool to compute perceptual metrics and inspect the model's prediction. Confer Supplementary Section A for detailed protocols.

Performance of information regularizations

As mentioned previously, we want to capture rich and complex actions that span a wide range of embodiments, as observed in the in-the-wild videos we consider. The first questions we thus want to answer is how different information regularization techniques adapt to this complexity?

While we measure performance in various manners through the remainder of the manuscript, focusing on different aspect and properties, we first examine the prediction quality in an ideal setting. Here we will measure the prediction error of models when unrolling a trajectory, using the inverse dynamics model (and thus the future frame) to infer the actions. This will be an upper bound of performance across all other experiments.

We will say that a regularization is "better" if it leads to a variety of achievable performance and does not saturate easily. Being able to explore a multitude of behaviors also enables us to measure the impact of latent capacity on downstream performance. As we show in a later section, achieving the lowest prediction error using the inverse dynamics model is not always desirable, as downstream tasks require a balance between complexity and identifiability of latent actions. As we can see in Figure 4, sparse and noisy latent actions are able to achieve a range of performance between unconstrained latent actions (using the whole continuous space) and a deterministic world model. Even at maximal sparsity, we still have d = 128 latent actions with sparsity constrains, where when the weight β of D KL becomes high, noisy latent actions effectively become noise, equivalent to no conditioning. However, the vector quantization based approach struggles to scale its capacity and remains very close to the deterministic baseline.

In the rest of this work, we will talk about this "in-thewild prediction error" as capacity of the latent actions. Since everything else in the training is identical, the drop in prediction error is attributed to the capacity of the latent actions. Lower prediction error indicates higher capacity latent actions, while a higher one indicate lower capacity latent actions.

On a more qualitative note, in Figure 3 we look at a precise, relatively complex, action that exists in natural videos: someone entering and moving in a scene. We find that sparse and noisy latent actions are able to capture this action accurately, while the quantization approach shows more of a blob entering the scene. Interestingly, the exact shirt color is not captured in the latent action, highlighting that it captures a more abstract information than the exact pixels changing. Confer Supplementary Section F for additional visualizations.

Takeaway

A vector quantization based approach struggles to capture complex actions. Noisy or sparse latent actions are able to capture more complex actions when given the capacity.

What kind of actions do we learn ?

While we showed an ideal setting where latent actions are inferred by the IDM, the model could simply cheat and encode the next frame in the latent action. Or we could learn latent actions that cannot be applied on another video, contrary to our goal of them being minimal explanations . We thus study these two problems with simple and intuitive metrics. See Figure 5 for illustration of the protocols.

Figure 5 Rawlatent evaluation. By artificially stitching videos, we can create abrupt scene changes. Measuring how the prediction error increases when such changes happen compared to the original video tells us how well the model can capture the whole next frame (a) . To measure the transferability of latent actions, we measure if they inference is cycle-consistent. We infer latent actions on video A, then apply them of another random video. From this prediction, we re-infer the latent actions and apply them on video A. If the latent action transfers well, we should obtain a small error with video A (b) . The combination of both metrics ensures that shortcuts are not the source of the transfer.

Future leakage. To measure how much information about the future state is leaked in the latent actions, we can artificially generate scene changes by swapping ends of videos and measure how much the prediction error increases. If the model perfectly encodes the next frame in the latent we should not be able to see a prediction error spike, and thus this lack of spike is a necessary (but not sufficient 3 ) condition for a cheating model. Other metrics such as the alignment between between the latent actions from s t -1 to s t and s t +1 to s t have been proposed to measure the degree of leakage (Yang et al., 2025), but the exact value remains hard to interpret as long as we don't have perfect alignment, and thus copy of the frame.

As we can see in Table 1, no matter the capacity of the latent actions, we find that the prediction error more than doubles compared to its baseline level. This suggests that no studied model is capable of cheating by encoding the next frame. We hypothesize that the complexity of the used dataset makes it harder for the model to learn this solution.

Visual inspection in Figure 6 reveals that while some information about the next frame is captured in the latent actions, it is minor. However, as we study in transferability evaluations, this is not an issue in practice, and merely a consequence of having to encode objects appearing in and out of frames.

Dolatent actions transfer well ? The next experiment to see if we have learned meaningful latent actions is

3 In this scenario, the only solution is to encode the next frame. This does not mean that in regular conditions the models would always fall back to this behavior.

Figure 6 Future leakage. In the presence of a scene cut, the only solution is for the latent action to encode the next frame. As capacity of the latent actions increase, more of the scene can be reconstructed, albeit with an extremely poor quality.

if we can apply latent actions inferred on video A to video B. Quantitatively, we evaluate the models on cycle consistency of latent actions. From random videos A and B, we infer latent actions on video A then apply them on video B. If the latent actions transfer well, we should be able to infer them again. We thus infer them again on video B and apply them on video A. By measuring the increase in prediction error on video A with the original and cyclically inferred latent

Figure 7 Transfer and cycle consistency of latent actions. We infer latent actions from a source video, here of a man moving to the left. We then apply these actions to a flying ball, which stops its motion and also starts moving left, demonstrating transferability of latent actions. We then re-infer the latent actions and apply them to the original video. We can see the man moving to the left again, indicating that the motion was re-inferred correctly. Human videos recorded by the authors, flying ball video from (Riochet et al., 2022).

Table 1 Prediction error increase under scene changes. On Kinetics (Kay et al., 2017), all models exhibit a significantly higher error when a scene change occurs. This shows that the latent actions cannot simply copy the next frame. We report LPIPS values for ease of interpretation.

actions, we can see how well latent actions transfer. While this transfer is not well defined on random natural videos, leading to absolute gaps that are hard to interpret, this can still allow us to rank models and get an intuition about this transfer. We can see in Table 2 that on both Kinetics (Kay et al., 2017) (human activity videos) and RECON (Shah et al., 2021) (navigation) we only obtain a minor increase in prediction error over this latent inference cycle. While latent actions with higher capacity lead to a worse transfer, their performance remains higher after transfer than their more constrained counterparts. As shown by the previous lack of leakage of the future frame, this transfer does not stem from copying the next frame, which would be a way to obtain perfect performance.

Table 2 Action cycle consistency. Actions are inferred on Video 1, then applied on Video 2. Actions are again inferred and applied again on Video 1. The small increase in prediction error indicates that actions can reliably be transferred and re-inferred. We report LPIPS values over 2s prediction for ease of interpretability.

The results are qualitatively investigated in Figure 7 where we can see the movement of a man transferred to a flying ball (demonstrating transfer) and then reinferred and applied to the original video successfully. Confer Supplementary Section G for additional visualisations. However, such good performance even on data where we do not expect actions to transfer well such as random natural videos makes us wonder what type of actions are we learning. For this we turn to a qualitative analysis in the next paragraph.

Which embodiment do the latent actions learn ? Looking at Figure 8 we can see that motion is localized, i.e. the action that is transferred is where movement occurs, and what is this movement. Due to a lack of common embodiment in natural videos, the model learns generic actions that are applied relative the to

Animating the right person

Animating the left person

Figure 8 Action locality. We apply a localized locomotion action to a video with two individuals inside of it. We find that only the person closest to the walking man in the first video starts moving, indicating that the action has localized properties. We are making the individual at a given position move to the left. Videos recorded by the authors.

camera, the only thing common across videos.

This camera-relative embodiment can be a strength as we previously saw in Figure 7. This general abstraction allows us to transfer motion between entirely different objects, which would not be possible if motion only targeted semantically similar objects.

Takeaway

The absence of a clear embodiment in natural videos leads to latent actions capturing more spatially-localized, camera-relative, transformations.

Leveraging latent action world models for planning

One application of a latent action space is to use it a generic interface for various embodiments. If we are able to learn a mapping from "real" actions to latent ones, we can thus control the world model in an interpretable way. This also allows us to solve planning tasks, as we will study in this section.

Controller training. The first part is to train a module to go from real actions -and optional representationsto latent actions. In the case of using actions alone we use a simple MLP, and when using actions and past representations we use a cross-attention based adapter. Confer Supplementary Section A for detailed architecture and protocols. We then simply train this controller module to predict the latent action with an L2 loss. We illustrate this process in Figure 9. Due to the learned latent actions being camera relative, using actions alone can be insufficient as the target latent actions will vary not only based on the action but also camera position. In practice, we find that the controller converges to a latent action that leads to no movement when not using past representations. Confer Supplementary

Figure 9 Controller training. We train a lightweight module to map known actions to latent actions. Representations of the past are used to help the prediction of the right latent actions.

Section H for visualizations.

Rolloutquality. We train controllers on DROID (Khazatsky et al., 2024), a robotic manipulation dataset, as well as RECON (Shah et al., 2021), a navigation dataset. DROID allows us to evaluate the model on data where the camera is fixed but an agent is moving inside of the scene, while NWM has still scenes but where the camera wearer is the one moving. As we can see qualitatively in figure 10 and quantitatively in the left column of figure 11, models are able to achieve quality predictions when using the controller. The predictions obtained when using the controller are very similar to the ones obtained with the IDM, with slightly more conservative actions.

We however find a lack of correlation between the prediction error on in-the-wild videos, i.e. the capacity of the latent actions, and the quality of the rollouts when using the controller. For both sparse and noisy latent actions, we find that using the most or least constrained setting is suboptimal, and that a more balanced regularization leads to the best predictions. This can intuitively be explained by over-constrained latent actions not containing enough information, and under-constrained ones containing too much information about the future. This is consistent with the trends observed previously, where more constrained latent actions transfer better, but freer ones can capture more fine-grained motion. Due to the simplicity of the action space here, we see that even discrete latent actions work well, supporting this choice in

Figure 10 Unrollings using the controller and IDM. On both DROID and RECON, the controller is able to approximate the latent action produced by the inverse dynamics model. Movements are applied correctly over the unrolling, however physical appearance degrades over time. To produce the unrollings, frames are duplicated to map one action to one latent, something not seen during training.

prior work (Bu et al., 2025; Schmidt and Jiang, 2024). Confer Supplementary Section C for detailed results.

Planning performance. We can now use our trained controllers and measure performance on goal-based planning tasks using existing protocols. Given an initial observation s t and goal observation s g , we seek an action sequence that minimizes the distance between the predicted and goal states.

For our DROID controller, we adopt the protocol of Terver et al. (2025) and use a set of videos recorded in the real world on a Franka Emika Panda. We consider trajectories where the goal is to move the arm to a specific goal position. We plan at a horizon of H =3 steps using the Cross-Entropy Method (CEM) (Rubinstein, 1997) and compare ourselves to the performance of V-JEPA 2-AC which is trained in a similar way as our model but using known actions, as well as the best model based on V-JEPA 2 from Terver et al. (2025) to upper bound the performance. To measure performance, we use the distance to the goal ( ∆ xyz ) which can be easily computed thanks to the compositionality of translations. Confer Supplementary Section A for the detailed protocol. While performance remains lower than specifically designed models, our models are able to achieve similar performance to V-JEPA 2-AC, demonstrating that our learned latent actions can effectively be used as an interface for planning tasks. Here, the higher capacity latent actions, even though they may produce worse rollouts, can lead to the best planning performance. Notably, noisy latent actions obtain the best planning performance when the unrollings are the worst, relatively speaking. We explore the impact of adding domain specific data in our pipeline in Supplementary Section D.

Nonetheless, we find that the quality of the unrolling is not perfectly correlated with planning performance. This is a common challenge in the world model literature(Zhang et al., 2025). Overall, we find that our models trained only on in-the-wild videos learn latent action spaces that can effectively be reused to solve simple planning problems, with noisy latent actions being the best.

On a navigation task, using our controller trained on RECON, we follow the protocol of NWM (Bar et al., 2024) and evaluate performance using CEM for planning. We rely on the Relative Pose Error (RPE) (Sturm et al., 2012) between planned and groundtruth trajectories as our main metric. We find similar conclusions here, with models able to achieve performance that while not on par with NWM, are able to beat policy based baselines such as NoMaD (Sridhar et al., 2024). Egocentric navigation has the added difficulty of additional information entering the frame at every prediction step, making it harder to produce clean unrollings and lowering performance. For more detailed planning results, confer Supplementary Section C.

Takeaway

Latent actions learned solely on natural videos can be leveraged to solve planning tasks with similar performance as models having access to domain specific data with labeled actions.

Figure 11 Controller and planning performance. On both DROID and RECON, we are able to successfully train a model to map real to latent actions (left). Using these action with classical planning protocols, we are able to achieve similar performance to world model or policy baselines, that are trained with actions from the start (right). Overall, the best performing models are the ones where the latent actions form a middle ground in term of capacity.

Figure 12 Scaling trends. We investigate for two sets of latent regularizations the performance behaviour when scaling the model size (left), total training time (middle), and training data quantity (right). We find that for all axes of scaling, we are able to obtain an improved IDM on natural videos (top row). We see that when measuring performance on planning tasks we obtain similar trends, with the clearest improvements obtained by training longer. (bottom row). For data scaling, we note that our usual recipe sees on average every video twice, but we only see a total of 1% of the total number of frames. This latter number is when we start to see degraded performance due to a too small training set. Stars indicate our default setup in the rest of the paper.

Scaling models and data.

In this section we investigate how the performance of the models scales as we increase data, model size, and training time. For this study we focus on sparse (with λ l 1 = 0 . 01 ) and noisy latent actions (with β = 5 × 10 -5 ). Looking at both allows us to study scaling trends in diverse settings. We can see in Figure 12 that overall, as model size, training time, or training data increase, we obtain better predictions when using the IDM on natural videos. However, looking at the planning performance on DROID shows us a more nuanced story, where training times significantly improves the performance but model sizes mainly has an effect for the noisy latent actions, and training data does not show a significant trend. This nuanced story about model size is consistent with previous work (Ye et al., 2025) which also find minor increase in performance when performing scaling analyses. These results would suggest that while scaling can improve the quality of a latent action world model by improving the quality of the latent actions and/or forward model, this may not always be visible in downstream tasks that mainly evaluate simple actions, as are often used in the literature.

Limitations and future work

Variable latent information content. In our work, the information constraint placed on the latent actions is based on a static coefficient. However, every video has actions of various complexity, and are even sometimes deterministic. It would thus be interesting to adjust the constraint based on the complexity of the video. While this may come at a cost on the complexity of the latent action space, it would enable better calibrated latent actions.

Sampling and planning in latent action space. While we studied the transfer of latent actions inferred on natural videos as well as their use as a control interface, one can wonder if we cannot exploit the latent actions directly. Using the latent actions as-is would allow us to measure their quality more accurately. This can be done by sampling latent actions and analyzing the predictions, or by performing planning in the latent action space (Rybkin et al., 2019). We provide some initial analysis on these aspects in Supplementary Section B, noting that most of the works is ahead for high dimensional structured latent actions.

Shapingrepresentationswithsinglestagetraining. Currently, the world model is trained on top of frozen representations. This representation space was not designed with prediction in mind, which can hinder the inverse dynamics training, as well as the quality of the predictions in general. As we use similar data to the pretraining distribution of V-JEPA 2 in our work, the use of latent actions in a V-JEPA 2 pretraining could unlock single-stage encoder/world-model training. This is an exciting direction for future work.

Conclusion

This work demonstrates the feasibility of learning effective latent action world models (LAMs) directly from large-scale, in-the-wild natural video datasets. We successfully address the significant challenges posed by this data, including high action complexity, environmental noise, and the lack of a common embodiment. Our study of information regularizations highlights the benefit of continuous latent actions, which are able to adapt more effectively to the complexity of actions present in natural videos. Vector quantization, although very common in practice, struggles to adapt to this scale. By studying the leakage of future frames in the latent actions, we found that this problem is not present in practical setting, which we hypothesize is due to a combination of conditioning choice and data complexity. We further found that while higher capacity latent actions hurt transferability, latent actions were still able to be inferred and reapplied consistently. This led to the finding that on natural videos, learned latent actions are spatially-localized relative to the camera due to the lack of a common embodiment across videos. Qualitatively, the learned latent actions can capture complex actions, such as a person entering a scene, and can even transfer motion between different objects, such as from a human to a ball. Most critically, we demonstrated the practical utility of this approach. By training a simple controller to map state and known actions to the learned latent actions, our world model-trained exclusively on in-the-wild, natural videos-can be controlled to solve robotic manipulation tasks. It achieves planning performance comparable to baselines trained on in-domain, actionlabeled data. Overall, our analyses and experiments demonstrate the viability and potential of training latent action models on uncurated natural videos, offering a step towards more general world models.

Acknowledgments

We would like to thank Adrien Bardes for accepting to act in videos used for qualitative results, as well as for fruitful discussions. We also thank Amir Bar for discussions and advice on planning experiments.

Experimental details

Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one, 2020. https://arxiv.org/abs/1912.03263 .

Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945 , 2024.

Appendix

Training and evaluation protocols label{sec:detailed_protocol

Decoder training. Our decoder is trained using a ViT-L (Dosovitskiy et al., 2021) architecture, using RoPE (Su et al., 2021; Assran et al., 2025) positional embeddings. It reuses the architecture of the V-JEPA 2 encoder (Assran et al., 2025), with an added linear layer to map from patch to pixels. The decoder processes the full video sequence with a frame causal attention mask to only attend to past frames.

It is trained using a combination of L 1 and perceptual loss (Johnson et al., 2016; Zhang et al., 2018). The decoder's weights are optimized using the Muon optimizer, with a learning rate of 0 . 02 , AdamW learning rate of 3 × 10 -4 and weight decay of 0 . 01 . We train the model with a batch size of 512, for 90 000 iterations, using a linear learning rate warmup for 12 000 iterations, followed by a cosine annealing.

Latent action training. By default, our world model p ψ uses a ViT-L (Dosovitskiy et al., 2021) architecture equipped with RoPE (Su et al., 2021; Assran et al., 2025) positional embeddings. We condition p ψ on latent actions z through an adapted AdaLN-zero (Peebles and Xie, 2023) mechanism that performs frame-wise conditioning, instead of the original sequence wise conditioning. Each latent action z t is represented as a 128-dimensional continuous vector. We train the world model for next frame prediction using teacher forcing (Williams and Zipser, 1989; Vaswani et al., 2017) for computational efficiency.

We train on YoutubeTemporal-1B (Zellers et al., 2022) with batches of size 1024 for 30000 iterations. For optimization, we rely on the Muon optimizer (Jordan et al., 2024) with a learning rate 0 . 02 alongside AdamW (Loshchilov and Hutter, 2019) at a learning rate of 6 . 25 × 10 -4 . The learning rate schedule begins with a linear warmup for the first 10% of training iterations, followed by cosine annealing. Weight decay is set to 0 . 04 . Training takes approximately 12 hours on 64 H100 GPUs.

The training loss can be defined as

with p ψ the world model, s 0: t is the sequence of past representations (encoded frames), z t the latent action inferred by the inverse dynamics model g ϕ from consecutive representations s t and s t +1 , and L z the regularization applied to the latent action.

To determine the coefficient used for the latent action regularization terms, we perform a sweep by increasing and decreasing the coefficients regulating information content until the latent actions have the same effect as noise, until an increase in capacity does not yield a reduction in prediction error, or for vector quantization when the codebook starts to not be fully utilized. This leads to the following coefficients:

Controller training. Our controllers consist of 2 self-attention blocks used to process the representation of the previous frame (we only look at the ultimate previous frame s t -1 , not the whole past s 0: t -1 ) followed by a cross-attention block between embedded real actions, and processed representations. Actions are embedded with a 3 layer MLP to a target embedding dimension chosen as the same as the encoder (1024 by default). The output singular token per timestep is then projected to the latent action dimension of 128 with a linear layer.

Since our latent action world models are trained with one latent action for two frames due to the video tokenization, we duplicate frames in the dataset to obtain a clear one-to-one mapping between real and latent actions.

The controller is then trained for 3000 iterations using the AdamW optimizer (Loshchilov and Hutter, 2019), with a learning rate of 1 × 10 -3 , a weight decay of 0 . 04 , β 1 = 0 . 9 and β 2 = 0 . 999 . The learning rate follwos a

linear warmup for 300 iterations and then a cosine decay for the rest of the training. We use a batch size of 256 with 8 frames videos at 4fps (which gives us 16 frames after duplication).

Planning protocol for DROID. Our model is used for planning using the protocol of Terver et al. (2025), which is as follows. Let s t = f θ ( V t ) denote the latent visual state obtained by encoding the frame V t through the encoder f θ . Given an initial observation s t and a goal observation s g , we seek an action sequence a t : t + H -1 := a t , . . . , a t + H -1 that leads from s t towards s g over a planning horizon H . In practice, we use H = 3

We define the planning cost of an action sequence as

where s g = f θ ( V g ) is the encoded goal state, and the predicted latent visual states ˆ s are obtained by recursively unrolling the predictor:

with c denoting the controller that maps actions and latent visual states to latent actions.

We use the Cross-Entropy Method (CEM) (Rubinstein, 1997) to solve this optimization problem. CEM maintains a Gaussian distribution over action sequences, initialized with zero mean and unit variance. At each iteration, we sample N = 300 candidate action sequences from the current distribution, evaluate their costs using the world model, and refit the distribution to the top K = 10 elite samples. We perform I = 15 iterations of this procedure and select the first action of the best sequence for execution.

To evaluate planning performance, we run 64 independent episodes. For each episode, we randomly select one video from 16 validation videos and randomly sample a clip of H +1 = 4 frames at 4 fps (matching training conditions). We then defined our error as the distance to the goal, defined as the L 1 distance between the cumulative planned actions and the cumulative groundtruth actions from the dataset:

where a plan i denotes the planned action at timestep i and a gt i the corresponding groundtruth action leading from s t to s g . This metric measures the difference in total displacement between the planned and groundtruth trajectories, which is well-suited for actions that are additive in time, since multiple (inifinitely many) paths can lead to the target. We report the error averaged across all 64 episodes.

Planning protocol for RECON. We use a similar protocol as for DROID, following the exact one used by NWM (Bar et al., 2024) which we recall for clarity. For additional details, confer Bar et al. (2024). Here for the Cross Entropy Method, we use N = 120 candidate actions and only a singular iteration, which was found to be sufficient in NWM.

For efficiency, trajectories are assumed as a straight line, which allows us to plan only a single action that can be divided in the right number of time-steps. The planning horizon is here H = 8 which at 4fps represents 2 seconds in the future.

Once the trajectory is planned, we can compute the Absolute Trajectory Error (ATE) and Relative Pose Error (RPE) (Bar et al., 2024; Sturm et al., 2012) to measure the quality of the trajectory compared to the groundtruth ones. In practice we focus on RPE in the main body of our work, but ATE results are reportes in Supplementary Section C.

Sampling latent actions label{sec:sampling

Throughout this work, latent actions have either been used as-is for transfer experiments, or as an interface to control the learned world model with interpretable actions. Performing planning directly in latent action space is, to the best of our knowledge, an open problem that can be made worse depending on the geometry of the latent action space.

Latent action sampling is the first process to elucidate, which varies based on the choice of latent action regularization. For discrete latents , the task is straightforward: sample from the codebook, possibly only for used codes. For noisy, VAE-like latents , the prior distribution N (0 , 1) can be used. However, the strength of the regularization used during training will alter how closely this prior is matched, leading to suboptimal coverage of the latent action distribution. Sparse latents are perhaps the most challenging sampling-wise. Due to the definition of the latent action space being based on using an energy function, we have to resort to MCMC sampling techniques for EBMs (LeCun et al., 2006). A common approach is to leverage our knowledge of the energy function's gradient and use a sampler based on Stochastic Gradient Langevin Dynamics (SGLD) (Grathwohl et al., 2020; Welling and Teh, 2011). The sampling can be defined:

Here p can be a uniform distribution over the latent action space, or a Gaussian distribution for example. Similarly to using the prior distribution for noisy latents, when training a LAM we are not necessarily minimizing properly the energy function associated to our latents, which can lead to a misalignment between sampled latents and the ones inferred in practice.

Figure S1 Sampling latent actions. For each class of latent actions and various capacity, we infer latent on actions on natural videos and sample the same amount randomly. Looking at 2D visualizations obtained with UMAP (McInnes et al., 2018), we can see that high capacity latents (i.e. less constrained ones) are harder to sample as they are further away from the intended regularization or prior distribution. As the capacity gets lower, the visible overlap between sampled and true latents suggests that the sampling procedure works closer to intended.

As we can see in figure S1, the aforementioned sampling strategies are able to sample similar latents to real ones when they have a low capacity. In that case, the models were trained with stronger constraints on the latent actions which can explain why the sampling is adequate. However when the latents are less constrained, and thus have a higher capacity, the true and sampled latents are easily separable which suggests a poor sampling.

While this analysis is purely qualitative, it effectively demonstrates how sampling approaches start to break down when handling continuous latents. An interesting angle of attack to tackle this sampling problem could

be to use learning based methods that make fewer assumptions about the latent action distribution, such as diffusion models (Sohl-Dickstein et al., 2015).

Detailed planning results label{sec:detailed_planning_results

Table S1 Results on DROID. We first train a controller to map actions to latent actions and measure the quality of the unrollings compared to the IDM (left). We then select unseen videos and infer actions based on a goal image. We measure performance as the distance to the goal (right) .

Latents

Sparse

Noisy

Discrete

Controller

0.14 (

)

IDM

0.09

Capacity

Low

High

Robot manipulation vs in-the-wild videos label{sec:droid_vs_ytb

In this section, we investigate how pretraining on DROID (Khazatsky et al., 2024) affects performance, both on qualitative examples and on planning performance.

Qualitative analysis. We start by comparing a model trained on YoutubeTemporal-1B with one trained solely on DROID using sparse latents with λ l 1 = 0 . 01 . Looking at qualitative results in Figure S2 on natural videos, we can see that a model trained exclusively on DROID struggles to model actions present in in-the-wild videos. This is even true in this scenario where we are using the inverse dynamics model, which thus represents an ideal upper bound of capabilities. Interestingly, when the action corresponds to a person entering the room, we find that the model trained on DROID makes a robotic arm appear, as it is the only moving object seen during training. While this model struggles to open and close a hand, it is however capable of animating objects that are not seen during training, such as a human walking in the scene. Looking closely we can see that the exact leg movement is not captured well, but the overall translation movement is.

these results suggest that pretraining on a more diverse dataset is beneficial to capture more diverse actions, but that even when training on a more constrained datasets, actions that still generalize can be learned. This further supports the illustration in Figure 1.

Planning performance. While we have previously seen that we are able to achieve good planning performance by pretraining only on in-the-wild videos, one can wonder how much the addition of domain specific data influence performance. For this, we pretrain models with a mix of DROID and YoutubeTemporal-1B data, varying the weights of the dataset between 0 and 100%.

Table S3 Effect of varying DROID pretraining weight on planning. Adding in domain data helps both the quality of rollouts and planning performance. Even a minor amount of data can yield a strong boost in performance.

As we can see in Table S3, adding domain specific data can drastically help performance, even with as low as 10% in some settings. What is also interesting for our latent action model setup is that by training a latent action model with domain specific data, we can achieve very similar planning performance compared to a world model trained on the same data with access to action labels (0.06 vs 0.05 for the best model from Terver et al. (2025)). Beyond our work, these results suggest that training a latent action model on the widest range of data possible may be optimal for a diverse set of applications.

Figure S2 Sample predictions using the IDM across data sources. Top: a person entering the scene; middle: hand motion; bottom: object translation. The model trained on DROID struggles on human-centric actions outside its training distribution (entering, hand), while both models can handle simple object translation.

Qualitative Impact of regularization strength label{sec:reg_strength_qual

While we previously quantified the impact of latent action capacity, equivalently regularization strength, we now turn ourselves to more qualitative analyses. Throughout this section we consider noisy latents, but similar conclusions hold across regularization families.

As we can see in Figure S3, when latent actions are overly constrained, the model is unable to make a human appear. As the constraint gets weaker, we start to see the person appearing, albeit with suboptimal appearance and motion. Continuing to weaken this regularization, we start to see a better outline of the person, and a higher fidelity in motion, especially for the leg movements.

In Figure S4 we study the impact of the regularization strength when transferring movements from a human to a ball. We can see that with a too strong regularization, the ball simply continues its trajectory. We essentially have a deterministic world model. As the regularization increases, the ball slows down more until it perfectly follows the transferred motion. We then see it going perfectly left, in a straight line. This highlights the importance of adequate capacity to be able to identify interpretable actions.

While so far more capacity has been beneficial, we get a better understanding of what happens at lower constraints in Figure S5. Here we see that while initially capacity improves the cycle consistency of actions, in some cases at higher capacity the motion is not applied to the whole human when re-inferred. This suggests a greater spatial localization of actions at higher capacity. We obtain more "precise" actions, at the cost of generality. This mirrors what is observed in planning evaluations, where the optimal latent actions spaces strike a balance between capacity and generality.

Context

Prediction

Figure S3 Quality of the IDM across regularizations. Overly constraiend latents are not able to capture a person entering the room. As the capacity of the latent actions grows, both the quality of the person and leg movement increases, but plateaus after a certain point.

Figure S4 Cross object action transfer across regularizations. The quality of motion transfer increases with the capacity of the latents. More constrained latents either have no effect, or a weaker one.

Figure S5 Cycle consistency for different regularizations. As the latent action capacity increases we obtain improved transfer. After a certain point, the movement becomes more localized and only the upper body motion is captured back.

Additional IDM rollouts. label{sec:more_idm

In this section we take a look at more qualitative examples of rollouts performed with the inverse dynamics models. This allows us to establish an upper bound of the performance attainable by a given model, with the caveat that models may use shortcut solutions. Similar to figure 3, we take a look at the least constrained latents for all regularizations. We focus on videos from SSv2 (Goyal et al., 2017) as a natural video dataset that are not seen during training.

As we can see in figures S6 and S7, latent actions constrained via noise addition or sparsity are able to capture the actions happening in videos, but vector quantized ones struggle more. The latter is still able to capture rough motion, but struggles with more precise one such as the rotation of the object at the top of figure S7. Overall all of these samples correlate our previous finding and demonstrate the usefulness of continuous regularized latent actions.

Figure S6 Sample predictions using the IDM. We illustrate the highest quality unrollings obtained with different regularization on SSv2, using the inverse dynamics model.

Additional human action transfer results. label{sec:more_transfer

In this section, we take a look at more action transfer across scenarios. For this we consider different levels and families of regularization. We investigate four scenarios of action transfer: making someone appear and walk in a scene with someone present, two people raising their arms transferred to one person, someone entering the scene with someone else being static, someone walking in a scene. Figure S8 considers noisy latents with low capacity latents, Figure S9 noisy latents with high capacity latents, and Figure S10 sparse latents with high capacity. This last example has the overall highest capacity, as previously measured by prediction error.

We find that the action of someone entering an empty room is adequately transferred, but with different behavior based on capacity. With low capacity, the newly introduced person and the one already present both start moving. At higher capacities, we see that the already present person either moves with the new character once they overlap, or disappears. We however find that if the original video contained a person standing still (third pairs of row), then the person in the target video also remains still. This difference in behavior suggests that the model can distinguish humans from the background, and the latent actions affect them differently, which is a desired behavior. This is consistent with figure 6 where we see that the latent actions consider humans with higher priority than the background.

When transferring the motion of two person raising their right arm to a single one, we see that both arms become raised. The arms also follow the same movement as in the original video, in spite of the ambiguity of this transfer task. The arms however do not expand horizontally as much as in the original video, which we hypothesize is due to the locality of the action. This appears consistent across capacities.

Finally, when making a still person walk to the left of the scene, all capacities create movement, but at higher capacity we can see the person turn and move, which is more natural than the translation observed at lower capacity. The person only starts this motion once the motion is performed at their current location, further reinforcing the previously discussed locality.

Another positive results from these qualitative examples is that there is no leakage from the background in any video, suggesting again that models are not cheating by copying the future but learning valid latent actions.

Overall we see that actions can be adequately transferred across videos, where the difficulty of defining a clear embodiment of in-the-wild videos becomes a strength in ambiguous settings such as going from two to one person.

Context

Introduction

Figure S8 Additional transfer results, noisy latents with β = 10 -4 . First pair of rows, making someone enter an frame with someone in it. Second pair of rows, transferring movements from two to one person. Third pair of rows, someone enter the frames with a still person in common. Fourth pair of rows, animating someone already present in the room.

Context

Introduction

Figure S9 Additional transfer results, noisy latents with β = 10 -6 . First pair of rows, making someone enter an frame with someone in it. Second pair of rows, transferring movements from two to one person. Third pair of rows, someone enter the frames with a still person in common. Fourth pair of rows, animating someone already present in the room.

Context

Prediction

Figure S10 Additional transfer results, sparse latents with λ l 1 = 0 . 01 . First pair of rows, making someone enter an frame with someone in it. Second pair of rows, transferring movements from two to one person. Third pair of rows, someone enter the frames with a still person in common. Fourth pair of rows, animating someone already present in the room.

Qualitative performance of the controllers label{sec:controller_qual

In this section, we take a look at rollouts produced by our learned controllers, to help understand behaviors observed in practice.

We first take a look at random samples from the validation set of RECON and DROID, using our model with the lowest LPIPS value. As we can see in Figure S11 the model is able to accurately model movements from the camera wearer, with a few caveats. In the first video, we can see that the tree is not accurately predicted once it enters the frame. This can be explained by the missing information from the beginning of the video and the model is only able to guess that the tree continues. In the second row, as the sun becomes occluded, the image gets darker. In the prediction of our model, we can see that the brightness remains high and the sun remains present in the corner of the frame, moving along with the camera. Nonetheless, we are able to accurately control the latent action world model using human interpretable actions.

On DROID in Figure S12 the model is again able to perform similar movements to the groundtruth but it struggles with making the robotic arm enter the frame. On the last row, we can see that no matter the action, nothing happens as the model did not see the arm in the video. This is a sensible failure mode. On the first row, we do see a movement of the visible part of the arm (mainly the gripper), but the rest of the arm does not appear. This again stems from a lack of information, combined with an unfamiliarity with the objects present in this video during training.

To further illustrate why the controller needs access to the representations, beyond previous intuition, we show some rollouts performed using a representation-less controller in Figure S13. Due to the different cameras possible for the videos, as well as our camera-relative latents we find that that the model is not able to successfully control the robotic arm. Instead, the arm remains static. This further demonstrates the importance of representations from the past in the contextualization of latent actions.

Figure S11 Unrolling of the controller on RECON. The controller can adequately map real actions to latent actions, allowing precise control of the world model.

Context

Prediction

Figure S13 Unrolling of the controller without representations of the past on DROID. Due to the ambiguity of actions without knowing the position of the arm or camera, the model resorts to producing no movements.

Latents	Capacity	w/o change	w/ change	w/ change
Sparse	Low	0.28	0.66	( × 2 . 3 )
Sparse	High	0.2	0.5	( × 2 . 4 )
Noisy	Low	0.33	0.69	( × 2 . 1 )
Noisy	High	0.21	0.54	( × 2 . 5 )
Discrete	Low	0.34	0.69	( × 2 . 0 )
Discrete	High	0.29	0.68	( × 2 . 3 )

Latents	Capacity	Kinetics	Kinetics	Kinetics	RECON	RECON	RECON
		Original	Transfer	Transfer	Original	Transfer	Transfer
Sparse	Low	0.26	0.31	( × 1 . 20 )	0.24	0.29	( × 1 . 21 )
Sparse	High	0.19	0.24	( × 1 . 30 )	0.20	0.23	( × 1 . 14 )
Noisy	Low	0.30	0.34	( × 1 . 13 )	0.29	0.33	( × 1 . 15 )
Noisy	High	0.20	0.26	( × 1 . 34 )	0.20	0.24	( × 1 . 22 )
Discrete	Low	0.32	0.33	( × 1 . 03 )	0.32	0.33	( × 1 . 03 )
Discrete	High	0.27	0.29	( × 1 . 07 )	0.26	0.27	( × 1 . 05 )

Latents	Capacity	∆ xyz (m)
Sparse	Low Mid	0.33 0.18
Noisy	High Low Mid High	0.13 0.49 0.11 0.10
Discrete	Low High	0.18 0.14
V-JEPA 2-AC	N/A	0.15
V-JEPA 2 + WM	N/A	0.05

Latents	Capacity	ATE	RPE
Sparse	Low Mid	1.68 1.45	0.48 0.41
	High	1.43	0.42
Noisy	Low Mid	2.06 1.49	0.55 0.41
	High
		1.40	0.40
Discrete	Low	1.81	0.51
	High	1.48	0.42
NoMaD	N/A	1.93	0.52
NWM	N/A	1.13	0.35

Model	DROID weight	0%	10%	25%	50%	75%	90%	100%
Sparse	Controller LPIPS	0.14	0.14	0.12	0.11	0.1	0.1	0.1
Sparse	∆ xyz	0.14	0.13	0.14	0.09	0.09	0.08	0.08
Noisy	Controller LPIPS	0.11	0.1	0.1	0.1	0.1	0.1	0.9
Noisy	∆ xyz	0.14	0.09	0.09	0.09	0.06	0.06	0.07

$$ \mathcal{L}{t} = | s{t+1} - p_\psi(s_{0:t},z_t) |1 ;, \text{with} ; z_t = g\phi(s_{t},s_{t+1}). $$

$$ \mathcal{L}(Z) = VCM(Z) + \frac{1}{N}\sum_i E(Z_i) , $$

$$ E(z) = \lambda_{l2}\max\left(\sqrt{D} - |z|2^2,0\right) + \lambda{l1}|z|_1 $$

$$ \mathcal{L}(z_t) = - \beta ;D_{KL}\left(q(z_t| s_t,s_{t+1}) || \mathcal{N}(0,1)\right) $$

$$ C(s_t, a_{t:t+H-1}, s_g) = |s_g - \hat{s}_{t+H}|_2, $$

$$ \hat{s}{t} = f\theta(V_t), \quad \hat{s}{i+1} = p\psi(\hat{s}_{i}, c(a_i, \hat{s}_i)), \quad i \in [t, t+H-1], $$

$$ \Delta xyz = \left| \sum_{i=t}^{t+H-1} a_i^{\text{plan}} - \sum_{i=t}^{t+H-1} a_i^{\text{gt}} \right|_1, $$

$$ z_0 \sim p(z), \quad z_{t+1} = z_t - \frac{\alpha}{2} \frac{\partial E(z_i)}{\partial z_i} + \epsilon, \quad \text{with}\quad \epsilon \sim \mathcal{N}(0,\alpha). $$

$$ VCM(Z) = &\lambda_{V} \frac{1}{D} \sum_d \max\left(1-\sqrt{\text{Var}(Z_{\cdot,d})},0\right) \ & + \lambda_{C}\frac{1}{D(D-1)} \sum_{i\neq j} \text{Cov}(Z){i,j}^2 \ & + \lambda{M}\frac{1}{N D} \sum_{i,j} Z_{i,j} . $$

Latents	Capacity	w/o change	w/ change	w/ change
Sparse	Low	0.28	0.66	( × 2 . 3 )
Sparse	High	0.2	0.5	( × 2 . 4 )
Noisy	Low	0.33	0.69	( × 2 . 1 )
Noisy	High	0.21	0.54	( × 2 . 5 )
Discrete	Low	0.34	0.69	( × 2 . 0 )
Discrete	High	0.29	0.68	( × 2 . 3 )

Latents	Capacity	Kinetics	Kinetics	Kinetics	RECON	RECON	RECON
		Original	Transfer	Transfer	Original	Transfer	Transfer
Sparse	Low	0.26	0.31	( × 1 . 20 )	0.24	0.29	( × 1 . 21 )
Sparse	High	0.19	0.24	( × 1 . 30 )	0.20	0.23	( × 1 . 14 )
Noisy	Low	0.30	0.34	( × 1 . 13 )	0.29	0.33	( × 1 . 15 )
Noisy	High	0.20	0.26	( × 1 . 34 )	0.20	0.24	( × 1 . 22 )
Discrete	Low	0.32	0.33	( × 1 . 03 )	0.32	0.33	( × 1 . 03 )
Discrete	High	0.27	0.29	( × 1 . 07 )	0.26	0.27	( × 1 . 05 )

Latents	Capacity	∆ xyz (m)
Sparse	Low Mid	0.33 0.18
Noisy	High Low Mid High	0.13 0.49 0.11 0.10
Discrete	Low High	0.18 0.14
V-JEPA 2-AC	N/A	0.15
V-JEPA 2 + WM	N/A	0.05

Latents	Capacity	ATE	RPE
Sparse	Low Mid	1.68 1.45	0.48 0.41
	High	1.43	0.42
Noisy	Low Mid	2.06 1.49	0.55 0.41
	High
		1.40	0.40
Discrete	Low	1.81	0.51
	High	1.48	0.42
NoMaD	N/A	1.93	0.52
NWM	N/A	1.13	0.35

Model	DROID weight	0%	10%	25%	50%	75%	90%	100%
Sparse	Controller LPIPS	0.14	0.14	0.12	0.11	0.1	0.1	0.1
Sparse	∆ xyz	0.14	0.13	0.14	0.09	0.09	0.08	0.08
Noisy	Controller LPIPS	0.11	0.1	0.1	0.1	0.1	0.1	0.9
Noisy	∆ xyz	0.14	0.09	0.09	0.09	0.06	0.06	0.07

References

[2020RandAug] Cubuk, Ekin Dogus, Zoph, Barret, Shlens, Jon, Le, Quoc. (2020). RandAugment: Practical Automated Data Augmentation with a Reduced Search Space. Advances in Neural Information Processing Systems.

[3dwarehouse] {Trimble Inc. 3D Warehouse.

[adeli2022advanced] Adeli, Ehsan, Sun, Luning, Wang, Jianxun, Taflanidis, Alexandros A. (2022). An advanced spatio-temporal convolutional recurrent neural network for storm surge predictions. arXiv preprint arXiv:2204.09501.

[agrawal2015moving] Pulkit Agrawal, Joao Carreira, Jitendra Malik. (2015). Learning to see by moving. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[agrawal2022alphareq] Kumar Krishna Agrawal, Arnab Kumar Mondal, Arna Ghosh, Blake Aaron Richards. (2022). ${\textbackslash. Advances in Neural Information Processing Systems.

[agrawal_poke_latent_2016] Agrawal, Pulkit, Nair, Ashvin V, Abbeel, Pieter, Malik, Jitendra, Levine, Sergey. (2016). Learning to {Poke. Advances in {Neural.

[alcorn_strike_2019] Alcorn, Michael A., Li, Qi, Gong, Zhitao, Wang, Chengfei, Mai, Long, Ku, Wei-Shinn, Nguyen, Anh. (2019). Strike ({With. 2019 {IEEE. doi:10.1109/CVPR.2019.00498.

[alet2021noether] Alet, Ferran, Doblar, Dylan, Zhou, Allan, Tenenbaum, Josh, Kawaguchi, Kenji, Finn, Chelsea. (2021). Noether networks: meta-learning useful conserved quantities. Advances in Neural Information Processing Systems.

[anonymous] Author, N. N.. (2021). Suppressed for Anonymity.

[anonymous2023steerable] Anonymous. (2023). Steerable Equivariant Representation Learning. Submitted to The Eleventh International Conference on Learning Representations.

[arjovsky2016unitary] Arjovsky, Martin, Shah, Amar, Bengio, Yoshua. (2016). Unitary evolution recurrent neural networks. Proceedings of the International Conference on Machine Learning.

[asano2020labelling] Yuki Markus Asano, Christian Rupprecht, Andrea Vedaldi. (2020). Self-labelling via simultaneous clustering and representation learning. International Conference on Learning Representations.

[assran2023ijepa] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[bachman2019mutual] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning Representations by Maximizing Mutual Information Across Views. Advances in Neural Information Processing Systems.

[baevski2022data2vec] Baevski, Alexei, Hsu, Wei-Ning, Xu, Qiantong, Babu, Arun, Gu, Jiatao, Auli, Michael. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. Proceedings of the International Conference on Machine Learning.

[baillargeon1985object] Baillargeon, Renee, Spelke, Elizabeth S, Wasserman, Stanley. (1985). Object permanence in five-month-old infants. Cognition.

[baillargeon1995collision] Baillargeon, Renee. (1995). Physical reasoning in infancy. The cognitive neurosciences.

[baillargeon_innate_2008] Baillargeon, Renée. (2008). Innate {Ideas. Perspectives on Psychological Science. doi:10.1111/j.1745-6916.2008.00056.x.

[baillargeon_permanence_1991] Baillargeon, Renee, DeVos, Julie. (1991). Object {Permanence. Child Development. doi:10.2307/1130803.

[baillargeon_support_1990] Baillargeon, Renée, Hanko-Summers, Stephanie. (1990). Is the top object adequately supported by the bottom object? young infants' understanding of support relations. Cognitive Development. doi:10.1016/0885-2014(90)90011-H.

[baillargeon_support_1992] Baillargeon, Renée, Needham, Amy, Devos, Julie. (1992). The development of young infants' intuitions about support. Early Development and Parenting. doi:10.1002/edp.2430010203.

[statlog_german_credit_data_144] Hofmann, Hans. (1994). {Statlog (German Credit Data).

[Oquab_2014_transfer] Oquab, Maxime, Bottou, Leon, Laptev, Ivan, Sivic, Josef. (2014). Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[zeiler2014visualizing] Zeiler, Matthew D, Fergus, Rob. (2014). Visualizing and understanding convolutional networks. Proceedings of the European Conference on Computer Vision.

[alexnet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems.

[german_report] Grömping, Ulrike. (2019). South German Credit Data: Correcting a Widely Used Data Set.

[baker2003matrix] Baker, Andrew. (2003). Matrix groups: An introduction to Lie group theory.

[balestriero2022spectral] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and Non-Contrastive Self-Supervised Learning Recover Global and Local Spectral Embedding Methods. arXiv preprint arXiv:2205.11508.

[balestriero2023cookbook] Balestriero, Randall, Ibrahim, Mark, Sobal, Vlad, Morcos, Ari, Shekhar, Shashank, Goldstein, Tom, Bordes, Florian, Bardes, Adrien, Mialon, Gregoire, Tian, Yuandong, others. (2023). A Cookbook of Self-Supervised Learning. arXiv preprint arXiv:2304.12210.

[bansal2024videophy] Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, Aditya Grover. (2024). VideoPhy: Evaluating Physical Commonsense for Video Generation. arXiv preprint arXiv:2406.03520.

[bao2021beit] Bao, Hangbo, Dong, Li, Wei, Furu. (2021). Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.

[bar2019learning] Bar-Sinai, Yohai, Hoyer, Stephan, Hickey, Jason, Brenner, Michael P.. (2019). Learning data-driven discretizations for partial differential equations. Proceedings of the National Academy of Sciences.

[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.

[bardes2022vicregl] Adrien Bardes, Jean Ponce, Yann LeCun. (2022). {VICR. Advances in Neural Information Processing Systems.

[bardes_self-supervised_2020] Bardes, Adrien. (2020). Self-supervised vision algorithms with regularized latent variables.

[bardes_vjepa_2024] Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, Nicolas Ballas. (2024). Revisiting Feature Prediction for Learning Visual Representations from Video. Transactions on Machine Learning Research.

[battaglia_simulation_2013] Battaglia, Peter W., Hamrick, Jessica B., Tenenbaum, Joshua B.. (2013). Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.1306572110.

[baumann2000symmetry] Baumann, Gerd. (2000). Symmetry analysis of differential equations with Mathematica{\textregistered.

[bautista2016cliquecnn] Miguel A. Bautista, Artsiom Sanakoyeu, Ekaterina Sutter, Björn Ommer. (2016). CliqueCNN: Deep Unsupervised Exemplar Learning. Advances in Neural Information Processing Systems.

[bear2021physion] Bear, Daniel M, Wang, Elias, Mrowca, Damian, Binder, Felix J, Tung, Hsiao-Yu Fish, Pramod, RT, Holdaway, Cameron, Tao, Sirui, Smith, Kevin, Sun, Fan-Yun, others. (2021). Physion: Evaluating physical prediction from vision in humans and machines. arXiv preprint arXiv:2106.08261.

[samuelson2005they] Samuelson, Larissa K, Smith, Linda B. (2005). They call it like they see it: Spontaneous naming and attention to shape. Developmental science.

[benchekroun2023worldsense] Benchekroun, Youssef, Dervishi, Megi, Ibrahim, Mark, Gaya, Jean-Baptiste, Martinet, Xavier, Mialon, Gr{'e. (2023). Worldsense: A synthetic benchmark for grounded reasoning in large language models. arXiv preprint arXiv:2311.15930.

[bird2009rooks] Bird, Christopher David, Emery, Nathan John. (2009). Rooks use stones to raise the water level to reach a floating worm. Current Biology.

[bisk2020piqa] Bisk, Yonatan, Zellers, Rowan, Gao, Jianfeng, Choi, Yejin, others. (2020). Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI conference on artificial intelligence.

[blender] {Blender Online Community. Blender - a 3D modelling and rendering package.

[bommasani2021opportunities] Bommasani, Rishi, Hudson, Drew A, Adeli, Ehsan, Altman, Russ, Arora, Simran, von Arx, Sydney, Bernstein, Michael S, Bohg, Jeannette, Bosselut, Antoine, Brunskill, Emma, others. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

[bordes2021high] Bordes, Florian, Balestriero, Randall, Vincent, Pascal. (2021). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. arXiv preprint arXiv:2112.09164.

[bordes2022guillotine] Bordes, Florian, Balestriero, Randall, Garrido, Quentin, Bardes, Adrien, Vincent, Pascal. (2022). Guillotine regularization: Improving deep networks generalization by removing their head. arXiv preprint arXiv:2206.13378.

[bouchacourt_addressing_2021] Bouchacourt, Diane, Ibrahim, Mark, Deny, Stéphane. (2021). Addressing the {Topological. arXiv preprint arXiv:2102.05623 [cs].

[brandes2022proteinbert] Brandes, Nadav, Ofer, Dan, Peleg, Yam, Rappoport, Nadav, Linial, Michal. (2022). ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics.

[brandstetter2022lie] Brandstetter, Johannes, Welling, Max, Worrall, Daniel E. (2022). Lie Point Symmetry Data Augmentation for Neural PDE Solvers. arXiv preprint arXiv:2202.07643.

[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.

[bronstein2021geometric] Bronstein, Michael M, Bruna, Joan, Cohen, Taco, Veli{\v{c. (2021). Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478.

[brooks2024sora] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, Aditya Ramesh. (2024). Video generation models as world simulators.

[brown2020language] Brown, Tom B., Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, Agarwal, Sandhini, Herbert-Voss, Ariel, Krueger, Gretchen, Henighan, Tom, Child, Rewon, Ramesh, Aditya, Ziegler, Daniel M., Wu, Jeffrey, Winter, Clemens, Hesse, Christopher, Chen, Mark, Sigler, Eric, Litwin, Mateusz, Gray, Scott, Chess, Benjamin, Clark, Jack, Berner, Christopher, McCandlish, Sam, Radford, Alec, Sutskever, Ilya, Amodei, Dario. (2020). Language Models Are Few-Shot Learners. Proceedings of the 34th International Conference on Neural Information Processing Systems.

[brunton2016discovering] Brunton, Steven L., Proctor, Joshua L., Kutz, J. Nathan. (2016). Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proceedings of the {N.

[cabannes2023ssl] Cabannes, Vivien, Kiani, Bobak T, Balestriero, Randall, LeCun, Yann, Bietti, Alberto. (2023). The SSL Interplay: Augmentations, Inductive Bias, and Generalization. arXiv preprint arXiv:2302.02774.

[cacchione2004recognizing] Cacchione, Trix, Krist, Horst. (2004). Recognizing impossible object relations: intuitions about support in chimpanzees (Pan troglodytes).. Journal of Comparative Psychology.

[canzi2012simple] Canzi, Claudia, Guerra, Graziano. (2012). A simple counterexample related to the Lie--Trotter product formula. Semigroup Forum.

[carey2000origin] Carey, Susan. (2000). The origin of concepts. Journal of Cognition and Development.

[caron2018clustering] Mathilde Caron, Piotr Bojanowski, Armand Joulin, Matthijs Douze. (2018). Deep clustering for unsupervised learning. Proceedings of the European Conference on Computer Vision.

[caron2019noncurated] Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin. (2019). Unsupervised Pre-Training of Image Features on Non-Curated Data. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems.

[caron2021dino] Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal Piotr Bojanowski Armand Joulin. (2021). Emerging Properties in Self-Supervised Vision Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[cerri2019variational] Cerri, Olmo, Nguyen, Thong Q, Pierini, Maurizio, Spiropulu, Maria, Vlimant, Jean-Roch. (2019). Variational autoencoders for new physics mining at the large hadron collider. Journal of High Energy Physics.

[chang2015shapenet] Chang, Angel X, Funkhouser, Thomas, Guibas, Leonidas, Hanrahan, Pat, Huang, Qixing, Li, Zimo, Savarese, Silvio, Savva, Manolis, Song, Shuran, Su, Hao, others. (2015). Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012.

[chang_NPE_2017] Chang, Michael B., Ullman, Tomer, Torralba, Antonio, Tenenbaum, Joshua B.. (2017). A {Compositional.

[chavhan2023amortised] Ruchika Chavhan, Jan Stuehmer, Calum Heggan, Mehrdad Yaghoobi, Timothy Hospedales. (2023). Amortised Invariance Learning for Contrastive Self-Supervision. The Eleventh International Conference on Learning Representations.

[chavhan2023diversity] Chavhan, Ruchika, Gouk, Henry, Li, Da, Hospedales, Timothy. (2023). Quality Diversity for Visual Pre-Training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (ICCV).

[chen2019symplectic] Chen, Zhengdao, Zhang, Jianyu, Arjovsky, Martin, Bottou, L{'e. (2019). Symplectic recurrent neural networks. arXiv preprint arXiv:1909.13334.

[chen2020mocov2] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He. (2020). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.

[chen2020simclrv2] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, Geoffrey Hinton. (2020). Big Self-Supervised Models are Strong Semi-Supervised Learners. Advances in Neural Information Processing Systems.

[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Machine Learning.

[chen2020simsiam] Xinlei Chen, Kaiming He. (2020). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[chen2021intriguing] Chen, Ting, Luo, Calvin, Li, Lala. (2021). Intriguing properties of contrastive losses. Advances in Neural Information Processing Systems.

[chen2021mocov3] Xinlei Chen, Saining Xie, Kaiming He. (2021). An Empirical Study of Training Self-Supervised Vision Transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[chen2021neural] Chen, Yuhan, Matsubara, Takashi, Yaguchi, Takaharu. (2021). Neural symplectic form: learning Hamiltonian equations on general coordinate systems. Advances in Neural Information Processing Systems.

[chen2022intra] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Intra-instance vicreg: Bag of self-supervised image patch embedding. arXiv preprint arXiv:2206.08954.

[chen2023context] Chen, Xiaokang, Ding, Mingyu, Wang, Xiaodi, Xin, Ying, Mo, Shentong, Wang, Yunhao, Han, Shumin, Luo, Ping, Zeng, Gang, Wang, Jingdong. (2023). Context autoencoder for self-supervised representation learning. International Journal of Computer Vision.

[chen2024dae] Xinlei Chen, Zhuang Liu, Saining Xie, Kaiming He. (2024). Deconstructing Denoising Diffusion Models for Self-Supervised Learning.

[childs2021theory] Childs, Andrew M, Su, Yuan, Tran, Minh C, Wiebe, Nathan, Zhu, Shuchen. (2021). Theory of trotter error with commutator scaling. Physical Review X.

[chopra2005] Hadsell, Raia, Chopra, Sumit, LeCun, Yann. (2006). Dimensionality reduction by learning an invariant mapping. 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06).

[cifar] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.

[clark2013whatever_next] Clark, Andy. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences.

[clark2023text] Clark, Kevin, Jaini, Priyank. (2023). Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233.

[cohen2016group] Cohen, Taco, Welling, Max. (2016). Group equivariant convolutional networks. Proceedings of the International Conference on Machine Learning.

[cohen2018spherical] Cohen, Taco S, Geiger, Mario, K{. (2018). Spherical cnns. arXiv preprint arXiv:1801.10130.

[cohen_group_2016] Cohen, Taco S., Welling, Max. (2016). Group {Equivariant. arXiv preprint arXiv:1602.07576 [cs, stat].

[cohen_spherical_2018] Cohen, Taco S., Geiger, Mario, Koehler, Jonas, Welling, Max. (2018). Spherical {CNNs. arXiv preprint arXiv:1801.10130 [cs, stat].

[cover1965geometrical] Cover, Thomas M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE transactions on electronic computers.

[dangovski_equivariant_2022] Dangovski, Rumen, Jing, Li, Loh, Charlotte, Han, Seungwook, Srivastava, Akash, Cheung, Brian, Agrawal, Pulkit, Solja{\v{c. (2021). Equivariant contrastive learning. arXiv preprint arXiv:2111.00899.

[deng2009imagenet] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei. (2009). ImageNet: A large-scale hierarchical image database. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[denninger2019blenderproc] Denninger, Maximilian, Sundermeyer, Martin, Winkelbauer, Dominik, Zidan, Youssef, Olefir, Dmitry, Elbadrawy, Mohamad, Lodhi, Ahsan, Katam, Harinandan. (2019). BlenderProc. arXiv preprint arXiv:1911.01911.

[descombes2010exact] Descombes, St{'e. (2010). *An exact local error representation of exponential operator splitting methods for evolutionary problems and applications to linear Schr{*. BIT Numerical Mathematics.

[devillers2022equimod] Devillers, Alexandre, Lefort, Mathieu. (2022). EquiMod: An Equivariance Module to Improve Self-Supervised Learning. arXiv preprint arXiv:2211.01244.

[jaderberg2015spatial] Jaderberg, Max, Simonyan, Karen, Zisserman, Andrew, others. (2015). Spatial transformer networks. Advances in neural information processing systems.

[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. European conference on computer vision.

[owens2018audio] Owens, Andrew, Efros, Alexei A. (2018). Audio-visual scene analysis with self-supervised multisensory features. Proceedings of the European Conference on Computer Vision.

[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[doersch2015unsupervised] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE international conference on computer vision.

[dollard1979product] Dollard, John D, Friedman, Charles N, Masani, Pesi Rustom. (1979). Product integration with applications to differential equations.

[dosovitskiy2021vit] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image Is Worth 16X16 Words: Transformers For Image Recognition At Scale. International Conference on Learning Representations.

[douillard2021dytox] Douillard, Arthur, Ram'e, Alexandre, Couairon, Guillaume, Cord, Matthieu. (2022). DyTox: Transformers for Continual Learning with DYnamic TOken eXpansion. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[DudaHart2nd] R. O. Duda, P. E. Hart, D. G. Stork. (2000). Pattern Classification.

[dupont_equivariant_2020] Dupont, Emilien, Bautista, Miguel Angel, Colburn, Alex, Sankar, Aditya, Guestrin, Carlos, Susskind, Josh, Shan, Qi. (2020). Equivariant {Neural. arXiv preprint arXiv:2006.07630 [cs, stat].

[el2024aim] El-Nouby, Alaaeldin, Klein, Michal, Zhai, Shuangfei, Bautista, Miguel Angel, Toshev, Alexander, Shankar, Vaishaal, Susskind, Joshua M, Joulin, Armand. (2024). Scalable Pre-training of Large Autoregressive Image Models. arXiv preprint arXiv:2401.08541.

[engel2000one] Engel, Klaus-Jochen, Nagel, Rainer, Brendle, Simon. (2000). One-parameter semigroups for linear evolution equations.

[ermolov2021whitening] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe. (2021). Whitening for Self-Supervised Representation Learning. arXiv preprint arxiv:2007.06346.

[esteves_cross-domain_2019] Esteves, Carlos, Sud, Avneesh, Luo, Zhengyi, Daniilidis, Kostas, Makadia, Ameesh. (2019). Cross-{Domain. International {Conference.

[esteves_learning_2020] Esteves, Carlos, Allen-Blanchette, Christine, Makadia, Ameesh, Daniilidis, Kostas. (2020). Learning {SO. International Journal of Computer Vision. doi:10.1007/s11263-019-01220-1.

[everingham2010voc] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, Andrew Zisserman. (2010). The pascal visual object classes (voc) challenge. IJCV.

[faghri2018vse] Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, Sanja Fidler. (2018). VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. BMVC.

[falorsi2018explorations] Falorsi, Luca, De Haan, Pim, Davidson, Tim R, De Cao, Nicola, Weiler, Maurice, Forr{'e. (2018). Explorations in homeomorphic variational auto-encoding. arXiv preprint arXiv:1807.04689.

[fan2008liblinear] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, Chih-Jen Lin. (2008). LIBLINEAR: A Library for Large Linear Classification. JMLR.

[feige_invariant-equivariant_2019] Feige, Ilya. (2019). Invariant-equivariant representation learning for multi-class data. arXiv preprint arXiv:1902.03251 [cs, stat].

[fernandez2016review] Fern{'a. (2016). Review of multi-fidelity models. arXiv preprint arXiv:1609.07196.

[fernandez2022sslwatermarking] Fernandez, Pierre, Sablayrolles, Alexandre, Furon, Teddy, Jégou, Hervé, Douze, Matthijs. (2022). Watermarking Images in Self-Supervised Latent Spaces. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[finn2016unsupervised] Finn, Chelsea, Goodfellow, Ian, Levine, Sergey. (2016). Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems.

[finzi2020simplifying] Finzi, Marc, Wang, Ke Alexander, Wilson, Andrew G. (2020). Simplifying hamiltonian and lagrangian neural networks via explicit constraints. Advances in neural information processing systems.

[food101] Bossard, Lukas, Guillaumin, Matthieu, Van Gool, Luc. (2014). Food-101 -- Mining Discriminative Components with Random Forests. European Conference on Computer Vision.

[forrester2007multi] Forrester, Alexander IJ, S{'o. (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the royal society a: mathematical, physical and engineering sciences.

[Freire2009OnTP] Igor Leite Freire. (2009). On the paper “Symmetry analysis of wave equation on sphere” by H. Azad and M.T. Mustafa. Journal of Mathematical Analysis and Applications.

[fries2022lasdi] Fries, William D, He, Xiaolong, Choi, Youngsoo. (2022). Lasdi: Parametric latent space dynamics identification. Computer Methods in Applied Mechanics and Engineering.

[ganea2019breaking] Ganea, Octavian, Gelly, Sylvain, B{'e. (2019). Breaking the softmax bottleneck via learnable monotonic pointwise non-linearities. Proceedings of the International Conference on Machine Learning.

[garrido2022duality] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2022). On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574.

[garrido2023sie] Garrido, Quentin, Najman, Laurent, Lecun, Yann. (2023). Self-supervised learning of Split Invariant Equivariant representations. Proceedings of the 40th International Conference on Machine Learning.

[ghosh2022investigating] Ghosh, Arna, Mondal, Arnab Kumar, Agrawal, Kumar Krishna, Richards, Blake. (2022). Investigating power laws in deep representation learning. arXiv preprint arXiv:2202.05808.

[gidaris2018unsupervised] Spyros Gidaris, Praveer Singh, Nikos Komodakis. (2018). Unsupervised Representation Learning by Predicting Image Rotations. International Conference on Learning Representations.

[gidaris2020bags] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, Matthieu Cord. (2020). Learning Representations by Predicting Bags of Visual Words. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[gidaris2021obow] Spyros Gidaris, Andrei Bursuc, Gilles Puy, Nikos Komodakis, Matthieu Cord, Patrick Pérez. (2021). Online Bag-of-Visual-Words Generation for Unsupervised Representation Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[girish2022one] Girish, Sharath, Dey, Debadeepta, Joshi, Neel, Vineet, Vibhav, Shah, Shital, Mendes, Caio Cesar Teodoro, Shrivastava, Abhinav, Song, Yale. (2022). One Network Doesn't Rule Them All: Moving Beyond Handcrafted Architectures in Self-Supervised Learning. arXiv preprint arXiv:2203.08130.

[gondal_transfer_2019] Gondal, Muhammad Waleed, Wüthrich, Manuel, Miladinović, Đorđe, Locatello, Francesco, Breidt, Martin, Volchkov, Valentin, Akpo, Joel, Bachem, Olivier, Schölkopf, Bernhard, Bauer, Stefan. (2019). On the {Transfer. arXiv preprint arXiv:1906.03292 [cs, stat].

[goodhart1975problems] Goodhart, Charles A. E.. (1975). Problems of Monetary Management: The {U.K.. Papers in Monetary Economics.

[gopnik1999scientist] Gopnik, Alison, Meltzoff, Andrew N, Kuhl, Patricia K. (1999). The scientist in the crib: Minds, brains, and how children learn..

[goyal2017lars] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.

[goyal2017something] Goyal, Raghav, Ebrahimi Kahou, Samira, Michalski, Vincent, Materzynska, Joanna, Westphal, Susanne, Kim, Heuna, Haenel, Valentin, Fruend, Ingo, Yianilos, Peter, Mueller-Freitag, Moritz, others. (2017). The. Proceedings of the IEEE international conference on computer vision.

[goyal2021vissl] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Ishan Misra. (2021). VISSL.

[gretton2005measuring] Gretton, Arthur, Bousquet, Olivier, Smola, Alex, Sch{. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. International conference on algorithmic learning theory.

[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems.

[grill2020byol] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko. (2020). Bootstrap your own latent: A new approach to self-supervised Learning. Advances in Neural Information Processing Systems.

[gupta2022towards] Gupta, Jayesh K, Brandstetter, Johannes. (2022). Towards Multi-spatiotemporal-scale Generalized PDE Modeling. TMLR.

[gupta2023care] Sharut Gupta, Joshua Robinson, Derek Lim, Soledad Villar, Stefanie Jegelka. (2023). Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning.

[guyon1991structural] Guyon, Isabelle, Vapnik, Vladimir, Boser, Bernhard, Bottou, Leon, Solla, Sara A. (1991). Structural risk minimization for character recognition. Advances in neural information processing systems.

[gwilliam2022beyond] Gwilliam, Matthew, Shrivastava, Abhinav. (2022). Beyond supervised vs. unsupervised: Representative benchmarking and analysis of image representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[ha2016hypernetworks] Ha, David, Dai, Andrew, Le, Quoc V. (2016). Hypernetworks. arXiv preprint arXiv:1609.09106.

[ha2018worldmodels] Ha, David, Schmidhuber, J{. (2018). Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems 31.

[hadsell2006contrastive] Raia Hadsell, Sumit Chopra, Yann LeCun. (2006). Dimensionality reduction by learning an invariant mapping. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[hafner2019dreamer] Hafner, Danijar, Lillicrap, Timothy, Ba, Jimmy, Norouzi, Mohammad. (2019). Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603.

[hafner2023dreamerv3] Hafner, Danijar, Pasukonis, Jurgis, Ba, Jimmy, Lillicrap, Timothy. (2023). Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104.

[halvagal2022predictor] Halvagal, Manu Srinath, Laborieux, Axel, Zenke, Friedemann. (2022). Predictor networks and stop-grads provide implicit variance regularization in BYOL/SimSiam. arXiv preprint arXiv:2212.04858.

[hansen2022modem] Hansen, Nicklas, Lin, Yixin, Su, Hao, Wang, Xiaolong, Kumar, Vikash, Rajeswaran, Aravind. (2022). Modem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint arXiv:2212.05698.

[haochen2021provable] HaoChen, Jeff Z, Wei, Colin, Gaidon, Adrien, Ma, Tengyu. (2021). Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems.

[haochen2022beyond] HaoChen, Jeff Z, Wei, Colin, Kumar, Ananya, Ma, Tengyu. (2022). Beyond Separability: Analyzing the Linear Transferability of Contrastive Representations to Related Subpopulations. arXiv preprint arXiv:2204.02683.

[he2016resnet] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep Residual Learning for Image Recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[he2017maskrcnn] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. (2017). Mask r-cnn. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[he2021mae] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.

[he2022exploring] He, Bobby, Ozay, Mete. (2022). Exploring the Gap between Collapsed & Whitened Features in Self-Supervised Learning. Proceedings of the International Conference on Machine Learning.

[he2022glasdi] He, Xiaolong, Choi, Youngsoo, Fries, William D, Belof, Jon, Chen, Jiun-Shyan. (2022). gLaSDI: Parametric Physics-informed Greedy Latent Space Dynamics Identification. arXiv preprint arXiv:2204.12005.

[helber2019eurosat] Helber, Patrick, Bischke, Benjamin, Dengel, Andreas, Borth, Damian. (2019). Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[henaff2019data] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl Doersch, S. M. Ali Eslami, Aäron van den Oord. (2019). Data-efficient image recognition with contrastive predictive coding. Proceedings of the International Conference on Machine Learning.

[henaff2021detcon] Olivier J. Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van den Oord, Oriol Vinyals, João Carreira. (2021). Efficient Visual Pretraining with Contrastive Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[henaff2022odin] Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, Andrew Zisserman, João Carreira, Relja Arandjelović. (2022). Object discovery and representation networks. arXiv preprint arXiv:2103.10957.

[herman2010laboratory] Herman, Louis M. (2010). What laboratory research has told us about dolphin cognition. International Journal of Comparative Psychology.

[hespos_physics_2012] Hespos, Susan J., vanMarle, Kristy. (2012). Physics for infants: characterizing the origins of knowledge about objects, substances, and number. WIREs Cognitive Science. doi:10.1002/wcs.157.

[hinton2011transforming] Hinton, Geoffrey E, Krizhevsky, Alex, Wang, Sida D. (2011). Transforming auto-encoders. International conference on artificial neural networks.

[hinton2015distillation] Geoffrey Hinton, Oriol Vinyals, Jeffrey Dean. (2015). Distilling the Knowledge in a Neural Network. NIPS Deep Learning and Representation Learning Workshop.

[rumelhart1986learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J. (1986). Learning representations by back-propagating errors. nature.

[hjelm2019mutual] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Adam Trischler, Yoshua Bengio. (2019). Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations.

[hohwy2013predictivemind] Hohwy, Jakob. (2013). The predictive mind.

[houde_cambridge_2022] . The {Cambridge. (2022).

[https://doi.org/10.48550/arxiv.2112.04307] Baddoo, Peter J., Herrmann, Benjamin, McKeon, Beverley J., Kutz, J. Nathan, Brunton, Steven L.. (2021). Physics-informed dynamic mode decomposition (piDMD).

[https://doi.org/10.48550/arxiv.2302.03580] Equer, Léonard, Rusch, T. Konstantin, Mishra, Siddhartha. (2023). Multi-Scale Message Passing Neural PDE Solvers.

[hu2023gaia1] Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado. (2023). GAIA-1: A Generative World Model for Autonomous Driving.

[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[huang2019neighbourhood] Jiabo Huang, Qi Dong andShaogang Gong, Xiatian Zhu. (2019). Unsupervised Deep Learning by Neighbourhood Discovery. Proceedings of the International Conference on Machine Learning.

[huang2021towards] Huang, Weiran, Yi, Mingyang, Zhao, Xuyang. (2021). Towards the generalization of contrastive self-supervised learning. arXiv preprint arXiv:2111.00743.

[hudson2023soda] Hudson, Drew A, Zoran, Daniel, Malinowski, Mateusz, Lampinen, Andrew K, Jaegle, Andrew, McClelland, James L, Matthey, Loic, Hill, Felix, Lerchner, Alexander. (2024). Soda: Bottleneck diffusion models for representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[hutchison_synergistic_2006] Osadchy, Margarita, Le Cun, Yann, Miller, Matthew L.. (2006). Synergistic {Face. Toward {Category. doi:10.1007/11957959_10.

[ibragimov1995crc] Ibragimov, Nail H. (1995). CRC handbook of Lie group analysis of differential equations.

[ioffe2015bn] Sergey Ioffe, Christian Szegedy. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the International Conference on Machine Learning.

[isakov2006inverse] Isakov, Victor. (2006). Inverse problems for partial differential equations.

[hinton2006reducing] Hinton, Geoffrey E, Salakhutdinov, Ruslan R. (2006). Reducing the dimensionality of data with neural networks. science.

[lee2024video] Lee, Seon-Ho, Wang, Jue, Zhang, Zhikang, Fan, David, Li, Xinyu. (2024). Video token merging for long-form video understanding. arXiv preprint arXiv:2410.23782.

[li2024mvbench] Li, Kunchang, Wang, Yali, He, Yinan, Li, Yizhuo, Wang, Yi, Liu, Yi, Wang, Zun, Xu, Jilan, Chen, Guo, Luo, Ping, others. (2024). Mvbench: A comprehensive multi-modal video understanding benchmark. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[hansen2023td] Hansen, Nicklas, Su, Hao, Wang, Xiaolong. (2023). Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828.

[yu2020meta] Yu, Tianhe, Quillen, Deirdre, He, Zhanpeng, Julian, Ryan, Hausman, Karol, Finn, Chelsea, Levine, Sergey. (2020). Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. Conference on robot learning.

[bar2024navigation] Bar, Amir, Zhou, Gaoyue, Tran, Danny, Darrell, Trevor, LeCun, Yann. (2024). Navigation world models. arXiv preprint arXiv:2412.03572.

[feng2025reflective] Feng, Yunhai, Han, Jiaming, Yang, Zhuoran, Yue, Xiangyu, Levine, Sergey, Luo, Jianlan. (2025). Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation. arXiv preprint arXiv:2502.16707.

[yang2025magma] Yang, Jianwei, Tan, Reuben, Wu, Qianhui, Zheng, Ruijie, Peng, Baolin, Liang, Yongyuan, Gu, Yu, Cai, Mu, Ye, Seonghyeon, Jang, Joel, others. (2025). Magma: A Foundation Model for Multimodal AI Agents. arXiv preprint arXiv:2502.13130.

[garrido2025intuitive] Garrido, Quentin, Ballas, Nicolas, Assran, Mahmoud, Bardes, Adrien, Najman, Laurent, Rabbat, Michael, Dupoux, Emmanuel, LeCun, Yann. (2025). Intuitive physics understanding emerges from self-supervised pretraining on natural videos. arXiv preprint arXiv:2502.11831.

[mialon2023pde] Mialon, Gr{'e. (2023). Self-supervised learning with lie symmetries for partial differential equations. Advances in Neural Information Processing Systems.

[garrido2024iwm] Garrido, Quentin, Assran, Mahmoud, Ballas, Nicolas, Bardes, Adrien, Najman, Laurent, LeCun, Yann. (2024). Learning and leveraging world models in visual representation learning. arXiv preprint arXiv:2403.00504.

[garrido2023rankme] Garrido, Quentin, Balestriero, Randall, Najman, Laurent, Lecun, Yann. (2023). Rankme: Assessing the downstream performance of pretrained self-supervised representations by their rank. Proceedings of the International Conference on Machine Learning.

[gu2021efficiently] Gu, Albert, Goel, Karan, R{'e. (2021). Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.

[gu2023mamba] Gu, Albert, Dao, Tri. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.

[zellers2022merlot] Zellers, Rowan, Lu, Jiasen, Lu, Ximing, Yu, Youngjae, Zhao, Yanpeng, Salehi, Mohammadreza, Kusupati, Aditya, Hessel, Jack, Farhadi, Ali, Choi, Yejin. (2022). Merlot reserve: Neural script knowledge through vision and language and sound. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[assran2022hidden] Assran, Mahmoud, Balestriero, Randall, Duval, Quentin, Bordes, Florian, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, Ballas, Nicolas. (2022). The hidden uniform cluster prior in self-supervised learning. arXiv preprint arXiv:2210.07277.

[jassim_grasp_2024] Jassim, Serwan, Holubar, Mario, Richter, Annika, Wolff, Cornelius, Ohmer, Xenia, Bruni, Elia. (2024). GRASP: A Novel Benchmark for Evaluating Language GRounding and Situated Physics Understanding in Multimodal Language Models. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, {IJCAI-24. doi:10.24963/ijcai.2024/696.

[jenni2018artifacts] Simon Jenni, Paolo Favaro. (2018). Self-supervised feature learning by learning to spot artifacts. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[jenni_time-equivariant_2021] Jenni, Simon, Jin, Hailin. (2021). Time-{Equivariant. arXiv preprint arXiv:2112.03624 [cs].

[jing2022understanding] Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning. International Conference on Learning Representations.

[johnson2017clevr] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. Proceedings of the IEEE conference on computer vision and pattern recognition.

[jumper2021highly] Jumper, John, Evans, Richard, Pritzel, Alexander, Green, Tim, Figurnov, Michael, Ronneberger, Olaf, Tunyasuvunakool, Kathryn, Bates, Russ, {\v{Z. (2021). Highly accurate protein structure prediction with AlphaFold. Nature.

[kang2024farvideogenerationworld] Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng. (2024). How Far is Video Generation from World Model: A Physical Law Perspective. arXiv preprint arXiv:2411.02385.

[kaplan2020scaling] Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B, Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

[Karniadakis2021Nature] Karniadakis, George E., Kevrekidis, Ioannis G., Lu, Lu, Perdikaris, Paris, Wang, Sifan, Yang, Liu. (2021). Physics-informed machine learning. Nature Reviews Physics.

[kay2017kinetics] Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijayanarasimhan, Sudheendra, Viola, Fabio, Green, Tim, Back, Trevor, Natsev, Paul, others. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

[kaya2019deep] Kaya, Mahmut, Bilge, Hasan {\c{S. (2019). Deep metric learning: A survey. Symmetry.

[kearns89] M. J. Kearns. (1989). Computational Complexity of Machine Learning.

[kellman_perception_1983] Kellman, Philip J., Spelke, Elizabeth S.. (1983). Perception of partly occluded objects in infancy. Cognitive Psychology. doi:10.1016/0010-0285(83)90017-8.

[kiani2022projunn] Kiani, Bobak, Balestriero, Randall, Lecun, Yann, Lloyd, Seth. (2022). projUNN: efficient method for training deep networks with unitary matrices. arXiv preprint arXiv:2203.05483.

[kibble1945extension] Kibble, WF. (1945). An extension of a theorem of Mehler's on Hermite polynomials. Mathematical Proceedings of the Cambridge Philosophical Society.

[kim1992gravity] Kim, In Kyeong, Spelke, Elizabeth S. (1992). Infants' sensitivity to effects of gravity on visible object motion.. Journal of Experimental Psychology: Human Perception and Performance.

[kim2018jigsaw] Dahun Kim, Donghyeon Cho, Donggeun Yoo, In So Kweon. (2018). Learning Image Representations by Completing Damaged Jigsaw Puzzles. WACV.

[kim2019deep] Kim, Byungsoo, Azevedo, Vinicius C, Thuerey, Nils, Kim, Theodore, Gross, Markus, Solenthaler, Barbara. (2019). Deep fluids: A generative network for parameterized fluid simulations. Computer graphics forum.

[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

[kipf2019contrastive] Kipf, Thomas, Van der Pol, Elise, Welling, Max. (2019). Contrastive learning of structured world models. arXiv preprint arXiv:1911.12247.

[kirkpatrick2021pushing] Kirkpatrick, James, McMorrow, Brendan, Turban, David HP, Gaunt, Alexander L, Spencer, James S, Matthews, Alexander GDG, Obika, Annette, Thiry, Louis, Fortunato, Meire, Pfau, David, others. (2021). Pushing the frontiers of density functionals by solving the fractional electron problem. Science.

[kochkov2021machine] Kochkov, Dmitrii, Smith, Jamie A, Alieva, Ayya, Wang, Qing, Brenner, Michael P, Hoyer, Stephan. (2021). Machine learning--accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences.

[kondor2018generalization] Kondor, Risi, Trivedi, Shubhendu. (2018). On the generalization of equivariance and convolution in neural networks to the action of compact groups. Proceedings of the International Conference on Machine Learning.

[kornblith2019similarity] Kornblith, Simon, Norouzi, Mohammad, Lee, Honglak, Hinton, Geoffrey. (2019). Similarity of neural network representations revisited. Proceedings of the International Conference on Machine Learning.

[krause20133d] Krause, Jonathan, Stark, Michael, Deng, Jia, Fei-Fei, Li. (2013). 3d object representations for fine-grained categorization. Proceedings of the IEEE international conference on computer vision workshops.

[lake_building_2016] Lake, Brenden M., Ullman, Tomer D., Tenenbaum, Joshua B., Gershman, Samuel J.. (2016). Building {Machines.

[langley00] P. Langley. (2000). Crafting Papers on Machine Learning. Proceedings of the 17th International Conference on Machine Learning (ICML 2000).

[larsson2016colorization] Gustav Larsson, Michael Maire, Gregory Shakhnarovich. (2016). Learning Representations for Automatic Colorization. Proceedings of the European Conference on Computer Vision.

[le2011ica] Le, Quoc, Karpenko, Alexandre, Ngiam, Jiquan, Ng, Andrew. (2011). ICA with reconstruction cost for efficient overcomplete feature learning. Advances in Neural Information Processing Systems.

[lecun2022AMI] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.

[lee2021cbyol] Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, Ian Fischer. (2021). Compressive Visual Representations. Advances in Neural Information Processing Systems.

[guo2022byol] Guo, Zhaohan, Thakoor, Shantanu, P{^\i. (2022). Byol-explore: Exploration by bootstrapped prediction. Advances in neural information processing systems.

[lee2021predicting] Lee, Jason D, Lei, Qi, Saunshi, Nikunj, Zhuo, Jiacheng. (2021). Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems.

[lee_improving_2021] Lee, Hankook, Lee, Kibok, Lee, Kimin, Lee, Honglak, Shin, Jinwoo. (2021). Improving {Transferability.

[lerer2016learning] Lerer, Adam, Gross, Sam, Fergus, Rob. (2016). Learning physical intuition of block towers by example. Proceedings of the International Conference on Machine Learning.

[lerer_learning_2016] Lerer, Adam, Gross, Sam, Fergus, Rob. (2016). Learning {Physical.

[li2011concise] Li, Shengqiao. (2011). Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics.

[li2020efficient] Li, Jun, Fuxin, Li, Todorovic, Sinisa. (2020). Efficient riemannian optimization on the stiefel manifold via the cayley transform. arXiv preprint arXiv:2002.01113.

[li2020reaction] Li, Angran, Chen, Ruijia, Farimani, Amir Barati, Zhang, Yongjie Jessica. (2020). Reaction diffusion system prediction based on convolutional neural network. Scientific reports.

[li2021fourier] Zongyi Li, Nikola Kovachki, Kamyar Azizzadenesheli, Burigede Liu, Kaushik Bhattacharya, Andrew Stuart, Anima Anandkumar. (2021). Fourier Neural Operator for Parametric Partial Differential Equations.

[li2021pcl] Junnan Li, Pan Zhou, Caiming Xiong, Steven C.H. Hoi. (2021). Prototypical Contrastive Learning of Unsupervised Representations. International Conference on Learning Representations.

[li2021prefixtuning] Xiang Lisa Li, Percy Liang. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation.

[li2022esvit] Chunyuan Li, Jianwei Yang, Pengchuan Zhang, Mei Gao, Bin Xiao, Xiyang Dai, Lu Yuan, Jianfeng Gao. (2022). Efficient Self-supervised Vision Transformers for Representation Learning. International Conference on Learning Representations.

[li2022neural] Li, Zengyi, Chen, Yubei, LeCun, Yann, Sommer, Friedrich T. (2022). Neural Manifold Clustering and Embedding. arXiv preprint arXiv:2201.10000.

[li2022understanding] Li, Alexander C, Efros, Alexei A, Pathak, Deepak. (2022). Understanding Collapse in Non-contrastive Siamese Representation Learning. European Conference on Computer Vision.

[li_leveraging_nodate] Li, Xiaolong, Weng, Yijia, Yi, Li, Guibas, Leonidas, Abbott, A Lynn, Song, Shuran, Wang, He. Leveraging {SE.

[lin2014coco] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, Piotr Dollár. (2014). Microsoft coco: Common objects in context. Proceedings of the European Conference on Computer Vision.

[liu2009lie] Liu, Hanze, Li, Jibin, Zhang, Quanxin. (2009). Lie symmetry analysis and exact explicit solutions for general Burgers’ equation. Journal of Computational and Applied Mathematics.

[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[lloyd1981infinitesimal] Lloyd, SP. (1981). The infinitesimal group of the Navier-Stokes equations. Acta Mechanica.

[logothetis1994view] Logothetis, NK, Pauls, Jon, B{. (1994). View-dependent object recognition by monkeys. Current biology.

[long2018pde] Long, Zichao, Lu, Yiping, Ma, Xianzhong, Dong, Bin. (2018). Pde-net: Learning pdes from data. Proceedings of the International Conference on Machine Learning.

[long_babyview_2024] Long, Bria, Xiang, Violet, Stojanov, Stefan, Sparks, Robert Z., Yin, Zi, Keene, Grace E., Tan, Alvin W. M., Feng, Steven Y., Zhuang, Chengxu, Marchman, Virginia A., Yamins, Daniel L. K., Frank, Michael C.. (2024). The {BabyView. arXiv preprint arXiv:2406.10447.

[loshchilov2017sgdr] Ilya Loshchilov, Frank Hutter. (2017). SGDR: stochastic gradient descent with warm restarts. International Conference on Learning Representations.

[lusch2018deep] Lusch, Bethany, Kutz, J Nathan, Brunton, Steven L. (2018). Deep learning for universal linear embeddings of nonlinear dynamics. Nature communications.

[MachineLearningI] . Machine Learning: An Artificial Intelligence Approach, Vol. I. (1983).

[marchetti2022equivariant] Marchetti, Giovanni Luca, Tegn{'e. (2022). Equivariant representation learning via class-pose decomposition. arXiv preprint arXiv:2207.03116.

[margoni_voe_2024] Margoni, Francesco, Surian, Luca, Baillargeon, Renée. (2024). The violation-of-expectation paradigm: {A. Psychological Review. doi:10.1037/rev0000450.

[masci2015geodesic] Masci, Jonathan, Boscaini, Davide, Bronstein, Michael, Vandergheynst, Pierre. (2015). Geodesic convolutional neural networks on riemannian manifolds. Proceedings of the IEEE international conference on computer vision workshops.

[mathieu_disentangling_2016] Mathieu, Michael, Zhao, Junbo, Sprechmann, Pablo, Ramesh, Aditya, LeCun, Yann. (2016). Disentangling factors of variation in deep representations using adversarial training. arXiv preprint arXiv:1611.03383 [cs, stat].

[mclachlan2002splitting] McLachlan, Robert I, Quispel, G Reinout W. (2002). Splitting methods. Acta Numerica.

[mehler1866ueber] Mehler, F Gustav. (1866). *Ueber die Entwicklung einer Function von beliebig vielen Variablen nach Laplaceschen Functionen h{*.

[meicler_5-month-olds_1980] Meicler, Muriel, Gratch, Gerald. (1980). Do 5-month-olds show object conception in piaget's sense?. Infant Behavior and Development.

[mendes2007raising] Mendes, Natacha, Hanus, Daniel, Call, Josep. (2007). Raising the level: orangutans use water as a tool. Biology letters.

[meng2021multi] Meng, Xuhui, Babaee, Hessam, Karniadakis, George Em. (2021). Multi-fidelity Bayesian neural networks: Algorithms and applications. Journal of Computational Physics.

[methods] . . ().

[mialon2022variance] Mialon, Gr'egoire, Balestriero, Randall, Lecun, Yann. (2022). Variance-covariance regularization enforces pairwise independence in self-supervised representations. arXiv preprint arXiv:2209.14905.

[miech2019howto100m] Miech, Antoine, Zhukov, Dimitri, Alayrac, Jean-Baptiste, Tapaswi, Makarand, Laptev, Ivan, Sivic, Josef. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[misra2016shuffle] Ishan Misra, C. Lawrence Zitnick, Martial Hebert. (2016). Shuffle and Learn: Unsupervised Learning using Temporal Order Verification. Proceedings of the European Conference on Computer Vision.

[misra2020pirl] Misra, Ishan, Maaten, Laurens van der. (2020). Self-Supervised Learning of Pretext-Invariant Representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[mitchell80] T. M. Mitchell. (1980). The Need for Biases in Learning Generalizations.

[mmseg2020] MMSegmentation Contributors. (2020). {MMSegmentation.

[moore_object_permanence_1978] Moore, M.Keith, Borton, Richard, Darby, Betty Lee. (1978). Visual tracking in young infants: {Evidence. Journal of Experimental Child Psychology. doi:10.1016/0022-0965(78)90076-0.

[moravec1988mind] Moravec, Hans. (1988). Mind Children: The Future of Robot and Human Intelligence. Harvard UP.

[motamed2025physicsiq] Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos. (2025). Do generative video models learn physical principles from watching videos?. arXiv preprint arXiv:2501.09038.

[muller_visual_1978] Muller, Alexandra Avdzej, Aslin, Richard N.. (1978). Visual tracking as an index of the object concept. Infant Behavior and Development. doi:10.1016/S0163-6383(78)80041-1.

[nair2010relu] Vinod Nair, Geoffrey E. Hinton. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of the International Conference on Machine Learning.

[nee1998limit] Nee, Janpou, Duan, Jinqiao. (1998). Limit set of trajectories of the coupled viscous Burgers' equations. Applied mathematics letters.

[Newell81] A. Newell, P. S. Rosenbloom. (1981). Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition.

[ng2012multifidelity] Ng, Leo Wai-Tsun, Eldred, Michael. (2012). Multifidelity uncertainty quantification using non-intrusive polynomial chaos and stochastic collocation. 53rd aiaa/asme/asce/ahs/asc structures, structural dynamics and materials conference 20th aiaa/asme/ahs adaptive structures conference 14th aiaa.

[nguyen2023climax] Nguyen, Tung, Brandstetter, Johannes, Kapoor, Ashish, Gupta, Jayesh K, Grover, Aditya. (2023). Clima{X. arXiv preprint arXiv:2301.10343.

[nichol2021improved] Nichol, Alexander Quinn, Dhariwal, Prafulla. (2021). Improved Denoising Diffusion Probabilistic Models. Proceedings of the 38th International Conference on Machine Learning.

[olver1979symmetry] Olver, Peter J. (1979). Symmetry groups and group invariant solutions of partial differential equations. Journal of Differential Geometry.

[omori2017infinite] Omori, Hideki. (2017). Infinite-dimensional Lie groups.

[openai_gpt4_2024] OpenAI. (2024). GPT-4 Technical Report.

[oquab2023dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.

[oquab2024dinov2] Maxime Oquab, Timoth{'e. (2024). {DINO. Transactions on Machine Learning Research.

[ozics2017similarity] {. (2017). Similarity solutions to Burgers' equation in terms of special functions of mathematical physics. Acta Physica Polonica B.

[park_learning_2022] Park, Jung Yeon, Biza, Ondrej, Zhao, Linfeng, van de Meent, Jan Willem, Walters, Robin. (2022). Learning {Symmetric.

[pathak2016context] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros. (2016). Context Encoders: Feature Learning by Inpainting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[pathak2017objects] Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, Bharath Hariharan. (2017). Learning Features by Watching Objects Move. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[perdikaris2017nonlinear] Perdikaris, Paris, Raissi, Maziar, Damianou, Andreas, Lawrence, Neil D, Karniadakis, George Em. (2017). Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[piaget1954construction] Piaget, Jean. (1954). The Construction of Reality in the Child.

[piczak2015dataset] Piczak, Karol J.. (2015). {ESC. Proceedings of the 23rd {Annual ACM Conference.

[piloto_intuitive_2022] Piloto, Luis S., Weinstein, Ari, Battaglia, Peter, Botvinick, Matthew. (2022). Intuitive physics learning in a deep-learning model inspired by developmental psychology. Nature Human Behaviour. doi:10.1038/s41562-022-01394-8.

[pokle2022contrasting] Pokle, Ashwini, Tian, Jinjin, Li, Yuchen, Risteski, Andrej. (2022). Contrasting the landscape of contrastive and non-contrastive learning. arXiv preprint arXiv:2203.15702.

[press2007numerical] Press, William H, Teukolsky, Saul A, Vetterling, William T, Flannery, Brian P. (2007). Numerical recipes 3rd edition: The art of scientific computing.

[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020.

[RAISSI2019686] Mazier Raissi, Paris Perdikaris, George E. Karniadakis. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics.

[raissi2020hidden] Raissi, Maziar, Yazdani, Alireza, Karniadakis, George Em. (2020). Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations. Science.

[ramezanizadeh2019review] Ramezanizadeh, Mahdi, Ahmadi, Mohammad Hossein, Nazari, Mohammad Alhuyi, Sadeghzadeh, Milad, Chen, Lingen. (2019). A review on the utilized machine learning approaches for modeling the dynamic viscosity of nanofluids. Renewable and Sustainable Energy Reviews.

[rao1999predictivecoding] Rao, Rajesh PN, Ballard, Dana H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience.

[rasmussen2010gaussian] Rasmussen, Carl Edward, Nickisch, Hannes. (2010). Gaussian processes for machine learning (GPML) toolbox. The Journal of Machine Learning Research.

[reed2021selfaugment] Reed, Colorado J, Metzger, Sean, Srinivas, Aravind, Darrell, Trevor, Keutzer, Kurt. (2021). Selfaugment: Automatic augmentation policies for self-supervised learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[reid2024gemini] Reid, Machel, Savinov, Nikolay, Teplyashin, Denis, Lepikhin, Dmitry, Lillicrap, Timothy, Alayrac, Jean-baptiste, Soricut, Radu, Lazaridou, Angeliki, Firat, Orhan, Schrittwieser, Julian, others. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.

[ren2015fasterrcnn] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems.

[repo] FAIR. (2025). Code and data for.

[richemond2020byolworks] Pierre H. Richemond, Jean-Bastien Grill, Florent Altché, Corentin Tallec, Florian Strub, Andrew Brock, Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, Michal Valko. (2020). BYOL works even without batch statistics. arXiv preprint arXiv:2010.10241.

[richter2022neural] Richter-Powell, Jack, Lipman, Yaron, Chen, Ricky TQ. (2022). Neural conservation laws: A divergence-free perspective. arXiv preprint arXiv:2210.01741.

[riochet2020occlusion] Riochet, Ronan, Sivic, Josef, Laptev, Ivan, Dupoux, Emmanuel. (2020). Occlusion resistant learning of intuitive physics from videos. arXiv preprint arXiv:2005.00069.

[riochet_intphys_2022] Riochet, Ronan, Castro, Mario Ynocente, Bernard, Mathieu, Lerer, Adam, Fergus, Rob, Izard, Véronique, Dupoux, Emmanuel. (2022). {IntPhys. IEEE Transactions on Pattern Analysis and Machine Intelligence. doi:10.1109/TPAMI.2021.3083839.

[roy2007effective] Roy, Olivier, Vetterli, Martin. (2007). The effective rank: A measure of effective dimensionality. 2007 15th European signal processing conference.

[rudy2017data] Rudy, Samuel H, Brunton, Steven L, Proctor, Joshua L, Kutz, J Nathan. (2017). Data-driven discovery of partial differential equations. Science advances.

[sajnani_condor_2022] Sajnani, Rahul, Poulenard, Adrien, Jain, Jivitesh, Dua, Radhika, Guibas, Leonidas J., Sridhar, Srinath. (2022). {ConDor. arXiv preprint arXiv:2201.07788 [cs].

[Samuel59] A. L. Samuel. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

[santurkar2018does] Santurkar, Shibani, Tsipras, Dimitris, Ilyas, Andrew, Madry, Aleksander. (2018). How does batch normalization help optimization?. Advances in neural information processing systems.

[sariyildiz2023improving] Sariyildiz, Mert Bulent, Kalantidis, Yannis, Alahari, Karteek, Larlus, Diane. (2023). No Reason for No Supervision: Improved Generalization in Supervised Models. International Conference on Learning Representations.

[satorras2021n] Satorras, V{\i. (2021). E (n) equivariant graph neural networks. Proceedings of the International Conference on Machine Learning.

[saunshi2022understanding] Saunshi, Nikunj, Ash, Jordan, Goel, Surbhi, Misra, Dipendra, Zhang, Cyril, Arora, Sanjeev, Kakade, Sham, Krishnamurthy, Akshay. (2022). Understanding contrastive learning requires incorporating inductive biases. arXiv preprint arXiv:2202.14037.

[scherr2022selfsupervised] Franz Scherr, Qinghai Guo, Timoleon Moraitis. (2022). Self-Supervised Learning Through Efference Copies. Advances in Neural Information Processing Systems.

[schmarje2021survey] Schmarje, Lars, Santarossa, Monty, Schr{. (2021). A survey on semi-, self-and unsupervised learning for image classification. IEEE Access.

[schmid2010dynamic] Schmid, Peter J. (2010). Dynamic mode decomposition of numerical and experimental data. Journal of fluid mechanics.

[schott_visual_2021] Schott, Lukas, von Kügelgen, Julius, Träuble, Frederik, Gehler, Peter, Russell, Chris, Bethge, Matthias, Schölkopf, Bernhard, Locatello, Francesco, Brendel, Wieland. (2021). Visual {Representation. arXiv preprint arXiv:2107.08221 [cs].

[shakerinava2022structuring] Mehran Shakerinava, Arnab Kumar Mondal, Siamak Ravanbakhsh. (2022). Structuring Representations Using Group Invariants. Advances in Neural Information Processing Systems.

[shannon1948mathematical] Shannon, Claude Elwood. (1948). A mathematical theory of communication. The Bell system technical journal.

[shen2022connect] Shen, Kendrick, Jones, Robbie, Kumar, Ananya, Xie, Sang Michael, HaoChen, Jeff Z, Ma, Tengyu, Liang, Percy. (2022). Connect, Not Collapse: Explaining Contrastive Learning for Unsupervised Domain Adaptation. arXiv preprint arXiv:2204.00570.

[siarohin2019whitening] Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe. (2019). Whitening and coloring transform for GANs. International Conference on Learning Representations.

[singer2015object] Singer, Rebecca, Henderson, Elizabeth. (2015). Object permanence in marine mammals using the violation of expectation procedure. Behavioural Processes.

[smith_adept_2019] Smith, Kevin, Mei, Lingjie, Yao, Shunyu, Wu, Jiajun, Spelke, Elizabeth, Tenenbaum, Josh, Ullman, Tomer. (2019). Modeling expectation violation in intuitive physics with coarse probabilistic object representations. Advances in neural information processing systems.

[smith_uncertainty_intphys_2013] Smith, Kevin A., Vul, Edward. (2013). Sources of {Uncertainty. Topics in Cognitive Science. doi:10.1111/tops.12009.

[spelke1985voe_looking] Spelke, Elizabeth S. (1985). Preferential-looking methods as tools for the study of cognition in infancy.. Measurement of audition and vision in the first year of postnatal life: A methodological overview.

[spelke_core_knowledge_2007] Spelke, Elizabeth S., Kinzler, Katherine D.. (2007). Core knowledge. Developmental Science. doi:10.1111/j.1467-7687.2007.00569.x.

[spelke_core_2000] Spelke, Elizabeth S.. (2000). Core knowledge. American Psychologist. doi:10.1037/0003-066X.55.11.1233.

[spelke_origins_1992] Spelke, Elizabeth S., Breinlinger, Karen, Macomber, Janet, Jacobson, Kristen. (1992). Origins of knowledge.. Psychological Review. doi:10.1037/0033-295X.99.4.605.

[spelke_spatiotemporal_1995] Spelke, Elizabeth S., Kestenbaum, Roberta, Simons, Daniel J., Wein, Debra. (1995). Spatiotemporal continuity, smoothness of motion and object identity in infancy. British Journal of Developmental Psychology. doi:10.1111/j.2044-835X.1995.tb00669.x.

[stahl_observing_2015] Stahl, Aimee E., Feigenson, Lisa. (2015). Observing the unexpected enhances infants’ learning and exploration. Science. doi:10.1126/science.aaa3799.

[su2021rope] Su, Jianlin, Lu, Yu, Pan, Shengfeng, Wen, Bo, Liu, Yunfeng. (2021). RoFormer: enhanced transformer with rotary position embedding. CoRR abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864.

[sullivan2021saycam] Sullivan, Jessica, Mei, Michelle, Perfors, Andrew, Wojcik, Erica, Frank, Michael C. (2021). SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective. Open mind.

[sun_canonical_nodate] Sun, Weiwei, Tagliasacchi, Andrea, Deng, Boyang, Sabour, Sara, Yazdani, Soroosh, Hinton, Geoffrey. Canonical {Capsules.

[suzuki1991general] Suzuki, Masuo. (1991). General theory of fractal path integrals with applications to many-body theories and statistical physics. Journal of Mathematical Physics.

[syah2021implementation] Syah, Rahmad, Ahmadian, Naeim, Elveny, Marischa, Alizadeh, SM, Hosseini, Meysam, Khan, Afrasyab. (2021). Implementation of artificial intelligence and support vector machine learning to estimate the drilling fluid density in high-pressure high-temperature wells. Energy Reports.

[takamoto2022pdebench] Takamoto, Makoto, Praditia, Timothy, Leiteritz, Raphael, MacKinlay, Daniel, Alesiani, Francesco, Pfl{. (2022). PDEBench: An extensive benchmark for scientific machine learning. Advances in Neural Information Processing Systems.

[tan_devbench_2024] Tan, Alvin Wei Ming, Yu, Sunny, Long, Bria, Ma, Wanjing Anya, Murray, Tonya, Silverman, Rebecca D., Yeatman, Jason D., Frank, Michael C.. (2024). {DevBench.

[tao2021unigrad] Tao, Chenxin, Wang, Honghui, Zhu, Xizhou, Dong, Jiahua, Song, Shiji, Huang, Gao, Dai, Jifeng. (2021). Exploring the Equivalence of Siamese Self-Supervised Learning via A Unified Gradient Framework. arXiv preprint arXiv:2112.05141.

[taylor2012new] Taylor, Alex H, Miller, Rachael, Gray, Russell D. (2012). New Caledonian crows reason about hidden causal agents. Proceedings of the National Academy of Sciences.

[thomas2018tensor] Thomas, Nathaniel, Smidt, Tess, Kearnes, Steven, Yang, Lusann, Li, Li, Kohlhoff, Kai, Riley, Patrick. (2018). Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. arXiv preprint arXiv:1802.08219.

[thrush_winoground_2022] Thrush, Tristan, Jiang, Ryan, Bartolo, Max, Singh, Amanpreet, Williams, Adina, Kiela, Douwe, Ross, Candace. (2022). Winoground: Probing vision and language models for visio-linguistic compositionality. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[tian2019cmc] Yonglong Tian, Dilip Krishnan, and Phillip Isola. (2019). Contrastive multiview coding. arXiv preprint arXiv:1906.05849v4.

[tian2020makes] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola. (2020). What makes for good views for contrastive learning. Advances in Neural Information Processing Systems.

[tian2021understanding] Yuandong Tian, Xinlei Chen, Surya Ganguli. (2021). Understanding self-supervised Learning Dynamics without Contrastive Pairs. arXiv preprint arXiv:2102.06810.

[tomasev2022relicv2] Nenad Tomasev, Ioana Bica, Brian McWilliams, Lars Buesing, Razvan Pascanu, Charles Blundell, Jovana Mitrovic. (2022). Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. arXiv preprint arXiv:2201.05119.

[trotter1959product] Trotter, Hale F. (1959). On the product of semi-groups of operators. Proceedings of the American Mathematical Society.

[tsai2021note] Tsai, Yao-Hung Hubert, Bai, Shaojie, Morency, Louis-Philippe, Salakhutdinov, Ruslan. (2021). A note on connecting barlow twins with negative-sample-free contrastive learning. arXiv preprint arXiv:2104.13712.

[ullman2017mind] Ullman, Tomer D, Spelke, Elizabeth, Battaglia, Peter, Tenenbaum, Joshua B. (2017). Mind games: Game engines as an architecture for intuitive physics. Trends in cognitive sciences.

[ullman_mindgames_2017] Ullman, Tomer D., Spelke, Elizabeth, Battaglia, Peter, Tenenbaum, Joshua B.. (2017). Mind {Games. Trends in Cognitive Sciences. doi:10.1016/j.tics.2017.05.012.

[ulyanov2018deep] Ulyanov, Dmitry, Vedaldi, Andrea, Lempitsky, Victor. (2018). Deep image prior. Proceedings of the IEEE conference on computer vision and pattern recognition.

[valenza_perceptual_2006] Valenza, Eloisa, Leo, Irene, Gava, Lucia, Simion, Francesca. (2006). Perceptual {Completion. Child Development.

[vallortigara2012core] Vallortigara, Giorgio. (2012). Core knowledge of object, number, and geometry: A comparative and neural approach. Cognitive neuropsychology.

[vanhorni2018naturalist] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie. (2018). The inaturalist species classification and detection dataset. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[vinuesa2022enhancing] Vinuesa, Ricardo, Brunton, Steven L. (2022). Enhancing computational fluid dynamics with machine learning. Nature Computational Science.

[voc07] Everingham, M., Van~Gool, L., Williams, C. K. I., Winn, J., Zisserman, A.. The {PASCAL.

[von_kugelgen_self-supervised_2021] von Kügelgen, Julius, Sharma, Yash, Gresele, Luigi, Brendel, Wieland, Schölkopf, Bernhard, Besserve, Michel, Locatello, Francesco. (2021). Self-{Supervised. arXiv preprint arXiv:2106.04619 [cs, stat].

[wang2015videos] Xiaolong Wang, Abhinav Gupta. (2015). Unsupervised Learning of Visual Representations using Videos. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[wang2017transitive] Xiaolong Wang, Kaiming He, Abhinav Gupta. (2017). Transitive Invariance for Self-supervised Visual Representation Learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[wang2020incorporating] Wang, Rui, Walters, Robin, Yu, Rose. (2020). Incorporating symmetry into deep dynamics models for improved generalization. arXiv preprint arXiv:2002.03061.

[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. Proceedings of the International Conference on Machine Learning.

[wang2021fast] Wang, Sifan, Bhouri, Mohamed Aziz, Perdikaris, Paris. (2021). Fast PDE-constrained optimization via self-supervised operator learning. arXiv preprint arXiv:2110.13297.

[wang2021twist] Feng Wang, Tao Kong, Rufeng Zhang, Huaping Liu, Hang Li. (2021). Self-Supervised Learning by Estimating Twin Class Distributions.

[wang2021unsupervised] Wang, Hanchen, Liu, Qi, Yue, Xiangyu, Lasenby, Joan, Kusner, Matt J. (2021). Unsupervised point cloud pre-training via occlusion completion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[wang2022cp2] Feng Wang, Huiyu Wang, Chen Wei, Alan Yuille, Wei Shen. (2022). CP2: Copy-Paste Contrastive Pretraining for Semantic Segmentation. arXiv preprint arXiv:2203.11709.

[wang_2024_qwen] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin. (2024). Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv preprint arXiv:2409.1219.

[wang_self-supervised_2019] Wang, Yude, Zhang, Jie, Kan, Meina, Shan, Shiguang, Chen, Xilin. (2019). Self-supervised {Scale.

[wang_videomaev2_2023] Wang, Limin, Huang, Bingkun, Zhao, Zhiyu, Tong, Zhan, He, Yinan, Wang, Yi, Wang, Yali, Qiao, Yu. (2023). VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[watters_visual_interaction_networks_2017] Watters, Nicholas, Tacchetti, Andrea, Weber, Theophane, Pascanu, Razvan, Battaglia, Peter, Zoran, Daniel. (2017). Visual {Interaction. arXiv preprint arXiv:1706.01433.

[wei2022instruction] Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le. (2022). Finetuned Language Models Are Zero-Shot Learners.

[wei2023cot] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou. (2023). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv preprint arXiv:2201.11903.

[Weihs_InfLevel_2022] Luca Weihs, Amanda Rose Yuile, Ren'{e. (2022). Benchmarking Progress to Infant-Level Physical Reasoning in AI. TMLR.

[wen2021toward] Wen, Zixin, Li, Yuanzhi. (2021). Toward understanding the feature learning process of self-supervised contrastive learning. Proceedings of the International Conference on Machine Learning.

[wilcox1999constancy] Wilcox, Teresa. (1999). Object individuation: Infants’ use of shape, size, pattern, and color. Cognition.

[wilcox2004constancy] Wilcox, Teresa, Chapa, Catherine. (2004). Priming infants to attend to color and pattern information in an individuation task. Cognition.

[winter2022unsupervised] Robin Winter, Marco Bertolini, Tuan Le, Frank Noe, Djork-Arn{'e. (2022). Unsupervised Learning of Group Invariant and Equivariant Representations. Advances in Neural Information Processing Systems.

[wood2013newborn] Wood, Justin N. (2013). Newborn chickens generate invariant object representations at the onset of visual object experience. Proceedings of the National Academy of Sciences.

[wu2018discrimination] Zhirong Wu, Yuanjun Xiong, Stella Yu, and Dahua Lin. (2018). Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[wu2019detectron2] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, Ross Girshick. (2019). Detectron2.

[wu2022non] Wu, Pin, Qiu, Feng, Feng, Weibing, Fang, Fangxing, Pain, Christopher. (2022). A non-intrusive reduced order model with transformer neural network and its application. Physics of Fluids.

[xiao2010sun] Xiao, Jianxiong, Hays, James, Ehinger, Krista A, Oliva, Aude, Torralba, Antonio. (2010). Sun database: Large-scale scene recognition from abbey to zoo. 2010 IEEE computer society conference on computer vision and pattern recognition.

[xiao2018upernet] Xiao, Tete, Liu Yingcheng, Zhou, Bolei, Jiang Yuning, Sun Jian. (2018). Unified Perceptual Parsing for Scene Understanding. Proceedings of the European Conference on Computer Vision.

[xiao_what_2021] Xiao, Tete, Wang, Xiaolong, Efros, Alexei A., Darrell, Trevor. (2021). What {Should.

[xie2016clustering] Junyuan Xie, Ross Girshick, Ali Farhadi. (2016). Unsupervised Deep Embedding for Clustering Analysis. Proceedings of the International Conference on Machine Learning.

[xie2017resnext] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He. (2017). Aggregated Residual Transformations for Deep Neural Networks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[xie2021pixpro] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, Han Hu. (2021). Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[xie2022should] Xie, Yuyang, Wen, Jianhong, Lau, Kin Wai, Rehman, Yasar Abbas Ur, Shen, Jiajun. (2022). What Should Be Equivariant in Self-Supervised Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[xie2022simmim] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu. (2022). SimMIM: A Simple Framework for Masked Image Modeling.

[xiong_self-supervised_2021] Xiong, Yuwen, Ren, Mengye, Zeng, Wenyuan, Urtasun, Raquel. (2021). Self-{Supervised. arXiv preprint arXiv:2101.06553 [cs].

[yan2020clusterfit] Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadiyaram, Dhruv Mahajan. (2020). ClusterFit: Improving Generalization of Visual Representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[yang2016joint] Jianwei Yang, Devi Parikh, Dhruv Batra. (2016). Joint Unsupervised Learning of Deep Representations and Image Clusters. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[yang2020improved] Yang, Jianyi, Anishchenko, Ivan, Park, Hahnbeom, Peng, Zhenling, Ovchinnikov, Sergey, Baker, David. (2020). Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences.

[yang2021insloc] Ceyuan Yang, Zhirong Wu, Bolei Zhou, Stephen Lin. (2021). Instance Localization for Self-supervised Detection Pretraining. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[yang2022inscon] Junwei Yang, Ke Zhang, Zhaolin Cui, Jinming Su, Junfeng Luo, Xiaolin Wei. (2022). InsCon: Instance Consistency Feature Representation via Self-Supervised Learning. arXiv preprint arXiv:2203.07688.

[yang2023unisim] Yang, Mengjiao, Du, Yilun, Ghasemipour, Kamyar, Tompson, Jonathan, Schuurmans, Dale, Abbeel, Pieter. (2023). Learning Interactive Real-World Simulators. arXiv preprint arXiv:2310.06114.

[ye2019spreading] Mang Ye, Xu Zhang, Pong C Yuen, Shih-Fu Chang. (2019). Unsupervised embedding learning via invariant and spreading instance feature. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[yeh2021decoupled] Yeh, Chun-Hsiao, Hong, Cheng-Yao, Hsu, Yen-Chi, Liu, Tyng-Luh, Chen, Yubei, LeCun, Yann. (2021). Decoupled Contrastive Learning. arXiv preprint arXiv:2110.06848.

[you2017lars] Yang You, Igor Gitman, Boris Ginsburg. (2017). Large Batch Training of Convolutional Networks. arXiv preprint arXiv:1708.03888.

[yun2019cutmix] Yun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, Yoo, Youngjoon. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[zagoruyko2016resnetwide] Sergey Zagoruyko, Nikos Komodakis. (2016). Wide Residual Networks. arXiv preprint arXiv:1605.07146.

[zaheer2017deep] Zaheer, Manzil, Kottur, Satwik, Ravanbakhsh, Siamak, Poczos, Barnabas, Salakhutdinov, Russ R, Smola, Alexander J. (2017). Deep sets. Advances in neural information processing systems.

[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. Proceedings of the International Conference on Machine Learning.

[zeidler2012applied] Zeidler, Eberhard. (2012). Applied functional analysis: main principles and their applications.

[zhang2016colorful] Richard Zhang, Phillip Isola, Alexei A. Efros. (2016). Colorful image colorization. Proceedings of the European Conference on Computer Vision.

[zhang2017split] Richard Zhang, Phillip Isola, Alexei A. Efros. (2017). Split-brain autoencoders: Unsupervised learning by crosschannel prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[zhang2018mixup] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz. (2018). mixup: Beyond Empirical Risk Minimization. International Conference on Learning Representations.

[zhang2022dual] Zhang, Chaoning, Zhang, Kang, Pham, Trung X, Niu, Axi, Qiao, Zhinan, Yoo, Chang D, Kweon, In So. (2022). Dual Temperature Helps Contrastive Learning Without Many Negative Samples: Towards Understanding and Simplifying MoCo. arXiv preprint arXiv:2203.17248.

[zhang2023instruction] Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, Guoyin Wang. (2023). Instruction Tuning for Large Language Models: A Survey.

[zhang_aet_2019] Zhang, Liheng, Qi, Guo-Jun, Wang, Liqiang, Luo, Jiebo. (2019). {AET. 2019 {IEEE. doi:10.1109/CVPR.2019.00265.

[zhou2014places] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, Aude Oliva. (2014). Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems.

[zhou2019ade20k] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, Antonio Torralba. (2019). Semantic understanding of scenes through the ADE20K dataset. IJCV.

[zhou2020graph] Zhou, Jie, Cui, Ganqu, Hu, Shengding, Zhang, Zhengyan, Yang, Cheng, Liu, Zhiyuan, Wang, Lifeng, Li, Changcheng, Sun, Maosong. (2020). Graph neural networks: A review of methods and applications. AI open.

[zhou2022ibot] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, Tao Kong. (2022). iBOT: Image BERT Pre-Training with Online Tokenizer. International Conference on Learning Representations.

[zhou2022mugs] Zhou, Pan, Zhou, Yichen, Si, Chenyang, Yu, Weihao, Ng, Teck Khim, Yan, Shuicheng. (2022). Mugs: A multi-granular self-supervised learning framework. arXiv preprint arXiv:2203.14415.

[zhuang2019local] Chengxu Zhuang, Alex Lin Zhai, Daniel Yamins. (2019). Local Aggregation for Unsupervised Learning of Visual Embeddings. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[zhuang_ventral_2021] Chengxu Zhuang, Siming Yan, Aran Nayebi, Martin Schrimpf, Michael C. Frank, James J. DiCarlo, Daniel L. K. Yamins. (2021). Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences. doi:10.1073/pnas.2014196118.

[zimmermann2021contrastive] Zimmermann, Roland S, Sharma, Yash, Schneider, Steffen, Bethge, Matthias, Brendel, Wieland. (2021). Contrastive learning inverts the data generating process. Proceedings of the International Conference on Machine Learning.

[he2020moco] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum Contrast for Unsupervised Visual Representation Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[caron2020swav] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems.

[oord2018infonce] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.

[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[garrido2023duality] Quentin Garrido, Yubei Chen, Adrien Bardes, Laurent Najman, Yann LeCun. (2023). On the duality between contrastive and non-contrastive self-supervised learning. The Eleventh International Conference on Learning Representations.

[loshchilov2018adamw] Ilya Loshchilov, Frank Hutter. (2019). Decoupled Weight Decay Regularization. International Conference on Learning Representations.

[arjovsky_wasserstein_2017] Arjovsky, Martin, Chintala, Soumith, Bottou, Léon. (2017). Wasserstein {GAN. arXiv preprint arXiv:1701.07875 [cs, stat].

[gulrajani_improved_2017] Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, Courville, Aaron. (2017). Improved {Training. arXiv preprint arXiv:1704.00028 [cs, stat].

[wang_solving_2021] Wang, Guangrun, Wang, Keze, Wang, Guangcong, Torr, Philip H. S., Lin, Liang. (2021). Solving {Inefficiency. arXiv preprint arXiv:2104.08760 [cs].

[caron_deep_2019] Caron, Mathilde, Bojanowski, Piotr, Joulin, Armand, Douze, Matthijs. (2019). Deep {Clustering. arXiv preprint arXiv:1807.05520 [cs].

[shwartz-ziv_opening_2017] Shwartz-Ziv, Ravid, Tishby, Naftali. (2017). Opening the {Black. arXiv preprint arXiv:1703.00810 [cs].

[jing_masked_2022] Jing, Li, Zhu, Jiachen, LeCun, Yann. (2022). Masked {Siamese. doi:10.48550/arXiv.2206.07700.

[moutakanni_2024_augs] Moutakanni, Th'{e. (2024). You Don’t Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning. Advances in Neural Information Processing Systems.

[littwin_how_2024] Littwin, Etai, Saremi, Omid, Advani, Madhu, Thilak, Vimal, Nakkiran, Preetum, Huang, Chen, Susskind, Joshua. (2024). How {JEPA. doi:10.48550/arXiv.2407.03475.

[balestriero_learning_2024] Balestriero, Randall, LeCun, Yann. (2024). Learning by {Reconstruction. doi:10.48550/arXiv.2402.11337.

[sermanet_time-contrastive_2018] Sermanet, Pierre, Lynch, Corey, Chebotar, Yevgen, Hsu, Jasmine, Jang, Eric, Schaal, Stefan, Levine, Sergey. (2018). Time-{Contrastive. doi:10.48550/arXiv.1704.06888.

[vincent2010stacked] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, Pierre-Antoine, Bottou, L{'e. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research.

[tong_videomae_2022] Tong, Zhan, Song, Yibing, Wang, Jue, Wang, Limin. (2022). {VideoMAE.

[yu_magvit_nodate] Yu, Lijun, Cheng, Yong, Sohn, Kihyuk, Lezama, Jose, Zhang, Han, Chang, Huiwen, Hauptmann, Alexander G, Yang, Ming-Hsuan, Hao, Yuan, Essa, Irfan, Jiang, Lu. {MAGVIT.

[yu2023magvit] Yu, Lijun, Cheng, Yong, Sohn, Kihyuk, Lezama, Jos{'e. (2023). Magvit: Masked generative video transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[feichtenhofer2022masked] Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He. (2022). Masked Autoencoders As Spatiotemporal Learners. Advances in Neural Information Processing Systems.

[zhao2024videoprism] Zhao, Long, Gundavarapu, Nitesh Bharadwaj, Yuan, Liangzhe, Zhou, Hao, Yan, Shen, Sun, Jennifer J., Friedman, Luke, Qian, Rui, Weyand, Tobias, Zhao, Yue, Hornung, Rachel, Schroff, Florian, Yang, Ming-Hsuan, Ross, David A, Wang, Huisheng, Adam, Hartwig, Sirotenko, Mikhail, Liu, Ting, Gong, Boqing. (2024). {V. Proceedings of the 41st International Conference on Machine Learning.

[carreira_scaling_2024] Carreira, João, Gokay, Dilara, King, Michael, Zhang, Chuhan, Rocco, Ignacio, Mahendran, Aravindh, Keck, Thomas Albert, Heyward, Joseph, Koppula, Skanda, Pot, Etienne, Erdogan, Goker, Hasson, Yana, Yang, Yi, Greff, Klaus, Moing, Guillaume Le, Steenkiste, Sjoerd van, Zoran, Daniel, Hudson, Drew A., Vélez, Pedro, Polanía, Luisa, Friedman, Luke, Duvarney, Chris, Goroshin, Ross, Allen, Kelsey, Walker, Jacob, Kabra, Rishabh, Aboussouan, Eric, Sun, Jennifer, Kipf, Thomas, Doersch, Carl, Pătrăucean, Viorica, Damen, Dima, Luc, Pauline, Sajjadi, Mehdi S. M., Zisserman, Andrew. (2024). Scaling {4D. doi:10.48550/arXiv.2412.15212.

[qian_spatiotemporal_2021] Qian, Rui, Meng, Tianjian, Gong, Boqing, Yang, Ming-Hsuan, Wang, Huisheng, Belongie, Serge, Cui, Yin. (2021). Spatiotemporal {Contrastive. arXiv preprint arXiv:2008.03800 [cs].

[ranasinghe_self-supervised_2021] Ranasinghe, Kanchana, Naseer, Muzammal, Khan, Salman, Khan, Fahad Shahbaz, Ryoo, Michael. (2021). Self-supervised {Video. arXiv preprint arXiv:2112.01514 [cs].

[recasens_broaden_2021] Recasens, Adrià, Luc, Pauline, Alayrac, Jean-Baptiste, Wang, Luyu, Strub, Florian, Tallec, Corentin, Malinowski, Mateusz, Patraucean, Viorica, Altché, Florent, Valko, Michal, Grill, Jean-Bastien, Oord, Aäron van den, Zisserman, Andrew. (2021). Broaden {Your. arXiv preprint arXiv:2103.16559 [cs].

[schneider_wav2vec_2019] Schneider, Steffen, Baevski, Alexei, Collobert, Ronan, Auli, Michael. (2019). wav2vec: {Unsupervised. doi:10.48550/arXiv.1904.05862.

[huang_masked_2023] Huang, Po-Yao, Xu, Hu, Li, Juncheng, Baevski, Alexei, Auli, Michael, Galuba, Wojciech, Metze, Florian, Feichtenhofer, Christoph. (2023). Masked {Autoencoders. doi:10.48550/arXiv.2207.06405.

[baade_mae-ast_2022] Baade, Alan, Peng, Puyuan, Harwath, David. (2022). {MAE. doi:10.48550/arXiv.2203.16691.

[moutakanni_advancing_2024] Moutakanni, Théo, Bojanowski, Piotr, Chassagnon, Guillaume, Hudelot, Céline, Joulin, Armand, LeCun, Yann, Muckley, Matthew, Oquab, Maxime, Revel, Marie-Pierre, Vakalopoulou, Maria. (2024). Advancing human-centric {AI. doi:10.48550/arXiv.2405.01469.

[parker_astroclip_2024] Parker, Liam, Lanusse, Francois, Golkar, Siavash, Sarra, Leopoldo, Cranmer, Miles, Bietti, Alberto, Eickenberg, Michael, Krawezik, Geraud, McCabe, Michael, Ohana, Ruben, Pettee, Mariel, Blancard, Bruno Regaldo-Saint, Tesileanu, Tiberiu, Cho, Kyunghyun, Ho, Shirley. (2024). {AstroCLIP. Monthly Notices of the Royal Astronomical Society. doi:10.1093/mnras/stae1450.

[Pearson1901PCA] Karl Pearson. (1901). LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. doi:10.1080/14786440109462720.

[hotelling1933pca] Hotelling, Harold. (1933). Analysis of a complex of statistical variables into principal components.. Journal of educational psychology.

[van2008tsne] Van der Maaten, Laurens, Hinton, Geoffrey. (2008). Visualizing data using t-SNE.. Journal of machine learning research.

[moon2019visualizing] Moon, Kevin R, Van Dijk, David, Wang, Zheng, Gigante, Scott, Burkhardt, Daniel B, Chen, William S, Yim, Kristina, Elzen, Antonia van den, Hirn, Matthew J, Coifman, Ronald R, others. (2019). Visualizing structure and transitions in high-dimensional biological data. Nature biotechnology.

[hochgerner2018conserved] Hochgerner, Hannah, Zeisel, Amit, L{. (2018). Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing. Nature neuroscience.

[bastidas2019comprehensive] Bastidas-Ponce, Aim{'e. (2019). Comprehensive single cell mRNA profiling reveals a detailed roadmap for pancreatic endocrinogenesis. Development.

[cerletti2020fate] Cerletti, Dario, Sandu, Ioana, Gupta, Revant, Oxenius, Annette, Claassen, Manfred. (2020). Fate trajectories of CD8+ T cells in chronic LCMV infection. bioRxiv.

[mcinnes2018umap] McInnes, Leland, Healy, John, Melville, James. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

[sainburg2021parametric] Sainburg, Tim, McInnes, Leland, Gentner, Timothy Q. (2021). Parametric UMAP embeddings for representation and semisupervised learning. Neural Computation.

[damrich_umaps_2021] Damrich, Sebastian, Hamprecht, Fred A.. (2021). On {UMAP. arXiv preprint arXiv:2103.14608 [cs, stat].

[damrich2023from] Sebastian Damrich, Niklas B{. (2023). From $t$-{SNE. The Eleventh International Conference on Learning Representations.

[williams1992reinforce] Williams, Ronald J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning.

[van2017vqvae] Van Den Oord, Aaron, Vinyals, Oriol, others. (2017). Neural discrete representation learning. Advances in neural information processing systems.

[gersho2012vector] Gersho, Allen, Gray, Robert M. (2012). Vector quantization and signal compression.

[bengio2013estimating] Bengio, Yoshua, L{'e. (2013). Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.

[mentzer2023finite] Mentzer, Fabian, Minnen, David, Agustsson, Eirikur, Tschannen, Michael. (2023). Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505.

[esser2021taming] Esser, Patrick, Rombach, Robin, Ommer, Bjorn. (2021). Taming transformers for high-resolution image synthesis. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[rombach2022high] Rombach, Robin, Blattmann, Andreas, Lorenz, Dominik, Esser, Patrick, Ommer, Bj{. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[amodio2019exploring] Amodio, Matthew, Van Dijk, David, Srinivasan, Krishnan, Chen, William S, Mohsen, Hussein, Moon, Kevin R, Campbell, Allison, Zhao, Yujiao, Wang, Xiaomei, Venkataswamy, Manjunatha, others. (2019). Exploring single-cell data with deep multitasking neural networks. Nature methods.

[garrido2022visualizing] Garrido, Quentin, Damrich, Sebastian, J{. (2022). Visualizing hierarchies in scRNA-seq data using a density tree-biased autoencoder. Bioinformatics.

[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.

[devlin2019bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers).

[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[dong2023peco] Dong, Xiaoyi, Bao, Jianmin, Zhang, Ting, Chen, Dongdong, Zhang, Weiming, Yuan, Lu, Chen, Dong, Wen, Fang, Yu, Nenghai, Guo, Baining. (2023). Peco: Perceptual codebook for bert pre-training of vision transformers. Proceedings of the AAAI Conference on Artificial Intelligence.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[donahue2019large] Donahue, Jeff, Simonyan, Karen. (2019). Large scale adversarial representation learning. Advances in neural information processing systems.

[jang2016categorical] Jang, Eric, Gu, Shixiang, Poole, Ben. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.

[cortes1995support] Cortes, Corinna, Vapnik, Vladimir. (1995). Support-vector networks. Machine learning.

[nyudepth] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus. (2012). Indoor Segmentation and Support Inference from RGBD Images. Proceedings of the European Conference on Computer Vision.

[zhu2017unpaired] Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, Efros, Alexei A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE international conference on computer vision.

[venkataramanan2024is] Shashanka Venkataramanan, Mamshad Nayeem Rizve, Joao Carreira, Yuki M Asano, Yannis Avrithis. (2024). Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video. The Twelfth International Conference on Learning Representations.

[goodfellow2014generative] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, Bengio, Yoshua. (2014). Generative adversarial nets. Advances in neural information processing systems.

[donahue2016adversarial] Donahue, Jeff, Kr{. (2016). Adversarial feature learning. arXiv preprint arXiv:1605.09782.

[dumoulin2016adversarially] Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Mastropietro, Olivier, Lamb, Alex, Arjovsky, Martin, Courville, Aaron. (2016). Adversarially learned inference. arXiv preprint arXiv:1606.00704.

[gulrajani2017improved] Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, Courville, Aaron C. (2017). Improved training of wasserstein gans. Advances in neural information processing systems.

[arjovsky2017wasserstein] Arjovsky, Martin, Chintala, Soumith, Bottou, L{'e. (2017). Wasserstein generative adversarial networks. Proceedings of the International Conference on Machine Learning.

[salimans2016improved] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, Chen, Xi. (2016). Improved techniques for training gans. Advances in neural information processing systems.

[bordes2017learning] Bordes, Florian, Honari, Sina, Vincent, Pascal. (2017). Learning to generate samples from noise through infusion training. arXiv preprint arXiv:1703.06975.

[ho2020denoising] Ho, Jonathan, Jain, Ajay, Abbeel, Pieter. (2020). Denoising diffusion probabilistic models. Advances in neural information processing systems.

[song2020denoising] Song, Jiaming, Meng, Chenlin, Ermon, Stefano. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.

[sohl2015deep] Sohl-Dickstein, Jascha, Weiss, Eric, Maheswaranathan, Niru, Ganguli, Surya. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. Proceedings of the International Conference on Machine Learning.

[bishop2006pattern] Bishop, Christopher M. (2006). Pattern recognition and machine learning.

[hyvarinen2005estimation] Hyv{. (2005). Estimation of non-normalized statistical models by score matching.. Journal of Machine Learning Research.

[song2019generative] Song, Yang, Ermon, Stefano. (2019). Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems.

[vincent2011connection] Vincent, Pascal. (2011). A connection between score matching and denoising autoencoders. Neural computation.

[lipman2022flow] Lipman, Yaron, Chen, Ricky TQ, Ben-Hamu, Heli, Nickel, Maximilian, Le, Matt. (2022). Flow matching for generative modeling. arXiv preprint arXiv:2210.02747.

[wei2023diffmae] Wei, Chen, Mangalam, Karttikeya, Huang, Po-Yao, Li, Yanghao, Fan, Haoqi, Xu, Hu, Wang, Huiyu, Xie, Cihang, Yuille, Alan, Feichtenhofer, Christoph. (2023). Diffusion Models as Masked Autoencoders. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (ICCV).

[bojanowski2017unsupervised] Bojanowski, Piotr, Joulin, Armand. (2017). Unsupervised learning by predicting noise. Proceedings of the International Conference on Machine Learning.

[balestriero2023diet] Balestriero, Randall. (2023). Unsupervised Learning on a DIET: Datum IndEx as Target Free of Self-Supervision, Reconstruction, Projector Head. arXiv preprint arXiv:2302.10260.

[cuturi2013sinkhorn] Cuturi, Marco. (2013). Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems.

[musgrave2020metric] Musgrave, Kevin, Belongie, Serge, Lim, Ser-Nam. (2020). A metric learning reality check. Proceedings of the European Conference on Computer Vision.

[oh2016deep] Oh Song, Hyun, Xiang, Yu, Jegelka, Stefanie, Savarese, Silvio. (2016). Deep metric learning via lifted structured feature embedding. Proceedings of the IEEE conference on computer vision and pattern recognition.

[schroff2015facenet] Schroff, Florian, Kalenichenko, Dmitry, Philbin, James. (2015). Facenet: A unified embedding for face recognition and clustering. Proceedings of the IEEE conference on computer vision and pattern recognition.

[wang2014learning] Wang, Jiang, Song, Yang, Leung, Thomas, Rosenberg, Chuck, Wang, Jingbin, Philbin, James, Chen, Bo, Wu, Ying. (2014). Learning fine-grained image similarity with deep ranking. Proceedings of the IEEE conference on computer vision and pattern recognition.

[sansone2024failure] Sansone, Emanuele, Lebailly, Tim, Tuytelaars, Tinne. (2024). Failure-Proof Non-Contrastive Self-Supervised Learning. arXiv preprint arXiv:2410.04959.

[zhai2023sigmoid] Zhai, Xiaohua, Mustafa, Basil, Kolesnikov, Alexander, Beyer, Lucas. (2023). Sigmoid loss for language image pre-training. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[Chang_2017_ICCV] Chang, Jianlong, Wang, Lingfeng, Meng, Gaofeng, Xiang, Shiming, Pan, Chunhong. (2017). Deep Adaptive Image Clustering. Proceedings of the IEEE International Conference on Computer Vision (ICCV).

[pmlr-v48-xieb16] Xie, Junyuan, Girshick, Ross, Farhadi, Ali. (2016). Unsupervised Deep Embedding for Clustering Analysis. Proceedings of The 33rd International Conference on Machine Learning.

[kingma_auto-encoding_2014] Kingma, Diederik P, Welling, Max. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations.

[cui_dynamo_2024] Cui, Zichen Jeff, Pan, Hengkai, Iyer, Aadhithya, Haldar, Siddhant, Pinto, Lerrel. (2024). {DynaMo. doi:10.48550/arXiv.2409.12192.

[bu_univla_2025] Bu, Qingwen, Yang, Yanting, Cai, Jisong, Gao, Shenyuan, Ren, Guanghui, Yao, Maoqing, Luo, Ping, Li, Hongyang. (2025). {UniVLA. doi:10.48550/arXiv.2505.06111.

[schmidt_actw-without-actions_2024] Schmidt, Dominik, Jiang, Minqi. (2024). Learning to {Act. doi:10.48550/arXiv.2312.10812.

[ye_lapa_2025] Ye, Seonghyeon, Jang, Joel, Jeon, Byeongguk, Joo, Sejune, Yang, Jianwei, Peng, Baolin, Mandlekar, Ajay, Tan, Reuben, Chao, Yu-Wei, Lin, Bill Yuchen, Liden, Lars, Lee, Kimin, Gao, Jianfeng, Zettlemoyer, Luke, Fox, Dieter, Seo, Minjoon. (2025). Latent {Action. doi:10.48550/arXiv.2410.11758.

[villar-corrales_playslot_2025] Villar-Corrales, Angel, Behnke, Sven. (2025). {PlaySlot. doi:10.48550/arXiv.2502.07600.

[yang_dichotomy_2022] Yang, Mengjiao, Schuurmans, Dale, Abbeel, Pieter, Nachum, Ofir. (2022). Dichotomy of {Control. doi:10.48550/arXiv.2210.13435.

[babaeizadeh_stochastic_2018] Babaeizadeh, Mohammad, Finn, Chelsea, Erhan, Dumitru, Campbell, Roy H., Levine, Sergey. (2018). Stochastic {Variational. doi:10.48550/arXiv.1710.11252.

[bae_devias_2024] Bae, Kyungho, Ahn, Geo, Kim, Youngrae, Choi, Jinwoo. (2024). {DEVIAS. doi:10.48550/arXiv.2312.00826.

[li_unified_2025] Li, Shuang, Gao, Yihuai, Sadigh, Dorsa, Song, Shuran. (2025). Unified {Video. doi:10.48550/arXiv.2503.00200.

[chen_igor_2024] Chen, Xiaoyu, Guo, Junliang, He, Tianyu, Zhang, Chuheng, Zhang, Pushi, Yang, Derek Cathera, Zhao, Li, Bian, Jiang. (2024). {IGOR. doi:10.48550/arXiv.2411.00785.

[agibot_2025] AgiBot-World-Contributors, Bu, Qingwen, Cai, Jisong, Chen, Li, Cui, Xiuqi, Ding, Yan, Feng, Siyuan, Gao, Shenyuan, He, Xindong, Hu, Xuan, Huang, Xu, Jiang, Shu, Jiang, Yuxin, Jing, Cheng, Li, Hongyang, Li, Jialu, Liu, Chiming, Liu, Yi, Lu, Yuxiang, Luo, Jianlan, Luo, Ping, Mu, Yao, Niu, Yuehan, Pan, Yixuan, Pang, Jiangmiao, Qiao, Yu, Ren, Guanghui, Ruan, Cheng, Shan, Jiaqi, Shen, Yongjian, Shi, Chengshi, Shi, Mingkang, Shi, Modi, Sima, Chonghao, Song, Jianheng, Wang, Huijie, Wang, Wenhao, Wei, Dafeng, Xie, Chengen, Xu, Guo, Yan, Junchi, Yang, Cunbiao, Yang, Lei, Yang, Shukai, Yao, Maoqing, Zeng, Jia, Zhang, Chi, Zhang, Qinglin, Zhao, Bin, Zhao, Chengyue, Zhao, Jiaqi, Zhu, Jianchao. (2025). {AgiBot. doi:10.48550/arXiv.2503.06669.

[yang2025como] Yang, Jiange, Shi, Yansong, Zhu, Haoyi, Liu, Mingyu, Ma, Kaijing, Wang, Yating, Wu, Gangshan, He, Tong, Wang, Limin. (2025). CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning. arXiv preprint arXiv:2505.17006.

[gao2025adaworld] Gao, Shenyuan, Zhou, Siyuan, Du, Yilun, Zhang, Jun, Gan, Chuang. (2025). Adaworld: Learning adaptable world models with latent actions. arXiv preprint arXiv:2503.18938.

[bruce2024genie] Bruce, Jake, Dennis, Michael D, Edwards, Ashley, Parker-Holder, Jack, Shi, Yuge, Hughes, Edward, Lai, Matthew, Mavalankar, Aditi, Steigerwald, Richie, Apps, Chris, others. (2024). Genie: Generative interactive environments. Forty-first International Conference on Machine Learning.

[alonso2024diffusion] Alonso, Eloi, Jelley, Adam, Micheli, Vincent, Kanervisto, Anssi, Storkey, Amos J, Pearce, Tim, Fleuret, Fran{\c{c. (2024). Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems.

[seo2023masked] Seo, Younggyo, Hafner, Danijar, Liu, Hao, Liu, Fangchen, James, Stephen, Lee, Kimin, Abbeel, Pieter. (2023). Masked world models for visual control. Conference on Robot Learning.

[agarwal2025cosmos] Agarwal, Niket, Ali, Arslan, Bala, Maciej, Balaji, Yogesh, Barker, Erik, Cai, Tiffany, Chattopadhyay, Prithvijit, Chen, Yongxin, Cui, Yin, Ding, Yifan, others. (2025). Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575.

[assran2025vjepa2] Assran, Mido, Bardes, Adrien, Fan, David, Garrido, Quentin, Howes, Russell, Muckley, Matthew, Rizvi, Ammar, Roberts, Claire, Sinha, Koustuv, Zholus, Artem, others. (2025). V-jepa 2: Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985.

[teng2025magi] Teng, Hansi, Jia, Hongyu, Sun, Lei, Li, Lingzhi, Li, Maolin, Tang, Mingqiu, Han, Shuai, Zhang, Tianning, Zhang, WQ, Luo, Weifeng, others. (2025). MAGI-1: Autoregressive Video Generation at Scale. arXiv preprint arXiv:2505.13211.

[bai2025peva] Bai, Yutong, Tran, Danny, Bar, Amir, LeCun, Yann, Darrell, Trevor, Malik, Jitendra. (2025). Whole-Body Conditioned Egocentric Video Prediction. arXiv preprint arXiv:2506.21552.

[ma2024nymeria] Ma, Lingni, Ye, Yuting, Hong, Fangzhou, Guzov, Vladimir, Jiang, Yifeng, Postyeni, Rowan, Pesqueira, Luis, Gamino, Alexander, Baiyya, Vijay, Kim, Hyo Jin, others. (2024). Nymeria: A massive collection of multimodal egocentric daily motion in the wild. European Conference on Computer Vision.

[khazatsky2024droid] Khazatsky, Alexander, Pertsch, Karl, Nair, Suraj, Balakrishna, Ashwin, Dasari, Sudeep, Karamcheti, Siddharth, Nasiriany, Soroush, Srirama, Mohan Kumar, Chen, Lawrence Yunliang, Ellis, Kirsty, others. (2024). Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945.

[nasiriany2024robocasa] Nasiriany, Soroush, Maddukuri, Abhiram, Zhang, Lance, Parikh, Adeet, Lo, Aaron, Joshi, Abhishek, Mandlekar, Ajay, Zhu, Yuke. (2024). Robocasa: Large-scale simulation of everyday tasks for generalist robots. arXiv preprint arXiv:2406.02523.

[shah2021rapid] Dhruv Shah, Benjamin Eysenbach, Nicholas Rhinehart, Sergey Levine. (2021). Rapid Exploration for Open-World Navigation with Latent Goal Models. 5th Annual Conference on Robot Learning.

[liu2023libero] Liu, Bo, Zhu, Yifeng, Gao, Chongkai, Feng, Yihao, Liu, Qiang, Zhu, Yuke, Stone, Peter. (2023). Libero: Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems.

[zhou2024dino] Zhou, Gaoyue, Pan, Hengkai, LeCun, Yann, Pinto, Lerrel. (2024). Dino-wm: World models on pre-trained visual features enable zero-shot planning. arXiv preprint arXiv:2411.04983.

[baldassarre2025back] Baldassarre, Federico, Szafraniec, Marc, Terver, Basile, Khalidov, Vasil, Massa, Francisco, LeCun, Yann, Labatut, Patrick, Seitzer, Maximilian, Bojanowski, Piotr. (2025). Back to the features: Dino as a foundation for video world models. arXiv preprint arXiv:2507.19468.

[luc2017predicting] Luc, Pauline, Neverova, Natalia, Couprie, Camille, Verbeek, Jakob, LeCun, Yann. (2017). Predicting deeper into the future of semantic segmentation. Proceedings of the IEEE international conference on computer vision.

[karypidis2024dino] Karypidis, Efstathios, Kakogeorgiou, Ioannis, Gidaris, Spyros, Komodakis, Nikos. (2024). DINO-Foresight: Looking into the Future with DINO. arXiv preprint arXiv:2412.11673.

[bjorck2025gr00t] Bjorck, Johan, Casta{~n. (2025). Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734.

[chen2025moto] Chen, Yi, Ge, Yuying, Tang, Weiliang, Li, Yizhuo, Ge, Yixiao, Ding, Mingyu, Shan, Ying, Liu, Xihui. (2025). Moto: Latent motion token as the bridging language for learning robot manipulation from videos. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[nikulin2025latent] Nikulin, Alexander, Zisman, Ilya, Tarasov, Denis, Lyubaykin, Nikita, Polubarov, Andrei, Kiselev, Igor, Kurenkov, Vladislav. (2025). Latent action learning requires supervision in the presence of distractors. arXiv preprint arXiv:2502.00379.

[liang2025clam] Liang, Anthony, Czempin, Pavel, Hong, Matthew, Zhou, Yutai, Biyik, Erdem, Tu, Stephen. (2025). Clam: Continuous latent action models for robot learning from unlabeled demonstrations. arXiv preprint arXiv:2505.04999.

[grauman2022ego4d] Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, others. (2022). Ego4d: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

[friston2010free] Friston, Karl. (2010). The free-energy principle: a unified brain theory?. Nature reviews neuroscience.

[clark2013whatever] Clark, Andy. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and brain sciences.

[bubic2010prediction] Bubic, Andreja, Von Cramon, D Yves, Schubotz, Ricarda I. (2010). Prediction, cognition and the brain. Frontiers in human neuroscience.

[sutton1991dyna] Sutton, Richard S. (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin.

[radford2018gpt] Radford, Alec, Narasimhan, Karthik, Salimans, Tim, Sutskever, Ilya, others. (2018). Improving language understanding by generative pre-training.

[jordan2024muon] Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, Jeremy Bernstein. (2024). Muon: An optimizer for hidden layers in neural networks.

[williams1989learning] Williams, Ronald J, Zipser, David. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural computation.

[sutskever2014sequence] Sutskever, Ilya, Vinyals, Oriol, Le, Quoc V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.

[johnson2016perceptual] Johnson, Justin, Alahi, Alexandre, Fei-Fei, Li. (2016). Perceptual losses for real-time style transfer and super-resolution. European conference on computer vision.

[zhang2018unreasonable] Zhang, Richard, Isola, Phillip, Efros, Alexei A, Shechtman, Eli, Wang, Oliver. (2018). The unreasonable effectiveness of deep features as a perceptual metric. Proceedings of the IEEE conference on computer vision and pattern recognition.

[amos2023tutorial] Amos, Brandon, others. (2023). Tutorial on amortized optimization. Foundations and Trends{\textregistered.

[peebles2023scalable] Peebles, William, Xie, Saining. (2023). Scalable diffusion models with transformers. Proceedings of the IEEE/CVF international conference on computer vision.

[zhang2025world] Zhang, Jiahan, Jiang, Muqing, Dai, Nanru, Lu, Taiming, Uzunoglu, Arda, Zhang, Shunchi, Wei, Yana, Wang, Jiahao, Patel, Vishal M, Liang, Paul Pu, others. (2025). World-in-World: World Models in a Closed-Loop World. arXiv preprint arXiv:2510.18135.

[sun2024video] Sun, Yihong, Zhou, Hao, Yuan, Liangzhe, Sun, Jennifer J, Li, Yandong, Jia, Xuhui, Adam, Hartwig, Hariharan, Bharath, Zhao, Long, Liu, Ting. (2024). Video creation by demonstration. arXiv preprint arXiv:2412.09551.

[hoque2025egodex] Hoque, Ryan, Huang, Peide, Yoon, David J, Sivapurapu, Mouli, Zhang, Jian. (2025). EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video. arXiv preprint arXiv:2505.11709.

[wang2025coevolvinglatentactionworld] Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, Jiang Bian. (2025). Co-Evolving Latent Action World Models.

[lecun2006tutorial] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, Marc'Aurelio, Huang, Fu-Jie. (2006). A Tutorial on Energy-Based Learning. Predicting Structured Data.

[grathwohl2020classifiersecretlyenergybased] Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, Kevin Swersky. (2020). Your Classifier is Secretly an Energy Based Model and You Should Treat it Like One.

[welling2011sgld] Welling, Max, Teh, Yee Whye. (2011). Bayesian learning via stochastic gradient langevin dynamics. Proceedings of the 28th International Conference on International Conference on Machine Learning.

[rubinstein1997cem] Rubinstein, Reuven Y. (1997). Optimization of computer simulation models with rare events. European Journal of Operational Research.

[sturm2012evaluating] Sturm, J{. (2012). Evaluating egomotion and structure-from-motion approaches using the TUM RGB-D benchmark. Proc. of the Workshop on Color-Depth Camera Fusion in Robotics at the IEEE/RJS International Conference on Intelligent Robot Systems (IROS).

[nguyen1990truck] Nguyen, Derrick, Widrow, Bernard. (1990). The truck backer-upper: An example of self-learning in neural networks. Advanced neural computers.

[terver2025drives] Terver, Basile, Yang, Tsung-Yen, Ponce, Jean, Bardes, Adrien, LeCun, Yann. (2025). What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?. arXiv preprint arXiv:2512.24497.

[sridhar2024nomad] Sridhar, Ajay, Shah, Dhruv, Glossop, Catherine, Levine, Sergey. (2024). Nomad: Goal masked diffusion policies for navigation and exploration. 2024 IEEE International Conference on Robotics and Automation (ICRA).

[drozdov2024video] Drozdov, Katrina, Shwartz-Ziv, Ravid, LeCun, Yann. (2024). Video representation learning with joint-embedding predictive architectures. arXiv preprint arXiv:2412.10925.

[rybkin2018learning] Oleh Rybkin, Karl Pertsch, Andrew Jaegle, Konstantinos G. Derpanis, Kostas Daniilidis. (2019). Learning what you can do before doing anything. International Conference on Learning Representations.

[edwards2019imitating] Edwards, Ashley, Sahni, Himanshu, Schroecker, Yannick, Isbell, Charles. (2019). Imitating latent policies from observation. International conference on machine learning.

[menapace2021playable] Menapace, Willi, Lathuiliere, Stephane, Tulyakov, Sergey, Siarohin, Aliaksandr, Ricci, Elisa. (2021). Playable video generation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[menapace2022playable] Menapace, Willi, Lathuili{`e. (2022). Playable environments: Video manipulation in space and time. Proceedings of the ieee/cvf conference on computer vision and pattern recognition.

Leveraging latent action world models for planning​

Introduction​

Related works​

Problem setting​

Experimental details​

Performance of information regularizations​

Takeaway

What kind of actions do we learn ?​

Animating the right person

Animating the left person

Takeaway

Leveraging latent action world models for planning​

Takeaway

Scaling models and data.​

Limitations and future work​

Conclusion​

Acknowledgments​

Experimental details​

Appendix

Training and evaluation protocols label{sec:detailed_protocol​

Sampling latent actions label{sec:sampling​

Detailed planning results label{sec:detailed_planning_results​

Robot manipulation vs in-the-wild videos label{sec:droid_vs_ytb​

Qualitative Impact of regularization strength label{sec:reg_strength_qual​

Additional IDM rollouts. label{sec:more_idm​

Additional human action transfer results. label{sec:more_transfer​

Context

Introduction​

Context

Introduction​

Context

Qualitative performance of the controllers label{sec:controller_qual​

References​

Leveraging latent action world models for planning

Introduction

Related works

Problem setting

Experimental details

Performance of information regularizations

What kind of actions do we learn ?

Leveraging latent action world models for planning

Scaling models and data.

Limitations and future work

Conclusion

Acknowledgments

Experimental details

Training and evaluation protocols label{sec:detailed_protocol

Sampling latent actions label{sec:sampling

Detailed planning results label{sec:detailed_planning_results

Robot manipulation vs in-the-wild videos label{sec:droid_vs_ytb

Qualitative Impact of regularization strength label{sec:reg_strength_qual

Additional IDM rollouts. label{sec:more_idm

Additional human action transfer results. label{sec:more_transfer

Introduction

Introduction

Qualitative performance of the controllers label{sec:controller_qual

References