Revisiting Feature Prediction for Learning Visual Representations from Video
Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas
Abstract
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9%percent81.981.9% on Kinetics-400, 72.2%percent72.272.2% on Something-Something-v2, and 77.9%percent77.977.9% on ImageNet1K.
What Matters for Learning Representations from Video?
Adrien Bardes 1 , 2 , 3 , Quentin Garrido 1 , 4 , Jean Ponce 3 , 5 , 6 , Xinlei Chen 1 , Michael Rabbat 1 , Yann LeCun 1 , 5 , 6 , MahmoudAssran 1 , † , Nicolas Ballas 1 , †
1 FAIR at Meta, 2 Inria, 3 École normale supérieure, CNRS, PSL Research University, 4 Univ. Gustave Eiffel, CNRS, LIGM, 5 Courant Institute, New York University, 6 Center for Data Science, New York University † Joint last author
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA , a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81 . 9% on Kinetics-400, 72 . 2% on Something-Something-v2, and 77 . 9% on ImageNet1K.
Date:
April 15, 2024
Correspondence:
{abardes, massran, ballasn}@meta.com
https://github.com/facebookresearch/jepa
Blogpost:
Click here
Masking Strategy
$$ \label{eq:loss} \text{minimize}{\theta,\phi}\quad \rVert P\phi(E_\theta(x), \Delta_y) - \text{sg}(\overline{E}_\theta(y)) \lVert_1, $$ \tag{eq:loss}
$$ \label{eq:loss_detail} \text{Loss} = \frac{1}{M} \sum_{k \in (i_1, \ldots, i_M)} \lVert \hat s_{k} - s_k \rVert_1, $$ \tag{eq:loss_detail}
$$ P^\star(E_\theta(x)) &= \text{argmin}{P} \lVert P(E\theta(x)) - Y \rVert_1\ &= \text{median}(Y|E_\theta(x)). $$
$$ 1ex] OmniMAE & ViT-L/16 & 2400M & 1170K & 65.6 & 60.6 & 14.4 & \bf 75.1 & 59.8 & 66.1 & 84.0 & 74.2 \ VideoMAE & ViT-L/16 & 410M & 400K & 77.8 & 65.5 & 21.6 & 71.1 & 59.3 & 64.6 & 85.4 & 74.3 \ Hiera & Hiera-L & 770M & 1500K & 75.5 & 64.2 & 15.8 & 68.9 & 58.5 & 56.9 & \bf 87.3 & \bf 75.1 \ \midrule V-JEPA & ViT-L/16 & 270M & 90K & \cc\bf 80.8 & \cc\bf 69.5 & \cc\bf 25.6 & \cc 74.8 & \cc\bf 60.3 & \cc\bf 67.8 & \cc 85.6 & \cc \bf 75.1 \ \bottomrule \end{tabular}} \vskip 1em \centering {\fontfamily{ptm}\fontsize{7pt}{7pt}\selectfont \caption{{\it Comparison with State-of-the-Art Models.} We compare \putalg with state-of-the-art baselines in frozen evaluation with an attentive probe on downstream image tasks (IN1K, Place205, iNat21) and video tasks (K400, SSv2, AVA). All models are evaluated at resolution 224, except I-JEPA${512}$ and V-JEPA${384}$ which are evaluated respectively at resolution $512$ and $384$. On K400 and SSv2 we follow the standard practice of reporting accuracy from several spatial and temporal views from the video. Compared to other video baselines, \putalg exhibits a consistent improvement across all downstream tasks. Compared to image-models that excel under the frozen evaluation, \putalg shows a significant performance improvement on tasks requiring motion understanding (+21 points on SSv2), and reduces the gap between video and image models on tasks requiring static appearance-based features.} \label{tb:large_results}\vspace{1em} \begin{tabular}{llrr ccc ccc} \toprule & & & & \multicolumn{3}{c}{\it Video Tasks} & \multicolumn{3}{c}{\it Image Tasks} \ \cmidrule(l){5-7} \cmidrule(l){8-10} & & & & \bf K400 & \bf SSv2 & \bf AVA & \bf IN1K & \bf Places205 & \bf iNat21 \ \bf Method & \bf Arch. & \bf Params. & \bf Data & {\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$8$\times$3)} & {\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$2$\times$3)} & & & & \ \midrule \multicolumn{2}{l}{\bf\it Methods pretrained on Images} & & & & & & \[1ex] I-JEPA & ViT-H/16${512}$ & 630M & IN22K & 79.7 & 50.0 & 19.8 & 84.4 & 66.5 & 85.7 \ OpenCLIP & ViT-G/14 & 1800M & LAION & 81.8 & 34.8 & 23.2 & 85.3 & \bf 70.2 & 83.6 \ DINOv2 & ViT-g/14 & 1100M & LVD-142M & \bf 83.4 & 50.6 & 24.3 & \bf 86.2 & 68.4 & \bf 88.8 \ \midrule \multicolumn{2}{l}{\bf\it Methods pretrained on Videos} & & & & & & \[1ex] MVD & ViT-L/16 & 200M & IN1K+K400 & 79.4 & 66.5 & 19.7 & 73.3 & 59.4 & 65.7 \ OmniMAE & ViT-H/16 & 630M & IN1K+SSv2 & 71.4 & 65.4 & 16.0 & 76.3 & 60.6 & 72.4\ VideoMAE & ViT-H/16 & 630M & K400 & 79.8 & 66.2 & 20.7 & 72.3 & 59.1 & 65.5 \ VideoMAEv2 & ViT-g/14 & 1100M & Un.Hybrid & 71.2 & 61.2 & 12.9 & 71.4 & 60.6 & 68.3\ Hiera & Hiera-H & 670M & K400 & 77.0 & 64.7 & 17.5 & 71.4 & 59.5 & 61.7 \ \midrule \multirow{3}{}{V-JEPA} & ViT-L/16 & 200M & \multirow{3}{}{VideoMix2M} & \cc 80.8 & \cc 69.5 & \cc 25.6 & \cc 74.8 & \cc 60.3 & \cc 67.8\ & ViT-H/16 & 630M & & \cc\bf 82.0 & \cc 71.4 & \cc\bf 25.8 & \cc 75.9 & \cc 61.7 & \cc 67.9 \ & ViT-H/16${384}$ & 630M & & \cc 81.9 & \cc\bf 72.2 & \cc 25.0 & \cc\bf 77.4 & \cc\bf 62.8 & \cc \bf 72.6 \ \bottomrule \end{tabular}} \end{table*}
\section{Comparison with Prior Work} In Section~\ref{subsec:pixel_comparison}, we investigate the impact of feature prediction by comparing \putalg with video approaches that rely on pixel prediction, while using a similar architecture for all baselines. Subsequently, in Section~\ref{subsec:sota_comparison}, we remove the architectural constraint and report the best performance across architectures for self-supervised video and image pretraining approaches. Finally, we explore the label-efficiency of \putalg relative to other self-supervised video pretraining approaches in Section~\ref{subsec:lowshot}. We further detail the evaluation setup in Appendix~\ref{app:evaluation}.
\subsection{Comparison with Pixel Prediction} \label{subsec:pixel_comparison}
To investigate the effectiveness of feature prediction pretraining, we first compare \putalg to video masked modeling models relying on a pixel prediction loss. We control for the possible confounding factor of model architecture by evaluating all models using either a ViT-L/16 encoder, or a Hiera-L encoder, which has a similar number of parameters. For the pixel prediction baselines we consider VideoMAE~\citep{tong2022videomae, wang2023videomae}, which trains vision transformer autoencoders exclusively on video, Hiera~\citep{ryali2023hiera}, which trains a hierarchical transformer autoencoder on video, and OmniMAE~\citep{girdhar2023omnimae}, which trains a vision transformer autoencoder on static images and video simultaneously.
Table~\ref{tb:pixel_comparison} examines both frozen evaluation with an attentive probe on downstream video and image tasks, as well as end-to-end fine-tuning. In frozen evaluation, \putalg outperforms the baselines on all downstream tasks, except ImageNet, where we achieve $74.8%$ compared to $75.1%$ of an OmniMAE model trained directly on ImageNet; hence, \putalg achieves comparable ImageNet performance despite only pretraining on video.
Under the fine-tuning protocol, \putalg also achieves the best performance of any model trained with a ViT-L/16, and matches the performance of the Hiera-L on SSv2, which benefits from a hierachical prior~\citep{ryali2023hiera}. The \putalg models achieve this result while processing significantly fewer samples during pretraining (Figure~\ref{fig:ssv2_finetuning}), demonstrating the efficiency of feature prediction as a learning principle. \begin{figure}[t] \includegraphics[width=\linewidth]{assets/scatter-ssv2-finetuned-compute.pdf} \caption{{\it SSv2 fine-tuning performance vs.~Samples Seen.} We report SSv2 fine-tuning for \putalg and pixel-reconstruction baselines using a ViT-L/16 or Hiera-L architecture. \putalg outperforms all pixel-reconstruction methods using a ViT-L/16 and matches the Hiera-L performance while seeing significantly less samples during pretraining.} \label{fig:ssv2_finetuning} \end{figure}
\subsection{Comparison with State-of-the-Art} \label{subsec:sota_comparison}
Next, in Table~\ref{tb:large_results}, we inspect how the \putalg models pretrained on video stack up next to the largest state-of-the-art self-supervised image and video models when freezing the backbone encoder and training an attentive probe on top. Our image pretrained baselines include OpenCLIP~\citep{cherti2023reproducible}, DINOv2~\citep{oquab2023dinov2}, and I-JEPA~\citep{assran2023self}. The OpenCLIP model is trained with a contrastive image-text alignment objective, DINOv2 and I-JEPA are trained with self-supervision. These models are known to excel in their frozen-evaluation performance~\citep{oquab2023dinov2}; i.e., their ability to produce visual features that can be applied to many downstream tasks simultaneously, without end-to-end fine-tuning, and thus provide highly competitive baselines. Our video pretrained baselines include VideoMAE~\citep{tong2022videomae}, OmniMAE~\citep{girdhar2023omnimae}, Hiera~\citep{ryali2023hiera}, VideoMAEv2~\citep{wang2023videomae}, and MVD~\citep{wang2023masked}. The OpenCLIP, DINOv2 and VideoMAEv2 models are parameterized as Giant/Gigantic vision transformer architectures containing over 1B parameters trained on large-scale image or video datasets. \begin{figure}[t] \includegraphics[width=\linewidth]{assets/scatter-ssv2-frozen-compute-time.pdf} \caption{{\it SSv2 frozen-evaluation performance vs.~Pretraining Time.} Wallclock times for all methods are measured on a single GPU with a batch size of 10 clips, using the official codebases for VideoMAE and VideoMAEv2, and linearly extrapolated assuming a global batch size of 2400 samples. However, note that the SSv2 accuracies of video pixel prediction methods are actually obtained with small batch sizes and significantly longer training schedules. \putalg outperforms pixel-reconstruction methods while training significantly faster.} \label{fig:ssv2_frozen} \end{figure} \begin{table*}[t] \centering {\fontfamily{ptm}\fontsize{7pt}{7pt}\selectfont \caption{{\it Low-Shot Frozen Evaluation.} Comparing \putalg to other video models in frozen evaluation on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. We report the mean performances and standard deviation using the K400 and SSv2 validation sets. \putalg is more label-efficient than other models; specifically, decreasing the available number of labeled examples from each class increases the performance gap between \putalg and the baselines.} \label{tb:lowshot} \begin{tabular}{ll ccc ccc} \toprule & & \multicolumn{6}{c}{\it Frozen Evaluation} \[1ex] & & \multicolumn{3}{c}{\bf K400} & \multicolumn{3}{c}{\bf SSv2} \ & & \multicolumn{3}{c}{\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$8$\times$3)} & \multicolumn{3}{c}{\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$2$\times$3)} \ \cmidrule(l){3-5} \cmidrule(l){6-8} & & 5% & 10% & 50% & 5% & 10% & 50% \ \bf Method & \bf Arch. & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$29 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$58 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$287 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$48 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$96 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$440 samples per class) \ \midrule MVD & ViT-L/16 & 62.6 $\pm$ 0.2 & 68.3 $\pm$ 0.2 & 77.2 $\pm$ 0.3 & 42.9 $\pm$ 0.8 & 49.5 $\pm$ 0.6 & 61.0 $\pm$ 0.2 \ VideoMAE & ViT-H/16 & 62.3 $\pm$ 0.3 & 68.5 $\pm$ 0.2 & 78.2 $\pm$ 0.1 & 41.4 $\pm$ 0.8 & 48.1 $\pm$ 0.2 & 60.5 $\pm$ 0.4 \ VideoMAEv2 & ViT-g/14 & 37.0 $\pm$ 0.3 & 48.8 $\pm$ 0.4 & 67.8 $\pm$ 0.1 & 28.0 $\pm$ 1.0 & 37.3 $\pm$ 0.3 & 54.0 $\pm$ 0.3 \ \midrule \multirow{2}{}{V-JEPA} & ViT-H/16 & \cc 67.0 $\pm$ 0.2 & \cc 72.1 $\pm$ 0.1 & \cc 80.2 $\pm$ 0.2 & \cc 51.9 $\pm$ 0.3 & \cc 57.5 $\pm$ 0.4 & \cc 67.3 $\pm$ 0.2 \ & ViT-H/16$_{384}$ & \bf\cc 68.2 $\pm$ 0.2 & \cc\bf 72.8 $\pm$ 0.2 & \bf\cc 80.6 $\pm$ 0.2 &\bf\cc 54.0 $\pm$ 0.2 & \bf\cc 59.3 $\pm$ 0.5 & \bf\cc 67.9 $\pm$ 0.2 \ \bottomrule \end{tabular}} \end{table}
\paragraph{\bf Comparison with video models.} Compared to large-scale video baselines, the \putalg models outperform all previous models on every downstream video and image task with notable margin (see Table~\ref{tb:large_results}). Our H/16 model outperforms the largest publicly available VideoMAE, VideoMAEv2, OmniMAE, MVD, and Hiera models by at least $+5$ points in motion understanding (Something-Something-v2), $+2$ points in action recognition (Kinetics-400), $+5$ points on action detection (AVA), $+1$ point on object recognition (ImageNet-1K), $+2$ points in scene recognition (Places205), and $+0.2$ points on fine-grained recognition (iNaturalist). Moreover, when comparing pretraining wallclock time in Figure~\ref{fig:ssv2_frozen}, we see that \putalg achieves this performance with a roughly $2\times$ speedup compared to the large pixel prediction models.
\paragraph{\bf Comparison with image models.} On tasks that require a fine-grained understanding of motion (Something-Something-v2), the \putalg models provide a major improvement (over $+21$ points) compared to large-scale image baselines, such as DINOv2, OpenCLIP, and I-JEPA. Self-supervised pretraining from videos allows to model dynamic concepts that are not easily learned from static image datasets. Similarly, we observe that the \putalg models outperform image-based pretraining on action localization.
On Kinetics-400, we find image models to perform well; e.g., while DINOv2~\citep{oquab2023dinov2} previously reported $78.4%$ on K400 with a linear probe, we improve the frozen evaluation of the g/14 model to $83.4%$ by using an attentive probe. In this case, our H/16 model achieves $82.0%$ top-1 accuracy. It is worth noting that the label for many Kinetics videos can be inferred using appearance-based cues, without requiring an understanding of motion~\citep{sevilla2021only}.
The \putalg models narrow the gap with image models on image classification tasks. In particular, \putalg achieves a score of $77.4%$ on ImageNet using a one-layer attentive probe, which can be further improved to $\bf{77.9%}$ using a two-layer attentive probe. More generally, we hypothesize that the datasets used to train \putalg and other video models are too constrained and lack the visual diversity of the internet-scale pretraining data used by the images models; as such, there is value in focusing future work on building diverse publicly available video datasets.
\subsection{Label-efficiency} \label{subsec:lowshot} We examine the label-efficiency of \putalg compared to other self-supervised video models by measuring the ability of the pretrained backbones to adapt to downstream tasks with few labels. Specifically, we investigate the performance of the frozen models on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. Table~\ref{tb:lowshot} reports the mean performances and standard deviation using the K400 and SSv2 validation sets.
We find \putalg to be more label-efficient than other self-supervised video models: decreasing the available number of labeled examples for training the attentive probe results in an increase in the performance gap between \putalg and the other models. In particular, the performance of the largest \putalg model on K400 drops by 12% to 68.2% top-1 when we reduce the number of labeled examples by a factor of $10\times$ (from roughly 287 examples per class to 29 examples per class). By contrast, VideoMAEv2 drops by 30% to 37.0% top-1, VideoMAE drops by 15.9% to 62.3% top-1, and MVD drops by 14.6% to 62.6% top-1.
Similar observations hold on SSv2. The performance of the largest \putalg model on SSv2 drops by 13.9% to 54.0% top-1 when we reduce the number of labeled examples by a factor of $10\times$ (from roughly 440 examples per class to 48 examples per class). By contrast, VideoMAEv2 drops by 26% to 28.0% top-1, VideoMAE drops by 19.1% to 41.4% top-1, and MVD drops by 18.1% to 42.9% top-1.
\section{Evaluating the Predictor} Next, we seek to qualitatively inspect the \putalg models. Recall that the predictor network in \putalg predicts the representations of a masked spatio-temporal region $y$ from a visible region $x$, given the positional information of the masked regions (see Section~\ref{sec:methodology}). To qualitatively investigate the grounding of the feature-space predictions, we freeze the pretrained encoder and predictor networks and train a conditional diffusion decoder to map the \putalg predictions to interpretable pixels. Notably, the decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video (see Figure~\ref{fig:decoder_method}). \begin{figure*}[t!] \centering \begin{subfigure}[b]{\textwidth} \centering \includegraphics[width=0.825\linewidth]{assets/decoder-color.pdf} \caption{ {\bf Visualization Methodology.} We train a conditional diffusion model to decode the \putalg feature-space predictions to interpretable pixels; the pretrained \putalg encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video. } \label{fig:decoder_method} \end{subfigure} \vskip 4mm \begin{subfigure}[b]{\textwidth} \centering \includegraphics[width=0.485\linewidth]{assets/samples-v1.png}\quad \includegraphics[width=0.485\linewidth]{assets/samples-v0.png} \caption{ {\bf Visualizations.} {\it First Row:} Masked videos used as input to the \putalg models (a pretrained ViT-H/16 encoder and its corresponding predictor network). {\it Other rows:} Bounding boxes contain various samples from the decoder overlayed on the original video. \putalg is not a generative model and the decoder does not have access to the context (first row), so we do not expect samples to exactly match the input. This experiment qualitatively illustrates what information is encoded and predicted by \putalg. In particular, characteristics that are common across samples represent information that is encoded in the \putalg predictions. \putalg generates predictions that are spatially and temporally coherent with unmask region of the video. The predictions also capture consistent motion through time. } \label{fig:prediction-sample} \end{subfigure} \caption{{\it Qualitative Analysis.} Offline visualizations of the \putalg feature-space predictions.} \label{fig:prediction-visualization} \end{figure*}
Given a masked video, we use the \putalg pretrained models to predict the representations of the missing regions, and then use the decoder to project the representations to pixel space. Figure~\ref{fig:prediction-sample} shows decoder outputs for various random seeds. Qualities that are common across samples represent information that is contained in the predictor representation.
Figure~\ref{fig:prediction-sample} shows that the \putalg feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video. Specifically, the samples in Figure~\ref{fig:prediction-sample} show that the \putalg predictor correctly captures positional uncertainty and produces a variety of visual objects at various locations with consistent motion. Some of the samples also demonstrate an understanding of object-permanence, as the visual objects remain consistent after partial occlusion.
\section{Conclusion} In this work, we explored the effectiveness of feature prediction as a stand-alone objective for unsupervised learning from video and introduced \putalg, a collection of vision models trained solely using a self-supervised feature prediction objective. The \putalg models demonstrate the ability to solve various downstream image and video tasks without adaption of the model parameters, and outperform previous video representation learning approaches in frozen evaluation on action recognition, spatio-temporal action detection, and image classification tasks. Additionally, we show that pretraining \putalg on videos is particularly effective for solving downstream tasks requiring fine-grained motion understanding, while large-scale image models trained on internet scale datasets fall short on such tasks. Finally, we empirically observed that \putalg models are label-efficient learners, and exhibit good performance on downstream tasks, even when only few labeled examples are available.
\bibliographystyle{assets/plainnat} \bibliography{paper}
\clearpage \newpage
\onecolumn
\beginappendix
\section{Extended Related Works}
We first review approaches for learning visual perception from static images before discussing strategies for learning from video.
\subsection*{Weakly-Supervised Learning from Static Images} One family of approaches for learning visual perception from static images trains a visual encoder to predict the representations of text captions often found accompanying images from the Web, as in CLIP~\citep{radford2021learning} or CoCa~\citep{yu2022coca}. The largest open source CLIP model to date, numbering 2B parameters and trained on over 2B web-scraped images~\citep{cherti2023reproducible}, demonstrates impressive performance on a wide range of downstream image and video tasks. Notably, this is achieved using only the light-weight adaptation of task-specific heads, also referred to as frozen-evaluation, and does not require expensive end-to-end fine-tuning of the pretrained model.
\subsection*{Self-Supervised Learning from Static Images} Other approaches for learning from static images leverage unsupervised objectives. Initial works on self-supervised approaches are based on sparse coding or hand-crafted pretext tasks, such as colorization~\citep{larsson2016learning,larsson2017colorization}, rotation prediction~\citep{gidaris2020learning}, and jigsaws~\citep{noroozi2016unsupervised}. More recent approaches leverage invariance-based objectives by training a visual encoder to be invariant to hand-crafted image transformations~\citep{wu2018unsupervised,chen2020simple}.
Another family of methods learn representations using denoising autoencoders~\citep{denoising_vincent}; image inpainting is one popular instantiation of this idea~\citep{pathak2016context}. More recently, masked autoencoders~\citep{he2021masked} train an encoder-decoder transformer to predict missing pixels of a masked image. Follow-up work addresses the indeterminism of pixel reconstruction by exploring instantiations of masked image modeling in latent space~\citep{baevski2022data2vec,assran2023self,baevski2022efficient}. These approaches can be seen as applications of the predictive feature principle in the image modality.
There are also various methods that combine both masked image modeling and invariance criteria to learn visual representations from static images, such as iBOT~\citep{zhou2021ibotyes} and DINOv2~\citep{zhou2021ibotyes, oquab2023dinov2}, the latter is currently the most competitive instantiation of self-supervised learning with static images, scaled to a model with over 1.1B parameters trained on a curated dataset of 142M images.
\subsection*{Weakly-Supervised Learning from Videos} One family of approaches for learning visual perception from videos relies on weakly-supervised guidance from closed captioning, often computed from an ASR transcription of audio data accompanying internet videos. For instance, VideoBERT~\citep{sun2019videobert,xu2021videoclip} trains a video encoder to predict masked spans in the textual closed captions. Similarly, VideoCLIP~\citep{xu2021videoclip} trains a video encoder to predict the representation of video captions computed by a text encoder. Follow-up work such as MERLOT~\citep{zellers2022merlot}, VATT~\citep{akbari2021vatt}, and InternVideo~\citep{wang2022internvideo} extended VideoCLIP by incorporating additional unsupervised objectives.
\subsection*{Self-Supervised Learning from Videos} Similar to unsupervised learning from images, a family of unsupervised video representation learning approaches enforces a spatio-temporal representation of a video clip to be invariant to hand-crafted spatio-temporal data augmentations~\citep{parthasarathy2022self}. However, one obvious insight is that the temporal ordering of visual information in video can provide implicit supervision. Indeed, this insight is the key insight leveraged by many works on unsupervised video learning. Towards leveraging temporal information as supervision, some approaches train a visual encoder by predicting the temporal ordering of frames~\citep{xu2019self, lee2017unsupervised}. Other approaches seek to predict low-level motion vectors computed from optical flow~\citep{pintea2014deja}, or to predict mixing pixels in video frames, using either a frame-interpolation objective~\citep{kalluri2023flavr} or a denoising autoencoder~\citep{tong2022videomae, feichtenhofer2022masked, wang2023videomae}.
\section{Extended Description of V-JEPA} \label{appendix:vjepa_extended_description}
In this section, we provide an in-depth description of our approach \putalg that is illustrated in Figure~\ref{fig:vjepa-complex}.
\paragraph{\bf Input.} Unless stated otherwise, during during pretraining, we always randomly sample a clip of 16 frames from each input video with a temporal stride of 4 between sampled frames. An input video clip therefore covers 64 frames in total, or roughly 2 seconds of a given video running at 30 frames per second. We then resize the video's spatial dimensions to $224 \times 224$, resulting in an overall shape of $16 \times 224 \times 224 \times 3$ for the entire clip. Since ViT networks process a 1D sequence of tokens, we must convert an input video clip into a 1D token sequence. To do so, we apply a 3D convolution comprising $d$ filters of size $2 \times 16 \times 16$ with a temporal stride of $2$ and a spatial stride of $16$, resulting in a tensor of shape $8 \times 14 \times 14 \times d$. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape $1568 \times d$. This process is demonstrated in Figure~\ref{fig:patchitfy}. \begin{figure}[h] \centering \includegraphics[width=0.9\linewidth]{assets/patchify.pdf} \caption{\small{\bf \putalg} training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size $16 \times 224 \times 224 \times 3$ into a 1D token sequence, we apply a 3D convolution comprising $d$ filters of size $2 \times 16 \times 16$ with a temporal stride of $2$ and a spatial stride of $16$, resulting in a tensor of shape $8 \times 14 \times 14 \times d$. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape $1568 \times d$.} \label{fig:patchitfy} \end{figure}
\paragraph{\bf \putalg.} We sample both a video clip, and a video mask in each iteration. We denote a video clip represented as a 1D token sequence of length $L=1568$ by $x_{L} = (x_1, \ldots, x_L)$. Similarly, given a mask of $M < L$ patches, leaving $N=L-M$ patches unmasked, we denote the indices of masked patches by $(i_1, \ldots, i_M)$ and its complement (the indices of unmasked patches) by $(j_1, \ldots, j_{N})$.
{\bf \it Computing the $x$-representations.} To compute the \putalg loss, we first produce the $x$-representations by masking the video clip and feeding it into the $x$-encoder; we denote the masked video by $x_N = (x_{j_1}, \ldots, x_{j_N})$. Applying the $x$-encoder $E_\theta(\cdot)$ to the masked clip gives a sequence of patch representations, denoted as $z_N = E_{\theta}(x_N) = (z_{j_1}, \ldots, z_{j_N}).$ $$ \tag{tb:large_results}
$$ \sum^L_{i=1}{\frac{\exp(q^\top {\bf W_k} s_i)}{\sum_j \exp(q^\top {\bf W_k} s_j)} {\bf W_v} s_i }, $$
Introduction
Humans possess the remarkable ability to map low-level signals originating from the retina into a semantic spatiotemporal understanding of the world; synthesizing notions such as objects and global motion (Spelke et al., 1995). A long-standing goal of the machine learning community is to identify the principles or objectives that may guide such unsupervised learning in humans (Field, 1994; Berkes and Wiskott, 2005; Hinton, 1989). One related hypothesis is based on the predictive feature principle (Rao and Ballard, 1999), which posits that representations of temporally adjacent sensory stimuli should be predictive of each other.
In this work, we revisit feature prediction as a standalone objective for unsupervised learning of visual representations from video. Numerous advances in the field such as the standard use of transformer architectures in vision (Dosovitskiy et al., 2020), the maturing of masked autoencoding frameworks (Xie et al., 2021; Bao et al., 2021; He et al., 2021), query-based feature pooling (Chen et al., 2022), joint-embedding predictive architectures (JEPA) (LeCun, 2022; Assran et al., 2023; Baevski et al., 2022b), and larger datasets - form a unique arsenal of tools, which we integrate in a modern and conceptually simple method, the video joint-embedding predictive architecture or V-JEPA , which is based solely on feature prediction, without using pretrained image encoders, text, negative examples, human annotations, or pixel-

Related Works
Slow Features. One way to encourage temporally adjacent representations to be predictive of each other is to ensure that they vary slowly over time. Early works targeting predictive features encouraged representations of individual video frames to be locally temporally invariant, while preventing representation collapse by using spectral methods, as in SFA (Wiskott and Sejnowski, 2002), SSA (Kayser et al., 2001), and Simulated Fixations (Zou et al., 2012). More recently, Goroshin et al. (2015); Wang et al. (2010) train a siamese convolutional network to map the representations of two subsequent frames to the same point, while encouraging distant frames to have diverse representations via a pairwise margin loss and a triplet loss, respectively. Other works (Oord et al., 2018; Surís et al., 2021; Feichtenhofer et al., 2021) implement temporal invariance using noisecontrastive estimation (Gutmann and Hyvärinen, 2012). Our exploration in this paper goes beyond temporal in- variance and explores feature prediction using masked modeling.
Predictive Features. Going beyond local invariance, a family of works trains a predictor network to map the representation of a frame or clip at one time-step to a distinct representation at another time-step. Srivastava et al. (2015); Vondrick et al. (2016); Wang et al. (2023b) train such a video feature predictor network on top of a frozen pretrained image or video encoder. Unfreezing the target feature extractor, several methods train the video encoder and the predictor network simultaneously, while preventing collapse by using a supervised action forecasting loss (Girdhar and Grauman, 2021), or by using the representations of distant clips as negative samples in a contrastive loss (Han et al., 2019, 2020; Tan et al., 2023), often focusing on small convolutional encoders (Han et al., 2019, 2020). The idea of learning a representation by predicting missing information in feature space is also core to the joint-embedding predictive architecture (JEPA) (LeCun, 2022), which combines a siamese encoder with a predictor network. JEPAs have been successfully instantiated in several modalities, such as with audio data (Baevski et al., 2022b) and image data (Zhou et al., 2021; Oquab et al., 2023; Assran et al., 2023). In this work, we extend this paradigm to video data by leveraging recent advances in self-supervised learning.
Advances in Self-Supervised Learning. The use of vision transformers (Dosovitskiy et al., 2020; Li et al., 2022) has become standard practice in self-supervised learning with joint-embedding architectures (Chen et al., 2021; Caron et al., 2021; Oquab et al., 2023; Zhou et al., 2021; Assran et al., 2022), and unlocked masked image modeling in pixel space by parameterizing the pixel decoder as a transformer with learnable mask tokens (Dosovitskiy et al., 2020; Xie et al., 2021; He et al., 2021; Bao et al., 2021), demonstrating a step-change in the representation quality of autoencoding methods (Vincent et al., 2010). This line of generative methods was subsequently extended to video data using spatio-temporal masking (Tong et al., 2022; Feichtenhofer et al., 2022; Wang et al., 2023a; Kalluri et al., 2023; Gupta et al., 2023). It was also recently shown that the representations of masked image autoencoders could be significantly improved by using learnable pooling mechanisms based on cross-attention (Chen et al., 2022). Finally, through careful selection of design choices, the non-contrastive collapse prevention strategy in BYOL (Grill et al., 2020) was recently made to work with image feature prediction methods (Baevski et al., 2022b; Assran et al., 2023), which demonstrated the ability to learn representations that can be leveraged for various downstream tasks without relying on invariance to hand-crafted image transformations.
bf Slow Features.
bf Predictive Features.
Approaches that predict in pixel space must dedicate significant model capacity and compute to capture all the low-level detail in the visual input. By contrast, approaches that predict in latent space have the flexibility to eliminate irrelevant or unpredictable pixel-level details from the target representation (Vondrick et al., 2016). Predicting in representation space has been shown to lead to versatile representations that perform well across many downstream tasks through linear probing or lowshot adaptation (Assran et al., 2023; Oquab et al., 2023; Assran et al., 2022), while demonstrating an efficiency gain during pretraining compared to pixel level reconstruction (Assran et al., 2023; Baevski et al., 2022b,a). The works of Baevski et al. (2022a,b) additionally show that predicting in representation space results in competitive end-to-end fine-tuning performance in the image, audio and text domains. In this work, we extend these findings to the video modality.
bf Advances in Self-Supervised Learning.
Similar to unsupervised learning from images, a family of unsupervised video representation learning approaches enforces a spatio-temporal representation of a video clip to be invariant to hand-crafted spatio-temporal data augmentations (Parthasarathy et al., 2022). However, one obvious insight is that the temporal ordering of visual information in video can provide implicit supervision. Indeed, this insight is the key insight leveraged by many works on unsupervised video learning. Towards leveraging temporal information as supervision, some approaches train a visual encoder by predicting the temporal ordering of frames (Xu et al., 2019; Lee et al., 2017). Other approaches seek to predict low-level motion vectors computed from optical flow (Pintea et al., 2014), or to predict mixing pixels in video frames, using either a frame-interpolation objective (Kalluri et al., 2023) or a denoising autoencoder (Tong et al., 2022; Feichtenhofer et al., 2022; Wang et al., 2023a).
bf Feature Prediction versus Pixel Reconstruction.
Approaches that predict in pixel space must dedicate significant model capacity and compute to capture all the low-level detail in the visual input. By contrast, approaches that predict in latent space have the flexibility to eliminate irrelevant or unpredictable pixel-level details from the target representation (Vondrick et al., 2016). Predicting in representation space has been shown to lead to versatile representations that perform well across many downstream tasks through linear probing or lowshot adaptation (Assran et al., 2023; Oquab et al., 2023; Assran et al., 2022), while demonstrating an efficiency gain during pretraining compared to pixel level reconstruction (Assran et al., 2023; Baevski et al., 2022b,a). The works of Baevski et al. (2022a,b) additionally show that predicting in representation space results in competitive end-to-end fine-tuning performance in the image, audio and text domains. In this work, we extend these findings to the video modality.
Methodology: Video-JEPA

Figure 2 Joint-Embedding Predictive Architectures are trained to predict the representation of an input y from the representation of another input x . The additional variable z provides the predictor with information about the transformation that computes y from x .
Our goal is to explore the effectiveness of feature prediction as a stand-alone objective for learning visual representations from video. To that end, we use a joint-embedding predictive architecture (JEPA) (LeCun, 2022); see Figure 2. The main idea behind a JEPA is to learn by predicting the representation of an input y from the representation of another input x . The basic architecture is made up of an encoder, E θ ( · ) , which computes the representation of the inputs, and a predictor, P ϕ ( · ) , which predicts the representation of y from the representation of x , conditioned on a variable z indicating the transformation (or corruption) between x and y . Conditioning on z enables the generation of distinct predictions for various transformations of x .
Training Objective
We train our visual encoder E θ ( · ) to satisfy the constraint that representations computed from one part of the video, y , should be predictable from representations computed from another part of the video, x . The predictor network P ϕ ( · ) , which maps the representation of x to the representation of y , is trained simultaneously with the encoder, and is provided specification of the spatio-temporal positions of y through the conditioning variable z ← ∆ y .
Naively implementing the objective using the regression
$$
$$
would admit a trivial solution, where the encoder outputs a constant representation, regardless of its input. In practice, we use the following modified objective to prevent representation collapse,
$$
$$
where sg ( · ) denotes a stop-gradient operation, which does not backpropagate through its argument, and E θ ( · ) is an exponential moving average of the network E θ ( · ) . The use of an exponential-moving average feature extractor along with a stop-gradient and a predictor has been used as a collapse prevention strategy for image pretraining (Grill et al., 2020), and studied empirically (Xie et al., 2021) and theoretically (Tian et al., 2021). In fact, the objective in equation (1) is similar to the loss of Assran et al. (2023) used for image pretraining, but we modify it to use an ℓ 1 regression, which we found to be more stable.
Theoretical motivation. A theoretical motivation for the effectiveness of this collapse prevention strategy was proposed in Grill et al. (2020) for the BYOL method. We provide a simple adaptation of their analysis for our ℓ 1 loss. For ease of exposition, we will disregard the effect of the conditioning variable z and consider one dimensional representations. Denote the representation E θ ( y ) by a random variable Y . The optimal predictor under equation (1) is thus given by the following functional expression,
$$
$$
Substituting this expression for the optimal predictor into the loss function and evaluating the expected gradient of the encoder gives
$$
$$
where MAD ( · | E θ ( x )) is the median absolute deviation of a random variable conditioned on E θ ( x ) . Thus, in the case where the predictor is optimal, the encoder must learn to capture as much information about the video as possible to minimize the deviation of the target. The hypothesis is that incorporating an exponential moving average to compute the representation of y ensures that the predictor evolves faster than the encoder and remains close to optimal, thereby preventing collapse.

×
Figure 3 V-JEPA. Training operates on a video clip of T frames with spatial resolution H × W , flattened into a sequence of L tokens. (Left to right): We first obtain the input of the x -encoder by dropping tokens from the video clip. The x -encoder then processes the masked video sequence, and outputs an embedding vector for each input token. Next, the outputs of the x -encoder are concatenated with a set of learnable mask tokens containing positional embeddings of the masked spatio-temporal patches. The predictor network processes the combined token sequence, and outputs an embedding vector for each mask token. The outputs of the predictor are then regressed to the prediction targets using an L 1 loss. The prediction targets correspond to the output of the y -encoder.
Theoretical motivation.
Linear vs. Attentive probe Table 12 shows that V-JEPA and VideoMAE benefit from using a non-linear attentive probe and multiple clips on the K400 and SSv2 downstream tasks. Additionally, Table 13 shows that attentive probing leads to better performance on average for DINOv2 and OpenCLIP models. Since attentive probing and multiclips eval improves the performance of all models, we use it as our default protocol in frozen evaluation.
Table 11 Finetuning Evaluation hyper-parameters.
Table 12 Linear vs. Attentive Probe Evaluation for V-JEPA and VideoMAE. We evaluate the effect of linear (Lin.) and attentive (Att.) probing when adapting V-JEPA to the K400 ( 16 × 5 × 3 ) and SSv2 (16 × 2 × 2) tasks. V-JEPA and VideoMAE benefit from using a non-linear attentive probe.
Table 13 Linear vs. Attentive Probe Evaluation for DINOv2 and OpenCLIP. We evaluate the effect of linear (Lin.) and attentive probing (Att.) when adapting DINOv2 and OpenCLIP. Image-baselines benefit from using an attentive probing strategy. Results shown in gray are reported from the linear probe evaluation in Oquab et al. (2023).
One Clip vs Multiple clips. We examine the impact of changing the temporal coverage of a model during downstream evaluation on K400 action classification. In Table 14, we evaluate VideoMAE and V-JEPA models using an attentive probe with access to either the feature map of 1 clip randomly sampled from the video, or the concatenated feature map of 8 clips randomly sampled from the video. To sample 8 clips from a video, we first divide the video into 8 equal length temporal segments, and sample 1 clip at random from each segment. A single clip corresponds to ≈ 2 seconds of a video on average, while 8 clips correspond to ≈ 16 seconds. The video encoders processes each clip separately to produce a clip-level feature map, which are then concatenated at the input to the attentive probe.
Increasing the temporal coverage from 1 clip per video to 8 clips improves the performance of both V-JEPA and VideoMAE on K400 action classification. We therefore use the multiclip attentive probing setup as our default evaluation pipeline.
Prediction Task: Predicting $y$ from $x$
The feature prediction task is based on a masked modeling formulation (He et al., 2021; Tong et al., 2022); i.e., regions x and y from the video are sampled using masking. To sample y from a video, we sample several (possibly overlapping) spatially continuous blocks with various aspect ratios and repeat the spatial blocks across the entire temporal dimension of the video; x is taken to be the complement. Masking a large continuous block that covers the full temporal dimension limits information leakage due to the spatial and temporal redundancy of videos, and results in a harder prediction task (Tong et al., 2022).
We leverage two types of masks: short-range masks, where we take the union of 8 randomly sampled target blocks covering 15% of each frame, and long-range masks, where we take the union of 2 randomly sampled target blocks covering 70% of each frame. In both cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0 . 75 , 1 . 5) . Given that both short-range and long-range masks are produced by sampling many blocks and taking their union, the result is an average masking ratio of ∼ 90% . We refer to our masking strategy as multi-block, and compare it to other possible masking strategies in Section 4.
Network Parameterization
We use a Vision Transformer (ViT) (Dosovitskiy et al., 2020; Arnab et al., 2021) as our video backbone. To process a video with a transformer network, we split the video clip into a 3D grid of L spatio-temporal patches, where a patch consists of a 16 × 16 pixel block spanning 2 consecutive frames; we refer to these spatio-temporal patches as tokens. This sequence of tokens is then directly processed by the stack of transformer blocks. In- puts x and y correspond to masked regions of a video, we apply the video masks by simply dropping a subset of the tokens. We apply masking at the input of the x -encoder, and at the output of the y -encoder to construct contextualized targets (Baevski et al., 2022b). The encoder is parameterized using standard ViT networks, while the predictor is a narrow transformer implemented using 12 blocks with an embedding dimension of 384 . Taking inspiration from masked autoencoders (He et al., 2021), our predictor takes as input the sequence of embeddings produced by the x -encoder as well as a sequence of learnable mask tokens with positional embeddings indicating the spatio-temporal positions of the y tokens. The output of the predictor is an embedding vector for each mask token; see Figure 3 and refer to Appendix B for more details.
Pretraining Data and Evaluation Setup
Pretraining. We combine several public datasets to construct an unsupervised video pretraining dataset, which we refer to as VideoMix2M. Specifically, we combine the videos from HowTo100M (HT) (Miech et al., 2019), Kinetics-400/600/700 (K710) (Kay et al., 2017), and Something-Something-v2 (SSv2) (Goyal et al., 2017), and remove any overlap with the validation sets of Kinetics-400/600/700 and Something-Something-v2, resulting in approximately 2 million videos. We train a ViT-L/16, a ViT-H/16, and a ViT-H/16 384 transformer model on VideoMix2M. We use a batch size of 3072 for the ViT-L/16 and ViT-H/16 models, and a batch size of 2400 for the ViT-H/16 384 model. Each model takes as input a video clip of 16 frames sampled with a frameskip of 4, corresponding to roughly 3 second clips on average. The ViT-L/16 and ViT-H/16 process the video at a spatial resolution of 224, while the ViT-H/16 384 uses an input resolution of 384; cf. Appendix C.
Table 1 Pixels vs. Featurized Targets. We ablate the effect of computing the prediction loss in feature space vs pixel space. All models are trained on VideoMix2M for 90K iterations with a batch size of 3072 using the multi-block prediction task. We examine downstream performance using a frozen backbone with attentive probing, and report top-1 accuracy using a single center view. We also examine end-to-end fine-tuning performance of the models on K400. Predicting in feature space provide a consistent improvement over pixel space prediction.
Table 2 Pretraining Data Distribution. We pretrain all models for 90K iterations using a batch size of 3072, and evaluate downstream performance of the frozen backbones with an attentive probe using a single center view. Average performance across tasks increases with the pretraining dataset size.
Evaluations. Pretrained models are evaluated on downstream video and image tasks. On video tasks, we use a subset of the VideoGLUE benchmark (Yuan et al., 2023) to test for various capabilities; specifically, we investigate action recognition on Kinetics400 (K400) (Kay et al., 2017), motion classification on Something-Something-v2 (SSv2) (Goyal et al., 2017), and action localization on AVA (Gu et al., 2018). Action classification on Kinetics evaluates the appearance-based understanding of the model, as many action classes in the dataset can be inferred from the presence of specific objects in the video (Sevilla-Lara et al., 2021). Motion classification on Something-Something-v2 evaluates the temporal understanding of the model, as action classes in the dataset are decoupled from the appearance/presence of specific objects in the video (Goyal et al., 2017). Finally, action localization on AVA evaluates the ability of the model to understand and localize motions in the video. We follow standard practice and report accuracy on K400 and SSv2 by sampling several spatial and temporal views. For static image tasks, we explore object recognition on ImageNet (Russakovsky et al., 2015), scene classification on Places205 (Zhou et al., 2014), and fine-grained recognition on iNaturalist 2021 (Van Horn et al., 2018).
bf Pretraining.
In section, we report V-JEPA pretraining details. Table 8 summarizes the main hyperparameters used during pretraining.
Architectures. We use Vision Transformer (Dosovitskiy et al., 2020) (ViT) architectures for the x -encoder and y -encoder. We train three V-JEPA encoders: a ViT-L/16 224 , a ViT-H/16 224 and a ViT-H/16 384 . All three encoders take as input a short video clip of 16 frames with a temporal stride of 4 between consecutive frames. The subscripts, 224 and 384 , indicate the spatial resolution of the video clip. V-JEPA flattens the video clip into a sequence of non-overlapping spatio-temporal patches of size 16 × 16 × 2 (see Figure 7). For all three models, the predictor is designed as a narrow ViT architecture, consisting of 12 transformer blocks with an embedding dimension of 384. For simplicity, we keep the number of self-attention heads in the predictor equal to that of the backbone used for the context-encoder/target-encoder. V-JEPA is pretrained without using a [cls] token.
Optimization. We use AdamW (Loshchilov and Hutter, 2017) to optimize the x -encoder and predictor weights. The ViT-L/16 224 and ViT-H/16 224 models use a batch size of 3072 while the ViT-H/16 384 uses a batch size of 2400 . Models are trained for a total of 90,000 iterations. The learning rate is linearly increased from 2 × 10 -4 to 6 . 25 × 10 -4 during the first 12 , 000 iterations of pretraining, and decayed to 10 -6 following a cosine schedule.
Table 9 Frozen Evaluation hyper-parameters.
Weight-decay is also linearly increased from 0 . 04 to 0 . 4 throughout pretraining. The y -encoder weights are initialized identically to the x -encoder, and subsequently updated as an exponential moving average (EMA) (Tarvainen and Valpola, 2017) of the x -encoder weights using a momentum value which starts at 0 . 998 and is linearly increased to 1 . 0 during training (Caron et al., 2021; Assran et al., 2022). We scale all hyper-parameter schedules 25% beyond the actual training schedule. Specifically, the learning rate schedule, weight-decay schedule, and EMA schedule are computed assuming a training length of 112,500 iterations, even though we only train our model for 90,000 iterations. We found the last 25% of the default scheduler period to update hyper-parameters too aggressively, and simply truncating the schedulers improved performance.
Masking. As described in Section 3, we propose a 3D Multi-Block masking strategy. We use two type of masks: short-range masks, where we take the union of 8 randomly sampled target blocks with a spatial scale of 0 . 15 , and long-range masks, where we take the union of 2 randomly sampled target blocks with a spatial scale of 0 . 7 . In both cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0 . 75 , 1 . 5) .
bf Evaluations.

Figure 1 V-JEPA models pretrained on video learn versatile visual representations. It performs well on motion-based tasks (Something-Something-v2) and appearance-based tasks (Kinetics 400) without adaptation of the model's parameters, i.e., using the same frozen backbone for both tasks.
level reconstruction.
We seek to answer the simple question:
How effective is feature prediction as a standalone objective for unsupervised learning from video with modern tools?
To that end, we pretrain a family of V-JEPA models on a dataset of 2 million videos collected from publicly available datasets by combining a masked modeling prediction task with a joint-embedding predictive architecture (see Figure 2). We measure performance on several downstream image and video tasks, using both frozen evaluation and end-to-end fine-tuning. Our findings suggest that feature prediction can indeed serve as an effective stand-alone objective for unsupervised learning from video, while using significantly shorter training schedules than pixel prediction methods. Specifically:
· Feature prediction leads to versatile visual representations that perform well across downstream image and video tasks without adaption of the model's weights; i.e., using a frozen backbone. V-JEPA achieves the best performance among methods we consider (+6% accuracy) on the SomethingSomething-v2 task, which requires finegrained temporal understanding. V-JEPA is also competitive on tasks like Kinetics400, where appearance-based features are sufficient and hence state-of-the-art image models such as DINOv2 excel (Figure 1 and Table 6). · Models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol (attentive probing) and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules (Tables 5 and 6). · Models trained with feature prediction are more label-efficient than pixel prediction approaches. Decreasing the available number of labeled examples results in an increase in the performance gap between V-JEPA and pixel-reconstruction models (Table 7).
What Matters for Learning Representations from Video?
In this section we isolate the contributions of several design choices, including: a) the use of a feature prediction versus pixel prediction objective, b) the construction of the pretraining data distribution, c) the feature pooling strategy for leveraging the model's representations in downstream tasks, and d) the masking strategy, towards identifying: what to predict from what?
Predicting Representations versus Pixels
We first ablate the effect of computing the prediction loss in representation space. We train a pair of ViT-L/16 models using either a V-JEPA feature prediction loss, or a mean-squared error loss with the normalized pixel values, as in masked autoencoders (He et al., 2021), and perform a sweep over the learning rate and weight decay schedules for both approaches. All models are pretrained on VideoMix2M for 90K iterations with a batch size of 3072 using multi-block masking. We examine performance on Kinetics-400 (K400), Something-Something-v2 (SSv2), and ImageNet-1K (IN1K), using a frozen backbone with an attentive probe, and report top-1 accuracy using a single center view. We also examine end-to-end fine-tuning performance of the models on Kinetics-400.
Results of this comparison are reported in Table 1 and indicate that predicting in feature space provides a consistent performance improvement over pixel space prediction in both frozen evaluation of the video backbone, as well as end-to-end fine-tuning.
Pretraining Data Distribution
Next we study the impact of the pretraining data distribution in Table 2. Leveraging large scale datasets
Table 3 Average Pooling vs. Adaptive Pooling. We pool the feature map output by the frozen V-JEPA encoder using an attentive probe, which is then fed into a linear classifier for downstream supervised tasks (K400 and SSv2). We evaluate two pooling strategies: 1) average pooling (Avg.), and attentive pooling (Att.). Results are reported using a single center view. Using adaptive pooling with a crossattention layer leads to improvements of +17 . 3 points on K400 and +16 . 1 points on SSv2.
has been critical for enabling the surge of advancements in other modalities, such as text and images (Kaplan et al., 2020; Cherti et al., 2023). We investigate whether a similar trend holds for video data. To control for the possible confounding variable of compute budget, we pretrain all models in Table 2 for 90K iterations using a batch-size of 3072. We report downstream results on K400, SSv2, and IN1K using a frozen backbone with an attentive probe, and report top-1 accuracy using a single center view.
Table 2 shows that average performance across tasks monotonically increases as we increase the size of the pretraining dataset, but the best task-specific performance is obtained by independently selecting the pretraining data for each specific downstream task. For instance, the L/16 obtains its best SSv2 performance when pretrained on K710+SSv2, its best K400 performance when pretrained only on K710, and its best IN1K performance when pretrained only on K710+HT. The best average performance across all tasks is achieved by pretraining VideoMix2M, which combines all the data sources. Similarly, the H/16 pretrained on K710+SSv2 achieves a greater K400 score than the H/16 pretrained on VideoMix2M, however, the top performing H/16 on average is pretrained on VideoMix2M.
Evaluation: Attentive Probing
Next we explore the feature pooling strategy for applying the model's representations in downstream tasks. Since the prediction objective in equation (1) is unnormalized, there is no a priori reason for the encoder to yield a linearly separable subspace (Chen et al., 2020). Thus, rather than using a linear operation (averaging) to pool the features output of the frozen backbone, we explore a learnable non-linear pooling strategy. Specifically, when evaluating the frozen pretrained backbone on downstream tasks, we learn a cross-attention layer with a learnable query token. The output of the crossattention layer is then added back to the query token (residual connection), and then fed into two-layer MLP
Table 4 Ablating Prediction Task. Models are ViT-L/16 networks pretrained on K710 and SSv2 and evaluated with an attentive probe using a single center view. The region x is sampled by masking spatio-temporal regions in the video; y is the mask complement. 1) random-tube[r]: x is obtained by masking a fraction r of tubes (spatial patches extended across the entire temporal duration) from the video, 2) causal multi-block[p]: x is restricted to the first p frames of the 16-frame video, which are then masked with a random set of spatio-temporal blocks, 3) multi-block : x is obtained by masking a random set of spatio-temporal blocks from the entire video. Best performance obtained by using multiblock masking.
with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier.
In Table 3 we see that using adaptive pooling with a learnable cross-attention layer leads to a significant improvement of +17 points on K400 and +16 . 1 points on SSv2. Using an attentive-probe is also beneficial for other baseline models as reported in Appendix E.
Prediction Task: Predicting $y$ from $x$
The feature prediction task is based on a masked modeling formulation (He et al., 2021; Tong et al., 2022); i.e., regions x and y from the video are sampled using masking. To sample y from a video, we sample several (possibly overlapping) spatially continuous blocks with various aspect ratios and repeat the spatial blocks across the entire temporal dimension of the video; x is taken to be the complement. Masking a large continuous block that covers the full temporal dimension limits information leakage due to the spatial and temporal redundancy of videos, and results in a harder prediction task (Tong et al., 2022).
We leverage two types of masks: short-range masks, where we take the union of 8 randomly sampled target blocks covering 15% of each frame, and long-range masks, where we take the union of 2 randomly sampled target blocks covering 70% of each frame. In both cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0 . 75 , 1 . 5) . Given that both short-range and long-range masks are produced by sampling many blocks and taking their union, the result is an average masking ratio of ∼ 90% . We refer to our masking strategy as multi-block, and compare it to other possible masking strategies in Section 4.
Comparison with Prior Work
In Section 5.1, we investigate the impact of feature prediction by comparing V-JEPA with video approaches that rely on pixel prediction, while using a similar architecture for all baselines. Subsequently, in Section 5.2, we remove the architectural constraint and report the best performance across architectures for self-supervised video and image pretraining approaches. Finally, we explore the label-efficiency of V-JEPA relative to other selfsupervised video pretraining approaches in Section 5.3. We further detail the evaluation setup in Appendix D.
Comparison with Pixel Prediction
To investigate the effectiveness of feature prediction pretraining, we first compare V-JEPA to video masked modeling models relying on a pixel prediction loss. Wecontrol for the possible confounding factor of model architecture by evaluating all models using either a ViT-L/16 encoder, or a Hiera-L encoder, which has a similar number of parameters. For the pixel prediction baselines we consider VideoMAE (Tong et al., 2022; Wang et al., 2023a), which trains vision transformer autoencoders exclusively on video, Hiera (Ryali et al., 2023), which trains a hierarchical transformer autoencoder on video, and OmniMAE (Girdhar et al., 2023), which trains a vision transformer autoencoder on static images and video simultaneously.
Table 5 examines both frozen evaluation with an attentive probe on downstream video and image tasks, as well as end-to-end fine-tuning. In frozen evaluation, V-JEPA outperforms the baselines on all downstream tasks, except ImageNet, where we achieve 74 . 8% compared to 75 . 1% of an OmniMAE model trained directly on Im-

Figure 4 SSv2 fine-tuning performance vs. Samples Seen. We report SSv2 fine-tuning for V-JEPA and pixel-reconstruction baselines using a ViT-L/16 or Hiera-L architecture. V-JEPA outperforms all pixel-reconstruction methods using a ViTL/16 and matches the Hiera-L performance while seeing significantly less samples during pretraining.
ageNet; hence, V-JEPA achieves comparable ImageNet performance despite only pretraining on video.
Under the fine-tuning protocol, V-JEPA also achieves the best performance of any model trained with a ViT-L/16, and matches the performance of the Hiera-L on SSv2, which benefits from a hierachical prior (Ryali et al., 2023). The V-JEPA models achieve this result while processing significantly fewer samples during pretraining (Figure 4), demonstrating the efficiency of feature prediction as a learning principle.
Comparison with State-of-the-Art
Next, in Table 6, we inspect how the V-JEPA models pretrained on video stack up next to the largest stateof-the-art self-supervised image and video models when freezing the backbone encoder and training an attentive probe on top. Our image pretrained baselines include OpenCLIP (Cherti et al., 2023), DINOv2 (Oquab et al., 2023), and I-JEPA (Assran et al., 2023). The OpenCLIP model is trained with a contrastive image-text alignment objective, DINOv2 and I-JEPA are trained with self-supervision. These models are known to excel in their frozen-evaluation performance (Oquab et al., 2023); i.e., their ability to produce visual features that can be applied to many downstream tasks simultaneously, without end-to-end fine-tuning, and thus provide highly competitive baselines. Our video pretrained baselines include VideoMAE (Tong et al., 2022), OmniMAE (Girdhar et al., 2023), Hiera (Ryali et al., 2023), VideoMAEv2 (Wang et al., 2023a), and MVD (Wang et al., 2023b). The OpenCLIP, DINOv2 and VideoMAEv2 models are parameterized as Giant/Gigantic vision transformer architectures containing over 1B parameters trained on large-scale image or video datasets.
Comparison with video models. Compared to large-scale video baselines, the V-JEPA models outperform all previous models on every downstream video

Figure 5 SSv2 frozen-evaluation performance vs. Pretraining Time. Wallclock times for all methods are measured on a single GPU with a batch size of 10 clips, using the official codebases for VideoMAE and VideoMAEv2, and linearly extrapolated assuming a global batch size of 2400 samples. However, note that the SSv2 accuracies of video pixel prediction methods are actually obtained with small batch sizes and significantly longer training schedules. V-JEPA outperforms pixel-reconstruction methods while training significantly faster.
and image task with notable margin (see Table 6). Our H/16 model outperforms the largest publicly available VideoMAE, VideoMAEv2, OmniMAE, MVD, and Hiera models by at least +5 points in motion understanding (Something-Something-v2), +2 points in action recognition (Kinetics-400), +5 points on action detection (AVA), +1 point on object recognition (ImageNet-1K), +2 points in scene recognition (Places205), and +0 . 2 points on finegrained recognition (iNaturalist). Moreover, when comparing pretraining wallclock time in Figure 5, we see that V-JEPA achieves this performance with a roughly 2 × speedup compared to the large pixel prediction models.
Comparison with image models. On tasks that require a fine-grained understanding of motion (SomethingSomething-v2), the V-JEPA models provide a major improvement (over +21 points) compared to large-scale image baselines, such as DINOv2, OpenCLIP, and IJEPA. Self-supervised pretraining from videos allows to model dynamic concepts that are not easily learned from static image datasets. Similarly, we observe that the V-JEPA models outperform image-based pretraining on action localization.
On Kinetics-400, we find image models to perform well; e.g., while DINOv2 (Oquab et al., 2023) previously reported 78 . 4% on K400 with a linear probe, we improve the frozen evaluation of the g/14 model to 83 . 4% by using an attentive probe. In this case, our H/16 model achieves 82 . 0% top-1 accuracy. It is worth noting that the label for many Kinetics videos can be inferred using appearance-based cues, without requiring an understanding of motion (Sevilla-Lara et al., 2021).
The V-JEPA models narrow the gap with image models on image classification tasks. In particular, V-JEPA achieves a score of 77 . 4% on ImageNet using a one-
Table 7 Low-Shot Frozen Evaluation. Comparing V-JEPA to other video models in frozen evaluation on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. We report the mean performances and standard deviation using the K400 and SSv2 validation sets. V-JEPA is more label-efficient than other models; specifically, decreasing the available number of labeled examples from each class increases the performance gap between V-JEPA and the baselines.
layer attentive probe, which can be further improved to 77 . 9 % using a two-layer attentive probe. More generally, we hypothesize that the datasets used to train V-JEPA and other video models are too constrained and lack the visual diversity of the internet-scale pretraining data used by the images models; as such, there is value in focusing future work on building diverse publicly available video datasets.
bf Comparison with video models.
In Section 5.1, we investigate the impact of feature prediction by comparing V-JEPA with video approaches that rely on pixel prediction, while using a similar architecture for all baselines. Subsequently, in Section 5.2, we remove the architectural constraint and report the best performance across architectures for self-supervised video and image pretraining approaches. Finally, we explore the label-efficiency of V-JEPA relative to other selfsupervised video pretraining approaches in Section 5.3. We further detail the evaluation setup in Appendix D.
bf Comparison with image models.
In Section 5.1, we investigate the impact of feature prediction by comparing V-JEPA with video approaches that rely on pixel prediction, while using a similar architecture for all baselines. Subsequently, in Section 5.2, we remove the architectural constraint and report the best performance across architectures for self-supervised video and image pretraining approaches. Finally, we explore the label-efficiency of V-JEPA relative to other selfsupervised video pretraining approaches in Section 5.3. We further detail the evaluation setup in Appendix D.
Label-efficiency
We examine the label-efficiency of V-JEPA compared to other self-supervised video models by measuring the ability of the pretrained backbones to adapt to downstream tasks with few labels. Specifically, we investigate the performance of the frozen models on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several lowshot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. Table 7 reports the mean performances and standard deviation using the K400 and SSv2 validation sets.
We find V-JEPA to be more label-efficient than other self-supervised video models: decreasing the available number of labeled examples for training the attentive probe results in an increase in the performance gap between V-JEPA and the other models. In particular, the performance of the largest V-JEPA model on K400 drops by 12% to 68.2% top-1 when we reduce the number of labeled examples by a factor of 10 × (from roughly 287 examples per class to 29 examples per class). By contrast, VideoMAEv2 drops by 30% to 37.0% top-1, VideoMAE drops by 15.9% to 62.3% top-1, and MVD drops by 14.6% to 62.6% top-1.
Similar observations hold on SSv2. The performance of the largest V-JEPA model on SSv2 drops by 13.9%
to 54.0% top-1 when we reduce the number of labeled examples by a factor of 10 × (from roughly 440 examples per class to 48 examples per class). By contrast, VideoMAEv2 drops by 26% to 28.0% top-1, VideoMAE drops by 19.1% to 41.4% top-1, and MVD drops by 18.1% to 42.9% top-1.
Evaluating the Predictor
Next, we seek to qualitatively inspect the V-JEPA models. Recall that the predictor network in V-JEPA predicts the representations of a masked spatio-temporal region y from a visible region x , given the positional information of the masked regions (see Section 3). To qualitatively investigate the grounding of the feature-space predictions, we freeze the pretrained encoder and predictor networks and train a conditional diffusion decoder to map the V-JEPA predictions to interpretable pixels. Notably, the decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video (see Figure 6a).
Given a masked video, we use the V-JEPA pretrained models to predict the representations of the missing regions, and then use the decoder to project the representations to pixel space. Figure 6b shows decoder outputs for various random seeds. Qualities that are common across samples represent information that is contained in the predictor representation.
Figure 6b shows that the V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video. Specifically, the samples in Figure 6b show that the V-JEPA predictor correctly captures positional uncertainty and produces a variety of visual objects at various locations with consistent motion. Some of the samples also demonstrate an understanding of object-permanence, as the visual objects remain consistent after partial occlusion.
Conclusion
In this work, we explored the effectiveness of feature prediction as a stand-alone objective for unsupervised learning from video and introduced V-JEPA , a collection of vision models trained solely using a self-supervised feature prediction objective. The V-JEPA models demonstrate the ability to solve various downstream image and video tasks without adaption of the model parameters, and outperform previous video representation learning approaches in frozen evaluation on action recognition, spatio-temporal action detection, and image classification tasks. Additionally, we show that pretraining VJEPA on videos is particularly effective for solving down- stream tasks requiring fine-grained motion understanding, while large-scale image models trained on internet scale datasets fall short on such tasks. Finally, we empirically observed that V-JEPA models are label-efficient learners, and exhibit good performance on downstream tasks, even when only few labeled examples are available.
References
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems , 34:24206-24221, 2021.
Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision , 115(3): 211-252, 2015.
Extended Related Works
We first review approaches for learning visual perception from static images before discussing strategies for learning from video.
Weakly-Supervised Learning from Static Images
One family of approaches for learning visual perception from static images trains a visual encoder to predict the representations of text captions often found accompanying images from the Web, as in CLIP (Radford et al., 2021) or CoCa (Yu et al., 2022). The largest open source CLIP model to date, numbering 2B parameters and trained on over 2B web-scraped images (Cherti et al., 2023), demonstrates impressive performance on a wide range of downstream image and video tasks. Notably, this is achieved using only the light-weight adaptation of task-specific heads, also referred to as frozen-evaluation, and does not require expensive end-to-end fine-tuning of the pretrained model.
Self-Supervised Learning from Static Images
Other approaches for learning from static images leverage unsupervised objectives. Initial works on self-supervised approaches are based on sparse coding or hand-crafted pretext tasks, such as colorization (Larsson et al., 2016, 2017), rotation prediction (Gidaris et al., 2020), and jigsaws (Noroozi and Favaro, 2016). More recent approaches leverage invariance-based objectives by training a visual encoder to be invariant to hand-crafted image transformations (Wu et al., 2018; Chen et al., 2020).
Another family of methods learn representations using denoising autoencoders (Vincent et al., 2008); image inpainting is one popular instantiation of this idea (Pathak et al., 2016). More recently, masked autoencoders (He et al., 2021) train an encoder-decoder transformer to predict missing pixels of a masked image. Follow-up work addresses the indeterminism of pixel reconstruction by exploring instantiations of masked image modeling in latent space (Baevski et al., 2022b; Assran et al., 2023; Baevski et al., 2022a). These approaches can be seen as applications of the predictive feature principle in the image modality.
There are also various methods that combine both masked image modeling and invariance criteria to learn visual representations from static images, such as iBOT (Zhou et al., 2021) and DINOv2 (Zhou et al., 2021; Oquab et al., 2023), the latter is currently the most competitive instantiation of self-supervised learning with static images, scaled to a model with over 1.1B parameters trained on a curated dataset of 142M images.
Weakly-Supervised Learning from Videos
One family of approaches for learning visual perception from videos relies on weakly-supervised guidance from closed captioning, often computed from an ASR transcription of audio data accompanying internet videos. For instance, VideoBERT (Sun et al., 2019; Xu et al., 2021) trains a video encoder to predict masked spans in the textual closed captions. Similarly, VideoCLIP (Xu et al., 2021) trains a video encoder to predict the representation of video captions computed by a text encoder. Follow-up work such as MERLOT (Zellers et al., 2022), VATT (Akbari et al., 2021), and InternVideo (Wang et al., 2022) extended VideoCLIP by incorporating additional unsupervised objectives.
Self-Supervised Learning from Videos
Similar to unsupervised learning from images, a family of unsupervised video representation learning approaches enforces a spatio-temporal representation of a video clip to be invariant to hand-crafted spatio-temporal data augmentations (Parthasarathy et al., 2022). However, one obvious insight is that the temporal ordering of visual information in video can provide implicit supervision. Indeed, this insight is the key insight leveraged by many works on unsupervised video learning. Towards leveraging temporal information as supervision, some approaches train a visual encoder by predicting the temporal ordering of frames (Xu et al., 2019; Lee et al., 2017). Other approaches seek to predict low-level motion vectors computed from optical flow (Pintea et al., 2014), or to predict mixing pixels in video frames, using either a frame-interpolation objective (Kalluri et al., 2023) or a denoising autoencoder (Tong et al., 2022; Feichtenhofer et al., 2022; Wang et al., 2023a).
Extended Description of V-JEPA
In this section, we provide an in-depth description of our approach V-JEPA that is illustrated in Figure 3.
Input. Unless stated otherwise, during during pretraining, we always randomly sample a clip of 16 frames from each input video with a temporal stride of 4 between sampled frames. An input video clip therefore covers 64 frames in total, or roughly 2 seconds of a given video running at 30 frames per second. We then resize the video's spatial dimensions to 224 × 224 , resulting in an overall shape of 16 × 224 × 224 × 3 for the entire clip. Since ViT networks process a 1D sequence of tokens, we must convert an input video clip into a 1D token sequence. To do so, we apply a 3D convolution comprising d filters of size 2 × 16 × 16 with a temporal stride of 2 and a spatial stride of 16 , resulting in a tensor of shape 8 × 14 × 14 × d . Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape 1568 × d . This process is demonstrated in Figure 7.

Figure 7 V-JEPA training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size 16 × 224 × 224 × 3 into a 1D token sequence, we apply a 3D convolution comprising d filters of size 2 × 16 × 16 with a temporal stride of 2 and a spatial stride of 16 , resulting in a tensor of shape 8 × 14 × 14 × d . Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape 1568 × d .
V-JEPA . We sample both a video clip, and a video mask in each iteration. We denote a video clip represented as a 1D token sequence of length L = 1568 by x L = ( x 1 , . . . , x L ) . Similarly, given a mask of M < L patches, leaving N = L -M patches unmasked, we denote the indices of masked patches by ( i 1 , . . . , i M ) and its complement (the indices of unmasked patches) by ( j 1 , . . . , j N ) .
Computing the x -representations. To compute the V-JEPA loss, we first produce the x -representations by masking the video clip and feeding it into the x -encoder; we denote the masked video by x N = ( x j 1 , . . . , x j N ) . Applying the x -encoder E θ ( · ) to the masked clip gives a sequence of patch representations, denoted as z N = E θ ( x N ) = ( z j 1 , . . . , z j N ) .
Predicting the target. Next, the V-JEPA predictor network P ϕ ( · , · ) takes as input the tokens produced by the x -encoder and predicts the missing regions in the video clip, which are specified by a set of learnable mask tokens. Specifically, the mask tokens are parameterized as the sum of a shared learnable vector and an absolute 3D sin-cos positional embedding, denoted by m M = ( m i 1 , . . . , m i M ) . The output of the predictor is thus given by, ˆ s M = P ϕ ( z N , m M ) = (ˆ s i 1 , . . . , ˆ s i M ) , corresponding to a d -dimensional output for each of the M masked patches.
Computing the y -representations. Finally to compute the prediction targets, the entire unmasked video clip is processed by the y -encoder to obtain a set of target representations, denoted by s L = E θ ( x L ) = ( s 1 , . . . , s L ) . The V-JEPA loss is now computed as
$$
$$
which is simply the average L 1 distance between the output of the predictor and the y -encoder. We then compute a gradient update with respect to the parameters of the x -encoder, θ , and the predictor, ϕ , and subsequently update the parameters of the y -encoder as an exponential moving average of the context encoder weights (Polyak average).
Table 8 pretraining hyper-parameters for V-JEPA.
Multi-Mask Prediction. To increase the efficiency of V-JEPA , we use a multi-masking strategy (Caron et al., 2020; Baevski et al., 2022a), which enables us to amortize the cost of the target computation. As mentioned in Section 3, for a given video clip, we sample 2 different masks, short-range and long-range. While we need to forward propagate the x -encoder and predictor separately for each mask, we only need to compute the y -representation once.
bf Input.
V-JEPA.
bf Multi-Mask Prediction.
Next, we seek to qualitatively inspect the V-JEPA models. Recall that the predictor network in V-JEPA predicts the representations of a masked spatio-temporal region y from a visible region x , given the positional information of the masked regions (see Section 3). To qualitatively investigate the grounding of the feature-space predictions, we freeze the pretrained encoder and predictor networks and train a conditional diffusion decoder to map the V-JEPA predictions to interpretable pixels. Notably, the decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video (see Figure 6a).
Given a masked video, we use the V-JEPA pretrained models to predict the representations of the missing regions, and then use the decoder to project the representations to pixel space. Figure 6b shows decoder outputs for various random seeds. Qualities that are common across samples represent information that is contained in the predictor representation.
Figure 6b shows that the V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video. Specifically, the samples in Figure 6b show that the V-JEPA predictor correctly captures positional uncertainty and produces a variety of visual objects at various locations with consistent motion. Some of the samples also demonstrate an understanding of object-permanence, as the visual objects remain consistent after partial occlusion.
Pretraining details
In section, we report V-JEPA pretraining details. Table 8 summarizes the main hyperparameters used during pretraining.
Architectures. We use Vision Transformer (Dosovitskiy et al., 2020) (ViT) architectures for the x -encoder and y -encoder. We train three V-JEPA encoders: a ViT-L/16 224 , a ViT-H/16 224 and a ViT-H/16 384 . All three encoders take as input a short video clip of 16 frames with a temporal stride of 4 between consecutive frames. The subscripts, 224 and 384 , indicate the spatial resolution of the video clip. V-JEPA flattens the video clip into a sequence of non-overlapping spatio-temporal patches of size 16 × 16 × 2 (see Figure 7). For all three models, the predictor is designed as a narrow ViT architecture, consisting of 12 transformer blocks with an embedding dimension of 384. For simplicity, we keep the number of self-attention heads in the predictor equal to that of the backbone used for the context-encoder/target-encoder. V-JEPA is pretrained without using a [cls] token.
Optimization. We use AdamW (Loshchilov and Hutter, 2017) to optimize the x -encoder and predictor weights. The ViT-L/16 224 and ViT-H/16 224 models use a batch size of 3072 while the ViT-H/16 384 uses a batch size of 2400 . Models are trained for a total of 90,000 iterations. The learning rate is linearly increased from 2 × 10 -4 to 6 . 25 × 10 -4 during the first 12 , 000 iterations of pretraining, and decayed to 10 -6 following a cosine schedule.
Table 9 Frozen Evaluation hyper-parameters.
Weight-decay is also linearly increased from 0 . 04 to 0 . 4 throughout pretraining. The y -encoder weights are initialized identically to the x -encoder, and subsequently updated as an exponential moving average (EMA) (Tarvainen and Valpola, 2017) of the x -encoder weights using a momentum value which starts at 0 . 998 and is linearly increased to 1 . 0 during training (Caron et al., 2021; Assran et al., 2022). We scale all hyper-parameter schedules 25% beyond the actual training schedule. Specifically, the learning rate schedule, weight-decay schedule, and EMA schedule are computed assuming a training length of 112,500 iterations, even though we only train our model for 90,000 iterations. We found the last 25% of the default scheduler period to update hyper-parameters too aggressively, and simply truncating the schedulers improved performance.
Masking. As described in Section 3, we propose a 3D Multi-Block masking strategy. We use two type of masks: short-range masks, where we take the union of 8 randomly sampled target blocks with a spatial scale of 0 . 15 , and long-range masks, where we take the union of 2 randomly sampled target blocks with a spatial scale of 0 . 7 . In both cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0 . 75 , 1 . 5) .
Architectures.
Optimization.
We use a Vision Transformer (ViT) (Dosovitskiy et al., 2020; Arnab et al., 2021) as our video backbone. To process a video with a transformer network, we split the video clip into a 3D grid of L spatio-temporal patches, where a patch consists of a 16 × 16 pixel block spanning 2 consecutive frames; we refer to these spatio-temporal patches as tokens. This sequence of tokens is then directly processed by the stack of transformer blocks. In- puts x and y correspond to masked regions of a video, we apply the video masks by simply dropping a subset of the tokens. We apply masking at the input of the x -encoder, and at the output of the y -encoder to construct contextualized targets (Baevski et al., 2022b). The encoder is parameterized using standard ViT networks, while the predictor is a narrow transformer implemented using 12 blocks with an embedding dimension of 384 . Taking inspiration from masked autoencoders (He et al., 2021), our predictor takes as input the sequence of embeddings produced by the x -encoder as well as a sequence of learnable mask tokens with positional embeddings indicating the spatio-temporal positions of the y tokens. The output of the predictor is an embedding vector for each mask token; see Figure 3 and refer to Appendix B for more details.
Masking.
An important component of the V-JEPA pretraining strategy is the 3D clip masking strategy. In this section, we detail 26 ablation experiments exploring different masks. For all the experiments, we pretrain a ViT-B/16 pretrained on K400. Figure 8 presents a summary of those results.
Figure 8c shows the effect of changing the spatial and temporal masking ratio. Figure 8b ablates the number of sampled blocks used to construct the masks given a fixed effective masking ratio of 90% . Finally, in Figure 8a we
Table 14 Temporal Coverage on Kinetics-400. We evaluate the effect of temporal coverage on K400. We train an attentive probe on K400 using either 1 clip ( ≈ 2 seconds of a video) or 8 clips ( ≈ 16 seconds of a video). To sample N clips, we first divide a video in N equal-length temporal segments and sample one clip at random per segment. The video encoder processes each clip in parallel and all the encoder output tokens are concatenated at the input of the attentive probe. Increasing the temporal coverage from 1 clip per video to 8 clips significantly improves the performance for both our VideoMAE baseline and V-JEPA.
Table 15 Finetuning results. We evaluate a V-JEPA model with the finetuning protocol on the K400 and SSv2 datasets using 16 frames per clip and multi-view fusion (5 × 3 or 2 × 3 ) for inference. The #Samples Seen entry corresponds to the number of video clips processed during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. We compare V-JEPA with different video self-supervised learning approaches. We report the VideoMAEv2 results without instruction-turning for consistency with the other approaches. V-JEPA obtains competitive performance using the finetuning protocol.
examine our multi-masking strategy and find that sampling two masks for each clip (long-range and short-range) to be more effective than sampling just a single mask for each clip.
In Figure 8c, we explore different average spatial and temporal masking ratio, i.e. the spatial/temporal ratio of the area that is covered by a mask on average for a clip. Recall that each mask is constructed by sampling several (possibly overlapping) blocks and taking their union. We change the average spatial or temporal masking ratio by changing a block spatial or temporal size, as well as the overall number of blocks. We found that low spatial or temporal coverage results in a trivial prediction task, which degrades downstream performance. Based on those results, we sample masks that remove roughly 90% of the frame and extend along the entire temporal dimension of the clip by default.
In Figure 8b , we explore different block size given an effective spatial masking ratio of 90% and temporal ratio of 100%. We keep the masking ratio approximately constant by changing the block size and the number of block at the same time. We find that sampling several blocks to perform better than sampling a single large block. Figure 9 visually illustrates the effect of sampling several smaller blocks to construct a mask.
In Figure 8a, we explore the effect of sampling various number of masks per samples. We find that sampling two masks for each clip, with different spatial block sizes for each, to be more effective than sampling just a single mask. We hypothesize that this masking strategy induces complementary tasks. In our experiment, we use this as our default masks sampling.

Table 16 Sample efficiency. We compare the sample efficiency of pretraining various state-of-the-art image and video models. The #Samples Seen entry corresponds to the number of samples (image or video clips) processed by the network during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. The V-JEPA results in this paper are obtained while processing an order of magnitude fewer samples than previous methods.

Figure 8 Masking Strategy Ablation. Evaluating a linear probe on a ViT-B/16 pretrained with V-JEPA on K400 under various 3D Multi-Block masking settings. We examine the impact of (a) sampling several masks per video, (b) varying the number of blocks in a mask, and (c) varying the average spatial and temporal masking ratio. A temporal masking ratio of 100% extends the spatial mask across all the frames in the clip. We find it important to maintain a high spatial and temporal masking ratio during pretraining.
(c)
Num. Blocks: 2, Spatial Block Size:
160
×
Figure 9 Illustration of mask with number of blocks and block size. Each mask is constructed by sampling several (possibly overlapping) blocks and taking their union.
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Fine-Tuning | ||
|---|---|---|---|---|---|
| Target | Arch. | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K | K400-ft (16 × 5 × 3) |
| Pixels | ViT-L/16 | 68.6 | 66.0 | 73.3 | 85.4 |
| Features | ViT-L/16 | 73.7 | 66.2 | 74.8 | 85.6 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | ||||
|---|---|---|---|---|---|---|
| Arch. | Data | #Samples | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K | Avg. |
| ViT-L/16 | K710 | 700K | 75.8 | 63.2 | 73.7 | 70.9 |
| K710+SSv2 | 900K | 72.9 | 67.4 | 72.8 | 71.0 | |
| K710+HT | 1900K | 74.5 | 64.2 | 74.8 | 71.1 | |
| VideoMix2M | 2000K | 73.7 | 66.2 | 74.8 | 71.5 | |
| ViT-H/16 | K710+SSv2 | 900K | 75.7 | 66.8 | 73.7 | 72.0 |
| ViT-H/16 | VideoMix2M | 2000K | 74.0 | 68.5 | 75.9 | 72.8 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | ||
|---|---|---|---|---|---|
| Method | Arch. | Avg. | Att. | Avg. | Att. |
| V-JEPA | ViT-L/16 | 56.7 | 73.7 | 50.1 | 66.2 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | |
|---|---|---|---|
| Masking | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K |
| random-tube[0.9] | 51.5 | 46.4 | 55.6 |
| causal multi-block[6] | 61.3 | 49.8 | 66.9 |
| causal multi-block[12] | 71.9 | 63.6 | 72.2 |
| multi-block | 72.9 | 67.4 | 72.8 |
| Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Fine-Tuning | Fine-Tuning | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | #Samples Seen | Iter. | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | AVA | IN1K | Places205 | iNat21 | K400-ft (16 × 5 × 3) | SSv2-ft (16 × 2 × 3) |
| Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction |
| OmniMAE VideoMAE Hiera | ViT-L/16 ViT-L/16 Hiera-L | 2400M 410M 770M | 1170K 400K 1500K | 65.6 77.8 75.5 | 60.6 65.5 64.2 | 14.4 21.6 15.8 | 75.1 71.1 68.9 | 59.8 59.3 58.5 | 66.1 64.6 56.9 | 84.0 85.4 | 74.2 74.3 |
| V-JEPA | 270M | 60.3 | 85.6 | ||||||||
| 87.3 | 75.1 | ||||||||||
| ViT-L/16 | 90K | 80.8 | 69.5 | 25.6 | 74.8 | 67.8 | 75.1 |
| Video Tasks | Video Tasks | Video Tasks | Image Tasks | Image Tasks | Image Tasks | ||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Params. | Data | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | AVA | IN1K | Places205 | iNat21 |
| Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images |
| I-JEPA | ViT-H/16 512 | 630M | IN22K | 79.7 | 50.0 | 19.8 | 84.4 | 66.5 | 85.7 |
| OpenCLIP | ViT-G/14 | 1800M | LAION | 81.8 | 34.8 | 23.2 | 85.3 | 70.2 | 83.6 |
| DINOv2 | ViT-g/14 | 1100M | LVD-142M | 83.4 | 50.6 | 24.3 | 86.2 | 68.4 | 88.8 |
| Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos |
| MVD | ViT-L/16 | 200M | IN1K+K400 | 79.4 | 66.5 | 19.7 | 73.3 | 59.4 | 65.7 |
| OmniMAE | ViT-H/16 | 630M | IN1K+SSv2 | 71.4 | 65.4 | 16.0 | 76.3 | 60.6 | 72.4 |
| VideoMAE | ViT-H/16 | 630M | K400 | 79.8 | 66.2 | 20.7 | 72.3 | 59.1 | 65.5 |
| VideoMAEv2 | ViT-g/14 | 1100M | Un.Hybrid | 71.2 | 61.2 | 12.9 | 71.4 | 60.6 | 68.3 |
| Hiera | Hiera-H | 670M | K400 | 77.0 | 64.7 | 17.5 | 71.4 | 59.5 | 61.7 |
| ViT-L/16 | 200M | 80.8 | 69.5 | 25.6 | 74.8 | 60.3 | 67.8 | ||
| V-JEPA | ViT-H/16 | 630M | VideoMix2M | 82.0 | 71.4 | 25.8 | 75.9 | 61.7 | 67.9 |
| V-JEPA | ViT-H/16 384 | 630M | 81.9 | 72.2 | 25.0 | 77.4 | 62.8 | 72.6 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | |||
|---|---|---|---|---|---|---|---|
| K400 (16 × 8 × 3) | K400 (16 × 8 × 3) | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | SSv2 (16 × 2 × 3) | SSv2 (16 × 2 × 3) | ||
| Method | Arch. | 5% ( ∼ 29 samples per class) | 10% ( ∼ 58 samples per class) | 50% ( ∼ 287 samples per class) | 5% ( ∼ 48 samples per class) | 10% ( ∼ 96 samples per class) | 50% ( ∼ 440 samples per class) |
| MVD | ViT-L/16 | 62.6 ± 0.2 | 68.3 ± 0.2 | 77.2 ± 0.3 | 42.9 ± 0.8 | 49.5 ± 0.6 | 61.0 ± 0.2 |
| VideoMAE | ViT-H/16 | 62.3 ± 0.3 | 68.5 ± 0.2 | 78.2 ± 0.1 | 41.4 ± 0.8 | 48.1 ± 0.2 | 60.5 ± 0.4 |
| VideoMAEv2 | ViT-g/14 | 37.0 ± 0.3 | 48.8 ± 0.4 | 67.8 ± 0.1 | 28.0 ± 1.0 | 37.3 ± 0.3 | 54.0 ± 0.3 |
| V-JEPA | ViT-H/16 | 67.0 ± 0.2 | 72.1 ± 0.1 | 80.2 ± 0.2 | 51.9 ± 0.3 | 57.5 ± 0.4 | 67.3 ± 0.2 |
| V-JEPA | ViT-H/16 384 | 68.2 ± 0.2 | 72.8 ± 0.2 | 80.6 ± 0.2 | 54.0 ± 0.2 | 59.3 ± 0.5 | 67.9 ± 0.2 |
| Hyper-parameter | ViT-L/16 224 | ViT-H/16 224 | ViT-H/16 384 |
|---|---|---|---|
| data | |||
| datasets | VideoMix2M | VideoMix2M | VideoMix2M |
| resolution | 224 | 224 | 384 |
| num_frames | 16 | 16 | 16 |
| temporal_stride | 4 | 4 | 4 |
| horizontal_flip | true | true | true |
| random_resize_scale | (0.3, 1.0) | (0.3, 1.0) | (0.3, 1.0) |
| random_resize_aspect_ratio | (0.75, 1.35) | (0.75, 1.35) | (0.75, 1.35) |
| masking | |||
| block_aspect_ratio | (0.75, 1.5) | (0.75, 1.5) | (0.75, 1.5) |
| shortrange_mask_num_blocks | 8 | 8 | 8 |
| shortrange_mask_spatial_scale | 0.15 | 0.15 | 0.15 |
| longrange_mask_num_blocks | 2 | 2 | 2 |
| longrange_mask_spatial_scale | 0.7 | 0.7 | 0.7 |
| optimization | |||
| batch_size | 3072 | 3072 | 2400 |
| total_number_of_iterations | 90000 | 90000 | 90000 |
| warmup_iterations | 12000 | 12000 | 12000 |
| lr | 6.25e-4 | 6.25 × 10 - 4 | 6.25 × 10 - 4 |
| start_lr | 2 × 10 - 4 | 2 × 10 - 4 | 2 × 10 - 4 |
| final_lr | 1 × 10 - 6 | 1 × 10 - 6 | 1 × 10 - 6 |
| start_momentum | 0.998 | 0.998 | 0.998 |
| final_momentum | 1.0 | 1.0 | 1.0 |
| start_weight_decay | 0.04 | 0.04 | 0.04 |
| final_weight_decay | 0.4 | 0.4 | 0.4 |
| scheduler_scale_factor | 1.25 | 1.25 | 1.25 |
| architecture | |||
| patch_size | 16 | 16 | 16 |
| tubelet_size | 2 | 2 | 2 |
| pred_depth | 12 | 12 | 12 |
| pred_embed_dim | 384 | 384 | 384 |
| hardware | |||
| dtype | bfloat16 | bfloat16 | bfloat16 |
| accelerator | A100 80G | A100 80G | A100 80G |
| Hyper-parameter | K400 | SSv2 | IN1K | Place205 | iNat21 |
|---|---|---|---|---|---|
| data | |||||
| num_clips | 8 | 1 16 4 | N.A. N.A. N.A. | N.A. N.A. N.A. | N.A. N.A. N.A. |
| num_frames | 16 | ||||
| temporal_stride | 4 | ||||
| horizontal_flip | true | true | true | true | true |
| random_resize_scale | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) |
| random_resize_aspect_ratio auto_augment | false | false | true | true | true |
| optimization | |||||
| batch_size | 256 | 256 | 1024 | 1024 | 1024 |
| epochs | 20 | 20 | 20 | 20 | 20 |
| lr | 1e-3 | 1e-3 | 1e-3 | 1e-3 | 1e-3 |
| final_lr | 0 | 0 | 0 | 0 | 0 |
| weight_decay | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| Hyper-parameter | ViT-L/16 | ViT-H/16 |
|---|---|---|
| out_layers batch_size epochs opt opt_eps momentum weight_decay lr warmup_lr min_lr warmup_epochs warmup_steps | [18, 20, 22, 24] 64 30 AdamW 0.00000001 0.9 0.05 0.0001 0.000001 0.000001 2 1 | [26, 28, 30, 32] 64 30 AdamW 0.00000001 0.9 0.05 0.0001 0.000001 0.000001 2 1 |
| Hyper-parameter | K400 | K400 | K400 | SSv2 | SSv2 |
|---|---|---|---|---|---|
| data | |||||
| num_segments num_frames | 1 16 4 224 | ||||
| sampling_rate | |||||
| resolution | |||||
| model | |||||
| model_name | ViT-L/16 | ViT-H/16 | ViT-L/16 | ViT-H/16 | |
| drop_path | 0.1 | 0.2 | 0.2 | 0.2 | |
| head_drop_rate | 0. | 0. | 0.5 | 0.5 | |
| optimization | |||||
| batch_size | 256 | 1024 | 256 | 256 | |
| epochs | 35 | 25 | 15 | 15 | |
| opt | adamw | ||||
| opt_eps | 0.00000001 | ||||
| momentum weight_decay | 0.9 0.05 | ||||
| lr | 0.002 | 0.0005 | 0.0005 | 0.0005 | |
| layer_decay | 0.75 | 0.75 | 0.75 | 0.75 | |
| warmup_lr | 1e-6 | 1e-8 | 1e-6 | 1e-6 | |
| min_lr | 1e-6 | 1e-5 | 1.5e-4 | 1.5e-3 | |
| warmup_epochs | 5 | ||||
| augmentations color_jitter | 0.4 | ||||
| horizontal_flip | True | True | False | False | |
| num_sample | 2 | ||||
| aa | rand-m7-n4-mstd0.5-inc1 | ||||
| smoothing | 0.1 | ||||
| train_interpolation | bicubic | ||||
| test_num_segment | 5 | 5 | 2 | 2 | |
| test_num_crop | 3 | 3 | 3 | 3 | |
| erase | |||||
| prob | 0.25 | ||||
| mode | pixel | ||||
| count | 1 | ||||
| split | False | ||||
| mixup | |||||
| mixup | 0.8 | ||||
| cutmix | 1.0 | ||||
| mixup_prob | 1.0 | ||||
| mixup_switch_prob | 0.5 | ||||
| mixup_mode | batch |
| K400 | K400 | SSv2 | SSv2 | ||
|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. |
| VideoMAE | ViT-L/16 | 52.5 | 77.8 | 41.3 | 61.2 |
| V-JEPA | ViT-L/16 | 56.7 | 80.8 | 50.1 | 69.5 |
| K400 | K400 | SSv2 | SSv2 | IN1K | IN1K | Place205 | Place205 | iNat21 | iNat21 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. |
| DINOv2 | ViT-g/14 | 78.4 | 83.4 | 38.3 | 50.0 | 86.5 | 86.2 | 67.5 | 68.4 | 85.7 | 88.8 |
| OpenCLIP | ViT-G/14 | 78.3 | 81.8 | 35.8 | 34.8 | 86.2 | 85.3 | 69.8 | 70.2 | 76.0 | 83.6 |
| Method | Arch. | 1 Clip | 8 Clips |
|---|---|---|---|
| VideoMAE | ViT-L/16 | 69.4 | 77.8 |
| V-JEPA | ViT-L/16 | 73.7 | 80.9 |
| Method | Arch. | Pretraining Data | #Samples Seen | K400 (16 × 5 × 3) | SSv2 (16 × 2 × 3) | ||||
|---|---|---|---|---|---|---|---|---|---|
| VideoMAEv1 | ViT-L/16 ViT-H/16 ViT-H/16 ViT-L/16 | K400 | SSv2 K400 | SSv2 Un.Hybrid K400+IN1K | 380M | 410M 380M | 410M 1600M 2400M | 85.4 86.6 86.9 86.4 | 74.3 74.8 76.8 76.7 |
| VideoMAEv2 | |||||||||
| MVD | |||||||||
| MVD | ViT-H/16 | K400+IN1K | 2400M | 87.2 | 77.3 | ||||
| V-JEPA | ViT-L/16 | VideoMix2M | 270M | 85.6 | 75.1 | ||||
| V-JEPA | ViT-H/16 | VideoMix2M | 270M | 86.6 | 77.0 |
| Method | Arch. | Data | #Samples Seen |
|---|---|---|---|
| OpenCLIP | ViT-G/14 | LAION-2B | 39000M |
| DINOv2 | ViT-g/14 | LVD 142M | 1900M |
| VideoMAEv2 | ViT-g/14 | UnlabeledHybrid | 1600M |
| V-JEPA | ViT-H/16 384 | VideoMix2M | 210M |
Evaluation details
Frozen classification
Attentive Probing. Given an input video, x L , the V-JEPA target encoder E θ ( · ) outputs a sequence of L tokens, E θ ( x L ) = ( s 1 , . . . , s L ) , where s i ∈ R d . To pool this sequence of tokens into a single feature vector, we apply a lightweight non-linear cross-attention block which replace the self-attention operation of a transformer block with cross attention. Specifically, the cross-attention performs the following computation:
$$
$$
where W k , W v ∈ R d × d are the key and value matrices, and q ∈ R d is a learnable query token. The output of the cross-attention is then added back to the query token (residual connection), and then fed into two-layer MLP with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier. The parameters of the cross-attention block are jointly learned with that of the linear classifier for the downstream task, while the encoder parameters are kept frozen. Note that, in practice, we actually use an attentive probe with 12 heads, each of dimension 12 . In Appendix E we show that baselines benefit from the attentive probing protocol.
Optimization. For all the tasks, we use AdamW optimizer with a cosine scheduler (no warmup) that decays the learning rate from 0 . 001 to 0 . We use a fixed weight-decay of 0 . 01 and apply simple data augmentations (random resized crops and horizontal flips) during training of the attentive probe, except on image tasks, where we apply AutoAugment (Dogus Cubuk et al., 2019). Table 9 reports the hyperparameters for each downstream evaluation.
Extension to multiple clips. Unless stated otherwise, our attentive probe takes 8 clips of 16 frames as input on Kinetics, and 2 clips of 16 frames on Something-Somethingv2 to increase the temporal coverage of the video.
Table 10 Frozen Detection hyper-parameters.
Specifically, we first divide a video in 8 (or 2) equal-length temporal segments, and sample 1 clip at random per segment. The video encoder E θ processes each clip separately and produces a clip-level feature map. The feature maps for each clip are then concatenated together and fed to the attentive probe. At test time, we average the prediction of 3 spatial views following standard practice in video classification.
Application of video models to images. To evaluate the video models on image tasks, we simply duplicate input images to generate still video clips of 16 frames. We perform this duplication operation simply for convenience in evaluation of the video models, however we find this step to be unnecessary in general. Given a video tokenizer implemented as a 3D-conv with a temporal stride of 2 , it is sufficient to simply duplicate the image into a 2 frame video clip. This would result in the same number of input tokens as that produced by a static image model with a 2D-conv tokenizer.
Application of image models to videos. To evaluate image models such as DINOv2 and OpenCLIP on video tasks, we simply process each frame independently with the image encoder to produce a frame-level feature map. The feature maps for each frame are then concatenated and fed to the attentive probe, just as we do with the clip-level feature maps when evaluating video models.
Attentive Probing.
Next we explore the feature pooling strategy for applying the model's representations in downstream tasks. Since the prediction objective in equation (1) is unnormalized, there is no a priori reason for the encoder to yield a linearly separable subspace (Chen et al., 2020). Thus, rather than using a linear operation (averaging) to pool the features output of the frozen backbone, we explore a learnable non-linear pooling strategy. Specifically, when evaluating the frozen pretrained backbone on downstream tasks, we learn a cross-attention layer with a learnable query token. The output of the crossattention layer is then added back to the query token (residual connection), and then fed into two-layer MLP
Table 4 Ablating Prediction Task. Models are ViT-L/16 networks pretrained on K710 and SSv2 and evaluated with an attentive probe using a single center view. The region x is sampled by masking spatio-temporal regions in the video; y is the mask complement. 1) random-tube[r]: x is obtained by masking a fraction r of tubes (spatial patches extended across the entire temporal duration) from the video, 2) causal multi-block[p]: x is restricted to the first p frames of the 16-frame video, which are then masked with a random set of spatio-temporal blocks, 3) multi-block : x is obtained by masking a random set of spatio-temporal blocks from the entire video. Best performance obtained by using multiblock masking.
with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier.
In Table 3 we see that using adaptive pooling with a learnable cross-attention layer leads to a significant improvement of +17 points on K400 and +16 . 1 points on SSv2. Using an attentive-probe is also beneficial for other baseline models as reported in Appendix E.
Optimization.
We use a Vision Transformer (ViT) (Dosovitskiy et al., 2020; Arnab et al., 2021) as our video backbone. To process a video with a transformer network, we split the video clip into a 3D grid of L spatio-temporal patches, where a patch consists of a 16 × 16 pixel block spanning 2 consecutive frames; we refer to these spatio-temporal patches as tokens. This sequence of tokens is then directly processed by the stack of transformer blocks. In- puts x and y correspond to masked regions of a video, we apply the video masks by simply dropping a subset of the tokens. We apply masking at the input of the x -encoder, and at the output of the y -encoder to construct contextualized targets (Baevski et al., 2022b). The encoder is parameterized using standard ViT networks, while the predictor is a narrow transformer implemented using 12 blocks with an embedding dimension of 384 . Taking inspiration from masked autoencoders (He et al., 2021), our predictor takes as input the sequence of embeddings produced by the x -encoder as well as a sequence of learnable mask tokens with positional embeddings indicating the spatio-temporal positions of the y tokens. The output of the predictor is an embedding vector for each mask token; see Figure 3 and refer to Appendix B for more details.
Extension to multiple clips.
Application of video models to images.
Application of image models to videos.
Frozen detection
We evaluate our model on the AVA (Gu et al., 2018) spatio-temporal localization of human actions dataset, containing 211k training and 57k validation video segments. We follow the experimental protocol of (Feichtenhofer et al., 2021), and use precomputed masks from a pretrained Faster-RCNN adapted to videos, which uses a ResNeXt-101-FPN backbone and is pretrained on ImageNet and COCO. We train a linear classifier on top of the frozen V-JEPA features to classify the extracted regions of interest and report mean Average Precision (mAP) on the 60 most common classes. Hyper-parameters are provided in Table 10. Our frozen features are obtained by concatenating the last layer of the transformer encoder with three intermediate layers. We use a batch size of 64 and pretrain for 30 epochs with AdamW using a learning rate of 0.0001 with 2 epochs of warmup and a weight decay of 0.05.
Finetuning
Following Tong et al. (2022), we finetune a linear layer on top of our model, using a layer decay schema and mixup as the data augmentation pipeline. We provide all hyper-parameters for both K400 and SSv2 in Table 11.
Extra Results
Frozen Evaluation.

Figure 1 V-JEPA models pretrained on video learn versatile visual representations. It performs well on motion-based tasks (Something-Something-v2) and appearance-based tasks (Kinetics 400) without adaptation of the model's parameters, i.e., using the same frozen backbone for both tasks.
level reconstruction.
We seek to answer the simple question:
How effective is feature prediction as a standalone objective for unsupervised learning from video with modern tools?
To that end, we pretrain a family of V-JEPA models on a dataset of 2 million videos collected from publicly available datasets by combining a masked modeling prediction task with a joint-embedding predictive architecture (see Figure 2). We measure performance on several downstream image and video tasks, using both frozen evaluation and end-to-end fine-tuning. Our findings suggest that feature prediction can indeed serve as an effective stand-alone objective for unsupervised learning from video, while using significantly shorter training schedules than pixel prediction methods. Specifically:
· Feature prediction leads to versatile visual representations that perform well across downstream image and video tasks without adaption of the model's weights; i.e., using a frozen backbone. V-JEPA achieves the best performance among methods we consider (+6% accuracy) on the SomethingSomething-v2 task, which requires finegrained temporal understanding. V-JEPA is also competitive on tasks like Kinetics400, where appearance-based features are sufficient and hence state-of-the-art image models such as DINOv2 excel (Figure 1 and Table 6). · Models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol (attentive probing) and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules (Tables 5 and 6). · Models trained with feature prediction are more label-efficient than pixel prediction approaches. Decreasing the available number of labeled examples results in an increase in the performance gap between V-JEPA and pixel-reconstruction models (Table 7).
Linear vs. Attentive probe
Next we explore the feature pooling strategy for applying the model's representations in downstream tasks. Since the prediction objective in equation (1) is unnormalized, there is no a priori reason for the encoder to yield a linearly separable subspace (Chen et al., 2020). Thus, rather than using a linear operation (averaging) to pool the features output of the frozen backbone, we explore a learnable non-linear pooling strategy. Specifically, when evaluating the frozen pretrained backbone on downstream tasks, we learn a cross-attention layer with a learnable query token. The output of the crossattention layer is then added back to the query token (residual connection), and then fed into two-layer MLP
Table 4 Ablating Prediction Task. Models are ViT-L/16 networks pretrained on K710 and SSv2 and evaluated with an attentive probe using a single center view. The region x is sampled by masking spatio-temporal regions in the video; y is the mask complement. 1) random-tube[r]: x is obtained by masking a fraction r of tubes (spatial patches extended across the entire temporal duration) from the video, 2) causal multi-block[p]: x is restricted to the first p frames of the 16-frame video, which are then masked with a random set of spatio-temporal blocks, 3) multi-block : x is obtained by masking a random set of spatio-temporal blocks from the entire video. Best performance obtained by using multiblock masking.
with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier.
In Table 3 we see that using adaptive pooling with a learnable cross-attention layer leads to a significant improvement of +17 points on K400 and +16 . 1 points on SSv2. Using an attentive-probe is also beneficial for other baseline models as reported in Appendix E.
One Clip vs Multiple clips.
Finetuning
Following Tong et al. (2022), we finetune a linear layer on top of our model, using a layer decay schema and mixup as the data augmentation pipeline. We provide all hyper-parameters for both K400 and SSv2 in Table 11.
Sample Efficiency of pretraining
We compare the sample efficiency of pretraining various state-of-the-art image and video models. Specifically, we look at the number of samples (image or video clips) processed by the network during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. Notably, our results with V-JEPA are obtained while processing an order of magnitude fewer samples than previous methods, and notably two orders of magnitude fewer samples than OpenCLIP. We believe that further investment towards improving the video pretraining data distribution could lead to substantial gains in downstream image and video tasks.
Masking Strategy
1]FAIR at Meta 2]Inria 3]École normale supérieure, CNRS, PSL Research University 4]Univ. Gustave Eiffel, CNRS, LIGM 5]Courant Institute, New York University 6]Center for Data Science, New York University \contribution[†]Joint last author
This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9%percent81.981.9% on Kinetics-400, 72.2%percent72.272.2% on Something-Something-v2, and 77.9%percent77.977.9% on ImageNet1K.
[Code]https://github.com/facebookresearch/jepa \metadata[Blogpost]Click here
Humans possess the remarkable ability to map low-level signals originating from the retina into a semantic spatio-temporal understanding of the world; synthesizing notions such as objects and global motion (Spelke et al., 1995). A long-standing goal of the machine learning community is to identify the principles or objectives that may guide such unsupervised learning in humans (Field, 1994; Berkes and Wiskott, 2005; Hinton, 1989). One related hypothesis is based on the predictive feature principle (Rao and Ballard, 1999), which posits that representations of temporally adjacent sensory stimuli should be predictive of each other.
In this work, we revisit feature prediction as a stand-alone objective for unsupervised learning of visual representations from video. Numerous advances in the field — such as the standard use of transformer architectures in vision (Dosovitskiy et al., 2020), the maturing of masked autoencoding frameworks (Xie et al., 2021; Bao et al., 2021; He et al., 2021), query-based feature pooling (Chen et al., 2022), joint-embedding predictive architectures (JEPA) (LeCun, 2022; Assran et al., 2023; Baevski et al., 2022b), and larger datasets — form a unique arsenal of tools, which we integrate in a modern and conceptually simple method, the video joint-embedding predictive architecture or V-JEPA, which is based solely on feature prediction, without using pretrained image encoders, text, negative examples, human annotations, or pixel-level reconstruction.
We seek to answer the simple question:
How effective is feature prediction as a stand-alone objective for unsupervised learning from video with modern tools?
To that end, we pretrain a family of V-JEPA models on a dataset of 2 million videos collected from publicly available datasets by combining a masked modeling prediction task with a joint-embedding predictive architecture (see Figure 2). We measure performance on several downstream image and video tasks, using both frozen evaluation and end-to-end fine-tuning. Our findings suggest that feature prediction can indeed serve as an effective stand-alone objective for unsupervised learning from video, while using significantly shorter training schedules than pixel prediction methods. Specifically:
Feature prediction leads to versatile visual representations that perform well across downstream image and video tasks without adaption of the model’s weights; i.e., using a frozen backbone. V-JEPA achieves the best performance among methods we consider (+6% accuracy) on the SomethingSomething-v2 task, which requires fine-grained temporal understanding. V-JEPA is also competitive on tasks like Kinetics400, where appearance-based features are sufficient and hence state-of-the-art image models such as DINOv2 excel (Figure 1 and Table 6).
Models trained with feature prediction are superior to pixel prediction approaches under a frozen evaluation protocol (attentive probing) and are competitive with pixel prediction under full fine-tuning, while using significantly shorter training schedules (Tables 6 and 6).
Models trained with feature prediction are more label-efficient than pixel prediction approaches. Decreasing the available number of labeled examples results in an increase in the performance gap between V-JEPA and pixel-reconstruction models (Table 7).
One way to encourage temporally adjacent representations to be predictive of each other is to ensure that they vary slowly over time. Early works targeting predictive features encouraged representations of individual video frames to be locally temporally invariant, while preventing representation collapse by using spectral methods, as in SFA (Wiskott and Sejnowski, 2002), SSA (Kayser et al., 2001), and Simulated Fixations (Zou et al., 2012). More recently, Goroshin et al. (2015); Wang et al. (2010) train a siamese convolutional network to map the representations of two subsequent frames to the same point, while encouraging distant frames to have diverse representations via a pair-wise margin loss and a triplet loss, respectively. Other works (Oord et al., 2018; Surís et al., 2021; Feichtenhofer et al., 2021) implement temporal invariance using noise-contrastive estimation (Gutmann and Hyvärinen, 2012). Our exploration in this paper goes beyond temporal invariance and explores feature prediction using masked modeling.
Going beyond local invariance, a family of works trains a predictor network to map the representation of a frame or clip at one time-step to a distinct representation at another time-step. Srivastava et al. (2015); Vondrick et al. (2016); Wang et al. (2023b) train such a video feature predictor network on top of a frozen pretrained image or video encoder. Unfreezing the target feature extractor, several methods train the video encoder and the predictor network simultaneously, while preventing collapse by using a supervised action forecasting loss (Girdhar and Grauman, 2021), or by using the representations of distant clips as negative samples in a contrastive loss (Han et al., 2019, 2020; Tan et al., 2023), often focusing on small convolutional encoders (Han et al., 2019, 2020). The idea of learning a representation by predicting missing information in feature space is also core to the joint-embedding predictive architecture (JEPA) (LeCun, 2022), which combines a siamese encoder with a predictor network. JEPAs have been successfully instantiated in several modalities, such as with audio data (Baevski et al., 2022b) and image data (Zhou et al., 2021; Oquab et al., 2023; Assran et al., 2023). In this work, we extend this paradigm to video data by leveraging recent advances in self-supervised learning.
The use of vision transformers (Dosovitskiy et al., 2020; Li et al., 2022) has become standard practice in self-supervised learning with joint-embedding architectures (Chen et al., 2021; Caron et al., 2021; Oquab et al., 2023; Zhou et al., 2021; Assran et al., 2022), and unlocked masked image modeling in pixel space by parameterizing the pixel decoder as a transformer with learnable mask tokens (Dosovitskiy et al., 2020; Xie et al., 2021; He et al., 2021; Bao et al., 2021), demonstrating a step-change in the representation quality of autoencoding methods (Vincent et al., 2010). This line of generative methods was subsequently extended to video data using spatio-temporal masking (Tong et al., 2022; Feichtenhofer et al., 2022; Wang et al., 2023a; Kalluri et al., 2023; Gupta et al., 2023). It was also recently shown that the representations of masked image autoencoders could be significantly improved by using learnable pooling mechanisms based on cross-attention (Chen et al., 2022). Finally, through careful selection of design choices, the non-contrastive collapse prevention strategy in BYOL (Grill et al., 2020) was recently made to work with image feature prediction methods (Baevski et al., 2022b; Assran et al., 2023), which demonstrated the ability to learn representations that can be leveraged for various downstream tasks without relying on invariance to hand-crafted image transformations.
Approaches that predict in pixel space must dedicate significant model capacity and compute to capture all the low-level detail in the visual input. By contrast, approaches that predict in latent space have the flexibility to eliminate irrelevant or unpredictable pixel-level details from the target representation (Vondrick et al., 2016). Predicting in representation space has been shown to lead to versatile representations that perform well across many downstream tasks through linear probing or low-shot adaptation (Assran et al., 2023; Oquab et al., 2023; Assran et al., 2022), while demonstrating an efficiency gain during pretraining compared to pixel level reconstruction (Assran et al., 2023; Baevski et al., 2022b, a). The works of Baevski et al. (2022a, b) additionally show that predicting in representation space results in competitive end-to-end fine-tuning performance in the image, audio and text domains. In this work, we extend these findings to the video modality.
Our goal is to explore the effectiveness of feature prediction as a stand-alone objective for learning visual representations from video. To that end, we use a joint-embedding predictive architecture (JEPA) (LeCun, 2022); see Figure 2. The main idea behind a JEPA is to learn by predicting the representation of an input y𝑦y from the representation of another input x𝑥x. The basic architecture is made up of an encoder, Eθ(⋅)subscript𝐸𝜃⋅E_{\theta}(\cdot), which computes the representation of the inputs, and a predictor, Pϕ(⋅)subscript𝑃italic-ϕ⋅P_{\phi}(\cdot), which predicts the representation of y𝑦y from the representation of x𝑥x, conditioned on a variable z𝑧z indicating the transformation (or corruption) between x𝑥x and y𝑦y. Conditioning on z𝑧z enables the generation of distinct predictions for various transformations of x𝑥x.
We train our visual encoder Eθ(⋅)subscript𝐸𝜃⋅E_{\theta}(\cdot) to satisfy the constraint that representations computed from one part of the video, y𝑦y, should be predictable from representations computed from another part of the video, x𝑥x. The predictor network Pϕ(⋅)subscript𝑃italic-ϕ⋅P_{\phi}(\cdot), which maps the representation of x𝑥x to the representation of y𝑦y, is trained simultaneously with the encoder, and is provided specification of the spatio-temporal positions of y𝑦y through the conditioning variable z←Δy←𝑧subscriptΔ𝑦z\leftarrow\Delta_{y}.
Naively implementing the objective using the regression
would admit a trivial solution, where the encoder outputs a constant representation, regardless of its input. In practice, we use the following modified objective to prevent representation collapse,
where sg(⋅)sg⋅\text{sg}(\cdot) denotes a stop-gradient operation, which does not backpropagate through its argument, and E¯θ(⋅)subscript¯𝐸𝜃⋅\overline{E}{\theta}(\cdot) is an exponential moving average of the network Eθ(⋅)subscript𝐸𝜃⋅E{\theta}(\cdot). The use of an exponential-moving average feature extractor along with a stop-gradient and a predictor has been used as a collapse prevention strategy for image pretraining (Grill et al., 2020), and studied empirically (Xie et al., 2021) and theoretically (Tian et al., 2021). In fact, the objective in equation (1) is similar to the loss of Assran et al. (2023) used for image pretraining, but we modify it to use an ℓ1subscriptℓ1\ell_{1} regression, which we found to be more stable.
A theoretical motivation for the effectiveness of this collapse prevention strategy was proposed in Grill et al. (2020) for the BYOL method. We provide a simple adaptation of their analysis for our ℓ1subscriptℓ1\ell_{1} loss. For ease of exposition, we will disregard the effect of the conditioning variable z𝑧z and consider one dimensional representations. Denote the representation E¯θ(y)subscript¯𝐸𝜃𝑦\overline{E}_{\theta}(y) by a random variable Y𝑌Y. The optimal predictor under equation (1) is thus given by the following functional expression,
Substituting this expression for the optimal predictor into the loss function and evaluating the expected gradient of the encoder gives
where MAD(⋅|Eθ(x))\text{MAD}(\cdot\ |E_{\theta}(x)) is the median absolute deviation of a random variable conditioned on Eθ(x)subscript𝐸𝜃𝑥E_{\theta}(x). Thus, in the case where the predictor is optimal, the encoder must learn to capture as much information about the video as possible to minimize the deviation of the target. The hypothesis is that incorporating an exponential moving average to compute the representation of y𝑦y ensures that the predictor evolves faster than the encoder and remains close to optimal, thereby preventing collapse.
The feature prediction task is based on a masked modeling formulation (He et al., 2021; Tong et al., 2022); i.e., regions x𝑥x and y𝑦y from the video are sampled using masking. To sample y𝑦y from a video, we sample several (possibly overlapping) spatially continuous blocks with various aspect ratios and repeat the spatial blocks across the entire temporal dimension of the video; x𝑥x is taken to be the complement. Masking a large continuous block that covers the full temporal dimension limits information leakage due to the spatial and temporal redundancy of videos, and results in a harder prediction task (Tong et al., 2022).
We leverage two types of masks: short-range masks, where we take the union of 888 randomly sampled target blocks covering 15% of each frame, and long-range masks, where we take the union of 222 randomly sampled target blocks covering 70% of each frame. In both cases, the aspect ratio for all sampled blocks is randomly chosen in the range (0.75,1.5)0.751.5(0.75,1.5). Given that both short-range and long-range masks are produced by sampling many blocks and taking their union, the result is an average masking ratio of ∼90%similar-toabsentpercent90\sim 90%. We refer to our masking strategy as multi-block, and compare it to other possible masking strategies in Section 4.
We use a Vision Transformer (ViT) (Dosovitskiy et al., 2020; Arnab et al., 2021) as our video backbone. To process a video with a transformer network, we split the video clip into a 3D grid of L𝐿L spatio-temporal patches, where a patch consists of a 16×16161616\times 16 pixel block spanning 222 consecutive frames; we refer to these spatio-temporal patches as tokens. This sequence of tokens is then directly processed by the stack of transformer blocks. Inputs x𝑥x and y𝑦y correspond to masked regions of a video, we apply the video masks by simply dropping a subset of the tokens. We apply masking at the input of the x𝑥x-encoder, and at the output of the y𝑦y-encoder to construct contextualized targets (Baevski et al., 2022b). The encoder is parameterized using standard ViT networks, while the predictor is a narrow transformer implemented using 121212 blocks with an embedding dimension of 384384384. Taking inspiration from masked autoencoders (He et al., 2021), our predictor takes as input the sequence of embeddings produced by the x𝑥x-encoder as well as a sequence of learnable mask tokens with positional embeddings indicating the spatio-temporal positions of the y𝑦y tokens. The output of the predictor is an embedding vector for each mask token; see Figure 3 and refer to Appendix 9 for more details.
We combine several public datasets to construct an unsupervised video pretraining dataset, which we refer to as VideoMix2M. Specifically, we combine the videos from HowTo100M (HT) (Miech et al., 2019), Kinetics-400/600/700 (K710) (Kay et al., 2017), and Something-Something-v2 (SSv2) (Goyal et al., 2017), and remove any overlap with the validation sets of Kinetics-400/600/700 and Something-Something-v2, resulting in approximately 2 million videos. We train a ViT-L/16, a ViT-H/16, and a ViT-H/16384 transformer model on VideoMix2M. We use a batch size of 3072 for the ViT-L/16 and ViT-H/16 models, and a batch size of 2400 for the ViT-H/16384 model. Each model takes as input a video clip of 16 frames sampled with a frame-skip of 4, corresponding to roughly 3 second clips on average. The ViT-L/16 and ViT-H/16 process the video at a spatial resolution of 224, while the ViT-H/16384 uses an input resolution of 384; cf. Appendix 10.
Pretrained models are evaluated on downstream video and image tasks. On video tasks, we use a subset of the VideoGLUE benchmark (Yuan et al., 2023) to test for various capabilities; specifically, we investigate action recognition on Kinetics-400 (K400) (Kay et al., 2017), motion classification on Something-Something-v2 (SSv2) (Goyal et al., 2017), and action localization on AVA (Gu et al., 2018). Action classification on Kinetics evaluates the appearance-based understanding of the model, as many action classes in the dataset can be inferred from the presence of specific objects in the video (Sevilla-Lara et al., 2021). Motion classification on Something-Something-v2 evaluates the temporal understanding of the model, as action classes in the dataset are decoupled from the appearance/presence of specific objects in the video (Goyal et al., 2017). Finally, action localization on AVA evaluates the ability of the model to understand and localize motions in the video. We follow standard practice and report accuracy on K400 and SSv2 by sampling several spatial and temporal views. For static image tasks, we explore object recognition on ImageNet (Russakovsky et al., 2015), scene classification on Places205 (Zhou et al., 2014), and fine-grained recognition on iNaturalist 2021 (Van Horn et al., 2018).
In this section we isolate the contributions of several design choices, including: a) the use of a feature prediction versus pixel prediction objective, b) the construction of the pretraining data distribution, c) the feature pooling strategy for leveraging the model’s representations in downstream tasks, and d) the masking strategy, towards identifying: what to predict from what?
We first ablate the effect of computing the prediction loss in representation space. We train a pair of ViT-L/16 models using either a V-JEPA feature prediction loss, or a mean-squared error loss with the normalized pixel values, as in masked autoencoders (He et al., 2021), and perform a sweep over the learning rate and weight decay schedules for both approaches. All models are pretrained on VideoMix2M for 90K iterations with a batch size of 3072 using multi-block masking. We examine performance on Kinetics-400 (K400), Something-Something-v2 (SSv2), and ImageNet-1K (IN1K), using a frozen backbone with an attentive probe, and report top-1 accuracy using a single center view. We also examine end-to-end fine-tuning performance of the models on Kinetics-400.
Results of this comparison are reported in Table 2 and indicate that predicting in feature space provides a consistent performance improvement over pixel space prediction in both frozen evaluation of the video backbone, as well as end-to-end fine-tuning.
Next we study the impact of the pretraining data distribution in Table 2. Leveraging large scale datasets has been critical for enabling the surge of advancements in other modalities, such as text and images (Kaplan et al., 2020; Cherti et al., 2023). We investigate whether a similar trend holds for video data. To control for the possible confounding variable of compute budget, we pretrain all models in Table 2 for 90K iterations using a batch-size of 3072. We report downstream results on K400, SSv2, and IN1K using a frozen backbone with an attentive probe, and report top-1 accuracy using a single center view.
Table 2 shows that average performance across tasks monotonically increases as we increase the size of the pretraining dataset, but the best task-specific performance is obtained by independently selecting the pretraining data for each specific downstream task. For instance, the L/16 obtains its best SSv2 performance when pretrained on K710+SSv2, its best K400 performance when pretrained only on K710, and its best IN1K performance when pretrained only on K710+HT. The best average performance across all tasks is achieved by pretraining VideoMix2M, which combines all the data sources. Similarly, the H/16 pretrained on K710+SSv2 achieves a greater K400 score than the H/16 pretrained on VideoMix2M, however, the top performing H/16 on average is pretrained on VideoMix2M.
Next we explore the feature pooling strategy for applying the model’s representations in downstream tasks. Since the prediction objective in equation (1) is unnormalized, there is no a priori reason for the encoder to yield a linearly separable subspace (Chen et al., 2020). Thus, rather than using a linear operation (averaging) to pool the features output of the frozen backbone, we explore a learnable non-linear pooling strategy. Specifically, when evaluating the frozen pretrained backbone on downstream tasks, we learn a cross-attention layer with a learnable query token. The output of the cross-attention layer is then added back to the query token (residual connection), and then fed into two-layer MLP with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier.
In Table 3 we see that using adaptive pooling with a learnable cross-attention layer leads to a significant improvement of +1717+17 points on K400 and +16.116.1+16.1 points on SSv2. Using an attentive-probe is also beneficial for other baseline models as reported in Appendix 12.
We conduct an ablation on the masking strategy used in V-JEPA pretraining. We examine the following masking strategies: random-tube[r] in which x𝑥x is obtained by removing a random fraction r𝑟r of tubes (spatial patches extended across the entire temporal duration) from the video, causal multi-block[p] in which x𝑥x is restricted to the first p𝑝p frames of the 16-frame video, which are then masked with a random set of spatio-temporal blocks, and multi-block in which x𝑥x obtained by masking a random set of spatio-temporal blocks from the entire video. Spatio-temporal blocks are sampled using the parameters described in Section 3.2; an ablation on the size and quantity of masked spatio-temporal blocks is provided in Appendix 12.4.
Table 4 indicates that the best results are obtained by sampling x𝑥x using a multi-block strategy, wherein the network is forced to make predictions after removing large continuous blocks in the video. When x𝑥x is only sampled from the first few frames of the video, as in the causal multi-block strategy, we observe a decrease in downstream performances. Finally, the random-tube strategy, wherein 90% of the tubes in the video are randomly masked, leads to features of low-semantic quality when combined with our feature prediction objective.
In Section 5.1, we investigate the impact of feature prediction by comparing V-JEPA with video approaches that rely on pixel prediction, while using a similar architecture for all baselines. Subsequently, in Section 5.2, we remove the architectural constraint and report the best performance across architectures for self-supervised video and image pretraining approaches. Finally, we explore the label-efficiency of V-JEPA relative to other self-supervised video pretraining approaches in Section 5.3. We further detail the evaluation setup in Appendix 11.
To investigate the effectiveness of feature prediction pretraining, we first compare V-JEPA to video masked modeling models relying on a pixel prediction loss. We control for the possible confounding factor of model architecture by evaluating all models using either a ViT-L/16 encoder, or a Hiera-L encoder, which has a similar number of parameters. For the pixel prediction baselines we consider VideoMAE (Tong et al., 2022; Wang et al., 2023a), which trains vision transformer autoencoders exclusively on video, Hiera (Ryali et al., 2023), which trains a hierarchical transformer autoencoder on video, and OmniMAE (Girdhar et al., 2023), which trains a vision transformer autoencoder on static images and video simultaneously.
Table 6 examines both frozen evaluation with an attentive probe on downstream video and image tasks, as well as end-to-end fine-tuning. In frozen evaluation, V-JEPA outperforms the baselines on all downstream tasks, except ImageNet, where we achieve 74.8%percent74.874.8% compared to 75.1%percent75.175.1% of an OmniMAE model trained directly on ImageNet; hence, V-JEPA achieves comparable ImageNet performance despite only pretraining on video.
Under the fine-tuning protocol, V-JEPA also achieves the best performance of any model trained with a ViT-L/16, and matches the performance of the Hiera-L on SSv2, which benefits from a hierachical prior (Ryali et al., 2023). The V-JEPA models achieve this result while processing significantly fewer samples during pretraining (Figure 4), demonstrating the efficiency of feature prediction as a learning principle.
Next, in Table 6, we inspect how the V-JEPA models pretrained on video stack up next to the largest state-of-the-art self-supervised image and video models when freezing the backbone encoder and training an attentive probe on top. Our image pretrained baselines include OpenCLIP (Cherti et al., 2023), DINOv2 (Oquab et al., 2023), and I-JEPA (Assran et al., 2023). The OpenCLIP model is trained with a contrastive image-text alignment objective, DINOv2 and I-JEPA are trained with self-supervision. These models are known to excel in their frozen-evaluation performance (Oquab et al., 2023); i.e., their ability to produce visual features that can be applied to many downstream tasks simultaneously, without end-to-end fine-tuning, and thus provide highly competitive baselines. Our video pretrained baselines include VideoMAE (Tong et al., 2022), OmniMAE (Girdhar et al., 2023), Hiera (Ryali et al., 2023), VideoMAEv2 (Wang et al., 2023a), and MVD (Wang et al., 2023b). The OpenCLIP, DINOv2 and VideoMAEv2 models are parameterized as Giant/Gigantic vision transformer architectures containing over 1B parameters trained on large-scale image or video datasets.
Compared to large-scale video baselines, the V-JEPA models outperform all previous models on every downstream video and image task with notable margin (see Table 6). Our H/16 model outperforms the largest publicly available VideoMAE, VideoMAEv2, OmniMAE, MVD, and Hiera models by at least +55+5 points in motion understanding (Something-Something-v2), +22+2 points in action recognition (Kinetics-400), +55+5 points on action detection (AVA), +11+1 point on object recognition (ImageNet-1K), +22+2 points in scene recognition (Places205), and +0.20.2+0.2 points on fine-grained recognition (iNaturalist). Moreover, when comparing pretraining wallclock time in Figure 5, we see that V-JEPA achieves this performance with a roughly 2×2\times speedup compared to the large pixel prediction models.
On tasks that require a fine-grained understanding of motion (Something-Something-v2), the V-JEPA models provide a major improvement (over +2121+21 points) compared to large-scale image baselines, such as DINOv2, OpenCLIP, and I-JEPA. Self-supervised pretraining from videos allows to model dynamic concepts that are not easily learned from static image datasets. Similarly, we observe that the V-JEPA models outperform image-based pretraining on action localization.
On Kinetics-400, we find image models to perform well; e.g., while DINOv2 (Oquab et al., 2023) previously reported 78.4%percent78.478.4% on K400 with a linear probe, we improve the frozen evaluation of the g/14 model to 83.4%percent83.483.4% by using an attentive probe. In this case, our H/16 model achieves 82.0%percent82.082.0% top-1 accuracy. It is worth noting that the label for many Kinetics videos can be inferred using appearance-based cues, without requiring an understanding of motion (Sevilla-Lara et al., 2021).
The V-JEPA models narrow the gap with image models on image classification tasks. In particular, V-JEPA achieves a score of 77.4%percent77.477.4% on ImageNet using a one-layer attentive probe, which can be further improved to 77.9%percent77.9\bf{77.9%} using a two-layer attentive probe. More generally, we hypothesize that the datasets used to train V-JEPA and other video models are too constrained and lack the visual diversity of the internet-scale pretraining data used by the images models; as such, there is value in focusing future work on building diverse publicly available video datasets.
We examine the label-efficiency of V-JEPA compared to other self-supervised video models by measuring the ability of the pretrained backbones to adapt to downstream tasks with few labels. Specifically, we investigate the performance of the frozen models on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. Table 7 reports the mean performances and standard deviation using the K400 and SSv2 validation sets.
We find V-JEPA to be more label-efficient than other self-supervised video models: decreasing the available number of labeled examples for training the attentive probe results in an increase in the performance gap between V-JEPA and the other models. In particular, the performance of the largest V-JEPA model on K400 drops by 12% to 68.2% top-1 when we reduce the number of labeled examples by a factor of 10×10\times (from roughly 287 examples per class to 29 examples per class). By contrast, VideoMAEv2 drops by 30% to 37.0% top-1, VideoMAE drops by 15.9% to 62.3% top-1, and MVD drops by 14.6% to 62.6% top-1.
Similar observations hold on SSv2. The performance of the largest V-JEPA model on SSv2 drops by 13.9% to 54.0% top-1 when we reduce the number of labeled examples by a factor of 10×10\times (from roughly 440 examples per class to 48 examples per class). By contrast, VideoMAEv2 drops by 26% to 28.0% top-1, VideoMAE drops by 19.1% to 41.4% top-1, and MVD drops by 18.1% to 42.9% top-1.
Next, we seek to qualitatively inspect the V-JEPA models. Recall that the predictor network in V-JEPA predicts the representations of a masked spatio-temporal region y𝑦y from a visible region x𝑥x, given the positional information of the masked regions (see Section 3). To qualitatively investigate the grounding of the feature-space predictions, we freeze the pretrained encoder and predictor networks and train a conditional diffusion decoder to map the V-JEPA predictions to interpretable pixels. Notably, the decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video (see Figure 6(a)).
Given a masked video, we use the V-JEPA pretrained models to predict the representations of the missing regions, and then use the decoder to project the representations to pixel space. Figure 6(b) shows decoder outputs for various random seeds. Qualities that are common across samples represent information that is contained in the predictor representation.
Figure 6(b) shows that the V-JEPA feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video. Specifically, the samples in Figure 6(b) show that the V-JEPA predictor correctly captures positional uncertainty and produces a variety of visual objects at various locations with consistent motion. Some of the samples also demonstrate an understanding of object-permanence, as the visual objects remain consistent after partial occlusion.
In this work, we explored the effectiveness of feature prediction as a stand-alone objective for unsupervised learning from video and introduced V-JEPA, a collection of vision models trained solely using a self-supervised feature prediction objective. The V-JEPA models demonstrate the ability to solve various downstream image and video tasks without adaption of the model parameters, and outperform previous video representation learning approaches in frozen evaluation on action recognition, spatio-temporal action detection, and image classification tasks. Additionally, we show that pretraining V-JEPA on videos is particularly effective for solving downstream tasks requiring fine-grained motion understanding, while large-scale image models trained on internet scale datasets fall short on such tasks. Finally, we empirically observed that V-JEPA models are label-efficient learners, and exhibit good performance on downstream tasks, even when only few labeled examples are available.
We first review approaches for learning visual perception from static images before discussing strategies for learning from video.
One family of approaches for learning visual perception from static images trains a visual encoder to predict the representations of text captions often found accompanying images from the Web, as in CLIP (Radford et al., 2021) or CoCa (Yu et al., 2022). The largest open source CLIP model to date, numbering 2B parameters and trained on over 2B web-scraped images (Cherti et al., 2023), demonstrates impressive performance on a wide range of downstream image and video tasks. Notably, this is achieved using only the light-weight adaptation of task-specific heads, also referred to as frozen-evaluation, and does not require expensive end-to-end fine-tuning of the pretrained model.
Other approaches for learning from static images leverage unsupervised objectives. Initial works on self-supervised approaches are based on sparse coding or hand-crafted pretext tasks, such as colorization (Larsson et al., 2016, 2017), rotation prediction (Gidaris et al., 2020), and jigsaws (Noroozi and Favaro, 2016). More recent approaches leverage invariance-based objectives by training a visual encoder to be invariant to hand-crafted image transformations (Wu et al., 2018; Chen et al., 2020).
Another family of methods learn representations using denoising autoencoders (Vincent et al., 2008); image inpainting is one popular instantiation of this idea (Pathak et al., 2016). More recently, masked autoencoders (He et al., 2021) train an encoder-decoder transformer to predict missing pixels of a masked image. Follow-up work addresses the indeterminism of pixel reconstruction by exploring instantiations of masked image modeling in latent space (Baevski et al., 2022b; Assran et al., 2023; Baevski et al., 2022a). These approaches can be seen as applications of the predictive feature principle in the image modality.
There are also various methods that combine both masked image modeling and invariance criteria to learn visual representations from static images, such as iBOT (Zhou et al., 2021) and DINOv2 (Zhou et al., 2021; Oquab et al., 2023), the latter is currently the most competitive instantiation of self-supervised learning with static images, scaled to a model with over 1.1B parameters trained on a curated dataset of 142M images.
One family of approaches for learning visual perception from videos relies on weakly-supervised guidance from closed captioning, often computed from an ASR transcription of audio data accompanying internet videos. For instance, VideoBERT (Sun et al., 2019; Xu et al., 2021) trains a video encoder to predict masked spans in the textual closed captions. Similarly, VideoCLIP (Xu et al., 2021) trains a video encoder to predict the representation of video captions computed by a text encoder. Follow-up work such as MERLOT (Zellers et al., 2022), VATT (Akbari et al., 2021), and InternVideo (Wang et al., 2022) extended VideoCLIP by incorporating additional unsupervised objectives.
Similar to unsupervised learning from images, a family of unsupervised video representation learning approaches enforces a spatio-temporal representation of a video clip to be invariant to hand-crafted spatio-temporal data augmentations (Parthasarathy et al., 2022). However, one obvious insight is that the temporal ordering of visual information in video can provide implicit supervision. Indeed, this insight is the key insight leveraged by many works on unsupervised video learning. Towards leveraging temporal information as supervision, some approaches train a visual encoder by predicting the temporal ordering of frames (Xu et al., 2019; Lee et al., 2017). Other approaches seek to predict low-level motion vectors computed from optical flow (Pintea et al., 2014), or to predict mixing pixels in video frames, using either a frame-interpolation objective (Kalluri et al., 2023) or a denoising autoencoder (Tong et al., 2022; Feichtenhofer et al., 2022; Wang et al., 2023a).
In this section, we provide an in-depth description of our approach V-JEPA that is illustrated in Figure 3.
Unless stated otherwise, during during pretraining, we always randomly sample a clip of 16 frames from each input video with a temporal stride of 4 between sampled frames. An input video clip therefore covers 64 frames in total, or roughly 2 seconds of a given video running at 30 frames per second. We then resize the video’s spatial dimensions to 224×224224224224\times 224, resulting in an overall shape of 16×224×224×316224224316\times 224\times 224\times 3 for the entire clip. Since ViT networks process a 1D sequence of tokens, we must convert an input video clip into a 1D token sequence. To do so, we apply a 3D convolution comprising d𝑑d filters of size 2×16×16216162\times 16\times 16 with a temporal stride of 222 and a spatial stride of 161616, resulting in a tensor of shape 8×14×14×d81414𝑑8\times 14\times 14\times d. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape 1568×d1568𝑑1568\times d. This process is demonstrated in Figure 7.
We sample both a video clip, and a video mask in each iteration. We denote a video clip represented as a 1D token sequence of length L=1568𝐿1568L=1568 by xL=(x1,…,xL)subscript𝑥𝐿subscript𝑥1…subscript𝑥𝐿x_{L}=(x_{1},\ldots,x_{L}). Similarly, given a mask of M<L𝑀𝐿M<L patches, leaving N=L−M𝑁𝐿𝑀N=L-M patches unmasked, we denote the indices of masked patches by (i1,…,iM)subscript𝑖1…subscript𝑖𝑀(i_{1},\ldots,i_{M}) and its complement (the indices of unmasked patches) by (j1,…,jN)subscript𝑗1…subscript𝑗𝑁(j_{1},\ldots,j_{N}).
Computing the x𝑥x-representations. To compute the V-JEPA loss, we first produce the x𝑥x-representations by masking the video clip and feeding it into the x𝑥x-encoder; we denote the masked video by xN=(xj1,…,xjN)subscript𝑥𝑁subscript𝑥subscript𝑗1…subscript𝑥subscript𝑗𝑁x_{N}=(x_{j_{1}},\ldots,x_{j_{N}}). Applying the x𝑥x-encoder Eθ(⋅)subscript𝐸𝜃⋅E_{\theta}(\cdot) to the masked clip gives a sequence of patch representations, denoted as zN=Eθ(xN)=(zj1,…,zjN).subscript𝑧𝑁subscript𝐸𝜃subscript𝑥𝑁subscript𝑧subscript𝑗1…subscript𝑧subscript𝑗𝑁z_{N}=E_{\theta}(x_{N})=(z_{j_{1}},\ldots,z_{j_{N}}).
Predicting the target. Next, the V-JEPA predictor network Pϕ(⋅,⋅)subscript𝑃italic-ϕ⋅⋅P_{\phi}(\cdot,\cdot) takes as input the tokens produced by the x𝑥x-encoder and predicts the missing regions in the video clip, which are specified by a set of learnable mask tokens. Specifically, the mask tokens are parameterized as the sum of a shared learnable vector and an absolute 3D sin-cos positional embedding, denoted by mM=(mi1,…,miM)subscript𝑚𝑀subscript𝑚subscript𝑖1…subscript𝑚subscript𝑖𝑀m_{M}=(m_{i_{1}},\ldots,m_{i_{M}}). The output of the predictor is thus given by, s^M=Pϕ(zN,mM)=(s^i1,…,s^iM),subscript^𝑠𝑀subscript𝑃italic-ϕsubscript𝑧𝑁subscript𝑚𝑀subscript^𝑠subscript𝑖1…subscript^𝑠subscript𝑖𝑀\hat{s}{M}=P{\phi}(z_{N},m_{M})=(\hat{s}{i{1}},\ldots,\hat{s}{i{M}}), corresponding to a d𝑑d-dimensional output for each of the M𝑀M masked patches.
Computing the y𝑦y-representations. Finally to compute the prediction targets, the entire unmasked video clip is processed by the y𝑦y-encoder to obtain a set of target representations, denoted by sL=E¯θ(xL)=(s1,…,sL).subscript𝑠𝐿subscript¯𝐸𝜃subscript𝑥𝐿subscript𝑠1…subscript𝑠𝐿s_{L}=\overline{E}{\theta}(x{L})=(s_{1},\ldots,s_{L}). The V-JEPA loss is now computed as
which is simply the average L1subscript𝐿1L_{1} distance between the output of the predictor and the y𝑦y-encoder. We then compute a gradient update with respect to the parameters of the x𝑥x-encoder, θ𝜃\theta, and the predictor, ϕitalic-ϕ\phi, and subsequently update the parameters of the y𝑦y-encoder as an exponential moving average of the context encoder weights (Polyak average).
To increase the efficiency of V-JEPA, we use a multi-masking strategy (Caron et al., 2020; Baevski et al., 2022a), which enables us to amortize the cost of the target computation. As mentioned in Section 3, for a given video clip, we sample 2 different masks, short-range and long-range. While we need to forward propagate the x𝑥x-encoder and predictor separately for each mask, we only need to compute the y𝑦y-representation once.
In section, we report V-JEPA pretraining details. Table 8 summarizes the main hyperparameters used during pretraining.
We use Vision Transformer (Dosovitskiy et al., 2020) (ViT) architectures for the x𝑥x-encoder and y𝑦y-encoder. We train three V-JEPA encoders: a ViT-L/16224, a ViT-H/16224 and a ViT-H/16384. All three encoders take as input a short video clip of 16 frames with a temporal stride of 4 between consecutive frames. The subscripts, 224224224 and 384384384, indicate the spatial resolution of the video clip. V-JEPA flattens the video clip into a sequence of non-overlapping spatio-temporal patches of size 16×16×21616216\times 16\times 2 (see Figure 7). For all three models, the predictor is designed as a narrow ViT architecture, consisting of 12 transformer blocks with an embedding dimension of 384. For simplicity, we keep the number of self-attention heads in the predictor equal to that of the backbone used for the context-encoder/target-encoder. V-JEPA is pretrained without using a [cls] token.
We use AdamW (Loshchilov and Hutter, 2017) to optimize the x𝑥x-encoder and predictor weights. The ViT-L/16224 and ViT-H/16224 models use a batch size of 307230723072 while the ViT-H/16384 uses a batch size of 240024002400. Models are trained for a total of 90,000 iterations. The learning rate is linearly increased from 2×10−42superscript1042\times 10^{-4} to 6.25×10−46.25superscript1046.25\times 10^{-4} during the first 12,0001200012,000 iterations of pretraining, and decayed to 10−6superscript10610^{-6} following a cosine schedule. Weight-decay is also linearly increased from 0.040.040.04 to 0.40.40.4 throughout pretraining. The y𝑦y-encoder weights are initialized identically to the x𝑥x-encoder, and subsequently updated as an exponential moving average (EMA) (Tarvainen and Valpola, 2017) of the x𝑥x-encoder weights using a momentum value which starts at 0.9980.9980.998 and is linearly increased to 1.01.01.0 during training (Caron et al., 2021; Assran et al., 2022). We scale all hyper-parameter schedules 25% beyond the actual training schedule. Specifically, the learning rate schedule, weight-decay schedule, and EMA schedule are computed assuming a training length of 112,500 iterations, even though we only train our model for 90,000 iterations. We found the last 25%percent2525% of the default scheduler period to update hyper-parameters too aggressively, and simply truncating the schedulers improved performance.
Given an input video, xLsubscript𝑥𝐿x_{L}, the V-JEPA target encoder E¯θ(⋅)subscript¯𝐸𝜃⋅\overline{E}{{\theta}}(\cdot) outputs a sequence of L𝐿L tokens, Eθ(xL)=(s1,…,sL)subscript𝐸𝜃subscript𝑥𝐿subscript𝑠1…subscript𝑠𝐿E{\theta}(x_{L})=(s_{1},\ldots,s_{L}), where si∈ℝdsubscript𝑠𝑖superscriptℝ𝑑s_{i}\in\mathbb{R}^{d}. To pool this sequence of tokens into a single feature vector, we apply a lightweight non-linear cross-attention block which replace the self-attention operation of a transformer block with cross attention. Specifically, the cross-attention performs the following computation:
where 𝐖𝐤,𝐖𝐯∈𝐑𝐝×𝐝subscript𝐖𝐤subscript𝐖𝐯superscript𝐑𝐝𝐝\bf{W_{k}},{\bf W_{v}}\in R^{d\times d} are the key and value matrices, and q∈Rd𝑞superscript𝑅𝑑q\in R^{d} is a learnable query token. The output of the cross-attention is then added back to the query token (residual connection), and then fed into two-layer MLP with a single GeLU activation, followed by a LayerNorm, and finally a linear classifier. The parameters of the cross-attention block are jointly learned with that of the linear classifier for the downstream task, while the encoder parameters are kept frozen. Note that, in practice, we actually use an attentive probe with 12 heads, each of dimension 121212. In Appendix 12 we show that baselines benefit from the attentive probing protocol.
For all the tasks, we use AdamW optimizer with a cosine scheduler (no warmup) that decays the learning rate from 0.0010.0010.001 to 00. We use a fixed weight-decay of 0.010.010.01 and apply simple data augmentations (random resized crops and horizontal flips) during training of the attentive probe, except on image tasks, where we apply AutoAugment (Dogus Cubuk et al., 2019). Table 9 reports the hyperparameters for each downstream evaluation.
Unless stated otherwise, our attentive probe takes 8 clips of 16 frames as input on Kinetics, and 2 clips of 16 frames on Something-Somethingv2 to increase the temporal coverage of the video. Specifically, we first divide a video in 8 (or 2) equal-length temporal segments, and sample 1 clip at random per segment. The video encoder E¯θsubscript¯𝐸𝜃\overline{E}_{{\theta}} processes each clip separately and produces a clip-level feature map. The feature maps for each clip are then concatenated together and fed to the attentive probe. At test time, we average the prediction of 3 spatial views following standard practice in video classification.
To evaluate the video models on image tasks, we simply duplicate input images to generate still video clips of 16 frames. We perform this duplication operation simply for convenience in evaluation of the video models, however we find this step to be unnecessary in general. Given a video tokenizer implemented as a 3D-conv with a temporal stride of 222, it is sufficient to simply duplicate the image into a 2 frame video clip. This would result in the same number of input tokens as that produced by a static image model with a 2D-conv tokenizer.
To evaluate image models such as DINOv2 and OpenCLIP on video tasks, we simply process each frame independently with the image encoder to produce a frame-level feature map. The feature maps for each frame are then concatenated and fed to the attentive probe, just as we do with the clip-level feature maps when evaluating video models.
We evaluate our model on the AVA (Gu et al., 2018) spatio-temporal localization of human actions dataset, containing 211k training and 57k validation video segments. We follow the experimental protocol of (Feichtenhofer et al., 2021), and use precomputed masks from a pretrained Faster-RCNN adapted to videos, which uses a ResNeXt-101-FPN backbone and is pretrained on ImageNet and COCO. We train a linear classifier on top of the frozen V-JEPA features to classify the extracted regions of interest and report mean Average Precision (mAP) on the 60 most common classes. Hyper-parameters are provided in Table 10. Our frozen features are obtained by concatenating the last layer of the transformer encoder with three intermediate layers. We use a batch size of 64 and pretrain for 30 epochs with AdamW using a learning rate of 0.0001 with 2 epochs of warmup and a weight decay of 0.05.
Following Tong et al. (2022), we finetune a linear layer on top of our model, using a layer decay schema and mixup as the data augmentation pipeline. We provide all hyper-parameters for both K400 and SSv2 in Table 11.
Table 12 shows that V-JEPA and VideoMAE benefit from using a non-linear attentive probe and multiple clips on the K400 and SSv2 downstream tasks. Additionally, Table 13 shows that attentive probing leads to better performance on average for DINOv2 and OpenCLIP models. Since attentive probing and multiclips eval improves the performance of all models, we use it as our default protocol in frozen evaluation.
We examine the impact of changing the temporal coverage of a model during downstream evaluation on K400 action classification. In Table 14, we evaluate VideoMAE and V-JEPA models using an attentive probe with access to either the feature map of 1 clip randomly sampled from the video, or the concatenated feature map of 8 clips randomly sampled from the video. To sample 8 clips from a video, we first divide the video into 8 equal length temporal segments, and sample 1 clip at random from each segment. A single clip corresponds to ≈\approx 2 seconds of a video on average, while 8 clips correspond to ≈\approx 16 seconds. The video encoders processes each clip separately to produce a clip-level feature map, which are then concatenated at the input to the attentive probe.
Increasing the temporal coverage from 1 clip per video to 8 clips improves the performance of both V-JEPA and VideoMAE on K400 action classification. We therefore use the multiclip attentive probing setup as our default evaluation pipeline.
In Table 15, we evaluate V-JEPA using finetuning (separately) on K400 and SSv2. We compare V-JEPA with VideoMAEv2 (Wang et al., 2023a), VideoMAE (Tong et al., 2022) and MVD (Wang et al., 2023b) using a ViT-L/16 or a ViT-H/16 architecture. V-JEPA obtains competitive performance using a finetuning protocol. With a ViTiH/16 architecture, V-JEPA outperforms by 1.2%percent1.21.2% VideoMAE and +0.3%percent0.3+0.3% VideoMAEv2 on the SSv2 dataset, while obtaining comparable performance on K400. V-JEPA also obtains performance similar to MVD on the SSv2 dataset. The MVD model achieves the best performance across models on the K400 dataset, and is trained using the image dataset ImageNet1K, in contrast to the other methods in the table, which only use video data. Additionally MVD requires the processing of significantly more samples during pretraining due to the cost of training the teacher encoder networks in a pre-pre-training step.
Table 16 Sample efficiency. We compare the sample efficiency of pretraining various state-of-the-art image and video models. The #Samples Seen entry corresponds to the number of samples (image or video clips) processed by the network during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. The V-JEPA results in this paper are obtained while processing an order of magnitude fewer samples than previous methods.
An important component of the V-JEPA pretraining strategy is the 3D clip masking strategy. In this section, we detail 26 ablation experiments exploring different masks. For all the experiments, we pretrain a ViT-B/16 pretrained on K400. Figure 8 presents a summary of those results.
Figure 8c shows the effect of changing the spatial and temporal masking ratio. Figure 8b ablates the number of sampled blocks used to construct the masks given a fixed effective masking ratio of 90% . Finally, in Figure 8a we
In Figure 8(c), we explore different average spatial and temporal masking ratio, i.e. the spatial/temporal ratio of the area that is covered by a mask on average for a clip. Recall that each mask is constructed by sampling several (possibly overlapping) blocks and taking their union. We change the average spatial or temporal masking ratio by changing a block spatial or temporal size, as well as the overall number of blocks. We found that low spatial or temporal coverage results in a trivial prediction task, which degrades downstream performance. Based on those results, we sample masks that remove roughly 90%percent9090% of the frame and extend along the entire temporal dimension of the clip by default.
In Figure 8b , we explore different block size given an effective spatial masking ratio of 90% and temporal ratio of 100%. We keep the masking ratio approximately constant by changing the block size and the number of block at the same time. We find that sampling several blocks to perform better than sampling a single large block. Figure 9 visually illustrates the effect of sampling several smaller blocks to construct a mask.
In Figure 8a, we explore the effect of sampling various number of masks per samples. We find that sampling two masks for each clip, with different spatial block sizes for each, to be more effective than sampling just a single mask. We hypothesize that this masking strategy induces complementary tasks. In our experiment, we use this as our default masks sampling.
Table: S3.T2: Pixels vs. Featurized Targets. We ablate the effect of computing the prediction loss in feature space vs pixel space. All models are trained on VideoMix2M for 90K iterations with a batch size of 3072 using the multi-block prediction task. We examine downstream performance using a frozen backbone with attentive probing, and report top-1 accuracy using a single center view. We also examine end-to-end fine-tuning performance of the models on K400. Predicting in feature space provide a consistent improvement over pixel space prediction.
| Frozen Evaluation | Fine-Tuning | ||||
| K400 | SSv2 | IN1K | K400-ft | ||
| Target | Arch. | (16×\times1×\times1) | (16×\times1×\times1) | (16×\times5×\times3) | |
| Pixels | ViT-L/16 | 68.6 | 66.0 | 73.3 | 85.4 |
| Features | ViT-L/16 | 73.7 | 66.2 | 74.8 | 85.6 |
Table: S4.T3: Average Pooling vs. Adaptive Pooling. We pool the feature map output by the frozen V-JEPA encoder using an attentive probe, which is then fed into a linear classifier for downstream supervised tasks (K400 and SSv2). We evaluate two pooling strategies: 1) average pooling (Avg.), and attentive pooling (Att.). Results are reported using a single center view. Using adaptive pooling with a cross-attention layer leads to improvements of +17.317.3+17.3 points on K400 and +16.116.1+16.1 points on SSv2.
| Frozen Evaluation | |||||
|---|---|---|---|---|---|
| K400 | SSv2 | ||||
| (16×\times1×\times1) | (16×\times1×\times1) | ||||
| Method | Arch. | Avg. | Att. | Avg. | Att. |
| V-JEPA | ViT-L/16 | 56.7 | 73.7 | 50.1 | 66.2 |
Table: S4.T4: Ablating Prediction Task. Models are ViT-L/16 networks pretrained on K710 and SSv2 and evaluated with an attentive probe using a single center view. The region x𝑥x is sampled by masking spatio-temporal regions in the video; y𝑦y is the mask complement. 1) random-tube[r]: x𝑥x is obtained by masking a fraction r𝑟r of tubes (spatial patches extended across the entire temporal duration) from the video, 2) causal multi-block[p]: x𝑥x is restricted to the first p𝑝p frames of the 16-frame video, which are then masked with a random set of spatio-temporal blocks, 3) multi-block: x𝑥x is obtained by masking a random set of spatio-temporal blocks from the entire video. Best performance obtained by using multiblock masking.
| Frozen Evaluation | |||
| K400 | SSv2 | IN1K | |
| Masking | (16×\times1×\times1) | (16×\times1×\times1) | |
| random-tube[0.9] | 51.5 | 46.4 | 55.6 |
| causal multi-block[6] | 61.3 | 49.8 | 66.9 |
| causal multi-block[12] | 71.9 | 63.6 | 72.2 |
| multi-block | 72.9 | 67.4 | 72.8 |
Table: S4.T6: Comparison with Pixel Prediction Methods. We compare V-JEPA with OmniMAE (Girdhar et al., 2023), VideoMAE (Tong et al., 2022), and Hiera (Ryali et al., 2023), which leverage a pixel-reconstruction loss. All models are trained using a ViT-L architecture or a comparable Hiera-L. We evaluate the approaches on downstream image tasks (IN1K, Places205, iNat201) and video tasks (K400, SSv2, AVA) in both frozen evaluation (with a frozen backbone), and end-to-end fine-tuning. All models are evaluated at resolution 224. On K400 and SSv2 we follow the standard practice of reporting accuracy from several spatial and temporal views from the video. In frozen evaluation, V-JEPA outperforms the baselines on all downstream tasks, except ImageNet, where the model achieves 74.8%percent74.874.8% compared to 75.1%percent75.175.1% of an OmniMAE model trained directly on ImageNet. V-JEPA also achieves the best fine-tuning performance amongs all ViT-L models and matches the Hiera-L on SSv2. The V-JEPA results are achieved while processing significantly fewer examples during pretraining.
| Frozen Evaluation w/ Att. Pooling | Fine-Tuning | ||||||||||
| #Samples | K400 | SSv2 | AVA | IN1K | Places205 | iNat21 | K400-ft | SSv2-ft | |||
| Method | Arch. | Seen | Iter. | (16×\times8×\times3) | (16×\times2×\times3) | (16×\times5×\times3) | (16×\times2×\times3) | ||||
| Methods pretrained using pixel prediction | |||||||||||
| OmniMAE | ViT-L/16 | 2400M | 1170K | 65.6 | 60.6 | 14.4 | 75.1 | 59.8 | 66.1 | 84.0 | 74.2 |
| VideoMAE | ViT-L/16 | 410M | 400K | 77.8 | 65.5 | 21.6 | 71.1 | 59.3 | 64.6 | 85.4 | 74.3 |
| Hiera | Hiera-L | 770M | 1500K | 75.5 | 64.2 | 15.8 | 68.9 | 58.5 | 56.9 | 87.3 | 75.1 |
| V-JEPA | ViT-L/16 | 270M | 90K | 80.8 | 69.5 | 25.6 | 74.8 | 60.3 | 67.8 | 85.6 | 75.1 |
Table: S5.T7: Low-Shot Frozen Evaluation. Comparing V-JEPA to other video models in frozen evaluation on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. We report the mean performances and standard deviation using the K400 and SSv2 validation sets. V-JEPA is more label-efficient than other models; specifically, decreasing the available number of labeled examples from each class increases the performance gap between V-JEPA and the baselines.
| Frozen Evaluation | |||||||
| K400 | SSv2 | ||||||
| (16×\times8×\times3) | (16×\times2×\times3) | ||||||
| 5% | 10% | 50% | 5% | 10% | 50% | ||
| Method | Arch. | (∼similar-to\sim29 samples per class) | (∼similar-to\sim58 samples per class) | (∼similar-to\sim287 samples per class) | (∼similar-to\sim48 samples per class) | (∼similar-to\sim96 samples per class) | (∼similar-to\sim440 samples per class) |
| MVD | ViT-L/16 | 62.6 ±plus-or-minus\pm 0.2 | 68.3 ±plus-or-minus\pm 0.2 | 77.2 ±plus-or-minus\pm 0.3 | 42.9 ±plus-or-minus\pm 0.8 | 49.5 ±plus-or-minus\pm 0.6 | 61.0 ±plus-or-minus\pm 0.2 |
| VideoMAE | ViT-H/16 | 62.3 ±plus-or-minus\pm 0.3 | 68.5 ±plus-or-minus\pm 0.2 | 78.2 ±plus-or-minus\pm 0.1 | 41.4 ±plus-or-minus\pm 0.8 | 48.1 ±plus-or-minus\pm 0.2 | 60.5 ±plus-or-minus\pm 0.4 |
| VideoMAEv2 | ViT-g/14 | 37.0 ±plus-or-minus\pm 0.3 | 48.8 ±plus-or-minus\pm 0.4 | 67.8 ±plus-or-minus\pm 0.1 | 28.0 ±plus-or-minus\pm 1.0 | 37.3 ±plus-or-minus\pm 0.3 | 54.0 ±plus-or-minus\pm 0.3 |
| V-JEPA | ViT-H/16 | 67.0 ±plus-or-minus\pm 0.2 | 72.1 ±plus-or-minus\pm 0.1 | 80.2 ±plus-or-minus\pm 0.2 | 51.9 ±plus-or-minus\pm 0.3 | 57.5 ±plus-or-minus\pm 0.4 | 67.3 ±plus-or-minus\pm 0.2 |
| ViT-H/16384 | 68.2 ±plus-or-minus\pm 0.2 | 72.8 ±plus-or-minus\pm 0.2 | 80.6 ±plus-or-minus\pm 0.2 | 54.0 ±plus-or-minus\pm 0.2 | 59.3 ±plus-or-minus\pm 0.5 | 67.9 ±plus-or-minus\pm 0.2 |
Table: S10.T8: pretraining hyper-parameters for V-JEPA.
| Hyper-parameter | ViT-L/16224 | ViT-H/16224 | ViT-H/16384 |
| data | |||
| datasets | VideoMix2M | VideoMix2M | VideoMix2M |
| resolution | 224 | 224 | 384 |
| num_frames | 16 | 16 | 16 |
| temporal_stride | 4 | 4 | 4 |
| horizontal_flip | true | true | true |
| random_resize_scale | (0.3, 1.0) | (0.3, 1.0) | (0.3, 1.0) |
| random_resize_aspect_ratio | (0.75, 1.35) | (0.75, 1.35) | (0.75, 1.35) |
| masking | |||
| block_aspect_ratio | (0.75, 1.5) | (0.75, 1.5) | (0.75, 1.5) |
| shortrange_mask_num_blocks | 8 | 8 | 8 |
| shortrange_mask_spatial_scale | 0.15 | 0.15 | 0.15 |
| longrange_mask_num_blocks | 2 | 2 | 2 |
| longrange_mask_spatial_scale | 0.7 | 0.7 | 0.7 |
| optimization | |||
| batch_size | 3072 | 3072 | 2400 |
| total_number_of_iterations | 90000 | 90000 | 90000 |
| warmup_iterations | 12000 | 12000 | 12000 |
| lr | 6.25e-4 | 6.25×10−4absentsuperscript104\times 10^{-4} | 6.25×10−4absentsuperscript104\times 10^{-4} |
| start_lr | 2×10−4absentsuperscript104\times 10^{-4} | 2×10−4absentsuperscript104\times 10^{-4} | 2×10−4absentsuperscript104\times 10^{-4} |
| final_lr | 1×10−6absentsuperscript106\times 10^{-6} | 1×10−6absentsuperscript106\times 10^{-6} | 1×10−6absentsuperscript106\times 10^{-6} |
| start_momentum | 0.998 | 0.998 | 0.998 |
| final_momentum | 1.0 | 1.0 | 1.0 |
| start_weight_decay | 0.04 | 0.04 | 0.04 |
| final_weight_decay | 0.4 | 0.4 | 0.4 |
| scheduler_scale_factor | 1.25 | 1.25 | 1.25 |
| architecture | |||
| patch_size | 16 | 16 | 16 |
| tubelet_size | 2 | 2 | 2 |
| pred_depth | 12 | 12 | 12 |
| pred_embed_dim | 384 | 384 | 384 |
| hardware | |||
| dtype | bfloat16 | bfloat16 | bfloat16 |
| accelerator | A100 80G | A100 80G | A100 80G |
Table: S12.T12: Linear vs. Attentive Probe Evaluation for V-JEPA and VideoMAE. We evaluate the effect of linear (Lin.) and attentive (Att.) probing when adapting V-JEPA to the K400 (16×5×3165316\times 5\times 3) and SSv2 (16×2×2)1622(16\times 2\times 2) tasks. V-JEPA and VideoMAE benefit from using a non-linear attentive probe.
| K400 | SSv2 | ||||
|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. |
| VideoMAE | ViT-L/16 | 52.5 | 77.8 | 41.3 | 61.2 |
| V-JEPA | ViT-L/16 | 56.7 | 80.8 | 50.1 | 69.5 |
Table: S12.T13: Linear vs. Attentive Probe Evaluation for DINOv2 and OpenCLIP. We evaluate the effect of linear (Lin.) and attentive probing (Att.) when adapting DINOv2 and OpenCLIP. Image-baselines benefit from using an attentive probing strategy. Results shown in gray are reported from the linear probe evaluation in Oquab et al. (2023).
| K400 | SSv2 | IN1K | Place205 | iNat21 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. |
| DINOv2 | ViT-g/14 | 78.4 | 83.4 | 38.3 | 50.0 | 86.5 | 86.2 | 67.5 | 68.4 | 85.7 | 88.8 |
| OpenCLIP | ViT-G/14 | 78.3 | 81.8 | 35.8 | 34.8 | 86.2 | 85.3 | 69.8 | 70.2 | 76.0 | 83.6 |
Table: S12.T14: Temporal Coverage on Kinetics-400. We evaluate the effect of temporal coverage on K400. We train an attentive probe on K400 using either 1 clip (≈\approx 2 seconds of a video) or 8 clips (≈\approx 16 seconds of a video). To sample N𝑁N clips, we first divide a video in N𝑁N equal-length temporal segments and sample one clip at random per segment. The video encoder processes each clip in parallel and all the encoder output tokens are concatenated at the input of the attentive probe. Increasing the temporal coverage from 1 clip per video to 8 clips significantly improves the performance for both our VideoMAE baseline and V-JEPA.
| Method | Arch. | 1 Clip | 8 Clips |
|---|---|---|---|
| VideoMAE | ViT-L/16 | 69.4 | 77.8 |
| V-JEPA | ViT-L/16 | 73.7 | 80.9 |
Table: S12.T15: Finetuning results. We evaluate a V-JEPA model with the finetuning protocol on the K400 and SSv2 datasets using 16 frames per clip and multi-view fusion (5×\times3 or 222×333) for inference. The #Samples Seen entry corresponds to the number of video clips processed during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. We compare V-JEPA with different video self-supervised learning approaches. We report the VideoMAEv2 results without instruction-turning for consistency with the other approaches. V-JEPA obtains competitive performance using the finetuning protocol.
| Method | Arch. | Pretraining Data | #Samples Seen | K400 | SSv2 |
| (16×\times5×\times3) | (16×\times2×\times3) | ||||
| VideoMAEv1 | ViT-L/16 | K400||\lvertSSv2 | 380M||\lvert410M | 85.4 | 74.3 |
| ViT-H/16 | K400||\lvertSSv2 | 380M||\lvert410M | 86.6 | 74.8 | |
| VideoMAEv2 | ViT-H/16 | Un.Hybrid | 1600M | 86.9 | 76.8 |
| MVD | ViT-L/16 | K400+IN1K | 2400M | 86.4 | 76.7 |
| ViT-H/16 | K400+IN1K | 2400M | 87.2 | 77.3 | |
| V-JEPA | ViT-L/16 | VideoMix2M | 270M | 85.6 | 75.1 |
| ViT-H/16 | VideoMix2M | 270M | 86.6 | 77.0 |
Table: S12.T16: Sample efficiency. We compare the sample efficiency of pretraining various state-of-the-art image and video models. The #Samples Seen entry corresponds to the number of samples (image or video clips) processed by the network during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. The V-JEPA results in this paper are obtained while processing an order of magnitude fewer samples than previous methods.
| Method | Arch. | Data | #Samples Seen |
|---|---|---|---|
| OpenCLIP | ViT-G/14 | LAION-2B | 39000M |
| DINOv2 | ViT-g/14 | LVD 142M | 1900M |
| VideoMAEv2 | ViT-g/14 | UnlabeledHybrid | 1600M |
| V-JEPA | ViT-H/16384 | VideoMix2M | 210M |
V-JEPA models pretrained on video learn versatile visual representations. It performs well on motion-based tasks (Something-Something-v2) and appearance-based tasks (Kinetics 400) without adaptation of the model’s parameters, i.e., using the same frozen backbone for both tasks.
Joint-Embedding Predictive Architectures are trained to predict the representation of an input y𝑦y from the representation of another input x𝑥x. The additional variable z𝑧z provides the predictor with information about the transformation that computes y𝑦y from x𝑥x.
V-JEPA. Training operates on a video clip of T𝑇T frames with spatial resolution H×W𝐻𝑊H\times W, flattened into a sequence of L𝐿L tokens. (Left to right): We first obtain the input of the x𝑥x-encoder by dropping tokens from the video clip. The x𝑥x-encoder then processes the masked video sequence, and outputs an embedding vector for each input token. Next, the outputs of the x𝑥x-encoder are concatenated with a set of learnable mask tokens containing positional embeddings of the masked spatio-temporal patches. The predictor network processes the combined token sequence, and outputs an embedding vector for each mask token. The outputs of the predictor are then regressed to the prediction targets using an L1subscript𝐿1L_{1} loss. The prediction targets correspond to the output of the y𝑦y-encoder.
SSv2 fine-tuning performance vs. Samples Seen. We report SSv2 fine-tuning for V-JEPA and pixel-reconstruction baselines using a ViT-L/16 or Hiera-L architecture. V-JEPA outperforms all pixel-reconstruction methods using a ViT-L/16 and matches the Hiera-L performance while seeing significantly less samples during pretraining.
SSv2 frozen-evaluation performance vs. Pretraining Time. Wallclock times for all methods are measured on a single GPU with a batch size of 10 clips, using the official codebases for VideoMAE and VideoMAEv2, and linearly extrapolated assuming a global batch size of 2400 samples. However, note that the SSv2 accuracies of video pixel prediction methods are actually obtained with small batch sizes and significantly longer training schedules. V-JEPA outperforms pixel-reconstruction methods while training significantly faster.
(a) Visualization Methodology. We train a conditional diffusion model to decode the V-JEPA feature-space predictions to interpretable pixels; the pretrained V-JEPA encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video.
(b) Visualizations. First Row: Masked videos used as input to the V-JEPA models (a pretrained ViT-H/16 encoder and its corresponding predictor network). Other rows: Bounding boxes contain various samples from the decoder overlayed on the original video. V-JEPA is not a generative model and the decoder does not have access to the context (first row), so we do not expect samples to exactly match the input. This experiment qualitatively illustrates what information is encoded and predicted by V-JEPA. In particular, characteristics that are common across samples represent information that is encoded in the V-JEPA predictions. V-JEPA generates predictions that are spatially and temporally coherent with unmask region of the video. The predictions also capture consistent motion through time.
V-JEPA training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size 16×224×224×316224224316\times 224\times 224\times 3 into a 1D token sequence, we apply a 3D convolution comprising d𝑑d filters of size 2×16×16216162\times 16\times 16 with a temporal stride of 222 and a spatial stride of 161616, resulting in a tensor of shape 8×14×14×d81414𝑑8\times 14\times 14\times d. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape 1568×d1568𝑑1568\times d.
(a)
(a) Num. Blocks: 8, Spatial Block Size: 32×32323232\times 32
$$ \label{eq:loss} \text{minimize}{\theta,\phi}\quad \rVert P\phi(E_\theta(x), \Delta_y) - \text{sg}(\overline{E}_\theta(y)) \lVert_1, $$ \tag{eq:loss}
$$ \label{eq:loss_detail} \text{Loss} = \frac{1}{M} \sum_{k \in (i_1, \ldots, i_M)} \lVert \hat s_{k} - s_k \rVert_1, $$ \tag{eq:loss_detail}
$$ \sum^L_{i=1}{\frac{\exp(q^\top {\bf W_k} s_i)}{\sum_j \exp(q^\top {\bf W_k} s_j)} {\bf W_v} s_i }, $$
$$ \displaystyle P^{\star}(E_{\theta}(x)) $$
Table 14 Temporal Coverage on Kinetics-400. We evaluate the effect of temporal coverage on K400. We train an attentive probe on K400 using either 1 clip ( ≈ 2 seconds of a video) or 8 clips ( ≈ 16 seconds of a video). To sample N clips, we first divide a video in N equal-length temporal segments and sample one clip at random per segment. The video encoder processes each clip in parallel and all the encoder output tokens are concatenated at the input of the attentive probe. Increasing the temporal coverage from 1 clip per video to 8 clips significantly improves the performance for both our VideoMAE baseline and V-JEPA.
Table 15 Finetuning results. We evaluate a V-JEPA model with the finetuning protocol on the K400 and SSv2 datasets using 16 frames per clip and multi-view fusion (5 × 3 or 2 × 3 ) for inference. The #Samples Seen entry corresponds to the number of video clips processed during pretraining, which is larger than the size of the pretraining dataset for multi-epoch training. We compare V-JEPA with different video self-supervised learning approaches. We report the VideoMAEv2 results without instruction-turning for consistency with the other approaches. V-JEPA obtains competitive performance using the finetuning protocol.
In Figure 8c, we explore different average spatial and temporal masking ratio, i.e. the spatial/temporal ratio of the area that is covered by a mask on average for a clip. Recall that each mask is constructed by sampling several (possibly overlapping) blocks and taking their union. We change the average spatial or temporal masking ratio by changing a block spatial or temporal size, as well as the overall number of blocks. We found that low spatial or temporal coverage results in a trivial prediction task, which degrades downstream performance. Based on those results, we sample masks that remove roughly 90% of the frame and extend along the entire temporal dimension of the clip by default.


Figure 8 Masking Strategy Ablation. Evaluating a linear probe on a ViT-B/16 pretrained with V-JEPA on K400 under various 3D Multi-Block masking settings. We examine the impact of (a) sampling several masks per video, (b) varying the number of blocks in a mask, and (c) varying the average spatial and temporal masking ratio. A temporal masking ratio of 100% extends the spatial mask across all the frames in the clip. We find it important to maintain a high spatial and temporal masking ratio during pretraining.
(c)
Num. Blocks: 2, Spatial Block Size:
160
×
Figure 9 Illustration of mask with number of blocks and block size. Each mask is constructed by sampling several (possibly overlapping) blocks and taking their union.
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Fine-Tuning | ||
|---|---|---|---|---|---|
| Target | Arch. | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K | K400-ft (16 × 5 × 3) |
| Pixels | ViT-L/16 | 68.6 | 66.0 | 73.3 | 85.4 |
| Features | ViT-L/16 | 73.7 | 66.2 | 74.8 | 85.6 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | ||||
|---|---|---|---|---|---|---|
| Arch. | Data | #Samples | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K | Avg. |
| ViT-L/16 | K710 | 700K | 75.8 | 63.2 | 73.7 | 70.9 |
| K710+SSv2 | 900K | 72.9 | 67.4 | 72.8 | 71.0 | |
| K710+HT | 1900K | 74.5 | 64.2 | 74.8 | 71.1 | |
| VideoMix2M | 2000K | 73.7 | 66.2 | 74.8 | 71.5 | |
| ViT-H/16 | K710+SSv2 | 900K | 75.7 | 66.8 | 73.7 | 72.0 |
| ViT-H/16 | VideoMix2M | 2000K | 74.0 | 68.5 | 75.9 | 72.8 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | ||
|---|---|---|---|---|---|
| Method | Arch. | Avg. | Att. | Avg. | Att. |
| V-JEPA | ViT-L/16 | 56.7 | 73.7 | 50.1 | 66.2 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | |
|---|---|---|---|
| Masking | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K |
| random-tube[0.9] | 51.5 | 46.4 | 55.6 |
| causal multi-block[6] | 61.3 | 49.8 | 66.9 |
| causal multi-block[12] | 71.9 | 63.6 | 72.2 |
| multi-block | 72.9 | 67.4 | 72.8 |
| Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Fine-Tuning | Fine-Tuning | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | #Samples Seen | Iter. | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | AVA | IN1K | Places205 | iNat21 | K400-ft (16 × 5 × 3) | SSv2-ft (16 × 2 × 3) |
| Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction |
| OmniMAE VideoMAE Hiera | ViT-L/16 ViT-L/16 Hiera-L | 2400M 410M 770M | 1170K 400K 1500K | 65.6 77.8 75.5 | 60.6 65.5 64.2 | 14.4 21.6 15.8 | 75.1 71.1 68.9 | 59.8 59.3 58.5 | 66.1 64.6 56.9 | 84.0 85.4 | 74.2 74.3 |
| V-JEPA | 270M | 60.3 | 85.6 | ||||||||
| 87.3 | 75.1 | ||||||||||
| ViT-L/16 | 90K | 80.8 | 69.5 | 25.6 | 74.8 | 67.8 | 75.1 |
| Video Tasks | Video Tasks | Video Tasks | Image Tasks | Image Tasks | Image Tasks | ||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Params. | Data | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | AVA | IN1K | Places205 | iNat21 |
| Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images |
| I-JEPA | ViT-H/16 512 | 630M | IN22K | 79.7 | 50.0 | 19.8 | 84.4 | 66.5 | 85.7 |
| OpenCLIP | ViT-G/14 | 1800M | LAION | 81.8 | 34.8 | 23.2 | 85.3 | 70.2 | 83.6 |
| DINOv2 | ViT-g/14 | 1100M | LVD-142M | 83.4 | 50.6 | 24.3 | 86.2 | 68.4 | 88.8 |
| Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos |
| MVD | ViT-L/16 | 200M | IN1K+K400 | 79.4 | 66.5 | 19.7 | 73.3 | 59.4 | 65.7 |
| OmniMAE | ViT-H/16 | 630M | IN1K+SSv2 | 71.4 | 65.4 | 16.0 | 76.3 | 60.6 | 72.4 |
| VideoMAE | ViT-H/16 | 630M | K400 | 79.8 | 66.2 | 20.7 | 72.3 | 59.1 | 65.5 |
| VideoMAEv2 | ViT-g/14 | 1100M | Un.Hybrid | 71.2 | 61.2 | 12.9 | 71.4 | 60.6 | 68.3 |
| Hiera | Hiera-H | 670M | K400 | 77.0 | 64.7 | 17.5 | 71.4 | 59.5 | 61.7 |
| ViT-L/16 | 200M | 80.8 | 69.5 | 25.6 | 74.8 | 60.3 | 67.8 | ||
| V-JEPA | ViT-H/16 | 630M | VideoMix2M | 82.0 | 71.4 | 25.8 | 75.9 | 61.7 | 67.9 |
| V-JEPA | ViT-H/16 384 | 630M | 81.9 | 72.2 | 25.0 | 77.4 | 62.8 | 72.6 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | |||
|---|---|---|---|---|---|---|---|
| K400 (16 × 8 × 3) | K400 (16 × 8 × 3) | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | SSv2 (16 × 2 × 3) | SSv2 (16 × 2 × 3) | ||
| Method | Arch. | 5% ( ∼ 29 samples per class) | 10% ( ∼ 58 samples per class) | 50% ( ∼ 287 samples per class) | 5% ( ∼ 48 samples per class) | 10% ( ∼ 96 samples per class) | 50% ( ∼ 440 samples per class) |
| MVD | ViT-L/16 | 62.6 ± 0.2 | 68.3 ± 0.2 | 77.2 ± 0.3 | 42.9 ± 0.8 | 49.5 ± 0.6 | 61.0 ± 0.2 |
| VideoMAE | ViT-H/16 | 62.3 ± 0.3 | 68.5 ± 0.2 | 78.2 ± 0.1 | 41.4 ± 0.8 | 48.1 ± 0.2 | 60.5 ± 0.4 |
| VideoMAEv2 | ViT-g/14 | 37.0 ± 0.3 | 48.8 ± 0.4 | 67.8 ± 0.1 | 28.0 ± 1.0 | 37.3 ± 0.3 | 54.0 ± 0.3 |
| V-JEPA | ViT-H/16 | 67.0 ± 0.2 | 72.1 ± 0.1 | 80.2 ± 0.2 | 51.9 ± 0.3 | 57.5 ± 0.4 | 67.3 ± 0.2 |
| V-JEPA | ViT-H/16 384 | 68.2 ± 0.2 | 72.8 ± 0.2 | 80.6 ± 0.2 | 54.0 ± 0.2 | 59.3 ± 0.5 | 67.9 ± 0.2 |
| Hyper-parameter | ViT-L/16 224 | ViT-H/16 224 | ViT-H/16 384 |
|---|---|---|---|
| data | |||
| datasets | VideoMix2M | VideoMix2M | VideoMix2M |
| resolution | 224 | 224 | 384 |
| num_frames | 16 | 16 | 16 |
| temporal_stride | 4 | 4 | 4 |
| horizontal_flip | true | true | true |
| random_resize_scale | (0.3, 1.0) | (0.3, 1.0) | (0.3, 1.0) |
| random_resize_aspect_ratio | (0.75, 1.35) | (0.75, 1.35) | (0.75, 1.35) |
| masking | |||
| block_aspect_ratio | (0.75, 1.5) | (0.75, 1.5) | (0.75, 1.5) |
| shortrange_mask_num_blocks | 8 | 8 | 8 |
| shortrange_mask_spatial_scale | 0.15 | 0.15 | 0.15 |
| longrange_mask_num_blocks | 2 | 2 | 2 |
| longrange_mask_spatial_scale | 0.7 | 0.7 | 0.7 |
| optimization | |||
| batch_size | 3072 | 3072 | 2400 |
| total_number_of_iterations | 90000 | 90000 | 90000 |
| warmup_iterations | 12000 | 12000 | 12000 |
| lr | 6.25e-4 | 6.25 × 10 - 4 | 6.25 × 10 - 4 |
| start_lr | 2 × 10 - 4 | 2 × 10 - 4 | 2 × 10 - 4 |
| final_lr | 1 × 10 - 6 | 1 × 10 - 6 | 1 × 10 - 6 |
| start_momentum | 0.998 | 0.998 | 0.998 |
| final_momentum | 1.0 | 1.0 | 1.0 |
| start_weight_decay | 0.04 | 0.04 | 0.04 |
| final_weight_decay | 0.4 | 0.4 | 0.4 |
| scheduler_scale_factor | 1.25 | 1.25 | 1.25 |
| architecture | |||
| patch_size | 16 | 16 | 16 |
| tubelet_size | 2 | 2 | 2 |
| pred_depth | 12 | 12 | 12 |
| pred_embed_dim | 384 | 384 | 384 |
| hardware | |||
| dtype | bfloat16 | bfloat16 | bfloat16 |
| accelerator | A100 80G | A100 80G | A100 80G |
| Hyper-parameter | K400 | SSv2 | IN1K | Place205 | iNat21 |
|---|---|---|---|---|---|
| data | |||||
| num_clips | 8 | 1 16 4 | N.A. N.A. N.A. | N.A. N.A. N.A. | N.A. N.A. N.A. |
| num_frames | 16 | ||||
| temporal_stride | 4 | ||||
| horizontal_flip | true | true | true | true | true |
| random_resize_scale | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) |
| random_resize_aspect_ratio auto_augment | false | false | true | true | true |
| optimization | |||||
| batch_size | 256 | 256 | 1024 | 1024 | 1024 |
| epochs | 20 | 20 | 20 | 20 | 20 |
| lr | 1e-3 | 1e-3 | 1e-3 | 1e-3 | 1e-3 |
| final_lr | 0 | 0 | 0 | 0 | 0 |
| weight_decay | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| Hyper-parameter | ViT-L/16 | ViT-H/16 |
|---|---|---|
| out_layers batch_size epochs opt opt_eps momentum weight_decay lr warmup_lr min_lr warmup_epochs warmup_steps | [18, 20, 22, 24] 64 30 AdamW 0.00000001 0.9 0.05 0.0001 0.000001 0.000001 2 1 | [26, 28, 30, 32] 64 30 AdamW 0.00000001 0.9 0.05 0.0001 0.000001 0.000001 2 1 |
| Hyper-parameter | K400 | K400 | K400 | SSv2 | SSv2 |
|---|---|---|---|---|---|
| data | |||||
| num_segments num_frames | 1 16 4 224 | ||||
| sampling_rate | |||||
| resolution | |||||
| model | |||||
| model_name | ViT-L/16 | ViT-H/16 | ViT-L/16 | ViT-H/16 | |
| drop_path | 0.1 | 0.2 | 0.2 | 0.2 | |
| head_drop_rate | 0. | 0. | 0.5 | 0.5 | |
| optimization | |||||
| batch_size | 256 | 1024 | 256 | 256 | |
| epochs | 35 | 25 | 15 | 15 | |
| opt | adamw | ||||
| opt_eps | 0.00000001 | ||||
| momentum weight_decay | 0.9 0.05 | ||||
| lr | 0.002 | 0.0005 | 0.0005 | 0.0005 | |
| layer_decay | 0.75 | 0.75 | 0.75 | 0.75 | |
| warmup_lr | 1e-6 | 1e-8 | 1e-6 | 1e-6 | |
| min_lr | 1e-6 | 1e-5 | 1.5e-4 | 1.5e-3 | |
| warmup_epochs | 5 | ||||
| augmentations color_jitter | 0.4 | ||||
| horizontal_flip | True | True | False | False | |
| num_sample | 2 | ||||
| aa | rand-m7-n4-mstd0.5-inc1 | ||||
| smoothing | 0.1 | ||||
| train_interpolation | bicubic | ||||
| test_num_segment | 5 | 5 | 2 | 2 | |
| test_num_crop | 3 | 3 | 3 | 3 | |
| erase | |||||
| prob | 0.25 | ||||
| mode | pixel | ||||
| count | 1 | ||||
| split | False | ||||
| mixup | |||||
| mixup | 0.8 | ||||
| cutmix | 1.0 | ||||
| mixup_prob | 1.0 | ||||
| mixup_switch_prob | 0.5 | ||||
| mixup_mode | batch |
| K400 | K400 | SSv2 | SSv2 | ||
|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. |
| VideoMAE | ViT-L/16 | 52.5 | 77.8 | 41.3 | 61.2 |
| V-JEPA | ViT-L/16 | 56.7 | 80.8 | 50.1 | 69.5 |
| K400 | K400 | SSv2 | SSv2 | IN1K | IN1K | Place205 | Place205 | iNat21 | iNat21 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. |
| DINOv2 | ViT-g/14 | 78.4 | 83.4 | 38.3 | 50.0 | 86.5 | 86.2 | 67.5 | 68.4 | 85.7 | 88.8 |
| OpenCLIP | ViT-G/14 | 78.3 | 81.8 | 35.8 | 34.8 | 86.2 | 85.3 | 69.8 | 70.2 | 76.0 | 83.6 |
| Method | Arch. | 1 Clip | 8 Clips |
|---|---|---|---|
| VideoMAE | ViT-L/16 | 69.4 | 77.8 |
| V-JEPA | ViT-L/16 | 73.7 | 80.9 |
| Method | Arch. | Pretraining Data | #Samples Seen | K400 (16 × 5 × 3) | SSv2 (16 × 2 × 3) | ||||
|---|---|---|---|---|---|---|---|---|---|
| VideoMAEv1 | ViT-L/16 ViT-H/16 ViT-H/16 ViT-L/16 | K400 | SSv2 K400 | SSv2 Un.Hybrid K400+IN1K | 380M | 410M 380M | 410M 1600M 2400M | 85.4 86.6 86.9 86.4 | 74.3 74.8 76.8 76.7 |
| VideoMAEv2 | |||||||||
| MVD | |||||||||
| MVD | ViT-H/16 | K400+IN1K | 2400M | 87.2 | 77.3 | ||||
| V-JEPA | ViT-L/16 | VideoMix2M | 270M | 85.6 | 75.1 | ||||
| V-JEPA | ViT-H/16 | VideoMix2M | 270M | 86.6 | 77.0 |
| Method | Arch. | Data | #Samples Seen |
|---|---|---|---|
| OpenCLIP | ViT-G/14 | LAION-2B | 39000M |
| DINOv2 | ViT-g/14 | LVD 142M | 1900M |
| VideoMAEv2 | ViT-g/14 | UnlabeledHybrid | 1600M |
| V-JEPA | ViT-H/16 384 | VideoMix2M | 210M |
$$ P^\star(E_\theta(x)) &= \text{argmin}{P} \lVert P(E\theta(x)) - Y \rVert_1\ &= \text{median}(Y|E_\theta(x)). $$
$$ 1ex] OmniMAE & ViT-L/16 & 2400M & 1170K & 65.6 & 60.6 & 14.4 & \bf 75.1 & 59.8 & 66.1 & 84.0 & 74.2 \ VideoMAE & ViT-L/16 & 410M & 400K & 77.8 & 65.5 & 21.6 & 71.1 & 59.3 & 64.6 & 85.4 & 74.3 \ Hiera & Hiera-L & 770M & 1500K & 75.5 & 64.2 & 15.8 & 68.9 & 58.5 & 56.9 & \bf 87.3 & \bf 75.1 \ \midrule V-JEPA & ViT-L/16 & 270M & 90K & \cc\bf 80.8 & \cc\bf 69.5 & \cc\bf 25.6 & \cc 74.8 & \cc\bf 60.3 & \cc\bf 67.8 & \cc 85.6 & \cc \bf 75.1 \ \bottomrule \end{tabular}} \vskip 1em \centering {\fontfamily{ptm}\fontsize{7pt}{7pt}\selectfont \caption{{\it Comparison with State-of-the-Art Models.} We compare \putalg with state-of-the-art baselines in frozen evaluation with an attentive probe on downstream image tasks (IN1K, Place205, iNat21) and video tasks (K400, SSv2, AVA). All models are evaluated at resolution 224, except I-JEPA${512}$ and V-JEPA${384}$ which are evaluated respectively at resolution $512$ and $384$. On K400 and SSv2 we follow the standard practice of reporting accuracy from several spatial and temporal views from the video. Compared to other video baselines, \putalg exhibits a consistent improvement across all downstream tasks. Compared to image-models that excel under the frozen evaluation, \putalg shows a significant performance improvement on tasks requiring motion understanding (+21 points on SSv2), and reduces the gap between video and image models on tasks requiring static appearance-based features.} \label{tb:large_results}\vspace{1em} \begin{tabular}{llrr ccc ccc} \toprule & & & & \multicolumn{3}{c}{\it Video Tasks} & \multicolumn{3}{c}{\it Image Tasks} \ \cmidrule(l){5-7} \cmidrule(l){8-10} & & & & \bf K400 & \bf SSv2 & \bf AVA & \bf IN1K & \bf Places205 & \bf iNat21 \ \bf Method & \bf Arch. & \bf Params. & \bf Data & {\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$8$\times$3)} & {\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$2$\times$3)} & & & & \ \midrule \multicolumn{2}{l}{\bf\it Methods pretrained on Images} & & & & & & \[1ex] I-JEPA & ViT-H/16${512}$ & 630M & IN22K & 79.7 & 50.0 & 19.8 & 84.4 & 66.5 & 85.7 \ OpenCLIP & ViT-G/14 & 1800M & LAION & 81.8 & 34.8 & 23.2 & 85.3 & \bf 70.2 & 83.6 \ DINOv2 & ViT-g/14 & 1100M & LVD-142M & \bf 83.4 & 50.6 & 24.3 & \bf 86.2 & 68.4 & \bf 88.8 \ \midrule \multicolumn{2}{l}{\bf\it Methods pretrained on Videos} & & & & & & \[1ex] MVD & ViT-L/16 & 200M & IN1K+K400 & 79.4 & 66.5 & 19.7 & 73.3 & 59.4 & 65.7 \ OmniMAE & ViT-H/16 & 630M & IN1K+SSv2 & 71.4 & 65.4 & 16.0 & 76.3 & 60.6 & 72.4\ VideoMAE & ViT-H/16 & 630M & K400 & 79.8 & 66.2 & 20.7 & 72.3 & 59.1 & 65.5 \ VideoMAEv2 & ViT-g/14 & 1100M & Un.Hybrid & 71.2 & 61.2 & 12.9 & 71.4 & 60.6 & 68.3\ Hiera & Hiera-H & 670M & K400 & 77.0 & 64.7 & 17.5 & 71.4 & 59.5 & 61.7 \ \midrule \multirow{3}{}{V-JEPA} & ViT-L/16 & 200M & \multirow{3}{}{VideoMix2M} & \cc 80.8 & \cc 69.5 & \cc 25.6 & \cc 74.8 & \cc 60.3 & \cc 67.8\ & ViT-H/16 & 630M & & \cc\bf 82.0 & \cc 71.4 & \cc\bf 25.8 & \cc 75.9 & \cc 61.7 & \cc 67.9 \ & ViT-H/16${384}$ & 630M & & \cc 81.9 & \cc\bf 72.2 & \cc 25.0 & \cc\bf 77.4 & \cc\bf 62.8 & \cc \bf 72.6 \ \bottomrule \end{tabular}} \end{table*}
\section{Comparison with Prior Work} In Section~\ref{subsec:pixel_comparison}, we investigate the impact of feature prediction by comparing \putalg with video approaches that rely on pixel prediction, while using a similar architecture for all baselines. Subsequently, in Section~\ref{subsec:sota_comparison}, we remove the architectural constraint and report the best performance across architectures for self-supervised video and image pretraining approaches. Finally, we explore the label-efficiency of \putalg relative to other self-supervised video pretraining approaches in Section~\ref{subsec:lowshot}. We further detail the evaluation setup in Appendix~\ref{app:evaluation}.
\subsection{Comparison with Pixel Prediction} \label{subsec:pixel_comparison}
To investigate the effectiveness of feature prediction pretraining, we first compare \putalg to video masked modeling models relying on a pixel prediction loss. We control for the possible confounding factor of model architecture by evaluating all models using either a ViT-L/16 encoder, or a Hiera-L encoder, which has a similar number of parameters. For the pixel prediction baselines we consider VideoMAE~\citep{tong2022videomae, wang2023videomae}, which trains vision transformer autoencoders exclusively on video, Hiera~\citep{ryali2023hiera}, which trains a hierarchical transformer autoencoder on video, and OmniMAE~\citep{girdhar2023omnimae}, which trains a vision transformer autoencoder on static images and video simultaneously.
Table~\ref{tb:pixel_comparison} examines both frozen evaluation with an attentive probe on downstream video and image tasks, as well as end-to-end fine-tuning. In frozen evaluation, \putalg outperforms the baselines on all downstream tasks, except ImageNet, where we achieve $74.8%$ compared to $75.1%$ of an OmniMAE model trained directly on ImageNet; hence, \putalg achieves comparable ImageNet performance despite only pretraining on video.
Under the fine-tuning protocol, \putalg also achieves the best performance of any model trained with a ViT-L/16, and matches the performance of the Hiera-L on SSv2, which benefits from a hierachical prior~\citep{ryali2023hiera}. The \putalg models achieve this result while processing significantly fewer samples during pretraining (Figure~\ref{fig:ssv2_finetuning}), demonstrating the efficiency of feature prediction as a learning principle. \begin{figure}[t] \includegraphics[width=\linewidth]{assets/scatter-ssv2-finetuned-compute.pdf} \caption{{\it SSv2 fine-tuning performance vs.~Samples Seen.} We report SSv2 fine-tuning for \putalg and pixel-reconstruction baselines using a ViT-L/16 or Hiera-L architecture. \putalg outperforms all pixel-reconstruction methods using a ViT-L/16 and matches the Hiera-L performance while seeing significantly less samples during pretraining.} \label{fig:ssv2_finetuning} \end{figure}
\subsection{Comparison with State-of-the-Art} \label{subsec:sota_comparison}
Next, in Table~\ref{tb:large_results}, we inspect how the \putalg models pretrained on video stack up next to the largest state-of-the-art self-supervised image and video models when freezing the backbone encoder and training an attentive probe on top. Our image pretrained baselines include OpenCLIP~\citep{cherti2023reproducible}, DINOv2~\citep{oquab2023dinov2}, and I-JEPA~\citep{assran2023self}. The OpenCLIP model is trained with a contrastive image-text alignment objective, DINOv2 and I-JEPA are trained with self-supervision. These models are known to excel in their frozen-evaluation performance~\citep{oquab2023dinov2}; i.e., their ability to produce visual features that can be applied to many downstream tasks simultaneously, without end-to-end fine-tuning, and thus provide highly competitive baselines. Our video pretrained baselines include VideoMAE~\citep{tong2022videomae}, OmniMAE~\citep{girdhar2023omnimae}, Hiera~\citep{ryali2023hiera}, VideoMAEv2~\citep{wang2023videomae}, and MVD~\citep{wang2023masked}. The OpenCLIP, DINOv2 and VideoMAEv2 models are parameterized as Giant/Gigantic vision transformer architectures containing over 1B parameters trained on large-scale image or video datasets. \begin{figure}[t] \includegraphics[width=\linewidth]{assets/scatter-ssv2-frozen-compute-time.pdf} \caption{{\it SSv2 frozen-evaluation performance vs.~Pretraining Time.} Wallclock times for all methods are measured on a single GPU with a batch size of 10 clips, using the official codebases for VideoMAE and VideoMAEv2, and linearly extrapolated assuming a global batch size of 2400 samples. However, note that the SSv2 accuracies of video pixel prediction methods are actually obtained with small batch sizes and significantly longer training schedules. \putalg outperforms pixel-reconstruction methods while training significantly faster.} \label{fig:ssv2_frozen} \end{figure} \begin{table*}[t] \centering {\fontfamily{ptm}\fontsize{7pt}{7pt}\selectfont \caption{{\it Low-Shot Frozen Evaluation.} Comparing \putalg to other video models in frozen evaluation on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. We report the mean performances and standard deviation using the K400 and SSv2 validation sets. \putalg is more label-efficient than other models; specifically, decreasing the available number of labeled examples from each class increases the performance gap between \putalg and the baselines.} \label{tb:lowshot} \begin{tabular}{ll ccc ccc} \toprule & & \multicolumn{6}{c}{\it Frozen Evaluation} \[1ex] & & \multicolumn{3}{c}{\bf K400} & \multicolumn{3}{c}{\bf SSv2} \ & & \multicolumn{3}{c}{\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$8$\times$3)} & \multicolumn{3}{c}{\fontsize{5.5pt}{5.5pt}\selectfont(16$\times$2$\times$3)} \ \cmidrule(l){3-5} \cmidrule(l){6-8} & & 5% & 10% & 50% & 5% & 10% & 50% \ \bf Method & \bf Arch. & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$29 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$58 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$287 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$48 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$96 samples per class) & \fontsize{5.5pt}{5.5pt}\selectfont($\sim$440 samples per class) \ \midrule MVD & ViT-L/16 & 62.6 $\pm$ 0.2 & 68.3 $\pm$ 0.2 & 77.2 $\pm$ 0.3 & 42.9 $\pm$ 0.8 & 49.5 $\pm$ 0.6 & 61.0 $\pm$ 0.2 \ VideoMAE & ViT-H/16 & 62.3 $\pm$ 0.3 & 68.5 $\pm$ 0.2 & 78.2 $\pm$ 0.1 & 41.4 $\pm$ 0.8 & 48.1 $\pm$ 0.2 & 60.5 $\pm$ 0.4 \ VideoMAEv2 & ViT-g/14 & 37.0 $\pm$ 0.3 & 48.8 $\pm$ 0.4 & 67.8 $\pm$ 0.1 & 28.0 $\pm$ 1.0 & 37.3 $\pm$ 0.3 & 54.0 $\pm$ 0.3 \ \midrule \multirow{2}{}{V-JEPA} & ViT-H/16 & \cc 67.0 $\pm$ 0.2 & \cc 72.1 $\pm$ 0.1 & \cc 80.2 $\pm$ 0.2 & \cc 51.9 $\pm$ 0.3 & \cc 57.5 $\pm$ 0.4 & \cc 67.3 $\pm$ 0.2 \ & ViT-H/16$_{384}$ & \bf\cc 68.2 $\pm$ 0.2 & \cc\bf 72.8 $\pm$ 0.2 & \bf\cc 80.6 $\pm$ 0.2 &\bf\cc 54.0 $\pm$ 0.2 & \bf\cc 59.3 $\pm$ 0.5 & \bf\cc 67.9 $\pm$ 0.2 \ \bottomrule \end{tabular}} \end{table}
\paragraph{\bf Comparison with video models.} Compared to large-scale video baselines, the \putalg models outperform all previous models on every downstream video and image task with notable margin (see Table~\ref{tb:large_results}). Our H/16 model outperforms the largest publicly available VideoMAE, VideoMAEv2, OmniMAE, MVD, and Hiera models by at least $+5$ points in motion understanding (Something-Something-v2), $+2$ points in action recognition (Kinetics-400), $+5$ points on action detection (AVA), $+1$ point on object recognition (ImageNet-1K), $+2$ points in scene recognition (Places205), and $+0.2$ points on fine-grained recognition (iNaturalist). Moreover, when comparing pretraining wallclock time in Figure~\ref{fig:ssv2_frozen}, we see that \putalg achieves this performance with a roughly $2\times$ speedup compared to the large pixel prediction models.
\paragraph{\bf Comparison with image models.} On tasks that require a fine-grained understanding of motion (Something-Something-v2), the \putalg models provide a major improvement (over $+21$ points) compared to large-scale image baselines, such as DINOv2, OpenCLIP, and I-JEPA. Self-supervised pretraining from videos allows to model dynamic concepts that are not easily learned from static image datasets. Similarly, we observe that the \putalg models outperform image-based pretraining on action localization.
On Kinetics-400, we find image models to perform well; e.g., while DINOv2~\citep{oquab2023dinov2} previously reported $78.4%$ on K400 with a linear probe, we improve the frozen evaluation of the g/14 model to $83.4%$ by using an attentive probe. In this case, our H/16 model achieves $82.0%$ top-1 accuracy. It is worth noting that the label for many Kinetics videos can be inferred using appearance-based cues, without requiring an understanding of motion~\citep{sevilla2021only}.
The \putalg models narrow the gap with image models on image classification tasks. In particular, \putalg achieves a score of $77.4%$ on ImageNet using a one-layer attentive probe, which can be further improved to $\bf{77.9%}$ using a two-layer attentive probe. More generally, we hypothesize that the datasets used to train \putalg and other video models are too constrained and lack the visual diversity of the internet-scale pretraining data used by the images models; as such, there is value in focusing future work on building diverse publicly available video datasets.
\subsection{Label-efficiency} \label{subsec:lowshot} We examine the label-efficiency of \putalg compared to other self-supervised video models by measuring the ability of the pretrained backbones to adapt to downstream tasks with few labels. Specifically, we investigate the performance of the frozen models on Kinetics-400 and Something-Something-v2 as we vary the percentage of labeled examples from each dataset available for training the attentive probe. We train the probes in several low-shot settings: using either 5% of the train set, 10%, or 50%, and take 3 random splits in each setting to obtain more robust metrics, resulting in 9 different evaluation experiments for each model. Table~\ref{tb:lowshot} reports the mean performances and standard deviation using the K400 and SSv2 validation sets.
We find \putalg to be more label-efficient than other self-supervised video models: decreasing the available number of labeled examples for training the attentive probe results in an increase in the performance gap between \putalg and the other models. In particular, the performance of the largest \putalg model on K400 drops by 12% to 68.2% top-1 when we reduce the number of labeled examples by a factor of $10\times$ (from roughly 287 examples per class to 29 examples per class). By contrast, VideoMAEv2 drops by 30% to 37.0% top-1, VideoMAE drops by 15.9% to 62.3% top-1, and MVD drops by 14.6% to 62.6% top-1.
Similar observations hold on SSv2. The performance of the largest \putalg model on SSv2 drops by 13.9% to 54.0% top-1 when we reduce the number of labeled examples by a factor of $10\times$ (from roughly 440 examples per class to 48 examples per class). By contrast, VideoMAEv2 drops by 26% to 28.0% top-1, VideoMAE drops by 19.1% to 41.4% top-1, and MVD drops by 18.1% to 42.9% top-1.
\section{Evaluating the Predictor} Next, we seek to qualitatively inspect the \putalg models. Recall that the predictor network in \putalg predicts the representations of a masked spatio-temporal region $y$ from a visible region $x$, given the positional information of the masked regions (see Section~\ref{sec:methodology}). To qualitatively investigate the grounding of the feature-space predictions, we freeze the pretrained encoder and predictor networks and train a conditional diffusion decoder to map the \putalg predictions to interpretable pixels. Notably, the decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video (see Figure~\ref{fig:decoder_method}). \begin{figure*}[t!] \centering \begin{subfigure}[b]{\textwidth} \centering \includegraphics[width=0.825\linewidth]{assets/decoder-color.pdf} \caption{ {\bf Visualization Methodology.} We train a conditional diffusion model to decode the \putalg feature-space predictions to interpretable pixels; the pretrained \putalg encoder and predictor networks are kept frozen in this process. The decoder is only fed the representations predicted for the missing regions of the video, and does not have access to the unmasked regions of the video. } \label{fig:decoder_method} \end{subfigure} \vskip 4mm \begin{subfigure}[b]{\textwidth} \centering \includegraphics[width=0.485\linewidth]{assets/samples-v1.png}\quad \includegraphics[width=0.485\linewidth]{assets/samples-v0.png} \caption{ {\bf Visualizations.} {\it First Row:} Masked videos used as input to the \putalg models (a pretrained ViT-H/16 encoder and its corresponding predictor network). {\it Other rows:} Bounding boxes contain various samples from the decoder overlayed on the original video. \putalg is not a generative model and the decoder does not have access to the context (first row), so we do not expect samples to exactly match the input. This experiment qualitatively illustrates what information is encoded and predicted by \putalg. In particular, characteristics that are common across samples represent information that is encoded in the \putalg predictions. \putalg generates predictions that are spatially and temporally coherent with unmask region of the video. The predictions also capture consistent motion through time. } \label{fig:prediction-sample} \end{subfigure} \caption{{\it Qualitative Analysis.} Offline visualizations of the \putalg feature-space predictions.} \label{fig:prediction-visualization} \end{figure*}
Given a masked video, we use the \putalg pretrained models to predict the representations of the missing regions, and then use the decoder to project the representations to pixel space. Figure~\ref{fig:prediction-sample} shows decoder outputs for various random seeds. Qualities that are common across samples represent information that is contained in the predictor representation.
Figure~\ref{fig:prediction-sample} shows that the \putalg feature predictions are indeed grounded, and exhibit spatio-temporal consistency with the unmasked regions of the video. Specifically, the samples in Figure~\ref{fig:prediction-sample} show that the \putalg predictor correctly captures positional uncertainty and produces a variety of visual objects at various locations with consistent motion. Some of the samples also demonstrate an understanding of object-permanence, as the visual objects remain consistent after partial occlusion.
\section{Conclusion} In this work, we explored the effectiveness of feature prediction as a stand-alone objective for unsupervised learning from video and introduced \putalg, a collection of vision models trained solely using a self-supervised feature prediction objective. The \putalg models demonstrate the ability to solve various downstream image and video tasks without adaption of the model parameters, and outperform previous video representation learning approaches in frozen evaluation on action recognition, spatio-temporal action detection, and image classification tasks. Additionally, we show that pretraining \putalg on videos is particularly effective for solving downstream tasks requiring fine-grained motion understanding, while large-scale image models trained on internet scale datasets fall short on such tasks. Finally, we empirically observed that \putalg models are label-efficient learners, and exhibit good performance on downstream tasks, even when only few labeled examples are available.
\bibliographystyle{assets/plainnat} \bibliography{paper}
\clearpage \newpage
\onecolumn
\beginappendix
\section{Extended Related Works}
We first review approaches for learning visual perception from static images before discussing strategies for learning from video.
\subsection*{Weakly-Supervised Learning from Static Images} One family of approaches for learning visual perception from static images trains a visual encoder to predict the representations of text captions often found accompanying images from the Web, as in CLIP~\citep{radford2021learning} or CoCa~\citep{yu2022coca}. The largest open source CLIP model to date, numbering 2B parameters and trained on over 2B web-scraped images~\citep{cherti2023reproducible}, demonstrates impressive performance on a wide range of downstream image and video tasks. Notably, this is achieved using only the light-weight adaptation of task-specific heads, also referred to as frozen-evaluation, and does not require expensive end-to-end fine-tuning of the pretrained model.
\subsection*{Self-Supervised Learning from Static Images} Other approaches for learning from static images leverage unsupervised objectives. Initial works on self-supervised approaches are based on sparse coding or hand-crafted pretext tasks, such as colorization~\citep{larsson2016learning,larsson2017colorization}, rotation prediction~\citep{gidaris2020learning}, and jigsaws~\citep{noroozi2016unsupervised}. More recent approaches leverage invariance-based objectives by training a visual encoder to be invariant to hand-crafted image transformations~\citep{wu2018unsupervised,chen2020simple}.
Another family of methods learn representations using denoising autoencoders~\citep{denoising_vincent}; image inpainting is one popular instantiation of this idea~\citep{pathak2016context}. More recently, masked autoencoders~\citep{he2021masked} train an encoder-decoder transformer to predict missing pixels of a masked image. Follow-up work addresses the indeterminism of pixel reconstruction by exploring instantiations of masked image modeling in latent space~\citep{baevski2022data2vec,assran2023self,baevski2022efficient}. These approaches can be seen as applications of the predictive feature principle in the image modality.
There are also various methods that combine both masked image modeling and invariance criteria to learn visual representations from static images, such as iBOT~\citep{zhou2021ibotyes} and DINOv2~\citep{zhou2021ibotyes, oquab2023dinov2}, the latter is currently the most competitive instantiation of self-supervised learning with static images, scaled to a model with over 1.1B parameters trained on a curated dataset of 142M images.
\subsection*{Weakly-Supervised Learning from Videos} One family of approaches for learning visual perception from videos relies on weakly-supervised guidance from closed captioning, often computed from an ASR transcription of audio data accompanying internet videos. For instance, VideoBERT~\citep{sun2019videobert,xu2021videoclip} trains a video encoder to predict masked spans in the textual closed captions. Similarly, VideoCLIP~\citep{xu2021videoclip} trains a video encoder to predict the representation of video captions computed by a text encoder. Follow-up work such as MERLOT~\citep{zellers2022merlot}, VATT~\citep{akbari2021vatt}, and InternVideo~\citep{wang2022internvideo} extended VideoCLIP by incorporating additional unsupervised objectives.
\subsection*{Self-Supervised Learning from Videos} Similar to unsupervised learning from images, a family of unsupervised video representation learning approaches enforces a spatio-temporal representation of a video clip to be invariant to hand-crafted spatio-temporal data augmentations~\citep{parthasarathy2022self}. However, one obvious insight is that the temporal ordering of visual information in video can provide implicit supervision. Indeed, this insight is the key insight leveraged by many works on unsupervised video learning. Towards leveraging temporal information as supervision, some approaches train a visual encoder by predicting the temporal ordering of frames~\citep{xu2019self, lee2017unsupervised}. Other approaches seek to predict low-level motion vectors computed from optical flow~\citep{pintea2014deja}, or to predict mixing pixels in video frames, using either a frame-interpolation objective~\citep{kalluri2023flavr} or a denoising autoencoder~\citep{tong2022videomae, feichtenhofer2022masked, wang2023videomae}.
\section{Extended Description of V-JEPA} \label{appendix:vjepa_extended_description}
In this section, we provide an in-depth description of our approach \putalg that is illustrated in Figure~\ref{fig:vjepa-complex}.
\paragraph{\bf Input.} Unless stated otherwise, during during pretraining, we always randomly sample a clip of 16 frames from each input video with a temporal stride of 4 between sampled frames. An input video clip therefore covers 64 frames in total, or roughly 2 seconds of a given video running at 30 frames per second. We then resize the video's spatial dimensions to $224 \times 224$, resulting in an overall shape of $16 \times 224 \times 224 \times 3$ for the entire clip. Since ViT networks process a 1D sequence of tokens, we must convert an input video clip into a 1D token sequence. To do so, we apply a 3D convolution comprising $d$ filters of size $2 \times 16 \times 16$ with a temporal stride of $2$ and a spatial stride of $16$, resulting in a tensor of shape $8 \times 14 \times 14 \times d$. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape $1568 \times d$. This process is demonstrated in Figure~\ref{fig:patchitfy}. \begin{figure}[h] \centering \includegraphics[width=0.9\linewidth]{assets/patchify.pdf} \caption{\small{\bf \putalg} training operates on a video clip flattened into a sequence of tokens. To convert a video clip of size $16 \times 224 \times 224 \times 3$ into a 1D token sequence, we apply a 3D convolution comprising $d$ filters of size $2 \times 16 \times 16$ with a temporal stride of $2$ and a spatial stride of $16$, resulting in a tensor of shape $8 \times 14 \times 14 \times d$. Next we add absolute 3D sin-cos positional embeddings to the spatio-temporal feature map and flatten it, resulting in a 1D token sequence of shape $1568 \times d$.} \label{fig:patchitfy} \end{figure}
\paragraph{\bf \putalg.} We sample both a video clip, and a video mask in each iteration. We denote a video clip represented as a 1D token sequence of length $L=1568$ by $x_{L} = (x_1, \ldots, x_L)$. Similarly, given a mask of $M < L$ patches, leaving $N=L-M$ patches unmasked, we denote the indices of masked patches by $(i_1, \ldots, i_M)$ and its complement (the indices of unmasked patches) by $(j_1, \ldots, j_{N})$.
{\bf \it Computing the $x$-representations.} To compute the \putalg loss, we first produce the $x$-representations by masking the video clip and feeding it into the $x$-encoder; we denote the masked video by $x_N = (x_{j_1}, \ldots, x_{j_N})$. Applying the $x$-encoder $E_\theta(\cdot)$ to the masked clip gives a sequence of patch representations, denoted as $z_N = E_{\theta}(x_N) = (z_{j_1}, \ldots, z_{j_N}).$ $$ \tag{tb:large_results}
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Fine-Tuning | ||
|---|---|---|---|---|---|
| Target | Arch. | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K | K400-ft (16 × 5 × 3) |
| Pixels | ViT-L/16 | 68.6 | 66.0 | 73.3 | 85.4 |
| Features | ViT-L/16 | 73.7 | 66.2 | 74.8 | 85.6 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | ||||
|---|---|---|---|---|---|---|
| Arch. | Data | #Samples | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K | Avg. |
| ViT-L/16 | K710 | 700K | 75.8 | 63.2 | 73.7 | 70.9 |
| K710+SSv2 | 900K | 72.9 | 67.4 | 72.8 | 71.0 | |
| K710+HT | 1900K | 74.5 | 64.2 | 74.8 | 71.1 | |
| VideoMix2M | 2000K | 73.7 | 66.2 | 74.8 | 71.5 | |
| ViT-H/16 | K710+SSv2 | 900K | 75.7 | 66.8 | 73.7 | 72.0 |
| ViT-H/16 | VideoMix2M | 2000K | 74.0 | 68.5 | 75.9 | 72.8 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | ||
|---|---|---|---|---|---|
| Method | Arch. | Avg. | Att. | Avg. | Att. |
| V-JEPA | ViT-L/16 | 56.7 | 73.7 | 50.1 | 66.2 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | |
|---|---|---|---|
| Masking | K400 (16 × 1 × 1) | SSv2 (16 × 1 × 1) | IN1K |
| random-tube[0.9] | 51.5 | 46.4 | 55.6 |
| causal multi-block[6] | 61.3 | 49.8 | 66.9 |
| causal multi-block[12] | 71.9 | 63.6 | 72.2 |
| multi-block | 72.9 | 67.4 | 72.8 |
| Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Frozen Evaluation w/ Att. Pooling | Fine-Tuning | Fine-Tuning | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | #Samples Seen | Iter. | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | AVA | IN1K | Places205 | iNat21 | K400-ft (16 × 5 × 3) | SSv2-ft (16 × 2 × 3) |
| Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction | Methods pretrained using pixel prediction |
| OmniMAE VideoMAE Hiera | ViT-L/16 ViT-L/16 Hiera-L | 2400M 410M 770M | 1170K 400K 1500K | 65.6 77.8 75.5 | 60.6 65.5 64.2 | 14.4 21.6 15.8 | 75.1 71.1 68.9 | 59.8 59.3 58.5 | 66.1 64.6 56.9 | 84.0 85.4 | 74.2 74.3 |
| V-JEPA | 270M | 60.3 | 85.6 | ||||||||
| 87.3 | 75.1 | ||||||||||
| ViT-L/16 | 90K | 80.8 | 69.5 | 25.6 | 74.8 | 67.8 | 75.1 |
| Video Tasks | Video Tasks | Video Tasks | Image Tasks | Image Tasks | Image Tasks | ||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Params. | Data | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | AVA | IN1K | Places205 | iNat21 |
| Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images | Methods pretrained on Images |
| I-JEPA | ViT-H/16 512 | 630M | IN22K | 79.7 | 50.0 | 19.8 | 84.4 | 66.5 | 85.7 |
| OpenCLIP | ViT-G/14 | 1800M | LAION | 81.8 | 34.8 | 23.2 | 85.3 | 70.2 | 83.6 |
| DINOv2 | ViT-g/14 | 1100M | LVD-142M | 83.4 | 50.6 | 24.3 | 86.2 | 68.4 | 88.8 |
| Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos | Methods pretrained on Videos |
| MVD | ViT-L/16 | 200M | IN1K+K400 | 79.4 | 66.5 | 19.7 | 73.3 | 59.4 | 65.7 |
| OmniMAE | ViT-H/16 | 630M | IN1K+SSv2 | 71.4 | 65.4 | 16.0 | 76.3 | 60.6 | 72.4 |
| VideoMAE | ViT-H/16 | 630M | K400 | 79.8 | 66.2 | 20.7 | 72.3 | 59.1 | 65.5 |
| VideoMAEv2 | ViT-g/14 | 1100M | Un.Hybrid | 71.2 | 61.2 | 12.9 | 71.4 | 60.6 | 68.3 |
| Hiera | Hiera-H | 670M | K400 | 77.0 | 64.7 | 17.5 | 71.4 | 59.5 | 61.7 |
| ViT-L/16 | 200M | 80.8 | 69.5 | 25.6 | 74.8 | 60.3 | 67.8 | ||
| V-JEPA | ViT-H/16 | 630M | VideoMix2M | 82.0 | 71.4 | 25.8 | 75.9 | 61.7 | 67.9 |
| V-JEPA | ViT-H/16 384 | 630M | 81.9 | 72.2 | 25.0 | 77.4 | 62.8 | 72.6 |
| Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | Frozen Evaluation | |||
|---|---|---|---|---|---|---|---|
| K400 (16 × 8 × 3) | K400 (16 × 8 × 3) | K400 (16 × 8 × 3) | SSv2 (16 × 2 × 3) | SSv2 (16 × 2 × 3) | SSv2 (16 × 2 × 3) | ||
| Method | Arch. | 5% ( ∼ 29 samples per class) | 10% ( ∼ 58 samples per class) | 50% ( ∼ 287 samples per class) | 5% ( ∼ 48 samples per class) | 10% ( ∼ 96 samples per class) | 50% ( ∼ 440 samples per class) |
| MVD | ViT-L/16 | 62.6 ± 0.2 | 68.3 ± 0.2 | 77.2 ± 0.3 | 42.9 ± 0.8 | 49.5 ± 0.6 | 61.0 ± 0.2 |
| VideoMAE | ViT-H/16 | 62.3 ± 0.3 | 68.5 ± 0.2 | 78.2 ± 0.1 | 41.4 ± 0.8 | 48.1 ± 0.2 | 60.5 ± 0.4 |
| VideoMAEv2 | ViT-g/14 | 37.0 ± 0.3 | 48.8 ± 0.4 | 67.8 ± 0.1 | 28.0 ± 1.0 | 37.3 ± 0.3 | 54.0 ± 0.3 |
| V-JEPA | ViT-H/16 | 67.0 ± 0.2 | 72.1 ± 0.1 | 80.2 ± 0.2 | 51.9 ± 0.3 | 57.5 ± 0.4 | 67.3 ± 0.2 |
| V-JEPA | ViT-H/16 384 | 68.2 ± 0.2 | 72.8 ± 0.2 | 80.6 ± 0.2 | 54.0 ± 0.2 | 59.3 ± 0.5 | 67.9 ± 0.2 |
| Hyper-parameter | ViT-L/16 224 | ViT-H/16 224 | ViT-H/16 384 |
|---|---|---|---|
| data | |||
| datasets | VideoMix2M | VideoMix2M | VideoMix2M |
| resolution | 224 | 224 | 384 |
| num_frames | 16 | 16 | 16 |
| temporal_stride | 4 | 4 | 4 |
| horizontal_flip | true | true | true |
| random_resize_scale | (0.3, 1.0) | (0.3, 1.0) | (0.3, 1.0) |
| random_resize_aspect_ratio | (0.75, 1.35) | (0.75, 1.35) | (0.75, 1.35) |
| masking | |||
| block_aspect_ratio | (0.75, 1.5) | (0.75, 1.5) | (0.75, 1.5) |
| shortrange_mask_num_blocks | 8 | 8 | 8 |
| shortrange_mask_spatial_scale | 0.15 | 0.15 | 0.15 |
| longrange_mask_num_blocks | 2 | 2 | 2 |
| longrange_mask_spatial_scale | 0.7 | 0.7 | 0.7 |
| optimization | |||
| batch_size | 3072 | 3072 | 2400 |
| total_number_of_iterations | 90000 | 90000 | 90000 |
| warmup_iterations | 12000 | 12000 | 12000 |
| lr | 6.25e-4 | 6.25 × 10 - 4 | 6.25 × 10 - 4 |
| start_lr | 2 × 10 - 4 | 2 × 10 - 4 | 2 × 10 - 4 |
| final_lr | 1 × 10 - 6 | 1 × 10 - 6 | 1 × 10 - 6 |
| start_momentum | 0.998 | 0.998 | 0.998 |
| final_momentum | 1.0 | 1.0 | 1.0 |
| start_weight_decay | 0.04 | 0.04 | 0.04 |
| final_weight_decay | 0.4 | 0.4 | 0.4 |
| scheduler_scale_factor | 1.25 | 1.25 | 1.25 |
| architecture | |||
| patch_size | 16 | 16 | 16 |
| tubelet_size | 2 | 2 | 2 |
| pred_depth | 12 | 12 | 12 |
| pred_embed_dim | 384 | 384 | 384 |
| hardware | |||
| dtype | bfloat16 | bfloat16 | bfloat16 |
| accelerator | A100 80G | A100 80G | A100 80G |
| Hyper-parameter | K400 | SSv2 | IN1K | Place205 | iNat21 |
|---|---|---|---|---|---|
| data | |||||
| num_clips | 8 | 1 16 4 | N.A. N.A. N.A. | N.A. N.A. N.A. | N.A. N.A. N.A. |
| num_frames | 16 | ||||
| temporal_stride | 4 | ||||
| horizontal_flip | true | true | true | true | true |
| random_resize_scale | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) | (0.08, 1.0) (0.75, 1.33) |
| random_resize_aspect_ratio auto_augment | false | false | true | true | true |
| optimization | |||||
| batch_size | 256 | 256 | 1024 | 1024 | 1024 |
| epochs | 20 | 20 | 20 | 20 | 20 |
| lr | 1e-3 | 1e-3 | 1e-3 | 1e-3 | 1e-3 |
| final_lr | 0 | 0 | 0 | 0 | 0 |
| weight_decay | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
| Hyper-parameter | ViT-L/16 | ViT-H/16 |
|---|---|---|
| out_layers batch_size epochs opt opt_eps momentum weight_decay lr warmup_lr min_lr warmup_epochs warmup_steps | [18, 20, 22, 24] 64 30 AdamW 0.00000001 0.9 0.05 0.0001 0.000001 0.000001 2 1 | [26, 28, 30, 32] 64 30 AdamW 0.00000001 0.9 0.05 0.0001 0.000001 0.000001 2 1 |
| Hyper-parameter | K400 | K400 | K400 | SSv2 | SSv2 |
|---|---|---|---|---|---|
| data | |||||
| num_segments num_frames | 1 16 4 224 | ||||
| sampling_rate | |||||
| resolution | |||||
| model | |||||
| model_name | ViT-L/16 | ViT-H/16 | ViT-L/16 | ViT-H/16 | |
| drop_path | 0.1 | 0.2 | 0.2 | 0.2 | |
| head_drop_rate | 0. | 0. | 0.5 | 0.5 | |
| optimization | |||||
| batch_size | 256 | 1024 | 256 | 256 | |
| epochs | 35 | 25 | 15 | 15 | |
| opt | adamw | ||||
| opt_eps | 0.00000001 | ||||
| momentum weight_decay | 0.9 0.05 | ||||
| lr | 0.002 | 0.0005 | 0.0005 | 0.0005 | |
| layer_decay | 0.75 | 0.75 | 0.75 | 0.75 | |
| warmup_lr | 1e-6 | 1e-8 | 1e-6 | 1e-6 | |
| min_lr | 1e-6 | 1e-5 | 1.5e-4 | 1.5e-3 | |
| warmup_epochs | 5 | ||||
| augmentations color_jitter | 0.4 | ||||
| horizontal_flip | True | True | False | False | |
| num_sample | 2 | ||||
| aa | rand-m7-n4-mstd0.5-inc1 | ||||
| smoothing | 0.1 | ||||
| train_interpolation | bicubic | ||||
| test_num_segment | 5 | 5 | 2 | 2 | |
| test_num_crop | 3 | 3 | 3 | 3 | |
| erase | |||||
| prob | 0.25 | ||||
| mode | pixel | ||||
| count | 1 | ||||
| split | False | ||||
| mixup | |||||
| mixup | 0.8 | ||||
| cutmix | 1.0 | ||||
| mixup_prob | 1.0 | ||||
| mixup_switch_prob | 0.5 | ||||
| mixup_mode | batch |
| K400 | K400 | SSv2 | SSv2 | ||
|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. |
| VideoMAE | ViT-L/16 | 52.5 | 77.8 | 41.3 | 61.2 |
| V-JEPA | ViT-L/16 | 56.7 | 80.8 | 50.1 | 69.5 |
| K400 | K400 | SSv2 | SSv2 | IN1K | IN1K | Place205 | Place205 | iNat21 | iNat21 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Arch. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. | Lin. | Att. |
| DINOv2 | ViT-g/14 | 78.4 | 83.4 | 38.3 | 50.0 | 86.5 | 86.2 | 67.5 | 68.4 | 85.7 | 88.8 |
| OpenCLIP | ViT-G/14 | 78.3 | 81.8 | 35.8 | 34.8 | 86.2 | 85.3 | 69.8 | 70.2 | 76.0 | 83.6 |
| Method | Arch. | 1 Clip | 8 Clips |
|---|---|---|---|
| VideoMAE | ViT-L/16 | 69.4 | 77.8 |
| V-JEPA | ViT-L/16 | 73.7 | 80.9 |
| Method | Arch. | Pretraining Data | #Samples Seen | K400 (16 × 5 × 3) | SSv2 (16 × 2 × 3) | ||||
|---|---|---|---|---|---|---|---|---|---|
| VideoMAEv1 | ViT-L/16 ViT-H/16 ViT-H/16 ViT-L/16 | K400 | SSv2 K400 | SSv2 Un.Hybrid K400+IN1K | 380M | 410M 380M | 410M 1600M 2400M | 85.4 86.6 86.9 86.4 | 74.3 74.8 76.8 76.7 |
| VideoMAEv2 | |||||||||
| MVD | |||||||||
| MVD | ViT-H/16 | K400+IN1K | 2400M | 87.2 | 77.3 | ||||
| V-JEPA | ViT-L/16 | VideoMix2M | 270M | 85.6 | 75.1 | ||||
| V-JEPA | ViT-H/16 | VideoMix2M | 270M | 86.6 | 77.0 |
| Method | Arch. | Data | #Samples Seen |
|---|---|---|---|
| OpenCLIP | ViT-G/14 | LAION-2B | 39000M |
| DINOv2 | ViT-g/14 | LVD 142M | 1900M |
| VideoMAEv2 | ViT-g/14 | UnlabeledHybrid | 1600M |
| V-JEPA | ViT-H/16 384 | VideoMix2M | 210M |


References
[todo] todo. (2023). todo.
[cubuk2019auto] Dogus Cubuk, Ekin, Zoph, Barret, Mane, Dandelion andVasudevan, Vijay, V. Le, Quoc. (2019). AutoAugment: Learning Augmentation Policies from Data. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[arnab2021vivit] Arnab, Anurag, Dehghani, Mostafa, Heigold, Georg, Sun, Chen, Lucic, Mario, Schmid, Cordelia. (2021). ViViT: A Video Vision Transformer. Proceedings of the IEEE international conference on computer vision.
[bardes2023mc] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2023). MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features. arXiv preprint arXiv:2307.12698.
[li2022exploring] Li, Yanghao, Mao, Hanzi, Girshick, Ross, He, Kaiming. (2022). Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527.
[bardes2022vicregl] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2022). VICRegL: Self-Supervised Learning of Local Visual Features. arXiv preprint arXiv:2210.01571.
[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron. (2016). Deep learning.
[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.
[bridle1991unsupervised] Bridle, John, Heading, Anthony, MacKay, David. (1991). Unsupervised classifiers, mutual information and'phantom targets. Advances in neural information processing systems.
[zha2001spectral] Zha, Hongyuan, He, Xiaofeng, Ding, Chris, Gu, Ming, Simon, Horst D. (2001). Spectral relaxation for k-means clustering. NeurIPS.
[hornik2012spherical] Hornik, Kurt, Feinerer, Ingo, Kober, Martin, Buchta, Christian. (2012). Spherical k-means clustering. Journal of statistical software.
[park2009simple] Park, Hae-Sang, Jun, Chi-Hyuck. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with applications.
[van2008visualizing] Van der Maaten, Laurens, Hinton, Geoffrey. (2008). Visualizing data using t-SNE.. Journal of machine learning research.
[wang2010learning] Wang, Fei, Li, Ping, Konig, Arnd Christian. (2010). Learning a bi-stochastic data similarity matrix. 2010 IEEE International Conference on Data Mining.
[meilua2006uniqueness] Meil{\u{a. (2006). The uniqueness of a good optimum for k-means. Proceedings of the 23rd international conference on Machine learning.
[wu2009adapting] Wu, Junjie, Xiong, Hui, Chen, Jian. (2009). Adapting the right measures for k-means clustering. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
[liang2012k] Liang, Jiye, Bai, Liang, Dang, Chuangyin, Cao, Fuyuan. (2012). The $ K $-means-type algorithms versus imbalanced data distributions. IEEE Transactions on Fuzzy Systems.
[rujeerapaiboon2019size] Rujeerapaiboon, Napat, Schindler, Kilian, Kuhn, Daniel, Wiesemann, Wolfram. (2019). Size matters: Cardinality-constrained clustering and outlier detection via conic optimization. SIAM J. Optimization.
[bradley2000constrained] Bradley, Paul S, Bennett, Kristin P, Demiriz, Ayhan. (2000). Constrained k-means clustering. Microsoft Research, Redmond.
[kleindessner2019fair] Kleindessner, Matth{. (2019). Fair k-center clustering for data summarization. ICML.
[bordia2019identifying] Bordia, Shikha, Bowman, Samuel R. (2019). Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035.
[buolamwini2018gender] Buolamwini, Joy, Gebru, Timnit. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability and Transparency.
[ma2022principles] Ma, Yi, Tsao, Doris, Shum, Heung-Yeung. (2022). On the principles of Parsimony and Self-consistency for the emergence of intelligence. Frontiers of Information Technology & Electronic Engineering.
[wiener2019cybernetics] Wiener, Norbert. (2019). Cybernetics or Control and Communication in the Animal and the Machine.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[krause2010discriminative] Krause, Andreas, Perona, Pietro, Gomes, Ryan. (2010). Discriminative clustering by regularized information maximization. Advances in neural information processing systems.
[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
[henaff2020data] Henaff, Olivier. (2020). Data-efficient image recognition with contrastive predictive coding. International conference on machine learning.
[hu2017learning] Hu, Weihua, Miyato, Takeru, Tokui, Seiya, Matsumoto, Eiichi, Sugiyama, Masashi. (2017). Learning discrete representations via information maximizing self-augmented training. International conference on machine learning.
[linsker1988self] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.
[tschannen2019mutual] Tschannen, Michael, Djolonga, Josip, Rubenstein, Paul K, Gelly, Sylvain, Lucic, Mario. (2019). On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625.
[lake2011one] Lake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, Tenenbaum, Joshua. (2011). One shot learning of simple visual concepts. Proceedings of the annual meeting of the cognitive science society.
[salakhutdinov2007learning] Salakhutdinov, Ruslan, Hinton, Geoff. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. Artificial Intelligence and Statistics.
[boden1980jean] Boden, Margaret A. (1980). Jean Piaget.
[piaget1964cognitive] Piaget, Jean. (1964). Cognitive development in children: Piaget. Journal of research in science teaching.
[boden1978artificial] Boden, Margaret A. (1978). Artificial intelligence and Piagetian theory. Synthese.
[bruner1961individual] Bruner, Jerome S. (1961). Reply to Individual and collective problems in the study of thinking. Annals of the New York Academy of Sciences.
[piaget1971biology] Piaget, Jean. (1971). Biology and knowledge: An essay on the relations between organic regulations and cognitive processes..
[grandvalet2006entropy] Grandvalet, Yves, Bengio, Yoshua. (2006). Entropy regularization. Semi-supervised learning.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709.
[chen2020big] Chen, Ting, Kornblith, Simon, Swersky, Kevin, Norouzi, Mohammad, Hinton, Geoffrey. (2020). Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.
[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.
[assran2020recovering] Assran, Mahmoud, Ballas, Nicolas, Castrejon, Lluis, Rabbat, Michael. (2020). Recovering Petaflops in Contrastive Semi-Supervised Learning of Visual Representations. arXiv preprint arXiv:2006.10803.
[vinyals2016matching] Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, Wierstra, Daan. (2016). Matching networks for one shot learning. arXiv preprint arXiv:1606.04080.
[snell2017prototypical] Snell, Jake, Swersky, Kevin, Zemel, Richard S. (2017). Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175.
[ravi2016optimization] Ravi, Sachin, Larochelle, Hugo. (2016). Optimization as a model for few-shot learning.
[lake2017building] Lake, Brenden M, Ullman, Tomer D, Tenenbaum, Joshua B, Gershman, Samuel J. (2017). Building machines that learn and think like people. Behavioral and brain sciences.
[russakovsky2015imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., Fei-Fei, Li. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision.
[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[you2017large] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.
[sutskever2013importance] Sutskever, Ilya, Martens, James, Dahl, George, Hinton, Geoffrey. (2013). On the importance of initialization and momentum in deep learning. International conference on machine learning.
[xie2019unsupervised] Xie, Qizhe, Dai, Zihang, Hovy, Eduard, Luong, Minh-Thang, Le, Quoc V. (2019). Unsupervised data augmentation. arXiv preprint arXiv:1904.12848.
[sohn2020fixmatch] Sohn, Kihyuk, Berthelot, David, Li, Chun-Liang, Zhang, Zizhao, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Zhang, Han, Raffel, Colin. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685.
[pham2020meta] Pham, Hieu, Xie, Qizhe, Dai, Zihang, Le, Quoc V. (2020). Meta pseudo labels. arXiv preprint arXiv:2003.10580.
[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella X, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE conference on computer vision and pattern recognition.
[misra2020self] Misra, Ishan, van der Maaten, Laurens. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[ren2018meta] Ren, Mengye, Triantafillou, Eleni, Ravi, Sachin, Snell, Jake, Swersky, Kevin, Tenenbaum, Joshua B, Larochelle, Hugo, Zemel, Richard S. (2018). Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.
[he2019moco] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722.
[chen2020mocov2] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He. (2020). Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297.
[hsu2018unsupervised] Hsu, Kyle, Levine, Sergey, Finn, Chelsea. (2018). Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334.
[chen2020exploring] Chen, Xinlei, He, Kaiming. (2020). Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:2011.10566.
[loshchilov2016sgdr] Loshchilov, Ilya, Hutter, Frank. (2016). {SGDR. arXiv preprint arXiv:1608.03983.
[khosla2020supervised] Khosla, Prannay, Teterwak, Piotr, Wang, Chen, Sarna, Aaron, Tian, Yonglong, Isola, Phillip, Maschinot, Aaron, Liu, Ce, Krishnan, Dilip. (2020). Supervised Contrastive Learning. arXiv preprint arXiv:2004.11362.
[miyato2018virtual] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, Ishii, Shin. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence.
[verma2019interpolation] Verma, Vikas, Kawaguchi, Kenji, Lamb, Alex, Kannala, Juho, Bengio, Yoshua, Lopez-Paz, David. (2019). Interpolation Consistency Training for Semi-Supervised Learning. arXiv preprint arXiv:1903.03825.
[zhai2019s4l] Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, Beyer, Lucas. (2019). S4l: Self-supervised semi-supervised learning. Proceedings of the IEEE international conference on computer vision.
[lee2013pseudo] Lee, Dong-Hyun. (2013). Pseudo-Label: The simple and efficient semi-supervised learning method for deep neural networks. In International Conference on Machine Learning Workshop.
[scudder1965probability] Scudder, H.. (1965). Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory.
[riloff1996automatically] Riloff, Ellen. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the National Conference on Artificial Intelligence.
[berthelot2019mixmatch] Berthelot, David, Carlini, Nicholas, Goodfellow, Ian, Papernot, Nicolas, Oliver, Avital, Raffel, Colin A. (2019). Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems.
[berthelot2019remixmatch] Berthelot, David, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Sohn, Kihyuk, Zhang, Han, Raffel, Colin. (2019). ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv preprint arXiv:1911.09785.
[yarowsky1995unsupervised] Yarowsky, David. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics.
[asano2019self] Asano, Yuki Markus, Rupprecht, Christian, Vedaldi, Andrea. (2019). Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371.
[zoph2020rethinking] Zoph, Barret, Ghiasi, Golnaz, Lin, Tsung-Yi, Cui, Yin, Liu, Hanxiao, Cubuk, Ekin D, Le, Quoc V. (2020). Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882.
[xie2020self] Xie, Qizhe, Luong, Minh-Thang, Hovy, Eduard, Le, Quoc V. (2020). Self-training with noisy student improves imagenet classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[tarvainen2017mean] Tarvainen, Antti, Valpola, Harri. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780.
[el2021large] El-Nouby, Alaaeldin, Izacard, Gautier, Touvron, Hugo, Laptev, Ivan, Jegou, Herv{'e. (2021). Are Large-scale Datasets Necessary for Self-Supervised Pre-training?. arXiv preprint arXiv:2112.10740.
[mitrovic2020representation] Mitrovic, Jovana, McWilliams, Brian, Walker, Jacob, Buesing, Lars, Blundell, Charles. (2020). Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922.
[assran2020supervision] Assran, Mahmoud, Ballas, Nicolas, Castrejon, Lluis, Rabbat, Michael. (2020). Supervision accelerates pre-training in contrastive semi-supervised learning of visual representations. arXiv preprint arXiv:2006.10803.
[joulin2012convex] Joulin, Armand, Bach, Francis. (2012). A convex relaxation for weakly supervised classifiers. arXiv preprint arXiv:1206.6413.
[laine2016temporal] Laine, Samuli, Aila, Timo. (2016). Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
[jackson2019semi] Jackson, Jacob, Schulman, John. (2019). Semi-supervised learning by label gradient alignment. arXiv preprint arXiv:1902.02336.
[wang2019enaet] Wang, Xiao, Kihara, Daisuke, Luo, Jiebo, Qi, Guo-Jun. (2019). Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning. arXiv preprint arXiv:1911.09265.
[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.
[zagoruyko2016wide] Zagoruyko, Sergey, Komodakis, Nikos. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
[thomee2016yfcc100m] Thomee, Bart, Shamma, David A, Friedland, Gerald, Elizalde, Benjamin, Ni, Karl, Poland, Douglas, Borth, Damian, Li, Li-Jia. (2016). YFCC100M: The new data in multimedia research. Communications of the ACM.
[zhang2017mixup] Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N, Lopez-Paz, David. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
[yun2019cutmix] Yun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, Yoo, Youngjoon. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[cubuk2019autoaugment] Cubuk, Ekin D, Zoph, Barret, Mane, Dandelion, Vasudevan, Vijay, Le, Quoc V. (2019). Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[blum1998combining] Blum, Avrim, Mitchell, Tom. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the eleventh annual conference on Computational learning theory.
[berman2019multigrain] Berman, Maxim, J{'e. (2019). Multigrain: a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509.
[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294.
[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.
[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[baevski2022data2vec] Baevski, Alexei, Hsu, Wei-Ning, Xu, Qiantong, Babu, Arun, Gu, Jiatao, Auli, Michael. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
[bromley1993signature] Bromley, Jane, Bentz, James W, Bottou, L{'e. (1993). Signature verification using a “siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence.
[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.
[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.
[assran2021semi] Assran, Mahmoud, Caron, Mathilde, Misra, Ishan, Bojanowski, Piotr, Joulin, Armand, Ballas, Nicolas, Rabbat, Michael. (2021). Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples. arXiv preprint arXiv:2104.13963.
[chen2020generative] Chen, Mark, Radford, Alec, Child, Rewon, Wu, Jeffrey, Jun, Heewoo, Luan, David, Sutskever, Ilya. (2020). Generative pretraining from pixels. International Conference on Machine Learning.
[he2021masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
[denoising_vincent] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. Proceedings of the 25th International Conference on Machine Learning.
[vincent2010stacked] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, Pierre-Antoine, Bottou, L{'e. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research.
[xie2021simmim] Xie, Zhenda, Zhang, Zheng, Cao, Yue, Lin, Yutong, Bao, Jianmin, Yao, Zhuliang, Dai, Qi, Hu, Han. (2021). Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886.
[wei2021masked] Wei, Chen, Fan, Haoqi, Xie, Saining, Wu, Chao-Yuan, Yuille, Alan, Feichtenhofer, Christoph. (2021). Masked Feature Prediction for Self-Supervised Visual Pre-Training. arXiv preprint arXiv:2112.09133.
[bao2021beit] Bao, Hangbo, Dong, Li, Wei, Furu. (2021). BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254.
[zhou2021ibotyes] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.
[loshchilov2017decoupled] Loshchilov, Ilya, Hutter, Frank. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
[chen2021empirical] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057.
[touvron2021training] Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, J{'e. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning.
[assran2022masked] Assran, Mahmoud, Caron, Mathilde, Misra, Ishan, Bojanowski, Piotr, Bordes, Florian, Vincent, Pascal, Joulin, Armand, Rabbat, Michael, Ballas, Nicolas. (2022). Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141.
[goyal2022vision] Goyal, Priya, Duval, Quentin, Seessel, Isaac, Caron, Mathilde, Singh, Mannat, Misra, Ishan, Sagun, Levent, Joulin, Armand, Bojanowski, Piotr. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360.
[tian2021divide] Tian, Yonglong, Henaff, Olivier J, van den Oord, A{. (2021). Divide and contrast: Self-supervised learning from uncurated data. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[mahajan2018exploring] Mahajan, Dhruv, Girshick, Ross, Ramanathan, Vignesh, He, Kaiming, Paluri, Manohar, Li, Yixuan, Bharambe, Ashwin, Van Der Maaten, Laurens. (2018). Exploring the limits of weakly supervised pretraining. Proceedings of the European conference on computer vision (ECCV).
[newman2005power] Newman, Mark EJ. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary physics.
[van2018inaturalist] Van Horn, Grant, Mac Aodha, Oisin, Song, Yang, Cui, Yin, Sun, Chen, Shepard, Alex, Adam, Hartwig, Perona, Pietro, Belongie, Serge. (2018). The inaturalist species classification and detection dataset. Proceedings of the IEEE conference on computer vision and pattern recognition.
[places205] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems.
[cifar10] Alex Krizhevsky. (2009). Learning multiple layers of features from tiny images.
[kitti] Andreas Geiger, Philip Lenz, Raquel Urtasun. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. Conference on Computer Vision and Pattern Recognition (CVPR).
[clevr] Johnson, Justin, Hariharan, Bharath, van der Maaten, Laurens, Fei-Fei, Li, Zitnick, C Lawrence, Girshick, Ross. (2017). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR.
[bordes2022high] Florian Bordes, Randall Balestriero, Pascal Vincent. (2022). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. Transactions on Machine Learning Research.
[https://doi.org/10.48550/arxiv.1310.4546] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, Dean, Jeffrey. (2013). Distributed Representations of Words and Phrases and their Compositionality. doi:10.48550/ARXIV.1310.4546.
[zhou2014learning] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems.
[johnson2017clevr] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. Proceedings of the IEEE conference on computer vision and pattern recognition.
[geiger2013vision] Geiger, Andreas, Lenz, Philip, Stiller, Christoph, Urtasun, Raquel. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research.
[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.
[balestriero2022contrastive] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508.
[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.
[chen2021intriguing] Chen, Ting, Luo, Calvin, Li, Lala. (2021). Intriguing properties of contrastive losses. Advances in Neural Information Processing Systems.
[garrido2022duality] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2022). On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574.
[goyal2021vissl] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Ishan Misra. (2021). VISSL.
[https://doi.org/10.48550/arxiv.1502.03167] Ioffe, Sergey, Szegedy, Christian. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. doi:10.48550/ARXIV.1502.03167.
[lecun2022path] LeCun, Yann. (2022). A Path Towards Autonomous Machine Intelligence Version 0.9. 2, 2022-06-27.
[chen2022intra] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding. arXiv preprint arXiv:2206.08954.
[gidaris2020learning] Gidaris, Spyros, Bursuc, Andrei, Komodakis, Nikos, P{'e. (2020). Learning representations by predicting bags of visual words. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[bordes2022guillotine] Bordes, Florian, Balestriero, Randall, Garrido, Quentin, Bardes, Adrien, Vincent, Pascal. (2022). Guillotine Regularization: Improving Deep Networks Generalization by Removing their Head. arXiv preprint arXiv:2206.13378.
[rao1999predictive] Rao, Rajesh PN, Ballard, Dana H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience.
[pathak2016context] Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell, Trevor, Efros, Alexei A. (2016). Context encoders: Feature learning by inpainting. Proceedings of the IEEE conference on computer vision and pattern recognition.
[elias1955] Friston, Karl. (2005). A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences. doi:10.1109/TIT.1955.1055126.
[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[ramesh2021zero] Ramesh, Aditya, Pavlov, Mikhail, Goh, Gabriel, Gray, Scott, Voss, Chelsea, Radford, Alec, Chen, Mark, Sutskever, Ilya. (2021). Zero-shot text-to-image generation. International Conference on Machine Learning.
[dalal2005histograms] Dalal, Navneet, Triggs, Bill. (2005). Histograms of oriented gradients for human detection. 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05).
[larsson2016learning] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2016). Learning representations for automatic colorization.
[zhang2016colorful] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization.
[larsson2017colorization] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2017). Colorization as a proxy task for visual understanding.
[assran2022hidden] Assran, Mahmoud, Balestriero, Randall, Duval, Quentin, Bordes, Florian, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, Ballas, Nicolas. (2022). The Hidden Uniform Cluster Prior in Self-Supervised Learning. arXiv preprint arXiv:2210.07277.
[lecun2006tutorial] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, Huang, Fujie. (2006). A tutorial on energy-based learning. Predicting structured data.
[vtab] Zhai, Xiaohua, Puigcerver, Joan, Kolesnikov, Alexander, Ruyssen, Pierre, Riquelme, Carlos, Lucic, Mario, Djolonga, Josip, Pinto, Andre Susano, Neumann, Maxim, Dosovitskiy, Alexey, Beyer, Lucas, Bachem, Olivier, Tschannen, Michael, Michalski, Marcin, Bousquet, Olivier, Gelly, Sylvain, Houlsby, Neil. (2019). A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark. doi:10.48550/ARXIV.1910.04867.
[lars] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large Batch Training of Convolutional Networks. doi:10.48550/ARXIV.1708.03888.
[zhou2019semantic] Zhou, Bolei, Zhao, Hang, Puig, Xavier, Xiao, Tete, Fidler, Sanja, Barriuso, Adela, Torralba, Antonio. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision.
[everingham2015pascal] Everingham, Mark, Eslami, SM, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2015). The pascal visual object classes challenge: A retrospective. International journal of computer vision.
[cai2022semi] Cai, Zhaowei, Ravichandran, Avinash, Favaro, Paolo, Wang, Manchen, Modolo, Davide, Bhotika, Rahul, Tu, Zhuowen, Soatto, Stefano. (2022). Semi-supervised vision transformers at scale. arXiv preprint arXiv:2208.05688.
[baevski2022efficient] Baevski, Alexei, Babu, Arun, Hsu, Wei-Ning, Auli, Michael. (2022). Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. arXiv preprint arXiv:2212.07525.
[chen2022context] Chen, Xiaokang, Ding, Mingyu, Wang, Xiaodi, Xin, Ying, Mo, Shentong, Wang, Yunhao, Han, Shumin, Luo, Ping, Zeng, Gang, Wang, Jingdong. (2022). Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026.
[oquab2023dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2023). Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
[mann2020language] Mann, Ben, Ryder, N, Subbiah, M, Kaplan, J, Dhariwal, P, Neelakantan, A, Shyam, P, Sastry, G, Askell, A, Agarwal, S, others. (2020). Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
[touvron2023llama] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-supervised learning from images with a joint-embedding predictive architecture. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[tong2022videomae] Tong, Zhan, Song, Yibing, Wang, Jue, Wang, Limin. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems.
[feichtenhofer2021large] Feichtenhofer, Christoph, Fan, Haoqi, Xiong, Bo, Girshick, Ross, He, Kaiming. (2021). A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning. Proceedings of the IEEE conference on computer vision and pattern recognition.
[feichtenhofer2022masked] Feichtenhofer, Christoph, Li, Yanghao, He, Kaiming, others. (2022). Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems.
[goyal2017something] Goyal, Raghav, Ebrahimi Kahou, Samira, Michalski, Vincent, Materzynska, Joanna, Westphal, Susanne, Kim, Heuna, Haenel, Valentin, Fruend, Ingo, Yianilos, Peter, Mueller-Freitag, Moritz, others. (2017). The. Proceedings of the IEEE international conference on computer vision.
[gu2018ava] Gu, Chunhui, Sun, Chen, Ross, David A, Vondrick, Carl, Pantofaru, Caroline, Li, Yeqing, Vijayanarasimhan, Sudheendra, Toderici, George, Ricco, Susanna, Sukthankar, Rahul, others. (2018). Ava: A video dataset of spatio-temporally localized atomic visual actions. Proceedings of the IEEE conference on computer vision and pattern recognition.
[kay2017kinetics] Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijayanarasimhan, Sudheendra, Viola, Fabio, Green, Tim, Back, Trevor, Natsev, Paul, others. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
[kaplan2020scaling] Kaplan, Jared, McCandlish, Sam, Henighan, Tom, Brown, Tom B, Chess, Benjamin, Child, Rewon, Gray, Scott, Radford, Alec, Wu, Jeffrey, Amodei, Dario. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
[cherti2023reproducible] Cherti, Mehdi, Beaumont, Romain, Wightman, Ross, Wortsman, Mitchell, Ilharco, Gabriel, Gordon, Cade, Schuhmann, Christoph, Schmidt, Ludwig, Jitsev, Jenia. (2023). Reproducible scaling laws for contrastive language-image learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision. International conference on machine learning.
[girdhar2023omnimae] Girdhar, Rohit, El-Nouby, Alaaeldin, Singh, Mannat, Alwala, Kalyan Vasudev, Joulin, Armand, Misra, Ishan. (2023). Omnimae: Single model masked pretraining on images and videos. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[wang2023videomae] Wang, Limin, Huang, Bingkun, Zhao, Zhiyu, Tong, Zhan, He, Yinan, Wang, Yi, Wang, Yali, Qiao, Yu. (2023). Videomae v2: Scaling video masked autoencoders with dual masking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[wang2022internvideo] Wang, Yi, Li, Kunchang, Li, Yizhuo, He, Yinan, Huang, Bingkun, Zhao, Zhiyu, Zhang, Hongjie, Xu, Jilan, Liu, Yi, Wang, Zun, others. (2022). Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191.
[akbari2021vatt] Akbari, Hassan, Yuan, Liangzhe, Qian, Rui, Chuang, Wei-Hong, Chang, Shih-Fu, Cui, Yin, Gong, Boqing. (2021). Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems.
[yuan2023videoglue] Yuan, Liangzhe, Gundavarapu, Nitesh Bharadwaj, Zhao, Long, Zhou, Hao, Cui, Yin, Jiang, Lu, Yang, Xuan, Jia, Menglin, Weyand, Tobias, Friedman, Luke, others. (2023). VideoGLUE: Video General Understanding Evaluation of Foundation Models. arXiv preprint arXiv:2307.03166.
[miech2019howto100m] Miech, Antoine, Zhukov, Dimitri, Alayrac, Jean-Baptiste, Tapaswi, Makarand, Laptev, Ivan, Sivic, Josef. (2019). Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. Proceedings of the IEEE/CVF international conference on computer vision.
[sevilla2021only] Sevilla-Lara, Laura, Zha, Shengxin, Yan, Zhicheng, Goswami, Vedanuj, Feiszli, Matt, Torresani, Lorenzo. (2021). Only time can tell: Discovering temporal data for temporal modeling. Proceedings of the IEEE/CVF winter conference on applications of computer vision.
[wang2018temporal] Wang, Limin, Xiong, Yuanjun, Wang, Zhe, Qiao, Yu, Lin, Dahua, Tang, Xiaoou, Van Gool, Luc. (2018). Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence.
[sun2019videobert] Sun, Chen, Myers, Austin, Vondrick, Carl, Murphy, Kevin, Schmid, Cordelia. (2019). Videobert: A joint model for video and language representation learning. Proceedings of the IEEE/CVF international conference on computer vision.
[xu2021videoclip] Xu, Hu, Ghosh, Gargi, Huang, Po-Yao, Okhonko, Dmytro, Aghajanyan, Armen, Metze, Florian, Zettlemoyer, Luke, Feichtenhofer, Christoph. (2021). Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084.
[zellers2022merlot] Zellers, Rowan, Lu, Jiasen, Lu, Ximing, Yu, Youngjae, Zhao, Yanpeng, Salehi, Mohammadreza, Kusupati, Aditya, Hessel, Jack, Farhadi, Ali, Choi, Yejin. (2022). Merlot reserve: Neural script knowledge through vision and language and sound. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[huang2022contrastive] Huang, Zhicheng, Jin, Xiaojie, Lu, Chengze, Hou, Qibin, Cheng, Ming-Ming, Fu, Dongmei, Shen, Xiaohui, Feng, Jiashi. (2022). Contrastive masked autoencoders are stronger vision learners. arXiv preprint arXiv:2207.13532.
[wiskott2002slow] Wiskott, Laurenz, Sejnowski, Terrence J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural computation.
[parthasarathy2022self] Parthasarathy, Nikhil, Eslami, SM, Carreira, Jo{~a. (2022). Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433.
[sermanet2018time] Sermanet, Pierre, Lynch, Corey, Chebotar, Yevgen, Hsu, Jasmine, Jang, Eric, Schaal, Stefan, Levine, Sergey, Brain, Google. (2018). Time-contrastive networks: Self-supervised learning from video. 2018 IEEE international conference on robotics and automation (ICRA).
[ryali2023hiera] Ryali, Chaitanya, Hu, Yuan-Ting, Bolya, Daniel, Wei, Chen, Fan, Haoqi, Huang, Po-Yao, Aggarwal, Vaibhav, Chowdhury, Arkabandhu, Poursaeed, Omid, Hoffman, Judy, others. (2023). Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles. arXiv preprint arXiv:2306.00989.
[li2023unmasked] Li, Kunchang, Wang, Yali, Li, Yizhuo, Wang, Yi, He, Yinan, Wang, Limin, Qiao, Yu. (2023). Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058.
[gupta2023siamese] Gupta, Agrim, Wu, Jiajun, Deng, Jia, Fei-Fei, Li. (2023). Siamese Masked Autoencoders. arXiv preprint arXiv:2305.14344.
[hinton1989connectionist] Hinton, Geoffrey E. (1989). Connectionist learning procedures. Machine learning.
[berkes2005slow] Berkes, Pietro, Wiskott, Laurenz. (2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of vision.
[spelke1995object] Spelke, Elizabeth S, Vishton, Peter, Von Hofsten, Claes. (1995). Object perception, object-directed action, and physical knowledge in infancy..
[field1994goal] Field, David J. (1994). What is the goal of sensory coding?. Neural computation.
[barlow1961coding] Barlow, Horace B. (1961). The coding of sensory messages. Current problems in animal behavior.
[goroshin2015unsupervised] Goroshin, Ross, Bruna, Joan, Tompson, Jonathan, Eigen, David, LeCun, Yann. (2015). Unsupervised learning of spatiotemporally coherent metrics. Proceedings of the IEEE international conference on computer vision.
[srivastava2015unsupervised] Srivastava, Nitish, Mansimov, Elman, Salakhudinov, Ruslan. (2015). Unsupervised learning of video representations using lstms. International conference on machine learning.
[zou2012deep] Zou, Will, Zhu, Shenghuo, Yu, Kai, Ng, Andrew. (2012). Deep learning of invariant features via simulated fixations in video. Advances in neural information processing systems.
[wang2015unsupervised] Wang, Xiaolong, Gupta, Abhinav. (2015). Unsupervised learning of visual representations using videos. Proceedings of the IEEE international conference on computer vision.
[suris2021learning] Sur{'\i. (2021). Learning the predictability of the future. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[han2019video] Han, Tengda, Xie, Weidi, Zisserman, Andrew. (2019). Video representation learning by dense predictive coding. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops.
[vondrick2016anticipating] Vondrick, Carl, Pirsiavash, Hamed, Torralba, Antonio. (2016). Anticipating visual representations from unlabeled video. Proceedings of the IEEE conference on computer vision and pattern recognition.
[han2020memory] Han, Tengda, Xie, Weidi, Zisserman, Andrew. (2020). Memory-augmented dense predictive coding for video representation learning. European conference on computer vision.
[kalluri2023flavr] Kalluri, Tarun, Pathak, Deepak, Chandraker, Manmohan, Tran, Du. (2023). Flavr: Flow-agnostic video representations for fast frame interpolation. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
[lee2017unsupervised] Lee, Hsin-Ying, Huang, Jia-Bin, Singh, Maneesh, Yang, Ming-Hsuan. (2017). Unsupervised representation learning by sorting sequences. Proceedings of the IEEE international conference on computer vision.
[xu2019self] Xu, Dejing, Xiao, Jun, Zhao, Zhou, Shao, Jian, Xie, Di, Zhuang, Yueting. (2019). Self-supervised spatiotemporal learning via video clip order prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[pintea2014deja] Pintea, Silvia L, van Gemert, Jan C, Smeulders, Arnold WM. (2014). D{'e. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13.
[pathak2017learning] Pathak, Deepak, Girshick, Ross, Doll{'a. (2017). Learning features by watching objects move. Proceedings of the IEEE conference on computer vision and pattern recognition.
[tan2023multiscale] Tan, Reuben, De Lange, Matthias, Iuzzolino, Michael, Plummer, Bryan A, Saenko, Kate, Ridgeway, Karl, Torresani, Lorenzo. (2023). Multiscale Video Pretraining for Long-Term Activity Forecasting. arXiv preprint arXiv:2307.12854.
[girdhar2021anticipative] Girdhar, Rohit, Grauman, Kristen. (2021). Anticipative video transformer. Proceedings of the IEEE/CVF international conference on computer vision.
[hyvarinen2001independent] Hyv{. (2001). Independent component analysis, adaptive and learning systems for signal processing, communications, and control. John Wiley & Sons, Inc.
[gutmann2012noise] Gutmann, Michael U, Hyv{. (2012). Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics.. Journal of machine learning research.
[elias1955predictive] Elias, Peter. (1955). Predictive coding--I. IRE transactions on information theory.
[friston2005theory] Friston, Karl. (2005). A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences.
[kayser2001extracting] Kayser, Christoph, Einh{. (2001). Extracting slow subspaces from natural videos leads to complex cells. Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21--25, 2001 Proceedings 11.
[yu2022coca] Yu, Jiahui, Wang, Zirui, Vasudevan, Vijay, Yeung, Legg, Seyedhosseini, Mojtaba, Wu, Yonghui. (2022). Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917.
[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. European conference on computer vision.
[wang2023masked] Wang, Rui, Chen, Dongdong, Wu, Zuxuan, Chen, Yinpeng, Dai, Xiyang, Liu, Mengchen, Yuan, Lu, Jiang, Yu-Gang. (2023). Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[li2022uniformer] Li, Kunchang, Wang, Yali, Gao, Peng, Song, Guanglu, Liu, Yu, Li, Hongsheng, Qiao, Yu. (2022). Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676.
[bib1] Akbari et al. (2021) Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221, 2021.
[bib2] Arnab et al. (2021) Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lucic, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE international conference on computer vision, 2021.
[bib3] Assran et al. (2022) Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, and Nicolas Ballas. Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141, 2022.
[bib4] Assran et al. (2023) Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023.
[bib5] Baevski et al. (2022a) Alexei Baevski, Arun Babu, Wei-Ning Hsu, and Michael Auli. Efficient self-supervised learning with contextualized target representations for vision, speech and language. arXiv preprint arXiv:2212.07525, 2022a.
[bib6] Baevski et al. (2022b) Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022b.
[bib7] Bao et al. (2021) Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[bib8] Pietro Berkes and Laurenz Wiskott. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of vision, 5(6):9–9, 2005.
[bib9] Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
[bib10] Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
[bib11] Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709, 2020.
[bib12] Chen et al. (2022) Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, and Jingdong Wang. Context autoencoder for self-supervised representation learning. arXiv preprint arXiv:2202.03026, 2022.
[bib13] Chen et al. (2021) Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
[bib14] Cherti et al. (2023) Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
[bib15] Dogus Cubuk et al. (2019) Ekin Dogus Cubuk, Barret Zoph, Vijay Mane, Dandelion andVasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[bib16] Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[bib17] Feichtenhofer et al. (2021) Christoph Feichtenhofer, Haoqi Fan, Bo Xiong, Ross Girshick, and Kaiming He. A large-scale study on unsupervised spatiotemporal representation learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 2021.
[bib18] Feichtenhofer et al. (2022) Christoph Feichtenhofer, Yanghao Li, Kaiming He, et al. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
[bib19] David J Field. What is the goal of sensory coding? Neural computation, 6(4):559–601, 1994.
[bib20] Gidaris et al. (2020) Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick Pérez, and Matthieu Cord. Learning representations by predicting bags of visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6928–6938, 2020.
[bib21] Rohit Girdhar and Kristen Grauman. Anticipative video transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13505–13515, 2021.
[bib22] Girdhar et al. (2023) Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10406–10417, 2023.
[bib23] Goroshin et al. (2015) Ross Goroshin, Joan Bruna, Jonathan Tompson, David Eigen, and Yann LeCun. Unsupervised learning of spatiotemporally coherent metrics. In Proceedings of the IEEE international conference on computer vision, pages 4086–4093, 2015.
[bib24] Goyal et al. (2017) Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
[bib25] Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
[bib26] Gu et al. (2018) Chunhui Gu, Chen Sun, David A Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018.
[bib27] Gupta et al. (2023) Agrim Gupta, Jiajun Wu, Jia Deng, and Li Fei-Fei. Siamese masked autoencoders. arXiv preprint arXiv:2305.14344, 2023.
[bib28] Michael U Gutmann and Aapo Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of machine learning research, 13(2), 2012.
[bib29] Han et al. (2019) Tengda Han, Weidi Xie, and Andrew Zisserman. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
[bib30] Han et al. (2020) Tengda Han, Weidi Xie, and Andrew Zisserman. Memory-augmented dense predictive coding for video representation learning. In European conference on computer vision, pages 312–329. Springer, 2020.
[bib31] He et al. (2021) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
[bib32] Geoffrey E Hinton. Connectionist learning procedures. In Machine learning, pages 555–610. Elsevier, 1989.
[bib33] Kalluri et al. (2023) Tarun Kalluri, Deepak Pathak, Manmohan Chandraker, and Du Tran. Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2071–2082, 2023.
[bib34] Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
[bib35] Kay et al. (2017) Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
[bib36] Kayser et al. (2001) Christoph Kayser, Wolfgang Einhäuser, Olaf Dümmer, Peter König, and Konrad Körding. Extracting slow subspaces from natural videos leads to complex cells. In Artificial Neural Networks—ICANN 2001: International Conference Vienna, Austria, August 21–25, 2001 Proceedings 11, pages 1075–1080. Springer, 2001.
[bib37] Larsson et al. (2016) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Learning representations for automatic colorization. 2016.
[bib38] Larsson et al. (2017) Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Colorization as a proxy task for visual understanding. 2017.
[bib39] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.
[bib40] Lee et al. (2017) Hsin-Ying Lee, Jia-Bin Huang, Maneesh Singh, and Ming-Hsuan Yang. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE international conference on computer vision, pages 667–676, 2017.
[bib41] Li et al. (2022) Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Unified transformer for efficient spatiotemporal representation learning. arXiv preprint arXiv:2201.04676, 2022.
[bib42] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[bib43] Miech et al. (2019) Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
[bib44] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pages 69–84. Springer, 2016.
[bib45] Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[bib46] Oquab et al. (2023) Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
[bib47] Parthasarathy et al. (2022) Nikhil Parthasarathy, SM Eslami, João Carreira, and Olivier J Hénaff. Self-supervised video pretraining yields strong image representations. arXiv preprint arXiv:2210.06433, 2022.
[bib48] Pathak et al. (2016) Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
[bib49] Pintea et al. (2014) Silvia L Pintea, Jan C van Gemert, and Arnold WM Smeulders. Déja vu: Motion prediction in static images. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13, pages 172–187. Springer, 2014.
[bib50] Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[bib51] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience, 2(1):79–87, 1999.
[bib52] Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[bib53] Ryali et al. (2023) Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. arXiv preprint arXiv:2306.00989, 2023.
[bib54] Sevilla-Lara et al. (2021) Laura Sevilla-Lara, Shengxin Zha, Zhicheng Yan, Vedanuj Goswami, Matt Feiszli, and Lorenzo Torresani. Only time can tell: Discovering temporal data for temporal modeling. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 535–544, 2021.
[bib55] Spelke et al. (1995) Elizabeth S Spelke, Peter Vishton, and Claes Von Hofsten. Object perception, object-directed action, and physical knowledge in infancy. 1995.
[bib56] Srivastava et al. (2015) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852. PMLR, 2015.
[bib57] Sun et al. (2019) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
[bib58] Surís et al. (2021) Dídac Surís, Ruoshi Liu, and Carl Vondrick. Learning the predictability of the future. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12607–12617, 2021.
[bib59] Tan et al. (2023) Reuben Tan, Matthias De Lange, Michael Iuzzolino, Bryan A Plummer, Kate Saenko, Karl Ridgeway, and Lorenzo Torresani. Multiscale video pretraining for long-term activity forecasting. arXiv preprint arXiv:2307.12854, 2023.
[bib60] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780, 2017.
[bib61] Tian et al. (2021) Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning, pages 10268–10278. PMLR, 2021.
[bib62] Tong et al. (2022) Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
[bib63] Van Horn et al. (2018) Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
[bib64] Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, page 1096–1103, 2008.
[bib65] Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
[bib66] Vondrick et al. (2016) Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Anticipating visual representations from unlabeled video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 98–106, 2016.
[bib67] Wang et al. (2010) Fei Wang, Ping Li, and Arnd Christian Konig. Learning a bi-stochastic data similarity matrix. In 2010 IEEE International Conference on Data Mining, pages 551–560. IEEE, 2010.
[bib68] Wang et al. (2023a) Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023a.
[bib69] Wang et al. (2023b) Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, and Yu-Gang Jiang. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6312–6322, 2023b.
[bib70] Wang et al. (2022) Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
[bib71] Laurenz Wiskott and Terrence J Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural computation, 14(4):715–770, 2002.
[bib72] Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
[bib73] Xie et al. (2021) Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
[bib74] Xu et al. (2019) Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10334–10343, 2019.
[bib75] Xu et al. (2021) Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
[bib76] Yu et al. (2022) Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
[bib77] Yuan et al. (2023) Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, et al. Videoglue: Video general understanding evaluation of foundation models. arXiv preprint arXiv:2307.03166, 2023.
[bib78] Zellers et al. (2022) Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022.
[bib79] Zhou et al. (2014) Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. https://proceedings.neurips.cc/paper/2014/file/3fe94a002317b5f9259f82690aeea4cd-Paper.pdf.
[bib80] Zhou et al. (2021) Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
[bib81] Zou et al. (2012) Will Zou, Shenghuo Zhu, Kai Yu, and Andrew Ng. Deep learning of invariant features via simulated fixations in video. Advances in neural information processing systems, 25, 2012.