Predicting masked tokens in stochastic locations improves masked image modeling
Amir Bar, Florian Bordes, Assaf Shocher, Mahmoud Assran, Pascal Vincent, Nicolas Ballas, Trevor Darrell, Amir Globerson, Yann LeCun
Abstract
Self-supervised learning is a promising paradigm in deep learning that enables learning from unlabeled data by constructing pretext tasks that require learning useful representations. In natural language processing, the dominant pretext task has been masked language modeling (MLM), while in computer vision there exists an equivalent called Masked Image Modeling (MIM). However, MIM is challenging because it requires predicting semantic content in accurate locations. E.g, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose FlexPredict, a stochastic model that addresses this challenge by incorporating location uncertainty into the model. Specifically, we condition the model on stochastic masked token positions to guide the model toward learning features that are more robust to location uncertainties. Our approach improves downstream performance on a range of tasks, e.g, compared to MIM baselines, FlexPredict boosts ImageNet linear probing by 1.6% with ViT-B and by 2.5%percent2.52.5% for semi-supervised video segmentation using ViT-L.
Preliminaries - Masked Image Modeling
Amir Bar 1 2 3 Florian Bordes 3 Assaf Shocher 2 Mahmoud Assran 3 Pascal Vincent 3 Nicolas Ballas 3 Trevor Darrell 2 Amir Globerson 1 Yann LeCun 3 4
Masked Image Modeling (MIM) is a promising self-supervised learning approach that enables learning from unlabeled images. Despite its recent success, learning good representations through MIM remains challenging because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose to incorporate location uncertainty into MIM by using stochastic positional embeddings (StoP). Specifically, we condition the model on stochastic masked token positions drawn from a Gaussian distribution. StoP reduces overfitting to location features and guides the model toward learning features that are more robust to location uncertainties. Quantitatively, StoP improves downstream MIM performance on a variety of downstream tasks, including +1 . 7% on ImageNet linear probing using ViT-B, and +2 . 5% for ViT-H using 1% of the data. 1
Introduction
Masked Image Modeling (MIM) enables learning from unlabeled images by reconstructing masked parts of the image given the rest of the image as context. In recently years, new MIM methods have emerged (Xie et al., 2021; Bao et al., 2021; He et al., 2021; Assran et al., 2023). Masked AutoEncoders (MAE) (He et al., 2021) are trained to minimize a reconstruction error in pixel space, and I-JEPA (Assran et al., 2023) reconstructs image features. MIM is appealing compared to invariance-based self-supervised learning methods like DINO (Caron et al., 2021) and iBOT (Zhou et al., 2021) as MIM do not suffer from the same limitations, namely, it does not require heavy use of hand-crafted
1 Tel Aviv University 2 UC Berkeley 3 Meta AI (FAIR) 4 New York University. Correspondence to: Amir Bar < amir.bar@cs.tau.ac.il > .
1 See https://github.com/amirbar/StoP for code.

(a) Location Uncertainty in MIM

Figure 1. Given a partial image of a dog, can you precisely determine the location of its tail? Existing Masked Image Modeling (MIM) models like MAE (He et al., 2021) and I-JEPA (Assran et al., 2023) predict tokens deterministically and do not model location uncertainties (a), we propose to predict the target (masked tokens) in stochastic positions (StoP) which prevents overfitting to locations features. StoP leads to improved MIM performance on downstream tasks, including linear probing on ImageNet (b).
augmentations (Xiao et al.; He et al., 2021), mini-batch statistics, or a uniform cluster prior (Assran et al., 2022).
Despite the recent success of MIM, we argue that learning good representations using MIM remains challenging due to location uncertainties because it requires predicting the right semantic content in accurate locations. For example, given an incomplete picture of a dog (see Figure 1a), we might guess there's a tail, but we can't be sure exactly where it is, as it could realistically be in several different places. Without explicitly modeling this location uncertainty, existing MIM models like MAE and I-JEPA might overfit on semantic content in arbitrary locations (e.g, the tail location).
In this work, we propose to address location uncertainty in MIM by turning existing MIM models into stochastic ones. Instead of training the model to make predictions in exact locations, we use Stochastic Positional embeddings (StoP) to introduce noise to the masked token's positions, implicitly forcing the model to make stochastic predictions. StoP guides the model towards learning features that are more resilient to location uncertainties, such as the fact that a tail exists in a general area rather than a specific point, which improves downstream performance (Figure 1b).
Specifically, we model the position of every masked token
as a random variable with a Gaussian distribution where its mean is the position of the patch, and the covariance matrix is learned. We find it crucial to design StoP carefully so that the model does not collapse back to deterministic positional embeddings by scaling down the covariance matrix weights to overcome the noise.
To prevent collapse, we propose to tie between the scales of the noise and input context. With this constraint, scaling down the noise also scales down the input context, which makes the reconstruction task too hard to achieve. On the other hand, increasing the scale of the noise leads to very stochastic masked token positions, which makes the reconstruction task difficult as well. We provide a theoretical proof, showing that our solution indeed prevents collapse.
Our contributions are as follows. First, we propose the idea of Stochastic Positional embeddings (StoP) and apply it to MIM to address the location uncertainty in MIM, namely that the location of semantic features is stochastic. Second, we demonstrate that adding StoP to I-JEPA, a recent MIM approach, leads to improved performance on a variety of downstream tasks, highlighting its effectiveness. Lastly, implementing StoP for MIM requires only three extra lines of code, without adding any runtime or memory overhead.
Related Work
Masked image modeling (MIM). There is a significant body of research exploring visual representation learning by predicting corrupted sensory inputs. Denoising autoencoders (Vincent et al., 2010), for example, use random noise as input corruption, while context encoders (Pathak et al., 2016) regress an entire image region based on its surrounding. The idea behind masked image modeling (He et al., 2021; Xie et al., 2021; Bao et al., 2021) has emerged as a way to address image denoising. In this approach, a Vision Transformer (Dosovitskiy et al., 2020) is used to reconstruct missing input patches. The Masked Autoencoders (MAE) architecture (He et al., 2021), for example, efficiently reconstructs missing patches in pixel space and achieves strong performance on large labeled datasets. Other approaches, such as BEiT (Bao et al., 2021), predict a latent code obtained using a pretrained tokenizer. However, pixel-level pre-training has been shown to outperform BEiT in fine-tuning. SimMiM (Xie et al., 2021) explores simple reconstruction targets like color clusters but shows no significant advantages over pixel space reconstruction. Recently, Image-JEPA (I-JEPA) (Assran et al., 2023; LeCun, 2022) was proposed as a non-generative approach for selfsupervised learning of semantic image representations. IJEPA predicts the representations of various target blocks in an image from a single context block to guide it toward

Figure 6. Feature visualization . We plot the similarity between the predicted features of a given patch (marked in white within the masked black area) and other features in the same image. Using StoP produces features that are less location based compared to IJEPA baseline that have strong correlation with the target location.
producing semantic representations. Our approach builds on this line of work and we propose to deal with location uncertainty using stochastic positional embeddings which was not explored before.
Positional Embeddings in Transformers. One of the core components of the Transformer architecture (Vaswani et al., 2017) is the Self-Attention block, which is a permutation invariant function, e.g, changing the order of the input tokens does not change the function output. Consequently, it is necessary to feed input tokens together with their positional embedding to describe their location. Absolute positional embeddings like fixed 2D sinusoidal features (Bello et al., 2019) or learned location features are the prevalent type of positional embeddings for the Vision Transformer (Dosovitskiy et al., 2020). Relative positional embeddings have recently gained popularity in NLP due to their ability to address the gap between the training and testing sequence length (Su et al., 2021; Chu et al., 2021; Press et al., 2021). For example, (Press et al., 2021) proposed ALiBi to bias self-attention to assign higher confidence to neighboring locations, and SPE (Liutkus et al., 2021) proposed a stochastic approximation for relative positional embedding in linear transformers. Differently, we propose StoP to tackle location uncertainties in MIM, and it can be easily applied on top of any existing deterministic variant.
Invariance-based methods. These methods incorporate a loss that encourages similarity between augmented views of the the same image while avoiding a trivial solution. For example, contrastive learning prevents collapse by introducing negative examples (Hadsell et al., 2006; Dosovitskiy et al., 2014; Chen et al., 2020a; He et al., 2019; Chen et al., 2020b; Dwibedi et al., 2021). This can be achieved using a memory bank of previous instances (Wu et al., 2018; Oord et al., 2018; Tian et al., 2019; Misra & van der Maaten, 2020). However, there are also non-contrastive solutions that have been proposed. Of particular interest, a momentum encoder has been shown to prevent collapse even without negative pairs (Grill et al., 2020; Caron et al., 2021; Salakhutdinov & Hinton, 2007). Other methods include stopping the gradient to one branch (Chen & He, 2021) or applying regularization using batch statistics (Zbontar et al., 2021; Bardes et al., 2021; 2022; Ermolov et al., 2020; Hua et al., 2021). MoCo v3 (Chen et al., 2021), then DINO (Caron et al., 2021) extended these approaches for Vision Transformer, and iBOT (Zhou et al., 2021) proposed to add a MIM loss to DINO. These approaches perform extremely well on ImageNet linear-probing, yet they rely on batch statistics, struggle under non-uniform distributions (Assran et al., 2022), and require hand-crafted image augmentations (Xiao et al.). Our approach is based on MIM that requires less assumptions on batch statistics or handcrafted invariances.
Preliminaries - Masked Image Modeling
Ablations
Context encoding.
Experiments and Results
Next, we turn to discuss the main experiments presented in the paper. In Section 4.1, we describe the application of StoP to various downstream tasks including image recognition, dense prediction, and low-level vision tasks. In Section 4.2 we discuss the ablation study and design choices. The full implementation details are included in Appendix C.
Optimal Predictor
Consider a random variable X (corresponding to the context in our case. For simplicity assume X is just the positional embedding of the context) that is used to predict a variable Y (corresponding to the target in our case). But now instead of predicting from X , we use a noise variable Z that is independent of both X,Y , and provide the predictor with only the noisy
result R = g ( X,Z ) . Here g is some mixing function (in our case g ( x, z ) = x + z ). We next derive the optimal predictor f ( R ) in this case. Formally we want to minimize:
$$
$$
A classic result in estimation is that this is optimized by the conditional expectation f ( r ) = E [ Y | R = r ] .
We simplify this as follows:
$$
$$
where in the second line we used the fact that:
$$
$$
To further illustrate, consider the case where z is Gaussian with zero mean and unit variance. Then p ( x | r ) is also Gaussian with expectation r , and the expression above amounts to convolution of the clean expected values with a Gaussian:
$$
$$
Stochastic Position Embeddings.
The most common choices for positional embeddings for Vision Transformers are sine-cosine location features (also used in MAE, I-JEPA) and learned positional embedding. We evaluate the MIM downstream performance using each of these options and using StoP (see Table 6). The results indicate that using StoP improves the performance by +3 . 2% compared to sinusoidal and learned positional embeddings.
Learned vs. predefined covariance matrix. To confirm that learning the covariance matrix Σ = σAA T (and specifically A ) is beneficial compared to using a predefined covariance matrix, we compare to stochastic positional embeddings with a predefined covariance matrix Σ = σI , without any learning. We compare both options using different σ hyperparameter values. Figure 3 indicates that it is advantageous to learn Σ rather than use fixed parameters.
Table 4. Linear-probe transfer for various downstream tasks . Linear-evaluation on downstream image classification, object counting, and depth ordering tasks. Using StoP instead of sinusoidal deterministic positions leads to improvements on all tasks. E.g, +3 . 3% on iNAT18 and +1 . 3% on Counting.
Table 5. Finetuning results over IN-1k with 1% labels. StoP significantly improves finetuning performance compared to using sine-cosine positional embeddings. Using ViT-L/16 architecture.

Figure 3. Learned vs. predefined stochastic positions. Using the learned covariance matrix as in StoP, e.g, Σ = σAA T leads to +3 . 5% improvement compared to smaller gains with a fixed covariance matrix Σ = σI . Accuracy is reported based on linear probing evaluation using 1% of the data from IN-1k.
Application of StoP to different tokens. We apply StoP to context and/or masked tokens. The results in Table 7 confirm our design choice, showing that StoP is most beneficial when it is applied solely to masked tokens, compared to context tokens, or both masked and context tokens.
Avoiding “shortcuts”.
Reparametrization Trick.
Experiments and Results
Next, we turn to discuss the main experiments presented in the paper. In Section 4.1, we describe the application of StoP to various downstream tasks including image recognition, dense prediction, and low-level vision tasks. In Section 4.2 we discuss the ablation study and design choices. The full implementation details are included in Appendix C.
Optimal Predictor
Consider a random variable X (corresponding to the context in our case. For simplicity assume X is just the positional embedding of the context) that is used to predict a variable Y (corresponding to the target in our case). But now instead of predicting from X , we use a noise variable Z that is independent of both X,Y , and provide the predictor with only the noisy
result R = g ( X,Z ) . Here g is some mixing function (in our case g ( x, z ) = x + z ). We next derive the optimal predictor f ( R ) in this case. Formally we want to minimize:
$$
$$
A classic result in estimation is that this is optimized by the conditional expectation f ( r ) = E [ Y | R = r ] .
We simplify this as follows:
$$
$$
where in the second line we used the fact that:
$$
$$
To further illustrate, consider the case where z is Gaussian with zero mean and unit variance. Then p ( x | r ) is also Gaussian with expectation r , and the expression above amounts to convolution of the clean expected values with a Gaussian:
$$
$$
Experiments and Results
Next, we turn to discuss the main experiments presented in the paper. In Section 4.1, we describe the application of StoP to various downstream tasks including image recognition, dense prediction, and low-level vision tasks. In Section 4.2 we discuss the ablation study and design choices. The full implementation details are included in Appendix C.
Ablation Study
Our primary focus is to evaluate the effectiveness of StoP. To demonstrate this, we assess various design options using ViT-B architecture for the encoder and predictor. We pre-train for 300 epochs on IN-1k based on the I-JEPA (Assran et al., 2023) MIM model. We then assessed the linear probing performance on IN-1k using only 1% of the labels.
Downstream Tasks
We conducted pre-training of StoP on top of I-JEPA, which is a state-of-the-art MIM model. We train on IN-1k for a period of 600 epochs using ViT-B/16 and ViT-L/16 architectures for the encoder and predictor or for 300 epochs when using ViT-H/14. Subsequently, we proceeded to evaluate the model's performance on a variety of downstream tasks. Additional results and comparison to invariance-based approaches are included Appendix C.2.
Image recognition. For image classification, we perform a linear probing evaluation of StoP on multiple datasets, including ImageNet (IN-1k) (Russakovsky et al., 2015), Places 205 (Zhou et al., 2014a), iNaturalist 2018 (Van Horn et al., 2018), and CIFAR 100 (Krizhevsky, 2009). These datasets vary in their size, their purpose, and the geographical environments from which the images were captured. For example, IN-1k contains over 1 . 2 million images compared to CIFAR-100 which contains only 60 , 000 images, and while IN-1k is focused on object recognition, iNaturalist and Places are focused on scene and species recognition.
In Table 1, we present the linear probing image classification results conducted on IN-1k under different linear evaluation protocols using different amounts of data, and by aggregating features from different layers. E.g, '100%, last 4 layers' applies linear probing on the entire IN-1k data and the representation of each image is comprised of a concatenation of four feature vectors, each one summarizes information from its corresponding layer via average pooling. In Table 2 we compare linear probing results of common MIM methods on IN-1k, reporting past published performance. In Table 2 all perform linear probing over the output from the last layer.
StoP improves the baseline performance using all architectures examined. For example, +2 . 5% linear probing performance gains with ViT-H using 1% of the labeled data and 1 . 6% when using features from the last 4 layers using ViT-B on the full IN-1k data. Furthermore, using StoP leads to improvements in downstream linear probing tasks (see Table 4). For example, StoP leads to 3 . 3% improvement on iNAT using ViT-H and 1.3% on counting. This confirms that the learned representations lead to improvements in a large variety of image recognition tasks. On full finetuning using 1% of the labeled data, we observe similar performance improvements (see Table 5), e.g, +2 . 3% improvements on
Table 1. StoP compared to deterministic sinusoidal positional embeddings on IN-1k . StoP leads to consistent linear probing improvement in all settings. When applying linear probing on a trained ViT-H model with StoP, using only 1% of the labeled data and using averaged pooled features from the last layer, StoP results in an +2.5% improvement. The baseline I-JEPA uses sinusoidal positional embeddings.
Table 2. Linear-evaluation on IN-1k . Replacing sinusoidal positional embeddings with StoP in I-JEPA significantly improves linear probing results.
Top-1 accuracy using ViT-L model. We provide the full finetuning results in Table 16, Appendix C.2.
Counting and depth ordering. We assess the downstream performance on tasks that require fine-grained objects representations like counting and depth ordering using the CLEVR (Johnson et al., 2017) dataset. Table 4 provides evidence that using StoP significantly improve counting ( +1 . 3% ) and slightly improve depth ordering ( +0 . 1% ).
Dense prediction. To evaluate how well StoP performs on dense prediction tasks, e.g, tasks that require fine-grained spatial representations, we utilized the learned models for semi-supervised video object segmentation on the DAVIS 2017 (Pont-Tuset et al., 2017) dataset. We follow previous works (e.g Jabri et al. (2020); Caron et al. (2021)) and use the pretrained model to extract frames features and use patch-level affinities between frames to track the first segmentation mask. We include video semi-supervised videoobject segmentation by tracking results in Table 3. We find that StoP significantly improves over I-JEPA with deterministic sinusoidal location features. For example, we observe an improvement of +2 . 5% in J & F using ViT-L.
Table 3. Video objects semi-supervised segmentation. MIMwith StoP learns features with a finer level of granularity. Results are reported on DAVIS 2017 dataset.
Limitations
In this work, we proposed to use stochastic positional embedding (StoP) to tackle location uncertainty in MIM. By conditioning on stochastic masked token positions, our model learns features that are more robust to location uncertainty. The effectiveness of this approach is demonstrated on various datasets and downstream tasks, outperforming existing MIM methods and highlighting its potential for self-supervised learning. Based on our experiments and visualizations, modeling location uncertainties with StoP reduces overfitting to location features.
Optimal Predictor
Consider a random variable X (corresponding to the context in our case. For simplicity assume X is just the positional embedding of the context) that is used to predict a variable Y (corresponding to the target in our case). But now instead of predicting from X , we use a noise variable Z that is independent of both X,Y , and provide the predictor with only the noisy
result R = g ( X,Z ) . Here g is some mixing function (in our case g ( x, z ) = x + z ). We next derive the optimal predictor f ( R ) in this case. Formally we want to minimize:
$$
$$
A classic result in estimation is that this is optimized by the conditional expectation f ( r ) = E [ Y | R = r ] .
We simplify this as follows:
$$
$$
where in the second line we used the fact that:
$$
$$
To further illustrate, consider the case where z is Gaussian with zero mean and unit variance. Then p ( x | r ) is also Gaussian with expectation r , and the expression above amounts to convolution of the clean expected values with a Gaussian:
$$
$$
Low-level vision.
6 Results
Ablations
Here we pretrain all models for 300 epochs using 4 V100 nodes, on a total batch size of 2048 . In all the ablation study experiments, we follow the exact recipe of (Assran et al., 2023). We include the full config in Table 9 for completeness.
To evaluate the pretrained models, we use linear probing evaluation using 1% of IN-1k (Russakovsky et al., 2015). To obtain the features of an image, we apply the target encoder over the image to obtain a sequence of tokens corresponding to the image. We then average the tokens to obtain a single representative vector. The linear classifier is trained over this representation, maintaining the rest of the target encoder layers fixed.
Downstream Tasks
We conducted pre-training of StoP on top of I-JEPA, which is a state-of-the-art MIM model. We train on IN-1k for a period of 600 epochs using ViT-B/16 and ViT-L/16 architectures for the encoder and predictor or for 300 epochs when using ViT-H/14. Subsequently, we proceeded to evaluate the model's performance on a variety of downstream tasks. Additional results and comparison to invariance-based approaches are included Appendix C.2.
Image recognition. For image classification, we perform a linear probing evaluation of StoP on multiple datasets, including ImageNet (IN-1k) (Russakovsky et al., 2015), Places 205 (Zhou et al., 2014a), iNaturalist 2018 (Van Horn et al., 2018), and CIFAR 100 (Krizhevsky, 2009). These datasets vary in their size, their purpose, and the geographical environments from which the images were captured. For example, IN-1k contains over 1 . 2 million images compared to CIFAR-100 which contains only 60 , 000 images, and while IN-1k is focused on object recognition, iNaturalist and Places are focused on scene and species recognition.
In Table 1, we present the linear probing image classification results conducted on IN-1k under different linear evaluation protocols using different amounts of data, and by aggregating features from different layers. E.g, '100%, last 4 layers' applies linear probing on the entire IN-1k data and the representation of each image is comprised of a concatenation of four feature vectors, each one summarizes information from its corresponding layer via average pooling. In Table 2 we compare linear probing results of common MIM methods on IN-1k, reporting past published performance. In Table 2 all perform linear probing over the output from the last layer.
StoP improves the baseline performance using all architectures examined. For example, +2 . 5% linear probing performance gains with ViT-H using 1% of the labeled data and 1 . 6% when using features from the last 4 layers using ViT-B on the full IN-1k data. Furthermore, using StoP leads to improvements in downstream linear probing tasks (see Table 4). For example, StoP leads to 3 . 3% improvement on iNAT using ViT-H and 1.3% on counting. This confirms that the learned representations lead to improvements in a large variety of image recognition tasks. On full finetuning using 1% of the labeled data, we observe similar performance improvements (see Table 5), e.g, +2 . 3% improvements on
Table 1. StoP compared to deterministic sinusoidal positional embeddings on IN-1k . StoP leads to consistent linear probing improvement in all settings. When applying linear probing on a trained ViT-H model with StoP, using only 1% of the labeled data and using averaged pooled features from the last layer, StoP results in an +2.5% improvement. The baseline I-JEPA uses sinusoidal positional embeddings.
Table 2. Linear-evaluation on IN-1k . Replacing sinusoidal positional embeddings with StoP in I-JEPA significantly improves linear probing results.
Top-1 accuracy using ViT-L model. We provide the full finetuning results in Table 16, Appendix C.2.
Counting and depth ordering. We assess the downstream performance on tasks that require fine-grained objects representations like counting and depth ordering using the CLEVR (Johnson et al., 2017) dataset. Table 4 provides evidence that using StoP significantly improve counting ( +1 . 3% ) and slightly improve depth ordering ( +0 . 1% ).
Dense prediction. To evaluate how well StoP performs on dense prediction tasks, e.g, tasks that require fine-grained spatial representations, we utilized the learned models for semi-supervised video object segmentation on the DAVIS 2017 (Pont-Tuset et al., 2017) dataset. We follow previous works (e.g Jabri et al. (2020); Caron et al. (2021)) and use the pretrained model to extract frames features and use patch-level affinities between frames to track the first segmentation mask. We include video semi-supervised videoobject segmentation by tracking results in Table 3. We find that StoP significantly improves over I-JEPA with deterministic sinusoidal location features. For example, we observe an improvement of +2 . 5% in J & F using ViT-L.
Table 3. Video objects semi-supervised segmentation. MIMwith StoP learns features with a finer level of granularity. Results are reported on DAVIS 2017 dataset.
Limitations
In this work, we proposed to use stochastic positional embedding (StoP) to tackle location uncertainty in MIM. By conditioning on stochastic masked token positions, our model learns features that are more robust to location uncertainty. The effectiveness of this approach is demonstrated on various datasets and downstream tasks, outperforming existing MIM methods and highlighting its potential for self-supervised learning. Based on our experiments and visualizations, modeling location uncertainties with StoP reduces overfitting to location features.
Optimal Predictor
Consider a random variable X (corresponding to the context in our case. For simplicity assume X is just the positional embedding of the context) that is used to predict a variable Y (corresponding to the target in our case). But now instead of predicting from X , we use a noise variable Z that is independent of both X,Y , and provide the predictor with only the noisy
result R = g ( X,Z ) . Here g is some mixing function (in our case g ( x, z ) = x + z ). We next derive the optimal predictor f ( R ) in this case. Formally we want to minimize:
$$
$$
A classic result in estimation is that this is optimized by the conditional expectation f ( r ) = E [ Y | R = r ] .
We simplify this as follows:
$$
$$
where in the second line we used the fact that:
$$
$$
To further illustrate, consider the case where z is Gaussian with zero mean and unit variance. Then p ( x | r ) is also Gaussian with expectation r , and the expression above amounts to convolution of the clean expected values with a Gaussian:
$$
$$
Low-level vision.
Analysis
To explain how StoP affects MIM, we analyze the learned model weights, visualize the stochastic positional embeddings, and visualize the predicted features.
Table 6. Different positional embeddings . Linear probing on IN-1K using only 1% of the labels. Stochastic Positions (StoP) outperforms other common deterministic variants by 3 . 3% .

Figure 4. Increasing σ induces regularization. Changing the prior σ (where Σ = σAA T ) induces regularization over A and increases the norm of the masked token, which preserves the masked token information in comparison to the added noise.
StoP induces regularization. The matrix A is used to project both noise tokens and context embedding tokens. We hypothesize that StoP implicitly regularizes A . To test this hypothesis we train models using StoP changing only the hyperparam σ (see Figure 4). We find that increasing the value of σ leads to a decrease in the norm of A , which can be viewed as regularization. On the other hand, increasing σ leads to an increase in the norm of the masked token bias ˜ m . We speculate that the masked token bias increases in scale to prevent losing its information relative to the noise.
To further analyze this phenomenon, we train additional models while applying l 1 or l 2 regularization on A while keeping the positional embeddings of masked tokens deterministic. We find that StoP leads to + 2% improvement
Table 7. Applying noise to different tokens . Applying learned noise to context and/or masked tokens positional embeddings (sinecosine). Reporting linear evaluation accuracy (using 1% of IN-1k).
Table 8. Low resolution prediction . Performance of StoP compared to models that predict features on lower scales via max pooling or bilinear resizing. Reporting linear evaluation accuracy (using 1% of IN-1k). StoP performs better than low res prediction.
over l 1 and + 2 . 1% over l 2 regualrization. Therefore, we conclude that StoP is superior to simple regularization.
Stochastic positional embedding visualization. To visualize how StoP affects the similarity between different positions, we plot the similarity matrix between a stochastic position embedding query and the predefined sine-cosine deterministic positions (Figure 5). With StoP, we find that query locations are more similar to a wider range of neighboring locations. Building on this observation, we train models to investigate if directly predicting lower-scale features is beneficial. We trained models to predict features in both the original scale and a downscaled version by a factor of 2, using bilinear resizing and max pooling for downscaling. However, we found that predicting lower scale features does not improve performance (see Table 8).
Prediction visualization. We include heatmap visualization to visualize the similarity of a predicted token to all other tokens within the same image (see Figure 6). For a given image, mask, and a masked patch of interest, we apply cosine similarity between the predicted patch and all other token representations within the same image, followed by a softmax. For I-JEPA with sine-cosine positional embeddings, the visualization indicates that adjacent tokens tend to share similar features, implying a correlation between the features and spatial location. In contrast, StoP produces predictions correlated with non-neighboring small areas. We speculate that using StoP leads to learning features that are more semantic and prevents overfitting to location features.

Figure 5. Similarity matrices of deterministic and stochastic positional embedding (StoP) to a query position . Each row represents the similarity given a different query position. StoP leads to a spatially smooth similarity matrix, thereby making it hard to distinguish the exact location of a given patch.
Limitations
We applied StoP to I-JEPA which performs image reconstruction in the feature space. However, our attempts to apply StoP to MIM that use pixel based reconstruction, mainly MAE, were not successful. We speculate that adding StoP to MAE might make pixel reconstruction too difficult to achieve. Additionally, StoP tackles location uncertainty but not appearance uncertainty, which we believe is implicitly modeled by reconstructing tokens in feature space. Also, when modeling stochastic positions it may might be possible to condition the noise on the input image, namely the context tokens. We leave this extension for future work. Lastly, while combining StoP with MIM shows significant improvements, invariance-based approaches still perform slightly better (e.g, iBOT, DINO) than MIM approaches.
Related Work
Stochastic positional embeddings visualization.
Amir Bar 1 2 3 Florian Bordes 3 Assaf Shocher 2 Mahmoud Assran 3 Pascal Vincent 3 Nicolas Ballas 3 Trevor Darrell 2 Amir Globerson 1 Yann LeCun 3 4
Optimal Predictor
Consider a random variable X (corresponding to the context in our case. For simplicity assume X is just the positional embedding of the context) that is used to predict a variable Y (corresponding to the target in our case). But now instead of predicting from X , we use a noise variable Z that is independent of both X,Y , and provide the predictor with only the noisy
result R = g ( X,Z ) . Here g is some mixing function (in our case g ( x, z ) = x + z ). We next derive the optimal predictor f ( R ) in this case. Formally we want to minimize:
$$
$$
A classic result in estimation is that this is optimized by the conditional expectation f ( r ) = E [ Y | R = r ] .
We simplify this as follows:
$$
$$
where in the second line we used the fact that:
$$
$$
To further illustrate, consider the case where z is Gaussian with zero mean and unit variance. Then p ( x | r ) is also Gaussian with expectation r , and the expression above amounts to convolution of the clean expected values with a Gaussian:
$$
$$
Predictions visualization.
Conclusion
In this work, we proposed to use stochastic positional embedding (StoP) to tackle location uncertainty in MIM. By conditioning on stochastic masked token positions, our model learns features that are more robust to location uncertainty. The effectiveness of this approach is demonstrated on various datasets and downstream tasks, outperforming existing MIM methods and highlighting its potential for self-supervised learning. Based on our experiments and visualizations, modeling location uncertainties with StoP reduces overfitting to location features.
Acknowledgments:
References
Appendix
Limitations
We applied StoP to I-JEPA which performs image reconstruction in the feature space. However, our attempts to apply StoP to MIM that use pixel based reconstruction, mainly MAE, were not successful. We speculate that adding StoP to MAE might make pixel reconstruction too difficult to achieve. Additionally, StoP tackles location uncertainty but not appearance uncertainty, which we believe is implicitly modeled by reconstructing tokens in feature space. Also, when modeling stochastic positions it may might be possible to condition the noise on the input image, namely the context tokens. We leave this extension for future work. Lastly, while combining StoP with MIM shows significant improvements, invariance-based approaches still perform slightly better (e.g, iBOT, DINO) than MIM approaches.
Ablations
Here we pretrain all models for 300 epochs using 4 V100 nodes, on a total batch size of 2048 . In all the ablation study experiments, we follow the exact recipe of (Assran et al., 2023). We include the full config in Table 9 for completeness.
To evaluate the pretrained models, we use linear probing evaluation using 1% of IN-1k (Russakovsky et al., 2015). To obtain the features of an image, we apply the target encoder over the image to obtain a sequence of tokens corresponding to the image. We then average the tokens to obtain a single representative vector. The linear classifier is trained over this representation, maintaining the rest of the target encoder layers fixed.
Downstream Tasks
x[1]¿\arraybackslashp#1pt \newcolumntypey[1]¿\arraybackslashp#1pt \newcolumntypez[1]¿\arraybackslashp#1pt
Self-supervised learning is a promising paradigm in deep learning that enables learning from unlabeled data by constructing pretext tasks that require learning useful representations. In natural language processing, the dominant pretext task has been masked language modeling (MLM), while in computer vision there exists an equivalent called Masked Image Modeling (MIM). However, MIM is challenging because it requires predicting semantic content in accurate locations. E.g, given an incomplete picture of a dog, we can guess that there is a tail, but we cannot determine its exact location. In this work, we propose FlexPredict, a stochastic model that addresses this challenge by incorporating location uncertainty into the model. Specifically, we condition the model on stochastic masked token positions to guide the model toward learning features that are more robust to location uncertainties. Our approach improves downstream performance on a range of tasks, e.g, compared to MIM baselines, FlexPredict boosts ImageNet linear probing by 1.6% with ViT-B and by 2.5%percent2.52.5% for semi-supervised video segmentation using ViT-L.
Self-supervised learning (SSL) has emerged as a promising paradigm in deep learning. By constructing pretext training tasks, it’s possible to leverage unlabeled data to learn representations that can be transferred across a wide range of downstream tasks. This approach has shown remarkable progress in various domains, including natural language processing [16, 8, 15], speech recognition [4, 2, 44], and computer vision [50, 35, 10, 24].
In NLP, masked language modeling (MLM) has emerged as a prominent pre-training task. MLM’s primary goal is to predict masked parts in a text based on rest of the text. This task is an essential component of the training process for popular models such as BERT [16], GPT [8], and similar models. Likewise, in computer vision, there exists a natural counterpart to MLM, known as Masked Image Modeling (MIM). In MIM, part of an image is masked, and the pretext task is to complete it. While this approach has been considered for quite some time [35] and is a form of denoising auto-encoders [42], the dominant approach to semi-supervised learning (SSL) in computer vision relies on learning representations that are invariant to handcrafted image augmentations [39, 22, 6]. Although these approaches produce highly semantic representations, they necessitate prior knowledge of task-specific invariances [46].
More recently, new MIM methods have emerged. Masked Auto-Encoders (MAE) [24], which are trained to minimize a reconstruction error in pixel space, have demonstrated competitive performances in fine-tuning with respect to SSL methods relying on handcrafted image augmentations. Some follow up works have removed the pixel space decoder to allow reconstruction directly in the latent space [3, 53, 1]. The most recent is I-JEPA [1], which stressed the importance of masking large blocks, and of predicting latent representations rather than pixel values. These works have narrowed the gap between MIM methods and invariance-based methods. However, the latter still outperforms the former on tasks such as ImageNet linear probing.
Here we argue that MIM suffers from an inherent difficulty that makes it challenging to learn representations. For instance, let’s take a partial image of a dog, as depicted in Figure 1. We know that the image contains the tail of the dog, but we cannot predict its precise location. Yet, current MIM methods do not model this uncertainty and attempt to provide an accurately localized prediction.
In this work, we propose a solution to address this challenge by introducing a stochastic MIM model. There are various approaches to achieve this, and we suggest a simple yet effective one. Instead of training the model to make predictions in exact locations, we introduce noise to masked tokens positions, thereby forcing the model to make stochastic predictions. This approach guides the model towards features that are more resilient to location uncertainties, such as the fact that a tail exists somewhere in a broad region of the image. However, it is crucial to design the noise injection method carefully, so that the model does not merely scale down weights to “overcome” the noise. We demonstrate how to tackle this issue in our proposed method.
Our contributions are twofold. First, we propose a novel approach for MIM that addresses the uncertainty in the MIM pretext task (e.g, the location of semantic features in the image is stochastic). Second, we demonstrate that our approach outperforms existing methods across a variety of downstream tasks, highlighting its effectiveness.
Invariance-based methods. Invariance-based methods involve training an encoder to ensure similar augmentations of the same image have similar representations while avoiding a trivial solution. For example, contrastive learning is used to prevent collapse to trivial solution by introducing negative examples [23, 18, 10, 25, 12, 19]. This can be achieved using a memory bank of previous instances [45, 34, 39, 33]. However, there are also non-contrastive solutions that have been proposed. Of particular interest, a momentum encoder has been shown to prevent collapse even without the use of negative pairs [22, 9, 38]. Other methods include stopping the gradient to one branch [13] or applying regularization using batch statistics [48, 6, 7, 20, 26]. Our approach is based on MIM, which doesn’t require assumptions on batch statistics or handcrafted invariances.
Masked image modeling (MIM). There is a significant body of research exploring visual representation learning by predicting corrupted sensory inputs. Denoising autoencoders [43], for example, use random noise as input corruption, while context encoders [35] regress an entire image region based on its surrounding. The idea behind masked image modeling [24, 47, 5] has emerged as a way to address image denoising. In this approach, a Vision Transformer [17] is used to reconstruct missing input patches. The Masked Autoencoders (MAE) architecture [24], for example, efficiently reconstructs missing patches in pixel space and achieves strong performance on large labeled datasets. Other approaches, such as BEiT [5], predict a latent code obtained using a pretrained tokenizer. However, pixel-level pre-training has been shown to outperform BEiT in fine-tuning. SimMiM [47] explores simple reconstruction targets like color clusters but shows no significant advantages over pixel space reconstruction.
Joint embedding predictive architecture (JEPA). The recently proposed JEPA [32] framework generalizes both the invariance-based and MIM approaches under the same umbrella. iBOT [53] is a state-of-the-art representation learning method that combines both global invariance loss and a MIM based loss, using an online tokenizer. Recently, Image-JEPA (I-JEPA) [1] was proposed as a non-generative approach for self-supervised learning of semantic image representations. I-JEPA predicts the representations of various target blocks in an image from a single context block to guide it toward producing semantic representations. We propose FlexPredict, a model that focuses on the prediction of coarse and more semantic features.
Our work leverages the I-JEPA framework [1], which we introduce by outlining its key concept. Specifically, I-JEPA is designed to predict the features of target blocks, based on contextual blocks from the same image. We proceed to elaborate on this in more detail.
Given an image, the standard tokenization process presented at [17] is applied. Specifically, given an input image Ix∈ℝH×W×3subscript𝐼𝑥superscriptℝ𝐻𝑊3I_{x}\in\mathbb{R}^{H\times W\times 3}, it is first patchified into a sequence of non-overlapping image patches p^=(p^1,…,p^k)^𝑝subscript^𝑝1…subscript^𝑝𝑘\hat{p}=(\hat{p}{1},...,\hat{p}{k}) where p^i∈ℝH′×W′×3subscript^𝑝𝑖superscriptℝsuperscript𝐻′superscript𝑊′3\hat{p}{i}\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times 3} and K=HWH′W′𝐾𝐻𝑊superscript𝐻′superscript𝑊′K=\frac{HW}{H^{\prime}W^{\prime}} is the number of patches. Then, each patch is projected to ℝdesuperscriptℝsubscript𝑑𝑒\mathbb{R}^{d{e}} through a linear fully connected layer. Next, for every patch p^isubscript^𝑝𝑖\hat{p}{i} the positional embedding features of the ithsuperscript𝑖𝑡ℎi^{th} token are added to it, resulting in the patchfied set p={p1,…pK}𝑝subscript𝑝1…subscript𝑝𝐾p={p{1},...p_{K}}.
Let x={pi|i∈Bx}𝑥conditional-setsubscript𝑝𝑖𝑖subscript𝐵𝑥x={p_{i}|i\in B_{x}} be the set of context patches where Bxsubscript𝐵𝑥B_{x} denotes the set of context indices. The set of context tokens is randomly chosen as in [1]. First, the context tokens are processed via an encoder model fθsubscript𝑓𝜃f_{\theta} to obtain deep representations:
Where sxi∈ℝdesubscript𝑠subscript𝑥𝑖superscriptℝsubscript𝑑𝑒s_{x_{i}}\in\mathbb{R}^{d_{e}} is the ithsuperscript𝑖𝑡ℎi^{th} context token representation.
First, a target block of patches is randomly chosen (e.g, tokens annotated in yellow in Figure 2). We denote its corresponding patch indices by Bysubscript𝐵𝑦B_{y}. Next, we define m={ψj+m~}j∈By𝑚subscriptsubscript𝜓𝑗~𝑚𝑗subscript𝐵𝑦m={\psi_{j}+\tilde{m}}{j\in B{y}} to be the set of masked tokens, where for each j∈By𝑗subscript𝐵𝑦j\in B_{y}, token mjsubscript𝑚𝑗m_{j} is a summation of a learned masked token m~~𝑚\tilde{m}, shared across all tokens, and a positional embedding ψjsubscript𝜓𝑗\psi_{j}. The predictor g𝑔g is then used to map from the context tokens and masked tokens to the predicted tokens: s^y=g(sx,m)subscript^𝑠𝑦𝑔subscript𝑠𝑥𝑚\hat{s}{y}=g(s{x},m).
To supervise the prediction, sy={syi}i∈Bysubscript𝑠𝑦subscriptsubscript𝑠subscript𝑦𝑖𝑖subscript𝐵𝑦s_{y}={s_{y_{i}}}{i\in B{y}} is obtained by feeding the patchified image tokens p𝑝p into a target encoder fθ¯subscript𝑓¯𝜃f_{\bar{\theta}}, then selecting the tokens corresponding to Bysubscript𝐵𝑦B_{y}. Finally, the loss is the mean squared error between sysubscript𝑠𝑦s_{y} and the predicted tokens s^ysubscript^𝑠𝑦\hat{s}_{y}:
Here sysubscript𝑠𝑦s_{y} is taken as constant, and the parameters of the target encoder fθ¯subscript𝑓¯𝜃f_{\bar{\theta}} are updated via an exponential moving average of the context encoder fθsubscript𝑓𝜃f_{{\theta}} which has shown to prevent collapse [9, 22].
The I-JEPA method and other MIM-like approaches condition the predictor model on the locations of the target patches, given by the masked tokens positional embeddings, and train the model to predict their content (either in pixel or latent space). This approach does not take into account that the exact location of objects is highly stochastic.
Instead, we force our model to be more flexible in representing locations by conditioning our model on stochastic positions, such that it is impossible to provide a location-accurate prediction. Hence, we refer to our approach as FlexPredict. A high-level schematic view of the model is included in Figure 2.
In what follows, we will explore the process of replacing the positional embeddings of the masked tokens with a stochastic alternative. This involves a few crucial steps, including defining the distribution of the stochastic positions, parameterizing it appropriately, and implementing measures to prevent the model from reducing the impact of the noise to the point where it becomes negligible.
In most Visual Transformer implementations, the position of a patch i𝑖i is encoded via an embedding vector ψisubscript𝜓𝑖\psi_{i}. A common choice is to map the position to sine and cosine features in different frequencies [41, 17]. Here we wish to replace this fixed, deterministic mapping with a stochastic map. This is contrary to past works that use a deterministic mapping to determine the positional embedding of a token [1, 24].
Given a position i𝑖i, we denote by ψ^isubscript^𝜓𝑖\hat{\psi}_{i} the random variable providing the position embedding. We assume:
Namely, ψ^^𝜓\hat{\psi} is distributed as Gaussian whose mean is the fixed embedding ψisubscript𝜓𝑖\psi_{i}, and covariance matrix Σ∈ℝdp×dpΣsuperscriptℝsubscript𝑑𝑝subscript𝑑𝑝\Sigma\in\mathbb{R}^{d_{p}\times d_{p}}.
Naturally, we want to learn an optimal ΣΣ\Sigma. However, this is challenging for two reasons. First, learning might result in the optimization process setting the values of ΣΣ\Sigma to zero, leading to no randomness. We refer to this case as a “shortcut solution”. Second, the sampling process of ψ^^𝜓\hat{\psi} is non-differential, and therefore we cannot derive gradients to directly optimize it with SGD.
To solve these issues, we start by paramertizing ΣΣ\Sigma, then describe how to avoid the “shortcut solution”, and the reparametrization trick to derive a differential algorithm. We start by parameterizing ΣΣ\Sigma, and use a general formulation of a low-rank covariance matrix:
Where A∈ℝdp×de𝐴superscriptℝsubscript𝑑𝑝subscript𝑑𝑒A\in\mathbb{R}^{d_{p}\times d_{e}} is a learned matrix and σ∈ℝ+𝜎superscriptℝ\sigma\in\mathbb{R^{+}} is a positive predefined scalar (hyperparameter). By learning matrix A𝐴A, this formulation is flexibile enough, e.g, it is possible learning to assign small variance to low-res location features, while assigning higher variance to higher-frequency features, and also capturing correlations between location features.
Without posing any constraints on A𝐴A, it is easy for the model to scale down the noise by setting A=0𝐴0A=0 and making the prediction problem deterministic again, and thereby easier. This would collapse back to the standard I-JEPA model, and lose the advantage of noisy spatial predictions. To avoid this shortcut, we use the following simple trick. We use the matrix A𝐴A to linearly project every context token sxisubscript𝑠subscript𝑥𝑖s_{x_{i}} as follows: c^i=Asxi+bsubscript^𝑐𝑖𝐴subscript𝑠subscript𝑥𝑖𝑏\hat{c}{i}=As{x_{i}}+b, where b𝑏b is a learned bias. With this simple trick, it is easy to see that setting A𝐴A to zero would set the context tokens to zero as well, making the prediction task too difficult for the network and successfully avoiding the above shortcut. This can also be viewed as a regularization of A𝐴A, and we discuss this further in Section 7.
Since ψ^^𝜓\hat{\psi} is sampled from a parameterized distribution, it isn’t immediately clear how to optimize over the learned parameters of the distribution A𝐴A, because the sampling operation is non-differentiable in A𝐴A. However, a standard trick in these cases is to reparameterize the distribution so that only sampling is from a fixed distribution that does not depend on A𝐴A (e.g., see [29]). Specifically, we generate samples from ψ^^𝜓\hat{\psi} by first sampling a vector ni∈ℝdesubscript𝑛𝑖superscriptℝsubscript𝑑𝑒n_{i}\in\mathbb{R}^{d_{e}} from a standard Gaussian distribution: ni∼N(0,σI)similar-tosubscript𝑛𝑖𝑁0𝜎𝐼n_{i}\sim N(0,\sigma I). Then we set ψ^^𝜓\hat{\psi} to the following function:
The resulting distribution of ψ^^𝜓\hat{\psi} is equal to that in Equation 2, however, we can now differentiate directly through A𝐴A.
Finally, for every i∈Bx𝑖subscript𝐵𝑥i\in B_{x} and j∈By𝑗subscript𝐵𝑦j\in B_{y}, we define the set of context and masked tokens to be:
Note that here the masked token misubscript𝑚𝑖m_{i} has a stochastic position, and m~~𝑚\tilde{m} is a learned bias shared across all positions. We can then apply g𝑔g to predict the target features s^y=g(c,m)subscript^𝑠𝑦𝑔𝑐𝑚\hat{s}_{y}=g(c,m) and use the same loss as in Equation 1.
Our approach relies on using stochastic positional embeddings. Here we provide further analysis of this prediction setting and show that the optimal prediction is indeed to perform spatial smoothing.
Consider a random variable X𝑋X (corresponding to the context in our case. For simplicity assume X𝑋X is just the positional embedding of the context) that is used to predict a variable Y𝑌Y (corresponding to the target in our case). But now instead of predicting from X𝑋X, we use a noise variable Z𝑍Z that is independent of both X,Y𝑋𝑌X,Y, and provide the predictor with only the noisy result R=g(X,Z)𝑅𝑔𝑋𝑍R=g(X,Z). Here g𝑔g is some mixing function (in our case g(x,z)=x+z𝑔𝑥𝑧𝑥𝑧g(x,z)=x+z). We next derive the optimal predictor f(R)𝑓𝑅f(R) in this case. Formally we want to minimize:
A classic result in estimation is that this is optimized by the conditional expectation f(r)=E[Y|R=r]𝑓𝑟𝐸delimited-[]conditional𝑌𝑅𝑟f(r)=E[Y|R=r]. We simplify this as follows:
where in the second line we used the fact that:
To further illustrate, consider the case where z𝑧z is Gaussian with zero mean and unit variance. Then p(x|r)𝑝conditional𝑥𝑟p(x|r) is also Gaussian with expectation r𝑟r, and the expression above amounts to convolution of the clean expected values with a Gaussian:
Next, we turn to discuss the main experiments presented in the paper. We start by discussing the ablation study and design choices in Section 5.1. Then in Section 5.2, we describe the application of FlexPredict to various downstream tasks including image recognition, dense prediction, and low-level vision tasks.
Our primary focus was to evaluate the effectiveness of adding noise. For this purpose, we experimented with learning A𝐴A given different hyper-parameter σ𝜎\sigma. We also investigated the impact of adding noise from fixed Gaussian distributions, namely Σ=σIΣ𝜎𝐼\Sigma=\sigma I, without learning. Lastly, we evaluate the effect of applying FlexPredict to context and/or masked tokens positions.
We evaluated various design options for the FlexPredict model. For each setting, we implemented the encoder and predictor using ViT-B architecture and pre-trained them for 300300300 epochs on IN-1k. We then assessed the linear probing performance on IN-1k using only 1% of the available labels.
We conducted pre-training of the FlexPredict model on IN-1k for a period of 600600600 epochs, utilizing either ViT-B or ViT-L architectures for the encoder and predictor. Subsequently, we proceeded to evaluate the model’s performance on a variety of downstream tasks. We include the full implementation details in the Supplementary Material.
Following past works, we focus on evaluating the (target) encoder representations [24, 1], and use the standard VISSL [21] evaluation protocol like in [1].
Image recognition. For image classification, we perform a linear probing evaluation of StoP on multiple datasets, including ImageNet (IN-1k) (Russakovsky et al., 2015), Places 205 (Zhou et al., 2014a), iNaturalist 2018 (Van Horn et al., 2018), and CIFAR 100 (Krizhevsky, 2009). These datasets vary in their size, their purpose, and the geographical environments from which the images were captured. For example, IN-1k contains over 1 . 2 million images compared to CIFAR-100 which contains only 60 , 000 images, and while IN-1k is focused on object recognition, iNaturalist and Places are focused on scene and species recognition.
Dense prediction. To evaluate how well StoP performs on dense prediction tasks, e.g, tasks that require fine-grained spatial representations, we utilized the learned models for semi-supervised video object segmentation on the DAVIS 2017 (Pont-Tuset et al., 2017) dataset. We follow previous works (e.g Jabri et al. (2020); Caron et al. (2021)) and use the pretrained model to extract frames features and use patch-level affinities between frames to track the first segmentation mask. We include video semi-supervised videoobject segmentation by tracking results in Table 3. We find that StoP significantly improves over I-JEPA with deterministic sinusoidal location features. For example, we observe an improvement of +2 . 5% in J & F using ViT-L.
We assessed the linear probing performance of our model on downstream tasks related to low-level vision. These tasks included object counting and object ordering by depth, which were evaluated using the CLEVR [28] dataset. In order to accurately perform these tasks, the model needed to not only recognize objects but also capture their location features.
We report the ablation study results in Section 6.1, then discuss results on various downstream tasks in Section 6.2.
We present the results comparing different noise, and the impact when changing the hyperparam σ𝜎\sigma. Figure 3 indicates that it is optimal to learn the parameters of the distribution as in FlexPredict, rather than use fixed parameters. Our findings demonstrate that setting σ=0.25𝜎0.25\sigma=0.25 leads to an improvement of 3.5%percent3.53.5% points compared to I-JEPA. Additionally, Table 1 reveals that FlexPredict is most beneficial when applied solely to masked tokens positional embeddings, not to context.
In Table 2, we present the linear probing image classification results conducted on IN-1k. Our approach, FlexPredict, achieves a performance improvement of 1.6%percent1.61.6% and 0.9%percent0.90.9% when using ViT-B and ViT-L, respectively, compared to previous MIM methods. Additionally, FlexPredict narrows the relative performance gap from iBOT [53] by 25%. Furthermore, our approach outperforms existing methods in downstream linear probing tasks. For example, FlexPredict leads to over 10% improvement on CIFAR-100 using ViT-B and 1% using ViT-L. This confirms that the learned representations lead to improvements in a large variety of image recognition tasks.
We include semi-supervised video-object segmentation results in Table 4. We find that FlexPredict significantly improves over I-JEPA [1], e.g, an improvement of 2.5%percent2.52.5% on J&F𝐽𝐹J&F using ViT-L. Notably, we find that while using I-JEPA does not lead to improvements here by scaling the model, scaling the model to ViT-L leads to a 1.4%percent1.41.4% improvement compared to ViT-B using FlexPredict.
Table 5 provides evidence that the learned representations of FlexPredict performs at least on-par with I-JEPA models in low-level tasks such as counting and depth ordering on the CLEVR dataset.
We perform a thorough analysis of FlexPredict. Specifically, we examine the stochastic effect of FlexPredict and attempt to interpret the properties of the learned model.
We train FlexPredict models, changing only the hyperparam σ𝜎\sigma. We find that increasing the value of σ𝜎\sigma leads to a decrease in the norm of A𝐴A, which can be viewed as regularization. On the other hand, increasing σ𝜎\sigma leads to an increase in the norm of the masked token m𝑚m. The mask token scale increases to prevent losing its information relative to the noise. We show this dynamic in Figure 5.
Based on the observations above, we train additional models to check whether FlexPredict can be explained by regularization. Specifically, we train I-JEPA models while applying l1subscript𝑙1l_{1} regularization on the predictor’s linear projection layer weights. We evaluate linear probing performance using 1%percent11% of the labels and find this leads to 1.5%percent1.51.5% improvement over I-JEPA, compared to 3.5%percent3.53.5% improvement using FlexPredict.
In order to visualize stochastic positional embeddings, we sampled stochastic positions and generated a similarity matrix of each sample with the predefined deterministic positions. Figure 4 provides examples of this. Our findings show that when noise is added to a positional embedding, the resulting similarity matrix changes, which makes it similar to a wider range of neighboring locations.
We build on the observations above and train additional I-JEPA models to investigate if FlexPredict performance could be achieved through predicting lower-scale features. We trained models to predict features in both the original scale and a downscaled version by a factor of 2, using bilinear resizing and max pooling for downscaling. However, we found that these methods did not significantly improve performance, as reported in Table 6.
We include heatmap visualization to visualize the similarity of a predicted token to all other tokens within the same image (see Figure 6). For a given image, mask, and a masked patch of interest, we apply cosine similarity between the predicted patch and all other token representations within the same image (given by the target encoder), followed by a softmax. For I-JEPA the visualization indicates that adjacent tokens tend to share similar features, implying a correlation between the features and spatial location. In contrast, FlexPredict produces predictions correlated with non-neighboring small areas. We speculate that training with stochastic positions prevents spatial adjacency bias.
In this work, we proposed FlexPredict, a stochastic model that tackles location uncertainties in the MIM task. By conditioning on stochastic masked tokens positions, our model learns features that are more robust to location uncertainties. The effectiveness of this approach is demonstrated on various datasets and downstream tasks, outperforming existing MIM methods and highlighting its potential for self-supervised learning. We speculate, based on our experiments and visualizations, that by modeling location uncertainties, FlexPredict suffers less from spatial adjacency bias. Other sources of uncertainty, like uncertainty in appearance, require further investigation in future work.
AG’s group has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). TD’s group was funded by DoD including DARPA LwLL and the Berkeley AI Research (BAIR) Commons. This work was completed in partial fulfillment for the Ph.D degree of the first author.
We include the full implementation details, pretraining configs and evaluation protocols for the Ablations (see Appendix A.1) and Downstream Tasks (Appendix A.2).
Here we pretrain all models for 300300300 epochs using 444 V100 nodes, on a total batch size of 204820482048. In all the ablation study experiments, we follow the recipe of [1]. We include the full config in Table 7.
To evaluate the pretrained models, we use linear probing evaluation using 1% of IN-1k [37]. To obtain the features of an image, we apply the target encoder over the image to obtain a sequence of tokens corresponding to the image. We then average the tokens to obtain a single representative vector. The linear classifier is trained over this representation, maintaining the rest of the target encoder layers fixed.
Here we pretrain FlexPredict for 600600600 epochs using 444 V100 nodes, on a total batch size of 204820482048 using ViT-B (see config in Table 8) and ViT-L (see config in Table 9). We follow a similar config compared to [1] except we use a lower learning rate. Intuitively, since FlexPredict is stochastic it is more sensitive to high learning rates.
For evaluation on downstream tasks, we use the features learned by the target-encoder and follow the protocol of VISSL [21] that was utilized by I-JEPA [1]. Specifically, we report the best linear evaluation number among the average-pooled patch representation of the last layer and the concatenation of the last 444 layers of the average-pooled patch representations.
For baselines that use Vision Transformers [17] with a [cls] token (e.g, iBOT [53], DINO [9] or MAE [24]), we use the default configurations of VISSL [21] to evaluate the publicly available checkpoints on iNaturalist18 [40], CIFAR100 [31], Clevr/Count [28, 49], Clevr/Dist [28, 49], and Places205 [52]. Following the evaluation protocol of VISSL [21], we freeze the encoder and return the best number among the [cls] token representation of the last layer and the concatenation of the last 444 layers of the [cls] token.
For semi-supervised video object segmentation, we propagate the first labeled frame in a video using the similarity between adjacent frames features. To label the video using the frozen features, we follow the code and hyperparams of [9]. To evaluate the segmented videos, we use the evaluation code of DAVIS 2017 [36].
Table: S6.T1: Applying noise to different tokens. Applying learned noise to context and/or masked tokens positional embeddings. Accuracy is based on linear probing using 1% of the data from IN-1k.
| Method | Top-1 |
|---|---|
| No Noise (I-JEPA [1]) | 54.3 |
| Context tokens only | 55.1 |
| Masked tokens only | \cellcolorfbApp57.8 |
| Masked + context tokens | 56.8 |
Table: S6.T2: Linear-evaluation on IN-1k. FlexPredict improves linear probing performance compared to other methods that do not rely on hand-crafted view data-augmentations during pretraining.
| Method | Arch. | Epochs | Top-1 |
|---|---|---|---|
| MIM methods, without view data augmentations | |||
| data2vec [3] | ViT-L/16 | 1600 | 53.5 |
| MAE [24] | ViT-B/16 | 1600 | 68.0 |
| ViT-L/16 | 1600 | 76.0 | |
| I-JEPA [1] | ViT-B/16 | 600 | 72.9 |
| ViT-L/16 | 600 | 77.5 | |
| FlexPredict | \cellcolorfbAppViT-B/16 | \cellcolorfbApp600 | \cellcolorfbApp74.5 |
| \cellcolorfbAppViT-L/16 | \cellcolorfbApp600 | \cellcolorfbApp78.4 | |
| Invariance-based methods, using extra view data augmentations | |||
| SimCLR v2 [11] | RN152 (2×2\times) | 800 | 79.1 |
| DINO [9] | ViT-B/16 | 400 | 78.1 |
| MoCo v3 [14] | ViT-B/16 | 300 | 76.7 |
| iBOT [53] | ViT-B/16 | 250 | 79.8 |
| ViT-L/16 | 250 | 81.0 |
Table: S6.T3: Linear-probe transfer for image classification. Linear-evaluation on downstream image classification tasks. FlexPredict significantly outperforms previous methods that also do not use augmentations (MAE and data2vec), and decreases the gap with the best view-invariance-based methods that leverage hand-crafted data augmentations during pretraining.
| Method | Arch. | CIFAR100 | Places205 | iNat18 |
| MIM methods, without view data augmentations | ||||
| data2vec [3] | ViT-L/16 | 59.6 | 36.9 | 10.9 |
| MAE [24] | ViT-B/16 | 68.1 | 49.2 | 26.8 |
| ViT-L/16 | 77.4 | 54.4 | 33.0 | |
| I-JEPA [1] | ViT-B/16 | 69.2 | 53.4 | 43.4 |
| ViT-L/16 | 83.6 | 56.5 | 48.4 | |
| FlexPredict | \cellcolorfbAppViT-B/16 | \cellcolorfbApp81.2 | \cellcolorfbApp54.3 | \cellcolorfbApp44.7 |
| \cellcolorfbAppViT-L/16 | \cellcolorfbApp84.7 | \cellcolorfbApp57.2 | \cellcolorfbApp49.2 | |
| Invariance-based methods, using extra view data augmentations | ||||
| DINO [9] | ViT-B/16 | 84.8 | 55.2 | 50.1 |
| iBOT [53] | ViT-B/16 | 85.5 | 56.7 | 50.0 |
| ViT-L/16 | 88.3 | 60.4 | 57.3 |
Table: S6.T4: Video objects semi-supervised segmentation. The results demonstrate that compared to MIM baselines, FlexPredict learns features in a finer level of granularity. Results reported on the DAVIS 2017 [36] dataset.
| Method | Arch. | J-Mean | F-Mean | J&F Mean |
| MIM methods, without view data augmentations | ||||
| MAE [24] | ViT-B/16 | 49.4 | 52.6 | 50.9 |
| ViT-L/16 | 52.5 | 54.3 | 53.4 | |
| I-JEPA [1] | ViT-B/16 | 56.1 | 56.2 | 56.1 |
| ViT-L/16 | 56.1 | 55.7 | 55.9 | |
| FlexPredict | \cellcolorfbAppViT-B/16 | \cellcolorfbApp56.6 | \cellcolorfbApp57.3 | \cellcolorfbApp57.0 |
| \cellcolorfbAppViT-L/16 | \cellcolorfbApp58.1 | \cellcolorfbApp58.7 | \cellcolorfbApp58.4 | |
| Invariance-based methods, using extra view data augmentations | ||||
| DINO [9] | ViT-B/16 | 60.7 | 63.9 | 62.3 |
| iBOT [53] | ViT-B/16 | 60.9 | 63.3 | 62.1 |
| ViT-L/16 | 61.7 | 63.9 | 62.8 |
Table: S6.T5: Linear-probing on low-level vision downstream tasks like object counting (CLEVR/Count) and depth prediction (Clevr/Dist). FlexPredict effectively captures low-level location features and it is on par or better than I-JEPA.
| Method | Arch. | Clevr/Count | Clevr/Dist |
| MIM methods, without view data augmentations | |||
| data2vec [3] | ViT-L/16 | 72.7 | 53.0 |
| MAE [24] | ViT-B/16 | 86.6 | 70.8 |
| ViT-L/16 | 92.1 | 73.0 | |
| I-JEPA [1] | ViT-B/16 | 82.2 | 70.7 |
| ViT-L/16 | 85.6 | 71.2 | |
| FlexPredict | \cellcolorfbAppViT-B/16 | \cellcolorfbApp83.7 | \cellcolorfbApp71.3 |
| \cellcolorfbAppViT-L/16 | \cellcolorfbApp85.7 | \cellcolorfbApp70.2 | |
| Invariance-based methods, using extra view data augmentations | |||
| DINO [9] | ViT-B/16 | 83.2 | 62.5 |
| iBOT [53] | ViT-B/16 | 85.1 | 64.4 |
| ViT-L/16 | 85.7 | 62.8 |
Table: S7.T6: Low resolution prediction. We evaluated the performance of FlexPredict against models that predict features on the original scale and an x2𝑥2x2 downscaled version using either max pooling or bilinear resizing. Reporting linear evaluation results on IN-1K using only 1% of the labels
| Method | Top-1 |
|---|---|
| I-JEPA [1]) | 54.3 |
| Low res pred (bilinear resize) | 52.1 |
| Low res (max pooling) | 54.1 |
| FlexPredict | \cellcolorfbApp57.8 |
Table: A1.T7: Pretraining setting for ablations. Using ViT-B encoder, trained for 300300300 epochs, config strictly follows [1].
| config | value |
| optimizer | AdamW |
| epochs | 300 |
| learning rate | 1e−31superscript𝑒31e^{-3} |
| weight decay | (0.04,0.4)0.040.4(0.04,0.4) |
| batch size | 2048 |
| learning rate schedule | cosine decay |
| warmup epochs | 15 |
| encoder arch. | ViT-B |
| predicted targets | 4 |
| predictor depth | 6 |
| predictor attention heads | 12 |
| predictor embedding dim. | 384 |
| σ𝜎\sigma (noise hyperparam) | 0.250.250.25 |
FlexPredict architecture. The model predictor gψsubscript𝑔𝜓g_{\psi} predicts a target block given masked tokens with stochastic positions and the context representation (obtain via fθsubscript𝑓𝜃f_{\theta}). The objective is to minimize the error between the predicted features and the target features obtained via target encoder fθ¯subscript𝑓¯𝜃f_{\bar{\theta}}
Using stochastic positional embeddings. sampling from distribution with learned covariance matrix as in FlexPredict, e.g, Σ=σAATΣ𝜎𝐴superscript𝐴𝑇\Sigma=\sigma AA^{T} leads to +3.5%percent3.5+3.5% improvement, while using a fixed covariance matrix Σ=σIΣ𝜎𝐼\Sigma=\sigma I leads to smaller 1.9%percent1.91.9% improvement. Accuracy is based on probing using 1% of the data from IN-1k.
Similarity matrices of positional embeddings matrix between deterministic (ψψi𝜓subscript𝜓𝑖\psi\psi_{i}) and stochastic FlexPredict (ψψ^i𝜓subscript^𝜓𝑖\psi\hat{\psi}_{i}) positions. Each row represents a different target position i𝑖i. Position embeddings are based on sine and cosine features.
Increasing σ𝜎\sigma induces regularization. The effect of changing the hyperparameter σ𝜎\sigma on the norm of learned parameter A𝐴A and the masked token m𝑚m. As we increase σ𝜎\sigma, the norm of A𝐴A decreases, indicating regularization. However, the norm of the masked token increases, likely to preserve its information relative to the added noise.
Predicted features visualization. We show a similarity heatmap between the predicted features of a given patch (marked in white within the masked area) and the other tokens encoded by the target encoder in the same image. For I-JEPA, adjacent tokens tend to share similar features, implying a correlation between the features and spatial location. In contrast, FlexPredict produces predictions correlated with non-neighboring small areas. We speculate that FlexPredict reduces spatial adjacency bias.
$$ {s}{x}=f{\phi}(x) $$ \tag{S3.Ex1}
$$ \frac{1}{\lvert B_{y}\rvert}\sum_{i\in B_{y}}|s_{y_{i}}-\hat{s}{y{i}}| $$ \tag{S3.E1}
$$ \hat{\psi}{i}\sim N(\psi{i},\Sigma) $$ \tag{S4.E2}
$$ \Sigma=\sigma AA^{T} $$ \tag{S4.Ex2}
$$ E_{R,Y}[(f(R)-Y)^{2}] $$ \tag{S4.E3}
$$ p(y,x|r)=p(y|x,r)p(x|r)=p(y|x)p(x|r) $$ \tag{S4.E4}
$$ \int_{x}E[Y|X=x]\frac{1}{\sqrt{2\pi}}e^{-0.5(x-r)^{2}}dx $$ \tag{S4.E5}
$$ \displaystyle\hat{\psi}{i}=An{i}+\psi_{i} $$
We conducted pre-training of StoP on top of I-JEPA, which is a state-of-the-art MIM model. We train on IN-1k for a period of 600 epochs using ViT-B/16 and ViT-L/16 architectures for the encoder and predictor or for 300 epochs when using ViT-H/14. Subsequently, we proceeded to evaluate the model's performance on a variety of downstream tasks. Additional results and comparison to invariance-based approaches are included Appendix C.2.
In Table 1, we present the linear probing image classification results conducted on IN-1k under different linear evaluation protocols using different amounts of data, and by aggregating features from different layers. E.g, '100%, last 4 layers' applies linear probing on the entire IN-1k data and the representation of each image is comprised of a concatenation of four feature vectors, each one summarizes information from its corresponding layer via average pooling. In Table 2 we compare linear probing results of common MIM methods on IN-1k, reporting past published performance. In Table 2 all perform linear probing over the output from the last layer.
StoP improves the baseline performance using all architectures examined. For example, +2 . 5% linear probing performance gains with ViT-H using 1% of the labeled data and 1 . 6% when using features from the last 4 layers using ViT-B on the full IN-1k data. Furthermore, using StoP leads to improvements in downstream linear probing tasks (see Table 4). For example, StoP leads to 3 . 3% improvement on iNAT using ViT-H and 1.3% on counting. This confirms that the learned representations lead to improvements in a large variety of image recognition tasks. On full finetuning using 1% of the labeled data, we observe similar performance improvements (see Table 5), e.g, +2 . 3% improvements on
Table 1. StoP compared to deterministic sinusoidal positional embeddings on IN-1k . StoP leads to consistent linear probing improvement in all settings. When applying linear probing on a trained ViT-H model with StoP, using only 1% of the labeled data and using averaged pooled features from the last layer, StoP results in an +2.5% improvement. The baseline I-JEPA uses sinusoidal positional embeddings.
Table 2. Linear-evaluation on IN-1k . Replacing sinusoidal positional embeddings with StoP in I-JEPA significantly improves linear probing results.
Top-1 accuracy using ViT-L model. We provide the full finetuning results in Table 16, Appendix C.2.
Counting and depth ordering. We assess the downstream performance on tasks that require fine-grained objects representations like counting and depth ordering using the CLEVR (Johnson et al., 2017) dataset. Table 4 provides evidence that using StoP significantly improve counting ( +1 . 3% ) and slightly improve depth ordering ( +0 . 1% ).
Table 3. Video objects semi-supervised segmentation. MIMwith StoP learns features with a finer level of granularity. Results are reported on DAVIS 2017 dataset.
| Arch | Method | 1%, last layer | 100%, last layer | 100%, last 4 layers |
|---|---|---|---|---|
| ViT-B/16 | I-JEPA +StoP | 57.1 60.3 (+3.2%) | 70.9 72.6 (+1.7%) | 72.9 74.5 (+1.6%) |
| ViT-L/16 | I-JEPA +StoP | 64.2 65.1 (+0.9%) | 76.1 77.1 (+1.0%) | 77.5 78.5 (+1.0%) |
| ViT-H/14 | I-JEPA +StoP | 62.9 65.4 (+2.5%) | 78.2 79.0 (+0.8%) | 79.3 79.6 (+0.3%) |
| Method | Arch. | Epochs | Top-1 |
|---|---|---|---|
| data2vec | ViT-L/16 | 1600 | 77.3 |
| MAE | ViT-B/16 ViT-L/16 | 1600 1600 | 68.0 75.8 |
| I-JEPA | ViT-B/16 ViT-L/16 ViT-H/14 | 600 600 300 | 70.9 76.1 78.2 |
| +StoP (ours) | ViT-B/16 ViT-L/16 ViT-H/14 | 600 600 300 | 72.6 77.1 79.0 |
| Method | Arch. | J-Mean | F-Mean | J&F Mean |
|---|---|---|---|---|
| MAE | ViT-B/16 ViT-L/16 | 49.4 52.5 | 52.6 54.3 | 50.9 53.4 |
| I-JEPA | ViT-B/16 ViT-L/16 ViT-H/14 | 56.1 56.1 58.5 | 56.2 55.7 60.9 | 56.1 55.9 59.7 |
| +StoP | ViT-B/16 ViT-L/16 ViT-H/14 | 56.6 58.1 58.9 | 57.3 58.7 61.2 | 57.0 58.4 60.1 |
| Method | Arch. | CIFAR100 | Places205 | iNat18 | CLEVR/Count | CLEVR/Dist |
|---|---|---|---|---|---|---|
| data2vec | ViT-L/16 | 81.6 | 54.6 | 28.1 | 85.3 | 71.3 |
| ViT-B/16 | 68.1 | 49.2 | 26.8 | 86.6 | 70.8 | |
| MAE | ViT-L/16 | 77.4 | 54.4 | 33 | 92.1 | 73 |
| ViT-H/14 | 77.3 | 55 | 32.9 | 90.5 | 72.4 | |
| ViT-B/16 | 69.2 | 53.4 | 43.4 | 82.2 | 70.7 | |
| I-JEPA | ViT-L/16 | 83.6 | 56.5 | 48.4 | 85.6 | 71.2 |
| ViT-H/14 | 87.5 | 58.4 | 47.6 | 86.7 | 72.4 | |
| ViT-B/16 | 81.2 | 54.3 | 44.7 | 83.7 | 71.3 | |
| +StoP | ViT-L/16 | 84.7 | 57.2 | 49.2 | 85.7 | 70.2 |
| ViT-H/14 | 87.7 | 58.4 | 50.9 | 88 | 72.5 |
| Method | Epochs | Top-1 |
|---|---|---|
| Sine Cosine | 600 | 69.4 |
| StoP (ours) | 600 | 71.7 |
| Method | Top-1 |
|---|---|
| Sine Cosine | 54.3 |
| Learned Pos. Embedding | 54.4 |
| Stochastic Positions (StoP) | 57.8 |
| Method | Top-1 |
|---|---|
| No Noise (Sine Cosine) | 54.3 |
| Context tokens only | 55.1 |
| Masked + context tokens | 56.8 |
| Masked tokens only | 57.8 |
| Method | Top-1 |
|---|---|
| Sine Cosine | 54.3 |
| x2 Low res (bilinear resize) | 52.1 |
| x2 Low res (max pooling) | 54.1 |
| Stochastic Positions (StoP) | 57.8 |
| config | value |
|---|---|
| optimizer epochs | AdamW 300 1 e - 3 |
| learning rate weight | (0 . 04 , 0 . |
| decay | 4) |
| batch size | 2048 |
| learning rate schedule | cosine decay |
| warmup epochs | 15 |
| encoder arch. | ViT-B |
| predicted targets | 4 |
| predictor depth | 6 |
| predictor attention heads | 12 |
| predictor embedding dim. | 384 |
| σ (noise hyperparam) | 0 . 25 |
| config | value |
|---|---|
| optimizer epochs | AdamW 600 - 4 |
| learning rate | 8 e (0 . 04 , 0 . |
| weight decay | 4) |
| batch size | 2048 |
| learning rate schedule | cosine decay |
| warmup epochs | 15 |
| encoder arch. | ViT-L |
| predicted targets | 4 |
| predictor depth | 12 |
| predictor attention heads | 16 |
| 384 | |
| predictor embedding dim. | |
| σ (noise hyperparam) | 0 . 25 |
| config | value |
|---|---|
| optimizer | AdamW 600 - 4 |
| epochs | |
| learning rate | 8 e |
| weight decay | (0 . 04 , 0 . 4) |
| batch size | 2048 |
| learning rate schedule | cosine decay |
| warmup epochs | 15 |
| encoder arch. | ViT-B |
| predicted targets | 4 |
| predictor depth | 6 |
| predictor attention heads | 12 |
| predictor embedding dim. | 384 |
| σ (noise hyperparam) | 0 . 25 |
| config | value |
|---|---|
| optimizer epochs | AdamW 600 - 3 |
| learning rate | 1 e |
| weight | (0 . 04 , 0 . |
| decay | 4) |
| batch size | 2048 |
| learning rate schedule | cosine decay |
| warmup epochs | 40 |
| encoder arch. | ViT-H |
| predicted targets | 4 |
| predictor depth | 12 |
| predictor attention heads | 16 |
| predictor embedding dim. | 384 |
| σ (noise hyperparam) | 0 . 2 |
| Method | Arch. | CIFAR100 | Places205 | iNat18 | CLEVR/Count | CLEVR/Dist |
|---|---|---|---|---|---|---|
| Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) |
| DINO | ViT-B/16 | 84.8 | 55.2 | 50.1 | 83.2 | 53.4 |
| iBOT | ViT-B/16 | 85.5 | 56.7 | 50.0 | 62.1 | 64.6 |
| ViT-L/16 | 88.3 | 60.4 | 57.3 | 85.7 | 62.8 | |
| Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods |
| data2vec | ViT-L/16 | 81.6 | 54.6 | 28.1 | 85.3 | 71.3 |
| ViT-B/16 | 68.1 | 49.2 | 26.8 | 86.6 | 70.8 | |
| MAE | ViT-L/16 | 77.4 | 54.4 | 33.0 | 92.1 | 73.0 |
| ViT-H/14 | 77.3 | 55.0 | 32.9 | 90.5 | 72.4 | |
| ViT-B/16 | 69.2 | 53.4 | 43.4 | 82.2 | 70.7 | |
| I-JEPA | ViT-L/16 | 83.6 | 56.5 | 48.4 | 85.6 | 71.2 |
| ViT-H/14 | 87.5 | 58.4 | 47.6 | 86.7 | 72.4 | |
| ViT-B/16 | 81.2 | 54.3 | 44.7 | 83.7 | 71.3 | |
| +StoP | ViT-L/16 | 84.7 | 57.2 | 49.2 | 85.7 | 70.2 |
| ViT-H/14 | 87.7 | 58.4 | 50.9 | 88.0 | 72.5 |
| Method | Arch. | Epochs | Top-1 |
|---|---|---|---|
| Invariance-based methods ( use extra image augmentations ) | Invariance-based methods ( use extra image augmentations ) | Invariance-based methods ( use extra image augmentations ) | Invariance-based methods ( use extra image augmentations ) |
| SimCLR v2 | RN152 ( 2 × ) | 800 | 79.1 |
| BYOL | RN200 ( 2 × ) | 800 | 79.6 |
| DINO | ViT-B/16 | 400 | 78.1 |
| DINO | ViT-B/8 | 300 | 80.1 |
| MoCo v3 | ViT-B/16 | 300 | 76.7 |
| MoCo v3 | ViT-BN-L/7 | 300 | 81.0 |
| MSN | ViT-L/7 | 200 | 80.7 |
| iBOT | ViT-B/16 | 250 | 79.8 |
| ViT-L/16 | 250 | 81.0 | |
| Masked Image Modeling methods | Masked Image Modeling methods | Masked Image Modeling methods | Masked Image Modeling methods |
| data2vec | ViT-L/16 | 1600 | 77.3 |
| MAE | ViT-B/16 | 1600 | 68.0 |
| MAE | ViT-L/16 | 1600 | 75.8 |
| MAE | ViT-H/14 | 1600 | 77.2 |
| I-JEPA | ViT-B/16 | 600 | 72.9 |
| I-JEPA | ViT-L/16 | 600 | 77.5 |
| I-JEPA | ViT-H/14 | 300 | 79.3 |
| +StoP (ours) | ViT-B/16 | 600 | 74.5 |
| +StoP (ours) | ViT-L/16 | 600 | 78.5 |
| +StoP (ours) | ViT-H/14 | 300 | 79.6 |
| Method | Arch. | J-Mean | F-Mean | J&F Mean |
|---|---|---|---|---|
| Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) | Invariance-based methods (use extra image augmentations) |
| DINO | ViT-B/16 | 60.7 | 63.9 | 62.3 |
| iBOT | ViT-B/16 ViT-L/16 | 60.9 61.7 | 63.3 63.9 | 62.1 62.8 |
| Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods | Masked Image Modeling Methods |
| ViT-B/16 | 49.4 | 52.6 | 50.9 | |
| MAE | ViT-L/16 | 52.5 | 54.3 | 53.4 |
| I-JEPA | ViT-B/16 ViT-L/16 | 56.1 56.1 | 56.2 55.7 | 56.1 55.9 |
| +StoP | ViT-B/16 ViT-L/16 ViT-H/14 | 56.6 58.1 58.9 | 57.3 58.7 61.2 | 57.0 58.4 60.1 |
| Method | Arch. | Epochs | Top-1 |
|---|---|---|---|
| Invariance-based | methods | ( use extra image | augmentations ) |
| DINO | ViT-B/8 | 300 | 70.0 |
| iBOT | ViT-B/16 | 400 | 69.7 |
| Masked Image Modeling methods | Masked Image Modeling methods | Masked Image Modeling methods | |
| MAE | ViT-L/16 | 1600 | 67.0 |
| I-JEPA | ViT-L/16 | 600 | 69.4 |
| +StoP (ours) | ViT-L/16 | 600 | 71.7 |

$$ \label{eq:mim_context_tokens} c_i = \psi_i + Bs_{x_i} $$ \tag{eq:mim_context_tokens}
$$ \label{eqn:reparam} \hat{\psi}_j = An_j + \psi_j \vspace{-5pt} $$ \tag{eqn:reparam}
$$ J_{tied}(A) = \sum_{i,j} \mathbb{E}_{n_j}[(F(An_j + \psi_j + \Tilde{m}, Ax_i) - y_j)^2] $$
$$ \frac{\partial J}{\partial A} &= \sum_{i,j} \mathbb{E}{n_j} [\frac{\partial}{\partial A} |F(An_j + \psi_j + \Tilde{m}, Bx_i) - y_j|^2] \ &= \sum{i,j} \mathbb{E}_{n_j} [2(F(An_j+\psi_j + \Tilde{m}, Bx_i) - y_j)\frac{\partial F(An_j+\psi_j + \Tilde{m}, Bx_i)}{\partial (An_j + \psi_j + \Tilde{m})} n^T_j] $$
Proposition. If the weights of $A$ and $B$ are tied (namely $A=B$) then $\left. dJ_{tied}{dA} \right|{A=0} = 0$ iff $\left. dJ{det}{dB} \right|_{B=0} = 0$
Proposition. If $Z$ is a Gaussian with zero mean and unit variance, the optimal predictor that minimizes Equation~eq:optimal_objective is: $$f(r) = \int_x E[Y|X=x]1{2\pi}e^{-0.5(x-r)^2}dx $$
Proof. align* \partial J{\partial A} &= \sum_{i,j} E_{n_j} [\partial{\partial A} |F(An_j + \psi_j + m, Bx_i) - y_j|^2] \ &= \sum_{i,j} E_{n_j} [2(F(An_j+\psi_j + m, Bx_i) - y_j)\partial F(An_j+\psi_j + \Tilde{m, Bx_i)}{\partial (An_j + \psi_j + m)} n^T_j] align* Set $A=0$, then derivative becomes: align* \partial J{\partial A}\Big|{A=0} &= 2\sum{i,j} (F(\psi_j + m, Bx_i) - y_j)\partial F(\psi_j + \Tilde{m, Bx_i)}{\partial (\psi_j + m)}E_{n_j}[{n^T_j}] = 0 align*
Proof. Next, we show that $A=0$ is a critical point of $J_{tied}$ iff $B=0$ is a critical point of $J_{det}$: equation \partial J_{tied}{\partial A}\Big|{A=0} = \sum{i,j} (F(\psi_j + m, 0) - y_j)\nabla F(\psi_i, 0)x_i ^T equation equation \partial J_{det}{\partial B}\Big|{B=0} = \sum{i,j} (F(\psi_j + m, 0) - y_j)\nabla F(\psi_j, 0)x_i ^T equation Therefore $\partial J_{tie}{\partial A}\Big|{A=0} = 0$ iff $\partial J{det}{\partial B}\Big|_{B=0}$
Algorithm: algorithm
[t]
\caption{MIM w/ StoP pseudo-code. requires only a minor implementation change, highlighted in light gray.}
\label{alg:fp}
\small
\begin{algorithmic}[1]
\State \textbf{Input:} num iterations $K$, image dist $S$, hyperparam $\sigma$, positional embeddings $\psi$
\State \textbf{Params}: ${A,\Tilde{m}}$, encoder $f_\theta$, predictor $g_\phi$
\For {$itr=1,2,...,K$}
\State $I_x \sim S$
\State $p \leftarrow \text{patchify}(I_x)$
\State $(x, B_x),(y, B_y) \leftarrow \text{mask}(p)$
\State $s_x \leftarrow f_{\theta}(x)$
\State \codecomment{\# apply StoP on a sequence of tokens}
\State \texttt{\colorbox{lightgrey}{$n_j \sim \mathcal{N}(0, \sigma I$)}}
\State \codecomment{\# $\psi_{B_x}$, $\psi_{B_y}$ - masked/context positional embeddings}
\State $m = $ \texttt{\colorbox{lightgrey}{$An$}} $+ \psi_{B_y} + \Tilde{m}$ \State $c = As_x + \psi_{B_x}$
\State \codecomment{\# predict targets}
\State $\hat{s}_y \leftarrow g_\phi(c, m)$
\State $ s_y \leftarrow \text{get\_target}(y)$
\State $\text{loss} \leftarrow L(\hat{s}_y, s_y)$
\State $\text{sgd\_step}(\text{loss}; \{\theta,\phi, A, \Tilde{m} \})$
\EndFor
\end{algorithmic}
References
[li2022exploring] Li, Yanghao, Mao, Hanzi, Girshick, Ross, He, Kaiming. (2022). Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527.
[bardes2022vicregl] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2022). VICRegL: Self-Supervised Learning of Local Visual Features. arXiv preprint arXiv:2210.01571.
[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron. (2016). Deep learning.
[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.
[bridle1991unsupervised] Bridle, John, Heading, Anthony, MacKay, David. (1991). Unsupervised classifiers, mutual information and'phantom targets. Advances in neural information processing systems.
[zha2001spectral] Zha, Hongyuan, He, Xiaofeng, Ding, Chris, Gu, Ming, Simon, Horst D. (2001). Spectral relaxation for k-means clustering. NeurIPS.
[hornik2012spherical] Hornik, Kurt, Feinerer, Ingo, Kober, Martin, Buchta, Christian. (2012). Spherical k-means clustering. Journal of statistical software.
[park2009simple] Park, Hae-Sang, Jun, Chi-Hyuck. (2009). A simple and fast algorithm for K-medoids clustering. Expert systems with applications.
[van2008visualizing] Van der Maaten, Laurens, Hinton, Geoffrey. (2008). Visualizing data using t-SNE.. Journal of machine learning research.
[wang2010learning] Wang, Fei, Li, Ping, Konig, Arnd Christian. (2010). Learning a bi-stochastic data similarity matrix. 2010 IEEE International Conference on Data Mining.
[meilua2006uniqueness] Meil{\u{a. (2006). The uniqueness of a good optimum for k-means. Proceedings of the 23rd international conference on Machine learning.
[wu2009adapting] Wu, Junjie, Xiong, Hui, Chen, Jian. (2009). Adapting the right measures for k-means clustering. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining.
[liang2012k] Liang, Jiye, Bai, Liang, Dang, Chuangyin, Cao, Fuyuan. (2012). The $ K $-means-type algorithms versus imbalanced data distributions. IEEE Transactions on Fuzzy Systems.
[rujeerapaiboon2019size] Rujeerapaiboon, Napat, Schindler, Kilian, Kuhn, Daniel, Wiesemann, Wolfram. (2019). Size matters: Cardinality-constrained clustering and outlier detection via conic optimization. SIAM J. Optimization.
[bradley2000constrained] Bradley, Paul S, Bennett, Kristin P, Demiriz, Ayhan. (2000). Constrained k-means clustering. Microsoft Research, Redmond.
[kleindessner2019fair] Kleindessner, Matth{. (2019). Fair k-center clustering for data summarization. ICML.
[bordia2019identifying] Bordia, Shikha, Bowman, Samuel R. (2019). Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035.
[buolamwini2018gender] Buolamwini, Joy, Gebru, Timnit. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability and Transparency.
[ma2022principles] Ma, Yi, Tsao, Doris, Shum, Heung-Yeung. (2022). On the principles of Parsimony and Self-consistency for the emergence of intelligence. Frontiers of Information Technology & Electronic Engineering.
[wiener2019cybernetics] Wiener, Norbert. (2019). Cybernetics or Control and Communication in the Animal and the Machine.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[krause2010discriminative] Krause, Andreas, Perona, Pietro, Gomes, Ryan. (2010). Discriminative clustering by regularized information maximization. Advances in neural information processing systems.
[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, others. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems.
[henaff2020data] Henaff, Olivier. (2020). Data-efficient image recognition with contrastive predictive coding. International conference on machine learning.
[hu2017learning] Hu, Weihua, Miyato, Takeru, Tokui, Seiya, Matsumoto, Eiichi, Sugiyama, Masashi. (2017). Learning discrete representations via information maximizing self-augmented training. International conference on machine learning.
[linsker1988self] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.
[tschannen2019mutual] Tschannen, Michael, Djolonga, Josip, Rubenstein, Paul K, Gelly, Sylvain, Lucic, Mario. (2019). On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625.
[lake2011one] Lake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, Tenenbaum, Joshua. (2011). One shot learning of simple visual concepts. Proceedings of the annual meeting of the cognitive science society.
[salakhutdinov2007learning] Salakhutdinov, Ruslan, Hinton, Geoff. (2007). Learning a nonlinear embedding by preserving class neighbourhood structure. Artificial Intelligence and Statistics.
[boden1980jean] Boden, Margaret A. (1980). Jean Piaget.
[piaget1964cognitive] Piaget, Jean. (1964). Cognitive development in children: Piaget. Journal of research in science teaching.
[boden1978artificial] Boden, Margaret A. (1978). Artificial intelligence and Piagetian theory. Synthese.
[bruner1961individual] Bruner, Jerome S. (1961). Reply to Individual and collective problems in the study of thinking. Annals of the New York Academy of Sciences.
[piaget1971biology] Piaget, Jean. (1971). Biology and knowledge: An essay on the relations between organic regulations and cognitive processes..
[grandvalet2006entropy] Grandvalet, Yves, Bengio, Yoshua. (2006). Entropy regularization. Semi-supervised learning.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709.
[chen2020big] Chen, Ting, Kornblith, Simon, Swersky, Kevin, Norouzi, Mohammad, Hinton, Geoffrey. (2020). Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.
[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.
[assran2020recovering] Assran, Mahmoud, Ballas, Nicolas, Castrejon, Lluis, Rabbat, Michael. (2020). Recovering Petaflops in Contrastive Semi-Supervised Learning of Visual Representations. arXiv preprint arXiv:2006.10803.
[vinyals2016matching] Vinyals, Oriol, Blundell, Charles, Lillicrap, Timothy, Kavukcuoglu, Koray, Wierstra, Daan. (2016). Matching networks for one shot learning. arXiv preprint arXiv:1606.04080.
[snell2017prototypical] Snell, Jake, Swersky, Kevin, Zemel, Richard S. (2017). Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175.
[ravi2016optimization] Ravi, Sachin, Larochelle, Hugo. (2016). Optimization as a model for few-shot learning.
[lake2017building] Lake, Brenden M, Ullman, Tomer D, Tenenbaum, Joshua B, Gershman, Samuel J. (2017). Building machines that learn and think like people. Behavioral and brain sciences.
[russakovsky2015imagenet] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., Fei-Fei, Li. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision.
[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[you2017large] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.
[sutskever2013importance] Sutskever, Ilya, Martens, James, Dahl, George, Hinton, Geoffrey. (2013). On the importance of initialization and momentum in deep learning. International conference on machine learning.
[xie2019unsupervised] Xie, Qizhe, Dai, Zihang, Hovy, Eduard, Luong, Minh-Thang, Le, Quoc V. (2019). Unsupervised data augmentation. arXiv preprint arXiv:1904.12848.
[sohn2020fixmatch] Sohn, Kihyuk, Berthelot, David, Li, Chun-Liang, Zhang, Zizhao, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Zhang, Han, Raffel, Colin. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685.
[pham2020meta] Pham, Hieu, Xie, Qizhe, Dai, Zihang, Le, Quoc V. (2020). Meta pseudo labels. arXiv preprint arXiv:2003.10580.
[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella X, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE conference on computer vision and pattern recognition.
[misra2020self] Misra, Ishan, van der Maaten, Laurens. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[ren2018meta] Ren, Mengye, Triantafillou, Eleni, Ravi, Sachin, Snell, Jake, Swersky, Kevin, Tenenbaum, Joshua B, Larochelle, Hugo, Zemel, Richard S. (2018). Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676.
[he2019moco] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. (2019). Momentum Contrast for Unsupervised Visual Representation Learning. arXiv preprint arXiv:1911.05722.
[chen2020mocov2] Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He. (2020). Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297.
[hsu2018unsupervised] Hsu, Kyle, Levine, Sergey, Finn, Chelsea. (2018). Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334.
[chen2020exploring] Chen, Xinlei, He, Kaiming. (2020). Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:2011.10566.
[loshchilov2016sgdr] Loshchilov, Ilya, Hutter, Frank. (2016). {SGDR. arXiv preprint arXiv:1608.03983.
[khosla2020supervised] Khosla, Prannay, Teterwak, Piotr, Wang, Chen, Sarna, Aaron, Tian, Yonglong, Isola, Phillip, Maschinot, Aaron, Liu, Ce, Krishnan, Dilip. (2020). Supervised Contrastive Learning. arXiv preprint arXiv:2004.11362.
[miyato2018virtual] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, Ishii, Shin. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence.
[verma2019interpolation] Verma, Vikas, Kawaguchi, Kenji, Lamb, Alex, Kannala, Juho, Bengio, Yoshua, Lopez-Paz, David. (2019). Interpolation Consistency Training for Semi-Supervised Learning. arXiv preprint arXiv:1903.03825.
[zhai2019s4l] Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, Beyer, Lucas. (2019). S4l: Self-supervised semi-supervised learning. Proceedings of the IEEE international conference on computer vision.
[lee2013pseudo] Lee, Dong-Hyun. (2013). Pseudo-Label: The simple and efficient semi-supervised learning method for deep neural networks. In International Conference on Machine Learning Workshop.
[scudder1965probability] Scudder, H.. (1965). Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory.
[riloff1996automatically] Riloff, Ellen. (1996). Automatically generating extraction patterns from untagged text. In Proceedings of the National Conference on Artificial Intelligence.
[berthelot2019mixmatch] Berthelot, David, Carlini, Nicholas, Goodfellow, Ian, Papernot, Nicolas, Oliver, Avital, Raffel, Colin A. (2019). Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems.
[berthelot2019remixmatch] Berthelot, David, Carlini, Nicholas, Cubuk, Ekin D, Kurakin, Alex, Sohn, Kihyuk, Zhang, Han, Raffel, Colin. (2019). ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv preprint arXiv:1911.09785.
[yarowsky1995unsupervised] Yarowsky, David. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In 33rd Annual Meeting of the Association for Computational Linguistics.
[asano2019self] Asano, Yuki Markus, Rupprecht, Christian, Vedaldi, Andrea. (2019). Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371.
[zoph2020rethinking] Zoph, Barret, Ghiasi, Golnaz, Lin, Tsung-Yi, Cui, Yin, Liu, Hanxiao, Cubuk, Ekin D, Le, Quoc V. (2020). Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882.
[xie2020self] Xie, Qizhe, Luong, Minh-Thang, Hovy, Eduard, Le, Quoc V. (2020). Self-training with noisy student improves imagenet classification. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[tarvainen2017mean] Tarvainen, Antti, Valpola, Harri. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780.
[el2021large] El-Nouby, Alaaeldin, Izacard, Gautier, Touvron, Hugo, Laptev, Ivan, Jegou, Herv{'e. (2021). Are Large-scale Datasets Necessary for Self-Supervised Pre-training?. arXiv preprint arXiv:2112.10740.
[mitrovic2020representation] Mitrovic, Jovana, McWilliams, Brian, Walker, Jacob, Buesing, Lars, Blundell, Charles. (2020). Representation learning via invariant causal mechanisms. arXiv preprint arXiv:2010.07922.
[assran2020supervision] Assran, Mahmoud, Ballas, Nicolas, Castrejon, Lluis, Rabbat, Michael. (2020). Supervision accelerates pre-training in contrastive semi-supervised learning of visual representations. arXiv preprint arXiv:2006.10803.
[joulin2012convex] Joulin, Armand, Bach, Francis. (2012). A convex relaxation for weakly supervised classifiers. arXiv preprint arXiv:1206.6413.
[laine2016temporal] Laine, Samuli, Aila, Timo. (2016). Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
[jackson2019semi] Jackson, Jacob, Schulman, John. (2019). Semi-supervised learning by label gradient alignment. arXiv preprint arXiv:1902.02336.
[wang2019enaet] Wang, Xiao, Kihara, Daisuke, Luo, Jiebo, Qi, Guo-Jun. (2019). Enaet: Self-trained ensemble autoencoding transformations for semi-supervised learning. arXiv preprint arXiv:1911.09265.
[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.
[zagoruyko2016wide] Zagoruyko, Sergey, Komodakis, Nikos. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
[thomee2016yfcc100m] Thomee, Bart, Shamma, David A, Friedland, Gerald, Elizalde, Benjamin, Ni, Karl, Poland, Douglas, Borth, Damian, Li, Li-Jia. (2016). YFCC100M: The new data in multimedia research. Communications of the ACM.
[zhang2017mixup] Zhang, Hongyi, Cisse, Moustapha, Dauphin, Yann N, Lopez-Paz, David. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
[yun2019cutmix] Yun, Sangdoo, Han, Dongyoon, Oh, Seong Joon, Chun, Sanghyuk, Choe, Junsuk, Yoo, Youngjoon. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[cubuk2019autoaugment] Cubuk, Ekin D, Zoph, Barret, Mane, Dandelion, Vasudevan, Vijay, Le, Quoc V. (2019). Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[blum1998combining] Blum, Avrim, Mitchell, Tom. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the eleventh annual conference on Computational learning theory.
[berman2019multigrain] Berman, Maxim, J{'e. (2019). Multigrain: a unified image embedding for classes and instances. arXiv preprint arXiv:1902.05509.
[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294.
[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, {\L. (2017). Attention is all you need. Advances in neural information processing systems.
[bahdanau2014neural] Bahdanau, Dzmitry, Cho, Kyunghyun, Bengio, Yoshua. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
[baevski2022data2vec] Baevski, Alexei, Hsu, Wei-Ning, Xu, Qiantong, Babu, Arun, Gu, Jiatao, Auli, Michael. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
[bromley1993signature] Bromley, Jane, Bentz, James W, Bottou, L{'e. (1993). Signature verification using a “siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence.
[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.
[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.
[assran2021semi] Assran, Mahmoud, Caron, Mathilde, Misra, Ishan, Bojanowski, Piotr, Joulin, Armand, Ballas, Nicolas, Rabbat, Michael. (2021). Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples. arXiv preprint arXiv:2104.13963.
[chen2020generative] Chen, Mark, Radford, Alec, Child, Rewon, Wu, Jeffrey, Jun, Heewoo, Luan, David, Sutskever, Ilya. (2020). Generative pretraining from pixels. International Conference on Machine Learning.
[he2021masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2021). Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377.
[denoising_vincent] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and Composing Robust Features with Denoising Autoencoders. Proceedings of the 25th International Conference on Machine Learning.
[vincent2010stacked] Vincent, Pascal, Larochelle, Hugo, Lajoie, Isabelle, Bengio, Yoshua, Manzagol, Pierre-Antoine, Bottou, L{'e. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research.
[xie2021simmim] Xie, Zhenda, Zhang, Zheng, Cao, Yue, Lin, Yutong, Bao, Jianmin, Yao, Zhuliang, Dai, Qi, Hu, Han. (2021). Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886.
[wei2021masked] Wei, Chen, Fan, Haoqi, Xie, Saining, Wu, Chao-Yuan, Yuille, Alan, Feichtenhofer, Christoph. (2021). Masked Feature Prediction for Self-Supervised Visual Pre-Training. arXiv preprint arXiv:2112.09133.
[bao2021beit] Bao, Hangbo, Dong, Li, Wei, Furu. (2021). BEiT: BERT Pre-Training of Image Transformers. arXiv preprint arXiv:2106.08254.
[zhou2021ibotyes] Zhou, Jinghao, Wei, Chen, Wang, Huiyu, Shen, Wei, Xie, Cihang, Yuille, Alan, Kong, Tao. (2021). Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.
[loshchilov2017decoupled] Loshchilov, Ilya, Hutter, Frank. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
[chen2021empirical] Chen, Xinlei, Xie, Saining, He, Kaiming. (2021). An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057.
[touvron2021training] Touvron, Hugo, Cord, Matthieu, Douze, Matthijs, Massa, Francisco, Sablayrolles, Alexandre, J{'e. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning.
[assran2022masked] Assran, Mahmoud, Caron, Mathilde, Misra, Ishan, Bojanowski, Piotr, Bordes, Florian, Vincent, Pascal, Joulin, Armand, Rabbat, Michael, Ballas, Nicolas. (2022). Masked siamese networks for label-efficient learning. arXiv preprint arXiv:2204.07141.
[goyal2022vision] Goyal, Priya, Duval, Quentin, Seessel, Isaac, Caron, Mathilde, Singh, Mannat, Misra, Ishan, Sagun, Levent, Joulin, Armand, Bojanowski, Piotr. (2022). Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360.
[tian2021divide] Tian, Yonglong, Henaff, Olivier J, van den Oord, A{. (2021). Divide and contrast: Self-supervised learning from uncurated data. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[mahajan2018exploring] Mahajan, Dhruv, Girshick, Ross, Ramanathan, Vignesh, He, Kaiming, Paluri, Manohar, Li, Yixuan, Bharambe, Ashwin, Van Der Maaten, Laurens. (2018). Exploring the limits of weakly supervised pretraining. Proceedings of the European conference on computer vision (ECCV).
[newman2005power] Newman, Mark EJ. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary physics.
[van2018inaturalist] Van Horn, Grant, Mac Aodha, Oisin, Song, Yang, Cui, Yin, Sun, Chen, Shepard, Alex, Adam, Hartwig, Perona, Pietro, Belongie, Serge. (2018). The inaturalist species classification and detection dataset. Proceedings of the IEEE conference on computer vision and pattern recognition.
[places205] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning Deep Features for Scene Recognition using Places Database. Advances in Neural Information Processing Systems.
[cifar10] Alex Krizhevsky. (2009). Learning multiple layers of features from tiny images.
[kitti] Andreas Geiger, Philip Lenz, Raquel Urtasun. (2012). Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. Conference on Computer Vision and Pattern Recognition (CVPR).
[clevr] Johnson, Justin, Hariharan, Bharath, van der Maaten, Laurens, Fei-Fei, Li, Zitnick, C Lawrence, Girshick, Ross. (2017). CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. CVPR.
[bordes2022high] Florian Bordes, Randall Balestriero, Pascal Vincent. (2022). High Fidelity Visualization of What Your Self-Supervised Representation Knows About. Transactions on Machine Learning Research.
[https://doi.org/10.48550/arxiv.1310.4546] Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg, Dean, Jeffrey. (2013). Distributed Representations of Words and Phrases and their Compositionality. doi:10.48550/ARXIV.1310.4546.
[zhou2014learning] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems.
[johnson2017clevr] Johnson, Justin, Hariharan, Bharath, Van Der Maaten, Laurens, Fei-Fei, Li, Lawrence Zitnick, C, Girshick, Ross. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. Proceedings of the IEEE conference on computer vision and pattern recognition.
[geiger2013vision] Geiger, Andreas, Lenz, Philip, Stiller, Christoph, Urtasun, Raquel. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research.
[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.
[balestriero2022contrastive] Balestriero, Randall, LeCun, Yann. (2022). Contrastive and non-contrastive self-supervised learning recover global and local spectral embedding methods. arXiv preprint arXiv:2205.11508.
[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.
[chen2021intriguing] Chen, Ting, Luo, Calvin, Li, Lala. (2021). Intriguing properties of contrastive losses. Advances in Neural Information Processing Systems.
[garrido2022duality] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2022). On the duality between contrastive and non-contrastive self-supervised learning. arXiv preprint arXiv:2206.02574.
[goyal2021vissl] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Ishan Misra. (2021). VISSL.
[https://doi.org/10.48550/arxiv.1502.03167] Ioffe, Sergey, Szegedy, Christian. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. doi:10.48550/ARXIV.1502.03167.
[lecun2022path] LeCun, Yann. (2022). A Path Towards Autonomous Machine Intelligence Version 0.9. 2, 2022-06-27.
[chen2022intra] Chen, Yubei, Bardes, Adrien, Li, Zengyi, LeCun, Yann. (2022). Intra-Instance VICReg: Bag of Self-Supervised Image Patch Embedding. arXiv preprint arXiv:2206.08954.
[gidaris2020learning] Gidaris, Spyros, Bursuc, Andrei, Komodakis, Nikos, P{'e. (2020). Learning representations by predicting bags of visual words. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[bordes2022guillotine] Bordes, Florian, Balestriero, Randall, Garrido, Quentin, Bardes, Adrien, Vincent, Pascal. (2022). Guillotine Regularization: Improving Deep Networks Generalization by Removing their Head. arXiv preprint arXiv:2206.13378.
[rao1999predictive] Rao, Rajesh PN, Ballard, Dana H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature neuroscience.
[pathak2016context] Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell, Trevor, Efros, Alexei A. (2016). Context encoders: Feature learning by inpainting. Proceedings of the IEEE conference on computer vision and pattern recognition.
[elias1955] Friston, Karl. (2005). A theory of cortical responses. Philosophical transactions of the Royal Society B: Biological sciences. doi:10.1109/TIT.1955.1055126.
[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[ramesh2021zero] Ramesh, Aditya, Pavlov, Mikhail, Goh, Gabriel, Gray, Scott, Voss, Chelsea, Radford, Alec, Chen, Mark, Sutskever, Ilya. (2021). Zero-shot text-to-image generation. International Conference on Machine Learning.
[dalal2005histograms] Dalal, Navneet, Triggs, Bill. (2005). Histograms of oriented gradients for human detection. 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05).
[larsson2016learning] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2016). Learning representations for automatic colorization.
[zhang2016colorful] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization.
[larsson2017colorization] Larsson, Gustav, Maire, Michael, Shakhnarovich, Gregory. (2017). Colorization as a proxy task for visual understanding.
[assran2022hidden] Assran, Mahmoud, Balestriero, Randall, Duval, Quentin, Bordes, Florian, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, Ballas, Nicolas. (2022). The Hidden Uniform Cluster Prior in Self-Supervised Learning. arXiv preprint arXiv:2210.07277.
[lecun2006tutorial] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, Huang, Fujie. (2006). A tutorial on energy-based learning. Predicting structured data.
[vtab] Zhai, Xiaohua, Puigcerver, Joan, Kolesnikov, Alexander, Ruyssen, Pierre, Riquelme, Carlos, Lucic, Mario, Djolonga, Josip, Pinto, Andre Susano, Neumann, Maxim, Dosovitskiy, Alexey, Beyer, Lucas, Bachem, Olivier, Tschannen, Michael, Michalski, Marcin, Bousquet, Olivier, Gelly, Sylvain, Houlsby, Neil. (2019). A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark. doi:10.48550/ARXIV.1910.04867.
[lars] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large Batch Training of Convolutional Networks. doi:10.48550/ARXIV.1708.03888.
[zhou2019semantic] Zhou, Bolei, Zhao, Hang, Puig, Xavier, Xiao, Tete, Fidler, Sanja, Barriuso, Adela, Torralba, Antonio. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision.
[everingham2015pascal] Everingham, Mark, Eslami, SM, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2015). The pascal visual object classes challenge: A retrospective. International journal of computer vision.
[cai2022semi] Cai, Zhaowei, Ravichandran, Avinash, Favaro, Paolo, Wang, Manchen, Modolo, Davide, Bhotika, Rahul, Tu, Zhuowen, Soatto, Stefano. (2022). Semi-supervised vision transformers at scale. arXiv preprint arXiv:2208.05688.
[baevski2022efficient] Baevski, Alexei, Babu, Arun, Hsu, Wei-Ning, Auli, Michael. (2022). Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language. arXiv preprint arXiv:2212.07525.
[assran2023self] Assran, Mahmoud, Duval, Quentin, Misra, Ishan, Bojanowski, Piotr, Vincent, Pascal, Rabbat, Michael, LeCun, Yann, Ballas, Nicolas. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. arXiv preprint arXiv:2301.08243.
[Ermolov2020WhiteningFS] Aleksandr Ermolov, Aliaksandr Siarohin, E. Sangineto, N. Sebe. (2020). Whitening for Self-Supervised Representation Learning. International Conference on Machine Learning.
[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[pont20172017] Pont-Tuset, Jordi, Perazzi, Federico, Caelles, Sergi, Arbel{'a. (2017). The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675.
[jabri2020space] Jabri, Allan, Owens, Andrew, Efros, Alexei. (2020). Space-time correspondence as a contrastive random walk. Advances in neural information processing systems.
[Hadsell2006DimensionalityRB] Raia Hadsell, Sumit Chopra, Yann LeCun. (2006). Dimensionality Reduction by Learning an Invariant Mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[Dosovitskiy2014DiscriminativeUF] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, Thomas Brox. (2014). Discriminative Unsupervised Feature Learning with Convolutional Neural Networks. NIPS.
[Tian2019ContrastiveMC] Yonglong Tian, Dilip Krishnan, Phillip Isola. (2019). Contrastive Multiview Coding. European Conference on Computer Vision.
[Misra2019SelfSupervisedLO] Ishan Misra, Laurens van der Maaten. (2019). Self-Supervised Learning of Pretext-Invariant Representations. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[clark2020electra] Clark, Kevin, Luong, Minh-Thang, Le, Quoc V, Manning, Christopher D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.
[baevski2020wav2vec] Baevski, Alexei, Zhou, Yuhao, Mohamed, Abdelrahman, Auli, Michael. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems.
[baevski2021unsupervised] Baevski, Alexei, Hsu, Wei-Ning, Conneau, Alexis, Auli, Michael. (2021). Unsupervised speech recognition. Advances in Neural Information Processing Systems.
[wang2020unsupervised] Wang, Weiran, Tang, Qingming, Livescu, Karen. (2020). Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.
[xiaoshould] Xiao, Tete, Wang, Xiaolong, Efros, Alexei A, Darrell, Trevor. What Should Not Be Contrastive in Contrastive Learning. International Conference on Learning Representations.
[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[Hua_2021_ICCV] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On Feature Decorrelation in Self-Supervised Learning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[Bar_2022_CVPR] Bar, Amir, Wang, Xin, Kantorov, Vadim, Reed, Colorado J., Herzig, Roei, Chechik, Gal, Rohrbach, Anna, Darrell, Trevor, Globerson, Amir. (2022). DETReg: Unsupervised Pretraining With Region Priors for Object Detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[press2021train] Press, Ofir, Smith, Noah A, Lewis, Mike. (2021). Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
[chu2021conditional] Chu, Xiangxiang, Tian, Zhi, Zhang, Bo, Wang, Xinlong, Wei, Xiaolin, Xia, Huaxia, Shen, Chunhua. (2021). Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882.
[bello2019attention] Bello, Irwan, Zoph, Barret, Vaswani, Ashish, Shlens, Jonathon, Le, Quoc V. (2019). Attention augmented convolutional networks. Proceedings of the IEEE/CVF international conference on computer vision.
[su2021roformer] Su, Jianlin, Lu, Yu, Pan, Shengfeng, Murtadha, Ahmed, Wen, Bo, Liu, Yunfeng. (2021). Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
[pmlr-v139-liutkus21a] Liutkus, Antoine, C'{\i. (2021). Relative Positional Encoding for Transformers with Linear Complexity. Proceedings of the 38th International Conference on Machine Learning.
[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.
[cai2019cascade] Cai, Zhaowei, Vasconcelos, Nuno. (2019). Cascade R-CNN: High quality object detection and instance segmentation. IEEE transactions on pattern analysis and machine intelligence.
[bib1] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. arXiv preprint arXiv:2301.08243, 2023.
[bib2] Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. Unsupervised speech recognition. Advances in Neural Information Processing Systems, 34:27826–27839, 2021.
[bib3] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022.
[bib4] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
[bib5] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[bib6] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
[bib7] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicregl: Self-supervised learning of local visual features. arXiv preprint arXiv:2210.01571, 2022.
[bib8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[bib9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv:2104.14294, 2021.
[bib10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. preprint arXiv:2002.05709, 2020.
[bib11] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey Hinton. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029, 2020.
[bib12] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
[bib13] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
[bib14] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. arXiv preprint arXiv:2104.02057, 2021.
[bib15] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
[bib16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[bib17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[bib18] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. In NIPS, 2014.
[bib19] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9588–9597, 2021.
[bib20] Aleksandr Ermolov, Aliaksandr Siarohin, E. Sangineto, and N. Sebe. Whitening for self-supervised representation learning. In International Conference on Machine Learning, 2020.
[bib21] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Ishan Misra. Vissl. https://github.com/facebookresearch/vissl, 2021.
[bib22] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733, 2020.
[bib23] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2:1735–1742, 2006.
[bib24] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
[bib25] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
[bib26] Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9598–9608, October 2021.
[bib27] Allan Jabri, Andrew Owens, and Alexei Efros. Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, 33:19545–19560, 2020.
[bib28] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
[bib29] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[bib30] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
[bib31] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[bib32] Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. 2022.
[bib33] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
[bib34] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[bib35] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
[bib36] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
[bib37] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
[bib38] Ruslan Salakhutdinov and Geoff Hinton. Learning a nonlinear embedding by preserving class neighbourhood structure. In Artificial Intelligence and Statistics, pages 412–419. PMLR, 2007.
[bib39] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European Conference on Computer Vision, 2019.
[bib40] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
[bib41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[bib42] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
[bib43] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
[bib44] Weiran Wang, Qingming Tang, and Karen Livescu. Unsupervised pre-training of bidirectional speech encoders via masked reconstruction. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6889–6893. IEEE, 2020.
[bib45] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
[bib46] Tete Xiao, Xiaolong Wang, Alexei A Efros, and Trevor Darrell. What should not be contrastive in contrastive learning. In International Conference on Learning Representations.
[bib47] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. arXiv preprint arXiv:2111.09886, 2021.
[bib48] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
[bib49] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, and Neil Houlsby. A large-scale study of representation learning with the visual task adaptation benchmark, 2019.
[bib50] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful image colorization. 2016.
[bib51] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
[bib52] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. Advances in neural information processing systems, 27, 2014.
[bib53] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. Ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.