Variance-Covariance Regularization Improves Representation Learning
Jiachen Zhu, Katrina Evtimova, Yubei Chen, Ravid Shwartz-Ziv, Yann LeCun
Abstract
Transfer learning has emerged as a key approach in the machine learning domain, enabling the application of knowledge derived from one domain to improve performance on subsequent tasks. Given the often limited information about these subsequent tasks, a strong transfer learning approach calls for the model to capture a diverse range of features during the initial pretraining stage. However, recent research suggests that, without sufficient regularization, the network tends to concentrate on features that primarily reduce the pretraining loss function. This tendency can result in inadequate feature learning and impaired generalization capability for target tasks. To address this issue, we propose Variance-Covariance Regularization (VCR), a regularization technique aimed at fostering diversity in the learned network features. Drawing inspiration from recent advancements in the self-supervised learning approach, our approach promotes learned representations that exhibit high variance and minimal covariance, thus preventing the network from focusing solely on loss-reducing features. We empirically validate the efficacy of our method through comprehensive experiments coupled with in-depth analytical studies on the learned representations. In addition, we develop an efficient implementation strategy that assures minimal computational overhead associated with our method. Our results indicate that VCR is a powerful and efficient method for enhancing transfer learning performance for both supervised learning and self-supervised learning, opening new possibilities for future research in this domain.
Variance-Covariance Regularization
Jiachen Zhu 1 Katrina Evtimova 2 Yubei Chen 3 Ravid Shwartz-Ziv 2 Yann LeCun 1 2 4
Transfer learning plays a key role in advancing machine learning models, yet conventional supervised pretraining often undermines feature transferability by prioritizing features that minimize the pretraining loss. In this work, we adapt a selfsupervised learning regularization technique from the VICReg method to supervised learning contexts, introducing Variance-Covariance Regularization (VCReg). This adaptation encourages the network to learn high-variance, low-covariance representations, promoting learning more diverse features. We outline best practices for an efficient implementation of our framework, including applying it to the intermediate representations. Through extensive empirical evaluation, we demonstrate that our method significantly enhances transfer learning for images and videos, achieving state-of-the-art performance across numerous tasks and datasets. VCReg also improves performance in scenarios like long-tail learning and hierarchical classification. Additionally, we show its effectiveness may stem from its success in addressing challenges like gradient starvation and neural collapse. In summary, VCReg offers a universally applicable regularization framework that significantly advances transfer learning and highlights the connection between gradient starvation, neural collapse, and feature transferability.
Introduction
Transfer learning enables models to apply knowledge from one domain to enhance performance in another, particularly when data are scarce or costly to obtain (Pan & Yang, 2010; Weiss et al., 2016; Zhuang et al., 2020; Bommasani et al., 2021). One of the key challenges arises during the supervised pretraining phase. In this phase, models often lack
1 New York University, Computer Science Department 2 New York University, Center for Data Science 3 UC Davis, Electrical and Computer Engineering Department 4 Meta AI, FAIR. Correspondence to: Jiachen Zhu < jiachen.zhu@nyu.edu > .

Figure 1. VCReg regularizes the network by encouraging the intermediate representations to have high variance and low covariance. VCReg is applied to the output of each network block to make all the intermediate representations capture diverse features.
detailed information about the downstream tasks to which they will be applied. Nevertheless, they must aim to capture a broad spectrum of features beneficial across various applications (Bengio, 2012; Caruana, 1997; Yosinski et al., 2014). Without proper regularization techniques, these supervised pretrained models tend to overly focus on features that minimize supervised loss, resulting in limited generalization capabilities and issues such as gradient starvation and neural collapse (Zhang et al., 2016; Neyshabur et al., 2017; Zhang et al., 2021; Pezeshki et al., 2021; Papyan et al., 2020; Shwartz-Ziv, 2022).
To tackle these challenges, we adapt the regularization techniques of the self-supervised VICReg method (Bardes et al., 2021) for the supervised learning paradigm. Our method, termed Variance-Covariance Regularization (VCReg), aims to encourage the learning of representations with high variance and low covariance, thus avoiding the overemphasis on features that merely minimize supervised loss. Instead of simply applying VCReg to the final representation of the network, we explore the most effective ways to incorporate it throughout the intermediate representations.
The structure of the paper is as follows: we begin with an introduction of our method, including an outline of a fast implementation strategy designed to minimize computational overhead. Following this, we present a series of experiments aimed at validating the method's efficacy across a wide range of tasks, datasets, and architectures.
Subsequently, we conduct analyses on the learned representations to demonstrate VCReg's effectiveness in mitigating common issues in transfer learning, such as neural collapse and gradient starvation.
Our paper makes the following contributions:
- We introduce a robust strategy for applying VCReg to neural networks, including integrating it into the intermediate layers.
- Wepropose a computationally efficient implementation of VCReg. This implementation is optimized to ensure minimal additional computational overhead, allowing for seamless integration into existing workflows.
- Through extensive experiments on benchmark datasets both in images and videos, we demonstrate that VCReg suppresses the prior state-of-the-art results in transfer learning performance across various network architectures, including ResNet (He et al., 2016), ConvNeXt (Liu et al., 2022), and ViT (Dosovitskiy et al., 2020). Moreover, we also show that VCReg improves performance in diverse scenarios like long-tail learning and hierarchical classification.
- We investigate the representations learned by VCReg, revealing its effectiveness in combating challenges such as gradient starvation (Pezeshki et al., 2021), neural collapse (Papyan et al., 2020), information compression (Shwartz-Ziv, 2022), and sensitivity to noise.
Before delving into VCReg's details in the following sections, it is key to note its divergence from VICReg, namely by omitting the invariance loss and focusing on variance and covariance loss for a wider application, especially in transfer learning. This approach tackles challenges like gradient starvation and neural collapse, advancing neural network training across various architectures. Our work further distinguishes itself by exploring optimal regularization strategies, moving beyond generic application to significantly enhance its effectiveness.
Related Work
Variance-Invariance-Covariance Regularization (VICReg)
VICReg (Bardes et al., 2021) is a novel SSL method that encourages the learned representation to be invariant to data augmentation. However, focusing solely on this invariance criterion can result in the network producing a constant representation, making it invariant to both data augmentation and the input data itself.
VICReg primarily regularizes the network by combining variance loss and covariance loss. The variance loss encour- ages high variance in the learned representations, thereby promoting the learning of diverse features. The covariance loss, on the other hand, aims to minimize redundancy in the learned features by reducing the overlap in information captured by different dimensions of the representation. This dual-objective optimization framework effectively promotes diverse feature learning for SSL (Shwartz-Ziv et al., 2022). To improve the performance of supervised network training, we adapt the SSL feature collapse prevention mechanism from VICReg and propose a variance-covariance regularization method.
To calculate the loss function of VICReg with a batch of data { x 1 . . . x n } , we first need to have a pair of inputs ( x ′ i , x ′′ i ) such that x ′ i and x ′′ i are two augmented versions of the original input x i . Given the neural network f θ ( · ) and the final representations z ′ i = f θ ( x ′ i ) and z ′′ i = f θ ( x ′′ i ) such that z ′ i , z ′′ i ∈ R D , VICReg minimizes the following loss:
$$
$$
The variance and covariance loss functions are defined as:
$$
$$
̸
$$
$$
where C = 1 N -1 ∑ N i =1 ( z i -¯ z )( z i -¯ z ) T denotes the covariance matrix, and ¯ z represents the mean vector, given by ¯ z = 1 N ∑ N i =1 z i .
Building on insights from prior studies (Shwartz-Ziv, 2022; Shwartz-Ziv et al., 2023), it is understood that the invariance term does not play a pivotal role in diversifying features. Consequently, in adapting to the supervised regime, we exclude the invariance term from the regularization.
Representation Whitening and Feature Diversity Regularizers
Representation whitening is a technique for processing inputs before they enter a network layer. It transforms the input so that its components are uncorrelated with unit variance (Kessy et al., 2018). This transformation achieves enhanced model optimization and generalization. It uses a whitening matrix derived from the data's covariance matrix and results in an identity covariance matrix, thereby aiding gradient flow during training and acting as a lightweight regularizer to reduce overfitting and encourage robust data representations (LeCun et al., 2002).
In addition to whitening as a processing step, additional regularization terms can be introduced to enforce decorrelation in the representations. Various prior works have explored these feature diversity regularization techniques to enhance neural network training (Cogswell et al., 2015; Ayinde et al., 2019; Laakom et al., 2023). These methods encourage diverse features in the representation by adding a regularization term. Recent methods like WLD-Reg (Laakom et al., 2023) and DeCov (Cogswell et al., 2015) also employ covariance-matrix-based regularization to promote feature diversity, similarly to our approach.
However, the studies above mainly focus on the benefits of optimization and generalization for the source task, often neglecting their implications for supervised transfer learning. VCReg distinguishes itself by explicitly targeting enhancements in transfer learning performance. Our results indicate that such regularization techniques yield only modest performance improvements in in-domain evaluations.
Gradient Starvation and Neural Collapse
Gradient starvation and neural collapse are two recently recognized phenomena that can significantly affect the quality of learned representations and a network's generalization ability (Pezeshki et al., 2021; Papyan et al., 2020; Ben-Shaul et al., 2023). Gradient starvation occurs when certain parameters in a deep learning model receive very small gradients during the training process, thereby leading to slower or non-existent learning for these parameters (Pezeshki et al., 2021). Neural collapse, on the other hand, is a phenomenon observed during the late stages of training when the internal representations of the network tend to collapse towards each other, resulting in a loss of feature diversity (Papyan et al., 2020). Both phenomena are particularly relevant in the context of transfer learning, where models are initially trained on a source task before being fine-tuned for a target task. Our work, through the use of VCReg, seeks to mitigate these issues, offering a pathway to more effective transfer learning.
Variance-Covariance Regularization
Vanilla VCReg
Consider a labeled dataset comprising N samples, denoted as { ( x 1 , y 1 ) . . . ( x N , y N ) } and a neural network f θ ( · ) , which takes these inputs x i and produces final predictions ˜ y i = f θ ( x i ) . In standard supervised learning, the loss is defined as L sup = 1 N ∑ N i =1 ℓ sup (˜ y i , y i ) .
The core objective of the Vanilla VCReg is to ensure that the D -dimensional input representations { h i } N i =1 to the last layer of the network exhibit both high variance and low covariance. To achieve this, we employ variance and covari- ance regularization, same as mentioned in equation 2:
$$
$$
Intuitively speaking, the covariance matrix captures the interdependencies among the dimensions of the feature vectors h i . Maximizing ℓ var encourages each feature dimension to contain unique, non-redundant information, while minimizing ℓ cov aims to reduce the correlation between different dimensions, thus promoting feature independence. The overall training loss, which includes also the supervised loss, then becomes:
$$
$$
$$
$$
Here, α and β serve as hyperparameters to control the strength of each regularization term.
Extending VCReg to Intermediate Representations
While regularizing the final layer in a neural network offers certain benefits, extending this approach to intermediate layers via VCReg provides additional advantages (for empirical evidence supporting this claim, please refer to Appendix A). Regularizing intermediate layers enables the model to capture more complex, higher-level abstractions. This strategy minimizes internal covariate shifts across layers, which in turn improves both the stability of training and the model's generalization capabilities. Furthermore, it fosters the development of feature hierarchies and enriches the latent space, leading to enhanced model interpretability and improved transfer learning performance.
To implement this extension, VCReg is applied at M strategically chosen layers throughout the neural network. For each intermediate layer j , we denote the feature representation for an input x i as h ( j ) i ∈ R D j . This culminates in a composite loss function, expressed as follows:
$$
$$
$$
$$
Spatial Dimensions However, applying VCReg to intermediate layers of real-world neural networks presents challenges due to the spatial dimensions in these intermediate representations. Naively reshaping these representations into long vectors would lead to unmanageably large covariance matrices, thereby increasing computational costs and
risking numerical instability. To address this issue, we adapt VCReg to accommodate networks with spatial dimensions. Each vector at a different spatial location is treated as an individual sample when calculating the covariance matrix. Both the variance loss and the covariance loss are then calculated based on this modified covariance matrix.
In terms of practical implementation, a VCReg is usually applied subsequently to each block within the neural network architecture, often succeeding residual connections. This placement allows for seamless incorporation into current network architectures and training paradigms.
Addressing Outliers with Smooth L1 Loss After treating spatial locations as independent samples for covariance computation, the resulting samples are no longer statistically independent. This can lead to outliers in the covariance matrix and unstable gradient updates. To address this, we introduce a smooth L1 penalty into the covariance loss term. Specifically, we replace the traditional squared covariance values C ij in ℓ cov with a smooth L1 function:
$$
$$
By implementing this modification, we ensure that the loss function increases in a more controlled manner with respect to large covariance values. Empirically, this minimizes the impact of outliers, thereby enhancing the stability of the training process.
Fast Implementation
To optimize VCRef speed, we use the fact that VCReg only affects the loss function and not the forward pass. This allows us to focus on directly modifying the backward function for improvements. Specifically, we sidestep the usual process of calculating the VCReg loss and subsequent backpropagation. Instead, we directly adjust the computed gradients, which is feasible since the VCReg loss calculation relies solely on the current representation. Further details of this speed-optimized technique are outlined in Appendix B. Our optimized VCReg implementation exhibits similar latency as batch normalization layers and is more than 5 times faster than the naive VCReg implementation. The results are presented in Table 8.
Experiments
In this section, we first outline the experimental framework and findings highlighting the effectiveness of our proposed regularization approach, VCReg, within the realm of transfer learning that utilizes supervised pretraining for both images and videos. Subsequently, we extend our experiments to three specialized learning scenarios: 1) class imbalance via long-tail learning, 2) synergizing with self-supervised learning frameworks, and 3) hierarchical classification problems. The objective is to assess the adaptability of VCReg across various data distributions and learning paradigms, thereby evaluating its broader utility in machine learning applications. For details on reproducing our experiments, please consult Appendix C.
Transfer Learning for Images
In this section, we adhere to evaluation protocols established by seminal works such as (Chen et al., 2020; Kornblith et al., 2021; Misra & Maaten, 2020) for our transfer learning experiments.
Initially, we pretrain models using three different architectures: ResNet-50 (He et al., 2016), ConvNeXt-Tiny (Liu et al., 2022), and ViT-Base-32 (Dosovitskiy et al., 2020), on the full ImageNet dataset. We follow the standard PyTorch recipes (Paszke et al., 2019) for all networks and do not modify any hyperparameters other than those related to VCReg to ensure a fair baseline comparison. Subsequently, we perform a linear probing evaluation across 9 different benchmark to evaluate the transfer learning performance.
For ResNet-50, we include two other feature diversity regularizer methods for comparison: DeCov (Cogswell et al., 2015) and WLD-Reg (Laakom et al., 2023). We conduct experiments solely with ResNet-50 because it is the principal architecture used in the WLD-Reg paper. To ensure a fair comparison, we source hyperparameters from Laakom et al. (2023) for both DeCov and WLD-Reg.
The results in Table 1 demonstrate that VCReg significantly enhances performance in transfer learning for images across almost all downstream datasets, achieving the highest performance for 9 out of 10 datasets, and for all three architectures. Clearly, VCReg acts as a versatile plug-in, effectively boosting transfer learning outcomes. Its effectiveness spans ConvNet and Transformer architectures, confirming its wide-ranging applicability.
Transfer Learning for Videos
To extend our evaluation of VCReg's efficacy, we conduct additional experiments using networks pretrained on video datasets. Specifically, we utilized models pretrained on Kinetics-400 (Kay et al., 2017) and Kinetics-710 (Li et al., 2022), subsequently finetuning them for action recognition tasks on HMDB51 (Kuehne et al., 2011). This set of experiments encompassed models trained with self-supervised learning objectives, including VideoMAE (Tong et al., 2022) and VideoMAEv2 (Wang et al., 2023), as well as models trained with conventional supervised learning objectives, such as ViViT (Arnab et al., 2021).
We follow the finetuning protocols detailed by Tong et al.
Table 1. Transfer Learning Performance with ImageNet Supervised Pretraining The table shows performance metrics for different architectures. Each model is pretrained on the full ImageNet dataset and then tested on different downstream datasets using linear probing. Application of VCReg consistently improves performance and beats other feature diversity regularizer. Averages are calculated excluding ImageNet results.
(2022) and the conventional evaluation method used in the field, where the final performance is measured by the mean classification accuracy across three provided splits (Simonyan & Zisserman, 2014). To pinpoint the optimal VCReg coefficients, we conducted a grid search based on validation set accuracy. For simplicity, in this setup, VCReg regularization is exclusively applied to the final output of each network during finetuning, just before the classification head.
Table 2 illustrates that incorporating VCReg as a plugin regularizer enhances video classification performance across various methods (VideoMAE, VideoMAE2, and ViViT-B) and backbone architectures (ViT-B and ViT-S). The performance gains are evident in the improvements seen with VCReg across all models in the table. This consistent enhancement across a spectrum of models solidifies VCReg's status as a practical and versatile regularizer, capable of substantially improving the performance of pretrained networks in transfer learning scenarios.
Class Imbalance with Long-Tail Learning
Class imbalance is a pervasive issue in many real-world datasets and poses a considerable challenge to standard neural network training algorithms. We conduct experiments to assess how well VCReg addresses this issue through longtail learning. We evaluate VCReg using the CIFAR10-LT and CIFAR100-LT (Krizhevsky et al., 2009) datasets, both engineered to have an imbalance ratio of 100. These experiments use a ResNet-32 backbone architecture. The per-class sample sizes ranges from 5,000 to 50 for CIFAR10-LT and from 500 to 5 for CIFAR100-LT.
Section 4.3 shows that models augmented with VCReg consistently outperform the standard ResNet-32 models on imbalanced datasets. These results are noteworthy because they demonstrate that VCReg effectively enhances the model's ability to discriminate between classes in imbal-
Table 2. Transfer Learning Performance with Kinetics-400 and Kinetics-710 pretrained models : The table shows finetuning performance of Kinetics pre-trained models on HMDB51. VideoMAE-S, VideoMAE-B, and ViViT-B are pretrained on Kinetics-400 dataset while VideoMAEv2-S and VideoMAEv2B are pre-trained on Kinetics-710. We apply VCReg only to the networks' output preceding the classification head. The results show that VCReg can boost the transfer learning classification performance for networks pre-trained on video data.
anced settings. This establishes VCReg as a valuable tool for real-world applications where class imbalance is often a concern.
Self-Supervised Learning with VCReg
Our subsequent investigation focuses on examining the synergy between VCReg and existing self-supervised learning paradigms. As mentioned in the previous sections, we apply VCReg not only to the final but also to intermediate representations. So in all of the following experiments for self-supervised learning with VCReg, we apply the original loss function to the output of the network, and the VCReg loss to all the intermediate representations.
We employ a ResNet-50 architecture, training it for 100
Table 3. Performance Comparison on Class-Imbalanced Datasets Using VCReg : This table shows the accuracy of standard ResNet-32 with and without VCReg when trained on class-imbalanced CIFAR10-LT and CIFAR100-LT datasets. The VCReg-enhanced models show improved performance, demonstrating the method's effectiveness in addressing class imbalance.
epochs under four different configurations: using either SimCLR loss or VICReg loss, coupled with the ImageNet dataset. For evaluation, we conduct linear probing tests on multiple downstream task datasets, following the protocols prescribed by (Misra & Maaten, 2020; Zbontar et al., 2021).
As indicated in Table 4, integrating VCReg into selfsupervised learning paradigms such as SimCLR and VICReg results in consistent performance improvements for transfer learning. Specifically, the linear probing accuracies are enhanced across nearly all the evaluated datasets. These gains underscore the broad applicability and versatility of VCReg, demonstrating its potential to enhance various machine learning methodologies.
Hierarchical Classification
To evaluate the efficacy of the learned representations across multiple levels of class granularity, we conduct experiments on the CIFAR100 dataset as well as five distinct subsets of ImageNet (Engstrom et al., 2019). In each dataset, every data sample is tagged with both superclass and subclass labels, denoted as ( x i , y sup i , y sub i ) . Note that while samples sharing the same subclass label also share the same superclass label, the reverse does not necessarily hold true. Initially, the model is trained using only the superclass labels, i.e., the ( x i , y sup i ) pairs. Subsequently, linear probing is employed with the subclass labels ( x i , y sub i ) to assess the quality of abstract features at the superclass level.
Table 5 presents key performance metrics, highlighting the substantial improvements VCReg brings to subclass classification. The improvements are consistent across all datasets, with the CIFAR100 dataset showing the most significant gain-an increase in accuracy from 60.7% to 72.9%. These results underscore VCReg's capability to assist neural networks in generating feature representations that are not only discriminative at the superclass level but are also well-suited for subclass distinctions. This attribute is particularly advantageous in real-world applications where class categorizations often exist within a hierarchical framework.

Figure 2. Comparative evaluation between training with and without VCReg on a 'Two-Moon' Synthetic Dataset. Decision boundaries are averaged over ten distinct runs with random data point sampling and model initialization. A single run's data points are displayed for visual clarity. The contrast between VCReg and 'No regularization' underscores the latter's limitations in forming intricate decision boundaries, while highlighting VCReg's effectiveness in generating meaningful ones.
Exploring the Benefits of VCReg
This section aims to thoroughly unpack the multi-faceted benefits of VCReg in the context of supervised neural network training. Specifically, we discuss its capability to address challenges such as gradient starvation (Pezeshki et al., 2021), neural collapse (Papyan et al., 2020), noisy data, and the preservation of information richness during model training (Shwartz-Ziv, 2022).
Mitigating Gradient Starvation
In line with the original study on gradient starvation (Pezeshki et al., 2021), we observe that most traditional regularization techniques fall short of capturing the vital features for the 'two-moon' dataset experiment. To assess the effectiveness of VCReg, we replicate this setting with a three-layer network and apply our method during training. Our visualized results in Figure 2 make it apparent that VCReg has a marked advantage over traditional regularization techniques, particularly in the aspects of separation margins. Thus, it is reasonable to conclude that VCReg can help mitigate gradient starvation. Please check section E for the detailed information about experiments related to the 'two-moon' dataset.
Preventing Neural Collapse and Information Compression
To deepen our understanding of VCReg and its training dynamics, we closely examine its learned representations. A recent study (Papyan et al., 2020) observed a peculiar trend in deep networks trained for classification tasks: the toplayer feature embeddings of training samples from the same class tend to cluster around their respective class means, which are as distant from each other as possible. However, this phenomenon could potentially result in a loss of diver-
Table 4. Impact of VCReg on Self-Supervised Learning Methods : This table presents a comparative analysis of ResNet-50 models pretrained with SimCLR and VICReg losses on ImageNet, both with and without the VCReg applied. The models are evaluated using linear probing on various downstream task datasets. The VCReg models consistently outperform the non-VCReg models, showcasing the method's broad utility in transfer learning for self-supervised learning scenarios. Averages are calculated excluding ImageNet results.
Table 5. Impact of VCReg on Hierarchical Classification in ConvNeXt Models : This table summarizes the classification accuracies obtained with ConvNeXt models, both with and without the VCReg regularization, across multiple datasets featuring hierarchical class structures. The models were initially trained using superclass labels and subsequently probed using subclass labels. VCReg consistently boosts performance in subclass classification tasks.
sity among the learned features (Papyan et al., 2020), thus curtailing the network's capacity to grasp the complexity of the data and leading to suboptimal performance for transfer learning (Li et al., 2018).
Our neural collapse investigation includes two key metrics:
Class-Distance Normalized Variance (CDNV) For a feature map f : R d → R p and two unlabeled sets of samples S 1 , S 2 ⊂ R d , the CDNV is defined as
$$
$$
where µ f ( S ) and σ 2 f ( S ) signify the mean and variance of the set { f ( x ) | x ∈ S } . This metric measures the degree of clustering of the features extracted from S 1 and S 2 , in relation to the distance between their respective features. A value approaching zero indicates perfect clustering.
Nearest Class-Center Classifier (NCC) This classifier is defined as
$$
$$
According to this measure, during training, collapsed feature embeddings in the penultimate layer become separable, and the classifier converges to the 'nearest class-center classifier'.
Preventing Information Compression Although effective compression often yields superior representations, overly aggressive compression might cause the loss of crucial information about the target task (Shwartz-Ziv et al., 2018; Shwartz-Ziv & Alemi, 2020; Shwartz-Ziv & LeCun, 2023). To investigate the compression during the learning, we use the mutual information neural estimation (MINE) (Belghazi et al., 2018), a method specifically designed to estimate the mutual information between the input and its corresponding embedded representation. This metric effectively gauges the complexity level of the representation, essentially indicating how much information (in terms of number of bits) it encodes.
We evaluate the learned representations of two ConvNeXt models (Liu et al., 2022), which are trained on ImageNet with supervised loss. One model is trained with VCReg, while the other is trained without VCReg. As demonstrated in Table 6, both types of collapse, measured by CDNV and NCC, and the mutual information estimation reveal that VCReg representations have significantly more diverse features (lower neural collapse) and contain more information compared to regular training. This suggests that not only does VCReg achieve superior results, but it also yields representations which contain more information.
In summary, the VCReg method mitigates the neural collapse phenomenon and prevents excessive information compression, two crucial factors that often limit the effectiveness of deep learning models in transfer learning tasks. Our findings highlight the potential of VCReg as a valuable addition to the deep learning toolbox, significantly increasing the
Table 6. VCReg learns richer representation and prevents neural collapse and information compression Metrics include ClassDistance Normalized Variance (CDNV), Nearest Class-Center Classifier (NCC), and Mutual Information (MI). Higher values in each metric for the VCReg model indicate reduced neural collapse and richer feature representations.
generalizability of learned representations.
Providing Robustness to Noise
In real-world scenarios, encountering noise is a common challenge, making robustness against noise a crucial feature for any effective transfer learning algorithm. Recognizing the ubiquity of noise in practical applications, we aim to evaluate the capability of VCReg to bolster transfer learning performance in noisy environments.
For this purpose, we utilize video networks initially pretrained on Kinetics-400 and Kinetics-710, as mentioned in section 4.2. We then finetune these networks on the HMDB51 dataset, which is deliberately subjected to varying levels of Gaussian noise. The findings in Table 3 reveal a clear advantage: incorporating VCReg notably improves the resilience of VideoMAE-S and VideoMAEv2-S models to noisy data, a robustness not observed in models without VCReg. This trend of increased durability against noise is consistently seen in larger models, such as VideoMAE-B and VideoMAEv2-B. For a more granular analysis, Appendix D provides a thorough description of the results, complete with detailed figures.
This investigation highlights the necessity of achieving optimal performance in non-ideal settings. It emphasizes the critical need for maintaining robustness and reliability under the challenges commonly encountered in real-world settings, such as noise. This dual capability significantly boosts a model's practical value and reliability.
Conclusion
In this work, we addressed prevalent challenges in supervised pretraining for transfer learning by introducing Variance-Covariance Regularization (VCReg). Building on the regularization technique of the self-supervised VICReg method, VCReg is designed to cultivate robust and generalizable features. Unlike conventional methods that attach regularization only to the final layer, we efficiently incorporate VCReg across intermediate layers to optimize its efficacy.

Figure 3. Impact of VCReg amidst noisy data : This figure shows the top-1 accuracy of VideoMAE-S and VideoMAEv2-S when fine-tuned for action recognition using HMDB51 corrupted with synthetic noise. We corrupt the data with Gaussian noise with standard deviation σ ∈ { 1 , 1 . 5 , 2 } . Models with VCReg outperform their non-regularized counterparts in this setting.
Our key contributions are threefold:
- We present a computationally efficient VCReg implementation that can be adapted to various network architectures.
- We provide empirical evidence through comprehensive evaluations on multiple benchmarks, demonstrating that using VCReg yields significant improvements in transfer learning performance across various network architectures and different learning paradigms, including video and image classification, long tail learning, and hierarchical classification.
- Our in-depth analyses confirm VCReg's effectiveness in overcoming typical transfer learning hurdles such as neural collapse, gradient starvation, and noise.
To conclude, VCReg stands out as a potent and adaptable regularization technique that elevates the quality and applicability of learned representations. It enhances both the performance and reliability of models in transfer learning settings and paves the way for further research to achieve highly optimized and generalizable machine learning models.
Acknowledgements
This material is partially based upon work supported by the National Science Foundation under NSF Award 1922658.
Experiments
Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., and Qiao, Y. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 14549-14560, 2023.
Weiss, K. R., Khoshgoftaar, T. M., and Wang, D. A survey of transfer learning. Journal of Big Data , 3, 2016.
Experimental Investigation on Effective Application of VCReg to Standard Networks
To determine the optimal manner of integrating the VCReg into a standard network, we conducted several experiments utilizing the ConvNeXt-Atto architecture, trained on ImageNet following the torchvision (Paszke et al., 2019) training recipe. To reduce the training time, we limited the network training to 90 epochs with a batch size of 4096. The complete configuration comprised 90 epochs, a batch size of 4096, two learning rate of { 0 . 016 , 0 . 008 } with a 5 epochs linear warmup followed by a cosine annealing decay. The weight decay was set at 0 . 05 and the norm layers were excluded from the weight decay. we experimented with α ∈ { 1 . 28 , 0 . 64 , 0 . 32 , 0 . 16 } and β ∈ { 0 . 16 , 0 . 08 , 0 . 04 , 0 . 02 , 0 . 01 } .
We experimented with incorporating the VCReg layers in four different locations:
The VCReg layer was implemented as detailed in 1, with the addition of a mean removal layer along the batch preceding the VCReg layer to ensure that the VCReg input exhibited a zero mean.
The results in Table 7 indicate superior performance when the VCReg layer is applied to the output of each block (second setup) or applied to the output of blocks and downsample layers (fourth setup) compared to the other setups. Considering architectures like ViT lack downsample layers, for consistency across different architectures, we decided to use the second configuration for further experiments.
The Fast Implementation of the VCReg
The VCRegeg does not affect the forward pass in any way, allowing us to substantially speed up the implementation by modifying the backward function directly. Instead of computing the VCReg loss and backpropagating it, we can directly alter the calculated gradient. This is possible since the VCReg loss calculation only requires the current representation. The specifics of this speed-optimized implementation are outlined in Algorithm 1.
Table 7. Transfer Learning Experiments with Different VCReg Configurations
The Fast Implementation of the VCReg
α , β and ϵ : hyperparameters # mm: matrix-matrix multiplication class VarianceCovarianceRegularizationFunction(Function): # forward pass # We assume the input has zero mean per channel # In practice, we apply a batch demean operation before calling the function def forward(ctx, input): ctx.save_for_backward(input) return input # backward pass def backward(ctx, grad_output): input, = ctx.saved_tensors # reshape the input to have (n, d) shape flattened_input = input.flatten(start_dim=0, end_dim=-2) n, d = flattened_input.shape # calculate the covariance matrix covariance_matrix = mm(flattened_input.t(), flattened_input) / (n - 1) # calculate the gradient diagonal = F.threshold(rsqrt(covariance_matrix.diagonal() + epsilon), 1.0, 0.0) std_grad_input = diagonal * flattened_input cov_grad_input = torch.mm(flattened_input, covariance_matrix.fill_diagonal_(0)) grad_input = grad_output -α/ ( d ( n -1)) * std_grad_input.view(grad_output) + 4 β/ ( d ( d -1)) * cov_grad_input return grad_input
We quantify the computational overhead by measuring the average time required for one NVIDIA A100 GPU to execute both the forward and backward passes on the entire network for a batch size of 128 using the ImageNet dataset. These results are summarized in Table 8. For the sake of comparison, we also include the latencies associated with adding Batch Normalization (BN) layers, revealing that our optimized VCReg implementation exhibits similar latencies to BN layers and is almost 5 times faster than the naive implementation.
Implementation Details
Transfer Learning Experiments with ImageNet Pretraining
In conducting the transfer learning experiments, we adhered primarily to the training recipe specified by PyTorch (Paszke et al., 2019) for each respective architecture during the supervised pretraining phase. We abstained from pretraining any of the baseline models, instead opting to directly download the weights from PyTorch's own repository. The only modifications applied were to the parameters associated with VCReg loss, and we experimented with α ∈ { 1 . 28 , 0 . 64 , 0 . 32 , 0 . 16 } and
β ∈ { 0 . 16 , 0 . 08 , 0 . 04 , 0 . 02 , 0 . 01 } .
For iNaturalist 18 (Van Horn et al., 2018) and Place205 (Zhou et al., 2014), we relied on the experimental settings detailed in (Zbontar et al., 2021) for the linear probe evaluation.
Regarding Food-101 (Bossard et al., 2014), Stanford Cars (Krause et al., 2013), FGVC Aircraft (Maji et al., 2013), Oxford-IIIT Pets (Parkhi et al., 2012), Oxford 102 Flowers (Nilsback & Zisserman, 2008), and the Describable Textures Dataset (DTD) (Cimpoi et al., 2014), we complied with the evaluation protocol provided by (Chen et al., 2020; Kornblith et al., 2021). An L 2 -regularized multinomial logistic regression classifier was trained on features extracted from the frozen pretrained network. Optimization of the softmax cross-entropy objective was conducted using L-BFGS, without the application of data augmentation. All images were resized to 224 pixels along the shorter side through bicubic resampling, followed by a 224 x 224 center crop. The L 2 -regularization parameter was selected from a range of 45 logarithmically spaced values between 0 . 00001 and 100000 .
All experiments were run three times, with the average results presented in Table 1.
Table 8. Average Time Required for One Forward and Backward Pass with Various Layers Inserted Comparison of computational latencies across different configurations of ViT and ConvNeXt networks. The table demonstrates the efficacy of the optimized VCReg layer in terms of computational time, compared to both naive VCReg and Batch Normalization (BN) layers.
Transfer Learning Experiments with Kinetics pre-trained Models
In conducting experiments with video-pretrained models, we utilize the publicly available code bases and model checkpoints provided for VideoMAE and VideoMAEv2 ( https: //github.com/MCG-NJU/VideoMAE and https: //github.com/OpenGVLab/VideoMAEv2 ). For both VideoMAE and VideoMAEv2 we use ViT-Small and ViT-Base checkpoints. VideoMAE models are pre-trained on Kinetics-400 while VideoMAEv2 on Kinetics-710. We use the pre-trained checkpoint for ViViT-B (ViT-Base backbone) pre-trained on Kinetics-400 from HuggingFace. For evaluation, we adopt the inference protocol of 10 clips × 3 crops. For VCReg hyperparameters experiments with values for α ∈ 1 , 3 , 5 and β ∈ { 0 . 1 , 0 .. 3 , 0 . 5 } . For the rest of the finetuning hyperparameters as well as the data preprocessing and evaluation protocol, we use the configuration for HMDB51 available in VideoMAE (Tong et al., 2022) and its corresponding code base (linked above).
Subclass Linear Probing Result with Network Pretrained on Superclass Label
For our subclass linear probing experiments, we employed a ConvNeXt-Atto network. Each model was pretrained for 200 epochs using the superclasses, adhering to the same procedure detailed in the Appendix A. Subsequent to this pretraining phase, we initiated a linear probing process using the subclass labels. This linear classifier was trained for 100 epochs, using a base learning rate of 0 . 016 in conjunction with a cosine learning rate schedule. The optimizer used was AdamW, which worked to minimize cross-entropy loss with a weight decay set at 0 . 05 . We processed our training data in batches of 256.
Long-Tail Learning Result
For our long-tail learning experiments, we use ResNet32 as a backbone for experiments on the CIFAR10-LT and CIFAR100-LT datasets. We trained 100 epochs with batch size 256, Adam optimizer with two learning rate of { 0 . 016 , 0 . 008 } with a 10-epoch linear warm-up followed by a cosine annealing decay. The weight decay was set at 0 . 05 and the norm layers were excluded from the weight decay. we experimented with α ∈ { 1 . 28 , 0 . 64 , 0 . 32 , 0 . 16 }
and β ∈ { 0 . 16 , 0 . 08 , 0 . 04 , 0 . 02 , 0 . 01 } .
VCReg with Self-Supervised Learning Methods
We train a ResNet-50 model in four different setups, using either the SimCLR loss or the VICReg loss with the ImageNet dataset. The application of the VCReg is the same as described in Appendix A.
We closely follow the original setting in (Chen et al., 2020) for SimCLR pretraining and (Bardes et al., 2021) for VICReg pretraining.
Augmentation For both methods, we use the same augmentation methods. Each augmented view is generated from a random set of augmentations of the same input image. We apply a series of standard augmentations for each view, including random cropping, resizing to 224x224, random horizontal flipping, random color-jittering, randomly converting to grayscale, and a random Gaussian blur. These augmentations are applied symmetrically on two branches (Geiping et al., 2022)
Architecture For SimCLR, the encoder is a ResNet-50 network without the final classification layer followed by a projector. The projector is a two-layer MLP with input dimension 2048, hidden dimension 2048, and output dimension 256. The projector has ReLU between the two layers and batch normalization after every layer. This 256dimensional embedding is fed to the infoNCE loss.
Optimization We follow the training protocol in (Zbontar et al., 2021). For SimCLR experiments, we used a LARS optimizer and a base learning rate 0.3 with cosine learning rate decay schedule. We pretrain the model for 100 epochs with 5 epochs warm-up with batch size 4096.
Evaluation We follow the standard evaluation protocol as prescribed by (Misra & Maaten, 2020; Zbontar et al., 2021), performing linear probing evaluations, on iNaturalist 18 (Van Horn et al., 2018) and Place205 (Zhou et al., 2014) datasets.
Robustness to noise
This section provides additional results on measuring VCReg's ability to enhance transfer learning performance in the presence of noise. In these experiments we start with VideoMAE-B and VideoMAEv2-B networks (from section 4.2) pre-trained on Kinetics-400 and Kinetics-710, respectively, then fine-tune them on HMDB51 corrupted with varying levels of Gaussian noise. Figure 4 shows that VCReg models outperform their non-regularized counterparts in this setting.
Two-Moon Dataset
In alignment with the original gradient starvation study (Pezeshki et al., 2021), we notice that most regular routine regularization techniques do not sufficiently capture the necessary features for the 'two-moon' dataset experiment. To evaluate our approach, we mirrored this setting and applied the VCReg during the training.
The synthetic 'two-moon' dataset comprises two classes of points, each forming a moon-like shape. The gradient starvation study highlighted an issue where if the gap between the two moons is wide enough for a straight line to separate the two classes, the network stops learning additional features and focuses solely on a single feature. We duplicated this situation using a three-layer network and applied all the initially tested methods in the original study. The resulting decision boundary after training with the 'two-moon' dataset is visualized in Figure 5.
From the visualization, it becomes apparent that not only does VCReg outperform other conventional regularization techniques in separation margins, but also it shows superior performance compared to spectral decoupling, a method specifically designed for this task. VCReg is effective in maximizing the variance while minimizing the covariance in the feature space, an achievement that is not obtained by other techniques such as L2, dropout (Hinton et al., 2012), and batch normalization (Ioffe & Szegedy, 2015). Consequently, these other techniques yield features that are less discriminative and informative.
Miscellaneous
Compute Resources
The majority of our experiments were run using AMD MI50 GPUs. The longest pretraining for ConvNeXt-Tiny takes about 48 hours on 2 nodes, where each node has 8 MI50 GPUs attached. We estimate that the total amount of compute resources used for all the experiments can be roughly approximated by 60 (days) × 24 (hours per day) × 8 (nodes) × 8 (GPUs per nodes) = 92 , 160 (GPU hours).
We are aware of potential environmental impact of consuming a lot of compute resources needed for this work, such as atmospheric CO 2 emissions due to the electricity used by the servers. However, we also believe that advancements in representation learning and transfer learning can potentially help mitigate these effects by reducing the need for data and compute resources in the future.
Limitations
Due to a lack of compute resources, we were unable to conduct a large number of experiments with the goal of tuning hyperparameters and searching for the best configurations. Therefore, the majority of hyperparameters and network configurations used in this work are the same as provided by PyTorch (Paszke et al., 2019). The only hyperparameters that were tuned were α and β , the coefficients for VCR. All the other hyperparameters may not be optimal.
In addition, all models were pretrained on the ImageNet (Deng et al., 2009) and (Krizhevsky et al., 2009) dataset, so their performances might differ if pretrained with other datasets containing different data distributions or different types of images (e.g., x-rays). We encourage further exploration in this direction for current and future self-supervised learning frameworks.
capbtabboxtable[][\FBwidth] \floatsetup[table]capposition=top
Transfer learning has emerged as a key approach in the machine learning domain, enabling the application of knowledge derived from one domain to improve performance on subsequent tasks. Given the often limited information about these subsequent tasks, a strong transfer learning approach calls for the model to capture a diverse range of features during the initial pretraining stage. However, recent research suggests that, without sufficient regularization, the network tends to concentrate on features that primarily reduce the pretraining loss function. This tendency can result in inadequate feature learning and impaired generalization capability for target tasks. To address this issue, we propose Variance-Covariance Regularization (VCR), a regularization technique aimed at fostering diversity in the learned network features. Drawing inspiration from recent advancements in the self-supervised learning approach, our approach promotes learned representations that exhibit high variance and minimal covariance, thus preventing the network from focusing solely on loss-reducing features.
We empirically validate the efficacy of our method through comprehensive experiments coupled with in-depth analytical studies on the learned representations. In addition, we develop an efficient implementation strategy that assures minimal computational overhead associated with our method. Our results indicate that VCR is a powerful and efficient method for enhancing transfer learning performance for both supervised learning and self-supervised learning, opening new possibilities for future research in this domain.
Transfer learning has emerged as a cornerstone within the field of machine learning. It allows models to take advantage of knowledge gleaned from one domain to improve performance in another [30, 43, 49, 6]. This paradigm is particularly beneficial in situations where data is scarce or costly to collect, as it lets models leverage extensive pretraining using large, publicly accessible datasets before fine-tuning for a specific task [44].
However, a key challenge encountered in transfer learning is that during the pretraining phase, we usually do not have detailed information on the downstream tasks [5]. Without such information, an effective strategy is to ensure that the model captures a broad and diverse range of features, which might be beneficial for a variety of tasks. Consequently, the ability of the model to extract a rich set of features becomes a critical factor in its overall transfer learning performance [9]. However, studies suggested that without sufficient regularization, models tend to learn features that significantly reduce the pretraining loss function, thus undermining their ability to generalize to the target tasks [46, 28, 47].
Several Recent research has focused on this issue, observing phenomena such as gradient starvation [34] and neural collapse [31]. The essence of this problem lies in the tendency of the model to overprioritize certain features, thereby neglecting others and leading to a less diverse representation of the data [23]. To address this problem, we propose Variance-Covariance Regularization (VCR), a technique designed to encourage the learning of a more diverse set of features by effectively decorrelating the learned representation.
Our approach is inspired by recent advances in joint embedding self-supervised learning techniques, particularly Variance-Invariance-Covariance Regularization (VICReg) [2], which encourages high variance and low covariance in the learned representations. This strategy aims to prevent the model from focusing solely on features that decrease the loss function by minimizing the information redundancy in the learned representation.
In this paper, following the presentation of our technique, we propose a proficient implementation strategy aimed at minimizing computational overhead. Subsequently, we conducted a sequence of experimental evaluations to verify the effectiveness of our method on various tasks, datasets, and architectures. Our findings suggest that VCR markedly improves transfer learning performance, paving the way for further exploration in this field.
Our paper makes the following contributions:
We introduce a regularization technique for network training that promotes high variance and low covariance in the learned representations, enhancing the performance of transfer learning.
We show that the integration of VCR into the network’s intermediate layers enhances overall performance. Additionally, we propose a swift implementation of this regularization tactic, thereby ensuring that the overall training process remains largely unaffected by additional time overhead.
We conducted extensive experiments on various benchmark datasets demonstrating notable improvements in transfer learning performance across various network architectures (ResNet [17], ConvNeXt [25] and ViT [13]) and various pretraining methods (supervised learning and self-supervised learning).
We investigate the dynamics of VCR, revealing that it effectively prevents neural collapse [31] and information compression [35], leading to the creation of diverse, information-rich features that result in superior performance over standard training.
Transfer learning has emerged as a crucial paradigm within machine learning due to its ability to use knowledge extracted from one domain (source) to enhance learning in a different, typically related domain (target) [30, 43, 49, 38, 6]. This approach proves particularly beneficial when the target task has limited data available, a common challenge in various machine learning applications. By leveraging a model pretrained on a large, more data-rich source task, transfer learning provides a means to bypass the need for extensive data collection and annotation in the target task, leading to more efficient and effective learning processes [44].
The effectiveness of transfer learning hinges on the idea that certain features or patterns are common across various tasks and domains. Consequently, a model trained on a source task can learn these shared features, which can be a useful starting point when the model is fine-tuned on a target task. This process often results in improved performance and faster convergence of the target task compared to training a model from scratch.
Transfer learning techniques generally fall into two primary categories: (i) feature-representation transfer and (ii) parameter-transfer methods. The former focuses on extracting and utilizing transferable feature representations from the source task, which can be used as input for the target task model. In contrast, parameter-transfer methods transfer learned parameters or model architectures from the source to the target task and further fine-tune them.
Our work aligns with the feature-representation transfer category, emphasizing the development of robust and diverse feature representations that can contribute significantly across various tasks. The main challenge associated with this approach is to ensure that the learned features are not overly specific to the source task and can generalize effectively to a variety of downstream tasks [5, 9]. The proposed VCR method addresses this challenge by promoting the learning of diverse feature representations, thus enhancing the model’s adaptability and performance in transfer learning scenarios.
VICReg [2] is a novel self-supervised learning method. Standing for Variance, Invariance, and Covariance Regularization, VICReg encourages learned features to possess large variance, invariance to data augmentation, and small covariance between different features. In the context of self-supervised learning, the regularization strategy offered by VICReg fosters the formation of robust and generalizable features that can be used for a multitude of downstream tasks, thus enhancing the versatility and performance of neural networks. Its simple, yet effective methodology has made it an important tool in advancing representation learning techniques.
VICReg is primarily implemented using a variance-covariance loss. The variance loss encourages high variance in the learned representations, thereby promoting the learning of a wide range of features. The covariance loss, on the other hand, aims to minimize redundancy in the learned features by reducing the overlap in information captured by different dimensions of the representation. This dual-objective optimization framework has been found to be effective in promoting diverse feature learning [2, 37]. In this work, we borrow the feature collapse prevention mechanism from VICReg and propose a variance-covariance regularization method for standard network training to improve transfer learning performance.
Gradient starvation and neural collapse are two recently recognized phenomena in the training of deep neural networks that can significantly impact the quality of learned representations and the generalization ability of the network [34, 31, 4].
Gradient starvation is a phenomenon in which certain parameters in a deep learning model receive very little gradient during training, leading to slower learning or no learning at all for these parameters [34]. This occurs when the model overemphasizes some features at the expense of others, causing the gradients for the underutilized features to become vanishingly small. This imbalance can lead the model to fail to capture the full richness of the data, thus limiting its performance in downstream tasks.
Neural collapse, on the other hand, refers to a phenomenon observed in the later stages of training deep neural networks, where the internal representations of the network tend to collapse towards each other, resulting in a loss of diversity among the learned features [31]. Similar to gradient starvation, this can limit the network’s capacity to fully comprehend the complexity of the data, leading to suboptimal performance [23].
These phenomena have been particularly noted in the context of transfer learning, where the model is trained on a source task before being fine-tuned on a target task. During the pretraining phase, the model tends to focus on learning features that are most relevant for the source task, which can lead to gradient starvation or neural collapse if these features do not fully capture the complexity of the source task’s data. This, in turn, can negatively impact the model’s performance during the fine-tuning phase, as the learned features may not generalize well to the target task.
In this section, we present our regularization approach, designed to encourage networks to capture a diverse array of features by enforcing large variance and small covariance in the intermediate representations, with the objective of bolstering transfer learning performance. We begin by presenting the underlying principles that guide our strategy. Subsequently, we conducted experiments on a toy dataset to validate the efficacy of our approach and validate our intuition. Finally, we provide a detailed description of the regularization technique and elaborate on its seamless integration into real-world network architectures and datasets [17, 25, 13].
In transfer learning, information on relevant features for downstream tasks is often elusive [30, 43, 49]. Ideally, the model should capture a wide range of potential features. However, without suitable regularization, networks lean towards features that could minimize the training loss function [34], which can pose issues for downstream tasks.
Several recently proposed self-supervised joint embedding learning methods [16, 10, 8, 2, 45, 24] offer useful insights. They aim to capture diverse features without prior task knowledge. Two mechanisms prevail: first, ensuring invariance to specific input augmentations; second, diversifying features through contrasting or redundancy minimization.
The first mechanism cannot easily adapt to other learning paradigms without significant modifications, and imposing a specific type of invariance could potentially affect transfer learning. Therefore, our focus is on adapting the second mechanism to a wide range of pretraining scenarios. However, most self-supervised learning techniques intertwine these two mechanisms, thus necessitating a method that differentiates them. This requirement brings us to the VICReg technique [2], which distinctly separates these two mechanisms. Inspired by this approach, we incorporate the feature collapse prevention method from VICReg as a regularizing factor in standard network training, aiming to enhance the efficiency of transfer learning.
VICReg fosters the learning of diverse features by enforcing the learned feature to exhibit considerable variance and minimal covariance [2]. Given a series of representations (h1,h2,…,hn)subscriptℎ1subscriptℎ2…subscriptℎ𝑛(h_{1},h_{2},...,h_{n}) where hi∈ℝdsubscriptℎ𝑖superscriptℝ𝑑h_{i}\in\mathbb{R}^{d}, VICReg minimizes variance loss:
Here, C=1n−1∑i=1n(hi−h¯)(hi−h¯)T𝐶1𝑛1superscriptsubscript𝑖1𝑛subscriptℎ𝑖¯ℎsuperscriptsubscriptℎ𝑖¯ℎ𝑇C=\frac{1}{n-1}\sum_{i=1}^{n}(h_{i}-\bar{h})(h_{i}-\bar{h})^{T} is the covariance matrix, and h¯=1n∑i=1dhi¯ℎ1𝑛superscriptsubscript𝑖1𝑑subscriptℎ𝑖\bar{h}=\frac{1}{n}\sum_{i=1}^{d}h_{i} is the mean vector.
Minimizing the variance loss makes the model capture a broad spectrum of information. However, solely minimizing variance loss might not encourage feature diversity due to the potential high channel correlation that leads to redundancy [41]. As such, to further encourage diversity and reduce redundancy, we also minimize the covariance loss:
This loss aims to minimize the overlap or redundancy in the information captured by different channels of the representation and effectively decorrelate the learned representation.
The final regularization strategy, therefore, optimizes both aspects of the loss in a balanced manner:
Here, α𝛼\alpha and β𝛽\beta are the hyperparameters that determine the weight or importance assigned to each part of the loss in the overall regularization process.
Before applying regularization techniques to practical scenarios, it is advantageous to initially assess their efficacy on a simpler toy dataset. This allows for a more profound understanding of the methods’ behaviors. In alignment with the original gradient starvation study [34], we notice that most regular routine regularization techniques do not sufficiently capture the necessary features for the ’two-moon’ dataset experiment. To evaluate our approach, we mirrored this setting and applied the VCR during the training.
The synthetic ’two-moon’ dataset comprises two classes of points, each forming a moon-like shape. The gradient starvation study highlighted an issue where if the gap between the two moons is wide enough for a straight line to separate the two classes, the network stops learning additional features and focuses solely on a single feature. We duplicated this situation using a three-layer network and applied all the initially tested methods in the original study. The resulting decision boundary after training with the two-moon dataset is visualized in Figure 2.
From the visualization, it becomes apparent that not only does VCR outperform other conventional regularization techniques in separation margins, but also it shows superior performance compared to spectral decoupling, a method specifically designed for this task. VCR is effective in maximizing the variance while minimizing the covariance in the feature space, an achievement that is not achieved by other techniques such as L2, dropout [18], and batch normalization [19]. Consequently, these other techniques yield features that are less discriminative and informative.
Having validated the effectiveness of our regularization method on the two-moon dataset, we now aim to incorporate this approach into real-world network architectures and datasets. We conduct carefully designed experiments to elucidate the best practices for implementing this regularization in practical scenarios. Comprehensive details of these experiments are included in the Appendix A, while here we distill our primary observations.
Our investigation yielded several insights regarding the deployment of VCR in practical network settings:
When dealing with any representation with spatial dimensions, each spatial location should be regarded as independent samples for regularization. This implies that our objective is to decorrelate the representations not just across different samples, but also within the same sample across varied spatial locations.
For deep networks, it is generally beneficial to apply regularization at multiple levels within the network, rather than limiting it to the final representation layer. This holds true even when only the final representation is used for the transfer learning task.
Consecutive regularization layers do not necessarily need to be closely positioned. An effective regularization seems to be applying a single regularization layer to each block in ConvNets or transformers.
Optimal performance is achieved when regularization is applied at the end of a block. Positioning the regularization layer, before or after the residual connection, does not significantly affect performance.
Training could become unstable when we use a high learning rate or a large batch size. By replacing the Cij2superscriptsubscript𝐶𝑖𝑗2C_{ij}^{2} term in covariance loss with a smooth L1 2δ|Cij| for all Cij>δ2𝛿subscript𝐶𝑖𝑗 for all subscript𝐶𝑖𝑗𝛿2\delta|C_{ij}|\text{ for all }C_{ij}>\delta, we reduce the gradient for the outliers with larger covariance and make training more stable.
Guided by these insights, we outline our final regularization algorithm as follows:
For each sample xisubscript𝑥𝑖x_{i}, execute a forward pass and collect the final output of each convolutional or attention block, denoted as (hi(1),…,hi(m))superscriptsubscriptℎ𝑖1…superscriptsubscriptℎ𝑖𝑚(h_{i}^{(1)},...,h_{i}^{(m)}).
Once all hidden representations of the j𝑗j-th block for the current batch are collected, we calculate the VCR loss for each block: L(j)=ℓVCR(h1(j),h2(j),…,hn(j))superscript𝐿𝑗subscriptℓVCRsuperscriptsubscriptℎ1𝑗superscriptsubscriptℎ2𝑗…superscriptsubscriptℎ𝑛𝑗L^{(j)}=\ell_{\mathrm{VCR}}(h_{1}^{(j)},h_{2}^{(j)},...,h_{n}^{(j)}).
The final loss is the aggregation of the original training loss and all intermediate VCR losses: ℓfinal(x1,…,xn)=ℓ(x1,…,xn)+∑j=1mL(j)subscriptℓfinalsubscript𝑥1…subscript𝑥𝑛ℓsubscript𝑥1…subscript𝑥𝑛superscriptsubscript𝑗1𝑚superscript𝐿𝑗\ell_{\mathrm{final}}(x_{1},...,x_{n})=\ell(x_{1},...,x_{n})+\sum_{j=1}^{m}L^{(j)}
Fast Implementation The VCR does not affect the forward pass in any way, allowing us to substantially speed up the implementation by modifying the backward function directly. Instead of computing the VCR loss and backpropagating it, we can directly alter the calculated gradient. This is possible since the VCR loss calculation only requires the current representation. The specifics of this speed-optimized implementation are outlined in Algorithm 1.
In this section, we elaborate on the experimental procedures and outcomes, demonstrating how our proposed regularization strategy enhances performance in transfer learning contexts. Detailed information on replicating the experiments is provided in Appendix B. All Code (in PyTorch) is available at https://github.com/jiachenzhu/VCR.
We begin our discussion with results derived from a standard transfer learning benchmark setup. The process involved preliminary training of our model on ImageNet [12], followed by linear probing [1] on a range of downstream datasets to evaluate the generalizability of the model. After sharing the results of conventional transfer learning scenarios, we dive into the quality of the learned features, with a focus on their utility in fine-grained classification tasks. Our methodology consists of a series of experiments that initially train the network using superclass labels, followed by the application of a linear probing method to distinguish subclass labels. This experimental design allows us to assess the breadth and flexibility of the features learned by our model. We further extend our experimental exploration to long-tail learning scenarios to address real-world data concerns. Here, we ascertain VCR’s effectiveness in scenarios with imbalanced class distributions. Finally, we applied the VCR to self-supervised learning training, demonstrating that the VCR not only enhances the learned feature for supervised learning but also improves the feature for self-supervised learning.
Experimental Setup The experimental procedures detailed in this section adhere to the evaluation protocols outlined in seminal work such as [10, 20, 27]. The first stage involves pretraining our network using the ImageNet dataset with the ResNet-50 [17], ConvNeXt-Tiny [25] and ViT-Base-32 [13] architecture. Our pretraining methodology adheres to the standard recipe provided by PyTorch [33] without any hyperparameter tuning for parameters used by the torchvision library, such as the learning rate and weight decay, ensuring a fair comparison between the models we trained and those provided by PyTorch. For each pretraining setup, we only adjusted the hyperparameters relevant to the VCR, ensuring the unique aspects of our methodology remain the primary focus.
Subsequently, we apply a linear probing evaluation across a variety of datasets to compare the final performance. These datasets include iNaturalist 18 [42], Place205 [48], Food-101 [7], Stanford Cars [21], FGVC Aircraft [26], Oxford-IIIT Pets [32], Oxford 102 Flowers [29], and the Describable Textures Dataset (DTD) [11].
Results The results presented in Table 1 depict a significant improvement in transfer learning performance in all downstream datasets when VCR is applied to the three architectures: ResNet-50, ConvNeXt-Tiny, and ViT-Base-32. There is strong evidence to show that the VCR could help to boost overall transfer learning performance for self-supervised learning. It also indicates VCR works for both ConvNet architecture and the Transformer architecture.
One thing to note is that we were unable to reproduce exactly the results of [10, 20] for ResNet-50. 111Our evaluation code is available at https://github.com/jiachenzhu/VCR. However, by focusing on the relative performance improvement offered by VCR, we ensured a fair comparison between models trained with and without VCR. Further studies could examine these discrepancies to better understand the influence of different training regimes and regularization methods.
To further evaluate the quality of the representations learned through VCR, we conducted an additional set of experiments using datasets containing both superclass and subclass labels.
Experimental Setup We utilized the CIFAR100 dataset [22], which contains 20 superclasses and 100 subclasses, and five different subsets of ImageNet introduced by [14] that are also labeled by superclass and subclass categories. We initially pretrained the network using superclass labels, then applied linear probing to the final representation using subclass labels. This allowed us to examine the quality and specificity of the features learned from the superclass labels.
Results The results presented in Table 2 clearly indicate that the integration of VCR improves the performance of subclass classification. Across all datasets and categories, ConvNeXt with VCR outperforms the baseline model, with CIFAR100 demonstrating the most significant improvement from 60.7% to 72.9%.
These results suggest that VCR enables the network to generate more discriminative and high-quality features at the superclass level, which can generalize well to the subclass level. This provides evidence that VCR not only enhances performance across various datasets and architectures, but also improves the granularity and quality of learned representations. Such an advantage is particularly beneficial in complex real-world scenarios where hierarchical class structures are common.
Real-world datasets commonly exhibit a degree of imbalance, which can significantly affect the neural network learning process. To ascertain the impact of our regularization method on such scenarios, we carried out an additional series of experiments focused on long-tail learning.
Experimental Design The experiments were performed using a ResNet-32 backbone on the CIFAR10-LT and CIFAR100-LT datasets [22]. These datasets possess an imbalance ratio of 100, leading to per-class sample counts ranging from 5,000 to 50 for CIFAR10-LT, and 500 to 5 for CIFAR100-LT respectively.
Results and Analysis Table 3 details the results obtained from our experiments. It is apparent that incorporating VCR into the learning process offers notable improvements in model performance on long-tail datasets. This enhancement underscores VCR’s effectiveness in preserving and leveraging the diversity of features, even when faced with significant class imbalance. This makes it a valuable tool for tasks involving real-world data, where imbalance is a common challenge.
Our final set of experiments investigates the application of VCR within self-supervised learning paradigms to assess its versatility and effectiveness across diverse learning methodologies.
Experimental Setup We trained a ResNet-50 model with 100 epochs under four different setups, using SimCLR loss or VICReg loss with the ImageNet dataset. Subsequently, we followed the standard evaluation protocol as prescribed by [27, 45], conducting linear probing evaluations on downstream task datasets. Additionally, we performed fine-tuning evaluations using only 1% of ImageNet labels.
Results As demonstrated in Table 4, the application of VCR consistently improves performance in different self-supervised learning environments. For example, when integrated with VICReg, VCR led to an increase in linear probing accuracy across all evaluated datasets except iNaturalist18. A similar trend is also observed when VCR is used with SimCLR.
These results confirm the hypothesis that VCR can augment self-supervised learning methods, potentially enabling more efficient and accurate learning from unlabeled or sparsely labeled data. The improvements observed across a variety of datasets and under different loss functions indicate the flexibility and broad applicability of VCR, reinforcing its potential as a valuable tool in diverse machine learning scenarios.
To deepen our understanding of VCR and its training dynamics, we closely examine its learned representations. A recent study [31] observed a peculiar trend in deep networks trained for classification tasks: The top-layer feature embeddings of training samples from the same class tend to cluster around their respective class means, which are as distant from each other as possible. However, this phenomenon could potentially result in a loss of diversity among the learned features [31], thus curtailing the network’s capacity to grasp the complexity of the data and leading to suboptimal performance [23] for transfer learning.
Class-Distance Normalized Variance (CDNV): For a feature map f:ℝd→ℝp:𝑓→superscriptℝ𝑑superscriptℝ𝑝f:\mathbb{R}^{d}\to\mathbb{R}^{p} and two unlabeled sets of samples S1,S2⊂ℝdsubscript𝑆1subscript𝑆2superscriptℝ𝑑S_{1},S_{2}\subset\mathbb{R}^{d}, the CDNV is defined as
where μf(S)subscript𝜇𝑓𝑆\mu_{f}(S) and Varf(S)subscriptVar𝑓𝑆\mathrm{Var}{f}(S) signify the mean and variance of the set {f(x)∣x∈S}conditional-set𝑓𝑥𝑥𝑆{f(x)\mid x\in S}. This metric measures the degree of clustering of the features extracted from S1subscript𝑆1S{1} and S2subscript𝑆2S_{2}, in relation to the distance between their respective features. A value approaching zero indicates perfect clustering.
Nearest Class-Center Classifier (NCC): This classifier is defined as
According to this measure, during training, collapsed feature embeddings in the penultimate layer become separable, and the classifier converges to the ’nearest class-center classifier’.
Preventing Information Compression. We next address the prevention of information compression during the learning process. Although effective compression often yields superior representations, overly aggressive compression might cause the loss of crucial information about the target task [40, 36, 39].
To investigate this, we use the mutual information neural estimation (MINE) [3], a method specifically designed to estimate the mutual information between the input and its corresponding embedded representation. This metric effectively gauges the complexity level of the representation, essentially indicating how much information (in terms of number of bits) it encodes.
We evaluate the learned representations of two ConvNeXt models [25], which are trained on supervised ImageNet. One model was trained with VCR, while the other was trained without VCR. As demonstrated in Table 5, both the collapses, measured by CDNV and NCC, and the mutual information estimation reveal that VCR representations have significantly more diverse features (lower neural collapse) and contain more information compared to regular training. This suggests that not only does VCR achieve superior results, but also its underlying representation contains more information.
In summary, the VCR method not only improves the performance of models in transfer learning scenarios, but also ensures a more diverse and information-rich representation of learning. It mitigates the neural collapse phenomenon and prevents excessive information compression, two crucial factors that often limit the effectiveness of deep learning models in transfer learning tasks. Our findings highlight the potential of VCR as a valuable addition to the deep learning toolbox, significantly increasing the generalizability of learned representations.
This study presented a variance-covariance regularization (VCR) method to enhance transfer learning performance. We found that our proposed method effectively encouraged the learning of more diverse and informative features by enforcing large variance and small covariance in the intermediate representations, which are crucial aspects for successful transfer learning scenarios.
Through rigorous experimentation, we demonstrated the ability of the VCR method to outperform traditional regularization techniques in complex classification tasks and to enhance the learning of representations more effectively, leading to improved performance in downstream tasks.
Nonetheless, several questions remain open for future research. For instance, how might the variance-covariance regularization method interact with other regularization techniques? Could there be synergies when combining VCR with other methods that might boost the model’s performance even further?
In conclusion, this work introduces a novel and efficient regularization method that shows promising results in improving the quality of learned representations and, consequently, the performance on transfer learning tasks. We believe that our findings provide a valuable contribution to the ongoing efforts to enhance the efficiency and effectiveness of transfer learning, serving as a stepping stone for future research in this area.
To determine the optimal manner of integrating the VCR into a standard network, we conducted several experiments utilizing the ConvNeXt-Atto architecture, trained on ImageNet following the torchvision [33] training recipe. To reduce the training time, we limited the network training to 90 epochs with a batch size of 4096. The complete configuration comprised 90 epochs, a batch size of 4096, two learning rate of {0.016,0.008}0.0160.008{0.016,0.008} with a 5 epochs linear warmup followed by a cosine annealing decay. The weight decay was set at 0.050.050.05 and the norm layers were excluded from the weight decay. we experimented with α∈{1.28,0.64,0.32,0.16}𝛼1.280.640.320.16\alpha\in{1.28,0.64,0.32,0.16} and β∈{0.16,0.08,0.04,0.02,0.01}𝛽0.160.080.040.020.01\beta\in{0.16,0.08,0.04,0.02,0.01}.
We experimented with incorporating the VCR layers in four different locations:
Applying the VCR exclusively to the second last representation (the input of the classification layer).
The VCR layer was implemented as detailed in 1, with the addition of a mean removal layer along the batch preceding the VCR layer to ensure that the VCR input exhibited a zero mean.
The results in Table 6 indicate superior performance when the VCR layer is applied to the output of each block (second setup) or applied to the output of blocks and downsample layers (fourth setup) compared to the other setups. Considering architectures like ViT lack downsample layers, for consistency across different architectures, we decided to use this configuration for further experiments.
In conducting the transfer learning experiments, we adhered primarily to the training recipe specified by PyTorch [33] for each respective architecture during the supervised pretraining phase. We abstained from pretraining any of the baseline models, instead opting to directly download the weights from PyTorch’s own repository. The only modifications applied were to the parameters associated with VCR loss, and we experimented with α∈{1.28,0.64,0.32,0.16}𝛼1.280.640.320.16\alpha\in{1.28,0.64,0.32,0.16} and β∈{0.16,0.08,0.04,0.02,0.01}𝛽0.160.080.040.020.01\beta\in{0.16,0.08,0.04,0.02,0.01}.
For iNaturalist 18 [42] and Place205 [48], we relied on the experimental settings detailed in [45] for the linear probe evaluation.
Regarding Food-101 [7], Stanford Cars [21], FGVC Aircraft [26], Oxford-IIIT Pets [32], Oxford 102 Flowers [29], and the Describable Textures Dataset (DTD) [11], we complied with the evaluation protocol provided by [10, 20]. An L2𝐿2L2-regularized multinomial logistic regression classifier was trained on features extracted from the frozen pretrained network. Optimization of the softmax cross-entropy objective was conducted using L-BFGS, without the application of data augmentation. All images were resized to 224 pixels along the shorter side through bicubic resampling, followed by a 224 x 224 center crop. The L2𝐿2L2-regularization parameter was selected from a range of 45 logarithmically spaced values between 0.000010.000010.00001 and 100000100000100000.
All experiments were run three times, with the average results presented in Table 1.
For our subclass linear probing experiments, we employed a ConvNeXt-Atto network. Each model was pretrained for 200 epochs using the superclasses, adhering to the same procedure detailed in the Appendix A. Subsequent to this pretraining phase, we initiated a linear probing process using the subclass labels. This linear classifier was trained for 100 epochs, using a base learning rate of 0.0160.0160.016 in conjunction with a cosine learning rate schedule. The optimizer used was AdamW, which worked to minimize cross-entropy loss with a weight decay set at 0.050.050.05. We processed our training data in batches of 256.
For our long-tail learning experiments, we use ResNet-32 as a backbone for experiments on the CIFAR10-LT and CIFAR100-LT datasets. We trained 100 epochs with batch size 256, Adam optimizer with two learning rate of {0.016,0.008}0.0160.008{0.016,0.008} with a 10-epoch linear warm-up followed by a cosine annealing decay. The weight decay was set at 0.050.050.05 and the norm layers were excluded from the weight decay. we experimented with α∈{1.28,0.64,0.32,0.16}𝛼1.280.640.320.16\alpha\in{1.28,0.64,0.32,0.16} and β∈{0.16,0.08,0.04,0.02,0.01}𝛽0.160.080.040.020.01\beta\in{0.16,0.08,0.04,0.02,0.01}.
We closely follow the original setting in [10] for SimCLR pretraining and [2] for VICReg pretraining.
Augmentation - For both methods, we use the same augmentation methods. Each augmented view is generated from a random set of augmentations of the same input image. We apply a series of standard augmentations for each view, including random cropping, resizing to 224x224, random horizontal flipping, random color-jittering, randomly converting to grayscale, and a random Gaussian blur. These augmentations are applied symmetrically on two branches [15]
Architecture - For SimCLR, the encoder is a ResNet-50 network without the final classification layer followed by a projector. The projector is a two-layer MLP with input dimension 2048, hidden dimension 2048, and output dimension 256. The projector has ReLU between the two layers and batch normalization after every layer. This 256-dimensional embedding is fed to the infoNCE loss.
Optimization - We follow the training protocol in [45]. For SimCLR experiments, we used a LARS optimizer and a base learning rate 0.3 with cosine learning rate decay schedule. We pretrain the model for 100 epochs with 5 epochs warm-up with batch size 4096.
The majority of our experiments were run using AMD MI50 GPUs. The longest pretraining for ConvNeXt-Tiny takes about 48 hours on 2 nodes, where each node has 8 MI50 GPUs attached. We estimate that the total amount of compute resources used for all the experiments can be roughly approximated by 60 (days)×24 (hours per day)×8 (nodes)×8 (GPUs per nodes)=92,160 (GPU hours)60 (days)24 (hours per day)8 (nodes)8 (GPUs per nodes)92160 (GPU hours)60\text{ (days)}\times 24\text{ (hours per day)}\times 8\text{ (nodes)}\times 8\text{ (GPUs per nodes)}=92,160\text{ (GPU hours)}.
We are aware of potential environmental impact of consuming a lot of compute resources needed for this work, such as atmospheric CO2subscriptCO2\text{CO}_{2} emissions due to the electricity used by the servers. However, we also believe that advancements in representation learning and transfer learning can potentially help mitigate these effects by reducing the need for data and compute resources in the future.
Table: S4.T1: ImageNet Transfer Learning Experiments with Different Architectures
| Architecture | iNat18 | Places | Food | Cars | Aircraft | Pets | Flowers | DTD |
|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 42.8% | 50.6% | 69.0% | 43.6% | 54.8% | 91.9% | 78.5% | 68.7% |
| ResNet-50 (VCR) | 45.3% | 51.2% | 71.7% | 54.1% | 70.5% | 92.1% | 88.0% | 70.8% |
| ConvNeXt-T | 51.6% | 53.8% | 78.4% | 62.9% | 74.7% | 93.9% | 91.3% | 72.9% |
| ConvNeXt-T (VCR) | 52.3% | 54.7% | 79.6% | 64.2% | 76.3% | 94.1% | 92.7% | 73.3% |
| ViT-Base-32 | 39.1% | 47.9% | 70.6% | 51.2% | 63.8% | 90.3% | 84.6% | 66.1% |
| ViT-Base-32 (VCR) | 40.6% | 48.1% | 70.9% | 52.0% | 65.8% | 91.0% | 86.6% | 66.5% |
Table: S4.T2: Subclass Linear Probing Result with Network Pretrained on Superclass Label
| Subset of ImageNet | ||||||
| CIFAR100 | living_9 | mixed_10 | mixed_13 | geirhos_16 | big_12 | |
| Number of Superclasses | 20 | 9 | 10 | 13 | 16 | 12 |
| Number of Subclasses | 100 | 72 | 60 | 78 | 32 | 240 |
| ConvNeXt | 60.7% | 53.4% | 60.3% | 61.1% | 60.5% | 51.8% |
| ConvNeXt (VCR) | 72.9% | 62.2% | 67.7% | 66.0% | 70.1% | 61.5% |
Table: S4.T3: Long-Tail Data Experiments
| Training Methods | CIFAR10-LT | CIFAR100-LT |
|---|---|---|
| ResNet-32 | 69.6% | 37.4% |
| ResNet-32 (VCR) | 71.2% | 40.4% |
Table: S4.T4: VCR with Self-Supervised Learning Methods
| Pretraining Methods | IN 1% | iNat18 | Places | Food | Cars | Aircraft | Pets | Flowers | DTD |
|---|---|---|---|---|---|---|---|---|---|
| SimCLR | 40.3% | 37.2% | 52.1% | 66.4% | 35.7% | 62.3% | 76.3% | 82.6% | 68.1% |
| SimCLR (VCR) | 41.0% | 41.3% | 52.3% | 67.7% | 40.6% | 61.9% | 76.6% | 83.6% | 69.0% |
| VICReg | 40.7% | 41.7% | 48.2% | 61.0% | 27.3% | 51.2% | 79.1% | 74.3% | 65.4% |
| VICReg (VCR) | 41.3% | 41.4% | 49.6% | 61.6% | 29.3% | 54.2% | 79.7% | 74.5% | 66.5% |
Table: S5.T5: VCR learns richer representation and prevents neural collapse and information compression
| Network | CDNV | NCC | MI |
|---|---|---|---|
| ConvNeXt | 0.28 | 0.99 | 2.8 |
| ConvNeXt (VCR) | 0.56 | 0.81 | 4.6 |
The VCR regularizes the network by encouraging the intermediate representation to have high variance and low covariance. The VCR is applied to the output of each network block to make all the intermediate representations capture diverse features
The effect of conventional regularization methods and the VCR on a simple task of two-moon classification. Shown decision boundaries are the average over 10 runs in which data points and the model initialization parameters are sampled randomly. Here, only the data points of one particular seed are plotted for visual clarity. It can be seen that conventional regularizations of deep learning seem not to help with learning a curved decision boundary.
$$ \displaystyle\ell_{\mathrm{var}}(h_{1},h_{2},...,h_{n})=\frac{1}{d}\sum_{i=1}^{d}\max(0,1-\sqrt{C_{ii}}) $$
$$ \displaystyle V_{f}(S_{1},S_{2})=\frac{\mathrm{Var}{f}(S{1})+\mathrm{Var}{f}(S{2})}{2|\mu_{f}(S_{1})-\mu_{f}(S_{2})|^{2}}, $$
$$ \displaystyle\hat{h}(x)=\operatorname*{arg,min}{c\in[C]}|f(x)-\mu{f}(S_{c})| $$
$$ \displaystyle\mathrm{Smooth_L1}(x)=\begin{cases}x^{2},&\text{if }|x|\leq\delta\ 2\delta|x|,&\text{otherwise}\end{cases} $$
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Figure 4. Impact of VCReg amidst noisy data : This figure shows the top-1 accuracy of VideoMAE-B and VideoMAEv2-B when fine-tuned for action recognition using HMDB51 with synthetic noise. We corrupt the data with Gaussian noise with standard deviation σ ∈ { 1 , 1 . 5 , 2 } . Models with VCReg outperform their non-regularized counterparts in this setting.

Figure 5. The effect of conventional regularization methods and the VCReg on a simple task of two-moon classification. Shown decision boundaries are the average over 10 runs in which data points and the model initialization parameters are sampled randomly. Here, only the data points of one particular seed are plotted for visual clarity. It can be seen that conventional regularizations of deep learning seem not to help with learning a curved decision boundary.
| Architecture | iNat18 | Places | Food | Cars | Aircraft | Pets | Flowers | DTD | Average |
|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 42.8% | 50.6% | 69.1% | 43.6% | 54.8% | 91.9% | 77.1% | 68.7% | 62.33% |
| ResNet-50 (DeCov) | 43.1% | 50.4% | 69.0% | 45.7% | 55.5% | 90.6% | 79.2% | 69.1% | 62.83% |
| ResNet-50 (WLD-Reg) | 43.9% | 51.2% | 70.2% | 43.9% | 58.7% | 91.4% | 80.7% | 69.0% | 63.63% |
| ResNet-50 (VCReg) | 45.3% | 51.2% | 71.7% | 54.1% | 70.5% | 92.1% | 88.0% | 70.8% | 67.96% |
| ConvNeXt-T | 51.6% | 53.8% | 78.4% | 62.9% | 74.7% | 93.9% | 91.3% | 72.9% | 72.44% |
| ConvNeXt-T (VCReg) | 52.3% | 54.7% | 79.6% | 64.2% | 76.3% | 94.1% | 92.7% | 73.3% | 73.40% |
| ViT-Base-32 | 39.1% | 47.9% | 70.6% | 51.2% | 63.8% | 90.3% | 84.6% | 66.1% | 64.20% |
| ViT-Base-32 (VCReg) | 40.6% | 48.1% | 70.9% | 52.0% | 65.8% | 91.0% | 86.6% | 66.5% | 65.19% |
| Method | Backbone | HMDB51 |
|---|---|---|
| VideoMAE-S | ViT-S | 79.9% |
| VideoMAE-S (VCReg) | ViT-S | 80.6% |
| VideoMAE-B | ViT-B | 82.2% |
| VideoMAE-B (VCReg) | ViT-B | 83.0% |
| VideoMAEv2-S | ViT-S | 83.6% |
| VideoMAEv2-S (VCReg) | ViT-S | 83.9% |
| VideoMAEv2-B | ViT-B | 86.5% |
| VideoMAEv2-B (VCReg) | ViT-B | 86.9% |
| ViViT-B | ViT-B | 70.9% |
| ViViT-B (VCReg) | ViT-B | 71.6% |
| Training Methods | CIFAR10-LT | CIFAR100-LT |
|---|---|---|
| ResNet-32 | 69.6% | 37.4% |
| ResNet-32 (VCReg) | 71.2% | 40.4% |
| Pretraining Methods | ImageNet | iNat18 | Places | Food | Cars | Aircraft | Pets | Flowers | DTD | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| SimCLR | 67.2% | 37.2% | 52.1% | 66.4% | 35.7% | 62.3% | 76.3% | 82.6% | 68.1% | 60.09% |
| SimCLR (VCReg) | 67.1% | 41.3% | 52.3% | 67.7% | 40.6% | 61.9% | 76.6% | 83.6% | 69.0% | 61.63% |
| VICReg | 65.2% | 41.7% | 48.2% | 61.0% | 27.3% | 51.2% | 79.1% | 74.3% | 65.4% | 56.03% |
| VICReg (VCReg) | 66.3% | 41.4% | 49.6% | 61.6% | 29.3% | 54.2% | 79.7% | 74.5% | 66.5% | 57.10% |
| Subsets of ImageNet | Subsets of ImageNet | Subsets of ImageNet | Subsets of ImageNet | Subsets of ImageNet | ||
|---|---|---|---|---|---|---|
| CIFAR100 | living 9 | mixed 10 | mixed 13 | geirhos 16 | big 12 | |
| Superclass Count | 20 | 9 | 10 | 13 | 16 | 12 |
| Subclass Count | 100 | 72 | 60 | 78 | 32 | 240 |
| ConvNeXt | 60.7% | 53.4% | 60.3% | 61.1% | 60.5% | 51.8% |
| ConvNeXt (VCReg) | 72.9% | 62.2% | 67.7% | 66.0% | 70.1% | 61.5% |
| Network | CDNV | NCC | MI |
|---|---|---|---|
| ConvNeXt | 0.28 | 0.99 | 2.8 |
| ConvNeXt (VCReg) | 0.56 | 0.81 | 4.6 |
| Architecture | Food | Cars | Aircraft | Pets | Flowers | DTD |
|---|---|---|---|---|---|---|
| ConvNeXt-Atto (VCReg1) | 63.2% | 39.6% | 55.9% | 89.1% | 85.3% | 65.1% |
| ConvNeXt-Atto (VCReg2) | 66.8 % | 48.1% | 60.4 % | 91.1 % | 86.4 % | 66.4 % |
| ConvNeXt-Atto (VCReg3) | 64.0% | 40.9% | 56.5% | 89.4% | 85.9% | 65.1% |
| ConvNeXt-Atto (VCReg4) | 66.7% | 48.3 % | 59.6% | 90.6% | 85.6% | 66.1% |
| Network | Number of Inserted Layers | Identity | VCReg (Naive) | VCReg (Fast) | BN |
|---|---|---|---|---|---|
| ViT-Base-32 | 12 | 0.223s | 1.427s | 0.245s | 0.247s |
| ConvNeXt-T | 18 | 0.442s | 2.951s | 0.471s | 0.468s |
$$ &\ell_{\mathrm{VICReg}}(z'_1 \ldots z'_n, z''_1 \ldots z''n) \ &= \alpha \ell{\mathrm{var}}(z'_1, \ldots, z'n) + \alpha \ell{\mathrm{var}}(z''_1, \ldots, z''n) \label{eq:vicreg}\ &+ \beta \ell{\mathrm{cov}}(z'_1, \ldots, z'n) + \beta \ell{\mathrm{cov}}(z''1, \ldots, z''n) \nonumber \ &+ \sum{i=1}^n \ell{\mathrm{inv}}(z'_i, z''_i). \nonumber $$ \tag{eq:vicreg}
$$ \ell_{\mathrm{var}} &= \frac{1}{D} \sum_{i=1}^{D} \max(0, 1 - \sqrt{C_{ii}}) \label{eq:var}\ \ell_{\mathrm{cov}} &= \frac{1}{D(D-1)} \sum_{i \neq j} C_{ij}^2 \label{eq:cov} $$ \tag{eq:var}
$$ \text{SmoothL1}(x) = \begin{cases} x^2, & \text{if } |x| \leq \delta \ 2 \delta |x| - \delta^2, & \text{otherwise} \end{cases} $$
$$ V_f(S_1,S_2) = \frac{\sigma^2_f(S_1) + \sigma^2_f(S_2)}{2|\mu_f(S_1)-\mu_f(S_2)|^2}, $$
$$ \hat{h}(x) = \argmin_{c\in [C]} |f(x) - \mu_{f}(S_c)| $$
Algorithm: algorithm
[t]
\caption{PyTorch-Style Pseudocode for Fast VCReg Implementation}
\label{alg:vcr_algorithm}
\definecolor{codeblue}{rgb}{0.25,0.5,0.5}
\lstset{
basicstyle=\fontsize{7.2pt}{7.2pt}\ttfamily\bfseries,
commentstyle=\fontsize{7.2pt}{7.2pt}\color{codeblue},
keywordstyle=\fontsize{7.2pt}{7.2pt},
}
\begin{lstlisting}[language=python, texcl=true, mathescape=true]
# $ alpha$, $ beta$ and $ epsilon$ : hyperparameters
# mm: matrix-matrix multiplication
class VarianceCovarianceRegularizationFunction(Function):
# forward pass
# We assume the input has zero mean per channel
# In practice, we apply a batch demean operation before calling the function
def forward(ctx, input):
ctx.save_for_backward(input)
return input
# backward pass
def backward(ctx, grad_output):
input, = ctx.saved_tensors
# reshape the input to have (n, d) shape
flattened_input = input.flatten(start_dim=0, end_dim=-2)
n, d = flattened_input.shape
# calculate the covariance matrix
covariance_matrix = mm(flattened_input.t(), flattened_input) / (n - 1)
# calculate the gradient
diagonal = F.threshold(rsqrt(covariance_matrix.diagonal() + \epsilon), 1.0, 0.0)
std_grad_input = diagonal * flattened_input
cov_grad_input = torch.mm(flattened_input, covariance_matrix.fill_diagonal_(0))
grad_input = grad_output
- $\alpha / (d(n-1))$ * std_grad_input.view(grad_output)
+ $ 4 \beta / (d(d-1))$ * cov_grad_input
return grad_input
\end{lstlisting}
| Architecture | iNat18 | Places | Food | Cars | Aircraft | Pets | Flowers | DTD | Average |
|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 42.8% | 50.6% | 69.1% | 43.6% | 54.8% | 91.9% | 77.1% | 68.7% | 62.33% |
| ResNet-50 (DeCov) | 43.1% | 50.4% | 69.0% | 45.7% | 55.5% | 90.6% | 79.2% | 69.1% | 62.83% |
| ResNet-50 (WLD-Reg) | 43.9% | 51.2% | 70.2% | 43.9% | 58.7% | 91.4% | 80.7% | 69.0% | 63.63% |
| ResNet-50 (VCReg) | 45.3% | 51.2% | 71.7% | 54.1% | 70.5% | 92.1% | 88.0% | 70.8% | 67.96% |
| ConvNeXt-T | 51.6% | 53.8% | 78.4% | 62.9% | 74.7% | 93.9% | 91.3% | 72.9% | 72.44% |
| ConvNeXt-T (VCReg) | 52.3% | 54.7% | 79.6% | 64.2% | 76.3% | 94.1% | 92.7% | 73.3% | 73.40% |
| ViT-Base-32 | 39.1% | 47.9% | 70.6% | 51.2% | 63.8% | 90.3% | 84.6% | 66.1% | 64.20% |
| ViT-Base-32 (VCReg) | 40.6% | 48.1% | 70.9% | 52.0% | 65.8% | 91.0% | 86.6% | 66.5% | 65.19% |
| Method | Backbone | HMDB51 |
|---|---|---|
| VideoMAE-S | ViT-S | 79.9% |
| VideoMAE-S (VCReg) | ViT-S | 80.6% |
| VideoMAE-B | ViT-B | 82.2% |
| VideoMAE-B (VCReg) | ViT-B | 83.0% |
| VideoMAEv2-S | ViT-S | 83.6% |
| VideoMAEv2-S (VCReg) | ViT-S | 83.9% |
| VideoMAEv2-B | ViT-B | 86.5% |
| VideoMAEv2-B (VCReg) | ViT-B | 86.9% |
| ViViT-B | ViT-B | 70.9% |
| ViViT-B (VCReg) | ViT-B | 71.6% |
| Training Methods | CIFAR10-LT | CIFAR100-LT |
|---|---|---|
| ResNet-32 | 69.6% | 37.4% |
| ResNet-32 (VCReg) | 71.2% | 40.4% |
| Pretraining Methods | ImageNet | iNat18 | Places | Food | Cars | Aircraft | Pets | Flowers | DTD | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| SimCLR | 67.2% | 37.2% | 52.1% | 66.4% | 35.7% | 62.3% | 76.3% | 82.6% | 68.1% | 60.09% |
| SimCLR (VCReg) | 67.1% | 41.3% | 52.3% | 67.7% | 40.6% | 61.9% | 76.6% | 83.6% | 69.0% | 61.63% |
| VICReg | 65.2% | 41.7% | 48.2% | 61.0% | 27.3% | 51.2% | 79.1% | 74.3% | 65.4% | 56.03% |
| VICReg (VCReg) | 66.3% | 41.4% | 49.6% | 61.6% | 29.3% | 54.2% | 79.7% | 74.5% | 66.5% | 57.10% |
| Subsets of ImageNet | Subsets of ImageNet | Subsets of ImageNet | Subsets of ImageNet | Subsets of ImageNet | ||
|---|---|---|---|---|---|---|
| CIFAR100 | living 9 | mixed 10 | mixed 13 | geirhos 16 | big 12 | |
| Superclass Count | 20 | 9 | 10 | 13 | 16 | 12 |
| Subclass Count | 100 | 72 | 60 | 78 | 32 | 240 |
| ConvNeXt | 60.7% | 53.4% | 60.3% | 61.1% | 60.5% | 51.8% |
| ConvNeXt (VCReg) | 72.9% | 62.2% | 67.7% | 66.0% | 70.1% | 61.5% |
| Network | CDNV | NCC | MI |
|---|---|---|---|
| ConvNeXt | 0.28 | 0.99 | 2.8 |
| ConvNeXt (VCReg) | 0.56 | 0.81 | 4.6 |
| Architecture | Food | Cars | Aircraft | Pets | Flowers | DTD |
|---|---|---|---|---|---|---|
| ConvNeXt-Atto (VCReg1) | 63.2% | 39.6% | 55.9% | 89.1% | 85.3% | 65.1% |
| ConvNeXt-Atto (VCReg2) | 66.8 % | 48.1% | 60.4 % | 91.1 % | 86.4 % | 66.4 % |
| ConvNeXt-Atto (VCReg3) | 64.0% | 40.9% | 56.5% | 89.4% | 85.9% | 65.1% |
| ConvNeXt-Atto (VCReg4) | 66.7% | 48.3 % | 59.6% | 90.6% | 85.6% | 66.1% |
| Network | Number of Inserted Layers | Identity | VCReg (Naive) | VCReg (Fast) | BN |
|---|---|---|---|---|---|
| ViT-Base-32 | 12 | 0.223s | 1.427s | 0.245s | 0.247s |
| ConvNeXt-T | 18 | 0.442s | 2.951s | 0.471s | 0.468s |
$$ \ell_{\mathrm{vcreg}}(h_1 \ldots h_N) &= \alpha \ell_{\mathrm{var}}(h_1 \ldots h_N) + \beta \ell_{\mathrm{cov}}(h_1 \ldots h_N) $$
References
[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. arXiv preprint arXiv:2103.03230.
[ben2023reverse] Ben-Shaul, Ido, Shwartz-Ziv, Ravid, Galanti, Tomer, Dekel, Shai, LeCun, Yann. (2023). Reverse Engineering Self-Supervised Learning. arXiv preprint arXiv:2305.15614.
[2023arXiv230409355S] Shwartz-Ziv, Ravid, LeCun, Yann. (2023). To Compress or Not to Compress--Self-Supervised Learning and Information Theory: A Review. arXiv preprint arXiv:2304.09355.
[geiping2022much] Geiping, Jonas, Goldblum, Micah, Somepalli, Gowthami, Shwartz-Ziv, Ravid, Goldstein, Tom, Wilson, Andrew Gordon. (2022). How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. arXiv preprint arXiv:2210.06441.
[shwartz2022information] Shwartz-Ziv, Ravid. (2022). Information flow in deep neural networks. arXiv preprint arXiv:2202.06749.
[shwartz2020information] Shwartz-Ziv, Ravid, Alemi, Alexander A. (2020). Information in infinite ensembles of infinitely-wide neural networks. Symposium on Advances in Approximate Bayesian Inference.
[shwartz2018representation] Shwartz-Ziv, Ravid, Painsky, Amichai, Tishby, Naftali. (2018). Representation compression and generalization in deep neural networks.
[shwartz2022we] Shwartz-Ziv, Ravid, Balestriero, Randall, LeCun, Yann. (2022). What Do We Maximize in Self-Supervised Learning?. arXiv preprint arXiv:2207.10081.
[belghazi2018mine] Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeswar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, R Devon. (2018). Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062.
[shwartz2022pre] Shwartz-Ziv, Ravid, Goldblum, Micah, Souri, Hossein, Kapoor, Sanyam, Zhu, Chen, LeCun, Yann, Wilson, Andrew G. (2022). Pre-train your loss: Easy bayesian transfer learning with informative priors. Advances in Neural Information Processing Systems.
[kahana2022contrastive] Kahana, Jonathan, Hoshen, Yedid. (2022). A contrastive objective for learning disentangled representations. Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXVI.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.
[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent: A new approach to self-supervised learning. arXiv preprint arXiv:2006.07733.
[goyal2019scaling] Goyal, Priya, Mahajan, Dhruv, Gupta, Abhinav, Misra, Ishan. (2019). Scaling and benchmarking self-supervised visual representation learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[deng2009imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition.
[zhai2019s4l] Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, Beyer, Lucas. (2019). S4l: Self-supervised semi-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[zhou2014learning] Zhou, Bolei, Lapedriza, Agata, Xiao, Jianxiong, Torralba, Antonio, Oliva, Aude. (2014). Learning deep features for scene recognition using places database. Advances in neural information processing systems.
[everingham2010pascal] Everingham, Mark, Van Gool, Luc, Williams, Christopher KI, Winn, John, Zisserman, Andrew. (2010). The pascal visual object classes (voc) challenge. International journal of computer vision.
[van2018inaturalist] Van Horn, Grant, Mac Aodha, Oisin, Song, Yang, Cui, Yin, Sun, Chen, Shepard, Alex, Adam, Hartwig, Perona, Pietro, Belongie, Serge. (2018). The inaturalist species classification and detection dataset. Proceedings of the IEEE conference on computer vision and pattern recognition.
[lin2014microsoft] Lin, Tsung-Yi, Maire, Michael, Belongie, Serge, Hays, James, Perona, Pietro, Ramanan, Deva, Doll{'a. (2014). Microsoft coco: Common objects in context. European conference on computer vision.
[ren2016faster] Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian. (2016). Faster R-CNN: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence.
[li2022neural] Li, Zengyi, Chen, Yubei, LeCun, Yann, Sommer, Friedrich T. (2022). Neural manifold clustering and embedding. arXiv preprint arXiv:2201.10000.
[he2017mask] He, Kaiming, Gkioxari, Georgia, Doll{'a. (2017). Mask r-cnn. Proceedings of the IEEE international conference on computer vision.
[wu2019detectron2] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, Ross Girshick. (2019). Detectron2.
[lecun2006tutorial] LeCun, Yann, Chopra, Sumit, Hadsell, Raia, Ranzato, M, Huang, F. (2006). A tutorial on energy-based learning. Predicting structured data.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. arXiv preprint arXiv:2105.04906.
[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.
[pathak2016context] Pathak, Deepak, Krahenbuhl, Philipp, Donahue, Jeff, Darrell, Trevor, Efros, Alexei A. (2016). Context encoders: Feature learning by inpainting. Proceedings of the IEEE conference on computer vision and pattern recognition.
[zhang2016colorful] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2016). Colorful image colorization. European conference on computer vision.
[zhang2017split] Zhang, Richard, Isola, Phillip, Efros, Alexei A. (2017). Split-brain autoencoders: Unsupervised learning by cross-channel prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[dosovitskiy2014discriminative] Dosovitskiy, Alexey, Springenberg, Jost Tobias, Riedmiller, Martin, Brox, Thomas. (2014). Discriminative unsupervised feature learning with convolutional neural networks. Advances in neural information processing systems.
[doersch2015unsupervised] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE international conference on computer vision.
[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. European conference on computer vision.
[wang2015unsupervised] Wang, Xiaolong, Gupta, Abhinav. (2015). Unsupervised learning of visual representations using videos. Proceedings of the IEEE international conference on computer vision.
[pathak2017learning] Pathak, Deepak, Girshick, Ross, Doll{'a. (2017). Learning features by watching objects move. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[caron2018deep] Caron, Mathilde, Bojanowski, Piotr, Joulin, Armand, Douze, Matthijs. (2018). Deep clustering for unsupervised learning of visual features. Proceedings of the European Conference on Computer Vision (ECCV).
[chen2020exploring] Chen, Xinlei, He, Kaiming. (2020). Exploring Simple Siamese Representation Learning. arXiv preprint arXiv:2011.10566.
[barlow1961possible] Barlow, Horace B, others. (1961). Possible principles underlying the transformation of sensory messages. Sensory communication.
[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.
[you2017large] You, Yang, Gitman, Igor, Ginsburg, Boris. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.
[tsai2021self] Tsai, Yao-Hung Hubert, Ma, Martin Q, Yang, Muqiao, Zhao, Han, Morency, Louis-Philippe, Salakhutdinov, Ruslan. (2021). Self-supervised Representation Learning with Relative Predictive Coding. arXiv preprint arXiv:2103.11275.
[ozair2019wasserstein] Ozair, Sherjil, Lynch, Corey, Bengio, Yoshua, Oord, Aaron van den, Levine, Sergey, Sermanet, Pierre. (2019). Wasserstein dependency measure for representation learning. arXiv preprint arXiv:1903.11780.
[poole2019variational] Poole, Ben, Ozair, Sherjil, Van Den Oord, Aaron, Alemi, Alex, Tucker, George. (2019). On variational bounds of mutual information. International Conference on Machine Learning.
[chopra2005learning] Chopra, Sumit, Hadsell, Raia, LeCun, Yann. (2005). Learning a similarity metric discriminatively, with application to face verification. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[cole2021does] Cole, Elijah, Yang, Xuan, Wilber, Kimberly, Mac Aodha, Oisin, Belongie, Serge. (2021). When Does Contrastive Visual Representation Learning Work?. arXiv preprint arXiv:2105.05837.
[purushwalkam2020demystifying] Purushwalkam Shiva Prakash, Senthil, Gupta, Abhinav. (2020). Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases. Advances in Neural Information Processing Systems.
[goyal2021vissl] Priya Goyal, Quentin Duval, Jeremy Reizenstein, Matthew Leavitt, Min Xu, Benjamin Lefaudeux, Mannat Singh, Vinicius Reis, Mathilde Caron, Piotr Bojanowski, Armand Joulin, Ishan Misra. (2021). VISSL.
[goyal2017accurate] Goyal, Priya, Doll{'a. (2017). Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677.
[qiao2019micro] Qiao, Siyuan, Wang, Huiyu, Liu, Chenxi, Shen, Wei, Yuille, Alan. (2019). Micro-Batch Training with Batch-Channel Normalization and Weight Standardization. arXiv preprint arXiv:1903.10520.
[dwibedi2021little] Dwibedi, Debidatta, Aytar, Yusuf, Tompson, Jonathan, Sermanet, Pierre, Zisserman, Andrew. (2021). With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. arXiv preprint arXiv:2104.14548.
[pan2010survey] Pan, Sinno Jialin, Yang, Qiang. (2010). A survey on transfer learning. IEEE Transactions on knowledge and data engineering.
[Weiss2016ASO] Karl R. Weiss, Taghi M. Khoshgoftaar, Dingding Wang. (2016). A survey of transfer learning. Journal of Big Data.
[zhuang2020comprehensive] Zhuang, Fuzhen, Qi, Zhiyuan, Duan, Keyu, Xi, Dongbo, Zhu, Yongchun, Zhu, Hengshu, Xiong, Hui, He, Qing. (2020). A comprehensive survey on transfer learning. Proceedings of the IEEE.
[yosinski2014transferable] Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, Lipson, Hod. (2014). How transferable are features in deep neural networks?. Advances in neural information processing systems.
[bengio2012deep] Bengio, Yoshua. (2012). Deep learning of representations for unsupervised and transfer learning. Proceedings of ICML workshop on unsupervised and transfer learning.
[caruana1997multitask] Caruana, Rich. (1997). Multitask learning. Machine learning.
[zhang2021understanding] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, Vinyals, Oriol. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM.
[zhang2016understanding] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, Vinyals, Oriol. (2016). Understanding deep learning requires rethinking generalization. CoRR abs/1611.03530 (2016). arXiv preprint arxiv:1611.03530.
[neyshabur2017exploring] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in neural information processing systems.
[papyan2020prevalence] Papyan, Vardan, Han, XY, Donoho, David L. (2020). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences.
[pezeshki2021gradient] Pezeshki, Mohammad, Kaba, Oumar, Bengio, Yoshua, Courville, Aaron C, Precup, Doina, Lajoie, Guillaume. (2021). Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems.
[li2018measuring] Li, Chunyuan, Farkhoor, Heerad, Liu, Rosanne, Yosinski, Jason. (2018). Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838.
[arpit2017closer] Arpit, Devansh, Jastrz{\k{e. (2017). A closer look at memorization in deep networks. International conference on machine learning.
[alain2016understanding] Alain, Guillaume, Bengio, Yoshua. (2016). Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
[kornblith2021better] Kornblith, Simon, Chen, Ting, Lee, Honglak, Norouzi, Mohammad. (2021). Why do better loss functions lead to less transferable features?. Advances in Neural Information Processing Systems.
[misra2020self] Misra, Ishan, Maaten, Laurens van der. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[bossard2014food] Bossard, Lukas, Guillaumin, Matthieu, Van Gool, Luc. (2014). Food-101--mining discriminative components with random forests. Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13.
[krause20133d] Krause, Jonathan, Stark, Michael, Deng, Jia, Fei-Fei, Li. (2013). 3d object representations for fine-grained categorization. Proceedings of the IEEE international conference on computer vision workshops.
[maji13fine-grained] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, A. Vedaldi. (2013). Fine-Grained Visual Classification of Aircraft.
[parkhi2012cats] Parkhi, Omkar M, Vedaldi, Andrea, Zisserman, Andrew, Jawahar, CV. (2012). Cats and dogs. 2012 IEEE conference on computer vision and pattern recognition.
[nilsback2008automated] Nilsback, Maria-Elena, Zisserman, Andrew. (2008). Automated flower classification over a large number of classes. 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.
[cimpoi2014describing] Cimpoi, Mircea, Maji, Subhransu, Kokkinos, Iasonas, Mohamed, Sammy, Vedaldi, Andrea. (2014). Describing textures in the wild. Proceedings of the IEEE conference on computer vision and pattern recognition.
[NEURIPS2019_9015] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, Desmaison, Alban, Kopf, Andreas, Yang, Edward, DeVito, Zachary, Raison, Martin, Tejani, Alykhan, Chilamkurthy, Sasank, Steiner, Benoit, Fang, Lu, Bai, Junjie, Chintala, Soumith. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32.
[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.
[robustness] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras. (2019). Robustness (Python Library).
[tishby2015deep] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep learning and the information bottleneck principle. 2015 ieee information theory workshop (itw).
[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.
[hinton2012improving] Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.
[ioffe2015batch] Ioffe, Sergey, Szegedy, Christian. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning.
[cogswell2015reducing] Cogswell, Michael, Ahmed, Faruk, Girshick, Ross, Zitnick, Larry, Batra, Dhruv. (2015). Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068.
[laakom2023wld] Laakom, Firas, Raitoharju, Jenni, Iosifidis, Alexandros, Gabbouj, Moncef. (2023). WLD-Reg: A Data-dependent Within-layer Diversity Regularizer. arXiv preprint arXiv:2301.01352.
[ayinde2019regularizing] Ayinde, Babajide O, Inanc, Tamer, Zurada, Jacek M. (2019). Regularizing deep neural networks by enhancing diversity in feature extraction. IEEE transactions on neural networks and learning systems.
[bansal2018can] Bansal, Nitin, Chen, Xiaohan, Wang, Zhangyang. (2018). Can we gain more from orthogonality regularizations in training deep networks?. Advances in Neural Information Processing Systems.
[bommasani2021opportunities] Bommasani, Rishi, Hudson, Drew A, Adeli, Ehsan, Altman, Russ, Arora, Simran, von Arx, Sydney, Bernstein, Michael S, Bohg, Jeannette, Bosselut, Antoine, Brunskill, Emma, others. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
[shwartz2023information] Shwartz-Ziv, Ravid, Balestriero, Randall, Kawaguchi, Kenji, Rudner, Tim GJ, LeCun, Yann. (2023). An Information-Theoretic Perspective on Variance-Invariance-Covariance Regularization. arXiv preprint arXiv:2303.00633.
[kessy2018optimal] Kessy, Agnan, Lewin, Alex, Strimmer, Korbinian. (2018). Optimal whitening and decorrelation. The American Statistician.
[lecun2002efficient] LeCun, Yann, Bottou, L{'e. (2002). Efficient backprop. Neural networks: Tricks of the trade.
[huang2018orthogonal] Huang, Lei, Liu, Xianglong, Lang, Bo, Yu, Adams, Wang, Yongliang, Li, Bo. (2018). Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. Proceedings of the AAAI Conference on Artificial Intelligence.
[laakom2023learning] Laakom, Firas, Raitoharju, Jenni, Iosifidis, Alexandros, Gabbouj, Moncef. (2023). Learning distinct features helps, provably. Joint European Conference on Machine Learning and Knowledge Discovery in Databases.
[soomro2012ucf101] Soomro, Khurram, Zamir, Amir Roshan, Shah, Mubarak. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
[kay2017kinetics] Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijayanarasimhan, Sudheendra, Viola, Fabio, Green, Tim, Back, Trevor, Natsev, Paul, others. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
[tong2022videomae] Tong, Zhan, Song, Yibing, Wang, Jue, Wang, Limin. (2022). Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems.
[arnab2021vivit] Arnab, Anurag, Dehghani, Mostafa, Heigold, Georg, Sun, Chen, Lu{\v{c. (2021). Vivit: A video vision transformer. Proceedings of the IEEE/CVF international conference on computer vision.
[wang2023videomae] Wang, Limin, Huang, Bingkun, Zhao, Zhiyu, Tong, Zhan, He, Yinan, Wang, Yi, Wang, Yali, Qiao, Yu. (2023). Videomae v2: Scaling video masked autoencoders with dual masking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[kuehne2011hmdb] Kuehne, Hildegard, Jhuang, Hueihan, Garrote, Est{'\i. (2011). HMDB: a large video database for human motion recognition. 2011 International conference on computer vision.
[li2022uniformerv2] Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, Yu Qiao. (2022). UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer.
[simonyan2014two] Simonyan, Karen, Zisserman, Andrew. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems.
[bib1] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
[bib2] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
[bib3] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.
[bib4] Ido Ben-Shaul, Ravid Shwartz-Ziv, Tomer Galanti, Shai Dekel, and Yann LeCun. Reverse engineering self-supervised learning. arXiv preprint arXiv:2305.15614, 2023.
[bib5] Yoshua Bengio. Deep learning of representations for unsupervised and transfer learning. In Proceedings of ICML workshop on unsupervised and transfer learning, pages 17–36. JMLR Workshop and Conference Proceedings, 2012.
[bib6] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
[bib7] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
[bib8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882, 2020.
[bib9] Rich Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
[bib10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[bib11] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
[bib12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[bib13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[bib14] Logan Engstrom, Andrew Ilyas, Shibani Santurkar, and Dimitris Tsipras. Robustness (python library), 2019.
[bib15] Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, and Andrew Gordon Wilson. How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization. arXiv preprint arXiv:2210.06441, 2022.
[bib16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
[bib17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[bib18] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[bib19] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
[bib20] Simon Kornblith, Ting Chen, Honglak Lee, and Mohammad Norouzi. Why do better loss functions lead to less transferable features? Advances in Neural Information Processing Systems, 34:28648–28662, 2021.
[bib21] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
[bib22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[bib23] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
[bib24] Zengyi Li, Yubei Chen, Yann LeCun, and Friedrich T Sommer. Neural manifold clustering and embedding. arXiv preprint arXiv:2201.10000, 2022.
[bib25] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
[bib26] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013.
[bib27] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717, 2020.
[bib28] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017.
[bib29] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
[bib30] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
[bib31] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
[bib32] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
[bib33] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[bib34] Mohammad Pezeshki, Oumar Kaba, Yoshua Bengio, Aaron C Courville, Doina Precup, and Guillaume Lajoie. Gradient starvation: A learning proclivity in neural networks. Advances in Neural Information Processing Systems, 34:1256–1272, 2021.
[bib35] Ravid Shwartz-Ziv. Information flow in deep neural networks. arXiv preprint arXiv:2202.06749, 2022.
[bib36] Ravid Shwartz-Ziv and Alexander A Alemi. Information in infinite ensembles of infinitely-wide neural networks. In Symposium on Advances in Approximate Bayesian Inference, pages 1–17. PMLR, 2020.
[bib37] Ravid Shwartz-Ziv, Randall Balestriero, and Yann LeCun. What do we maximize in self-supervised learning? arXiv preprint arXiv:2207.10081, 2022.
[bib38] Ravid Shwartz-Ziv, Micah Goldblum, Hossein Souri, Sanyam Kapoor, Chen Zhu, Yann LeCun, and Andrew G Wilson. Pre-train your loss: Easy bayesian transfer learning with informative priors. Advances in Neural Information Processing Systems, 35:27706–27715, 2022.
[bib39] Ravid Shwartz-Ziv and Yann LeCun. To compress or not to compress–self-supervised learning and information theory: A review. arXiv preprint arXiv:2304.09355, 2023.
[bib40] Ravid Shwartz-Ziv, Amichai Painsky, and Naftali Tishby. Representation compression and generalization in deep neural networks, 2018.
[bib41] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pages 1–5. IEEE, 2015.
[bib42] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
[bib43] Karl R. Weiss, Taghi M. Khoshgoftaar, and Dingding Wang. A survey of transfer learning. Journal of Big Data, 3, 2016.
[bib44] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? Advances in neural information processing systems, 27, 2014.
[bib45] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230, 2021.
[bib46] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. corr abs/1611.03530 (2016). arXiv preprint arxiv:1611.03530, 2016.
[bib47] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
[bib48] Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. Advances in neural information processing systems, 27, 2014.
[bib49] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76, 2020.