To Compress or Not to Compress - Self-Supervised Learning and Information Theory: A Review
\name Yann LeCun \addr New York University & Meta AI - FAIR \email
Abstract
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory, and notably the information bottleneck principle, has been pivotal in shaping deep neural networks. This principle focuses on optimizing the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. We weave together existing research into a cohesive narrative, delve into contemporary self-supervised methodologies, and spotlight potential research avenues and inherent challenges. Additionally, we discuss the empirical evaluation of information-theoretic quantities and their estimation methods. Overall, this paper furnishes an exhaustive review of the intersection of information theory, self-supervised learning, and deep neural networks.
Implicit Compression in Self-Supervised Learning Methods
Ravid Shwartz-Ziv New York University Yann LeCun New York University & Meta AI - FAIR
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory, and notably the information bottleneck principle, has been pivotal in shaping deep neural networks. This principle focuses on optimizing the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an informationtheoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem . We weave together existing research into a cohesive narrative, delve into contemporary self-supervised methodologies, and spotlight potential research avenues and inherent challenges. Additionally, we discuss the empirical evaluation of information-theoretic quantities and their estimation methods. Overall, this paper furnishes an exhaustive review of the intersection of information theory, self-supervised learning, and deep neural networks.
Keywords: Self-Supervised Learning, Information Theory, Representation Learning
Introduction
Deep neural networks (DNNs) have revolutionized fields such as computer vision, natural language processing, and speech recognition due to their remarkable performance in supervised learning tasks (Alam et al., 2020; He et al., 2015; LeCun et al., 2015). However, the success of DNNs is often limited by the need for vast amounts of labeled data, which can be both time-consuming and expensive to acquire. Self-supervised learning (SSL) emerges as a promising alternative, enabling models to learn from data without explicit labels by leveraging the underlying structure and relationships within the data itself.
Recent advances in SSL have been driven by joint embedding architectures, such as Siamese Nets (Bromley et al., 1993), DrLIM (Chopra et al., 2005; Hadsell et al., 2006), and SimCLR (Chen et al., 2020a). These approaches define a loss function that encourages representations of different versions of the same image to be similar while pushing representations of distinct images apart. After optimizing the surrogate objective, the pre-trained model can be employed as a feature extractor, with the learned features serving as inputs for downstream supervised tasks like image classification, object detection, instance segmentation, or pose estimation (Caron et al., 2021; Chen et al., 2020a; Misra and van der Maaten, 2020; ShwartzZiv et al., 2022b). Although SSL methods have shown promising results in practice, the
theoretical underpinnings behind their effectiveness remain an open question (Arora et al., 2019; Lee et al., 2021a).
Information theory has played a crucial role in understanding and optimizing deep neural networks, from practical applications like the variational information bottleneck (Alemi et al., 2016) to theoretical investigations of generalization bounds induced by mutual information (Steinke and Zakynthinou, 2020; Xu and Raginsky, 2017). Building upon these foundations, several researchers have attempted to enhance self-supervised and semisupervised learning algorithms using information-theoretic principles, such as the Mutual Information Neural Estimator (MINE) (Belghazi et al., 2018b) combined with the information maximization (InfoMax) principle (Linsker, 1988). However, the plethora of objective functions, contradicting assumptions, and various estimation techniques in the literature can make it challenging to grasp the underlying principles and their implications.
In this paper, we aim to achieve two objectives. First, we propose a unified framework that synthesizes existing research on self-supervised and semi-supervised learning from an information-theoretic standpoint. This framework allows us to present and compare current methods, analyze their assumptions and difficulties, and discuss the optimal representation for neural networks in general and self-supervised networks in particular. Second, we explore different methods and estimators for optimizing information-theoretic quantities in deep neural networks and investigate how recent models optimize various theoretical-information terms.
By reviewing the literature on various aspects of information-theoretic learning, we provide a comprehensive understanding of the interplay between information theory, self-supervised learning, and deep neural networks. We discuss the application of the information bottleneck principle (Tishby et al., 1999a), connections between information theory and generalization, and recent information-theoretic learning algorithms. Furthermore, we examine how the information-theoretic perspective can offer insights into the design of better self-supervised learning algorithms and the potential benefits of using information theory in SSL across a wide range of applications.
In addition to the main structure of the paper, we dedicate a section to the challenges and opportunities in extending the information-theoretic perspective to other learning paradigms, such as energy-based models. We highlight the potential advantages of incorporating these extensions into self-supervised learning algorithms and discuss the technical and conceptual challenges that must be addressed.
The structure of the paper is as follows. Section 2 introduces the key concepts in supervised, semi-supervised, self-supervised learning, information theory, and representation learning. Section 3 presents a unified framework for multiview learning based on information theory. We first discuss what an optimal representation is and why compression is beneficial for learning. Next, we explore optimal representation in single-view supervised learning models and how they can be extended to unsupervised, semi-supervised, and multiview contexts. The focus then shifts to self-supervised learning, where the optimal representation remains an open question. Using the unified framework, we compare recent self-supervised algorithms and discuss their differences. We analyze the assumptions behind these models, their effects
on the learned representation, and their varying perspectives on important information within the network.
Section 5 addresses several technical challenges, discussing both theoretical and practical issues in estimating theoretical information terms. We present recent methods for estimating these quantities, including variational bounds and estimators. Section 6 concludes the paper by offering insights into potential future research directions at the intersection of information theory, self-supervised learning, and deep neural networks. Our aim is to inspire further research that leverages information theory to advance our understanding of self-supervised learning and to develop more efficient and effective models for a broad range of applications.
Background and Fundamental Concepts
Multiview Representation Learning
Multiview learning has gained increasing attention and great practical success by using complementary information from multiple features or modalities. The multiview learning paradigm divides the input variable into multiple views from which the target variable should be predicted (Zhao et al., 2017b). Using this paradigm, one can eliminate hypotheses that contradict predictions from other views and provide a natural semi-supervised and self-supervised learning setting. A multiview dataset consists of data captured from multiple sources, modalities, and forms but with similar high-level semantics (Yan et al., 2021). This mechanism was initially used for natural-world data, combining image, text, audio, and video measurements. For example, photos of objects are taken from various angles, and our supervised task is to identify the objects. Another example is identifying a person by analyzing the video stream as one view and the audio stream as the other.
Although these views often provide different and complementary information about the same data, directly integrating them does not produce satisfactory results due to biases between multiple views (Yan et al., 2021). Thus, multiview representation learning involves identifying the underlying data structure and integrating the different views into a common feature space, resulting in high performance. In recent decades, multiview learning has been used for many machine learning tasks and influenced many algorithms, such as co-training mechanisms (Kumar and Daum´ e, 2011), subspace learning methods (Xue et al., 2019), and multiple kernel learning (MKL) (Bach and Jordan, 2002). Li et al. (2018) proposed two categories for multiview representation learning: (i) multiview representation fusion, which combines different features from multiple views into a single compact representation, and (ii) alignment of multiview representation, which attempts to capture the relationships among multiple different views through feature alignment. In this case, a learned mapping function embeds the data of each view, and the representations are regularized to form a multiviewaligned space. In this research direction, an early study is the Canonical Correlation Analysis (CCA) (Hotelling, 1936) and its kernel extensions (Bach and Jordan, 2003; Hardoon et al., 2004; Sun, 2013). In addition to CCA, multiview representation learning has penetrated a variety of learning methods, such as dimensionality reduction (Sun et al., 2010), clustering analysis (Yan et al., 2015), multiview sparse coding (Cao et al., 2013; Jia et al., 2010; Liu et al., 2014), and multimodal topic learning (Pu et al., 2020). However, despite their
promising results, these methods use handcrafted features and linear embedding functions, which cannot capture the nonlinear properties of multiview data.
The emergence of deep learning has provided a powerful way to learn complex, nonlinear, and hierarchical representations of data. By incorporating multiple hierarchical layers, deep learning algorithms can learn complex, subtle, and abstract representations of target data. The success of deep learning in various application domains has led to a growing interest in deep multiview methods, which have shown promising results. Examples of these methods include deep multiview canonical correlation analysis (Andrew et al., 2013) as an extension of CCA, multiview clustering via deep matrix factorization (Zhao et al., 2017a), and the deep multiview spectral network (Huang et al., 2019). Moreover, deep architectures have been employed to generate effective representations in methods such as multiview convolutional neural networks (Liu et al., 2021a), multimodal deep Boltzmann machines (Srivastava and Salakhutdinov, 2014), multimodal deep autoencoders (Ngiam et al., 2011; Wang et al., 2015), and multimodal recurrent neural networks (Donahue et al., 2015; Karpathy and Fei-Fei, 2015; Mao et al., 2014).
Self-Supervised Learning
Self-supervised learning (SSL) is a powerful technique that leverages unlabeled data to learn useful representations. In contrast to supervised learning, which relies on labeled data, SSL employs self-defined signals to establish a proxy objective between the input and the signal. The model is initially trained using this proxy objective and subsequently fine-tuned on the target task. Self-supervised signals, derived from the inherent co-occurrence relationships in the data, serve as self-supervision. Various such signals have been used to learn representations, including generative and joint embedding architectures (Bachman et al., 2019; Bar et al., 2022; Chen et al., 2020a,b).
Two main categories of SSL architectures exist: (1) generative architectures based on reconstruction or prediction and (2) joint embedding architectures (Liu et al., 2021b). Both architecture classes can be trained using either contrastive or non-contrastive methods.
We begin by discussing these two main types of architectures:
- Generative Architecture: Generative architectures employ an objective function that measures the divergence between input data and predicted reconstructions, such as squared error. The architecture reconstructs data from a latent variable or a corrupted version, potentially with a latent variable's assistance. Notable examples of generative architectures include auto-encoders, sparse coding, sparse auto-encoders, and variational auto-encoders (Kingma and Welling, 2013; Lee et al., 2006; Ng et al., 2011). As the reconstruction task lacks a single correct answer, most generative architectures utilize a latent variable, which, when varied, generates multiple reconstructions. The latent variable's information content requires regularization to ensure the system reconstructs regions of high data density while avoiding a collapse by reconstructing the entire space. PCA regularizes the latent variable by limiting its dimensions, while sparse coding and sparse auto-encoders restrict the number of non-zero components. Variational auto-encoders regularize the latent variable by rendering it stochastic
and maximizing the entropy of the distribution relative to a prior. Vector quantized variational auto-encoders (VQ-VAE) employ binary stochastic variables to achieve similar results (Van Den Oord et al., 2017).
- Joint Embedding Architectures (JEA): These architectures process multiple views of an input signal through encoders, producing representations of the views. The system is trained to ensure that these representations are both informative and mutually predictable. Examples include Siamese networks, where two identical encoders share weights (Chen et al., 2020a; Chen and He, 2021; Grill et al., 2020; He et al., 2020), and methods permitting encoders to differ (Bardes et al., 2021). A primary challenge with JEA is preventing informational collapse, in which the representations contain minimal information about the inputs, thereby facilitating their mutual prediction. JEA's advantage lies in the encoders' ability to eliminate noisy, unpredictable, or irrelevant information from the input within the representation space.
To effectively train these architectures, it is essential to ensure that the representations of different signals are distinct. This can be achieved through either contrastive or noncontrastive methods:
· Contrastive Methods: Contrastive methods utilize data points from the training set as positive samples and generate points outside the region of high data density as contrastive samples . The energy (e.g., reconstruction error for generative architectures or representation predictive error for JEA) should be low for positive samples and higher for contrastive samples. Various loss functions involving the energies of pairs or sets of samples can be minimized to achieve this objective. · Non-Contrastive Methods: Non-contrastive methods prevent the energy landscape's collapse by limiting the volume of space that can take low energy, either through architectural constraints or through a regularizer in the energy or training objective. In latent-variable generative architectures, preventing collapse is achieved by limiting or minimizing the information content of the latent variable. In JEA, collapse is prevented by maximizing the information content of the representations.
We now present a few concrete examples of popular models that employ various combinations of generative architectures, joint embedding architectures, contrastive training, and noncontrastive training:
The Denoising Autoencoder approach in generative architectures (Devlin et al., 2018; He et al., 2022; Vincent et al., 2008) using a triplet loss which utilizes a positive sample, which is a vector from the training set that should be reconstructed perfectly, and a contrastive sample consisting of data vectors, one from the training set and the other being a corrupted version of it. In SSL, the combination of JEA models with contrastive learning has proven highly effective. In contrastive learning, the objective is to attract different augmented views of the same image (positive points) while repelling dissimilar augmented views (negative points). Recent self-supervised visual representation learning examples include MoCo (He et al., 2020) and SimCLR (Chen et al., 2020a). The InfoNCE loss is a commonly used
objective function in many contrastive learning methods:
$$
$$
where x + is a sample similar to x , x k are all the samples in the batch, and f is an encoder.
However, contrastive methods heavily depend on all other samples in the batch and require a large batch size. Additionally, recent studies (Jing et al., 2021) have shown that contrastive learning can lead to dimensional collapse, where the embedding vectors span a lowerdimensional subspace instead of the entire embedding space. Although positive and negative pairs should repel each other to prevent dimensional collapse, augmentation along feature dimensions and implicit regularization cause the embedding vectors to fall into a lowerdimensional subspace, resulting in low-rank solutions.
To address these problems, recent works have introduced JEA models with non-contrastive methods . Unlike contrastive methods, these methods employ regularization to prevent the collapse of the representation and do not explicitly rely on negative samples. For example, several papers use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021; Grill et al., 2020), while Caron et al. (2020) employed an additional clustering step. VICReg (Bardes et al., 2021) is another non-contrastive method that regularizes the covariance matrix of representation. Consider two embedding batches Z = [ f ( x 1 ) , . . . , f ( x N )] and Z ′ = [ f ( x ′ 1) , . . . , f ( x ′ N )], each of size ( N × K ). Denote by C the ( K × K ) covariance matrix obtained from [ Z , Z ′ ]. The VICReg triplet loss is defined by:
̸
$$
$$
Semi-Supervised Learning
Semi-supervised learning employs both labeled and unlabeled data to enhance the model performance (Chapelle et al., 2009). Consistency regularization-based approaches (Laine and Aila, 2016; Miyato et al., 2018; Sohn et al., 2020) ensure that predictions remain stable under perturbations in input data and model parameters. Certain techniques, such as those proposed by Grandvalet and Bengio (2006) and Miyato et al. (2018), involve training a model by incorporating a regularization term into a supervised cross-entropy loss. In contrast, Xie et al. (2020) utilizes suitably weighted unsupervised regularization terms, while Zhai et al. (2019) adopts a combination of self-supervised pretext loss terms. Moreover, pseudo-labeling can generate synthetic labels based on network uncertainty to further aid model training (Lee et al., 2013).
Representation Learning
Representation learning is an essential aspect of various computer vision, natural language processing, and machine learning tasks, as it uncovers the underlying structures in data (Bengio et al., 2013). By extracting relevant information for classification and prediction tasks from the data, we can improve performance and reduce computational complexity
(Goodfellow et al., 2016). However, defining an effective representation remains a challenging task. In probabilistic models, a useful representation often captures the posterior distribution of explanatory factors beneath the observed input (LeCun et al., 2015). Bengio and LeCun (2007) introduced the idea of learning highly structured yet complex dependencies for AI tasks, which require transforming high-dimensional input structures into low-dimensional output structures or learning low-level representations. As a result, identifying relevant input features becomes challenging, as most input entropy is unrelated to the output (Shwartz-Ziv and Tishby, 2017). Ben-Shaul et al. (2023) demonstrated that self-supervised learning inherently promotes the clustering of samples based on semantic labels. Intriguingly, this clustering is driven by the objective's regularization term and aligns with semantic classes across multiple hierarchical levels.
Minimal Sufficient Statistic
A possible definition of an effective representation is based on minimal sufficient statistics.
Definition 1 Given ( X,Y ) ∼ P ( X,Y ) , let T := t ( X ) , where t is a deterministic function. We define T as a sufficient statistic of X for Y if Y -T -X forms a Markov chain.
A sufficient statistic captures all the information about Y in X . Cover (1999) proved this property:
However, the sufficiency definition also encompasses trivial identity statistics that only 'copy' rather than 'extract' essential information. To prevent statistics from inefficiently utilizing observations, the concept of minimal sufficient statistics was introduced:
Definition 3 (Minimal sufficient statistic (MSS)) A sufficient statistic T is minimal if, for any other sufficient statistic S , there exists a function f such that T = f ( S ) almost surely (a.s.).
In essence, MSS are the simplest sufficient statistics, inducing the coarsest sufficient partition on X . In MSS, the values of X are grouped into as few partitions as possible without sacrificing information. MSS are statistics with the maximum information about Y while retaining the least information about X as possible (Koopman, 1936).
The Information Bottleneck
The majority of distributions lack exact minimal sufficient statistics, leading Tishby et al. (1999b) to relax the optimization problem in two ways: (i) allowing the map to be stochastic, defined as an encoder P ( T | X ), and (ii) permitting the capture of only a small amount of I ( X ; Y ). The information bottleneck (IB) was introduced as a principled method to extract relevant information from observed signals related to a target. This framework finds the optimal trade-off between the accuracy and complexity of a random variable y ∈ Y with a joint distribution for a random variable x ∈ X . The IB has been employed in various fields such as neuroscience (Buesing and Maass, 2010; Palmer et al., 2015), slow feature analysis
(Turner and Sahani, 2007), speech recognition (Hecht et al., 2009), and deep learning (Alemi et al., 2016; Shwartz-Ziv and Tishby, 2017).
Let X be an input random variable, Y a target variable, and P ( X,Y ) their joint distribution. A representation T is a stochastic function of X defined by a mapping P ( T | X ). This mapping transforms X ∼ P ( X ) into a representation of T ∼ P ( T ) := ∫ P T | X ( · | x ) dP X ( x ). The triple Y -X -T forms a Markov chain in that order with respect to the joint probability measure P X,Y,T = P X,Y P T | X and the mutual information terms I ( X ; T ) and I ( Y ; T ).
Within the IB framework, our goal is to find a representation P ( T | X ) that extracts as much information as possible about Y (high performance) while compressing X maximally (keeping I ( X ; T ) small). This can also be interpreted as extracting only the relevant information that X contains about Y .
The data processing inequality (DPI) implies that I ( Y ; T ) ≤ I ( X ; Y ), so the compressed representation T cannot convey more information than the original signal. Consequently, there is a trade-off between compressed representation and the preservation of relevant information about Y . The construction of an efficient representation variable is characterized by its encoder and decoder distributions, P ( T | X ) and P ( Y | T ), respectively. The efficient representation of X involves minimizing the complexity of the representation I ( T ; X ) while maximizing I ( T ; Y ). Formally, the IB optimization involves minimizing the following objective function:
$$
$$
where β is the trade-off parameter controlling the complexity of T and the amount of relevant information it preserves. Intuitively, we pass the information that X contains about Y through a 'bottleneck' via the representation T . It has been shown that:
$$
$$
Representation Learning and the Information Bottleneck
Information theory traditionally assumes that underlying probabilities are known and do not require learning. For instance, the optimality of the initial IB work (Tishby et al., 1999b) relied on the assumption that the joint distribution of input and labels is known. However, a significant challenge in machine learning algorithms is inferring an accurate predictor for the unknown target variable from observed realizations. This discrepancy raises questions about the practical optimality of the IB and its relevance in modern learning algorithms. The following section delves into the relationship between the IB framework and learning, inference, and generalization.
Let X ∈ X and a target variable Y ∈ Y be random variables with an unknown joint distribution P ( X,Y ). For a given class of predictors f : X → ˆ Y and a loss function ℓ : Y → ˆ Y measuring discrepancies between true values and model predictions, our objective is to find the predictor f that minimizes the expected population risk.
$$
$$
Several issues arise with the population risk. Firstly, it remains unclear which loss function is optimal. A popular choice is the logarithmic loss (or error's entropy), which has been numerically demonstrated to yield better results (Erdogmus, 2002). This loss has been employed in various algorithms, including the InfoMax principle (Linsker, 1988), tree-based algorithms (Quinlan, 2014), deep neural networks (Zhang and Sabuncu, 2018), and Bayesian modeling (Wenzel et al., 2020). Painsky and Wornell (2018) provided a rigorous justification for using the logarithmic loss and showed that it is an upper bound to any choice of the loss function that is smooth, proper, and convex for binary classification problems.
In most cases, the joint distribution P ( X,Y ) is unknown, and we have access to only n samples from it, denoted by D n := ( x i , y i ) | i = 1 , . . . , n . Consequently, the population risk cannot be computed directly. Instead, we typically choose the predictor that minimizes the empirical population risk on a training dataset:
$$
$$
The generalization gap, defined as the difference between empirical and population risks, is given by:
$$
$$
Interestingly, the relationship between the true loss and the empirical loss can be bounded using the information bottleneck term. Shamir et al. (2010) developed several finite sample bounds for the generalization gap. According to their study, the IB framework exhibited good generalizability even with small sample sizes. In particular, they developed non-uniform bounds adaptive to the model's complexity. They demonstrated that for the discrete case, the error in estimating mutual information from finite samples is bounded by O ( | X | log n √ n ) , where | X | is the cardinality of X (the number of possible values that the random variable X can take). The results support the intuition that simpler models generalize better, and we would like to compress our model. Therefore, optimizing eq. (1) presents a trade-off between two opposing forces. On the one hand, we want to increase our prediction accuracy in our training data (high β ).
On the other hand, we would like to decrease β to narrow the generalization gap. Vera et al. (2018) extended their work and showed that the generalization gap is bounded by the square root of mutual information between training input and model representation times log n n . Furthermore, Russo and Zou (2019) and Xu and Raginsky (2017) demonstrated that the square root of the mutual information between the training input and the parameters inferred from the training algorithm provides a concise bound on the generalization gap. However, these bounds critically depend on the Markov operator that maps the training set to the network parameters, whose characterization is not trivial.
Achille and Soatto (2018) explored how applying the IB objective to the network's parameters may reduce overfitting while maintaining invariant representations. Their work showed
that flat minima, which have better generalization properties, bound the information with the weights, and the information in the weights bound the information in the activations. Chelombiev et al. (2019) found that the generalization precision is positively correlated with the degree of compression of the last layer in the network. Shwartz-Ziv et al. (2018) showed that the generalization error depends exponentially on the mutual information between the model and the input once it is smaller than log 2 n - the query sample complexity. Moreover, they demonstrated that M bits of compression of X are equivalent to an exponential factor of 2 M training examples. Piran et al. (2020) extended the original IB to the dual form, which offers several advantages in terms of compression.
These studies illustrate that the IB leads to a trade-off between prediction and complexity, even for the empirical distribution. With the IB objective, we can design estimators to find optimal solutions for different regimes with varying performance, complexity, and generalization.
Information-Theoretic Objectives
Before delving into the details, this section aims to provide an overview of the informationtheoretic objectives in various learning scenarios, including supervised, unsupervised, and self-supervised settings. We will also introduce a general framework to understand better the process of learning optimal representations and explore recent methods working towards this goal.
Developing a novel algorithm entails numerous aspects, such as architecture, initialization parameters, learning algorithms, and pre-processing techniques. A crucial element, however, is the objective function. As demonstrated in Section 2.4.2, the IB approach, originally introduced by Tishby et al. (1999b), defines the optimal representation in supervised scenarios, enabling us to identify which terms to compress during learning. However, determining the optimal representation and deriving information-based objective functions in self-supervised settings are more challenging. In this section, we introduce a general framework to understand the process of learning optimal representations and explore recent methods striving to achieve this goal.
Setup and Methodology
Using a two-channel input allows us to model complex multiview learning problems. In many real-world situations, data can be observed from multiple perspectives or modalities, making it essential to develop learning algorithms capable of handling such multiview data.
Consider a two-channel input, X 1 and X 2 , and a single-channel label Y for a downstream task, all possessing a joint distribution P ( X 1 , X 2 , Y ). We assume the availability of n labeled examples S = ( x i 1 , x i 2 , y i ) n i =1 and t unlabeled examples U = ( x i 1 , x i 2 ) n + t i = n +1 , both independently and identically distributed. Our objective is to predict Y using a loss function.
In our model, we use a learned encoder with a prior P ( Z ) to generate a conditional representation (which may be deterministic or stochastic) Z i | X i = P θ i ( Z i | X i ), where i = 1 , 2
represents the two views. Subsequently, we utilize various decoders to 'decode' distinct aspects of the representation:
For the supervised scenario, we have a joint embedding of the label classifiers from both views, ˆ Y 1 , 2 = Q ρ ( Y | Z 1 , Z 2 ), and two decoders predicting the labels of the downstream task based on each individual view, ˆ Y i = Q ρ i ( Y | Z i ) for i = 1 , 2.
For self-supervised learning, we utilize two cross-decoders attempting to predict one representation based on the other, ˜ Z 1 | Z 2 = q η 1 ( Z 1 | Z 2 ) and ˜ Z 2 | Z 1 = q η 2 ( Z 2 | Z 1 ). Figure 1 illustrates this structure.
The information-theoretic perspective of self-supervised networks has led to confusion regarding the information being optimized in recent work. In supervised and unsupervised learning, only one 'information path' exists when optimizing information-theoretic terms: the input is encoded through the network, and then the representation is decoded and compared to the targets. As a result, the representation and corresponding information always stem from a single encoder and decoder.
However, in the self-supervised multiview scenario, we can construct our representation using various encoders and decoders. For instance, we need to specify the associated random variable to define the information involved in I ( X 1 ; Z 1 ). This variable could either be based on the encoder of X 1 -P θ 1 ( Z 1 | X 1 ), or based on the encoder of X 2 -P θ 2 ( Z 2 | X 2 ), which is subsequently passed to the cross-decoder Q η 1 ( Z 1 | Z 2 ) and then to the direct decoder Q ψ 1 ( X 1 | Z 1 ).
To fully understand the information terms, we aim to optimize and distinguish between various 'information paths,' we marked each information path differently. For example, I ,P ( X 1 ) ,P ( Z 1 | X 1 ) ,P ( Z 2 | Z 1 ) ( X 1 , Z 2 ) is based on the path P ( X 1 ) → P ( Z 1 | X 1 ) → P ( Z 2 | Z 1 ). In the following section, we will 'translate' previous work into our present framework and examine the loss function.
Optimization with Labels
After establishing our framework, we can now incorporate various learning algorithms. We begin by examining classical single-view supervised information bottleneck algorithms for deep networks that utilize labeled data during training and extend them to the multiview scenario. Next, we broaden our perspective to include unsupervised learning, where input reconstruction replaces labels, and semi-supervised learning, where information-based regularization is applied to improve predictions.
Single-View Supervised Learning
In classical single-view supervised learning, the task of representation learning involves finding a distribution p ( z | x ) that maps data observations x ∈ X to a representation z ∈ Z , capturing only the relevant features of the input Shwartz-Ziv (2022). The goal is to predict a label y ∈ Y using the learned representation. Achille and Soatto (2018) defined the

Figure 1: Multiview information bottleneck diagram for self-supervised, unsupervised, and supervised learning
sufficiency of Z for Y as the amount of label information retained after passing data through the encoder:
Definition 4 Sufficiency : A representation Z of X is sufficient for Y if and only if I ( X ; Y | Z ) = 0 .
Federici et al. (2020) showed that Z is sufficient for Y if and only if the amount of information regarding the task remains unchanged by the encoding procedure. A sufficient representation can predict Y as accurately as the original data X . In Section 2.4, we saw a trade-off between prediction and generalization when there is a finite amount of data. To reduce the generalization gap, we aim to compress X while retaining as much predicate information on the labels as possible. Thus, we relax the sufficiency definition and minimize the following objective:
$$
$$
The mutual information I ( Y ; Z ) determines how much label information is accessible and reflects the model's ability to predict performance on the target task. I ( X ; Z ) represents the information that Z carries about the input, which we aim to compress. However, I ( X ; Z ) contains both relevant and irrelevant information about Y . Therefore, using the chain rule of information, Federici et al. (2020) proposed splitting I ( X,Z ) into two terms:
$$
$$
The conditional information I ( X,Z | Y ) represents information in Z that is not predictive of Y , i.e., superfluous information. The decomposition of input information enables us to compress only irrelevant information while preserving the relevant information for predicting Y . Several methods are available for evaluating and estimating these information-theoretic terms in the supervised case (see Section 5 for details).
The Information Bottleneck Theory of Deep Learning
The IB hypothesis for deep learning proposes two distinct phases of training neural networks (Shwartz-Ziv and Tishby, 2017): the fitting and compression phases. The fitting phase involves extracting information from the input and converting it into learned representations, characterized by increased mutual information between inputs and hidden representations. Conversely, the compression phase, which is much longer, concentrates on discarding unnecessary information for target prediction, decreasing mutual information between learned representations and inputs. In contrast, the mutual information between representations and targets increases. For more information, see Geiger (2020). Despite the elegance and plausibility of the IB hypothesis, empirically investigating it remains challenging (Amjad and Geiger, 2018).
The study of representation compression in Deep Neural Networks (DNNs) for supervised learning has shown inconsistent results. For instance, Chelombiev et al. (2019) discovered a positive correlation between generalization accuracy and the compression level of the network's final layer. Shwartz-Ziv et al. (2018) also examined the relationship between generalization and compression, demonstrating that generalization error exponentially depends on mutual information, I ( X ; Z ). Furthermore, Achille et al. (2017) established that flat minima, known for their improved generalization properties, constrain the mutual information. However, Saxe et al. (2019) showed that compression was not necessary for generalization in deep linear networks. Basirat et al. (2021) revealed that the decrease in mutual information is essentially equivalent to geometrical compression. Other studies have found that the mutual information between training inputs and inferred parameters provides a concise bound on the generalization gap (Pensia et al., 2018; Xu and Raginsky, 2017). Lastly, Achille and Soatto (2018) explored using an information bottleneck objective on network parameters to prevent overfitting and promote invariant representations.
Multiview IB Learning
The IB principle offers a rigorous method for learning encoders and decoders in supervised single-view problems. However, it is not directly applicable to multiview learning problems, as it assumes only one information source as the input. A common solution is to concatenate multiple views, though this neglects the unique characteristics of each view. To address this issue, Xu et al. (2014) introduced the large-margin multiview IB (LMIB) as an extension of the original IB problem. LMIB employs a communication system where multiple senders represent various views of examples. The system extracts specific components from different senders by compressing examples through a 'bottleneck,' and the linear projectors for each view are combined to create a shared representation. The large-margin principle replaces the maximization of mutual information in prediction, emphasizing the separation of samples from different classes. Limiting Rademacher complexity improves the solution's accuracy and generalization error bounds. Moreover, the algorithm's robustness is enhanced when accurate views counterbalance noisy views.
However, the LMIB method has a significant limitation: it utilizes linear projections for each view, which can restrict the combined representation when the relationship between different views is complex. To overcome this limitation, Wang et al. (2019) proposed using
deep neural networks to replace linear projectors. Their model first extracts concise latent representations from each view using deep networks and then learns the joint representation of all views using neural networks. They minimize the objective:
$$
$$
Here, α and β are trade-off parameters, Z 1 and Z 2 are the two neural networks' representations, and Z 1 , 2 is the joint embedding of Z 1 and Z 2 . The first two terms decrease the mutual information between a view's latent representation and its original data representation, resulting in a simpler and more generalizable model. The final term forces the joint representation to maximize the discrimination ability for the downstream task.
Semi-Supervised IB Learning: Leveraging Unlabeled Data
Obtaining labeled data can be challenging or expensive in many practical scenarios, while many unlabeled samples may be readily available. Semi-supervised learning addresses this issue by leveraging the vast amount of unlabeled data during training in conjunction with a small set of labeled samples. Common strategies to achieve this involve adding regularization terms or adopting mechanisms that promote better generalization. Berthelot et al. (2019) grouped regularization methods into three primary categories: entropy minimization, consistency regularization, and generic regularization.
Voloshynovskiy et al. (2020) introduced an information-theoretic framework for semisupervised learning based on the IB principle. In this context, the semi-supervised classification problem involves encoding input X into the latent space Z while preserving only class-relevant information . A supervised classifier can achieve this if there is sufficient labeled data. However, when the number of labeled examples is limited, the standard label classifier p ( y | z ) becomes unreliable and requires regularization.
To tackle this issue, the authors assumed a prior on the class label distribution p ( y ). They introduced a term to minimize the D KL between the assumed marginal prior and the empirical marginal prior, effectively regularizing the conditional label classifier with the labels' marginal distribution. This approach reduces the classifier's sensitivity to the scarcity of labeled examples. They proposed two variational IB semi-supervised extensions for the priors:
Handcrafted Priors : These priors are predefined for regularization and can be based on domain knowledge or statistical properties of the data. Alternatively, they can be learned using other networks. Handcrafted priors in this context are similar to priors used in the Variational Information Bottleneck (VIB) formalism (Alemi et al., 2016; Wang et al., 2019).
Learnable Priors : Voloshynovskiy et al. (2020) also suggests using learnable priors as an alternative to handcrafted regularization priors on the latent representation. This method involves regularizing Z through another IB-based regularization with two components: (i) latent space regularization and (ii) observation space regularization. In this case, an additional hidden variable M is introduced after the representation to regulate the information flow between Z and Y . An auto-encoder q ( m | z ) is employed, and the optimization process aims to compress the information flowing from Z to M while retaining only label-relevant
information. The IB objective is defined as:
$$
$$
Here, β and β y are hyperparameters that balance the trade-off between the relevance of M to the labels and the compression of Z into M .
Furthermore, Voloshynovskiy et al. (2020) demonstrated that various popular semi-supervised methods can be considered special cases of the optimization problem described above. Notably, the semi-supervised AAE (Makhzani et al., 2015), CatGAN (Springenberg, 2015), SeGMA (Smieja et al., 2019), and VAE (Kingma et al., 2014) can all be viewed as specific instantiations of this framework.
Unsupervised IB learning
In the unsupervised setting, data samples are not directly labeled by classes. Voloshynovskiy et al. (2020) defined unsupervised IB as a 'compressed' parameterized mapping of X to Z , which preserves some information in Z about X through the reverse decoder ¯ X = Q ( X | Z ). Therefore, the Lagrangian of unsupervised IB can be defined as follows:
$$
$$
where I ( X ; Z ) is the information determined by the encoder q ( z | x ) and I ( Z ; ¯ X ) is the information determined by the decoder q ( x | z ), i.e., the reconstruction error. In other words, unsupervised IB is a special case of supervised IB, where labels are replaced with the reconstruction performance of the training input. Alemi et al. (2016) showed that Variational Autoencoder (VAE) (Kingma and Welling, 2019) and β -VAE (Higgins et al., 2017) are special cases of unsupervised variational IB. Voloshynovskiy et al. (2020) extended their results and showed that many models, including adversarial autoencoders (Makhzani et al., 2015), InfoVAEs (Zhao et al., 2017c), and VAE/GANs (Larsen et al., 2016), could be viewed as special cases of unsupervised IB. The main difference between them is the bounds on the different mutual information of the IB. Furthermore, unsupervised IB was used by U˘ gur et al. (2020) to derive lower bounds for their unsupervised generative clustering framework, while Roy et al. (2018) used it to study vector-quantized autoencoders.
Voloshynovskiy et al. (2020) pointed out that for the classification task in supervised IB, the latent space Z should be sufficient statistics for Y , whose entropy is much lower than X . This results in a highly compressed representation where sequences close in the input space might be close in the latent space, and the less significant features will be compressed. In contrast, in the unsupervised setup, the IB suggests compressing the input to the encoded representation so that each input sequence can be decoded uniquely. In this case, the latent space's entropy should correspond to the input space's entropy, and compression is much more difficult.
Self-Supervised Multiview Information Bottleneck Learning
How can we learn without labels and still achieve good predictive power? Is compression necessary to obtain an optimal representation? This section analyzes and discusses how to achieve optimal representation for self-supervised learning when labels are not available during training. We review recent methods for self-supervised learning and show how they can be integrated into a single framework. We compare their objective functions, implicit assumptions, and theoretical challenges. Finally, we consider the information-theoretic properties of these representations, their optimality, and different ways of learning them.
One approach to enhance deep learning methods is to apply the InfoMax principle in a multiview setting (Linsker, 1988; Wiskott and Sejnowski, 2002). As one of the earliest approaches, Linsker (1988) proposed maximizing information transfer from input data to its latent representation, showing its equivalence to maximizing the determinant of the output covariance under the Gaussian distribution assumption. Becker and Hinton (1992) introduced a representation learning approach based on maximizing an approximation of the mutual information between alternative latent vectors obtained from the same image. The most well-known application is the Independent Component Analysis (ICA) Infomax algorithm (Bell and Sejnowski, 1995), designed to separate independent sources from their linear combinations. The ICA-Infomax algorithm aims to maximize the mutual information between mixtures and source estimates while imposing statistical independence among outputs. The Deep Infomax approach (Hjelm et al., 2018) extends this idea to unsupervised feature learning by maximizing the mutual information between input and output while matching a prior distribution for the representations. Recent work has applied this principle to a self-supervised multiview setting (Bachman et al., 2019; Henaff, 2020; Hjelm et al., 2018; Tian et al., 2020a), wherein these works maximize the mutual information between the views Z 1 and Z 2 using the classifier q ( z 1 | z 2 ), which attempts to predict one representation from the other.
However, Tschannen et al. (2019) demonstrated that the effectiveness of InfoMax models is more attributable to the inductive biases introduced by the architecture and estimators than to the training objectives themselves, as the InfoMax objectives can be trivially maximized using invertible encoders. Moreover, a fundamental issue with the InfoMax principle is that it retains irrelevant information about the labels, contradicting the core concept of the IB principle, which advocates compressing the representation to enhance generalizability.
To resolve this problem, Sridharan and Kakade (2008) proposed the multiview IB framework . According to this framework, in the multiview without labels setting, the IB principle of preserving relevant data while compressing irrelevant data requires assumptions regarding the relationship between views and labels. They presented the MultiView assumption , which asserts that either view (approximately) would be sufficient for downstream tasks. By this assumption, they define the relevant information as the shared information between the views. Therefore, augmentations (such as changing the image style) should not affect the labels.
Additionally, the views will provide most of the information in the input regarding downstream tasks. We improve generalization without affecting performance by compressing the information not shared between the two views. Their formulation is as follows:
Assumption 1 The MultiView Assumption: There exists a ϵ info (which is assumed to be small) such that
$$
$$
$$
$$
$$
$$
As a result, when the information sharing parameter, ϵ info , is small, the information shared between views includes task-relevant details. For instance, in self-supervised contrastive learning for visual data (Hjelm et al., 2018), views represent various augmentations of the same image. In this scenario, the MultiView assumption is considered mild if the downstream task remains unaffected by the augmentation (Geiping et al., 2022). Image augmentations can be perceived as altering an image's style without changing its content. Thus, Tsai et al. (2020) contends that the information required for downstream tasks should be preserved in the content rather than the style. This assumption allows us to separate the information into relevant (shared information) and irrelevant (not shared) components and to compress only the unimportant details that do not contain information about downstream tasks. Based on this assumption, we aim to maximize the relevant information I ( X 2 ; Z 1 ) and minimize I ( X 1 ; Z 1 | X 2 ) - the exclusive information that Z 1 contains about X 1 , which cannot be predicted by observing X 2 . This irrelevant information is unnecessary for the prediction task and can be discarded. In the extreme case, where X 1 and X 2 share only label information, this approach recovers the supervised IB method without labels. Conversely, if X 1 and X 2 are identical, this method collapses into the InfoMax principle, as no information can be accurately discarded.
Federici et al. (2020) used the relaxed Lagrangian objective to obtain the minimal sufficient representation Z 1 for X 2 as:
$$
$$
$$
$$
where β 1 and β 2 are the Lagrangian multipliers introduced by the constraint optimization. By defining Z 1 and Z 2 on the same domain and re-parameterizing the Lagrangian multipliers, the average of the two loss functions can be upper bounded as:
$$
$$
where D SKL represents the symmetrized KL divergence obtained by averaging the expected value of D KL ( p ( z 1 | x 1 ) || p ( z 2 | x 2 )) and D KL ( p ( z 2 | x 2 ) || p ( z 1 | x 1 )). Note that when the mapping from X 1 to Z 1 is deterministic, I ( Z 1 ; X 1 | X 2 ) minimization and H ( Z 1 | X 2 ) minimization are interchangeable and the algorithms of Federici et al. (2020) and Tsai et al.
(2020) minimize the same objective. Another implementation of the same idea is based on the Conditional Entropy Bottleneck (CEB) algorithm (Fischer, 2020) and proposed by Lee et al. (2021b). This algorithm adds the residual information as a compression term to the InfoMax objective using the reverse decoders q ( z 1 | x 2 ) and q ( z 2 | x 1 ).
In conclusion, all the algorithms mentioned above are based on the Multiview assumption. Utilizing this assumption, they can distinguish relevant information from irrelevant information. As a result, all these algorithms aim to maximize the information (or the predictive ability) of one representation with respect to the other view while compressing the information between each representation and its corresponding view. The key differences between these algorithms lie in the decomposition and implementation of these information terms.
Dubois et al. (2021) offers another theoretical analysis of the IB for self-supervised learning. Their work addresses the question of the minimum bit rate required to store the input but still achieve high performance on a family of downstream tasks Y ∈ Y . It is a rate-distortion problem, where the goal is to find a compressed representation that will give us a good prediction for every task. We require that the distortion measure is bounded:
$$
$$
Accessing the downstream task is necessary to find the solution during the learning process. As a result, Dubois et al. (2021) considered only tasks invariant to some equivalence relation, which divides the input into disjoint equivalence classes. An example would be an image with labels that remain unchanged after augmentation. This is similar to the Multiview assumption where ϵ info → 0. By applying Shannon's rate-distortion theory, they concluded that the minimum achievable bit rate is the rate-distortion function with the above invariance distortion. Thus, the optimal rate can be determined by minimizing the following Lagrangian:
$$
$$
Using this objective, the maximization of information with labels is replaced by maximizing the prediction ability of one view from the original input, regularized by direct information from the input. Similarly to the above results, we would like to find a representation Z 1 that compresses the input X 1 so that Z 1 has the maximum information about X 2 .
Implicit Compression in Self-Supervised Learning Methods
While the optimal IB representation is based on the Multiview assumption, most selfsupervised learning models only use the infoMax principle and maximize the mutual information I ( Z 1 ; Z 2 ) without an explicit regularization term. However, recent studies have shown that contrastive learning creates compressed representations that include only relevant information (Tian et al., 2020b; Wang et al., 2022). The question is, why is the learned representation compressed? The maximization of I ( Z 1 ; Z 2 ) could theoretically be sufficient
to retain all the information from both X 1 and X 2 by making the representations invertible. In this section, we attempt to explain this phenomenon.
We begin with the InfoMax principle (Linsker, 1988), which maximizes the mutual information between the representations of random variables Z 1 and Z 2 of the two views. We can lowerbound it using:
$$
$$
The bound is tight when q ( z 1 | z 2 ) = p ( z 1 | z 2 ), in which case the first term equals the conditional entropy H ( Z 1 | Z 2 ). The second term of eq. (7) can be considered a negative reconstruction error or distortion between Z 1 and Z 2 .
In the supervised case, where Z is a learned stochastic representation of the input and Y is the label, we aim to optimize
$$
$$
. Since Y is constant, optimizing the information I ( Z ; Y ) requires only minimizing the prediction term E [log q ( Y | Z )] by making Z more informative about Y . This term is the cross-entropy loss for classification or the square loss for regressions. Thus, we can minimize the log loss without any other regularization on the representation.
In contrast, for the self-supervised case, we have a more straightforward option to minimize H ( Z 1 | Z 2 ): Making Z 1 easier to predict by Z 2 , which can be achieved by reducing its variance along specific dimensions. If we do not regularize H ( Z 1 ), it will decrease to zero, and we will observe a collapse. This is why, in contrastive methods, the variance of the representation (large entropy) is significant only in the directions with a high variance in the data, which is enforced by data augmentation (Jing et al., 2021). According to this analysis, the network benefits from making the representations 'simple' (easier to predict). Hence, even though our representation does not have explicit information-theoretical constraints, the learning process will compress the representation.
Beyond the Multiview Assumption
According to the Multiview IB analysis presented in Section 4, the optimal way to create a useful representation is to maximize the mutual information between the representations of different views while compressing irrelevant information in each representation. In fact, as discussed in Section 4.1, we can achieve this optimal compressed representation even without explicit regularization. However, this optimality is based on the Multiview assumption , which states that the relevant information for downstream tasks comes from the information shared between views. Therefore, Tian et al. (2020b) concluded that when a minimal sufficient representation has been obtained, the optimal views for self-supervised learning are determined by downstream tasks.
However, the Multiview assumption is highly constrained, as all relevant information must be shared between all views. In cases where this assumption is incorrect, such as with
aggressive data augmentation or multiple downstream tasks or modalities, sharing all the necessary information can be challenging. For example, if one view is a video stream while the other is an audio stream, the shared information may be sufficient for object recognition but not for tracking. Furthermore, relevant information for downstream tasks may not be contained within the shared information between views, meaning that removing non-shared information can negatively impact performance.
Kahana and Hoshen (2022) identified a series of tasks that violate the Multiview assumption . To accomplish these tasks, the learned representation must also be invariant to unwanted attributes, such as bias removal and cross-domain retrieval. In such cases, only some attributes have labels, and the objective is to learn an invariant representation for the domain for which labels are provided while also being informative for all other attributes without labels. For example, for face images, only the identity labels may be provided, and the goal is to learn a representation that captures the unlabeled pose attribute but contains no information about the identity attribute. The task can also be applied to fair decisions, cross-domain matching, model anonymization, and image translation.
Wang et al. (2022) formalized another case where the Multiview assumption does not hold when non-shared task-relevant information cannot be ignored. In such cases, the minimal sufficient representation contains less task-relevant information than other sufficient representations, resulting in inferior performance. Furthermore, their analysis shows that in such cases, the learned representation in contrastive learning is insufficient for downstream tasks, which may overfit the shared information.
As a result of their analysis, Wang et al. (2022) and Kahana and Hoshen (2022) proposed explicitly increasing mutual information between the representation and input to preserve task-relevant information and prevent the compression of unshared information between views. In this case, the two regularization terms of the two views are incorporated into the original InfoMax objective, and the following objective is optimized:
$$
$$
Wang et al. (2022) demonstrated the effectiveness of their method for SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020), and Barlow Twins (Zbontar et al., 2021) across classification, detection, and segmentation tasks.
To Compress or Not to Compress?
As seen in Eq. 9, when the Multiview assumption is violated, the objective for obtaining an optimal representation is to maximize the mutual information between each input and its representation. This contrasts with the situation in which the Multiview assumption holds, or the supervised case, where the objective is to minimize the mutual information between the representation and the input. In both supervised and unsupervised cases, we have direct access to the relevant information, which we can use to separate and compress irrelevant information. However, in the self-supervised case, we depend heavily on the Multiview assumption . If this assumption is violated due to unshared information between views that
is relevant for the downstream task, we cannot separate relevant and irrelevant information. Furthermore, the learning algorithm's nature requires that this information be protected by explicitly maximizing it.
As datasets continue to expand in size and models are anticipated to serve as base models for various downstream tasks, the Multiview assumption becomes less pertinent. Consequently, compressing irrelevant information when the Multiview assumption does not hold presents one of the most significant challenges in self-supervised learning. Identifying new methods to separate relevant from irrelevant information based on alternative assumptions is a promising avenue for research. It is also essential to recognize that empirical measurement of information-theoretic quantities and their estimators plays a crucial role in developing and evaluating such methods.
Optimizing Information in Deep Neural Networks: Challenges and Approaches
Recent years have seen information-theoretic analyses employed to explain and optimize deep learning techniques (Shwartz-Ziv and Tishby, 2017). Despite their elegance and plausibility, empirically measuring and analyzing information in deep networks presents challenges. Two critical problems are (1) information in deterministic networks and (2) estimating information in high-dimensional spaces.
textbf{Information in Deterministic Networks
Information-theoretic methods have significantly impacted deep learning (Alemi et al., 2016; Shwartz-Ziv and Tishby, 2017; Steinke and Zakynthinou, 2020). However, a key challenge is addressing the source of randomness in deterministic DNNs.
The mutual information between the input and representation is infinite, leading to ill-posed optimization problems or piecewise constant outcomes (Amjad and Geiger, 2019; Goldfeld et al., 2018). To tackle this issue, researchers have proposed various solutions. One common approach is to discretize the input distribution and real-valued hidden representations by binning, which facilitates non-trivial measurements and prevents the mutual information from always taking the maximum value of the log of the dataset size, thus avoiding ill-posed optimization problems (Shwartz-Ziv and Tishby, 2017).
However, binning and discretization are essentially equivalent to geometrical compression and serve as clustering measures (Goldfeld et al., 2018). Moreover, this discretization depends on the chosen bin size and does not track the mutual information across varying bin sizes Goldfeld et al. (2018); Ross (2014). To address these limitations, researchers have proposed alternative approaches such as interpreting binned information as a weight decay penalty Elad et al. (2019b), estimating mutual information based on lower bounds assuming a continuous input distribution without making assumptions about the network's output distribution properties (Shwartz-Ziv et al., 2022a; Wang and Isola, 2020; Zimmermann et al., 2021), injecting additive noise, and considering data augmentation as the source of noise (Dubois et al., 2021; Goldfeld et al., 2018; Lee et al., 2021b; Shwartz-Ziv and Tishby, 2017).
textbf{Measuring Information in High-Dimensional Spaces
Estimating mutual information in high-dimensional spaces presents a significant challenge when applying information-theoretic measures to real-world data. This problem has been extensively studied (Gao et al., 2015; Paninski, 2003), revealing the inefficiency of solutions for large dimensions and the limited scalability of known approximations with respect to sample size and dimension. Despite these difficulties, various entropy and mutual information estimation approaches have been developed, including classic methods like k-nearest neighbors (KNN) (Kozachenko and Leonenko, 1987) and kernel density estimation techniques (Hang et al., 2018), as well as more recent efficient methods.
Chelombiev et al. (2019) developed adaptive mutual information estimators based on entropies-equal bins and scaled noise kernel density estimator. Generative decoder networks, such as PixelCNN++ (Van den Oord et al., 2016), have been employed to estimate a lower bound on mutual information (Darlow and Storkey, 2020; Nash et al., 2018; Shwartz-Ziv et al., 2023). Another strategy includes ensemble dependency graph estimators, adaptive mutual information estimation methods (EDGE) by merging randomized locality-sensitive hashing (LSH), dependency graphs, and ensemble bias reduction techniques (Noshad and Hero III, 2018). The Mutual Information Neural Estimator (MINE) (Belghazi et al., 2018a) maximizes KL divergence using the dual representation of Donsker and Varadhan (1975) and has been employed for direct mutual information estimation (Elad et al., 2019a). Shwartz-Ziv and Alemi (2020) developed a controlled framework that utilized the neural tangent kernels (Jacot et al., 2018), in order to obtain tractable information measures.
Improving mutual information estimation can be achieved using larger batch sizes, although this may negatively impact generalization performance and memory requirements. Alternatively, researchers have suggested employing surrogate measures for mutual information, such as log-determinant mutual information (LDMI), based on second-order statistics (Erdogan, 2022; Ozsoy et al., 2022), which reflects linear dependence. Goldfeld and Greenewald (2021) proposed the Sliced Mutual Information (SMI), defined as an average of MI terms between one-dimensional projections of high-dimensional variables. SMI inherits many properties of its classic counterpart. It can be estimated with optimal parametric error rates in all dimensions by combining an MI estimator between scalar variables with an MC integrator (Goldfeld and Greenewald, 2021). The k -SMI, introduced by Goldfeld et al. (2022), extends the SMI by projecting to k -dimensional subspace, which relaxes the smoothness assumptions, improves scalability, and enhances performance.
In conclusion, estimating and optimizing information in deep neural networks presents significant challenges, particularly in deterministic networks and high-dimensional spaces. Researchers have proposed various approaches to address these issues, including discretization, alternative estimators, and surrogate measures. As the field continues to evolve, it is expected that more advanced techniques will emerge to overcome these challenges and facilitate the understanding and optimization of deep learning models.
Future Research Directions
Despite the solid foundation established by existing self-supervised learning methods from an information theory perspective, several potential research directions warrant exploration:
Self-supervised learning with non-shared information. As discussed in Section 4, the separation of relevant (preserved) and irrelevant (compressed) information relies on the Multiview Assumption . This assumption, which states that only shared information is essential for downstream tasks, is rather restrictive. For example, situations may arise where each view contains distinct information relevant to a downstream task or multiple tasks necessitate different features. Some methods have been proposed to tackle this problem, but they mainly focus on maximizing the network's information without explicit constraints. Formalizing this scenario and exploring differentiating between relevant and irrelevant data based on non-shared information represents an intriguing research direction.
Self-supervised learning for tabular data. At present, the internal compression of self-supervised learning methods may compress relevant information due to improper augmentation 4.1. Consequently, we must heavily rely on generating the two views, which must accurately represent information related to the downstream process. Custom augmentation must be developed for each domain, taking into account extensive prior knowledge on data augmentation. While some papers have attempted to extend self-supervised learning to tabular data (Arik and Pfister, 2021; Ucar et al., 2021), further work is necessary from both theoretical and practical standpoints to achieve high performance with self-supervised learning for tabular data (Shwartz-Ziv and Armon, 2022). The augmentation process is crucial for the performance of current vision and text models. In the case of tabular data, employing information-theoretic loss functions that do not require information compression may help harness the benefits of self-supervised learning.
Integrating other learning methods into the information-theoretic framework.
Prior works have investigated various supervised, unsupervised, semi-supervised, and selfsupervised learning methods, demonstrating that they optimize information-theoretic quantities. However, state-of-the-art methods employ additional changes and engineering practices that may be related to information theory, such as the stop gradient operation utilized by many self-supervised learning methods today (Chen and He, 2021; Grill et al., 2020). The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) can be employed to explain this operation when one path is the E-step and the other is the M-step. Additionally, Elidan and Friedman (2012) proposed an IB-inspired version of the EM, which could help develop information-theoretic-based objectives using the stop gradient operation.
Expanding the analysis to usable information. While information theory offers a rigorous conceptual framework for describing information, it neglects essential aspects of computation. (Conditional) entropy, for example, is directly related to the predictability of a random variable in a betting game where agents are rewarded for accurate guesses. However, the standard definition assumes that agents have no computational bounds and can employ arbitrarily complex prediction schemes (Cover, 1999). In the context of deep
learning, predictive information H ( Y | Z ) measures the amount of information that can be extracted from Z about Y given access to all decoders p ( y | z ) in the world. Recently, Xu et al. (2020) introduced predictive V-information as an alternative formulation based on realistic computational constraints.
Extending self-supervised learning's information-based perspective to energybased model optimization. Until now, research combining self-supervised learning with information theory has focused on probabilistic models with tractable likelihoods. These models enable specific optimization of model parameters concerning the tractable log-likelihood (Dinh et al., 2016; Germain et al., 2015; Graves, 2013; Rezende and Mohamed, 2015) or a tractable lower bound of the likelihood (Alemi et al., 2016; Kingma and Welling, 2019). Although models with tractable likelihoods offer certain benefits, their scope is limited and necessitates a particular format. Energy-based models (EBMs) present a more flexible, unified framework. Rather than specifying a normalized probability, EBMs define inference as minimizing an unnormalized energy function and learning as minimizing a loss function. The energy function does not require integration and can be parameterized with any nonlinear regression function. Inference typically involves finding a low-energy configuration or sampling from all possible configurations such that the probability of selecting a specific configuration follows a Gibbs distribution (Huembeli et al., 2022; Song and Kingma, 2021).
Investigating energy-based models for self-supervised learning from both theoretical and practical perspectives can open up numerous promising research directions. For instance, we could directly apply tools developed for energy-based models and statistical machines to optimize the model, such as Maximum Likelihood Training with MCMC (Younes, 1999), score matching (Hyv¨ arinen, 2006), denoising score matching (Song et al., 2020; Vincent, 2011), and score-based generation models (Song and Ermon, 2019).
Self-supervised learning with non-shared information.
Self-supervised learning (SSL) is a powerful technique that leverages unlabeled data to learn useful representations. In contrast to supervised learning, which relies on labeled data, SSL employs self-defined signals to establish a proxy objective between the input and the signal. The model is initially trained using this proxy objective and subsequently fine-tuned on the target task. Self-supervised signals, derived from the inherent co-occurrence relationships in the data, serve as self-supervision. Various such signals have been used to learn representations, including generative and joint embedding architectures (Bachman et al., 2019; Bar et al., 2022; Chen et al., 2020a,b).
Two main categories of SSL architectures exist: (1) generative architectures based on reconstruction or prediction and (2) joint embedding architectures (Liu et al., 2021b). Both architecture classes can be trained using either contrastive or non-contrastive methods.
We begin by discussing these two main types of architectures:
- Generative Architecture: Generative architectures employ an objective function that measures the divergence between input data and predicted reconstructions, such as squared error. The architecture reconstructs data from a latent variable or a corrupted version, potentially with a latent variable's assistance. Notable examples of generative architectures include auto-encoders, sparse coding, sparse auto-encoders, and variational auto-encoders (Kingma and Welling, 2013; Lee et al., 2006; Ng et al., 2011). As the reconstruction task lacks a single correct answer, most generative architectures utilize a latent variable, which, when varied, generates multiple reconstructions. The latent variable's information content requires regularization to ensure the system reconstructs regions of high data density while avoiding a collapse by reconstructing the entire space. PCA regularizes the latent variable by limiting its dimensions, while sparse coding and sparse auto-encoders restrict the number of non-zero components. Variational auto-encoders regularize the latent variable by rendering it stochastic
and maximizing the entropy of the distribution relative to a prior. Vector quantized variational auto-encoders (VQ-VAE) employ binary stochastic variables to achieve similar results (Van Den Oord et al., 2017).
- Joint Embedding Architectures (JEA): These architectures process multiple views of an input signal through encoders, producing representations of the views. The system is trained to ensure that these representations are both informative and mutually predictable. Examples include Siamese networks, where two identical encoders share weights (Chen et al., 2020a; Chen and He, 2021; Grill et al., 2020; He et al., 2020), and methods permitting encoders to differ (Bardes et al., 2021). A primary challenge with JEA is preventing informational collapse, in which the representations contain minimal information about the inputs, thereby facilitating their mutual prediction. JEA's advantage lies in the encoders' ability to eliminate noisy, unpredictable, or irrelevant information from the input within the representation space.
To effectively train these architectures, it is essential to ensure that the representations of different signals are distinct. This can be achieved through either contrastive or noncontrastive methods:
· Contrastive Methods: Contrastive methods utilize data points from the training set as positive samples and generate points outside the region of high data density as contrastive samples . The energy (e.g., reconstruction error for generative architectures or representation predictive error for JEA) should be low for positive samples and higher for contrastive samples. Various loss functions involving the energies of pairs or sets of samples can be minimized to achieve this objective. · Non-Contrastive Methods: Non-contrastive methods prevent the energy landscape's collapse by limiting the volume of space that can take low energy, either through architectural constraints or through a regularizer in the energy or training objective. In latent-variable generative architectures, preventing collapse is achieved by limiting or minimizing the information content of the latent variable. In JEA, collapse is prevented by maximizing the information content of the representations.
We now present a few concrete examples of popular models that employ various combinations of generative architectures, joint embedding architectures, contrastive training, and noncontrastive training:
The Denoising Autoencoder approach in generative architectures (Devlin et al., 2018; He et al., 2022; Vincent et al., 2008) using a triplet loss which utilizes a positive sample, which is a vector from the training set that should be reconstructed perfectly, and a contrastive sample consisting of data vectors, one from the training set and the other being a corrupted version of it. In SSL, the combination of JEA models with contrastive learning has proven highly effective. In contrastive learning, the objective is to attract different augmented views of the same image (positive points) while repelling dissimilar augmented views (negative points). Recent self-supervised visual representation learning examples include MoCo (He et al., 2020) and SimCLR (Chen et al., 2020a). The InfoNCE loss is a commonly used
objective function in many contrastive learning methods:
$$
$$
where x + is a sample similar to x , x k are all the samples in the batch, and f is an encoder.
However, contrastive methods heavily depend on all other samples in the batch and require a large batch size. Additionally, recent studies (Jing et al., 2021) have shown that contrastive learning can lead to dimensional collapse, where the embedding vectors span a lowerdimensional subspace instead of the entire embedding space. Although positive and negative pairs should repel each other to prevent dimensional collapse, augmentation along feature dimensions and implicit regularization cause the embedding vectors to fall into a lowerdimensional subspace, resulting in low-rank solutions.
To address these problems, recent works have introduced JEA models with non-contrastive methods . Unlike contrastive methods, these methods employ regularization to prevent the collapse of the representation and do not explicitly rely on negative samples. For example, several papers use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021; Grill et al., 2020), while Caron et al. (2020) employed an additional clustering step. VICReg (Bardes et al., 2021) is another non-contrastive method that regularizes the covariance matrix of representation. Consider two embedding batches Z = [ f ( x 1 ) , . . . , f ( x N )] and Z ′ = [ f ( x ′ 1) , . . . , f ( x ′ N )], each of size ( N × K ). Denote by C the ( K × K ) covariance matrix obtained from [ Z , Z ′ ]. The VICReg triplet loss is defined by:
̸
$$
$$
Self-supervised learning for tabular data.
Self-supervised learning (SSL) is a powerful technique that leverages unlabeled data to learn useful representations. In contrast to supervised learning, which relies on labeled data, SSL employs self-defined signals to establish a proxy objective between the input and the signal. The model is initially trained using this proxy objective and subsequently fine-tuned on the target task. Self-supervised signals, derived from the inherent co-occurrence relationships in the data, serve as self-supervision. Various such signals have been used to learn representations, including generative and joint embedding architectures (Bachman et al., 2019; Bar et al., 2022; Chen et al., 2020a,b).
Two main categories of SSL architectures exist: (1) generative architectures based on reconstruction or prediction and (2) joint embedding architectures (Liu et al., 2021b). Both architecture classes can be trained using either contrastive or non-contrastive methods.
We begin by discussing these two main types of architectures:
- Generative Architecture: Generative architectures employ an objective function that measures the divergence between input data and predicted reconstructions, such as squared error. The architecture reconstructs data from a latent variable or a corrupted version, potentially with a latent variable's assistance. Notable examples of generative architectures include auto-encoders, sparse coding, sparse auto-encoders, and variational auto-encoders (Kingma and Welling, 2013; Lee et al., 2006; Ng et al., 2011). As the reconstruction task lacks a single correct answer, most generative architectures utilize a latent variable, which, when varied, generates multiple reconstructions. The latent variable's information content requires regularization to ensure the system reconstructs regions of high data density while avoiding a collapse by reconstructing the entire space. PCA regularizes the latent variable by limiting its dimensions, while sparse coding and sparse auto-encoders restrict the number of non-zero components. Variational auto-encoders regularize the latent variable by rendering it stochastic
and maximizing the entropy of the distribution relative to a prior. Vector quantized variational auto-encoders (VQ-VAE) employ binary stochastic variables to achieve similar results (Van Den Oord et al., 2017).
- Joint Embedding Architectures (JEA): These architectures process multiple views of an input signal through encoders, producing representations of the views. The system is trained to ensure that these representations are both informative and mutually predictable. Examples include Siamese networks, where two identical encoders share weights (Chen et al., 2020a; Chen and He, 2021; Grill et al., 2020; He et al., 2020), and methods permitting encoders to differ (Bardes et al., 2021). A primary challenge with JEA is preventing informational collapse, in which the representations contain minimal information about the inputs, thereby facilitating their mutual prediction. JEA's advantage lies in the encoders' ability to eliminate noisy, unpredictable, or irrelevant information from the input within the representation space.
To effectively train these architectures, it is essential to ensure that the representations of different signals are distinct. This can be achieved through either contrastive or noncontrastive methods:
· Contrastive Methods: Contrastive methods utilize data points from the training set as positive samples and generate points outside the region of high data density as contrastive samples . The energy (e.g., reconstruction error for generative architectures or representation predictive error for JEA) should be low for positive samples and higher for contrastive samples. Various loss functions involving the energies of pairs or sets of samples can be minimized to achieve this objective. · Non-Contrastive Methods: Non-contrastive methods prevent the energy landscape's collapse by limiting the volume of space that can take low energy, either through architectural constraints or through a regularizer in the energy or training objective. In latent-variable generative architectures, preventing collapse is achieved by limiting or minimizing the information content of the latent variable. In JEA, collapse is prevented by maximizing the information content of the representations.
We now present a few concrete examples of popular models that employ various combinations of generative architectures, joint embedding architectures, contrastive training, and noncontrastive training:
The Denoising Autoencoder approach in generative architectures (Devlin et al., 2018; He et al., 2022; Vincent et al., 2008) using a triplet loss which utilizes a positive sample, which is a vector from the training set that should be reconstructed perfectly, and a contrastive sample consisting of data vectors, one from the training set and the other being a corrupted version of it. In SSL, the combination of JEA models with contrastive learning has proven highly effective. In contrastive learning, the objective is to attract different augmented views of the same image (positive points) while repelling dissimilar augmented views (negative points). Recent self-supervised visual representation learning examples include MoCo (He et al., 2020) and SimCLR (Chen et al., 2020a). The InfoNCE loss is a commonly used
objective function in many contrastive learning methods:
$$
$$
where x + is a sample similar to x , x k are all the samples in the batch, and f is an encoder.
However, contrastive methods heavily depend on all other samples in the batch and require a large batch size. Additionally, recent studies (Jing et al., 2021) have shown that contrastive learning can lead to dimensional collapse, where the embedding vectors span a lowerdimensional subspace instead of the entire embedding space. Although positive and negative pairs should repel each other to prevent dimensional collapse, augmentation along feature dimensions and implicit regularization cause the embedding vectors to fall into a lowerdimensional subspace, resulting in low-rank solutions.
To address these problems, recent works have introduced JEA models with non-contrastive methods . Unlike contrastive methods, these methods employ regularization to prevent the collapse of the representation and do not explicitly rely on negative samples. For example, several papers use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021; Grill et al., 2020), while Caron et al. (2020) employed an additional clustering step. VICReg (Bardes et al., 2021) is another non-contrastive method that regularizes the covariance matrix of representation. Consider two embedding batches Z = [ f ( x 1 ) , . . . , f ( x N )] and Z ′ = [ f ( x ′ 1) , . . . , f ( x ′ N )], each of size ( N × K ). Denote by C the ( K × K ) covariance matrix obtained from [ Z , Z ′ ]. The VICReg triplet loss is defined by:
̸
$$
$$
Integrating other learning methods into the information-theoretic framework.
Information theory traditionally assumes that underlying probabilities are known and do not require learning. For instance, the optimality of the initial IB work (Tishby et al., 1999b) relied on the assumption that the joint distribution of input and labels is known. However, a significant challenge in machine learning algorithms is inferring an accurate predictor for the unknown target variable from observed realizations. This discrepancy raises questions about the practical optimality of the IB and its relevance in modern learning algorithms. The following section delves into the relationship between the IB framework and learning, inference, and generalization.
Let X ∈ X and a target variable Y ∈ Y be random variables with an unknown joint distribution P ( X,Y ). For a given class of predictors f : X → ˆ Y and a loss function ℓ : Y → ˆ Y measuring discrepancies between true values and model predictions, our objective is to find the predictor f that minimizes the expected population risk.
$$
$$
Several issues arise with the population risk. Firstly, it remains unclear which loss function is optimal. A popular choice is the logarithmic loss (or error's entropy), which has been numerically demonstrated to yield better results (Erdogmus, 2002). This loss has been employed in various algorithms, including the InfoMax principle (Linsker, 1988), tree-based algorithms (Quinlan, 2014), deep neural networks (Zhang and Sabuncu, 2018), and Bayesian modeling (Wenzel et al., 2020). Painsky and Wornell (2018) provided a rigorous justification for using the logarithmic loss and showed that it is an upper bound to any choice of the loss function that is smooth, proper, and convex for binary classification problems.
In most cases, the joint distribution P ( X,Y ) is unknown, and we have access to only n samples from it, denoted by D n := ( x i , y i ) | i = 1 , . . . , n . Consequently, the population risk cannot be computed directly. Instead, we typically choose the predictor that minimizes the empirical population risk on a training dataset:
$$
$$
The generalization gap, defined as the difference between empirical and population risks, is given by:
$$
$$
Interestingly, the relationship between the true loss and the empirical loss can be bounded using the information bottleneck term. Shamir et al. (2010) developed several finite sample bounds for the generalization gap. According to their study, the IB framework exhibited good generalizability even with small sample sizes. In particular, they developed non-uniform bounds adaptive to the model's complexity. They demonstrated that for the discrete case, the error in estimating mutual information from finite samples is bounded by O ( | X | log n √ n ) , where | X | is the cardinality of X (the number of possible values that the random variable X can take). The results support the intuition that simpler models generalize better, and we would like to compress our model. Therefore, optimizing eq. (1) presents a trade-off between two opposing forces. On the one hand, we want to increase our prediction accuracy in our training data (high β ).
On the other hand, we would like to decrease β to narrow the generalization gap. Vera et al. (2018) extended their work and showed that the generalization gap is bounded by the square root of mutual information between training input and model representation times log n n . Furthermore, Russo and Zou (2019) and Xu and Raginsky (2017) demonstrated that the square root of the mutual information between the training input and the parameters inferred from the training algorithm provides a concise bound on the generalization gap. However, these bounds critically depend on the Markov operator that maps the training set to the network parameters, whose characterization is not trivial.
Achille and Soatto (2018) explored how applying the IB objective to the network's parameters may reduce overfitting while maintaining invariant representations. Their work showed
that flat minima, which have better generalization properties, bound the information with the weights, and the information in the weights bound the information in the activations. Chelombiev et al. (2019) found that the generalization precision is positively correlated with the degree of compression of the last layer in the network. Shwartz-Ziv et al. (2018) showed that the generalization error depends exponentially on the mutual information between the model and the input once it is smaller than log 2 n - the query sample complexity. Moreover, they demonstrated that M bits of compression of X are equivalent to an exponential factor of 2 M training examples. Piran et al. (2020) extended the original IB to the dual form, which offers several advantages in terms of compression.
These studies illustrate that the IB leads to a trade-off between prediction and complexity, even for the empirical distribution. With the IB objective, we can design estimators to find optimal solutions for different regimes with varying performance, complexity, and generalization.
Expanding the analysis to usable information.
The multiview self-supervised IB framework can be extended to cases involving more than two views ( X 1 , · · · , X n ) and multiple downstream tasks ( Y 1 , · · · , Y K ). A simple extension of the multiview IB framework can be achieved by setting the objective function to maximize the joint mutual information of all views' representations I ( Z 1 ; · · · Z n ) and compressing the individual information for each view I ( X i ; Z i ) , 1 ≤ i ≤ N However, to ensure the optimality of this objective, we must expand the multiview assumption to include more than two views. In this scenario, we need to assume that relevant information is shared among all different views and tasks, which might be overly restrictive. As a result, defining and analyzing a more refined version of this naive solution is essential. One potential approach involves utilizing the Multi-feature Information Bottleneck (MfIB) (Lou et al., 2013), which extends the original IB. The MfIB processes multiple feature types simultaneously and analyzes data from various sources. This framework establishes a joint distribution between the multivariate data and the model. Rather than solely preserving the information of one feature variable maximally, the MfIB concurrently maintains multiple feature variables' information while compressing them. The MfIB characterizes the relationships between different sources and outputs by employing the multivariate Information Bottleneck (Friedman et al., 2013) and setting Bayesian networks.
Extending self-supervised learning's information-based perspective to energy-based model optimization.
How can we learn without labels and still achieve good predictive power? Is compression necessary to obtain an optimal representation? This section analyzes and discusses how to achieve optimal representation for self-supervised learning when labels are not available during training. We review recent methods for self-supervised learning and show how they can be integrated into a single framework. We compare their objective functions, implicit assumptions, and theoretical challenges. Finally, we consider the information-theoretic properties of these representations, their optimality, and different ways of learning them.
One approach to enhance deep learning methods is to apply the InfoMax principle in a multiview setting (Linsker, 1988; Wiskott and Sejnowski, 2002). As one of the earliest approaches, Linsker (1988) proposed maximizing information transfer from input data to its latent representation, showing its equivalence to maximizing the determinant of the output covariance under the Gaussian distribution assumption. Becker and Hinton (1992) introduced a representation learning approach based on maximizing an approximation of the mutual information between alternative latent vectors obtained from the same image. The most well-known application is the Independent Component Analysis (ICA) Infomax algorithm (Bell and Sejnowski, 1995), designed to separate independent sources from their linear combinations. The ICA-Infomax algorithm aims to maximize the mutual information between mixtures and source estimates while imposing statistical independence among outputs. The Deep Infomax approach (Hjelm et al., 2018) extends this idea to unsupervised feature learning by maximizing the mutual information between input and output while matching a prior distribution for the representations. Recent work has applied this principle to a self-supervised multiview setting (Bachman et al., 2019; Henaff, 2020; Hjelm et al., 2018; Tian et al., 2020a), wherein these works maximize the mutual information between the views Z 1 and Z 2 using the classifier q ( z 1 | z 2 ), which attempts to predict one representation from the other.
However, Tschannen et al. (2019) demonstrated that the effectiveness of InfoMax models is more attributable to the inductive biases introduced by the architecture and estimators than to the training objectives themselves, as the InfoMax objectives can be trivially maximized using invertible encoders. Moreover, a fundamental issue with the InfoMax principle is that it retains irrelevant information about the labels, contradicting the core concept of the IB principle, which advocates compressing the representation to enhance generalizability.
To resolve this problem, Sridharan and Kakade (2008) proposed the multiview IB framework . According to this framework, in the multiview without labels setting, the IB principle of preserving relevant data while compressing irrelevant data requires assumptions regarding the relationship between views and labels. They presented the MultiView assumption , which asserts that either view (approximately) would be sufficient for downstream tasks. By this assumption, they define the relevant information as the shared information between the views. Therefore, augmentations (such as changing the image style) should not affect the labels.
Additionally, the views will provide most of the information in the input regarding downstream tasks. We improve generalization without affecting performance by compressing the information not shared between the two views. Their formulation is as follows:
Assumption 1 The MultiView Assumption: There exists a ϵ info (which is assumed to be small) such that
$$
$$
$$
$$
$$
$$
As a result, when the information sharing parameter, ϵ info , is small, the information shared between views includes task-relevant details. For instance, in self-supervised contrastive learning for visual data (Hjelm et al., 2018), views represent various augmentations of the same image. In this scenario, the MultiView assumption is considered mild if the downstream task remains unaffected by the augmentation (Geiping et al., 2022). Image augmentations can be perceived as altering an image's style without changing its content. Thus, Tsai et al. (2020) contends that the information required for downstream tasks should be preserved in the content rather than the style. This assumption allows us to separate the information into relevant (shared information) and irrelevant (not shared) components and to compress only the unimportant details that do not contain information about downstream tasks. Based on this assumption, we aim to maximize the relevant information I ( X 2 ; Z 1 ) and minimize I ( X 1 ; Z 1 | X 2 ) - the exclusive information that Z 1 contains about X 1 , which cannot be predicted by observing X 2 . This irrelevant information is unnecessary for the prediction task and can be discarded. In the extreme case, where X 1 and X 2 share only label information, this approach recovers the supervised IB method without labels. Conversely, if X 1 and X 2 are identical, this method collapses into the InfoMax principle, as no information can be accurately discarded.
Federici et al. (2020) used the relaxed Lagrangian objective to obtain the minimal sufficient representation Z 1 for X 2 as:
$$
$$
$$
$$
where β 1 and β 2 are the Lagrangian multipliers introduced by the constraint optimization. By defining Z 1 and Z 2 on the same domain and re-parameterizing the Lagrangian multipliers, the average of the two loss functions can be upper bounded as:
$$
$$
where D SKL represents the symmetrized KL divergence obtained by averaging the expected value of D KL ( p ( z 1 | x 1 ) || p ( z 2 | x 2 )) and D KL ( p ( z 2 | x 2 ) || p ( z 1 | x 1 )). Note that when the mapping from X 1 to Z 1 is deterministic, I ( Z 1 ; X 1 | X 2 ) minimization and H ( Z 1 | X 2 ) minimization are interchangeable and the algorithms of Federici et al. (2020) and Tsai et al.
(2020) minimize the same objective. Another implementation of the same idea is based on the Conditional Entropy Bottleneck (CEB) algorithm (Fischer, 2020) and proposed by Lee et al. (2021b). This algorithm adds the residual information as a compression term to the InfoMax objective using the reverse decoders q ( z 1 | x 2 ) and q ( z 2 | x 1 ).
In conclusion, all the algorithms mentioned above are based on the Multiview assumption. Utilizing this assumption, they can distinguish relevant information from irrelevant information. As a result, all these algorithms aim to maximize the information (or the predictive ability) of one representation with respect to the other view while compressing the information between each representation and its corresponding view. The key differences between these algorithms lie in the decomposition and implementation of these information terms.
Dubois et al. (2021) offers another theoretical analysis of the IB for self-supervised learning. Their work addresses the question of the minimum bit rate required to store the input but still achieve high performance on a family of downstream tasks Y ∈ Y . It is a rate-distortion problem, where the goal is to find a compressed representation that will give us a good prediction for every task. We require that the distortion measure is bounded:
$$
$$
Accessing the downstream task is necessary to find the solution during the learning process. As a result, Dubois et al. (2021) considered only tasks invariant to some equivalence relation, which divides the input into disjoint equivalence classes. An example would be an image with labels that remain unchanged after augmentation. This is similar to the Multiview assumption where ϵ info → 0. By applying Shannon's rate-distortion theory, they concluded that the minimum achievable bit rate is the rate-distortion function with the above invariance distortion. Thus, the optimal rate can be determined by minimizing the following Lagrangian:
$$
$$
Using this objective, the maximization of information with labels is replaced by maximizing the prediction ability of one view from the original input, regularized by direct information from the input. Similarly to the above results, we would like to find a representation Z 1 that compresses the input X 1 so that Z 1 has the maximum information about X 2 .
Expanding the multiview framework to accommodate more views and tasks.
The multiview self-supervised IB framework can be extended to cases involving more than two views ( X 1 , · · · , X n ) and multiple downstream tasks ( Y 1 , · · · , Y K ). A simple extension of the multiview IB framework can be achieved by setting the objective function to maximize the joint mutual information of all views' representations I ( Z 1 ; · · · Z n ) and compressing the individual information for each view I ( X i ; Z i ) , 1 ≤ i ≤ N However, to ensure the optimality of this objective, we must expand the multiview assumption to include more than two views. In this scenario, we need to assume that relevant information is shared among all different views and tasks, which might be overly restrictive. As a result, defining and analyzing a more refined version of this naive solution is essential. One potential approach involves utilizing the Multi-feature Information Bottleneck (MfIB) (Lou et al., 2013), which extends the original IB. The MfIB processes multiple feature types simultaneously and analyzes data from various sources. This framework establishes a joint distribution between the multivariate data and the model. Rather than solely preserving the information of one feature variable maximally, the MfIB concurrently maintains multiple feature variables' information while compressing them. The MfIB characterizes the relationships between different sources and outputs by employing the multivariate Information Bottleneck (Friedman et al., 2013) and setting Bayesian networks.
Conclusion
In this study, we delved deeply into the concept of optimal representation in self-supervised learning through the lens of information theory. We synthesized various approaches, highlighting their foundational assumptions and constraints, and integrated them into a unified framework. Additionally, we explored the key information-theoretic terms that influence these optimal representations and the methods for estimating them.
While supervised and unsupervised learning offer more direct access to relevant information, self-supervised learning depends heavily on assumptions about the relationship between data and downstream tasks. This reliance makes distinguishing between relevant and irrelevant information considerably more challenging, necessitating further assumptions.
Despite these challenges, information theory stands out as a robust and versatile framework for analysis and algorithmic development. This adaptable framework caters to a range of learning paradigms and elucidates the inherent assumptions underpinning data and model optimization.
With the rapid growth of datasets and the increasing expectations placed on models to handle multiple downstream tasks, the traditional Multi-view assumption might become less reliable. One significant challenge in self-supervised learning is the precise compression of irrelevant information, especially when these assumptions are compromised.
Future research avenues might involve expanding the Multi-view framework to include more views and tasks and deepening our understanding of information theory's impact on facets of deep learning, such as reinforcement learning and generative models.
In summary, information theory is a crucial tool in our quest to understand better and optimize self-supervised learning models. By harnessing its principles, we can more adeptly navigate the intricacies of deep neural network development, paving the way for creating more effective models.
$$ \begin{aligned} \label{eq:learnableprior_combined} \mathcal{L} &= D_{KL}(q(m|z) || p(m|z)) - \beta D_{KL}(q(x|m) || p(x|m)) - \beta_y D_{KL}(p(y|z) || p(y)) \ &\Leftrightarrow I(M; Z) - \beta I(M; X) - \beta_y I(Y; Z) \end{aligned} $$ \tag{eq:learnableprior_combined}
$$ I(Y; Z) \geq H(Y) + \boldsymbol{\mathbb{E}}\left[\log q(Y\mid Z)\right] $$
$$ \mathbb{E}_{x,x^+, x^-}\left[-\log \left(\frac{e^{f(x)^Tf(x^+)}}{\sum{k=1}^K{e^{f(x)^Tf(x^k)}}}\right)\right] $$
$$ \label{eq:IB} \mathcal{L}=\min_{P(t \mid x); p(y \mid t )} I(X;T) - \beta I(Y;T)~, $$ \tag{eq:IB}
$$ I(T:Y)=I(X:Y)-\mathbb{E}_{x\sim P(X), t\sim P(T|x)}\left[D\left[P(Y|x)||P(Y|t)\right]\right] $$
$$ \mathcal{L}{P(X,Y)}\left(f, \ell\right) = \mathbb{E}{P(X,Y)}\left[\ell(Y, f(X))\right] $$
$$ \label{eq:factorize} I(X;Z) = \underbrace{I(X;Z|Y)}\text{superfluous information} + \underbrace{I(Z;Y)}{\text{predictive information}} $$ \tag{eq:factorize}
$$ \mathcal{L} = \alpha I_{\scaleto{{P(X_1),P(Z_1|X_1})}{6pt}}(X_1;Z_1) + \beta I_{\scaleto{{P(X_2),P(Z_2|X_2})}{6pt}}(X_2;Z_2) - I_{\scaleto{{P(Z_2|X_2),P(Z_2|X_1})}{6pt}}(Z_{1,2};Y) $$
$$ I(Y;X_2|X_1) &\leq \epsilon_{\text{info}},\ I(Y;X_1|X_2) &\leq \epsilon_{\text{info}}. $$
$$ \mathcal{ L} = -I_{\scaleto{{P(Z_1|X_1),Q(Z_2|Z_1})}{6pt}}(Z_1;Z_2) + \beta D_{\text{SKL}}[p(z_1 \mid x_1)||P(z_2 \mid x_2)] $$
$$ v(Z)=\frac{1}{d}\sum_{i=1}^d\max \left(0, \gamma- \sqrt{Var(z^j) +\epsilon})\right) $$
$$ s(Z,Z^\prime) = \frac{1}{n}\sum_i ||z_i - z^\prime_i||^2_2 $$
$$ \ell(Z, Z^\prime) = \lambda s(Z, Z^\prime) + \mu \left[v(Z) v(Z^\prime)\right] + \nu \left[c(Z) +c(Z^\prime)\right] $$
$$ \mathcal{L}\hspace{-0.1cm}=\frac{1}{K}\sum_{k=1}^K\hspace{-0.1cm}\left(\hspace{-0.1cm}\alpha\max \left(0, \gamma- \sqrt{\bC_{k,k} +\epsilon}\right)\hspace{-0.1cm}+\hspace{-0.1cm}\beta \sum_{k'\neq k}\hspace{-0.1cm}\left(\bC_{k,k'}\right)^2\hspace{-0.1cm}\right) ;;;+\gamma| \bZ-\bZ'|_F^2/N. $$
Theorem. Let $T$ be a probabilistic function of $X$. Then, $T$ is a sufficient statistic for $Y$ if and only if $I(T(X); Y )=I(X; Y )$.
Definition. (Minimal sufficient statistic (MSS)) A sufficient statistic $T$ is minimal if, for any other sufficient statistic $S$, there exists a function $f$ such that $T=f(S)$ almost surely (a.s.).
InfoNCE
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory, and notably the information bottleneck principle, has been pivotal in shaping deep neural networks. This principle focuses on optimizing the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. We weave together existing research into a cohesive narrative, delve into contemporary self-supervised methodologies, and spotlight potential research avenues and inherent challenges. Additionally, we discuss the empirical evaluation of information-theoretic quantities and their estimation methods. Overall, this paper furnishes an exhaustive review of the intersection of information theory, self-supervised learning, and deep neural networks.
Keywords: Self-Supervised Learning, Information Theory, Representation Learning
Deep neural networks (DNNs) have revolutionized fields such as computer vision, natural language processing, and speech recognition due to their remarkable performance in supervised learning tasks (Alam et al., 2020; LeCun et al., 2015; He et al., 2015). However, the success of DNNs is often limited by the need for vast amounts of labeled data, which can be both time-consuming and expensive to acquire. Self-supervised learning (SSL) emerges as a promising alternative, enabling models to learn from data without explicit labels by leveraging the underlying structure and relationships within the data itself.
Recent advances in SSL have been driven by joint embedding architectures, such as Siamese Nets (Bromley et al., 1993), DrLIM (Chopra et al., 2005; Hadsell et al., 2006), and SimCLR (Chen et al., 2020a). These approaches define a loss function that encourages representations of different versions of the same image to be similar while pushing representations of distinct images apart. After optimizing the surrogate objective, the pre-trained model can be employed as a feature extractor, with the learned features serving as inputs for downstream supervised tasks like image classification, object detection, instance segmentation, or pose estimation (Caron et al., 2021; Chen et al., 2020a; Misra and van der Maaten, 2020; Shwartz-Ziv et al., 2022b). Although SSL methods have shown promising results in practice, the theoretical underpinnings behind their effectiveness remain an open question (Arora et al., 2019; Lee et al., 2021a).
Information theory has played a crucial role in understanding and optimizing deep neural networks, from practical applications like the variational information bottleneck (Alemi et al., 2016) to theoretical investigations of generalization bounds induced by mutual information (Xu and Raginsky, 2017; Steinke and Zakynthinou, 2020). Building upon these foundations, several researchers have attempted to enhance self-supervised and semi-supervised learning algorithms using information-theoretic principles, such as the Mutual Information Neural Estimator (MINE) (Belghazi et al., 2018b) combined with the information maximization (InfoMax) principle (Linsker, 1988). However, the plethora of objective functions, contradicting assumptions, and various estimation techniques in the literature can make it challenging to grasp the underlying principles and their implications.
In this paper, we aim to achieve two objectives. First, we propose a unified framework that synthesizes existing research on self-supervised and semi-supervised learning from an information-theoretic standpoint. This framework allows us to present and compare current methods, analyze their assumptions and difficulties, and discuss the optimal representation for neural networks in general and self-supervised networks in particular. Second, we explore different methods and estimators for optimizing information-theoretic quantities in deep neural networks and investigate how recent models optimize various theoretical-information terms.
By reviewing the literature on various aspects of information-theoretic learning, we provide a comprehensive understanding of the interplay between information theory, self-supervised learning, and deep neural networks. We discuss the application of the information bottleneck principle (Tishby et al., 1999a), connections between information theory and generalization, and recent information-theoretic learning algorithms. Furthermore, we examine how the information-theoretic perspective can offer insights into the design of better self-supervised learning algorithms and the potential benefits of using information theory in SSL across a wide range of applications.
In addition to the main structure of the paper, we dedicate a section to the challenges and opportunities in extending the information-theoretic perspective to other learning paradigms, such as energy-based models. We highlight the potential advantages of incorporating these extensions into self-supervised learning algorithms and discuss the technical and conceptual challenges that must be addressed.
The structure of the paper is as follows. Section 2 introduces the key concepts in supervised, semi-supervised, self-supervised learning, information theory, and representation learning. Section 3 presents a unified framework for multiview learning based on information theory. We first discuss what an optimal representation is and why compression is beneficial for learning. Next, we explore optimal representation in single-view supervised learning models and how they can be extended to unsupervised, semi-supervised, and multiview contexts. The focus then shifts to self-supervised learning, where the optimal representation remains an open question. Using the unified framework, we compare recent self-supervised algorithms and discuss their differences. We analyze the assumptions behind these models, their effects on the learned representation, and their varying perspectives on important information within the network.
Section 5 addresses several technical challenges, discussing both theoretical and practical issues in estimating theoretical information terms. We present recent methods for estimating these quantities, including variational bounds and estimators. Section 6 concludes the paper by offering insights into potential future research directions at the intersection of information theory, self-supervised learning, and deep neural networks. Our aim is to inspire further research that leverages information theory to advance our understanding of self-supervised learning and to develop more efficient and effective models for a broad range of applications.
Multiview learning has gained increasing attention and great practical success by using complementary information from multiple features or modalities. The multiview learning paradigm divides the input variable into multiple views from which the target variable should be predicted (Zhao et al., 2017b). Using this paradigm, one can eliminate hypotheses that contradict predictions from other views and provide a natural semi-supervised and self-supervised learning setting. A multiview dataset consists of data captured from multiple sources, modalities, and forms but with similar high-level semantics (Yan et al., 2021). This mechanism was initially used for natural-world data, combining image, text, audio, and video measurements. For example, photos of objects are taken from various angles, and our supervised task is to identify the objects. Another example is identifying a person by analyzing the video stream as one view and the audio stream as the other.
Although these views often provide different and complementary information about the same data, directly integrating them does not produce satisfactory results due to biases between multiple views (Yan et al., 2021). Thus, multiview representation learning involves identifying the underlying data structure and integrating the different views into a common feature space, resulting in high performance. In recent decades, multiview learning has been used for many machine learning tasks and influenced many algorithms, such as co-training mechanisms (Kumar and Daumé, 2011), subspace learning methods (Xue et al., 2019), and multiple kernel learning (MKL) (Bach and Jordan, 2002). Li et al. (2018) proposed two categories for multiview representation learning: (i) multiview representation fusion, which combines different features from multiple views into a single compact representation, and (ii) alignment of multiview representation, which attempts to capture the relationships among multiple different views through feature alignment. In this case, a learned mapping function embeds the data of each view, and the representations are regularized to form a multiview-aligned space. In this research direction, an early study is the Canonical Correlation Analysis (CCA) (Hotelling, 1936) and its kernel extensions (Bach and Jordan, 2003; Hardoon et al., 2004; Sun, 2013). In addition to CCA, multiview representation learning has penetrated a variety of learning methods, such as dimensionality reduction (Sun et al., 2010), clustering analysis (Yan et al., 2015), multiview sparse coding (Jia et al., 2010; Cao et al., 2013; Liu et al., 2014), and multimodal topic learning (Pu et al., 2020). However, despite their promising results, these methods use handcrafted features and linear embedding functions, which cannot capture the nonlinear properties of multiview data.
The emergence of deep learning has provided a powerful way to learn complex, nonlinear, and hierarchical representations of data. By incorporating multiple hierarchical layers, deep learning algorithms can learn complex, subtle, and abstract representations of target data. The success of deep learning in various application domains has led to a growing interest in deep multiview methods, which have shown promising results. Examples of these methods include deep multiview canonical correlation analysis (Andrew et al., 2013) as an extension of CCA, multiview clustering via deep matrix factorization (Zhao et al., 2017a), and the deep multiview spectral network (Huang et al., 2019). Moreover, deep architectures have been employed to generate effective representations in methods such as multiview convolutional neural networks (Liu et al., 2021a), multimodal deep Boltzmann machines (Srivastava and Salakhutdinov, 2014), multimodal deep autoencoders (Ngiam et al., 2011; Wang et al., 2015), and multimodal recurrent neural networks (Karpathy and Fei-Fei, 2015; Mao et al., 2014; Donahue et al., 2015).
Self-supervised learning (SSL) is a powerful technique that leverages unlabeled data to learn useful representations. In contrast to supervised learning, which relies on labeled data, SSL employs self-defined signals to establish a proxy objective between the input and the signal. The model is initially trained using this proxy objective and subsequently fine-tuned on the target task. Self-supervised signals, derived from the inherent co-occurrence relationships in the data, serve as self-supervision. Various such signals have been used to learn representations, including generative and joint embedding architectures (Chen et al., 2020a, b; Bachman et al., 2019; Bar et al., 2022).
Two main categories of SSL architectures exist: (1) generative architectures based on reconstruction or prediction and (2) joint embedding architectures (Liu et al., 2021b). Both architecture classes can be trained using either contrastive or non-contrastive methods.
We begin by discussing these two main types of architectures:
Generative Architecture: Generative architectures employ an objective function that measures the divergence between input data and predicted reconstructions, such as squared error. The architecture reconstructs data from a latent variable or a corrupted version, potentially with a latent variable’s assistance. Notable examples of generative architectures include auto-encoders, sparse coding, sparse auto-encoders, and variational auto-encoders (Kingma and Welling, 2013; Lee et al., 2006; Ng et al., 2011). As the reconstruction task lacks a single correct answer, most generative architectures utilize a latent variable, which, when varied, generates multiple reconstructions. The latent variable’s information content requires regularization to ensure the system reconstructs regions of high data density while avoiding a collapse by reconstructing the entire space. PCA regularizes the latent variable by limiting its dimensions, while sparse coding and sparse auto-encoders restrict the number of non-zero components. Variational auto-encoders regularize the latent variable by rendering it stochastic and maximizing the entropy of the distribution relative to a prior. Vector quantized variational auto-encoders (VQ-VAE) employ binary stochastic variables to achieve similar results (Van Den Oord et al., 2017).
Joint Embedding Architectures (JEA): These architectures process multiple views of an input signal through encoders, producing representations of the views. The system is trained to ensure that these representations are both informative and mutually predictable. Examples include Siamese networks, where two identical encoders share weights (Chen and He, 2021; Chen et al., 2020a; He et al., 2020; Grill et al., 2020), and methods permitting encoders to differ (Bardes et al., 2021). A primary challenge with JEA is preventing informational collapse, in which the representations contain minimal information about the inputs, thereby facilitating their mutual prediction. JEA’s advantage lies in the encoders’ ability to eliminate noisy, unpredictable, or irrelevant information from the input within the representation space.
To effectively train these architectures, it is essential to ensure that the representations of different signals are distinct. This can be achieved through either contrastive or non-contrastive methods:
Contrastive Methods: Contrastive methods utilize data points from the training set as positive samples and generate points outside the region of high data density as contrastive samples. The energy (e.g., reconstruction error for generative architectures or representation predictive error for JEA) should be low for positive samples and higher for contrastive samples. Various loss functions involving the energies of pairs or sets of samples can be minimized to achieve this objective.
Non-Contrastive Methods: Non-contrastive methods prevent the energy landscape’s collapse by limiting the volume of space that can take low energy, either through architectural constraints or through a regularizer in the energy or training objective. In latent-variable generative architectures, preventing collapse is achieved by limiting or minimizing the information content of the latent variable. In JEA, collapse is prevented by maximizing the information content of the representations.
We now present a few concrete examples of popular models that employ various combinations of generative architectures, joint embedding architectures, contrastive training, and non-contrastive training:
The Denoising Autoencoder approach in generative architectures (Vincent et al., 2008; Devlin et al., 2018; He et al., 2022) using a triplet loss which utilizes a positive sample, which is a vector from the training set that should be reconstructed perfectly, and a contrastive sample consisting of data vectors, one from the training set and the other being a corrupted version of it. In SSL, the combination of JEA models with contrastive learning has proven highly effective. In contrastive learning, the objective is to attract different augmented views of the same image (positive points) while repelling dissimilar augmented views (negative points). Recent self-supervised visual representation learning examples include MoCo (He et al., 2020) and SimCLR (Chen et al., 2020a). The InfoNCE loss is a commonly used objective function in many contrastive learning methods:
where x+limit-from𝑥x+ is a sample similar to x𝑥x, xksuperscript𝑥𝑘x^{k} are all the samples in the batch, and f𝑓f is an encoder.
However, contrastive methods heavily depend on all other samples in the batch and require a large batch size. Additionally, recent studies (Jing et al., 2021) have shown that contrastive learning can lead to dimensional collapse, where the embedding vectors span a lower-dimensional subspace instead of the entire embedding space. Although positive and negative pairs should repel each other to prevent dimensional collapse, augmentation along feature dimensions and implicit regularization cause the embedding vectors to fall into a lower-dimensional subspace, resulting in low-rank solutions.
To address these problems, recent works have introduced JEA models with non-contrastive methods. Unlike contrastive methods, these methods employ regularization to prevent the collapse of the representation and do not explicitly rely on negative samples. For example, several papers use stop-gradients and extra predictors to avoid collapse (Chen and He, 2021; Grill et al., 2020), while Caron et al. (2020) employed an additional clustering step. VICReg (Bardes et al., 2021) is another non-contrastive method that regularizes the covariance matrix of representation. Consider two embedding batches 𝒁=[f(𝒙1),…,f(𝒙N)]𝒁𝑓subscript𝒙1…𝑓subscript𝒙𝑁\bm{Z}=\left[f(\bm{x}{1}),\dots,f(\bm{x}{N})\right] and 𝒁′=[f(𝒙′1),…,f(𝒙′N)]superscript𝒁′𝑓superscript𝒙′1…𝑓superscript𝒙′𝑁\bm{Z}^{\prime}=\left[f(\bm{x}^{\prime}1),\dots,f(\bm{x}^{\prime}N)\right], each of size (N×K)𝑁𝐾(N\times K). Denote by 𝑪𝑪\bm{C} the (K×K)𝐾𝐾(K\times K) covariance matrix obtained from [𝒁,𝒁′]𝒁superscript𝒁′[\bm{Z},\bm{Z}^{\prime}]. The VICReg triplet loss is defined by:
Semi-supervised learning employs both labeled and unlabeled data to enhance the model performance (Chapelle et al., 2009). Consistency regularization-based approaches (Laine and Aila, 2016; Miyato et al., 2018; Sohn et al., 2020) ensure that predictions remain stable under perturbations in input data and model parameters. Certain techniques, such as those proposed by Grandvalet and Bengio (2006) and Miyato et al. (2018), involve training a model by incorporating a regularization term into a supervised cross-entropy loss. In contrast, Xie et al. (2020) utilizes suitably weighted unsupervised regularization terms, while Zhai et al. (2019) adopts a combination of self-supervised pretext loss terms. Moreover, pseudo-labeling can generate synthetic labels based on network uncertainty to further aid model training (Lee et al., 2013).
Representation learning is an essential aspect of various computer vision, natural language processing, and machine learning tasks, as it uncovers the underlying structures in data (Bengio et al., 2013). By extracting relevant information for classification and prediction tasks from the data, we can improve performance and reduce computational complexity (Goodfellow et al., 2016). However, defining an effective representation remains a challenging task. In probabilistic models, a useful representation often captures the posterior distribution of explanatory factors beneath the observed input (LeCun et al., 2015). Bengio and LeCun (2007) introduced the idea of learning highly structured yet complex dependencies for AI tasks, which require transforming high-dimensional input structures into low-dimensional output structures or learning low-level representations. As a result, identifying relevant input features becomes challenging, as most input entropy is unrelated to the output (Shwartz-Ziv and Tishby, 2017). Ben-Shaul et al. (2023) demonstrated that self-supervised learning inherently promotes the clustering of samples based on semantic labels. Intriguingly, this clustering is driven by the objective’s regularization term and aligns with semantic classes across multiple hierarchical levels.
A possible definition of an effective representation is based on minimal sufficient statistics.
Given (X,Y)∼P(X,Y)similar-to𝑋𝑌𝑃𝑋𝑌(X,Y)\sim P(X,Y), let T:=t(X)assign𝑇𝑡𝑋T:=t(X), where t𝑡t is a deterministic function. We define T𝑇T as a sufficient statistic of X𝑋X for Y𝑌Y if Y−T−X𝑌𝑇𝑋Y-T-X forms a Markov chain.
A sufficient statistic captures all the information about Y𝑌Y in X𝑋X. Cover (1999) proved this property:
Let T𝑇T be a probabilistic function of X𝑋X. Then, T𝑇T is a sufficient statistic for Y𝑌Y if and only if I(T(X);Y)=I(X;Y)𝐼𝑇𝑋𝑌𝐼𝑋𝑌I(T(X);Y)=I(X;Y).
However, the sufficiency definition also encompasses trivial identity statistics that only ”copy” rather than ”extract” essential information. To prevent statistics from inefficiently utilizing observations, the concept of minimal sufficient statistics was introduced:
(Minimal sufficient statistic (MSS)) A sufficient statistic T𝑇T is minimal if, for any other sufficient statistic S𝑆S, there exists a function f𝑓f such that T=f(S)𝑇𝑓𝑆T=f(S) almost surely (a.s.).
In essence, MSS are the simplest sufficient statistics, inducing the coarsest sufficient partition on X𝑋X. In MSS, the values of X𝑋X are grouped into as few partitions as possible without sacrificing information. MSS are statistics with the maximum information about Y𝑌Y while retaining the least information about X𝑋X as possible (Koopman, 1936).
The majority of distributions lack exact minimal sufficient statistics, leading Tishby et al. (1999b) to relax the optimization problem in two ways: (i) allowing the map to be stochastic, defined as an encoder P(T|X)𝑃conditional𝑇𝑋P(T|X), and (ii) permitting the capture of only a small amount of I(X;Y)𝐼𝑋𝑌I(X;Y). The information bottleneck (IB) was introduced as a principled method to extract relevant information from observed signals related to a target. This framework finds the optimal trade-off between the accuracy and complexity of a random variable y∈𝒴𝑦𝒴y\in\mathcal{Y} with a joint distribution for a random variable x∈𝒳𝑥𝒳x\in\mathcal{X}. The IB has been employed in various fields such as neuroscience (Buesing and Maass, 2010; Palmer et al., 2015), slow feature analysis (Turner and Sahani, 2007), speech recognition (Hecht et al., 2009), and deep learning (Shwartz-Ziv and Tishby, 2017; Alemi et al., 2016).
Let X𝑋X be an input random variable, Y𝑌Y a target variable, and P(X,Y)𝑃𝑋𝑌P(X,Y) their joint distribution. A representation T𝑇T is a stochastic function of X𝑋X defined by a mapping P(T∣X)𝑃conditional𝑇𝑋P(T\mid X). This mapping transforms X∼P(X)similar-to𝑋𝑃𝑋X\sim P(X) into a representation of T∼P(T):=∫PT∣X(⋅∣x)dPX(x)T\sim P(T):=\int P_{T\mid X}(\cdot\mid x)dP_{X}(x). The triple Y−X−T𝑌𝑋𝑇Y-X-T forms a Markov chain in that order with respect to the joint probability measure PX,Y,T=PX,YPT∣Xsubscript𝑃𝑋𝑌𝑇subscript𝑃𝑋𝑌subscript𝑃conditional𝑇𝑋P_{X,Y,T}=P_{X,Y}P_{T\mid X} and the mutual information terms I(X;T)𝐼𝑋𝑇I(X;T) and I(Y;T)𝐼𝑌𝑇I(Y;T).
Within the IB framework, our goal is to find a representation P(T∣X)𝑃conditional𝑇𝑋P(T\mid X) that extracts as much information as possible about Y𝑌Y (high performance) while compressing X𝑋X maximally (keeping I(X;T)𝐼𝑋𝑇I(X;T) small). This can also be interpreted as extracting only the relevant information that X𝑋X contains about Y𝑌Y.
The data processing inequality (DPI) implies that I(Y;T)≤I(X;Y)𝐼𝑌𝑇𝐼𝑋𝑌I(Y;T)\leq I(X;Y), so the compressed representation T𝑇T cannot convey more information than the original signal. Consequently, there is a trade-off between compressed representation and the preservation of relevant information about Y𝑌Y. The construction of an efficient representation variable is characterized by its encoder and decoder distributions, P(T∣X)𝑃conditional𝑇𝑋P(T\mid X) and P(Y∣T)𝑃conditional𝑌𝑇P(Y\mid T), respectively. The efficient representation of X𝑋X involves minimizing the complexity of the representation I(T;X)𝐼𝑇𝑋I\left(T;X\right) while maximizing I(T;Y)𝐼𝑇𝑌I\left(T;Y\right). Formally, the IB optimization involves minimizing the following objective function:
where β𝛽\beta is the trade-off parameter controlling the complexity of T𝑇T and the amount of relevant information it preserves. Intuitively, we pass the information that X𝑋X contains about Y𝑌Y through a “bottleneck” via the representation T𝑇T. It has been shown that:
Information theory traditionally assumes that underlying probabilities are known and do not require learning. For instance, the optimality of the initial IB work (Tishby et al., 1999b) relied on the assumption that the joint distribution of input and labels is known. However, a significant challenge in machine learning algorithms is inferring an accurate predictor for the unknown target variable from observed realizations. This discrepancy raises questions about the practical optimality of the IB and its relevance in modern learning algorithms. The following section delves into the relationship between the IB framework and learning, inference, and generalization.
Let X∈𝒳𝑋𝒳X\in\mathcal{X} and a target variable Y∈𝒴𝑌𝒴Y\in\mathcal{Y} be random variables with an unknown joint distribution P(X,Y)𝑃𝑋𝑌P(X,Y). For a given class of predictors f:𝒳→𝒴^:𝑓→𝒳^𝒴f:\mathcal{X}\to\mathcal{\hat{Y}} and a loss function ℓ:𝒴→𝒴^:ℓ→𝒴^𝒴\ell:\mathcal{Y}\to\mathcal{\hat{Y}} measuring discrepancies between true values and model predictions, our objective is to find the predictor f𝑓f that minimizes the expected population risk.
Several issues arise with the population risk. Firstly, it remains unclear which loss function is optimal. A popular choice is the logarithmic loss (or error’s entropy), which has been numerically demonstrated to yield better results (Erdogmus, 2002). This loss has been employed in various algorithms, including the InfoMax principle (Linsker, 1988), tree-based algorithms (Quinlan, 2014), deep neural networks (Zhang and Sabuncu, 2018), and Bayesian modeling (Wenzel et al., 2020). Painsky and Wornell (2018) provided a rigorous justification for using the logarithmic loss and showed that it is an upper bound to any choice of the loss function that is smooth, proper, and convex for binary classification problems.
In most cases, the joint distribution P(X,Y)𝑃𝑋𝑌P(X,Y) is unknown, and we have access to only n𝑛n samples from it, denoted by 𝒟n:=(xi,yi)∣i=1,…,nformulae-sequenceassignsubscript𝒟𝑛conditionalsubscript𝑥𝑖subscript𝑦𝑖𝑖1…𝑛\mathcal{D}{n}:={(x{i},y_{i})\mid i=1,\ldots,n}. Consequently, the population risk cannot be computed directly. Instead, we typically choose the predictor that minimizes the empirical population risk on a training dataset:
The generalization gap, defined as the difference between empirical and population risks, is given by:
Interestingly, the relationship between the true loss and the empirical loss can be bounded using the information bottleneck term. Shamir et al. (2010) developed several finite sample bounds for the generalization gap. According to their study, the IB framework exhibited good generalizability even with small sample sizes. In particular, they developed non-uniform bounds adaptive to the model’s complexity. They demonstrated that for the discrete case, the error in estimating mutual information from finite samples is bounded by O(|X|lognn)𝑂𝑋𝑛𝑛O\left(\frac{|X|\log n}{\sqrt{n}}\right), where |X|𝑋|X| is the cardinality of X𝑋X (the number of possible values that the random variable X𝑋X can take). The results support the intuition that simpler models generalize better, and we would like to compress our model. Therefore, optimizing eq. 1 presents a trade-off between two opposing forces. On the one hand, we want to increase our prediction accuracy in our training data (high β𝛽\beta).
On the other hand, we would like to decrease β𝛽\beta to narrow the generalization gap. Vera et al. (2018) extended their work and showed that the generalization gap is bounded by the square root of mutual information between training input and model representation times lognn𝑛𝑛\frac{\log n}{n}. Furthermore, Russo and Zou (2019) and Xu and Raginsky (2017) demonstrated that the square root of the mutual information between the training input and the parameters inferred from the training algorithm provides a concise bound on the generalization gap. However, these bounds critically depend on the Markov operator that maps the training set to the network parameters, whose characterization is not trivial.
Achille and Soatto (2018) explored how applying the IB objective to the network’s parameters may reduce overfitting while maintaining invariant representations. Their work showed that flat minima, which have better generalization properties, bound the information with the weights, and the information in the weights bound the information in the activations. Chelombiev et al. (2019) found that the generalization precision is positively correlated with the degree of compression of the last layer in the network. Shwartz-Ziv et al. (2018) showed that the generalization error depends exponentially on the mutual information between the model and the input once it is smaller than log2n2𝑛\log 2n - the query sample complexity. Moreover, they demonstrated that M𝑀M bits of compression of X𝑋X are equivalent to an exponential factor of 2Msuperscript2𝑀2^{M} training examples. Piran et al. (2020) extended the original IB to the dual form, which offers several advantages in terms of compression.
These studies illustrate that the IB leads to a trade-off between prediction and complexity, even for the empirical distribution. With the IB objective, we can design estimators to find optimal solutions for different regimes with varying performance, complexity, and generalization.
Before delving into the details, this section aims to provide an overview of the information-theoretic objectives in various learning scenarios, including supervised, unsupervised, and self-supervised settings. We will also introduce a general framework to understand better the process of learning optimal representations and explore recent methods working towards this goal.
Developing a novel algorithm entails numerous aspects, such as architecture, initialization parameters, learning algorithms, and pre-processing techniques. A crucial element, however, is the objective function. As demonstrated in Section 2.4.2, the IB approach, originally introduced by Tishby et al. (1999b), defines the optimal representation in supervised scenarios, enabling us to identify which terms to compress during learning. However, determining the optimal representation and deriving information-based objective functions in self-supervised settings are more challenging. In this section, we introduce a general framework to understand the process of learning optimal representations and explore recent methods striving to achieve this goal.
Using a two-channel input allows us to model complex multiview learning problems. In many real-world situations, data can be observed from multiple perspectives or modalities, making it essential to develop learning algorithms capable of handling such multiview data.
Consider a two-channel input, X1subscript𝑋1X_{1} and X2subscript𝑋2X_{2}, and a single-channel label Y𝑌Y for a downstream task, all possessing a joint distribution P(X1,X2,Y)𝑃subscript𝑋1subscript𝑋2𝑌P(X_{1},X_{2},Y). We assume the availability of n𝑛n labeled examples S=(x1i,x2i,yi)i=1n𝑆subscriptsuperscriptsubscriptsuperscript𝑥𝑖1subscriptsuperscript𝑥𝑖2superscript𝑦𝑖𝑛𝑖1S={(x^{i}{1},x^{i}{2},y^{i})}^{n}{i=1} and t𝑡t unlabeled examples U=(x1i,x2i)i=n+1n+t𝑈subscriptsuperscriptsubscriptsuperscript𝑥𝑖1subscriptsuperscript𝑥𝑖2𝑛𝑡𝑖𝑛1U={(x^{i}{1},x^{i}{2})}^{n+t}{i=n+1}, both independently and identically distributed. Our objective is to predict Y𝑌Y using a loss function.
In our model, we use a learned encoder with a prior P(Z)𝑃𝑍P(Z) to generate a conditional representation (which may be deterministic or stochastic) Zi|Xi=Pθi(Zi|Xi)conditionalsubscript𝑍𝑖subscript𝑋𝑖subscript𝑃subscript𝜃𝑖conditionalsubscript𝑍𝑖subscript𝑋𝑖Z_{i}|X_{i}=P_{\theta_{i}}(Z_{i}|X_{i}), where i=1,2𝑖12i=1,2 represents the two views. Subsequently, we utilize various decoders to ’decode’ distinct aspects of the representation:
For the supervised scenario, we have a joint embedding of the label classifiers from both views, Y^1,2=Qρ(Y|Z1,Z2)subscript^𝑌12subscript𝑄𝜌conditional𝑌subscript𝑍1subscript𝑍2\hat{Y}{1,2}=Q{\rho}(Y|Z_{1},Z_{2}), and two decoders predicting the labels of the downstream task based on each individual view, Yi^=Qρi(Y|Zi)^subscript𝑌𝑖subscript𝑄subscript𝜌𝑖conditional𝑌subscript𝑍𝑖\hat{Y_{i}}=Q_{\rho_{i}}(Y|Z_{i}) for i=1,2𝑖12i=1,2.
For the unsupervised case, we have direct decoders for input reconstruction from the representation, Xi¯=Qψi(Xi|Zi)¯subscript𝑋𝑖subscript𝑄subscript𝜓𝑖conditionalsubscript𝑋𝑖subscript𝑍𝑖\bar{X_{i}}=Q_{\psi_{i}}(X_{i}|Z_{i}) for i=1,2𝑖12i=1,2.
For self-supervised learning, we utilize two cross-decoders attempting to predict one representation based on the other, Z1~|Z2=qη1(Z1|Z2)conditionalsubscript𝑍1subscript𝑍2subscript𝑞subscript𝜂1conditionalsubscript𝑍1subscript𝑍2\tilde{Z_{1}}|Z_{2}=q_{\eta_{1}}(Z_{1}|Z_{2}) and Z2|Z1=qη2(Z2|Z1)conditional~subscript𝑍2subscript𝑍1subscript𝑞subscript𝜂2conditionalsubscript𝑍2subscript𝑍1\tilde{Z_{2}}|Z_{1}=q_{\eta_{2}}(Z_{2}|Z_{1}). Figure 1 illustrates this structure.
The information-theoretic perspective of self-supervised networks has led to confusion regarding the information being optimized in recent work. In supervised and unsupervised learning, only one ’information path’ exists when optimizing information-theoretic terms: the input is encoded through the network, and then the representation is decoded and compared to the targets. As a result, the representation and corresponding information always stem from a single encoder and decoder.
However, in the self-supervised multiview scenario, we can construct our representation using various encoders and decoders. For instance, we need to specify the associated random variable to define the information involved in I(X1;Z1)𝐼subscript𝑋1subscript𝑍1I(X_{1};Z_{1}). This variable could either be based on the encoder of X1subscript𝑋1X_{1} - Pθ1(Z1|X1)subscript𝑃subscript𝜃1conditionalsubscript𝑍1subscript𝑋1P_{\theta_{1}}(Z_{1}|X_{1}), or based on the encoder of X2subscript𝑋2X_{2} - Pθ2(Z2|X2)subscript𝑃subscript𝜃2conditionalsubscript𝑍2subscript𝑋2P_{\theta_{2}}(Z_{2}|X_{2}), which is subsequently passed to the cross-decoder Qη1(Z1|Z2)subscript𝑄subscript𝜂1conditionalsubscript𝑍1subscript𝑍2Q_{\eta_{1}}(Z_{1}|Z_{2}) and then to the direct decoder Qψ1(X1|Z1)subscript𝑄subscript𝜓1conditionalsubscript𝑋1subscript𝑍1Q_{\psi_{1}}(X_{1}|Z_{1}).
To fully understand the information terms, we aim to optimize and distinguish between various ”information paths,” we marked each information path differently. For example, I,P(X1),P(Z1|X1),P(Z2|Z1)(X1,Z2)I_{,P(X_{1}),P(Z_{1}|X_{1}),P(Z_{2}|Z_{1})}\left(X_{1},Z_{2}\right) is based on the path P(X1)→P(Z1|X1)→P(Z2|Z1)→𝑃subscript𝑋1𝑃conditionalsubscript𝑍1subscript𝑋1→𝑃conditionalsubscript𝑍2subscript𝑍1P(X_{1})\to P(Z_{1}|X_{1})\to P(Z_{2}|Z_{1}). In the following section, we will ”translate” previous work into our present framework and examine the loss function.
After establishing our framework, we can now incorporate various learning algorithms. We begin by examining classical single-view supervised information bottleneck algorithms for deep networks that utilize labeled data during training and extend them to the multiview scenario. Next, we broaden our perspective to include unsupervised learning, where input reconstruction replaces labels, and semi-supervised learning, where information-based regularization is applied to improve predictions.
In classical single-view supervised learning, the task of representation learning involves finding a distribution p(z|x)𝑝conditional𝑧𝑥p(z|x) that maps data observations x∈𝒳𝑥𝒳x\in\mathcal{X} to a representation z∈𝒵𝑧𝒵z\in\mathcal{Z}, capturing only the relevant features of the input Shwartz-Ziv (2022). The goal is to predict a label y∈𝒴𝑦𝒴y\in\mathcal{Y} using the learned representation. Achille and Soatto (2018) defined the sufficiency of Z𝑍Z for Y𝑌Y as the amount of label information retained after passing data through the encoder:
Federici et al. (2020) showed that Z𝑍Z is sufficient for Y𝑌Y if and only if the amount of information regarding the task remains unchanged by the encoding procedure. A sufficient representation can predict Y𝑌Y as accurately as the original data X𝑋X. In Section 2.4, we saw a trade-off between prediction and generalization when there is a finite amount of data. To reduce the generalization gap, we aim to compress X𝑋X while retaining as much predicate information on the labels as possible. Thus, we relax the sufficiency definition and minimize the following objective:
The mutual information I(Y;Z)𝐼𝑌𝑍I(Y;Z) determines how much label information is accessible and reflects the model’s ability to predict performance on the target task. I(X;Z)𝐼𝑋𝑍I(X;Z) represents the information that Z𝑍Z carries about the input, which we aim to compress. However, I(X;Z)𝐼𝑋𝑍I(X;Z) contains both relevant and irrelevant information about Y𝑌Y. Therefore, using the chain rule of information, Federici et al. (2020) proposed splitting I(X,Z)𝐼𝑋𝑍I(X,Z) into two terms:
The conditional information I(X,Z|Y)𝐼𝑋conditional𝑍𝑌I(X,Z|Y) represents information in Z𝑍Z that is not predictive of Y𝑌Y, i.e., superfluous information. The decomposition of input information enables us to compress only irrelevant information while preserving the relevant information for predicting Y𝑌Y. Several methods are available for evaluating and estimating these information-theoretic terms in the supervised case (see Section 5 for details).
The IB hypothesis for deep learning proposes two distinct phases of training neural networks (Shwartz-Ziv and Tishby, 2017): the fitting and compression phases. The fitting phase involves extracting information from the input and converting it into learned representations, characterized by increased mutual information between inputs and hidden representations. Conversely, the compression phase, which is much longer, concentrates on discarding unnecessary information for target prediction, decreasing mutual information between learned representations and inputs. In contrast, the mutual information between representations and targets increases. For more information, see Geiger (2020). Despite the elegance and plausibility of the IB hypothesis, empirically investigating it remains challenging (Amjad and Geiger, 2018).
The study of representation compression in Deep Neural Networks (DNNs) for supervised learning has shown inconsistent results. For instance, Chelombiev et al. (2019) discovered a positive correlation between generalization accuracy and the compression level of the network’s final layer. Shwartz-Ziv et al. (2018) also examined the relationship between generalization and compression, demonstrating that generalization error exponentially depends on mutual information, I(X;Z)𝐼𝑋𝑍I(X;Z). Furthermore, Achille et al. (2017) established that flat minima, known for their improved generalization properties, constrain the mutual information. However, Saxe et al. (2019) showed that compression was not necessary for generalization in deep linear networks. Basirat et al. (2021) revealed that the decrease in mutual information is essentially equivalent to geometrical compression. Other studies have found that the mutual information between training inputs and inferred parameters provides a concise bound on the generalization gap (Xu and Raginsky, 2017; Pensia et al., 2018). Lastly, Achille and Soatto (2018) explored using an information bottleneck objective on network parameters to prevent overfitting and promote invariant representations.
The IB principle offers a rigorous method for learning encoders and decoders in supervised single-view problems. However, it is not directly applicable to multiview learning problems, as it assumes only one information source as the input. A common solution is to concatenate multiple views, though this neglects the unique characteristics of each view. To address this issue, Xu et al. (2014) introduced the large-margin multiview IB (LMIB) as an extension of the original IB problem. LMIB employs a communication system where multiple senders represent various views of examples. The system extracts specific components from different senders by compressing examples through a ”bottleneck,” and the linear projectors for each view are combined to create a shared representation. The large-margin principle replaces the maximization of mutual information in prediction, emphasizing the separation of samples from different classes. Limiting Rademacher complexity improves the solution’s accuracy and generalization error bounds. Moreover, the algorithm’s robustness is enhanced when accurate views counterbalance noisy views.
However, the LMIB method has a significant limitation: it utilizes linear projections for each view, which can restrict the combined representation when the relationship between different views is complex. To overcome this limitation, Wang et al. (2019) proposed using deep neural networks to replace linear projectors. Their model first extracts concise latent representations from each view using deep networks and then learns the joint representation of all views using neural networks. They minimize the objective:
Here, α𝛼\alpha and β𝛽\beta are trade-off parameters, Z1subscript𝑍1Z_{1} and Z2subscript𝑍2Z_{2} are the two neural networks’ representations, and Z1,2subscript𝑍12Z_{1,2} is the joint embedding of Z1subscript𝑍1Z_{1} and Z2subscript𝑍2Z_{2}. The first two terms decrease the mutual information between a view’s latent representation and its original data representation, resulting in a simpler and more generalizable model. The final term forces the joint representation to maximize the discrimination ability for the downstream task.
Obtaining labeled data can be challenging or expensive in many practical scenarios, while many unlabeled samples may be readily available. Semi-supervised learning addresses this issue by leveraging the vast amount of unlabeled data during training in conjunction with a small set of labeled samples. Common strategies to achieve this involve adding regularization terms or adopting mechanisms that promote better generalization. Berthelot et al. (2019) grouped regularization methods into three primary categories: entropy minimization, consistency regularization, and generic regularization.
Voloshynovskiy et al. (2020) introduced an information-theoretic framework for semi-supervised learning based on the IB principle. In this context, the semi-supervised classification problem involves encoding input X𝑋X into the latent space Z𝑍Z while preserving only class-relevant information. A supervised classifier can achieve this if there is sufficient labeled data. However, when the number of labeled examples is limited, the standard label classifier p(y|z)𝑝conditional𝑦𝑧p(y|z) becomes unreliable and requires regularization.
To tackle this issue, the authors assumed a prior on the class label distribution p(y)𝑝𝑦p(y). They introduced a term to minimize the DKLsubscript𝐷𝐾𝐿D_{KL} between the assumed marginal prior and the empirical marginal prior, effectively regularizing the conditional label classifier with the labels’ marginal distribution. This approach reduces the classifier’s sensitivity to the scarcity of labeled examples. They proposed two variational IB semi-supervised extensions for the priors:
Handcrafted Priors: These priors are predefined for regularization and can be based on domain knowledge or statistical properties of the data. Alternatively, they can be learned using other networks. Handcrafted priors in this context are similar to priors used in the Variational Information Bottleneck (VIB) formalism (Alemi et al., 2016; Wang et al., 2019).
Learnable Priors: Voloshynovskiy et al. (2020) also suggests using learnable priors as an alternative to handcrafted regularization priors on the latent representation. This method involves regularizing Z𝑍Z through another IB-based regularization with two components: (i) latent space regularization and (ii) observation space regularization. In this case, an additional hidden variable M𝑀M is introduced after the representation to regulate the information flow between Z𝑍Z and Y𝑌Y. An auto-encoder q(m|z)𝑞conditional𝑚𝑧q(m|z) is employed, and the optimization process aims to compress the information flowing from Z𝑍Z to M𝑀M while retaining only label-relevant information. The IB objective is defined as:
Here, β𝛽\beta and βysubscript𝛽𝑦\beta_{y} are hyperparameters that balance the trade-off between the relevance of M𝑀M to the labels and the compression of Z𝑍Z into M𝑀M.
Furthermore, Voloshynovskiy et al. (2020) demonstrated that various popular semi-supervised methods can be considered special cases of the optimization problem described above. Notably, the semi-supervised AAE (Makhzani et al., 2015), CatGAN (Springenberg, 2015), SeGMA (Smieja et al., 2019), and VAE (Kingma et al., 2014) can all be viewed as specific instantiations of this framework.
In the unsupervised setting, data samples are not directly labeled by classes. Voloshynovskiy et al. (2020) defined unsupervised IB as a ’compressed’ parameterized mapping of X𝑋X to Z𝑍Z, which preserves some information in Z𝑍Z about X𝑋X through the reverse decoder X¯=Q(X|Z)¯𝑋𝑄conditional𝑋𝑍\bar{X}=Q(X|Z). Therefore, the Lagrangian of unsupervised IB can be defined as follows:
where I(X;Z)𝐼𝑋𝑍I(X;Z) is the information determined by the encoder q(z|x)𝑞conditional𝑧𝑥q(z|x) and I(Z;X¯)𝐼𝑍¯𝑋I(Z;\bar{X}) is the information determined by the decoder q(x|z)𝑞conditional𝑥𝑧q(x|z), i.e., the reconstruction error. In other words, unsupervised IB is a special case of supervised IB, where labels are replaced with the reconstruction performance of the training input. Alemi et al. (2016) showed that Variational Autoencoder (VAE) (Kingma and Welling, 2019) and β𝛽\beta-VAE (Higgins et al., 2017) are special cases of unsupervised variational IB. Voloshynovskiy et al. (2020) extended their results and showed that many models, including adversarial autoencoders (Makhzani et al., 2015), InfoVAEs (Zhao et al., 2017c), and VAE/GANs (Larsen et al., 2016), could be viewed as special cases of unsupervised IB. The main difference between them is the bounds on the different mutual information of the IB. Furthermore, unsupervised IB was used by Uğur et al. (2020) to derive lower bounds for their unsupervised generative clustering framework, while Roy et al. (2018) used it to study vector-quantized autoencoders.
Voloshynovskiy et al. (2020) pointed out that for the classification task in supervised IB, the latent space Z𝑍Z should be sufficient statistics for Y𝑌Y, whose entropy is much lower than X𝑋X. This results in a highly compressed representation where sequences close in the input space might be close in the latent space, and the less significant features will be compressed. In contrast, in the unsupervised setup, the IB suggests compressing the input to the encoded representation so that each input sequence can be decoded uniquely. In this case, the latent space’s entropy should correspond to the input space’s entropy, and compression is much more difficult.
How can we learn without labels and still achieve good predictive power? Is compression necessary to obtain an optimal representation? This section analyzes and discusses how to achieve optimal representation for self-supervised learning when labels are not available during training. We review recent methods for self-supervised learning and show how they can be integrated into a single framework. We compare their objective functions, implicit assumptions, and theoretical challenges. Finally, we consider the information-theoretic properties of these representations, their optimality, and different ways of learning them.
One approach to enhance deep learning methods is to apply the InfoMax principle in a multiview setting (Linsker, 1988; Wiskott and Sejnowski, 2002). As one of the earliest approaches, Linsker (1988) proposed maximizing information transfer from input data to its latent representation, showing its equivalence to maximizing the determinant of the output covariance under the Gaussian distribution assumption. Becker and Hinton (1992) introduced a representation learning approach based on maximizing an approximation of the mutual information between alternative latent vectors obtained from the same image. The most well-known application is the Independent Component Analysis (ICA) Infomax algorithm (Bell and Sejnowski, 1995), designed to separate independent sources from their linear combinations. The ICA-Infomax algorithm aims to maximize the mutual information between mixtures and source estimates while imposing statistical independence among outputs. The Deep Infomax approach (Hjelm et al., 2018) extends this idea to unsupervised feature learning by maximizing the mutual information between input and output while matching a prior distribution for the representations. Recent work has applied this principle to a self-supervised multiview setting (Hjelm et al., 2018; Henaff, 2020; Bachman et al., 2019; Tian et al., 2020a), wherein these works maximize the mutual information between the views Z1subscript𝑍1Z_{1} and Z2subscript𝑍2Z_{2} using the classifier q(z1|z2)𝑞conditionalsubscript𝑧1subscript𝑧2q(z_{1}|z_{2}), which attempts to predict one representation from the other.
However, Tschannen et al. (2019) demonstrated that the effectiveness of InfoMax models is more attributable to the inductive biases introduced by the architecture and estimators than to the training objectives themselves, as the InfoMax objectives can be trivially maximized using invertible encoders. Moreover, a fundamental issue with the InfoMax principle is that it retains irrelevant information about the labels, contradicting the core concept of the IB principle, which advocates compressing the representation to enhance generalizability.
To resolve this problem, Sridharan and Kakade (2008) proposed the multiview IB framework. According to this framework, in the multiview without labels setting, the IB principle of preserving relevant data while compressing irrelevant data requires assumptions regarding the relationship between views and labels. They presented the MultiView assumption, which asserts that either view (approximately) would be sufficient for downstream tasks. By this assumption, they define the relevant information as the shared information between the views. Therefore, augmentations (such as changing the image style) should not affect the labels.
Additionally, the views will provide most of the information in the input regarding downstream tasks. We improve generalization without affecting performance by compressing the information not shared between the two views. Their formulation is as follows:
The MultiView Assumption: There exists a ϵinfosubscriptitalic-ϵinfo\epsilon_{\text{info}} (which is assumed to be small) such that
As a result, when the information sharing parameter, ϵinfosubscriptitalic-ϵinfo\epsilon_{\text{info}}, is small, the information shared between views includes task-relevant details. For instance, in self-supervised contrastive learning for visual data (Hjelm et al., 2018), views represent various augmentations of the same image. In this scenario, the MultiView assumption is considered mild if the downstream task remains unaffected by the augmentation (Geiping et al., 2022). Image augmentations can be perceived as altering an image’s style without changing its content. Thus, Tsai et al. (2020) contends that the information required for downstream tasks should be preserved in the content rather than the style. This assumption allows us to separate the information into relevant (shared information) and irrelevant (not shared) components and to compress only the unimportant details that do not contain information about downstream tasks. Based on this assumption, we aim to maximize the relevant information I(X2;Z1)𝐼subscript𝑋2subscript𝑍1I(X_{2};Z_{1}) and minimize I(X1;Z1∣X2)𝐼subscript𝑋1conditionalsubscript𝑍1subscript𝑋2I(X_{1};Z_{1}\mid X_{2}) - the exclusive information that Z1subscript𝑍1Z_{1} contains about X1subscript𝑋1X_{1}, which cannot be predicted by observing X2subscript𝑋2X_{2}. This irrelevant information is unnecessary for the prediction task and can be discarded. In the extreme case, where X1subscript𝑋1X_{1} and X2subscript𝑋2X_{2} share only label information, this approach recovers the supervised IB method without labels. Conversely, if X1subscript𝑋1X_{1} and X2subscript𝑋2X_{2} are identical, this method collapses into the InfoMax principle, as no information can be accurately discarded.
Federici et al. (2020) used the relaxed Lagrangian objective to obtain the minimal sufficient representation Z1subscript𝑍1Z_{1} for X2subscript𝑋2X_{2} as:
where β1subscript𝛽1\beta_{1} and β2subscript𝛽2\beta_{2} are the Lagrangian multipliers introduced by the constraint optimization. By defining Z1subscript𝑍1Z_{1} and Z2subscript𝑍2Z_{2} on the same domain and re-parameterizing the Lagrangian multipliers, the average of the two loss functions can be upper bounded as:
where DSKLsubscript𝐷SKLD_{\text{SKL}} represents the symmetrized KL𝐾𝐿KL divergence obtained by averaging the expected value of DKL(p(z1∣x1)||p(z2∣x2))D_{\text{KL}}(p(z_{1}\mid x_{1})||p(z_{2}\mid x_{2})) and DKL(p(z2∣x2)||p(z1∣x1))D_{\text{KL}}(p(z_{2}\mid x_{2})||p(z_{1}\mid x_{1})). Note that when the mapping from X1subscript𝑋1X_{1} to Z1subscript𝑍1Z_{1} is deterministic, I(Z1;X1∣X2)𝐼subscript𝑍1conditionalsubscript𝑋1subscript𝑋2I(Z_{1};X_{1}\mid X_{2}) minimization and H(Z1∣X2)𝐻conditionalsubscript𝑍1subscript𝑋2H(Z_{1}\mid X_{2}) minimization are interchangeable and the algorithms of Federici et al. (2020) and Tsai et al. (2020) minimize the same objective. Another implementation of the same idea is based on the Conditional Entropy Bottleneck (CEB) algorithm (Fischer, 2020) and proposed by Lee et al. (2021b). This algorithm adds the residual information as a compression term to the InfoMax objective using the reverse decoders q(z1∣x2)𝑞conditionalsubscript𝑧1subscript𝑥2q(z_{1}\mid x_{2}) and q(z2∣x1)𝑞conditionalsubscript𝑧2subscript𝑥1q(z_{2}\mid x_{1}).
In conclusion, all the algorithms mentioned above are based on the Multiview assumption. Utilizing this assumption, they can distinguish relevant information from irrelevant information. As a result, all these algorithms aim to maximize the information (or the predictive ability) of one representation with respect to the other view while compressing the information between each representation and its corresponding view. The key differences between these algorithms lie in the decomposition and implementation of these information terms.
Dubois et al. (2021) offers another theoretical analysis of the IB for self-supervised learning. Their work addresses the question of the minimum bit rate required to store the input but still achieve high performance on a family of downstream tasks Y∈𝒴𝑌𝒴Y\in\mathcal{Y}. It is a rate-distortion problem, where the goal is to find a compressed representation that will give us a good prediction for every task. We require that the distortion measure is bounded:
Accessing the downstream task is necessary to find the solution during the learning process. As a result, Dubois et al. (2021) considered only tasks invariant to some equivalence relation, which divides the input into disjoint equivalence classes. An example would be an image with labels that remain unchanged after augmentation. This is similar to the Multiview assumption where ϵinfo→0→subscriptitalic-ϵ𝑖𝑛𝑓𝑜0\epsilon_{info}\to 0. By applying Shannon’s rate-distortion theory, they concluded that the minimum achievable bit rate is the rate-distortion function with the above invariance distortion. Thus, the optimal rate can be determined by minimizing the following Lagrangian:
Using this objective, the maximization of information with labels is replaced by maximizing the prediction ability of one view from the original input, regularized by direct information from the input. Similarly to the above results, we would like to find a representation Z1subscript𝑍1Z_{1} that compresses the input X1subscript𝑋1X_{1} so that Z1subscript𝑍1Z_{1} has the maximum information about X2subscript𝑋2X_{2}.
While the optimal IB representation is based on the Multiview assumption, most self-supervised learning models only use the infoMax principle and maximize the mutual information I(Z1;Z2)𝐼subscript𝑍1subscript𝑍2I(Z_{1};Z_{2}) without an explicit regularization term. However, recent studies have shown that contrastive learning creates compressed representations that include only relevant information (Wang et al., 2022; Tian et al., 2020b). The question is, why is the learned representation compressed? The maximization of I(Z1;Z2)𝐼subscript𝑍1subscript𝑍2I(Z_{1};Z_{2}) could theoretically be sufficient to retain all the information from both X1subscript𝑋1X_{1} and X2subscript𝑋2X_{2} by making the representations invertible. In this section, we attempt to explain this phenomenon.
We begin with the InfoMax principle (Linsker, 1988), which maximizes the mutual information between the representations of random variables Z1superscript𝑍1Z^{1} and Z2superscript𝑍2Z^{2} of the two views. We can lower-bound it using:
The bound is tight when q(z1|z2)=p(z1|z2)𝑞conditionalsubscript𝑧1subscript𝑧2𝑝conditionalsubscript𝑧1subscript𝑧2q(z_{1}|z_{2})=p(z_{1}|z_{2}), in which case the first term equals the conditional entropy H(Z1|Z2)𝐻conditionalsubscript𝑍1subscript𝑍2H(Z_{1}|Z_{2}). The second term of eq. 7 can be considered a negative reconstruction error or distortion between Z1subscript𝑍1Z_{1} and Z2subscript𝑍2Z_{2}.
In the supervised case, where Z𝑍Z is a learned stochastic representation of the input and Y𝑌Y is the label, we aim to optimize
. Since Y𝑌Y is constant, optimizing the information I(Z;Y)𝐼𝑍𝑌I(Z;Y) requires only minimizing the prediction term 𝔼[logq(Y|Z)]𝔼delimited-[]𝑞conditional𝑌𝑍\mathbb{E}\left[\log q(Y|Z)\right] by making Z𝑍Z more informative about Y𝑌Y. This term is the cross-entropy loss for classification or the square loss for regressions. Thus, we can minimize the log loss without any other regularization on the representation.
In contrast, for the self-supervised case, we have a more straightforward option to minimize H(Z1|Z2)𝐻conditionalsubscript𝑍1subscript𝑍2H(Z_{1}|Z_{2}): Making Z1subscript𝑍1Z_{1} easier to predict by Z2subscript𝑍2Z_{2}, which can be achieved by reducing its variance along specific dimensions. If we do not regularize H(Z1)𝐻subscript𝑍1H(Z_{1}), it will decrease to zero, and we will observe a collapse. This is why, in contrastive methods, the variance of the representation (large entropy) is significant only in the directions with a high variance in the data, which is enforced by data augmentation (Jing et al., 2021). According to this analysis, the network benefits from making the representations ”simple” (easier to predict). Hence, even though our representation does not have explicit information-theoretical constraints, the learning process will compress the representation.
According to the Multiview IB analysis presented in Section 4, the optimal way to create a useful representation is to maximize the mutual information between the representations of different views while compressing irrelevant information in each representation. In fact, as discussed in Section 4.1, we can achieve this optimal compressed representation even without explicit regularization. However, this optimality is based on the Multiview assumption, which states that the relevant information for downstream tasks comes from the information shared between views. Therefore, Tian et al. (2020b) concluded that when a minimal sufficient representation has been obtained, the optimal views for self-supervised learning are determined by downstream tasks.
However, the Multiview assumption is highly constrained, as all relevant information must be shared between all views. In cases where this assumption is incorrect, such as with aggressive data augmentation or multiple downstream tasks or modalities, sharing all the necessary information can be challenging. For example, if one view is a video stream while the other is an audio stream, the shared information may be sufficient for object recognition but not for tracking. Furthermore, relevant information for downstream tasks may not be contained within the shared information between views, meaning that removing non-shared information can negatively impact performance.
Kahana and Hoshen (2022) identified a series of tasks that violate the Multiview assumption. To accomplish these tasks, the learned representation must also be invariant to unwanted attributes, such as bias removal and cross-domain retrieval. In such cases, only some attributes have labels, and the objective is to learn an invariant representation for the domain for which labels are provided while also being informative for all other attributes without labels. For example, for face images, only the identity labels may be provided, and the goal is to learn a representation that captures the unlabeled pose attribute but contains no information about the identity attribute. The task can also be applied to fair decisions, cross-domain matching, model anonymization, and image translation.
Wang et al. (2022) formalized another case where the Multiview assumption does not hold when non-shared task-relevant information cannot be ignored. In such cases, the minimal sufficient representation contains less task-relevant information than other sufficient representations, resulting in inferior performance. Furthermore, their analysis shows that in such cases, the learned representation in contrastive learning is insufficient for downstream tasks, which may overfit the shared information.
As a result of their analysis, Wang et al. (2022) and Kahana and Hoshen (2022) proposed explicitly increasing mutual information between the representation and input to preserve task-relevant information and prevent the compression of unshared information between views. In this case, the two regularization terms of the two views are incorporated into the original InfoMax objective, and the following objective is optimized:
Wang et al. (2022) demonstrated the effectiveness of their method for SimCLR (Chen et al., 2020a), BYOL (Grill et al., 2020), and Barlow Twins (Zbontar et al., 2021) across classification, detection, and segmentation tasks.
As seen in Eq. 9, when the Multiview assumption is violated, the objective for obtaining an optimal representation is to maximize the mutual information between each input and its representation. This contrasts with the situation in which the Multiview assumption holds, or the supervised case, where the objective is to minimize the mutual information between the representation and the input. In both supervised and unsupervised cases, we have direct access to the relevant information, which we can use to separate and compress irrelevant information. However, in the self-supervised case, we depend heavily on the Multiview assumption. If this assumption is violated due to unshared information between views that is relevant for the downstream task, we cannot separate relevant and irrelevant information. Furthermore, the learning algorithm’s nature requires that this information be protected by explicitly maximizing it.
As datasets continue to expand in size and models are anticipated to serve as base models for various downstream tasks, the Multiview assumption becomes less pertinent. Consequently, compressing irrelevant information when the Multiview assumption does not hold presents one of the most significant challenges in self-supervised learning. Identifying new methods to separate relevant from irrelevant information based on alternative assumptions is a promising avenue for research. It is also essential to recognize that empirical measurement of information-theoretic quantities and their estimators plays a crucial role in developing and evaluating such methods.
Recent years have seen information-theoretic analyses employed to explain and optimize deep learning techniques (Shwartz-Ziv and Tishby, 2017). Despite their elegance and plausibility, empirically measuring and analyzing information in deep networks presents challenges. Two critical problems are (1) information in deterministic networks and (2) estimating information in high-dimensional spaces.
Information-theoretic methods have significantly impacted deep learning (Alemi et al., 2016; Steinke and Zakynthinou, 2020; Shwartz-Ziv and Tishby, 2017). However, a key challenge is addressing the source of randomness in deterministic DNNs.
The mutual information between the input and representation is infinite, leading to ill-posed optimization problems or piecewise constant outcomes (Amjad and Geiger, 2019; Goldfeld et al., 2018). To tackle this issue, researchers have proposed various solutions. One common approach is to discretize the input distribution and real-valued hidden representations by binning, which facilitates non-trivial measurements and prevents the mutual information from always taking the maximum value of the log of the dataset size, thus avoiding ill-posed optimization problems (Shwartz-Ziv and Tishby, 2017).
However, binning and discretization are essentially equivalent to geometrical compression and serve as clustering measures (Goldfeld et al., 2018). Moreover, this discretization depends on the chosen bin size and does not track the mutual information across varying bin sizes Goldfeld et al. (2018); Ross (2014). To address these limitations, researchers have proposed alternative approaches such as interpreting binned information as a weight decay penalty Elad et al. (2019b), estimating mutual information based on lower bounds assuming a continuous input distribution without making assumptions about the network’s output distribution properties (Wang and Isola, 2020; Zimmermann et al., 2021; Shwartz-Ziv et al., 2022a), injecting additive noise, and considering data augmentation as the source of noise (Lee et al., 2021b; Shwartz-Ziv and Tishby, 2017; Goldfeld et al., 2018; Dubois et al., 2021).
Estimating mutual information in high-dimensional spaces presents a significant challenge when applying information-theoretic measures to real-world data. This problem has been extensively studied (Paninski, 2003; Gao et al., 2015), revealing the inefficiency of solutions for large dimensions and the limited scalability of known approximations with respect to sample size and dimension. Despite these difficulties, various entropy and mutual information estimation approaches have been developed, including classic methods like k-nearest neighbors (KNN) (Kozachenko and Leonenko, 1987) and kernel density estimation techniques (Hang et al., 2018), as well as more recent efficient methods.
Chelombiev et al. (2019) developed adaptive mutual information estimators based on entropies-equal bins and scaled noise kernel density estimator. Generative decoder networks, such as PixelCNN++ (Van den Oord et al., 2016), have been employed to estimate a lower bound on mutual information (Darlow and Storkey, 2020; Nash et al., 2018; Shwartz-Ziv et al., 2023). Another strategy includes ensemble dependency graph estimators, adaptive mutual information estimation methods (EDGE) by merging randomized locality-sensitive hashing (LSH), dependency graphs, and ensemble bias reduction techniques (Noshad and Hero III, 2018). The Mutual Information Neural Estimator (MINE) (Belghazi et al., 2018a) maximizes KL divergence using the dual representation of Donsker and Varadhan (1975) and has been employed for direct mutual information estimation (Elad et al., 2019a). Shwartz-Ziv and Alemi (2020) developed a controlled framework that utilized the neural tangent kernels (Jacot et al., 2018), in order to obtain tractable information measures.
Improving mutual information estimation can be achieved using larger batch sizes, although this may negatively impact generalization performance and memory requirements. Alternatively, researchers have suggested employing surrogate measures for mutual information, such as log-determinant mutual information (LDMI), based on second-order statistics (Ozsoy et al., 2022; Erdogan, 2022), which reflects linear dependence. Goldfeld and Greenewald (2021) proposed the Sliced Mutual Information (SMI), defined as an average of MI terms between one-dimensional projections of high-dimensional variables. SMI inherits many properties of its classic counterpart. It can be estimated with optimal parametric error rates in all dimensions by combining an MI estimator between scalar variables with an MC integrator (Goldfeld and Greenewald, 2021). The k𝑘k-SMI, introduced by Goldfeld et al. (2022), extends the SMI by projecting to k𝑘k-dimensional subspace, which relaxes the smoothness assumptions, improves scalability, and enhances performance.
In conclusion, estimating and optimizing information in deep neural networks presents significant challenges, particularly in deterministic networks and high-dimensional spaces. Researchers have proposed various approaches to address these issues, including discretization, alternative estimators, and surrogate measures. As the field continues to evolve, it is expected that more advanced techniques will emerge to overcome these challenges and facilitate the understanding and optimization of deep learning models.
As discussed in Section 4, the separation of relevant (preserved) and irrelevant (compressed) information relies on the Multiview Assumption. This assumption, which states that only shared information is essential for downstream tasks, is rather restrictive. For example, situations may arise where each view contains distinct information relevant to a downstream task or multiple tasks necessitate different features. Some methods have been proposed to tackle this problem, but they mainly focus on maximizing the network’s information without explicit constraints. Formalizing this scenario and exploring differentiating between relevant and irrelevant data based on non-shared information represents an intriguing research direction.
At present, the internal compression of self-supervised learning methods may compress relevant information due to improper augmentation 4.1. Consequently, we must heavily rely on generating the two views, which must accurately represent information related to the downstream process. Custom augmentation must be developed for each domain, taking into account extensive prior knowledge on data augmentation. While some papers have attempted to extend self-supervised learning to tabular data (Ucar et al., 2021; Arik and Pfister, 2021), further work is necessary from both theoretical and practical standpoints to achieve high performance with self-supervised learning for tabular data (Shwartz-Ziv and Armon, 2022). The augmentation process is crucial for the performance of current vision and text models. In the case of tabular data, employing information-theoretic loss functions that do not require information compression may help harness the benefits of self-supervised learning.
Prior works have investigated various supervised, unsupervised, semi-supervised, and self-supervised learning methods, demonstrating that they optimize information-theoretic quantities. However, state-of-the-art methods employ additional changes and engineering practices that may be related to information theory, such as the stop gradient operation utilized by many self-supervised learning methods today (Grill et al., 2020; Chen and He, 2021). The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) can be employed to explain this operation when one path is the E-step and the other is the M-step. Additionally, Elidan and Friedman (2012) proposed an IB-inspired version of the EM, which could help develop information-theoretic-based objectives using the stop gradient operation.
While information theory offers a rigorous conceptual framework for describing information, it neglects essential aspects of computation. (Conditional) entropy, for example, is directly related to the predictability of a random variable in a betting game where agents are rewarded for accurate guesses. However, the standard definition assumes that agents have no computational bounds and can employ arbitrarily complex prediction schemes (Cover, 1999). In the context of deep learning, predictive information H(Y|Z)𝐻conditional𝑌𝑍H(Y|Z) measures the amount of information that can be extracted from Z𝑍Z about Y𝑌Y given access to all decoders p(y|z)𝑝conditional𝑦𝑧p(y|z) in the world. Recently, Xu et al. (2020) introduced predictive V-information as an alternative formulation based on realistic computational constraints.
Until now, research combining self-supervised learning with information theory has focused on probabilistic models with tractable likelihoods. These models enable specific optimization of model parameters concerning the tractable log-likelihood (Graves, 2013; Germain et al., 2015; Dinh et al., 2016; Rezende and Mohamed, 2015) or a tractable lower bound of the likelihood (Kingma and Welling, 2019; Alemi et al., 2016). Although models with tractable likelihoods offer certain benefits, their scope is limited and necessitates a particular format. Energy-based models (EBMs) present a more flexible, unified framework. Rather than specifying a normalized probability, EBMs define inference as minimizing an unnormalized energy function and learning as minimizing a loss function. The energy function does not require integration and can be parameterized with any nonlinear regression function. Inference typically involves finding a low-energy configuration or sampling from all possible configurations such that the probability of selecting a specific configuration follows a Gibbs distribution (Huembeli et al., 2022; Song and Kingma, 2021).
Investigating energy-based models for self-supervised learning from both theoretical and practical perspectives can open up numerous promising research directions. For instance, we could directly apply tools developed for energy-based models and statistical machines to optimize the model, such as Maximum Likelihood Training with MCMC (Younes, 1999), score matching (Hyvärinen, 2006), denoising score matching (Song et al., 2020; Vincent, 2011), and score-based generation models (Song and Ermon, 2019).
The multiview self-supervised IB framework can be extended to cases involving more than two views (X1,⋯,Xn)subscript𝑋1⋯subscript𝑋𝑛(X_{1},\cdots,X_{n}) and multiple downstream tasks (Y1,⋯,YK)subscript𝑌1⋯subscript𝑌𝐾(Y_{1},\cdots,Y_{K}). A simple extension of the multiview IB framework can be achieved by setting the objective function to maximize the joint mutual information of all views’ representations I(Z1;⋯Zn)𝐼subscript𝑍1⋯subscript𝑍𝑛I(Z_{1};\cdots Z_{n}) and compressing the individual information for each view I(Xi;Zi),1≤i≤N𝐼subscript𝑋𝑖subscript𝑍𝑖1𝑖𝑁I(X_{i};Z_{i}),\quad 1\leq i\leq N However, to ensure the optimality of this objective, we must expand the multiview assumption to include more than two views. In this scenario, we need to assume that relevant information is shared among all different views and tasks, which might be overly restrictive. As a result, defining and analyzing a more refined version of this naive solution is essential. One potential approach involves utilizing the Multi-feature Information Bottleneck (MfIB) (Lou et al., 2013), which extends the original IB. The MfIB processes multiple feature types simultaneously and analyzes data from various sources. This framework establishes a joint distribution between the multivariate data and the model. Rather than solely preserving the information of one feature variable maximally, the MfIB concurrently maintains multiple feature variables’ information while compressing them. The MfIB characterizes the relationships between different sources and outputs by employing the multivariate Information Bottleneck (Friedman et al., 2013) and setting Bayesian networks.
In this study, we delved deeply into the concept of optimal representation in self-supervised learning through the lens of information theory. We synthesized various approaches, highlighting their foundational assumptions and constraints, and integrated them into a unified framework. Additionally, we explored the key information-theoretic terms that influence these optimal representations and the methods for estimating them.
While supervised and unsupervised learning offer more direct access to relevant information, self-supervised learning depends heavily on assumptions about the relationship between data and downstream tasks. This reliance makes distinguishing between relevant and irrelevant information considerably more challenging, necessitating further assumptions.
Despite these challenges, information theory stands out as a robust and versatile framework for analysis and algorithmic development. This adaptable framework caters to a range of learning paradigms and elucidates the inherent assumptions underpinning data and model optimization.
With the rapid growth of datasets and the increasing expectations placed on models to handle multiple downstream tasks, the traditional Multi-view assumption might become less reliable. One significant challenge in self-supervised learning is the precise compression of irrelevant information, especially when these assumptions are compromised.
Future research avenues might involve expanding the Multi-view framework to include more views and tasks and deepening our understanding of information theory’s impact on facets of deep learning, such as reinforcement learning and generative models.
In summary, information theory is a crucial tool in our quest to understand better and optimize self-supervised learning models. By harnessing its principles, we can more adeptly navigate the intricacies of deep neural network development, paving the way for creating more effective models.
Multiview information bottleneck diagram for self-supervised, unsupervised, and supervised learning
$$ D_{\mathcal{T}}(X,Z)=\sup_{Y\in\mathcal{Y}}H(Y\mid Z_{1})-H(Y\mid X_{1})\leq\delta. $$ \tag{S4.Ex12}
$$ I(Y;Z)\geq H(Y)+\bm{\mathbb{E}}\left[\log q(Y\mid Z)\right] $$ \tag{S4.E8}
$$ \displaystyle\mathbb{E}_{x,x^{+},x^{-}}\left[-\log\left(\frac{e^{f(x)^{T}f(x^{+})}}{\sum{k=1}^{K}{e^{f(x)^{T}f(x^{k})}}}\right)\right] $$
$$ \displaystyle\mathcal{L}=\frac{1}{K}\sum_{k=1}^{K}\left(\alpha\max\left(0,\gamma-\sqrt{\bm{C}{k,k}+\epsilon}\right)+\beta\sum{k^{\prime}\neq k}\left(\bm{C}{k,k^{\prime}}\right)^{2}\right);;;+\gamma|\bm{Z}-\bm{Z}^{\prime}|{F}^{2}/N. $$
$$ \displaystyle\mathcal{L}=\min_{P(t\mid x);p(y\mid t)}I(X;T)-\beta I(Y;T)\leavevmode\nobreak\ , $$
$$ \displaystyle I(T:Y)=I(X:Y)-\mathbb{E}_{x\sim P(X),t\sim P(T|x)}\left[D\left[P(Y|x)||P(Y|t)\right]\right] $$
$$ \displaystyle Gen_{P(X,Y)}\left(f,\ell,\mathcal{D}{n}\right):=\mathcal{L}{P(X,Y)}\left(f,\ell\right)-\mathcal{\hat{L}}{P(X,Y)}\left(f,\ell,\mathcal{D}{n}\right) $$
$$ \displaystyle I(X;Z)=\underbrace{I(X;Z|Y)}{\text{superfluous information}}+\underbrace{I(Z;Y)}{\text{predictive information}} $$
$$ \displaystyle\mathcal{L}=\alpha I_{\scaleto{{P(X_{1}),P(Z_{1}|X_{1}})}{6pt}}(X_{1};Z_{1})+\beta I_{\scaleto{{P(X_{2}),P(Z_{2}|X_{2}})}{6pt}}(X_{2};Z_{2})-I_{\scaleto{{P(Z_{2}|X_{2}),P(Z_{2}|X_{1}})}{6pt}}(Z_{1,2};Y $$
$$ \displaystyle\mathcal{L} $$ \tag{S3.E5X}
$$ \displaystyle\mathcal{L}=-I_{\scaleto{{P(Z_{1}|X_{1}),Q(Z_{2}|Z_{1}})}{6pt}}(Z_{1};Z_{2})+\beta D_{\text{SKL}}[p(z_{1}\mid x_{1})||P(z_{2}\mid x_{2})] $$
Definition. Definition 1 Given (X,Y)∼P(X,Y)similar-to𝑋𝑌𝑃𝑋𝑌(X,Y)\sim P(X,Y), let T:=t(X)assign𝑇𝑡𝑋T:=t(X), where t𝑡t is a deterministic function. We define T𝑇T as a sufficient statistic of X𝑋X for Y𝑌Y if Y−T−X𝑌𝑇𝑋Y-T-X forms a Markov chain.
Definition. Definition 3 (Minimal sufficient statistic (MSS)) A sufficient statistic T𝑇T is minimal if, for any other sufficient statistic S𝑆S, there exists a function f𝑓f such that T=f(S)𝑇𝑓𝑆T=f(S) almost surely (a.s.).
Definition. Definition 4 Sufficiency: A representation ZZZ of XXX is sufficient for YYY if and only if I(X;Y|Z)=0IXconditionalYZ0I(X;Y|Z)=0.
Assumption. Assumption 1 The MultiView Assumption: There exists a ϵinfosubscriptitalic-ϵinfo\epsilon_{\text{info}} (which is assumed to be small) such that I(Y;X2|X1)𝐼𝑌conditionalsubscript𝑋2subscript𝑋1\displaystyle I(Y;X_{2}|X_{1}) ≤ϵinfo,absentsubscriptitalic-ϵinfo\displaystyle\leq\epsilon_{\text{info}}, I(Y;X1|X2)𝐼𝑌conditionalsubscript𝑋1subscript𝑋2\displaystyle I(Y;X_{1}|X_{2}) ≤ϵinfo.absentsubscriptitalic-ϵinfo\displaystyle\leq\epsilon_{\text{info}}.
References
[Reference1] Achille, Alessandro, Rovere, Matteo, Soatto, Stefano. (2017). Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856.
[jacot2018neural] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems.
[shwartz2022information] Shwartz-Ziv, Ravid. (2022). Information flow in deep neural networks. arXiv preprint arXiv:2202.06749.
[piran2020dual] Piran, Zoe, Shwartz-Ziv, Ravid, Tishby, Naftali. (2020). The dual information bottleneck. arXiv preprint arXiv:2006.04641.
[shwartz2020information] Shwartz-Ziv, Ravid, Alemi, Alexander A. (2020). Information in infinite ensembles of infinitely-wide neural networks. Symposium on Advances in Approximate Bayesian Inference.
[geiping2022much] Geiping, Jonas, Goldblum, Micah, Somepalli, Gowthami, Shwartz-Ziv, Ravid, Goldstein, Tom, Wilson, Andrew Gordon. (2022). How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. arXiv preprint arXiv:2210.06441.
[ZHAO201743] Jing Zhao, Xijiong Xie, Xin Xu, Shiliang Sun. (2017). Multi-view learning overview: Recent progress and new challenges. Information Fusion. doi:https://doi.org/10.1016/j.inffus.2017.02.007.
[erdogan2022information] Erdogan, Alper T. (2022). An information maximization based blind source separation approach for dependent and independent sources. ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[vincent2008extracting] Vincent, Pascal, Larochelle, Hugo, Bengio, Yoshua, Manzagol, Pierre-Antoine. (2008). Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th international conference on Machine learning.
[elad2019the] Adar Elad, Doron Haviv, Yochai Blau, Tomer Michaeli. (2019). The effectiveness of layer-by-layer training using the information bottleneck principle.
[caron2021emerging] Caron, Mathilde, Touvron, Hugo, Misra, Ishan, J{'e. (2021). Emerging properties in self-supervised vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[shwartz2022we] Shwartz-Ziv, Ravid, Balestriero, Randall, LeCun, Yann. (2022). What Do We Maximize in Self-Supervised Learning?. arXiv preprint arXiv:2207.10081.
[bromley1993signature] Bromley, Jane, Guyon, Isabelle, LeCun, Yann, S{. (1993). Signature verification using a. Advances in neural information processing systems.
[shwartz2023information] Shwartz-Ziv, Ravid, Balestriero, Randall, Kawaguchi, Kenji, Rudner, Tim GJ, LeCun, Yann. (2023). An Information-Theoretic Perspective on Variance-Invariance-Covariance Regularization. arXiv preprint arXiv:2303.00633.
[chelombiev2018adaptive] Ivan Chelombiev, Conor Houghton, Cian O'Donnell. (2019). Adaptive Estimators Show Information Compression in Deep Neural Networks. International Conference on Learning Representations.
[bardes2021vicreg] Bardes, Adrien, Ponce, Jean, LeCun, Yann. (2021). Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906.
[largemarginib] Tsai, Yao-Hung Hubert, Wu, Yue, Salakhutdinov, Ruslan, Morency, Louis-Philippe. (2020). Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576. doi:10.1109/TPAMI.2013.2296528.
[dubois2021lossy] Dubois, Yann, Bloem-Reddy, Benjamin, Ullrich, Karen, Maddison, Chris J. (2021). Lossy compression for lossless prediction. Advances in Neural Information Processing Systems.
[goldfeld2021sliced] Goldfeld, Ziv, Greenewald, Kristjan. (2021). Sliced mutual information: A scalable measure of statistical dependence. Advances in Neural Information Processing Systems.
[Basirat2021] Basirat, Mina, Geiger, Bernhard C., Roth, Peter M.. (2021). A Geometric Perspective on Information Plane Analysis. Entropy.
[misra2020self] Misra, Ishan, {van der Maaten. (2020). Self-supervised learning of pretext-invariant representations. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[wu2018unsupervised] Wu, Zhirong, Xiong, Yuanjun, Yu, Stella, Lin, Dahua. (2018). Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978.
[he2020momentum] He, Kaiming, Fan, Haoqi, Wu, Yuxin, Xie, Saining, Girshick, Ross. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[he2022masked] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[zimmermann2021contrastive] Zimmermann, Roland S, Sharma, Yash, Schneider, Steffen, Bethge, Matthias, Brendel, Wieland. (2021). Contrastive learning inverts the data generating process. International Conference on Machine Learning.
[wang2020understanding] Wang, Tongzhou, Isola, Phillip. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. International Conference on Machine Learning.
[makhzani2017pixelgan] Makhzani, Alireza, Frey, Brendan J. (2017). Pixelgan autoencoders. Advances in Neural Information Processing Systems.
[berthelot2019mixmatch] Berthelot, David, Carlini, Nicholas, Goodfellow, Ian, Papernot, Nicolas, Oliver, Avital, Raffel, Colin A. (2019). Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems.
[larsen2016autoencoding] Larsen, Anders Boesen Lindbo, S{\o. (2016). Autoencoding beyond pixels using a learned similarity metric. International conference on machine learning.
[doersch2015unsupervised] Doersch, Carl, Gupta, Abhinav, Efros, Alexei A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE international conference on computer vision.
[vera2018collaborative] Vera, Matias, Vega, Leonardo Rey, Piantanida, Pablo. (2018). Collaborative information bottleneck. IEEE Transactions on Information Theory.
[lou2013multi] Lou, Zhengzheng, Ye, Yangdong, Yan, Xiaoqiang. (2013). The multi-feature information bottleneck with application to unsupervised image categorization. Twenty-Third International Joint Conference on Artificial Intelligence.
[friedman2013multivariate] Friedman, Nir, Mosenzon, Ori, Slonim, Noam, Tishby, Naftali. (2013). Multivariate information bottleneck. arXiv preprint arXiv:1301.2270.
[goodfellow2014generative] Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, Bengio, Yoshua. (2014). Generative adversarial nets. Advances in neural information processing systems.
[noroozi2016unsupervised] Noroozi, Mehdi, Favaro, Paolo. (2016). Unsupervised learning of visual representations by solving jigsaw puzzles. European conference on computer vision.
[zhao2017infovae] Zhao, Shengjia, Song, Jiaming, Ermon, Stefano. (2017). Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262.
[smieja1906segma] Smieja, M, Wolczyk, M, Tabor, J, Geiger, B. SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder. arXiv preprint arXiv:1906.09333.
[kingma2014semi] Kingma, Durk P, Mohamed, Shakir, Jimenez Rezende, Danilo, Welling, Max. (2014). Semi-supervised learning with deep generative models. Advances in neural information processing systems.
[springenberg2015unsupervised] Springenberg, Jost Tobias. (2015). Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390.
[makhzani2015adversarial] Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, Goodfellow, Ian, Frey, Brendan. (2015). Adversarial autoencoders. arXiv preprint arXiv:1511.05644.
[dinh2016density] Dinh, Laurent, Sohl-Dickstein, Jascha, Bengio, Samy. (2016). Density estimation using real nvp. arXiv preprint arXiv:1605.08803.
[higgins2016beta] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. ICLR.
[kingma2019introduction] Kingma, Diederik P, Welling, Max. (2019). An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691.
[rezende2015variational] Rezende, Danilo, Mohamed, Shakir. (2015). Variational inference with normalizing flows. International conference on machine learning.
[sohn2020fixmatch] Sohn, Kihyuk, Berthelot, David, Carlini, Nicholas, Zhang, Zizhao, Zhang, Han, Raffel, Colin A, Cubuk, Ekin Dogus, Kurakin, Alexey, Li, Chun-Liang. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems.
[germain2015made] Germain, Mathieu, Gregor, Karol, Murray, Iain, Larochelle, Hugo. (2015). Made: Masked autoencoder for distribution estimation. International Conference on Machine Learning.
[graves2013generating] Graves, Alex. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
[laine2016temporal] Laine, Samuli, Aila, Timo. (2016). Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
[Lee2013PseudoLabelT] Lee, Dong-Hyun, others. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. Workshop on challenges in representation learning, ICML.
[kahana2022contrastive] Kahana, Jonathan, Hoshen, Yedid. (2022). A Contrastive Objective for Learning Disentangled Representations. arXiv preprint arXiv:2203.11284.
[zhai2019s4l] Zhai, Xiaohua, Oliver, Avital, Kolesnikov, Alexander, Beyer, Lucas. (2019). S4l: Self-supervised semi-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[xie2020unsupervised] Xie, Qizhe, Dai, Zihang, Hovy, Eduard, Luong, Thang, Le, Quoc. (2020). Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems.
[miyato2018virtual] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, Ishii, Shin. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence.
[grandvalet2006entropy] Grandvalet, Yves, Bengio, Yoshua. (2006). Entropy Regularization..
[chapelle2009semi] Chapelle, Olivier, Scholkopf, Bernhard, Zien, Alexander. (2009). Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks.
[dubois2020learning] Dubois, Yann, Kiela, Douwe, Schwab, David J, Vedantam, Ramakrishna. (2020). Learning optimal representations with the decodable information bottleneck. Advances in Neural Information Processing Systems.
[xu2020theory] Xu, Yilun, Zhao, Shengjia, Song, Jiaming, Stewart, Russell, Ermon, Stefano. (2020). A theory of usable information under computational constraints. arXiv preprint arXiv:2002.10689.
[elidan2012information] Elidan, Gal, Friedman, Nir. (2012). The information bottleneck EM algorithm. arXiv preprint arXiv:1212.2460.
[dempster1977maximum] Dempster, Arthur P, Laird, Nan M, Rubin, Donald B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological).
[shwartz2022pre] Shwartz-Ziv, Ravid, Goldblum, Micah, Souri, Hossein, Kapoor, Sanyam, Zhu, Chen, LeCun, Yann, Wilson, Andrew G. (2022). Pre-train your loss: Easy bayesian transfer learning with informative priors. Advances in Neural Information Processing Systems.
[shwartz2022tabular] Shwartz-Ziv, Ravid, Armon, Amitai. (2022). Tabular data: Deep learning is not all you need. Information Fusion.
[wang2022rethinking] Wang, Haoqing, Guo, Xun, Deng, Zhi-Hong, Lu, Yan. (2022). Rethinking Minimal Sufficient Representation in Contrastive Learning. arXiv preprint arXiv:2203.07004.
[tian2020makes] Tian, Yonglong, Sun, Chen, Poole, Ben, Krishnan, Dilip, Schmid, Cordelia, Isola, Phillip. (2020). What makes for good views for contrastive learning?. Advances in Neural Information Processing Systems.
[fischer2020conditional] Fischer, Ian. (2020). The conditional entropy bottleneck. Entropy.
[lee2021predicting] Lee, Jason D, Lei, Qi, Saunshi, Nikunj, Zhuo, Jiacheng. (2021). Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems.
[arora2019theoretical] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.
[chen2020improved] Chen, Xinlei, Fan, Haoqi, Girshick, Ross, He, Kaiming. (2020). Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.
[chen2020simple] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations. International conference on machine learning.
[Reference3] Li, Yingming, Yang, Ming, Zhang, Zhongfei. (2018). A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering. doi:10.1109/MIS.2009.36.
[donahue2015long] Donahue, Jeffrey, Anne Hendricks, Lisa, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, Darrell, Trevor. (2015). Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE conference on computer vision and pattern recognition.
[mao2014deep] Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, Huang, Zhiheng, Yuille, Alan. (2014). Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.
[bach2002kernel] Bach, Francis R, Jordan, Michael I. (2002). Kernel independent component analysis. Journal of machine learning research.
[chelombiev2019adaptive] Chelombiev, Ivan, Houghton, Conor, O'Donnell, Cian. (2019). Adaptive estimators show information compression in deep neural networks. arXiv preprint arXiv:1902.09037.
[pmlr-v9-gutmann10a] Gutmann, Michael, Hyvärinen, Aapo. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics.
[perozzi2014deepwalk] Perozzi, Bryan, Al-Rfou, Rami, Skiena, Steven. (2014). Deepwalk: Online learning of social representations. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.
[liu2021deep] Liu, Shiming, Xia, Yifan, Shi, Zhusheng, Yu, Hui, Li, Zhiqiang, Lin, Jianguo. (2021). Deep learning in sheet metal bending with a novel theory-guided deep neural network. IEEE/CAA Journal of Automatica Sinica.
[huang2019multi] Huang, Zhenyu, Zhou, Joey Tianyi, Peng, Xi, Zhang, Changqing, Zhu, Hongyuan, Lv, Jiancheng. (2019). Multi-view Spectral Clustering Network.. IJCAI.
[zhao2017multi] Zhao, Handong, Ding, Zhengming, Fu, Yun. (2017). Multi-view clustering via deep matrix factorization. Thirty-first AAAI conference on artificial intelligence.
[andrew2013deep] Andrew, Galen, Arora, Raman, Bilmes, Jeff, Livescu, Karen. (2013). Deep canonical correlation analysis. International conference on machine learning.
[yan2015unsupervised] Yan, Xiaoqiang, Ye, Yangdong, Lou, Zhengzheng. (2015). Unsupervised video categorization based on multivariate information bottleneck method. Knowledge-Based Systems.
[sun2010scalable] Sun, Liang, Ceran, Betul, Ye, Jieping. (2010). A scalable two-stage approach for a class of dimensionality reduction techniques. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining.
[dinh2014nice] Dinh, Laurent, Krueger, David, Bengio, Yoshua. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516.
[linsker1988self] Linsker, Ralph. (1988). Self-organization in a perceptual network. Computer.
[becker1992self] Becker, Suzanna, Hinton, Geoffrey E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature.
[goldfeld2022k] Goldfeld, Ziv, Greenewald, Kristjan, Nuradha, Theshani, Reeves, Galen. (2022). k-sliced mutual information: A quantitative study of scalability with dimension. arXiv preprint arXiv:2206.08526.
[bell1995information] Bell, Anthony J, Sejnowski, Terrence J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural computation.
[ozsoy2022self] Ozsoy, Serdar, Hamdan, Shadi, Arik, Sercan, Yuret, Deniz, Erdogan, Alper. (2022). Self-supervised learning with an information maximization criterion. Advances in Neural Information Processing Systems.
[XUE2019210] Zhe Xue, Junping Du, Dawei Du, Siwei Lyu. (2019). Deep low-rank subspace ensemble for multi-view clustering. Information Sciences. doi:https://doi.org/10.1016/j.ins.2019.01.018.
[kumar2011co] Kumar, Abhishek, Daum{'e. (2011). A co-training approach for multi-view spectral clustering. Proceedings of the 28th international conference on machine learning (ICML-11).
[YAN2021106] Xiaoqiang Yan, Shizhe Hu, Yiqiao Mao, Yangdong Ye, Hui Yu. (2021). Deep multi-view learning methods: A review. Neurocomputing. doi:https://doi.org/10.1016/j.neucom.2021.03.090.
[van2016pixel] Van Oord, Aaron, Kalchbrenner, Nal, Kavukcuoglu, Koray. (2016). Pixel recurrent neural networks. International conference on machine learning.
[van2016conditional] Van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, others. (2016). Conditional image generation with pixelcnn decoders. Advances in neural information processing systems.
[liu2021self] Liu, Xiao, Zhang, Fanjin, Hou, Zhenyu, Mian, Li, Wang, Zhaoyu, Zhang, Jing, Tang, Jie. (2021). Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering.
[bachman2019learning] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning representations by maximizing mutual information across views. Advances in neural information processing systems.
[federici2020learning] Federici, Marco, Dutta, Anjan, Forr{'e. (2020). Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017.
[tian2020contrastive] Tian, Yonglong, Krishnan, Dilip, Isola, Phillip. (2020). Contrastive multiview coding. European conference on computer vision.
[tschannen2019mutual] Tschannen, Michael, Djolonga, Josip, Rubenstein, Paul K, Gelly, Sylvain, Lucic, Mario. (2019). On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625.
[darlow2020information] Darlow, Luke Nicholas, Storkey, Amos. (2020). What Information Does a ResNet Compress?. arXiv preprint arXiv:2003.06254.
[deepmultiview2019] Qi Wang, Claire Boudreau, Qixing Luo, Pang-Ning Tan, Jiayu Zhou. (2019). Deep Multi-view Information Bottleneck. Proceedings of the 2019 SIAM International Conference on Data Mining (SDM). doi:10.1137/1.9781611975673.5.
[hang2018kernel] Hang, Hanyuan, Steinwart, Ingo, Feng, Yunlong, Suykens, Johan AK. (2018). Kernel density estimation for dynamical systems. The Journal of Machine Learning Research.
[kozachenko1987sample] Kozachenko, Lyudmyla F, Leonenko, Nikolai N. (1987). Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii.
[paninski2003estimation] Paninski, Liam. (2003). Estimation of entropy and mutual information. Neural computation.
[laurenz2002] Henaff, Olivier. (2020). Data-efficient image recognition with contrastive predictive coding. Neural Computation. doi:10.1162/089976602317318938.
[hjelm2018learning] Hjelm, R Devon, Fedorov, Alex, Lavoie-Marchildon, Samuel, Grewal, Karan, Bachman, Phil, Trischler, Adam, Bengio, Yoshua. (2018). Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
[karpathy2015deep] Karpathy, Andrej, Fei-Fei, Li. (2015). Deep visual-semantic alignments for generating image descriptions. Proceedings of the IEEE conference on computer vision and pattern recognition.
[deepmultiview2015] Wang, Weiran, Arora, Raman, Livescu, Karen, Bilmes, Jeff. (2015). On Deep Multi-View Representation Learning. Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37.
[multimodel2011] Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, Ng, Andrew Y.. (2011). Multimodal Deep Learning. Proceedings of the 28th International Conference on International Conference on Machine Learning.
[srivastava14b] Nitish Srivastava, Ruslan Salakhutdinov. (2014). Multimodal Learning with Deep Boltzmann Machines. Journal of Machine Learning Research.
[chen2010] Chen, Ning, Zhu, Jun, Xing, Eric. (2010). Predictive Subspace Learning for Multi-view Data: a Large Margin Approach. Advances in Neural Information Processing Systems.
[Xu2014LargeMarginMB] Chang Xu, Dacheng Tao, Chao Xu. (2014). Large-Margin Multi-ViewInformation Bottleneck. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[xing2012mining] Xing, Eric P, Yan, Rong, Hauptmann, Alexander G. (2012). Mining associated text and images with dual-wing harmoniums. arXiv preprint arXiv:1207.1423.
[multi2014] Weifeng Liu, Dacheng Tao, Jun Cheng, Yuanyan Tang. (2014). Multiview Hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding. doi:https://doi.org/10.1016/j.cviu.2013.03.007.
[article2008] Sridharan, Karthik, Kakade, Sham. (2008). An Information Theoretic Framework for Multi-View Learning. SO.
[Tian2013] Cao, Tian, Jojic, Vladimir, Modla, Shannon, Powell, Debbie, Czymmek, Kirk, Niethammer, Marc. (2013). Robust Multimodal Dictionary Learning. Medical Image Computing and Computer-Assisted Intervention -- MICCAI 2013.
[factorized2010] Jia, Yangqing, Salzmann, Mathieu, Darrell, Trevor. (2010). Factorized Latent Spaces with Structured Sparsity. Advances in Neural Information Processing Systems.
[matching2003] Barnard, Kobus, Duygulu, Pinar, Forsyth, David, de Freitas, Nando, Blei, David M., Jordan, Michael I.. (2003). Matching Words and Pictures. J. Mach. Learn. Res..
[miss2000] Cohn, David, Hofmann, Thomas. (2000). The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity. Advances in Neural Information Processing Systems.
[Sun2013ASO] Shiliang Sun. (2013). A survey of multi-view machine learning. Neural Computing and Applications.
[hardoon2004] Bach, Francis R., Jordan, Michael I.. (2003). Kernel Independent Component Analysis. J. Mach. Learn. Res.. doi:10.1162/153244303768966085.
[cca1396] Harold Hotelling. (1936). Relations Between Two Sets of Variates. Biometrika.
[Darbellay99] Vapnik, Vladimir N, Chervonenkis, A Ya. (2015). On the uniform convergence of relative frequencies of events to their probabilities. CoRR. doi:10.1108/03684921011046735.
[cover1999elements] Cover, Thomas M. (1999). Elements of information theory.
[tishby2000information] Tishby, Naftali, Pereira, Fernando C, Bialek, William. (2000). The information bottleneck method. arXiv preprint physics/0004057.
[koopman1936distributions] Koopman, Bernard Osgood. (1936). On distributions admitting a sufficient statistic. Transactions of the American Mathematical society.
[gilad2003information] Gilad-Bachrach, Ran, Navot, Amir, Tishby, Naftali. (2003). An information theoretic tradeoff between complexity and accuracy. Learning Theory and Kernel Machines.
[kinney2014equitability] Kinney, Justin B, Atwal, Gurinder S. (2014). Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences.
[rosenblatt1958perceptron] Rosenblatt, Frank. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.. Psychological review.
[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems.
[hinton2006fast] Hinton, Geoffrey E, Osindero, Simon, Teh, Yee-Whye. (2006). A fast learning algorithm for deep belief nets. Neural computation.
[bengio2007greedy] Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, Larochelle, Hugo. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems.
[simonyan2014very] Simonyan, Karen, Zisserman, Andrew. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
[ren2015faster] Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems.
[belghazi2018mine] Belghazi, Mohamed Ishmael, Baratin, Aristide, Rajeswar, Sai, Ozair, Sherjil, Bengio, Yoshua, Courville, Aaron, Hjelm, R Devon. (2018). Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062.
[steinke2020reasoning] Steinke, Thomas, Zakynthinou, Lydia. (2020). Reasoning about generalization via conditional mutual information. Conference on Learning Theory.
[lee2019wide] Lee, Jaehoon, Xiao, Lechao, Schoenholz, Samuel, Bahri, Yasaman, Novak, Roman, Sohl-Dickstein, Jascha, Pennington, Jeffrey. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems.
[kingma2013auto] Kingma, Diederik P, Welling, Max. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
[strouse2017deterministic] Strouse, DJ, Schwab, David J. (2017). The deterministic information bottleneck. Neural computation.
[elad2019direct] Elad, Adar, Haviv, Doron, Blau, Yochai, Michaeli, Tomer. (2019). Direct validation of the information bottleneck principle for deep nets. Proceedings of the IEEE International Conference on Computer Vision Workshops.
[fischer2020ceb] Fischer, Ian, Alemi, Alexander A. (2020). CEB Improves Model Robustness. arXiv preprint arXiv:2002.05380.
[shannon1948mathematical] Shannon, Claude E. (1948). A mathematical theory of communication. The Bell system technical journal.
[SHAMIR20102696] Ohad Shamir, Sivan Sabato, Naftali Tishby. (2010). Learning and generalization with the information bottleneck. Theoretical Computer Science. doi:https://doi.org/10.1016/j.tcs.2010.04.006.
[painsky2018bregman] Painsky, Amichai, Wornell, Gregory W. (2018). Bregman Divergence Bounds and the Universality of the Logarithmic Loss. arXiv preprint arXiv:1810.07014.
[painsky2018information] Painsky, Amichai, Feder, Meir, Tishby, Naftali. (2018). An Information-Theoretic Framework for Non-linear Canonical Correlation Analysis. arXiv preprint arXiv:1810.13259.
[DBLP:journals/corr/abs-1801-02254] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, Tomaso A. Poggio. (2018). Theory of Deep Learning IIb: Optimization Properties of {SGD. CoRR.
[entropy2019] Cheng, H., Lian, D., Gao, S.and Geng, Y. (2019). Utilizing Information Bottleneck to Evaluate the Capability of Deep Neural Networks for Image Classification. Entropy.
[gabrie2018entropy] Gabri{'e. (2018). Entropy and mutual information in models of deep neural networks. arXiv preprint arXiv:1805.09785.
[DBLP:journals/corr/abs-1710-11029] Pratik Chaudhari, Stefano Soatto. (2017). Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. CoRR.
[2016arXiv161101353A] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. doi:10.1103/PhysRevE.69.066138.
[chechik2005information] Ross, Brian C. (2014). Mutual Information between Discrete and Continuous Data Sets. PLoS ONE. doi:10.1371/journal.pone.0087357.
[harremoes2007information] Painsky, Amichai, Rosset, Saharon, Feder, Meir. (2016). Generalized independent component analysis over finite alphabets. IEEE Transactions on Information Theory.
[rissanen1978modeling] Rissanen, Jorma. (1978). Modeling by shortest data description. Automatica.
[vapnik1968uniform] Vapnik, Vladimir N, Chervonenkis, Aleksei Yakovlevich. (1968). The uniform convergence of frequencies of the appearance of events to their probabilities. Doklady Akademii Nauk.
[sauer1972density] Sauer, Norbert. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A.
[shelah1972combinatorial] Shelah, Saharon. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics.
[hoeffding1963probability] Hoeffding, Wassily. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association.
[chigirev2004optimal] Chigirev, Denis V, Bialek, William. (2004). Optimal manifold representation of data: an information theoretic approach. Advances in Neural Information Processing Systems.
[deco2012information] Deco, Gustavo, Obradovic, Dragan. (2012). An information-theoretic approach to neural computing.
[salakhutdinov2013learning] Jin, Chi, Ge, Rong, Netrapalli, Praneeth, Kakade, Sham M, Jordan, Michael I. (2017). How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887. doi:10.1109/TIT.1972.1054753.
[achille2018emergence] Achille, Alessandro, Soatto, Stefano. (2018). Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research.
[saxe2019information] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.
[yu2020understanding] Yu, Shujian, Wickstr{\o. (2020). Understanding convolutional neural networks with information theory: An initial exploration. IEEE Transactions on Neural Networks and Learning Systems.
[cheng2018evaluating] Cheng, Hao, Lian, Dongze, Gao, Shenghua, Geng, Yanlin. (2018). Evaluating capability of deep neural networks for image classification via information plane. Proceedings of the European Conference on Computer Vision (ECCV).
[goldfeld2018estimating] Goldfeld, Ziv, Berg, Ewout van den, Greenewald, Kristjan, Melnyk, Igor, Nguyen, Nam, Kingsbury, Brian, Polyanskiy, Yury. (2018). Estimating information flow in deep neural networks. arXiv preprint arXiv:1810.05728.
[wickstrom2019information] Wickstr{\o. (2019). Information Plane Analysis of Deep Neural Networks via Matrix-Based Renyi's Entropy and Tensor Kernels. arXiv preprint arXiv:1909.11396.
[amjad2019learning] Amjad, Rana Ali, Geiger, Bernhard Claus. (2019). Learning representations for neural network-based classification using the information bottleneck principle. IEEE transactions on pattern analysis and machine intelligence.
[goldfeld2020convergence] Goldfeld, Ziv, Greenewald, Kristjan, Niles-Weed, Jonathan, Polyanskiy, Yury. (2020). Convergence of smoothed empirical measures with applications to entropy estimation. IEEE Transactions on Information Theory.
[cvitkovic2019minimal] Cvitkovic, Milan, Koliander, G{. (2019). Minimal achievable sufficient statistic learning. arXiv preprint arXiv:1905.07822.
[geiger2020information] Geiger, Bernhard C. (2020). On Information Plane Analyses of Neural Network Classifiers--A Review. arXiv preprint arXiv:2003.09671.
[van2020survey] Van Engelen, Jesper E, Hoos, Holger H. (2020). A survey on semi-supervised learning. Machine Learning.
[pogodin2020kernelized] Pogodin, Roman, Latham, Peter E. (2020). Kernelized information bottleneck leads to biologically plausible 3-factor Hebbian learning in deep networks. arXiv preprint arXiv:2006.07123.
[vera2018role] Vera, Mat{'\i. (2018). The role of information complexity and randomization in representation learning. arXiv preprint arXiv:1802.05355.
[song2021train] Song, Yang, Kingma, Diederik P. (2021). How to train your energy-based models. arXiv preprint arXiv:2101.03288.
[huembeli2022physics] Huembeli, Patrick, Arrazola, Juan Miguel, Killoran, Nathan, Mohseni, Masoud, Wittek, Peter. (2022). The physics of energy-based models. Quantum Machine Intelligence.
[noshad2018scalable] Noshad, Morteza, Hero III, Alfred O. (2018). Scalable Mutual Information Estimation using Dependence Graphs. arXiv preprint arXiv:1801.09125.
[achille2018information] Achille, Alessandro, Soatto, Stefano. (2018). Information dropout: Learning optimal representations through noisy computation. IEEE transactions on pattern analysis and machine intelligence.
[kirsch2020unpacking] Kirsch, Andreas, Lyle, Clare, Gal, Yarin. (2020). Unpacking Information Bottlenecks: Unifying Information-Theoretic Objectives in Deep Learning. arXiv preprint arXiv:2003.12537.
[pensia2018generalization] Pensia, Ankit, Jog, Varun, Loh, Po-Ling. (2018). Generalization error bounds for noisy, iterative algorithms. 2018 IEEE International Symposium on Information Theory (ISIT).
[NIPS2019_9282] Negrea, Jeffrey, Haghifam, Mahdi, Dziugaite, Gintare Karolina, Khisti, Ashish, Roy, Daniel M. (2019). Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates. Advances in Neural Information Processing Systems 32.
[NIPS2018_7954] Asadi, Amir, Abbe, Emmanuel, Verdu, Sergio. (2018). Chaining Mutual Information and Tightening Generalization Bounds. Advances in Neural Information Processing Systems 31.
[russo2016controlling] Russo, Daniel, Zou, James. (2016). Controlling bias in adaptive data analysis using information theory. Artificial Intelligence and Statistics.
[erdogmus2002information] Erdogmus, Deniz. (2002). Information theoretic learning: Renyi's entropy and its applications to adaptive system training.
[boucheron2005theory] Boucheron, St{'e. (2005). Theory of classification: A survey of some recent advances. ESAIM: probability and statistics.
[neyshabur2014search] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614.
[neyshabur2015norm] Neyshabur, Behnam, Tomioka, Ryota, Srebro, Nathan. (2015). Norm-based capacity control in neural networks. Conference on Learning Theory.
[zhang2016understanding] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, Vinyals, Oriol. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
[bartlett2002rademacher] Bartlett, Peter L, Mendelson, Shahar. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research.
[bousquet2002stability] Bousquet, Olivier, Elisseeff, Andre. (2002). Stability and generalization. Journal of machine learning research.
[stavac] Achille, Alessandro, Paolini, Giovanni, Soatto, Stefano. (2019). Where is the information in a deep neural network?. arXiv preprint arXiv:1905.12213.
[nash2018inverting] Nash, Charlie, Kushman, Nate, Williams, Christopher KI. (2018). Inverting Supervised Representations with Autoregressive Neural Density Models. arXiv preprint arXiv:1806.00400.
[csiszar1987conditional] Csisz{'a. (1987). Conditional limit theorems under Markov conditioning. IEEE Transactions on Information Theory.
[jabref-meta: databaseType:bibtex;}
@ARTICLE{2016arXiv16110135amjad2018not3A] {Achille. {Information Dropout: Learning Optimal Representations Through Noisy Computation. ArXiv e-prints.
[berglund2013measuring] Kraskov, Alexander, St. (2004). Estimating mutual information. Phys. Rev. E. doi:10.1103/PhysRevE.69.066138.
[DBLP:journals/corr/WangYMC16] . . ().
[2014arXiv1412.6615S] Ucar, Talip, Hajiramezanali, Ehsan, Edwards, Lindsay. (2021). Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems.
[roy2018theory] Roy, Aurko, Vaswani, Ashish, Neelakantan, Arvind, Parmar, Niki. (2018). Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063.
[NEURIPS2019_3001ef25] Song, Yang, Ermon, Stefano. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. Advances in Neural Information Processing Systems.
[Hyvrinen06someextensions] Aapo Hyvärinen. (2006). Some extensions of score matching.
[song2020score] Song, Yang, Sohl-Dickstein, Jascha, Kingma, Diederik P, Kumar, Abhishek, Ermon, Stefano, Poole, Ben. (2020). Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
[Vincent2011] Laurent Younes. (1999). On The Convergence Of Markovian Stochastic Algorithms With Rapidly Decreasing Ergodicity Rates. Neural Computation. doi:10.1162/NECO_a_00142.
[chen2021exploring] Chen, Xinlei, He, Kaiming. (2021). Exploring simple siamese representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[arik2021tabnet] Arik, Sercan {. (2021). Tabnet: Attentive interpretable tabular learning. Proceedings of the AAAI Conference on Artificial Intelligence.
[GauusianIB2005] U{\u{g. (2020). Variational information bottleneck for unsupervised clustering: Deep gaussian mixture embedding. Entropy.
[turner2007maximum] Turner, Richard, Sahani, Maneesh. (2007). A maximum-likelihood interpretation for slow feature analysis. Neural computation.
[hecht2009speaker] Hecht, Ron M, Noor, Elad, Tishby, Naftali. (2009). Speaker recognition by Gaussian information bottleneck. Tenth Annual Conference of the International Speech Communication Association.
[palmer2015predictive] Palmer, Stephanie E, Marre, Olivier, Berry, Michael J, Bialek, William. (2015). Predictive information in a sensory population. Proceedings of the National Academy of Sciences.
[buesing2010spiking] Buesing, Lars, Maass, Wolfgang. (2010). A spiking neuron as information bottleneck. Neural computation.
[shwartz2018representation] Shwartz-Ziv, Ravid, Painsky, Amichai, Tishby, Naftali. (2018). Representation compression and generalization in deep neural networks.
[HeZRS15] Alam, Mahbubul, Samad, Manar D, Vidyaratne, Lasitha, Glandon, Alexander, Iftekharuddin, Khan M. (2020). Survey on deep neural networks in speech and vision systems. Neurocomputing.
[saxe2018information] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.
[amjad2018not] Amjad, Rana Ali, Geiger, Bernhard C. (2018). How (Not) To Train Your Neural Network Using the Information Bottleneck Principle. arXiv preprint arXiv:1802.09766.
[elad2018effectiveness] Elad, Adar, Haviv, Doron, Blau, Yochai, Michaeli, Tomer. (2018). The effectiveness of layer-by-layer training using the information bottleneck principle.
[xu2017information] Xu, Aolin, Raginsky, Maxim. (2017). Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems.
[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
[8437679] Russo, Daniel, Zou, James. (2019). How much does your data exploration overfit? Controlling bias via information usage. IEEE Transactions on Information Theory. doi:10.1109/ISIT.2018.8437679.
[jing2021understanding] Jing, Li, Vincent, Pascal, LeCun, Yann, Tian, Yuandong. (2021). Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348.
[zbontar2021barlow] Zbontar, Jure, Jing, Li, Misra, Ishan, LeCun, Yann, Deny, St{'e. (2021). Barlow twins: Self-supervised learning via redundancy reduction. International Conference on Machine Learning.
[tian2021understanding] Tian, Yuandong, Chen, Xinlei, Ganguli, Surya. (2021). Understanding self-supervised learning dynamics without contrastive pairs. International Conference on Machine Learning.
[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF International Conference on Computer Vision.
[grill2020bootstrap] Grill, Jean-Bastien, Strub, Florian, Altch{'e. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems.
[caron2020unsupervised] Caron, Mathilde, Misra, Ishan, Mairal, Julien, Goyal, Priya, Bojanowski, Piotr, Joulin, Armand. (2020). Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems.
[zhang2022how] Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, In So Kweon. (2022). How Does SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive Learning. International Conference on Learning Representations.
[Arora2019theory] Arora, Sanjeev, Khandeparkar, Hrishikesh, Khodak, Mikhail, Plevrakis, Orestis, Saunshi, Nikunj. (2019). A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229.
[kolchinsky2017estimating] Kolchinsky, Artemy, Tracey, Brendan D. (2017). Estimating mixture entropy with pairwise distances. Entropy.
[pu2020multimodal] Pu, Shi, He, Yijiang, Li, Zheng, Zheng, Mao. (2020). Multimodal Topic Learning for Video Recommendation. arXiv preprint arXiv:2010.13373.
[voloshynovskiy2019information] Voloshynovskiy, Slava, Taran, Olga, Kondah, Mouad, Holotyak, Taras, Rezende, Danilo. (2020). Variational Information Bottleneck for Semi-Supervised Classification. Entropy. doi:10.3390/e22090943.
[gao2015efficient] Gao, Shuyang, Ver Steeg, Greg, Galstyan, Aram. (2015). Efficient estimation of mutual information for strongly dependent variables. Artificial Intelligence and Statistics.
[Belghazi2018MutualIN] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, Aaron C. Courville. (2018). Mutual Information Neural Estimation. ICML.
[donsker1975asymptotic] Donsker, Monroe D, Varadhan, SR Srinivasa. (1975). Asymptotic evaluation of certain Markov process expectations for large time, I. Communications on Pure and Applied Mathematics.
[2018Estimating] {Goldfeld. {Estimating Information Flow in Neural Networks. ArXiv e-prints.
[jacobsen2018irevnet] Jörn-Henrik Jacobsen, Arnold W.M. Smeulders, Edouard Oyallon. (2018). i-RevNet: Deep Invertible Networks. International Conference on Learning Representations.
[bertsekas2011incremental] Bertsekas, Dimitri P. (2011). Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning.
[li2017convergence] Li, Yuanzhi, Yuan, Yang. (2017). Convergence analysis of two-layer neural networks with relu activation. Advances in Neural Information Processing Systems.
[dieuleveut2017bridging] Dieuleveut, Aymeric, Durmus, Alain, Bach, Francis. (2017). Bridging the gap between constant step size stochastic gradient descent and markov chains. arXiv preprint arXiv:1707.06386.
[rumelhart1986learning] Rumelhart, David E, Hinton, Geoffrey E, Williams, Ronald J. (1986). Learning representations by back-propagating errors. nature.
[oord2016wavenet] Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, Kavukcuoglu, Koray. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
[matias2018role] Alemi, Alexander A, Fischer, Ian, Dillon, Joshua V, Murphy, Kevin. (2016). Deep Variational Information Bottleneck. arXiv:1612.00410.
[skincat] Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous, Rif A, Murphy, Kevin. (2017). An Information-Theoretic Analysis of Deep Latent-Variable Models. arXiv:1711.00464.
[brokenelbo] Alemi, Alexander A, Poole, Ben, Dillon, Joshua V, Saurous, Rif A, Murphy, Kevin. (2018). Fixing a Broken {ELBO. ICML 2018.
[infoautoencoding] Anonymous. (2018). The Information-Autoencoding Family: A Lagrangian Perspective on Latent Variable Generative Modeling. International Conference on Learning Representations.
[rationalignorance] Mattingly, Henry H, Transtrum, Mark K, Abbott, Michael C, Machta, Benjamin B. (2017). Rational ignorance: simpler models learn more from finite data. arXiv:1705.01166.
[infoscaling] Abbott, Michael C, Machta, Benjamin B. (2018). An Information Scaling Law: $\zeta = 3/4$. arXiv:1710.09351.
[thermoinfo] Parrondo, Juan MR, Horowitz, Jordan M, Sagawa, Takahiro. (2015). Thermodynamics of information. Nature physics.
[costbenefitdata] {Still. {Thermodynamic cost and benefit of data representations. arXiv: 1705.00612.
[marginalent] {Crooks. {Marginal and Conditional Second Laws of Thermodynamics. arXiv: 1611.04628.
[ben2023reverse] Ben-Shaul, Ido, Shwartz-Ziv, Ravid, Galanti, Tomer, Dekel, Shai, LeCun, Yann. (2023). Reverse Engineering Self-Supervised Learning. arXiv preprint arXiv:2305.15614.
[bar2022detreg] Bar, Amir, Wang, Xin, Kantorov, Vadim, Reed, Colorado J, Herzig, Roei, Chechik, Gal, Rohrbach, Anna, Darrell, Trevor, Globerson, Amir. (2022). Detreg: Unsupervised pretraining with region priors for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[thermoprediction] {Still. {Thermodynamics of Prediction. Physical Review Letters. doi:10.1103/PhysRevLett.109.120604.
[interactive] {Still. {Information-theoretic approach to interactive learning. EPL (Europhysics Letters). doi:10.1209/0295-5075/85/28005.
[optimalcausal] {Still. {Optimal Causal Inference: Estimating Stored Information and Approximating Causal Architecture. arXiv: 0708.1580.
[structurenoise] {Still. {Structure or Noise?. arXiv: 0708.0654.
[clusters] {Still. {How many clusters? An information theoretic perspective. ArXiv Physics e-prints.
[jaynes] Jaynes, Edwin T. (1957). Information theory and statistical mechanics. Physical review.
[sethna] Sethna, James. (2006). Statistical mechanics: entropy, order parameters, and complexity.
[coverthomas] Cover, Thomas M, Thomas, Joy A. (2012). Elements of information theory.
[reversible] Maclaurin, Dougal, Duvenaud, David, Adams, Ryan P.. (2015). Gradient-based Hyperparameter Optimization Through Reversible Learning. Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37.
[mib] Friedman, Nir, Mosenzon, Ori, Slonim, Noam, Tishby, Naftali. (2001). Multivariate information bottleneck. Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence.
[predictive] Bialek, William, Nemenman, Ilya, Tishby, Naftali. (2001). Predictability, complexity, and learning. Neural computation.
[wenzel2020good] Wenzel, Florian, Roth, Kevin, Veeling, Bastiaan S, {'S. (2020). How good is the bayes posterior in deep neural networks really?. arXiv preprint arXiv:2002.02405.
[vae] Kingma, Diederik P, Welling, Max. {Auto-encoding variational Bayes.
[zhang2018generalized] Zhang, Zhilu, Sabuncu, Mert. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems.
[quinlan2014c4] Quinlan, J Ross. (2014). C4. 5: programs for machine learning.
[betavae] Higgins, Irina, Matthey, Loic, Pal, Arka, Burgess, Christopher, Glorot, Xavier, Botvinick, Matthew, Mohamed, Shakir, Lerchner, Alexander. {$\beta$-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework.
[emergence] {Achille. {Emergence of Invariance and Disentangling in Deep Representations. Proceedings of the ICML Workshop on Principled Approaches to Deep Learning.
[ng2011sparse] Ng, Andrew, others. (2011). Sparse autoencoder. CS294A Lecture notes.
[NIPS2006_2d71b2ae] Lee, Honglak, Battle, Alexis, Raina, Rajat, Ng, Andrew. (2006). Efficient sparse coding algorithms. Advances in Neural Information Processing Systems.
[ib] N. Tishby, F.C. Pereira, W. Biale. The Information Bottleneck method. The 37th annual Allerton Conf. on Communication, Control, and Computing.
[bbb] {Blundell. {Weight Uncertainty in Neural Networks. arXiv: 1505.05424.
[semi] {Kingma. {Semi-Supervised Learning with Deep Generative Models. arXiv: 1406.5298.
[sgdasbayes] {Mandt. {Stochastic Gradient Descent as Approximate Bayesian Inference. arXiv: 1704.04289.
[sgr] {Ma. {A Complete Recipe for Stochastic Gradient MCMC. arXiv:1506.04696.
[sgld] Welling, Max, Teh, Yee W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th international conference on machine learning (ICML-11).
[bayessgd] {Smith. {A Bayesian Perspective on Generalization and Stochastic Gradient Descent. arXiv:1710.06451.
[sghmc] {Chen. {Stochastic Gradient Hamiltonian Monte Carlo. arXiv:1402.4102.
[snapshot] {Huang. {Snapshot Ensembles: Train 1, get M for free. arXiv: 1704.00109.
[poppar] {Machta. {Monte Carlo Methods for Rough Free Energy Landscapes: Population Annealing and Parallel Tempering. Journal of Statistical Physics. doi:10.1007/s10955-011-0249-0.
[finn] Finn, Colin BP. (1993). Thermal physics.
[energyentropy] {Zhang. {Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning. arXiv: 1803.01927.
[pacbayes] {McAllester. {A PAC-Bayesian Tutorial with A Dropout Bound. arXiv: 1307.2118.
[pacbayesbayes] Germain, Pascal, Bach, Francis, Lacoste, Alexandre, Lacoste-Julien, Simon. (2016). PAC-Bayesian Theory Meets Bayesian Inference. Advances in Neural Information Processing Systems 29.
[marsh] Marsh, Charles. (2013). Introduction to continuous entropy.
[box] Box, George EP, Draper, Norman R. (1987). Empirical model-building and response surfaces..
[infoprojection] Csisz{'a. (2003). Information projections revisited. IEEE Transactions on Information Theory.
[lecturenotes] Ariel Caticha. (2008). Lectures on Probability, Entropy, and Statistical Physics.
[correspondence] Colin H. LaMont, Paul A. Wiggins. (2017). A correspondence between thermodynamics and inference.
[watanabegrey] Watanabe, Sumio. (2009). Algebraic geometry and statistical learning theory.
[watanabegreen] Watanabe, Sumio. (2018). Mathematical theory of Bayesian statistics.
[whereinfo] Alessandro Achille, Stefano Soatto. (2019). Where is the Information in a Deep Neural Network?.
[ffjord] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, David Duvenaud. (2018). FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models.
[widelinear] {Lee. (2019). {Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. arXiv e-prints.
[fisherRao] Liang, Tengyuan, Poggio, Tomaso, Rakhlin, Alexander, Stokes, James. (2017). Fisher-rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530.
[AIC] Akaike, Hirotugu. (1974). A new look at the statistical model identification. Selected Papers of Hirotugu Akaike.
[TIC] {Thomas. (2019). {Information matrices and generalization. arXiv e-prints.
[generalization_dnn] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in Neural Information Processing Systems.
[vmibounds] Ben Poole, Sherjil Ozair, A{. (2019). On Variational Bounds of Mutual Information. CoRR.
[gaussib] Chechik, Gal, Globerson, Amir, Tishby, Naftali, Weiss, Yair. (2005). Information bottleneck for Gaussian variables. Journal of machine learning research.
[halko] Halko, Nathan, Martinsson, Per-Gunnar, Tropp, Joel A. (2011). Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review.
[tishbydeep] Tishby, Naftali, Zaslavsky, Noga. (2015). Deep learning and the information bottleneck principle. 2015 IEEE Information Theory Workshop (ITW).
[saxe] Saxe, Andrew M, Bansal, Yamini, Dapello, Joel, Advani, Madhu, Kolchinsky, Artemy, Tracey, Brendan D, Cox, David D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment.
[hownot] Amjad, Rana Ali, Geiger, Bernhard C. (2018). How (not) to train your neural network using the information bottleneck principle. arXiv preprint arXiv:1802.09766.
[brendan] Kolchinsky, Artemy, Tracey, Brendan D, Van Kuyk, Steven. (2018). Caveats for information bottleneck in deterministic scenarios. arXiv preprint arXiv:1808.07593.
[mnist] LeCun, Yann, Cortes, Corinna, Burges, CJ. (2010). MNIST handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist.
[ntk] Jacot, Arthur, Gabriel, Franck, Hongler, Cl{'e. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems.
[neuraltangents] Novak, Roman, Xiao, Lechao, Hron, Jiri, Lee, Jaehoon, Alemi, Alexander A, Sohl-Dickstein, Jascha, Schoenholz, Samuel S. (2019). Neural tangents: Fast and easy infinite neural networks in python. arXiv preprint arXiv:1912.02803.
[fisher] Frederik Kunstner, Lukas Balles, Philipp Hennig. (2019). Limitations of the Empirical Fisher Approximation.
[littlebits] Raef Bassily, Shay Moran, Ido Nachum, Jonathan Shafer, Amir Yehudayoff. (2017). Learners that Use Little Information.
[neuralode] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. (2018). Neural Ordinary Differential Equations.
[bayesianbounds] Banerjee, Arindam. (2006). On bayesian bounds. Proceedings of the 23rd international conference on Machine learning.
[invertible] Anonymous. (2020). On the Invertibility of Invertible Neural Networks. Submitted to International Conference on Learning Representations.
[cando] Sanjeev Arora, Simon S. Du, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang, Dingli Yu. (2019). Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks.
[liang2019fisher] Liang, Tengyuan, Poggio, Tomaso, Rakhlin, Alexander, Stokes, James. (2019). Fisher-rao metric, geometry, and complexity of neural networks. The 22nd International Conference on Artificial Intelligence and Statistics.
[neyshabur2017exploring] Neyshabur, Behnam, Bhojanapalli, Srinadh, McAllester, David, Srebro, Nati. (2017). Exploring generalization in deep learning. Advances in neural information processing systems.
[hardt2016train] Hardt, Moritz, Recht, Ben, Singer, Yoram. (2016). Train faster, generalize better: Stability of stochastic gradient descent. International Conference on Machine Learning.
[watanabe2010asymptotic] Watanabe, Sumio, Opper, Manfred. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory.. Journal of machine learning research.
[russo2019much] Russo, Daniel, Zou, James. (2019). How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory.
[slonim2002information] Slonim, Noam. (2002). The information bottleneck: Theory and applications.
[Tishby1999] Steinbach, Michael, Ert{. (2004). The challenges of clustering high dimensional data. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing.
[Gilad-bachrach] Ran Gilad-bachrach, Amir Navot, Naftali Tishby. (2003). An information theoretic tradeoff between complexity and accuracy. In Proceedings of the COLT.
[CriticalSlowingDown:2004] Tredicce, Jorge R, Lippi, Gian Luca, Mandel, Paul, Charasse, Basile, Chevalier, Aude, Picqu{'e. (2004). Critical slowing down at a bifurcation. American Journal of Physics.
[lee2021compressive] Lee, Kuang-Huei, Arnab, Anurag, Guadarrama, Sergio, Canny, John, Fischer, Ian. (2021). Compressive visual representations. Advances in Neural Information Processing Systems.
[tishby99information] Tishby, Naftali, Pereira, Fernando C., Bialek, William. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing.
[Csiszar] Csisz'{a. (2004). Information Theory and Statistics: A Tutorial. Commun. Inf. Theory. doi:10.1561/0100000004.
[Cover:2006:EIT:1146355] Cover, Thomas M., Thomas, Joy A.. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).
[DBLP:conf/alt/ShamirST08] Ohad Shamir, Sivan Sabato, Naftali Tishby. (2010). Learning and generalization with the information bottleneck. Theor. Comput. Sci..
[DBLP:conf/alt/2008] . Algorithmic Learning Theory, 19th International Conference, {ALT. (2008).
[Exp_forms] Lawrence D. Brown. (1986). Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory. Lecture Notes-Monograph Series.
[Painsky2019] {Painsky. (2018). {Bregman Divergence Bounds and the Universality of the Logarithmic Loss. arXiv e-prints.
[Csiszar:2004:ITS:1166379.1166380] Csisz'{a. (2004). Information Theory and Statistics: A Tutorial. Commun. Inf. Theory. doi:10.1561/0100000004.
[CIS-58533] Tusnady, G., Csiszar, I.. (1984). Information geometry and alternating minimization procedures. Statistics & Decisions: Supplement Issues.
[slonim_MIB] Slonim, Noam, Friedman, Nir, Tishby, Naftali. (2006). Multivariate Information Bottleneck. Neural Computation. doi:10.1162/neco.2006.18.8.1739.
[Ay2019] Domenico Felice, Nihat Ay. (2019). Divergence Functions in Information Geometry. Geometric Science of Information - 4th International Conference, {GSI. doi:10.1007/978-3-030-26980-7_45.
[DBLP:conf/gsi/2019] . Geometric Science of Information - 4th International Conference, {GSI. (2019).
[parker] Albert E. Parker, Tom'{a. (2003). Annealing and the Rate Distortion Problem. Advances in Neural Information Processing Systems 15.
[Jaynes58] Jaynes, E. T.. (1957). Information Theory and Statistical Mechanics. Phys. Rev.. doi:10.1103/PhysRev.106.620.
[ZaslavskyTishby:2019] Zaslavsky, Noga, Tishby, Naftali. (2019). Deterministic Annealing and the Evolution of Optimal Information Bottleneck Representations. Preprint.
[Kullback58] S. Kullback. (1959). Information Theory and Statistics.
[GaussianIB] Chechik, Gal, Globerson, Amir, Tishby, Naftali, Weiss, Yair. (2005). Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res..
[globerson2003sufficient] Globerson, Amir, Tishby, Naftali. (2003). Sufficient dimensionality reduction. Journal of Machine Learning Research.
[ma2019unpaired] Ma, Shuang, McDuff, Daniel, Song, Yale. (2019). Unpaired Image-to-Speech Synthesis with Multimodal Information Bottleneck. Proceedings of the IEEE International Conference on Computer Vision.
[schneidman2001analyzing] Schneidman, Elad, Slonim, Noam, Tishby, Naftali, van Steveninck, R deRuyter, Bialek, William. (2001). Analyzing neural codes using the information bottleneck method. Advances in Neural Information Processing Systems, NIPS.
[Parbhoo2018CausalDI] Sonali Parbhoo, Mario Wieser, Volker Roth. (2018). Causal Deep Information Bottleneck. ArXiv.
[westover2008asymptotic] Westover, M Brandon. (2008). Asymptotic geometry of multiple hypothesis testing. IEEE transactions on information theory.
[nielsen2011chernoff] Nielsen, Frank. (2011). Chernoff information of exponential families. arXiv preprint arXiv:1102.2684.
[wieczorek2020difference] Wieczorek, Aleksander, Roth, Volker. (2020). On the Difference between the Information Bottleneck and the Deep Information Bottleneck. Entropy.
[wu2020phase] Wu, Tailin, Fischer, Ian. (2020). Phase Transitions for the Information Bottleneck in Representation Learning. arXiv preprint arXiv:2001.01878.
[fischer2018conditional] Fischer, Ian. (2018). The conditional entropy bottleneck. URL openreview. net/forum.
[lecun-mnisthandwrittendigit-2010] LeCun, Yann, Cortes, Corinna. {MNIST.
[raman2017illum] Raman, Ravi Kiran, Yu, Haizi, Varshney, Lav R. (2017). Illum information. 2017 Information Theory and Applications Workshop (ITA).
[palomar2008lautum] Palomar, Daniel P, Verd{'u. (2008). Lautum information. IEEE transactions on information theory.
[poole2019variational] Poole, Ben, Ozair, Sherjil, Oord, Aaron van den, Alemi, Alexander A, Tucker, George. (2019). On variational bounds of mutual information. arXiv preprint arXiv:1905.06922.
[hsu2018generalizing] Hsu, Hsiang, Asoodeh, Shahab, Salamatian, Salman, Calmon, Flavio P. (2018). Generalizing bottleneck problems. 2018 IEEE International Symposium on Information Theory (ISIT).
[dusenberry2020efficient] Dusenberry, Michael W, Jerfel, Ghassen, Wen, Yeming, Ma, Yi-an, Snoek, Jasper, Heller, Katherine, Lakshminarayanan, Balaji, Tran, Dustin. (2020). Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors. arXiv preprint arXiv:2005.07186.
[zagoruyko2016wide] Zagoruyko, Sergey, Komodakis, Nikos. (2016). Wide residual networks. arXiv preprint arXiv:1605.07146.
[muller2019does] M{. (2019). When does label smoothing help?. Advances in Neural Information Processing Systems.
[shalev2014understanding] Shalev-Shwartz, Shai, Ben-David, Shai. (2014). Understanding machine learning: From theory to algorithms.
[kingma2014adam] Kingma, Diederik P, Ba, Jimmy. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[zagoruyko2017diracnets] Zagoruyko, Sergey, Komodakis, Nikos. (2017). Diracnets: Training very deep neural networks without skip-connections. arXiv preprint arXiv:1706.00388.
[shamir2008learning] Shamir, Ohad, Sabato, Sivan, Tishby, Naftali. (2008). Learning and generalization with the information bottleneck. International Conference on Algorithmic Learning Theory.
[li-eisner-2019] Li, Xiang Lisa, Eisner, Jason. (2019). Specializing Word Embeddings (for Parsing) by Information Bottleneck. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
[chopra-cvpr-05] Chopra, Sumit, Hadsell, Raia, LeCun, Yann. (2005). Learning a similarity metric discriminatively, with application to face verification. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).
[hadsell-cvpr-06] Hadsell, Raia, Chopra, Sumit, LeCun, Yann. (2006). Dimensionality reduction by learning an invariant mapping. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).
[oord2017neural] Van Den Oord, Aaron, Vinyals, Oriol, others. (2017). Neural discrete representation learning. Advances in neural information processing systems.
[bib1] Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research, 19(1):1947–1980, 2018.
[bib2] Achille et al. (2017) Alessandro Achille, Matteo Rovere, and Stefano Soatto. Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856, 2017.
[bib3] Alam et al. (2020) Mahbubul Alam, Manar D Samad, Lasitha Vidyaratne, Alexander Glandon, and Khan M Iftekharuddin. Survey on deep neural networks in speech and vision systems. Neurocomputing, 417:302–321, 2020.
[bib4] Alemi et al. (2016) Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv:1612.00410, 2016. URL http://arxiv.org/abs/1612.00410.
[bib5] Rana Ali Amjad and Bernhard C Geiger. How (not) to train your neural network using the information bottleneck principle. arXiv preprint arXiv:1802.09766, 2018.
[bib6] Rana Ali Amjad and Bernhard Claus Geiger. Learning representations for neural network-based classification using the information bottleneck principle. IEEE transactions on pattern analysis and machine intelligence, 2019.
[bib7] Andrew et al. (2013) Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In International conference on machine learning, pages 1247–1255. PMLR, 2013.
[bib8] Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687, 2021.
[bib9] Arora et al. (2019) Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019.
[bib10] Francis R Bach and Michael I Jordan. Kernel independent component analysis. Journal of machine learning research, 3(Jul):1–48, 2002.
[bib11] Francis R. Bach and Michael I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3(null):1–48, mar 2003. ISSN 1532-4435. doi: 10.1162/153244303768966085. URL https://doi.org/10.1162/153244303768966085.
[bib12] Bachman et al. (2019) Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning representations by maximizing mutual information across views. Advances in neural information processing systems, 32, 2019.
[bib13] Bar et al. (2022) Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, and Amir Globerson. Detreg: Unsupervised pretraining with region priors for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14605–14615, 2022.
[bib14] Bardes et al. (2021) Adrien Bardes, Jean Ponce, and Yann LeCun. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
[bib15] Basirat et al. (2021) Mina Basirat, Bernhard C. Geiger, and Peter M. Roth. A geometric perspective on information plane analysis. Entropy, 23(6), 2021. ISSN 1099-4300. URL https://www.mdpi.com/1099-4300/23/6/711.
[bib16] Suzanna Becker and Geoffrey E Hinton. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356):161–163, 1992.
[bib17] Belghazi et al. (2018a) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, R. Devon Hjelm, and Aaron C. Courville. Mutual information neural estimation. In ICML, 2018a.
[bib18] Belghazi et al. (2018b) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and R Devon Hjelm. Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018b.
[bib19] Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
[bib20] Ben-Shaul et al. (2023) Ido Ben-Shaul, Ravid Shwartz-Ziv, Tomer Galanti, Shai Dekel, and Yann LeCun. Reverse engineering self-supervised learning. arXiv preprint arXiv:2305.15614, 2023.
[bib21] Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards ai ,in l. bottou, o. chapelle, d. decoste, and j. weston, editors,. Large Scale Kernel Machines,MIT Press., 2007.
[bib22] Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
[bib23] Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32, 2019.
[bib24] Bromley et al. (1993) Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.
[bib25] Lars Buesing and Wolfgang Maass. A spiking neuron as information bottleneck. Neural computation, 22(8):1961–1992, 2010.
[bib26] Cao et al. (2013) Tian Cao, Vladimir Jojic, Shannon Modla, Debbie Powell, Kirk Czymmek, and Marc Niethammer. Robust multimodal dictionary learning. In Kensaku Mori, Ichiro Sakuma, Yoshinobu Sato, Christian Barillot, and Nassir Navab, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013, pages 259–266, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg. ISBN 978-3-642-40811-3.
[bib27] Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
[bib28] Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
[bib29] Chapelle et al. (2009) Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks, 20(3):542–542, 2009.
[bib30] Chelombiev et al. (2019) Ivan Chelombiev, Conor Houghton, and Cian O’Donnell. Adaptive estimators show information compression in deep neural networks. arXiv preprint arXiv:1902.09037, 2019.
[bib31] Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020a.
[bib32] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15750–15758, 2021.
[bib33] Chen et al. (2020b) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
[bib34] Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE, 2005.
[bib35] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
[bib36] Luke Nicholas Darlow and Amos Storkey. What information does a resnet compress? arXiv preprint arXiv:2003.06254, 2020.
[bib37] Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
[bib38] Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[bib39] Dinh et al. (2016) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
[bib40] Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
[bib41] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1–47, 1975.
[bib42] Dubois et al. (2021) Yann Dubois, Benjamin Bloem-Reddy, Karen Ullrich, and Chris J Maddison. Lossy compression for lossless prediction. Advances in Neural Information Processing Systems, 34, 2021.
[bib43] Elad et al. (2019a) Adar Elad, Doron Haviv, Yochai Blau, and Tomer Michaeli. Direct validation of the information bottleneck principle for deep nets. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019a.
[bib44] Elad et al. (2019b) Adar Elad, Doron Haviv, Yochai Blau, and Tomer Michaeli. The effectiveness of layer-by-layer training using the information bottleneck principle, 2019b. URL https://openreview.net/forum?id=r1Nb5i05tX.
[bib45] Gal Elidan and Nir Friedman. The information bottleneck em algorithm. arXiv preprint arXiv:1212.2460, 2012.
[bib46] Alper T Erdogan. An information maximization based blind source separation approach for dependent and independent sources. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4378–4382. IEEE, 2022.
[bib47] Deniz Erdogmus. Information theoretic learning: Renyi’s entropy and its applications to adaptive system training. University of Florida, 2002.
[bib48] Federici et al. (2020) Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. Learning robust representations via multi-view information bottleneck. arXiv preprint arXiv:2002.07017, 2020.
[bib49] Ian Fischer. The conditional entropy bottleneck. Entropy, 22(9):999, 2020.
[bib50] Friedman et al. (2013) Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. Multivariate information bottleneck. arXiv preprint arXiv:1301.2270, 2013.
[bib51] Gao et al. (2015) Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286, 2015.
[bib52] Bernhard C Geiger. On information plane analyses of neural network classifiers–a review. arXiv preprint arXiv:2003.09671, 2020.
[bib53] Geiping et al. (2022) Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, and Andrew Gordon Wilson. How much data are augmentations worth? an investigation into scaling laws, invariance, and implicit regularization. arXiv preprint arXiv:2210.06441, 2022.
[bib54] Germain et al. (2015) Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder for distribution estimation. In International Conference on Machine Learning, pages 881–889. PMLR, 2015.
[bib55] Goldfeld et al. (2018) Z. Goldfeld, E. van den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy. Estimating Information Flow in Neural Networks. ArXiv e-prints, 2018.
[bib56] Ziv Goldfeld and Kristjan Greenewald. Sliced mutual information: A scalable measure of statistical dependence. Advances in Neural Information Processing Systems, 34:17567–17578, 2021.
[bib57] Goldfeld et al. (2022) Ziv Goldfeld, Kristjan Greenewald, Theshani Nuradha, and Galen Reeves. k-sliced mutual information: A quantitative study of scalability with dimension. arXiv preprint arXiv:2206.08526, 2022.
[bib58] Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2016. URL http://www.deeplearningbook.org.
[bib59] Yves Grandvalet and Yoshua Bengio. Entropy regularization., 2006.
[bib60] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
[bib61] Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33:21271–21284, 2020.
[bib62] Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
[bib63] Hang et al. (2018) Hanyuan Hang, Ingo Steinwart, Yunlong Feng, and Johan AK Suykens. Kernel density estimation for dynamical systems. The Journal of Machine Learning Research, 19(1):1260–1308, 2018.
[bib64] Hardoon et al. (2004) David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12):2639–2664, 2004. doi: 10.1162/0899766042321814.
[bib65] He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
[bib66] He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
[bib67] He et al. (2022) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
[bib68] Hecht et al. (2009) Ron M Hecht, Elad Noor, and Naftali Tishby. Speaker recognition by gaussian information bottleneck. In Tenth Annual Conference of the International Speech Communication Association, 2009.
[bib69] Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In International Conference on Machine Learning, pages 4182–4192. PMLR, 2020.
[bib70] Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
[bib71] Hjelm et al. (2018) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
[bib72] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936. ISSN 00063444. URL http://www.jstor.org/stable/2333955.
[bib73] Huang et al. (2019) Zhenyu Huang, Joey Tianyi Zhou, Xi Peng, Changqing Zhang, Hongyuan Zhu, and Jiancheng Lv. Multi-view spectral clustering network. In IJCAI, pages 2563–2569, 2019.
[bib74] Huembeli et al. (2022) Patrick Huembeli, Juan Miguel Arrazola, Nathan Killoran, Masoud Mohseni, and Peter Wittek. The physics of energy-based models. Quantum Machine Intelligence, 4(1):1–13, 2022.
[bib75] Aapo Hyvärinen. Some extensions of score matching, 2006.
[bib76] Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
[bib77] Jia et al. (2010) Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. Factorized latent spaces with structured sparsity. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper/2010/file/a49e9411d64ff53eccfdd09ad10a15b3-Paper.pdf.
[bib78] Jing et al. (2021) Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. arXiv preprint arXiv:2110.09348, 2021.
[bib79] Jonathan Kahana and Yedid Hoshen. A contrastive objective for learning disentangled representations. arXiv preprint arXiv:2203.11284, 2022.
[bib80] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
[bib81] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[bib82] Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
[bib83] Kingma et al. (2014) Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27, 2014.
[bib84] Bernard Osgood Koopman. On distributions admitting a sufficient statistic. Transactions of the American Mathematical society, 39(3):399–409, 1936.
[bib85] Lyudmyla F Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
[bib86] Abhishek Kumar and Hal Daumé. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 393–400. Citeseer, 2011.
[bib87] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[bib88] Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
[bib89] LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, ””, 2015.
[bib90] Lee et al. (2013) Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
[bib91] Lee et al. (2006) Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng. Efficient sparse coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/paper_files/paper/2006/file/2d71b2ae158c7c5912cc0bbde2bb9d95-Paper.pdf.
[bib92] Lee et al. (2021a) Jason D Lee, Qi Lei, Nikunj Saunshi, and Jiacheng Zhuo. Predicting what you already know helps: Provable self-supervised learning. Advances in Neural Information Processing Systems, 34, 2021a.
[bib93] Lee et al. (2021b) Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, and Ian Fischer. Compressive visual representations. Advances in Neural Information Processing Systems, 34, 2021b.
[bib94] Li et al. (2018) Yingming Li, Ming Yang, and Zhongfei Zhang. A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering, 31(10):1863–1883, 2018.
[bib95] Ralph Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, 1988.
[bib96] Liu et al. (2021a) Shiming Liu, Yifan Xia, Zhusheng Shi, Hui Yu, Zhiqiang Li, and Jianguo Lin. Deep learning in sheet metal bending with a novel theory-guided deep neural network. IEEE/CAA Journal of Automatica Sinica, 8(3):565–581, 2021a.
[bib97] Liu et al. (2014) Weifeng Liu, Dacheng Tao, Jun Cheng, and Yuanyan Tang. Multiview hessian discriminative sparse coding for image annotation. Computer Vision and Image Understanding, 118:50–60, 2014. ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2013.03.007. URL https://www.sciencedirect.com/science/article/pii/S1077314213001550.
[bib98] Liu et al. (2021b) Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian, Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering, 2021b.
[bib99] Lou et al. (2013) Zhengzheng Lou, Yangdong Ye, and Xiaoqiang Yan. The multi-feature information bottleneck with application to unsupervised image categorization. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013.
[bib100] Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
[bib101] Mao et al. (2014) Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.
[bib102] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6707–6717, 2020.
[bib103] Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[bib104] Nash et al. (2018) Charlie Nash, Nate Kushman, and Christopher KI Williams. Inverting supervised representations with autoregressive neural density models. arXiv preprint arXiv:1806.00400, 2018.
[bib105] Ng et al. (2011) Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011.
[bib106] Ngiam et al. (2011) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 689–696, Madison, WI, USA, 2011. Omnipress. ISBN 9781450306195.
[bib107] Noshad and Hero III (2018) Morteza Noshad and Alfred O Hero III. Scalable mutual information estimation using dependence graphs. arXiv preprint arXiv:1801.09125, 2018.
[bib108] Ozsoy et al. (2022) Serdar Ozsoy, Shadi Hamdan, Sercan Arik, Deniz Yuret, and Alper Erdogan. Self-supervised learning with an information maximization criterion. Advances in Neural Information Processing Systems, 35:35240–35253, 2022.
[bib109] Amichai Painsky and Gregory W Wornell. On the universality of the logistic loss function. arXiv preprint arXiv:1805.03804, 2018.
[bib110] Palmer et al. (2015) Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908–6913, 2015.
[bib111] Liam Paninski. Estimation of entropy and mutual information. Neural Comput., 15(6):1191–1253, 2003. ISSN 0899-7667. doi: 10.1162/089976603321780272.
[bib112] Pensia et al. (2018) Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018.
[bib113] Piran et al. (2020) Zoe Piran, Ravid Shwartz-Ziv, and Naftali Tishby. The dual information bottleneck. arXiv preprint arXiv:2006.04641, 2020.
[bib114] Pu et al. (2020) Shi Pu, Yijiang He, Zheng Li, and Mao Zheng. Multimodal topic learning for video recommendation. arXiv preprint arXiv:2010.13373, 2020.
[bib115] J Ross Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.
[bib116] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
[bib117] Brian C Ross. Mutual information between discrete and continuous data sets. PLoS ONE, 9(2):e87357, 2014. doi: 10.1371/journal.pone.0087357. URL https://doi.org/10.1371/journal.pone.0087357.
[bib118] Roy et al. (2018) Aurko Roy, Ashish Vaswani, Arvind Neelakantan, and Niki Parmar. Theory and experiments on vector quantized autoencoders. arXiv preprint arXiv:1805.11063, 2018.
[bib119] Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
[bib120] Saxe et al. (2019) Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox. On the information bottleneck theory of deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124020, 2019.
[bib121] Shamir et al. (2010) Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29):2696 – 2711, 2010. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.2010.04.006. URL http://www.sciencedirect.com/science/article/pii/S030439751000201X. Algorithmic Learning Theory (ALT 2008).
[bib122] Ravid Shwartz-Ziv. Information flow in deep neural networks. arXiv preprint arXiv:2202.06749, 2022.
[bib123] Ravid Shwartz-Ziv and Alexander A Alemi. Information in infinite ensembles of infinitely-wide neural networks. In Symposium on Advances in Approximate Bayesian Inference, pages 1–17. PMLR, 2020.
[bib124] Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022.
[bib125] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
[bib126] Shwartz-Ziv et al. (2018) Ravid Shwartz-Ziv, Amichai Painsky, and Naftali Tishby. Representation compression and generalization in deep neural networks, 2018.
[bib127] Shwartz-Ziv et al. (2022a) Ravid Shwartz-Ziv, Randall Balestriero, and Yann LeCun. What do we maximize in self-supervised learning? arXiv preprint arXiv:2207.10081, 2022a.
[bib128] Shwartz-Ziv et al. (2022b) Ravid Shwartz-Ziv, Micah Goldblum, Hossein Souri, Sanyam Kapoor, Chen Zhu, Yann LeCun, and Andrew G Wilson. Pre-train your loss: Easy bayesian transfer learning with informative priors. Advances in Neural Information Processing Systems, 35:27706–27715, 2022b.
[bib129] Shwartz-Ziv et al. (2023) Ravid Shwartz-Ziv, Randall Balestriero, Kenji Kawaguchi, Tim GJ Rudner, and Yann LeCun. An information-theoretic perspective on variance-invariance-covariance regularization. arXiv preprint arXiv:2303.00633, 2023.
[bib130] Smieja et al. (2019) M Smieja, M Wolczyk, J Tabor, and B Geiger. Segma: Semi-supervised gaussian mixture auto-encoder. arXiv preprint arXiv:1906.09333, 2019.
[bib131] Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33:596–608, 2020.
[bib132] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf.
[bib133] Yang Song and Diederik P Kingma. How to train your energy-based models. arXiv preprint arXiv:2101.03288, 2021.
[bib134] Song et al. (2020) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
[bib135] Jost Tobias Springenberg. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
[bib136] Karthik Sridharan and Sham Kakade. An information theoretic framework for multi-view learning. SO, 01 2008.
[bib137] Nitish Srivastava and Ruslan Salakhutdinov. Multimodal learning with deep boltzmann machines. Journal of Machine Learning Research, 15(84):2949–2980, 2014. URL http://jmlr.org/papers/v15/srivastava14b.html.
[bib138] Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutual information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.
[bib139] Sun et al. (2010) Liang Sun, Betul Ceran, and Jieping Ye. A scalable two-stage approach for a class of dimensionality reduction techniques. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 313–322, 2010.
[bib140] Shiliang Sun. A survey of multi-view machine learning. Neural Computing and Applications, 23:2031–2038, 2013.
[bib141] Tian et al. (2020a) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In European conference on computer vision, pages 776–794. Springer, 2020a.
[bib142] Tian et al. (2020b) Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? Advances in Neural Information Processing Systems, 33:6827–6839, 2020b.
[bib143] Tishby et al. (1999a) N. Tishby, F.C. Pereira, and W. Biale. The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, pages 368–377, 1999a. URL https://arxiv.org/abs/physics/0004057.
[bib144] Tishby et al. (1999b) Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, 1999b.
[bib145] Tsai et al. (2020) Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. arXiv preprint arXiv:2006.05576, 2020.
[bib146] Tschannen et al. (2019) Michael Tschannen, Josip Djolonga, Paul K Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019.
[bib147] Richard Turner and Maneesh Sahani. A maximum-likelihood interpretation for slow feature analysis. Neural computation, 19(4):1022–1038, 2007.
[bib148] Ucar et al. (2021) Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems, 34:18853–18865, 2021.
[bib149] Uğur et al. (2020) Yiğit Uğur, George Arvanitakis, and Abdellatif Zaidi. Variational information bottleneck for unsupervised clustering: Deep gaussian mixture embedding. Entropy, 22(2):213, 2020.
[bib150] Van den Oord et al. (2016) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
[bib151] Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
[bib152] Vera et al. (2018) Matías Vera, Pablo Piantanida, and Leonardo Rey Vega. The role of information complexity and randomization in representation learning. arXiv preprint arXiv:1802.05355, 2018.
[bib153] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011. doi: 10.1162/NECO˙a˙00142.
[bib154] Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008.
[bib155] Voloshynovskiy et al. (2020) Slava Voloshynovskiy, Olga Taran, Mouad Kondah, Taras Holotyak, and Danilo Rezende. Variational information bottleneck for semi-supervised classification. Entropy, 22(9), 2020. ISSN 1099-4300. doi: 10.3390/e22090943. URL https://www.mdpi.com/1099-4300/22/9/943.
[bib156] Wang et al. (2022) Haoqing Wang, Xun Guo, Zhi-Hong Deng, and Yan Lu. Rethinking minimal sufficient representation in contrastive learning. arXiv preprint arXiv:2203.07004, 2022.
[bib157] Wang et al. (2019) Qi Wang, Claire Boudreau, Qixing Luo, Pang-Ning Tan, and Jiayu Zhou. Deep Multi-view Information Bottleneck, pages 37–45. A, 2019. doi: 10.1137/1.9781611975673.5. URL https://epubs.siam.org/doi/abs/10.1137/1.9781611975673.5.
[bib158] Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR, 2020.
[bib159] Wang et al. (2015) Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 1083–1092. JMLR.org, 2015.
[bib160] Wenzel et al. (2020) Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub Świkatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the bayes posterior in deep neural networks really? arXiv preprint arXiv:2002.02405, 2020.
[bib161] Laurenz Wiskott and Terrence J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715–770, 2002. doi: 10.1162/089976602317318938.
[bib162] Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33:6256–6268, 2020.
[bib163] Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. Advances in Neural Information Processing Systems, 30, 2017.
[bib164] Xu et al. (2014) Chang Xu, Dacheng Tao, and Chao Xu. Large-margin multi-viewinformation bottleneck. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36:1559–1572, 2014.
[bib165] Xu et al. (2020) Yilun Xu, Shengjia Zhao, Jiaming Song, Russell Stewart, and Stefano Ermon. A theory of usable information under computational constraints. arXiv preprint arXiv:2002.10689, 2020.
[bib166] Xue et al. (2019) Zhe Xue, Junping Du, Dawei Du, and Siwei Lyu. Deep low-rank subspace ensemble for multi-view clustering. Information Sciences, 482:210–227, 2019. ISSN 0020-0255. doi: https://doi.org/10.1016/j.ins.2019.01.018. URL https://www.sciencedirect.com/science/article/pii/S0020025519300271.
[bib167] Yan et al. (2015) Xiaoqiang Yan, Yangdong Ye, and Zhengzheng Lou. Unsupervised video categorization based on multivariate information bottleneck method. Knowledge-Based Systems, 84:34–45, 2015.
[bib168] Yan et al. (2021) Xiaoqiang Yan, Shizhe Hu, Yiqiao Mao, Yangdong Ye, and Hui Yu. Deep multi-view learning methods: A review. Neurocomputing, 448:106–129, 2021. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2021.03.090. URL https://www.sciencedirect.com/science/article/pii/S0925231221004768.
[bib169] Laurent Younes. On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. In STOCHASTICS AND STOCHASTICS MODELS, pages 177–228, 1999.
[bib170] Zbontar et al. (2021) Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
[bib171] Zhai et al. (2019) Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1476–1485, 2019.
[bib172] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31, 2018.
[bib173] Zhao et al. (2017a) Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factorization. In Thirty-first AAAI conference on artificial intelligence, 2017a.
[bib174] Zhao et al. (2017b) Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017b. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2017.02.007. URL https://www.sciencedirect.com/science/article/pii/S1566253516302032.
[bib175] Zhao et al. (2017c) Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017c.
[bib176] Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.