Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations
Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun
Abstract
Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while preserving maximum-entropy up to rescaling under expected $\ell_p$ norm constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity–performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.
Sparse and Maximum-Entropy Distributions
Yilun Kuang 1 Yash Dagade 2 Tim G. J. Rudner 3 Randall Balestriero 4 Yann LeCun 1
/github GitHub

Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected ℓ 0 norm through rectification, while preserving maximum-entropy up to rescaling under expected ℓ p norm constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity-performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.
=45

(a)
Rectified LpJEPA Diagram.
Non-Closure under Projections.
Figure 1. Rectified LpJEPA. (a) Two views ( x, x ′ ) of the same underlying data are embedded and rectified to obtain ReLU( z ) and ReLU( z ′ ) ∈ R d . Rectified LpJEPA minimizes the ℓ 2 distance between rectified features while regularizing the d -dimensional rectified feature distribution towards a product of i.i.d. Rectified Gaussian distributions ReLU( N ( µ, σ 2 )) using RDMReg. As a result, each coordinate of the learned representation aligns towards a Rectified Gaussian distribution (CDF shown above), a special case of the Rectified Generalized Gaussian family RGN p ( µ, σ ) when p = 2 . In the absence of rectification on both the features and the target distribution, Rectified LpJEPA reduces to isotropic Gaussian regularization as in LeJEPA (Balestriero & LeCun, 2025). (b) Samples from 2 -dimensional Gaussian N ( 0 , I ) and Rectified Gaussian ReLU( N ( 0 , I )) are drawn and projected along a certain direction c . As opposed to Gaussian which is closed under linear combinations, the projected marginals of the Rectified Gaussian distribution no longer fall in the same family, motivating the necessity of using two-sample distribution-matching losses.
1 New York University. 2 Duke University. 3 University of Toronto. 4 Brown University. Correspondence to: Yilun Kuang < yilun. kuang@nyu.edu > , Yann LeCun < yann.lecun@nyu.edu > .
Introduction
Self-supervised representation learning has emerged as a promising paradigm for advancing machine intelligence without explicit supervision (Radford et al., 2018; Chen et al., 2020). A prominent class of methods-JointEmbedding Predictive Architectures (JEPAs)-learn representations by enforcing consistency across multiple views of the same data in the latent space, while avoiding explicit reconstructions or density estimations in the observation space (LeCun, 2022; Assran et al., 2023).
By decoupling learning from observation-level constraints, JEPAs operate at a higher level of abstraction, enabling flexibility in encoding task-relevant information. However, invariance alone admits degenerate solutions, including complete or dimensional collapse, where representations concentrate in trivial or low-rank subspaces (Jing et al., 2022).
Recent efforts culminate towards a distribution-matching approach for collapse prevention. In particular, LeJEPA (Balestriero & LeCun, 2025) introduces the SIGReg loss, which aligns one-dimensional projected feature marginals towards a univariate Gaussian across random projections, thereby regularizing the full representation towards isotropic Gaussian with convergence guaranteed by the Cram´ er-Wold theorem. By decomposing high-dimensional distribution matching into parallel one-dimensional projection-based optimizations, SIGReg mitigates the curse of dimensionality and enables scalable representation learning. The resulting features are maximum-entropy under fixed ℓ 2 norm constraints and strictly generalize prior second-order regularization methods such as VICReg (Bardes et al., 2022) by suppressing all higher-order statistical dependencies.
However, restricting feature distributions to isotropic Gaussian severely limits the range of representational structures that can be expressed. In particular, isotropic Gaussian features alone do not capture a key property of effective representations: sparsity . Across neuroscience, signal processing, and deep learning, sparse and non-negative codes repeatedly emerge as efficient, interpretable, and robust representations (Olshausen & Field, 1996; Donoho, 2006; Lee &Seung, 1999; Glorot et al., 2011).
To this end, we propose Rectified Distribution Matching Regularization (RDMReg), a two-sample sliced distributionmatching regularizer that aligns JEPA representations to the Rectified Generalized Gaussian (RGG) distribution, a novel family of probability distributions with controllable expected ℓ p norms and induced ℓ 0 regularizations from explicit rectifications. Notably, RGG also preserves maximumentropy up to rescaling under sparsity constraints, thereby preventing collapses even in highly sparse regimes.
The resulting method, Rectified LpJEPA, strictly generalizes LeJEPA, which arises as a special case corresponding to the dense regime of the Generalized Gaussian family. By introducing a principled inductive bias toward sparsity and non-negativity, Rectified LpJEPA jointly enforces invariance, preserves task-relevant information, and enables controllable sparsity.
We summarize our contributions as follows:
- Rectified Generalized Gaussian Distributions . We introduce the Rectified Generalized Gaussian (RGG) distribution and show that it enjoys maximum entropy properties under expected ℓ p norms constraints with induced ℓ 0 norm regularizations.
- Rectified LpJEPA with RDMReg . We propose Rectified LpJEPA, a novel JEPA architecture equipped with Rectified Distribution Matching Regularization (RDMReg), enabling controllable sparsity and non-negativity in learned representations.
- Empirical Validation . We empirically demonstrate that Rectified LpJEPA achieves controllable sparsity, favorable sparsity-performance trade-offs, improved statistical independence, and competitive downstream accuracy across image classification benchmarks.
Background
In this section, we review key notions of sparsity. Additional background can be found in Appendix A.
Sparsity . Beyond its role in robust recovery and compressed sensing (Mallat, 1999; Donoho, 2006), sparsity has long been argued to be a fundamental organizing principle of efficient information processing in human and animal intelligence (Barlow et al., 1961). In sensory neuroscience, extensive empirical evidences suggest that neural systems encode dense and high-dimensional sensory inputs into nonnegative, sparse activations under strict metabolic and signaling constraints (Olshausen & Field, 1996; Attwell & Laughlin, 2001).
In signal processing, sparse coding seeks to reconstruct signals using a minimal number of active components, typically enforced through ℓ 1 regularization (Chen et al., 2001). Complementarily, non-negative matrix factorization enforces non-negativity by restricting representations to the positive orthant, inducing a conic geometry that yields parts-based and interpretable decompositions (Lee & Seung, 1999).
In deep learning, rectifying nonlinearities such as ReLU enforce non-negativity by zeroing negative responses, inducing support sparsity akin to ℓ 0 constraints and underpinning the success of modern deep networks (Nair & Hinton, 2010; Glorot et al., 2011).
Metrics of Sparsity. To quantify sparsity, we consider a vector x ∈ R d . The ℓ 0 (pseudo-)norm ∥ x ∥ 0 :=

Figure 2. Rectified Laplace ( p = 1 ) and Rectified Gaussian ( p = 2 ) as special cases of Rectified Generalized Gaussian distributions. Assume µ = 0 and σ = 1 . For any p > 0 , the Truncated Generalized Gaussian ∏ d i =1 T GN p over the support (0 , ∞ ) d is the maximum differential entropy distribution under a fixed expected ℓ p -norm constraint. For p ∈ { 1 , 2 } , ∏ d i =1 T GN p further admits a radial-angular decomposition x = r · u with r ⊥ ⊥ u , where u is uniformly distributed with respect to the surface measure on the unit ℓ p -sphere confined to the positive orthant and r p follows a Gamma distribution. Rectified Laplace and Rectified Gaussian arise via coordinate-wise mixing of the corresponding truncated distributions with a Dirac measure at zero, yielding a distribution with expected ℓ 0 -norm guarantees, where Φ L and Φ N denote the cumulative distribution functions of the standard Laplace and standard Gaussian distributions respectively.
∑ d i =1 ✶ R { 0 } ( x i ) counts the number of nonzero elements in x , where ✶ S ( x ) is the indicator function that evaluates to 1 if x ∈ S and 0 otherwise. Direct minimization of ℓ 0 norm is however an NP-hard problem (Natarajan, 1995). A standard relaxation is the ℓ 1 norm, ∥ x ∥ 1 := ∑ d i =1 | x i | , which is the tightest convex envelope of ℓ 0 on bounded domains and enables tractable optimization (Tibshirani, 1996).
More generally, ℓ p quasi-norms ∥ x ∥ p p := ∑ d i =1 | x i | p with 0 < p < 1 provide a closer, nonconvex approximation to ℓ 0 : their singular behavior near zero strongly favors exact sparsity while exerting weaker penalties on large-magnitude components. Although nonconvexity complicates optimization, such penalties have been shown to yield sparser and less biased solutions than ℓ 1 under suitable conditions (Chartrand, 2007; Chartrand & Yin, 2008).
Sparse and Maximum-Entropy Distributions
In the following section, we show that the proposed Rectified Generalized Gaussian distribution is the direct mathematical consequence of maximizing entropy under ℓ p constraints with an induced ℓ 0 regularization, yielding representations that are simultaneously informative and sparse. We first introduce the Generalized Gaussian distribution (Section 3.1) and its truncated variant (Section 3.2), and show that they are the maximum-entropy distributions under an expected ℓ p norm constraint (Section 3.3). We then show that incorporating rectification yields the Rectified Generalized Gaussian distribution (Section 3.4), which preserves maximal-entropy guarantees-rescaled by the R´ enyi information dimension-while explicitly inducing ℓ 0 sparsity (Section 3.5).
Generalized Gaussian Distributions
In Definition 3.1, we present the standard form of the Generalized Gaussian Distribution (Subbotin, 1923; Goodman &Kotz, 1973; Nadarajah, 2005).
$$
$$
where Γ( s ) := ∫ ∞ 0 t s -1 e -t dt is the gamma function.
We observe that GN p ( µ, σ ) reduces to the Laplace distribution when p = 1 and the Gaussian distribution for p = 2 .
Truncated Generalized Gaussian Distributions
If we restrict the support, we obtain the Truncated Generalized Gaussian Distributions in Definition 3.2.
Definition 3.2 (Truncated Generalized Gaussian Distribution) . Let S ⊆ R be a subset of R with positive Lebesgue measure. The Truncated Generalized Gaussian distribution T GN p ( µ, σ, S ) is the restriction of the Generalized
Gaussian distribution GN p ( µ, σ ) to the support S . The probability density function of T GN p ( µ, σ, S ) is given by
$$
$$
$$
$$
where ✶ S ( x ) is the indicator function that evaluates to 1 if x ∈ S and 0 otherwise. The partition function is
When S = R , T GN p ( µ, σ ) is equivalent to GN p ( µ, σ ) .
Maximum Entropy under $ ell_p$ Constraints
We consider the multivariate generalization (Goodman & Kotz, 1973) as the joint distribution resulting from the product measure of independent and identically distributed (i.i.d.) Truncated Generalized Gaussian random variables, i.e. x ∼ ∏ d i =1 T GN p ( µ, σ, S ) where x = ( x 1 , . . . , x d ) for each x i ∼ T GN p ( µ, σ, S ) . For our purposes, we only need S = (0 , ∞ ) and thus the joint support is (0 , ∞ ) d .
In Proposition 3.3, we show that the zero-mean Multivariate Truncated Generalized Gaussian Distribution is in fact the maximum differential entropy distribution under the expected ℓ p norm constraints.
Proposition 3.3 (Maximum Entropy Characterizations of Multivariate Truncated Generalized Gaussian Distributions) . The maximum entropy distribution over S ⊆ R d , where S is a subset of R d with positive Lebesgue measure, under the constraints
$$
$$
$$
$$
where λ 1 = -1 /pσ p and Z S ( λ 1 ) is the partition function.
$$
$$
In fact, if S = R d , we show in Corollary E.2 that E [ ∥ x ∥ p p ] = dσ p . An immediate consequence of Proposition 3.3 is the well-known fact that Truncated Laplace and Truncated Gaussian over the same support set S are maximal entropy under the expected ℓ 1 and ℓ 2 norm constraints respectively. For any 0 <p< 1 , this proposition still holds true and thus we obtain a continuous spectrum of sparse distributions.
Rectified Generalized Gaussian Distributions
In Definition 3.4, we introduce the Rectified Generalized Gaussian (RGG) distribution.
Definition 3.4 (Rectified Generalized Gaussian) . The Rectified Generalized Gaussian distribution RGN p ( µ, σ ) is a mixture between a discrete Dirac measure δ 0 ( x ) (Definition B.4) and a Truncated Generalized Gaussian distribution T GN p ( µ, σ, (0 , ∞ )) with probability density function
$$
$$
$$
$$
̸
In Appendices B and C, we present additional technical details of the Rectified Generalized Gaussian distribution. We also visualizes the connections between Truncated Generalized Gaussian and Rectified Generalized Gaussian distributions in Figure 2. For p = 2 , we recover the Rectified Gaussian distribution (Socci et al., 1997; Anderson et al., 1997). To the best of our knowledge, our extension and application of the Generalized Gaussian distribution to its rectified variant is novel for p = 2 .
Nardon & Pianca (2009) proposed the simulation technique for Generalized Gaussian random variables. In Algorithm 1, we show how to sample from the Rectified Generalized Gaussian distribution RGN p ( µ, σ ) . Essentially, we only need to first sample from the Generalized Gaussian distribution, and then rectify. In other words, x ∼ ReLU( GN p ( µ, σ )) is equivalent to x ∼ RGN p ( µ, σ ) .
Sparsity and Entropy
Proposition 3.5 (Sparsity) . Let x ∼ ∏ d i =1 RGN p ( µ, σ ) in d dimension. Then
$$
$$
$$
$$
where sgn( · ) is the sign function and P ( · , · ) is the lower regularized gamma function.
$$
$$
Due to explicit rectifications, the RGG family is absolutely continuous with respect to the mixture between the Dirac
![Figure 3. Rectified LpJEPA achieves controllable sparsity and favorable sparsity-performance tradeoffs under proper parameterizations. (a) We report CIFAR-100 validation accuracy and the ℓ 0 sparsity metric 1 -(1 /d ) · E [ ∥ x ∥ 0 ] for four settings where we match non-rectified features z or rectified features z + := ReLU( z ) to either Rectified Generalized Gaussian RGN p or conventional Generalized Gaussian GN p . Rectified LpJEPA ( RGN p | z + ) achieves the best sparsity-performance tradeoffs compared to other settings. (b) We compare the normalized ℓ 0 norm of pretrained Rectified LpJEPA features against the theoretical predictions of Proposition 3.5 as µ varies. Empirical sparsity closely follows the predicted behavior across different values of µ and p . (c) We plot the Pareto frontier of sparsity versus accuracy across varying values of µ and p . Performance drops sharply only when more than ∼ 95% of entries are zero.](2602.01456-figure_003.png)
Figure 3. Rectified LpJEPA achieves controllable sparsity and favorable sparsity-performance tradeoffs under proper parameterizations. (a) We report CIFAR-100 validation accuracy and the ℓ 0 sparsity metric 1 -(1 /d ) · E [ ∥ x ∥ 0 ] for four settings where we match non-rectified features z or rectified features z + := ReLU( z ) to either Rectified Generalized Gaussian RGN p or conventional Generalized Gaussian GN p . Rectified LpJEPA ( RGN p | z + ) achieves the best sparsity-performance tradeoffs compared to other settings. (b) We compare the normalized ℓ 0 norm of pretrained Rectified LpJEPA features against the theoretical predictions of Proposition 3.5 as µ varies. Empirical sparsity closely follows the predicted behavior across different values of µ and p . (c) We plot the Pareto frontier of sparsity versus accuracy across varying values of µ and p . Performance drops sharply only when more than ∼ 95% of entries are zero.
and Lebesgue measure (Lemma B.6), rendering differential entropy ill-defined. Thus we resort to the concept of d ( ξ ) -dimensional entropy by (R´ enyi, 1959), which measures the Shannon entropy of quantized random vector under successive grid refinement. In Theorem 3.6, we provide a d ( ξ ) -dimensional entropy characterization of Rectified Generalized Gaussian, where d ( ξ ) is the R´ enyi information dimension. We defer additional details on the R´ enyi information dimension to Appendix F.
Theorem 3.6 (R´ enyi Information Dimension Characterizations of Multivariate Rectified Generalized Gaussian Distributions) . Let ξ ∼ ∏ D i =1 RGN p ( µ, σ ) be a Rectified Generalized Gaussian random vector. The R´ enyi information dimension of ξ is d ( ξ ) = D · Φ GN p (0 , 1) ( µ/σ ) , and the d ( ξ ) -dimensional entropy of ξ is given by
$$
$$
$$
$$
where H 0 ( · ) is the discrete Shannon entropy, H 1 ( · ) denotes the differential entropy, and ✶ (0 , ∞ ) ( ξ i ) is a Bernoulli random variable that equals 1 with probability Φ GN p (0 , 1) ( µ/σ ) and 0 with probability 1 -Φ GN p (0 , 1) ( µ/σ ) .
$$
$$
Thus we have shown that rectifications still preserve the maximal entropy property of the original distribution up to rescaling by the R´ enyi information dimension and constant offsets. In Lemma F.6, we further shows that the d ( ξ ) -dimensional entropy coincides with differential entropy under change of measure, enabling the interpretation of entropy under a Dirac and Lebesgue mixed measure.
Rectified LpJEPA
In the following section, we present a distributional regularization method based on the Cramer-Wold device (Section 4.1) for matching feature distributions towards Rectified Generalized Gaussian, resulting in the Rectified LpJEPA with Rectified Distribution Matching Regularization (RDMReg) (Section 4.2). Contrary to isotropic Gaussian, the Rectified Generalized Gaussian is not closed under linear combinations, leading to the necessity of two-sample sliced distribution matching (Section 4.3). We further demonstrate that RDMReg recovers a form of Non-Negative VCReg, which we defined in Section 4.4. Finally, we discuss various design choices of the target distribution with the parameter sets ( µ, σ, p ) which balance between sparsity and maximumentropy (Section 4.5).
Cramer-Wold Based Distribution Matching
The Cram´ er-Wold device states that two random vectors x , y ∈ R d are equal in distribution, i.e. x d = y , if and only if all their one-dimensional linear projections are equal in distribution (Cram´ er, 1936; Wold, 1938)
$$
$$
This result enables us to decompose a high-dimensional distribution matching problem into parallelized one-dimension optimizations, which significantly reduces the sample complexity in each of the one-dimensional problems.
Rectified LpJPEA with RDMReg
Let ( x , x ′ ) ∼ P x , x ′ denote a pair of random vectors jointly distributed according to a view-generating distribution P x , x ′ , where x and x represent two stochastic views (e.g., random augmentations) of the same underlying input data. Let f θ be a neural network. We write z = ReLU( f θ ( x )) and z ′ = ReLU( f θ ( x ′ )) , where z , z ′ ∈ R D are the output feature random vectors. We further sample y ∼ ∏ d i =1 RGN p ( µ, σ ) and the random projection vectors c from the uniform distribution on the ℓ 2 sphere, i.e. c ∼ Unif( S d -1 ℓ 2 ) . We denote the induced distribution under projections as P c ⊤ z and P c ⊤ y .
Our self-supervised learning objective consists of (i) an invariance term enforcing consistency across views, and (ii) a two-sample sliced distribution-matching loss which we called the Rectified Distribution Matching Regularization (RDMReg). The resulting loss takes the form
$$
$$
$$
$$
where L ( P ∥ Q ) is any loss function that minimizes the distance between two univariate distributions P and Q .
The Necessity of Two-Sample Hypothesis Testing
Contrary to the isotropic Gaussian, which is closed under linear combinations, the Rectified Generalized Gaussian (RGG) family is not preserved under linear projections: the one-dimensional projected marginals generally fall outside the RGG family. In fact, closure under linear combinations characterizes the class of multivariate stable distributions (Nolan, 1993), which is disjoint from our RGG family. As illustrated in Figure 1b, while any linear projection of a Gaussian remains Gaussian, projecting a Rectified Gaussian along different directions yields distinctly different marginals that no longer belong to the Rectified Gaussian family.
Consequently, the distribution matching loss L ( ·∥· ) must rely on sample-based, nonparametric two-sample hypothesis tests on projected marginals (Lehmann & Romano, 1951; Gretton et al., 2012). Among many possible choices, we instantiate this objective using the sliced 2 -Wasserstein distance (Bonneel et al., 2015; Kolouri et al., 2018) as it works well empirically. Let Z , Y ∈ R B × D be empirical neural network feature matrix and the samples from RGG where B is batch size and D is dimension. We denote a single random projection vector as c i ∈ R D out of N total projections.
The RDMReg loss function is given by
$$
$$
where ( · ) ↑ denotes sorting in ascending order. We additionally show in Figure 13c that a small, dimension-independent N suffices to achieve optimal performances.
Non-Negative VCReg Recovery
In Appendix I, we further show that minimizing the RDMReg loss recovers a form of Non-Negative VCReg (Appendix H) which minimizes second-order dependencies using only linear number of projections in dimensions. We further show in Figure 14 that using eigenvectors of the empirical feature covariance matrices as projection vectors c i accelerates second-order dependencies removal and hence leads to faster convergence towards optimal performances.
Hyperparameters of the Target Distributions
Proposition 3.5 shows that the hyperparameter set { µ, σ, p } collectively determines the ℓ 0 sparsity. σ is a special parameter since we always want σ > ϵ , where ϵ is some pre-specified threshold value, to prevent collapse.
We denote σ GN = Γ(1 /p ) 1 / 2 / ( p 1 /p · Γ(3 /p ) 1 / 2 ) as the choice to ensure that the variance of the random variable before rectification is 1 since the closed form variance is readily available for the Generalized Gaussian distribution.
It's also possible to find σ RGN such that the variance after rectification is 1 . In Proposition B.9, we derive the closed form expectation and variance of the Rectified Generalized Gaussian distribution. The choice of σ RGN can be determined by running a bisection search algorithm (see Algorithm 2) over the closed form variance formula. We defer additional comparisons between σ RGN and σ GN to Appendix D. Unless otherwise specified, we use σ GN as the default option.
Empirical Results
In the following sections, we introduce the basic settings and evaluations (Section 5.1). We establish our Rectified LpJEPA designs as the correct parameterizations to learn informative and sparse features compared to other possible alternatives (Sections 5.2 and 5.3). Rectified LpJEPA achieves controllable sparsity (Section 5.4) and favorable sparsity and performance tradeoffs (Section 5.5) with added benefits of learning more statistically independent (Section 5.6), highentropy (Section 5.7) features, and performs competitively in pretraining and transfer evaluations (Section 5.8).
Table 1. Linear Probe Results on ImageNet-100. Acc1 (%) is higher-is-better ( ↑ ); sparsity is lower-is-better ( ↓ ). Bold denotes best and underline denotes second-best in each column (ties allowed).
Experimental Settings
Baselines . We compare Rectified LpJEPA with dense baselines including SimCLR (denoted CL) (Chen et al., 2020) and VICReg (Bardes et al., 2022), as well as their sparse counterparts NCL (Wang et al., 2024) and Non-Negative VICReg (denoted NVICReg). We additionally compare against LpJEPA, which matches non-rectified features to Generalized Gaussian targets. Additional details for all baselines are provided in Appendix H.
Sparsity Metrics . We define the ℓ 1 sparsity metric for a D -dimensional random vector m ℓ 1 ( x ) = (1 /D ) · E [ ∥ x ∥ 2 1 / ∥ x ∥ 2 2 ] , which attains its minimum value 1 /D for extremely sparse vectors and its maximum value 1 for dense, uniformly distributed features. We additionally report the ℓ 0 sparsity metric m ℓ 0 ( x ) = (1 /D ) · E [ ∥ x ∥ 0 ] , which measures the fraction of nonzero entries, with m ℓ 0 = 0 indicating all-zero vectors and m ℓ 0 = 1 indicating fully dense representations. In Figure 13b, we empirically observe strong correlations between m ℓ 1 and m ℓ 0 metrics. Sometimes we report 1 -m ℓ 1 ( x ) or 1 -m ℓ 0 ( x ) for visualization purposes.
Backbones . Following conventional practices in selfsupervised learning (Balestriero et al., 2023), we adopt the encoder-projector design z = ReLU( f θ 2 ( f θ 1 ( x )) where f θ 1 is a encoder like ResNet (He et al., 2016) or ViT (Dosovitskiy, 2020) and f θ 2 is an additional multilayer perceptron. The Rectified LpJEPA loss is applied over z and linear probe evaluations are carried out on both z and f θ 1 ( x ) . We note that we add ReLU( · ) at the end based on our design. The overall architecture is visualized in Figure 1a.
Necessity of Rectifications
In Figure 3a, we report CIFAR-100 validation accuracy against the ℓ 0 sparsity metric 1 -(1 /D ) · E [ ∥ x ∥ 0 ] under ab- lations that independently control rectification of the target distribution and the learned features. Corresponding results using ℓ 1 sparsity are provided in Figure 11c. Without rectification, models achieve competitive accuracy but produce dense representations with no zero entries. When features are rectified, Rectified LpJEPA attains the best accuracy and sparsity tradeoff, whereas imposing an isotropic Gaussian distribution for rectified features leads to substantial performance drops.
Anti-Collapse via Continuous Mapping Theorems
By the continuous mapping theorem, convergence of x ∈ R d to a Generalized Gaussian implies that ReLU( x ) follows a Rectified Generalized Gaussian. In Figure 11b, we compare linear probe evaluations of Rectified LpJEPA versus LpJEPA features, where the linear probe is trained on pretrained LpJEPA after an additional rectification. We observe that performance drops sharply for the latter case, indicating that it's necessary to directly match to the Rectified Generalized Gaussian distribution.
Controllable Sparsity
Under the correct parameterizations of both the target distributions and the neural network features, we proceed to validate if we observe controllable sparsity in practice. Proposition 3.5 shows that the expected ℓ 0 norm is collectively determined by the set of parameters { µ, σ, p } . In Figure 3b, we show both the empirical ℓ 0 norms measured over different pretrained backbones (ResNet (He et al., 2016), ViT (Dosovitskiy, 2020), ConvNext (Liu et al., 2022)) and the theoretical ℓ 0 norm computed using Equation (9) as a function of varying µ and p and the choice of σ GN mentioned in Section 4.5. We observe that across different mean shift values µ on the x-axis, the empirical ℓ 0 closely tracks the
![Figure 4. Rectified LpJEPA empirically achieves higher-entropy, more independent features with dataset-adaptive sparsity. (a) The averaged univariate d ( ξ ) -dimensional entropy of the Rectified LpJEPA features are computed against the ℓ 1 sparsity metric 1 -(1 /D ) · E [ ∥ z ∥ 2 1 / ∥ z ∥ 2 2 ] across varying µ and p . Overall, we observe the expected behavior of sparsity-entropy tradeoff (b) We evaluate the normalized Hilbert-Schmidt independence Criterion (nHSIC) for LpJEPA, Rectified LpJEPA, and other baselines. Rectified LpJEPA achieves smaller nHSIC values compared to VICReg or NVICReg that only penalizes second-order statistics. (c) The relative mean absolute deviations (MAD) away from the median of the ℓ 1 and ℓ 0 sparsity metrics are computed over different methods. Rectified LpJEPA exhibits the highest variations of sparsity for different downstream dataset. Additional visualizations can be found in Figure 12.](2602.01456-figure_004.png)
Figure 4. Rectified LpJEPA empirically achieves higher-entropy, more independent features with dataset-adaptive sparsity. (a) The averaged univariate d ( ξ ) -dimensional entropy of the Rectified LpJEPA features are computed against the ℓ 1 sparsity metric 1 -(1 /D ) · E [ ∥ z ∥ 2 1 / ∥ z ∥ 2 2 ] across varying µ and p . Overall, we observe the expected behavior of sparsity-entropy tradeoff (b) We evaluate the normalized Hilbert-Schmidt independence Criterion (nHSIC) for LpJEPA, Rectified LpJEPA, and other baselines. Rectified LpJEPA achieves smaller nHSIC values compared to VICReg or NVICReg that only penalizes second-order statistics. (c) The relative mean absolute deviations (MAD) away from the median of the ℓ 1 and ℓ 0 sparsity metrics are computed over different methods. Rectified LpJEPA exhibits the highest variations of sparsity for different downstream dataset. Additional visualizations can be found in Figure 12.
theoretical predictions, and the theoretical ordering between p in the expected ℓ 0 norms is also preserved in the empirical results. We defer additional comparisons between σ GN and σ RGN and more choices of p to Figure 9.
Sparsity and Performance Tradeoff
With controllable sparsity at hand, we are interested in to what extent we can sparsify our features without performance drops. In Figure 3c, we plot the Pareto frontier of validation accuracy against the ℓ 0 sparsity metrics 1 -(1 /D ) · E [ ∥ x ∥ 0 ] across varying µ and p with the choice of σ GN. We observe smooth and slow decay of performance as number of zeros in the feature representations increase, and the cliff-like drop in performance only occurs when roughly 95% of the entries are zero, indicating significant exploitable sparsity in our learned image representations. Additional visualizations are deferred to the Figure 13a.
Pair-wise Independence via HSIC
Beyond sparsity, we evaluate whether the learned representations form approximately independent, factorial encodings of the input data. A principled measure of dependence is the total correlation , defined as the KL divergence between the joint distribution and the product of its marginals. However, estimating total correlation is intractable in highdimensional space (McAllester & Stratos, 2020). We therefore resort to the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) as a practical surrogate for detecting statistical dependence beyond second-order correlations captured by the covariance matrix.
In Figure 4b, we report the normalized HSIC values (see Appendix G for details) of Rectified LpJEPA and several dense and sparse baselines. Compared to methods such as VICReg and NVICReg, which explicitly regularize secondorder statistics but do not constrain higher-order dependencies, Rectified LpJEPA consistently achieves lower nHSIC values, indicating representations that are closer to being statistically independent. Contrastive methods such as CL and NCL also attain low nHSIC scores; however, contrastive objectives are known to suffer from high sample complexity in high-dimensional representation spaces (Chen et al., 2020). Overall, these results suggest that RDMReg objectives encourage not only sparsity but also reduced higher-order dependence.
Rényi Information Dimension and Entropy
We would like to quantify whether the learned representations exhibit high entropy. However, due to rectification, the resulting feature distributions are not absolutely continuous with respect to the Lebesgue measure, rendering standard differential entropy ill-defined and obscuring whether the usual decomposition of total correlation into marginal and joint entropies remains valid. In Appendix F.5, we show that this decomposition continues to hold when entropy is defined in terms of the d ( ξ ) -dimensional entropy.
In Figure 4a, we report the sum of marginal d ( ξ ) -dimensional entropies as an upper bound on the joint entropy across a range of dense and sparse representations. The results reveal a clear Pareto frontier between entropy and sparsity. Moreover, since Rectified LpJEPA consistently attains
lower nHSIC values than VICReg-style baselines, indicating reduced statistical dependence, the marginal entropy estimates for Rectified LpJEPA are expected to provide a tighter and more faithful approximation of the joint entropy.
Pretraining and Transfer Evaluations
In Table 1, we report linear probe results for Rectified LpJEPA pretrained on ImageNet100, compared against a range of dense and sparse baselines. Rectified LpJEPA consistently achieves a favorable trade-off between downstream accuracy and representation sparsity.
We further evaluate transfer performance under both fewshot and full-shot settings (see Tables 3 to 8). Across all configurations, Rectified LpJEPA achieves competitive accuracy, demonstrating strong transferability. In Figure 4c, we additionally observe that pretrained Rectified LpJEPA representations exhibit distinct sparsity patterns across multiple out-of-distribution datasets, suggesting that sparsity statistics can serve as a useful proxy for distinguishing indistribution training data from OOD inputs. Additional results can be seen in Appendix J. We also present additional nearest-neighbors retrieval and visual attribution maps in Appendix K.
Conclusion
We introduced Rectified LpJEPA, a JEPA model equipped with Rectified Distribution Matching Regularization (RDMReg) that induces sparse representations through distribution matching to the Rectified Generalized Gaussian distributions. By showing that sparsity can be achieved via target distribution design while preserving task-relevant information, our work opens new avenues for fundamental research on JEPA regularizers.
Acknowledgements
We thank Deep Chakraborty and Nadav Timor for helpful discussions. This work was supported in part by AFOSR under grant FA95502310139, NSF Award 1922658, and Kevin Buehler's gift. This work was also supported through the NYU IT High Performance Computing resources, services, and staff expertise.
Experimental Settings
Alonso-Gutierrez, D., Prochno, J., and Thaele, C. Gaussian fluctuations for high-dimensional random projections of ℓ n p -balls, 2018. URL https://arxiv.org/abs/ 1710.10130 .
Anderson, J., Barlow, H. B., Gregory, R. L., Hinton, G. E., and Ghahramani, Z. Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B: Biological Sciences , 352 (1358):1177-1190, 08 1997. ISSN 0962-8436. doi: 10.1098/rstb.1997.0101. URL https://doi.org/ 10.1098/rstb.1997.0101 .
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., and Ballas, N. Self-supervised learning from images with a joint-embedding predictive architecture, 2023. URL https://arxiv.org/ abs/2301.08243 .
Attwell, D. and Laughlin, S. B. An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism , 21(10):1133-1145, 2001.
Balestriero, R., Ibrahim, M., Sobal, V ., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., Schwarzschild, A., Wilson, A. G., Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y., and Goldblum, M. A cookbook of self-supervised learning, 2023. URL https://arxiv.org/abs/ 2304.12210 .
Barlow, H. B. et al. Possible principles underlying the transformation of sensory messages. Sensory communication , 1(01):217-233, 1961.
Barthe, F., Gu´ edon, O., Mendelson, S., and Naor, A. A probabilistic approach to the geometry of the ℓ pn-ball. The Annals of Probability , 33(2), March 2005. ISSN 0091-1798. doi: 10.1214/009117904000000874. URL http://dx. doi.org/10.1214/009117904000000874 .
Bonneel, N., Rabin, J., Peyr´ e, G., and Pfister, H. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision , 51(1):22-45, 2015.
Chakraborty, D., LeCun, Y., Rudner, T. G. J., and LearnedMiller, E. Improving pre-trained self-supervised embeddings through effective entropy maximization, 2025. URL https://arxiv.org/abs/2411.15931 .
LARGE Appendix
Additional Backgrounds
Self-Supervised Learning . Common self-supervised learning can be categorized into 1) contrastive methods (Chen et al., 2020; He et al., 2020), 2) non-contrastive methods (Zbontar et al., 2021; Bardes et al., 2022; Ermolov et al., 2021), 3) self-distillation methods (Grill et al., 2020; Caron et al., 2021; Chen & He, 2020) based on Balestriero et al. (2023). Along the line of statistical redundancy reductions, MCR 2 (Yu et al., 2020) regularizes the log determinant of the scaled empirical covariance matrix shifted by the identity matrix while MMCR (Yerxa et al., 2023) penalizes the nuclear norm of the centroid feature matrix. E2MC (Chakraborty et al., 2025) minimizes the sum of marginal entropies of the feature distribution on top of minimizing the VCReg loss (Bardes et al., 2022). Radial-VCReg (Kuang et al., 2025) and LeJEPA (Balestriero & LeCun, 2025) go beyond second-order dependencies by learning isotropic Gaussian features. Our Rectified LpJEPA also reduce higher-order dependencies by design, while enforcing sparsity over learned representations.
Prior work like Non-Negative Contrastive learning (NCL) (Wang et al., 2024) also aims to learn sparse features by optimizing contrastive losses over rectified features. Contrastive Sparse Representation (CSR) (Wen et al., 2025) develops a post-training sparsity adaptation method by learning a sparse auto-encoder (SAE) (Gao et al., 2024) over pretrained dense features using NCL loss, reconstruction loss, and a couple of SAE-specific auxiliary losses.
Cramer-Wold Based Distribution Matching Losses . Exemplars include sliced Wasserstein distances and their generative extensions (Bonneel et al., 2015; Kolouri et al., 2018), sliced kernel discrepancies (Nadjahi et al., 2020), projection-averaged multivariate tests (Kim et al., 2019), and more recently LeJEPA with SIGReg loss (Balestriero & LeCun, 2025), which also show that it suffices to sample c ∈ S d -1 ℓ 2 := { c ∈ R d | ∥ c ∥ 2 = 1 } .
Properties of Univariate Generalized Gaussian, Truncated Generalized Gaussian, and Rectified Generalized Gaussian Distributions
In the following section, we present additional details on the Generalized Gaussian (Appendix B.1), Truncated Generalized Gaussian (Appendix B.2), and the Rectified Generalized Gaussian distributions (Appendix B.3). We also present the expectation and variance (Appendix B.4) and the sampling method (Appendix B.5) for the Rectified Generalized Gaussian distribution.
Univariate Case - Generalized Gaussian
The Generalized Gaussian distribution GN p ( µ, σ ) (Subbotin, 1923; Goodman & Kotz, 1973; Nadarajah, 2005) has the probability density function given in Definition 3.1 with expectation and variance as
$$
$$
$$
$$
The cumulative distribution function of GN p ( µ, σ ) is given by
$$
$$
where γ ( · , · ) is the lower incomplete gamma function. We note that the probability density function of GN p ( µ, σ ) has other parameterizations (Remark B.1) and there are well-known special cases when p = 1 or p = 2 (Remark B.2). Remark B.1 . The probability density function of the Generalized Gaussian distribution can also be written as
$$
$$
where α := p 1 /p σ . We choose the particular presentation in Definition 3.1 for its connection to the family of L p -norm spherical distributions (Gupta & Song, 1997).
$$
$$
$$
$$
For measure-theoretical characterizations of the Rectified Generalized Gaussian distribution in Appendix B.3, we denote the probability measure for X ∼ GN p ( µ, σ ) as P GN p ( µ,σ ) .

Figure 5. The Probability Density Function of Generalized Gaussian GN p , Truncated Generalized Gaussian T GN p , and Rectified Generalized Gaussian RGN p across varying p with fixed µ = -0 . 5 and σ = 1 . Φ GN p (0 , 1) is the CDF of the Generalized Gaussian GN p (0 , 1) . (a) The case when p = 0 . 5 . (b) When p = 1 , we obtain Laplace, Truncated Laplace, and Rectified Laplace. (c) For p = 2 , we have Gaussian, Truncated Gaussian, and Rectified Gaussian.
Univariate Case - Truncated Generalized Gaussian
The Truncated Generalized Gaussian distribution is defined in Definition 3.2 in terms of the probability density function. In Definition B.3, we present the definition of the Truncated Generalized Gaussian probability measure.
Definition B.3 (Truncated Generalized Gaussian Probability Measure) . Let X ∼ P GN p ( µ,σ ) be a Generalized Gaussian random variable with ℓ p parameter p > 0 , location µ ∈ R , and scale σ > 0 . The Truncated Generalized Gaussian probability measure P T GN p ( µ,σ ) on the measurable space ( R , B ( R )) is defined as the conditional distribution of X given X > 0 , i.e.,
$$
$$
for any A ∈ B ( R ) , where Φ GN p (0 , 1) denotes the cumulative distribution function of the standardized Generalized Gaussian distribution.
Univariate Case - Rectified Generalized Gaussian
In Definition 3.4, we provide a probability density function (PDF) characterization of the Rectified Generalized Gaussian distribution. However, we note that the PDF presented in Definition 3.4 is not the Radon-Nikodym derivative of the Rectified Generalized Gaussian probability measure with respect to the standard Lebesgue measure over R , which we denote as λ . In Definition B.5, we provide a measure-theoretical treatment of the Rectified Generalized Gaussian distribution. We start by introducing the Dirac measure in Definition B.4.
Definition B.4 (Dirac Measure) . The Dirac measure δ x over a measurable space ( X, Σ) for a given x ∈ X is defined as for any measurable set A ⊆ X .
$$
$$
In Definition B.5, we formally introduce the Rectified Generalized Gaussian probability measure and its probability density function. Definition B.5 (Measure-Theoretical Definition of the Rectified Generalized Gaussian) . Fix parameters p > 0 , µ ∈ R , and σ > 0 . We denote ( R , B ( R )) as the real line equipped with Borel σ -algebra. Let λ be the Lebesgue measure on B ( R ) and let δ 0 be the Dirac measure at 0 presented in Definition B.4. The probability measure P X of the Rectified Generalized Gaussian random variable X is given by the mixture
$$
$$
where P T GN p ( µ,σ ) is the Truncated Generalized Gaussian probability measure in Definition B.3 and Φ GN p (0 , 1) is the CDF of the standard Generalized Gaussian GN p (0 , 1) . Define the mixed measure ν := λ + δ 0 . By Lemma B.7, the Radon-Nikodym derivative of P X with respect to ν exists and is given by
$$
$$
Proof. According to Folland (1999), if P X is a signed measure and ν is a positive measure on the same measurable space ( R , B ( R )) , then P X ≪ ν if ν ( A ) = 0 for every A ∈ B ( R ) implies P X ( A ) = 0 .
Let's consider the case of ν ( A ) = 0 . By definition, ν ( A ) = δ 0 ( A ) + λ ( A ) = 0 . Since both δ 0 and λ are non-negative measures, δ 0 ( A ) = λ ( A ) = 0 . We observe that δ 0 ( A ) = 0 implies 0 / ∈ A by the definition of the Dirac measure. Thus
$$
$$
where the first term vanishes because 0 / ∈ A . It's trivial that P T GN p ( µ,σ ) is absolutely continuous with respect to the Lebesgue measure. Since ν ( A ) = 0 = ⇒ λ ( A ) = 0 , we have λ ( A ) = 0 = ⇒ P T GN p ( µ,σ ) ( A ) = 0 . Thus
$$
$$
and we have proven the absolutely continuity result P X ≪ ν .
$$
$$
Proof. By Lemma B.6, P X ≪ ν so the Radon-Nikodym derivative d P X /dν exists and it suffices to show that for any A ⊆ B ( R ) we have
$$
$$
We start by expanding the integral with respect to a sum of measures
$$
$$
By the property of the Dirac measure, we have
$$
$$
We observe that ✶ { 0 } ( x ) = 1 and ✶ (0 , ∞ ) (0) = 0 . So we have
$$
$$
Now the second term can be expanded as
$$
$$
where the term
$$
$$
$$
$$
simply vanishes. Thus we are left with
$$
$$
By Definition B.3, the Truncated Generalized Gaussian probability measure is given by
$$
$$
(B.27)
$$
$$
Putting everything together, we arrive at
$$
$$
Thus we have proven the form of the Radon-Nikodym Derivative.
$$
$$
$$
$$
At first glance, the second term is the probability density function of the Generalized Gaussian distribution instead of its truncated version. In Corollary B.8, we provide an alternative presentation of the Rectified Generalized Gaussian distribution with explicit components of the probability density function of the Truncated Generalized Gaussian distribution.
Corollary B.8 (Equivalent Definition of Rectified Generalized Gaussian) . The probability density function of the Rectified Generalized Gaussian distribution RGN p ( µ, σ ) can also be written as
$$
$$
Proof. We can simplify the expression as
$$
$$
So we have
$$
$$
where the extra terms cancel out due to symmetry around 0 . Thus we have recovered the forms in Definition 3.4.
Expectation and Variance of the Rectified Generalized Gaussian Distribution
Proposition B.9. Let X ∼ RGN p ( µ, σ ) and sgn( µ ) ∈ {-1 , 0 , +1 } be the sign function. Let γ ( s, t ) be the lower incomplete gamma function, Γ( s, t ) be the upper incomplete gamma function, Γ( s ) be the gamma function, and P ( s, t ) = γ ( s, t ) / Γ( s ) be the lower regularized gamma function. Then
$$
$$
$$
$$
Proof. Let Z ∼ GN p ( µ, σ ) with density
Let's define the three auxiliary integrals
$$
$$
If X = ReLU( Z ) , then we know X ∼ RGN p ( µ, σ ) . Thus for any k ∈ { 1 , 2 } , we have
To simplify notations, let's denote C := p 1 -(1 /p ) / (2 σ Γ(1 /p )) , a := 1 / ( pσ p ) , and t 0 := a | µ | p = | µ | p / ( pσ p ) . Then
$$
$$
$$
$$
Define the change of variables t = z -µ . Thus we have z = t + µ and z ≥ 0 ⇐⇒ t ≥ -µ . Rewrite the integral as
$$
$$
$$
$$
$$
$$
$$
$$
Then we can rewrite (B.45) for k = 1 , 2 as
Thus we have proven the expression.
Definition B.10 (Gamma Functions) . If u ≥ 0 and b > -1 , then
$$
$$
where γ ( · , · ) and Γ( · , · ) are the lower and upper incomplete gamma functions. By definition, we also have
$$
$$
Lemma B.11 ( I 0 Integral) . The I 0 integral in Proposition B.9 is given by
$$
$$
$$
$$
Now we just need to compute I 0 , I 1 , and I 2 . By Lemma B.11, Lemma B.12, and Lemma B.13, we have
$$
$$
$$
$$
$$
$$
Similarly, the second moment is given by
$$
$$
(B.66)
Proof. If µ ≥ 0 , then -µ ≤ 0 . So we can split the integral at 0 and get:
$$
$$
Applying (B.67) with b = 0 to the first term and (B.68) with u = 0 to the second term gives us
$$
$$
$$
$$
Combining both cases, we arrive at
On [ -µ, 0] , we can substitute s = -t to get
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
So we have
$$
$$
$$
$$
$$
$$
Expectation and Variance of the Rectified Generalized Gaussian Distribution
Input: ℓ p parameter p > 0 , location µ ∈ R , scale σ > 0 Output: sample Y ∼ RGN p ( µ, σ ) Sample S ∼ Unif {-1 , +1 } Sample G ∼ Gamma ( shape = 1 p , rate = 1 ) Set X ← µ + σ S · ( pG ) 1 /p Set Y ← max(0 , X )
return Y
Expectation and Variance of the Rectified Generalized Gaussian Distribution
Input: ℓ p parameter p > 0 , location µ ∈ R , tolerance ε > 0
Output: scale σ ⋆ > 0 such that Var( RGN p ( µ, σ ⋆ )) ≈ 1 { Var( RGN p ( µ, σ ⋆ )) is defined in Proposition B.9. }
$$
$$
$$
$$
Choose initial bounds σ L > 0 and σ U > σ L such that f ( σ L ) < 0 , f ( σ U ) > 0
Rectified LpJEPA
$$
$$
$$
$$
Proof. If µ ≥ 0 , then -µ ≤ 0 . So we can split the integral at 0 and get:
$$
$$
Apply (B.67) with b = 2 to the first term and the full gamma integral to the second term, we have:
$$
$$
$$
$$
Combining both cases, we arrive at
$$
$$
$$
$$
Simulation Techniques for Rectified Generalized Gaussian
In Algorithm 1, we show how to sample from the Rectified Generalized Gaussian distribution.
Properties of Multivariate Generalized Gaussian, Truncated Generalized Gaussian, and Rectified Generalized Gaussian Distributions
In the following section, we present additional definitions and properties of the Multivariate Generalized Gaussian (Appendix C.1), Truncated Generalized Gaussian (Appendix C.2), and Rectified Generalized Gaussian distributions (Appendix C.3). We further derive the expected ℓ 0 norm for a Multivariate Rectified Generalized Gaussian distribution in Appendix C.4.
Multivariate Case - Multivariate Generalized Gaussian
We consider the multivariate generalization (Goodman & Kotz, 1973) as the joint distribution resulting from the product measure of independent and identically distributed (i.i.d.) Generalized Gaussian random variables, i.e. x ∼ ∏ d i =1 GN p ( µ, σ ) where x = ( x 1 , . . . , x d ) for each x i ∼ GN p ( µ, σ ) . The probability density function is given by
$$
$$
Assume that µ = 0 . Barthe et al. (2005) show that r p := ∥ x ∥ p p ∼ Γ( d/p, pσ p ) up to different notations. Also, u := x / ∥ x ∥ p follows the cone measure on the ℓ p sphere S d -1 ℓ p := { x ∈ R d |∥ x ∥ p = 1 } . It's shown that x = r · u and r ⊥ u (Barthe et al., 2005). In fact, the cone measure is identical to the ( d -1) -dimensional Hausdorff measure H d -1 (also called surface measure) when p ∈ { 1 , 2 , ∞} (Alonso-Gutierrez et al., 2018). So if A ⊆ S d -1 ℓ p , then p ( u ∈ A ) = H d -1 ( A ) / H d -1 ( S d -1 ℓ p ) .
Thus for zero-mean Laplace ( p = 1 ) and zero-mean Gaussian ( p = 2 ), the distributions of u are the uniform distributions on the simplex ∆ d -1 (or S d -1 ℓ 1 ) and the standard Euclidean unit sphere S d -1 ℓ 2 respectively.
More generally, the Multivariate Generalized Gaussian distribution (Goodman & Kotz, 1973) is a special case of the family of p -symmetric distributions (Fang et al., 1990) or L p -norm spherical distributions (Gupta & Song, 1997). The L p -norm spherical distributions has density functions of the form g ( ∥ x ∥ p p ) for g : [0 , ∞ ) → [0 , ∞ ) . If x follows the L p -norm spherical distribution, then ∥ x ∥ p and x / ∥ x ∥ p are also independent with each other.
There exist many other L p -norm spherical distributions induced by the choice of the density generator function g ( · ) like p -generalized Weibull distribution, L p -norm Pearson Type II distribution, L p -norm Pearson Type VII distribution, L p -norm multivariate t-distribution, L p -norm multivariate Cauchy distributions etc (Gupta & Song, 1997). We particularly choose the Generalized Gaussian distribution with the density generator function g ( · ) = exp( · ) for the inevitable consequences of the exponential function as the maximum entropy solutions. We show this in Lemma E.1.
Multivariate Case - Multivariate Truncated Generalized Gaussian
Let x = ( x 1 , . . . , x d ) ∼ ∏ d i =1 T GN p ( µ, σ, S ) be a Multivariate Truncated Generalized Gaussian random vector where each x i ∼ T GN p ( µ, σ, S ) . For our purposes, we only need S = [0 , ∞ ) and thus the joint support is [0 , ∞ ) d .
We observe that the angular distribution x / ∥ x ∥ p is still uniform over the support after truncation to the positive orthant [0 , ∞ ) d for p ∈ { 1 , 2 } . This is because truncation only rescales the density, which is already constant over the support. Due to the independence between ∥ x ∥ p and x / ∥ x ∥ p , the radial distribution is unchanged. Thus if x ∼ ∏ d i =1 T GN 2 . 0 (0 , σ, [0 , ∞ )) , then ∥ x ∥ 2 2 ∼ Γ( d/ 2 , 2 σ 2 ) and x / ∥ x ∥ 2 ∼ Unif( S d -1 ℓ + 2 ) where S d -1 ℓ + p := { x ∈ R d ∩ [0 , ∞ ) d |∥ x ∥ p = 1 } is the ℓ p sphere confined to the positive orthant and Unif( · ) is uniform distribution over the ℓ p sphere confined to the positive orthant.
When p = 1 . 0 , the multivariate truncated Laplace distribution ∏ d i =1 T GN 1 . 0 (0 , σ, [0 , ∞ )) reduces to the product of i.i.d exponential distribution. Thus ∥ x ∥ 1 ∼ Γ( d/ 1 , σ ) and x / ∥ x ∥ 1 is the Dirichlet distribution with all concentration parameters being 1 on the simplex ∆ d -1 , which we also denote as S d -1 ℓ + 1 (Devroye, 2006).
Multivariate Case - Multivariate Rectified Generalized Gaussian
We denote x = ( x 1 , . . . , x d ) ∼ ∏ d i =1 RGN p ( µ, σ ) as a Multivariate Rectified Generalized Gaussian random vector where each x i ∼ RGN p ( µ, σ ) . Contrary to the family of Truncated Generalized Gaussian distribution with smooth isotropic ℓ p geometry, rectification collapses most of the samples in the interior of the positive orthant into an exponentially large family of lower-dimensional faces, inducing polyhedral conic geometry. In fact, the probability of the random vector being in the interior of the positive orthant [0 , ∞ ) d is (1 -Φ GN p (0 , 1) ( -µ/σ )) d , which decays to 0 exponentially fast as d →∞ . Thus in high dimensions, most of the rectified samples concentrates on the boundary of the positive orthant cone.
Proof of Proposition~ ref{proposition:expectedl0normforrecgengauss
Proof. Let z ∼ ∏ d i =1 GN p ( µ, σ ) be a Generalized Gaussian random vector in d dimensions and x = ReLU ( z ) , or equivalently, x ∼ ∏ d i =1 RGN p ( µ, σ ) . By construction, we have independence between dimensions. Thus
$$
$$
So we have the expectation given by
$$
$$
where the CDF defined in Definition 3.1 evaluates to
$$
$$
$$
$$
Choice of $ sigma$ for Rectified Generalized Gaussian
In the following section, we investigate how we should pick the scale parameter σ for the Rectified Generalized Gaussian distribution RGN p ( µ, σ ) . We show that different choices of σ leads to different per-dimension variance (Appendix D.1), sparsity as measured by ℓ 0 metrics (Appendix D.2), and also the sparsity-performance tradeoffs (Appendix D.3). We also provide our final recommendation of σ at the end of this section.
How does $ sigma$ affect the variance?
Equation (B.2) and Equation (B.41) are the closed form expression for the variance of the Generalized Gaussian GN p ( µ, σ ) and Rectified Genealized Gaussian distributions RGN p ( µ, σ ) respectively.
To prevent feature collapses along each feature dimension, we always want non-zero variance and hence the target distribution should have non-zero variance as well. We consider two strategies of picking σ .
First, we can set σ = σ GN = Γ(1 /p ) 1 / 2 / ( p 1 /p · Γ(3 /p ) 1 / 2 ) . In this case, the variance of the Generalized Gaussian distribution is fixed to be 1 , i.e. Var( GN p ( µ, σ GN )) = 1 for any µ and p . However, the variance of the Rectified Generalized Gaussian distribution under the choice of σ GN is no longer fixed. In Figure 6, we plotted the variance of the Generalized Gaussian and the Rectified Generalized Gaussian distributions under the choice of σ GN with varying µ and p . We observe that the variance for the Generalized Gaussian distribution is indeed 1 , but the variance for the Rectified Generalized Gaussian distribution decreases as we increase p and decrease µ . In the worst case, the variance of the Rectified Gaussian distribution RGN 2 ( -3 , σ GN ) is around 0 . 0002 .
Second, we can also pick σ = σ RGN such that the variance of the Rectified Generalized Gaussian distribution is 1 , i.e. Var( RGN p ( µ, σ GN )) = 1 for any µ and p . Since the closed form expression in Equation (B.41) is complicated, we resort to using a bisection search algorithm (see Algorithm 2) to estimate σ RGN. In Figure 11a, we observe that it only takes around 30 iterations, invariant to choices of µ and p , to estimate σ RGN with bisection error below 10 -10 . We also only need to estimate σ RGN once for any µ and p .
In Figure 7, we reported the variance of the Generalized Gaussian and the Rectified Generalized Gaussian distributions if we choose σ = σ RGN. Both theoretically and empirically, the variance of the Rectified Generalized Gaussian distribution is 1 . Under the choice of σ RGN, we observe that the variance of the Generalized Gaussian distribution increases as we increase p and decreases µ . In the extreme case, the variance of the Gaussian distribution GN 2 ( -3 , σ RGN ) is around 11 . 56 .
In Figure 8, we also visualize values of σ GN and σ RGN under varying µ and p . We observe that none of the values of σ 's are extreme, and thus our sampling method (Algorithm 1) for Rectified Generalized Gaussian also won't be subjected to extreme value multiplications.
How does $ sigma$ affect the sparsity?
Intuitively, it seems desirable to pick σ RGN over σ GN because the choice of σ RGN encourages the per-dimensional variance of the features to be 1 , which is desirable as we know from the results in VICReg (Bardes et al., 2022). However, we observe that there is no simple free lunch here. Rectifications in general reduce variance by squashing negative values to be zeros, and enforcing the variance after rectifications being 1 will reduce sparsity.
In Figure 9, we report the theoretical ℓ 0 norms evaluated based on Proposition 3.5 and the empirical ℓ 0 norms computed over pretrained Rectified LpJEPA features. The choice of σ RGN leads to reduced sparsity measured by increased normalized ℓ 0 norms (1 /D ) · E [ ∥ x ∥ 0 ] both theoretically and empirically. Interestingly, we note that for the choice of σ RGN, the primary way to increase sparsity is to reduce p . If we choose σ = σ GN, then sparsity is more easily induced by decreasing µ , whereas varying p only induces small gaps in the amount of sparsity both theoretically and empirically.
Thus

Univariate Case - Generalized Gaussian


Univariate Case - Generalized Gaussian

(d)
Theoretical Variance of Rectified Generalized Gaussian.
Figure 6. Variance of Generalized Gaussian Distribution and Rectified Generalized Gaussian distributions under the choice of σ = σ GN . Top row: Variance of x ∼ GN p ( µ, σ GN ) . (a) The empirical variance of GN p ( µ, σ GN ) . (b) The theoretical variance of GN p ( µ, σ GN ) by evaluating Equation (B.2). Bottom row: Variance of x ∼ RGN p ( µ, σ GN ) . (c) The empirical variance of RGN p ( µ, σ GN ) . (d) The theoretical variance of RGN p ( µ, σ GN ) by evaluating Equation (B.41). The empirical variance in (a) and (c) are computed by i.i.d sampling 100000 samples from 32 dimensions from either GN p ( µ, σ GN ) or RGN p ( µ, σ GN ) . The per-dimension variance is estimated and we report the average variance across dimensions as a function of the mean shift value µ and the parameter p .
How does $ sigma$ affect performance?
We have already observed in Figure 9 that for the same value of µ (more specifically, µ < 0 as we're interested in sparse representations) and p , choosing σ GN always lead to sparser representations. However, we're rather more interested in whether the pareto frontier of sparsity-performance tradeoff induced by the choices of { µ, σ GN , p } can be significantly different from that of { µ, σ GN , p } . In other words, we would like to know if the choices of σ GN or σ RGN can lead to systematically better sparsity-performance tradeoffs as we vary µ and p .
In Figure 10, we show that in fact there is again no free lunch here. We report CIFAR-100 validation accuracy of pretrained Rectified LpJEPA projector representations against different mean shift values µ (Figure 10a), ℓ 1 sparsity metrics (1 /D ) · E [ ∥ z ∥ 2 1 / ∥ z ∥ 2 2 ] (Figure 10b), and ℓ 0 metrics (1 /D ) · E [ ∥ z ∥ 0 ] (Figure 10c) across varying p . In general, { µ, σ GN , p } has different sparsity patterns compared to { µ, σ RGN , p } , but the overall sparsity-performance tradeoffs are largely overlapped. Under ℓ 0 metric, we actually observe that Rectified Laplace RGN 1 ( µ, σ GN ) stands out as the setting that attains the best sparsity and accuracy tradeoff. Thus even if the choice of σ GN can lead to small variance as we show in Figure 6, we still choose σ GN as the default scale parameter for our target Rectified Generalized Gaussian distribution.

Univariate Case - Generalized Gaussian


(b) Theoretical Variance of Generalized Gaussian.
Figure 7. Variance of Generalized Gaussian Distribution and Rectified Generalized Gaussian distributions under the choice of σ = σ RGN . Top row: Variance of x ∼ GN p ( µ, σ RGN ) . (a) The empirical variance of GN p ( µ, σ RGN ) . (b) The theoretical variance of GN p ( µ, σ RGN ) by evaluating Equation (B.2). Bottom row: Variance of x ∼ RGN p ( µ, σ RGN ) . (c) The empirical variance of RGN p ( µ, σ RGN ) . (d) The theoretical variance of RGN p ( µ, σ RGN ) by evaluating Equation (B.41). The empirical variance in (a) and (c) are computed by i.i.d sampling 100000 samples from 32 dimensions from either GN p ( µ, σ RGN ) or RGN p ( µ, σ RGN ) . The per-dimension variance is estimated and we report the average variance across dimensions as a function of the mean shift value µ and the parameter p .
Maximum Differential Entropy Distributions
In the following section, we present a well-known statement for the form of the maximum-entropy probability distributions (Appendix E.1) and use the result to prove that the Multivariate Truncated Generalized Gaussian distribution is the maximum-entropy distribution under the expected ℓ p norm constraints given a fixed support (Appendix E.2). We further show that the constraint is E [ ∥ z ∥ p p ] = dσ p without truncation (Appendix E.3). In Appendix E.4 and Appendix E.5, we present the well-known corollary of product Laplace and isotropic Gaussian being the maximum-entropy distributions under expected ℓ 1 and ℓ 2 norm constraints respectively.
Derivation of Maximum Entropy Continuous Multivariate Probability Distributions under Support Constraints
Cover & Thomas (2006) provided a characterization of maximum entropy continuous univariate probability distributions. In Lemma E.1, we provide a multivariate extension of the maximum entropy probability distribution under the support set with positive Lebesgue measure.

(a)
Values of
σ
GN
across varying
µ
and
p
.


Figure 8. Values of σ GN and σ RGN under Different Choices of µ and p . (a) The values of σ GN are invariant to the mean shift value µ . (b) σ RGN changes as a function of both µ and p .
Figure 9. The theoretical and empirical normalized ℓ 0 norms under Different Choices of σ ∗ . (a) We report the theoretical ℓ 0 norms based on Proposition 3.5 for σ ∗ ∈ { σ GN , σ RGN } for varying µ and p . (b) The empirical ℓ 0 norms of pretrained Rectified LpJEPA features are measured against the theoretical ℓ 0 norms of the target Rectified Generalized Gaussian distribution RGN p ( µ, σ GN ) for varying µ and p . (c) We plot the empirical ℓ 0 norms of pretrained Rectified LpJEPA features against the theoretical ℓ 0 norms of the target Rectified Generalized Gaussian distribution RGN p ( µ, σ RGN ) for varying µ and p .
Lebesgue measure. We define r 1 , · · · , r m : S → R as measurable functions and let α 1 , · · · , α m ∈ R . Consider the optimization problem
$$
$$
$$
$$
$$
$$
$$
$$
$$
$$
We denote the set of functions that satisfies the given constraints as
$$
$$
$$
$$

Figure 10. The Sparsity-Performance Tradeoffs under Different Chocies of σ ∗ ∈ { σ GN , σ RGN } . (a) We report CIFAR-100 validation accuracy for pretrained Rectified LpJEPA projector representations under varying { µ, σ, p } . Under the same mean shift value µ , choosing σ RGN leads to better performance compared to σ GN if µ is more negative. (b) Projector accuracy is plotted against the ℓ 1 sparsity metrics measured over the pretrained Rectified LpJEPA projector representations. The gaps between σ GN and σ RGN are negligible. (c) Switching from ℓ 1 to ℓ 0 sparsity metrics, we observe the same behavior. In fact, σ GN attains minor advantages in the sparsity-performance tradeoffs, especially when p = 1 or p = 0 . 5 .
![Figure 11. Additional Results on the Choices of σ , the Location of ReLU( · ) , and the Ablations of ReLU( · ) for Rectified LpJEPA . (a) We report the bisection convergence error as a function of optimization iterations for finding the optimal σ RGN (see Appendix D). (b) We compared Rectified LpJEPA versus a version of distribution matching towards the Rectified Generalizd Gaussian distribution via the continuous mapping theorem (see Section 5.3). Rectified LpJEPA is the better design. (c) We show that Rectified LpJEPA attains the best sparsity-performance tradeoffs across various ablations of ReLU( · ) under the ℓ 1 sparsity metrics (1 /D ) · E [ ∥ z ∥ 2 1 / ∥ z ∥ 2 2 ] . See Section 5.2 for details.](2602.01456-figure_018.png)
Figure 11. Additional Results on the Choices of σ , the Location of ReLU( · ) , and the Ablations of ReLU( · ) for Rectified LpJEPA . (a) We report the bisection convergence error as a function of optimization iterations for finding the optimal σ RGN (see Appendix D). (b) We compared Rectified LpJEPA versus a version of distribution matching towards the Rectified Generalizd Gaussian distribution via the continuous mapping theorem (see Section 5.3). Rectified LpJEPA is the better design. (c) We show that Rectified LpJEPA attains the best sparsity-performance tradeoffs across various ablations of ReLU( · ) under the ℓ 1 sparsity metrics (1 /D ) · E [ ∥ z ∥ 2 1 / ∥ z ∥ 2 2 ] . See Section 5.2 for details.
Assume the set P is nonempty and that there exists λ = ( λ 1 , . . . , λ m ) ∈ R m such that
$$
$$
Then any maximizer p ⋆ of the optimization problem has the form
$$
$$

ℓ
0
Sparsity Across Transfer Tasks

1
Figure 12. Pretrained dense and sparse representations exhibits varying level of sparsity across different downstream tasks . We compare the ℓ 0 and ℓ 1 sparsity metrics for Rectified LpJEPA versus other baselines (see Appendix H) pretrained over ImageNet-100 across a variety of downstream tasks. (a) Rectified LpJEPA has varying ℓ 0 sparsity (1 /D ) · E [ ∥ z ∥ 0 ] over different datasets as we vary the mean shift value µ ∈ { 0 , -1 , -2 , -3 } . CL and VICReg always have all entries being non-zero due to the lack of explicit rectifications. (b) Under ℓ 1 sparsity metric (1 /D ) · E [ ∥ z ∥ 2 1 / ∥ z ∥ 2 2 ] , we observe varying sparsity for Rectified LpJEPA over different datasets for mean shift values ℓ 0 sparsity (1 /D ) · E [ ∥ z ∥ 0 ] . We observe that NCL in fact achieves the lowest ℓ 1 metric, but as we show in Figure 4c the most amount of variations is attained by Rectified LpJEPA.
$$
$$
where λ 0 , λ 1 , . . . , λ m ∈ R are Lagrange multipliers. Let p be a maximizer that is strictly positive almost everywhere (a.e.) over S . We denote δp as an arbitrary integrable perturbation supported on S such that p + ϵδp ≥ 0 for sufficiently small | ϵ | . Thus the Gateaux derivative of J in the direction of δp is given by
$$
$$
Thus the functional derivative is
$$
$$
Since this expression must vanish for all admissible perturbations δp , we get δ J δp = 0 almost everywhere on S . Solving for p yields
$$
$$
Absorbing the constant terms into Z S ( λ ) , we end up with
$$
$$
Maximum Entropy Distribution Under the $ ell_1$ Norm Constraint
Proof. By Lemma E.1, the target distribution has the form of
$$
$$
where we choose λ 1 = -1 pσ p which satisfies the constraint λ 1 < 0 for integration. Thus we have recovered the zero-mean Generalized Gaussian distribution with scale parameter σ . Now notice that
$$
$$
Thus we also obtain the constraint as E [ ∥ x ∥ p p ] = d dλ 1 log Z S ( λ 1 ) .
Maximum Entropy Distribution Under the $ ell_p$ Norm with Full Support
Corollary E.2. If S = R d in Proposition 3.3, then the constraint
$$
$$
and we recover the Generalized Gaussian distribution with zero mean and scale parameter σ .
$$
$$
$$
$$
$$
$$
According to Dytso et al. (2018), we know that E [ | x i | p ] = σ p . Thus
$$
$$
Maximum Entropy Distribution Under the $ ell_1$ Norm Constraint
In Corollary E.3, we show the well-known result that the maximum-entropy continuous multivariate distribution under the ℓ 1 norm constraint is the product Laplace distribution.
$$
$$
is the product of independent univariate symmetric Laplace distributions with zero mean and scale parameter b
$$
$$
Proof. By Lemma E.1, the target distribution has the form of
$$
$$
with the constraint λ 1 < 0 for integration. After normalization, we obtain
$$
$$
$$
$$
We note that the product Laplace distribution is different from the multivariate elliptical Laplace distribution presented in Kotz et al. (2012). For our purposes of identifying the maximum-entropy distribution under the expected ℓ 1 norm constraints, we should use the product Laplace distribution as the multivariate generalization of the univariate symmetric Laplace distribution.
Maximum Entropy Distribution under the $ ell_2$ Norm Constraint
In Corollary E.4, we present the well-known result that the maximum-entropy continuous multivariate distribution under the ℓ 2 norm constraint is the isotropic Gaussian distribution.
$$
$$
is the multivariate Gaussian distribution with mean µ and covariance Σ
$$
$$
When µ = 0 and Σ = I , the density function takes the form of
$$
$$
Proof. Notice that the vector-valued mean constraint and matrix-valued covariance constraint can be factorized as a collection of scalar-valued constraints
$$
$$
$$
$$
By a change of variable b = -1 /λ 1 , we arrive at
$$
$$
Rényi Information Dimension and Entropy
In the following section, we provide a self-contained review of the R´ enyi information dimension and the d ( ξ ) -dimensional entropy. The contents are based on the original paper by R´ enyi (1959). After introducing the basic concepts in Appendix F.1, we go on to derive and prove the corresponding quantities for our Rectified Generalized Distribution in Appendix F.2. In Appendix F.3, we provide an empirical estimator for the d ( ξ ) -dimensional entropy. We further show, perhaps somewhat trivially, that the d ( ξ ) -dimensional entropy is in fact equivalent to another notion of entropy where we replace the Lebesgue measure λ in standard differential entropy with the mixed measure ν := λ + δ 0 (Appendix F.4) for δ 0 being the Dirac measure in Definition B.4. In Appendix F.5, we discuss how the total correlation can be decomposed using different notions of entropy.
Definition of the $d( xi)$-dimensional Entropy
Conventionally, the differential entropy for a random variable X with distribution function P X is defined as
$$
$$
where λ is the Lebesgue measure. For a Rectified Generalized Gaussian random variable X ∼ RGN p ( µ, σ ) , the Radon-Nikodym derivative of P X with respect to λ does not exist as shown in Lemma B.6. As a result, differential entropy is ill-defined for the Rectified Generalized Gaussian distribution.
In the following section, we consider an alternative formulation known as the d ( ξ ) -dimensional entropy, where d ( ξ ) is the R´ enyi information dimension of the random variable ξ (R´ enyi, 1959). In Definition F.1, we review the basic definition of information dimension.
Definition F.1 (Information Dimension (R´ enyi, 1959)) . Consider a real-valued random variable ξ ∈ R and the discretization ξ n = (1 /n ) · [ nξ ] , where [ x ] preserves only the integral part of x . For example, [3 . 42] = 3 . Under suitable conditions, the information dimension d ( ξ ) exists and is given by
$$
$$
$$
$$
Intuitively, ξ n represents the quantization of the real-valued random variable ξ at the grid resolution of 1 /n . Thus the information dimension d ( ξ ) measures how fast the Shannon entropy grows as a result of finer and finer grind discretizations. In Definition F.2, we present the definition of the d ( ξ ) -dimensional entropy first introduced in R´ enyi (1959).
Definition F.2 ( d ( ξ ) -dimensional Entropy (R´ enyi, 1959)) . If the information dimension d ( ξ ) exists, the d ( ξ ) -dimensional entropy is defined as
$$
$$
Effectively, the d ( ξ ) -dimensional entropy measures the amount of uncertainty distributed along the d ( ξ ) continuous degrees of freedom. For discrete random variable x , the information dimension d ( x ) = 0 since it's invariant to finer discretization (R´ enyi, 1959). Thus the discrete Shannon entropy is the 0 -dimensional entropy H 0 . The continuous random variable x ′ has information dimension d ( x ′ ) = 1 and so the differential entropy is simply the 1 -dimensional entropy H 1 (R´ enyi, 1959).
In Definition F.3, we review the special case of H d ( ξ ) when the random variable ξ follows has a mixture probability measure.
$$
$$
where
$$
$$
where λ is the Lebesgue measure.
Expectation and Variance of the Rectified Generalized Gaussian Distribution
In the following section, we prove the d ( ξ ) -dimensional entropy characterization of the Multivariate Rectified Generalized Gaussian distribution presented in Theorem 3.6.
Proof. Since ξ ∼ ∏ D i =1 RGN p ( µ, σ ) where each ξ i ∼ RGN p ( µ, σ ) are i.i.d, it's trivial that
$$
$$
$$
$$
for all i by independence. In Appendix F.5, we also present an alternative interpretation of the d ( ξ ) -dimensional entropy that enables the decomposition of the joint entropy H d ( ξ ) ( ξ ) into the sums of the marginals H d ( ξ i ) ( ξ i ) under the independence assumption. Thus it suffices to prove in the univariate case. By Definition B.5 and Definition F.3, we know that the information dimension is given by
$$
$$
We observe that P 0 in Definition F.3 correspond to the Dirac measure δ 0 in Definition B.5. Thus
$$
$$
Now we can define a Bernoulli gating random variable
$$
$$
Thus by Definition F.3, the d ( ξ i ) -dimensional entropy is
$$
$$
So we have proven the expression in Theorem 3.6.
$$
$$
Empirical Estimators of the $d( xi)$-dimensional entropy
LemmaF.4 (Probability Measure Under Rectification) . Let X ∼ P X be a real-valued random variable where P X is absolutely continuous with respect to the Lebesgue measure λ , i.e. P X ≪ λ . Then the probability measure of Z := max(0 , X ) over ([0 , ∞ ) , B ([0 , ∞ ))) is
$$
$$
where δ 0 is the Dirac measure and 1 -d := P ( Z = 0) = P ( X ≤ 0) and
$$
$$
for any Borel A ⊂ (0 , ∞ ) .
Proof. Let φ : R → [0 , ∞ ) be the rectification map φ ( x ) := max(0 , x ) . Then P Z is the pushforward of P X by φ , i.e. for any Borel set B ∈ B ([0 , ∞ )) ,
$$
$$
$$
$$
For x ∈ ( -∞ , 0] , φ ( x ) = 0 . So we have
Combining these together, we arrive at
P
Z
(
B
) =
$$
$$
$$
$$
X
φ
1
∩
,
∞
0])
·
where δ 0 ( B ) is the Dirac measure in Definition B.4 that evaluates to 1 if 0 ∈ B and 0 otherwise.
Let d := P ( X > 0) = P X ((0 , ∞ )) . So trivially, 1 -d = P ( X ≤ 0) = P X (( -∞ , 0]) . By the definition of the conditional measure, we have that for any A ∈ B ( R ) ,
$$
$$
δ
(F.154)
We can write φ -1 ( B ) as
Notice that and
$$
$$
Thus we have proven the expression of the probability measure. Now if A ⊂ (0 , ∞ ) is Borel, then A ∩ (0 , ∞ ) = A and we have P X | (0 , ∞ ) ( A ) = P X ( A ) /d .
$$
$$
Proof. Following the same arguments in Lemma B.6, we know that P Z is absolutely continuous with respect to ν , i.e. P Z ≪ ν . Again, following the same arguments in Lemma B.7, we observe that for any Borel A ⊂ [0 , ∞ )
$$
$$
$$
$$
$$
$$
$$
$$
Thus we have shown that the Radon-Nikodym derivative is correct.
Consider a real-valued random variable X from some unknown distribution P X that's absolutely continuous with respect to the Lebesgue measure. Let Z = max( X, 0) be the rectified random variable. Then by Lemma F.4, the probability measure of Z can be written as
$$
$$
$$
$$
for any Borel A ⊂ (0 , ∞ ) . This is a probabilistic model that is suitable for characterizing the neural network output feature marginal distributions after rectifications. Notice that Equation (F.165) is in the form presented in Definition F.3. Thus it's valid to compute the d ( Z ) -dimensional entropy for any distribution that follows the decomposition in Equation (F.165). We also observe that the Rectified Generalized Gaussian probability measure is just a special case of Equation (F.165).
By Definition F.3, the d ( Z ) -dimensional entropy is given by
$$
$$
✶ In practice, we will have samples { z i } B i =1 from the random variable Z . We can estimate the information dimension by
$$
$$
Now we consider a subset { z i } B ′ i =1 where each z i > 0 . The differential entropy over { z i } B ′ i =1 can be computed using the m -spacing estimator (Vasicek, 1976; Learned-Miller et al., 2003)
$$
$$
where m is a spacing hyperparameter, and { z ( i ) | z (1) ≤ z (2) ≤ · · · ≤ z ( B ′ ) } B ′ i =1 are sorted samples of the original set { z i } B ′ i =1 . Putting these estimators together, the empirical d ( Z ) -dimensional entropy can be computed as
$$
$$
If we consider the multivariate case for the random vector z = ReLU( x ) , where x ∈ R D follows some unknown distribution P x and ReLU( · ) applies coordinate-wise, then in general we cannot compute the d ( z ) -dimensional entropy of the joint distribution P z both due to lack of estimators and intractable complexity.
However, we can compute the upper bound of the joint entropy by computing the sums of the marginal entropies
$$
$$
where ' ≤ ' reduces to the equality sign ' = ' if we have independence between all dimensions.
Alternative Interpretation of the $d( xi)$-dimensional Entropy
Let's denote the standard differential entropy as H λ ( X )
$$
$$
where λ is the Lebesgue measure, X ∼ P X is a real-valued random variable, and P X ≪ λ . We know from Appendix F.3 that the probability measure of Z := ReLU( X ) is defined as
$$
$$
$$
$$
In Lemma F.6, we show that this coincides with the d ( Z ) -dimensional entropy in Definition F.3.
$$
$$
Proof. We start by expanding the integral
$$
$$
By the property of the Dirac measure, we have
$$
$$
Lemma F.5 tells us that
$$
$$
$$
$$
$$
$$
By Definition F.3, the information dimension d ( Z ) = d . Notice how H 0 ( δ 0 ) = 0 . So we have
$$
$$
Generalization of the Entropy Decomposition of Total Correlation
The standard definition of total correlation for the the random vector x = ( x 1 , . . . , x D ) ∼ P x in D dimensions is
$$
$$
which only involves the joint probability measure and the product of marginal probability measures. However, when it comes to the entropy decomposition of the total correlation, we know that
$$
$$
where λ is the Lebesgue measure over R and λ ⊗ D is the Lebesgue measure over R d . For our purposes, we adopt the decomposition
$$
$$
where ν := δ 0 + λ and H ν is defined in Lemma F.6 and is equivalent to the d -dimensional entropy. So we have
$$
$$
Hilbert-Schmidt Independence Criterion
For two random variables X,Y with empirical samples x , y ∈ R B × 1 , the empirical Hilbert-Schmidt Independence Criterion (HSIC) is given by
$$
$$
where K ij = k ( x i , x j ) , L ij = l ( x i , x j ) for some kernels k, l , H := I -(1 /B ) · 11 ⊤ is the centering matrix, and Tr( · ) is the trace operator (Gretton et al., 2005). We denote the normalized HSIC as
$$
$$
Both HSIC and nHSIC capture nonlinear dependencies beyond second-order statistics and thus serve as a proxy for measuring statistical independence beyond inspecting the covariance matrix.
For a feature random vector z ∈ R d , we can obtain the nHSIC matrix by computing all pair-wise nHSIC( z i , z j ) . In our experiments, we report the average of the off-diagonals of the nHSIC matrix in Figure 4b. Following Mialon et al. (2022), we pick the Gaussian kernel where the bandwidth parameter σ is the median of pairwise ℓ 2 distances between samples. Due to the presence of rectifications, we set σ to be the standard deviation of the positive activations.
Baseline Designs
CL and NCL . We denote SimCLR (Chen et al., 2020) as CL. Non-Negative Contrastive Learning (NCL) (Wang et al., 2024) essentially applies the SimCLR loss over rectified features and thus is a sparse variant of contrastive learning.
VICReg and NVICReg . VICReg (Bardes et al., 2022) minimizes the ℓ 2 distance between the features of semantically related views while regularizing the empirical feature covariance matrix towards scalar times identity γ · I . We design a sparse version of VICReg, which we call Non-Negative VICReg (NVICReg), that applies the same VICReg loss over rectified features.
ReLU and RepReLU . Let z ∈ R . NCL (Wang et al., 2024) adopts a reparameterization of the standard rectified non-linearity as
$$
$$
where detach() blocks gradient flow. The RepReLU( · ) is equivalent to ReLU( · ) in the forward pass but allows gradient for negative entries. For NCL and NVICReg, we use NCL-ReLU and NVICReg-ReLU to denote usage of ReLU( · ) and NCL-RepReLU and NVICReg-RepReLU when using RepReLU( · ) .
For our Rectified LpJEPA, we note that we just use ReLU( · ) to avoid extra hyperparameter tuning. We defer detailed investigations on the activation functions to future work.
LpJEPA and LeJEPA . Rectified LpJEPA regularizes rectified features towards Rectified Generalized Gaussian distributions. We also design LpJEPA, which regularize non-rectified features towards the Generalized Gaussian distributions using the same projections-based distribution matching loss. When p = 2 , LpJEPA reduces to LeJEPA (Balestriero & LeCun, 2025), since GN 2 ( µ, σ ) = N ( µ, σ 2 ) . For 0 < p ≤ 1 , LpJEPA is still penalizing the ∥ · ∥ p p norms of the features and thus serves as another set of sparse baselines even though all entries are non-zero.
Non-Negative VCReg Recovery
The Cramer-Wold device provides asymptotic guarantees for feature distribution to match the Rectified Generalized Gaussian distribution ∏ d i =1 RGN p ( µ, σ ) with i.i.d coordinates and thus no higher-order dependencies across dimensions. Prior work such as VCReg (Bardes et al., 2022) demonstrates that explicitly removing second-order dependencies via covariance regularization is already sufficient to prevent representational collapse in practice. This motivates us to investigate whether RDMReg likewise controls second-order dependencies more explicitly.
In Proposition I.1, we show that matching feature distributions to the Rectified Generalized Gaussian distribution recovers a form of Non-Negative VCReg with only a linear number of random projections in the feature dimension.
Proposition I.1 (Implicit Regularization of Second-Order Statistics) . Let z ∈ R d be a neural network feature random vector with covariance matrix Cov[ z ] = Σ . We denote the eigendecomposition as Σ = UΛU ⊤ with the set of eigenvectors being { u i } d i =1 . Let y ∼ ∏ d i =1 RGN p ( µ, σ ) be the Rectified Generalized Gaussian random vector and define γ := Var[ RGN p ( µ, σ )] ∈ (0 , ∞ ) . If u ⊤ i z d = u ⊤ i y for all i ∈ { 1 , . . . , d } , then Σ = γ · I d .
Proof. Let Σ = Cov[ z ] and let Σ = UΛU ⊤ be its eigendecomposition, where U = [ u 1 , . . . , u d ] is orthonormal and Λ = diag( λ 1 , . . . , λ d ) . Since y ∼ ∏ d i =1 RGN p ( µ, σ ) has i.i.d. coordinates with variance γ := Var[ RGN p ( µ, σ )] , its covariance satisfies Cov[ y ] = γ I d . Hence, for any vector u i such that ∥ u i ∥ 2 = 1 ,
$$
$$
By the assumption u ⊤ i z d = u ⊤ i y for all i , the variances of the one-dimensional projected marginals are equal, i.e.
$$
$$
$$
$$
where λ i is the i -th eigenvalue of Σ . Therefore λ i = γ for all i , so Λ = γ I d .
Substituting back into the eigendecomposition yields
$$
$$
which is a scalar multiple of the identity. Hence all off-diagonal entries of Σ vanish and the covariance matrix is isotropic.
Thus we have shown that by sampling eigenvectors { u i } d i =1 , we can explicitly control the covariance matrix of neural network features. In practice, we always have B ≪ D . Thus truncated SVD has O ( B 2 D ) complexity using dense methods (Golub & Van Loan, 2013), or O ( BDk ) when computing only the topk eigenvectors via Lanczos or randomized methods (Parlett, 1998; Halko et al., 2011).
Since our feature matrix Z ∈ R B × D is obtained after rectifications, RDMReg thus recovers a form of Non-Negative VCReg, where we regularize non-negative neural network features to have isotropic covariance matrix. In fact, we can also view this as a non-negative matrix factorization (NMF) (Lee & Seung, 2000):
$$
$$
where ˜ Z := (1 / √ B -1) · Z is always non-negative. Non-Negative Contrastive Learning (NCL) (Wang et al., 2024) shows that applying SimCLR loss over rectified features recovers a form of NMF over the rescaled variant of the Gram matrix. Based on the Gram-Covariance matrix duality between contrastive and non-contrastive learning (Garrido et al., 2022), we observe a similar duality in NMF and defer the detailed investigations to future work.
Additional Experimental Results
In the following section, we include additional experimental results for evaluating our Rectified LpJEPA methods.
On the other hand, for each eigenvector u i ,
Table 2. Linear Probe Results on CIFAR-100. Acc1 (%) is higher-is-better ( ↑ ); sparsity is lower-is-better ( ↓ ). Bold denotes best and underline denotes second-best in each column (ties allowed).
Linear Probe over CIFAR-100
In Table 2, we report linear probe performances of Rectified LpJEPA and other dense and sparse baselines over CIFAR-100. Rectified LpJEPA achieves competitive sparsity-performance tradeoffs.
Ablations on Projector Dimensions
In Table 9, we compare VICReg, LeJEPA, and Rectified LpJEPA with varying projector dimensions. We observe that Rectified LpJEPA consistently attains competitive or better performances.
Rectified LpJEPA with ViT Backbones
We evaluate whether the strong performance of Rectified LpJEPA with a ResNet backbone shown in Table 1 generalizes across encoder architectures. As shown in Table 10, Rectified LpJEPA remains competitive when instantiated with a ViT encoder.
Additional Results on Eigenvectors
We would like to know whether incoporating the eigenvectors of the empirical feature covariance matrices into the projection directions for RDMReg can lead to faster convergence and directly removes second-order dependencies. To this end, we pretrain Rectified LpJEPA with RDMReg and log the variance and covariance loss defined in VICReg (Bardes et al., 2022).
The variance loss computes the ℓ 2 distance between the diagonal of the empirical feature covariance matrix and the theoretical variance of the Rectified Generalized Gaussian distribution as we derived in Proposition B.9. The covariance loss is simply the sum of the off-diagonal entries of the empirical feature covariance matrix scaled by 1 /D , where D is the feature dimension. We emphasize that we don't incorporate the variance and covariance losses into optimizations but only use them as evaluation metrics.
In Figure 14, we show that incorporating eigenvectors indeed leads to faster convergence and better performance and also signficant reductions in the variance and covariance loss. This further validates our observations on the Non-Negative VCReg recovery (Proposition I.1) of the RDMReg loss as we observe significant reductions in second-order dependencies.
Additional Results on Transfer Sparsity
In Figure 12, we plot the ℓ 0 and ℓ 1 sparsity metrics for baselines and Rectified LpJEPA across different downstream transfer tasks. We observe that Rectified LpJEPA exhibits larger variations in the sparsity values across datasets, indicating that sparsity can be used as a crude proxy for whether the task at hand is within the training distribution.
In Figure 19, we further probe whether the sparsity metric can be used as a signal for whether groups of inputs are correctly or incorrectly classified by the model. We observe that this is partially true when the inputs are from the pretraining dataset. The distribution of the ℓ 1 sparsity metrics is distinct between correctly and incorrectly classified examples. The divergence is less prominent for downstream transfer tasks, and we defer further investigations to future work.

Figure 13. Additional results on the sparsity-performance tradeoffs, the correlation between different sparsity metrics, and the effect of numbers of random projections on performance . (a) We present another version of Figure 3c where the sparsity metric is switched from ℓ 0 to ℓ 1 . Again, we observe the same Pareto frontier with a sharp drop in performance under extreme sparsity. (b) Across different backbones, Rectified LpJEPAs with the target distribution being the Rectified Laplace distributions learn sparser representations as we decrease the mean shift value µ . Specifically, we observe that the ℓ 0 and ℓ 1 metrics are quite correlated. Thus both metrics are effective in measuring sparsity. (c) We test Rectified LpJEPA models with batch size B = 128 and varying feature dimension D as long as numbers of projections. As we increase the dimensions D , we observe that the number of random projections required for good performance do not grow and are small relative to D . Hence our Rectified LpJEPA is quite robust to the growth in feature dimensions in terms of sampling efficiency.
Qualitative Analysis of Rectified LpJEPA
In the following section, we present qualitative analyses of Rectified LpJEPA models and baseline models pretrained over ImageNet-100. We use the target distribution RGN p ( µ, σ GN ) to denote Rectified LpJEPA with hyperparameters { µ, σ GN , p } . In Appendix K.1, we visualize nearest-neighbor retrieval of selected exemplar images in representation space. We present additional visual attribution maps in Appendix K.2.
$k$-Nearest Neighbors Visualizations
For a selected exemplar image, we retrieve its topk nearest neighbors from the ImageNet-100 validation set using cosine similarity over frozen projector features. Retrieved neighbors are outlined in green if their labels match the exemplar's class and in red otherwise.
In Figure 15, we visualize the top7 nearest neighbors of a pirate ship exemplar for Rectified LpJEPA with p = 1 under varying mean-shift values µ , alongside dense and sparse baseline models. Despite substantial variation in feature sparsity induced by µ , Rectified LpJEPA consistently retrieves semantically coherent neighbors: across all settings, the retrieved images belong exclusively to the pirate ship class. Combined with the competitive linear-probe performance reported in Table 1, these qualitative results suggest that Rectified LpJEPA preserves semantic structure even in highly sparse regimes.
We next consider a more challenging exemplar depicting a tabby cat in the foreground against a laptop background. As shown in Figure 16, dense baselines such as SimCLR (denoted CL for brevity) retrieve a mixture of cat and laptop images, indicating label-agnostic encodings that capture multiple objects present in the scene. In contrast, both sparse baselines and Rectified LpJEPA predominantly retrieve images of cats, except in the extreme sparsity setting of Rectified LpJEPA with µ = -3 . In this regime, the retrieved neighbors consist almost exclusively of laptop images. This behavior suggests that under extreme sparsity, either information about the cat is lost or the background laptop features dominate over the cat features.
To distinguish between these possibilities, we further probe Rectified LpJEPA with µ = -3 by cropping the exemplar image to retain only the cat foreground. As shown in Figure 17, once the background is removed, Rectified LpJEPA with µ = -3 consistently retrieves images of cats. This shows that even under extreme sparsity, Rectified LpJEPA still preserves information instead of being a lossy compression of the input, and we hypothesize that retrieval of solely laptop images in Figure 16 is due to competitions between features in the scene rather than information loss.
Visual Attribution Maps
To further support the above observations, we visualize attribution maps for the tabby cat exemplar and its cropped variant using Grad-CAM-style heatmaps (Selvaraju et al., 2019). Specifically, we backpropagate a scalar score derived from the representation-the squared ℓ 2 norm of the projector feature-to a late backbone layer, and compute a weighted combination of activations that is overlaid on
Table 3. 1-shot linear probe accuracy (%) using encoder features. All results are in the 1-shot setting. Avg. denotes the mean across datasets.
the input image.
As shown in Figure 18, when the background laptop is removed, all models concentrate their attributions on the cat, consistent with the cat-only retrieval behavior observed in Figure 17. For the full image containing both the cat and the laptop, attributions are more spatially spread-out. Notably, Rectified LpJEPA with µ = -3 places a large fraction of its attribution mass on the background, aligning with its tendency to retrieve laptop images in Figure 16.
Taken together, these results demonstrate that even under extreme sparsity, Rectified LpJEPA performs task-agnostic encoding without discarding information. This behavior aligns with our objective of learning sparse yet maximum-entropy representations, since maximizing entropy encourages preserving as much information about the input as possible while remaining agnostic to downstream tasks.
Implementation Details
Pretraining data and setup
We conduct self-supervised pretraining on ImageNet-100 using a ResNet-50 encoder. Unless otherwise specified, all methods are trained with identical data, architecture, optimizer, and augmentation pipelines to ensure fair comparison.
Architecture
The encoder is followed by a 3-layer MLP projector with hidden and output dimension 2048. For Rectified LpJEPA variants, denoted RGN p ( µ, σ GN ) , we append a final ReLU( · ) to the projector output to enforce explicit rectifications in the representation space. When p = 1 , our target distribution is Rectified Laplace. For p = 2 , the RDMReg loss is matching to the Rectified Gaussian distribution.
Data augmentation
Following the standard protocol in da Costa et al. (2022), we generate two stochastic views per image using: random resized crop (scale in [0 . 2 , 1 . 0] , output resolution 224 × 224 ), random horizontal flip ( p = 0 . 5 ), color jitter ( p = 0 . 8 ; brightness 0 . 4 , contrast 0 . 4 , saturation 0 . 2 , hue 0 . 1 ), random grayscale ( p = 0 . 2 ), Gaussian blur ( p = 0 . 5 ), and solarization ( p = 0 . 1 ).
Optimization
We pretrain for 1000 epochs using LARS optimizer (You et al., 2017) with a warmup+cosine learning rate schedule (10 warmup epochs). Unless otherwise specified, we use batch size 128, learning rate 0 . 0825 for the encoder, learning rate 0 . 0275 for the linear classifier, momentum 0 . 9 , and weight decay 10 -4 . All ImageNet-100 experiments are run on a single node with a single NVIDIA L40S GPU.
Table 4. 1-shot linear probe accuracy (%) using projector features. All results are in the 1-shot setting. Avg. denotes the mean across datasets.
Distribution matching objective
For Rectified LpJEPA, we set the invariance weight to λ sim = 25 . 0 and the RDMReg loss weight to λ dist = 125 . 0 . We perform distribution-matching using the sliced 2-Wasserstein distance (SWD) with 8192 random projections per iteration.
Compute and runtime
A full 1000-epoch ImageNet-100 pretraining run takes approximately 2d 7h wall-clock time on a single NVIDIA L40S GPU. To speed-up training, we pre-load the entire ImageNet-100 dataset to CPU memory to avoid additional I/O costs. We find that this significantly improves GPU utilization and also minimizes the communication time. However, we're not able to do this for larger-scale datasets.
Transfer evaluation protocol
We evaluate transfer performance with frozen-feature linear probing on six downstream datasets: DTD, CIFAR-10, CIFAR-100, Flowers102, Food-101, and Oxford-IIIT Pets. For each pretrained checkpoint, we freeze the encoder (and projector when applicable) and train a single linear classifier on top of both encoder features and projector features.
We report three label regimes: 1% (1-shot), 10% (10-shot), and 100% (All-shot) of the labeled training data. The linear probe is trained for 100 epochs using Adam with learning rate 10 -2 , batch size 512, and no weight decay. Evaluation inputs are resized to 256 on the shorter side, then center-cropped to 224 × 224 and normalized with ImageNet statistics; we apply no data augmentation during probing.
Reproducibility.
All ImageNet-100 pretraining results are reported from a single run (seed 5 ), trained with mixed-precision ( 16 -mixed) on a single NVIDIA L40S GPU (one node, one GPU; no distributed training).
Continuous mapping theorem evaluation (post-hoc ReLU probes).
For the continuous mapping theorem ablation (Figure 11b), we evaluate pretrained checkpoints by extracting frozen encoder/projector features and optionally applying a post-hoc rectification ReLU( · ) at evaluation time. We report (i) sparsity statistics computed on the validation features before and after rectification, and (ii) linear probe accuracy when training on dense (pre-ReLU) features versus rectified (post-ReLU) features.
Linear probes are trained for 100 epochs using SGD (momentum 0 . 9 ) with a cosine learning-rate schedule, learning rate 10 -2 , batch size 512, and weight decay 10 -6 . We use the same deterministic evaluation preprocessing described in Appendix L.7.
Table 5. 10-shot linear probe accuracy (%) using encoder features. All results are in the 10-shot setting. Avg. denotes the mean across datasets.
Vision Transformer (ViT) experiments
For the ViT results in Table 10, we use a ViT-Small backbone ( vit small ). The encoder is followed by a 3-layer MLP projector with hidden and output dimension 2048 (i.e., a 2048 -2048 -2048 MLP), and we apply a final ReLU( · ) to the projector output. We pretrain on ImageNet-100 for 1000 epochs using AdamW (batch size 128, learning rate 5 × 10 -4 , weight decay 10 -4 ) under the same augmentation pipeline described above. Other hypers are the same as used for ResNet-50 Experiments. A full 1000-epoch ImageNet-100 pretraining run for ViT takes approximately 2d 6h wall-clock time. Unless otherwise specified, we use mixed-precision (16-mixed) training on a single NVIDIA L40S GPU.
Table 6. 10-shot linear probe accuracy (%) using projector features. All results are in the 10-shot setting. Avg. denotes the mean across datasets.

Figure 14. Incorporating eigenvectors into random projections accelerates implicit VCReg loss minimization and speed-up convergence . We pretrain Rectified LpJEPA over CIFAR-100 with target distributions RGN 1 ( µ, σ GN ) where the mean shift value µ ∈ {-1 , 0 , 1 } . We consider three settings of selecting the projection vectors { c i } N i =1 where we set N = 8192 . We denote the setting as 'Rand' if all c i are uniformly sampled from the unit ℓ 2 sphere. Since in our setting the batch size is always smaller than the feature dimension, i.e. B < D , we call the setting 'Rand + Full Eig' if we select the topB eigenvectors and mix them with N -B random projection vectors. We also consider 'Rand + Bottom Eig' by sampling the bottom half B/ 2 eigenvectors and mixing with N -B/ 2 random projections. (a) (d) (g) We evaluate the variance loss between the three settings across varying µ . Incorporating all eigenvectors puts overly strong constraints, whereas penalizing the bottom half eigenvectors gives us good performances. (b) (e) (h) the covariance losses are evaluated during training and plotted for different projection vector settings across µ . Regularizing all topB eigenvector projections leads to significant implicit covariance loss minimization, where the covariance loss is the average of the off-diagonal of the empirical covariance matrix. (c) (f) (i) We report the projector accuracy against epoch for different projection settings. Using eigenvectors leads to both faster convergence and ultimately better performance.
Table 9. Projector Dimension Ablation on ImageNet-100. For each method and projector dimension, we report the best linear probe accuracy (%) over learning rates { 0 . 3 , 0 . 03 } . Encoder and projector accuracies are measured using frozen features. Bold indicates the best value within each projector-dimension block.
Table 10. RGN 1 . 0 ( µ, σ GN ) Mean-Shift Sweep (ViT) with Baselines. Wereport encoder Acc1 (val acc1), projector Acc1 (val proj acc1), and sparsity metrics. Bold indicates best in a column; underline indicates second best. For sparsity columns, lower is better (more sparse).

Figure 15. Nearest neighbors in feature space (ImageNet synset; unambiguous class). Topk cosine nearest neighbors in the projector space for a query labeled as pirate ship. Both dense and sparse methods retrieve pirate ships consistently, illustrating that even at high sparsity our models can preserve semantic consistency when the query is unambiguous.

Figure 16. Nearest neighbors in feature space (ImageNet synset; full scene). Topk cosine nearest neighbors in the projector space for a query labeled as tabby cat (n02123045) that contains both the cat and a salient laptop/desk context. Dense methods (e.g., SimCLR) can return a mixture of cat and laptop/desk neighbors. In contrast, highly sparse RGN 1 . 0 ( µ, σ GN ) variants tend to commit to a single factor: at MSV = -2 neighbors are predominantly tabby cats, while at MSV = -3 neighbors flip to predominantly laptop/desk images.

Figure 17. Nearest neighbors in feature space (probe crop). Topk cosine nearest neighbors in the projector space for a zoomed-in query that isolates the cat from Figure 16 by removing the competing laptop/desk cues. In this less ambiguous setting, neighbors remain tabby cats across both dense and highly sparse methods, including the most sparse RGN 1 . 0 ( µ, σ GN ) variants.

Figure 18. Representation-focused attribution across methods. Grad-CAM-style attribution maps computed on the projector representation for two views of the same scene (a tabby cat lying on a laptop). Rows compare dense baselines (SimCLR, VICReg, LeJEPA), sparse baselines (RepReLU variants), and our RGN p ( µ, σ GN ) family (where p = 1 . 0 corresponds to Laplace and µ is the mean-shift value, MSV) at increasing mean-shift values, which induce increasing sparsity.

Figure 19. Distribution of representation sparsity in the transfer setting. Violin plots showing the distribution of output-feature ℓ 1 sparsity for correctly (green) and incorrectly (red) classified samples across datasets and methods. All models are pretrained on ImageNet-100 and evaluated using a full-shot linear-probe transfer setup, where a linear classifier is trained on frozen representations using 100% of each downstream dataset. Sparsity is computed per sample from the frozen output features.
| Encoder Acc1 ↑ | Projector Acc1 ↑ | L1 Sparsity ↓ | L0 Sparsity ↓ | ||
|---|---|---|---|---|---|
| Rectified LpJEPA | RGN 1 . 0 (0 ,σ GN ) | 84.72 | 80.4 | 0.2726 | 0.694 |
| RGN 2 . 0 (0 ,σ GN ) | 85.08 | 80 | 0.3412 | 0.7298 | |
| RGN 1 . 0 (0 . 25 ,σ GN ) | 84.98 | 80.76 | 0.3745 | 0.7437 | |
| RGN 2 . 0 (1 . 0 ,σ GN ) | 85.08 | 80.54 | 0.6278 | 0.8668 | |
| RGN 2 . 0 ( - 2 . 5 ,σ GN ) | 82.02 | 67.82 | 0.0137 | 0.0224 | |
| RGN 1 . 0 ( - 3 . 0 ,σ GN ) | 82.72 | 71.88 | 0.0058 | 0.0098 | |
| Sparse Baselines | NVICReg-ReLU | 84.48 | 77.74 | 0.5207 | 0.7117 |
| Sparse Baselines | NCL-ReLU | 82.58 | 76.88 | 0.0037 | 0.0085 |
| Sparse Baselines | NVICReg-RepReLU | 84.2 | 78.18 | 0.4965 | 0.7549 |
| Sparse Baselines | NCL-RepReLU | 82.76 | 76.7 | 0.0024 | 0.0048 |
| Dense Baselines | VICReg | 84.18 | 78.88 | 0.7954 | 1 |
| Dense Baselines | SimCLR | 83.44 | 77.9 | 0.6338 | 1 |
| Dense Baselines | LeJEPA | 84.8 | 79.52 | 0.6365 | 1 |
| Encoder Acc1 ↑ | Projector Acc1 ↑ | L1 Sparsity ↓ | L0 Sparsity ↓ | ||
|---|---|---|---|---|---|
| Rectified LpJEPA | RGN 2 . 0 (0 ,σ GN ) | 66 . 29 | 62 . 15 | 0 . 3773 | 0 . 7357 |
| Rectified LpJEPA | RGN 1 . 0 (0 ,σ GN ) | 65 . 97 | 62 . 22 | 0 . 3019 | 0 . 6474 |
| Rectified LpJEPA | RGN 0 . 75 (0 ,σ GN ) | 65 . 78 | 62 . 80 | 0 . 2583 | 0 . 6099 |
| Rectified LpJEPA | RGN 0 . 50 (0 ,σ GN ) | 66 . 10 | 62 . 74 | 0 . 1996 | 0 . 5727 |
| Rectified LpJEPA | RGN 1 . 0 ( - 2 ,σ GN ) | 64 . 75 | 59 . 08 | 0 . 0236 | 0 . 0489 |
| Sparse Baselines | NCL-ReLU | 66 . 23 | 61 . 88 | 0 . 0228 | 0 . 0503 |
| Sparse Baselines | NVICReg-ReLU | 63 . 76 | 58 . 82 | 0 . 7415 | 0 . 8935 |
| Sparse Baselines | NCL-RepReLU | 66.32 | 61 . 40 | 0.0202 | 0.0426 |
| Sparse Baselines | NVICReg-RepReLU | 63 . 83 | 58 . 53 | 0 . 1551 | 0 . 2657 |
| Dense Baselines | SimCLR | 66 . 00 | 61 . 95 | 0 . 6364 | 1 . 0000 |
| Dense Baselines | VICReg | 63 . 78 | 58 . 82 | 0 . 8660 | 1 . 0000 |
| Dense Baselines | LeJEPA | 65 . 65 | 62 . 69 | 0 . 6379 | 1 . 0000 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 11.86 | 65.62 | 24.95 | 5.09 | 24.82 | 29.74 | 27.01 |
| Non-negative SimCLR | 11.17 | 67.67 | 24.23 | 5.59 | 24.44 | 19.71 | 25.47 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 9.89 | 68.31 | 26.27 | 4.21 | 25.98 | 17.66 | 25.39 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 9.47 | 67.19 | 23.58 | 4.85 | 24.93 | 16.60 | 24.44 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 12.66 | 66.36 | 23.91 | 10.18 | 24.88 | 25.18 | 27.20 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 9.31 | 50.39 | 8.35 | 6.13 | 10.63 | 21.72 | 17.76 |
| RGN 2 . 0 (0 ,σ GN ) | 14.04 | 69.47 | 24.09 | 4.86 | 25.68 | 24.34 | 27.08 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 12.82 | 65.97 | 24.12 | 4.91 | 24.44 | 20.20 | 25.41 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 13.88 | 59.62 | 15.88 | 7.84 | 16.21 | 24.97 | 23.07 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 11.06 | 55.71 | 11.42 | 8.18 | 12.92 | 12.13 | 18.57 |
| Dense baselines | |||||||
| VICReg | 10.85 | 63.98 | 21.52 | 6.23 | 25.29 | 23.33 | 25.20 |
| SimCLR | 12.82 | 66.90 | 23.93 | 10.60 | 24.99 | 24.12 | 27.23 |
| LeJEPA | 15.85 | 68.07 | 24.57 | 5.79 | 23.72 | 24.34 | 27.06 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 8.62 | 33.47 | 7.51 | 3.25 | 8.78 | 17.28 | 13.15 |
| Non-negative SimCLR | 6.70 | 41.71 | 9.24 | 2.91 | 9.17 | 15.37 | 14.18 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 6.81 | 48.96 | 11.46 | 2.03 | 12.09 | 16.38 | 16.29 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 7.82 | 47.14 | 11.20 | 2.75 | 11.43 | 14.94 | 15.88 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 10.37 | 38.36 | 9.13 | 5.56 | 9.53 | 20.93 | 15.65 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 5.69 | 32.25 | 5.16 | 2.70 | 5.76 | 19.02 | 11.76 |
| RGN 2 . 0 (0 ,σ GN ) | 10.59 | 47.74 | 10.61 | 2.70 | 11.46 | 20.50 | 17.27 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 9.84 | 46.97 | 12.33 | 3.07 | 11.70 | 19.38 | 17.21 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 9.57 | 35.08 | 6.59 | 4.49 | 7.41 | 20.69 | 13.97 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 9.20 | 18.36 | 2.64 | 2.33 | 3.64 | 7.14 | 7.22 |
| Dense baselines | |||||||
| VICReg | 8.09 | 39.35 | 6.74 | 3.04 | 9.34 | 19.49 | 14.34 |
| SimCLR | 12.39 | 48.05 | 11.52 | 7.24 | 14.16 | 21.34 | 19.12 |
| LeJEPA | 8.30 | 48.17 | 11.27 | 4.47 | 11.84 | 18.32 | 17.06 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 37.23 | 77.88 | 47.32 | 31.16 | 48.33 | 55.49 | 49.57 |
| Non-negative SimCLR | 40.11 | 79.11 | 50.35 | 23.96 | 50.60 | 50.37 | 49.08 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 41.91 | 79.58 | 49.67 | 24.93 | 49.97 | 55.63 | 50.28 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 38.19 | 78.98 | 48.76 | 32.87 | 48.51 | 55.41 | 50.45 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 40.32 | 79.25 | 45.86 | 22.72 | 46.51 | 56.91 | 48.60 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 19.63 | 66.93 | 24.45 | 8.25 | 26.12 | 32.76 | 29.69 |
| RGN 2 . 0 (0 ,σ GN ) | 40.90 | 80.15 | 50.69 | 29.76 | 50.34 | 53.07 | 50.82 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 39.63 | 79.00 | 47.51 | 26.85 | 47.39 | 55.60 | 49.33 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 30.80 | 73.88 | 34.61 | 12.03 | 35.56 | 44.92 | 38.63 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 15.27 | 68.30 | 24.84 | 7.64 | 29.79 | 26.63 | 28.74 |
| Dense baselines | |||||||
| VICReg | 38.35 | 76.25 | 45.63 | 26.52 | 48.30 | 52.19 | 47.88 |
| SimCLR | 41.70 | 77.88 | 46.87 | 31.66 | 49.27 | 49.74 | 49.52 |
| LeJEPA | 38.67 | 79.06 | 49.02 | 30.18 | 49.34 | 53.88 | 50.03 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 22.50 | 43.24 | 10.56 | 12.86 | 12.65 | 32.92 | 22.46 |
| Non-negative SimCLR | 23.09 | 48.38 | 17.73 | 8.72 | 14.97 | 32.87 | 24.29 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 24.47 | 58.06 | 22.40 | 9.97 | 19.73 | 40.17 | 29.13 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 25.90 | 57.61 | 22.39 | 11.40 | 21.24 | 38.18 | 29.45 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 23.83 | 44.69 | 15.32 | 9.19 | 14.65 | 33.31 | 23.50 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 18.19 | 36.39 | 9.03 | 7.46 | 8.67 | 26.17 | 17.65 |
| RGN 2 . 0 (0 ,σ GN ) | 22.82 | 57.93 | 21.26 | 12.07 | 19.03 | 36.90 | 28.33 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 27.39 | 57.32 | 24.07 | 13.06 | 21.54 | 40.01 | 30.57 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 22.23 | 40.83 | 12.58 | 8.93 | 12.01 | 29.38 | 20.99 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 9.89 | 23.81 | 5.03 | 6.02 | 6.15 | 10.08 | 10.16 |
| Dense baselines | |||||||
| VICReg | 23.09 | 41.20 | 10.24 | 10.60 | 13.55 | 32.49 | 21.86 |
| SimCLR | 33.09 | 59.56 | 25.91 | 15.68 | 28.61 | 42.00 | 34.14 |
| LeJEPA | 24.95 | 55.76 | 19.55 | 10.72 | 17.17 | 38.98 | 27.85 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 63.56 | 82.68 | 60.40 | 82.55 | 60.39 | 78.63 | 71.37 |
| Non-negative SimCLR | 64.68 | 84.41 | 62.95 | 84.97 | 63.32 | 76.56 | 72.82 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 65.05 | 84.50 | 62.73 | 83.75 | 62.65 | 78.00 | 72.78 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 64.68 | 83.97 | 60.75 | 81.77 | 60.25 | 77.68 | 71.52 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 63.67 | 85.31 | 58.11 | 81.62 | 58.39 | 77.38 | 70.75 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 49.04 | 79.11 | 43.94 | 44.72 | 46.22 | 61.30 | 54.06 |
| RGN 2 . 0 (0 ,σ GN ) | 64.52 | 85.06 | 64.51 | 84.09 | 62.35 | 78.25 | 73.13 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 64.26 | 84.21 | 59.90 | 81.59 | 59.47 | 77.65 | 71.18 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 57.87 | 82.19 | 49.81 | 69.38 | 51.90 | 68.93 | 63.35 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 53.35 | 78.95 | 45.32 | 57.47 | 46.32 | 52.22 | 55.61 |
| Dense baselines | |||||||
| VICReg | 62.77 | 81.47 | 59.23 | 80.96 | 60.44 | 77.11 | 70.33 |
| SimCLR | 64.73 | 82.49 | 60.10 | 80.47 | 61.18 | 71.44 | 70.07 |
| LeJEPA | 63.19 | 83.54 | 62.38 | 83.07 | 61.14 | 78.30 | 71.94 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 35.80 | 47.70 | 14.90 | 32.04 | 15.74 | 44.84 | 31.83 |
| Non-negative SimCLR | 36.86 | 50.24 | 21.92 | 25.57 | 17.70 | 47.02 | 33.22 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 41.49 | 63.14 | 29.46 | 39.96 | 26.70 | 52.55 | 42.22 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 42.82 | 63.05 | 32.96 | 39.68 | 28.98 | 55.08 | 43.76 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 39.84 | 48.69 | 19.47 | 29.65 | 18.21 | 44.51 | 33.39 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 28.72 | 37.47 | 11.13 | 12.65 | 9.61 | 36.09 | 22.61 |
| RGN 2 . 0 (0 ,σ GN ) | 41.81 | 62.82 | 29.31 | 37.32 | 24.80 | 53.18 | 41.54 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 45.00 | 63.85 | 34.13 | 42.72 | 30.30 | 56.09 | 45.35 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 34.52 | 43.39 | 15.19 | 23.34 | 14.44 | 40.39 | 28.55 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 21.38 | 24.92 | 6.58 | 7.94 | 7.20 | 16.63 | 14.11 |
| Dense baselines | |||||||
| VICReg | 37.55 | 44.84 | 13.37 | 34.23 | 17.03 | 43.45 | 31.75 |
| SimCLR | 51.65 | 63.71 | 35.10 | 45.73 | 36.06 | 57.78 | 48.34 |
| LeJEPA | 40.05 | 56.81 | 23.42 | 43.31 | 20.70 | 51.19 | 39.25 |
| Method | Encoder Acc1 ↑ | Projector Acc1 ↑ |
|---|---|---|
| Projector Dim. = 512 | Projector Dim. = 512 | Projector Dim. = 512 |
| VICReg | 63.72 | 57.80 |
| LeJEPA | 65.53 | 59.18 |
| RGN 1 . 0 (0 ,σ GN ) | 67.56 | 61.34 |
| RGN 2 . 0 (0 ,σ GN ) | 68.31 | 61.74 |
| Projector Dim. = 2048 | Projector Dim. = 2048 | Projector Dim. = 2048 |
| VICReg | 68.73 | 61.81 |
| LeJEPA | 67.18 | 60.12 |
| RGN 1 . 0 (0 ,σ GN ) | 69.33 | 64.90 |
| RGN 2 . 0 (0 ,σ GN ) | 69.54 | 64.85 |
| Method | Enc Acc1 ↑ | Proj Acc1 ↑ | ℓ 1 Sparsity ↓ | ℓ 0 Sparsity ↓ |
|---|---|---|---|---|
| Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) |
| RGN 1 . 0 (1 . 0 ,σ GN ) | 74.34 | 65.60 | 0.6459 | 0.9359 |
| RGN 1 . 0 (0 . 5 ,σ GN ) | 74.58 | 66.42 | 0.4768 | 0.8825 |
| RGN 1 . 0 (0 . 0 ,σ GN ) | 75.44 | 67.16 | 0.2730 | 0.7721 |
| RGN 1 . 0 ( - 0 . 5 ,σ GN ) | 74.80 | 66.86 | 0.1407 | 0.5526 |
| RGN 1 . 0 ( - 1 . 0 ,σ GN ) | 74.18 | 65.14 | 0.0737 | 0.3067 |
| RGN 1 . 0 ( - 1 . 5 ,σ GN ) | 74.88 | 63.70 | 0.0390 | 0.1227 |
| RGN 1 . 0 ( - 2 . 0 ,σ GN ) | 73.54 | 60.70 | 0.0238 | 0.0523 |
| RGN 1 . 0 ( - 2 . 5 ,σ GN ) | 72.06 | 57.96 | 0.0188 | 0.0357 |
| RGN 1 . 0 ( - 3 . 0 ,σ GN ) | 71.64 | 57.46 | 0.0132 | 0.0220 |
| Baselines (Dense) | Baselines (Dense) | Baselines (Dense) | Baselines (Dense) | Baselines (Dense) |
| LeJEPA | 65.36 | 59.12 | 0.6369 | 1.0000 |
| VICReg | 72.06 | 63.56 | 0.7877 | 1.0000 |
| SimCLR | 74.18 | 66.86 | 0.6663 | 1.0000 |
| Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) |
| NonNeg-VICReg | 71.64 | 65.42 | 0.5075 | 0.7066 |
| NonNeg-SimCLR | 74.48 | 63.76 | 0.0016 | 0.0023 |
$$ f_{\mathcal{GN}_{p}(\mu,\sigma)}(x)=\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg) $$
$$ \int_{S} p(\mathbf{x}) d\mathbf{x} &= 1, \quad \mathbb{E}[|\mathbf{x}|p^p] = \frac{d}{d\lambda_1}\log Z{S}(\lambda_1) $$
$$ \mathbb{E}[|\mathbf{x}|0]&=d\cdot\Phi{\mathcal{GN}_p(0,1)}\bigg(\frac{\mu}{\sigma}\bigg)\ &=\frac{d}{2}\bigg(1+\operatorname{sgn}\bigg(\frac{\mu}{\sigma}\bigg)P\bigg(\frac{1}{p},\frac{|\mu/\sigma|^p}{p}\bigg)\bigg)\label{eq:theolzeroeq} $$ \tag{eq:theolzeroeq}
$$ \mathbb{H}{d(\boldsymbol{\xi}i)}(\boldsymbol{\xi}i)&=\Phi{\mathcal{GN}p(0,1)}\bigg(\frac{\mu}{\sigma}\bigg)\cdot\mathbb{H}{1}(\mathcal{TGN}p(\mu,\sigma))\ &+\mathbb{H}{0}(\mathbbm{1}{(0,\infty)}(\boldsymbol{\xi}i))\ \mathbb{H}{d(\boldsymbol{\xi})}(\boldsymbol{\xi})&=\sum{i=1}^{D}\mathbb{H}_{d(\boldsymbol{\xi}_i)}(\boldsymbol{\xi}i)=D\cdot \mathbb{H}{d(\boldsymbol{\xi}_i)}(\boldsymbol{\xi}_i) $$
$$ \mathbf{x}\stackrel{\operatorname{d}}{=}\mathbf{y}\iff \mathbf{c}^\top\mathbf{x}\stackrel{\operatorname{d}}{=}\mathbf{c}^\top\mathbf{y}\text{ for all }\mathbf{c}\in\mathbb{R}^d $$
$$ \min_{\boldsymbol{\theta}},&\mathbb{E}{\mathbf{z},\mathbf{z}'}[ |\mathbf{z} - \mathbf{z}'|2^2]\ &+\mathbb{E}{\mathbf{c}}[\mathcal{L}(\mathbb{P}{\mathbf{c}^\top\mathbf{z}}|\mathbb{P}{\mathbf{c}^\top\mathbf{y}})]+\mathbb{E}{\mathbf{c}}[\mathcal{L}(\mathbb{P}{\mathbf{c}^\top\mathbf{z}'}|\mathbb{P}{\mathbf{c}^\top\mathbf{y}})] $$
$$ \label{eq:iterdistloss} \mathcal{L}(\mathbb{P}_{\mathbf{c}i^\top\mathbf{z}}|\mathbb{P}{\mathbf{c}_i^\top\mathbf{y}}):=\frac{1}{B}|(\mathbf{Z}\mathbf{c}_i)^{\uparrow}-(\mathbf{Y}\mathbf{c}_i)^{\uparrow}|_2^2 $$ \tag{eq:iterdistloss}
$$ \mathbb{E}[x]&=\mu\ \operatorname{Var}[x]&=\sigma^2p^{2/p}\frac{\Gamma(3/p)}{\Gamma(1/p)}\label{eq:varofgengausseq} $$ \tag{eq:varofgengausseq}
$$ \mathbb{P}_{\mathcal{TGN}p(\mu,\sigma)}(A) &:= \mathbb{P}!\left(X \in A \mid X>0\right) \nonumber=\frac{\mathbb{P}{\mathcal{GN}p(\mu,\sigma)}(A\cap(0,\infty))}{\mathbb{P}{\mathcal{GN}p(\mu,\sigma)}((0,\infty))}=\frac{\mathbb{P}{\mathcal{GN}p(\mu,\sigma)}(A\cap(0,\infty))}{1-\Phi{\mathcal{GN}_p(0,1)}(-\mu/\sigma)} $$
$$ \delta_x(A)=\mathbbm{1}_{A}(x)=\begin{cases} 0,x\notin A\ 1, x\in A \end{cases} $$
$$ \mathbb{P}{X}(A)=\int_A\frac{d\mathbb{P}{X}}{d\nu}d\nu $$
$$ \int_A\frac{d\mathbb{P}{X}}{d\nu}d\nu=\int_A\frac{d\mathbb{P}{X}}{d\nu}d\delta_0+\int_A\frac{d\mathbb{P}_{X}}{d\nu}d\lambda $$
$$ \int_A\frac{d\mathbb{P}{X}}{d\nu}d\lambda&=\int_A\Phi{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)\cdot\mathbbm{1}{{0}}(x)d\lambda(x)+\int_A\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg)\cdot\mathbbm{1}{(0,\infty)}(x)d\lambda(x)\ &=\Phi{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)\cdot\int_A\mathbbm{1}{{0}}(x)d\lambda(x)+\int_A\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg)\cdot\mathbbm{1}_{(0,\infty)}(x)d\lambda(x) $$
$$ \int_A\mathbbm{1}_{{0}}(x)d\lambda(x)=\lambda(A\cap{0})=0 $$
$$ \int_A\frac{d\mathbb{P}{X}}{d\nu}d\lambda&=\int_A\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg)\cdot\mathbbm{1}{(0,\infty)}(x)d\lambda(x)\ &=\int_{A\cap(0,\infty)}\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg)d\lambda(x)\ &=\int_{A\cap(0,\infty)}\frac{d\mathbb{P}_{\mathcal{GN}p(\mu,\sigma)}}{d\lambda}(x)d\lambda(x)\ &=\mathbb{P}{\mathcal{GN}_p(\mu,\sigma)}(A\cap(0,\infty)) $$
$$ \mathbb{P}_{\mathcal{RGN}p(\mu,\sigma)}(\mathbb{R})&=\Phi{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)\cdot\delta_0(\mathbb{R})+\bigg(1-\Phi{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)\bigg)\cdot\mathbb{P}{\mathcal{TGN}p(\mu,\sigma)}(\mathbb{R})\&=\Phi{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)+\bigg(1-\Phi{\mathcal{GN}_p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)\bigg)=1 $$
$$ \bigg(1-\Phi_{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)\bigg)\frac{1}{Z{(0,\infty)}(\mu,\sigma,p)}=\frac{\int_{-\infty}^0\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg)dx}{\int_{0}^{\infty}\exp\bigg(-\frac{|x-\mu|^p}{p\sigma^p}\bigg)dx}=\frac{p^{1-1/p}}{2\sigma\Gamma(1/p)} $$
$$ \mathbb{E}[X]&=\frac{1}{2}\bigg[\mu\bigg(1+\operatorname{sgn}(\mu)P\bigg(\frac{1}{p},\frac{|\mu|^p}{p\sigma^p}\bigg)\bigg)+p^{1/p}\sigma\frac{\Gamma(2/p, |\mu|^p/(p\sigma^p))}{\Gamma(1/p)}\bigg]\ \mathbb{E}[X^2]&=\frac{1}{2}\bigg[\mu^2\bigg(1+\operatorname{sgn}(\mu),P\bigg(\frac{1}{p},\frac{|\mu|^p}{p\sigma^p}\bigg)\bigg)+ 2\mu p^{1/p}\sigma\frac{\Gamma(2/p, |\mu|^p/(p\sigma^p))}{\Gamma(1/p)}\&+p^{2/p}\sigma^2\bigg(1+\operatorname{sgn}(\mu),P \bigg(\frac{3}{p},\frac{|\mu|^p}{p\sigma^p}\bigg)\bigg) \bigg],\ \operatorname{Var}(X)&=\mathbb{E}[X^2]-\big(\mathbb{E}[X]\big)^2.\label{eq:varofrecgengausseq} $$ \tag{eq:varofrecgengausseq}
$$ \mathbb{E}[X^k]=\mathbb{E}[Z^k\mathbbm{1}{(0,\infty)}(Z)]=\int{0}^{\infty} z^k f_Z(z),dz . $$
$$ \label{eq:shift} \mathbb{E}[X^k]=C\int_{-\mu}^{\infty} (t+\mu)^k \exp!\big(-a|t|^p\big),dt. $$ \tag{eq:shift}
$$ I_0&:=\int_{-\mu}^{\infty} e^{-a|t|^p},dt\ I_1&:=\int_{-\mu}^{\infty} t,e^{-a|t|^p},dt\ I_2&:=\int_{-\mu}^{\infty} t^2,e^{-a|t|^p},dt. $$
$$ \mathbb{E}[X] &= C(\mu I_0 + I_1),\label{eq:EX-I}\ \mathbb{E}[X^2] &= C(\mu^2 I_0 + 2\mu I_1 + I_2).\label{eq:EX2-I} $$ \tag{eq:EX-I}
$$ I_0&=\frac{1}{p}a^{-1/p}\Gamma!\Big(\frac{1}{p}\Big)\Big(1+\operatorname{sgn}(\mu),P!\Big(\frac{1}{p},t_0\Big)\Big) \ I_1&=\frac{1}{p}a^{-2/p}\Gamma!\Big(\frac{2}{p},t_0\Big) \ I_2&=\frac{1}{p}a^{-3/p}\Gamma!\Big(\frac{3}{p}\Big)\Big(1+\operatorname{sgn}(\mu),P!\Big(\frac{3}{p},t_0\Big)\Big). $$
$$ \mathbb{E}[X]&= C(\mu I_0 + I_1)\ &= C\mu\cdot \frac{1}{p}a^{-1/p}\Gamma!\Big(\frac{1}{p}\Big)\Big(1+\operatorname{sgn}(\mu)P(1/p,t_0)\Big)+ C\cdot \frac{1}{p}a^{-2/p}\Gamma!\Big(\frac{2}{p},t_0\Big) \ &=\frac{1}{2}\mu\Big(1+\operatorname{sgn}(\mu)P!\Big(\frac{1}{p},t_0\Big)\Big)+\frac{1}{2}p^{1/p}\sigma,\frac{\Gamma(2/p,t_0)}{\Gamma(1/p)}\ &=\frac{1}{2}\bigg[\mu\bigg(1+\operatorname{sgn}(\mu)P\bigg(\frac{1}{p},\frac{|\mu|^p}{p\sigma^p}\bigg)\bigg)+p^{1/p}\sigma\frac{\Gamma(2/p, |\mu|^p/(p\sigma^p))}{\Gamma(1/p)}\bigg] $$
$$ \int_{0}^{u} t^{b} e^{-a t^p},dt &=\frac{1}{p},a^{-(b+1)/p},\gamma!\Big(\frac{b+1}{p},,a u^p\Big),\label{eq:lower-gamma-id}\ \int_{u}^{\infty} t^{b} e^{-a t^p},dt &=\frac{1}{p},a^{-(b+1)/p},\Gamma!\Big(\frac{b+1}{p},,a u^p\Big),\label{eq:upper-gamma-id} $$ \tag{eq:lower-gamma-id}
$$ P(s,t):=\frac{\gamma(s,t)}{\Gamma(s)},\qquad \Gamma(s,t)=\Gamma(s)-\gamma(s,t)=\Gamma(s)\big(1-P(s,t)\big). $$
$$ I_0=\frac{1}{p}a^{-1/p}\gamma!\Big(\frac{1}{p},t_0\Big)+\frac{1}{p}a^{-1/p}\Gamma!\Big(\frac{1}{p}\Big) =\frac{1}{p}a^{-1/p}\Gamma!\Big(\frac{1}{p}\Big)\Big(1+P!\Big(\frac{1}{p},t_0\Big)\Big). $$
$$ \label{eq:I1} I_1=\frac{1}{p}a^{-2/p}\Gamma!\Big(\frac{2}{p},t_0\Big). $$ \tag{eq:I1}
$$ |\mathbf{x}|0=\sum{i=1}^{d}\mathbbm{1}{\mathbf{x}i>0}=\sum{i=1}^{d}\mathbbm{1}{\mathbf{z}_i>0} $$
$$ \max_{p}\quad -& \int_S p(\mathbf{x})\ln p(\mathbf{x})d\mathbf{x} \ \text{s.t.}\quad & \int_S p(\mathbf{x})d\mathbf{x} = 1, \ & \int_S r_i(\mathbf{x})p(\mathbf{x})d\mathbf{x} = \alpha_i, \quad i=1,\cdots, m, \ & p(\mathbf{x}) \ge 0 \quad \text{a.e. on } S. $$
$$ \mathcal{P}:=\bigg{p:\to[0,\infty)\bigg{|} \int_{S} p(\mathbf{x}),d\mathbf{x}=1, \int_{S}r_{i}(\mathbf{x})p(\mathbf{x}),d\mathbf{x}=\alpha_i,\forall i \bigg} $$
$$ Z_{S}(\boldsymbol{\lambda})&=\int_S \exp\bigg(\sum_{i=1}^m \lambda_i r_i(\mathbf{x})\bigg),d\mathbf{x} ;<;\infty. $$
$$ \left.\frac{d}{d\epsilon}\mathcal{J}[p+\epsilon\delta p]\right|{\epsilon=0}&=-\int{S}\left.\frac{d}{d\epsilon}\bigg[(p(\mathbf{x})+\epsilon\delta p(\mathbf{x}))\ln (p(\mathbf{x})+\epsilon\delta p(\mathbf{x}))\bigg]d\mathbf{x}\right|{\epsilon=0}+\lambda_0\bigg(\int{S}\left.\frac{d}{d\epsilon} \bigg[p(\mathbf{x})+\epsilon\delta p(\mathbf{x})\bigg] d\mathbf{x}\right|{\epsilon=0}\bigg)\ &+\sum{i=1}^{m}\lambda_i\bigg(\int_{S}\left.\frac{d}{d\epsilon}\bigg[r_i(\mathbf{x}) (p(\mathbf{x})+\epsilon\delta p(\mathbf{x})))d\mathbf{x}\bigg]\right|{\epsilon=0}\bigg)\ &=-\int{S}\left.\bigg[\delta p(\mathbf{x})\ln (p(\mathbf{x})+\epsilon\delta p(\mathbf{x}))+\delta p(\mathbf{x})\bigg]d\mathbf{x}\right|{\epsilon=0}+\lambda_0\bigg(\int{S} 1\cdot\delta p(\mathbf{x})d\mathbf{x}\bigg)\ &+\sum_{i=1}^{m}\lambda_i\bigg(\int_{S} r_i(\mathbf{x})\delta p(\mathbf{x})d\mathbf{x}\bigg)\ &=-\int_{S}\bigg[\ln p(\mathbf{x})+1\bigg]\delta p(\mathbf{x})d\mathbf{x}+\bigg(\int_{S} \lambda_0\cdot\delta p(\mathbf{x})d\mathbf{x}\bigg)+\bigg(\int_{S}\sum_{i=1}^{m}\lambda_i r_i(\mathbf{x})\delta p(\mathbf{x})d\mathbf{x}\bigg)\ $$
$$ \frac{d}{d\lambda_1}\log Z_S(\lambda_1)&=\frac{1}{Z_S(\lambda_1)}\frac{d}{d\lambda_1}Z_S(\lambda_1)\&=\frac{1}{Z_S(\lambda_1)}\int_{S}\frac{d}{d\lambda_1}\exp(\lambda_1|\mathbf{x}|p^p)d\mathbf{x}\&=\frac{1}{Z_S(\lambda_1)}\int{S}|\mathbf{x}|_p^p\exp(\lambda_1|\mathbf{x}|p^p)d\mathbf{x}\ &=\int{S}|\mathbf{x}|_p^p\frac{1}{Z_S(\lambda_1)}\exp\bigg(-\frac{|\mathbf{x}|_p^p}{p\sigma^p}\bigg)\cdot\mathbbm{1}_S(\mathbf{x})d\mathbf{x}\ &=\mathbb{E}[|\mathbf{x}|_p^p] $$
$$ \frac{1}{Z_S(\lambda_1)}=\frac{1}{Z_{\mathbb{R}^d}(\lambda_1)}=\frac{p^{d-d/p}}{(2\sigma)^d\Gamma(1/p)^d} $$
$$ p(\mathbf{x})\propto \exp(\lambda_1|\mathbf{x}|_1) $$
$$ \mathbb{E}[\mathbf{x}_i]=\boldsymbol{\mu}_i,\quad\mathbb{E}[\mathbf{x}_i \mathbf{x}j]=\mathbf{\Sigma}{ij}+\boldsymbol{\mu}_i\boldsymbol{\mu}_j, \quad\forall\space i,j\in{1,\cdots, d} $$
$$ p(\mathbf{x})&\propto\exp\bigg(\sum_{i=1}^{d}\boldsymbol{\lambda}i \mathbf{x}i+\sum{i=1}^{d}\sum{j=1}^{d}\mathbf{\Lambda}_{ij}\mathbf{x}_i\mathbf{x}_j\bigg)\ &=\exp\bigg(\boldsymbol{\lambda}^\top\mathbf{x}+\mathbf{x}^\top\mathbf{\Lambda}\mathbf{x}\bigg)\ &\propto\exp\bigg(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\mathbf{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu})\bigg) $$
$$ \mathbb{H}{0}(\eta):=\sum{k=1}^{\infty}q_k\log\frac{1}{q_k} $$
$$ \mathbb{H}{d(\xi)}=\lim{n\to\infty}(\mathbb{H}_{0}(\xi_n)-d(\xi)\log n) $$
$$ \mathbb{P}_{\xi}=(1-d)\cdot\mathbb{P}_0+d\cdot\mathbb{P}_1 $$
$$ \mathbb{H}{d(\xi)}(\xi)=(1-d)\cdot\sum{k=1}^{\infty}p_k\log\frac{1}{p_k}-d\cdot\int_{\mathbb{R}}\frac{d\mathbb{P}_1}{d\lambda}(x)\log\bigg(\frac{d\mathbb{P}_1}{d\lambda}(x)\bigg)d\lambda(x)+d\log\frac{1}{d}+(1-d)\log\frac{1}{1-d} $$
$$ \mathbbm{1}_{(0,\infty)}(\boldsymbol{\xi}_i)=\begin{cases} 1, \text{if } \boldsymbol{\xi}_i\in (0,\infty)\implies \text{with probability }d(\boldsymbol{\xi}_i)\ 0, \text{if } \boldsymbol{\xi}_i\notin (0,\infty)\implies \text{with probability }1-d(\boldsymbol{\xi}_i) \end{cases} $$
$$ \mathbb{H}0(\mathbbm{1}{(0,\infty)}(\boldsymbol{\xi}_i))&=d(\boldsymbol{\xi}_i)\log\frac{1}{d(\boldsymbol{\xi}_i)}+(1-d(\boldsymbol{\xi}_i))\log\frac{1}{1-d(\boldsymbol{\xi}_i)} $$
$$ \mathbb{H}_{d(\boldsymbol{\xi}_i)}(\boldsymbol{\xi}i)&=0-d(\boldsymbol{\xi}i)\cdot\int{\mathbb{R}}\frac{d\mathbb{P}{\mathcal{TGN}p(\mu,\sigma)}}{d\lambda}(x)\log\bigg(\frac{d\mathbb{P}{\mathcal{TGN}p(\mu,\sigma)}}{d\lambda}(x)\bigg)d\lambda(x)+\mathbb{H}0(\mathbbm{1}{(0,\infty)}(\boldsymbol{\xi}i))\ &=\Phi{\mathcal{GN}p(0,1)}\bigg(\frac{\mu}{\sigma}\bigg)\cdot\mathbb{H}{1}(\mathcal{TGN}p(\mu,\sigma))+\mathbb{H}{0}(\mathbbm{1}{(0,\infty)}(\boldsymbol{\xi}_i)) $$
$$ \mathbb{P}_{X\mid (0,\infty)}(A):=\frac{\mathbb{P}_X(A\cap(0,\infty))}{\mathbb{P}_X((0,\infty))}=\frac{\mathbb{P}_X(A)}{d} $$
$$ \mathbb{P}_Z(B)=\mathbb{P}_X(\varphi^{-1}(B)). $$
$$ \varphi^{-1}(B)=\big(\varphi^{-1}(B)\cap(-\infty,0]\big);\cup;\big(\varphi^{-1}(B)\cap(0,\infty)\big), $$
$$ \varphi^{-1}(B)\cap(-\infty,0]= \begin{cases} (-\infty,0], & 0\in B,\ \emptyset, & 0\notin B. \end{cases} $$
$$ \frac{d\mathbb{P}{Z}}{d\nu}(x)=(1-d)\cdot\mathbbm{1}{{0}}(x)+d\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x) $$
$$ \int_A\frac{d\mathbb{P}{Z}}{d\nu}d\lambda&=\int_A(1-d)\cdot\mathbbm{1}{{0}}(x)d\lambda(x)+\int_Ad\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x)d\lambda(x)\ &=(1-d)\cdot\int_A\mathbbm{1}{{0}}(x)d\lambda(x)+\int{A\cap (0,\infty)}d\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot d\lambda(x)\ &=0+d\cdot\int{A\cap (0,\infty)} d\mathbb{P}{X|(0,\infty)}(x) \ &=d\cdot\mathbb{P}{X|(0,\infty)}(A) $$
$$ \mathbb{P}Z=(1-d)\cdot\delta_0+d\cdot\mathbb{P}{X\mid (0,\infty)}\label{eq:decompositionofgeneralrectifiedvar} $$ \tag{eq:decompositionofgeneralrectifiedvar}
$$ \hat{\mathbb{H}}1(\mathbb{P}{X\mid (0,\infty)})=\frac{1}{B'-m}\sum_{i=1}^{B'-m}\log\bigg(\frac{B'+1}{m}\bigg(z_{(i+m)}-z_{(i)}\bigg)\bigg) $$
$$ \mathbb{H}{d(\mathbf{z})}(\mathbf{z})\leq\sum{i=1}^{D}\mathbb{H}_{d(\mathbf{z}_i)}(\mathbf{z}_i) $$
$$ \mathbb{H}{\nu}(Z)=H{d(Z)}(Z) $$
$$ \mathbb{H}{\nu}(Z)&=-\int \frac{d\mathbb{P}{Z}}{d\nu}\log\bigg(\frac{d\mathbb{P}{Z}}{d\nu}\bigg)d\nu\ &=-\int \frac{d\mathbb{P}{Z}}{d\nu}\log\bigg(\frac{d\mathbb{P}{Z}}{d\nu}\bigg)d\delta_0-\int \frac{d\mathbb{P}{Z}}{d\nu}\log\bigg(\frac{d\mathbb{P}_{Z}}{d\nu}\bigg)d\lambda $$
$$ -\int \frac{d\mathbb{P}{Z}}{d\nu}\log\bigg(\frac{d\mathbb{P}{Z}}{d\nu}\bigg)d\lambda &=-\int d\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x)\log(d\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x))d\lambda(x)\ &=-\int d\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x)\log(\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x))d\lambda(x)\&-\int d\cdot\frac{d\mathbb{P}{X|(0,\infty)}}{d\lambda}(x)\cdot\mathbbm{1}{(0,\infty)}(x)\log(d)d\lambda(x)\ &=d\cdot\mathbb{H}1(\mathbb{P}{X|(0,\infty)})-d\log(d) $$
$$ \operatorname{TC}(\mathbf{x})=D_{\operatorname{KL}}\bigg(\mathbb{P}{\mathbf{x}}\bigg|\prod{i=1}^{D}\mathbb{P}{\mathbf{x}i}\bigg)=\int\log\bigg(\frac{d\mathbb{P}{\mathbf{x}}}{\prod{i=1}^{D}\mathbb{P}_{\mathbf{x}i}}\bigg)d\mathbb{P}{\mathbf{x}} $$
$$ \operatorname{HSIC}(X,Y)=\frac{1}{(B-1)^2}\operatorname{Tr}(\mathbf{KHLH}) $$
$$ \operatorname{RepReLU}(z):=\operatorname{ReLU}(z).\operatorname{detach}() + \operatorname{GeLU}(z) - \operatorname{GeLU}(z).\operatorname{detach}() $$
$$ \operatorname{Var}(\mathbf{u}_i^\top\mathbf{y}) = \mathbf{u}_i^\top (\gamma\mathbf{I}_d)\mathbf{u}_i = \gamma\cdot|\mathbf{u}_i|_2^2 = \gamma. $$
$$ |\gamma\cdot\mathbf{I}_d - \tilde{\mathbf{Z}}^\top\tilde{\mathbf{Z}}|_F^2 $$
Theorem. [Rényi Information Dimension Characterizations of Multivariate Rectified Generalized Gaussian Distributions] Let $\xi\sim\prod_{i=1}^{D}RGN_p(\mu,\sigma)$ be a Rectified Generalized Gaussian random vector. The Rényi information dimension of $\xi$ is $d(\xi)=D\cdot\Phi_{GN_p(0,1)}(\mu/\sigma)$, and the $d(\xi)$-dimensional entropy of $\xi$ is given by align H_{d(\xi_i)}(\xi_i)&=\Phi_{GN_p(0,1)}\bigg(\mu{\sigma}\bigg)\cdotH_{1}(TGN_p(\mu,\sigma))\ &+H_{0}(1_{(0,\infty)}(\xi_i))\ H_{d(\xi)}(\xi)&=\sum_{i=1}^{D}H_{d(\xi_i)}(\xi_i)=D\cdot H_{d(\xi_i)}(\xi_i) align where $H_0(\cdot)$ is the discrete Shannon entropy, $H_1(\cdot)$ denotes the differential entropy, and $1_{(0,\infty)}(\xi_i)$ is a Bernoulli random variable that equals $1$ with probability $\Phi_{GN_p(0,1)}(\mu/\sigma)$ and $0$ with probability $1-\Phi_{GN_p(0,1)}(\mu/\sigma)$. proof See proof:proofofrenyiinfoentropyofrecgengauss. proof
Lemma. [Absolute Continuity] The Rectified Generalized Gaussian probability measure $P_X$ in Definition~def:measuretheoreticaldefofrgg is absolutely continuous with respect to the mixed measure $\nu:=\delta_0+\lambda$, i.e. $P_X \ll\nu$. proof According to folland1999real, if $P_X$ is a signed measure and $\nu$ is a positive measure on the same measurable space $(R, B(R))$, then $P_X \ll\nu$ if $\nu(A)=0$ for every $A\inB(R)$ implies $P_{X}(A)=0$. Let's consider the case of $\nu(A)=0$. By definition, $\nu(A)=\delta_0(A)+\lambda(A)=0$. Since both $\delta_0$ and $\lambda$ are non-negative measures, $\delta_0(A)=\lambda(A)=0$. We observe that $\delta_0(A)=0$ implies $0\notin A$ by the definition of the Dirac measure. Thus align P_{X}(A)&=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\delta_0(A)+\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A)\ &=\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A) align where the first term vanishes because $0\notin A$. It's trivial that $P_{TGN_p(\mu,\sigma)}$ is absolutely continuous with respect to the Lebesgue measure. Since $\nu(A)=0\implies \lambda(A)=0$, we have $\lambda(A)=0\implies P_{TGN_p(\mu,\sigma)}(A)=0$. Thus align P_{X}(A)&=\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A)=0 align and we have proven the absolutely continuity result $P_X \ll\nu$. proof
Lemma. [Radon–Nikodym Derivative] The Radon–Nikodym derivative of the Rectified Generalized Gaussian probability measure $P_X$ with respect to the mixed measure $\nu:=\delta_0+\lambda$ exists and is given by align d\mathbb{P_{X}}{d\nu}(x)=f_{RGN_p(\mu,\sigma)}(x)&=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot1_{{0}}(x)+p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x) align proof By Lemmalemma:absolutecontofrggwithrespecttomixed, $P_X \ll\nu$ so the Radon–Nikodym derivative $dP_{X}/d\nu$ exists and it suffices to show that for any $A\subseteqB(R)$ we have align P_{X}(A)=\int_Ad\mathbb{P_{X}}{d\nu}d\nu align We start by expanding the integral with respect to a sum of measures align \int_Ad\mathbb{P_{X}}{d\nu}d\nu=\int_Ad\mathbb{P_{X}}{d\nu}d\delta_0+\int_Ad\mathbb{P_{X}}{d\nu}d\lambda align By the property of the Dirac measure, we have align \int_Ad\mathbb{P_{X}}{d\nu}d\delta_0=d\mathbb{P_{X}}{d\nu}(0)\delta_0(A)=f_{RGN_p(\mu,\sigma)}(0)\delta_0(A) align We observe that $1_{{0}}(x)=1$ and $1_{(0,\infty)}(0)=0$. So we have align f_{RGN_p(\mu,\sigma)}(0)\delta_0(A)=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\delta_0(A) align Now the second term can be expanded as align \int_Ad\mathbb{P_{X}}{d\nu}d\lambda&=\int_A\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot1_{{0}}(x)d\lambda(x)+\int_Ap^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x)d\lambda(x)\ &=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\int_A1_{{0}}(x)d\lambda(x)+\int_Ap^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x)d\lambda(x) align where the term align \int_A1_{{0}}(x)d\lambda(x)=\lambda(A\cap{0})=0 align simply vanishes. Thus we are left with align \int_Ad\mathbb{P_{X}}{d\nu}d\lambda&=\int_Ap^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x)d\lambda(x)\ &=\int_{A\cap(0,\infty)}p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)d\lambda(x)\ &=\int_{A\cap(0,\infty)}d\mathbb{P_{GN_p(\mu,\sigma)}}{d\lambda}(x)d\lambda(x)\ &=P_{GN_p(\mu,\sigma)}(A\cap(0,\infty)) align By Definitiondef:trungengaussmeasuredef, the Truncated Generalized Gaussian probability measure is given by align P_{TGN_p(\mu,\sigma)}(A)&=\mathbb{P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))}{P_{GN_p(\mu,\sigma)}((0,\infty))}\ &=\mathbb{P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))}{1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)}\ align Thus we have the identity align P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))&=1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A) align Putting everything together, we arrive at align \int_Ad\mathbb{P_{X}}{d\nu}d\nu&=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\delta_0(A)+\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A)\ &=P_{X}(A) align Thus we have proven the form of the Radon–Nikodym Derivative. proof
Lemma. [$I_0$ Integral] The $I_0$ integral in Proposition~proposition:meanandvarofrecgengauss is given by align I_0=1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1+sgn(\mu),P!\Big(1{p},t_0\Big)\Big). align
Lemma. [Maximum Entropy Continuous Multivaraite Probability Distributions] Let $S\subseteqR^d$ be a measurable set with positive Lebesgue measure. We define $r_1,\cdots,r_m:S\toR$ as measurable functions and let $\alpha_1,\cdots,\alpha_m\inR$. Consider the optimization problem align \max_{p}\quad -& \int_S p(x)\ln p(x)dx \ s.t.\quad & \int_S p(x)dx = 1, \ & \int_S r_i(x)p(x)dx = \alpha_i, \quad i=1,\cdots, m, \ & p(x) \ge 0 \quad a.e. on S. align We denote the set of functions that satisfies the given constraints as align P:=\bigg{p:\to[0,\infty)| \int_{S} p(x),dx=1, \int_{S}r_{i}(x)p(x),dx=\alpha_i,\forall i \bigg} align Assume the set $P$ is nonempty and that there exists $\lambda=(\lambda_1,\dots,\lambda_m)\inR^m$ such that align Z_{S}(\lambda)&=\int_S \exp\bigg(\sum_{i=1}^m \lambda_i r_i(x)\bigg),dx ;<;\infty. align Then any maximizer $p^\star$ of the optimization problem has the form align p^\star(x)&=1{Z_{S}(\lambda)} \exp\bigg(\sum_{i=1}^m \lambda_i r_i(x)\bigg)\cdot1_S(x), align where ${\lambda_i}{i=1}^{m}$ are chosen to satisfy the constraints ${\alpha_i}{i=1}^{m}$. proof We can form the Lagrangian functional of the constrained optimization problem as align J[p]=-\int_{S} p(x)\ln p(x) dx+\lambda_0\bigg(\int_{S} p(x) dx-1\bigg)+\sum_{i=1}^{m}\lambda_i\bigg(\int_{S}r_i(x) p(x)dx-\alpha_i\bigg) align where $\lambda_0,\lambda_1,\dots,\lambda_m\inR$ are Lagrange multipliers. Let $p$ be a maximizer that is strictly positive almost everywhere (a.e.) over $S$. We denote $\delta p$ as an arbitrary integrable perturbation supported on $S$ such that $p+\epsilon\delta p\geq0$ for sufficiently small $|\epsilon|$. Thus the Gateaux derivative of $J$ in the direction of $\delta p$ is given by align \left.d{d\epsilon}J[p+\epsilon\delta p]\right|{\epsilon=0}&=-\int{S}\left.d{d\epsilon}\bigg[(p(x)+\epsilon\delta p(x))\ln (p(x)+\epsilon\delta p(x))\bigg]dx\right|{\epsilon=0}+\lambda_0\bigg(\int{S}\left.d{d\epsilon} \bigg[p(x)+\epsilon\delta p(x)\bigg] dx\right|{\epsilon=0}\bigg)\ &+\sum{i=1}^{m}\lambda_i\bigg(\int_{S}\left.d{d\epsilon}\bigg[r_i(x) (p(x)+\epsilon\delta p(x)))dx\bigg]\right|{\epsilon=0}\bigg)\ &=-\int{S}\left.\bigg[\delta p(x)\ln (p(x)+\epsilon\delta p(x))+\delta p(x)\bigg]dx\right|{\epsilon=0}+\lambda_0\bigg(\int{S} 1\cdot\delta p(x)dx\bigg)\ &+\sum_{i=1}^{m}\lambda_i\bigg(\int_{S} r_i(x)\delta p(x)dx\bigg)\ &=-\int_{S}\bigg[\ln p(x)+1\bigg]\delta p(x)dx+\bigg(\int_{S} \lambda_0\cdot\delta p(x)dx\bigg)+\bigg(\int_{S}\sum_{i=1}^{m}\lambda_i r_i(x)\delta p(x)dx\bigg)\ align Thus the functional derivative is align \deltaJ{\delta p}&=-\ln p(x)-1+\lambda_0+\sum_{i=1}^{m}\lambda_i r_i(x) align Since this expression must vanish for all admissible perturbations $\delta p$, we get $\deltaJ{\delta p}=0$ almost everywhere on $S$. Solving for $p$ yields align p(x)&=\exp\bigg(\lambda_0-1+\sum_{i=1}^m \lambda_i r_i(x)\bigg) align Absorbing the constant terms into $Z_{S}(\lambda)$, we end up with align p(x)&=1{Z_{S}(\lambda)} \exp\bigg(\sum_{i=1}^m \lambda_i r_i(x)\bigg)\cdot1_S(x) align proof
Lemma. [Probability Measure Under Rectification] Let $X\simP_{X}$ be a real-valued random variable where $P_X$ is absolutely continuous with respect to the Lebesgue measure $\lambda$, i.e. $P_X\ll\lambda$. Then the probability measure of $Z:=\max(0, X)$ over $([0,\infty),B([0,\infty)))$ is align P_Z=(1-d)\cdot\delta_0+d\cdotP_{X\mid (0,\infty)} align where $\delta_0$ is the Dirac measure and $1-d:=P(Z=0)=P(X\leq0)$ and align P_{X\mid (0,\infty)}(A):=\mathbb{P_X(A\cap(0,\infty))}{P_X((0,\infty))}=\mathbb{P_X(A)}{d} align for any Borel $A\subset (0,\infty)$.
Lemma. [Radon–Nikodym Derivative Under Rectifications] Let $X\simP_{X}$ be a real-valued random variable where $P_X$ is absolutely continuous with respect to the Lebesgue measure $\lambda$, i.e. $P_X\ll\lambda$. Consider $Z:=\max(0, X)\simP_{Z}$ and let $\delta_0$ be the Dirac measure in Definition~def:diracmeasuredef. Then the Radon–Nikodym derivative of $P_{Z}$ with respect to $\nu=\delta_0+\lambda$ is given by align d\mathbb{P_{Z}}{d\nu}(x)=(1-d)\cdot1_{{0}}(x)+d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x) align
Lemma. The entropy $H_{\nu}(Z)$ is equivalent to the $d(Z)$-dimensional entropy align H_{\nu}(Z)=H_{d(Z)}(Z) align
Proposition. [Maximum Entropy Characterizations of Multivariate Truncated Generalized Gaussian Distributions] The maximum entropy distribution over $S\subseteqR^d$, where $S$ is a subset of $R^d$ with positive Lebesgue measure, under the constraints align \int_{S} p(x) dx &= 1, \quad E[|x|p^p] = d{d\lambda_1}\log Z{S}(\lambda_1) align is the Multivariate Truncated Generalized Gaussian distribution $\prod_{i=1}^{d}TGN_{p}(0,\sigma,S)$ with probability density function align p(x)=1{Z_S(\lambda_1)}\exp\bigg(-|x|_p^p{p\sigma^p}\bigg) \cdot1_S(x) align where $\lambda_1=-1/p\sigma^p$ and $Z_S(\lambda_1)$ is the partition function.
Proposition. [Sparsity] Let $x\sim\prod_{i=1}^{d}RGN_p(\mu,\sigma)$ in $d$ dimension. Then align E[|x|0]&=d\cdot\Phi{GN_p(0,1)}\bigg(\mu{\sigma}\bigg)\ &=d{2}\bigg(1+sgn\bigg(\mu{\sigma}\bigg)P\bigg(1{p},|\mu/\sigma|^p{p}\bigg)\bigg) align where $sgn(\cdot)$ is the sign function and $P(\cdot,\cdot)$ is the lower regularized gamma function.
Proposition. Let $X\simRGN_{p}(\mu,\sigma)$ and $sgn(\mu)\in{-1,0,+1}$ be the sign function. Let $\gamma(s,t)$ be the lower incomplete gamma function, $\Gamma(s,t)$ be the upper incomplete gamma function, $\Gamma(s)$ be the gamma function, and $P(s,t)=\gamma(s,t)/\Gamma(s)$ be the lower regularized gamma function. Then align E[X]&=1{2}\bigg[\mu\bigg(1+sgn(\mu)P\bigg(1{p},|\mu|^p{p\sigma^p}\bigg)\bigg)+p^{1/p}\sigma\Gamma(2/p, |\mu|^p/(p\sigma^p)){\Gamma(1/p)}\bigg]\ E[X^2]&=1{2}\bigg[\mu^2\bigg(1+sgn(\mu),P\bigg(1{p},|\mu|^p{p\sigma^p}\bigg)\bigg)+ 2\mu p^{1/p}\sigma\Gamma(2/p, |\mu|^p/(p\sigma^p)){\Gamma(1/p)}\&+p^{2/p}\sigma^2\bigg(1+sgn(\mu),P \bigg(3{p},|\mu|^p{p\sigma^p}\bigg)\bigg) \bigg],\ Var(X)&=E[X^2]-\big(E[X]\big)^2. align
Proposition. [Implicit Regularization of Second-Order Statistics] Let $z\inR^{d}$ be a neural network feature random vector with covariance matrix $Cov[z]=\Sigma$. We denote the eigendecomposition as $\Sigma=U\LambdaU^\top$ with the set of eigenvectors being ${u_i}{i=1}^{d}$. Let $y\sim\prod{i=1}^{d}RGN_{p}(\mu,\sigma)$ be the Rectified Generalized Gaussian random vector and define $\gamma:=Var[RGN_{p}(\mu,\sigma)]\in(0,\infty)$. If $u_i^\topzd{=}u_i^\topy$ for all $i\in{1,\dots,d}$, then $\Sigma=\gamma\cdotI_d$.
Corollary. [Equivalent Definition of Rectified Generalized Gaussian] The probability density function of the Rectified Generalized Gaussian distribution $RGN_{p}(\mu,\sigma)$ can also be written as align f_{RGN_p(\mu,\sigma)}(x)&=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot1_{{0}}(x)\&+\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)1{Z_{(0,\infty)}(\mu,\sigma,p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x) align where $\Phi_{GN_p(0,1)}$ is the cumulative distribution function for the standard Generalized Gaussian distribution $GN_p(0, 1)$. proof We can simplify the expression as align 1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)=1-\Phi_{GN_p(\mu,\sigma)}(0)=\Phi_{GN_p(\mu,\sigma)}(0)=\int_{-\infty}^0p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx align So we have align \bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)1{Z_{(0,\infty)}(\mu,\sigma,p)}=\int_{-\infty^0p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx}{\int_{0}^{\infty}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx}=p^{1-1/p}{2\sigma\Gamma(1/p)} align where the extra terms cancel out due to symmetry around $0$. Thus we have recovered the forms in Definition~def:rectifiedgeneralizedgaussiandist. proof
Corollary. If $S=R^d$ in Proposition~proposition:truncatedgeneralizedgaussmaxent, then the constraint align E[|x|p^p] = d{d\lambda_1}\log Z{S}(\lambda_1)=d\sigma^p align and we recover the Generalized Gaussian distribution with zero mean and scale parameter $\sigma$. align p(x)=p^{d-d/p}{(2\sigma)^d\Gamma(1/p)^d}\exp\bigg(-|x|_p^p{p\sigma^p}\bigg) align
Corollary. The maximum entropy distribution over $R^d$ under the constraints align \int_{R^d} p(x) dx &= 1, \quad E[|x|_1] = db align is the product of independent univariate symmetric Laplace distributions with zero mean and scale parameter $b$ align p(x)=\bigg(1{2b}\bigg)^d\exp\bigg(-|x|_1{b}\bigg) align
Corollary. The maximum entropy distribution over $R^d$ under the constraints align \int_{R^d} p(x) dx &= 1, \quad E[x] = \mu, \quad E[(x-\mu)(x-\mu)^\top]=\Sigma align is the multivariate Gaussian distribution with mean $\mu$ and covariance $\Sigma$ align p(x)\propto\exp\bigg(-1{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)\bigg) align When $\mu=0$ and $\Sigma=I$, the density function takes the form of align p(x)\propto\exp\bigg(-1{2}|x|2^2\bigg) align proof Notice that the vector-valued mean constraint and matrix-valued covariance constraint can be factorized as a collection of scalar-valued constraints align E[x_i]=\mu_i,\quadE[x_i x_j]=\Sigma{ij}+\mu_i\mu_j, \quad\forall\space i,j\in{1,\cdots, d} align By Lemmalemma:universalmaxentcontmultivariatelemma, the maximum entropy distribution has the form align p(x)&\propto\exp\bigg(\sum_{i=1}^{d}\lambda_i x_i+\sum_{i=1}^{d}\sum_{j=1}^{d}\Lambda_{ij}x_ix_j\bigg)\ &=\exp\bigg(\lambda^\topx+x^\top\Lambdax\bigg)\ &\propto\exp\bigg(-1{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)\bigg) align for $\lambda=\Sigma^{-1}\mu$ and $\Lambda=-1{2}\Sigma^{-1}$, which is the multivariate Gaussian distribution up to normalizations. When $\mu=0$ and $\Sigma=I$, the density function trivially evaluates to align p(x)&\propto\exp\bigg(-1{2}|x|_2^2\bigg) align which is the maximum-entropy distribution under the expected $\ell_2$ norm constraints based on Lemmalemma:universalmaxentcontmultivariatelemma. proof
Definition. [Generalized Gaussian Distribution ] The Generalized Gaussian distribution $GN_{p}(\mu,\sigma)$ over the support $(-\infty,\infty)$ has the probability density function align f_{GN_{p}(\mu,\sigma)}(x)=p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg) align where $\Gamma(s):=\int^{\infty}_0 t^{s-1}e^{-t}dt$ is the gamma function.
Definition. [Truncated Generalized Gaussian Distribution] Let $S\subseteqR$ be a subset of $R$ with positive Lebesgue measure. The Truncated Generalized Gaussian distribution $TGN_{p}(\mu,\sigma, S)$ is the restriction of the Generalized Gaussian distribution $GN_p(\mu,\sigma)$ to the support $S$. The probability density function of $TGN_{p}(\mu,\sigma, S)$ is given by align f_{TGN_{p}(\mu,\sigma, S)}(x)=\mathbbm{1_{S}(x)}{Z_S(\mu,\sigma,p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg) %!\cdot!1_{S}(x) align where $1_{S}(x)$ is the indicator function that evaluates to $1$ if $x\in S$ and $0$ otherwise. The partition function is align Z_S(\mu,\sigma,p)=\int_{S}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx align
Definition. [Rectified Generalized Gaussian] The Rectified Generalized Gaussian distribution $RGN_{p}(\mu,\sigma)$ is a mixture between a discrete Dirac measure $\delta_0(x)$ (Definition~def:diracmeasuredef) and a Truncated Generalized Gaussian distribution $TGN_{p}(\mu,\sigma,(0,\infty))$ with probability density function align &f_{RGN_p(\mu,\sigma)}(x)=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot1_{{0}}(x)\&+p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x) align where $\Phi_{GN_p(0,1)}$ is the cumulative distribution function for the standard Generalized Gaussian distribution $GN_p(0, 1)$.
Definition. [Truncated Generalized Gaussian Probability Measure] Let $X \sim P_{GN_p(\mu,\sigma)}$ be a Generalized Gaussian random variable with $\ell_p$ parameter $p>0$, location $\mu \in R$, and scale $\sigma>0$. The Truncated Generalized Gaussian probability measure $P_{TGN_p(\mu,\sigma)}$ on the measurable space $(R,B(R))$ is defined as the conditional distribution of $X$ given $X>0$, i.e., align P_{TGN_p(\mu,\sigma)}(A) &:= P!\left(X \in A \mid X>0\right) \nonumber=\mathbb{P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))}{P_{GN_p(\mu,\sigma)}((0,\infty))}=\mathbb{P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))}{1-\Phi_{GN_p(0,1)}(-\mu/\sigma)} align for any $A \in B(R)$, where $\Phi_{GN_p(0,1)}$ denotes the cumulative distribution function of the standardized Generalized Gaussian distribution.
Definition. [Dirac Measure] The Dirac measure $\delta_x$ over a measurable space $(X,\Sigma)$ for a given $x\in X$ is defined as align \delta_x(A)=1_{A}(x)=cases 0,x\notin A\ 1, x\in A cases align for any measurable set $A\subseteq X$.
Definition. [Measure-Theoretical Definition of the Rectified Generalized Gaussian] Fix parameters $p>0$, $\mu\inR$, and $\sigma>0$. We denote $(R, B(R))$ as the real line equipped with Borel $\sigma$-algebra. Let $\lambda$ be the Lebesgue measure on $B(R)$ and let $\delta_0$ be the Dirac measure at $0$ presented in Definitiondef:diracmeasuredef. The probability measure $P_{X}$ of the Rectified Generalized Gaussian random variable $X$ is given by the mixture align P_{X}=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\delta_0+\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)\cdotP_{TGN_p(\mu,\sigma)} align where $P_{TGN_p(\mu,\sigma)}$ is the Truncated Generalized Gaussian probability measure in Definitiondef:trungengaussmeasuredef and $\Phi_{GN_p(0,1)}$ is the CDF of the standard Generalized Gaussian $GN_{p}(0,1)$. Define the mixed measure $\nu:=\lambda+\delta_0$. By Lemma~lemma:radonnikodymderivativeofrgg, the Radon-Nikodym derivative of $P_X$ with respect to $\nu$ exists and is given by align d\mathbb{P_{X}}{d\nu}(x)=f_{RGN_p(\mu,\sigma)}(x)&=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot1_{{0}}(x)+p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x) align
Definition. [Gamma Functions] If $u\ge 0$ and $b>-1$, then align \int_{0}^{u} t^{b} e^{-a t^p},dt &=1{p},a^{-(b+1)/p},\gamma!\Big(b+1{p},,a u^p\Big),\ \int_{u}^{\infty} t^{b} e^{-a t^p},dt &=1{p},a^{-(b+1)/p},\Gamma!\Big(b+1{p},,a u^p\Big), align where $\gamma(\cdot,\cdot)$ and $\Gamma(\cdot,\cdot)$ are the lower and upper incomplete gamma functions. By definition, we also have align P(s,t):=\gamma(s,t){\Gamma(s)},\qquad \Gamma(s,t)=\Gamma(s)-\gamma(s,t)=\Gamma(s)\big(1-P(s,t)\big). align
Definition. [Information Dimension renyi1959dimension] Consider a real-valued random variable $\xi\inR$ and the discretization $\xi_n=(1/n)\cdot [n\xi]$, where $[x]$ preserves only the integral part of $x$. For example, $[3.42]=3$. Under suitable conditions, the information dimension $d(\xi)$ exists and is given by align d(\xi)=\lim_{n\to\infty}\mathbb{H_{0}(\xi_n)}{\log n} align where align H_{0}(\eta):=\sum_{k=1}^{\infty}q_k\log1{q_k} align is the Shannon entropy for a discrete random variable $\eta$ with probabilities $q_k$ for $k=1,2,\dots$.
Definition. [$d(\xi)$-dimensional Entropy renyi1959dimension] If the information dimension $d(\xi)$ exists, the $d(\xi)$-dimensional entropy is defined as align H_{d(\xi)}=\lim_{n\to\infty}(H_{0}(\xi_n)-d(\xi)\log n) align
Definition. [$d(\xi)$-dimensional Entropy for Mixed Measures renyi1959dimension] Let $\xi$ be a random variable with probability measure align P_{\xi}=(1-d)\cdotP_0+d\cdotP_1 align where $P_0$ is discrete and $P_1$ is absolutely continuous with respect to the Lebesgue measure. Then the information dimension $d(\xi)=d$, and the $d(\xi)$-dimensional entropy is given by align H_{d(\xi)}(\xi)=(1-d)\cdot\sum_{k=1}^{\infty}p_k\log1{p_k}-d\cdot\int_{R}d\mathbb{P_1}{d\lambda}(x)\log\bigg(d\mathbb{P_1}{d\lambda}(x)\bigg)d\lambda(x)+d\log1{d}+(1-d)\log1{1-d} align where $\lambda$ is the Lebesgue measure.
Remark. The probability density function of the Generalized Gaussian distribution can also be written as align f_{GN_{p}(\mu,\sigma)}(x)=p{2\alpha\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{\alpha^p}\bigg) align where $\alpha:=p^{1/p}\sigma$. We choose the particular presentation in Definition~def:generalizedgaussiandist for its connection to the family of $L_p$-norm spherical distributions GUPTA1997241.
Remark. When $p=1$, the Generalized Gaussian distribution reduces to the Laplace distribution $L(\mu,\sigma)$ with probability density function align f_{GN_{1}(\mu,\sigma)}(x)=f_{L(\mu,\sigma)}(x)=1{2\sigma}\exp\bigg(-|x-\mu|{\sigma}\bigg) align If $p=2$, we recover the Gaussian distribution $N(\mu,\sigma^2)$ align f_{GN_{2}(\mu,\sigma)}(x)=f_{N(\mu,\sigma^2)}(x)=1{\sigma2\pi}\exp\bigg(-|x-\mu|^2{2\sigma^2}\bigg) align
Proof. See proof:proofoftruncatedgeneralizedgaussmaxent.
Proof. By Lemmalemma:absolutecontofrggwithrespecttomixed, $P_X \ll\nu$ so the Radon–Nikodym derivative $dP_{X}/d\nu$ exists and it suffices to show that for any $A\subseteqB(R)$ we have align P_{X}(A)=\int_Ad\mathbb{P_{X}}{d\nu}d\nu align We start by expanding the integral with respect to a sum of measures align \int_Ad\mathbb{P_{X}}{d\nu}d\nu=\int_Ad\mathbb{P_{X}}{d\nu}d\delta_0+\int_Ad\mathbb{P_{X}}{d\nu}d\lambda align By the property of the Dirac measure, we have align \int_Ad\mathbb{P_{X}}{d\nu}d\delta_0=d\mathbb{P_{X}}{d\nu}(0)\delta_0(A)=f_{RGN_p(\mu,\sigma)}(0)\delta_0(A) align We observe that $1_{{0}}(x)=1$ and $1_{(0,\infty)}(0)=0$. So we have align f_{RGN_p(\mu,\sigma)}(0)\delta_0(A)=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\delta_0(A) align Now the second term can be expanded as align \int_Ad\mathbb{P_{X}}{d\nu}d\lambda&=\int_A\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot1_{{0}}(x)d\lambda(x)+\int_Ap^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x)d\lambda(x)\ &=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\int_A1_{{0}}(x)d\lambda(x)+\int_Ap^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x)d\lambda(x) align where the term align \int_A1_{{0}}(x)d\lambda(x)=\lambda(A\cap{0})=0 align simply vanishes. Thus we are left with align \int_Ad\mathbb{P_{X}}{d\nu}d\lambda&=\int_Ap^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)\cdot1_{(0,\infty)}(x)d\lambda(x)\ &=\int_{A\cap(0,\infty)}p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)d\lambda(x)\ &=\int_{A\cap(0,\infty)}d\mathbb{P_{GN_p(\mu,\sigma)}}{d\lambda}(x)d\lambda(x)\ &=P_{GN_p(\mu,\sigma)}(A\cap(0,\infty)) align By Definitiondef:trungengaussmeasuredef, the Truncated Generalized Gaussian probability measure is given by align P_{TGN_p(\mu,\sigma)}(A)&=\mathbb{P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))}{P_{GN_p(\mu,\sigma)}((0,\infty))}\ &=\mathbb{P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))}{1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)}\ align Thus we have the identity align P_{GN_p(\mu,\sigma)}(A\cap(0,\infty))&=1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A) align Putting everything together, we arrive at align \int_Ad\mathbb{P_{X}}{d\nu}d\nu&=\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\cdot\delta_0(A)+\bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)\cdotP_{TGN_p(\mu,\sigma)}(A)\ &=P_{X}(A) align Thus we have proven the form of the Radon–Nikodym Derivative.
Proof. We can simplify the expression as align 1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)=1-\Phi_{GN_p(\mu,\sigma)}(0)=\Phi_{GN_p(\mu,\sigma)}(0)=\int_{-\infty}^0p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx align So we have align \bigg(1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)\bigg)1{Z_{(0,\infty)}(\mu,\sigma,p)}=\int_{-\infty^0p^{1-1/p}{2\sigma\Gamma(1/p)}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx}{\int_{0}^{\infty}\exp\bigg(-|x-\mu|^p{p\sigma^p}\bigg)dx}=p^{1-1/p}{2\sigma\Gamma(1/p)} align where the extra terms cancel out due to symmetry around $0$. Thus we have recovered the forms in Definition~def:rectifiedgeneralizedgaussiandist.
Proof. Let $Z\sim GN_{p}(\mu,\sigma)$ with density align f_Z(z)=p^{1-\frac1p}{2\sigma\Gamma(1/p)} \exp!\Big(-|z-\mu|^p{p\sigma^p}\Big). align If $X=ReLU(Z)$, then we know $X\simRGN_{p}(\mu,\sigma)$. Thus for any $k\in{1,2}$, we have align E[X^k]=E[Z^k1_{(0,\infty)}(Z)]=\int_{0}^{\infty} z^k f_Z(z),dz . align To simplify notations, let's denote $C:=p^{1-(1/p)}/(2\sigma\Gamma(1/p))$, $a:=1/(p\sigma^p)$, and $t_0:=a|\mu|^p=|\mu|^p/(p\sigma^p)$. Then align E[X^k]=C\int_{0}^{\infty} z^k \exp!\big(-a|z-\mu|^p\big),dz. align Define the change of variables $t=z-\mu$. Thus we have $z=t+\mu$ and $z\ge 0!\iff!t\ge -\mu$. Rewrite the integral as align E[X^k]=C\int_{-\mu}^{\infty} (t+\mu)^k \exp!\big(-a|t|^p\big),dt. align Let's define the three auxiliary integrals align I_0&:=\int_{-\mu}^{\infty} e^{-a|t|^p},dt\ I_1&:=\int_{-\mu}^{\infty} t,e^{-a|t|^p},dt\ I_2&:=\int_{-\mu}^{\infty} t^2,e^{-a|t|^p},dt. align Then we can rewrite eq:shift for $k=1,2$ as align E[X] &= C(\mu I_0 + I_1),\ E[X^2] &= C(\mu^2 I_0 + 2\mu I_1 + I_2). align Now we just need to compute $I_0$, $I_1$, and $I_2$. By Lemmalemma:I0integraleq, Lemmalemma:I1integraleq, and Lemma~lemma:I2integraleq, we have align I_0&=1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1+sgn(\mu),P!\Big(1{p},t_0\Big)\Big) \ I_1&=1{p}a^{-2/p}\Gamma!\Big(2{p},t_0\Big) \ I_2&=1{p}a^{-3/p}\Gamma!\Big(3{p}\Big)\Big(1+sgn(\mu),P!\Big(3{p},t_0\Big)\Big). align So we can substitute and get the expression for $E[X]$ as align E[X]&= C(\mu I_0 + I_1)\ &= C\mu\cdot 1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1+sgn(\mu)P(1/p,t_0)\Big)+ C\cdot 1{p}a^{-2/p}\Gamma!\Big(2{p},t_0\Big) \ &=1{2}\mu\Big(1+sgn(\mu)P!\Big(1{p},t_0\Big)\Big)+1{2}p^{1/p}\sigma,\Gamma(2/p,t_0){\Gamma(1/p)}\ &=1{2}\bigg[\mu\bigg(1+sgn(\mu)P\bigg(1{p},|\mu|^p{p\sigma^p}\bigg)\bigg)+p^{1/p}\sigma\Gamma(2/p, |\mu|^p/(p\sigma^p)){\Gamma(1/p)}\bigg] align Similarly, the second moment is given by align E[X^2]&= C(\mu^2 I_0 + 2\mu I_1 + I_2)\ &=C\mu^2 I_0 + 2C\mu I_1 + C I_2\ &= C\mu^2\cdot 1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1+sgn(\mu)P(1/p,t_0)\Big) + 2C\mu\cdot 1{p}a^{-2/p}\Gamma!\Big(2{p},t_0\Big)\ &+ C\cdot 1{p}a^{-3/p}\Gamma!\Big(3{p}\Big)\Big(1+sgn(\mu)P(3/p,t_0)\Big)\ &=1{2}\mu^2\Big(1+sgn(\mu)P!\Big(1{p},t_0\Big)\Big)+1{2}\Big(2\mu,p^{1/p}\sigma,\Gamma(2/p,t_0){\Gamma(1/p)}\Big)\ &+1{2}p^{2/p}\sigma^2\Big(1+sgn(\mu)P!\Big(3{p},t_0\Big)\Big)\ &=1{2}\bigg[\mu^2\bigg(1+sgn(\mu),P\bigg(1{p},|\mu|^p{p\sigma^p}\bigg)\bigg)+2\mu p^{1/p}\sigma\Gamma(2/p, |\mu|^p/(p\sigma^p)){\Gamma(1/p)}\ &+p^{2/p}\sigma^2\bigg(1+sgn(\mu),P \bigg(3{p},|\mu|^p{p\sigma^p}\bigg)\bigg) \bigg],\ align Thus we have proven the expression.
Proof. If $\mu\ge 0$, then $-\mu\le 0$. So we can split the integral at $0$ and get: align I_0=\int_{-\mu}^{0} e^{-a|t|^p},dt+\int_{0}^{\infty} e^{-a t^p},dt =\int_{0}^{\mu} e^{-a s^p},ds+\int_{0}^{\infty} e^{-a t^p},dt. align Applying eq:lower-gamma-id with $b=0$ to the first term and eq:upper-gamma-id with $u=0$ to the second term gives us align I_0=1{p}a^{-1/p}\gamma!\Big(1{p},t_0\Big)+1{p}a^{-1/p}\Gamma!\Big(1{p}\Big) =1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1+P!\Big(1{p},t_0\Big)\Big). align Now if $\mu<0$, then $-\mu=|\mu|>0$ and we have align I_0=\int_{|\mu|}^{\infty} e^{-a t^p},dt =1{p}a^{-1/p}\Gamma!\Big(1{p},t_0\Big) =1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1-P!\Big(1{p},t_0\Big)\Big). align Combining both cases, we arrive at align I_0=1{p}a^{-1/p}\Gamma!\Big(1{p}\Big)\Big(1+sgn(\mu),P!\Big(1{p},t_0\Big)\Big). align
Proof. Let $z\sim\prod_{i=1}^{d}GN_{p}(\mu,\sigma)$ be a Generalized Gaussian random vector in $d$ dimensions and $x=ReLU(z)$, or equivalently, $x\sim\prod_{i=1}^{d}RGN_{p}(\mu,\sigma)$. By construction, we have independence between dimensions. Thus align |x|0=\sum{i=1}^{d}1_{x_i>0}=\sum_{i=1}^{d}1_{z_i>0} align So we have the expectation given by align E[|x|0]=\sum{i=1}^{d}E[1_{z_i>0}]=\sum_{i=1}^{d}P(z_i>0)=\sum_{i=1}^{d}\Phi_{GN_p(0,1)}\bigg(\mu{\sigma}\bigg)=d\cdot \Phi_{GN_p(0,1)}\bigg(\mu{\sigma}\bigg) align where the CDF defined in Definition~def:generalizedgaussiandist evaluates to align \Phi_{GN_p(0,1)}\bigg(\mu{\sigma}\bigg)=1{2}\bigg(1+sgn\bigg(\mu{\sigma}\bigg)P\bigg(1{p},|\mu/\sigma|^p{p}\bigg)\bigg) align Thus align E[|x|_0]=d{2}\bigg(1+sgn\bigg(\mu{\sigma}\bigg)P\bigg(1{p},|\mu/\sigma|^p{p}\bigg)\bigg) align
Proof. We can form the Lagrangian functional of the constrained optimization problem as align J[p]=-\int_{S} p(x)\ln p(x) dx+\lambda_0\bigg(\int_{S} p(x) dx-1\bigg)+\sum_{i=1}^{m}\lambda_i\bigg(\int_{S}r_i(x) p(x)dx-\alpha_i\bigg) align where $\lambda_0,\lambda_1,\dots,\lambda_m\inR$ are Lagrange multipliers. Let $p$ be a maximizer that is strictly positive almost everywhere (a.e.) over $S$. We denote $\delta p$ as an arbitrary integrable perturbation supported on $S$ such that $p+\epsilon\delta p\geq0$ for sufficiently small $|\epsilon|$. Thus the Gateaux derivative of $J$ in the direction of $\delta p$ is given by align \left.d{d\epsilon}J[p+\epsilon\delta p]\right|{\epsilon=0}&=-\int{S}\left.d{d\epsilon}\bigg[(p(x)+\epsilon\delta p(x))\ln (p(x)+\epsilon\delta p(x))\bigg]dx\right|{\epsilon=0}+\lambda_0\bigg(\int{S}\left.d{d\epsilon} \bigg[p(x)+\epsilon\delta p(x)\bigg] dx\right|{\epsilon=0}\bigg)\ &+\sum{i=1}^{m}\lambda_i\bigg(\int_{S}\left.d{d\epsilon}\bigg[r_i(x) (p(x)+\epsilon\delta p(x)))dx\bigg]\right|{\epsilon=0}\bigg)\ &=-\int{S}\left.\bigg[\delta p(x)\ln (p(x)+\epsilon\delta p(x))+\delta p(x)\bigg]dx\right|{\epsilon=0}+\lambda_0\bigg(\int{S} 1\cdot\delta p(x)dx\bigg)\ &+\sum_{i=1}^{m}\lambda_i\bigg(\int_{S} r_i(x)\delta p(x)dx\bigg)\ &=-\int_{S}\bigg[\ln p(x)+1\bigg]\delta p(x)dx+\bigg(\int_{S} \lambda_0\cdot\delta p(x)dx\bigg)+\bigg(\int_{S}\sum_{i=1}^{m}\lambda_i r_i(x)\delta p(x)dx\bigg)\ align Thus the functional derivative is align \deltaJ{\delta p}&=-\ln p(x)-1+\lambda_0+\sum_{i=1}^{m}\lambda_i r_i(x) align Since this expression must vanish for all admissible perturbations $\delta p$, we get $\deltaJ{\delta p}=0$ almost everywhere on $S$. Solving for $p$ yields align p(x)&=\exp\bigg(\lambda_0-1+\sum_{i=1}^m \lambda_i r_i(x)\bigg) align Absorbing the constant terms into $Z_{S}(\lambda)$, we end up with align p(x)&=1{Z_{S}(\lambda)} \exp\bigg(\sum_{i=1}^m \lambda_i r_i(x)\bigg)\cdot1_S(x) align
Proof. By Lemma~lemma:universalmaxentcontmultivariatelemma, the target distribution has the form of align p(x)&=1{Z_S(\lambda_1)}\exp(\lambda_1|x|_p^p)\cdot1_S(x)\ &=1{Z_S(\lambda_1)}\exp\bigg(-|x|p^p{p\sigma^p}\bigg)\cdot1_S(x) align where we choose $\lambda_1=-1{p\sigma^p}$ which satisfies the constraint $\lambda_1<0$ for integration. Thus we have recovered the zero-mean Generalized Gaussian distribution with scale parameter $\sigma$. Now notice that align d{d\lambda_1}\log Z_S(\lambda_1)&=1{Z_S(\lambda_1)}d{d\lambda_1}Z_S(\lambda_1)\&=1{Z_S(\lambda_1)}\int{S}d{d\lambda_1}\exp(\lambda_1|x|p^p)dx\&=1{Z_S(\lambda_1)}\int{S}|x|_p^p\exp(\lambda_1|x|p^p)dx\ &=\int{S}|x|_p^p1{Z_S(\lambda_1)}\exp\bigg(-|x|_p^p{p\sigma^p}\bigg)\cdot1_S(x)dx\ &=E[|x|_p^p] align Thus we also obtain the constraint as $E[|x|p^p] = d{d\lambda_1}\log Z{S}(\lambda_1)$.
Proof. By Proposition~proposition:truncatedgeneralizedgaussmaxent, the target distribution has the form of align p(x)=1{Z_S(\lambda_1)}\exp\bigg(-|x|p^p{p\sigma^p}\bigg) \cdot1_S(x) align If $S=R^d$, then the normalization constant becomes align 1{Z_S(\lambda_1)}=1{Z{R^d}(\lambda_1)}=p^{d-d/p}{(2\sigma)^d\Gamma(1/p)^d} align According to Dytso2018, we know that $E[|x_i|^p]=\sigma^p$. Thus align E[|x|p^p] = d{d\lambda_1}\log Z{S}(\lambda_1)=d\sigma^p align
Proof. By Lemma~lemma:universalmaxentcontmultivariatelemma, the target distribution has the form of align p(x)\propto \exp(\lambda_1|x|_1) align with the constraint $\lambda_1<0$ for integration. After normalization, we obtain align p(x)=\bigg(-\lambda_1{2}\bigg)^d\exp(\lambda_1|x|_1) align By a change of variable $b=-1/\lambda_1$, we arrive at align p(x)=\bigg(1{2b}\bigg)^d\exp\bigg(-|x|_1{b}\bigg) align Thus we have recovered the zero-mean product Laplace distribution with scale parameter $b$.
Proof. Notice that the vector-valued mean constraint and matrix-valued covariance constraint can be factorized as a collection of scalar-valued constraints align E[x_i]=\mu_i,\quadE[x_i x_j]=\Sigma_{ij}+\mu_i\mu_j, \quad\forall\space i,j\in{1,\cdots, d} align By Lemmalemma:universalmaxentcontmultivariatelemma, the maximum entropy distribution has the form align p(x)&\propto\exp\bigg(\sum_{i=1}^{d}\lambda_i x_i+\sum_{i=1}^{d}\sum_{j=1}^{d}\Lambda_{ij}x_ix_j\bigg)\ &=\exp\bigg(\lambda^\topx+x^\top\Lambdax\bigg)\ &\propto\exp\bigg(-1{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)\bigg) align for $\lambda=\Sigma^{-1}\mu$ and $\Lambda=-1{2}\Sigma^{-1}$, which is the multivariate Gaussian distribution up to normalizations. When $\mu=0$ and $\Sigma=I$, the density function trivially evaluates to align p(x)&\propto\exp\bigg(-1{2}|x|_2^2\bigg) align which is the maximum-entropy distribution under the expected $\ell_2$ norm constraints based on Lemmalemma:universalmaxentcontmultivariatelemma.
Proof. Since $\xi\sim\prod_{i=1}^{D}RGN_p(\mu,\sigma)$ where each $\xi_i\simRGN_p(\mu,\sigma)$ are i.i.d, it's trivial that align d(\xi)&=D\cdot d(\xi_i)\ H_{d(\xi)}(\xi)&=\sum_{i=1}^{D}H_{d(\xi_i)}(\xi_i)=D\cdot H_{d(\xi_i)}(\xi_i) align for all $i$ by independence. In sec:genoftotalcorrelationundernuent, we also present an alternative interpretation of the $d(\xi)$-dimensional entropy that enables the decomposition of the joint entropy $H_{d(\xi)}(\xi)$ into the sums of the marginals $H_{d(\xi_i)}(\xi_i)$ under the independence assumption. Thus it suffices to prove in the univariate case. By Definitiondef:measuretheoreticaldefofrgg and Definitiondef:ddimentformixedmeasuresdef, we know that the information dimension is given by align d(\xi_i)=1-\Phi_{GN_p(0,1)}\bigg(-\mu{\sigma}\bigg)=\Phi_{GN_p(0,1)}\bigg(\mu{\sigma}\bigg) align We observe that $P_0$ in Definitiondef:ddimentformixedmeasuresdef correspond to the Dirac measure $\delta_0$ in Definitiondef:measuretheoreticaldefofrgg. Thus align (1-d(\xi_i))\cdot\sum_{k=1}^{\infty}p_k\log1{p_k}=(1-d(\xi_i))\cdot(1\log 1)=0 align Now we can define a Bernoulli gating random variable align 1_{(0,\infty)}(\xi_i)=cases 1, if \xi_i\in (0,\infty)\implies with probability d(\xi_i)\ 0, if \xi_i\notin (0,\infty)\implies with probability 1-d(\xi_i) cases align It's well known that the Shannon entropy for a Bernoulli random variable is align H_0(1_{(0,\infty)}(\xi_i))&=d(\xi_i)\log1{d(\xi_i)}+(1-d(\xi_i))\log1{1-d(\xi_i)} align Thus by Definitiondef:ddimentformixedmeasuresdef, the $d(\xi_i)$-dimensional entropy is align H_{d(\xi_i)}(\xi_i)&=0-d(\xi_i)\cdot\int_{R}d\mathbb{P_{TGN_p(\mu,\sigma)}}{d\lambda}(x)\log\bigg(d\mathbb{P_{TGN_p(\mu,\sigma)}}{d\lambda}(x)\bigg)d\lambda(x)+H_0(1_{(0,\infty)}(\xi_i))\ &=\Phi_{GN_p(0,1)}\bigg(\mu{\sigma}\bigg)\cdotH_{1}(TGN_p(\mu,\sigma))+H_{0}(1_{(0,\infty)}(\xi_i)) align So we have proven the expression in Theoremtheorem:renyiinfoentropyofrecgengauss.
Proof. Let $\varphi:R\to[0,\infty)$ be the rectification map $\varphi(x):=\max(0,x).$ Then $P_Z$ is the pushforward of $P_X$ by $\varphi$, i.e. for any Borel set $B\inB([0,\infty))$, align P_Z(B)=P_X(\varphi^{-1}(B)). align We can write $\varphi^{-1}(B)$ as align \varphi^{-1}(B)=\big(\varphi^{-1}(B)\cap(-\infty,0]\big);\cup;\big(\varphi^{-1}(B)\cap(0,\infty)\big), align For $x\in(-\infty,0]$, $\varphi(x)=0$. So we have align \varphi^{-1}(B)\cap(-\infty,0]= cases (-\infty,0], & 0\in B,\ \emptyset, & 0\notin B. cases align Now for $(0,\infty)$, $\varphi(x)=x$. So we have align \varphi^{-1}(B)\cap(0,\infty)=B\cap(0,\infty). align Combining these together, we arrive at align P_Z(B)=P_X(\varphi^{-1}(B))=P_X(B\cap(0,\infty))+P_X((-\infty,0])\cdot\delta_0(B) align where $\delta_0(B)$ is the Dirac measure in Definition~def:diracmeasuredef that evaluates to $1$ if $0\in B$ and $0$ otherwise. Let $d:=P(X>0)=P_X((0,\infty))$. So trivially, $1-d=P(X\le 0)=P_X((-\infty,0])$. By the definition of the conditional measure, we have that for any $A\inB(R)$, align P_{X\mid(0,\infty)}(A):=\mathbb{P_X(A\cap(0,\infty))}{P_X((0,\infty))}=\mathbb{P_X(A\cap(0,\infty))}{d}, \qquad A\inB(R). align Then for every $B\inB([0,\infty))$, align P_Z(B)=d,P_{X\mid(0,\infty)}(B)+(1-d),\delta_0(B) align Thus we have proven the expression of the probability measure. Now if $A\subset(0,\infty)$ is Borel, then $A\cap(0,\infty)=A$ and we have $P_{X\mid(0,\infty)}(A)=P_X(A)/d$.
Proof. Following the same arguments in Lemmalemma:absolutecontofrggwithrespecttomixed, we know that $P_Z$ is absolutely continuous with respect to $\nu$, i.e. $P_Z\ll \nu$. Again, following the same arguments in Lemmalemma:radonnikodymderivativeofrgg, we observe that for any Borel $A\subset [0,\infty)$ align \int_Ad\mathbb{P_{Z}}{d\nu}d\nu&=\int_Ad\mathbb{P_{Z}}{d\nu}d\delta_0+\int_Ad\mathbb{P_{Z}}{d\nu}d\lambda align Notice that align \int_Ad\mathbb{P_{Z}}{d\nu}d\delta_0=d\mathbb{P_{Z}}{d\nu}(0)\delta_0(A)=(1-d)\cdot\delta_0(A) align and align \int_Ad\mathbb{P_{Z}}{d\nu}d\lambda&=\int_A(1-d)\cdot1_{{0}}(x)d\lambda(x)+\int_Ad\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x)d\lambda(x)\ &=(1-d)\cdot\int_A1_{{0}}(x)d\lambda(x)+\int_{A\cap (0,\infty)}d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot d\lambda(x)\ &=0+d\cdot\int_{A\cap (0,\infty)} dP_{X|(0,\infty)}(x) \ &=d\cdotP_{X|(0,\infty)}(A) align Putting everything together, we have align \int_Ad\mathbb{P_{Z}}{d\nu}d\nu&=(1-d)\cdot\delta_0(A)+d\cdotP_{X|(0,\infty)}(A)=P_{Z}(A) align Thus we have shown that the Radon–Nikodym derivative is correct.
Proof. We start by expanding the integral align H_{\nu}(Z)&=-\int d\mathbb{P_{Z}}{d\nu}\log\bigg(d\mathbb{P_{Z}}{d\nu}\bigg)d\nu\ &=-\int d\mathbb{P_{Z}}{d\nu}\log\bigg(d\mathbb{P_{Z}}{d\nu}\bigg)d\delta_0-\int d\mathbb{P_{Z}}{d\nu}\log\bigg(d\mathbb{P_{Z}}{d\nu}\bigg)d\lambda align By the property of the Dirac measure, we have align -\int d\mathbb{P_{Z}}{d\nu}(x)\log\bigg(d\mathbb{P_{Z}}{d\nu}(x)\bigg)d\delta_0(x)=-d\mathbb{P_{Z}}{d\nu}(0)\log\bigg(d\mathbb{P_{Z}}{d\nu}(0)\bigg)=-(1-d)\log(1-d) align Lemmalemma:generalradonnikodymunderrectifications tells us that align d\mathbb{P_{Z}}{d\nu}(x)=(1-d)\cdot1_{{0}}(x)+d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x) align Due to the term $1_{{0}}(x)$, any Lebesgue measure evaluates to $0$ since ${0}$ is a Lebesgue measure zero set. So effectively, we can write align -\int d\mathbb{P_{Z}}{d\nu}\log\bigg(d\mathbb{P_{Z}}{d\nu}\bigg)d\lambda &=-\int d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x)\log(d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x))d\lambda(x)\ &=-\int d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x)\log(d\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x))d\lambda(x)\&-\int d\cdotd\mathbb{P_{X|(0,\infty)}}{d\lambda}(x)\cdot1_{(0,\infty)}(x)\log(d)d\lambda(x)\ &=d\cdotH_1(P_{X|(0,\infty)})-d\log(d) align Combing the terms together, we have align H_{\nu}(Z)&=d\cdotH_1(P_{X|(0,\infty)})-d\log(d)-(1-d)\log(1-d)\ &=d\cdotH_1(P_{X|(0,\infty)})+d\log\bigg(1{d}\bigg)+(1-d)\log\bigg(1{d-1}\bigg) align By Definitiondef:ddimentformixedmeasuresdef, the information dimension $d(Z)=d$. Notice how $H_0(\delta_0)=0$. So we have align H_{\nu}(Z)&=d(Z)\cdotH_1(P_{X|(0,\infty)})+d(Z)\log\bigg(1{d(Z)}\bigg)+(1-d(Z))\log\bigg(1{d(Z)-1}\bigg)=H_{d(Z)}(Z). align
Proof. Let $\Sigma=Cov[z]$ and let $\Sigma=U\LambdaU^\top$ be its eigendecomposition, where $U=[u_1,\dots,u_d]$ is orthonormal and $\Lambda=diag(\lambda_1,\dots,\lambda_d)$. Since $y\sim\prod_{i=1}^{d}RGN_{p}(\mu,\sigma)$ has i.i.d. coordinates with variance $\gamma:=Var[RGN_{p}(\mu,\sigma)]$, its covariance satisfies $Cov[y]=\gammaI_d$. Hence, for any vector $u_i$ such that $|u_i|_2=1$, align Var(u_i^\topy) = u_i^\top (\gammaI_d)u_i = \gamma\cdot|u_i|_2^2 = \gamma. align By the assumption $u_i^\top zd{=}u_i^\top y$ for all $i$, the variances of the one-dimensional projected marginals are equal, i.e. align Var(u_i^\top z) = Var(u_i^\top y) = \gamma \qquad \forall i. align On the other hand, for each eigenvector $u_i$, align Var(u_i^\top z) = u_i^\top \Sigmau_i= \lambda_i\cdot |u_i|_2^2 = \lambda_i, align where $\lambda_i$ is the $i$-th eigenvalue of $\Sigma$. Therefore $\lambda_i=\gamma$ for all $i$, so $\Lambda=\gamma I_d$. Substituting back into the eigendecomposition yields align \Sigma = U\LambdaU^\top = U(\gamma I_d)U^\top = \gamma I_d, align which is a scalar multiple of the identity. Hence all off-diagonal entries of $\Sigma$ vanish and the covariance matrix is isotropic.
Algorithm: algorithm
[tb]
\caption{Simulation of the Rectified Generalized Gaussian Random Variables $\mathcal{RGN}_p(\mu,\sigma)$}
\label{alg:samplingrectifiedgeneralizedgaussian}
\begin{algorithmic}
\STATE {\bfseries Input:} $\ell_p$ parameter $p>0$, location $\mu\in\mathbb{R}$, scale $\sigma>0$
\STATE {\bfseries Output:} sample $Y \sim \mathcal{RGN}_p(\mu,\sigma)$
\STATE Sample $S \sim \mathrm{Unif}\{-1,+1\}$
\STATE Sample $G \sim \mathrm{Gamma}\!\left(\text{shape}=\frac{1}{p},\,\text{rate}=1\right)$
\STATE Set $X \gets \mu + \sigma \, S \cdot (pG)^{1/p}$
\STATE Set $Y \gets \max(0, X)$
\STATE \textbf{return} $Y$
\end{algorithmic}
Algorithm: algorithm
[tb]
\caption{Bisection Search for the Scale Parameter $\sigma$ for Rectified Generalized Gaussian with Unit Variance}
\label{alg:bisection_sigma}
\begin{algorithmic}
\STATE {\bfseries Input:} $\ell_p$ parameter $p>0$, location $\mu\in\mathbb{R}$, tolerance $\varepsilon>0$
\STATE {\bfseries Output:} scale $\sigma^\star>0$ such that $\operatorname{Var}(\mathcal{RGN}_p(\mu,\sigma^\star))\approx 1$ \COMMENT{$\operatorname{Var}(\mathcal{RGN}_p(\mu,\sigma^\star))$ is defined in Proposition~\ref{proposition:meanandvarofrecgengauss}.}
\STATE Define $V(\sigma)\coloneqq \operatorname{Var}(\mathcal{RGN}_p(\mu,\sigma))$
\STATE Define $f(\sigma)\coloneqq V(\sigma)-1$
\STATE Choose initial bounds $\sigma_L>0$ and $\sigma_U>\sigma_L$ such that $f(\sigma_L)<0, f(\sigma_U)>0$
\REPEAT
\STATE $\sigma_M \gets (\sigma_L+\sigma_U)/2$
\IF{$f(\sigma_M)>0$}
\STATE $\sigma_U \gets \sigma_M$
\ELSE
\STATE $\sigma_L \gets \sigma_M$
\ENDIF
\UNTIL{$|\sigma_U-\sigma_L|\le \varepsilon$}
\STATE $\sigma^\star \gets (\sigma_L+\sigma_U)/2$
\STATE \textbf{return} $\sigma^\star$
\end{algorithmic}
| Encoder Acc1 ↑ | Projector Acc1 ↑ | L1 Sparsity ↓ | L0 Sparsity ↓ | ||
|---|---|---|---|---|---|
| Rectified LpJEPA | RGN 1 . 0 (0 ,σ GN ) | 84.72 | 80.4 | 0.2726 | 0.694 |
| RGN 2 . 0 (0 ,σ GN ) | 85.08 | 80 | 0.3412 | 0.7298 | |
| RGN 1 . 0 (0 . 25 ,σ GN ) | 84.98 | 80.76 | 0.3745 | 0.7437 | |
| RGN 2 . 0 (1 . 0 ,σ GN ) | 85.08 | 80.54 | 0.6278 | 0.8668 | |
| RGN 2 . 0 ( - 2 . 5 ,σ GN ) | 82.02 | 67.82 | 0.0137 | 0.0224 | |
| RGN 1 . 0 ( - 3 . 0 ,σ GN ) | 82.72 | 71.88 | 0.0058 | 0.0098 | |
| Sparse Baselines | NVICReg-ReLU | 84.48 | 77.74 | 0.5207 | 0.7117 |
| Sparse Baselines | NCL-ReLU | 82.58 | 76.88 | 0.0037 | 0.0085 |
| Sparse Baselines | NVICReg-RepReLU | 84.2 | 78.18 | 0.4965 | 0.7549 |
| Sparse Baselines | NCL-RepReLU | 82.76 | 76.7 | 0.0024 | 0.0048 |
| Dense Baselines | VICReg | 84.18 | 78.88 | 0.7954 | 1 |
| Dense Baselines | SimCLR | 83.44 | 77.9 | 0.6338 | 1 |
| Dense Baselines | LeJEPA | 84.8 | 79.52 | 0.6365 | 1 |
| Encoder Acc1 ↑ | Projector Acc1 ↑ | L1 Sparsity ↓ | L0 Sparsity ↓ | ||
|---|---|---|---|---|---|
| Rectified LpJEPA | RGN 2 . 0 (0 ,σ GN ) | 66 . 29 | 62 . 15 | 0 . 3773 | 0 . 7357 |
| Rectified LpJEPA | RGN 1 . 0 (0 ,σ GN ) | 65 . 97 | 62 . 22 | 0 . 3019 | 0 . 6474 |
| Rectified LpJEPA | RGN 0 . 75 (0 ,σ GN ) | 65 . 78 | 62 . 80 | 0 . 2583 | 0 . 6099 |
| Rectified LpJEPA | RGN 0 . 50 (0 ,σ GN ) | 66 . 10 | 62 . 74 | 0 . 1996 | 0 . 5727 |
| Rectified LpJEPA | RGN 1 . 0 ( - 2 ,σ GN ) | 64 . 75 | 59 . 08 | 0 . 0236 | 0 . 0489 |
| Sparse Baselines | NCL-ReLU | 66 . 23 | 61 . 88 | 0 . 0228 | 0 . 0503 |
| Sparse Baselines | NVICReg-ReLU | 63 . 76 | 58 . 82 | 0 . 7415 | 0 . 8935 |
| Sparse Baselines | NCL-RepReLU | 66.32 | 61 . 40 | 0.0202 | 0.0426 |
| Sparse Baselines | NVICReg-RepReLU | 63 . 83 | 58 . 53 | 0 . 1551 | 0 . 2657 |
| Dense Baselines | SimCLR | 66 . 00 | 61 . 95 | 0 . 6364 | 1 . 0000 |
| Dense Baselines | VICReg | 63 . 78 | 58 . 82 | 0 . 8660 | 1 . 0000 |
| Dense Baselines | LeJEPA | 65 . 65 | 62 . 69 | 0 . 6379 | 1 . 0000 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 11.86 | 65.62 | 24.95 | 5.09 | 24.82 | 29.74 | 27.01 |
| Non-negative SimCLR | 11.17 | 67.67 | 24.23 | 5.59 | 24.44 | 19.71 | 25.47 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 9.89 | 68.31 | 26.27 | 4.21 | 25.98 | 17.66 | 25.39 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 9.47 | 67.19 | 23.58 | 4.85 | 24.93 | 16.60 | 24.44 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 12.66 | 66.36 | 23.91 | 10.18 | 24.88 | 25.18 | 27.20 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 9.31 | 50.39 | 8.35 | 6.13 | 10.63 | 21.72 | 17.76 |
| RGN 2 . 0 (0 ,σ GN ) | 14.04 | 69.47 | 24.09 | 4.86 | 25.68 | 24.34 | 27.08 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 12.82 | 65.97 | 24.12 | 4.91 | 24.44 | 20.20 | 25.41 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 13.88 | 59.62 | 15.88 | 7.84 | 16.21 | 24.97 | 23.07 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 11.06 | 55.71 | 11.42 | 8.18 | 12.92 | 12.13 | 18.57 |
| Dense baselines | |||||||
| VICReg | 10.85 | 63.98 | 21.52 | 6.23 | 25.29 | 23.33 | 25.20 |
| SimCLR | 12.82 | 66.90 | 23.93 | 10.60 | 24.99 | 24.12 | 27.23 |
| LeJEPA | 15.85 | 68.07 | 24.57 | 5.79 | 23.72 | 24.34 | 27.06 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 8.62 | 33.47 | 7.51 | 3.25 | 8.78 | 17.28 | 13.15 |
| Non-negative SimCLR | 6.70 | 41.71 | 9.24 | 2.91 | 9.17 | 15.37 | 14.18 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 6.81 | 48.96 | 11.46 | 2.03 | 12.09 | 16.38 | 16.29 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 7.82 | 47.14 | 11.20 | 2.75 | 11.43 | 14.94 | 15.88 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 10.37 | 38.36 | 9.13 | 5.56 | 9.53 | 20.93 | 15.65 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 5.69 | 32.25 | 5.16 | 2.70 | 5.76 | 19.02 | 11.76 |
| RGN 2 . 0 (0 ,σ GN ) | 10.59 | 47.74 | 10.61 | 2.70 | 11.46 | 20.50 | 17.27 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 9.84 | 46.97 | 12.33 | 3.07 | 11.70 | 19.38 | 17.21 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 9.57 | 35.08 | 6.59 | 4.49 | 7.41 | 20.69 | 13.97 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 9.20 | 18.36 | 2.64 | 2.33 | 3.64 | 7.14 | 7.22 |
| Dense baselines | |||||||
| VICReg | 8.09 | 39.35 | 6.74 | 3.04 | 9.34 | 19.49 | 14.34 |
| SimCLR | 12.39 | 48.05 | 11.52 | 7.24 | 14.16 | 21.34 | 19.12 |
| LeJEPA | 8.30 | 48.17 | 11.27 | 4.47 | 11.84 | 18.32 | 17.06 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 37.23 | 77.88 | 47.32 | 31.16 | 48.33 | 55.49 | 49.57 |
| Non-negative SimCLR | 40.11 | 79.11 | 50.35 | 23.96 | 50.60 | 50.37 | 49.08 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 41.91 | 79.58 | 49.67 | 24.93 | 49.97 | 55.63 | 50.28 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 38.19 | 78.98 | 48.76 | 32.87 | 48.51 | 55.41 | 50.45 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 40.32 | 79.25 | 45.86 | 22.72 | 46.51 | 56.91 | 48.60 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 19.63 | 66.93 | 24.45 | 8.25 | 26.12 | 32.76 | 29.69 |
| RGN 2 . 0 (0 ,σ GN ) | 40.90 | 80.15 | 50.69 | 29.76 | 50.34 | 53.07 | 50.82 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 39.63 | 79.00 | 47.51 | 26.85 | 47.39 | 55.60 | 49.33 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 30.80 | 73.88 | 34.61 | 12.03 | 35.56 | 44.92 | 38.63 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 15.27 | 68.30 | 24.84 | 7.64 | 29.79 | 26.63 | 28.74 |
| Dense baselines | |||||||
| VICReg | 38.35 | 76.25 | 45.63 | 26.52 | 48.30 | 52.19 | 47.88 |
| SimCLR | 41.70 | 77.88 | 46.87 | 31.66 | 49.27 | 49.74 | 49.52 |
| LeJEPA | 38.67 | 79.06 | 49.02 | 30.18 | 49.34 | 53.88 | 50.03 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 22.50 | 43.24 | 10.56 | 12.86 | 12.65 | 32.92 | 22.46 |
| Non-negative SimCLR | 23.09 | 48.38 | 17.73 | 8.72 | 14.97 | 32.87 | 24.29 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 24.47 | 58.06 | 22.40 | 9.97 | 19.73 | 40.17 | 29.13 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 25.90 | 57.61 | 22.39 | 11.40 | 21.24 | 38.18 | 29.45 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 23.83 | 44.69 | 15.32 | 9.19 | 14.65 | 33.31 | 23.50 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 18.19 | 36.39 | 9.03 | 7.46 | 8.67 | 26.17 | 17.65 |
| RGN 2 . 0 (0 ,σ GN ) | 22.82 | 57.93 | 21.26 | 12.07 | 19.03 | 36.90 | 28.33 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 27.39 | 57.32 | 24.07 | 13.06 | 21.54 | 40.01 | 30.57 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 22.23 | 40.83 | 12.58 | 8.93 | 12.01 | 29.38 | 20.99 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 9.89 | 23.81 | 5.03 | 6.02 | 6.15 | 10.08 | 10.16 |
| Dense baselines | |||||||
| VICReg | 23.09 | 41.20 | 10.24 | 10.60 | 13.55 | 32.49 | 21.86 |
| SimCLR | 33.09 | 59.56 | 25.91 | 15.68 | 28.61 | 42.00 | 34.14 |
| LeJEPA | 24.95 | 55.76 | 19.55 | 10.72 | 17.17 | 38.98 | 27.85 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 63.56 | 82.68 | 60.40 | 82.55 | 60.39 | 78.63 | 71.37 |
| Non-negative SimCLR | 64.68 | 84.41 | 62.95 | 84.97 | 63.32 | 76.56 | 72.82 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 65.05 | 84.50 | 62.73 | 83.75 | 62.65 | 78.00 | 72.78 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 64.68 | 83.97 | 60.75 | 81.77 | 60.25 | 77.68 | 71.52 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 63.67 | 85.31 | 58.11 | 81.62 | 58.39 | 77.38 | 70.75 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 49.04 | 79.11 | 43.94 | 44.72 | 46.22 | 61.30 | 54.06 |
| RGN 2 . 0 (0 ,σ GN ) | 64.52 | 85.06 | 64.51 | 84.09 | 62.35 | 78.25 | 73.13 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 64.26 | 84.21 | 59.90 | 81.59 | 59.47 | 77.65 | 71.18 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 57.87 | 82.19 | 49.81 | 69.38 | 51.90 | 68.93 | 63.35 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 53.35 | 78.95 | 45.32 | 57.47 | 46.32 | 52.22 | 55.61 |
| Dense baselines | |||||||
| VICReg | 62.77 | 81.47 | 59.23 | 80.96 | 60.44 | 77.11 | 70.33 |
| SimCLR | 64.73 | 82.49 | 60.10 | 80.47 | 61.18 | 71.44 | 70.07 |
| LeJEPA | 63.19 | 83.54 | 62.38 | 83.07 | 61.14 | 78.30 | 71.94 |
| model | DTD | cifar10 | cifar100 | flowers | food | pets | avg. |
|---|---|---|---|---|---|---|---|
| Non-negative VICReg | 35.80 | 47.70 | 14.90 | 32.04 | 15.74 | 44.84 | 31.83 |
| Non-negative SimCLR | 36.86 | 50.24 | 21.92 | 25.57 | 17.70 | 47.02 | 33.22 |
| Our methods | |||||||
| RGN 1 . 0 (0 ,σ GN ) | 41.49 | 63.14 | 29.46 | 39.96 | 26.70 | 52.55 | 42.22 |
| RGN 1 . 0 ( - 1 ,σ GN ) | 42.82 | 63.05 | 32.96 | 39.68 | 28.98 | 55.08 | 43.76 |
| RGN 1 . 0 ( - 2 ,σ GN ) | 39.84 | 48.69 | 19.47 | 29.65 | 18.21 | 44.51 | 33.39 |
| RGN 1 . 0 ( - 3 ,σ GN ) | 28.72 | 37.47 | 11.13 | 12.65 | 9.61 | 36.09 | 22.61 |
| RGN 2 . 0 (0 ,σ GN ) | 41.81 | 62.82 | 29.31 | 37.32 | 24.80 | 53.18 | 41.54 |
| RGN 2 . 0 ( - 1 ,σ GN ) | 45.00 | 63.85 | 34.13 | 42.72 | 30.30 | 56.09 | 45.35 |
| RGN 2 . 0 ( - 2 ,σ GN ) | 34.52 | 43.39 | 15.19 | 23.34 | 14.44 | 40.39 | 28.55 |
| RGN 2 . 0 ( - 3 ,σ GN ) | 21.38 | 24.92 | 6.58 | 7.94 | 7.20 | 16.63 | 14.11 |
| Dense baselines | |||||||
| VICReg | 37.55 | 44.84 | 13.37 | 34.23 | 17.03 | 43.45 | 31.75 |
| SimCLR | 51.65 | 63.71 | 35.10 | 45.73 | 36.06 | 57.78 | 48.34 |
| LeJEPA | 40.05 | 56.81 | 23.42 | 43.31 | 20.70 | 51.19 | 39.25 |
| Method | Encoder Acc1 ↑ | Projector Acc1 ↑ |
|---|---|---|
| Projector Dim. = 512 | Projector Dim. = 512 | Projector Dim. = 512 |
| VICReg | 63.72 | 57.80 |
| LeJEPA | 65.53 | 59.18 |
| RGN 1 . 0 (0 ,σ GN ) | 67.56 | 61.34 |
| RGN 2 . 0 (0 ,σ GN ) | 68.31 | 61.74 |
| Projector Dim. = 2048 | Projector Dim. = 2048 | Projector Dim. = 2048 |
| VICReg | 68.73 | 61.81 |
| LeJEPA | 67.18 | 60.12 |
| RGN 1 . 0 (0 ,σ GN ) | 69.33 | 64.90 |
| RGN 2 . 0 (0 ,σ GN ) | 69.54 | 64.85 |
| Method | Enc Acc1 ↑ | Proj Acc1 ↑ | ℓ 1 Sparsity ↓ | ℓ 0 Sparsity ↓ |
|---|---|---|---|---|
| Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) | Ours: RGN 1 . 0 ( µ,σ GN ) (Mean Shift Value, MSV) |
| RGN 1 . 0 (1 . 0 ,σ GN ) | 74.34 | 65.60 | 0.6459 | 0.9359 |
| RGN 1 . 0 (0 . 5 ,σ GN ) | 74.58 | 66.42 | 0.4768 | 0.8825 |
| RGN 1 . 0 (0 . 0 ,σ GN ) | 75.44 | 67.16 | 0.2730 | 0.7721 |
| RGN 1 . 0 ( - 0 . 5 ,σ GN ) | 74.80 | 66.86 | 0.1407 | 0.5526 |
| RGN 1 . 0 ( - 1 . 0 ,σ GN ) | 74.18 | 65.14 | 0.0737 | 0.3067 |
| RGN 1 . 0 ( - 1 . 5 ,σ GN ) | 74.88 | 63.70 | 0.0390 | 0.1227 |
| RGN 1 . 0 ( - 2 . 0 ,σ GN ) | 73.54 | 60.70 | 0.0238 | 0.0523 |
| RGN 1 . 0 ( - 2 . 5 ,σ GN ) | 72.06 | 57.96 | 0.0188 | 0.0357 |
| RGN 1 . 0 ( - 3 . 0 ,σ GN ) | 71.64 | 57.46 | 0.0132 | 0.0220 |
| Baselines (Dense) | Baselines (Dense) | Baselines (Dense) | Baselines (Dense) | Baselines (Dense) |
| LeJEPA | 65.36 | 59.12 | 0.6369 | 1.0000 |
| VICReg | 72.06 | 63.56 | 0.7877 | 1.0000 |
| SimCLR | 74.18 | 66.86 | 0.6663 | 1.0000 |
| Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) | Baselines (Sparse / NonNeg) |
| NonNeg-VICReg | 71.64 | 65.42 | 0.5075 | 0.7066 |
| NonNeg-SimCLR | 74.48 | 63.76 | 0.0016 | 0.0023 |












$$ \mathbb{E}[X^2]&= C(\mu^2 I_0 + 2\mu I_1 + I_2)\ &=C\mu^2 I_0 + 2C\mu I_1 + C I_2\ &= C\mu^2\cdot \frac{1}{p}a^{-1/p}\Gamma!\Big(\frac{1}{p}\Big)\Big(1+\operatorname{sgn}(\mu)P(1/p,t_0)\Big) + 2C\mu\cdot \frac{1}{p}a^{-2/p}\Gamma!\Big(\frac{2}{p},t_0\Big)\ &+ C\cdot \frac{1}{p}a^{-3/p}\Gamma!\Big(\frac{3}{p}\Big)\Big(1+\operatorname{sgn}(\mu)P(3/p,t_0)\Big)\ &=\frac{1}{2}\mu^2\Big(1+\operatorname{sgn}(\mu)P!\Big(\frac{1}{p},t_0\Big)\Big)+\frac{1}{2}\Big(2\mu,p^{1/p}\sigma,\frac{\Gamma(2/p,t_0)}{\Gamma(1/p)}\Big)\ &+\frac{1}{2}p^{2/p}\sigma^2\Big(1+\operatorname{sgn}(\mu)P!\Big(\frac{3}{p},t_0\Big)\Big)\ &=\frac{1}{2}\bigg[\mu^2\bigg(1+\operatorname{sgn}(\mu),P\bigg(\frac{1}{p},\frac{|\mu|^p}{p\sigma^p}\bigg)\bigg)+2\mu p^{1/p}\sigma\frac{\Gamma(2/p, |\mu|^p/(p\sigma^p))}{\Gamma(1/p)}\ &+p^{2/p}\sigma^2\bigg(1+\operatorname{sgn}(\mu),P \bigg(\frac{3}{p},\frac{|\mu|^p}{p\sigma^p}\bigg)\bigg) \bigg],\ $$
$$ d(\boldsymbol{\xi}i)=1-\Phi{\mathcal{GN}p(0,1)}\bigg(-\frac{\mu}{\sigma}\bigg)=\Phi{\mathcal{GN}_p(0,1)}\bigg(\frac{\mu}{\sigma}\bigg) $$
$$ \mathbb{P}Z(B)=d,\mathbb{P}{X\mid(0,\infty)}(B)+(1-d),\delta_0(B) $$
$$ \mathbb{H}_{\nu}(Z)&=d\cdot\mathbb{H}1(\mathbb{P}{X|(0,\infty)})-d\log(d)-(1-d)\log(1-d)\ &=d\cdot\mathbb{H}1(\mathbb{P}{X|(0,\infty)})+d\log\bigg(\frac{1}{d}\bigg)+(1-d)\log\bigg(\frac{1}{d-1}\bigg) $$
References
[bardes2022vicregvarianceinvariancecovarianceregularizationselfsupervised] Adrien Bardes, Jean Ponce, Yann LeCun. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning.
[Lyu08c] S Lyu, E P Simoncelli. (2009). Nonlinear extraction of 'Independent Components' of natural images using radial {Gaussianization. Neural Computation. doi:10.1162/neco.2009.04-08-773.
[lyu2008reducing] Lyu, Siwei, Simoncelli, Eero. (2008). Reducing statistical dependencies in natural signals using radial Gaussianization. Advances in neural information processing systems.
[cardoso2003dependence] Cardoso, Jean-Fran{\c{c. (2003). Dependence, correlation and gaussianity in independent component analysis. Journal of Machine Learning Research.
[ermolov2021whiteningselfsupervisedrepresentationlearning] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Sangineto, Nicu Sebe. (2021). Whitening for Self-Supervised Representation Learning.
[NASH1976156] David Nash, Murray S. Klamkin. (1976). A spherical characterization of the normal distribution. Journal of Mathematical Analysis and Applications. doi:https://doi.org/10.1016/0022-247X(76)90285-7.
[shrinkage18] Dominique Fourdrinier, William E. Strawderman, Martin T. Wells. (2018). Shrinkage Estimation. Springer Series in Statistics.
[infogeobook] Shun-ichi Amari. (2016). Information Geometry and Its Applications. Springer Series in Applied Mathematical Sciences.
[NIPS2000_3c947bc2] Chen, Scott, Gopinath, Ramesh. (2000). Gaussianization. Advances in Neural Information Processing Systems.
[Studeny1998] Studen{'y. (1998). The Multiinformation Function as a Tool for Measuring Stochastic Dependence. Learning in Graphical Models. doi:10.1007/978-94-011-5014-9_10.
[chakraborty2025improvingpretrainedselfsupervisedembeddings] Deep Chakraborty, Yann LeCun, Tim G. J. Rudner, Erik Learned-Miller. (2025). Improving Pre-trained Self-Supervised Embeddings Through Effective Entropy Maximization.
[kotz2012laplace] Kotz, Samuel, Kozubowski, Tomasz, Podgorski, Krzystof. (2012). The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance.
[NEURIPS2019_ddf35421] Bachman, Philip, Hjelm, R Devon, Buchwalter, William. (2019). Learning Representations by Maximizing Mutual Information Across Views. Advances in Neural Information Processing Systems.
[fang1990symmetric] Fang, Kai-Tai, Kotz, Samuel, Ng, Kai Wang. (1990). Symmetric Multivariate and Related Distributions. doi:10.1201/9781351077040.
[linskerinfomax] Yifei Wang, Qi Zhang, Yaoyu Guo, Yisen Wang. (2024). Non-negative Contrastive Learning. Computer. doi:10.1109/2.36.
[NEURIPS2021_27debb43] HaoChen, Jeff Z., Wei, Colin, Gaidon, Adrien, Ma, Tengyu. (2021). Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss. Advances in Neural Information Processing Systems.
[NIPS2000_f9d11525] Lee, Daniel, Seung, H. Sebastian. (2000). Algorithms for Non-negative Matrix Factorization. Advances in Neural Information Processing Systems.
[listapaperlecun] Gregor, Karol, LeCun, Yann. (2010). Learning fast approximations of sparse coding. Proceedings of the 27th International Conference on International Conference on Machine Learning.
[yang2025gateddeltanetworksimproving] Songlin Yang, Jan Kautz, Ali Hatamizadeh. (2025). Gated Delta Networks: Improving Mamba2 with Delta Rule.
[zhou2025dinowmworldmodelspretrained] Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto. (2025). DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning.
[bellSejnowskiinfomax] Bell, Anthony J., Sejnowski, Terrence J.. (1999). An information-maximization approach to blind separation and blind deconvolution. Unsupervised Learning.
[NIPS2009_5751ec3e] Cho, Youngmin, Saul, Lawrence. (2009). Kernel Methods for Deep Learning. Advances in Neural Information Processing Systems.
[anonymous2025sparse] Anonymous. (2025). Sparse World Models: Visual World Modeling with Sparse Representations. Submitted to The Fourteenth International Conference on Learning Representations.
[wen2025matryoshkarevisitingsparsecoding] Tiansheng Wen, Yifei Wang, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You. (2025). Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation.
[gao2024scalingevaluatingsparseautoencoders] Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, Jeffrey Wu. (2024). Scaling and evaluating sparse autoencoders.
[oquab2024dinov2learningrobustvisual] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski. (2024). DINOv2: Learning Robust Visual Features without Supervision.
[Beauchamp2018numerical] Beauchamp, Maxime. (2018). On numerical computation for the distribution of the convolution of N independent rectified Gaussian variables. Journal de la société française de statistique.
[coverthomaselementsofinfo] Cover, Thomas M., Thomas, Joy A.. (2006). Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).
[NIPS1999_955a1584] Downs, Oliver, MacKay, David, Lee, Daniel. (1999). The Nonnegative Boltzmann Machine. Advances in Neural Information Processing Systems.
[SongGuptaLpNorm1997] Song, D., Gupta, A. K.. (1997). {$L_p$. Proc. Amer. Math. Soc.. doi:10.1090/S0002-9939-97-03900-2.
[Sub23] Subbotin, M. T.. (1923). On the Law of Frequency of Error. Mat. Sb..
[Nardon01112009] Martina Nardon, Paolo Pianca. (2009). Simulation techniques for generalized Gaussian densities. Journal of Statistical Computation and Simulation. doi:10.1080/00949650802290912.
[Dytso2018] Dytso, Alex, Bustin, Ronit, Poor, H. Vincent, Shamai, Shlomo. (2018). Analytical properties of generalized Gaussian distributions. Journal of Statistical Distributions and Applications. doi:10.1186/s40488-018-0088-5.
[bengio2014representationlearningreviewnew] Yoshua Bengio, Aaron Courville, Pascal Vincent. (2014). Representation Learning: A Review and New Perspectives.
[Barthe_2005] David Alonso-Gutierrez, Joscha Prochno, Christoph Thaele. (2018). Gaussian fluctuations for high-dimensional random projections of $\ell_p^n$-balls. The Annals of Probability. doi:10.1214/009117904000000874.
[Nadarajah01092005] Saralees Nadarajah. (2005). A generalized normal distribution. Journal of Applied Statistics. doi:10.1080/02664760500079464.
[GOODMAN1973204] Goodman, Irwin R, Kotz, Samuel. (1973). Multivariate $\theta$-generalized normal distributions. Journal of Multivariate Analysis.
[GUPTA1997241] A.K. Gupta, D. Song. (1997). Lp-norm spherical distribution. Journal of Statistical Planning and Inference. doi:https://doi.org/10.1016/S0378-3758(96)00129-2.
[devroye2006nonuniform] Devroye, Luc. (2006). Nonuniform random variate generation. Handbooks in operations research and management science.
[chen2020simpleframeworkcontrastivelearning] Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton. (2020). A Simple Framework for Contrastive Learning of Visual Representations.
[he2020momentumcontrastunsupervisedvisual] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. (2020). Momentum Contrast for Unsupervised Visual Representation Learning.
[zbontar2021barlowtwinsselfsupervisedlearning] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, Stéphane Deny. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction.
[grill2020bootstraplatentnewapproach] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko. (2020). Bootstrap your own latent: A new approach to self-supervised Learning.
[caron2021emergingpropertiesselfsupervisedvision] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin. (2021). Emerging Properties in Self-Supervised Vision Transformers.
[radford2018improving] Radford, Alec, Narasimhan, Karthik, Salimans, Tim, Sutskever, Ilya, others. (2018). Improving language understanding by generative pre-training.
[lecun2022path] LeCun, Yann. (2022). A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review.
[jing2022understandingdimensionalcollapsecontrastive] Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. (2022). Understanding Dimensional Collapse in Contrastive Self-supervised Learning.
[assran2023selfsupervisedlearningimagesjointembedding] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture.
[kuang2025radialvcreg] Yilun Kuang, Yash Dagade, Deep Chakraborty, Erik Learned-Miller, Randall Balestriero, Tim G. J. Rudner, Yann LeCun. (2025). Radial-{VCR. UniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models.
[balestriero2025lejepaprovablescalableselfsupervised] Randall Balestriero, Yann LeCun. (2025). LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics.
[olshausen1996emergence] Olshausen, Bruno A, Field, David J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature.
[donoho2006compressed] Donoho, David L. (2006). Compressed sensing. IEEE Transactions on information theory.
[lee1999learning] Lee, Daniel D, Seung, H Sebastian. (1999). Learning the parts of objects by non-negative matrix factorization. nature.
[glorot2011deep] Glorot, Xavier, Bordes, Antoine, Bengio, Yoshua. (2011). Deep sparse rectifier neural networks. Proceedings of the fourteenth international conference on artificial intelligence and statistics.
[mallat1999wavelet] Mallat, St{'e. (1999). A wavelet tour of signal processing.
[barlow1961possible] Barlow, Horace B, others. (1961). Possible principles underlying the transformation of sensory messages. Sensory communication.
[attwell2001energy] Attwell, David, Laughlin, Simon B. (2001). An energy budget for signaling in the grey matter of the brain. Journal of Cerebral Blood Flow & Metabolism.
[nair2010rectified] Nair, Vinod, Hinton, Geoffrey E. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th international conference on machine learning (ICML-10).
[chen2001atomic] Chen, Scott Shaobing, Donoho, David L, Saunders, Michael A. (2001). Atomic decomposition by basis pursuit. SIAM review.
[natarajan1995sparse] Natarajan, Balas Kausik. (1995). Sparse approximate solutions to linear systems. SIAM journal on computing.
[tibshirani1996regression] Tibshirani, Robert. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology.
[chartrand2007exact] Chartrand, Rick. (2007). Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Processing Letters.
[chartrand2008iteratively] Chartrand, Rick, Yin, Wotao. (2008). Iteratively reweighted algorithms for compressive sensing. 2008 IEEE international conference on acoustics, speech and signal processing.
[NIPS1997_28fc2782] Socci, Nicholas, Lee, Daniel, Seung, H. Sebastian. (1997). The Rectified Gaussian Distribution. Advances in Neural Information Processing Systems.
[AndersonRecGauss] Anderson, J., Barlow, H. B., Gregory, R. L., Hinton, Geoffrey E., Ghahramani, Zoubin. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society B: Biological Sciences. doi:10.1098/rstb.1997.0101.
[cramer1936] Cram{'e. (1936). Sur un nouveau théorème-limite de la théorie des probabilités.
[wold1938] Wold, Herman. (1938). A Study in the Analysis of Stationary Time Series.
[bonneel2015sliced] Bonneel, Nicolas, Rabin, Julien, Peyr{'e. (2015). Sliced and Radon Wasserstein Barycenters of Measures. Journal of Mathematical Imaging and Vision.
[kolouri2018swae] Kolouri, Soheil, Rohde, Gustavo K., Hoffmann, Heike. (2018). Sliced-Wasserstein Autoencoder. International Conference on Learning Representations (Workshop).
[nadjahi2020spd] Nadjahi, Kimia, De Bortoli, Valentin, Delon, Julie, Genevay, Aude. (2020). Statistical and Topological Properties of Sliced Probability Divergences. Advances in Neural Information Processing Systems.
[kim2019projection] Kim, Inkyung, Balakrishnan, Sivaraman, Wasserman, Larry. (2019). Robust Multivariate Nonparametric Tests via Projection Averaging. The Annals of Statistics.
[nolan1993multivariate] Nolan, John P. (1993). Multivariate Stable Distributions. COMPUTING SCIENCE AND STATISTICS.
[lehmann1951consistency] Lehmann, E. L., Romano, J. P.. (1951). Testing Statistical Hypotheses.
[gretton2012kernel] Gretton, A., Borgwardt, K. M., Rasch, M. J., Sch{. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research.
[golub2013matrix] Golub, Gene H., Van Loan, Charles F.. (2013). Matrix Computations.
[parlett1998symmetric] Parlett, Beresford N.. (1998). The Symmetric Eigenvalue Problem.
[halko2011finding] Halko, Nathan, Martinsson, Per-Gunnar, Tropp, Joel A.. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review.
[Garridoduality] Garrido, Quentin, Chen, Yubei, Bardes, Adrien, Najman, Laurent, Lecun, Yann. (2022). On the duality between contrastive and non-contrastive self-supervised learning. doi:10.48550/ARXIV.2206.02574.
[folland1999real] Folland, Gerald B. (1999). Real analysis: modern techniques and their applications.
[renyi1959dimension] R{'e. (1959). On the dimension and entropy of probability distributions. Acta Mathematica Academiae Scientiarum Hungarica.
[vasicek1976test] Vasicek, Oldrich. (1976). A test for normality based on sample entropy. Journal of the Royal Statistical Society Series B: Statistical Methodology.
[learned2003ica] Learned-Miller, Erik G, others. (2003). ICA using spacings estimates of entropy. Journal of machine learning research.
[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.
[dosovitskiy2020image] Dosovitskiy, Alexey. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
[liu2022convnet] Liu, Zhuang, Mao, Hanzi, Wu, Chao-Yuan, Feichtenhofer, Christoph, Darrell, Trevor, Xie, Saining. (2022). A convnet for the 2020s. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
[mcallester2020formallimitationsmeasurementmutual] David McAllester, Karl Stratos. (2020). Formal Limitations on the Measurement of Mutual Information.
[gretton2005measuring] Gretton, Arthur, Bousquet, Olivier, Smola, Alex, Sch{. (2005). Measuring statistical dependence with Hilbert-Schmidt norms. International conference on algorithmic learning theory.
[ermolov2021whitening] Ermolov, Aleksandr, Siarohin, Aliaksandr, Sangineto, Enver, Sebe, Nicu. (2021). Whitening for self-supervised representation learning. International conference on machine learning.
[hua2021feature] Hua, Tianyu, Wang, Wenxiao, Xue, Zihui, Ren, Sucheng, Wang, Yue, Zhao, Hang. (2021). On feature decorrelation in self-supervised learning. Proceedings of the IEEE/CVF international conference on computer vision.
[Selvaraju_2019] Mialon, Gr{'e. (2022). Variance covariance regularization enforces pairwise independence in self-supervised representations. arXiv preprint arXiv:2209.14905. doi:10.1007/s11263-019-01228-7.
[balestriero2023cookbookselfsupervisedlearning] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes, Gregoire Mialon, Yuandong Tian, Avi Schwarzschild, Andrew Gordon Wilson, Jonas Geiping, Quentin Garrido, Pierre Fernandez, Amir Bar, Hamed Pirsiavash, Yann LeCun, Micah Goldblum. (2023). A Cookbook of Self-Supervised Learning.
[chen2020exploringsimplesiameserepresentation] Xinlei Chen, Kaiming He. (2020). Exploring Simple Siamese Representation Learning.
[yu2020learningdiversediscriminativerepresentations] Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, Yi Ma. (2020). Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction.
[yerxa2023learningefficientcodingnatural] Thomas Yerxa, Yilun Kuang, Eero Simoncelli, SueYeon Chung. (2023). Learning Efficient Coding of Natural Images with Maximum Manifold Capacity Representations.
[JMLR:v23:21-1155] Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, Nicu Sebe, Elisa Ricci. (2022). solo-learn: A Library of Self-supervised Methods for Visual Representation Learning. Journal of Machine Learning Research.
[you2017largebatchtrainingconvolutional] Yang You, Igor Gitman, Boris Ginsburg. (2017). Large Batch Training of Convolutional Networks.