Skip to main content

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv

Abstract

From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer’s performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.

Introduction

Large Language Models (LLMs) have driven remarkable progress in natural language processing (NLP), achieving state-of-the-art results on many tasks (Brown et al., 2020; Devlin et al., 2019; Li et al., 2022). At the heart of most applications lies a common assumption: final-layer representations are the most useful for downstream tasks. Yet a fundamental question remains: does the final layer always yield the best representation?

1 University of Kentucky 2 Mila-Quebec AI Institute 3 University of Montreal 4 New York University 5 University of California, Los Angeles 6 Independent 7 Meta FAIR 8 Wand.AI. Correspondence to: Oscar Skean oscar.skean@uky.edu.

Proceedings of the 42 nd International Conference on Machine Learning , Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

Figure 1: Intermediate layers consistently outperform final layers on downstream tasks. The average score of 32 MTEB tasks using the outputs of every model layer as embeddings for three different model architectures. The x-axis is the depth percentage of the layer, rather than the layer number which varies across models.

Figure 1: Intermediate layers consistently outperform final layers on downstream tasks. The average score of 32 MTEB tasks using the outputs of every model layer as embeddings for three different model architectures. The x-axis is the depth percentage of the layer, rather than the layer number which varies across models.

In this paper, we conduct a layer-wise analysis of LLMs across diverse architectures-including transformer-based ones (Vaswani et al., 2017), state-space models (SSMs) (Gu & Dao, 2024), and encoder-based models like BERT (Devlin et al., 2019)-spanning parameter scales from tens of millions to billions. Through systematic evaluation on 32 embedding tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), we find that intermediate layers often surpass the final layer by up to 16% in downstream accuracy. Figure 1 illustrates this phenomenon, where mid-depth layers provide particularly strong representations while the very last layer can become overly specialized to the pretraining objective.

A unified framework. To better understand intermediate layers' effectiveness, we combine three complementary perspectives (Section 3):

· Information-theoretic: How much do layers compress or preserve semantic information (Shwartz-Ziv &Tishby, 2019; Shwartz-Ziv, 2022)? · Geometric: How do token embeddings unfold in highdimensional space (Hosseini & Fedorenko, 2023))? · Invariance: Are embeddings robust to input perturbations (e.g., InfoNCE (Oord et al., 2018), LiDAR (Thilak et al., 2024) and DiME (Skean et al., 2023))?

We show that these perspectives can be viewed under a single lens, which clarifies how intermediate layers strike a balance between retaining features and discarding noise.

Key findings and contributions. Our investigation leads to several important insights:

· Intermediate layers consistently outperform final layers. This pattern is evident in both transformers and SSMs, suggesting a broad architecture-agnostic effect. · Autoregressive vs. masked-language training. Autoregressive models exhibit a pronounced mid-layer 'compression valley,' whereas masked or bidirectional models show milder intermediate changes. · Domain-general effect. We extend these results to vision models and find that autoregressive image transformers display the same mid-depth bottleneck, indicating that the training objective , rather than the data modality, is the key driver. · CoT finetuning. Analyzing chain-of-thought (CoT) reveals that finetuning can reshape mid-layer entropy, preserving latent context for multi-step reasoning.

Overall, our results challenge the default reliance on finallayer embeddings and highlight intermediate layers as potentially underutilized sources of meaningful features. In this paper, we detail our unified framework (Section 3), present extensive experiments in both language and vision (Section 4, 5, 6), and conclude with a discussion of our findings, their implications, and future directions. 1

Understanding Neural Representations. A long line of research has aimed to understand how deep neural networks encode and organize information. Early studies employed linear probes for intermediate layers (Alain & Bengio, 2017), while subsequent efforts introduced more sophisticated techniques such as SVCCA (Raghu et al., 2017) to compare learned features across architectures and training regimes. Although these approaches have shed light on representation dynamics, most focus on vision models or shallow networks. In contrast, our work contributes to a growing body of literature extending layer-wise analysis to large-scale language models, emphasizing specific behaviors of intermediate layers across diverse architectures. Complementing our empirical findings, Saponati et al. (2025) present a theoretical analysis of how different pretext

1 We make our code available at https://github.com/ OFSkean/information_flow

tasks, such as next-token prediction and masked language modeling, influence the structure of learned representations.

Layer-wise Analysis in Language Models. Recent work has increasingly focused on identifying which transformer layers encode different types of information. For example, linguistic features such as part-of-speech tags or semantic roles are best encoded by the middle layers of a BERT (Liu et al., 2019; Tenney et al., 2019; V oita et al., 2019). More recent work has shown that mid-depth layers sometimes hold surprisingly robust features, challenging the usual emphasis on final layer representations (Jin et al., 2024; Gurnee & Tegmark, 2023; Fan et al., 2024). A related line of work investigates the attention sink phenomenon (Xiao et al., 2024; Brunner et al., 2020; Gu et al., 2025), in which attention disproportionately concentrates on a single token. Notably, intermediate decoder layers have been shown to not exhibit these extreme attention sinks (Barbero et al., 2025), suggesting they engage in more distributed and meaningful information processing than the shallow or deep layers.

Compression and Generalization. Multiple lines of research link compression and generalization performance (Deletang et al., 2024). For instance, Bordes et al. (2023) demonstrated that discarding certain layers in selfsupervised encoders can even improve downstream accuracy, while Park et al. (2024a) found that LLM embeddings often lie in low-dimensional manifolds. Our empirical study reinforces these ideas by demonstrating that many networks-especially autoregressive transformers-naturally develop a mid-layer bottleneck that appears crucial for balancing 'signal' versus 'noise.' We show how intermediate layers can achieve optimal trade-offs between preserving task-relevant information and discarding superfluous detail.

Representation Quality Metrics. A variety of metrics have been proposed to quantify the 'quality' of learned representations. We group them into three main categories:

· Information-theoretic measures capture how much a model's internal representations compress or preserve relevant information. For example, the Information Bottleneck (Shwartz-Ziv & Tishby, 2019; Shwartz-Ziv, 2022) analyzes whether intermediate layers discard noise while retaining essential features. Intrinsic dimensionality, which describes the minimum number of features to represent data, has also been used to analyze intermediate layers in LLMs (Cheng et al., 2025; Valeriani et al., 2023; Razzhigaev et al., 2024). This line of work has shown semantic abstractions useful for downstream tasks are better encoded in middle layers than last layers in large transformer models. While we do not study intrinsic dimensionality in our subsequent analysis, it would make a promising direction

for future work.

· Geometric measures focus on the structure of embeddings in high-dimensional space. Classical approaches include analyzing singular values and effective rank of the representation matrix (Garrido et al., 2023). The anisotropy metric of Razzhigaev et al. (2024) has been used to study compression in intermediate model layers and we compare our results with their findings in Section 4.2. Anisotropy fits in well with our proposed framework, though we leave a formal integration to future work. Recent work explores curvature (Hosseini &Fedorenko, 2023) to quantify how smoothly tokens are mapped across consecutive positions or time steps. · Task-based or invariance metrics evaluate how well representations support downstream goals. For instance, augmentations-based approaches such as InfoNCE (Oord et al., 2018) and LiDAR (Thilak et al., 2024) estimate invariance to perturbations, while methods like NESum or Self-Cluster (Agrawal et al., 2022) link closely to entropy. In computer vision, these scores often correlate strongly with downstream accuracy, highlightFing how robust the embeddings are.

Although these representation quality metric categories may appear distinct, we show (Section 3) that many can be unified under a single lens. This unification illuminates why certain intermediate layers balance compression, geometry, and invariance so effectively, leading to better representations for downstream tasks.

Overall, our work bridges these overlapping threads by evaluating a range of architectures and training paradigms via a unified set of metrics. Beyond merely confirming that intermediate layers can be effective, we elucidate why this happens, tying it to fundamental properties such as entropy, invariance, and geometry. This novel perspective provides an avenue for both finer-grained diagnostics of large language models and more deliberate design of mid-layer representations for downstream tasks.

Understanding Neural Representations.

Key Takeaway: Matrix-based entropy unifies seemingly disparate metrics of representation quality, providing a single theoretical lens for analyzing compression, geometry, and invariance.

A central challenge in analyzing internal representations is determining how to assess their quality. Although existing work draws on numerous ideas-from mutual information to geometric manifold analysis to invariance under augmentations-these threads can seem disparate. In this section, we consolidate them into a unified theoretical framework that shows how these seemingly different metrics connect and why they collectively measure 'representation quality.'

Layer-wise Analysis in Language Models.

Oscar Skean 1 Md Rifat Arefin 2 3 Dan Zhao 4 Niket Patel 5 Jalal Naghiyev 6 Yann LeCun 4 7 Ravid Shwartz-Ziv 4 8

Architectural Comparisons.

In this section, we elaborate on the specific architectures of transformers and State Space Models (SSMs). We outline the mathematical foundations, including the weight matrices, attention mechanisms for transformers, and the state transition matrices for SSMs. Detailed equations and parameter configurations are provided to facilitate replication and deeper understanding.

Compression and Generalization.

We investigated the representation quality of intermediate layers in LLMs and their role in downstream task performance. We introduced a unified framework of evaluation metrics, establish theoretical connections among them, and apply these metrics to analyze transformer-based architectures, SSMs, and vision models. A key phenomenon unveiled by prompt entropy was an information bottleneck in the middle layers of autoregressive transformers in both vision and language domains. Furthermore, we show that intermediate layers often surpass final layers in representation quality, holding implications for feature relevance and extraction. DiME, curvature, and infoNCE correlate well with downstream performance, suggesting a fundamental connection between representation and generalizability.

In conclusion, our work studies the internal representation dynamics in LLMs, offering theoretical and empirical insights as well as practical implications for optimizing model design and training strategies. Future work should further investigate the underlying causes of intermediate layer compression and do explicit finetuning to control compression.

Representation Quality Metrics.

Key Takeaway: Information-theoretic, geometric, and invariance-based metrics offer complementary perspectives on representation quality that can all be understood through matrix-based entropy.

We now introduce the seven representation evaluation metrics used in our experiments, grouped into three broad categories: (1) information-theoretic , (2) geometric , and (3) augmentation-invariance . All relate back to the Gram matrix K and hence to Eq. (1).

A Unified Framework for Neural Representations

Key Takeaway: Matrix-based entropy unifies seemingly disparate metrics of representation quality, providing a single theoretical lens for analyzing compression, geometry, and invariance.

A central challenge in analyzing internal representations is determining how to assess their quality. Although existing work draws on numerous ideas-from mutual information to geometric manifold analysis to invariance under augmentations-these threads can seem disparate. In this section, we consolidate them into a unified theoretical framework that shows how these seemingly different metrics connect and why they collectively measure 'representation quality.'

Notation and Motivation

Consider a neural network that maps inputs x (e.g., tokens in a sequence) to internal hidden states Z . We denote Z ∈ R N × D as a matrix of N data samples (or tokens) in D dimensions. Some key questions arise:

  1. How compressed are these representations?
  2. How robust are they to perturbations or augmentations?
  3. How do they geometrically organize different inputs?

Answers to these questions can illuminate which layers strike the right balance between preserving relevant features and discarding noise.

Matrix-Based Entropy: A Common Theoretical Thread

We focus on a key quantity known as matrix-based entropy (Giraldo et al., 2014; Skean et al., 2023), which applies directly to the Gram matrix K = ZZ ⊤ . Let { λ i ( K ) } be the (nonnegative) eigenvalues of K . For any order α > 0 , define:

$$

$$

where r = rank( K ) ≤ min( N,D ) . Intuitively, if only a few eigenvalues dominate, S α ( Z ) is small -indicating a highly compressed representation. Conversely, if Z is spread out across many principal directions, S α ( Z ) is large . By varying α , one smoothly transitions between notions like collision entropy ( α = 2 ) and von Neumann entropy ( α → 1 ). We will typically use α = 1 for simplicity.

Bridging geometry, invariance, and feature locality. A key benefit of matrix-based entropy is that it unifies multiple representational perspectives:

· Compression or information content: A handful of large eigenvalues in K = ZZ ⊤ indicates that Z is lowrank, i.e. the model has collapsed much of the input variation into fewer dimensions. In contrast, a more uniform eigenvalue spectrum implies higher-entropy, more diverse features. · Geometric smoothness: If tokens within a prompt follow a trajectory in embedding space with sharp turns , that curvature can manifest as skewed eigenvalue spectra (Hosseini & Fedorenko, 2023). Curvature also differentiates local transitions (token-to-token) from global structural patterns across longer segments or entire prompts. · Invariance under augmentations: Metrics like InfoNCE (Oord et al., 2018) and LiDAR (Thilak et al., 2024) effectively measure whether augmentations of the same sample (e.g. character swaps) map to similar embeddings. Strong invariance corresponds to stable clustering in ZZ ⊤ , which again depends on the distribution of eigenvalues and how local vs. global features are retained or discarded.

Thus, evaluating S α ( Z ) provides a single lens for assessing 'representation quality' across compression, geometric structure, and invariance-and highlights how both local details and global patterns are organized.

Matrix-based entropy.

We focus on a key quantity known as matrix-based entropy (Giraldo et al., 2014; Skean et al., 2023), which applies directly to the Gram matrix K = ZZ ⊤ . Let { λ i ( K ) } be the (nonnegative) eigenvalues of K . For any order α > 0 , define:

$$

$$

where r = rank( K ) ≤ min( N,D ) . Intuitively, if only a few eigenvalues dominate, S α ( Z ) is small -indicating a highly compressed representation. Conversely, if Z is spread out across many principal directions, S α ( Z ) is large . By varying α , one smoothly transitions between notions like collision entropy ( α = 2 ) and von Neumann entropy ( α → 1 ). We will typically use α = 1 for simplicity.

Bridging geometry, invariance, and feature locality. A key benefit of matrix-based entropy is that it unifies multiple representational perspectives:

· Compression or information content: A handful of large eigenvalues in K = ZZ ⊤ indicates that Z is lowrank, i.e. the model has collapsed much of the input variation into fewer dimensions. In contrast, a more uniform eigenvalue spectrum implies higher-entropy, more diverse features. · Geometric smoothness: If tokens within a prompt follow a trajectory in embedding space with sharp turns , that curvature can manifest as skewed eigenvalue spectra (Hosseini & Fedorenko, 2023). Curvature also differentiates local transitions (token-to-token) from global structural patterns across longer segments or entire prompts. · Invariance under augmentations: Metrics like InfoNCE (Oord et al., 2018) and LiDAR (Thilak et al., 2024) effectively measure whether augmentations of the same sample (e.g. character swaps) map to similar embeddings. Strong invariance corresponds to stable clustering in ZZ ⊤ , which again depends on the distribution of eigenvalues and how local vs. global features are retained or discarded.

Thus, evaluating S α ( Z ) provides a single lens for assessing 'representation quality' across compression, geometric structure, and invariance-and highlights how both local details and global patterns are organized.

Bridging geometry, invariance, and compression.

Token- and Dataset-Level Measures

Bridging geometry, invariance, and feature locality.

Representation Evaluation Metrics

Key Takeaway: Information-theoretic, geometric, and invariance-based metrics offer complementary perspectives on representation quality that can all be understood through matrix-based entropy.

We now introduce the seven representation evaluation metrics used in our experiments, grouped into three broad categories: (1) information-theoretic , (2) geometric , and (3) augmentation-invariance . All relate back to the Gram matrix K and hence to Eq. (1).

Information-Theoretic Metrics

Prompt Entropy. Following Wei et al. (2024), we apply matrix-based entropy (Eq. 1) to the token embeddings within a single prompt . This prompt entropy quantifies how widely tokens are spread in the embedding space. Higher entropy indicates more diverse, less redundant token-level features; lower entropy implies stronger compression.

Dataset Entropy. We can also aggregate embeddings across N prompts by taking the mean token embedding of each prompt to form Z ∈ R N × D . Applying entropy to Z yields a dataset -level measure of global diversity-revealing how distinctly the model separates different inputs.

Effective Rank. (Roy & Vetterli, 2007) can be shown to be a lower bound to exp( S 1 ( Z )) , highlighting how dimensionality effectively shrinks if the representation is strongly compressed. We prove this connection in Theorem 1. This has implications for popular representation evaluation metrics such as RankMe (Garrido et al., 2023) and LiDAR (Thilak et al., 2024), which are both inspired by Effective Rank.

Prompt Entropy.

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

Dataset Entropy.
Effective Rank.
Dataset Entropy.

Geometric Metrics

Curvature. Proposed by Hosseini & Fedorenko (2023), curvature captures how sharply the token embeddings turn when viewed as a sequence in R D . For a prompt of length L , let v k = z k +1 -z k be the difference between consecutive tokens. The average curvature is:

$$

$$

Higher curvature means consecutive tokens shift direction abruptly and more local level features; lower curvature suggests a smoother trajectory and global level features.

Curvature.

Augmentation Invariance Metrics

Lastly, we assess how stable the model's representations are to small perturbations of the same input (e.g., random character swaps, keyboard-level changes; see Appendix). Suppose a prompt p i is augmented into p ( a ) i and p ( b ) i . After embedding these, we compare the row vectors in Z 1 , Z 2 ∈ R N × D under different scoring criteria:

InfoNCE. This self-supervised objective (Oord et al., 2018) encourages matched samples to lie close in embedding space while pushing unmatched samples away. A lower InfoNCE loss indicates stronger invariance to augmentation.

LiDAR. LiDAR (Thilak et al., 2024) uses a linear discriminant approach that measures within-class versus betweenclass scatter. Treating each prompt as its own class, LiDAR checks how well augmentations form tight clusters.

DiME. Similarly, DiME (Skean et al., 2023) is grounded in matrix-based entropy. It compares real paired samples against random pairings to estimate how uniquely aligned correct augmentations are.

InfoNCE.
LiDAR.
DiME.

Core Theoretical Results

Key Takeaway: Our theoretical framework establishes concrete connections between representation entropy and downstream performance through properties like effective rank and invariance.

Here, we summarize key statements that justify why these metrics meaningfully measure representation quality. We refer to the appendix G for details and proofs. Beyond serving as a unifying view, matrix-based entropy also connects to foundational concepts like majorization, Schur concavity, and mutual information. Furthermore, we can directly relate the eigenvalue entropy to the matrix entropy, most naturally via the Effective Rank (Roy & Vetterli, 2007). The following theorem makes this connection explicit.

Theorem 1 (Lower Bound via Effective Rank) . For Shannon-based entropy ( α → 1 ),

$$

$$

meaning a large effective rank implies a high entropy.

Under appropriate conditions on the data distribution and model, we can show connections between prompt entropy and dataset entropy via the following scaling behaviors:

Theorem 2 (Informal) .

  1. If prompt entropy remains near its maximum for all prompts, then the dataset entropy S 2 ( ZZ ⊤ ) grows on the order of log ( L 2 N ) .
  2. If prompt entropy instead stays near its minimum for all prompts, then dataset entropy grows more slowly, on the order of log ( L 2 N 3 ) .

In short, high token-level (prompt) diversity encourages broader global diversity in the dataset-level embeddings, whereas over-compressing token representations can limit how effectively different prompts separate. Our subsequent analysis connects these ideas to self-supervised objectives like InfoNCE, which also tie higher entropy to stronger robustness and discriminability in the learned representations.

Theorem 3 (Dataset Entropy Bounds InfoNCE) . For data X and representation Z ( X ) , the InfoNCE loss on N samples satisfies:

$$

$$

where H ( Z ) is interpretable as matrix-based entropy at the dataset level. Hence, reducing InfoNCE implies learning a higher-entropy (and thus often more robust) representation.

Practical outlook. Overall, our theoretical analysis shows that compression (entropy), geometry (curvature, rank), and invariance (e.g. InfoNCE) are all facets of how the Gram matrix ZZ ⊤ distributes variance. Examining these metrics across different layers reveals exactly where a network 'prunes' redundancy (low entropy) versus preserving essential distinctions (high entropy). This unified perspective also facilitates cross-architecture comparisons (e.g. transformers vs. SSMs) by highlighting how each architecture organizes information internally. Beyond offering a theoretical foundation, it provides a practical blueprint for diagnosing, tuning, and improving hidden-layer representations.

Prompt Entropy vs. Dataset Entropy (Informal)
Practical outlook.

In this section, we empirically test our theoretical framework through extensive experiments across architectures, scales, and training regimes. We focus on three key questions:

· Do intermediate layers consistently outperform final layers across diverse downstream tasks? · How do these intermediate representations differ across architectures, training stages, and scales? · How does post-training methods (e.g., fine-tuning and chain-of-thought) reshape representations?

Core Theoretical Perspective

Key Takeaway: Our theoretical framework establishes concrete connections between representation entropy and downstream performance through properties like effective rank and invariance.

Here, we summarize key statements that justify why these metrics meaningfully measure representation quality. We refer to the appendix G for details and proofs. Beyond serving as a unifying view, matrix-based entropy also connects to foundational concepts like majorization, Schur concavity, and mutual information. Furthermore, we can directly relate the eigenvalue entropy to the matrix entropy, most naturally via the Effective Rank (Roy & Vetterli, 2007). The following theorem makes this connection explicit.

Theorem 1 (Lower Bound via Effective Rank) . For Shannon-based entropy ( α → 1 ),

$$

$$

meaning a large effective rank implies a high entropy.

Under appropriate conditions on the data distribution and model, we can show connections between prompt entropy and dataset entropy via the following scaling behaviors:

Theorem 2 (Informal) .

  1. If prompt entropy remains near its maximum for all prompts, then the dataset entropy S 2 ( ZZ ⊤ ) grows on the order of log ( L 2 N ) .
  2. If prompt entropy instead stays near its minimum for all prompts, then dataset entropy grows more slowly, on the order of log ( L 2 N 3 ) .

In short, high token-level (prompt) diversity encourages broader global diversity in the dataset-level embeddings, whereas over-compressing token representations can limit how effectively different prompts separate. Our subsequent analysis connects these ideas to self-supervised objectives like InfoNCE, which also tie higher entropy to stronger robustness and discriminability in the learned representations.

Theorem 3 (Dataset Entropy Bounds InfoNCE) . For data X and representation Z ( X ) , the InfoNCE loss on N samples satisfies:

$$

$$

where H ( Z ) is interpretable as matrix-based entropy at the dataset level. Hence, reducing InfoNCE implies learning a higher-entropy (and thus often more robust) representation.

Practical outlook. Overall, our theoretical analysis shows that compression (entropy), geometry (curvature, rank), and invariance (e.g. InfoNCE) are all facets of how the Gram matrix ZZ ⊤ distributes variance. Examining these metrics across different layers reveals exactly where a network 'prunes' redundancy (low entropy) versus preserving essential distinctions (high entropy). This unified perspective also facilitates cross-architecture comparisons (e.g. transformers vs. SSMs) by highlighting how each architecture organizes information internally. Beyond offering a theoretical foundation, it provides a practical blueprint for diagnosing, tuning, and improving hidden-layer representations.

Representation Metrics

Key Takeaway: Information-theoretic, geometric, and invariance-based metrics offer complementary perspectives on representation quality that can all be understood through matrix-based entropy.

We now introduce the seven representation evaluation metrics used in our experiments, grouped into three broad categories: (1) information-theoretic , (2) geometric , and (3) augmentation-invariance . All relate back to the Gram matrix K and hence to Eq. (1).

Token-level Diversity Metrics

Prompt Entropy. Following Wei et al. (2024), we apply matrix-based entropy (Eq. 1) to the token embeddings within a single prompt . This prompt entropy quantifies how widely tokens are spread in the embedding space. Higher entropy indicates more diverse, less redundant token-level features; lower entropy implies stronger compression.

Dataset Entropy. We can also aggregate embeddings across N prompts by taking the mean token embedding of each prompt to form Z ∈ R N × D . Applying entropy to Z yields a dataset -level measure of global diversity-revealing how distinctly the model separates different inputs.

Effective Rank. (Roy & Vetterli, 2007) can be shown to be a lower bound to exp( S 1 ( Z )) , highlighting how dimensionality effectively shrinks if the representation is strongly compressed. We prove this connection in Theorem 1. This has implications for popular representation evaluation metrics such as RankMe (Garrido et al., 2023) and LiDAR (Thilak et al., 2024), which are both inspired by Effective Rank.

Prompt Entropy.

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

Effective Rank:
Dataset Entropy.
Curvature.

Augmentation Invariance Metrics

Lastly, we assess how stable the model's representations are to small perturbations of the same input (e.g., random character swaps, keyboard-level changes; see Appendix). Suppose a prompt p i is augmented into p ( a ) i and p ( b ) i . After embedding these, we compare the row vectors in Z 1 , Z 2 ∈ R N × D under different scoring criteria:

InfoNCE. This self-supervised objective (Oord et al., 2018) encourages matched samples to lie close in embedding space while pushing unmatched samples away. A lower InfoNCE loss indicates stronger invariance to augmentation.

LiDAR. LiDAR (Thilak et al., 2024) uses a linear discriminant approach that measures within-class versus betweenclass scatter. Treating each prompt as its own class, LiDAR checks how well augmentations form tight clusters.

DiME. Similarly, DiME (Skean et al., 2023) is grounded in matrix-based entropy. It compares real paired samples against random pairings to estimate how uniquely aligned correct augmentations are.

InfoNCE.
LiDAR.
DiME.

Representation Evaluation Metrics

Key Takeaway: Information-theoretic, geometric, and invariance-based metrics offer complementary perspectives on representation quality that can all be understood through matrix-based entropy.

We now introduce the seven representation evaluation metrics used in our experiments, grouped into three broad categories: (1) information-theoretic , (2) geometric , and (3) augmentation-invariance . All relate back to the Gram matrix K and hence to Eq. (1).

Token Embedding Diversity Metrics

Effective Rank:
Prompt Entropy:

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

Dataset Entropy
Curvature

Augmentation Invariance Metrics

Lastly, we assess how stable the model's representations are to small perturbations of the same input (e.g., random character swaps, keyboard-level changes; see Appendix). Suppose a prompt p i is augmented into p ( a ) i and p ( b ) i . After embedding these, we compare the row vectors in Z 1 , Z 2 ∈ R N × D under different scoring criteria:

InfoNCE. This self-supervised objective (Oord et al., 2018) encourages matched samples to lie close in embedding space while pushing unmatched samples away. A lower InfoNCE loss indicates stronger invariance to augmentation.

LiDAR. LiDAR (Thilak et al., 2024) uses a linear discriminant approach that measures within-class versus betweenclass scatter. Treating each prompt as its own class, LiDAR checks how well augmentations form tight clusters.

DiME. Similarly, DiME (Skean et al., 2023) is grounded in matrix-based entropy. It compares real paired samples against random pairings to estimate how uniquely aligned correct augmentations are.

InfoNCE
LiDAR
DiME

Theory

Definition 1. (Majorization) Let p, q ∈ R n be nonnegative vectors such that ∑ N i =1 p i = ∑ N i =1 q i . We say that q majorizes p, denoted by p ≼ q , if their ordered sequences p [1] ≥ · · · ≥ p [ n ] and q [1] ≥ · · · ≥ q [ n ] satisfy:

$$

$$

Definition 2. (Schur-Convexity) A real-valued function f on R n is called Schur-convex if p ≼ q = ⇒ f ( p ) ≤ f ( q ) , and Schur-concave if p ≼ q = ⇒ f ( q ) ≤ f ( p ) .

Lemma 1. The matrix-based entropy, as given in Equation 1, is a Schur-concave function for α > 0 . This result is well-known and, for instance, was recently given by Lemma 4.1 in (Giraldo et al., 2014).

Theorem 4. Suppose we have a matrix of embeddings Z ∈ R N × D and its covariance Z T Z . Then the effective rank of Z is an lower bound of exp( S 1 ( Z )) , where S 1 denotes the matrix-based entropy of α = 1 .

Proof. Denote the ordered singular values of Z as σ 1 ≥ · · · ≥ σ min( N,D ) ≥ 0 and the ordered eigenvalues of Z T Z as λ 1 ≥ · · · ≥ λ min( N,D ) ≥ 0 . Without loss of generality, assume that ∑ N i =1 σ i = ∑ N i =1 λ i = 1 . If this is not the case, then set σ i := σ i ∑ N i =1 σ i and λ i := λ i ∑ N i =1 λ i .

It is straightforward to show that σ 2 i = λ i . Because ∀ i σ i ≤ 1 , we have that σ i ≥ λ i . This implies that λ ≼ σ . Therefore, S 1 ( σ ) ≤ S 1 ( λ ) = ⇒ effective rank ( Z ) ≤ exp S 1 ( Z ) .

̸

Proposition 1. (Random Unit Vectors are Nearly Orthogonal) Suppose we have m unit vectors in R D , that are distributed according to the uniform distribution on the hypersphere. Then with probability at least 1 -m 2 √ 2 πe -Dϵ 2 2 , we have that for any pair i, j , i = j ,

$$

$$

Proof. We can begin by defining the central ϵ -band around a slice of the hypersphere S D -1 as,

$$

$$

where e 1 denotes the first basis vector. The probability of a uniformly distributed vector on the unit sphere not landing in T ϵ ⊂ S D -1 can be bounded as,

$$

$$

$$

$$

Now, by the union bound on each i = j , we get that,

$$

$$

$$

$$

Theorem 5. ( Maximum Prompt Entropy implies Large Dataset Entropy.) Suppose we have a orthogonally equivarient representation model Z such that for all sequences Z i = Z ( X i ) the prompt entropy is maximal and the rows are unit. Suppose also that the data distribution Data is a isotropic unit Gaussian. Suppose we draw sequences of length L = D from the data distribution. Then with probability 1 -N 2 √ 2 πe -Dϵ 2 2 N 2 over draw of { x i } N i =1 ∼ Data , we have that,

$$

$$

Proof. First note that, since the prompt entropy is maximal for each sample i , which we denote Z i = Z ( X i ) , then the matrix K Z = Z i Z ⊤ i is full rank. Since by assumption each row of Z i has unit rows, then we know that ∥ Z i ∥ 2 F = L = ∑ L k =1 σ 2 k . In particular we also know that σ l = σ j for all pairs l, j by the assumption that the prompt entropy is maximized. In particular we then know that Z i Z ⊤ i is a orthogonal matrix, and the rows of Z i form an orthonormal set. We can then write, for some O i a rotation matrix, that,

$$

$$

We will denote the average over sequences of length L , across all N samples, by the dataset matrix ¯ Z = ( q 1 , q 2 , . . . q N ) ⊤ . Since by assumption our model Z ( · ) is orthogonally equivariant, and the Data distribution is radially symmetric, it follows that these { q i } N i =1 are random points on the hypersphere of radius 1 √ L . This means that the matrix √ D ¯ Z consists of rows that are uniform points on hypersphere of radius 1 . Now notice that,

$$

$$

Since √ Lq i is a unit vector this will simplify to,

$$

$$

Now notice that by proposition, we have that with probability at least 1 -N 2 √ 2 πe -Dϵ 2 2 N 2 ,

$$

$$

$$

$$

$$

$$

So then since,

$$

$$

we have that, e -S 2 ( ¯ Z ¯ Z ⊤ ) = ∥ ¯ Z ¯ Z ⊤ ∥ 2 F . In particular,

$$

$$

Which completes the proof.

$$

$$

Proof. Since the prompt entropy is minimal for each sample, we know that each Z ( X i ) will be a rank one matrix, so we can write it as the outer product. In particular, we can write Z ( X i ) = v i u i ⊤ . However, since the rows of Z ( X i ) are of unit length, we know that all the rows are identical, so we may write without loss of generality, Z ( X i ) = v i 1 ⊤ . Then, it follows that,

$$

$$

We will write the dataset average matrix as before as ¯ Z = ( q 1 , q 2 , . . . q N ) ⊤ . In particular the matrix D N ¯ Z has rows that are all unit vectors, and these are randomly distributed uniformly on the hyper-sphere. Now notice that,

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

In particular,

$$

$$

Theorem 7. (Dataset Entropy Bounds InfoNCE) Let X ∼ Data be a discrete random variable distributed according to the data distribution. Let X → Z be the Markovian relation between X and the representation Z . Then, the InfoNCE loss on N samples from Data satisfies,

$$

$$

The entropy H ( Z ) is analogous to the Dataset Entropy.

Proof. The first inequality follows as a simple result from (Oord et al., 2018). Then, use that,

$$

$$

Batch Embedding Diversity Metrics

Batch Entropy

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

Batch Entropy

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

Downstream Task Experiments

Key Takeaway: Intermediate layers of language models consistently outperform final layers across all architectures and tasks, challenging the conventional wisdom of using final-layer representations.

In this section, we use intermediate layers for downstream embedding tasks and employ our unified framework from Section 3, measuring all the embeddings across all layers.

Empirical Results

In this section, we empirically test our theoretical framework through extensive experiments across architectures, scales, and training regimes. We focus on three key questions:

· Do intermediate layers consistently outperform final layers across diverse downstream tasks? · How do these intermediate representations differ across architectures, training stages, and scales? · How does post-training methods (e.g., fine-tuning and chain-of-thought) reshape representations?

Downstream Task Performance

Key Takeaway: Intermediate layers of language models consistently outperform final layers across all architectures and tasks, challenging the conventional wisdom of using final-layer representations.

In this section, we use intermediate layers for downstream embedding tasks and employ our unified framework from Section 3, measuring all the embeddings across all layers.

Experimental Setup

Models We evaluate three distinct architectural families: Pythia and Llama3 (decoder-only transformers) (Biderman et al., 2023; Dubey et al., 2024), Mamba (state space model) (Gu & Dao, 2024), BERT (encoder-only transformer) (Devlin et al., 2019) and LLM2Vec models (bidirectional attention) (Behnam Ghader et al., 2024).

Tasks We test each layer's embeddings on 32 tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), spanning classification, clustering, and reranking dor a comprehensive evaluation across various tasks. We refer to the Appendix for details.

Models
Tasks
Metrics

Tasks

Methodology

Intermediate Layers Often Outperform Final Layers

Are final-layer embeddings indeed optimal for downstream tasks? In Figure 1, we compare average performance on MTEB tasks across all layers of the three models.

Key observation. In nearly every task, some intermediate layer outperforms the final layer. The absolute improvement ranges from 2% to as high as 16% on average, and the best layer often resides around the mid-depth of the network. This phenomena is consistent across all the different architectures. This confirms emerging observations in recent work for generation tasks (Bordes et al., 2023; El-Nouby et al., 2024; Chen et al., 2020; Fan et al., 2024) and extends them to a wider range of benchmarks and tasks.

Why do these layers matter? From our theoretical perspective, intermediate layers appear to strike a balance between retaining sufficient information (avoiding overcompression) and discarding low-level noise. Later in Section 4.2, we show that these sweet spots are not random but tied to how intermediate layers are processing information.

Figure 2: Pythia and Mamba's intermediate layers show pronounced changes in representation quality metrics, while BERT's remain more stable. Three representation evaluation metrics calculated on the wikitext dataset for every layer in Pythia-410M, Mamba 370M, and BERT-base architectures. The x-axis denotes layer depth as a percentage, allowing fair comparison between models with different layer counts.

Figure 2: Pythia and Mamba's intermediate layers show pronounced changes in representation quality metrics, while BERT's remain more stable. Three representation evaluation metrics calculated on the wikitext dataset for every layer in Pythia-410M, Mamba 370M, and BERT-base architectures. The x-axis denotes layer depth as a percentage, allowing fair comparison between models with different layer counts.

Key observation.
Why do these layers matter?

Layer-Wise Metrics Correlate with Downstream Performance

To validate our framework, we analyze how each evaluation metric correlates with downstream performance. Figures 3 and 8 show distance correlations between metrics and task scores for Pythia-410M. We find that all metrics exhibit strong relationships with downstream performance. Among them, curvature, DiME, and InfoNCE stand out with particularly high correlations. These associations remain robust across different correlation measures, including Spearman and Kendall, reinforcing the reliability of our findings.

Our results suggest that our metrics capture some aspects of intermediate representations that contribute to downstream utility. In Appendix E, we leverage these strong correlations to select high-performing layers in an unsupervised manner, following (Agrawal et al., 2022; Garrido et al., 2023; Thilak et al., 2024). In short, we can identify an intermediate layer that surpasses the final layer in downstream performance-without using any task-specific labels. For instance, using DiME-based layer selection leads to a 3% average improvement in MTEB scores for the Pythia-410M model.

Downstream Performance and Entropy Are Negatively Correlated

Key Takeaway: Intermediate layers of language models consistently outperform final layers across all architectures and tasks, challenging the conventional wisdom of using final-layer representations.

In this section, we use intermediate layers for downstream embedding tasks and employ our unified framework from Section 3, measuring all the embeddings across all layers.

Experimental Setup for Evaluating Representation Quality

Key Takeaway: Matrix-based entropy unifies seemingly disparate metrics of representation quality, providing a single theoretical lens for analyzing compression, geometry, and invariance.

A central challenge in analyzing internal representations is determining how to assess their quality. Although existing work draws on numerous ideas-from mutual information to geometric manifold analysis to invariance under augmentations-these threads can seem disparate. In this section, we consolidate them into a unified theoretical framework that shows how these seemingly different metrics connect and why they collectively measure 'representation quality.'

Behavior of metrics across model architectures

Architectural and Scale Differences

Key Takeaway: Different architectures exhibit distinct patterns of information compression. Autoregressive models show mid-layer bottlenecks while bidirectional models maintain more uniform trends.

Aside from strong correlations with downstream performance, we can use our evaluation framework to assess the internal behaviors of LLMs. In both this section and Section 4.3, we use WikiText-103 (Merity et al., 2017) for analyzing our representation metrics on standard textual data. To investigate how architecture and model size influ- ence representation quality, we compare three fundamentally different LLM variants-BERT (encoder-only), Pythia (decoder-only), and Mamba (state-space model)-and then scale up Pythia to observe emerging trends.

Encoder vs. Decoder vs. SSM. Figure 2 shows how prompt entropy, curvature, and augmentation metrics evolve across each model's layers. BERT, which encodes the entire input bidirectionally, generally maintains high entropy across layers, suggesting minimal compression: the model can see all tokens at once and need not discard as much information. By contrast, the decoder-only Pythia exhibits a strong mid-layer entropy dip, reflecting its autoregressive objective's tendency to filter or prune non-local details in the middle of the network. As a result, Pythia's 'sweet spot' for downstream tasks often lies around mid-depth, where it

Figure 3: Relationship between representation metrics and task performance averaged across layers for Pythia 410M. Using distance correlation (dCor), we see strong associative relationships across the board with DiME exhibiting the strongest relationship with downstream performance. We use dCor due to its robustness and ability to measure both linear and non-linear relationships (dCor ∈ [0 , 1] with 0 indicating statistical independence and 1 indicating strong dependency). We defer additional results to the Appendix.

Figure 3: Relationship between representation metrics and task performance averaged across layers for Pythia 410M. Using distance correlation (dCor), we see strong associative relationships across the board with DiME exhibiting the strongest relationship with downstream performance. We use dCor due to its robustness and ability to measure both linear and non-linear relationships (dCor ∈ [0 , 1] with 0 indicating statistical independence and 1 indicating strong dependency). We defer additional results to the Appendix.

balances essential context and compression. Mamba, meanwhile, processes sequences through a state-space approach that yields flatter, more uniform curves across depth: it neither retains as much information as BERT nor compresses as aggressively as Pythia's mid-layers. These conclusions align with Razzhigaev et al. (2024) which showed a flat layer-wise anisotropy for encoder models and a spike in intermediate layer anisotropy for decoder models.

Scaling Size Effects. In Figure 12, we analyze Pythia models ranging from 14M to 1B parameters. Larger models display more pronounced intermediate compression (entropy dips), indicating a heightened ability to distill relevant features. We also observe smoother token trajectories (lower curvature) and stronger invariance (higher LiDAR), consistent with findings that bigger models more effectively filter noise and capture long-range dependencies. These trends reinforce why performance peaks in the middle of the network: larger models hold more capacity to compress intermediate representations, yet still preserve crucial semantic details.

Finetuning Effects In Figure 13, we study how finetuning affects the internal representations of Llama3 (Dubey et al., 2024). We compare the baseline Llama3-8B to two finetuned LLM2Vec models (Behnam Ghader et al., 2024). The LLM2Vec-mntp-unsup-simcse model enables bidirectional attention in Llama3 and goes through two unsupervised training phases to improve Llama3's performance on embedding tasks. The LLM2Vec-mntp-supervised adds an additional supervised finetuning phase. It is clear that both finetuned models have improved augmentation invariance. Furthermore, the unsupervised model has higher prompt entropy than Llama3 while the supervised model has less.

Encoder vs. Decoder vs. SSM.
Decoder-only vs. LLM2Vec:
Scaling Size Effects.
Finetuning Effects

We take regular prompts from the wikitext dataset, tokenize them, and then for each token we randomly replace it with probability p . We draw replacements tokens by sampling a random token from within the prompt. We show examples below for varying levels of p .

The Effect of the Scaling:
Layer-Level Analysis of Transformer Sub-Components.

While our experiments treat each transformer layer as a single unit, transformer blocks are composed of multiple sub-layers (pre-attention normalization, self-attention, residuals, MLPs). By measuring entropy after each sub-layer, we find in Figure 15 that residual connections drive the midnetwork compression observed in Section 4.2. Specifically:

· Sub-layers before residuals (e.g. pre-attention, attention scores, or MLP pre-residual outputs) often show only mild compression. · Residual sub-layers exhibit a pronounced drop in entropy, indicating a significant filtering of information. A concurrent study (Csordás et al., 2025) observed a decrease in the residual stream norm in the second half of decoder models, reinforcing our findings.

The strong entropy 'valley' at intermediate layers is tied to how residual paths merge new signals with the existing hidden state. This aligns with prior work indicating that residuals act as a regularizer (Marion et al., 2024), smoothing out spurious components in hidden representations.

Layer-Level Analysis of Transformer Sub-Components.

While our experiments treat each transformer layer as a single unit, transformer blocks are composed of multiple sub-layers (pre-attention normalization, self-attention, residuals, MLPs). By measuring entropy after each sub-layer, we find in Figure 15 that residual connections drive the midnetwork compression observed in Section 4.2. Specifically:

· Sub-layers before residuals (e.g. pre-attention, attention scores, or MLP pre-residual outputs) often show only mild compression. · Residual sub-layers exhibit a pronounced drop in entropy, indicating a significant filtering of information. A concurrent study (Csordás et al., 2025) observed a decrease in the residual stream norm in the second half of decoder models, reinforcing our findings.

The strong entropy 'valley' at intermediate layers is tied to how residual paths merge new signals with the existing hidden state. This aligns with prior work indicating that residuals act as a regularizer (Marion et al., 2024), smoothing out spurious components in hidden representations.

Impact of Training Progression

Takeaway: Significant changes during training occur in intermediate layers and early layers stabilize quickly, supporting the detokenization hypothesis.

We measure Pythia's metrics at multiple checkpoints to understand how layer-wise representations evolve throughout training (Figures 4 and 11). Two main observations emerge:

Intermediate Layers Undergo the Most Change. The largest shifts in representation quality occur in mid-depth layers. Specifically, prompt entropy steadily decreases there as training progresses, implying that intermediate layers increasingly compress and abstract the input. Meanwhile, LiDAR scores are minimal in these same layers. Likewise, curvature becomes smoother in the middle of the network, suggesting the model refines its internal structure to capture longer-range or more nuanced patterns in language.

Early Layers Stabilize Quickly. In contrast to intermediate layers, the earliest layers change very little after the initial phase of training. This observation aligns with the 'detokenization' hypothesis of Lad et al. (2024), which posits that the main functional role of early layers is to convert raw tokens into a basic embedding space. This idea is closely related to the 'shared task' layers of Zhao et al. (2024), introduced in the context of instruction tuning on diverse tasks. In particular, they show that the first nine layers of LlaMA 2 7B (Touvron et al., 2023) perform general task-agnostic operations. As a result, the most substantial changes to representations, such as enhanced compression, are driven primarily by the intermediate layers, reinforcing their importance for learning robust, high-level features.

Intermediate Layers Undergo the Most Change.

Are final-layer embeddings indeed optimal for downstream tasks? In Figure 1, we compare average performance on MTEB tasks across all layers of the three models.

Key observation. In nearly every task, some intermediate layer outperforms the final layer. The absolute improvement ranges from 2% to as high as 16% on average, and the best layer often resides around the mid-depth of the network. This phenomena is consistent across all the different architectures. This confirms emerging observations in recent work for generation tasks (Bordes et al., 2023; El-Nouby et al., 2024; Chen et al., 2020; Fan et al., 2024) and extends them to a wider range of benchmarks and tasks.

Why do these layers matter? From our theoretical perspective, intermediate layers appear to strike a balance between retaining sufficient information (avoiding overcompression) and discarding low-level noise. Later in Section 4.2, we show that these sweet spots are not random but tied to how intermediate layers are processing information.

Figure 2: Pythia and Mamba's intermediate layers show pronounced changes in representation quality metrics, while BERT's remain more stable. Three representation evaluation metrics calculated on the wikitext dataset for every layer in Pythia-410M, Mamba 370M, and BERT-base architectures. The x-axis denotes layer depth as a percentage, allowing fair comparison between models with different layer counts.

Figure 2: Pythia and Mamba's intermediate layers show pronounced changes in representation quality metrics, while BERT's remain more stable. Three representation evaluation metrics calculated on the wikitext dataset for every layer in Pythia-410M, Mamba 370M, and BERT-base architectures. The x-axis denotes layer depth as a percentage, allowing fair comparison between models with different layer counts.

Early Layers Stabilize Quickly.

Impact of Chain-of-Thought Finetuning

Key Takeaway: CoT finetuning enables models to maintain richer context throughout their layers.

Recent work has highlighted Chain-of-Thought (CoT) finetuning as a powerful strategy for improving reasoning capabilities (Arefin et al., 2025; DeepSeek-AI, 2025). To examine its effects on representations, in Figure 5 we compare Qwen 2.5 and Qwen 2.5-Math (Yang et al., 2024), where the latter underwent additional math pretraining and CoT finetuning. Measuring token-level prompt entropy across sequence length reveals that the finetuned model maintains higher entropy with lower variance across examples.

Figure 4: Strong trends in intermediate behavior emerge during training Representation evaluation metrics across layers at various Pythia-410M training checkpoints, ranging from step 1 to the final step at 143k. The x-axis is the model layer, showing how training affects different layers, while the colors are different checkpoints during training.

Figure 4: Strong trends in intermediate behavior emerge during training Representation evaluation metrics across layers at various Pythia-410M training checkpoints, ranging from step 1 to the final step at 143k. The x-axis is the model layer, showing how training affects different layers, while the colors are different checkpoints during training.

Figure 5: Token-level prompt entropy across sequence lengths for Qwen 2.5 and Qwen 2.5-Math. The base model (Qwen 2.5) exhibits greater prompt compression, while the finetuned (Qwen 2.5-Math) has higher entropy, indicating more information retention.

Figure 5: Token-level prompt entropy across sequence lengths for Qwen 2.5 and Qwen 2.5-Math. The base model (Qwen 2.5) exhibits greater prompt compression, while the finetuned (Qwen 2.5-Math) has higher entropy, indicating more information retention.

These findings suggest that CoT finetuning encourages models to preserve more context throughout their hidden layers, enabling better multi-step reasoning. Our framework provides a quantitative lens into how CoT fine-tuning pushes models to maintain richer internal representations across sequences, explaining its effectiveness in multi-step tasks. While CoT traces can be inspected directly in these models, our approach is particularly valuable for analyzing models that reason in continuous latent space (Hao et al., 2024).

Bimodal Entropy Observations in Specialized Data

Extreme Input Conditions

To better probe the underlying factors affecting representation quality, we inspect each layer's responsiveness to different input types. We use Pythia-410M on three types of extreme prompts and measure prompt entropy across layers (Figure 6). We prove examples of these prompts in Appendix F. Overall, we find that:

  1. Token repetition compresses intermediate layers. As p increases (i.e., more repeated tokens), prompt entropy decreases sharply in mid-depth layers, suggesting that the model recognizes/encodes repetitive patterns and discards redundancy in its internal representation.
  2. Random tokens inflate early-layer entropy. Adding token-level randomness, increases entropy significantly in early layers, revealing their sensitivity to noise. In contrast, deeper layers are more robust.

Overall, these results confirm that intermediate layers play a major role in handling complex or unusual inputs, selectively compressing or filtering out repetitive patterns while retaining crucial distinctions. Early layers are more sensitive to noise and the incremental benefit of adding more tokens diminishes with prompt length. This behavior highlights the diverse ways in which different layers balance the trade-off between preserving and discarding information, underscoring the significance of intermediate representations.

Behavior of prompt entropy for increasingly extreme prompts

Comparison to Vision Transformers

Do our findings extend to other domains like computer vision? Vision models employ diverse architectures and training objectives from fully supervised learning to selfsupervised methods, and from bidirectional to autoregressive encoders. Their diversity provides an ideal testbed to examine how well our findings generalize and how different training objectives shape internal representations.

We examine several representative vision approaches: ViT (Dosovitskiy et al., 2021), a supervised transformer trained on labeled data; CLIP (Radford et al., 2021), a weakly supervised image encoder; BEiT (Bao et al., 2022), a self-supervised encoder that reconstructs masked patches; DINOv2 (Oquab et al., 2024), a self-supervised approach leveraging augmentations and exponential moving average teachers; MAE (He et al., 2022), a self-supervised approach

Figure 6: Prompt entropy across layers of Pythia 410M under various extreme input conditions. (a) Increasing token repetition leads to decreased entropy in intermediate layers. (b) Increasing token randomness results in higher entropy, especially in initial layers. (c) Unnormalized prompt entropy increases with prompt length due to the larger number of tokens. These results demonstrate how the model's internal representations adapt to different types of input perturbations.

Figure 6: Prompt entropy across layers of Pythia 410M under various extreme input conditions. (a) Increasing token repetition leads to decreased entropy in intermediate layers. (b) Increasing token randomness results in higher entropy, especially in initial layers. (c) Unnormalized prompt entropy increases with prompt length due to the larger number of tokens. These results demonstrate how the model's internal representations adapt to different types of input perturbations.

that reconstructs images from masked patches; AIM (ElNouby et al., 2024), an autoregressive transformer that predicts the next patch in an image sequence (GPT-style nexttoken prediction); and AIMv2 (Fini et al., 2025), which extends AIM with a multimodal next-token prediction task. In Figure 14, we evaluate every model layer on ImageNet-1k with attention probing and our suite of metrics.

AIM exhibits behavior similar to language models. AIM, which predicts image patches sequentially, exhibits the same entropy "valley" and accuracy peak at intermediate layers that we observed in language models like Pythia. This pattern suggests that autoregressive training, whether over text tokens or image patches, consistently creates a mid-depth information bottleneck. The sequential prediction constraint forces models to compress non-local contextual information early in processing, then selectively re-expand the most relevant features for accurate prediction. AIM's strong intermediate performance was first noted in (El-Nouby et al., 2024). Interestingly, while the AIMv2 model does not show improved intermediate accuracy, it still produces an entropy valley. We hypothesize this difference is due to the multimodal text-vision pretext task, which may alter information compression dynamics.

Vision transformers behave differently from language models. All models except for AIM exhibit strictly increasing downstream accuracy toward final layers. Similar trends have been shown for ResNets (Sorscher et al., 2022), where few-shot classification error is strictly decreasing across layers. Most non-autoregressive vision models show steadily increasing dataset entropy. The notable exception is BEIT, which exhibits a substantial intermediate dip. Taken together, the results suggest that without an autoregressive objective, vision transformers have less need for drastic transformations at mid-depth.

Autoregression as the driving factor. The strong midlayer compression observed in LLMs seems to be not purely a property of 'sequential token data' vs. 'image patch data,' but rather a byproduct of pretraining. While various selfsupervised (or fully supervised) objectives in vision foster more uniform feature building across layers, autoregressive vision models develop similar mid-layer bottlenecks that we see in language. Thus, the objective design-whether or not a model is autoregressive-appears crucial in shaping layer-wise representation quality, regardless of domain.

A departure from language models for most vision Transformers.

Do our findings extend to other domains like computer vision? Vision models employ diverse architectures and training objectives from fully supervised learning to selfsupervised methods, and from bidirectional to autoregressive encoders. Their diversity provides an ideal testbed to examine how well our findings generalize and how different training objectives shape internal representations.

We examine several representative vision approaches: ViT (Dosovitskiy et al., 2021), a supervised transformer trained on labeled data; CLIP (Radford et al., 2021), a weakly supervised image encoder; BEiT (Bao et al., 2022), a self-supervised encoder that reconstructs masked patches; DINOv2 (Oquab et al., 2024), a self-supervised approach leveraging augmentations and exponential moving average teachers; MAE (He et al., 2022), a self-supervised approach

Figure 6: Prompt entropy across layers of Pythia 410M under various extreme input conditions. (a) Increasing token repetition leads to decreased entropy in intermediate layers. (b) Increasing token randomness results in higher entropy, especially in initial layers. (c) Unnormalized prompt entropy increases with prompt length due to the larger number of tokens. These results demonstrate how the model's internal representations adapt to different types of input perturbations.

Figure 6: Prompt entropy across layers of Pythia 410M under various extreme input conditions. (a) Increasing token repetition leads to decreased entropy in intermediate layers. (b) Increasing token randomness results in higher entropy, especially in initial layers. (c) Unnormalized prompt entropy increases with prompt length due to the larger number of tokens. These results demonstrate how the model's internal representations adapt to different types of input perturbations.

that reconstructs images from masked patches; AIM (ElNouby et al., 2024), an autoregressive transformer that predicts the next patch in an image sequence (GPT-style nexttoken prediction); and AIMv2 (Fini et al., 2025), which extends AIM with a multimodal next-token prediction task. In Figure 14, we evaluate every model layer on ImageNet-1k with attention probing and our suite of metrics.

AIM exhibits behavior similar to language models. AIM, which predicts image patches sequentially, exhibits the same entropy "valley" and accuracy peak at intermediate layers that we observed in language models like Pythia. This pattern suggests that autoregressive training, whether over text tokens or image patches, consistently creates a mid-depth information bottleneck. The sequential prediction constraint forces models to compress non-local contextual information early in processing, then selectively re-expand the most relevant features for accurate prediction. AIM's strong intermediate performance was first noted in (El-Nouby et al., 2024). Interestingly, while the AIMv2 model does not show improved intermediate accuracy, it still produces an entropy valley. We hypothesize this difference is due to the multimodal text-vision pretext task, which may alter information compression dynamics.

Vision transformers behave differently from language models. All models except for AIM exhibit strictly increasing downstream accuracy toward final layers. Similar trends have been shown for ResNets (Sorscher et al., 2022), where few-shot classification error is strictly decreasing across layers. Most non-autoregressive vision models show steadily increasing dataset entropy. The notable exception is BEIT, which exhibits a substantial intermediate dip. Taken together, the results suggest that without an autoregressive objective, vision transformers have less need for drastic transformations at mid-depth.

Autoregression as the driving factor. The strong midlayer compression observed in LLMs seems to be not purely a property of 'sequential token data' vs. 'image patch data,' but rather a byproduct of pretraining. While various selfsupervised (or fully supervised) objectives in vision foster more uniform feature building across layers, autoregressive vision models develop similar mid-layer bottlenecks that we see in language. Thus, the objective design-whether or not a model is autoregressive-appears crucial in shaping layer-wise representation quality, regardless of domain.

AIM exhibits behavior similar to language models.
A departure from language models for most vision transformers.

Do our findings extend to other domains like computer vision? Vision models employ diverse architectures and training objectives from fully supervised learning to selfsupervised methods, and from bidirectional to autoregressive encoders. Their diversity provides an ideal testbed to examine how well our findings generalize and how different training objectives shape internal representations.

We examine several representative vision approaches: ViT (Dosovitskiy et al., 2021), a supervised transformer trained on labeled data; CLIP (Radford et al., 2021), a weakly supervised image encoder; BEiT (Bao et al., 2022), a self-supervised encoder that reconstructs masked patches; DINOv2 (Oquab et al., 2024), a self-supervised approach leveraging augmentations and exponential moving average teachers; MAE (He et al., 2022), a self-supervised approach

Figure 6: Prompt entropy across layers of Pythia 410M under various extreme input conditions. (a) Increasing token repetition leads to decreased entropy in intermediate layers. (b) Increasing token randomness results in higher entropy, especially in initial layers. (c) Unnormalized prompt entropy increases with prompt length due to the larger number of tokens. These results demonstrate how the model's internal representations adapt to different types of input perturbations.

Figure 6: Prompt entropy across layers of Pythia 410M under various extreme input conditions. (a) Increasing token repetition leads to decreased entropy in intermediate layers. (b) Increasing token randomness results in higher entropy, especially in initial layers. (c) Unnormalized prompt entropy increases with prompt length due to the larger number of tokens. These results demonstrate how the model's internal representations adapt to different types of input perturbations.

that reconstructs images from masked patches; AIM (ElNouby et al., 2024), an autoregressive transformer that predicts the next patch in an image sequence (GPT-style nexttoken prediction); and AIMv2 (Fini et al., 2025), which extends AIM with a multimodal next-token prediction task. In Figure 14, we evaluate every model layer on ImageNet-1k with attention probing and our suite of metrics.

AIM exhibits behavior similar to language models. AIM, which predicts image patches sequentially, exhibits the same entropy "valley" and accuracy peak at intermediate layers that we observed in language models like Pythia. This pattern suggests that autoregressive training, whether over text tokens or image patches, consistently creates a mid-depth information bottleneck. The sequential prediction constraint forces models to compress non-local contextual information early in processing, then selectively re-expand the most relevant features for accurate prediction. AIM's strong intermediate performance was first noted in (El-Nouby et al., 2024). Interestingly, while the AIMv2 model does not show improved intermediate accuracy, it still produces an entropy valley. We hypothesize this difference is due to the multimodal text-vision pretext task, which may alter information compression dynamics.

Vision transformers behave differently from language models. All models except for AIM exhibit strictly increasing downstream accuracy toward final layers. Similar trends have been shown for ResNets (Sorscher et al., 2022), where few-shot classification error is strictly decreasing across layers. Most non-autoregressive vision models show steadily increasing dataset entropy. The notable exception is BEIT, which exhibits a substantial intermediate dip. Taken together, the results suggest that without an autoregressive objective, vision transformers have less need for drastic transformations at mid-depth.

Autoregression as the driving factor. The strong midlayer compression observed in LLMs seems to be not purely a property of 'sequential token data' vs. 'image patch data,' but rather a byproduct of pretraining. While various selfsupervised (or fully supervised) objectives in vision foster more uniform feature building across layers, autoregressive vision models develop similar mid-layer bottlenecks that we see in language. Thus, the objective design-whether or not a model is autoregressive-appears crucial in shaping layer-wise representation quality, regardless of domain.

Vision transformers behave differently from language models.
Autoregression as the driving factor.

Discussion and Conclusion

We investigated the representation quality of intermediate layers in LLMs and their role in downstream task performance. We introduced a unified framework of evaluation metrics, establish theoretical connections among them, and apply these metrics to analyze transformer-based architectures, SSMs, and vision models. A key phenomenon unveiled by prompt entropy was an information bottleneck in the middle layers of autoregressive transformers in both vision and language domains. Furthermore, we show that intermediate layers often surpass final layers in representation quality, holding implications for feature relevance and extraction. DiME, curvature, and infoNCE correlate well with downstream performance, suggesting a fundamental connection between representation and generalizability.

In conclusion, our work studies the internal representation dynamics in LLMs, offering theoretical and empirical insights as well as practical implications for optimizing model design and training strategies. Future work should further investigate the underlying causes of intermediate layer compression and do explicit finetuning to control compression.

Impact Statement

Our paper studies the inner workings of large language models with findings that may challenge typical assumptions about the importance of intermediate layers in large language models and the representations they learn. Our findings suggest that representations from these layers can yield better performance on a variety of downstream tasks, which can have implications for model interpretability, robustness, and efficiency.

From an ethical standpoint, the ability to leverage intermediate-layer representations could impact fairness and bias considerations in evaluating model performance or in model deployment. By helping better identify latent features and representations, our approach may amplify latent biases. We welcome and encourage future work to explore methods that can ensure that intermediate-layer representations do not disproportionately reinforce biases or lead to unintended disparities in real-world applications.

Acknowledgements

We thank the anonymous reviewers for their valuable feedback, which helped improve the clarity and presentation of our results. We are also grateful to Diego Doimo, Artemii Novoselov, Jhoan Keider Hoyos Osorio, Luis Sanchez, and Matteo Saponati (listed alphabetically) for fruitful discussions and helpful pointers to related literature. Oscar Skean is supported by the Office of the Under Secretary of Defense for Research and Engineering under award number FA9550-21-1-0227.

Detailed Definitions

To better probe the underlying factors affecting representation quality, we inspect each layer's responsiveness to different input types. We use Pythia-410M on three types of extreme prompts and measure prompt entropy across layers (Figure 6). We prove examples of these prompts in Appendix F. Overall, we find that:

  1. Token repetition compresses intermediate layers. As p increases (i.e., more repeated tokens), prompt entropy decreases sharply in mid-depth layers, suggesting that the model recognizes/encodes repetitive patterns and discards redundancy in its internal representation.
  2. Random tokens inflate early-layer entropy. Adding token-level randomness, increases entropy significantly in early layers, revealing their sensitivity to noise. In contrast, deeper layers are more robust.

Overall, these results confirm that intermediate layers play a major role in handling complex or unusual inputs, selectively compressing or filtering out repetitive patterns while retaining crucial distinctions. Early layers are more sensitive to noise and the incremental benefit of adding more tokens diminishes with prompt length. This behavior highlights the diverse ways in which different layers balance the trade-off between preserving and discarding information, underscoring the significance of intermediate representations.

Investigation into Bimodal Distribution of Entropies

Key Takeaway: Information-theoretic, geometric, and invariance-based metrics offer complementary perspectives on representation quality that can all be understood through matrix-based entropy.

We now introduce the seven representation evaluation metrics used in our experiments, grouped into three broad categories: (1) information-theoretic , (2) geometric , and (3) augmentation-invariance . All relate back to the Gram matrix K and hence to Eq. (1).

Effect of Prompt Length

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

Manual Examination of Prompts
Training Set Overlap

Code and Results

Key Takeaway: Our theoretical framework establishes concrete connections between representation entropy and downstream performance through properties like effective rank and invariance.

Here, we summarize key statements that justify why these metrics meaningfully measure representation quality. We refer to the appendix G for details and proofs. Beyond serving as a unifying view, matrix-based entropy also connects to foundational concepts like majorization, Schur concavity, and mutual information. Furthermore, we can directly relate the eigenvalue entropy to the matrix entropy, most naturally via the Effective Rank (Roy & Vetterli, 2007). The following theorem makes this connection explicit.

Theorem 1 (Lower Bound via Effective Rank) . For Shannon-based entropy ( α → 1 ),

$$

$$

meaning a large effective rank implies a high entropy.

Under appropriate conditions on the data distribution and model, we can show connections between prompt entropy and dataset entropy via the following scaling behaviors:

Theorem 2 (Informal) .

  1. If prompt entropy remains near its maximum for all prompts, then the dataset entropy S 2 ( ZZ ⊤ ) grows on the order of log ( L 2 N ) .
  2. If prompt entropy instead stays near its minimum for all prompts, then dataset entropy grows more slowly, on the order of log ( L 2 N 3 ) .

In short, high token-level (prompt) diversity encourages broader global diversity in the dataset-level embeddings, whereas over-compressing token representations can limit how effectively different prompts separate. Our subsequent analysis connects these ideas to self-supervised objectives like InfoNCE, which also tie higher entropy to stronger robustness and discriminability in the learned representations.

Theorem 3 (Dataset Entropy Bounds InfoNCE) . For data X and representation Z ( X ) , the InfoNCE loss on N samples satisfies:

$$

$$

where H ( Z ) is interpretable as matrix-based entropy at the dataset level. Hence, reducing InfoNCE implies learning a higher-entropy (and thus often more robust) representation.

Practical outlook. Overall, our theoretical analysis shows that compression (entropy), geometry (curvature, rank), and invariance (e.g. InfoNCE) are all facets of how the Gram matrix ZZ ⊤ distributes variance. Examining these metrics across different layers reveals exactly where a network 'prunes' redundancy (low entropy) versus preserving essential distinctions (high entropy). This unified perspective also facilitates cross-architecture comparisons (e.g. transformers vs. SSMs) by highlighting how each architecture organizes information internally. Beyond offering a theoretical foundation, it provides a practical blueprint for diagnosing, tuning, and improving hidden-layer representations.

Architectural Details

In this section, we elaborate on the specific architectures of transformers and State Space Models (SSMs). We outline the mathematical foundations, including the weight matrices, attention mechanisms for transformers, and the state transition matrices for SSMs. Detailed equations and parameter configurations are provided to facilitate replication and deeper understanding.

Transformer

The transformer architecture (Vaswani et al., 2017) utilizes self-attention mechanisms. Given an input x , the key ( K ), query ( Q ), and value ( V ) matrices are computed as:

$$

$$

where W Q , W K ∈ R d × d k and W V ∈ R d × d v are learned weights.

The attention weights are calculated using:

$$

$$

where M is a mask to enforce causality in autoregressive tasks.

The output is then:

$$

$$

State Space Models

SSMs (Gu & Dao, 2024) model sequences using recurrent dynamics. The hidden state h t and output y t at time t are updated as:

$$

$$

$$

$$

where A ∈ R n × n , B ∈ R n × d , C ∈ R d × n , and D ∈ R d × d are learned parameters.

Discussion on Prompt Entropy

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α -order matrix-based entropy (Giraldo et al., 2014; Skean et al.,

2023; 2024), which serves as a tractable surrogate for traditional Rényi's α -order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ ( a, b ) = ab T . This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode highlevel concepts such as truth (Burns et al., 2023), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α -order Rényi entropy of the Gram matrix eigenvalues 2 . Notice how each eigenvalue is divided by tr ( K Z ) before being raised to the α power. This is so that the eigenvalues of K Z sum to one (because tr ( · ) = ∑ n i =1 λ i ( · ) ), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of K Z signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then

2 The non-zero eigenvalues of the Gram matrix ZZ T are equivalent to those of the covariance matrix Z T Z . Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D < N .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

Figure 7: The behavior of Eq. 1 for varying values of α on Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β .

the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in Z . On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of K Z to capture how much "volume" a dataset occupies (ShwartzZiv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by S LogDet ( Z ) = log det( K Z ) -log 2 . One can use Jensen's inequality to show that the LogDet entropy is a lower bound of Eq 1 when lim α → 1 (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α , several special cases of matrix-based entropy can be recovered. In particular, when lim α → 1 it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α = 2 it equals collision entropy. Interestingly, the case of α = 2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β -power law such that λ i = i -β . It is shown that for larger values of α , smaller eigenvalues contribute more to the entropy.

State Space Models

SSMs (Gu & Dao, 2024) model sequences using recurrent dynamics. The hidden state h t and output y t at time t are updated as:

$$

$$

$$

$$

where A ∈ R n × n , B ∈ R n × d , C ∈ R d × n , and D ∈ R d × d are learned parameters.

Behavior of Matrix-based Entropy for different choices of $ alpha$

We focus on a key quantity known as matrix-based entropy (Giraldo et al., 2014; Skean et al., 2023), which applies directly to the Gram matrix K = ZZ ⊤ . Let { λ i ( K ) } be the (nonnegative) eigenvalues of K . For any order α > 0 , define:

$$

$$

where r = rank( K ) ≤ min( N,D ) . Intuitively, if only a few eigenvalues dominate, S α ( Z ) is small -indicating a highly compressed representation. Conversely, if Z is spread out across many principal directions, S α ( Z ) is large . By varying α , one smoothly transitions between notions like collision entropy ( α = 2 ) and von Neumann entropy ( α → 1 ). We will typically use α = 1 for simplicity.

Bridging geometry, invariance, and feature locality. A key benefit of matrix-based entropy is that it unifies multiple representational perspectives:

· Compression or information content: A handful of large eigenvalues in K = ZZ ⊤ indicates that Z is lowrank, i.e. the model has collapsed much of the input variation into fewer dimensions. In contrast, a more uniform eigenvalue spectrum implies higher-entropy, more diverse features. · Geometric smoothness: If tokens within a prompt follow a trajectory in embedding space with sharp turns , that curvature can manifest as skewed eigenvalue spectra (Hosseini & Fedorenko, 2023). Curvature also differentiates local transitions (token-to-token) from global structural patterns across longer segments or entire prompts. · Invariance under augmentations: Metrics like InfoNCE (Oord et al., 2018) and LiDAR (Thilak et al., 2024) effectively measure whether augmentations of the same sample (e.g. character swaps) map to similar embeddings. Strong invariance corresponds to stable clustering in ZZ ⊤ , which again depends on the distribution of eigenvalues and how local vs. global features are retained or discarded.

Thus, evaluating S α ( Z ) provides a single lens for assessing 'representation quality' across compression, geometric structure, and invariance-and highlights how both local details and global patterns are organized.

Dataset Details

Wikitext Dataset

We used the wikitext dataset (Merity et al., 2017) for the majority of our experiments in Sections 4.2 and 5. This was downloaded from Salesforce/wikitext on huggingface. The dataset consists of 100 million tokens scraped from the Featured articles on wikipedia. We filtered out prompts which were less than 30 tokens or were wikipedia section headings.

MTEB

The 32 tasks we used from the Massive Text Embedding Benchmark (MTEB) are detailed in Table 1. They are English language tasks covering clustering, classification, reranking, and sentence-to-sentence.

AI-Medical-Chatbot Dataset

We used the wikitext dataset (Merity et al., 2017) for the majority of our experiments in Sections 4.2 and 5. This was downloaded from Salesforce/wikitext on huggingface. The dataset consists of 100 million tokens scraped from the Featured articles on wikipedia. We filtered out prompts which were less than 30 tokens or were wikipedia section headings.

Prompt Augmentations

For the augmentation-invariance metrics such as infoNCE, LiDAR, and DiME, we use the NLPAug library (Ma, 2019) to augment our prompts. We use three types of augmentations.

substitutes, swaps, or deletes characters.

We use the pseudocode below to do our augmentations using three types of augmentations, using the default library settings for each type. When computing augmentationinvariance metrics like infoNCE or DiME, we use the two augmented prompts rather than using one augmented prompt alongside the original prompt. Note that these augmentations may change the token length T of a prompt.

aug = naf.Sequential([ naw.SplitAug(p=0.3), nac.RandomCharAug(p=0.3), nac.KeyboardAug(p=0.3), ]) (aug_A, aug_B) = aug.augment(prompt, num_augmentations=2) prompt -> "The quick brown fox jumps over the lazy dog." aug_A -> "The quDUk b rown fox wEmps o ver the l azy dog." aug_B -> "The qTuXi bro wn fox uVm)s ob3r the la_k dog."

Using Evaluation Metrics as a Performance Proxy

We previously demonstrated strong correlations between our unsupervised evaluation metrics and downstream performance. These correlations can be exploited to select high-performing layers for a given task entirely without supervision, as suggested by prior work (Agrawal et al., 2022; Garrido et al., 2023; Thilak et al., 2024).

In Figure 2, we apply this unsupervised layer selection approach to Pythia-410M and LLM2Vec-8B using the 32task MTEB benchmark introduced in Section 4.1. Rather than computing task accuracies for every layer, we compute DiME, infoNCE, and dataset entropy for each task across all layers in a single forward pass. For each task, we then select the layer that minimizes one of these metrics-leveraging their negative correlation with downstream performance.

This straightforward yet effective method yields substantial performance improvements with no supervision. For example, DiME-based layer selection boosts the average MTEB score of Pythia-410M by 3%.

Extreme Prompts

Increasing Repetition

We take regular prompts from the wikitext dataset, tokenize them, and then for each token we randomly replace it with probability p . We draw replacements tokens by sampling a random token from within the prompt. We show examples below for varying levels of p .

Increasing Randomness

We take regular prompts from the wikitext dataset, tokenize them, and then for each token we randomly replace it with probability p . We draw replacements uniformly from the tokenizer distribution. We show examples below for varying levels of p . Unlike the character-level random noise added to prompts in Section with random noise discussed in Appendix D which might change the number of tokens T of the prompt, the token-level random noise used here does not do so.

Random Prompts with Certain Length

For the augmentation-invariance metrics such as infoNCE, LiDAR, and DiME, we use the NLPAug library (Ma, 2019) to augment our prompts. We use three types of augmentations.

substitutes, swaps, or deletes characters.

We use the pseudocode below to do our augmentations using three types of augmentations, using the default library settings for each type. When computing augmentationinvariance metrics like infoNCE or DiME, we use the two augmented prompts rather than using one augmented prompt alongside the original prompt. Note that these augmentations may change the token length T of a prompt.

aug = naf.Sequential([ naw.SplitAug(p=0.3), nac.RandomCharAug(p=0.3), nac.KeyboardAug(p=0.3), ]) (aug_A, aug_B) = aug.augment(prompt, num_augmentations=2) prompt -> "The quick brown fox jumps over the lazy dog." aug_A -> "The quDUk b rown fox wEmps o ver the l azy dog." aug_B -> "The qTuXi bro wn fox uVm)s ob3r the la_k dog."

Fractal Metrics

Curvature. Proposed by Hosseini & Fedorenko (2023), curvature captures how sharply the token embeddings turn when viewed as a sequence in R D . For a prompt of length L , let v k = z k +1 -z k be the difference between consecutive tokens. The average curvature is:

$$

$$

Higher curvature means consecutive tokens shift direction abruptly and more local level features; lower curvature suggests a smoother trajectory and global level features.

Results

In this section, we empirically test our theoretical framework through extensive experiments across architectures, scales, and training regimes. We focus on three key questions:

· Do intermediate layers consistently outperform final layers across diverse downstream tasks? · How do these intermediate representations differ across architectures, training stages, and scales? · How does post-training methods (e.g., fine-tuning and chain-of-thought) reshape representations?

Theorems

Definition 1. (Majorization) Let p, q ∈ R n be nonnegative vectors such that ∑ N i =1 p i = ∑ N i =1 q i . We say that q majorizes p, denoted by p ≼ q , if their ordered sequences p [1] ≥ · · · ≥ p [ n ] and q [1] ≥ · · · ≥ q [ n ] satisfy:

$$

$$

Definition 2. (Schur-Convexity) A real-valued function f on R n is called Schur-convex if p ≼ q = ⇒ f ( p ) ≤ f ( q ) , and Schur-concave if p ≼ q = ⇒ f ( q ) ≤ f ( p ) .

Lemma 1. The matrix-based entropy, as given in Equation 1, is a Schur-concave function for α > 0 . This result is well-known and, for instance, was recently given by Lemma 4.1 in (Giraldo et al., 2014).

Theorem 4. Suppose we have a matrix of embeddings Z ∈ R N × D and its covariance Z T Z . Then the effective rank of Z is an lower bound of exp( S 1 ( Z )) , where S 1 denotes the matrix-based entropy of α = 1 .

Proof. Denote the ordered singular values of Z as σ 1 ≥ · · · ≥ σ min( N,D ) ≥ 0 and the ordered eigenvalues of Z T Z as λ 1 ≥ · · · ≥ λ min( N,D ) ≥ 0 . Without loss of generality, assume that ∑ N i =1 σ i = ∑ N i =1 λ i = 1 . If this is not the case, then set σ i := σ i ∑ N i =1 σ i and λ i := λ i ∑ N i =1 λ i .

It is straightforward to show that σ 2 i = λ i . Because ∀ i σ i ≤ 1 , we have that σ i ≥ λ i . This implies that λ ≼ σ . Therefore, S 1 ( σ ) ≤ S 1 ( λ ) = ⇒ effective rank ( Z ) ≤ exp S 1 ( Z ) .

̸

Proposition 1. (Random Unit Vectors are Nearly Orthogonal) Suppose we have m unit vectors in R D , that are distributed according to the uniform distribution on the hypersphere. Then with probability at least 1 -m 2 √ 2 πe -Dϵ 2 2 , we have that for any pair i, j , i = j ,

$$

$$

Proof. We can begin by defining the central ϵ -band around a slice of the hypersphere S D -1 as,

$$

$$

where e 1 denotes the first basis vector. The probability of a uniformly distributed vector on the unit sphere not landing in T ϵ ⊂ S D -1 can be bounded as,

$$

$$

$$

$$

Now, by the union bound on each i = j , we get that,

$$

$$

$$

$$

Theorem 5. ( Maximum Prompt Entropy implies Large Dataset Entropy.) Suppose we have a orthogonally equivarient representation model Z such that for all sequences Z i = Z ( X i ) the prompt entropy is maximal and the rows are unit. Suppose also that the data distribution Data is a isotropic unit Gaussian. Suppose we draw sequences of length L = D from the data distribution. Then with probability 1 -N 2 √ 2 πe -Dϵ 2 2 N 2 over draw of { x i } N i =1 ∼ Data , we have that,

$$

$$

Proof. First note that, since the prompt entropy is maximal for each sample i , which we denote Z i = Z ( X i ) , then the matrix K Z = Z i Z ⊤ i is full rank. Since by assumption each row of Z i has unit rows, then we know that ∥ Z i ∥ 2 F = L = ∑ L k =1 σ 2 k . In particular we also know that σ l = σ j for all pairs l, j by the assumption that the prompt entropy is maximized. In particular we then know that Z i Z ⊤ i is a orthogonal matrix, and the rows of Z i form an orthonormal set. We can then write, for some O i a rotation matrix, that,

$$

$$

We will denote the average over sequences of length L , across all N samples, by the dataset matrix ¯ Z = ( q 1 , q 2 , . . . q N ) ⊤ . Since by assumption our model Z ( · ) is orthogonally equivariant, and the Data distribution is radially symmetric, it follows that these { q i } N i =1 are random points on the hypersphere of radius 1 √ L . This means that the matrix √ D ¯ Z consists of rows that are uniform points on hypersphere of radius 1 . Now notice that,

$$

$$

Since √ Lq i is a unit vector this will simplify to,

$$

$$

Now notice that by proposition, we have that with probability at least 1 -N 2 √ 2 πe -Dϵ 2 2 N 2 ,

$$

$$

$$

$$

$$

$$

So then since,

$$

$$

we have that, e -S 2 ( ¯ Z ¯ Z ⊤ ) = ∥ ¯ Z ¯ Z ⊤ ∥ 2 F . In particular,

$$

$$

Which completes the proof.

$$

$$

Proof. Since the prompt entropy is minimal for each sample, we know that each Z ( X i ) will be a rank one matrix, so we can write it as the outer product. In particular, we can write Z ( X i ) = v i u i ⊤ . However, since the rows of Z ( X i ) are of unit length, we know that all the rows are identical, so we may write without loss of generality, Z ( X i ) = v i 1 ⊤ . Then, it follows that,

$$

$$

We will write the dataset average matrix as before as ¯ Z = ( q 1 , q 2 , . . . q N ) ⊤ . In particular the matrix D N ¯ Z has rows that are all unit vectors, and these are randomly distributed uniformly on the hyper-sphere. Now notice that,

$$

$$

$$

$$

$$

$$

$$

$$

$$

$$

In particular,

$$

$$

Theorem 7. (Dataset Entropy Bounds InfoNCE) Let X ∼ Data be a discrete random variable distributed according to the data distribution. Let X → Z be the Markovian relation between X and the representation Z . Then, the InfoNCE loss on N samples from Data satisfies,

$$

$$

The entropy H ( Z ) is analogous to the Dataset Entropy.

Proof. The first inequality follows as a simple result from (Oord et al., 2018). Then, use that,

$$

$$

Additional Plots & Visualizations

$$ \label{eq:matrix-based-entropy} S_\alpha(\mathbf{Z}) ;=; \frac{1}{1-\alpha} ,\log !\biggl(,\sum_{i=1}^{r}!\Bigl(\tfrac{\lambda_i(\mathbf{K})}{\mathrm{tr}(\mathbf{K})}\Bigr)^\alpha\biggr), $$ \tag{eq:matrix-based-entropy}

$$ \label{eq:alpha1} S_1(\mathbf{Z}) ;=-\sum_{i=1}^{r} \lambda_i(\mathbf{K}) \log \lambda_i(\mathbf{K}) $$ \tag{eq:alpha1}

$$ \bar{C} ;=; \frac{1}{L-2} \sum_{k=1}^{L-2} \arccos \ !\Bigl( \frac{\mathbf{v}_{k+1}^\top \mathbf{v}k}{|\mathbf{v}{k+1}||\mathbf{v}_k|} \Bigr). $$

$$ \mathbf{Q} = \mathbf{x}\mathbf{W}_Q, \quad \mathbf{K} = \mathbf{x}\mathbf{W}_K, \quad \mathbf{V} = \mathbf{x}\mathbf{W}_V, $$

$$ \mathbf{A} = \operatorname{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}} + \mathbf{M}\right), $$

$$ \mathbf{y} = \mathbf{A}\mathbf{V}. $$

$$ \label{eqn:attention} $$ \tag{eqn:attention}

$$ \sum_{i=1}^k p_{[i]} \leq \sum_{i=1}^k q_{[i]} \textrm{\quad for \quad} k = 1, \cdots, n $$

$$ \mathbf{h}t &= \mathbf{A}\mathbf{h}{t-1} + \mathbf{B}\mathbf{x}_t, \ \mathbf{y}_t &= \mathbf{C}\mathbf{h}_t + \mathbf{D}\mathbf{x}_t, $$

$$ \mathbb{P}(\exists i,j : \langle |\mathbf{v_i}, \mathbf{v_j} \rangle|>\epsilon) &\leq \sum_{i\not= j}\mathbb{P}(|\langle \mathbf{v_i}, \mathbf{v_j} \rangle|>\epsilon)\ &\leq m^2 \sqrt{2\pi} e^{\frac{-D\epsilon^2}{2}}. $$

$$ |\bar Z\bar Z^\top |F^2 &= \frac1{L^2}| L \bar Z\bar Z^\top |F^2\ &= \frac{1}{L^2} (\sum{i=1}^N |\sqrt L q_i|^2 + \sum{i\not= j}\langle \sqrt L q_i, \sqrt L q_j \rangle ). $$

$$ |\bar Z\bar Z^\top|F^2 &= \sum{i=1}^N | \mathbf q_i|^2 + \sum_{i\not= j}\langle \mathbf q_i,\mathbf q_j \rangle \ &= \sum_{i=1}^N \frac{N^2}{L^2}|\mathbf v_i|^2 + \sum_{i\not= j}\frac{N^2}{L^2}\langle\mathbf v_i, \mathbf v_j \rangle \ &= \frac{N^3}{L^2} + \sum_{i\not= j}\frac{N^2}{L^2}\langle \mathbf v_i, \mathbf v_j \rangle. $$

$$ \mathrm{EffRank}(\mathbf{Z}) ;\le; \exp\bigl(S_1(\mathbf{Z})\bigr), $$

$$ \log(N) ;-; \mathrm{InfoNCE} ;;\le;; I(X;Z) ;;\le;; H(Z), $$

$$ |\langle \mathbf{v_i}, \mathbf{v_j} \rangle|\leq \epsilon. $$

$$ \mathbb P(T_\epsilon) \geq 1- \sqrt{2\pi} e^{\frac{-D\epsilon^2}{2}}. $$

$$ \mathbf{q_i} = \frac1L \sum_{i=1}^L\mathbf{z_i} = \frac1L O_i \mathbf1. $$

$$ \ | |\bar Z\bar Z^\top |_F^2 \ - \ \frac{N}{L^2} | \leq \epsilon. $$

$$ I(X; Z) = H(Z) - H(Z|X) \leq H(Z). $$

Theorem. [Matrix-Based Entropy is Schur-concave] % % For (\alpha > 0), (S_\alpha(Z)) in Eq.~eq:matrix-based-entropy is Schur-concave with respect to the ordered eigenvalues of (K=ZZ^\top). %

Theorem. [Lower Bound via Effective Rank] For Shannon-based entropy ($\alpha\to1$), [ EffRank(Z) ;\le; \exp\bigl(S_1(Z)\bigr), ] meaning a large effective rank implies a high entropy.

Theorem. [Lower Bound via Effective Rank] Let (Z\in R^{N\times D}) and (Z^\top Z) have singular values (\sigma_1\ge \dots \ge \sigma_D). Denote Shannon-based matrix entropy by (S_1(Z)). Then [ EffRank(Z) ;\le; \exp\bigl(S_1(Z)\bigr), ] meaning high (\alpha=1) entropy implies a large effective rank.

Theorem. [Informal] enumerate[itemsep=1pt, topsep=0pt] \item If prompt entropy remains near its maximum for all prompts, then the dataset entropy $S_2!\bigl(Z ,Z^{\top}\bigr)$ grows on the order of $ \log !\bigl(L^2{N}\bigr). $ \item If prompt entropy instead stays near its minimum for all prompts, then dataset entropy grows more slowly, on the order of $ \log !\bigl(L^2{N^3}\bigr). $ enumerate

Theorem. [Dataset Entropy Bounds InfoNCE] For data (X \sim Data) and representation (Z(X)), the InfoNCE loss oord2018representation on (N) samples satisfies: [ \log(N) ;-; InfoNCE ;;\le;; I(X;Z) ;;\le;; H(Z), ] where (H(Z)) can be interpreted as a (dataset-level) matrix-based entropy. Hence, lowering InfoNCE is consistent with learning a representation (Z) of higher overall entropy, underscoring the alignment between invariance metrics and the geometry of (Z).

Theorem. [Dataset Entropy Bounds InfoNCE] For data $X$ and representation $Z(X)$, the InfoNCE loss on $N$ samples satisfies: [ \log(N) - InfoNCE ;;\le;; I(X; Z) ;;\le;; H(Z), ] where $H(Z)$ is interpretable as matrix-based entropy at the dataset level. Hence, reducing InfoNCE implies learning a higher-entropy (and thus often more robust) representation.

Theorem. Suppose we have a matrix of embeddings $Z \in R^{N \times D}$ and its covariance $Z^T Z$. Then the effective rank of $Z$ is an lower bound of $\exp(S_1(Z))$, where $S_1$ denotes the matrix-based entropy of $\alpha=1$.

Theorem. (Maximum Prompt Entropy implies Large Dataset Entropy.) Suppose we have a orthogonally equivarient representation model $Z$ such that for all sequences $Z_i = Z(X_i)$ the prompt entropy is maximal and the rows are unit. Suppose also that the data distribution $Data$ is a isotropic unit Gaussian. Suppose we draw sequences of length $L = D$ from the data distribution. Then with probability $1-N^2 2\pi e^{-D\epsilon^2{2N^2}}$ over draw of ${x_i}_{i=1}^N \sim Data$, we have that, [ |e^{-S_2(QQ^\top)} - \frac N{L^2} | \leq \epsilon ]

Theorem. (Dataset Entropy Bounds InfoNCE) Let $X\sim Data$ be a discrete random variable distributed according to the data distribution. Let $X \to Z$ be the Markovian relation between $X$ and the representation $Z$. Then, the InfoNCE loss on $N$ samples from $Data$ satisfies, [ \log(N) - InfoNCE \leq I(X; Z) \leq H(Z). ] The entropy $H(Z)$ is analogous to the Dataset Entropy.

Lemma. The matrix-based entropy, as given in Equation~eq:matrix-based-entropy, is a Schur-concave function for $\alpha>0$. This result is well-known and, for instance, was recently given by Lemma 4.1 in giraldo2014measures.

Proposition. (Random Unit Vectors are Nearly Orthogonal) Suppose we have $m$ unit vectors in $\R^D$, that are distributed according to the uniform distribution on the hyper-sphere. Then with probability at least $1-m^2 2\pi e^{-D\epsilon^2{2}}$, we have that for any pair $i,j$, $i\not=j$, [ |\langle v_i, v_j \rangle|\leq \epsilon. ]

Definition. {(Majorization)} Let $p,q \in {R}^n$ be nonnegative vectors such that $\sum_{i=1}^N p_i = \sum_{i=1}^N q_i$. We say that q majorizes p, denoted by $p \preccurlyeq q$, if their ordered sequences $p_{[1]} \geq \cdots \geq p_{[n]}$ and $q_{[1]} \geq \cdots \geq q_{[n]}$ satisfy: equation \sum_{i=1}^k p_{[i]} \leq \sum_{i=1}^k q_{[i]} \quad for \quad k = 1, \cdots, n equation

Definition. {(Schur-Convexity)} A real-valued function $f$ on $R^n$ is called Schur-convex if $p \preccurlyeq q \implies f(p) \leq f(q)$, and Schur-concave if $p \preccurlyeq q \implies f(q) \leq f(p)$.

Proof. Denote the ordered singular values of $Z$ as $\sigma_1 \geq \cdots \geq \sigma_{(N,D)} \geq 0$ and the ordered eigenvalues of $Z^T Z$ as $\lambda_1 \geq \cdots \geq \lambda_{(N,D)} \geq 0$. Without loss of generality, assume that $\sum_{i=1}^N \sigma_i = \sum_{i=1}^N \lambda_i = 1$. If this is not the case, then set $\sigma_i \coloneq \sigma_i{\sum_{i=1}^N \sigma_i}$ and $\lambda_i \coloneq \lambda_i{\sum_{i=1}^N \lambda_i}$. It is straightforward to show that $\sigma_i^2 = \lambda_i$. Because $\forall i \quad \sigma_i \leq 1$, we have that $\sigma_i \geq \lambda_i$. This implies that $\lambda \preccurlyeq \sigma$. Therefore, $S_1(\sigma) \leq S_1{(\lambda)} \implies effective rank(Z) \leq S_1{(Z)}$.

Proof. We can begin by defining the central $\epsilon$-band around a slice of the hypersphere $\mathbb S_{D-1}$ as, [ T_\epsilon = { z \in \mathbb S_{D-1} : |\langle z,e_1\rangle| \leq \epsilon / 2}, ] where $e_1$ denotes the first basis vector. The probability of a uniformly distributed vector on the unit sphere not landing in $T_\epsilon \subset \mathbb S_{D-1}$ can be bounded as, [ \mathbb P(T_\epsilon) \geq 1- 2\pi e^{-D\epsilon^2{2}}. ] Now, treating $v_i$ as $e_1$, the basis vector, without loss of generality, we have that, when $v_i, v_j$ are uniformly distributed on the hyper-sphere, [ \mathbb P(|\langle v_i , v_j \rangle | < \epsilon) \leq 2\pi e^{-D\epsilon^2{2}} ] Now, by the union bound on each $i\not=j$, we get that, align* P(\exists i,j : \langle |v_i, v_j \rangle|>\epsilon) &\leq \sum_{i\not= j}P(|\langle v_i, v_j \rangle|>\epsilon)\ &\leq m^2 2\pi e^{-D\epsilon^2{2}}. align* So then with probability at least $1-m^2 2\pi e^{-D\epsilon^2{2}}$, we have that, for any pair $i,j$, [ | \langle v_i, v_j \rangle | \leq \epsilon. ]

Proof. First note that, since the prompt entropy is maximal for each sample $ i $, which we denote $Z_i = Z(X_i)$, then the matrix $K_Z = Z_iZ_i^\top$ is full rank. Since by assumption each row of $Z_i$ has unit rows, then we know that $|Z_i|F^2 = L = \sum{k=1}^L \sigma_k^2$. In particular we also know that $\sigma_l = \sigma_j$ for all pairs $l,j$ by the assumption that the prompt entropy is maximized. In particular we then know that $Z_iZ_i^\top$ is a orthogonal matrix, and the rows of $Z_i$ form an orthonormal set. We can then write, for some $O_i$ a rotation matrix, that, [ q_i = \frac1L \sum_{i=1}^Lz_i = \frac1L O_i \mathbf1. ] We will denote the average over sequences of length $L$, across all $N$ samples, by the dataset matrix $\bar Z = (\mathbf q_1, \mathbf q_2, \ldots \mathbf q_N)^\top$. Since by assumption our model $Z(\cdot)$ is orthogonally equivariant, and the $Data$ distribution is radially symmetric, it follows that these ${ q_i }_{i=1}^N$ are random points on the hypersphere of radius $1{L}$. This means that the matrix $D\bar Z$ consists of rows that are uniform points on hypersphere of radius $1$. Now notice that, align* |\bar Z\bar Z^\top |F^2 &= \frac1{L^2}| L \bar Z\bar Z^\top |F^2\ &= 1{L^2} (\sum{i=1}^N |\sqrt L q_i|^2 + \sum{i\not= j}\langle \sqrt L q_i, \sqrt L q_j \rangle ). align* Since $\sqrt L q_i$ is a unit vector this will simplify to, [ |\bar Z\bar Z^\top |F^2 = 1{L^2} (N + \sum{i\not= j}\langle \sqrt L q_i, \sqrt L q_j \rangle ). ] Now notice that by proposition, we have that with probability at least $1-N^2 2\pi e^{-D\epsilon^2{2N^2}}$, [ \forall i\not=j : \langle v_i, v_j \rangle \leq \frac \epsilon N. ] The union bound then tells us that, [ \mathbb P(\forall i\not= j : |\langle \sqrt D q_i, \sqrt D q_j \rangle| \leq \epsilon{N^2}) \geq 1-N^2 2\pi e^{-D\epsilon^2{2N^2}}. ] So then with probability at least $1-N^2 2\pi e^{-D\epsilon^2{2N^2}}$ over the draw of the data points, we have that, [ \ | |\bar Z\bar Z^\top |_F^2 \ - \ N{L^2} | \leq \epsilon. ] So then since, [ S_2(\bar Z\bar Z^\top) = \log\left(\frac1{|\bar Z\bar Z^\top|_F^2}\right), ] we have that, $e^{-S_2(\bar Z\bar Z^\top)} = |\bar Z\bar Z^\top|_F^2$. In particular, [ |e^{-S_2(\bar Z\bar Z^\top)} - \frac N{L^2} | \leq\epsilon. ] Which completes the proof.

Proof. Since the prompt entropy is minimal for each sample, we know that each $Z(X_i)$ will be a rank one matrix, so we can write it as the outer product. In particular, we can write $Z(X_i) = v_i{u_i}^\top $. However, since the rows of $Z(X_i)$ are of unit length, we know that all the rows are identical, so we may write without loss of generality, $Z(X_i) = v_i1^\top$. Then, it follows that, [ q_i = \frac1L \sum_{i=j}^L z_j^i = \frac NL v_i. ] We will write the dataset average matrix as before as $\bar Z = (\mathbf q_1, \mathbf q_2, \ldots \mathbf q_N)^\top$. In particular the matrix $\frac DN \bar Z$ has rows that are all unit vectors, and these are randomly distributed uniformly on the hyper-sphere. Now notice that, align* |\bar Z\bar Z^\top|F^2 &= \sum{i=1}^N | \mathbf q_i|^2 + \sum_{i\not= j}\langle \mathbf q_i,\mathbf q_j \rangle \ &= \sum_{i=1}^N N^2{L^2}|\mathbf v_i|^2 + \sum_{i\not= j}N^2{L^2}\langle\mathbf v_i, \mathbf v_j \rangle \ &= N^3{L^2} + \sum_{i\not= j}N^2{L^2}\langle \mathbf v_i, \mathbf v_j \rangle. align* Now by the prior proposition, with probability at least $1-N^2 2\pi e^{-D^3\epsilon^2{2N^8}}$, we know that, for all $i \not = j$, [ |\langle v_i, v_j \rangle| \leq \epsilon L^2{N^4}. ] So then we have that, [ ||\bar Z\bar Z^\top|F^2 - N^3{L^2}| \leq \sum{i\not= j}N^2{L^2}\langle |\mathbf v_i, \mathbf v_j \rangle| \leq \frac1{L^2}\sum_{i\not= j}\epsilon \leq \epsilon. ] In particular, [ |e^{-S_2(\bar Z\bar Z^\top)} - N^3{L^2} | \leq \epsilon. ]

Proof. The first inequality follows as a simple result from [oord2018representation]. Then, use that, [ I(X; Z) = H(Z) - H(Z|X) \leq H(Z). ]

̸

Refer to caption Relationships between representation metrics and task performance averaged across layers for Pythia 410M. Using a variety of linear and non-linear measures—Spearman’s ρ\rho, Kendall’s τ\tau, and distance correlation (dCor)—we see strong inversely associative relationships with the exception of InfoNCE which shows a positive, but still strong associativity. Ranges of ρ,τ∈[−1,1]\rho,\tau\in[-1,1] and dCor ∈[0,1]\in[0,1] with 0 indicating independence and 1 indicating strong dependency.

Figure 8: Relationships between representation metrics and task performance averaged across layers for Pythia 410M. Using a variety of linear and non-linear measuresSpearman's ρ , Kendall's τ , and distance correlation (dCor)we see strong inversely associative relationships with the exception of InfoNCE which shows a positive, but still strong associativity. Ranges of ρ, τ ∈ [ -1 , 1] and dCor ∈ [0 , 1] with 0 indicating independence and 1 indicating strong dependency.

Figure

Figure 9: Relationship between representation metrics and task performance averaged across layers for BERT. Using distance correlation (dCor), we see strong associative relationships across the board with LiDAR and dataset entropy exhibiting the strongest relationship with downstream performance. We use dcor due to its robustness and ability to measure both linear and non-linear relationships (dCor ∈ [0 , 1] with 0 indicating statistical independence and 1 indicating strong dependency). Other correlative measures also indicate moderate to strong relationships.

Table 1: MTEB Tasks used in experiments covering a wide range of different use-cases and domains.

Figure 10: textbfPythia's intermediate layers show pronounced changes in representation quality metrics, while Mamba's remain more stable. Representation evaluation metrics across layers in Pythia 410M and Mamba 370M architectures. The x-axis denotes model depth as a percentage, allowing fair comparison between models with different layer counts.

Figure 10: textbfPythia's intermediate layers show pronounced changes in representation quality metrics, while Mamba's remain more stable. Representation evaluation metrics across layers in Pythia 410M and Mamba 370M architectures. The x-axis denotes model depth as a percentage, allowing fair comparison between models with different layer counts.

Refer to caption Representation evaluation metrics across layers at various training checkpoints, ranging from step 1 to the final step at 143k. The x-axis represents the depth percentage of the model, showing how training affects different layers, particularly in the intermediate stages.

Figure 11: Representation evaluation metrics across layers at various training checkpoints, ranging from step 1 to the final step at 143k. The x-axis represents the depth percentage of the model, showing how training affects different layers, particularly in the intermediate stages.

Figure 12: Pythia and Mamba's intermediate layers show pronounced changes in representation quality metrics, while BERT's remain more stable. Three representation evaluation metrics calculated on the wikitext dataset for every layer in Pythia-410M, Mamba 370M, and BERT-base architectures. The x-axis denotes layer depth as a percentage, allowing fair comparison between models with different layer counts.

Figure 12: Pythia and Mamba's intermediate layers show pronounced changes in representation quality metrics, while BERT's remain more stable. Three representation evaluation metrics calculated on the wikitext dataset for every layer in Pythia-410M, Mamba 370M, and BERT-base architectures. The x-axis denotes layer depth as a percentage, allowing fair comparison between models with different layer counts.

Figure 13: Finetuning affects the internal behavior of LLMs. Representation evaluation metrics across layers for Llama3 and two finetuned versions of Llama3.

Figure 13: Finetuning affects the internal behavior of LLMs. Representation evaluation metrics across layers for Llama3 and two finetuned versions of Llama3.

Figure 14: Comparison of vision models trained on different pretext tasks . The dataset is ImageNet-100 (Tian et al., 2020) and all models use the same 24-layer ViT-L architecture. The validation accuracy is calculated using attention probing on tokens from a frozen backbone layer, following the work of El-Nouby et al. (2024)

Figure 14: Comparison of vision models trained on different pretext tasks . The dataset is ImageNet-100 (Tian et al., 2020) and all models use the same 24-layer ViT-L architecture. The validation accuracy is calculated using attention probing on tokens from a frozen backbone layer, following the work of El-Nouby et al. (2024)

Refer to caption Behavior of effeective rank at different stages within a transformer block.

Figure 15: Behavior of effective rank at different stages within a transformer block.

Task DomainTasks# Tasks (32 Total)
Pair ClassificationSprintDuplicateQuestions, TwitterSemEval2015, TwitterURLCorpus3
ClassificationAmazonCounterfactualClassification, AmazonReviewsClassification, Bank- ing77Classification, EmotionClassification, MTOPDomainClassification, MTOPIn- tentClassification, MassiveIntentClassification, MassiveScenarioClassification, ToxicConversationsClassification, TweetSentimentExtractionClassification10
ClusteringArxivClusteringS2S, BiorxivClusteringS2S, MedrxivClusteringS2S, RedditClustering, StackExchangeClustering, TwentyNewsgroupsClustering6
RerankingAskUbuntuDupQuestions, MindSmallReranking, SciDocsRR, StackOverflowDupQues- tions4
Sentence to SentenceBIOSSES, SICK-R, STS12, STS13, STS14, STS15, STS16, STS17, STSBenchmark9
ModelSupervised (Best)UnsupervisedUnsupervisedUnsupervisedUnsupervised
Naive (Last)min-DiMEmin-infoNCEmin-Dataset Entropy
Pythia-410M52.045.548.546.248.1
LLM2Vec-8B66.363.960.064.350.4

From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer’s performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.

Large Language Models (LLMs) have driven remarkable progress in natural language processing (NLP), achieving state-of-the-art results on many tasks (Brown et al., 2020; Devlin, 2018; Li et al., 2022). At the heart of most applications lies a common assumption: final-layer representations are the most useful for downstream tasks. Yet a fundamental question remains: does the final layer always yield the best representation?

In this paper, we conduct a layer-wise analysis of LLMs across diverse architectures—including Transformers (Vaswani et al., 2017), state-space models (SSMs) (Gu & Dao, 2024), and encoder-based models like BERT (Devlin, 2018)—spanning parameter scales from tens of millions to billions. Through systematic evaluation on 32 embedding tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), we find that intermediate layers often surpass the final layer by up to 16% in downstream accuracy. Figure 1 illustrates this phenomenon, where mid-depth layers provide particularly strong representations while the very last layer can become overly specialized to the pretraining objective.

A unified framework. To understand intermediate layers’ effectiveness, we integrate three complementary perspectives (Section 3):

Information-theoretic: How much do layers compress or preserve semantic information (Shwartz-Ziv & Tishby, 2019; Shwartz-Ziv, 2022)?

Geometric: How do token embeddings unfold in high-dimensional space (Hosseini & Fedorenko, 2023))?

Invariance: Are embeddings robust to input perturbations (e.g., InfoNCE (Oord et al., 2018), LiDAR (Thilak et al., 2024) and DiME (Skean et al., 2023))?

We show that these perspectives can be viewed under a single lens, which clarifies how intermediate layers strike a balance between retaining features and discarding noise.

Key findings and contributions. Our investigation leads to several important insights:

Intermediate layers consistently outperform final layers. This pattern is evident not only in Transformers but also in SSMs, suggesting a broad, architecture-agnostic effect.

Autoregressive vs. masked-language training. Autoregressive models exhibit a pronounced mid-layer “compression valley,” whereas masked or bidirectional models show milder intermediate changes.

Domain-general effect. We extend these results to vision models and find that autoregressive image transformers display the same mid-depth bottleneck, indicating that the training objective, rather than the data modality, is the key driver.

CoT finetuning. Analyzing chain-of-thought (CoT) reveals that finetuning can reshape mid-layer entropy, preserving latent context for multi-step reasoning.

Overall, our results challenge the default reliance on final-layer embeddings and highlight intermediate layers as potentially underutilized sources of meaningful features. In the rest of this paper, we detail our unified framework (Section 3), present an extensive set of experiments in both language and vision (Section 4, 6), and conclude with a discussion of implications for model design, training practices, and future directions.

A long line of research has aimed to understand how deep neural networks encode and organize information. Early studies employed linear probes to interpret intermediate layers (Alain & Bengio, 2017), while subsequent efforts introduced more sophisticated techniques such as SVCCA (Raghu et al., 2017) to compare learned features across architectures and training regimes. Although these approaches shed light on representation dynamics, most focus on vision backbones or relatively shallow models. In contrast, our work extends layer-wise analysis to large-scale language models, highlighting specific behaviors of intermediate layers in autoregressive Transformers, state-space models (SSMs), and beyond.

Transformer-based LLMs have sparked significant interest in which layers capture linguistic properties such as syntax and semantics (Liu et al., 2019; Tenney et al., 2019; Voita et al., 2019). More recent work (Jin et al., 2024; Gurnee & Tegmark, 2023; Fan et al., 2024) has shown that mid-depth layers sometimes hold surprisingly robust features, challenging the typical focus on final layers. Our contribution unifies and expands these observations through a large-scale, theoretical–empirical framework that quantifies the quality of every layer’s representation via information theory, geometry, and invariance metrics.

Transformers remain the dominant architecture for NLP (Vaswani et al., 2017), but they come in multiple variants. Encoder-only models (e.g., BERT (Devlin, 2018)) typically use bidirectional attention and masked-language objectives, while decoder-only architectures (e.g., GPT (Brown et al., 2020)) follow an autoregressive paradigm. Meanwhile, newer state-space models (SSMs) such as Mamba (Gu & Dao, 2024) use recurrent-style dynamics for efficient long-sequence processing. Although these designs differ significantly in attention mechanisms and sequence modeling strategies, there has been little direct comparison of hidden-layer representations across them. In our work, we analyze Transformers (both encoder- and decoder-only) and SSMs under a common set of metrics, highlighting contrasts in how intermediate layers compress or preserve information and showing that intermediate-layer representations can excel across multiple architectures.

A variety of metrics have been proposed to quantify the “quality” of learned representations. We group them into three main categories:

Information-theoretic measures capture how much a model’s internal representations compress or preserve relevant information. For example, the Information Bottleneck (Shwartz-Ziv & Tishby, 2019; Shwartz-Ziv, 2022) analyzes whether intermediate layers discard noise while retaining essential features.

Geometric measures focus on the structure of embeddings in high-dimensional space. Classical approaches include analyzing singular values or effective rank of the representation matrix (Garrido et al., 2023), while more recent work explores curvature (Hosseini & Fedorenko, 2023) to quantify how smoothly tokens are mapped across consecutive positions or time steps.

Task-based or invariance metrics evaluate how well representations support downstream goals. For instance, augmentations-based approaches such as InfoNCE (Oord et al., 2018) and LiDAR (Thilak et al., 2024) estimate invariance to perturbations, while methods like NESum or Self-Cluster (Agrawal et al., 2022) link closely to entropy. In computer vision, these scores often correlate strongly with downstream accuracy, highlighting how robust the embeddings are.

Although these categories may appear distinct, we will show (Section 3) that many can be unified under a single lens. This unification illuminates why certain intermediate layers balance compression, geometry, and invariance so effectively, leading to better representations for downstream tasks.

Multiple lines of research connect compression and generalization performance (Deletang et al., 2024). For instance, Bordes et al. (2023) demonstrated that discarding certain layers in self-supervised encoders can even improve downstream accuracy, while Park et al. (2024a) found that LLM embeddings often lie in low-dimensional manifolds. Our empirical study reinforces these ideas by demonstrating that many networks—especially autoregressive Transformers—naturally develop a mid-layer bottleneck that appears crucial for balancing “signal” versus “noise.” We show how intermediate layers can achieve optimal trade-offs between preserving task-relevant information and discarding superfluous detail.

Overall, our work bridges these overlapping threads by evaluating a range of architectures and training paradigms via a unified set of metrics. Beyond merely confirming that intermediate layers can be effective, we elucidate why this happens, tying it to fundamental properties such as entropy, invariance, and geometry. This novel perspective provides an avenue for both finer-grained diagnostics of large language models and more deliberate design of mid-layer representations for downstream tasks.

A central challenge in analyzing internal representations is determining how to assess their quality. Although existing work draws on numerous ideas—from mutual information to geometric manifold analysis to invariance under augmentations—these threads can seem disparate. In this section, we consolidate them into a unified theoretical framework that shows how these seemingly different metrics connect and why they collectively measure “representation quality.”

Consider a neural network that maps inputs 𝐱\mathbf{x} (e.g., tokens in a sequence) to internal hidden states 𝐙\mathbf{Z}. We denote 𝐙∈ℝN×D\mathbf{Z}\in\mathbb{R}^{N\times D} as a matrix of NN data samples (or tokens) in DD dimensions. Some key questions arise:

How compressed are these representations?

How do they geometrically organize different inputs?

We focus on a key quantity known as matrix-based entropy (Giraldo et al., 2014; Skean et al., 2023), which applies directly to the Gram matrix 𝐊=𝐙𝐙⊤\mathbf{K}=\mathbf{Z}\mathbf{Z}^{\top}. Let {λi​(𝐊)}{\lambda_{i}(\mathbf{K})} be the (nonnegative) eigenvalues of 𝐊\mathbf{K}. For any order α>0\alpha>0, define:

where r=rank​(𝐊)≤min⁡(N,D)r=\mathrm{rank}(\mathbf{K})\leq\min(N,D). Intuitively, if only a few eigenvalues dominate, Sα​(𝐙)S_{\alpha}(\mathbf{Z}) is small—indicating a highly compressed representation. Conversely, if 𝐙\mathbf{Z} is spread out across many principal directions, Sα​(𝐙)S_{\alpha}(\mathbf{Z}) is large. By varying α\alpha, one smoothly transitions between notions like collision entropy (α=2\alpha=2) and von Neumann entropy (α→1\alpha\to 1). We will typically use α=1\alpha=1 for simplicity.

Compression or information content: A handful of large eigenvalues in 𝐊=𝐙𝐙⊤\mathbf{K}=\mathbf{Z}\mathbf{Z}^{\top} indicates that 𝐙\mathbf{Z} is low-rank, i.e. the model has collapsed much of the input variation into fewer dimensions. In contrast, a more uniform eigenvalue spectrum implies higher-entropy, more diverse features.

Geometric smoothness: If tokens within a prompt follow a trajectory in embedding space with sharp turns, that curvature can manifest as skewed eigenvalue spectra (Hosseini & Fedorenko, 2023). Curvature also differentiates local transitions (token-to-token) from global structural patterns across longer segments or entire prompts.

Invariance under augmentations: Metrics like InfoNCE (Oord et al., 2018) and LiDAR (Thilak et al., 2024) effectively measure whether augmentations of the same sample (e.g. character swaps) map to similar embeddings. Strong invariance corresponds to stable clustering in 𝐙𝐙⊤\mathbf{Z}\mathbf{Z}^{\top}, which again depends on the distribution of eigenvalues and how local vs. global features are retained or discarded.

Thus, evaluating Sα​(𝐙)S_{\alpha}(\mathbf{Z}) provides a single lens for assessing “representation quality” across compression, geometric structure, and invariance—and highlights how both local details and global patterns are organized.

We now introduce the seven representation evaluation metrics used in our experiments, grouped into three broad categories: (1) information-theoretic, (2) geometric, and (3) augmentation-invariance. All relate back to the Gram matrix 𝐊\mathbf{K} and hence to Eq. (1).

Following Wei et al. (2024), we apply matrix-based entropy (Eq. 1) to the token embeddings within a single prompt. This prompt entropy quantifies how widely tokens are spread in the embedding space. Higher entropy indicates more diverse, less redundant token-level features; lower entropy implies stronger compression.

We can also aggregate embeddings across N prompts by taking the mean token embedding of each prompt to form 𝐙¯∈ℝN×D\overline{\mathbf{Z}}\in\mathbb{R}^{N\times D}. Applying entropy to 𝐙¯\overline{\mathbf{Z}} yields a dataset-level measure of global diversity—revealing how distinctly the model separates different inputs.

can be shown to be a lower bound to exp⁡(S1​(𝐙))\exp(S_{1}(\mathbf{Z})), highlighting how dimensionality effectively shrinks if the representation is strongly compressed. We prove this connection later in Theorem 1. This has implications for popular representation evaluation metrics such as RankMe (Garrido et al., 2023) and LiDAR (Thilak et al., 2024), which are both inspired by Effective Rank.

Proposed by Hosseini & Fedorenko (2023), curvature captures how sharply the token embeddings turn when viewed as a sequence in ℝD\mathbb{R}^{D}. For a prompt of length LL, let 𝐯k=𝐳k+1−𝐳k\mathbf{v}{k}=\mathbf{z}{k+1}-\mathbf{z}_{k} be the difference between consecutive tokens. The average curvature is:

Higher curvature means consecutive tokens shift direction abruptly and more local level features; lower curvature suggests a smoother trajectory and global level features.

Lastly, we assess how stable the model’s representations are to small perturbations of the same input (e.g., random character swaps, keyboard-level changes; see Appendix). Suppose pip_{i} is augmented into pi(a)p_{i}^{(a)} and pi(b)p_{i}^{(b)}. After embedding these, we compare the row vectors in 𝐙1,𝐙2∈ℝN×D\mathbf{Z}{1},\mathbf{Z}{2}\in\mathbb{R}^{N\times D} under different scoring criteria:

This self-supervised objective (Oord et al., 2018) encourages matched samples to lie close in embedding space while pushing unmatched samples away. A lower InfoNCE loss indicates stronger invariance to augmentation.

LiDAR (Thilak et al., 2024) uses a linear discriminant approach that measures within-class versus between-class scatter. Treating each prompt as its own class, LiDAR checks how well augmentations form tight clusters.

Similarly, DiME (Skean et al., 2023) is grounded in matrix-based entropy. It compares real paired samples against random pairings to estimate how uniquely aligned correct augmentations are.

Here, we summarize key statements that justify why these metrics meaningfully measure representation quality. We refer to the appendix F for details and proofs. Beyond serving as a unifying view, matrix-based entropy also connects to foundational concepts like majorization, Schur concavity, and mutual information. Furthermore, we can directly relate the eigenvalue entropy to the matrix entropy, most naturally via the Effective Rank (Roy & Vetterli, 2007b). The following theorem makes this connection explicit.

For Shannon-based entropy (α→1\alpha\to 1),

meaning a large effective rank implies a high entropy.

Under appropriate conditions on the data distribution and model, we can show connections between prompt entropy and dataset entropy via the following scaling behaviors:

If prompt entropy remains near its maximum for all prompts, then the dataset entropy S2​(𝐙¯​𝐙¯⊤)S_{2}!\bigl{(}\overline{\mathbf{Z}},\overline{\mathbf{Z}}^{\top}\bigr{)} grows on the order of log⁡(D2N).\log!\bigl{(}\tfrac{D^{2}}{N}\bigr{)}.

In short, high token-level (prompt) diversity encourages broader global diversity in the dataset-level embeddings, whereas over-compressing token representations can limit how effectively different prompts separate. Our subsequent analysis connects these ideas to self-supervised objectives like InfoNCE, which also tie higher entropy to stronger robustness and discriminability in the learned representations.

For data XX and representation Z​(X)Z(X), the InfoNCE loss on NN samples satisfies:

where H​(Z)H(Z) is interpretable as matrix-based entropy at the dataset level. Hence, reducing InfoNCE implies learning a higher-entropy (and thus often more robust) representation.

Overall, our theoretical analysis shows that compression (entropy), geometry (curvature, rank), and invariance (e.g. InfoNCE) are all facets of how the Gram matrix 𝐙𝐙⊤\mathbf{Z}\mathbf{Z}^{\top} distributes variance. Examining these metrics across different layers reveals exactly where a network “prunes” redundancy (low entropy) versus preserving essential distinctions (high entropy). This unified perspective also facilitates cross-architecture comparisons (e.g. Transformers vs. SSMs) by highlighting how each architecture organizes information internally. Beyond offering a theoretical foundation, it provides a practical blueprint for diagnosing, tuning, and improving hidden-layer representations.

In this section, we empirically validate our theoretical framework through extensive experiments across architectures, scales, and training regimes. Our investigation centers on three key questions:

How does post-training methods (e.g., fine-tuning and chain-of-thought) reshape representations?

In this section, we use intermediate layers for downstream embedding tasks and employ our unified framework from Section 3, measuring all the embeddings across all layers.

We evaluate three distinct architectural families: Pythia and Llama3 (decoder-only transformer) (Biderman et al., 2023; Dubey et al., 2024), Mamba (state space model) (Gu & Dao, 2024), BERT (encoder-only transformer) (Devlin, 2018) and LLM2Vec models (bidirectional attention) (BehnamGhader et al., 2024).

We test each layer’s embeddings on 32 tasks from the Massive Text Embedding Benchmark (MTEB) (Muennighoff et al., 2022), spanning classification, clustering, and reranking. This comprehensive evaluation provides insight into how different layers capture task-relevant features. For a full list of the tasks, refer to the Appendix.

A key question is whether final-layer embeddings are indeed optimal for downstream tasks. In Figure 1, we compare average performance on MTEB tasks across all layers of the three models.

In nearly every task, some intermediate layer outperforms the final layer. The absolute improvement ranges from 2% to as high as 16% on average, and the best layer often resides around the mid-depth of the network. This phenomena is consistent across all the different architectures. This confirms emerging observations in recent work for generation tasks (Bordes et al., 2023; El-Nouby et al., 2024; Chen et al., 2020; Fan et al., 2024) and extends them to a wider range of benchmarks and tasks.

From our theoretical perspective, intermediate layers appear to strike a balance between retaining sufficient information (avoiding over-compression) and discarding low-level noise. Later in Section 4.2, we show that these sweet spots are not random but tied to how intermediate layers are processing information.

To validate our framework’s relevance, we analyze how each metric (entropy, InfoNCE, etc.) correlates with downstream performance. Figure 3 and Figure 8 show distance correlations between metrics and task scores for Pythia-410M. We make several key observations:

All metrics show strong relationships with performance

DiME, curvature, and InfoNCE exhibit particularly strong correlations

Associations remain robust across different correlation measures (Spearman, Kendall)

These relationships suggest that our metrics effectively capture what makes intermediate representations powerful for downstream tasks.

Aside from strong correlations with downstream performance, we can use our evaluation framework to assess the internal behaviors of LLMs. In both this section and Section 4.3, we use WikiText-103 (Merity et al., 2017) for analyzing our representation metrics on standard textual data. To investigate how architecture and model size influence representation quality, we compare three fundamentally different LLM variants—BERT (encoder-only), Pythia (decoder-only), and Mamba (state-space model)—and then scale up Pythia to observe emerging trends.

Figure 2 shows how prompt entropy, curvature, and augmentation metrics evolve across each model’s layers. BERT, which encodes the entire input bidirectionally, generally maintains high entropy across layers, suggesting minimal compression: the model can see all tokens at once and need not discard as much information. By contrast, the decoder-only Pythia exhibits a strong mid-layer entropy dip, reflecting its autoregressive objective’s tendency to filter or prune non-local details in the middle of the network. As a result, Pythia’s “sweet spot” for downstream tasks often lies around mid-depth, where it balances essential context and compression. Mamba, meanwhile, processes sequences through a state-space approach that yields flatter, more uniform curves across depth: it neither retains as much information as BERT nor compresses as aggressively as Pythia’s mid-layers.

In Figure 12, we analyze Pythia models ranging from 14M to 1B parameters. Larger Pythia models display more pronounced intermediate compression (entropy dips), indicating a heightened ability to distill relevant features. We also observe smoother token trajectories (lower curvature) and stronger invariance (e.g., higher LiDAR), consistent with findings that bigger models more effectively filter noise and capture long-range dependencies. Notably, these trends further reinforce why performance often peaks in the middle of the network: larger models gain more capacity to compress intermediate representations, yet still preserve crucial semantic details.

In Figure 13, we study how finetuning affects the internal representations of Llama3 (Dubey et al., 2024). We compare the baseline Llama3-8B to two finetuned LLM2Vec models (BehnamGhader et al., 2024). The LLM2Vec-mntp-unsup-simcse model enables bidirectional attention in Llama3 and goes through two unsupervised training phases to improve Llama3’s performance on embedding tasks. The LLM2Vec-mntp-supervised adds an additional supervised finetuning phase. It is clear that both finetuned models have improved augmentation invariance. Furthermore, the unsupervised model has higher prompt entropy than Llama3 while the supervised model has less.

While our experiments treat each Transformer layer as a single unit, Transformer blocks comprise multiple sub-layers (pre-attention normalization, self-attention, residuals, MLPs). By measuring entropy after each sub-layer, we find in Figure 15 that residual connections drive the mid-network compression observed in Section 4.2. Specifically:

Sub-layers before residuals (e.g. pre-attention, raw attention, or MLP pre-residual outputs) often show only mild compression; their representations still carry much of the original variability

Residual sub-layers exhibit a marked entropy drop, reflecting significant information filtering

The strong “valley” in entropy at intermediate layers is thus tied to how residual paths merge computed signals with the existing hidden state. This aligns with prior work indicating that residuals act as a regularizer or "noise filter" (Marion et al., 2024), smoothing out spurious components in hidden representations.

The largest shifts in representation quality occur in mid-depth layers. Specifically, prompt entropy steadily decreases there as training progresses, implying that intermediate layers increasingly compress and abstract the input. Meanwhile, LiDAR scores are minimal in these same layers. Likewise, curvature becomes smoother in the middle of the network, suggesting the model refines its internal structure to capture longer-range or more nuanced patterns in language.

In contrast to the intermediate layers, the earliest layers change very little after the initial phase of training. This observation aligns with the “detokenization” hypothesis (Lad et al., 2024), which posits that early layers mainly convert raw tokens into a basic embedding space and then remain relatively fixed. As a result, the most substantial improvements in representation quality—such as enhanced compression—are driven primarily by the intermediate layers, reinforcing their importance for learning robust, high-level features.

Recent work has highlighted Chain-of-Thought (CoT) finetuning as a powerful strategy for improving reasoning capabilities (Arefin et al., 2024; DeepSeek-AI, 2025). To examine its effects, in Figure 5 we compare Qwen 2.5 and Qwen 2.5-Math (Yang et al., 2024a, b), where the latter underwent additional math pretraining and CoT finetuning. Measuring token-level prompt entropy across sequence length reveals that the finetuned model maintains higher entropy with lower variance across examples.

These findings suggest that CoT finetuning encourages models to preserve more context throughout their hidden layers, enabling better multi-step reasoning. Our framework provides a quantitative lens into how CoT fine-tuning pushes models to maintain richer internal representations across sequences, explaining its effectiveness in multi-step tasks. While CoT traces can be inspected directly in these models, our approach is particularly valuable for analyzing models that reason in continuous latent space (Hao et al., 2024).

To gain a deeper understanding of the underline factors which effect the representation quality, we check how each layer responds to different types of inputs. We use Pythia 410M to three types of extreme prompts and measure prompt entropy across layers (Figure 6). We find that:

Token repetition compresses intermediate layers. As pp increases (i.e., more repeated tokens), prompt entropy decreases sharply in mid-depth layers. This indicates that the model effectively recognizes and encodes these repetitive patterns, discarding redundancy in its internal representation.

Random tokens inflate early-layer entropy. When we introduce token-level randomness, entropy increases significantly in the first few layers, revealing that these initial layers are especially sensitive to noise. By contrast, deeper layers appear more robust to such perturbations.

Prompt length raises raw entropy but grows sublinearly once normalized. Longer inputs naturally boost the unnormalized entropy because more tokens create more variation. However, normalized entropy expands at a slower rate, suggesting each additional token contributes less unique information.

Overall, these results confirm that intermediate layers play a major role in handling complex or unusual inputs, selectively compressing or filtering out repetitive patterns while retaining crucial distinctions. At the same time, early layers respond more sensitively to noise, and the incremental benefit of adding more tokens diminishes with prompt length. This behavior highlights the diverse ways in which different layers balance the trade-off between preserving and discarding information, further underscoring the unique strengths of intermediate representations.

Although our focus has mainly been on language models, similar questions arise in computer vision. Vision architectures and training regimes differ widely, ranging from fully supervised methods to self-supervised approaches, and from bidirectional encoders to autoregressive transformers.

To investigate whether our findings generalize to vision models, we examine five representative vision approaches: ViT (Dosovitskiy et al., 2021), a supervised Transformer trained on labeled data; BEiT (Bao et al., 2022), a self-supervised encoder that reconstructs masked patches, analogous to masked token prediction in language; DINOv2 (Oquab et al., 2024), a self-supervised approach leveraging augmentations and exponential moving average teachers; MAE (He et al., 2022), a self-supervised framework that masks patches and reconstructs them, akin to masked autoencoders in language; and AIM (El-Nouby et al., 2024), an autoregressive Transformer that predicts the next patch in an image sequence (GPT-style next-token prediction). We evaluate each model on ImageNet-1k via layer-wise probing and our framework’s metrics.

Figure 14 shows that ViT, BEiT, DINOv2, and MAE exhibit strictly increasing downstream accuracy toward final layers, unlike language models. These models also show steadily increasing invariance metrics with depth, suggesting that without an autoregressive objective, vision Transformers have less need for drastic transformations at mid-depth.

In contrast, AIM—which is explicitly autoregressive over image patches—shows an entropy “valley” and corresponding peak in downstream accuracy at its intermediate layers (El-Nouby et al., 2024). This mimics the patterns we observe in LLMs like Pythia, suggesting that autoregressive training induces an information bottleneck mid-depth. As in language modeling, forcing a strictly left-to-right (or patch-to-patch) prediction can drive the model to compress non-local details earlier, then re-expand relevant features.

Taken together, these results indicate that the strong mid-layer compression observed in LLMs is not purely a property of “sequential token data” vs. “image patch data,” but rather a byproduct of autoregressive training. While various self-supervised (or fully supervised) objectives in vision often foster more uniform feature building across layers, autoregressive vision models develop the same mid-layer bottlenecks and sweet spots that we see in language. Thus, the architectural and objective design—especially whether or not a model is autoregressive—appears crucial in shaping layer-wise representation quality, regardless of domain.

In this work, we investigate the representation quality of intermediate layers in LLMs, shedding light on their critical role in downstream task performance. We introduce a unified framework of evaluation metrics, establish theoretical connections among them, and apply these metrics to analyze Transformer-based architectures, SSMs, and vision models. One key phenomenon unveiled by prompt entropy was an information bottleneck in the middle layers of autoregressive transformers in both vision and language domains. Furthermore, our results reveal that intermediate layers often surpass final layers in representation quality, emphasizing their importance for feature extraction. DiME, curvature, and infoNCE correlate very well with downstream performance, suggesting a fundamental connection between representation and generalizability.

In conclusion, our study deepens the understanding of internal representation dynamics in LLMs. These insights not only enrich the theoretical foundations of model representations but also offer practical implications for optimizing model design, training strategies, and real-world applications. Future research could investigate the underlying causes of intermediate layer compression and develop specialized metrics tailored to LLMs, enabling more precise and effective representation evaluation.

Our paper studies the inner workings of large language models with findings that may challenge typical assumptions about the importance of intermediate layers in large language models and the representations they learn. Our findings suggest that representations from these layers can yield better performance on a variety of downstream tasks, which can have implications for model interpretability, robustness, and efficiency.

From an ethical standpoint, the ability to leverage intermediate-layer representations could impact fairness and bias considerations in evaluating model performance or in model deployment. By helping better identify latent features and representations, our approach may amplify latent biases. We welcome and encourage future work to explore methods that can ensure that intermediate-layer representations do not disproportionately reinforce biases or lead to unintended disparities in real-world applications.

Oscar Skean is supported by the Office of the Under Secretary of Defense for Research and Engineering under award number FA9550-21-1-0227.

In this section, we elaborate on the specific architectures of Transformers and State Space Models (SSMs). We outline the mathematical foundations, including the weight matrices, attention mechanisms for Transformers, and the state transition matrices for SSMs. Detailed equations and parameter configurations are provided to facilitate replication and deeper understanding.

The Transformer architecture (Vaswani et al., 2017) utilizes self-attention mechanisms. Given an input 𝐱\mathbf{x}, the key (𝐊\mathbf{K}), query (𝐐\mathbf{Q}), and value (𝐕\mathbf{V}) matrices are computed as:

where 𝐖Q,𝐖K∈ℝd×dk\mathbf{W}{Q},\mathbf{W}{K}\in\mathbb{R}^{d\times d_{k}} and 𝐖V∈ℝd×dv\mathbf{W}{V}\in\mathbb{R}^{d\times d{v}} are learned weights.

The attention weights are calculated using:

where 𝐌\mathbf{M} is a mask to enforce causality in autoregressive tasks.

The output is then:

SSMs (Gu & Dao, 2024) model sequences using recurrent dynamics. The hidden state 𝐡t\mathbf{h}{t} and output 𝐲t\mathbf{y}{t} at time tt are updated as:

where 𝐀∈ℝn×n\mathbf{A}\in\mathbb{R}^{n\times n}, 𝐁∈ℝn×d\mathbf{B}\in\mathbb{R}^{n\times d}, 𝐂∈ℝd×n\mathbf{C}\in\mathbb{R}^{d\times n}, and 𝐃∈ℝd×d\mathbf{D}\in\mathbb{R}^{d\times d} are learned parameters.

The first measure of token embedding diversity we call prompt entropy. This entropy is measured on the intermediate tokens and captures how diverse the token representations are.

We follow the work of (Wei et al., 2024) and use α\alpha-order matrix-based entropy (Giraldo et al., 2014; Skean et al., 2023, 2024), which serves as a tractable surrogate for traditional Rényi’s α\alpha-order entropy (Rényi, 1961). The quantity is calculated using a similarity kernel κ\kappa on a batch of samples drawn from a distribution, without making explicit assumptions on what the true distribution is. The choice of kernel κ\kappa is flexible and can be any infinitely divisible kernel such as the Gaussian kernel, linear kernel, or Laplacian kernel, among others. For this work, we restrict ourselves to the linear kernel κ​(a,b)=a​bT\kappa(a,b)=ab^{T}. This choice is motivated by the linear representation hypothesis (Park et al., 2024b) which finds that large language model representations encode high-level concepts such as truth (Burns et al., 2022), honesty (Mallen & Belrose, 2024), and part-of-speech (Mamou et al., 2020) in linearly separable manifolds.

The equation for matrix-based entropy was previously defined in Eq. 1. One way to interpret Eq. 1 is as the α\alpha-order Rényi entropy of the Gram matrix eigenvalues111The non-zero eigenvalues of the Gram matrix Z​ZTZZ^{T} are equivalent to those of the covariance matrix ZT​ZZ^{T}Z. Using the covariance matrix instead of the Gram matrix in Eq. 1 makes no difference and is more computationally efficient if D<ND<N.. Notice how each eigenvalue is divided by tr​(𝐊𝐙)\textrm{tr}(\mathbf{K}{\mathbf{Z}}) before being raised to the α\alpha power. This is so that the eigenvalues of 𝐊𝐙\mathbf{K}{\mathbf{Z}} sum to one (because tr​(⋅)=∑i=1nλi​(⋅)\textrm{tr}(\cdot)=\sum_{i=1}^{n}\lambda_{i}(\cdot)), which is a necessary condition to treat the eigenvalues as a probability distribution. Futhermore, each eigenvalue of 𝐊𝐙\mathbf{K}{\mathbf{Z}} signifies the variance of samples in a particular principal component direction (Scholkopf & Smola, 2018). If entropy is low, then the eigenvalues form a heavy-tail distribution which implies that a few components dominate the variance of samples in ZZ. On the other hand, at maximum entropy, the eigenvalues form a uniform distribution and samples are spread equally in all directions. Matrix-based entropy is reminiscent of the LogDet entropy which uses the determinant of 𝐊𝐙\mathbf{K}{\mathbf{Z}} to capture how much "volume" a dataset occupies (Shwartz-Ziv et al., 2023; Zhouyin & Liu, 2021). The LogDet entropy is given by SLogDet​(Z)=log​det(𝐊𝐙)−log⁡2S_{\textrm{LogDet}}(Z)=\log\det(\mathbf{K}{\mathbf{Z}})-\log 2. One can use Jensen’s inequality to show that the LogDet entropy is a lower bound of Eq 1 when limα→1\lim{\alpha\rightarrow 1} (Appendix J.4 of (Shwartz-Ziv et al., 2023)).

Depending on the choice of α\alpha, several special cases of matrix-based entropy can be recovered. In particular, when limα→1\lim_{\alpha\rightarrow 1} it equals Shannon entropy (also referred to as von Neumann entropy in quantum information theory (Bach, 2022; Boes et al., 2019)), and when α=2\alpha=2 it equals collision entropy. Interestingly, the case of α=2\alpha=2 can be calculated without explicit eigendecomposition (Skean et al., 2024). We show in the Appendix Figure 7 how varying values of α\alpha affect the matrix-based entropy of Gram matrices with eigenvalues distributed with a β\beta-power law such that λi=i−β\lambda_{i}=i^{-\beta}. It is shown that for larger values of α\alpha, smaller eigenvalues contribute more to the entropy.

We used the wikitext dataset (Merity et al., 2017) for the majority of our experiments in Sections 4.2 and 5. This was downloaded from Salesforce/wikitext on huggingface. The dataset consists of 100 million tokens scraped from the Featured articles on wikipedia. We filtered out prompts which were less than 30 tokens or were wikipedia section headings.

The 32 tasks we used from the Massive Text Embedding Benchmark (MTEB) are detailed in Table 1. They are English language tasks covering clustering, classification, reranking, and sentence-to-sentence.

For the augmentation-invariance metrics such as infoNCE, LiDAR, and DiME, we use the NLPAug library (Ma, 2019) to augment our prompts. We use three types of augmentations.

The SplitAug augmentation randomly splits words into two parts by adding a space.

The Keyboard augmentation randomly substitutes characters with other characters that are at a distance of one as measured on a QWERTY keyboard. For instance, the character "k" may be replaced with "i", "l", "m", or "j".

We use the pseudocode below to do our augmentations using three types of augmentations, using the default library settings for each type. When computing augmentation-invariance metrics like infoNCE or DiME, we use the two augmented prompts rather than using one augmented prompt alongside the original prompt. Note that these augmentations may change the token length TT of a prompt.

We take regular prompts from the wikitext dataset, tokenize them, and then for each token we randomly replace it with probability pp. We draw replacements tokens by sampling a random token from within the prompt. We show examples below for varying levels of pp.

(p=0p=0) Mint records indicate the first gold dollars were produced on May 7…

(p=1.0p=1.0) Mint Mint Mint Mint Mint Mint Mint Mint Mint Mint Mint Mint Mint…

(p=1.0p=1.0) arf emulsion minorensteinorianmega_TOStack potsRecip Installifykeeping…

(Majorization) Let p,q∈Rnp,q\in\mathbb{}{R}^{n} be nonnegative vectors such that ∑i=1Npi=∑i=1Nqi\sum_{i=1}^{N}p_{i}=\sum_{i=1}^{N}q_{i}. We say that q majorizes p, denoted by p≼qp\preccurlyeq q, if their ordered sequences p[1]≥⋯≥p[n]p_{[1]}\geq\cdots\geq p_{[n]} and q[1]≥⋯≥q[n]q_{[1]}\geq\cdots\geq q_{[n]} satisfy:

(Schur-Convexity) A real-valued function ff on ℝn\mathbb{R}^{n} is called Schur-convex if p≼q⟹f​(p)≤f​(q)p\preccurlyeq q\implies f(p)\leq f(q), and Schur-concave if p≼q⟹f​(q)≤f​(p)p\preccurlyeq q\implies f(q)\leq f(p).

The matrix-based entropy, as given in Equation 1, is a Schur-concave function for α>0\alpha>0. This result is well-known and, for instance, was recently given by Lemma 4.1 in (Giraldo et al., 2014).

Suppose we have a matrix of embeddings Z∈ℝN×DZ\in\mathbb{R}^{N\times D} and its covariance ZT​ZZ^{T}Z. Then the effective rank of ZZ is an lower bound of exp⁡(S1​(Z))\exp(S_{1}(Z)), where S1S_{1} denotes the matrix-based entropy of α=1\alpha=1.

Denote the ordered singular values of ZZ as σ1≥⋯≥σmin⁡(N,D)≥0\sigma_{1}\geq\cdots\geq\sigma_{\min{(N,D)}}\geq 0 and the ordered eigenvalues of ZT​ZZ^{T}Z as λ1≥⋯≥λmin⁡(N,D)≥0\lambda_{1}\geq\cdots\geq\lambda_{\min{(N,D)}}\geq 0. Without loss of generality, assume that ∑i=1Nσi=∑i=1Nλi=1\sum_{i=1}^{N}\sigma_{i}=\sum_{i=1}^{N}\lambda_{i}=1. If this is not the case, then set σi≔σi∑i=1Nσi\sigma_{i}\coloneq\frac{\sigma_{i}}{\sum_{i=1}^{N}\sigma_{i}} and λi≔λi∑i=1Nλi\lambda_{i}\coloneq\frac{\lambda_{i}}{\sum_{i=1}^{N}\lambda_{i}}.

It is straightforward to show that σi2=λi\sigma_{i}^{2}=\lambda_{i}. Because ∀iσi≤1\forall i\quad\sigma_{i}\leq 1, we have that σi≥λi\sigma_{i}\geq\lambda_{i}. This implies that λ≼σ\lambda\preccurlyeq\sigma. Therefore, S1​(σ)≤S1​(λ)⟹effective rank​(Z)≤exp⁡S1​(Z)S_{1}(\sigma)\leq S_{1}{(\lambda)}\implies\textrm{effective rank}(Z)\leq\exp{S_{1}{(Z)}}.

(Random Unit Vectors are Nearly Orthogonal) Suppose we have mm unit vectors in ℝd\mathbb{R}^{d}, that are distributed according to the uniform distribution on the hyper-sphere. Then with probability at least 1−m2​2​π​e−n​ϵ221-m^{2}\sqrt{2\pi}e^{\frac{-n\epsilon^{2}}{2}}, we have that for any pair i,ji,j, i≠ji\not=j,

Notice that the probability of not landing in the ϵ\epsilon band Tϵ⊂𝕊n−1T_{\epsilon}\subset\mathbb{S}_{n-1} around the equator of the hypersphere of dimension nn can be bounded as,

Which is a result from (Wainwright, 2019). Notice that for sufficiently small epsilon, this means that the dot product of any two, randomly chosen vectors can be bounded with high probability. Indeed notice that,

and that if 𝐯𝐣∈Tϵ\mathbf{v_{j}}\in T_{\epsilon}, then, treating 𝐯𝐢\mathbf{v_{i}} as e1e_{1}, the basis vector, without loss of generality, we have that,

Now, by the union bound on each i≠ji\not=j, we get that,

(Maximum Prompt Entropy implies Large Dataset Entropy.) Suppose we have a orthogonally equivarient representation model ZZ such that for all sequences Zi=Z​(Xi)Z_{i}=Z(X_{i}) the prompt entropy is maximal and the rows are unit. Suppose also that the data distribution 𝐃𝐚𝐭𝐚\mathbf{Data} is a isotropic unit Gaussian. Suppose we draw sequences of length L=DL=D from the data distribution. Then with probability 1−N2​2​π​e−n​ϵ22​N21-N^{2}\sqrt{2\pi}e^{\frac{-n\epsilon^{2}}{2N^{2}}} over draw of {𝐱𝐢}i=1N∼𝐃𝐚𝐭𝐚{\mathbf{x_{i}}}_{i=1}^{N}\sim\mathbf{Data}, we have that,

First note that, since the prompt entropy is maximal for each sample Z​(Xi)Z(X_{i}), then the matrix KZ=Z​Z⊤K_{Z}=ZZ^{\top} is full rank. Since by assumption each row of ZZ has unit rows, then we know that ‖Z‖F2=L=∑k=1Lσk2|Z|{F}^{2}=L=\sum{k=1}^{L}\sigma_{k}^{2}. In particular we also know that σi=σj\sigma_{i}=\sigma_{j} for all pairs i,ji,j by the assumption that the prompt entropy is maximized. In particular we then know that Z​Z⊤ZZ^{\top} is a orthogonal matrix, and the rows of ZZ form an orthonormal set. We can then write, for some OiO_{i} a rotation matrix, that,

Since by assumption our model Z​(⋅)Z(\cdot) is orthogonally equivarient, and the Data distribution is radially symmetric, it follows that these {𝐪𝐢}i=1N{\mathbf{q_{i}}}_{i=1}^{N} are random points on the hypersphere of radius 1D\frac{1}{\sqrt{D}}. This means that the matrix D​Q\sqrt{D}Q consists of rows that are uniform points on hypersphere of radius 11. Now notice that,

Since D​qi\sqrt{D}q_{i} is a unit vector this will simplify to,

Now notice that by proposition, we have that with probability at least 1−N2​2​π​e−D​ϵ22​N21-N^{2}\sqrt{2\pi}e^{\frac{-D\epsilon^{2}}{2N^{2}}},

So then since,

we have that, e−S2​(Q​Q⊤)=‖Q​Q⊤‖F2e^{-S_{2}(QQ^{\top})}=|QQ^{\top}|_{F}^{2}. In particular,

Since the prompt entropy is minimal for each sample, we know that each Z​(Xi)Z(X_{i}) will be a rank one matrix, so we can write it as the outer product. In particular, we can write Z​(Xi)=𝐯𝐢​𝐮𝐢⊤Z(X_{i})=\mathbf{v_{i}}{\mathbf{u_{i}}}^{\top}. However, since the rows of Z​(Xi)Z(X_{i}) are of unit length, we know that all the rows are identical, so we may write Z​(Xi)=𝐯𝐢​𝟏⊤Z(X_{i})=\mathbf{v_{i}}\mathbf{1}^{\top}. Then, it follows that,

In particular the matrix DN​Q\frac{D}{N}Q has rows that are all unit vectors, and these are randomly distributed uniformly on the hypersphere. Now notice that,

In particular,

(Dataset Entropy Bounds InfoNCE) Let X∼DataX\sim\textbf{Data} be a discrete random variable distributed according to the data distribution. Let X→ZX\to Z be the Markovian relation between XX and the representation ZZ. Then, the InfoNCE loss on NN samples from Data satisfies,

The entropy H​(Z)H(Z) is analogous to the Dataset Entropy.

The first inequality follows as a simple result from (Oord et al., 2018). Then, use that,

Table: A3.T1: MTEB Tasks used in experiments covering a wide range of different use-cases and domains.

Task DomainTasks# Tasks (32 Total)
Pair ClassificationSprintDuplicateQuestions, TwitterSemEval2015, TwitterURLCorpus3
ClassificationAmazonCounterfactualClassification, AmazonReviewsClassification, Banking77Classification, EmotionClassification, MTOPDomainClassification, MTOPIntentClassification, MassiveIntentClassification, MassiveScenarioClassification, ToxicConversationsClassification, TweetSentimentExtractionClassification10
ClusteringArxivClusteringS2S, BiorxivClusteringS2S, MedrxivClusteringS2S, RedditClustering, StackExchangeClustering, TwentyNewsgroupsClustering6
RerankingAskUbuntuDupQuestions, MindSmallReranking, SciDocsRR, StackOverflowDupQuestions4
Sentence to SentenceBIOSSES, SICK-R, STS12, STS13, STS14, STS15, STS16, STS17, STSBenchmark9

Refer to caption Intermediate layers consistently outperform final layers on downstream tasks. The average score of 32 MTEB tasks using the outputs of every model layer as embeddings for three different model architectures. The x-axis is the depth percentage of the layer, rather than the layer number which varies across models.

Refer to caption (a) Prompt Entropy

Refer to caption (b) Curvature

Refer to caption (c) LiDAR

Refer to caption Relationship between representation metrics and task performance averaged across layers for Pythia 410M. Using distance correlation (dCor), we see strong associative relationships across the board with DiME exhibiting the strongest relationship with downstream performance. We use dCor due to its robustness and ability to measure both linear and non-linear relationships (dCor ∈[0,1]\in[0,1] with 0 indicating statistical independence and 1 indicating strong dependency). We defer additional results to the Appendix.

Refer to caption Token-level prompt entropy across sequence length for Qwen 2.5 and Qwen 2.5-Math models. The base model (Qwen 2.5) exhibits greater prompt compression, while the finetuned model (Qwen 2.5-Math) maintains higher entropy, indicating more information retention.

Refer to caption (a) Repetition

Refer to caption The behavior of Eq. 1 for varying values of α\alpha on Gram matrices with eigenvalues distributed with a β\beta-power law such that λi=i−β\lambda_{i}=i^{-\beta}.

Refer to caption Comparison of vision transformers trained with different pretext tasks.

$$ \theta\leq\arccos(\frac{\epsilon}{2}). $$ \tag{A6.Ex7}

$$ \displaystyle\mathbf{h}_{t} $$

$$ \displaystyle=\sum_{i=1}^{N}\frac{N^{2}}{D^{2}}|v_{i}|^{2}+\sum_{i\not=j}\frac{N^{2}}{D^{2}}\langle v_{i},v_{j}\rangle $$

Task DomainTasks# Tasks (32 Total)
Pair ClassificationSprintDuplicateQuestions, TwitterSemEval2015, TwitterURLCorpus3
ClassificationAmazonCounterfactualClassification, AmazonReviewsClassification, Bank- ing77Classification, EmotionClassification, MTOPDomainClassification, MTOPIn- tentClassification, MassiveIntentClassification, MassiveScenarioClassification, ToxicConversationsClassification, TweetSentimentExtractionClassification10
ClusteringArxivClusteringS2S, BiorxivClusteringS2S, MedrxivClusteringS2S, RedditClustering, StackExchangeClustering, TwentyNewsgroupsClustering6
RerankingAskUbuntuDupQuestions, MindSmallReranking, SciDocsRR, StackOverflowDupQues- tions4
Sentence to SentenceBIOSSES, SICK-R, STS12, STS13, STS14, STS15, STS16, STS17, STSBenchmark9
ModelSupervised (Best)UnsupervisedUnsupervisedUnsupervisedUnsupervised
Naive (Last)min-DiMEmin-infoNCEmin-Dataset Entropy
Pythia-410M52.045.548.546.248.1
LLM2Vec-8B66.363.960.064.350.4

Figure

$$ S_2(\bar Z\bar Z^\top) = \log\left(\frac1{|\bar Z\bar Z^\top|_F^2}\right), $$

Theorem. Theorem 2 (Informal). 1. If prompt entropy remains near its maximum for all prompts, then the dataset entropy S2​(𝐙¯​𝐙¯⊤)S_{2}!\bigl{(}\overline{\mathbf{Z}},\overline{\mathbf{Z}}^{\top}\bigr{)} grows on the order of log⁡(D2N).\log!\bigl{(}\tfrac{D^{2}}{N}\bigr{)}. 2. If prompt entropy instead stays near its minimum for all prompts, then dataset entropy grows more slowly, on the order of log⁡(D2N3).\log!\bigl{(}\tfrac{D^{2}}{N^{3}}\bigr{)}.

Proposition. Proposition 1. (Random Unit Vectors are Nearly Orthogonal) Suppose we have mm unit vectors in ℝd\mathbb{R}^{d}, that are distributed according to the uniform distribution on the hyper-sphere. Then with probability at least 1−m2​2​π​e−n​ϵ221-m^{2}\sqrt{2\pi}e^{\frac{-n\epsilon^{2}}{2}}, we have that for any pair i,ji,j, i≠ji\not=j, ⟨𝐯𝐢,𝐯𝐣⟩≤ϵ.\langle\mathbf{v_{i}},\mathbf{v_{j}}\rangle\leq\epsilon.

Theorem. Theorem 7. (Dataset Entropy Bounds InfoNCE) Let X∼DataX\sim\textbf{Data} be a discrete random variable distributed according to the data distribution. Let X→ZX\to Z be the Markovian relation between XX and the representation ZZ. Then, the InfoNCE loss on NN samples from Data satisfies, log⁡(N)−InfoNCE≤I​(X;Z)≤H​(Z).\log(N)-\text{InfoNCE}\leq I(X;Z)\leq H(Z). The entropy H​(Z)H(Z) is analogous to the Dataset Entropy.

References

[dcor2007] Gábor J. Székely, Maria L. Rizzo, Nail K. Bakirov. (2008). Measuring and testing dependence by correlation of distances. Annals of Statistics.

[hao2024training] Hao, Shibo, Sukhbaatar, Sainbayar, Su, DiJia, Li, Xian, Hu, Zhiting, Weston, Jason, Tian, Yuandong. (2024). Training Large Language Models to Reason in a Continuous Latent Space.

[qwen2.5] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu. (2024). Qwen2.5 Technical Report.

[deepseek] DeepSeek-AI. (2025). DeepSeek-{R1.

[wainwright2019high] Wainwright, Martin J. (2019). High-dimensional statistics: A non-asymptotic viewpoint.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, \L ukasz, Polosukhin, Illia. (2017). Attention is All you Need.

[ruslanmv2024] Ruslan Magana Vsevolodovna. (2024). AI Medical Chatbot dataset.

[biderman2023pythia] Biderman, Stella, Schoelkopf, Hailey, Anthony, Quentin Gregory, Bradley, Herbie, O’Brien, Kyle, Hallahan, Eric, Khan, Mohammad Aflah, Purohit, Shivanshu, Prashanth, USVSN Sai, Raff, Edward, others. (2023). Pythia: A suite for analyzing large language models across training and scaling.

[devlin2018bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2019). {BERT.

[behnamghader2024llm2vec] Behnam Ghader, Parishad, Adlakha, Vaibhav, Mosbach, Marius, Bahdanau, Dzmitry, Chapados, Nicolas, Reddy, Siva. (2024). {LLM2Vec.

[park2024geometry] Park, Kiho, Choe, Yo Joong, Jiang, Yibo, Veitch, Victor. (2024). The Geometry of Categorical and Hierarchical Concepts in Large Language Models. ICML 2024 Workshop on Mechanistic Interpretability.

[mamba] Gu, Albert, Dao, Tri. (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.

[chen2020mocov2] Chen, Xinlei, Fan, Haoqi, Girshick, Ross, He, Kaiming. (2020). Improved baselines with momentum contrastive learning.

[hendrycks2020mmlu] Hendrycks, Dan, Burns, Collin, Basart, Steven, Zou, Andy, Mazeika, Mantas, Song, Dawn, Steinhardt, Jacob. (2021). Measuring massive multitask language understanding.

[bordes2022guillotine] Bordes, Florian, Balestriero, Randall, Garrido, Quentin, Bardes, Adrien, Vincent, Pascal. (2023). Guillotine regularization: Why removing layers is needed to improve generalization in self-supervised learning.

[fan2024notalllayers] Fan, Siqi, Jiang, Xin, Li, Xiang, Meng, Xuying, Han, Peng, Shang, Shuo, Sun, Aixin, Wang, Yequan, Wang, Zhongyuan. (2024). Not all layers of {LLMs.

[muennighoff2022mteb] Muennighoff, Niklas, Tazi, Nouamane, Magne, Lo{. (2022). {MTEB.

[chen2020simclr] Chen, Ting, Kornblith, Simon, Norouzi, Mohammad, Hinton, Geoffrey. (2020). A simple framework for contrastive learning of visual representations.

[liu2019linguistic] Liu, Nelson F, Gardner, Matt, Belinkov, Yonatan, Peters, Matthew E, Smith, Noah A. (2019). Linguistic knowledge and transferability of contextual representations.

[gurnee2023language] Gurnee, Wes, Tegmark, Max. (2023). Language models represent space and time.

[alain2016understanding] Guillaume Alain, Yoshua Bengio. (2017). Understanding intermediate layers using linear classifier probes.

[oord2018representation] Oord, Aaron van den, Li, Yazhe, Vinyals, Oriol. (2018). Representation Learning with Contrastive Predictive Coding.

[raghu2017svcca] Raghu, Maithra, Gilmer, Justin, Yosinski, Jason, Sohl-Dickstein, Jascha. (2017). {SVCCA.

[shwartz2017opening] Shwartz-Ziv, Ravid, Tishby, Naftali. (2019). Opening the black box of deep neural networks via information.

[chen2023sudden] Chen, Angelica, Shwartz-Ziv, Ravid, Cho, Kyunghyun, Leavitt, Matthew L, Saphra, Naomi. (2024). Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs.

[NEURIPS2023_b63ad8c2] Ben-Shaul, Ido, Shwartz-Ziv, Ravid, Galanti, Tomer, Dekel, Shai, LeCun, Yann. . Reverse Engineering Self-Supervised Learning (2023).

[shwartz2024compress] Shwartz Ziv, Ravid, LeCun, Yann. (2024). To compress or not to compress—self-supervised learning and information theory: A review. Entropy.

[shwartz2022information] Shwartz-Ziv, Ravid. (2022). Information flow in deep neural networks.

[bm25s] Xing Han Lù. (2024). BM25S: Orders of magnitude faster lexical search via eager sparse scoring.

[merity2016pointer] Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2017). Pointer Sentinel Mixture Models.

[gao2020pile] Gao, Leo, Biderman, Stella, Black, Sid, Golding, Laurence, Hoppe, Travis, Foster, Charles, Phang, Jason, He, Horace, Thite, Anish, Nabeshima, Noa, others. (2020). The pile: An 800gb dataset of diverse text for language modeling.

[hartigan1985dip] Hartigan, John A, Hartigan, Pamela M. (1985). The dip test of unimodality. The annals of Statistics.

[burns2022dl] Burns, Collin, Ye, Haotian, Klein, Dan, Steinhardt, Jacob. (2023). Discovering Latent Knowledge in Language Models Without Supervision.

[scholkopf2018learning] Scholkopf, Bernhard, Smola, Alexander J. (2018). Learning with kernels: support vector machines, regularization, optimization, and beyond.

[parklinear2024] Park, Kiho, Choe, Yo Joong, Veitch, Victor. (2024). The Linear Representation Hypothesis and the Geometry of Large Language Models.

[boes2019neumann] Boes, Paul, Eisert, Jens, Gallego, Rodrigo, M{. (2019). Von Neumann entropy from unitarity. Physical review letters.

[zhouyin2021understanding] Zhouyin, Zhanghao, Liu, Ding. (2021). Understanding neural networks with logarithm determinant entropy estimator.

[shwartz2023information] Shwartz-Ziv, Ravid, Balestriero, Randall, Kawaguchi, Kenji, Rudner, Tim GJ, LeCun, Yann. (2023). An information theory perspective on variance-invariance-covariance regularization.

[bach2022information] Bach, Francis. (2022). Information theory with kernel methods. IEEE Transactions on Information Theory.

[mallen2024eliciting] Alex Troy Mallen, Nora Belrose. (2024). Eliciting Latent Knowledge from Quirky Language Models. ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models.

[mamou2020emergence] Mamou, Jonathan, Le, Hang, Del Rio, Miguel A, Stephenson, Cory, Tang, Hanlin, Kim, Yoon, Chung, SueYeon. Emergence of separable manifolds in deep language representations.

[skean2024frossl] Skean, Oscar, Dhakal, Aayush, Jacobs, Nathan, Giraldo, Luis Gonzalo Sanchez. (2024). {FroSSL.

[skean2023dime] Skean, Oscar, Osorio, Jhoan Keider Hoyos, Brockmeier, Austin J, Giraldo, Luis Gonzalo Sanchez. (2023). {DiME.

[thilak2023lidar] Thilak, Vimal, Huang, Chen, Saremi, Omid, Dinh, Laurent, Goh, Hanlin, Nakkiran, Preetum, Susskind, Joshua M, Littwin, Etai. (2024). {LiDAR.

[renyi1961measures] R{'e. (1961). On measures of entropy and information. Proceedings of the fourth Berkeley symposium on mathematical statistics and probability.

[giraldo2014measures] Giraldo, Luis Gonzalo Sanchez, Rao, Murali, Principe, Jose C. (2014). Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory.

[deletanglanguage] Deletang, Gregoire, Ruoss, Anian, Duquenne, Paul-Ambroise, Catt, Elliot, Genewein, Tim, Mattern, Christopher, Grau-Moya, Jordi, Wenliang, Li Kevin, Aitchison, Matthew, Orseau, Laurent, others. Language Modeling Is Compression.

[hosseini2024curvature] Hosseini, Eghbal, Fedorenko, Evelina. (2023). Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language..

[henaff2019perceptual] H{'e. (2019). Perceptual straightening of natural videos. Nature neuroscience.

[ma2019nlpaug] Edward Ma. (2019). {NLP Augmentation.

[wei2024large] Wei, Lai, Tan, Zhiquan, Li, Chenghai, Wang, Jindong, Huang, Weiran. (2024). {Diff-eRank.

[jin2024conceptdepth] Jin, Mingyu, Yu, Qinkai, Huang, Jingyuan, Zeng, Qingcheng, Wang, Zhenting, Hua, Wenyue, Zhao, Haiyan, Mei, Kai, Meng, Yanda, Ding, Kaize, others. (2024). Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers?.

[lad2024remarkable] Lad, Vedang, Gurnee, Wes, Tegmark, Max. (2024). The Remarkable Robustness of {LLMs.

[garrido2023rankme] Garrido, Quentin, Balestriero, Randall, Najman, Laurent, Lecun, Yann. (2023). {RankMe.

[agrawal2022alphareq] Agrawal, Kumar K, Mondal, Arnab Kumar, Ghosh, Arna, Richards, Blake. (2022). $\alpha$-{ReQ.

[seqvcr] Arefin, Md Rifat, Subbaraj, Gopeshh, Gontier, Nicolas, LeCun, Yann, Rish, Irina, Shwartz-Ziv, Ravid, Pal, Christopher. (2025). {Seq-VCR.

[llms-know-what-they-know] Kadavath, Saurav, Conerly, Tom, Askell, Amanda, Henighan, Tom, Drain, Dawn, Perez, Ethan, Schiefer, Nicholas, Hatfield-Dodds, Zac, DasSarma, Nova, Tran-Johnson, Eli, others. (2022). Language models (mostly) know what they know.

[palm2] Anil, Rohan, Dai, Andrew M, Firat, Orhan, Johnson, Melvin, Lepikhin, Dmitry, Passos, Alexandre, Shakeri, Siamak, Taropa, Emanuel, Bailey, Paige, Chen, Zhifeng, others. (2023). {PaLM.

[hurst-exponent] Hurst, Harold Edwin. (1951). Long-term storage capacity of reservoirs. Transactions of the American society of civil engineers.

[pythia] Biderman, Stella, Schoelkopf, Hailey, Anthony, Quentin Gregory, Bradley, Herbie, O’Brien, Kyle, Hallahan, Eric, Khan, Mohammad Aflah, Purohit, Shivanshu, Prashanth, USVSN Sai, Raff, Edward, others. (2023). Pythia: A suite for analyzing large language models across training and scaling.

[fractal-review] Gneiting, Tilmann, Schlather, Martin. (2004). Stochastic models that separate fractal dimension and the Hurst effect. SIAM review.

[fractal-generalization-bound] Dupuis, Benjamin, Deligiannidis, George, Simsekli, Umut. (2023). Generalization bounds using data-dependent fractal dimensions.

[gsm8k] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, others. (2021). Training verifiers to solve math word problems.

[fractal-next-token] Alabdulmohsin, Ibrahim, Tran, Vinh Q, Dehghani, Mostafa. (2024). Fractal Patterns May Illuminate the Success of Next-Token Prediction.

[fractal-limitations] Tan, Charlie B, Garc{'\i. (2024). On the limitations of fractal dimension as a measure of generalization.

[fractal-ethernet-traffic] Park, Kihong, Willinger, Walter. (2000). Self-similar network traffic and performance evaluation.

[mandelbrot-british-coast] Mandelbrot, Benoit. (1967). How long is the coast of Britain? Statistical self-similarity and fractional dimension. American Association for the Advancement of Science.

[effective-rank] Roy, Olivier, Vetterli, Martin. (2007). The effective rank: A measure of effective dimensionality. European signal processing conference.

[llama3] Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, others. (2024). The {Llama 3.

[imagenet] Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, Fei-Fei, Li. (2009). Imagenet: A large-scale hierarchical image database.

[aim] El-Nouby, Alaaeldin, Klein, Michal, Zhai, Shuangfei, Bautista, Miguel Angel, Toshev, Alexander, Shankar, Vaishaal, Susskind, Joshua M, Joulin, Armand. (2024). Scalable pre-training of large autoregressive image models.

[pixelgpt] Chen, Mark, Radford, Alec, Child, Rewon, Wu, Jeffrey, Jun, Heewoo, Luan, David, Sutskever, Ilya. (2020). Generative pretraining from pixels.

[vit] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2021). An image is worth 16x16 words: Transformers for image recognition at scale.

[beit] Bao, Hangbo, Dong, Li, Piao, Songhao, Wei, Furu. (2022). {BeIT.

[dinov2] Oquab, Maxime, Darcet, Timoth{'e. (2024). {DINOv2.

[mae] He, Kaiming, Chen, Xinlei, Xie, Saining, Li, Yanghao, Doll{'a. (2022). Masked autoencoders are scalable vision learners.

[llama2] Touvron, Hugo, Martin, Louis, Stone, Kevin, Albert, Peter, Almahairi, Amjad, Babaei, Yasmine, Bashlykov, Nikolay, Batra, Soumya, Bhargava, Prajjwal, Bhosale, Shruti, others. (2023). Llama 2: Open foundation and fine-tuned chat models.

[other-layer-by-layer] Zhao, Zheng, Ziser, Yftah, Cohen, Shay B. (2024). Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models.

[saponati2025underlying] Saponati, Matteo, Sager, Pascal, Aceituno, Pau Vilimelis, Stadelmann, Thilo, Grewe, Benjamin. (2025). The underlying structures of self-attention: symmetry, directionality, and emergent dynamics in Transformer training.

[gpt3] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, Agarwal, Sandhini, Herbert-Voss, Ariel, Krueger, Gretchen, Henighan, Tom, Child, Rewon, Ramesh, Aditya, Ziegler, Daniel, Wu, Jeffrey, Winter, Clemens, Hesse, Chris, Chen, Mark, Sigler, Eric, Litwin, Mateusz, Gray, Scott, Chess, Benjamin, Clark, Jack, Berner, Christopher, McCandlish, Sam, Radford, Alec, Sutskever, Ilya, Amodei, Dario. (2020). Language Models are Few-Shot Learners.

[radford2021learning] Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agarwal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, others. (2021). Learning transferable visual models from natural language supervision.

[alphacode] Li, Yujia, Choi, David, Chung, Junyoung, Kushman, Nate, Schrittwieser, Julian, Leblond, R{'e. (2022). Competition-level code generation with alphacode. Science.

[tenney2019bert] Tenney, Ian, Das, Dipanjan, Pavlick, Ellie. (2019). {BERT.

[voita2019bottom] Voita, Elena, Sennrich, Rico, Titov, Ivan. (2019). The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives.

[marion2024implicit] Marion, Pierre, Wu, Yu-Han, Sander, Michael E, Biau, G{'e. (2024). Implicit regularization of deep residual networks towards neural ODEs.

[van2017neural] Van Den Oord, Aaron, Vinyals, Oriol, others. (2017). Neural discrete representation learning.

[transformer-is-secretly-linear] Razzhigaev, Anton, Mikhalchuk, Matvey, Goncharova, Elizaveta, Gerasimenko, Nikolai, Oseledets, Ivan, Dimitrov, Denis, Kuznetsov, Andrey. (2024). Your transformer is secretly linear.

[anisotropy] Razzhigaev, Anton, Mikhalchuk, Matvey, Goncharova, Elizaveta, Oseledets, Ivan, Dimitrov, Denis, Kuznetsov, Andrey. (2024). The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models.

[llm-depth-residuals] Csord{'a. (2025). Do Language Models Use Their Depth Efficiently?.

[gu2024attention] Gu, Xiangming, Pang, Tianyu, Du, Chao, Liu, Qian, Zhang, Fengzhuo, Du, Cunxiao, Wang, Ye, Lin, Min. When Attention Sink Emerges in Language Models: An Empirical View.

[identifiability] Brunner, Gino, Liu, Yang, Pascual, Damian, Richter, Oliver, Ciaramita, Massimiliano, Wattenhofer, Roger. (2020). On identifiability in transformers.

[first-token-attending] Barbero, Federico, Arroyo, Alvaro, Gu, Xiangming, Perivolaropoulos, Christos, Bronstein, Michael, Veli{\v{c. (2025). Why do {LLMs.

[doimo-hidden-representations] Valeriani, Lucrezia, Doimo, Diego, Cuturello, Francesca, Laio, Alessandro, Ansuini, Alessio, Cazzaniga, Alberto. (2023). The geometry of hidden representations of large transformer models.

[doimo-abstraction-phase] Cheng, Emily, Doimo, Diego, Kervadec, Corentin, Macocco, Iuri, Yu, Jade, Laio, Alessandro, Baroni, Marco. (2025). Emergence of a high-dimensional abstraction phase in language transformers.

[attention-sinks] Xiao, Guangxuan, Tian, Yuandong, Chen, Beidi, Han, Song, Lewis, Mike. (2024). Efficient streaming language models with attention sinks.

[bigger-and-deeper] Chen, Nuo, Wu, Ning, Liang, Shining, Gong, Ming, Shou, Linjun, Zhang, Dongmei, Li, Jia. (2023). Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers.

[aimv2] Fini, Enrico, Shukor, Mustafa, Li, Xiujun, Dufter, Philipp, Klein, Michal, Haldimann, David, Aitharaju, Sai, da Costa, Victor Guilherme Turrisi, B{'e. (2025). Multimodal autoregressive pre-training of large vision encoders.

[haim-neural-representational-geometry] Schrage, Linden, Irie, Kazuki, Sompolinsky, Haim. (2024). Neural Representational Geometry of Concepts in Large Language Models. NeurIPS 2024 Workshop on Symmetry and Geometry in Neural Representations.

[haim-object-manifold-separability] Cohen, Uri, Chung, SueYeon, Lee, Daniel D, Sompolinsky, Haim. (2020). Separability and geometry of object manifolds in deep neural networks.

[haim-neural-geometry-fewshot-learning] Sorscher, Ben, Ganguli, Surya, Sompolinsky, Haim. (2022). Neural representational geometry underlies few-shot concept learning. Proceedings of the National Academy of Sciences.

[knowledge-flow] Liu, Iou Jen, Peng, Jian, Schwing, Alexander G. (2019). Knowledge flow: Improve upon your teachers.

[mergenet] Li, Kunxi, Zhan, Tianyu, Fu, Kairui, Zhang, Shengyu, Kuang, Kun, Li, Jiwei, Zhao, Zhou, Wu, Fan, Wu, Fei. (2025). MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities.

[layer-as-painter] Sun, Qi, Pickett, Marc, Nain, Aakash Kumar, Jones, Llion. (2025). Transformer layers as painters.

[cmc-imagenet100] Tian, Yonglong, Krishnan, Dilip, Isola, Phillip. (2020). Contrastive Multiview Coding.

[bib1] Agrawal et al. (2022) Agrawal, K. K., Mondal, A. K., Ghosh, A., and Richards, B. α\alpha-ReQ: Assessing representation quality in self-supervised learning by measuring eigenspectrum decay. NeurIPs, 2022.

[bib2] Alain, G. and Bengio, Y. Understanding intermediate layers using linear classifier probes. ICLR, 2017.

[bib3] Arefin et al. (2024) Arefin, M. R., Subbaraj, G., Gontier, N., LeCun, Y., Rish, I., Shwartz-Ziv, R., and Pal, C. Seq-VCR: Preventing collapse in intermediate transformer representations for enhanced reasoning. arXiv, 2024.

[bib4] Bach, F. Information theory with kernel methods. IEEE Transactions on Information Theory, 2022.

[bib5] Bao et al. (2022) Bao, H., Dong, L., Piao, S., and Wei, F. BeIT: Bert pre-training of image transformers. ICLR, 2022.

[bib6] BehnamGhader et al. (2024) BehnamGhader, P., Adlakha, V., Mosbach, M., Bahdanau, D., Chapados, N., and Reddy, S. LLM2Vec: Large language models are secretly powerful text encoders. COLM, 2024.

[bib7] Biderman et al. (2023) Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. Pythia: A suite for analyzing large language models across training and scaling. In ICML, 2023.

[bib8] Boes et al. (2019) Boes, P., Eisert, J., Gallego, R., Müller, M. P., and Wilming, H. Von neumann entropy from unitarity. Physical review letters, 2019.

[bib9] Bordes et al. (2023) Bordes, F., Balestriero, R., Garrido, Q., Bardes, A., and Vincent, P. Guillotine regularization: Why removing layers is needed to improve generalization in self-supervised learning. TMLR, 2023.

[bib10] Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In NeurIPs, 2020.

[bib11] Burns et al. (2022) Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv, 2022.

[bib12] Chen et al. (2020) Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., and Sutskever, I. Generative pretraining from pixels. In ICML. PMLR, 2020.

[bib13] DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.

[bib14] Deletang et al. (2024) Deletang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., et al. Language modeling is compression. In ICLR, 2024.

[bib15] Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[bib16] Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.

[bib17] Dubey et al. (2024) Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv, 2024.

[bib18] El-Nouby et al. (2024) El-Nouby, A., Klein, M., Zhai, S., Bautista, M. A., Toshev, A., Shankar, V., Susskind, J. M., and Joulin, A. Scalable pre-training of large autoregressive image models. ICML, 2024.

[bib19] Fan et al. (2024) Fan, S., Jiang, X., Li, X., Meng, X., Han, P., Shang, S., Sun, A., Wang, Y., and Wang, Z. Not all layers of llms are necessary during inference. arXiv, 2024.

[bib20] Garrido et al. (2023) Garrido, Q., Balestriero, R., Najman, L., and Lecun, Y. RankMe: Assessing the downstream performance of pretrained self-supervised representations by their rank. In ICML, 2023.

[bib21] Giraldo et al. (2014) Giraldo, L. G. S., Rao, M., and Principe, J. C. Measures of entropy from data using infinitely divisible kernels. IEEE Transactions on Information Theory, 2014.

[bib22] Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. COLM, 2024.

[bib23] Gurnee, W. and Tegmark, M. Language models represent space and time. arXiv, 2023.

[bib24] Hao et al. (2024) Hao, S., Sukhbaatar, S., Su, D., Li, X., Hu, Z., Weston, J., and Tian, Y. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024.

[bib25] He et al. (2022) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. Masked autoencoders are scalable vision learners. In CVPR, 2022.

[bib26] Hosseini, E. and Fedorenko, E. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language. NeurIPs, 2023.

[bib27] Jin et al. (2024) Jin, M., Yu, Q., Huang, J., Zeng, Q., Wang, Z., Hua, W., Zhao, H., Mei, K., Meng, Y., Ding, K., et al. Exploring concept depth: How large language models acquire knowledge at different layers? arXiv, 2024.

[bib28] Lad et al. (2024) Lad, V., Gurnee, W., and Tegmark, M. The remarkable robustness of llms: Stages of inference? arXiv, 2024.

[bib29] Li et al. (2022) Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 2022.

[bib30] Liu et al. (2019) Liu, N. F., Gardner, M., Belinkov, Y., Peters, M. E., and Smith, N. A. Linguistic knowledge and transferability of contextual representations. North American Chapter of the Association for Computational Linguistics, 2019.

[bib31] Ma, E. Nlp augmentation. https://github.com/makcedward/nlpaug, 2019.

[bib32] Mallen, A. T. and Belrose, N. Eliciting latent knowledge from quirky language models. In ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024.

[bib33] Mamou et al. (2020) Mamou, J., Le, H., Del Rio, M. A., Stephenson, C., Tang, H., Kim, Y., and Chung, S. Emergence of separable manifolds in deep language representations. In ICML, 2020.

[bib34] Marion et al. (2024) Marion, P., Wu, Y.-H., Sander, M. E., and Biau, G. Implicit regularization of deep residual networks towards neural odes. ICLR, 2024.

[bib35] Merity et al. (2017) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. ICLR, 2017.

[bib36] Muennighoff et al. (2022) Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. MTEB: Massive text embedding benchmark. arXiv, 2022.

[bib37] Oord et al. (2018) Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. In ICLR, 2018.

[bib38] Oquab et al. (2024) Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. DINOv2: Learning robust visual features without supervision. TMLR, 2024.

[bib39] Park et al. (2024a) Park, K., Choe, Y. J., Jiang, Y., and Veitch, V. The geometry of categorical and hierarchical concepts in large language models. arXiv preprint arXiv:2406.01506, 2024a.

[bib40] Park et al. (2024b) Park, K., Choe, Y. J., and Veitch, V. The linear representation hypothesis and the geometry of large language models. In ICML, 2024b.

[bib41] Raghu et al. (2017) Raghu, M., Gilmer, J., Yosinski, J., and Sohl-Dickstein, J. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. NeurIPs, 2017.

[bib42] Rényi, A. On measures of entropy and information. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, 1961.

[bib43] Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. In European signal processing conference, 2007a.

[bib44] Roy, O. and Vetterli, M. The effective rank: A measure of effective dimensionality. In 2007 15th European Signal Processing Conference, pp. 606–610, 2007b.

[bib45] Scholkopf, B. and Smola, A. J. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2018.

[bib46] Shwartz-Ziv, R. Information flow in deep neural networks. PhD thesis, Hebrew University, 2022.

[bib47] Shwartz-Ziv, R. and Tishby, N. Opening the black box of deep neural networks via information. Entropy, 2019.

[bib48] Shwartz-Ziv et al. (2023) Shwartz-Ziv, R., Balestriero, R., Kawaguchi, K., Rudner, T. G., and LeCun, Y. An information theory perspective on variance-invariance-covariance regularization. NeurIPs, 2023.

[bib49] Skean et al. (2023) Skean, O., Osorio, J. K. H., Brockmeier, A. J., and Giraldo, L. G. S. DiME: Maximizing mutual information by a difference of matrix-based entropies. arXiv, 2023.

[bib50] Skean et al. (2024) Skean, O., Dhakal, A., Jacobs, N., and Giraldo, L. G. S. FroSSL: Frobenius norm minimization for self-supervised learning. In ECCV, 2024.

[bib51] Tenney et al. (2019) Tenney, I., Das, D., and Pavlick, E. BERT rediscovers the classical nlp pipeline. ACL, 2019.

[bib52] Thilak et al. (2024) Thilak, V., Huang, C., Saremi, O., Dinh, L., Goh, H., Nakkiran, P., Susskind, J. M., and Littwin, E. LiDAR: Sensing linear probing performance in joint embedding ssl architectures. ICLR, 2024.

[bib53] Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. NeurIPs, 2017.

[bib54] Voita et al. (2019) Voita, E., Sennrich, R., and Titov, I. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. EMNLP-IJCNLP, 2019.

[bib55] Wainwright, M. J. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.

[bib56] Wei et al. (2024) Wei, L., Tan, Z., Li, C., Wang, J., and Huang, W. Large language model evaluation via matrix entropy. arXiv, 2024.

[bib57] Yang et al. (2024a) Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024a.

[bib58] Yang et al. (2024b) Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., Lu, K., Xue, M., Lin, R., Liu, T., Ren, X., and Zhang, Z. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024b. URL https://arxiv.org/abs/2409.12122.

[bib59] Zhouyin, Z. and Liu, D. Understanding neural networks with logarithm determinant entropy estimator. arXiv, 2021.