{% spaceskip=1 fontdimen2 font plus .10 fontdimen2 font minus .90 fontdimen2 font xspaceskip= spaceskip Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin }%

Enrique Queipo-de-Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv

Abstract

Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M--120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

Enrique Queipo-de-Llano ∗ , 1 , ´ Alvaro Arroyo ∗ , 1 , Federico Barbero 1 , Xiaowen Dong 1 , Michael Bronstein 1 , 2 , Yann LeCun 3 , 4 , Ravid Shwartz-Ziv 3 1 University of Oxford 2 AITHYRA 3 New York University 4 Meta FAIR

Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

Introduction

Large Language Models (LLMs) have become remarkably capable, yet how they process information through their layers remains poorly understood. Two phenomena have particularly puzzled researchers: attention sinks , where attention heads mysteriously collapse their focus onto semantically uninformative tokens (Xiao et al., 2024), and compression valleys , where intermediate representations show unexpectedly low entropy despite the model's high-dimensional space (Skean et al., 2025).

These phenomena appear paradoxical: why would powerful models waste attention on meaningless tokens, and why would representations compress in the middle of processing? Previous work has explained attention sinks through positional biases (Gu et al., 2025) and over-mixing prevention (Barbero et al., 2025a), while compression valleys have been explained through an information bottleneck theory (Skean et al., 2025). However, the precise reasons why they emerge remain unclear and no formal link has been established between them.

We reveal that attention sinks and compression valleys are two manifestations of a single mechanism: massive activations in the residual stream. These extremely large-magnitude features emerge on specific tokens (typically the beginning-of-sequence token, BOS), create both effects simultaneously, and act as input-agnostic biases. While prior work linked massive activations to attention sinks (Sun et al., 2024; Cancedda, 2024), we prove they also drive compression: when a single token's norm dominates others, it necessarily creates a dominant singular value in the representation matrix, explaining the compression. Experiments across several models (410M-120B parameters) confirm this unified mechanism, connecting both phenomena via the massive activations.

∗ Denotes equal first authorship. Correspondence to alvaro.arroyo@eng.ox.ac.uk and enrique.queipodellanoburgos@reuben.ox.ac.uk

This unified mechanism reveals how transformers organize computation across depth through the Mix-Compress-Refine framework: massive activations control three distinct processing phases. Early layers (0-20% depth) mix information broadly via diffuse attention. Middle layers (20-85%) compress representations while halting mixing through attention sinks, both triggered by massive activation emergence. Late layers (85-100%) selectively refine through localized attention as norms equalize. This phase structure explains task-specific optimal depths: embedding tasks peak during compression, while generation requires full refinement.

Our contributions are:

· We empirically demonstrate that attention sinks and compression valleys emerge simultaneously in middle layers across several models (410M-120B parameters). · We prove that massive activations mathematically require compression, providing tight bounds on entropy reduction and singular value dominance (Theorem 1). · We validate causality through targeted ablations: removing massive activations eliminates both compression and reduces attention sinks. · Wepropose the Mix-Compress-Refine theory of information flow, explaining how transformers organize computation into three distinct phases. · Weshow this framework helps resolve the puzzle of task-dependent optimal depths: embedding tasks peak during compression while generation requires full refinement.

In this paper, we study decoder-only transformers with L layers, hidden dimension d , and H attention heads per layer. For a sequence of T tokens, let x ( ℓ ) i ∈ R d denote token i 's representation at layer ℓ , and X ( ℓ ) ∈ R T × d the full representation matrix. Attention weights α ( ℓ,h ) ij from token i to j in head h satisfy causal masking ( α ij = 0 for j > i ). Full architecture provided in Appendix A.1.

Key Metrics. For a representation matrix X with singular values σ 1 ≥ σ 2 ≥ . . . ≥ σ r , we measure compression via the matrix-based entropy :

Low entropy indicates compression into few dominant directions. The anisotropy p 1 = σ 2 1 / ∥ X ∥ 2 F measures directional bias (Razzhigaev et al., 2023) (near 1 = extreme bias, near 1 /r = isotropy). For token position k , the attention sink score and sink rate (Gu et al., 2025) are:

with threshold τ = 0 . 3 (unless otherwise stated), and I denotes the indicator function. We focus on the BOS token, the primary sink across models.

Attention Sinks. Attention heads mysteriously focus on semantically uninformative tokens (e.g., BOS) across diverse models and scales (Xiao et al., 2024). While Barbero et al. (2025a) argues they prevent over-mixing, Cancedda (2024) relates them to spectral subspaces, and Gu et al. (2025) traces emergence to pretraining, no work has yet examined their depth-wise organization.

Compression Valleys. Transformer representations compress dramatically in middle layers, where the matrix-based entropy drops significantly before recovering (Skean et al., 2025). This universal pattern coincides with increased anisotropy ( p 1 > 0 . 9 ) (Razzhigaev et al., 2023) and nearlinear layer transitions (Razzhigaev et al., 2024). Paradoxically, these compressed representations excel at embedding tasks despite their reduced dimensionality. The mechanism remained unknown, with only information bottleneck hypotheses lacking causal evidence (Skean et al., 2025).

Massive Activations. Sun et al. (2024) identified extremely large-magnitude features in transformer residual streams, with individual neurons exceeding typical activations by factors of 10 3 -10 6 . These 'massive activations' consistently appear on delimiter and special tokens (particularly

Figure 1: Attention sinks and compression valleys emerge simultaneously when BOS tokens develop massive activations. Normalized entropy ( left ), BOS sink rate ( middle ), and BOS token norm ( right ) across layers for six models evaluated on GSM8K. All three phenomena align precisely: when BOS norms spike by factors of 10 3 -10 4 (right panel), entropy drops below 0.5 bits (left) and sink rates surge to near 1.0 (middle), confirming our unified mechanism hypothesis.

the BOS token), acting as input-agnostic biases. Furthermore, Sun et al. (2024) found a link between the emergence of massive activations and attention sinks, which was reinforced in Barbero et al. (2025b); Yona et al. (2025). However, none of these works link this phenomenon to representational structure or a unified theory of information flow in LLMs .

Computation Across Depth in Transformers. Several works have sought to understand the evolution of representations in Transformer-based models from a theoretical perspective. Dong et al. (2021) proved that the repeated application of self-attention leads to rank collapse in simplified settings without residual connections. Geshkovski et al. (2023) analyze self-attention dynamics and show tokens cluster over depth. Wu et al. (2024) studied how layer normalization and attention masks affect information propagation, finding that normalization can prevent rank decay. Other empirical work examines intermediate layer outputs directly. We highlight the LogitLens (Nostalgebraist, 2020) and TunedLens (Belrose et al., 2023), which decode hidden states using the model unembedding matrix and an affine probe per layer, respectively. Furthermore, Lad et al. (2024) measures the sensitivity to delete and swap interventions across layers and argues at different stages of depth-wise inference. Most recently, Csord´ as et al. (2025) argue that deeper LLMs underutilize their additional layers, with later layers mainly refining probability distributions rather than composing novel computations. While these studies illuminate layerwise behavior, none provides a unified mechanism that explains why stages form in depth or predicts when they should appear.

The Gap This Work Addresses. Thus far, attention sinks have been tied to massive activations while compression has remained a separate observation without a causal mechanism. In this work, we document the synchronized dynamics of these phenomena, showing that the same massive activations that create sinks are also the main driver of compression. Building on their co-emergence, we propose a three-stage theory in which residual-stream norms simultaneously regulate mixing in attention heads and compression in representations. Finally, we connect the mechanism to downstream behavior, distinguishing between embedding-style and generation tasks.

Key Metrics.

Attention Sinks.

Compression Valleys.

The middle phase begins abruptly with the emergence of massive activations, typically on the BOS token. As established in Section 3.2, these massive activations necessarily induce representational compression, as well as attention sink formation (Gu et al., 2025). This phase serves as a computational shut-off switch for mixing. The attention sinks act as approximate 'no-ops' (Bondarenko et al., 2023). By attending to BOS tokens with near-zero value norms, heads effectively skip their contribution while preserving the residual stream.

Figure 5: Middle-layer sinks adapt to input complexity while early mixing remains constant. (Left) BOS sink scores for a high-sink prompt in Pythia 410M, showing strong attention to BOS in layers 5-20. (Middle) Difference in BOS sink scores between high-sink and low-sink prompts, revealing input-dependent variation concentrated in middle layers. (Right) Difference in mixing scores between the same prompts, showing near-zero variation in early layers. This demonstrates that Phase 1 performs fixed mixing regardless of input, while Phase 2 compression dynamically adjusts sink strength based on prompt complexity.

In these middle layers, the model refined information through the compressed residual stream, where a few dominant directions preserve high-level context while discarding redundancies. This aligns with the depth-efficiency perspective of Csord´ as et al. (2025), who show that mid-layers contribute less to shaping future tokens and more to stabilizing current representations. In Section 5, we show that performance on generation tasks tends to improve mostly in the latter half of this second phase. We hypothesize this lag reflects the time mid-layer MLPs need to process and consolidate the compressed signal before yielding token-level refinements. We do not treat this as a separate phase, however, since it is not cleanly demarcated by the emergence or dissipation of massive activations.

Sink behavior in middle to late layers adapts to input complexity. Mid to late layer sinks are input-dependent ('dormant heads'), which are often inert but active on specific prompts (Guo et al., 2024). Reproducing Sandoval-Segura et al. (2025) on 20K FineWeb-Edu prompts (Lozhkov et al., 2023), we compare the 'top' prompts (with the strongest sink scores) and the 'bottom' prompts (with the weakest). As shown in Figure 5, sink strength diverges in the middle and last layers. This provides evidence that while middle layers default to sink-like behavior that limits mixing, sink strength in this phase varies depending on the prompt.

Massive Activations.

We now prove that massive activations necessarily induce the observed compression. Consider the representation matrix X ∈ R T × d with rows { x i } T -1 i =0 , where x 0 denotes the BOS token.

Theorem 1 (Massive Activations Induce Spectral Dominance) . Let M = ∥ x 0 ∥ 2 , R = ∑ i =0 ∥ x i ∥ 2 , and θ i be the angle between x 0 and x i . Define the alignment term α = 1 R ∑ i =0 ∥ x i ∥ 2 cos 2 θ i ∈ [0 , 1] . Then:

where σ 1 is the largest singular value of X .

Proof Sketch. By the characterization of singular values, σ 2 1 = max ∥ v ∥ =1 ∥ Xv ∥ 2 . Choosing v = x 0 / ∥ x 0 ∥ and expanding ∥ Xv ∥ 2 yields the bound. Full proofs and discussions in Appendix A.2.

This theorem has immediate consequences for compression metrics:

Corollary 2 (Compression Bounds) . Let c = M/R be the norm ratio and p = ( c + α ) / ( c + 1) . Then:

Dominance: σ 2 1 / ∑ j ≥ 2 σ 2 j ≥ ( c + α ) / (1 -α )
Anisotropy: p 1 ≥ p
Entropy: H ( X ) ≤ -p log p -(1 -p ) log(1 -p ) + (1 -p ) log( r -1)

Meaning of the results. Theorem 1 shows two factors control the rise of σ 2 1 : (i) the magnitudes of the activations M , and (ii) the alignment α of other rows with x 0 . Full alignment makes X rank one (with σ 2 1 ( X ) = M + R = || X || 2 F ), while even with small α with large M suffices to grow σ 2 1 . The corollaries give lower bounds on dominance and anisotropy in terms of ( c, α ) , so increasing c (stronger gap between activations) or increasing α (stronger alignment) provably widens the spectral gap. Consequently, the singular-value entropy is tightly upper-bounded by the mass in the top component, so c ≫ 1 or α → 1 also results in H ( X ) dropping towards zero.

Tightness of the bounds in practice. When massive activations create growing norm ratios c , these bounds become tight. Figure 3 compares our theoretical bounds against empirical measurements for Pythia 410M across all layers. In early layers where massive activations are absent, the bounds

Figure 3: Our theoretical bounds become exact when massive activations emerge, proving they drive compression. Left: When BOS norm dominates (layers 5-15), the first singular value σ 2 1 approximately equals both ∥ x BOS ∥ 2 and ∥ X ∥ 2 F , confirming near rank-one structure. Right: Our entropy upper bound (Corollary 2) tightly matches empirical values in compressed layers, validating that massive activations mathematically necessitate compression. Average across 100 RedPajama (Weber et al., 2024) examples for Pythia 410M.

are loose as expected, because the theory only constrains the intermediate layers. However, in middle layers where massive activations emerge, the bounds become nearly exact, with predicted and observed values overlapping within measurement error. This tightness reveals that massive activations are the dominant mechanism shaping representation geometry: when ∥ x 0 ∥ becomes massive, the representation matrix effectively becomes rank-one plus small perturbations, exactly as our theory predicts.

Computation Across Depth in Transformers.

Key Insight: Different tasks achieve peak performance at different phases of the MixCompress-Refine organization. Embedding tasks peak during Phase 2's compression, benefiting from lower-dimensional spaces. Generation tasks improve monotonically through all phases, requiring Phase 3's refinement for accurate next token prediction. Multiple-choice reasoning shows flat performance until mid-depth, suggesting it needs both compression and subsequent refinement. This explains why studies reach different conclusions about 'optimal' layers: they're measuring fundamentally different computational objectives.

Prior work (Skean et al., 2025) found that mid-layer representations perform strongly, particularly on embedding benchmarks, and linked this effect to the mid-depth compression valley. In this section, we broaden the picture by evaluating both embedding and generation tasks, relating their depthwise performance to the three-stage framework introduced above.

The Gap This Work Addresses.

Massive Activations Drive Both Attention Sinks and Compression

Key Insight: Attention sinks and compression valleys are not separate phenomena but two consequences of massive activations in the residual stream. We prove theoretically that when BOS token norms exceed others, they necessarily create a dominant singular value, causing compression and coinciding with attention sinks. This unification reveals that a single mechanism controls both representation structure and attention in middle layers.

Empirical Correlation: Synchronized Emergence Across Models

We first empirically document that attention sinks and compression valleys emerge simultaneously across model families and scales. Figure 1 shows the layer-wise evolution of three metrics across six models (Pythia 410M/6.9B, LLaMA3 8B, Qwen2 7B, Gemma 7B, Bloom 1.7B): (1) matrix-based entropy H ( X ( ℓ ) ) , (2) sink-rate ( ℓ ) 0 , and (3) BOS token norm ∥ x ( ℓ ) 0 ∥ . We compute these metrics for all 7.5K training examples in GSM8K (Cobbe et al., 2021), plotting the mean and standard deviation at each layer.

Figure 2: The coupled emergence of massive activations, compression, and sinks develops early in training and persists. Evolution of normalized entropy (left) , sink rate (middle) , and BOS norm (right) across training checkpoints (1-143k steps) for Pythia 410M. All three phenomena emerge together around step 1k and remain synchronized throughout training, indicating this organization is learned early.

We observe that all three patterns align precisely . When the BOS norm spikes to factors of 10 3 -10 4 (typically layers 0-5 depending on model depth), entropy simultaneously drops and sink rates surge. We compute the Pearson correlation between the change in BOS norm and entropy, obtaining r = -0 . 9 ± 0 . 18 across models, while BOS norm and sink rate correlate at r = 0 . 58 ± 0 . 25 .

We highlight that this synchronization is remarkably consistent. While sink rates vary with prompt content, the layer index where these phenomena emerge is fixed for each model, and the massive activation is deterministic. For instance, in Pythia 410M, the transition consistently occurs at layer 5 regardless of input, suggesting an architectural rather than input-dependent mechanism. We show how these transitions emerge during training in Figure 2. We point the reader to Appendix B.1 for details on experiments, larger models, and a note on GPT OSS (Agarwal et al., 2025).

Theoretical Framework: Massive Activations Imply Compression

We now prove that massive activations necessarily induce the observed compression. Consider the representation matrix X ∈ R T × d with rows { x i } T -1 i =0 , where x 0 denotes the BOS token.

where σ 1 is the largest singular value of X .

This theorem has immediate consequences for compression metrics:

Corollary 2 (Compression Bounds) . Let c = M/R be the norm ratio and p = ( c + α ) / ( c + 1) . Then:

Dominance: σ 2 1 / ∑ j ≥ 2 σ 2 j ≥ ( c + α ) / (1 -α )
Anisotropy: p 1 ≥ p
Entropy: H ( X ) ≤ -p log p -(1 -p ) log(1 -p ) + (1 -p ) log( r -1)

Massive Activations.

Proofs

Compression Valleys.

Evidence from Targeted Ablations

To isolate the exact role of massive activation empirically, we perform targeted ablations: zeroing the MLP's contribution to the BOS token at layers where massive activations emerge. Specifically, we set x ( ℓ +1) BOS ← x ( ℓ ) BOS + Attn ( ℓ ) ( x BOS ) , removing only MLP ( ℓ ) ( x BOS ) .

We find that ablating massive activations can eliminate both phenomena , confirming our theoretical conjecture. In LLaMA3 8B, removing the MLP's contribution at the first layer prevents entropy drop (remains at 0.4-0.5 bits vs. dropping to 0.02 bits), eliminates sink formation (sink rate drops from 0.85-1.0 to 0.0) and maintains BOS norm within 2 × of other tokens (vs. 10 2 × normally) (Fig. 4). Similar results hold across models, as seen in Fig.14 in Appendix B.1. Some models (Pythia 410M, Qwen2 7B) develop massive activations in more than one stage. Ablating any single stage partially reduces compression; ablating all stages eliminates it entirely, suggesting a cumulative contribution. However, for Pythia 410M, ablations remove compression but do not remove attention sinks, which suggests that the formation of sinks might have model-dependent causes.

Why Middle Layers? We hypothesize that the mid-depth concentration of sinks and compression reflects how Transformers allocate computation across depth: early layers perform broad contextual mixing, while later layers become increasingly aligned with next-token prediction (Lad et al., 2024). Freed from these pressures, middle layers can develop extreme features that regulate tokento-token sharing and induce compression, consistent with a mid-network shift toward refinement (Csord´ as et al., 2025). In the next section, we show how massive activations predict, and precisely characterize, three stages of information flow.

Mix-Compress-Refine: A Theory of Information Flow

Key Insight: Transformers organize computation into three distinct phases demarcated by massive activations. Early layers mix information broadly to build context. Middle layers compress representations and halt mixing when massive activations emerge, preventing over-smoothing while maintaining essential information. Late layers equalize norms and switch to local positional attention, refining representations for task-specific outputs.

Building on our mechanistic understanding of massive activations, we now present a broader theory of how transformers organize computation across depth. We propose that information processing occurs in three distinct phases, demarcated by the emergence and dissipation of massive activations.

Figure 4: Removing massive activations eliminates both compression and attention sinks, confirming causality. Ablating the MLP contribution to the BOS token at layer 0 in LLaMA3 8B has three effects: (Left) Entropy remains at ∼ 0.5 bits instead of dropping to 0.02, showing decompression. (Middle) Sink rate stays at 0 throughout depth, confirming no attention sink formation. (Right) BOS norm (orange) remains comparable to the rest of tokens (grey) instead of spiking by 10 3 × . This causal intervention validates that massive activations drive both phenomena.

Phase 1: Information Mixing (Layers 0--20 %)

In the early layers of the model, we observe diffuse attention patterns, which are enabled by the lack of massive activations. This allows the model to perform mixing of high-dimensional token representations for a few layers, allowing the model to build contextual representations through broad information integration. An example of such an attention head in the early layers can be seen in Figure 6.

To quantify mixing in attention heads, we define the Mixing Score as the average row entropy of attention matrices: 1 T ∑ T i =1 H ( A ( ℓ,h ) i, : ) . Across models, we find that early layers consistently maintain mixing scores above 0 . 7 , confirming active token mixing, before dropping sharply when massive activations emerge. Notably, this mixing phase varies in extent: From just the first layer in some models to approximately 20% of network depth in others. However, its qualitative characteristics remain consistent. We plot this metric Figure 18 in Appendix B.2 across models and layers.

We believe that this initial mixing stage is deliberately limited to prevent over-mixing and representational collapse that would occur with extended uniform attention (Barbero et al., 2024; 2025a), analogous to over-smoothing in graph neural networks (Arroyo et al., 2025). This controlled, brief mixing phase establishes the semantic foundation that subsequent phases refine. The model captures both local token dependencies and global context, creating rich representations that can be selectively compressed and refined in later phases.

Phase 2: Compression and Halted Mixing (Layers 20--85 %)

In the final phase, the model reverses the compression bottleneck through norm equalization and a fundamental shift in attention patterns.

Norm equalization drives decompression. In this phase, we find the BOS norm plateaus or decreases while the average norm of the rest of tokens rises sharply, driving them toward similar magnitudes (right panel in Figure 4 and Figure 15 in Appendix B.1). This equalization begins earlier than the full phase transition, where the average norm starts rising around 40-60% depth, preparing for the eventual shift. The massive activation ratio drops from > 10 3 to < 10 , removing the mathematical basis for compression and allowing representations to re-expand.

Attention shifts to positional patterns. As massive activations dissipate, we observe heads transition from sink-dominated to position-based patterns. In particular, we observe the emergence of identity heads ( i → i ), previous-token heads ( i → i -1 ), and other sharp attention patterns, where by sharp we mean highly localized attention patterns. Figure 6 shows an example of such a pattern in the Pythia 410M model. We find that sharp positional patterns are especially common in RoPE-based models, consistent with recent work (Barbero et al., 2025b) showing that RoPE induces frequencyselective structure that favors the emergence of such heads. We provide empirical evidence of this in Appendix B.2. In particular, when measuring the mixing rate of attention patterns, only models without RoPE revert to higher mixing in later layers, whereas RoPE-based models consistently transition toward sharp positional attention.

This phase serves three computational purposes. Webelieve that this phase serves three purposes. First, norm equalization reduces BOS dominance so content tokens can meaningfully influence the residual pathway and receive token-specific refinements. Second, attention shifts to sharp (often positional) heads that perform selective mixing , focusing on a few task-relevant tokens and writing their features into the residual. As late layers re-expand capacity, these signals can be represented

Figure 6: Attention patterns transform from diffuse mixing to sinks to positional focus across depth. Evolution of attention patterns in Pythia 410M showing representative heads at layers 0, 16, and 23. Early layers exhibit diffuse attention enabling broad information mixing. Middle layers show sink patterns that halt mixing. Late layers display sharp positional patterns for selective refinement.

distinctly rather than squeezed by the mid-layer bottleneck. In parallel, identity/near-diagonal heads curb mixing without defaulting to sinks, where their non-zero value writes act as local signal boosters, in contrast to BOS sinks that effectively zero out updates. Third, bringing token norms to smaller, comparable scales likely improves numerical stability for the unembedding. Notably, models tend to equalize by boosting content-token norms rather than fully collapsing the BOS norm, preserving a modest global bias while enabling precise, content-driven refinements.

Implications for Downstream Performance

The Distinct Performance Patterns Across Tasks

Generation improves monotonically through all phases. We begin by evaluating intermediate layers across multiple model families and sizes using LogitLens (Nostalgebraist, 2020) on WikiText-2 (Merity et al., 2016). We observe a steady perplexity decline with depth from > 10 4 in early layers to 10-25 at full depth, as shown in Figure 7 (left). We notice little gain in the very early layers (consistent with an embedding-formation stage) and continued refinement through middepth. Across several models, the sharpest improvements occur in Phase 3, where norm equalization and positional/identity heads enable token-specific refinements, which appear to be crucial for nexttoken prediction tasks.

We further test the same set of models on multiple-choice question-answering tasks. We evaluate on ARC Easy, ARC Challenge, HellaSwag, and WinoGrande (Clark et al., 2018; Zellers et al., 2019; Sakaguchi et al., 2021) via LogitLens and the LM Evaluation Harness (Gao et al., 2024) with zeroshot learning. Figure 7 (middle) shows the results for ARC Easy and results for the rest of the datasets can be found in Figure 24, Appendix B.3. For sufficiently large models, accuracies remain largely flat until roughly 40-60% depth and then rise sharply. This suggests that, for generationaligned tasks, compression alone (Phase 2) is not sufficient; gains emerge only toward the end of Phase 2 (after sufficient residual refinement) and continue into Phase 3, where norm equalization and positional/identity heads enable token-specific updates.

In short, in generation settings, the late Phase 2 to Phase 3 transition is pivotal, aligning with the observation in Csord´ as et al. (2025) of a mid-network phase change from future-token to current-token computation, providing independent validation that Phase 2 and Phase 3 serve distinct computational roles.

Figure 7: Embedding tasks peak during compression while generation requires full refinement, revealing distinct computational objectives. (Left, Middle) Perplexity on Wikitext-2 and multiple-choice QA accuracy in ARC Easy via LogitLens generally do not improve significantly until ∼ 50% depth, then decreases/rises steadily through Phase 3. (Right) Linear probe test accuracy on the same task peaks at 25-75% depth (Phase 2) and declines thereafter. This divergence demonstrates that embedding-relevant features concentrate in compressed middle layers, while generation tasks require full-depth for token-specific predictions.

Appendix B.3). These results align with evidence that next-token pretraining does not uniformly benefit perception-style classification (Balestriero & Huang, 2024). Together, they underscore task dependence : classification-relevant linear features concentrate in intermediate layers, whereas late layers are repurposed for token-specific generative refinement. Furthermore, the pattern suggests that massive activations in the residual pathway not only curb over-mixing via sink formation but also act as a mechanism the model uses to compress information.

Why Do Different Tasks Need Different Phases?

We believe these findings clarify which tasks actually benefit from compression. In particular, embedding-style objectives (such as clustering, retrieval, classification, bitext mining, etc.) gain from Phase 2's compression because they target low-dimensional structure while discarding irrelevant information, echoing classic arguments on the benefits of information bottlenecks and compressed representations (Shwartz-Ziv et al., 2018; Kawaguchi et al., 2023). This picture aligns with evidence that LLMs produce surprisingly linear (and often linearly separable) embeddings (Razzhigaev et al., 2024; Marks & Tegmark, 2023). In particular, when features concentrate in a low-dimensional subspace, linear probing, semantic retrieval, and related embedding tasks become easier. Moreover, such linear structure has been linked to the emergence of high-level semantic concepts (Park et al., 2023), reinforcing our hypothesis on why mid-layer compressed states tend to work well for non-generative evaluations.

By contrast, generation and reasoning require capacity that compressed states alone cannot provide. Performance improves the most once Phase 3 norm equalization restores higher entropy, and positional heads/MLPs can refine token-specific details, which is also when we observe the models being most confident about their predictions (see Fig. 23 in Appendix B.3). In this way, the model makes use of the compressed and refined representation from Phase 2, which has captured high-level ideas and semantic concepts, and expands this into higher-dimensional space to perform token-level refinements in Phase 3.

This reconciles two views: compressed mid-layers suit embedding benchmarks, whereas next-tokenprediction-aligned tasks benefit from full-depth processing. Practically, 'optimal layer' selection should match phase to objective, suggesting phase-aware early exiting (Schuster et al., 2022) as a potentially promising design choice.

Conclusion

In this work, we revisited two puzzling phenomena in decoder-only Transformers, attention sinks and compression valleys. We began with the observation that attention sinks, compression valleys, and massive activations all emerge at the same time in language models. We then proved that a single high-norm token necessarily induces a dominant singular value, yielding low matrix-entropy and high anisotropy, and we bounded these effects quantitatively.

Building on this, we proposed a Mix-Compress-Refine theory of depth-wise computation in LLMs. In particular, we show that early layers mix broadly through diffuse attention, middle layers compress and curb mixing via attention sinks, late layers re-equalize norms and apply sharp positional heads for selective refinement. The boundaries between these phases are marked by the appearance and later disappearance of massive activations in depth. We use this organization to clarify downstream task behavior. While embedding-style tasks peak in compressed mid-layers, generation improves through late refinement and benefits from full depth.

We see this framework as a step toward a more mechanistic account of how LLMs allocate computation across depth. We hope these insights help connect head-level mechanisms with representation geometry, ultimately guiding more efficient and controllable LLM designs.

Experimental details.

Proofs

Architecture

In this work, we study decoder-only Transformers (Radford et al., 2018), which employ causal masking in attention and constitute the dominant architecture in today's large language models (Gemma Team et al., 2024; Dubey et al., 2024). We follow the notation of Barbero et al. (2024), but we importantly also consider a model with H ≥ 1 attention heads:

where we will also denote by the matrices A ( ℓ,h ) the attention heads given by A ( ℓ,h ) ij = α ( ℓ,h ) ij . The causal masking translates into A ( ℓ,h ) being lower-triangular and the row-wise softmax implies row-stochasticity.

Theoretical results

This section includes the proofs of the statements of Section 3.2, where we show massive activations imply the dominance of a singular value. One can obtain a weaker version of the bound focused only on the massive activation (no alignment terms) that entails weaker bounds for the spectral metrics. The following lemma serves as a proof for the fact that σ 2 1 ( X ) = max || v || =1 || Xv || 2 .

Lemma 3. Let A be a real symmetric n × n matrix with (real) eigenvalues λ max = λ 1 ≥ λ 2 ≥ . . . ≥ λ n = λ min and let

denote the Rayleigh Quotient for A and a real non-zero vector x ∈ R n . Then R ( A , x ) ∈ [ λ min , λ max ] , achieving each bound at the corresponding eigenvectors v min , v max .

Proof. Let A = Q ⊤ ΛQ be the diagonalization of A in the eigenbasis given by the v i and let y = Qx , such that x = Q ⊤ y for Q is orthogonal (i.e. Q ⊤ = Q -1 ) . Then,

Since the weights w i = y 2 i / ∑ y 2 j satisfy w i ≥ 0 and ∑ i w i = 1 , R ( A , x ) is a convex linear combination of the eigenvalues and therefore λ min ≤ R ( A , x ) ≤ λ max , with equalities when x = v max , v min .

we should expect σ 2 1 to dominate whenever (i) the norm ratio c = || x 0 || 2 / ∑ i =0 || x i || 2 is large or (ii) the remaining rows { x i } i =0 are measurably aligned with x 0 . The next result formalizes this intuition.

Theorem 4. Let M = ∥ x 0 ∥ 2 , R = ∑ i =0 ∥ x i ∥ 2 , and θ i be the angle between x 0 and x i . Define the alignment term α = 1 R ∑ i =0 ∥ x i ∥ 2 cos 2 θ i ∈ [0 , 1] . Then:

Proof. By definition of the singular value (also see Lemma 3),

Using ⟨ x i , x 0 ⟩ 2 = || x 0 || 2 || x i || 2 cos 2 θ i , we obtain

Since αR = ∑ i =0 ∥ x i ∥ 2 cos 2 θ i , we get σ 2 1 ≥ M + αR , which is the desired result.

As mentioned in the main text, Theorem 4 makes precise how two independent factors govern the rise of σ 2 1 : (i) the magnitudes of the activations M , and (ii) the alignment α of the remaining rows with x 0 . If representations were totally aligned, then X would indeed be rank one and would have one singular value given by σ 2 1 ( X ) = M + R = || X || 2 F . Conversely, even with small α (say, when token representations are not aligned or even orthogonal), a large norm M suffices to grow σ 2 1 . Empirically, we observe the term || x 0 || 2 making the most impact in our analysis, as we know it will be orders of magnitude larger than the rest of norms, however keeping the alignment term is also important for the following results.

We move onto proving Corollary 2, which we split in three in this section.

Proof. From Theorem 4, σ 2 1 ≥ M + αR . Moreover, ∑ j ≥ 2 σ 2 j = || X || 2 F -σ 2 1 ≤ || X || 2 F -( M + αR ) = R -αR = (1 -α ) R . Therefore one gets:

Corollary 6 (Anisotropy) . Let p 1 = σ 2 1 / || X || 2 F denote the anisotropy. In the setting of 4,

As mentioned in the main text, Corollaries 5, 6 lower-bound the dominance ratio and anisotropy using only ( c, α ) . Thus, either increasing c (stronger massive activation) or increasing α (stronger alignment) provably inflates the spectral gap. In both cases, having perfect alignment with x 0 or having || x 0 || 2 grow with respect to the rest, forces extreme values. If α → 1 , then c + α 1 -α → ∞

and c + α c +1 → 1 , intuitively because only one direction becomes relevant in the data. Moreover, as the massive activation grows c → ∞ , the same result holds. Notice that c is the ratio between the massive activation and the rest of them, therefore c increases by letting x 0 be larger in norm, but also letting the rest of representations have low norm.

Corollary 7 (Shannon matrix-based entropy) . Let p j := σ 2 j / || X || 2 F denote the normalized distribution of singular values of X . Let H ( X ) := -∑ r j =1 p j log p j be the Shannon entropy of such distribution. Let p := c + α c +1 . Then, we have the following bound

Proof. Let

so we need to bound the second term, which is the entropy of r -1 terms adding up to 1 -p 1 ≤ 1 -p . This term would be maximised if the mass was equally distributed, that is, p j = 1 -p 1 r -1 ≤ 1 -p r -1 . Therefore, one gets

The result is obtained combining these two bounds.

For fixed top mass p 1 ≥ p , entropy is maximized when the remaining mass 1 -p is spread uniformly over the other r -1 singular values; the bound above is exactly that maximum. Consequently, any additional structure in the tail (e.g., a second spike) will lower the true entropy beneath this upper bound. Notice for c →∞ or α → 1 , p → 1 and the upper bound approaches 0 .

Limitations of this analysis. In the theoretical analysis conducted above, we only considered one massive activation placed on the BOS token. In practice, models may exhibit more than one massive activation (Sun et al., 2024). In this case, our c term would make the bounds more permissive. We believe this poses no problem to our overall message and that this analysis can be extended. One can suppose the first n tokens to be the massive activations and decompose X = ∑ n -1 i =0 e i x ⊤ i + Y such that the first summand can be of rank at most n and Y a small perturbation in comparison, leading to small entropy (effective rank ≤ n ), also holding for longer context lengths.

.

Proofs

.

Proofs

(Singular value dominance).

Proofs

A note on GPT OSS.

Proofs

(Shannon matrix-based entropy).

Proofs

Limitations of this analysis.

Additional results

Experimental details. All experiments were implemented in PyTorch using NVIDIA A100 GPUs with 40GB memory or NVIDIA H100 GPUs with 80GB when the memory requirements were stronger. We examined pretrained models of varying depths, using HuggingFace repositories with Transformers and Transformer-Lens (Nanda & Bloom, 2022). When large datasets were run to collect metrics such as sink rates and norms, prompts were truncated to a maximum length of 4096 tokens for the FineWeb-Edu experiment (Fig. 5) and 1024 for the GSM8K experiment (Fig. 1), as the latter required singular value decompositions to compute the entropy. LogitLens experiments for multiple-choice-questions tasks were done with LM-Evaluation-Harness (Gao et al., 2024), implementing our own model wrapper to output hidden states at each layer instead of final ones.

Pearson Correlations. To assess the dynamical relationship between BOS norm, matrix-based entropy, and BOS sink rate across layers, we computed correlations on their layerwise changes. For each model and metric, the trajectory across layers was first z -scored, and then we defined the delta at layer ℓ as the difference with respect to the preceding layer,

This procedure emphasizes abrupt layerwise changes rather than absolute values, which is crucial because BOS norm often exhibits sharp spikes that coincide with collapses in entropy and the subsequent emergence of attention sinks. We then measured Pearson correlation coefficients between ∆˜ b ℓ and ∆˜ e ℓ (BOS norm vs entropy, same layer) and between ∆˜ b ℓ and ∆˜ s ℓ +1 (BOS norm vs sink rate, lagged by one layer). Correlations were computed separately per model and summarized across models by Fisher z -transform averaging, reporting the mean correlation and the standard deviation across models.

Limitations. We outline some limitations of our work. Our analysis focuses on decoder-only Transformers and primarily attributes both sinks and compression to BOS-centered massive activations; models with alternative positional schemes, attention sparsity patterns, or special-token conventions (e.g., no explicit BOS token, sinks in different positions or ALiBi encodings) may exhibit different dynamics. Our causal claims use targeted MLP ablations on selected layers and model families, however, we observe model-dependent exceptions (e.g., sinks persisting despite decompression). Lastly, the theory assumes a single massive row, whereas real models may feature multiple interacting massive activations. However, as discussed in Appendix A.2, we believe this poses no harm to the overall message: a few massive activations would push the representations to a lower-dimensional subspace, but not necessarily of dimension 1.

Experimental details.

Pearson Correlations.

Key Metrics. For a representation matrix X with singular values σ 1 ≥ σ 2 ≥ . . . ≥ σ r , we measure compression via the matrix-based entropy :

with threshold τ = 0 . 3 (unless otherwise stated), and I denotes the indicator function. We focus on the BOS token, the primary sink across models.

Limitations.

∗ Denotes equal first authorship. Correspondence to alvaro.arroyo@eng.ox.ac.uk and enrique.queipodellanoburgos@reuben.ox.ac.uk

Our contributions are:

Broader Analysis of Models

In this section, we provide broader validation of our three-phase theory across model families and model sizes. Moreover, we expand on the empirical measurements of metrics from our theoretical analysis, on the ablation of MLPs and provide two notes on the specifics of the GPT OSS model and Gemma 7B.

Figure 8: Entropy, sink rate and BOS norm for the Pythia family of models.

Validation on more model families and large models. To further validate our Mix-CompressRefine theory we observe the emergence of compression, attention sinks and massive activations in the Pythia model family (Fig. 8) and in very large models (70B-120B), specifically LLaMA3 70B, Qwen2 72B and GPT OSS 120B (Fig. 9). The prompt is a single GSM8K example. GPT OSS' particular sink patterns are explained later in this section. We believe this showcases our observed correlations are a universal phenomena in LLMs.

Training Checkpoints. We evaluated the training dynamics of the Pythia 410M/6.9B/12B models across multiple checkpoints (steps 1, 1k, 2k, 4k, 8k, 10k, 20k, 30k, and 143k). At each checkpoint, after every layer we recorded the entropy, BOS sink rate (threshold τ = 0 . 3 ) and the norm of the BOS token representation. The prompt was a single GSM8K prompt ' Janet's ducks lay 16 eggs... ' Figures 10 and 11 illustrate the results for the Pythia 6.9B and 12B models, respectively.

Visualizations of theoretical results. We provide plots with the bounds from the theoretical discussion in Section 3.2. Figures 13 and 12 show these values for LLaMA3 8B and Pythia 410M. We show (1) the terms M = || x BOS || 2 , αR and M + R = || X || 2 F from Theorem 1, (2) the top 3 singular values σ 2 i and the sum ∑ i ≥ 1 σ 2 i and (3,4,5) the dominance, anisotropy and entropy bounds from Corollary 2. In all cases, we observe the bounds being tight in the middle layers. In this regime, the first singular value σ 2 1 follows the trajectory of || x BOS || 2 closely and dominates the rest of the singular values. The dominance decreases steadily, specially towards the second half of the network, indicating the preparation for next token prediction of Phase 3.

MLP ablations. We further run the targeted MLP ablations on more models to erase the appearance of the massive activation. For LLaMA3 8B, we ablate layer 0; for Qwen2 7B, we ablate layers

Figure 13: Theoretical bounds for LLaMA3 8B.

3 and 4 and for Pythia 410M, we ablate layers 5-7. The results are shown in figure 14. Interestingly, removing massive activation always decompresses the representations, however, in Pythia 410M it does not remove the attention sinks, which might be explained by the many architectural differences between these models.

Norm equalizations. Across models, we find that the average norm of the rest of the tokens (excluding BOS) grows monotonically with depth, while the BOS norm grows abruptly with the massive activation, remains constant in the middle layers and drops at the last layers. Figure 15 illustrates this process for three models. As the rest of tokens become closer to the BOS norm, the dominance of the first singular value is weaker, allowing for representations to decompress.

Figure 15: LLMs equalize token norms towards at the end of the network.

Anote on GPT OSS. In the GPT OSS (Agarwal et al., 2025) family of models, each attention head is equipped with a learnable sink logit that allows it to divert probability mass away from real tokens, effectively providing a 'skip' option. However, unlike the explicit ( k ′ , v ′ ) bias formulation studied in Sun et al. (2024); Gu et al. (2025), GPT OSS does not include a learnable value sink token. This means the model cannot encode bias information directly through the sink, and we hypothesize that it instead continues to rely on massive activations at the BOS token to implement bias-like behavior and generate compression. This explains why BOS sink patterns are still observed, particularly in the middle layers (see Figure 16). The alternating spikes across layers may be a consequence of GPT OSS' alternating dense and locally banded sparse attention pattern: in layers with local attention windows, heads are less able to access BOS, while in subsequent dense layers BOS becomes globally visible again, producing the observed oscillatory sinkness.

The particular case of Gemma 7B. Even though Gemma 7B follows the same dynamics we have discussed in the chapter, how it achieves them is different from the rest. Token norms in Gemma 7B start very high; instead of increasing the BOS norm to create a massive activation, Gemma 7B decreases the norms of the remaining tokens to create the disparity needed for compression, then

re-equalizes by increasing their norms in late layers. We attribute the initially high norms to the embedding layer, as there are no other components that can account for it. We believe this is also why attention patterns in Gemma 7B look a bit different from the rest, with identity heads emerging both at the early and later layers. Figure 17 illustrates this. Pre- means before each layer, while postmeans after each layer.

Validation on more model families and large models.

Training Checkpoints.

Visualizations of theoretical results.