Skip to main content

{% spaceskip=1 fontdimen2 font plus .10 fontdimen2 font minus .90 fontdimen2 font xspaceskip= spaceskip Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin }%

Enrique Queipo-de-Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv

Abstract

Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M--120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

Enrique Queipo-de-Llano ∗ , 1 , ´ Alvaro Arroyo ∗ , 1 , Federico Barbero 1 , Xiaowen Dong 1 , Michael Bronstein 1 , 2 , Yann LeCun 3 , 4 , Ravid Shwartz-Ziv 3 1 University of Oxford 2 AITHYRA 3 New York University 4 Meta FAIR

Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

Introduction

Large Language Models (LLMs) have become remarkably capable, yet how they process information through their layers remains poorly understood. Two phenomena have particularly puzzled researchers: attention sinks , where attention heads mysteriously collapse their focus onto semantically uninformative tokens (Xiao et al., 2024), and compression valleys , where intermediate representations show unexpectedly low entropy despite the model's high-dimensional space (Skean et al., 2025).

These phenomena appear paradoxical: why would powerful models waste attention on meaningless tokens, and why would representations compress in the middle of processing? Previous work has explained attention sinks through positional biases (Gu et al., 2025) and over-mixing prevention (Barbero et al., 2025a), while compression valleys have been explained through an information bottleneck theory (Skean et al., 2025). However, the precise reasons why they emerge remain unclear and no formal link has been established between them.

We reveal that attention sinks and compression valleys are two manifestations of a single mechanism: massive activations in the residual stream. These extremely large-magnitude features emerge on specific tokens (typically the beginning-of-sequence token, BOS), create both effects simultaneously, and act as input-agnostic biases. While prior work linked massive activations to attention sinks (Sun et al., 2024; Cancedda, 2024), we prove they also drive compression: when a single token's norm dominates others, it necessarily creates a dominant singular value in the representation matrix, explaining the compression. Experiments across several models (410M-120B parameters) confirm this unified mechanism, connecting both phenomena via the massive activations.

∗ Denotes equal first authorship. Correspondence to alvaro.arroyo@eng.ox.ac.uk and enrique.queipodellanoburgos@reuben.ox.ac.uk

This unified mechanism reveals how transformers organize computation across depth through the Mix-Compress-Refine framework: massive activations control three distinct processing phases. Early layers (0-20% depth) mix information broadly via diffuse attention. Middle layers (20-85%) compress representations while halting mixing through attention sinks, both triggered by massive activation emergence. Late layers (85-100%) selectively refine through localized attention as norms equalize. This phase structure explains task-specific optimal depths: embedding tasks peak during compression, while generation requires full refinement.

Our contributions are:

· We empirically demonstrate that attention sinks and compression valleys emerge simultaneously in middle layers across several models (410M-120B parameters). · We prove that massive activations mathematically require compression, providing tight bounds on entropy reduction and singular value dominance (Theorem 1). · We validate causality through targeted ablations: removing massive activations eliminates both compression and reduces attention sinks. · Wepropose the Mix-Compress-Refine theory of information flow, explaining how transformers organize computation into three distinct phases. · Weshow this framework helps resolve the puzzle of task-dependent optimal depths: embedding tasks peak during compression while generation requires full refinement.

In this paper, we study decoder-only transformers with L layers, hidden dimension d , and H attention heads per layer. For a sequence of T tokens, let x ( ℓ ) i ∈ R d denote token i 's representation at layer ℓ , and X ( ℓ ) ∈ R T × d the full representation matrix. Attention weights α ( ℓ,h ) ij from token i to j in head h satisfy causal masking ( α ij = 0 for j > i ). Full architecture provided in Appendix A.1.

Key Metrics. For a representation matrix X with singular values σ 1 ≥ σ 2 ≥ . . . ≥ σ r , we measure compression via the matrix-based entropy :

$$

$$

Low entropy indicates compression into few dominant directions. The anisotropy p 1 = σ 2 1 / ∥ X ∥ 2 F measures directional bias (Razzhigaev et al., 2023) (near 1 = extreme bias, near 1 /r = isotropy). For token position k , the attention sink score and sink rate (Gu et al., 2025) are:

$$

$$

with threshold τ = 0 . 3 (unless otherwise stated), and I denotes the indicator function. We focus on the BOS token, the primary sink across models.

Attention Sinks. Attention heads mysteriously focus on semantically uninformative tokens (e.g., BOS) across diverse models and scales (Xiao et al., 2024). While Barbero et al. (2025a) argues they prevent over-mixing, Cancedda (2024) relates them to spectral subspaces, and Gu et al. (2025) traces emergence to pretraining, no work has yet examined their depth-wise organization.

Compression Valleys. Transformer representations compress dramatically in middle layers, where the matrix-based entropy drops significantly before recovering (Skean et al., 2025). This universal pattern coincides with increased anisotropy ( p 1 > 0 . 9 ) (Razzhigaev et al., 2023) and nearlinear layer transitions (Razzhigaev et al., 2024). Paradoxically, these compressed representations excel at embedding tasks despite their reduced dimensionality. The mechanism remained unknown, with only information bottleneck hypotheses lacking causal evidence (Skean et al., 2025).

Massive Activations. Sun et al. (2024) identified extremely large-magnitude features in transformer residual streams, with individual neurons exceeding typical activations by factors of 10 3 -10 6 . These 'massive activations' consistently appear on delimiter and special tokens (particularly

Figure 1: Attention sinks and compression valleys emerge simultaneously when BOS tokens develop massive activations. Normalized entropy ( left ), BOS sink rate ( middle ), and BOS token norm ( right ) across layers for six models evaluated on GSM8K. All three phenomena align precisely: when BOS norms spike by factors of 10 3 -10 4 (right panel), entropy drops below 0.5 bits (left) and sink rates surge to near 1.0 (middle), confirming our unified mechanism hypothesis.

Figure 1: Attention sinks and compression valleys emerge simultaneously when BOS tokens develop massive activations. Normalized entropy ( left ), BOS sink rate ( middle ), and BOS token norm ( right ) across layers for six models evaluated on GSM8K. All three phenomena align precisely: when BOS norms spike by factors of 10 3 -10 4 (right panel), entropy drops below 0.5 bits (left) and sink rates surge to near 1.0 (middle), confirming our unified mechanism hypothesis.

the BOS token), acting as input-agnostic biases. Furthermore, Sun et al. (2024) found a link between the emergence of massive activations and attention sinks, which was reinforced in Barbero et al. (2025b); Yona et al. (2025). However, none of these works link this phenomenon to representational structure or a unified theory of information flow in LLMs .

Computation Across Depth in Transformers. Several works have sought to understand the evolution of representations in Transformer-based models from a theoretical perspective. Dong et al. (2021) proved that the repeated application of self-attention leads to rank collapse in simplified settings without residual connections. Geshkovski et al. (2023) analyze self-attention dynamics and show tokens cluster over depth. Wu et al. (2024) studied how layer normalization and attention masks affect information propagation, finding that normalization can prevent rank decay. Other empirical work examines intermediate layer outputs directly. We highlight the LogitLens (Nostalgebraist, 2020) and TunedLens (Belrose et al., 2023), which decode hidden states using the model unembedding matrix and an affine probe per layer, respectively. Furthermore, Lad et al. (2024) measures the sensitivity to delete and swap interventions across layers and argues at different stages of depth-wise inference. Most recently, Csord´ as et al. (2025) argue that deeper LLMs underutilize their additional layers, with later layers mainly refining probability distributions rather than composing novel computations. While these studies illuminate layerwise behavior, none provides a unified mechanism that explains why stages form in depth or predicts when they should appear.

The Gap This Work Addresses. Thus far, attention sinks have been tied to massive activations while compression has remained a separate observation without a causal mechanism. In this work, we document the synchronized dynamics of these phenomena, showing that the same massive activations that create sinks are also the main driver of compression. Building on their co-emergence, we propose a three-stage theory in which residual-stream norms simultaneously regulate mixing in attention heads and compression in representations. Finally, we connect the mechanism to downstream behavior, distinguishing between embedding-style and generation tasks.

Key Metrics.
Attention Sinks.
Compression Valleys.

The middle phase begins abruptly with the emergence of massive activations, typically on the BOS token. As established in Section 3.2, these massive activations necessarily induce representational compression, as well as attention sink formation (Gu et al., 2025). This phase serves as a computational shut-off switch for mixing. The attention sinks act as approximate 'no-ops' (Bondarenko et al., 2023). By attending to BOS tokens with near-zero value norms, heads effectively skip their contribution while preserving the residual stream.

Figure 5: Middle-layer sinks adapt to input complexity while early mixing remains constant. (Left) BOS sink scores for a high-sink prompt in Pythia 410M, showing strong attention to BOS in layers 5-20. (Middle) Difference in BOS sink scores between high-sink and low-sink prompts, revealing input-dependent variation concentrated in middle layers. (Right) Difference in mixing scores between the same prompts, showing near-zero variation in early layers. This demonstrates that Phase 1 performs fixed mixing regardless of input, while Phase 2 compression dynamically adjusts sink strength based on prompt complexity.

Figure 5: Middle-layer sinks adapt to input complexity while early mixing remains constant. (Left) BOS sink scores for a high-sink prompt in Pythia 410M, showing strong attention to BOS in layers 5-20. (Middle) Difference in BOS sink scores between high-sink and low-sink prompts, revealing input-dependent variation concentrated in middle layers. (Right) Difference in mixing scores between the same prompts, showing near-zero variation in early layers. This demonstrates that Phase 1 performs fixed mixing regardless of input, while Phase 2 compression dynamically adjusts sink strength based on prompt complexity.

In these middle layers, the model refined information through the compressed residual stream, where a few dominant directions preserve high-level context while discarding redundancies. This aligns with the depth-efficiency perspective of Csord´ as et al. (2025), who show that mid-layers contribute less to shaping future tokens and more to stabilizing current representations. In Section 5, we show that performance on generation tasks tends to improve mostly in the latter half of this second phase. We hypothesize this lag reflects the time mid-layer MLPs need to process and consolidate the compressed signal before yielding token-level refinements. We do not treat this as a separate phase, however, since it is not cleanly demarcated by the emergence or dissipation of massive activations.

Sink behavior in middle to late layers adapts to input complexity. Mid to late layer sinks are input-dependent ('dormant heads'), which are often inert but active on specific prompts (Guo et al., 2024). Reproducing Sandoval-Segura et al. (2025) on 20K FineWeb-Edu prompts (Lozhkov et al., 2023), we compare the 'top' prompts (with the strongest sink scores) and the 'bottom' prompts (with the weakest). As shown in Figure 5, sink strength diverges in the middle and last layers. This provides evidence that while middle layers default to sink-like behavior that limits mixing, sink strength in this phase varies depending on the prompt.

Massive Activations.

We now prove that massive activations necessarily induce the observed compression. Consider the representation matrix X ∈ R T × d with rows { x i } T -1 i =0 , where x 0 denotes the BOS token.

̸

Theorem 1 (Massive Activations Induce Spectral Dominance) . Let M = ∥ x 0 ∥ 2 , R = ∑ i =0 ∥ x i ∥ 2 , and θ i be the angle between x 0 and x i . Define the alignment term α = 1 R ∑ i =0 ∥ x i ∥ 2 cos 2 θ i ∈ [0 , 1] . Then:

$$

$$

where σ 1 is the largest singular value of X .

Proof Sketch. By the characterization of singular values, σ 2 1 = max ∥ v ∥ =1 ∥ Xv ∥ 2 . Choosing v = x 0 / ∥ x 0 ∥ and expanding ∥ Xv ∥ 2 yields the bound. Full proofs and discussions in Appendix A.2.

This theorem has immediate consequences for compression metrics:

Corollary 2 (Compression Bounds) . Let c = M/R be the norm ratio and p = ( c + α ) / ( c + 1) . Then:

  1. Dominance: σ 2 1 / ∑ j ≥ 2 σ 2 j ≥ ( c + α ) / (1 -α )
  2. Anisotropy: p 1 ≥ p
  3. Entropy: H ( X ) ≤ -p log p -(1 -p ) log(1 -p ) + (1 -p ) log( r -1)

Meaning of the results. Theorem 1 shows two factors control the rise of σ 2 1 : (i) the magnitudes of the activations M , and (ii) the alignment α of other rows with x 0 . Full alignment makes X rank one (with σ 2 1 ( X ) = M + R = || X || 2 F ), while even with small α with large M suffices to grow σ 2 1 . The corollaries give lower bounds on dominance and anisotropy in terms of ( c, α ) , so increasing c (stronger gap between activations) or increasing α (stronger alignment) provably widens the spectral gap. Consequently, the singular-value entropy is tightly upper-bounded by the mass in the top component, so c ≫ 1 or α → 1 also results in H ( X ) dropping towards zero.

Tightness of the bounds in practice. When massive activations create growing norm ratios c , these bounds become tight. Figure 3 compares our theoretical bounds against empirical measurements for Pythia 410M across all layers. In early layers where massive activations are absent, the bounds

Figure 3: Our theoretical bounds become exact when massive activations emerge, proving they drive compression. Left: When BOS norm dominates (layers 5-15), the first singular value σ 2 1 approximately equals both ∥ x BOS ∥ 2 and ∥ X ∥ 2 F , confirming near rank-one structure. Right: Our entropy upper bound (Corollary 2) tightly matches empirical values in compressed layers, validating that massive activations mathematically necessitate compression. Average across 100 RedPajama (Weber et al., 2024) examples for Pythia 410M.

Figure 3: Our theoretical bounds become exact when massive activations emerge, proving they drive compression. Left: When BOS norm dominates (layers 5-15), the first singular value σ 2 1 approximately equals both ∥ x BOS ∥ 2 and ∥ X ∥ 2 F , confirming near rank-one structure. Right: Our entropy upper bound (Corollary 2) tightly matches empirical values in compressed layers, validating that massive activations mathematically necessitate compression. Average across 100 RedPajama (Weber et al., 2024) examples for Pythia 410M.

are loose as expected, because the theory only constrains the intermediate layers. However, in middle layers where massive activations emerge, the bounds become nearly exact, with predicted and observed values overlapping within measurement error. This tightness reveals that massive activations are the dominant mechanism shaping representation geometry: when ∥ x 0 ∥ becomes massive, the representation matrix effectively becomes rank-one plus small perturbations, exactly as our theory predicts.

Computation Across Depth in Transformers.

Key Insight: Different tasks achieve peak performance at different phases of the MixCompress-Refine organization. Embedding tasks peak during Phase 2's compression, benefiting from lower-dimensional spaces. Generation tasks improve monotonically through all phases, requiring Phase 3's refinement for accurate next token prediction. Multiple-choice reasoning shows flat performance until mid-depth, suggesting it needs both compression and subsequent refinement. This explains why studies reach different conclusions about 'optimal' layers: they're measuring fundamentally different computational objectives.

Prior work (Skean et al., 2025) found that mid-layer representations perform strongly, particularly on embedding benchmarks, and linked this effect to the mid-depth compression valley. In this section, we broaden the picture by evaluating both embedding and generation tasks, relating their depthwise performance to the three-stage framework introduced above.

The Gap This Work Addresses.

Massive Activations Drive Both Attention Sinks and Compression

Key Insight: Attention sinks and compression valleys are not separate phenomena but two consequences of massive activations in the residual stream. We prove theoretically that when BOS token norms exceed others, they necessarily create a dominant singular value, causing compression and coinciding with attention sinks. This unification reveals that a single mechanism controls both representation structure and attention in middle layers.

Empirical Correlation: Synchronized Emergence Across Models

We first empirically document that attention sinks and compression valleys emerge simultaneously across model families and scales. Figure 1 shows the layer-wise evolution of three metrics across six models (Pythia 410M/6.9B, LLaMA3 8B, Qwen2 7B, Gemma 7B, Bloom 1.7B): (1) matrix-based entropy H ( X ( ℓ ) ) , (2) sink-rate ( ℓ ) 0 , and (3) BOS token norm ∥ x ( ℓ ) 0 ∥ . We compute these metrics for all 7.5K training examples in GSM8K (Cobbe et al., 2021), plotting the mean and standard deviation at each layer.

Figure 2: The coupled emergence of massive activations, compression, and sinks develops early in training and persists. Evolution of normalized entropy (left) , sink rate (middle) , and BOS norm (right) across training checkpoints (1-143k steps) for Pythia 410M. All three phenomena emerge together around step 1k and remain synchronized throughout training, indicating this organization is learned early.

Figure 2: The coupled emergence of massive activations, compression, and sinks develops early in training and persists. Evolution of normalized entropy (left) , sink rate (middle) , and BOS norm (right) across training checkpoints (1-143k steps) for Pythia 410M. All three phenomena emerge together around step 1k and remain synchronized throughout training, indicating this organization is learned early.

We observe that all three patterns align precisely . When the BOS norm spikes to factors of 10 3 -10 4 (typically layers 0-5 depending on model depth), entropy simultaneously drops and sink rates surge. We compute the Pearson correlation between the change in BOS norm and entropy, obtaining r = -0 . 9 ± 0 . 18 across models, while BOS norm and sink rate correlate at r = 0 . 58 ± 0 . 25 .

We highlight that this synchronization is remarkably consistent. While sink rates vary with prompt content, the layer index where these phenomena emerge is fixed for each model, and the massive activation is deterministic. For instance, in Pythia 410M, the transition consistently occurs at layer 5 regardless of input, suggesting an architectural rather than input-dependent mechanism. We show how these transitions emerge during training in Figure 2. We point the reader to Appendix B.1 for details on experiments, larger models, and a note on GPT OSS (Agarwal et al., 2025).

Theoretical Framework: Massive Activations Imply Compression

We now prove that massive activations necessarily induce the observed compression. Consider the representation matrix X ∈ R T × d with rows { x i } T -1 i =0 , where x 0 denotes the BOS token.

̸

Theorem 1 (Massive Activations Induce Spectral Dominance) . Let M = ∥ x 0 ∥ 2 , R = ∑ i =0 ∥ x i ∥ 2 , and θ i be the angle between x 0 and x i . Define the alignment term α = 1 R ∑ i =0 ∥ x i ∥ 2 cos 2 θ i ∈ [0 , 1] . Then:

$$

$$

where σ 1 is the largest singular value of X .

Proof Sketch. By the characterization of singular values, σ 2 1 = max ∥ v ∥ =1 ∥ Xv ∥ 2 . Choosing v = x 0 / ∥ x 0 ∥ and expanding ∥ Xv ∥ 2 yields the bound. Full proofs and discussions in Appendix A.2.

This theorem has immediate consequences for compression metrics:

Corollary 2 (Compression Bounds) . Let c = M/R be the norm ratio and p = ( c + α ) / ( c + 1) . Then:

  1. Dominance: σ 2 1 / ∑ j ≥ 2 σ 2 j ≥ ( c + α ) / (1 -α )
  2. Anisotropy: p 1 ≥ p
  3. Entropy: H ( X ) ≤ -p log p -(1 -p ) log(1 -p ) + (1 -p ) log( r -1)

Meaning of the results. Theorem 1 shows two factors control the rise of σ 2 1 : (i) the magnitudes of the activations M , and (ii) the alignment α of other rows with x 0 . Full alignment makes X rank one (with σ 2 1 ( X ) = M + R = || X || 2 F ), while even with small α with large M suffices to grow σ 2 1 . The corollaries give lower bounds on dominance and anisotropy in terms of ( c, α ) , so increasing c (stronger gap between activations) or increasing α (stronger alignment) provably widens the spectral gap. Consequently, the singular-value entropy is tightly upper-bounded by the mass in the top component, so c ≫ 1 or α → 1 also results in H ( X ) dropping towards zero.

Tightness of the bounds in practice. When massive activations create growing norm ratios c , these bounds become tight. Figure 3 compares our theoretical bounds against empirical measurements for Pythia 410M across all layers. In early layers where massive activations are absent, the bounds

Figure 3: Our theoretical bounds become exact when massive activations emerge, proving they drive compression. Left: When BOS norm dominates (layers 5-15), the first singular value σ 2 1 approximately equals both ∥ x BOS ∥ 2 and ∥ X ∥ 2 F , confirming near rank-one structure. Right: Our entropy upper bound (Corollary 2) tightly matches empirical values in compressed layers, validating that massive activations mathematically necessitate compression. Average across 100 RedPajama (Weber et al., 2024) examples for Pythia 410M.

Figure 3: Our theoretical bounds become exact when massive activations emerge, proving they drive compression. Left: When BOS norm dominates (layers 5-15), the first singular value σ 2 1 approximately equals both ∥ x BOS ∥ 2 and ∥ X ∥ 2 F , confirming near rank-one structure. Right: Our entropy upper bound (Corollary 2) tightly matches empirical values in compressed layers, validating that massive activations mathematically necessitate compression. Average across 100 RedPajama (Weber et al., 2024) examples for Pythia 410M.

are loose as expected, because the theory only constrains the intermediate layers. However, in middle layers where massive activations emerge, the bounds become nearly exact, with predicted and observed values overlapping within measurement error. This tightness reveals that massive activations are the dominant mechanism shaping representation geometry: when ∥ x 0 ∥ becomes massive, the representation matrix effectively becomes rank-one plus small perturbations, exactly as our theory predicts.

Massive Activations.

Key Insight: Attention sinks and compression valleys are not separate phenomena but two consequences of massive activations in the residual stream. We prove theoretically that when BOS token norms exceed others, they necessarily create a dominant singular value, causing compression and coinciding with attention sinks. This unification reveals that a single mechanism controls both representation structure and attention in middle layers.

Proofs

Compression Valleys.

Evidence from Targeted Ablations

To isolate the exact role of massive activation empirically, we perform targeted ablations: zeroing the MLP's contribution to the BOS token at layers where massive activations emerge. Specifically, we set x ( ℓ +1) BOS ← x ( ℓ ) BOS + Attn ( ℓ ) ( x BOS ) , removing only MLP ( ℓ ) ( x BOS ) .

We find that ablating massive activations can eliminate both phenomena , confirming our theoretical conjecture. In LLaMA3 8B, removing the MLP's contribution at the first layer prevents entropy drop (remains at 0.4-0.5 bits vs. dropping to 0.02 bits), eliminates sink formation (sink rate drops from 0.85-1.0 to 0.0) and maintains BOS norm within 2 × of other tokens (vs. 10 2 × normally) (Fig. 4). Similar results hold across models, as seen in Fig.14 in Appendix B.1. Some models (Pythia 410M, Qwen2 7B) develop massive activations in more than one stage. Ablating any single stage partially reduces compression; ablating all stages eliminates it entirely, suggesting a cumulative contribution. However, for Pythia 410M, ablations remove compression but do not remove attention sinks, which suggests that the formation of sinks might have model-dependent causes.

Why Middle Layers? We hypothesize that the mid-depth concentration of sinks and compression reflects how Transformers allocate computation across depth: early layers perform broad contextual mixing, while later layers become increasingly aligned with next-token prediction (Lad et al., 2024). Freed from these pressures, middle layers can develop extreme features that regulate tokento-token sharing and induce compression, consistent with a mid-network shift toward refinement (Csord´ as et al., 2025). In the next section, we show how massive activations predict, and precisely characterize, three stages of information flow.

Mix-Compress-Refine: A Theory of Information Flow

Key Insight: Transformers organize computation into three distinct phases demarcated by massive activations. Early layers mix information broadly to build context. Middle layers compress representations and halt mixing when massive activations emerge, preventing over-smoothing while maintaining essential information. Late layers equalize norms and switch to local positional attention, refining representations for task-specific outputs.

Building on our mechanistic understanding of massive activations, we now present a broader theory of how transformers organize computation across depth. We propose that information processing occurs in three distinct phases, demarcated by the emergence and dissipation of massive activations.

Figure 4: Removing massive activations eliminates both compression and attention sinks, confirming causality. Ablating the MLP contribution to the BOS token at layer 0 in LLaMA3 8B has three effects: (Left) Entropy remains at ∼ 0.5 bits instead of dropping to 0.02, showing decompression. (Middle) Sink rate stays at 0 throughout depth, confirming no attention sink formation. (Right) BOS norm (orange) remains comparable to the rest of tokens (grey) instead of spiking by 10 3 × . This causal intervention validates that massive activations drive both phenomena.

Figure 4: Removing massive activations eliminates both compression and attention sinks, confirming causality. Ablating the MLP contribution to the BOS token at layer 0 in LLaMA3 8B has three effects: (Left) Entropy remains at ∼ 0.5 bits instead of dropping to 0.02, showing decompression. (Middle) Sink rate stays at 0 throughout depth, confirming no attention sink formation. (Right) BOS norm (orange) remains comparable to the rest of tokens (grey) instead of spiking by 10 3 × . This causal intervention validates that massive activations drive both phenomena.

Phase 1: Information Mixing (Layers 0--20 %)

In the early layers of the model, we observe diffuse attention patterns, which are enabled by the lack of massive activations. This allows the model to perform mixing of high-dimensional token representations for a few layers, allowing the model to build contextual representations through broad information integration. An example of such an attention head in the early layers can be seen in Figure 6.

To quantify mixing in attention heads, we define the Mixing Score as the average row entropy of attention matrices: 1 T ∑ T i =1 H ( A ( ℓ,h ) i, : ) . Across models, we find that early layers consistently maintain mixing scores above 0 . 7 , confirming active token mixing, before dropping sharply when massive activations emerge. Notably, this mixing phase varies in extent: From just the first layer in some models to approximately 20% of network depth in others. However, its qualitative characteristics remain consistent. We plot this metric Figure 18 in Appendix B.2 across models and layers.

We believe that this initial mixing stage is deliberately limited to prevent over-mixing and representational collapse that would occur with extended uniform attention (Barbero et al., 2024; 2025a), analogous to over-smoothing in graph neural networks (Arroyo et al., 2025). This controlled, brief mixing phase establishes the semantic foundation that subsequent phases refine. The model captures both local token dependencies and global context, creating rich representations that can be selectively compressed and refined in later phases.

Phase 2: Compression and Halted Mixing (Layers 20--85 %)

The middle phase begins abruptly with the emergence of massive activations, typically on the BOS token. As established in Section 3.2, these massive activations necessarily induce representational compression, as well as attention sink formation (Gu et al., 2025). This phase serves as a computational shut-off switch for mixing. The attention sinks act as approximate 'no-ops' (Bondarenko et al., 2023). By attending to BOS tokens with near-zero value norms, heads effectively skip their contribution while preserving the residual stream.

Figure 5: Middle-layer sinks adapt to input complexity while early mixing remains constant. (Left) BOS sink scores for a high-sink prompt in Pythia 410M, showing strong attention to BOS in layers 5-20. (Middle) Difference in BOS sink scores between high-sink and low-sink prompts, revealing input-dependent variation concentrated in middle layers. (Right) Difference in mixing scores between the same prompts, showing near-zero variation in early layers. This demonstrates that Phase 1 performs fixed mixing regardless of input, while Phase 2 compression dynamically adjusts sink strength based on prompt complexity.

Figure 5: Middle-layer sinks adapt to input complexity while early mixing remains constant. (Left) BOS sink scores for a high-sink prompt in Pythia 410M, showing strong attention to BOS in layers 5-20. (Middle) Difference in BOS sink scores between high-sink and low-sink prompts, revealing input-dependent variation concentrated in middle layers. (Right) Difference in mixing scores between the same prompts, showing near-zero variation in early layers. This demonstrates that Phase 1 performs fixed mixing regardless of input, while Phase 2 compression dynamically adjusts sink strength based on prompt complexity.

In these middle layers, the model refined information through the compressed residual stream, where a few dominant directions preserve high-level context while discarding redundancies. This aligns with the depth-efficiency perspective of Csord´ as et al. (2025), who show that mid-layers contribute less to shaping future tokens and more to stabilizing current representations. In Section 5, we show that performance on generation tasks tends to improve mostly in the latter half of this second phase. We hypothesize this lag reflects the time mid-layer MLPs need to process and consolidate the compressed signal before yielding token-level refinements. We do not treat this as a separate phase, however, since it is not cleanly demarcated by the emergence or dissipation of massive activations.

Sink behavior in middle to late layers adapts to input complexity. Mid to late layer sinks are input-dependent ('dormant heads'), which are often inert but active on specific prompts (Guo et al., 2024). Reproducing Sandoval-Segura et al. (2025) on 20K FineWeb-Edu prompts (Lozhkov et al., 2023), we compare the 'top' prompts (with the strongest sink scores) and the 'bottom' prompts (with the weakest). As shown in Figure 5, sink strength diverges in the middle and last layers. This provides evidence that while middle layers default to sink-like behavior that limits mixing, sink strength in this phase varies depending on the prompt.

Phase 3: Selective Refinement (Layers 85--100 %)

In the final phase, the model reverses the compression bottleneck through norm equalization and a fundamental shift in attention patterns.

Norm equalization drives decompression. In this phase, we find the BOS norm plateaus or decreases while the average norm of the rest of tokens rises sharply, driving them toward similar magnitudes (right panel in Figure 4 and Figure 15 in Appendix B.1). This equalization begins earlier than the full phase transition, where the average norm starts rising around 40-60% depth, preparing for the eventual shift. The massive activation ratio drops from > 10 3 to < 10 , removing the mathematical basis for compression and allowing representations to re-expand.

Attention shifts to positional patterns. As massive activations dissipate, we observe heads transition from sink-dominated to position-based patterns. In particular, we observe the emergence of identity heads ( i → i ), previous-token heads ( i → i -1 ), and other sharp attention patterns, where by sharp we mean highly localized attention patterns. Figure 6 shows an example of such a pattern in the Pythia 410M model. We find that sharp positional patterns are especially common in RoPE-based models, consistent with recent work (Barbero et al., 2025b) showing that RoPE induces frequencyselective structure that favors the emergence of such heads. We provide empirical evidence of this in Appendix B.2. In particular, when measuring the mixing rate of attention patterns, only models without RoPE revert to higher mixing in later layers, whereas RoPE-based models consistently transition toward sharp positional attention.

This phase serves three computational purposes. Webelieve that this phase serves three purposes. First, norm equalization reduces BOS dominance so content tokens can meaningfully influence the residual pathway and receive token-specific refinements. Second, attention shifts to sharp (often positional) heads that perform selective mixing , focusing on a few task-relevant tokens and writing their features into the residual. As late layers re-expand capacity, these signals can be represented

Figure 6: Attention patterns transform from diffuse mixing to sinks to positional focus across depth. Evolution of attention patterns in Pythia 410M showing representative heads at layers 0, 16, and 23. Early layers exhibit diffuse attention enabling broad information mixing. Middle layers show sink patterns that halt mixing. Late layers display sharp positional patterns for selective refinement.

Figure 6: Attention patterns transform from diffuse mixing to sinks to positional focus across depth. Evolution of attention patterns in Pythia 410M showing representative heads at layers 0, 16, and 23. Early layers exhibit diffuse attention enabling broad information mixing. Middle layers show sink patterns that halt mixing. Late layers display sharp positional patterns for selective refinement.

distinctly rather than squeezed by the mid-layer bottleneck. In parallel, identity/near-diagonal heads curb mixing without defaulting to sinks, where their non-zero value writes act as local signal boosters, in contrast to BOS sinks that effectively zero out updates. Third, bringing token norms to smaller, comparable scales likely improves numerical stability for the unembedding. Notably, models tend to equalize by boosting content-token norms rather than fully collapsing the BOS norm, preserving a modest global bias while enabling precise, content-driven refinements.

Implications for Downstream Performance

Key Insight: Different tasks achieve peak performance at different phases of the MixCompress-Refine organization. Embedding tasks peak during Phase 2's compression, benefiting from lower-dimensional spaces. Generation tasks improve monotonically through all phases, requiring Phase 3's refinement for accurate next token prediction. Multiple-choice reasoning shows flat performance until mid-depth, suggesting it needs both compression and subsequent refinement. This explains why studies reach different conclusions about 'optimal' layers: they're measuring fundamentally different computational objectives.

Prior work (Skean et al., 2025) found that mid-layer representations perform strongly, particularly on embedding benchmarks, and linked this effect to the mid-depth compression valley. In this section, we broaden the picture by evaluating both embedding and generation tasks, relating their depthwise performance to the three-stage framework introduced above.

The Distinct Performance Patterns Across Tasks

Generation improves monotonically through all phases. We begin by evaluating intermediate layers across multiple model families and sizes using LogitLens (Nostalgebraist, 2020) on WikiText-2 (Merity et al., 2016). We observe a steady perplexity decline with depth from > 10 4 in early layers to 10-25 at full depth, as shown in Figure 7 (left). We notice little gain in the very early layers (consistent with an embedding-formation stage) and continued refinement through middepth. Across several models, the sharpest improvements occur in Phase 3, where norm equalization and positional/identity heads enable token-specific refinements, which appear to be crucial for nexttoken prediction tasks.

We further test the same set of models on multiple-choice question-answering tasks. We evaluate on ARC Easy, ARC Challenge, HellaSwag, and WinoGrande (Clark et al., 2018; Zellers et al., 2019; Sakaguchi et al., 2021) via LogitLens and the LM Evaluation Harness (Gao et al., 2024) with zeroshot learning. Figure 7 (middle) shows the results for ARC Easy and results for the rest of the datasets can be found in Figure 24, Appendix B.3. For sufficiently large models, accuracies remain largely flat until roughly 40-60% depth and then rise sharply. This suggests that, for generationaligned tasks, compression alone (Phase 2) is not sufficient; gains emerge only toward the end of Phase 2 (after sufficient residual refinement) and continue into Phase 3, where norm equalization and positional/identity heads enable token-specific updates.

In short, in generation settings, the late Phase 2 to Phase 3 transition is pivotal, aligning with the observation in Csord´ as et al. (2025) of a mid-network phase change from future-token to current-token computation, providing independent validation that Phase 2 and Phase 3 serve distinct computational roles.

Embedding tasks peak in middle layers. To highlight the difference between generation and embedding tasks, we implement the following linear probing experiment. For each dataset, we encode the examples with the frozen backbone and extract hidden states at every layer. We then train a linear classifier at each layer on the training split and evaluate it on the test split. We probe the same multiple-choice QA benchmarks as before and, additionally, a sentence classification dataset (SST-2; Socher et al. 2013), assessing how much task-relevant information is linearly accessible at different depths. Figure 7 (right) shows the results for ARC Easy, highlighting the difference in optimal depths with the corresponding generation task, while full results for this experiment are shown in Fig. 26, Appendix B.3. Moreover, we reproduce Skean et al. (2025) on 32 MTEB tasks (Muennighoff et al., 2022) across a broader set of larger, decoder-only models. Across the board, we find that performance peaks consistently at 25-75% relative depth, outperforming early/late layers by 10-20% and precisely aligning with Phase 2, where compression is strongest (see Fig. 27 in

Figure 7: Embedding tasks peak during compression while generation requires full refinement, revealing distinct computational objectives. (Left, Middle) Perplexity on Wikitext-2 and multiple-choice QA accuracy in ARC Easy via LogitLens generally do not improve significantly until ∼ 50% depth, then decreases/rises steadily through Phase 3. (Right) Linear probe test accuracy on the same task peaks at 25-75% depth (Phase 2) and declines thereafter. This divergence demonstrates that embedding-relevant features concentrate in compressed middle layers, while generation tasks require full-depth for token-specific predictions.

Figure 7: Embedding tasks peak during compression while generation requires full refinement, revealing distinct computational objectives. (Left, Middle) Perplexity on Wikitext-2 and multiple-choice QA accuracy in ARC Easy via LogitLens generally do not improve significantly until ∼ 50% depth, then decreases/rises steadily through Phase 3. (Right) Linear probe test accuracy on the same task peaks at 25-75% depth (Phase 2) and declines thereafter. This divergence demonstrates that embedding-relevant features concentrate in compressed middle layers, while generation tasks require full-depth for token-specific predictions.

Appendix B.3). These results align with evidence that next-token pretraining does not uniformly benefit perception-style classification (Balestriero & Huang, 2024). Together, they underscore task dependence : classification-relevant linear features concentrate in intermediate layers, whereas late layers are repurposed for token-specific generative refinement. Furthermore, the pattern suggests that massive activations in the residual pathway not only curb over-mixing via sink formation but also act as a mechanism the model uses to compress information.

Why Do Different Tasks Need Different Phases?

We believe these findings clarify which tasks actually benefit from compression. In particular, embedding-style objectives (such as clustering, retrieval, classification, bitext mining, etc.) gain from Phase 2's compression because they target low-dimensional structure while discarding irrelevant information, echoing classic arguments on the benefits of information bottlenecks and compressed representations (Shwartz-Ziv et al., 2018; Kawaguchi et al., 2023). This picture aligns with evidence that LLMs produce surprisingly linear (and often linearly separable) embeddings (Razzhigaev et al., 2024; Marks & Tegmark, 2023). In particular, when features concentrate in a low-dimensional subspace, linear probing, semantic retrieval, and related embedding tasks become easier. Moreover, such linear structure has been linked to the emergence of high-level semantic concepts (Park et al., 2023), reinforcing our hypothesis on why mid-layer compressed states tend to work well for non-generative evaluations.

By contrast, generation and reasoning require capacity that compressed states alone cannot provide. Performance improves the most once Phase 3 norm equalization restores higher entropy, and positional heads/MLPs can refine token-specific details, which is also when we observe the models being most confident about their predictions (see Fig. 23 in Appendix B.3). In this way, the model makes use of the compressed and refined representation from Phase 2, which has captured high-level ideas and semantic concepts, and expands this into higher-dimensional space to perform token-level refinements in Phase 3.

This reconciles two views: compressed mid-layers suit embedding benchmarks, whereas next-tokenprediction-aligned tasks benefit from full-depth processing. Practically, 'optimal layer' selection should match phase to objective, suggesting phase-aware early exiting (Schuster et al., 2022) as a potentially promising design choice.

Conclusion

In this work, we revisited two puzzling phenomena in decoder-only Transformers, attention sinks and compression valleys. We began with the observation that attention sinks, compression valleys, and massive activations all emerge at the same time in language models. We then proved that a single high-norm token necessarily induces a dominant singular value, yielding low matrix-entropy and high anisotropy, and we bounded these effects quantitatively.

Building on this, we proposed a Mix-Compress-Refine theory of depth-wise computation in LLMs. In particular, we show that early layers mix broadly through diffuse attention, middle layers compress and curb mixing via attention sinks, late layers re-equalize norms and apply sharp positional heads for selective refinement. The boundaries between these phases are marked by the appearance and later disappearance of massive activations in depth. We use this organization to clarify downstream task behavior. While embedding-style tasks peak in compressed mid-layers, generation improves through late refinement and benefits from full depth.

We see this framework as a step toward a more mechanistic account of how LLMs allocate computation across depth. We hope these insights help connect head-level mechanisms with representation geometry, ultimately guiding more efficient and controllable LLM designs.

Experimental details.

Proofs

Architecture

In this work, we study decoder-only Transformers (Radford et al., 2018), which employ causal masking in attention and constitute the dominant architecture in today's large language models (Gemma Team et al., 2024; Dubey et al., 2024). We follow the notation of Barbero et al. (2024), but we importantly also consider a model with H ≥ 1 attention heads:

$$

$$

where we will also denote by the matrices A ( ℓ,h ) the attention heads given by A ( ℓ,h ) ij = α ( ℓ,h ) ij . The causal masking translates into A ( ℓ,h ) being lower-triangular and the row-wise softmax implies row-stochasticity.

Theoretical results

This section includes the proofs of the statements of Section 3.2, where we show massive activations imply the dominance of a singular value. One can obtain a weaker version of the bound focused only on the massive activation (no alignment terms) that entails weaker bounds for the spectral metrics. The following lemma serves as a proof for the fact that σ 2 1 ( X ) = max || v || =1 || Xv || 2 .

Lemma 3. Let A be a real symmetric n × n matrix with (real) eigenvalues λ max = λ 1 ≥ λ 2 ≥ . . . ≥ λ n = λ min and let

$$

$$

denote the Rayleigh Quotient for A and a real non-zero vector x ∈ R n . Then R ( A , x ) ∈ [ λ min , λ max ] , achieving each bound at the corresponding eigenvectors v min , v max .

Proof. Let A = Q ⊤ ΛQ be the diagonalization of A in the eigenbasis given by the v i and let y = Qx , such that x = Q ⊤ y for Q is orthogonal (i.e. Q ⊤ = Q -1 ) . Then,

$$

$$

Since the weights w i = y 2 i / ∑ y 2 j satisfy w i ≥ 0 and ∑ i w i = 1 , R ( A , x ) is a convex linear combination of the eigenvalues and therefore λ min ≤ R ( A , x ) ≤ λ max , with equalities when x = v max , v min .

We now prove that the emergence of massive activations in some layers directly implies that the first singular value dominates the distribution, which translates into extreme values for anisotropy and matrix-based entropy. The intuition is that entropy and anisotropy are representation-only properties: they depend solely on the singular-value spectrum of the representation matrix X , whose rows are token-wise representations { x i } and columns are features. A massive activation means that one row, say x 0 , carries disproportionately large norm M = ∥ x 0 ∥ 2 compared to the rest of token representations, sometimes orders of magnitude larger. Let v = x 0 / || x 0 || be the direction of x 0 , notice we can always write X = e 1 x ⊤ 0 + Y = e 1 M v ⊤ + Y , where Y contains the rest of the representations. If M is large compared to || Y || 2 F , then X is effectively a rank one matrix plus a small perturbation, and we would expect σ 2 1 ( X ) ≈ M and v to be close to the first right singular vector. This is exactly the mechanism exploited by PCA (Ma´ ckiewicz & Ratajczak, 1993): the first principal component points in the direction that explains the largest variance; a massive activation creates such a dominant variance direction by construction. Therefore, even before formal bounds,

̸

we should expect σ 2 1 to dominate whenever (i) the norm ratio c = || x 0 || 2 / ∑ i =0 || x i || 2 is large or (ii) the remaining rows { x i } i =0 are measurably aligned with x 0 . The next result formalizes this intuition.

Theorem 4. Let M = ∥ x 0 ∥ 2 , R = ∑ i =0 ∥ x i ∥ 2 , and θ i be the angle between x 0 and x i . Define the alignment term α = 1 R ∑ i =0 ∥ x i ∥ 2 cos 2 θ i ∈ [0 , 1] . Then:

$$

$$

Proof. By definition of the singular value (also see Lemma 3),

$$

$$

Using ⟨ x i , x 0 ⟩ 2 = || x 0 || 2 || x i || 2 cos 2 θ i , we obtain

$$

$$

Since αR = ∑ i =0 ∥ x i ∥ 2 cos 2 θ i , we get σ 2 1 ≥ M + αR , which is the desired result.

As mentioned in the main text, Theorem 4 makes precise how two independent factors govern the rise of σ 2 1 : (i) the magnitudes of the activations M , and (ii) the alignment α of the remaining rows with x 0 . If representations were totally aligned, then X would indeed be rank one and would have one singular value given by σ 2 1 ( X ) = M + R = || X || 2 F . Conversely, even with small α (say, when token representations are not aligned or even orthogonal), a large norm M suffices to grow σ 2 1 . Empirically, we observe the term || x 0 || 2 making the most impact in our analysis, as we know it will be orders of magnitude larger than the rest of norms, however keeping the alignment term is also important for the following results.

We move onto proving Corollary 2, which we split in three in this section.

$$

$$

Proof. From Theorem 4, σ 2 1 ≥ M + αR . Moreover, ∑ j ≥ 2 σ 2 j = || X || 2 F -σ 2 1 ≤ || X || 2 F -( M + αR ) = R -αR = (1 -α ) R . Therefore one gets:

$$

$$

Corollary 6 (Anisotropy) . Let p 1 = σ 2 1 / || X || 2 F denote the anisotropy. In the setting of 4,

$$

$$

$$

$$

As mentioned in the main text, Corollaries 5, 6 lower-bound the dominance ratio and anisotropy using only ( c, α ) . Thus, either increasing c (stronger massive activation) or increasing α (stronger alignment) provably inflates the spectral gap. In both cases, having perfect alignment with x 0 or having || x 0 || 2 grow with respect to the rest, forces extreme values. If α → 1 , then c + α 1 -α → ∞

and c + α c +1 → 1 , intuitively because only one direction becomes relevant in the data. Moreover, as the massive activation grows c → ∞ , the same result holds. Notice that c is the ratio between the massive activation and the rest of them, therefore c increases by letting x 0 be larger in norm, but also letting the rest of representations have low norm.

Corollary 7 (Shannon matrix-based entropy) . Let p j := σ 2 j / || X || 2 F denote the normalized distribution of singular values of X . Let H ( X ) := -∑ r j =1 p j log p j be the Shannon entropy of such distribution. Let p := c + α c +1 . Then, we have the following bound

$$

$$

Proof. Let

$$

$$

so we need to bound the second term, which is the entropy of r -1 terms adding up to 1 -p 1 ≤ 1 -p . This term would be maximised if the mass was equally distributed, that is, p j = 1 -p 1 r -1 ≤ 1 -p r -1 . Therefore, one gets

$$

$$

$$

$$

The result is obtained combining these two bounds.

For fixed top mass p 1 ≥ p , entropy is maximized when the remaining mass 1 -p is spread uniformly over the other r -1 singular values; the bound above is exactly that maximum. Consequently, any additional structure in the tail (e.g., a second spike) will lower the true entropy beneath this upper bound. Notice for c →∞ or α → 1 , p → 1 and the upper bound approaches 0 .

Limitations of this analysis. In the theoretical analysis conducted above, we only considered one massive activation placed on the BOS token. In practice, models may exhibit more than one massive activation (Sun et al., 2024). In this case, our c term would make the bounds more permissive. We believe this poses no problem to our overall message and that this analysis can be extended. One can suppose the first n tokens to be the massive activations and decompose X = ∑ n -1 i =0 e i x ⊤ i + Y such that the first summand can be of rank at most n and Y a small perturbation in comparison, leading to small entropy (effective rank ≤ n ), also holding for longer context lengths.

.

Proofs

.

Proofs

(Singular value dominance).

Proofs

A note on GPT OSS.

Proofs

(Shannon matrix-based entropy).

Proofs

Limitations of this analysis.

Additional results

Experimental details. All experiments were implemented in PyTorch using NVIDIA A100 GPUs with 40GB memory or NVIDIA H100 GPUs with 80GB when the memory requirements were stronger. We examined pretrained models of varying depths, using HuggingFace repositories with Transformers and Transformer-Lens (Nanda & Bloom, 2022). When large datasets were run to collect metrics such as sink rates and norms, prompts were truncated to a maximum length of 4096 tokens for the FineWeb-Edu experiment (Fig. 5) and 1024 for the GSM8K experiment (Fig. 1), as the latter required singular value decompositions to compute the entropy. LogitLens experiments for multiple-choice-questions tasks were done with LM-Evaluation-Harness (Gao et al., 2024), implementing our own model wrapper to output hidden states at each layer instead of final ones.

Pearson Correlations. To assess the dynamical relationship between BOS norm, matrix-based entropy, and BOS sink rate across layers, we computed correlations on their layerwise changes. For each model and metric, the trajectory across layers was first z -scored, and then we defined the delta at layer ℓ as the difference with respect to the preceding layer,

$$

$$

This procedure emphasizes abrupt layerwise changes rather than absolute values, which is crucial because BOS norm often exhibits sharp spikes that coincide with collapses in entropy and the subsequent emergence of attention sinks. We then measured Pearson correlation coefficients between ∆˜ b ℓ and ∆˜ e ℓ (BOS norm vs entropy, same layer) and between ∆˜ b ℓ and ∆˜ s ℓ +1 (BOS norm vs sink rate, lagged by one layer). Correlations were computed separately per model and summarized across models by Fisher z -transform averaging, reporting the mean correlation and the standard deviation across models.

Limitations. We outline some limitations of our work. Our analysis focuses on decoder-only Transformers and primarily attributes both sinks and compression to BOS-centered massive activations; models with alternative positional schemes, attention sparsity patterns, or special-token conventions (e.g., no explicit BOS token, sinks in different positions or ALiBi encodings) may exhibit different dynamics. Our causal claims use targeted MLP ablations on selected layers and model families, however, we observe model-dependent exceptions (e.g., sinks persisting despite decompression). Lastly, the theory assumes a single massive row, whereas real models may feature multiple interacting massive activations. However, as discussed in Appendix A.2, we believe this poses no harm to the overall message: a few massive activations would push the representations to a lower-dimensional subspace, but not necessarily of dimension 1.

Experimental details.
Pearson Correlations.

In this paper, we study decoder-only transformers with L layers, hidden dimension d , and H attention heads per layer. For a sequence of T tokens, let x ( ℓ ) i ∈ R d denote token i 's representation at layer ℓ , and X ( ℓ ) ∈ R T × d the full representation matrix. Attention weights α ( ℓ,h ) ij from token i to j in head h satisfy causal masking ( α ij = 0 for j > i ). Full architecture provided in Appendix A.1.

Key Metrics. For a representation matrix X with singular values σ 1 ≥ σ 2 ≥ . . . ≥ σ r , we measure compression via the matrix-based entropy :

$$

$$

Low entropy indicates compression into few dominant directions. The anisotropy p 1 = σ 2 1 / ∥ X ∥ 2 F measures directional bias (Razzhigaev et al., 2023) (near 1 = extreme bias, near 1 /r = isotropy). For token position k , the attention sink score and sink rate (Gu et al., 2025) are:

$$

$$

with threshold τ = 0 . 3 (unless otherwise stated), and I denotes the indicator function. We focus on the BOS token, the primary sink across models.

Attention Sinks. Attention heads mysteriously focus on semantically uninformative tokens (e.g., BOS) across diverse models and scales (Xiao et al., 2024). While Barbero et al. (2025a) argues they prevent over-mixing, Cancedda (2024) relates them to spectral subspaces, and Gu et al. (2025) traces emergence to pretraining, no work has yet examined their depth-wise organization.

Compression Valleys. Transformer representations compress dramatically in middle layers, where the matrix-based entropy drops significantly before recovering (Skean et al., 2025). This universal pattern coincides with increased anisotropy ( p 1 > 0 . 9 ) (Razzhigaev et al., 2023) and nearlinear layer transitions (Razzhigaev et al., 2024). Paradoxically, these compressed representations excel at embedding tasks despite their reduced dimensionality. The mechanism remained unknown, with only information bottleneck hypotheses lacking causal evidence (Skean et al., 2025).

Massive Activations. Sun et al. (2024) identified extremely large-magnitude features in transformer residual streams, with individual neurons exceeding typical activations by factors of 10 3 -10 6 . These 'massive activations' consistently appear on delimiter and special tokens (particularly

Figure 1: Attention sinks and compression valleys emerge simultaneously when BOS tokens develop massive activations. Normalized entropy ( left ), BOS sink rate ( middle ), and BOS token norm ( right ) across layers for six models evaluated on GSM8K. All three phenomena align precisely: when BOS norms spike by factors of 10 3 -10 4 (right panel), entropy drops below 0.5 bits (left) and sink rates surge to near 1.0 (middle), confirming our unified mechanism hypothesis.

Figure 1: Attention sinks and compression valleys emerge simultaneously when BOS tokens develop massive activations. Normalized entropy ( left ), BOS sink rate ( middle ), and BOS token norm ( right ) across layers for six models evaluated on GSM8K. All three phenomena align precisely: when BOS norms spike by factors of 10 3 -10 4 (right panel), entropy drops below 0.5 bits (left) and sink rates surge to near 1.0 (middle), confirming our unified mechanism hypothesis.

the BOS token), acting as input-agnostic biases. Furthermore, Sun et al. (2024) found a link between the emergence of massive activations and attention sinks, which was reinforced in Barbero et al. (2025b); Yona et al. (2025). However, none of these works link this phenomenon to representational structure or a unified theory of information flow in LLMs .

Computation Across Depth in Transformers. Several works have sought to understand the evolution of representations in Transformer-based models from a theoretical perspective. Dong et al. (2021) proved that the repeated application of self-attention leads to rank collapse in simplified settings without residual connections. Geshkovski et al. (2023) analyze self-attention dynamics and show tokens cluster over depth. Wu et al. (2024) studied how layer normalization and attention masks affect information propagation, finding that normalization can prevent rank decay. Other empirical work examines intermediate layer outputs directly. We highlight the LogitLens (Nostalgebraist, 2020) and TunedLens (Belrose et al., 2023), which decode hidden states using the model unembedding matrix and an affine probe per layer, respectively. Furthermore, Lad et al. (2024) measures the sensitivity to delete and swap interventions across layers and argues at different stages of depth-wise inference. Most recently, Csord´ as et al. (2025) argue that deeper LLMs underutilize their additional layers, with later layers mainly refining probability distributions rather than composing novel computations. While these studies illuminate layerwise behavior, none provides a unified mechanism that explains why stages form in depth or predicts when they should appear.

The Gap This Work Addresses. Thus far, attention sinks have been tied to massive activations while compression has remained a separate observation without a causal mechanism. In this work, we document the synchronized dynamics of these phenomena, showing that the same massive activations that create sinks are also the main driver of compression. Building on their co-emergence, we propose a three-stage theory in which residual-stream norms simultaneously regulate mixing in attention heads and compression in representations. Finally, we connect the mechanism to downstream behavior, distinguishing between embedding-style and generation tasks.

Limitations.

Large Language Models (LLMs) have become remarkably capable, yet how they process information through their layers remains poorly understood. Two phenomena have particularly puzzled researchers: attention sinks , where attention heads mysteriously collapse their focus onto semantically uninformative tokens (Xiao et al., 2024), and compression valleys , where intermediate representations show unexpectedly low entropy despite the model's high-dimensional space (Skean et al., 2025).

These phenomena appear paradoxical: why would powerful models waste attention on meaningless tokens, and why would representations compress in the middle of processing? Previous work has explained attention sinks through positional biases (Gu et al., 2025) and over-mixing prevention (Barbero et al., 2025a), while compression valleys have been explained through an information bottleneck theory (Skean et al., 2025). However, the precise reasons why they emerge remain unclear and no formal link has been established between them.

We reveal that attention sinks and compression valleys are two manifestations of a single mechanism: massive activations in the residual stream. These extremely large-magnitude features emerge on specific tokens (typically the beginning-of-sequence token, BOS), create both effects simultaneously, and act as input-agnostic biases. While prior work linked massive activations to attention sinks (Sun et al., 2024; Cancedda, 2024), we prove they also drive compression: when a single token's norm dominates others, it necessarily creates a dominant singular value in the representation matrix, explaining the compression. Experiments across several models (410M-120B parameters) confirm this unified mechanism, connecting both phenomena via the massive activations.

∗ Denotes equal first authorship. Correspondence to alvaro.arroyo@eng.ox.ac.uk and enrique.queipodellanoburgos@reuben.ox.ac.uk

This unified mechanism reveals how transformers organize computation across depth through the Mix-Compress-Refine framework: massive activations control three distinct processing phases. Early layers (0-20% depth) mix information broadly via diffuse attention. Middle layers (20-85%) compress representations while halting mixing through attention sinks, both triggered by massive activation emergence. Late layers (85-100%) selectively refine through localized attention as norms equalize. This phase structure explains task-specific optimal depths: embedding tasks peak during compression, while generation requires full refinement.

Our contributions are:

· We empirically demonstrate that attention sinks and compression valleys emerge simultaneously in middle layers across several models (410M-120B parameters). · We prove that massive activations mathematically require compression, providing tight bounds on entropy reduction and singular value dominance (Theorem 1). · We validate causality through targeted ablations: removing massive activations eliminates both compression and reduces attention sinks. · Wepropose the Mix-Compress-Refine theory of information flow, explaining how transformers organize computation into three distinct phases. · Weshow this framework helps resolve the puzzle of task-dependent optimal depths: embedding tasks peak during compression while generation requires full refinement.

Broader Analysis of Models

In this section, we provide broader validation of our three-phase theory across model families and model sizes. Moreover, we expand on the empirical measurements of metrics from our theoretical analysis, on the ablation of MLPs and provide two notes on the specifics of the GPT OSS model and Gemma 7B.

Figure 8: Entropy, sink rate and BOS norm for the Pythia family of models.

Figure 8: Entropy, sink rate and BOS norm for the Pythia family of models.

Validation on more model families and large models. To further validate our Mix-CompressRefine theory we observe the emergence of compression, attention sinks and massive activations in the Pythia model family (Fig. 8) and in very large models (70B-120B), specifically LLaMA3 70B, Qwen2 72B and GPT OSS 120B (Fig. 9). The prompt is a single GSM8K example. GPT OSS' particular sink patterns are explained later in this section. We believe this showcases our observed correlations are a universal phenomena in LLMs.

Training Checkpoints. We evaluated the training dynamics of the Pythia 410M/6.9B/12B models across multiple checkpoints (steps 1, 1k, 2k, 4k, 8k, 10k, 20k, 30k, and 143k). At each checkpoint, after every layer we recorded the entropy, BOS sink rate (threshold τ = 0 . 3 ) and the norm of the BOS token representation. The prompt was a single GSM8K prompt ' Janet's ducks lay 16 eggs... ' Figures 10 and 11 illustrate the results for the Pythia 6.9B and 12B models, respectively.

Visualizations of theoretical results. We provide plots with the bounds from the theoretical discussion in Section 3.2. Figures 13 and 12 show these values for LLaMA3 8B and Pythia 410M. We show (1) the terms M = || x BOS || 2 , αR and M + R = || X || 2 F from Theorem 1, (2) the top 3 singular values σ 2 i and the sum ∑ i ≥ 1 σ 2 i and (3,4,5) the dominance, anisotropy and entropy bounds from Corollary 2. In all cases, we observe the bounds being tight in the middle layers. In this regime, the first singular value σ 2 1 follows the trajectory of || x BOS || 2 closely and dominates the rest of the singular values. The dominance decreases steadily, specially towards the second half of the network, indicating the preparation for next token prediction of Phase 3.

MLP ablations. We further run the targeted MLP ablations on more models to erase the appearance of the massive activation. For LLaMA3 8B, we ablate layer 0; for Qwen2 7B, we ablate layers

Figure 13: Theoretical bounds for LLaMA3 8B.

Figure 13: Theoretical bounds for LLaMA3 8B.

3 and 4 and for Pythia 410M, we ablate layers 5-7. The results are shown in figure 14. Interestingly, removing massive activation always decompresses the representations, however, in Pythia 410M it does not remove the attention sinks, which might be explained by the many architectural differences between these models.

Norm equalizations. Across models, we find that the average norm of the rest of the tokens (excluding BOS) grows monotonically with depth, while the BOS norm grows abruptly with the massive activation, remains constant in the middle layers and drops at the last layers. Figure 15 illustrates this process for three models. As the rest of tokens become closer to the BOS norm, the dominance of the first singular value is weaker, allowing for representations to decompress.

Figure 15: LLMs equalize token norms towards at the end of the network.

Figure 15: LLMs equalize token norms towards at the end of the network.

Anote on GPT OSS. In the GPT OSS (Agarwal et al., 2025) family of models, each attention head is equipped with a learnable sink logit that allows it to divert probability mass away from real tokens, effectively providing a 'skip' option. However, unlike the explicit ( k ′ , v ′ ) bias formulation studied in Sun et al. (2024); Gu et al. (2025), GPT OSS does not include a learnable value sink token. This means the model cannot encode bias information directly through the sink, and we hypothesize that it instead continues to rely on massive activations at the BOS token to implement bias-like behavior and generate compression. This explains why BOS sink patterns are still observed, particularly in the middle layers (see Figure 16). The alternating spikes across layers may be a consequence of GPT OSS' alternating dense and locally banded sparse attention pattern: in layers with local attention windows, heads are less able to access BOS, while in subsequent dense layers BOS becomes globally visible again, producing the observed oscillatory sinkness.

The particular case of Gemma 7B. Even though Gemma 7B follows the same dynamics we have discussed in the chapter, how it achieves them is different from the rest. Token norms in Gemma 7B start very high; instead of increasing the BOS norm to create a massive activation, Gemma 7B decreases the norms of the remaining tokens to create the disparity needed for compression, then

re-equalizes by increasing their norms in late layers. We attribute the initially high norms to the embedding layer, as there are no other components that can account for it. We believe this is also why attention patterns in Gemma 7B look a bit different from the rest, with identity heads emerging both at the early and later layers. Figure 17 illustrates this. Pre- means before each layer, while postmeans after each layer.

Validation on more model families and large models.
Training Checkpoints.
Visualizations of theoretical results.

This section includes the proofs of the statements of Section 3.2, where we show massive activations imply the dominance of a singular value. One can obtain a weaker version of the bound focused only on the massive activation (no alignment terms) that entails weaker bounds for the spectral metrics. The following lemma serves as a proof for the fact that σ 2 1 ( X ) = max || v || =1 || Xv || 2 .

Lemma 3. Let A be a real symmetric n × n matrix with (real) eigenvalues λ max = λ 1 ≥ λ 2 ≥ . . . ≥ λ n = λ min and let

$$

$$

denote the Rayleigh Quotient for A and a real non-zero vector x ∈ R n . Then R ( A , x ) ∈ [ λ min , λ max ] , achieving each bound at the corresponding eigenvectors v min , v max .

Proof. Let A = Q ⊤ ΛQ be the diagonalization of A in the eigenbasis given by the v i and let y = Qx , such that x = Q ⊤ y for Q is orthogonal (i.e. Q ⊤ = Q -1 ) . Then,

$$

$$

Since the weights w i = y 2 i / ∑ y 2 j satisfy w i ≥ 0 and ∑ i w i = 1 , R ( A , x ) is a convex linear combination of the eigenvalues and therefore λ min ≤ R ( A , x ) ≤ λ max , with equalities when x = v max , v min .

We now prove that the emergence of massive activations in some layers directly implies that the first singular value dominates the distribution, which translates into extreme values for anisotropy and matrix-based entropy. The intuition is that entropy and anisotropy are representation-only properties: they depend solely on the singular-value spectrum of the representation matrix X , whose rows are token-wise representations { x i } and columns are features. A massive activation means that one row, say x 0 , carries disproportionately large norm M = ∥ x 0 ∥ 2 compared to the rest of token representations, sometimes orders of magnitude larger. Let v = x 0 / || x 0 || be the direction of x 0 , notice we can always write X = e 1 x ⊤ 0 + Y = e 1 M v ⊤ + Y , where Y contains the rest of the representations. If M is large compared to || Y || 2 F , then X is effectively a rank one matrix plus a small perturbation, and we would expect σ 2 1 ( X ) ≈ M and v to be close to the first right singular vector. This is exactly the mechanism exploited by PCA (Ma´ ckiewicz & Ratajczak, 1993): the first principal component points in the direction that explains the largest variance; a massive activation creates such a dominant variance direction by construction. Therefore, even before formal bounds,

̸

we should expect σ 2 1 to dominate whenever (i) the norm ratio c = || x 0 || 2 / ∑ i =0 || x i || 2 is large or (ii) the remaining rows { x i } i =0 are measurably aligned with x 0 . The next result formalizes this intuition.

Theorem 4. Let M = ∥ x 0 ∥ 2 , R = ∑ i =0 ∥ x i ∥ 2 , and θ i be the angle between x 0 and x i . Define the alignment term α = 1 R ∑ i =0 ∥ x i ∥ 2 cos 2 θ i ∈ [0 , 1] . Then:

$$

$$

Proof. By definition of the singular value (also see Lemma 3),

$$

$$

Using ⟨ x i , x 0 ⟩ 2 = || x 0 || 2 || x i || 2 cos 2 θ i , we obtain

$$

$$

Since αR = ∑ i =0 ∥ x i ∥ 2 cos 2 θ i , we get σ 2 1 ≥ M + αR , which is the desired result.

As mentioned in the main text, Theorem 4 makes precise how two independent factors govern the rise of σ 2 1 : (i) the magnitudes of the activations M , and (ii) the alignment α of the remaining rows with x 0 . If representations were totally aligned, then X would indeed be rank one and would have one singular value given by σ 2 1 ( X ) = M + R = || X || 2 F . Conversely, even with small α (say, when token representations are not aligned or even orthogonal), a large norm M suffices to grow σ 2 1 . Empirically, we observe the term || x 0 || 2 making the most impact in our analysis, as we know it will be orders of magnitude larger than the rest of norms, however keeping the alignment term is also important for the following results.

We move onto proving Corollary 2, which we split in three in this section.

$$

$$

Proof. From Theorem 4, σ 2 1 ≥ M + αR . Moreover, ∑ j ≥ 2 σ 2 j = || X || 2 F -σ 2 1 ≤ || X || 2 F -( M + αR ) = R -αR = (1 -α ) R . Therefore one gets:

$$

$$

Corollary 6 (Anisotropy) . Let p 1 = σ 2 1 / || X || 2 F denote the anisotropy. In the setting of 4,

$$

$$

$$

$$

As mentioned in the main text, Corollaries 5, 6 lower-bound the dominance ratio and anisotropy using only ( c, α ) . Thus, either increasing c (stronger massive activation) or increasing α (stronger alignment) provably inflates the spectral gap. In both cases, having perfect alignment with x 0 or having || x 0 || 2 grow with respect to the rest, forces extreme values. If α → 1 , then c + α 1 -α → ∞

and c + α c +1 → 1 , intuitively because only one direction becomes relevant in the data. Moreover, as the massive activation grows c → ∞ , the same result holds. Notice that c is the ratio between the massive activation and the rest of them, therefore c increases by letting x 0 be larger in norm, but also letting the rest of representations have low norm.

Corollary 7 (Shannon matrix-based entropy) . Let p j := σ 2 j / || X || 2 F denote the normalized distribution of singular values of X . Let H ( X ) := -∑ r j =1 p j log p j be the Shannon entropy of such distribution. Let p := c + α c +1 . Then, we have the following bound

$$

$$

Proof. Let

$$

$$

so we need to bound the second term, which is the entropy of r -1 terms adding up to 1 -p 1 ≤ 1 -p . This term would be maximised if the mass was equally distributed, that is, p j = 1 -p 1 r -1 ≤ 1 -p r -1 . Therefore, one gets

$$

$$

$$

$$

The result is obtained combining these two bounds.

For fixed top mass p 1 ≥ p , entropy is maximized when the remaining mass 1 -p is spread uniformly over the other r -1 singular values; the bound above is exactly that maximum. Consequently, any additional structure in the tail (e.g., a second spike) will lower the true entropy beneath this upper bound. Notice for c →∞ or α → 1 , p → 1 and the upper bound approaches 0 .

Limitations of this analysis. In the theoretical analysis conducted above, we only considered one massive activation placed on the BOS token. In practice, models may exhibit more than one massive activation (Sun et al., 2024). In this case, our c term would make the bounds more permissive. We believe this poses no problem to our overall message and that this analysis can be extended. One can suppose the first n tokens to be the massive activations and decompose X = ∑ n -1 i =0 e i x ⊤ i + Y such that the first summand can be of rank at most n and Y a small perturbation in comparison, leading to small entropy (effective rank ≤ n ), also holding for longer context lengths.

MLP ablations.

To isolate the exact role of massive activation empirically, we perform targeted ablations: zeroing the MLP's contribution to the BOS token at layers where massive activations emerge. Specifically, we set x ( ℓ +1) BOS ← x ( ℓ ) BOS + Attn ( ℓ ) ( x BOS ) , removing only MLP ( ℓ ) ( x BOS ) .

We find that ablating massive activations can eliminate both phenomena , confirming our theoretical conjecture. In LLaMA3 8B, removing the MLP's contribution at the first layer prevents entropy drop (remains at 0.4-0.5 bits vs. dropping to 0.02 bits), eliminates sink formation (sink rate drops from 0.85-1.0 to 0.0) and maintains BOS norm within 2 × of other tokens (vs. 10 2 × normally) (Fig. 4). Similar results hold across models, as seen in Fig.14 in Appendix B.1. Some models (Pythia 410M, Qwen2 7B) develop massive activations in more than one stage. Ablating any single stage partially reduces compression; ablating all stages eliminates it entirely, suggesting a cumulative contribution. However, for Pythia 410M, ablations remove compression but do not remove attention sinks, which suggests that the formation of sinks might have model-dependent causes.

Why Middle Layers? We hypothesize that the mid-depth concentration of sinks and compression reflects how Transformers allocate computation across depth: early layers perform broad contextual mixing, while later layers become increasingly aligned with next-token prediction (Lad et al., 2024). Freed from these pressures, middle layers can develop extreme features that regulate tokento-token sharing and induce compression, consistent with a mid-network shift toward refinement (Csord´ as et al., 2025). In the next section, we show how massive activations predict, and precisely characterize, three stages of information flow.

Norm equalizations.

To isolate the exact role of massive activation empirically, we perform targeted ablations: zeroing the MLP's contribution to the BOS token at layers where massive activations emerge. Specifically, we set x ( ℓ +1) BOS ← x ( ℓ ) BOS + Attn ( ℓ ) ( x BOS ) , removing only MLP ( ℓ ) ( x BOS ) .

We find that ablating massive activations can eliminate both phenomena , confirming our theoretical conjecture. In LLaMA3 8B, removing the MLP's contribution at the first layer prevents entropy drop (remains at 0.4-0.5 bits vs. dropping to 0.02 bits), eliminates sink formation (sink rate drops from 0.85-1.0 to 0.0) and maintains BOS norm within 2 × of other tokens (vs. 10 2 × normally) (Fig. 4). Similar results hold across models, as seen in Fig.14 in Appendix B.1. Some models (Pythia 410M, Qwen2 7B) develop massive activations in more than one stage. Ablating any single stage partially reduces compression; ablating all stages eliminates it entirely, suggesting a cumulative contribution. However, for Pythia 410M, ablations remove compression but do not remove attention sinks, which suggests that the formation of sinks might have model-dependent causes.

Why Middle Layers? We hypothesize that the mid-depth concentration of sinks and compression reflects how Transformers allocate computation across depth: early layers perform broad contextual mixing, while later layers become increasingly aligned with next-token prediction (Lad et al., 2024). Freed from these pressures, middle layers can develop extreme features that regulate tokento-token sharing and induce compression, consistent with a mid-network shift toward refinement (Csord´ as et al., 2025). In the next section, we show how massive activations predict, and precisely characterize, three stages of information flow.

A note on GPT OSS.
The particular case of Gemma 7B.

Mixing and Sink Detection Metrics

In this section we propose and study new metrics for quantifying mixing and 'sinkiness' in attention heads, and provide further validation on the FineWeb-Edu experiment from Section 4.2.

Mixing Score. Let A be a lower triangular, row-stochastic attention matrix. We define the Mixing Score as the average Shannon Entropy of each row H row = 1 T ∑ i =1 H ( A i, : ) = 1 T ∑ i =1 -∑ i j =0 α ij log α ij . Since each row of A is the output to a softmax , it is a probability distribution so the score is well-defined. This captures how broadly each token is attending to its preceding tokens. High values indicate the rows are close to the uniform distribution, suggesting broad mixing across tokens. Low values imply the rows are one-hot vectors,

Figure 17: Token norms in Gemma 7B. The BOS norm starts high since the beginning.

Figure 17: Token norms in Gemma 7B. The BOS norm starts high since the beginning.

suggesting very localized mixing (sinks, identity or positional heads). Figure 18 (right) shows the Mixing Score in depth for a variety of models, showing how the mixing abruptly decreases from 0.7-0.75 to 0.3-0.4 after the first few layers. Bloom 1.7B resumes mixing in the last phase due to not being capable of producing positional patterns, as it is the only one without rotary positional embeddings (Barbero et al., 2025b).

Figure 18: Left: ColSum Rate ( τ = 0 . 3 ) across depth for different models. Right: Mixing Score across depth for different models, averaging across heads per layer. The ColSum Rate increases with the massive activations, similar to the BOS sink rate, while the Mixing Score abruptly decreases after the first few layers.

Figure 18: Left: ColSum Rate ( τ = 0 . 3 ) across depth for different models. Right: Mixing Score across depth for different models, averaging across heads per layer. The ColSum Rate increases with the massive activations, similar to the BOS sink rate, while the Mixing Score abruptly decreases after the first few layers.

ColSum Concentration. Similar to the mixing score, column sums c ′ j = ∑ i A ij capture how much attention is received by token j . We get a probability distribution by normalizing to c j =

c ′ j / ∑ i c ′ i = c ′ j /T , since ∑ i c ′ i = ∑ i ∑ j A ij = T for A is row-stochastic. Denote by H col = -(log T ) -1 ∑ c j log c j ∈ [0 , 1] the normalized entropy of such distribution. For consistency, we define the ColSum Concentration as C = 1 -H col ∈ [0 , 1] . High C means a few columns receive most mass (sink-like), low C means diffuse reception. When the sink is the BOS token, the ColSum Concentration is tightly related to the BOS sink score coupled: for a single BOS-dominated head, ColSum increases monotonically with the BOS score c 0 = 1 T ∑ i A i 0 and is lower-bounded by the case where the remaining mass is spread uniformly across the other T -1 columns. In that case, C min ( c 0 ) = 1 -[ -(log T ) -1 ( c 0 log c 0 + (1 -c 0 ) log ( 1 -c 0 T -1 ))] , and any additional concentration on non-BOS columns pushes C above this curve. Similar to the sink rate, we can define a ColSum Rate as the percentage of heads with ColSum Concentration above a certain threshold. Figure 18 (left) shows the ColSum Rate ( τ = 0 . 3 ) for different models across depth, imitating the Sink Rate's behavior. Moreover, scatter plots in Figure 19 show the ColSum Concentration as a function of the BOS score for all heads in different models. As given by the bound, high BOS score means high C , these are pure BOS sinks. Points with high ColSum but low c 0 reveal heads that sink to non-BOS tokens. In Pythia 410M, we observe such an outlier head, indicating a sink token different from BOS.

Figure 19: BOS score versus ColSum Concentration. The relationship with the BOS sink score indicates ColSum concentration is a good, token-agnostic alternative for sink detection.

Figure 19: BOS score versus ColSum Concentration. The relationship with the BOS sink score indicates ColSum concentration is a good, token-agnostic alternative for sink detection.

Limitations of Mixing Score and ColSum Concentration Each of our diagnostics highlights one axis of behavior while missing others. The ColSum Concentration C = 1 -H col is effective at flagging sinks, where one column dominates, but it assigns zero score to identity heads and very low score to perfectly uniform heads. Conversely, the Average Row Entropy H row measures sparsity of rows, distinguishing diffuse mixing from one-hot attention, but it cannot differentiate which sharp pattern occurs: sinks, identities, or previous-token heads all have similarly low row entropy. Thus neither metric alone fully separates the regimes of interest. In principle, one could combine them into a scalar Mix2D( α ) = αC + (1 -α ) H row , where, for a suitable choice of α , sinks would map near 1, perfectly uniform heads near 0, and identities near 0.5. This would give a single axis interpolating between mixing, sinkness, and identity. In practice, however, we did not find this construction very informative and thus did not include it.

Sink-Versus-Identity Index. In Phase 3, we observe attention patterns changing to more localized, sharp ones. Some of these patterns include identity-like heads, previous-token heads and hybrid sink-identity heads. We quantify this transition using the sink-versus-identity index, defined as SVI = B/ ( B + D ) where B is BOS attention and D = 1 T ∑ i A ii is diagonal attention. Therefore B + D = ∑ i =1 A i 0 + A ii ∈ [0 , 1] . Figure 20 plots each head as a 2D point ( SV I, B + D ) , with color corresponding to its layer. Early heads tend to have low B + D , indicating no attention is allocated to the BOS token nor the identity. As depth progresses, heads tend to go toward high B + D and high SV I , indicating strong sink presence. Moreover, the middle to late layers tend to also show identity patterns or sink-identity hybrid patterns.

Figure 21: Left. BOS sink scores (top prompt, Bloom 1.7B). Middle. Top-bottom prompt difference in BOS sink score. Right. Top-bottom prompt difference in mixing score.

Figure 21: Left. BOS sink scores (top prompt, Bloom 1.7B). Middle. Top-bottom prompt difference in BOS sink score. Right. Top-bottom prompt difference in mixing score.

Further validation on FineWeb-Edu. Figures 21 and 22 show the FineWeb-Edu experiment for Bloom 1.7B and Qwen2 7B models. The trend is clear: regardless of the input, the models do not allocate attention to the BOS token until the massive activation emerges. The amount of sinks present in the middle layers is input-dependent, however the amount of mixing performed in the early layers is not.

Mixing Score.
ColSum Concentration.

In this work, we revisited two puzzling phenomena in decoder-only Transformers, attention sinks and compression valleys. We began with the observation that attention sinks, compression valleys, and massive activations all emerge at the same time in language models. We then proved that a single high-norm token necessarily induces a dominant singular value, yielding low matrix-entropy and high anisotropy, and we bounded these effects quantitatively.

Building on this, we proposed a Mix-Compress-Refine theory of depth-wise computation in LLMs. In particular, we show that early layers mix broadly through diffuse attention, middle layers compress and curb mixing via attention sinks, late layers re-equalize norms and apply sharp positional heads for selective refinement. The boundaries between these phases are marked by the appearance and later disappearance of massive activations in depth. We use this organization to clarify downstream task behavior. While embedding-style tasks peak in compressed mid-layers, generation improves through late refinement and benefits from full depth.

We see this framework as a step toward a more mechanistic account of how LLMs allocate computation across depth. We hope these insights help connect head-level mechanisms with representation geometry, ultimately guiding more efficient and controllable LLM designs.

Limitations of Mixing Score and ColSum Concentration

Key Insight: Attention sinks and compression valleys are not separate phenomena but two consequences of massive activations in the residual stream. We prove theoretically that when BOS token norms exceed others, they necessarily create a dominant singular value, causing compression and coinciding with attention sinks. This unification reveals that a single mechanism controls both representation structure and attention in middle layers.

Sink-Versus-Identity Index.
Further validation on FineWeb-Edu.

Additional results on downstream tasks

In this section we provide more details on the experiments and results exposed in Section 5 of the main text.

Figure 23: Language modeling requires full depth. Entropy of the output distribution at each layer.

Figure 23: Language modeling requires full depth. Entropy of the output distribution at each layer.

Generation Tasks. We applied the LogitLens to WikiText-2 by passing each batch of tokenized blocks through a frozen backbone and, for every layer, projecting that layer's hidden states to vocabulary logits using the model's tied unembedding head. For each layer ℓ , we computed the next-token cross entropy loss and perplexity (as shown in Figure 7 of the main text), as well as the mean token entropy of the softmaxed logits (Ali et al., 2025), as shown in Figure 23. We take this entropy as a proxy of the model's confidence over the next token, and we also observe it decreases more rapidly towards Phase 3. In addition to next-token prediction, we extended the LogitLens evaluation to multiple-choice QA benchmarks (ARC Easy, ARC Challenge, HellaSwag, WinoGrande), where the model must select among a small set of candidate answers. For each layer, we applied the final layer norm and projected the embeddings with the tied unembedding head. We used LM Evaluation Harness to score, recording the accuracies. This allows us to compare how representations at different

depths support generation-style (next-token) and selection-style (multiple-choice) reasoning. Figure 24 shows MCQ performance remains relatively flat through the compression valley of Phase 2 and begins improving towards ∼ 50% of the network, underscoring that reasoning tasks require both compression and late-layer specialization. For completeness, we also ran the experiments with fiveshot learning for each dataset. However, this only seemed to boost the final accuracies but did not influence the overall behavior observed.

Figure 24: LogitLens accuracies on multiple choice question datasets.

Figure 24: LogitLens accuracies on multiple choice question datasets.

TunedLens. TunedLens (Belrose et al., 2023) is a refinement of the LogitLens technique that involves training a small affine transformation onto the vocabulary for each layer instead of using the model's own unembedding layer. To further validate our LogitLens experiment in the MCQ datasets, we also used the LM Evaluation Harness to run the TunedLens for Pythia 410M, Pythia 6.9B and LLaMA3 8B with the pretrained lenses available at Belrose et al. (2023). We include the results in Figure 25 for completeness, however we do not observe meaningful differences in the layerwise behavior with respect to the LogitLens.

Embedding Tasks. To further validate the results of Section 5 and the ones proposed by (Skean et al., 2025), we run a standard linear probing experiment. Probes are trained independently per layer, with backbone parameters fixed, using a learning rate of 5 × 10 -4 , one epoch, maximum length of 1024 and batch sizes of 16-32. We train for backbones Pythia 410M, Pythia 6.9B, LLaMA3 8B, Qwen2 7B and Gemma 7B. Figure 26 shows the results. As discussed, across models and datasets, accuracy peaks in the middle layers. These results suggest that the linear features relevant for classification emerge transiently in the compressed middle representations, while the late layers are repurposed for generative refinement. Moreover, we run 32 MTEB tasks for the same models and report the average main score across tasks in Figure 27.

Figure

Relative depth (%)

Figure 28: Gemma 7B Heads for example prompt.

Figure 28: Gemma 7B Heads for example prompt.

Generation Tasks.
TunedLens.
Embedding Tasks.

Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M–120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

Large Language Models (LLMs) have become remarkably capable, yet how they process information through their layers remains poorly understood. Two phenomena have particularly puzzled researchers: attention sinks, where attention heads mysteriously collapse their focus onto semantically uninformative tokens (Xiao et al., 2024), and compression valleys, where intermediate representations show unexpectedly low entropy despite the model’s high-dimensional space (Skean et al., 2025).

These phenomena appear paradoxical: why would powerful models waste attention on meaningless tokens, and why would representations compress in the middle of processing? Previous work has explained attention sinks through positional biases (Gu et al., 2025) and over-mixing prevention (Barbero et al., 2025a), while compression valleys have been explained through an information bottleneck theory (Skean et al., 2025). However, the precise reasons why they emerge remain unclear and no formal link has been established between them.

We reveal that attention sinks and compression valleys are two manifestations of a single mechanism: massive activations in the residual stream. These extremely large-magnitude features emerge on specific tokens (typically the beginning-of-sequence token, bos), create both effects simultaneously, and act as input-agnostic biases. While prior work linked massive activations to attention sinks (Sun et al., 2024; Cancedda, 2024), we prove they also drive compression: when a single token’s norm dominates others, it necessarily creates a dominant singular value in the representation matrix, explaining the compression. Experiments across several models (410M–120B parameters) confirm this unified mechanism, connecting both phenomena via the massive activations.

This unified mechanism reveals how transformers organize computation across depth through the Mix-Compress-Refine framework: massive activations control three distinct processing phases. Early layers (0–20% depth) mix information broadly via diffuse attention. Middle layers (20–85%) compress representations while halting mixing through attention sinks, both triggered by massive activation emergence. Late layers (85–100%) selectively refine through localized attention as norms equalize. This phase structure explains task-specific optimal depths: embedding tasks peak during compression, while generation requires full refinement.

Our contributions are:

We empirically demonstrate that attention sinks and compression valleys emerge simultaneously in middle layers across several models (410M–120B parameters).

We prove that massive activations mathematically require compression, providing tight bounds on entropy reduction and singular value dominance (Theorem 1).

We validate causality through targeted ablations: removing massive activations eliminates both compression and reduces attention sinks.

We propose the Mix-Compress-Refine theory of information flow, explaining how transformers organize computation into three distinct phases.

We show this framework helps resolve the puzzle of task-dependent optimal depths: embedding tasks peak during compression while generation requires full refinement.

In this paper, we study decoder-only transformers with LL layers, hidden dimension dd, and HH attention heads per layer. For a sequence of TT tokens, let 𝐱i(ℓ)∈ℝd\mathbf{x}{i}^{(\ell)}\in\mathbb{R}^{d} denote token ii’s representation at layer ℓ\ell, and 𝐗(ℓ)∈ℝT×d\mathbf{X}^{(\ell)}\in\mathbb{R}^{T\times d} the full representation matrix. Attention weights αi​j(ℓ,h)\alpha{ij}^{(\ell,h)} from token ii to jj in head hh satisfy causal masking (αi​j=0\alpha_{ij}=0 for j>ij>i). Full architecture provided in Appendix A.1.

For a representation matrix 𝐗\mathbf{X} with singular values σ1≥σ2≥…≥σr\sigma_{1}\geq\sigma_{2}\geq\ldots\geq\sigma_{r}, we measure compression via the matrix-based entropy:

Low entropy indicates compression into few dominant directions. The anisotropy p1=σ12/‖𝐗‖F2p_{1}=\sigma_{1}^{2}/|\mathbf{X}|_{F}^{2} measures directional bias (Razzhigaev et al., 2023) (near 1 = extreme bias, near 1/r1/r = isotropy). For token position kk, the attention sink score and sink rate (Gu et al., 2025) are:

with threshold τ=0.3\tau=0.3 (unless otherwise stated), and 𝕀\mathbb{I} denotes the indicator function. We focus on the bos token, the primary sink across models.

Attention heads mysteriously focus on semantically uninformative tokens (e.g., bos) across diverse models and scales (Xiao et al., 2024). While Barbero et al. (2025a) argues they prevent over-mixing, Cancedda (2024) relates them to spectral subspaces, and Gu et al. (2025) traces emergence to pretraining, no work has yet examined their depth-wise organization.

Transformer representations compress dramatically in middle layers, where the matrix-based entropy drops significantly before recovering (Skean et al., 2025). This universal pattern coincides with increased anisotropy (p1>0.9p_{1}>0.9) (Razzhigaev et al., 2023) and near-linear layer transitions (Razzhigaev et al., 2024). Paradoxically, these compressed representations excel at embedding tasks despite their reduced dimensionality. The mechanism remained unknown, with only information bottleneck hypotheses lacking causal evidence (Skean et al., 2025).

Sun et al. (2024) identified extremely large-magnitude features in transformer residual streams, with individual neurons exceeding typical activations by factors of 10310^{3}–10610^{6}. These “massive activations” consistently appear on delimiter and special tokens (particularly the bos token), acting as input-agnostic biases. Furthermore, Sun et al. (2024) found a link between the emergence of massive activations and attention sinks, which was reinforced in Barbero et al. (2025b); Yona et al. (2025). However, none of these works link this phenomenon to representational structure or a unified theory of information flow in LLMs.

Several works have sought to understand the evolution of representations in Transformer-based models from a theoretical perspective. Dong et al. (2021) proved that the repeated application of self-attention leads to rank collapse in simplified settings without residual connections. Geshkovski et al. (2023) analyze self-attention dynamics and show tokens cluster over depth. Wu et al. (2024) studied how layer normalization and attention masks affect information propagation, finding that normalization can prevent rank decay. Other empirical work examines intermediate layer outputs directly. We highlight the LogitLens (Nostalgebraist, 2020) and TunedLens (Belrose et al., 2023), which decode hidden states using the model unembedding matrix and an affine probe per layer, respectively. Furthermore, Lad et al. (2024) measures the sensitivity to delete and swap interventions across layers and argues at different stages of depth-wise inference. Most recently, Csordás et al. (2025) argue that deeper LLMs underutilize their additional layers, with later layers mainly refining probability distributions rather than composing novel computations. While these studies illuminate layerwise behavior, none provides a unified mechanism that explains why stages form in depth or predicts when they should appear.

Thus far, attention sinks have been tied to massive activations while compression has remained a separate observation without a causal mechanism. In this work, we document the synchronized dynamics of these phenomena, showing that the same massive activations that create sinks are also the main driver of compression. Building on their co-emergence, we propose a three-stage theory in which residual-stream norms simultaneously regulate mixing in attention heads and compression in representations. Finally, we connect the mechanism to downstream behavior, distinguishing between embedding-style and generation tasks.

We first empirically document that attention sinks and compression valleys emerge simultaneously across model families and scales. Figure 1 shows the layer-wise evolution of three metrics across six models (Pythia 410M/6.9B, LLaMA3 8B, Qwen2 7B, Gemma 7B, Bloom 1.7B): (1) matrix-based entropy H​(𝐗(ℓ))H(\mathbf{X}^{(\ell)}), (2) sink-rate0(ℓ)\text{sink-rate}{0}^{(\ell)}, and (3) bos token norm ‖𝐱0(ℓ)‖|\mathbf{x}{0}^{(\ell)}|. We compute these metrics for all 7.5K training examples in GSM8K (Cobbe et al., 2021), plotting the mean and standard deviation at each layer.

We observe that all three patterns align precisely. When the bos norm spikes to factors of 10310^{3}–10410^{4} (typically layers 0–5 depending on model depth), entropy simultaneously drops and sink rates surge. We compute the Pearson correlation between the change in bos norm and entropy, obtaining r=−0.9±0.18r=-0.9\pm 0.18 across models, while bos norm and sink rate correlate at r=0.58±0.25r=0.58\pm 0.25.

We highlight that this synchronization is remarkably consistent. While sink rates vary with prompt content, the layer index where these phenomena emerge is fixed for each model, and the massive activation is deterministic. For instance, in Pythia 410M, the transition consistently occurs at layer 5 regardless of input, suggesting an architectural rather than input-dependent mechanism. We show how these transitions emerge during training in Figure 2. We point the reader to Appendix B.1 for details on experiments, larger models, and a note on GPT OSS (Agarwal et al., 2025).

We now prove that massive activations necessarily induce the observed compression. Consider the representation matrix 𝐗∈ℝT×d\mathbf{X}\in\mathbb{R}^{T\times d} with rows {𝐱i}i=0T−1{\mathbf{x}{i}}{i=0}^{T-1}, where 𝐱0\mathbf{x}_{0} denotes the bos token.

Let M=‖𝐱0‖2M=|\mathbf{x}{0}|^{2}, R=∑i≠0‖𝐱i‖2R=\sum{i\neq 0}|\mathbf{x}{i}|^{2}, and θi\theta{i} be the angle between 𝐱0\mathbf{x}{0} and 𝐱i\mathbf{x}{i}. Define the alignment term α=1R​∑i≠0‖𝐱i‖2​cos2⁡θi∈[0,1]\alpha=\frac{1}{R}\sum_{i\neq 0}|\mathbf{x}{i}|^{2}\cos^{2}\theta{i}\in[0,1]. Then:

where σ1\sigma_{1} is the largest singular value of 𝐗\mathbf{X}.

By the characterization of singular values, σ12=max‖𝐯‖=1⁡‖𝐗𝐯‖2\sigma_{1}^{2}=\max_{|\mathbf{v}|=1}|\mathbf{X}\mathbf{v}|^{2}. Choosing 𝐯=𝐱0/‖𝐱0‖\mathbf{v}=\mathbf{x}{0}/|\mathbf{x}{0}| and expanding ‖𝐗𝐯‖2|\mathbf{X}\mathbf{v}|^{2} yields the bound. Full proofs and discussions in Appendix A.2. ∎

This theorem has immediate consequences for compression metrics:

Let c=M/Rc=M/R be the norm ratio and p=(c+α)/(c+1)p=(c+\alpha)/(c+1). Then:

Dominance: σ12/∑j≥2σj2≥(c+α)/(1−α)\sigma_{1}^{2}/\sum_{j\geq 2}\sigma_{j}^{2}\geq(c+\alpha)/(1-\alpha)

Anisotropy: p1≥pp_{1}\geq p

Entropy: H​(𝐗)≤−p​log⁡p−(1−p)​log⁡(1−p)+(1−p)​log⁡(r−1)H(\mathbf{X})\leq-p\log p-(1-p)\log(1-p)+(1-p)\log(r-1)

Meaning of the results. Theorem 1 shows two factors control the rise of σ12\sigma_{1}^{2}: (i) the magnitudes of the activations MM, and (ii) the alignment α\alpha of other rows with 𝐱0{\mathbf{x}}{0}. Full alignment makes 𝐗{\mathbf{X}} rank one (with σ12​(𝐗)=M+R=‖𝐗‖F2\sigma{1}^{2}({\mathbf{X}})=M+R=||{\mathbf{X}}||{F}^{2}), while even with small α\alpha with large MM suffices to grow σ12\sigma{1}^{2}. The corollaries give lower bounds on dominance and anisotropy in terms of (c,α)(c,\alpha), so increasing cc (stronger gap between activations) or increasing α\alpha (stronger alignment) provably widens the spectral gap. Consequently, the singular-value entropy is tightly upper-bounded by the mass in the top component, so c≫1c\gg 1 or α→1\alpha\rightarrow 1 also results in H​(𝐗)H(\mathbf{X}) dropping towards zero.

Tightness of the bounds in practice. When massive activations create growing norm ratios cc, these bounds become tight. Figure 3 compares our theoretical bounds against empirical measurements for Pythia 410M across all layers. In early layers where massive activations are absent, the bounds are loose as expected, because the theory only constrains the intermediate layers. However, in middle layers where massive activations emerge, the bounds become nearly exact, with predicted and observed values overlapping within measurement error. This tightness reveals that massive activations are the dominant mechanism shaping representation geometry: when ‖𝐱0‖|\mathbf{x}_{0}| becomes massive, the representation matrix effectively becomes rank-one plus small perturbations, exactly as our theory predicts.

To isolate the exact role of massive activation empirically, we perform targeted ablations: zeroing the MLP’s contribution to the bos token at layers where massive activations emerge. Specifically, we set 𝐱BOS(ℓ+1)←𝐱BOS(ℓ)+Attn(ℓ)​(𝐱BOS)\mathbf{x}{\text{BOS}}^{(\ell+1)}\leftarrow\mathbf{x}{\text{BOS}}^{(\ell)}+\text{Attn}^{(\ell)}(\mathbf{x}{\text{BOS}}), removing only MLP(ℓ)​(𝐱BOS)\text{MLP}^{(\ell)}(\mathbf{x}{\text{BOS}}).

We find that ablating massive activations can eliminate both phenomena, confirming our theoretical conjecture. In LLaMA3 8B, removing the MLP’s contribution at the first layer prevents entropy drop (remains at 0.4-0.5 bits vs. dropping to 0.02 bits), eliminates sink formation (sink rate drops from 0.85-1.0 to 0.0) and maintains bos norm within 2×2\times of other tokens (vs. 102×10^{2}\times normally) (Fig. 4). Similar results hold across models, as seen in Fig.14 in Appendix B.1. Some models (Pythia 410M, Qwen2 7B) develop massive activations in more than one stage. Ablating any single stage partially reduces compression; ablating all stages eliminates it entirely, suggesting a cumulative contribution. However, for Pythia 410M, ablations remove compression but do not remove attention sinks, which suggests that the formation of sinks might have model-dependent causes.

Why Middle Layers? We hypothesize that the mid-depth concentration of sinks and compression reflects how Transformers allocate computation across depth: early layers perform broad contextual mixing, while later layers become increasingly aligned with next-token prediction (Lad et al., 2024). Freed from these pressures, middle layers can develop extreme features that regulate token-to-token sharing and induce compression, consistent with a mid-network shift toward refinement (Csordás et al., 2025). In the next section, we show how massive activations predict, and precisely characterize, three stages of information flow.

Building on our mechanistic understanding of massive activations, we now present a broader theory of how transformers organize computation across depth. We propose that information processing occurs in three distinct phases, demarcated by the emergence and dissipation of massive activations.

In the early layers of the model, we observe diffuse attention patterns, which are enabled by the lack of massive activations. This allows the model to perform mixing of high-dimensional token representations for a few layers, allowing the model to build contextual representations through broad information integration. An example of such an attention head in the early layers can be seen in Figure 6.

To quantify mixing in attention heads, we define the Mixing Score as the average row entropy of attention matrices: 1T​∑i=1TH​(𝐀i,:(ℓ,h))\frac{1}{T}\sum_{i=1}^{T}H(\mathbf{A}_{i,:}^{(\ell,h)}). Across models, we find that early layers consistently maintain mixing scores above 0.70.7, confirming active token mixing, before dropping sharply when massive activations emerge. Notably, this mixing phase varies in extent: From just the first layer in some models to approximately 20% of network depth in others. However, its qualitative characteristics remain consistent. We plot this metric Figure 18 in Appendix B.2 across models and layers.

We believe that this initial mixing stage is deliberately limited to prevent over-mixing and representational collapse that would occur with extended uniform attention (Barbero et al., 2024; 2025a), analogous to over-smoothing in graph neural networks (Arroyo et al., 2025). This controlled, brief mixing phase establishes the semantic foundation that subsequent phases refine. The model captures both local token dependencies and global context, creating rich representations that can be selectively compressed and refined in later phases.

The middle phase begins abruptly with the emergence of massive activations, typically on the bos token. As established in Section 3.2, these massive activations necessarily induce representational compression, as well as attention sink formation (Gu et al., 2025). This phase serves as a computational shut-off switch for mixing. The attention sinks act as approximate “no-ops” (Bondarenko et al., 2023). By attending to bos tokens with near-zero value norms, heads effectively skip their contribution while preserving the residual stream.

In these middle layers, the model refined information through the compressed residual stream, where a few dominant directions preserve high-level context while discarding redundancies. This aligns with the depth-efficiency perspective of Csordás et al. (2025), who show that mid-layers contribute less to shaping future tokens and more to stabilizing current representations. In Section 5, we show that performance on generation tasks tends to improve mostly in the latter half of this second phase. We hypothesize this lag reflects the time mid-layer MLPs need to process and consolidate the compressed signal before yielding token-level refinements. We do not treat this as a separate phase, however, since it is not cleanly demarcated by the emergence or dissipation of massive activations.

Sink behavior in middle to late layers adapts to input complexity. Mid to late layer sinks are input-dependent (“dormant heads”), which are often inert but active on specific prompts (Guo et al., 2024). Reproducing Sandoval-Segura et al. (2025) on 20K FineWeb-Edu prompts (Lozhkov et al., 2023), we compare the “top” prompts (with the strongest sink scores) and the “bottom” prompts (with the weakest). As shown in Figure 5, sink strength diverges in the middle and last layers. This provides evidence that while middle layers default to sink-like behavior that limits mixing, sink strength in this phase varies depending on the prompt.

In the final phase, the model reverses the compression bottleneck through norm equalization and a fundamental shift in attention patterns.

Norm equalization drives decompression. In this phase, we find the bos norm plateaus or decreases while the average norm of the rest of tokens rises sharply, driving them toward similar magnitudes (right panel in Figure 4 and Figure 15 in Appendix B.1). This equalization begins earlier than the full phase transition, where the average norm starts rising around 40–60% depth, preparing for the eventual shift. The massive activation ratio drops from >103>10^{3} to <10<10, removing the mathematical basis for compression and allowing representations to re-expand.

Attention shifts to positional patterns. As massive activations dissipate, we observe heads transition from sink-dominated to position-based patterns. In particular, we observe the emergence of identity heads (i→ii\rightarrow i), previous-token heads (i→i−1i\rightarrow i-1), and other sharp attention patterns, where by sharp we mean highly localized attention patterns. Figure 6 shows an example of such a pattern in the Pythia 410M model. We find that sharp positional patterns are especially common in RoPE-based models, consistent with recent work (Barbero et al., 2025b) showing that RoPE induces frequency-selective structure that favors the emergence of such heads. We provide empirical evidence of this in Appendix B.2. In particular, when measuring the mixing rate of attention patterns, only models without RoPE revert to higher mixing in later layers, whereas RoPE-based models consistently transition toward sharp positional attention.

This phase serves three computational purposes. We believe that this phase serves three purposes. First, norm equalization reduces bos dominance so content tokens can meaningfully influence the residual pathway and receive token-specific refinements. Second, attention shifts to sharp (often positional) heads that perform selective mixing, focusing on a few task-relevant tokens and writing their features into the residual. As late layers re-expand capacity, these signals can be represented distinctly rather than squeezed by the mid-layer bottleneck. In parallel, identity/near-diagonal heads curb mixing without defaulting to sinks, where their non-zero value writes act as local signal boosters, in contrast to bos sinks that effectively zero out updates. Third, bringing token norms to smaller, comparable scales likely improves numerical stability for the unembedding. Notably, models tend to equalize by boosting content-token norms rather than fully collapsing the bos norm, preserving a modest global bias while enabling precise, content-driven refinements.

Prior work (Skean et al., 2025) found that mid-layer representations perform strongly, particularly on embedding benchmarks, and linked this effect to the mid-depth compression valley. In this section, we broaden the picture by evaluating both embedding and generation tasks, relating their depthwise performance to the three-stage framework introduced above.

Generation improves monotonically through all phases. We begin by evaluating intermediate layers across multiple model families and sizes using LogitLens (Nostalgebraist, 2020) on WikiText-2 (Merity et al., 2016). We observe a steady perplexity decline with depth from >104>10^{4} in early layers to 10-25 at full depth, as shown in Figure 7 (left). We notice little gain in the very early layers (consistent with an embedding-formation stage) and continued refinement through mid-depth. Across several models, the sharpest improvements occur in Phase 3, where norm equalization and positional/identity heads enable token-specific refinements, which appear to be crucial for next-token prediction tasks.

We further test the same set of models on multiple-choice question-answering tasks. We evaluate on ARC Easy, ARC Challenge, HellaSwag, and WinoGrande (Clark et al., 2018; Zellers et al., 2019; Sakaguchi et al., 2021) via LogitLens and the LM Evaluation Harness (Gao et al., 2024) with zero-shot learning. Figure 7 (middle) shows the results for ARC Easy and results for the rest of the datasets can be found in Figure 24, Appendix B.3. For sufficiently large models, accuracies remain largely flat until roughly 40-60% depth and then rise sharply. This suggests that, for generation-aligned tasks, compression alone (Phase 2) is not sufficient; gains emerge only toward the end of Phase 2 (after sufficient residual refinement) and continue into Phase 3, where norm equalization and positional/identity heads enable token-specific updates.

In short, in generation settings, the late Phase 2 to Phase 3 transition is pivotal, aligning with the observation in Csordás et al. (2025) of a mid-network phase change from future-token to current-token computation, providing independent validation that Phase 2 and Phase 3 serve distinct computational roles.

Embedding tasks peak in middle layers. To highlight the difference between generation and embedding tasks, we implement the following linear probing experiment. For each dataset, we encode the examples with the frozen backbone and extract hidden states at every layer. We then train a linear classifier at each layer on the training split and evaluate it on the test split. We probe the same multiple-choice QA benchmarks as before and, additionally, a sentence classification dataset (SST-2; Socher et al. 2013), assessing how much task-relevant information is linearly accessible at different depths. Figure 7 (right) shows the results for ARC Easy, highlighting the difference in optimal depths with the corresponding generation task, while full results for this experiment are shown in Fig. 26, Appendix B.3. Moreover, we reproduce Skean et al. (2025) on 32 MTEB tasks (Muennighoff et al., 2022) across a broader set of larger, decoder-only models. Across the board, we find that performance peaks consistently at 25–75% relative depth, outperforming early/late layers by 10-20% and precisely aligning with Phase 2, where compression is strongest (see Fig. 27 in Appendix B.3). These results align with evidence that next-token pretraining does not uniformly benefit perception-style classification (Balestriero & Huang, 2024). Together, they underscore task dependence: classification-relevant linear features concentrate in intermediate layers, whereas late layers are repurposed for token-specific generative refinement. Furthermore, the pattern suggests that massive activations in the residual pathway not only curb over-mixing via sink formation but also act as a mechanism the model uses to compress information.

We believe these findings clarify which tasks actually benefit from compression. In particular, embedding-style objectives (such as clustering, retrieval, classification, bitext mining, etc.) gain from Phase 2’s compression because they target low-dimensional structure while discarding irrelevant information, echoing classic arguments on the benefits of information bottlenecks and compressed representations (Shwartz-Ziv et al., 2018; Kawaguchi et al., 2023). This picture aligns with evidence that LLMs produce surprisingly linear (and often linearly separable) embeddings (Razzhigaev et al., 2024; Marks & Tegmark, 2023). In particular, when features concentrate in a low-dimensional subspace, linear probing, semantic retrieval, and related embedding tasks become easier. Moreover, such linear structure has been linked to the emergence of high-level semantic concepts (Park et al., 2023), reinforcing our hypothesis on why mid-layer compressed states tend to work well for non-generative evaluations.

By contrast, generation and reasoning require capacity that compressed states alone cannot provide. Performance improves the most once Phase 3 norm equalization restores higher entropy, and positional heads/MLPs can refine token-specific details, which is also when we observe the models being most confident about their predictions (see Fig. 23 in Appendix B.3). In this way, the model makes use of the compressed and refined representation from Phase 2, which has captured high-level ideas and semantic concepts, and expands this into higher-dimensional space to perform token-level refinements in Phase 3.

This reconciles two views: compressed mid-layers suit embedding benchmarks, whereas next-token-prediction–aligned tasks benefit from full-depth processing. Practically, “optimal layer” selection should match phase to objective, suggesting phase-aware early exiting (Schuster et al., 2022) as a potentially promising design choice.

In this work, we revisited two puzzling phenomena in decoder-only Transformers, attention sinks and compression valleys. We began with the observation that attention sinks, compression valleys, and massive activations all emerge at the same time in language models. We then proved that a single high-norm token necessarily induces a dominant singular value, yielding low matrix-entropy and high anisotropy, and we bounded these effects quantitatively.

Building on this, we proposed a Mix–Compress–Refine theory of depth-wise computation in LLMs. In particular, we show that early layers mix broadly through diffuse attention, middle layers compress and curb mixing via attention sinks, late layers re-equalize norms and apply sharp positional heads for selective refinement. The boundaries between these phases are marked by the appearance and later disappearance of massive activations in depth. We use this organization to clarify downstream task behavior. While embedding-style tasks peak in compressed mid-layers, generation improves through late refinement and benefits from full depth.

We see this framework as a step toward a more mechanistic account of how LLMs allocate computation across depth. We hope these insights help connect head-level mechanisms with representation geometry, ultimately guiding more efficient and controllable LLM designs.

In this work, we study decoder-only Transformers (Radford et al., 2018), which employ causal masking in attention and constitute the dominant architecture in today’s large language models (Gemma Team et al., 2024; Dubey et al., 2024). We follow the notation of Barbero et al. (2024), but we importantly also consider a model with H≥1H\geq 1 attention heads:

where we will also denote by the matrices 𝐀(ℓ,h){\mathbf{A}}^{(\ell,h)} the attention heads given by 𝐀i​j(ℓ,h)=αi​j(ℓ,h){\mathbf{A}}^{(\ell,h)}{ij}=\alpha^{(\ell,h)}{ij}. The causal masking translates into 𝐀(ℓ,h){\mathbf{A}}^{(\ell,h)} being lower-triangular and the row-wise softmax implies row-stochasticity.

This section includes the proofs of the statements of Section 3.2, where we show massive activations imply the dominance of a singular value. One can obtain a weaker version of the bound focused only on the massive activation (no alignment terms) that entails weaker bounds for the spectral metrics. The following lemma serves as a proof for the fact that σ12​(𝐗)=max‖𝐯‖=1​‖𝐗𝐯‖2\sigma_{1}^{2}({\mathbf{X}})=\max_{||{\mathbf{v}}||=1}{||{\mathbf{X}}{\mathbf{v}}||^{2}}.

denote the Rayleigh Quotient for 𝐀{\mathbf{A}} and a real non-zero vector 𝐱∈ℝn{\mathbf{x}}\in{\mathbb{R}}^{n}. Then R​(𝐀,𝐱)∈[λmin,λmax]R({\mathbf{A}},{\mathbf{x}})\in[\lambda_{\min},\lambda_{\max}], achieving each bound at the corresponding eigenvectors 𝐯min,𝐯max{\mathbf{v}}{\min},{\mathbf{v}}{\max}.

Let 𝐀=𝐐⊤​𝚲​𝐐{\mathbf{A}}={\mathbf{Q}}^{\top}\bm{\Lambda}{\mathbf{Q}} be the diagonalization of 𝐀{\mathbf{A}} in the eigenbasis given by the 𝐯i{\mathbf{v}}_{i} and let 𝐲=𝐐𝐱{\mathbf{y}}={\mathbf{Q}}{\mathbf{x}}, such that 𝐱=𝐐⊤​𝐲{\mathbf{x}}={\mathbf{Q}}^{\top}{\mathbf{y}} for 𝐐{\mathbf{Q}} is orthogonal (i.e. 𝐐⊤=𝐐−1){\mathbf{Q}}^{\top}={\mathbf{Q}}^{-1}). Then,

Since the weights wi=yi2/∑yj2w_{i}=y_{i}^{2}/\sum y_{j}^{2} satisfy wi≥0w_{i}\geq 0 and ∑iwi=1\sum_{i}w_{i}=1, R​(𝐀,𝐱)R({\mathbf{A}},{\mathbf{x}}) is a convex linear combination of the eigenvalues and therefore λmin≤R​(𝐀,𝐱)≤λmax\lambda_{\min}\leq R({\mathbf{A}},{\mathbf{x}})\leq\lambda_{\max}, with equalities when 𝐱=𝐯max,𝐯min{\mathbf{x}}={\mathbf{v}}{\max},{\mathbf{v}}{\min}. ∎

We now prove that the emergence of massive activations in some layers directly implies that the first singular value dominates the distribution, which translates into extreme values for anisotropy and matrix-based entropy. The intuition is that entropy and anisotropy are representation-only properties: they depend solely on the singular-value spectrum of the representation matrix 𝐗{\mathbf{X}}, whose rows are token-wise representations {𝐱i}{{\mathbf{x}}{i}} and columns are features. A massive activation means that one row, say 𝐱0{\mathbf{x}}{0}, carries disproportionately large norm M=‖𝐱0‖2M=|{\mathbf{x}}{0}|^{2} compared to the rest of token representations, sometimes orders of magnitude larger. Let 𝐯=𝐱0/‖𝐱0‖{\mathbf{v}}={\mathbf{x}}{0}/||{\mathbf{x}}{0}|| be the direction of 𝐱0{\mathbf{x}}{0}, notice we can always write 𝐗=𝐞1​𝐱0⊤+𝐘=𝐞1​M​𝐯⊤+𝐘{\mathbf{X}}={\mathbf{e}}{1}{\mathbf{x}}{0}^{\top}+{\mathbf{Y}}={\mathbf{e}}{1}M{\mathbf{v}}^{\top}+{\mathbf{Y}}, where 𝐘{\mathbf{Y}} contains the rest of the representations. If MM is large compared to ‖𝐘‖F2||{\mathbf{Y}}||{F}^{2}, then 𝐗{\mathbf{X}} is effectively a rank one matrix plus a small perturbation, and we would expect σ12​(𝐗)≈M\sigma_{1}^{2}({\mathbf{X}})\approx M and 𝐯{\mathbf{v}} to be close to the first right singular vector. This is exactly the mechanism exploited by PCA (Maćkiewicz & Ratajczak, 1993): the first principal component points in the direction that explains the largest variance; a massive activation creates such a dominant variance direction by construction. Therefore, even before formal bounds, we should expect σ12\sigma_{1}^{2} to dominate whenever (i) the norm ratio c=‖𝐱0‖2/∑i≠0‖𝐱i‖2c=||{\mathbf{x}}{0}||^{2}/\sum{i\neq 0}||{\mathbf{x}}{i}||^{2} is large or (ii) the remaining rows {𝐱i}i≠0{{\mathbf{x}}{i}}{i\neq 0} are measurably aligned with 𝐱0{\mathbf{x}}{0}. The next result formalizes this intuition.

By definition of the singular value (also see Lemma 3),

Using ⟨𝐱i,𝐱0⟩2=‖𝐱0‖2​‖𝐱i‖2​cos2⁡θi\langle{\mathbf{x}}{i},{\mathbf{x}}{0}\rangle^{2}=||{\mathbf{x}}{0}||^{2}||{\mathbf{x}}{i}||^{2}\cos^{2}\theta_{i}, we obtain

Since α​R=∑i≠0‖𝐱i‖2​cos2⁡θi\alpha R=\sum_{i\neq 0}|{\mathbf{x}}{i}|^{2}\cos^{2}\theta{i}, we get σ12≥M+α​R\sigma_{1}^{2}\geq M+\alpha R, which is the desired result. ∎

As mentioned in the main text, Theorem 4 makes precise how two independent factors govern the rise of σ12\sigma_{1}^{2}: (i) the magnitudes of the activations MM, and (ii) the alignment α\alpha of the remaining rows with 𝐱0{\mathbf{x}}{0}. If representations were totally aligned, then 𝐗{\mathbf{X}} would indeed be rank one and would have one singular value given by σ12​(𝐗)=M+R=‖𝐗‖F2\sigma{1}^{2}({\mathbf{X}})=M+R=||{\mathbf{X}}||{F}^{2}. Conversely, even with small α\alpha (say, when token representations are not aligned or even orthogonal), a large norm M suffices to grow σ12\sigma{1}^{2}. Empirically, we observe the term ‖𝐱0‖2||{\mathbf{x}}_{0}||^{2} making the most impact in our analysis, as we know it will be orders of magnitude larger than the rest of norms, however keeping the alignment term is also important for the following results.

We move onto proving Corollary 2, which we split in three in this section.

In the setting of Theorem 4, let c=‖𝐱0‖2/∑i≠0‖𝐱i‖2c=||{\mathbf{x}}{0}||^{2}/\sum{i\neq 0}||{\mathbf{x}}_{i}||^{2}, then

From Theorem 4, σ12≥M+α​R\sigma_{1}^{2}\geq M+\alpha R. Moreover, ∑j≥2σj2=‖𝐗‖F2−σ12≤‖𝐗‖F2−(M+α​R)=R−α​R=(1−α)​R\sum_{j\geq 2}\sigma_{j}^{2}=||{\mathbf{X}}||{F}^{2}-\sigma{1}^{2}\leq||{\mathbf{X}}||_{F}^{2}-(M+\alpha R)=R-\alpha R=(1-\alpha)R. Therefore one gets:

Let p1=σ12/‖𝐗‖F2p_{1}=\sigma_{1}^{2}/||{\mathbf{X}}||_{F}^{2} denote the anisotropy. In the setting of 4,

As mentioned in the main text, Corollaries 5, 6 lower-bound the dominance ratio and anisotropy using only (c,α)(c,\alpha). Thus, either increasing cc (stronger massive activation) or increasing α\alpha (stronger alignment) provably inflates the spectral gap. In both cases, having perfect alignment with 𝐱0{\mathbf{x}}{0} or having ‖𝐱0‖2||{\mathbf{x}}{0}||^{2} grow with respect to the rest, forces extreme values. If α→1\alpha\to 1, then c+α1−α→∞\frac{c+\alpha}{1-\alpha}\to\infty and c+αc+1→1\frac{c+\alpha}{c+1}\to 1, intuitively because only one direction becomes relevant in the data. Moreover, as the massive activation grows c→∞c\to\infty, the same result holds. Notice that cc is the ratio between the massive activation and the rest of them, therefore cc increases by letting 𝐱0{\mathbf{x}}_{0} be larger in norm, but also letting the rest of representations have low norm.

Let pj:=σj2/‖𝐗‖F2p_{j}:=\sigma_{j}^{2}/||{\mathbf{X}}||{F}^{2} denote the normalized distribution of singular values of 𝐗{\mathbf{X}}. Let H​(𝐗):=−∑j=1rpj​log⁡pjH({\mathbf{X}}):=-\sum{j=1}^{r}p_{j}\log p_{j} be the Shannon entropy of such distribution. Let p:=c+αc+1p:=\dfrac{c+\alpha}{c+1}. Then, we have the following bound

so we need to bound the second term, which is the entropy of r−1r-1 terms adding up to 1−p1≤1−p1-p_{1}\leq 1-p. This term would be maximised if the mass was equally distributed, that is, pj=1−p1r−1≤1−pr−1p_{j}=\frac{1-p_{1}}{r-1}\leq\frac{1-p}{r-1}. Therefore, one gets

The result is obtained combining these two bounds. ∎

For fixed top mass p1≥pp_{1}\geq p, entropy is maximized when the remaining mass 1−p1-p is spread uniformly over the other r−1r-1 singular values; the bound above is exactly that maximum. Consequently, any additional structure in the tail (e.g., a second spike) will lower the true entropy beneath this upper bound. Notice for c→∞c\to\infty or α→1\alpha\to 1, p→1p\to 1 and the upper bound approaches 0.

In the theoretical analysis conducted above, we only considered one massive activation placed on the bos token. In practice, models may exhibit more than one massive activation (Sun et al., 2024). In this case, our cc term would make the bounds more permissive. We believe this poses no problem to our overall message and that this analysis can be extended. One can suppose the first nn tokens to be the massive activations and decompose 𝐗=∑i=0n−1𝐞i​𝐱i⊤+𝐘{\mathbf{X}}=\sum_{i=0}^{n-1}{\mathbf{e}}{i}{\mathbf{x}}{i}^{\top}+{\mathbf{Y}} such that the first summand can be of rank at most nn and 𝐘{\mathbf{Y}} a small perturbation in comparison, leading to small entropy (effective rank ≤n\leq n), also holding for longer context lengths.

All experiments were implemented in PyTorch using NVIDIA A100 GPUs with 40GB memory or NVIDIA H100 GPUs with 80GB when the memory requirements were stronger. We examined pretrained models of varying depths, using HuggingFace repositories with Transformers and Transformer-Lens (Nanda & Bloom, 2022). When large datasets were run to collect metrics such as sink rates and norms, prompts were truncated to a maximum length of 4096 tokens for the FineWeb-Edu experiment (Fig. 5) and 1024 for the GSM8K experiment (Fig. 1), as the latter required singular value decompositions to compute the entropy. LogitLens experiments for multiple-choice-questions tasks were done with LM-Evaluation-Harness (Gao et al., 2024), implementing our own model wrapper to output hidden states at each layer instead of final ones.

To assess the dynamical relationship between bos norm, matrix-based entropy, and bos sink rate across layers, we computed correlations on their layerwise changes. For each model and metric, the trajectory across layers was first zz-scored, and then we defined the delta at layer ℓ\ell as the difference with respect to the preceding layer,

This procedure emphasizes abrupt layerwise changes rather than absolute values, which is crucial because bos norm often exhibits sharp spikes that coincide with collapses in entropy and the subsequent emergence of attention sinks. We then measured Pearson correlation coefficients between Δ​bℓ\Delta\tilde{b}_{\ell} and Δ​eℓ\Delta\tilde{e}{\ell} (bos norm vs entropy, same layer) and between Δ​bℓ\Delta\tilde{b}_{\ell} and Δ​sℓ+1\Delta\tilde{s}{\ell+1} (bos norm vs sink rate, lagged by one layer). Correlations were computed separately per model and summarized across models by Fisher zz-transform averaging, reporting the mean correlation and the standard deviation across models.

We outline some limitations of our work. Our analysis focuses on decoder-only Transformers and primarily attributes both sinks and compression to bos-centered massive activations; models with alternative positional schemes, attention sparsity patterns, or special-token conventions (e.g., no explicit bos token, sinks in different positions or ALiBi encodings) may exhibit different dynamics. Our causal claims use targeted MLP ablations on selected layers and model families, however, we observe model-dependent exceptions (e.g., sinks persisting despite decompression). Lastly, the theory assumes a single massive row, whereas real models may feature multiple interacting massive activations. However, as discussed in Appendix A.2, we believe this poses no harm to the overall message: a few massive activations would push the representations to a lower-dimensional subspace, but not necessarily of dimension 1.

In this section, we provide broader validation of our three-phase theory across model families and model sizes. Moreover, we expand on the empirical measurements of metrics from our theoretical analysis, on the ablation of MLPs and provide two notes on the specifics of the GPT OSS model and Gemma 7B.

To further validate our Mix-Compress-Refine theory we observe the emergence of compression, attention sinks and massive activations in the Pythia model family (Fig. 8) and in very large models (70B-120B), specifically LLaMA3 70B, Qwen2 72B and GPT OSS 120B (Fig. 9). The prompt is a single GSM8K example. GPT OSS’ particular sink patterns are explained later in this section. We believe this showcases our observed correlations are a universal phenomena in LLMs.

We evaluated the training dynamics of the Pythia 410M/6.9B/12B models across multiple checkpoints (steps 1, 1k, 2k, 4k, 8k, 10k, 20k, 30k, and 143k). At each checkpoint, after every layer we recorded the entropy, bos sink rate (threshold τ=0.3\tau=0.3) and the norm of the bos token representation. The prompt was a single GSM8K prompt “Janet’s ducks lay 16 eggs...” Figures 10 and 11 illustrate the results for the Pythia 6.9B and 12B models, respectively.

We provide plots with the bounds from the theoretical discussion in Section 3.2. Figures 13 and 12 show these values for LLaMA3 8B and Pythia 410M. We show (1) the terms M=‖𝐱BOS‖2M=||{\mathbf{x}}{\text{BOS}}||^{2}, α​R\alpha R and M+R=‖𝐗‖F2M+R=||{\mathbf{X}}||{F}^{2} from Theorem 1, (2) the top 3 singular values σi2\sigma_{i}^{2} and the sum ∑i≥1σi2\sum_{i\geq 1}\sigma_{i}^{2} and (3,4,5) the dominance, anisotropy and entropy bounds from Corollary 2. In all cases, we observe the bounds being tight in the middle layers. In this regime, the first singular value σ12\sigma_{1}^{2} follows the trajectory of ‖𝐱BOS‖2||{\mathbf{x}}_{\text{BOS}}||^{2} closely and dominates the rest of the singular values. The dominance decreases steadily, specially towards the second half of the network, indicating the preparation for next token prediction of Phase 3.

We further run the targeted MLP ablations on more models to erase the appearance of the massive activation. For LLaMA3 8B, we ablate layer 0; for Qwen2 7B, we ablate layers 3 and 4 and for Pythia 410M, we ablate layers 5-7. The results are shown in figure 14. Interestingly, removing massive activation always decompresses the representations, however, in Pythia 410M it does not remove the attention sinks, which might be explained by the many architectural differences between these models.

Across models, we find that the average norm of the rest of the tokens (excluding bos) grows monotonically with depth, while the bos norm grows abruptly with the massive activation, remains constant in the middle layers and drops at the last layers. Figure 15 illustrates this process for three models. As the rest of tokens become closer to the bos norm, the dominance of the first singular value is weaker, allowing for representations to decompress.

In the GPT OSS (Agarwal et al., 2025) family of models, each attention head is equipped with a learnable sink logit that allows it to divert probability mass away from real tokens, effectively providing a “skip” option. However, unlike the explicit (k′,v′)(k^{\prime},v^{\prime}) bias formulation studied in Sun et al. (2024); Gu et al. (2025), GPT OSS does not include a learnable value sink token. This means the model cannot encode bias information directly through the sink, and we hypothesize that it instead continues to rely on massive activations at the bos token to implement bias-like behavior and generate compression. This explains why bos sink patterns are still observed, particularly in the middle layers (see Figure 16). The alternating spikes across layers may be a consequence of GPT OSS’ alternating dense and locally banded sparse attention pattern: in layers with local attention windows, heads are less able to access bos, while in subsequent dense layers bos becomes globally visible again, producing the observed oscillatory sinkness.

Even though Gemma 7B follows the same dynamics we have discussed in the chapter, how it achieves them is different from the rest. Token norms in Gemma 7B start very high; instead of increasing the bos norm to create a massive activation, Gemma 7B decreases the norms of the remaining tokens to create the disparity needed for compression, then re-equalizes by increasing their norms in late layers. We attribute the initially high norms to the embedding layer, as there are no other components that can account for it. We believe this is also why attention patterns in Gemma 7B look a bit different from the rest, with identity heads emerging both at the early and later layers. Figure 17 illustrates this. Pre- means before each layer, while post- means after each layer.

In this section we propose and study new metrics for quantifying mixing and “sinkiness” in attention heads, and provide further validation on the FineWeb-Edu experiment from Section 4.2.

Let 𝐀{\mathbf{A}} be a lower triangular, row-stochastic attention matrix. We define the Mixing Score as the average Shannon Entropy of each row Hrow=1T​∑i=1H​(𝐀i,:)=1T​∑i=1−∑j=0iαi​j​log⁡αi​jH_{\text{row}}=\frac{1}{T}\sum_{i=1}H({\mathbf{A}}{i,:})=\frac{1}{T}\sum{i=1}-\sum_{j=0}^{i}\alpha_{ij}\log\alpha_{ij}. Since each row of 𝐀{\mathbf{A}} is the output to a softmax\mathrm{softmax}, it is a probability distribution so the score is well-defined. This captures how broadly each token is attending to its preceding tokens. High values indicate the rows are close to the uniform distribution, suggesting broad mixing across tokens. Low values imply the rows are one-hot vectors, suggesting very localized mixing (sinks, identity or positional heads). Figure 18 (right) shows the Mixing Score in depth for a variety of models, showing how the mixing abruptly decreases from 0.7-0.75 to 0.3-0.4 after the first few layers. Bloom 1.7B resumes mixing in the last phase due to not being capable of producing positional patterns, as it is the only one without rotary positional embeddings (Barbero et al., 2025b).

Similar to the mixing score, column sums cj′=∑i𝐀i​jc^{\prime}{j}=\sum{i}{\mathbf{A}}{ij} capture how much attention is received by token jj. We get a probability distribution by normalizing to cj=cj′/∑ici′=cj′/Tc{j}=c^{\prime}{j}/\sum{i}c^{\prime}{i}=c^{\prime}{j}/T, since ∑ici′=∑i∑j𝐀i​j=T\sum_{i}c^{\prime}{i}=\sum{i}\sum_{j}{\mathbf{A}}{ij}=T for 𝐀{\mathbf{A}} is row-stochastic. Denote by Hcol=−(log⁡T)−1​∑cj​log⁡cj∈[0,1]H{\text{col}}=-(\log T)^{-1}\sum c_{j}\log c_{j}\in[0,1] the normalized entropy of such distribution. For consistency, we define the ColSum Concentration as C=1−Hcol∈[0,1]C=1-H_{\text{col}}\in[0,1]. High CC means a few columns receive most mass (sink-like), low CC means diffuse reception. When the sink is the bos token, the ColSum Concentration is tightly related to the bos sink score coupled: for a single bos-dominated head, ColSum increases monotonically with the bos score c0=1T​∑iAi​0c_{0}=\tfrac{1}{T}\sum_{i}A_{i0} and is lower-bounded by the case where the remaining mass is spread uniformly across the other T−1T!-!1 columns. In that case, Cmin​(c0)=1−[−(log⁡T)−1​(c0​log⁡c0+(1−c0)​log⁡(1−c0T−1))]C_{\min}(c_{0})=1-\bigl[-(\log T)^{-1}\bigl(c_{0}\log c_{0}+(1-c_{0})\log!\bigl(\tfrac{1-c_{0}}{T-1}\bigr)\bigr)\bigr], and any additional concentration on non-bos columns pushes CC above this curve. Similar to the sink rate, we can define a ColSum Rate as the percentage of heads with ColSum Concentration above a certain threshold. Figure 18 (left) shows the ColSum Rate (τ=0.3\tau=0.3) for different models across depth, imitating the Sink Rate’s behavior. Moreover, scatter plots in Figure 19 show the ColSum Concentration as a function of the bos score for all heads in different models. As given by the bound, high bos score means high CC, these are pure bos sinks. Points with high ColSum but low c0c_{0} reveal heads that sink to non-bos tokens. In Pythia 410M, we observe such an outlier head, indicating a sink token different from bos.

Each of our diagnostics highlights one axis of behavior while missing others. The ColSum Concentration C=1−HcolC=1-H_{\text{col}} is effective at flagging sinks, where one column dominates, but it assigns zero score to identity heads and very low score to perfectly uniform heads. Conversely, the Average Row Entropy HrowH_{\text{row}} measures sparsity of rows, distinguishing diffuse mixing from one-hot attention, but it cannot differentiate which sharp pattern occurs: sinks, identities, or previous-token heads all have similarly low row entropy. Thus neither metric alone fully separates the regimes of interest. In principle, one could combine them into a scalar Mix2D​(α)=α​C+(1−α)​Hrow,\mathrm{Mix2D}(\alpha)=\alpha C+(1-\alpha){H_{\text{row}}},, where, for a suitable choice of α\alpha, sinks would map near 1, perfectly uniform heads near 0, and identities near 0.5. This would give a single axis interpolating between mixing, sinkness, and identity. In practice, however, we did not find this construction very informative and thus did not include it.

In Phase 3, we observe attention patterns changing to more localized, sharp ones. Some of these patterns include identity-like heads, previous-token heads and hybrid sink-identity heads. We quantify this transition using the sink-versus-identity index, defined as SVI=B/(B+D)\text{SVI}=B/(B+D) where BB is bos attention and D=1T​∑i𝐀i​iD=\frac{1}{T}\sum_{i}{\mathbf{A}}{ii} is diagonal attention. Therefore B+D=∑i=1𝐀i​0+𝐀i​i∈[0,1]B+D=\sum{i=1}{\mathbf{A}}{i0}+{\mathbf{A}}{ii}\in[0,1]. Figure 20 plots each head as a 2D point (S​V​I,B+D)(SVI,B+D), with color corresponding to its layer. Early heads tend to have low B+DB+D, indicating no attention is allocated to the bos token nor the identity. As depth progresses, heads tend to go toward high B+DB+D and high S​V​ISVI, indicating strong sink presence. Moreover, the middle to late layers tend to also show identity patterns or sink-identity hybrid patterns.

Figures 21 and 22 show the FineWeb-Edu experiment for Bloom 1.7B and Qwen2 7B models. The trend is clear: regardless of the input, the models do not allocate attention to the bos token until the massive activation emerges. The amount of sinks present in the middle layers is input-dependent, however the amount of mixing performed in the early layers is not.

In this section we provide more details on the experiments and results exposed in Section 5 of the main text.

We applied the LogitLens to WikiText-2 by passing each batch of tokenized blocks through a frozen backbone and, for every layer, projecting that layer’s hidden states to vocabulary logits using the model’s tied unembedding head. For each layer ℓ\ell, we computed the next-token cross entropy loss and perplexity (as shown in Figure 7 of the main text), as well as the mean token entropy of the softmaxed logits (Ali et al., 2025), as shown in Figure 23. We take this entropy as a proxy of the model’s confidence over the next token, and we also observe it decreases more rapidly towards Phase 3. In addition to next-token prediction, we extended the LogitLens evaluation to multiple-choice QA benchmarks (ARC Easy, ARC Challenge, HellaSwag, WinoGrande), where the model must select among a small set of candidate answers. For each layer, we applied the final layer norm and projected the embeddings with the tied unembedding head. We used LM Evaluation Harness to score, recording the accuracies. This allows us to compare how representations at different depths support generation-style (next-token) and selection-style (multiple-choice) reasoning. Figure 24 shows MCQ performance remains relatively flat through the compression valley of Phase 2 and begins improving towards ∼50%\sim!50% of the network, underscoring that reasoning tasks require both compression and late-layer specialization. For completeness, we also ran the experiments with five-shot learning for each dataset. However, this only seemed to boost the final accuracies but did not influence the overall behavior observed.

TunedLens (Belrose et al., 2023) is a refinement of the LogitLens technique that involves training a small affine transformation onto the vocabulary for each layer instead of using the model’s own unembedding layer. To further validate our LogitLens experiment in the MCQ datasets, we also used the LM Evaluation Harness to run the TunedLens for Pythia 410M, Pythia 6.9B and LLaMA3 8B with the pretrained lenses available at Belrose et al. (2023). We include the results in Figure 25 for completeness, however we do not observe meaningful differences in the layerwise behavior with respect to the LogitLens.

To further validate the results of Section 5 and the ones proposed by (Skean et al., 2025), we run a standard linear probing experiment. Probes are trained independently per layer, with backbone parameters fixed, using a learning rate of 5×10−45\times 10^{-4}, one epoch, maximum length of 1024 and batch sizes of 16-32. We train for backbones Pythia 410M, Pythia 6.9B, LLaMA3 8B, Qwen2 7B and Gemma 7B. Figure 26 shows the results. As discussed, across models and datasets, accuracy peaks in the middle layers. These results suggest that the linear features relevant for classification emerge transiently in the compressed middle representations, while the late layers are repurposed for generative refinement. Moreover, we run 32 MTEB tasks for the same models and report the average main score across tasks in Figure 27.

Refer to caption Attention sinks and compression valleys emerge simultaneously when bos tokens develop massive activations. Normalized entropy (left), bos sink rate (middle), and bos token norm (right) across layers for six models evaluated on GSM8K. All three phenomena align precisely: when bos norms spike by factors of 10310^{3}–10410^{4} (right panel), entropy drops below 0.5 bits (left) and sink rates surge to near 1.0 (middle), confirming our unified mechanism hypothesis.

Refer to caption The coupled emergence of massive activations, compression, and sinks develops early in training and persists. Evolution of normalized entropy (left), sink rate (middle), and bos norm (right) across training checkpoints (1–143k steps) for Pythia 410M. All three phenomena emerge together around step 1k and remain synchronized throughout training, indicating this organization is learned early.

Refer to caption (a)

Refer to caption Removing massive activations eliminates both compression and attention sinks, confirming causality. Ablating the MLP contribution to the bos token at layer 0 in LLaMA3 8B has three effects: (Left) Entropy remains at ∼\sim0.5 bits instead of dropping to 0.02, showing decompression. (Middle) Sink rate stays at 0 throughout depth, confirming no attention sink formation. (Right) bos norm (orange) remains comparable to the rest of tokens (grey) instead of spiking by 103×10^{3}\times. This causal intervention validates that massive activations drive both phenomena.

Refer to caption Attention patterns transform from diffuse mixing to sinks to positional focus across depth. Evolution of attention patterns in Pythia 410M showing representative heads at layers 0, 16, and 23. Early layers exhibit diffuse attention enabling broad information mixing. Middle layers show sink patterns that halt mixing. Late layers display sharp positional patterns for selective refinement.

Refer to caption Embedding tasks peak during compression while generation requires full refinement, revealing distinct computational objectives. (Left, Middle) Perplexity on Wikitext-2 and multiple-choice QA accuracy in ARC Easy via LogitLens generally do not improve significantly until ∼\sim50% depth, then decreases/rises steadily through Phase 3. (Right) Linear probe test accuracy on the same task peaks at 25–75% depth (Phase 2) and declines thereafter. This divergence demonstrates that embedding-relevant features concentrate in compressed middle layers, while generation tasks require full-depth for token-specific predictions.

Refer to caption Entropy, sink rate and bos norm for the Pythia family of models.

Refer to caption Theoretical bounds for Pythia 410M.

Refer to caption Token norms in Gemma 7B. The bos norm starts high since the beginning.

Refer to caption Left: ColSum Rate (τ=0.3\tau=0.3) across depth for different models. Right: Mixing Score across depth for different models, averaging across heads per layer. The ColSum Rate increases with the massive activations, similar to the bos sink rate, while the Mixing Score abruptly decreases after the first few layers.

Refer to caption LogitLens accuracies on multiple choice question datasets.

Refer to caption Linear probing validation accuracies.

Refer to caption Average main score across 32 MTEB tasks.

Refer to caption Gemma 7B Heads for example prompt.

$$ H(\mathbf{X}) = -\sum_{j=1}^r p_j \log p_j, \quad \text{where } p_j = \sigma_j^2/|\mathbf{X}|_F^2 $$

$$ \text{sink-score}k^{(\ell,h)} = \frac{1}{T}\sum{t=0}^{T-1}\alpha_{tk}^{(\ell,h)}, \quad \text{sink-rate}^{(\ell)}k = \frac{1}{H}\sum{h=1}^H\mathbb{I}!\left(\text{sink-score}_k^{(\ell,h)}\geq \tau\right) $$

$$ \sigma_1^2 \geq M + \alpha R, $$

$$ R(\rmA,\rvx) = \frac{\rvx^\top\rmA\rvx}{\rvx^\top\rvx} $$

$$ R({\mathbf{A}},{\mathbf{x}})=\frac{({\mathbf{x}}^{\top}{\mathbf{Q}}^{\top})\bm{\Lambda}({\mathbf{Q}}{\mathbf{x}})}{{\mathbf{x}}^{\top}{\mathbf{x}}}=\frac{{\mathbf{y}}^{\top}\bm{\Lambda}{\mathbf{y}}}{{\mathbf{y}}^{\top}({\mathbf{Q}}{\mathbf{Q}}^{\top}){\mathbf{y}}}=\frac{{\mathbf{y}}^{\top}\bm{\Lambda}{\mathbf{y}}}{{\mathbf{y}}^{\top}{\mathbf{y}}}=\frac{\sum_{i=1}^{n}\lambda_{i}y_{i}^{2}}{\sum_{i=1}^{n}y_{i}^{2}}=\sum_{i=1}^{n}\lambda_{i}\left(\frac{y_{i}^{2}}{\sum_{j}y_{j}^{2}}\right). $$ \tag{A1.Ex5}

$$ \sigma_1^2 \geq \frac{\rvx_0^\top \rmX^\top \rmX\rvx_0}{\rvx_0^\top \rvx_0} = \frac{||\rmX\rvx_0||^2}{||\rvx_0||^2} = \frac{1}{||\rvx_0||^2}\sum_{i=0} \langle \rvx_i,\rvx_0 \rangle^2 = ||\rvx_0||^2 + \sum_{i\neq 0}\frac{\langle \rvx_i,\rvx_0 \rangle^2}{||\rvx_0||^2}. $$

$$ \sigma_1^2 ;\ge; |\rvx_0|^2 + \sum_{i\ne 0}|\rvx_i|^2 \cos^2\theta_i ;=; M + \alpha R . $$

$$ \frac{\sigma_1^2}{\sum_{j\ge 2}\sigma_j^2}\geq \frac{M + \alpha R}{||\rmX||_F^2 - (M + \alpha R)}= \frac{M+\alpha R}{(1-\alpha)R} = \frac{\frac{M}{R} + \alpha}{1 - \alpha} = \frac{c + \alpha}{1 - \alpha}. $$

$$ H(\rmX);\le; -p\log p ;-; (1-p)\log(1-p) ;+; (1-p)\log(r-1), $$

$$ H(\rmX) = -p_1\log p_1 - \sum_{j=2}^r p_j\log p_j \leq -p\log p - \sum_{j=2}^r p_j\log p_j, $$

$$ \Delta \tilde{b}\ell = \tilde{b}\ell - \tilde{b}{\ell-1}, \quad \Delta \tilde{e}\ell = \tilde{e}\ell - \tilde{e}{\ell-1}, \quad \Delta \tilde{s}\ell = \tilde{s}\ell - \tilde{s}_{\ell-1}. $$

$$ \displaystyle\mathbf{z}^{(\ell,h)}_{i} $$

Theorem. Theorem 1 (Massive Activations Induce Spectral Dominance). Let M=‖𝐱0‖2M=|\mathbf{x}{0}|^{2}, R=∑i≠0‖𝐱i‖2R=\sum{i\neq 0}|\mathbf{x}{i}|^{2}, and θi\theta{i} be the angle between 𝐱0\mathbf{x}{0} and 𝐱i\mathbf{x}{i}. Define the alignment term α=1R​∑i≠0‖𝐱i‖2​cos2⁡θi∈[0,1]\alpha=\frac{1}{R}\sum_{i\neq 0}|\mathbf{x}{i}|^{2}\cos^{2}\theta{i}\in[0,1]. Then: σ12≥M+α​R,\sigma_{1}^{2}\geq M+\alpha R, (3) where σ1\sigma_{1} is the largest singular value of 𝐗\mathbf{X}.

Corollary. Corollary 2 (Compression Bounds). Let c=M/Rc=M/R be the norm ratio and p=(c+α)/(c+1)p=(c+\alpha)/(c+1). Then: 1. Dominance: σ12/∑j≥2σj2≥(c+α)/(1−α)\sigma_{1}^{2}/\sum_{j\geq 2}\sigma_{j}^{2}\geq(c+\alpha)/(1-\alpha) 2. Anisotropy: p1≥pp_{1}\geq p 3. Entropy: H​(𝐗)≤−p​log⁡p−(1−p)​log⁡(1−p)+(1−p)​log⁡(r−1)H(\mathbf{X})\leq-p\log p-(1-p)\log(1-p)+(1-p)\log(r-1)

Lemma. Let $\rmA$ be a real symmetric $n \times n$ matrix with (real) eigenvalues $\lambda_{\max}=\lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_n = \lambda_{\min}$ and let [ R(\rmA,\rvx) = \rvx^\top\rmA\rvx{\rvx^\top\rvx} ] denote the Rayleigh Quotient for $\rmA$ and a real non-zero vector $\rvx\in \sR^n$. Then $R(\rmA,\rvx) \in [\lambda_{\min},\lambda_{\max}]$, achieving each bound at the corresponding eigenvectors $\rvv_{\min},\rvv_{\max}$.

Corollary. [Singular value dominance] In the setting of Theorem thm:rayleigh-alpha, let $c =||\rvx_0||^2 / \sum_{i\neq 0} ||\rvx_i||^2$, then [ \sigma_1^2 ;\ge; \left(c+\alpha{1-\alpha}\right)\sum_{j\ge 2}\sigma_j^2. ]

Corollary. [Anisotropy] Let $p_1 = \sigma_1^2/||\rmX||_F^2$ denote the anisotropy. In the setting of thm:rayleigh-alpha, [ p_1 ;\ge; M+\alpha R{M+R} ;=; c+\alpha{c+1}. ]

Corollary. Corollary 7 (Shannon matrix-based entropy). Let pj:=σj2/‖𝐗‖F2p_{j}:=\sigma_{j}^{2}/||{\mathbf{X}}||{F}^{2} denote the normalized distribution of singular values of 𝐗{\mathbf{X}}. Let H​(𝐗):=−∑j=1rpj​log⁡pjH({\mathbf{X}}):=-\sum{j=1}^{r}p_{j}\log p_{j} be the Shannon entropy of such distribution. Let p:=c+αc+1p:=\dfrac{c+\alpha}{c+1}. Then, we have the following bound H​(𝐗)≤−p​log⁡p−(1−p)​log⁡(1−p)+(1−p)​log⁡(r−1),H({\mathbf{X}});\leq;-p\log p;-;(1-p)\log(1-p);+;(1-p)\log(r-1),

$$ \label{eq:transformer} \zb^{(\ell, h)}i &= \sum{j \leq i} \alpha_{ij}^{(\ell, h)} \Wb^{(\ell, h)} \xb^{(\ell)}j, \text{with } \alpha{ij}^{(\ell, h)} = \frac{\exp\left(k\left(\qb_i^{(\ell, h)}, \kb_j^{(\ell, h)}, \pb_{ij}\right)\right)}{\sum_{w \leq i}\exp\left(k\left(\qb_i^{(\ell, h)}, \kb_w^{(\ell, h)}, \pb_{iw}\right)\right)} \ \zb^{(\ell)}i &= \Wb^{(\ell)}\bigoplus{h \in H} \zb_i^{(\ell , h)} + \xb^{(\ell)}_i,\ \xb_i^{(\ell+1)} &= \psib^{(\ell)}\left(\zb_i^{(\ell)}\right) + \zb_i^{(\ell)}, \vspace{-15pt} $$ \tag{eq:transformer}

$$ R(\rmA,\rvx) = \frac{(\rvx^\top\rmQ^\top)\boldsymbol{\Lambda} (\rmQ\rvx)}{\rvx^\top\rvx} = \frac{\rvy^\top\boldsymbol{\Lambda} \rvy}{\rvy^\top(\rmQ\rmQ^\top)\rvy} = \frac{\rvy^\top\boldsymbol{\Lambda} \rvy}{\rvy^\top\rvy} = \frac{\sum_{i=1}^n \lambda_iy_i^2}{\sum_{i=1}^ny_i^2} = \sum_{i=1}^n \lambda_i \left(\frac{y_i^2}{\sum_j y_j^2}\right). $$

$$

  • \sum_{j=2}^r p_j\log p_j \leq -\sum_{j=2}^r \frac{1-p}{r-1}\log\left(\frac{1-p}{r-1}\right) = -(1-p)\log \left(\frac{1-p}{r-1}\right). $$

Theorem. [Massive Activations Induce Spectral Dominance] Let $M = |x_0|^2$, $R = \sum_{i \neq 0} |x_i|^2$, and $\theta_i$ be the angle between $x_0$ and $x_i$. Define the {\em alignment term} $\alpha = 1{R}\sum_{i \neq 0} |x_i|^2 \cos^2\theta_i \in [0,1]$. Then: equation \sigma_1^2 \geq M + \alpha R, equation where $\sigma_1$ is the largest singular value of $X$. -0.3cm

Corollary. [Compression Bounds] Let $c = M/R$ be the norm ratio and $p = (c+\alpha)/(c+1)$. Then: enumerate \item Dominance: $\sigma_1^2 / \sum_{j \geq 2} \sigma_j^2 \geq (c+\alpha)/(1-\alpha)$ \item Anisotropy: $p_1 \geq p$ \item Entropy: $H(X) \leq -p\log p - (1-p)\log(1-p) + (1-p)\log(r-1)$ enumerate

Corollary. [Shannon matrix-based entropy] Let (p_j:=\sigma_j^2/||\rmX||F^2) denote the normalized distribution of singular values of $\rmX$. Let (H(\rmX):=-\sum{j=1}^r p_j\log p_j) be the Shannon entropy of such distribution. Let (p:=c+\alpha{c+1}). Then, we have the following bound [ H(\rmX);\le; -p\log p ;-; (1-p)\log(1-p) ;+; (1-p)\log(r-1), ]

Proof. [Proof Sketch] By the characterization of singular values, $\sigma_1^2 = \max_{|v|=1} |Xv|^2$. Choosing $v = x_0/|x_0|$ and expanding $|Xv|^2$ yields the bound. Full proofs and discussions in Appendix app:proofs.

Proof. Let $\rmA=\rmQ^\top\Lambda \rmQ$ be the diagonalization of $\rmA$ in the eigenbasis given by the $\rvv_i$ and let $\rvy = \rmQ\rvx$, such that $\rvx=\rmQ^\top\rvy$ for $\rmQ$ is orthogonal (i.e. $\rmQ^\top=\rmQ^{-1})$. Then, [ R(\rmA,\rvx) = (\rvx^\top\rmQ^\top)\boldsymbol{\Lambda (\rmQ\rvx)}{\rvx^\top\rvx} = \rvy^\top\boldsymbol{\Lambda \rvy}{\rvy^\top(\rmQ\rmQ^\top)\rvy} = \rvy^\top\boldsymbol{\Lambda \rvy}{\rvy^\top\rvy} = \sum_{i=1^n \lambda_iy_i^2}{\sum_{i=1}^ny_i^2} = \sum_{i=1}^n \lambda_i \left(y_i^2{\sum_j y_j^2}\right). ] Since the weights $w_i = y_i^2/\sum y_j^2$ satisfy $w_i\geq 0$ and $\sum_i w_i = 1$, $R(\rmA,\rvx)$ is a convex linear combination of the eigenvalues and therefore $\lambda_{\min}\leq R(\rmA,\rvx)\leq \lambda_{\max}$, with equalities when $\rvx=\rvv_{\max},\rvv_{\min}$.

Proof. By definition of the singular value (also see Lemma lemma:rayleigh), [ \sigma_1^2 \geq \rvx_0^\top \rmX^\top \rmX\rvx_0{\rvx_0^\top \rvx_0} = ||\rmX\rvx_0||^2{||\rvx_0||^2} = 1{||\rvx_0||^2}\sum_{i=0} \langle \rvx_i,\rvx_0 \rangle^2 = ||\rvx_0||^2 + \sum_{i\neq 0}\langle \rvx_i,\rvx_0 \rangle^2{||\rvx_0||^2}. ] Using $\langle \rvx_i,\rvx_0\rangle^2 = ||\rvx_0||^2||\rvx_i||^2\cos^2\theta_i$, we obtain [ \sigma_1^2 \ge |\rvx_0|^2 + \sum_{i\ne 0}|\rvx_i|^2\cos^2\theta_i ] Since (\alpha R=\sum_{i\ne 0}|\rvx_i|^2\cos^2\theta_i), we get $\sigma_1^2\ge M+\alpha R$, which is the desired result. \qedhere

Proof. From Theorem thm:rayleigh-alpha, (\sigma_1^2\ge M+\alpha R). Moreover, $\sum_{j\geq 2} \sigma_j^2 = ||\rmX||_F^2 - \sigma_1^2 \leq ||\rmX||F^2 - (M+\alpha R)= R - \alpha R = (1-\alpha)R$. Therefore one gets: [ \sigma_1^2{\sum{j\ge 2}\sigma_j^2}\geq M + \alpha R{||\rmX||_F^2 - (M + \alpha R)}= M+\alpha R{(1-\alpha)R} = \frac{M{R} + \alpha}{1 - \alpha} = c + \alpha{1 - \alpha}. ]

Proof. Let [ H(\rmX) = -p_1\log p_1 - \sum_{j=2}^r p_j\log p_j \leq -p\log p - \sum_{j=2}^r p_j\log p_j, ] so we need to bound the second term, which is the entropy of $r-1$ terms adding up to $1 - p_1 \leq 1 - p$. This term would be maximised if the mass was equally distributed, that is, $p_j = 1 - p_1{r-1} \leq 1-p{r-1}$. Therefore, one gets [ - \sum_{j=2}^r p_j\log p_j \leq -\sum_{j=2}^r 1-p{r-1}\log\left(1-p{r-1}\right) = -(1-p)\log \left(1-p{r-1}\right). ] The result is obtained combining these two bounds.

Figure

References

[confNEU = "Advances in Neural Information Processing Systems (NeurIPS)"} @string{confNIPS = "Advances in Neural Information Processing Systems (NeurIPS)"} @string{confICLR = "International Conference on Learning Representations (ICLR)"} @string{confICML = "International Conference on Machine Learning (ICML)"} @string{confEMNLP = "Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)"}

@incollection{Bengio+chapter2007] Bengio, Yoshua, LeCun, Yann. (2007). Scaling Learning Algorithms Towards {AI. Large Scale Kernel Machines.

[Hinton06] Hinton, Geoffrey E., Osindero, Simon, Teh, Yee Whye. (2006). A Fast Learning Algorithm for Deep Belief Nets. Neural Computation.

[goodfellow2016deep] Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, Bengio, Yoshua. (2016). Deep learning.

[razzhigaev2024transformersecretlylinear] Razzhigaev, Anton, Mikhalchuk, Matvey, Goncharova, Elizaveta, Gerasimenko, Nikolai, Oseledets, Ivan, Dimitrov, Denis, Kuznetsov, Andrey. (2024). Your transformer is secretly linear. arXiv preprint arXiv:2405.12250.

[layerbylayer] Skean, Oscar, Arefin, Md Rifat, Zhao, Dan, Patel, Niket, Naghiyev, Jalal, LeCun, Yann, Shwartz-Ziv, Ravid. (2025). Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013.

[razzhigaev2024shapelearninganisotropyintrinsic] Razzhigaev, Anton, Mikhalchuk, Matvey, Goncharova, Elizaveta, Oseledets, Ivan, Dimitrov, Denis, Kuznetsov, Andrey. (2023). The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. arXiv preprint arXiv:2311.05928.

[fodor1988connectionism] Fodor, Jerry A, Pylyshyn, Zenon W, others. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition.

[marcus1998rethinking] Marcus, Gary F. (1998). Rethinking eliminative connectionism. Cognitive psychology.

[newman2020eos] Newman, Benjamin, Hewitt, John, Liang, Percy, Manning, Christopher D. (2020). The {EOS.

[hupkes2019compositionality] Hupkes, Dieuwke, Dankers, Verna, Mul, Mathijs, Bruni, Elia. (2020). Compositionality Decomposed: How do Neural Networks Generalise?. Journal of Artificial Intelligence Research.

[lake2017generalization] Daniel Keysers, Nathanael Sch{. (2020). Measuring Compositional Generalization: A Comprehensive Method on Realistic Data.

[kim2020cogs] Kim, Najoung, Linzen, Tal. (2020). {COGS.

[lake2019compositional] Lake, Brenden M. (2019). Compositional generalization through meta sequence-to-sequence learning.

[li2019compositional] Li, Yuanpeng, Zhao, Liang, Wang, Jianyu, Hestness, Joel. (2019). Compositional Generalization for Primitive Substitutions. Proc. Conf. on Empirical Methods in Natural Language Processing and Int.Joint Conf. on Natural Language Processing (EMNLP-IJCNLP).

[korrel2019transcoding] Korrel, Kris, Hupkes, Dieuwke, Dankers, Verna, Bruni, Elia. (2019). Transcoding Compositionally: Using Attention to Find More Generalizable Solutions. Proc. BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, ACL.

[russin2019compositional] Russin, Jake, Jo, Jason, O'Reilly, Randall C, Bengio, Yoshua. (2019). Compositional generalization in a deep seq2seq model by separating syntax and semantics. Preprint arXiv:1904.09708.

[chen2020compositional] Chen, Xinyun, Liang, Chen, Yu, Adams Wei, Song, Dawn, Zhou, Denny. (2020). Compositional Generalization via Neural-Symbolic Stack Machines.

[liu2020compositional] Liu, Qian, An, Shengnan, Lou, Jian-Guang, Chen, Bei, Lin, Zeqi, Gao, Yan, Zhou, Bin, Zheng, Nanning, Zhang, Dongmei. (2020). Compositional Generalization by Learning Analytical Expressions.

[furrer2020compositional] Furrer, Daniel, van Zee, Marc, Scales, Nathan, Sch{. (2020). Compositional generalization in semantic parsing: Pre-training vs. specialized architectures. Preprint arXiv:2007.08970.

[vaswani2017attention] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. (2017). Attention is All you Need.

[dehghani2019universal] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, Lukasz Kaiser. (2019). Universal {T.

[bahdanau2015neural] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio. (2015). Neural Machine Translation by Jointly Learning to Align and Translate.

[hochreiter1997long] Hochreiter, Sepp, Schmidhuber, J{. (1997). Long short-term memory. Neural computation.

[dai2019transformer] Dai, Zihang, Yang, Zhilin, Yang, Yiming, Carbonell, Jaime G, Le, Quoc, Salakhutdinov, Ruslan. (2019). Transformer-{XL.

[nakkiran2019deep] Nakkiran, Preetum, Kaplun, Gal, Bansal, Yamini, Yang, Tristan, Barak, Boaz, Sutskever, Ilya. (2019). Deep Double Descent: Where Bigger Models and More Data Hurt.

[glorot2010understanding] Glorot, Xavier, Bengio, Yoshua. (2010). Understanding the difficulty of training deep feedforward neural networks.

[he2015delving] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.

[csordas2021are] R{'o. (2021). Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks.

[saxton2018analysing] David Saxton, Edward Grefenstette, Felix Hill, Pushmeet Kohli. (2019). Analysing Mathematical Reasoning Abilities of Neural Models.

[graves2016adaptive] Graves, Alex. (2016). Adaptive Computation Time for Recurrent Neural Networks.

[andreas2020good] Andreas, Jacob. (2020). Good-Enough Compositional Data Augmentation.

[gordon2020permutation] Jonathan Gordon, David Lopez-Paz, Marco Baroni, Diane Bouchacourt. (2020). Permutation Equivariant Models for Compositional Generalization in Language.

[dessi2019cnns] Dess{`\i. (2019). {CNN.

[nye2020learning] Nye, Maxwell I, Solar-Lezama, Armando, Tenenbaum, Joshua B, Lake, Brenden M. (2020). Learning compositional rules via neural program synthesis. Preprint arXiv:2003.05562.

[herzig2020span] Herzig, Jonathan, Berant, Jonathan. (2020). Span-based semantic parsing for compositional generalization. Preprint arXiv:2009.06040.

[guo2020hierarchical] Guo, Yinuo, Lin, Zeqi, Lou, Jian-Guang, Zhang, Dongmei. (2020). Hierarchical Poset Decoding for Compositional Generalization in Language.

[huang2020improving] Huang, Xiao Shi, Perez, Felipe, Ba, Jimmy, Volkovs, Maksims. (2020). Improving {Transformer.

[schlag2019enhancing] Schlag, Imanol, Smolensky, Paul, Fernandez, Roland, Jojic, Nebojsa, Schmidhuber, J{. (2019). Enhancing the {Transformer. Preprint arXiv:1910.06611.

[loula2018rearranging] Loula, Jo{~a. (2018). Rearranging the Familiar: Testing Compositional Generalization in Recurrent Networks. BlackboxNLP@ EMNLP.

[shaw2018selfattention] Peter Shaw, Jakob Uszkoreit, Ashish Vaswani. (2018). Self-Attention with Relative Position Representations.

[shaw2021compositional] Shaw, Peter, Chang, Ming-Wei, Pasupat, Panupong, Toutanova, Kristina. (2021). Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?.

[bahdanau2019closure] Bahdanau, Dzmitry, de Vries, Harm, O'Donnell, Timothy J, Murty, Shikhar, Beaudoin, Philippe, Bengio, Yoshua, Courville, Aaron. (2019). {CLOSURE. ViGIL workshop, NeurIPS.

[greff2020binding] Greff, Klaus, van Steenkiste, Sjoerd, Schmidhuber, J{. (2020). On the Binding Problem in Artificial Neural Networks. Preprint arXiv:2012.05208.

[NIPS2014_5346] Sutskever, Ilya, Vinyals, Oriol, Le, Quoc V. (2014). Sequence to Sequence Learning with Neural Networks.

[graves2012sequence] Graves, Alex. (2012). Sequence transduction with recurrent neural networks. Workshop on Representation Learning, ICML.

[zhang2019improving] Zhang, Biao, Titov, Ivan, Sennrich, Rico. (2019). Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention. Proc. Conf. on Empirical Methods in Natural Language Processing and Int.Joint Conf. on Natural Language Processing (EMNLP-IJCNLP).

[zhu2021gradinit] Zhu, Chen, Ni, Renkun, Xu, Zheng, Kong, Kezhi, Huang, W Ronny, Goldstein, Tom. (2021). GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training. Preprint arXiv:2102.08098.

[roelofs2019measuring] Roelofs, Rebecca. (2019). Measuring Generalization and overfitting in Machine learning.

[zhang2017understanding] Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, Vinyals, Oriol. (2017). Understanding deep learning requires rethinking generalization.

[jacot2018neural] Jacot, Arthur, Hongler, Cl{'e. (2018). Neural Tangent Kernel: Convergence and Generalization in Neural Networks.

[paszke2019pytorch] Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer, Adam, Bradbury, James, Chanan, Gregory, Killeen, Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca, Desmaison, Alban, Kopf, Andreas, Yang, Edward, DeVito, Zachary, Raison, Martin, Tejani, Alykhan, Chilamkurthy, Sasank, Steiner, Benoit, Fang, Lu, Bai, Junjie, Chintala, Soumith. (2019). {PyTorch.

[charton2021learning] Francois Charton, Amaury Hayat, Guillaume Lample. (2021). Learning advanced mathematical computations from examples.

[trask2018neural] Trask, Andrew, Hill, Felix, Reed, Scott E, Rae, Jack, Dyer, Chris, Blunsom, Phil. (2018). Neural Arithmetic Logic Units.

[kaiser2015neural] Lukasz Kaiser, Ilya Sutskever. (2016). Neural {GPU.

[liska2018memorize] Adam Liska, Germ{'{a. (2018). Memorize or generalize? {S. AEGAP Workshop ICML.

[ontanon2021making] Santiago Onta{~{n. (2022). Making Transformers Solve Compositional Tasks.

[fodor1990connectionism] Fodor, Jerry, McLaughlin, Brian P. (1990). Connectionism and the problem of systematicity: Why {S. Cognition.

[herzig2021unlocking] Herzig, Jonathan, Shaw, Peter, Chang, Ming-Wei, Guu, Kelvin, Pasupat, Panupong, Zhang, Yuan. (2021). Unlocking Compositional Generalization in Pre-trained Models Using Intermediate Representations. Preprint arXiv:2104.07478.

[csordas2019improving] R{'{o. (2019). Improving {D.

[graves2016hybrid] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska{-. (2016). Hybrid computing using a neural network with dynamic external memory. Nature.

[freivalds2019neural] Karlis Freivalds, Emils Ozolins, Agris Sostaks. (2019). Neural Shuffle-Exchange Networks - Sequence Processing in {O.

[irie2021going] Kazuki Irie, Imanol Schlag, R{'{o. (2021). Going Beyond Linear {T. Preprint arXiv:2106.06295.

[schmidhuber1992learning] Imanol Schlag, Kazuki Irie, J{. (2021). Linear Transformers Are Secretly Fast Weight Programmers. Neural Computation.

[csordas2021devil] R'obert Csord'as, Kazuki Irie, J. (2021). The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of {T. Proc. Conf. on Empirical Methods in Natural Language Processing (EMNLP).

[conklin2021meta] Henry Conklin, Bailin Wang, Kenny Smith, Ivan Titov. (2021). Meta-Learning to Compositionally Generalize.

[dubois2020location] Yann Dubois, Gautier Dagan, Dieuwke Hupkes, Elia Bruni. (2020). Location Attention for Extrapolation to Longer Sequences.

[andreas2016neural] Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, Klein, Dan. (2016). Neural Module Networks.

[kirsch2018modular] Kirsch, Louis, Kunze, Julius, Barber, David. (2018). Modular Networks: Learning to Decompose Neural Computation.

[chang2018automatically] Michael Chang, Abhishek Gupta, Sergey Levine, Thomas L. Griffiths. (2019). Automatically Composing Representation Transformations as a Means for Generalization.

[hudson2018compositional] Drew A. Hudson, Christopher D. Manning. (2018). Compositional Attention Networks for Machine Reasoning.

[weiss21] Gail Weiss, Yoav Goldberg, Eran Yahav. (2021). Thinking Like {T.

[parisotto2020stabilizing] Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, {\c{C. (2020). Stabilizing {T.

[chaabouni2021can] Chaabouni, Rahma, Dess{`\i. (2021). Can Transformers Jump Around Right in Natural Language? {A. Preprint arXiv:2107.01366.

[dauphin2017lm] Dauphin, Yann N, Fan, Angela, Auli, Michael, Grangier, David. (2017). Language Modeling with Gated Convolutional Networks.

[banino2021pondernet] Andrea Banino, Jan Balaguer, Charles Blundell. (2021). Ponder{N. Preprint arXiv:2107.05407.

[hupkes2018learning] Hupkes, Dieuwke, Singh, Anand, Korrel, Kris, Kruszewski, German, Bruni, Elia. (2019). Learning compositionally through attentive guidance. Proc. Int. Conf. on Computational Linguistics and Intelligent Text Processing.

[loshchilov2019decoupled] Cheng{-. (2019). Music Transformer: Generating Music with Long-Term Structure.

[gehring2017convolutional] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin. (2017). Convolutional Sequence to Sequence Learning.

[srivastava2015training] Srivastava, Rupesh K, Greff, Klaus, Schmidhuber, J{. (2015). Training very deep networks.

[hanson1990stochastic] Hanson, Stephen Jos{'e. (1990). A stochastic version of the delta rule. Physica D: Nonlinear Phenomena.

[srivastava2014dropout] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, Salakhutdinov, Ruslan. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research.

[nangia2018listops] Nikita Nangia, Samuel R. Bowman. (2018). List{O.

[havrylov2019cooperative] Serhii Havrylov, Germ{'{a. (2019). Cooperative Learning of Disjoint Syntax and Semantics.

[tay2021long] Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, Donald Metzler. (2021). Long {R.

[ShenTHLSC19] Yikang Shen, Shawn Tan, Seyed Arian Hosseini, Zhouhan Lin, Alessandro Sordoni, Aaron C. Courville. (2019). Ordered Memory.

[DessiB19] Roberto Dess{`{\i. (2019). {CNN.

[BrooksRLS21] Ethan A. Brooks, Janarthanan Rajendran, Richard L. Lewis, Satinder Singh. (2021). Reinforcement Learning of Implicit and Explicit Control Flow Instructions.

[gpt3] Brown, Tom B, others. (2020). Language models are few-shot learners.

[devlin2019bert] Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers).

[dosovitskiy2021an] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

[irie19trafolm] Irie, Kazuki, Zeyer, Albert, Schl. (2019). Language Modeling with Deep {T. Proc. Interspeech.

[irie2020phd] Irie, Kazuki. (2020). Advancing Neural Language Modeling in Automatic Speech Recognition.

[shazeer2020glu] Shazeer, Noam. (2020). {GLU. Preprint arXiv:2002.05202.

[kimcharacter] Kim, Yoon, Rush, Yacine Jernite David Sontag Alexander. (2016). Character-Aware Neural Language Models. Proc. {AAAI.

[schmidhuber92ncchunker] Yunyang Xiong, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, Vikas Singh. (2021). *Nystr{*. Neural Computation.

[ChowdhuryC21] Jishnu Ray Chowdhury, Cornelia Caragea. (2021). Modeling Hierarchical Structures with Continuous Recursive Neural Networks.

[schuster1997bidirectional] Schuster, Mike, Paliwal, Kuldip K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing.

[juergen2015deeplearning] J{. (2015). Deep learning in neural networks: An overview. Neural Networks.

[ciresan2012multicolumn] Dan C. Ciresan, Ueli Meier, J{. (2012). Multi-column deep neural networks for image classification.

[krizhevsky2012imagenet] Krizhevsky, Alex, Sutskever, Ilya, Hinton, Geoffrey E. (2012). ImageNet Classification with Deep Convolutional Neural Networks.

[mnih2013playing] Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Graves, Alex, Antonoglou, Ioannis, Wierstra, Daan, Riedmiller, Martin. (2013). Playing atari with deep reinforcement learning. NIPS Deep Learning Workshop.

[silver2016mastering] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis. (2016). Mastering the game of Go with deep neural networks and tree search. Nature.

[siegelmann1992computational] Hava T. Siegelmann, Eduardo D. Sontag. (1992). On the Computational Power of Neural Nets. Proceedings of the Conference on Computational Learning Theory, {COLT.

[pagin2010compositionality] Pagin, Peter, Westerst{\aa. (2010). Compositionality {I. Philosophy Compass.

[dankers2021paradox] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po{-. (2022). Scaling Language Models: Methods, Analysis {&. Preprint arXiv:2112.11446.

[fernando2017pathnet] Fernando, Chrisantha, Banarse, Dylan, Blundell, Charles, Zwols, Yori, Ha, David, Rusu, Andrei A, Pritzel, Alexander, Wierstra, Daan. (2017). Pathnet: Evolution channels gradient descent in super neural networks. Preprint arXiv:1701.08734.

[mallya2018packnet] Arun Mallya, Svetlana Lazebnik. Pack{N.

[bengio2009curriculum] Bengio, Yoshua, Louradour, J'{e. (2009). Curriculum Learning.

[kirkpatrick2017curriculum] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, Raia Hadsell. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences.

[french1999catastrophic] French, Robert M. (1999). Catastrophic forgetting in connectionist networks. Trends in cognitive sciences.

[mccloskey1989catastrophic] McCloskey, Michael, Cohen, Neal J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation.

[bendavid2010theory] Shai Ben{-. (2010). A theory of learning from different domains. Machine Learning.

[blitzer2006domain] John Blitzer, Ryan T. McDonald, Fernando Pereira. (2006). Domain Adaptation with Structural Correspondence Learning.

[koh2021wilds] Koh, Pang Wei, Sagawa, Shiori, Marklund, Henrik, Xie, Sang Michael, Zhang, Marvin, Balsubramani, Akshay, Hu, Weihua, Yasunaga, Michihiro, Phillips, Richard Lanas, Gao, Irena, Lee, Tony, David, Etienne, Stavness, Ian, Guo, Wei, Earnshaw, Berton, Haque, Imran, Beery, Sara M, Leskovec, Jure, Kundaje, Anshul, Pierson, Emma, Levine, Sergey, Finn, Chelsea, Liang, Percy. (2021). {WILDS.

[vapnik1998statistical] Vladimir Vapnik. (1998). Statistical learning theory.

[pearl2009causality] Pearl, Judea. (2009). Causality.

[pearl2018book] Pearl, Judea, Mackenzie, Dana. (2018). The Book of Why: The New Science of Cause and Effect.

[scholkopf2019causality] Bernhard Sch{. (2019). Causality for Machine Learning. Preprint arXiv:1911.10500.

[goyal2021recurrent] Anirudh Goyal, Alex Lamb, Jordan Hoffmann, Shagun Sodhani, Sergey Levine, Yoshua Bengio, Bernhard Sch{. (2021). Recurrent Independent Mechanisms.

[mitrovic2021representation] Jovana Mitrovic, Brian McWilliams, Jacob C. Walker, Lars Holger Buesing, Charles Blundell. (2021). Representation Learning via Invariant Causal Mechanisms.

[ellis2021dreamcoder] Kevin Ellis, Catherine Wong, Maxwell I. Nye, Mathias Sabl{'{e. (2021). {DreamCoder. International Conference on Programming Language Design and Implementation ({PLDI.

[chaudhuri2021neurosymbolic] Swarat Chaudhuri, Kevin Ellis, Oleksandr Polozov, Rishabh Singh, Armando Solar{-. (2021). Neurosymbolic Programming. Foundations and Trends in Programing Languages.

[silver2016alphago] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis. (2016). Mastering the game of Go with deep neural networks and tree search. Nature.

[li2022comptetition] Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R{'{e. (2022). Competition-Level Code Generation with {A. Preprint arXiv:2203.07814.

[deng2020metalearning] Xiang Deng, Zhongfei (Mark) Zhang. (2020). Is the Meta-Learning Idea Able to Improve the Generalization of Deep Neural Networks on the Standard Supervised Learning?. International Conference on Pattern Recognition, {ICPR.

[conklin2021metalearning] Henry Conklin, Bailin Wang, Kenny Smith, Ivan Titov. (2021). Meta-Learning to Compositionally Generalize.

[elman1990rnn] Jeffrey L. Elman. (1990). Finding Structure in Time. Cognitive Science.

[graves2014neural] Graves, Alex, Wayne, Greg, Danihelka, Ivo. (2014). Neural Turing Machines. Preprint arXiv:1410.5401.

[whitehead1928symbolism] Whitehead, Alfred North. (1928). Symbolism: Its meaning and effect. Journal of Philosophical Studies.

[fodor1975language] Fodor, Jerry Alan. (1975). The language of thought.

[haugeland1985artificial] Haugeland, John. (1985). Artificial intelligence: the very idea.

[newell1959report] Newell, Allen, Shaw, John C, Simon, Herbert A. (1959). Report on a general problem solving program. IFIP congress.

[shortliffe1975model] Shortliffe, Edward H, Buchanan, Bruce G. (1975). A model of inexact reasoning in medicine. Mathematical biosciences.

[weizenbaum1966eliza] Weizenbaum, Joseph. (1966). {ELIZA. Communications of the ACM.

[barlow1989finding] Barlow, Horace B, Kaushal, Tej P, Mitchison, Graeme J. (1989). Finding minimum entropy codes. Neural Computation.

[schmidhuber1992learning_factorial] Schmidhuber, J{. (1992). Learning factorial codes by predictability minimization. Neural computation.

[higgins2018towards] Higgins, Irina, Amos, David, Pfau, David, Racaniere, Sebastien, Matthey, Loic, Rezende, Danilo, Lerchner, Alexander. (2018). Towards a definition of disentangled representations. Preprint arXiv:1812.02230.

[hinton1984distributed] Hinton, Geoffrey E. (1984). Distributed representations. Technical report.

[bengio2013representation] Bengio, Yoshua, Courville, Aaron, Vincent, Pascal. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence.

[clune2013evolutionary] Clune, Jeff, Mouret, Jean-Baptiste, Lipson, Hod. (2013). The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences.

[ruis2022improving] Ruis, Laura, Lake, Brenden. (2022). Improving Systematic Generalization Through Modularity and Augmentation. Preprint arXiv:2202.10745.

[schmidhuber1991fastweights] J. (1991). Learning to Control Fast-Weight Memories: An Alternative to Recurrent Nets.

[schmidhuber1992learning_to_control2] Schmidhuber, J{. (1992). Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation.

[schlag2018learning] Schlag, Imanol, Schmidhuber, J{. (2018). Learning to reason with third order tensor products. Advances in neural information processing systems.

[velickovic2020neural] Petar Velickovic, Rex Ying, Matilde Padovano, Raia Hadsell, Charles Blundell. (2020). Neural Execution of Graph Algorithms.

[velickovic2021neural] Petar Velickovic, Charles Blundell. (2021). Neural algorithmic reasoning. Patterns.

[velickovic2022graph] Andrew Dudzik, Petar Velickovic. (2022). Graph Neural Networks are Dynamic Programmers. Preprint arXiv:2203.15544.

[bronstein2017geometric] Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, Pierre Vandergheynst. (2017). Geometric Deep Learning: Going beyond Euclidean data. {IEEE.

[bronstein2021geometric] Michael M. Bronstein, Joan Bruna, Taco Cohen, Petar Velickovic. (2021). Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. Preprint arXiv:2104.13478.

[chowdhery2022palm] Chowdhery, Aakanksha, Narang, Sharan, Devlin, Jacob, Bosma, Maarten, Mishra, Gaurav, Roberts, Adam, Barham, Paul, Chung, Hyung Won, Sutton, Charles, Gehrmann, Sebastian, others. (2022). Pa{LM. Preprint arXiv:2204.02311.

[weston2016babi] Jason Weston, Antoine Bordes, Sumit Chopra, Tom{'{a. (2016). Towards {AI.

[csordas2021neural] R'obert Csord'as, Sjoerd van Steenkiste, J. (2021). Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks. Int. Conf. on Learning Representations (ICLR).

[bahdanau2018systematic] Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries, Aaron Courville. (2019). Systematic Generalization: What Is Required and Can It Be Learned?.

[watanabe2019interpreting] Watanabe, Chihiro. (2019). Interpreting Layered Neural Networks via Hierarchical Modular Representation. International Conference on Neural Information Processing.

[filan2020neural] Filan, Daniel, Hod, Shlomi, Wild, Cody, Critch, Andrew, Russell, Stuart. (2020). Neural Networks are Surprisingly Modular. Preprint arXiv:2003.04881.

[jang2017categorical] Jang, Eric, Gu, Shixiang, Poole, Ben. (2017). Categorical Reparametrization with Gumbel-Softmax.

[maddison2016concrete] Maddison, Chris J, Mnih, Andriy, Teh, Yee Whye. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables.

[bengio2013estimating] Yoshua Bengio, Nicholas L{'{e. (2013). Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. CoRR.

[hinton12neural] Hinton, Geoffrey. (2012). Neural networks for machine learning.. Coursera, video lectures..

[lecun2010mnist] LeCun, Yann, Cortes, Corinna, Burges, CJ. (2010). {MNIST. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist.

[kirkpatrick2017overcoming] Kirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A, Milan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, others. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences.

[golkar2019continual] Golkar, Siavash, Kagan, Michael, Cho, Kyunghyun. (2019). Continual learning via neural pruning. NeurIPS 2019 Workshop Neuro AI.

[kolouri2019attention] Kolouri, Soheil, Ketz, Nicholas, Zou, Xinyun, Krichmar, Jeffrey, Pilly, Praveen. (2019). Attention-Based Structural-Plasticity. Preprint arXiv:1903.06070.

[ShawUV18] Peter Shaw, Jakob Uszkoreit, Ashish Vaswani. (2018). Self-Attention with Relative Position Representations.

[csordas2021ndr] R'obert Csord'as, Kazuki Irie, J. (2022). The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization.

[bogin2022unobserved] Ben Bogin, Shivanshu Gupta, Jonathan Berant. (2022). Unobserved Local Structures Make Compositional Generalization Hard. Preprint arXiv:2201.05899.

[kaplan2020scaling] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei. (2020). Scaling Laws for Neural Language Models. Preprint arXiv:2001.08361.

[clark2022unified] Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake A. Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Katie Millican, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Jack W. Rae, Erich Elsen, Koray Kavukcuoglu, Karen Simonyan. (2022). Unified Scaling Laws for Routed Language Models. Preprint arXiv:2202.01169.

[fedus2021switch] Fedus, William, Zoph, Barret, Shazeer, Noam. (2021). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Preprint arXiv:2101.03961.

[shazeer2017outrageously] Shazeer, Noam, Mirhoseini, Azalia, Maziarz, Krzysztof, Davis, Andy, Le, Quoc, Hinton, Geoffrey, Dean, Jeff. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.

[lake2017building] Lake, Brenden M., Ullman, Tomer D., Tenenbaum, Joshua B., Gershman, Samuel J.. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences.

[szegedy2014intriguing] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, Rob Fergus. (2014). Intriguing properties of neural networks.

[schmidhuber87metametameta] Aky{. (2022). Compositionality as Lexical Symmetry. Preprint arXiv:2201.12926.

[kirk2021survey] Kirk, Robert, Zhang, Amy, Grefenstette, Edward, Rockt{. (2021). A survey of generalisation in deep reinforcement learning. Preprint arXiv:2111.09794.

[marcus2003algebraic] Marcus, Gary F. (2003). The algebraic mind: Integrating connectionism and cognitive science.

[johnson2017clevr] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei{-. (2017). {CLEVR:.

[lapata2021structured] Wang, Bailin, Lapata, Mirella, Titov, Ivan. (2021). Structured reordering for modeling latent alignments in sequence transduction.

[liu2021learning] Chenyao Liu, Shengnan An, Zeqi Lin, Qian Liu, Bei Chen, Jian{-. (2021). Learning Algebraic Recombination for Compositional Generalization.

[weissenhorn2022compositional] Wei{\ss. (2022). Compositional Generalization Requires Compositional Parsers. Preprint arXiv:2202.11937.

[kim2021sequence] Kim, Yoon. (2021). Sequence-to-sequence learning with latent neural grammars.

[sartran2022transformer] Sartran, Laurent, Barrett, Samuel, Kuncoro, Adhiguna, Stanojevi{'c. (2022). Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale. Preprint arXiv:2203.00633.

[zheng2022disentangled] Hao Zheng, Mirella Lapata. (2022). Disentangled Sequence to Sequence Learning for Compositional Generalization.

[mittal2020brims] Sarthak Mittal, Alex Lamb, Anirudh Goyal, Vikram Voleti, Murray Shanahan, Guillaume Lajoie, Michael Mozer, Yoshua Bengio. (2020). Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules.

[liu2022adaptive] Liu, Dianbo, Lamb, Alex, Ji, Xu, Notsawo, Pascal, Mozer, Mike, Bengio, Yoshua, Kawaguchi, Kenji. (2022). Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization. Preprint arXiv:2202.01334.

[yan2020neural] Yujun Yan, Kevin Swersky, Danai Koutra, Parthasarathy Ranganathan, Milad Hashemi. (2020). Neural Execution Engines: Learning to Execute Subroutines.

[nye2022show] Maxwell Nye, Anders Johan Andreassen, Guy Gur{-. (2022). Show Your Work: Scratchpads for Intermediate Computation with Language Models. ICLR 2022 DL4C Workshop.

[wei2022chain] Wei, Jason, Wang, Xuezhi, Schuurmans, Dale, Bosma, Maarten, Chi, Ed, Le, Quoc, Zhou, Denny. (2022). Chain of thought prompting elicits reasoning in large language models.

[mittal2021compositional] Mittal, Sarthak, Raparthy, Sharath Chandra, Rish, Irina, Bengio, Yoshua, Lajoie, Guillaume. (2021). Compositional Attention: Disentangling Search and Retrieval. Preprint arXiv:2110.09419.

[oren2020improving] Inbar Oren, Jonathan Herzig, Nitish Gupta, Matt Gardner, Jonathan Berant. (2020). Improving Compositional Generalization in Semantic Parsing.

[ruiz2021iterative] Ruiz, Luana, Ainslie, Joshua, Onta{~n. (2021). Iterative decoding for compositional generalization in transformers. Preprint arXiv:2110.04169.

[klinger2020study] Klinger, Tim, Adjodah, Dhaval, Marois, Vincent, Joseph, Josh, Riemer, Matthew, Pentland, Alex'Sandy', Campbell, Murray. (2020). A Study of Compositional Generalization in Neural Models. Preprint arXiv:2006.09437.

[zhang2021subnetwork] Dinghuai Zhang, Kartik Ahuja, Yilun Xu, Yisen Wang, Aaron C. Courville. (2021). Can Subnetwork Structure Be the Key to Out-of-Distribution Generalization?.

[schwarzschild2021can] Schwarzschild, Avi, Borgnia, Eitan, Gupta, Arjun, Huang, Furong, Vishkin, Uzi, Goldblum, Micah, Goldstein, Tom. (2021). Can You Learn an Algorithm? Generalizing from Easy to Hard Problems with Recurrent Networks.

[bau2021neural] Bau, Anthony, Andreas, Jacob. (2021). How Do Neural Sequence Models Generalize? Local and Global Context Cues for Out-of-Distribution Prediction. Preprint arXiv:2111.03108.

[zhang2020identity] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer. (2020). Identity Crisis: Memorization and Generalization Under Extreme Overparameterization.

[qiu2021improving] Qiu, Linlu, Shaw, Peter, Pasupat, Panupong, Nowak, Pawe{\l. (2021). Improving Compositional Generalization with Latent Structure and Data Augmentation. Preprint arXiv:2112.07610.

[nogueira2021investigating] Nogueira, Rodrigo, Jiang, Zhiying, Lin, Jimmy. (2021). Investigating the limitations of transformers with simple arithmetic tasks. ICLR 2021 Mathematical Reasoning in General Artificial Intelligence Workshop.

[vani2021iterated] Ankit Vani, Max Schwarzer, Yuchen Lu, Eeshan Dhekane, Aaron C. Courville. (2021). Iterated learning for emergent systematicity in {VQA.

[akyurek2021lexicon] Aky{. (2021). Lexicon Learning for Few-Shot Neural Sequence Modeling.

[kharitonov2021doubt] Eugene Kharitonov, Rahma Chaabouni. (2021). What they do when in doubt: a study of inductive biases in seq2seq learners.

[zaremba2015learning] Zaremba, Wojciech, Sutskever, Ilya. (2015). Learning to execute.

[jumper2021highly] Jumper, John, Evans, Richard, Pritzel, Alexander, Green, Tim, Figurnov, Michael, Ronneberger, Olaf, Tunyasuvunakool, Kathryn, Bates, Russ, {\v{Z. (2021). Highly accurate protein structure prediction with {AlphaFold. Nature.

[degrave2022magnetic] Degrave, Jonas, Felici, Federico, Buchli, Jonas, Neunert, Michael, Tracey, Brendan, Carpanese, Francesco, Ewalds, Timo, Hafner, Roland, Abdolmaleki, Abbas, de Las Casas, Diego, others. (2022). Magnetic control of tokamak plasmas through deep reinforcement learning. Nature.

[solomonoff1964a] Ray Solomonoff. (1964). A formal theory of inductive inference. Part {I. Information and Control.

[solomonoff1964b] Ray Solomonoff. (1964). A formal theory of inductive inference. Part {II. Information and Control.

[hutter2000aixi] Hutter, Marcus. (2000). A theory of universal artificial intelligence based on algorithmic complexity. Preprint arXiv:cs/0004001.

[ivakhnenko1965] Ivakhnenko, Aleksey Grigorievitch, Lapa, Valentin Grigorevich. (1965). Cybernetic Predicting Devices. Information and Control.

[ivakhnenko1968] Ivakhnenko, Aleksey Grigorievitch. (1968). The group method of data handling -- a rival of the method of stochastic approximation. Soviet Automatic Control.

[ivakhnenko1971] Ivakhnenko, Aleksey Grigorievitch. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics.

[ring1991incremental] Mark Ring. (1991). Incremental Development of Complex Behaviors through Automatic Construction of Sensory-motor Hierarchies. Machine Learning Proceedings.

[schlimmer1986case] Jeffrey C. Schlimmer, Douglas H. Fisher. (1986). A Case Study of Incremental Concept Induction. Proceedings of the 5th National Conference on Artificial Intelligence.

[schmidhuber2004oops] Shun{-. (1967). A Theory of Adaptive Pattern Classifiers. {IEEE.

[amari1968information] Shun{-. (1968). Information Theory — Geometric Theory of Information.

[sperduti93encoding] Alessandro Sperduti. (1993). Encoding Labeled Graphs by Labeling {RAAM.

[sperduti97supervised] Alessandro Sperduti, Antonina Starita. (1997). Supervised neural networks for the classification of structures. {IEEE.

[goller96learning] Christoph Goller, Andreas K{. (1996). Learning task-dependent distributed representations by backpropagation through structure. Proceedings of International Conference on Neural Networks (ICNN).

[kuchler96inductive] Andreas K{. (1996). Inductive Learning in Symbolic Domains Using Structure-Driven Recurrent Neural Networks. Advances in Artificial Intelligence.

[baldi1996hybrid] Pierre Baldi, Yves Chauvin. (1996). Hybrid Modeling, {HMM/NN. Neural Computation.

[juergen1993decreasing] Jeffrey L. Elman. (1993). Learning and development in neural networks: the importance of starting small. Cognition.

[towell1990refinement] Towell, Geofrey G, Shavlik, Jude W, Noordewier, Michiel O, others. (1990). Refinement of approximate domain theories by knowledge-based neural networks.

[amari72learning] Shun{-. (1972). Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements. {IEEE.

[kohonen72correlation] Teuvo Kohonen. (1972). Correlation Matrix Memories. {IEEE.

[sun1994computational] Sun, Ron, Bookman, Lawrence A. (1994). Computational architectures integrating neural and symbolic processes: A perspective on the state of the art.

[roli1995image] Fabio Roli, Sebastiano B. Serpico, Gianni Vernazza. (1995). Image Recognition by Integration of Connectionist and Symbolic Approaches.

[pollack1987connectionist] Pollack, Jordan Bruce. (1987). On connectionist models of natural language processing. PhD dissertation. University of Illinois.

[franke18robust] Jörg Franke, Jan Niehues, Alex Waibel. (2018). Robust and Scalable Differentiable Neural Computer for Question Answering. Workshop on Machine Reading for Question Answering (MRQA), ACL.

[santoro17simple] Adam Santoro, David Raposo, David G. T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter W. Battaglia, Tim Lillicrap. (2017). A simple neural network module for relational reasoning.

[henaff17tracking] Mikael Henaff, Jason Weston, Arthur Szlam, Antoine Bordes, Yann LeCun. (2017). Tracking the World State with Recurrent Entity Networks.

[rosenbaum2019routing] Rosenbaum, Clemens, Cases, Ignacio, Riemer, Matthew, Klinger, Tim. (2019). Routing networks and the challenges of modular and compositional computation. Preprint arXiv:1904.12774.

[krizhevsky2009learning] Krizhevsky, Alex, Hinton, Geoffrey, others. (2009). Learning multiple layers of features from tiny images.

[he2016deep] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2016). Deep Residual Learning for Image Recognition.

[watanabe2018modular] Watanabe, Chihiro, Hiramatsu, Kaoru, Kashino, Kunio. (2018). Modular representation of layered neural networks. Neural Networks.

[garnelo2019reconciling] Garnelo, Marta, Shanahan, Murray. (2019). Reconciling deep learning with symbolic artificial intelligence: representing objects and relations. Current Opinion in Behavioral Sciences.

[davis2020network] Davis, Brian, Bhatt, Umang, Bhardwaj, Kartikeya, Marculescu, Radu, Moura, Jos{'e. (2020). On network science and mutual information for explaining deep neural networks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[bengio2015conditional] Bengio, Emmanuel, Bacon, Pierre-Luc, Pineau, Joelle, Precup, Doina. (2015). Conditional computation in neural networks for faster models. Preprint arXiv:1511.06297.

[santoro2018measuring] Adam Santoro, Felix Hill, David G. T. Barrett, Ari S. Morcos, Timothy P. Lillicrap. (2018). Measuring abstract reasoning in neural networks.

[andreas2018measuring] Jacob Andreas. (2019). Measuring Compositionality in Representation Learning.

[yang2020multi] Ruihan Yang, Huazhe Xu, Yi Wu, Xiaolong Wang. (2020). Multi-Task Reinforcement Learning with Soft Modularization.

[lecun1990optimal] Yann LeCun, John S. Denker, Sara A. Solla. (1989). Optimal Brain Damage.

[hassibi1993second] Babak Hassibi, David G. Stork. (1992). Second Order Derivatives for Network Pruning: Optimal Brain Surgeon.

[li2017pruning] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf. (2017). Pruning Filters for Efficient ConvNets.

[frankle2018the] Jonathan Frankle, Michael Carbin. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.

[gaier2019weight] Adam Gaier, David Ha. (2019). Weight Agnostic Neural Networks.

[simonyan2013deep] Simonyan, Karen, Vedaldi, Andrea, Zisserman, Andrew. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Preprint arXiv:1312.6034.

[springenberg2015striving] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, Martin A. Riedmiller. (2015). Striving for Simplicity: The All Convolutional Net.

[sundararajan2017axiomatic] Mukund Sundararajan, Ankur Taly, Qiqi Yan. (2017). Axiomatic Attribution for Deep Networks.

[shrikumar2017learning] Avanti Shrikumar, Peyton Greenside, Anshul Kundaje. (2017). Learning Important Features Through Propagating Activation Differences.

[mallya2018piggyback] Arun Mallya, Dillon Davis, Svetlana Lazebnik. (2018). Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights.

[clune2013summary] Jeff Clune, Jean{-. (2014). Summary of.

[hill2020environmental] Felix Hill, Andrew K. Lampinen, Rosalia Schneider, Stephen Clark, Matthew M. Botvinick, James L. McClelland, Adam Santoro. Environmental drivers of systematicity and generalization in a situated agent.

[purushwalkam2019taskdriven] Senthil Purushwalkam, Maximilian Nickel, Abhinav Gupta, Marc'Aurelio Ranzato. (2019). Task-Driven Modular Networks for Zero-Shot Compositional Learning.

[zhou2019deconstructing] Hattie Zhou, Janice Lan, Rosanne Liu, Jason Yosinski. (2019). Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask.

[goyal2021factorizing] Anirudh Goyal, Alex Lamb, Phanideep Gampa, Philippe Beaudoin, Charles Blundell, Sergey Levine, Yoshua Bengio, Michael Curtis Mozer. (2021). Factorizing Declarative and Procedural Knowledge in Structured, Dynamical Environments.

[nair2010rectified] Vinod Nair, Geoffrey E. Hinton. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines.

[kingma2014adam] Diederik P. Kingma, Jimmy Ba. (2015). Adam: {A.

[malsburg1973self] von der Malsburg, Chr.. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik.

[sparsednc] Jack W. Rae, Jonathan J. Hunt, Ivo Danihelka, Timothy Harley, Andrew W. Senior, Gregory Wayne, Alex Graves, Tim Lillicrap. (2016). Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes.

[diffallocdnc] Itamar Ben-Ari, Alan Joseph Bekker. (2017). Differentiable Memory Allocation Mechanism For Neural Computing. MLSLP.

[schmidhuber2012self] Schmidhuber, J{. (2012). Self-delimiting neural networks. Preprint arXiv:1210.0118.

[shaw2020compositional] Peter Shaw, Ming{-. (2021). Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?.

[rae2021scaling] Rae, Jack W, Borgeaud, Sebastian, Cai, Trevor, Millican, Katie, Hoffmann, Jordan, Song, Francis, Aslanides, John, Henderson, Sarah, Ring, Roman, Young, Susannah, others. (2021). Scaling language models: Methods, analysis & insights from training gopher. Preprint arXiv:2112.11446.

[schmidhuber90composition] J. (1990). Towards Compositional Learning in Dynamic Networks.

[hendrycks2016gaussian] Hendrycks, Dan, Gimpel, Kevin. (2016). Gaussian error linear units ({GELU. Preprint arXiv:1606.08415.

[ba2016layer] Ba, Jimmy Lei, Kiros, Jamie Ryan, Hinton, Geoffrey E. (2016). Layer Normalization. Preprint arXiv:1607.06450.

[touvron2023llama] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie{-. (2023). {LLaMA. Preprint arXiv:2302.13971.

[alpaca] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto. (2023). Stanford {A. GitHub repository.

[vicuna2023] Chiang, Wei-Lin, Li, Zhuohan, Lin, Zi, Sheng, Ying, Wu, Zhanghao, Zhang, Hao, Zheng, Lianmin, Zhuang, Siyuan, Zhuang, Yonghao, Gonzalez, Joseph E., Stoica, Ion, Xing, Eric P.. (2023). Vicuna: An Open-Source Chatbot Impressing {GPT.

[fedus2022switch] Fedus, William, Zoph, Barret, Shazeer, Noam. (2022). Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.

[lepikhin2021shard] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen. (2021). {GS.

[chi2022representation] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian{-. (2022). On the Representation Collapse of Sparse Mixture of Experts.

[ParikhT0U16] Ankur P. Parikh, Oscar T{. (2016). A Decomposable Attention Model for Natural Language Inference.

[cheng16] Jianpeng Cheng, Li Dong, Mirella Lapata. (2016). Long Short-Term Memory-Networks for Machine Reading.

[katharopoulos2020transformers] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Fran{\c{c. (2020). Transformers are {RNN.

[choromanski2021rethinking] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tam{'{a. (2021). Rethinking Attention with Performers.

[dao2022flash] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher R{'{e. (2022). Flash{A.

[geva2021transformer] Geva, Mor, Schuster, Roei, Berant, Jonathan, Levy, Omer. (2021). Transformer Feed-Forward Layers Are Key-Value Memories.

[li2023the] Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar. (2023). The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers.

[lampe2019large] Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, Herv{'{e. (2019). Large Memory Layers with Product Keys.

[jacobs1991adaptive] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, Geoffrey E. Hinton. (1991). Adaptive Mixtures of Local Experts. Neural Compututaion.

[sinkhorn1964relationship] Sinkhorn, Richard. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics.

[sinkhorn1967concerning] Sinkhorn, Richard, Knopp, Paul. (1967). Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics.

[kool2021unbiased] Kool, Wouter, Maddison, Chris J, Mnih, Andriy. (2021). Unbiased gradient estimation with balanced assignments for mixtures of experts. I (Still) Can't Believe It's Not Better Workshop, NeurIPS.

[gpt2] Radford, Alec, Wu, Jeff, Child, Rewon, Luan, David, Amodei, Dario, Sutskever, Ilya. (2019). Language Models are Unsupervised Multitask Learners.

[rae2021gopher] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po{-. (2021). Scaling Language Models: Methods, Analysis {&. Preprint arXiv:2112.11446.

[demetters2022llm.int8] Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer. (2022). {LLM.

[zafrir2019q8bert] Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat. (2019). {Q8BERT:. Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS.

[lewis2021base] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer. (2021). {BASE.

[shen2023study] Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian. (2023). A Study on {ReLU. Preprint arXiv:2302.06461.

[ivakhnenko1965book] Ivakhnenko, Alekse{\u\i. (1965). Cybernetic Predicting Devices.

[sennrich16bpe] Rico Sennrich, Barry Haddow, Alexandra Birch. (2016). Neural Machine Translation of Rare Words with Subword Units.

[SchusterN12] Mike Schuster, Kaisuke Nakajima. (2012). Japanese and Korean voice search.

[KudoR18] Taku Kudo, John Richardson. (2018). SentencePiece: {A.

[irie18radmm] Irie, Kazuki, Kumar, Shankar, Nirschl, Michael, Liao, Hank. (2018). {RADMM. Proc. {IEEE.

[li2022branch] Li, Margaret, Gururangan, Suchin, Dettmers, Tim, Lewis, Mike, Althoff, Tim, Smith, Noah A, Zettlemoyer, Luke. (2022). Branch-train-merge: Embarrassingly parallel training of expert language models. Preprint arXiv:2208.03306.

[dziri2023faith] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Za{. (2023). Faith and Fate: Limits of Transformers on Compositionality. Preprint arXiv:2305.18654.

[hopcroft1979introduction] John E. Hopcroft, Jeffrey D. Ullman. (1979). Introduction to Automata Theory, Languages and Computation.

[irie2022dualform] Kazuki Irie, R'obert Csord'as, J. (2022). The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention.

[joulin2015inferring] Armand Joulin, Tom{'{a. (2015). Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets.

[suzgun2019memory] Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, Stuart M. Shieber. (2019). Memory-Augmented Recurrent Neural Networks Can Learn Generalized Dyck Languages. Preprint arXiv:1911.03329.

[openai2023gpt4] OpenAI. (2023). {GPT-4. Preprint arXiv:2303.08774.

[lewkowycz2022solving] Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay V. Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman{-. (2022). Solving Quantitative Reasoning Problems with Language Models.

[williams1989learning] Ronald J. Williams, David Zipser. (1989). A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation.

[anil2022exploring] Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay V. Ramasesh, Ambrose Slone, Guy Gur{-. (2022). Exploring Length Generalization in Large Language Models.

[openai2022chatgpt] OpenAI. (2022). {ChatGPT.

[ouyang2022training] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, Ryan Lowe. (2022). Training language models to follow instructions with human feedback.

[bubeck2023sparsk] S{'{e. (2023). Sparks of Artificial General Intelligence: Early experiments with {GPT-4. Preprint arXiv:2303.12712.

[tian2023ischatgpt] Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing{-. (2023). Is {ChatGPT. Preprint arXiv:2304.11938.

[wu2023reasoning] Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, Yoon Kim. (2023). Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks. Preprint arXiv:2307.02477.

[liu2023evaluating] Hanmeng Liu, Ruoxi Ning, Zhiyang Teng, Jian Liu, Qiji Zhou, Yue Zhang. (2023). Evaluating the Logical Reasoning Ability of ChatGPT and {GPT-4. Preprint arXiv:2304.03439.

[gao2021pile] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy. (2021). The {P. Preprint arXiv:2101.00027.

[raffel2020exploring] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

[he2023tweet] Horace He. . (2023).

[enryu2023gpt4] Enryu. (2023). GPT4 and Coding Problems.

[weiss2022tweet] Gail Weiss. . (2022).

[mclaughlin2009systematicity] Brian P. McLaughlin. (2009). Systematicity redux. Synthese.

[johnson1972reasoning] Johnson-Laird, Philip N, Legrenzi, Paolo, Legrenzi, Maria Sonino. (1972). Reasoning and a sense of reality. British journal of Psychology.

[wason1968reasoning] Wason, Peter C. (1968). Reasoning about a rule. Quarterly journal of experimental psychology.

[pylyshyn1984computation] Pylyshyn, Zenon Walter. (1984). Computation and cognition.

[memnn] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus. (2015). Weakly Supervised Memory Networks. Preprint arXiv:1503.08895.

[keyvalnet] Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir{-. (2016). Key-Value Memory Networks for Directly Reading Documents.

[xiong2020layer] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie{-. (2020). On Layer Normalization in the Transformer Architecture.

[csordas2022ctlpp] R'obert Csord'as, Kazuki Irie, J. (2022). {CTL.

[hutchins2020block] DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur. (2022). Block-Recurrent Transformers.

[chomsky1962explanatory] Chomsky, N.. (1962). Explanatory Models in Linguistics.

[siegelmann1991turing] Siegelmann, Hava T, Sontag, Eduardo D. (1991). Turing computability with neural nets. Applied Mathematics Letters.

[siegelmann1995computational] Hava T. Siegelmann, Eduardo D. Sontag. (1995). On the Computational Power of Neural Nets. Journal of Computer and System Sciences.

[chung2021turing] Stephen Chung, Hava T. Siegelmann. (2021). Turing Completeness of Bounded-Precision Recurrent Neural Networks.

[hochreiter1991untersuchungen] Hochreiter, Sepp. (1991). Untersuchungen zu dynamischen neuronalen {N. Diploma, Technische Universit{.

[nostalgebraist2022chinchilla] Nostalgebraist. (2022). Chinchilla's Wild Implications.

[hoffmann2022training] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre. (2022). Training Compute-Optimal Large Language Models. Preprint arXiv:2203.15556.

[malsburg1981correlation] Lenz, Wilhelm. (1920). *Beitrag zum Verst{*. Z. Phys..

[ising1924beitrag] Ising, E. (1924). Beitrag zur Theorie des Ferround Paramagnetismus.

[schmidhuber2022annotated] J{. (2022). Annotated History of Modern {AI. Preprint arXiv:2212.11279.

[graves2015bidirectional] Alex Graves, Santiago Fern{'{a. (2005). Bidirectional {LSTM.

[srivastava2015highway] Rupesh Kumar Srivastava, Klaus Greff, J{. (2015). Highway Networks. ICML Deep Learnig Workshop.

[fukushima1969relu] Kunihiko Fukushima. (1969). Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements. {IEEE.

[linnainmaa1970representation] Linnainmaa, Seppo. (1970). The representation of the cumulative rounding error of an algorithm as a {T.

[werbos1982applications] Werbos, Paul. (1982). Applications of advances in nonlinear sensitivity analysis. System Modeling and Optimization.

[kelley1960gradient] Kelley, Henry J. (1960). Gradient theory of optimal flight paths. Ars Journal.

[weiss2018practical] Gail Weiss, Yoav Goldberg, Eran Yahav. (2018). On the Practical Computational Power of Finite Precision {RNN.

[cho2014learning] Kyunghyun Cho, Bart van Merrienboer, {\c{C. (2014). Learning Phrase Representations using {RNN.

[alain2017undestanding] Guillaume Alain, Yoshua Bengio. (2017). Understanding intermediate layers using linear classifier probes. {ICLR.

[lakretz2019emergence] Yair Lakretz, Germ{'{a. (2019). The emergence of number and syntax units in {LSTM. confNAACL.

[hupkes2018visualisation] Hupkes, Dieuwke, Veldhoen, Sara, Zuidema, Willem. (2018). Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research.

[levin1973universal] Levin, Leonid Anatolevich. (1973). Universal sequential search problems. Problemy peredachi informatsii.

[deville1994logic] Deville, Yves, Lau, Kung-Kiu. (1994). Logic program synthesis. The Journal of Logic Programming.

[sussman1973computational] Sussman, Gerald J. (1973). A computational model of skill acquisition.

[newell1961gps] Newell, Allen, Simon, Herbert Alexander. (1961). {GPS.

[plotkin1972automatic] Plotkin, Gordon. (1972). Automatic methods of inductive inference.

[saphiro1981] Ehud Y. Shapiro. (1981). The Model Inference System. International Joint Conference on Artificial Intelligence, {IJCAI.

[kolmogorov1965three] Kolmogorov, Andrei N. (1965). Three approaches to the quantitative definition of information’. Problems of information transmission.

[rissanen1978modeling] Rissanen, Jorma. (1978). Modeling by shortest data description. Automatica.

[dosovitskiy2021vit] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

[radford2023whisper] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever. (2023). Robust Speech Recognition via Large-Scale Weak Supervision.

[chen2021decision] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch. (2021). Decision Transformer: Reinforcement Learning via Sequence Modeling.

[csordas2023approximating] R'obert Csord'as, Kazuki Irie, J. (2023). Approximating Two-Layer Feedforward Networks for Efficient Transformers. Findings of the Association for Computational Linguistics: {EMNLP.

[zhang2022moa] Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, Zhang Xiong. (2022). Mixture of Attention Heads: Selecting Attention Heads Per Token.

[hutter2006human] Hutter, Marcus. (2006). The Human knowledge compression prize.

[peS2o] Luca Soldaini, Kyle Lo. {pe{S.

[nguyen2022improving] Tan Nguyen, Tam Nguyen, Hai Do, Khai Nguyen, Vishwanath Saragadam, Minh Pham, Duy Khuong Nguyen, Nhat Ho, Stanley J. Osher. (2022). Improving Transformer with an Admixture of Attention Heads.

[peng2020mixture] Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith. (2020). A Mixture of h - 1 Heads is Better than h Heads.

[gerganov2023llamacpp] Georgi Gerganov. (2023). llama.cpp.

[mistral7b] MistralAI. (2023). Mistral 7B.

[stanic2023languini] Aleksandar Stanić, Dylan R. Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, J{. (2023). The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute. Preprint arXiv:2309.11197.

[su2021roformer] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, Yunfeng Liu. (2021). {RoFormer. Preprint arXiv:2104.09864.

[olsson2022context] Olsson, Catherine, Elhage, Nelson, Nanda, Neel, Joseph, Nicholas, DasSarma, Nova, Henighan, Tom, Mann, Ben, Askell, Amanda, Bai, Yuntao, Chen, Anna, Conerly, Tom, Drain, Dawn, Ganguli, Deep, Hatfield-Dodds, Zac, Hernandez, Danny, Johnston, Scott, Jones, Andy, Kernion, Jackson, Lovitt, Liane, Ndousse, Kamal, Amodei, Dario, Brown, Tom, Clark, Jack, Kaplan, Jared, McCandlish, Sam, Olah, Chris. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.

[Schmidhuber:92ncfastweights] Peng, Hao, Pappas, Nikolaos, Yogatama, Dani, Schwartz, Roy, Smith, Noah A, Kong, Lingpeng. (2021). Random Feature Attention. Neural Computation.

[dallee] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. (2022). Hierarchical Text-Conditional Image Generation with {CLIP. Preprint arXiv:2204.06125.

[ho2020denoising] Jonathan Ho, Ajay Jain, Pieter Abbeel. (2020). Denoising Diffusion Probabilistic Models.

[dosovitskiy2021image] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

[tay2023scaling] Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler. (2023). Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?. Findings of the Association for Computational Linguistics: {EMNLP.

[csordas2023switchead] R'obert Csord'as, Piotr Pi\k{e. (2023). {SwitchHead. Preprint arXiv:2312.07987.

[tan2023sparse] Shawn Tan, Yikang Shen, Zhenfang Chen, Aaron C. Courville, Chuang Gan. (2023). Sparse Universal Transformer.

[paperno2016lambada] Denis Paperno, Germ{'{a. (2016). The {LAMBADA.

[hill2015goldilocks] Felix Hill, Antoine Bordes, Sumit Chopra, Jason Weston. (2016). The {G.

[warstadt2020blimp] Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng{-. (2020). {BLiMP.

[sakaguchi2020winogrande] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale.

[lan2020albert] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. (2020). {ALBERT:.

[brody2023expressivity] Shaked Brody, Uri Alon, Eran Yahav. (2023). On the Expressivity Role of LayerNorm in Transformers' Attention.

[xie2023residual] Shufang Xie, Huishuai Zhang, Junliang Guo, Xu Tan, Jiang Bian, Hany Hassan Awadalla, Arul Menezes, Tao Qin, Rui Yan. (2023). ResiDual: Transformer with Dual Residual Connections. Preprint arXiv:2304.14802.

[jaegle2021perceiver] Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, Jo{~{a. (2021). Perceiver: General Perception with Iterative Attention.

[bolatov2022recurrent] Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev. (2022). Recurrent Memory Transformer.

[krajewski2024scaling] Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi{'{o. (2024). Scaling Laws for Fine-Grained Mixture of Experts. Preprint arXiv:2402.07871.

[dai2024deepseekmoe] Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang. (2024). {DeepSeekMoE. Preprint arXiv:2401.06066.

[HeZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2016). Identity Mappings in Deep Residual Networks. Proc. European Conf. on Computer Vision (ECCV).

[elhage2022superposition] Elhage, Nelson, Hume, Tristan, Olsson, Catherine, Schiefer, Nicholas, Henighan, Tom, Kravec, Shauna, Hatfield-Dodds, Zac, Lasenby, Robert, Drain, Dawn, Chen, Carol, Grosse, Roger, McCandlish, Sam, Kaplan, Jared, Amodei, Dario, Wattenberg, Martin, Olah, Christopher. (2022). Toy Models of Superposition. Transformer Circuits Thread.

[thorpe1989local] Simon J. Thorpe. (1989). Local vs. Distributed Coding. Intellectica.

[cerebras2023slimpajama] Soboleva, Daria, Al-Khateeb, Faisal, Myers, Robert, Steeves, Jacob R, Hestness, Joel, Dey, Nolan. {SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.

[kocetkov2022thestack] Kocetkov, Denis, Li, Raymond, Ben Allal, Loubna, Li, Jia, Mou,Chenghao, Muñoz Ferrandis, Carlos, Jernite, Yacine, Mitchell, Margaret, Hughes, Sean, Wolf, Thomas, Bahdanau, Dzmitry, von Werra, Leandro, de Vries, Harm. (2022). The {S. Preprint arXiv:2211.15533.

[takase2023lessons] Sho Takase, Shun Kiyono. (2023). Lessons on Parameter Sharing across Layers in Transformers. SustaiNLP Workshop.

[kim2023solar] Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, Changbae Ahn, Seonghoon Yang, Sukyung Lee, Hyunbyung Park, Gyoungjin Gim, Mikyoung Cha, Hwalsuk Lee, Sunghun Kim. (2023). {SOLAR. Preprint arXiv:2312.15166.

[HampshireW90] John B. Hampshire II, Alexander H. Waibel. (1990). The Meta-Pi network: connectionist rapid adaptation for high-performance multi-speaker phoneme recognition.

[jordan1986] Jordan, Michael I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proc. Conf. of the Cognitive Science Society.

[BergenOB21] Leon Bergen, Timothy J. O'Donnell, Dzmitry Bahdanau. (2021). Systematic Generalization with Edge Transformers.

[zellers2019hellaswag] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?.

[bisk2020piqa] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi. (2020). {PIQA:.

[clark2018think] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. (2018). Think you have Solved Question Answering? Try {ARC. Preprint arXiv:1803.05457.

[wang2024grokked] Boshi Wang, Xiang Yue, Yu Su, Huan Sun. (2024). Grokked Transformers are Implicit Reasoners: {A.

[brian2024hopping] Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson. (2024). Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries.

[csordas2024moeut] R'obert Csord'as, Kazuki Irie, J. (2024). {MoEUT.

[sun2025painters] Qi Sun, Marc Pickett, Aakash Kumar Nain, Llion Jones. (2025). Transformer Layers as Painters.

[petty2023impact] Jackson Petty, Sjoerd van Steenkiste, Ishita Dasgupta, Fei Sha, Dan Garrette, Tal Linzen. (2023). The Impact of Depth and Width on Transformer Language Model Generalization. Preprint arXiv:2310.19956.

[lad2024remarkable] Lad, Vedang, Lee, Jin Hwa, Gurnee, Wes, Tegmark, Max. (2024). The remarkable robustness of llms: Stages of inference?. arXiv preprint arXiv:2406.19384.

[ameisen2025circuit] Ameisen, Emmanuel, Lindsey, Jack, Pearce, Adam, Gurnee, Wes, Turner, Nicholas L., Chen, Brian, Citro, Craig, Abrahams, David, Carter, Shan, Hosmer, Basil, Marcus, Jonathan, Sklar, Michael, Templeton, Adly, Bricken, Trenton, McDougall, Callum, Cunningham, Hoagy, Henighan, Thomas, Jermyn, Adam, Jones, Andy, Persic, Andrew, Qi, Zhenyi, Ben Thompson, T., Zimmerman, Sam, Rivoire, Kelley, Conerly, Thomas, Olah, Chris, Batson, Joshua. (2025). Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits Thread.

[lindsey2025biology] Lindsey, Jack, Gurnee, Wes, Ameisen, Emmanuel, Chen, Brian, Pearce, Adam, Turner, Nicholas L., Citro, Craig, Abrahams, David, Carter, Shan, Hosmer, Basil, Marcus, Jonathan, Sklar, Michael, Templeton, Adly, Bricken, Trenton, McDougall, Callum, Cunningham, Hoagy, Henighan, Thomas, Jermyn, Adam, Jones, Andy, Persic, Andrew, Qi, Zhenyi, Thompson, T. Ben, Zimmerman, Sam, Rivoire, Kelley, Conerly, Thomas, Olah, Chris, Batson, Joshua. (2025). On the Biology of a Large Language Model. Transformer Circuits Thread.

[nostalgebraist2020interpreting] Nostalgebraist. (2020). Interpreting GPT: The Logit Lens.

[zhang2019rmsnorm] Biao Zhang, Rico Sennrich. (2019). Root Mean Square Layer Normalization.

[he2024understanding] Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas Hofmann. (2024). Understanding and Minimising Outlier Features in Transformer Training.

[dobey2024llama3] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al{-. (2024). The {Llama. Preprint arXiv:2407.21783.

[cobbe2021training] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman. (2021). Training Verifiers to Solve Math Word Problems. Preprint arXiv:2110.14168.

[huang-etal-2024-ravel] Tom Henighan, Shan Carter, Tristan Hume, Nelson Elhage, Robert Lasenby, Stanislav Fort, Nicholas Schiefer, Christopher Olah. (2023). Superposition, Memorization, and Double Descent. Transformer Circuits Thread.

[bricken2023monosemanticity] Bricken, Trenton, Templeton, Adly, Batson, Joshua, Chen, Brian, Jermyn, Adam, Conerly, Tom, Turner, Nick, Anil, Cem, Denison, Carson, Askell, Amanda, Lasenby, Robert, Wu, Yifan, Kravec, Shauna, Schiefer, Nicholas, Maxwell, Tim, Joseph, Nicholas, Hatfield-Dodds, Zac, Tamkin, Alex, Nguyen, Karina, McLean, Brayden, Burke, Josiah E, Hume, Tristan, Carter, Shan, Henighan, Tom, Olah, Christopher. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.

[zhong2023mquake] Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, Danqi Chen. (2023). {MQuAKE.

[hendrycks2021measuring] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, Jacob Steinhardt. (2021). Measuring Mathematical Problem Solving With the {MATH.

[gromov2025unreasonable] Gromov, Andrey, Tirumala, Kushal, Shapourian, Hassan, Glorioso, Paolo, Roberts, Dan. (2025). The Unreasonable Ineffectiveness of the Deeper Layers.

[fiotto-kaufman2025nnsight] Rhys Gould, Euan Ong, George Ogden, Arthur Conmy. (2024). Successor Heads: Recurring, Interpretable Attention Heads In The Wild.

[mcdougall2023copy] Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda. (2023). Copy Suppression: Comprehensively Understanding an Attention Head. Preprint arXiv:2310.04625.

[veit2016residual] Andreas Veit, Michael J. Wilber, Serge J. Belongie. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks.

[gurnee2024language] Wes Gurnee, Max Tegmark. (2024). Language Models Represent Space and Time. The Twelfth International Conference on Learning Representations.

[liu2023dejavu] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher R{'{e. (2023). Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time.

[voita2024neurons] Elena Voita, Javier Ferrando, Christoforos Nalmpantis. (2024). Neurons in Large Language Models: Dead, N-gram, Positional.

[vig2020investigating] Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, Stuart M. Shieber. (2020). Investigating Gender Bias in Language Models Using Causal Mediation Analysis.

[geiger2020blackbox] Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, Noah D. Goodman. (2024). Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations. Causal Learning and Reasoning.

[Ravfogel:2020] Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, Yoav Goldberg. (2020). Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection.

[qwen25] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zihan Qiu. (2024). Qwen2.5 Technical Report. Preprint arXiv:2412.15115.

[geiping2025scaling] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein. (2025). Scaling up Test-Time Compute with Latent Reasoning: {A. Preprint arXiv:2502.05171.

[hao2024training] Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, Yuandong Tian. (2024). Training Large Language Models to Reason in a Continuous Latent Space. Preprint arXiv:2412.06769.

[open-llm-leaderboard-v2] Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, Thomas Wolf. (2024). Open LLM Leaderboard v2.

[skean2025layer] Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz{-. (2025). Layer by Layer: Uncovering Hidden Representations in Language Models. Preprint arXiv:2502.02013.

[ethayarajh2019anisotropy] Kawin Ethayarajh. (2019). How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and {GPT-2.

[Vaswani+2017] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, \L ukasz, Polosukhin, Illia. (2017). Attention is All you Need. Advances in Neural Information Processing Systems.

[vig2019analyzing] Vig, Jesse, Belinkov, Yonatan. (2019). Analyzing the structure of attention in a transformer language model. arXiv preprint arXiv:1906.04284.

[vig2019multiscale] Vig, Jesse. (2019). A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714.

[brunner2019identifiability] Brunner, Gino, Liu, Yang, Pascual, Damian, Richter, Oliver, Ciaramita, Massimiliano, Wattenhofer, Roger. (2019). On identifiability in transformers. arXiv preprint arXiv:1908.04211.

[clark-etal-2019-bert] Clark, Kevin, Khandelwal, Urvashi, Levy, Omer, Manning, Christopher D.. (2019). What Does {BERT. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. doi:10.18653/v1/W19-4828.

[liu-etal-2024-intactkv] Liu, Ruikang, Bai, Haoli, Lin, Haokun, Li, Yuening, Gao, Han, Xu, Zhengzhuo, Hou, Lu, Yao, Jun, Yuan, Chun. (2024). {I. Findings of the Association for Computational Linguistics: ACL 2024. doi:10.18653/v1/2024.findings-acl.460.

[geshkovski2023mathematical] Geshkovski, Borjan, Letrouit, Cyril, Polyanskiy, Yury, Rigollet, Philippe. (2023). A mathematical perspective on transformers. arXiv preprint arXiv:2312.10794.

[ge2024model] Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao. (2024). Model Tells You What to Discard: Adaptive {KV. The Twelfth International Conference on Learning Representations.

[yona2025interpreting] Yona, Itay, Shumailov, Ilia, Hayes, Jamie, Barbero, Federico, Gandelsman, Yossi. (2025). Interpreting the Repeated Token Phenomenon in Large Language Models. arXiv preprint arXiv:2503.08908.

[wu2024role] Wu, Xinyi, Ajorlou, Amir, Wang, Yifei, Jegelka, Stefanie, Jadbabaie, Ali. (2024). On the Role of Attention Masks and LayerNorm in Transformers. arXiv preprint arXiv:2405.18781.

[dubey2024llama] Dubey, Abhimanyu, Jauhri, Abhinav, Pandey, Abhinav, Kadian, Abhishek, Al-Dahle, Ahmad, Letman, Aiesha, Mathur, Akhil, Schelten, Alan, Yang, Amy, Fan, Angela, others. (2024). The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[team2024gemma] {Gemma Team. (2024). Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.

[xiao2024efficient] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis. (2024). Efficient Streaming Language Models with Attention Sinks. The Twelfth International Conference on Learning Representations.

[guo2024active] Guo, Tianyu, Pai, Druv, Bai, Yu, Jiao, Jiantao, Jordan, Michael I, Mei, Song. (2024). Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms. arXiv preprint arXiv:2410.13835.

[sun2024massive] Mingjie Sun, Xinlei Chen, J Zico Kolter, Zhuang Liu. (2024). Massive Activations in Large Language Models. First Conference on Language Modeling.

[cancedda2024spectral] Cancedda, Nicola. (2024). Spectral filters, dark signals, and attention sinks. arXiv preprint arXiv:2402.09221.

[gu2025when] Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin. (2025). When Attention Sink Emerges in Language Models: An Empirical View. The Thirteenth International Conference on Learning Representations.

[barbero2025round] Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, Petar Veli{\v{c. (2025). Round and Round We Go! What makes Rotary Positional Encodings useful?. The Thirteenth International Conference on Learning Representations.

[dong2021attention] Dong, Yihe, Cordonnier, Jean-Baptiste, Loukas, Andreas. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. International conference on machine learning.

[di2022understanding] Di Giovanni, Francesco, James Rowbottom, Benjamin Paul Chamberlain, Thomas Markovich, Michael M. Bronstein. (2023). Understanding convolution on graphs via energies. Transactions on Machine Learning Research.

[keriven2022not] Keriven, Nicolas. (2022). Not too little, not too much: a theoretical analysis of graph (over) smoothing. Advances in Neural Information Processing Systems.

[velivckovic2024softmax] Veli{\v{c. (2024). softmax is not enough (for sharp out-of-distribution). arXiv preprint arXiv:2410.01104.

[vitvitskyi2025makes] Vitvitskyi, Alex, Ara{'u. (2025). What makes a good feedforward computational graph?. arXiv preprint arXiv:2502.06751.

[noci2022signal] Noci, Lorenzo, Anagnostidis, Sotiris, Biggio, Luca, Orvieto, Antonio, Singh, Sidak Pal, Lucchi, Aurelien. (2022). Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems.

[arroyo2025vanishing] Arroyo, {'A. (2025). On Vanishing Gradients, Over-Smoothing, and Over-Squashing in GNNs: Bridging Recurrent and Graph Learning. arXiv preprint arXiv:2502.10818.

[radford2018improving] Radford, Alec, Narasimhan, Karthik, Salimans, Tim, Sutskever, Ilya, others. (2018). Improving language understanding by generative pre-training.

[barbero2024transformers] Barbero, Federico, Banino, Andrea, Kapturowski, Steven, Kumaran, Dharshan, Madeira Ara{'u. (2024). Transformers need glasses! information over-squashing in language tasks. Advances in Neural Information Processing Systems.

[naderi2024mind] Naderi, Alireza, Saada, Thiziri Nait, Tanner, Jared. (2024). Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers. arXiv preprint arXiv:2410.07799.

[raposo2024mixture] Raposo, David, Ritter, Sam, Richards, Blake, Lillicrap, Timothy, Humphreys, Peter Conway, Santoro, Adam. (2024). Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258.

[brown2020language] Brown, Tom, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared D, Dhariwal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, others. (2020). Language models are few-shot learners. Advances in neural information processing systems.

[zhao2024analysing] Zhao, Yu, Qu, Yuanbin, Staniszewski, Konrad, Tworkowski, Szymon, Liu, Wei, Mi{\l. (2024). Analysing the impact of sequence composition on language model pre-training. arXiv preprint arXiv:2402.13991.

[csordas2025dollms] Csord{'a. (2025). Do Language Models Use Their Depth Efficiently?. arXiv preprint arXiv:2505.13898.

[barbero2025llmsattendtoken] Barbero, Federico, Arroyo, Alvaro, Gu, Xiangming, Perivolaropoulos, Christos, Bronstein, Michael, Veli{\v{c. (2025). Why do LLMs attend to the first token?. arXiv preprint arXiv:2504.02732.

[effective-rank] Roy, Olivier, Vetterli, Martin. (2007). The effective rank: A measure of effective dimensionality. 2007 15th European Signal Processing Conference.

[biderman2023pythiasuiteanalyzinglarge] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling.

[openai2025gptoss120bgptoss20bmodel] Agarwal, Sandhini, Ahmad, Lama, Ai, Jason, Altman, Sam, Applebaum, Andy, Arbus, Edwin, Arora, Rahul K, Bai, Yu, Baker, Bowen, Bao, Haiming, others. (2025). gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925.

[sandovalsegura2025usingattentionsinksidentify] Sandoval-Segura, Pedro, Wang, Xijun, Panda, Ashwinee, Goldblum, Micah, Basri, Ronen, Goldstein, Tom, Jacobs, David. (2025). Using Attention Sinks to Identify and Evaluate Dormant Heads in Pretrained LLMs. arXiv preprint arXiv:2504.03889.

[cobbe2021gsm8k] Cobbe, Karl, Kosaraju, Vineet, Bavarian, Mohammad, Chen, Mark, Jun, Heewoo, Kaiser, Lukasz, Plappert, Matthias, Tworek, Jerry, Hilton, Jacob, Nakano, Reiichiro, Hesse, Christopher, Schulman, John. (2021). Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.

[lozhkov2024fineweb-edu] Lozhkov, Anton, Ben Allal, Loubna, von Werra, Leandro, Wolf, Thomas. FineWeb-Edu: the Finest Collection of Educational Content. doi:10.57967/hf/2497.

[zellers2019hellaswagmachinereallyfinish] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi. (2019). HellaSwag: Can a Machine Really Finish Your Sentence?.

[sakaguchi2019winograndeadversarialwinogradschema] Sakaguchi, Keisuke, Bras, Ronan Le, Bhagavatula, Chandra, Choi, Yejin. (2021). Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM.

[allenai:arc] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1.

[muennighoff2023mtebmassivetextembedding] Muennighoff, Niklas, Tazi, Nouamane, Magne, Lo{. (2022). Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316.

[wikitext] Merity, Stephen, Xiong, Caiming, Bradbury, James, Socher, Richard. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.

[eval-harness] Gao, Leo, Tow, Jonathan, Abbasi, Baber, Biderman, Stella, Black, Sid, DiPofi, Anthony, Foster, Charles, Golding, Laurence, Hsu, Jeffrey, Le Noac'h, Alain, Li, Haonan, McDonell, Kyle, Muennighoff, Niklas, Ociepa, Chris, Phang, Jason, Reynolds, Laria, Schoelkopf, Hailey, Skowron, Aviya, Sutawika, Lintang, Tang, Eric, Thite, Anish, Wang, Ben, Wang, Kevin, Zou, Andy. The Language Model Evaluation Harness. doi:10.5281/zenodo.12608602.

[weller2025theoreticallimitationsembeddingbasedretrieval] Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee. (2025). On the Theoretical Limitations of Embedding-Based Retrieval.

[yang2024qwen2technicalreport] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, Zhihao Fan. (2024). Qwen2 Technical Report.

[workshop2023bloom176bparameteropenaccessmultilingual] BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, Thomas Wolf. (2023). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.

[MACKIEWICZ1993303] Ma{'c. (1993). Principal components analysis (PCA). Computers & Geosciences.

[bondarenko2023quantizabletransformersremovingoutliers] Bondarenko, Yelysei, Nagel, Markus, Blankevoort, Tijmen. (2023). Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing. Advances in Neural Information Processing Systems.

[socher-etal-2013-recursive] Socher, Richard, Perelygin, Alex, Wu, Jean, Chuang, Jason, Manning, Christopher D., Ng, Andrew, Potts, Christopher. (2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

[frasca2020signscalableinceptiongraph] Fabrizio Frasca, Emanuele Rossi, Davide Eynard, Ben Chamberlain, Michael Bronstein, Federico Monti. (2020). SIGN: Scalable Inception Graph Neural Networks.

[press2022alibi] Ofir Press, Noah A. Smith, Mike Lewis. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

[balestriero2024for] Randall Balestriero, Hai Huang. (2024). For Perception Tasks: The Cost of {LLM. NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice.

[nanda2022transformerlens] Neel Nanda, Joseph Bloom. (2022). TransformerLens.

[belrose2023eliciting] Belrose, Nora, Furman, Zach, Smith, Logan, Halawi, Danny, Ostrovsky, Igor, McKinney, Lev, Biderman, Stella, Steinhardt, Jacob. (2023). Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112.

[lozhkovfineweb] Lozhkov, Anton, Allal, Loubna Ben, von Werra, Leandro, Wolf, Thomas. (2023). Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface. co/datasets/HuggingFaceFW/fineweb-edu.

[kawaguchi2023does] Kawaguchi, Kenji, Deng, Zhun, Ji, Xu, Huang, Jiaoyang. (2023). How does information bottleneck help deep learning?. International conference on machine learning.

[shwartz2018representation] Shwartz-Ziv, Ravid, Painsky, Amichai, Tishby, Naftali. (2018). Representation compression and generalization in deep neural networks.

[marks2024geometrytruthemergentlinear] Marks, Samuel, Tegmark, Max. (2023). The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824.

[park2024linearrepresentationhypothesisgeometry] Park, Kiho, Choe, Yo Joong, Veitch, Victor. (2023). The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658.

[jiang2024originslinearrepresentationslarge] Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, Victor Veitch. (2024). On the Origins of Linear Representations in Large Language Models.

[schuster2022confident] Schuster, Tal, Fisch, Adam, Gupta, Jai, Dehghani, Mostafa, Bahri, Dara, Tran, Vinh, Tay, Yi, Metzler, Donald. (2022). Confident adaptive language modeling. Advances in Neural Information Processing Systems.

[weber2024redpajamaopendatasettraining] Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, Ce Zhang. (2024). RedPajama: an Open Dataset for Training Large Language Models.

[ali2025entropy] Ali, Riccardo, Caso, Francesco, Irwin, Christopher, Li{`o. (2025). Entropy-lens: The information signature of transformer computations. arXiv preprint arXiv:2502.16570.

[bib1] Agarwal et al. (2025) Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925, 2025.

[bib2] Ali et al. (2025) Riccardo Ali, Francesco Caso, Christopher Irwin, and Pietro Liò. Entropy-lens: The information signature of transformer computations. arXiv preprint arXiv:2502.16570, 2025.

[bib3] Arroyo et al. (2025) Álvaro Arroyo, Alessio Gravina, Benjamin Gutteridge, Federico Barbero, Claudio Gallicchio, Xiaowen Dong, Michael Bronstein, and Pierre Vandergheynst. On vanishing gradients, over-smoothing, and over-squashing in gnns: Bridging recurrent and graph learning. arXiv preprint arXiv:2502.10818, 2025.

[bib4] Randall Balestriero and Hai Huang. For perception tasks: The cost of LLM pretraining by next-token prediction outweigh its benefits. In NeurIPS 2024 Workshop: Self-Supervised Learning - Theory and Practice, 2024.

[bib5] Barbero et al. (2024) Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, and Petar Veličković. Transformers need glasses! information over-squashing in language tasks. Advances in Neural Information Processing Systems, 37:98111–98142, 2024.

[bib6] Barbero et al. (2025a) Federico Barbero, Alvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu. Why do llms attend to the first token? arXiv preprint arXiv:2504.02732, 2025a.

[bib7] Barbero et al. (2025b) Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos, Razvan Pascanu, and Petar Veličković. Round and round we go! what makes rotary positional encodings useful? In The Thirteenth International Conference on Learning Representations, 2025b.

[bib8] Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.

[bib9] Bondarenko et al. (2023) Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp. 75067–75096, 2023.

[bib10] Nicola Cancedda. Spectral filters, dark signals, and attention sinks. arXiv preprint arXiv:2402.09221, 2024.

[bib11] Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.

[bib12] Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

[bib13] Csordás et al. (2025) Róbert Csordás, Christopher D Manning, and Christopher Potts. Do language models use their depth efficiently? arXiv preprint arXiv:2505.13898, 2025.

[bib14] Dong et al. (2021) Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International conference on machine learning, pp. 2793–2803. PMLR, 2021.

[bib15] Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.

[bib16] Gao et al. (2024) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The language model evaluation harness, 07 2024. URL https://zenodo.org/records/12608602.

[bib17] Gemma Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.

[bib18] Geshkovski et al. (2023) Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. A mathematical perspective on transformers. arXiv preprint arXiv:2312.10794, 2023.

[bib19] Gu et al. (2025) Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin. When attention sink emerges in language models: An empirical view. In The Thirteenth International Conference on Learning Representations, 2025.

[bib20] Guo et al. (2024) Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I Jordan, and Song Mei. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms. arXiv preprint arXiv:2410.13835, 2024.

[bib21] Kawaguchi et al. (2023) Kenji Kawaguchi, Zhun Deng, Xu Ji, and Jiaoyang Huang. How does information bottleneck help deep learning? In International conference on machine learning, pp. 16049–16096. PMLR, 2023.

[bib22] Lad et al. (2024) Vedang Lad, Jin Hwa Lee, Wes Gurnee, and Max Tegmark. The remarkable robustness of llms: Stages of inference? arXiv preprint arXiv:2406.19384, 2024.

[bib23] Lozhkov et al. (2023) Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URL https://huggingface. co/datasets/HuggingFaceFW/fineweb-edu, 2023.

[bib24] Andrzej Maćkiewicz and Waldemar Ratajczak. Principal components analysis (pca). Computers & Geosciences, 19(3):303–342, 1993.

[bib25] Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.

[bib26] Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.

[bib27] Muennighoff et al. (2022) Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.

[bib28] Neel Nanda and Joseph Bloom. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens, 2022.

[bib29] Nostalgebraist. Interpreting gpt: The logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020.

[bib30] Park et al. (2023) Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023.

[bib31] Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.

[bib32] Razzhigaev et al. (2023) Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. The shape of learning: Anisotropy and intrinsic dimensions in transformer-based models. arXiv preprint arXiv:2311.05928, 2023.

[bib33] Razzhigaev et al. (2024) Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, and Andrey Kuznetsov. Your transformer is secretly linear. arXiv preprint arXiv:2405.12250, 2024.

[bib34] Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.

[bib35] Sandoval-Segura et al. (2025) Pedro Sandoval-Segura, Xijun Wang, Ashwinee Panda, Micah Goldblum, Ronen Basri, Tom Goldstein, and David Jacobs. Using attention sinks to identify and evaluate dormant heads in pretrained llms. arXiv preprint arXiv:2504.03889, 2025.

[bib36] Schuster et al. (2022) Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.

[bib37] Shwartz-Ziv et al. (2018) Ravid Shwartz-Ziv, Amichai Painsky, and Naftali Tishby. Representation compression and generalization in deep neural networks. 2018. URL https://openreview.net/forum?id=SkeL6sCqK7.

[bib38] Skean et al. (2025) Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. Layer by layer: Uncovering hidden representations in language models. Preprint arXiv:2502.02013, abs/2502.02013, 2025.

[bib39] Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/D13-1170.

[bib40] Sun et al. (2024) Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=F7aAhfitX6.

[bib41] Weber et al. (2024) Maurice Weber, Daniel Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, Ben Athiwaratkun, Rahul Chalamala, Kezhen Chen, Max Ryabinin, Tri Dao, Percy Liang, Christopher Ré, Irina Rish, and Ce Zhang. Redpajama: an open dataset for training large language models, 2024. URL https://arxiv.org/abs/2411.12372.

[bib42] Wu et al. (2024) Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, and Ali Jadbabaie. On the role of attention masks and layernorm in transformers. arXiv preprint arXiv:2405.18781, 2024.

[bib43] Xiao et al. (2024) Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.

[bib44] Yona et al. (2025) Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, and Yossi Gandelsman. Interpreting the repeated token phenomenon in large language models. arXiv preprint arXiv:2503.08908, 2025.

[bib45] Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830.