Skip to main content

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

Abstract

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key--value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (\dmc), a method for online key--value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B, and 70B) into \dmc Transformers, achieving up to 7$\times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. \dmc is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. \dmc preserves the original downstream performance with up to 4$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key--value eviction policies (H$_2$O, TOVA). GQA and \dmc can be even combined to obtain compounded gains. Hence, \dmc can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget. %

Method: Dynamic Memory Compression

Piotr Nawrot * QV Adrian Ła´ ncucki * QK Marcin Chochowski Q David Tarjan Q Edoardo M. Ponti V Q NVIDIA K University of Wrocław V University of Edinburgh

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B, and 70B) into DMC Transformers, achieving up to 7 × throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. DMC preserves the original downstream performance with up to 4 × cache compression, outperforming up-trained grouped-query attention (GQA) and key-value eviction policies (H 2 O, TOVA). GQA and DMC can be even combined to obtain compounded gains. Hence, DMC can serve as a drop-in replacement for KV caching in existing LLMs to fit longer contexts and larger batches within any given memory budget.

Introduction

Transformer Large Language Models (LLMs) are the state of the art in generative and conversational AI (Touvron et al., 2023; Jiang et al., 2023). Their deployment, however, is curtailed in part by their inefficiency. This is not only due to the quadratic complexity of attention layers (Bahdanau et al., 2015; Vaswani et al., 2017): during generation, Transformers store the keys and values of past tokens in memory

  • Equal contribution.

Correspondence to: Piotr Nawrot < piotr.nawrot@ed.ac.uk >.

Proceedings of the 41 st International Conference on Machine Learning , Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Figure

(a) Regular key-value cache with items kv i depicted as boxes. New items are always appended.

Figure

(b) Dynamic Memory Compression (DMC) chooses whether to accumulate or append current items, resulting in a smaller keyvalue cache.

Figure 1: Key-value cache update mechanisms.

to avoid recomputing them multiple times. Since this keyvalue (KV) cache grows linearly with the sequence length and batch size, generation with Transformers quickly becomes prohibitive due to the excessive memory load. This issue emerges even more clearly with long-context generation (e.g., in dialogues and stories) or when serving large numbers of user queries.

A widespread solution to increase the memory efficiency of Transformers during inference is Grouped Query Attention (GQA; Ainslie et al., 2023; Shazeer, 2019), where the number of key and value heads is reduced by sharing each of them among multiple query heads. Alternatively, the length of the KV cache can be shortened through token merging (Zhang et al., 2018; Liu et al., 2018; Bolya et al., 2023) or cache eviction policies (Zhang et al., 2023; Oren et al., 2024). Nevertheless, these methods often pay the price of a severe degradation in downstream performance. On the other hand, hardware/IO-aware (Dao et al., 2022; Kwon et al., 2023) and sub-quadratic algorithms for attention (Beltagy et al., 2020; Choromanski et al., 2021) do not alleviate the memory load of the KV cache.

In our work, we aim to adaptively compress the KV cache of LLMs, retaining their performance while reducing their

memory load. To this end, we propose Dynamic Memory Compression (DMC). As shown in Figure 1, during every time step, DMC decides whether to append the current key and value representations to the cache or to perform a weighted average of them with the top-most item in the cache. Note that, for an appropriate prior over compression decisions, the memory can grow sub-linearly in DMC, which therefore falls halfway between vanilla Transformers and state space language models (Fu et al., 2023; Gu & Dao, 2023), where memory size is constant.

In our experiments, we equip pre-existing LLMs-such as Llama 2 (Touvron et al., 2023) 7B, 13B, and 70B-with DMC by retrofitting them on a negligible percentage of the original pre-training data (~2% for 2 × compression, and ~8% for 8 × compression) and without adding any extra parameters to the original LLM. We evaluate our DMC models on a series of downstream tasks such as MMLU (Hendrycks et al., 2021) for factuality, QA datasets for common-sense reasoning, and HumanEval (Chen et al., 2021) for code. We find that DMC LLMs retain a downstream performance similar to the original LLM, whereas baselines-such as GQA, H 2 O, and TOVA-incur significant degradation at high compression ratios. Finally, we show that DMC can be hybridized with GQA such that their compression ratios are compounded. For Llama 2 70B, which is pre-trained with GQA 8 × ,DMC 2 × achieves a total compression of 16 × .

We verify that KV cache compression translates into more efficient generation in practice. We measure that DMC 4 × increases the inference throughput between 350% and 390% for Llama 2 7B and 13B on NVIDIA H100 or A100 GPUs without loss in performance. For higher compression, such as DMC 8 × , we observe inference throughput gains up to 700% with ~5% relative MMLU performance drop for Llama 2 7B and 13B. In fact, DMC allows us to fit larger batches and longer sequences into a given memory budget. Finally, compression schemata learned by DMC provide insights into the internal structure of the LLMs, revealing a preference for compressing heads in higher layers.

Background

Multi-Head Self-Attention

Let X = ( x 1 , . . . , x n ) ∈ R n × d denote the input sequence of hidden states of a Transformer layer, where n stands for the number of tokens in a sequence, and d for the hidden state dimension. A Multi-Head Self-Attention (MHSA) layer divides the embeddings into n h different heads. Afterward, the self-attention process is applied to each head separately. This enables the model to focus on different parts of the input, capturing various types of relationships in the data. For each head h , different weight matrices W h q , W h k , W h v ∈ R d/n h × d/n h are used to project the input sequence into queries Q h , keys K h , and values V h :

$$

$$

$$

$$

$$

$$

The attention scores and outputs for the i -th token are then computed as

$$

$$

Finally, the outputs from all heads are concatenated and linearly transformed to produce the final output O ∈ R n × d :

$$

$$

where W o ∈ R d × d .

KV Caching During Inference

In a Transformer decoder, the generation of sequences is auto-regressive: each new token is predicted based on the previously generated ones. To avoid redundant computation, it is common practice to store the keys and values of previously computed tokens in the KV cache. For each time step i , only the keys and values for the current token x i are computed whereas those for x <i are retrieved from the cache. Thus for each head h :

$$

$$

$$

$$

Note that this process is not necessary for queries as only the query for the current token x i is needed at each inference time step.

As a consequence, the KV cache grows linearly with each new token. This progressive expansion leads to substantial memory consumption and increased latency, especially for long input sequences. This issue is further exacerbated when serving multiple requests concurrently, as each inference process requires its own KV cache, significantly straining the system's resources.

Method: Dynamic Memory Compression

In this work, we tackle the problem of reducing the size of the KV cache, which brings two immediate benefits (Zhang et al., 2023): 1) it lowers the latency of auto-regressive generation, and 2) improves GPU utilization by allowing for larger batch sizes and sequence lengths, leading to increased throughput. In this section, we introduce Dynamic Memory Compression (DMC), a simple and inexpensive method for online compression of the KV cache at inference time. In what follows, we first describe the inference-time operation of DMC, which constitutes our end goal. Next, we illustrate how to teach a pre-trained LLM such behavior through short, continued pre-training.

Inference

Consider the forward pass of an attention layer during autoregressive inference (Section 2.1). In a vanilla Transformer, at every time step t , both k t and v t are appended to the KV cache (Section 2.2). In DMC, on the other hand, the KV cache update procedure is different, as detailed in Algorithm 1. First, a decision variable α t ∈ { 0 , 1 } and importance variable ω t ∈ [0 , 1] are predicted. In order to avoid adding new parameters, we reuse the first neuron from k t and q t , respectively, to extract the two scores: 1

$$

$$

where sigmoid normalizes both scores into a range [0 , 1] . At inference time, the decision score α t is obtained by rounding to the closest integer to make this variable binary.

Based on α t , a decision is made whether KV representations k t and v t are appended to the cache or accumulated with its last element (Figure 1). In particular, for accumulation, we perform a weighted average based on the predicted importance score ω t for the current token and the running sum

1 Note that this choice is arbitrary as we could use any two neurons from { q t , k t , v t } .

Algorithm 1 Single-head KV cache update with Dynamic Memory Compression (DMC)

Require: K,V, q t , k t , v t

1:

round sigmoid

t

(

α

ω

q

[0])

  • 3: k t [0] , q t [0] ← 0 , 0
  • 4: if α t = 1 then
  • 5: z t ← z t -1 + ω t

l

K

[

,

V

  • 8: else
  • 9: z t ← ω t

]

  • 11: V ← [ V 1: l , v t ]
  • 12: end if
  • 13: return K,V, q t , k t

of importance scores z t for all the tokens since the last time step α = 0 was predicted.

In effect, the α variable segments the input sequence: each decision determines if the current segment should continue ( α = 1 ) or a new segment should be opened ( α = 0 ). As a result, after the update, the cache length for DMC is l = ∑ n t =1 (1 -α t ) ≤ n , whereas in vanilla Transformers it is always l = n . In what follows, we will refer to the ratio n/l between the length n of an uncompressed cache and the compressed length l as the Compression Ratio (CR).

Finally, multi-head self-attention is calculated similarly to vanilla Transformers but using the compressed KV cache sequences. Algorithm 1 is applied to every MHSA layer and head independently. Hence, the corresponding KV sequences might have different lengths. Note that Algorithm 1 can be implemented efficiently without the if-then-else statement conditioned on α t , instead multiplying the previous k i , v i and z i by α t like in Equation (10).

Training

The DMC inference algorithm switches between accumulating and appending tokens to the KV cache. In order to endow LLMs with DMC, we continue pre-training them on a negligible amount of their pre-training data, gradually increasing the compression ratio towards a target. However, this poses serious challenges. First, we opt for end-to-end learning via gradient descent, which requires a continuous relaxation of the decision variables. As a result, we have to define an operation for KV cache updating when 0 < α < 1 , resulting in partly aggregated, partly accumulated key and value states. Second, to avoid training-inference mismatch, we must simulate the DMC behavior at inference time while parallelizing training across a sequence of tokens. As a consequence the length of K and V is not reduced through

l

z

k

t

[0]))

1

v

ω

)

]

,

ACCUMULATE

APPEND

compression during training; rather, all intermediate states of keys and values are explicitly kept in the sequence and an auxiliary, gradually discretized mask regulates interactions between queries and keys. We elaborate on our solutions to these challenges below.

Gradient Estimation for Discrete Decisions The decision whether to accumulate or append at inference time is a discrete one; however, rounding α t = sigmoid( k t [0]) to the nearest integer for training would result in an operation with zero or undefined gradients. Hence, we resort to stochastic reparametrization of the decision variable during training

$$

$$

where τ is the temperature 2 and c is a constant subtracted so that every α ≈ 0 at training step 0 . This ensures that DMC initially performs no compression and starts the training behaving just like the vanilla Transformer.

Partial accumulations As we relax our discrete decisions, we must now define a mechanism to update the KV cache that generalizes Algorithm 1 to continuous α . Hence, we define partially accumulated states for α ∈ [0 , 1] as:

$$

$$

Note that when α ∈ { 0 , 1 } , Equation (10) defaults to Algorithm 1.

Intermediate compression steps Aside from key and value computations shown in Equation (10), the rest of the forward pass can be performed in parallel for all tokens in the sequence. Nonetheless, this creates a mismatch between training and evaluation, since during training all intermediate states of keys and values are accessible in self-attention.

To illustrate this issue, consider the example of a KV cache during DMC inference shown in Figure 2 for the sequence of decision scores α 1:5 = (1 , 1 , 0 , 1 , 0) (importance scores ω have been omitted for clarity). Because the last element of the KV cache changes in-place at every time step, the future elements should not have access to its old contents (Figure 2, right). In order to properly simulate this inferencetime evolution of the KV cache during training, we keep all unrolled intermediate KV cache items. However, in lieu of an auto-regressive 'causal' mask, we use an additive mask based on the sequence of α values to modify the attention scores a h ij from Equation (4), which is shown in Figure 3.

2 Low temperatures sharpen α t into almost-discrete values which accurately mimics inference behavior.

Figure 2: An example of KV cache growth in DMC during inference (left). During training (right), we retain all intermediate states seen during inference and gradually block access to some of them (gray arrows)

Figure 2: An example of KV cache growth in DMC during inference (left). During training (right), we retain all intermediate states seen during inference and gradually block access to some of them (gray arrows)

$$

$$

Figure 3: Additive mask applied during training to the normalized attention scores in order to block queries from attending intermediate KV states (as well as future states).

As α values are increasingly pushed to almost-discrete states by the Gumbel estimator and the low-temperature setting, this strengthens the interactions of queries with the last state of every key-value segment (i.e., the only one accessible during inference), and weakens those with the previous states, which are discarded during inference. 3 In fact, when α ∈ { 0 , 1 } , the matrix is filled with either 0 or -∞ values, and exactly corresponds to the inference-time query-to-key attendance pattern.

Training objective The model is incentivized to compress the KV cache to a certain CR, and thus increase the predicted α values. Instead of matching the desired rate for each append-or-accumulate decision α , we calculate a global onesided loss as the difference between the sum of all decisions and the expected sum of KV tokens across all layers l , heads h and time steps t under the desired Compression Ratio (CR), normalized by ( n l n h n ) :

$$

$$

3 See Appendix G for details on the mask implementation.

The ℓ CR loss is added to the language modeling loss term ℓ LM = -∑ n t =1 log p θ ( x t | x <t ) , with the final objective of the training being:

$$

$$

Importantly, the training procedure is designed for slowly ramping up the target CR and saving ready-to-use DMC checkpoints along the way. This is possible because all hyperparameters, like Gumbel-sigmoid sampling temperature and learning rate, are not decayed and remain constant throughout training. A practical use case of this DMC property is to produce a series of DMC checkpoints with different CRs within a single run, and then choose the one with the desired efficiency-performance trade-off.

Gradient Estimation for Discrete Decisions

We mask the unnormalized attention score for the pair ( i, j ) as follows:

$$

$$

We rely on the memory-efficient implementation of MHSA from PyTorch, which allows adding arbitrary masks to the attention scores before softmax. Notwithstanding this, at inference time DMC remains compatible with efficient libraries for attention such as Flash Attention (Dao et al., 2022). The log(1 -α j +1 ) term is calculated as log-sigmoid ( -α j +1 ) for better numerical precision.

Figure 15: Validation perplexity and test MMLU accuracy vs compression ratio for different types of compression priors, in-layer relaxations, and importance scores. All models follow the SHORT training regime and are trained up to 2 × CR.

Figure 15: Validation perplexity and test MMLU accuracy vs compression ratio for different types of compression priors, in-layer relaxations, and importance scores. All models follow the SHORT training regime and are trained up to 2 × CR.

Partial accumulations

Storing a variable-length KV cache in memory DMC allows every head to learn a custom compression, which results in KV cache sequences with variable lengths across heads. This poses difficulties in storing these sequences efficiently in an n -dimensional tensor, considering that they will be extended during auto-regressive generation by uneven amounts of tokens. However, such sequences can be easily stored in memory with little overhead using PagedAttention (Kwon et al., 2023), where new pages are allocated on-demand for every head separately. In Section 5.2 we present latency and throughput measured with an implementation based on FlashAttention (Dao et al., 2022) and PagedAttention.

Windowgrouping approximation The calculation of partial accumulations, during training of DMC models Equation (10), for a sequence of n tokens requires O ( n ) sequential steps, therefore considerably slowing down the training. In order to improve the time complexity, we calculate Equation (10) at every time step t independently over a short window of the last w elements up to time t (e.g., w = 12 ). This enables us to reduce the span of computation to O ( w ) , provided that at least n threads can execute the computation in parallel for each of the n positions. During inference, this speed-up also applies to the prompt phase. However, the sliding window comes at a disadvantage as it needs to be cached during inference for heads that tend to accumulate more than w tokens.

Intermediate compression steps

In order to alleviate the problem of efficiently storing and retrieving the KV cache of variable-length sequences, we also study a Constrained variant of our algorithm (DMC-C) in which we force all heads in a given layer to maintain a similar compression ratio. A similar compression ratio allows us to store key and value sequences naïvely as padded tensors with minimal padding. To this end, we add an extra loss term to Equation (11)

$$

$$

Table 3 compares DMC with its Constrained variant DMCC. In general, while remaining superior to GQA, DMC-C

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

might If you have actually

restricted time for health and fit ness reg im en

Figure 10: Compression schema found by Llama 2 13B DMC 4 × in layer 0, head 14. Tokens that are merged in the KV cache are marked with the same color.

displays a significant degradation in several configurations, most notably 7B 4 × where it records a drop of -6.4 in MMLUcompared to the ceiling. On the other hand, DMC recovers all performance loss in DMC-C. When combined with custom attention implementations that do not require excessive padding, standard DMC should therefore be vastly preferred, as it retains the original LLM performance while fully reaping the advantages in memory efficiency.

Compression Schemata learned by DMC-C Studying the compression schema in Figure 12 learned by DMC-C, we find a very different pattern compared to DMC, due to the auxiliary loss forcing the model to compress similarly across heads in the same layer. Nevertheless, we observe a similar global preference for compressing deeper layers.

Figure 11: Compression schema found by Llama 2 13B DMC 4 × for layer 24, head 2. Tokens that are merged in the KV cache are marked with the same color.

Training objective

Training Steps per Increase in CR Another advantage of DMC is its high flexibility. In fact, based on the availability of resources, different regimes can be chosen when annealing the CR to the target during up-training. In Figure 13, we compare a SHORT and a LONG regime for the constrained variant of DMC (DMC-C), which continuously increases the CR by 1 every 3K and 6K steps (12B and 24B tokens), respectively. Evidently, there exists a trade-off between training steps (hence, time) and performance. Additionally, Figure 13 showcases another aspect of the higher flexibility DMC affords: it is compatible with arbitrary real-valued CRs, as opposed to integer CRs divisible by 2 as in GQA.

Schedules of Target CR Additionally, we explore how different schedules for the target CR impact the model performance. In the standard setup, this is linearly annealed from 1 to the target CR throughout the duration of training. Here, we compare it with a setup where the CR used in the auxiliary loss for compression is set to the target from the start (DMC-immediate). We show the results in Figure 14. As expected, DMC-immediate has a perplexity spike at the beginning when the model quickly increases the CR due to the auxiliary loss. While perplexity is recovered during training, even to a lower point than DMC with annealing, downstream accuracy on the MMLU benchmark is degraded across the board. This showcases why avoiding perplexity spikes is fundamental to successfully retrofitting an LLM.

Practical Considerations

Storing a variable-length KV cache in memory DMC allows every head to learn a custom compression, which results in KV cache sequences with variable lengths across heads. This poses difficulties in storing these sequences efficiently in an n -dimensional tensor, considering that they will be extended during auto-regressive generation by uneven amounts of tokens. However, such sequences can be easily stored in memory with little overhead using PagedAttention (Kwon et al., 2023), where new pages are allocated on-demand for every head separately. In Section 5.2 we present latency and throughput measured with an implementation based on FlashAttention (Dao et al., 2022) and PagedAttention.

Windowgrouping approximation The calculation of partial accumulations, during training of DMC models Equation (10), for a sequence of n tokens requires O ( n ) sequential steps, therefore considerably slowing down the training. In order to improve the time complexity, we calculate Equation (10) at every time step t independently over a short window of the last w elements up to time t (e.g., w = 12 ). This enables us to reduce the span of computation to O ( w ) , provided that at least n threads can execute the computation in parallel for each of the n positions. During inference, this speed-up also applies to the prompt phase. However, the sliding window comes at a disadvantage as it needs to be cached during inference for heads that tend to accumulate more than w tokens.

Storing a variable-length KV cache in memory
Window grouping approximation

Training Steps per Increase in CR Another advantage of DMC is its high flexibility. In fact, based on the availability of resources, different regimes can be chosen when annealing the CR to the target during up-training. In Figure 13, we compare a SHORT and a LONG regime for the constrained variant of DMC (DMC-C), which continuously increases the CR by 1 every 3K and 6K steps (12B and 24B tokens), respectively. Evidently, there exists a trade-off between training steps (hence, time) and performance. Additionally, Figure 13 showcases another aspect of the higher flexibility DMC affords: it is compatible with arbitrary real-valued CRs, as opposed to integer CRs divisible by 2 as in GQA.

Schedules of Target CR Additionally, we explore how different schedules for the target CR impact the model performance. In the standard setup, this is linearly annealed from 1 to the target CR throughout the duration of training. Here, we compare it with a setup where the CR used in the auxiliary loss for compression is set to the target from the start (DMC-immediate). We show the results in Figure 14. As expected, DMC-immediate has a perplexity spike at the beginning when the model quickly increases the CR due to the auxiliary loss. While perplexity is recovered during training, even to a lower point than DMC with annealing, downstream accuracy on the MMLU benchmark is degraded across the board. This showcases why avoiding perplexity spikes is fundamental to successfully retrofitting an LLM.

Experimental Setup

Baselines In our experiments, we evaluate strategies to retrofit a state-of-the-art Large Language Model (LLM), Llama 2 (Touvron et al., 2023) 4 , into a more efficient model across various sizes: 7B, 13B, and 70B. In addition to com-

4 Obtained from https://huggingface.co/meta-llama.

paring the downstream performance of DMC with the original model, we also use Grouped Query Attention (GQA) as a main baseline, as it constitutes the most widespread strategy to ensure KV cache efficiency (Jiang et al., 2023).

We also compare DMC with two KV cache eviction policies that do not require model retrofitting: Token Omission Via Attention (TOVA; Oren et al., 2024) and Heavy-Hitter Oracle (H 2 O; Zhang et al., 2023). H 2 O keeps in memory a fixed window of the most recent tokens, as well as additional heavy-hitter (H 2 ) tokens. H 2 tokens are chosen dynamically: specifically, they correspond to the tokens with the highest aggregated attention scores throughout the sequence. On the other hand, TOVA retains the top-k tokens based on the attention weights of the last token only. This means that for a given attention layer, at each decoding step, the token with the lowest attention score is omitted.

Checkpoint adaptation To equip the original Llama 2 LLM with GQA, we perform the standard checkpoint conversion proposed by Ainslie et al. (2023): the key and value projection matrices are split by head. Then the resulting sub-matrices corresponding to heads in the same group are merged via averaging. Note that this results in a fixed compression during training.

As for DMC, we avoid the introduction of new parameters by re-purposing the first dimension from both the q t and k t representations, in order to predict segmentation decisions α t and importance scores ω t . Setting q t and k t to zero triggers a significant increase in language modeling loss, by disrupting the attention scores. To counteract this spike in loss, we pre-train the model to disregard the first dimension of q t and k t in the attention calculations. Specifically, we load the pre-trained weights and up-train the model for 1 billion tokens (250 steps), annealing the values of q t and k t to 0 according to the following formula:

$$

$$

where t is the current step and n t = 250 . After this initial phase, which allows the model to ignore the first dimension of keys and values for attention calculations, we start the main retrofitting phase, where the model learns to compress the KV representations.

Training hyperparameters We strictly follow the training hyperparameters outlined by Touvron et al. (2023). We employ the AdamW optimizer with parameters β 1 = 0 . 9 , β 2 = 0 . 95 , and ϵ = 1 e -5 , in conjunction with a weight decay of 0 . 1 and gradient clipping of 1 . 0 . The batch size is 1024 with a sequence length of 4096 . We apply a constant learning rate identical to the final rate from the original Llama 2 pre-training phase: 3 × 10 -5 for the 7B and 13B

models, and 1 . 5 × 10 -5 for the 70B model. We set the constant from Equation (9) as c = 5 which in practice results in α t = 0 . 0067 . Empirically, c = 5 is a high enough value so that we do not experience a spike in language modeling loss at the start, yet low enough to be easily changed by learning q t [0] and k t [0] through gradient optimization. Finally, we set the window size (Section 3.3) to 12, and keep the Gumbel-sigmoid temperature τ constant at 0.1 for the entire training. We do not perform any exhaustive searches for these values: we believe that the DMC retrofitting procedure is robust to a wide range of sensible hyperparameter choices.

Training schedule The volume of data for continued pretraining of DMC is contingent on the targeted KV cache compression ratio; a larger CR necessitates more data. We use a linear training schedule with 24B, 72B, 120B, and 168B tokens for training to 2 × , 4 × , 6 × , and 8 × compression, respectively. In Appendix E we include an ablation where we use a schedule with twice less data.

We discovered that the right retrofitting strategy is crucial for DMC: starting the training without compression helps to preserve the original perplexity. Any significant increase in perplexity, even if recovered during continued pre-training, prevents the model from regaining its performance on downstream tasks (see the ablations in Appendix E). The target CR is linearly increased from 1 × to 8 × for the 7B and 13B models, and from 1 × to 2 × for the 70B model. 5

Upon achieving the target compression ratio, we initiate a final solidifying phase wherein we: 1) continue up-training for an additional 8B tokens, 2) maintain a fixed compression ratio, and 3) implement a cosine learning rate schedule, annealing it down to 10% of the initial value. This phase aims at stabilizing the model with a specific, fixed compression ratio. Resource-wise, all retrofitting stages require roughly 8k, 16k, and 28k GPU hours of NVIDIA H100 for Llama 2 7B, 13B, and 70B respectively.

Evaluation Following Touvron et al. (2023), we evaluate models on a series of downstream tasks, including MMLU(Hendrycks et al., 2021) for factuality, HumanEval (Chen et al., 2021) for Python code generation, and several question-answering datasets for common-sense reasoning: PIQA (Bisk et al., 2020), BoolQ (Clark et al., 2019), Arc-C and Arc-E (Clark et al., 2018), HellaSwag (Zellers et al., 2019), and WinoGrande (Sakaguchi et al., 2020). We report the 5-shot performance on MMLU, average pass@1 scores for HumanEval, and average 0-shot performance on common-sense benchmarks (CS-QA). For pass@1 scores we use a temperature of 0.1 and nucleus sampling (Holtz-

5 For Llama 2 70B, we do not up-train to 4 × because this LLM is already pre-trained with GQA 8 × .

man et al., 2020) with top-p = 0.95.

We adapted TOVA and H 2 O to our codebase based on publicly released code. For a given CR, the total budget for both policies is calculated as (1 / CR ) × n for MMLU and CS-QA tasks, where n is the input length. For HumanEval, which instead involves generating more than one token, the initial budget is also (1 / CR ) × n but then increases dynamically as we generate the answer. For H 2 O, the budget is split equally between the local window and heavy-hitter tokens.

Baselines
Checkpoint adaptation

Training Steps per Increase in CR Another advantage of DMC is its high flexibility. In fact, based on the availability of resources, different regimes can be chosen when annealing the CR to the target during up-training. In Figure 13, we compare a SHORT and a LONG regime for the constrained variant of DMC (DMC-C), which continuously increases the CR by 1 every 3K and 6K steps (12B and 24B tokens), respectively. Evidently, there exists a trade-off between training steps (hence, time) and performance. Additionally, Figure 13 showcases another aspect of the higher flexibility DMC affords: it is compatible with arbitrary real-valued CRs, as opposed to integer CRs divisible by 2 as in GQA.

Schedules of Target CR Additionally, we explore how different schedules for the target CR impact the model performance. In the standard setup, this is linearly annealed from 1 to the target CR throughout the duration of training. Here, we compare it with a setup where the CR used in the auxiliary loss for compression is set to the target from the start (DMC-immediate). We show the results in Figure 14. As expected, DMC-immediate has a perplexity spike at the beginning when the model quickly increases the CR due to the auxiliary loss. While perplexity is recovered during training, even to a lower point than DMC with annealing, downstream accuracy on the MMLU benchmark is degraded across the board. This showcases why avoiding perplexity spikes is fundamental to successfully retrofitting an LLM.

Training hyperparameters

Training Steps per Increase in CR Another advantage of DMC is its high flexibility. In fact, based on the availability of resources, different regimes can be chosen when annealing the CR to the target during up-training. In Figure 13, we compare a SHORT and a LONG regime for the constrained variant of DMC (DMC-C), which continuously increases the CR by 1 every 3K and 6K steps (12B and 24B tokens), respectively. Evidently, there exists a trade-off between training steps (hence, time) and performance. Additionally, Figure 13 showcases another aspect of the higher flexibility DMC affords: it is compatible with arbitrary real-valued CRs, as opposed to integer CRs divisible by 2 as in GQA.

Schedules of Target CR Additionally, we explore how different schedules for the target CR impact the model performance. In the standard setup, this is linearly annealed from 1 to the target CR throughout the duration of training. Here, we compare it with a setup where the CR used in the auxiliary loss for compression is set to the target from the start (DMC-immediate). We show the results in Figure 14. As expected, DMC-immediate has a perplexity spike at the beginning when the model quickly increases the CR due to the auxiliary loss. While perplexity is recovered during training, even to a lower point than DMC with annealing, downstream accuracy on the MMLU benchmark is degraded across the board. This showcases why avoiding perplexity spikes is fundamental to successfully retrofitting an LLM.

Training schedule

The DMC inference algorithm switches between accumulating and appending tokens to the KV cache. In order to endow LLMs with DMC, we continue pre-training them on a negligible amount of their pre-training data, gradually increasing the compression ratio towards a target. However, this poses serious challenges. First, we opt for end-to-end learning via gradient descent, which requires a continuous relaxation of the decision variables. As a result, we have to define an operation for KV cache updating when 0 < α < 1 , resulting in partly aggregated, partly accumulated key and value states. Second, to avoid training-inference mismatch, we must simulate the DMC behavior at inference time while parallelizing training across a sequence of tokens. As a consequence the length of K and V is not reduced through

l

z

k

t

[0]))

1

v

ω

)

]

,

ACCUMULATE

APPEND

compression during training; rather, all intermediate states of keys and values are explicitly kept in the sequence and an auxiliary, gradually discretized mask regulates interactions between queries and keys. We elaborate on our solutions to these challenges below.

Gradient Estimation for Discrete Decisions The decision whether to accumulate or append at inference time is a discrete one; however, rounding α t = sigmoid( k t [0]) to the nearest integer for training would result in an operation with zero or undefined gradients. Hence, we resort to stochastic reparametrization of the decision variable during training

$$

$$

where τ is the temperature 2 and c is a constant subtracted so that every α ≈ 0 at training step 0 . This ensures that DMC initially performs no compression and starts the training behaving just like the vanilla Transformer.

Partial accumulations As we relax our discrete decisions, we must now define a mechanism to update the KV cache that generalizes Algorithm 1 to continuous α . Hence, we define partially accumulated states for α ∈ [0 , 1] as:

$$

$$

Note that when α ∈ { 0 , 1 } , Equation (10) defaults to Algorithm 1.

Intermediate compression steps Aside from key and value computations shown in Equation (10), the rest of the forward pass can be performed in parallel for all tokens in the sequence. Nonetheless, this creates a mismatch between training and evaluation, since during training all intermediate states of keys and values are accessible in self-attention.

To illustrate this issue, consider the example of a KV cache during DMC inference shown in Figure 2 for the sequence of decision scores α 1:5 = (1 , 1 , 0 , 1 , 0) (importance scores ω have been omitted for clarity). Because the last element of the KV cache changes in-place at every time step, the future elements should not have access to its old contents (Figure 2, right). In order to properly simulate this inferencetime evolution of the KV cache during training, we keep all unrolled intermediate KV cache items. However, in lieu of an auto-regressive 'causal' mask, we use an additive mask based on the sequence of α values to modify the attention scores a h ij from Equation (4), which is shown in Figure 3.

2 Low temperatures sharpen α t into almost-discrete values which accurately mimics inference behavior.

Figure 2: An example of KV cache growth in DMC during inference (left). During training (right), we retain all intermediate states seen during inference and gradually block access to some of them (gray arrows)

Figure 2: An example of KV cache growth in DMC during inference (left). During training (right), we retain all intermediate states seen during inference and gradually block access to some of them (gray arrows)

$$

$$

Figure 3: Additive mask applied during training to the normalized attention scores in order to block queries from attending intermediate KV states (as well as future states).

As α values are increasingly pushed to almost-discrete states by the Gumbel estimator and the low-temperature setting, this strengthens the interactions of queries with the last state of every key-value segment (i.e., the only one accessible during inference), and weakens those with the previous states, which are discarded during inference. 3 In fact, when α ∈ { 0 , 1 } , the matrix is filled with either 0 or -∞ values, and exactly corresponds to the inference-time query-to-key attendance pattern.

Training objective The model is incentivized to compress the KV cache to a certain CR, and thus increase the predicted α values. Instead of matching the desired rate for each append-or-accumulate decision α , we calculate a global onesided loss as the difference between the sum of all decisions and the expected sum of KV tokens across all layers l , heads h and time steps t under the desired Compression Ratio (CR), normalized by ( n l n h n ) :

$$

$$

3 See Appendix G for details on the mask implementation.

The ℓ CR loss is added to the language modeling loss term ℓ LM = -∑ n t =1 log p θ ( x t | x <t ) , with the final objective of the training being:

$$

$$

Importantly, the training procedure is designed for slowly ramping up the target CR and saving ready-to-use DMC checkpoints along the way. This is possible because all hyperparameters, like Gumbel-sigmoid sampling temperature and learning rate, are not decayed and remain constant throughout training. A practical use case of this DMC property is to produce a series of DMC checkpoints with different CRs within a single run, and then choose the one with the desired efficiency-performance trade-off.

Evaluation

Fixed vs Learned Memory Compression We assessed the importance of dynamically learning compression decisions in DMC by comparing it with Fixed Memory Pooling, which reduces the overall number of tokens in memory by deterministically averaging every n tokens, where n in this case is identical to the compression ratio. The results, shown in Figure 14, demonstrate that the dynamic component of DMC is crucial to achieving lower perplexity as well as higher downstream accuracy.

Table 3: MMLU accuracy, Commonsense Question Answering (CS-QA) accuracy averaged across 6 tasks, and HumanEval Pass@1 for several scales (7B, 13B, and 70B) and compression ratios (CRs; 1 × , 2 × , and 4 × ) of Llama 2. Here, we include an extra DMC variant - DMC-C, which does not require custom implementation. (*) The 70B model was trained with GQA which compresses the KV cache 8 × .

Global vs Local (Layer-wise) Compression Prior We compare two approaches to compression: a Local Prior, which enforces a pre-specified compression ratio (CR) in each layer independently, requiring every layer to compress approximately the same amount, and a Global Prior used by default DMC, which applies a pre-specified CR across all layers, giving the model the freedom to apply different CRs in each layer, provided that their average compression equals the global CR. Figure 15 clearly indicates that the Global Prior (DMC in Figure 15) improves MMLU performance compared to the Local Prior.

In-layer Relaxation We then compare three strategies to determine how similar compression schemata for heads within each layer should be (assuming a global prior):

Figure 12: Heatmaps of average compression ratios across layers (x-axis) and heads (y-axis) for DMC-C. Heads are arranged from the highest compression to the lowest topdown for clarity.

Figure 12: Heatmaps of average compression ratios across layers (x-axis) and heads (y-axis) for DMC-C. Heads are arranged from the highest compression to the lowest topdown for clarity.

CRs among all heads within the layer.

As per Figure 15, the default DMC strategy shows a consistent MMLU performance across varying CRs, while both DMC-C and DMC-HardC exhibit a sharp drop in MMLU as the compression reaches 1.9 × . Moreover, in Table 3 we report a more thorough comparison between DMC and DMCC. In general, while remaining superior to GQA, DMC-C displays a significant degradation in several configurations when compared to regular DMC.

Importance Scores Finally, we assess the impact of predicting importance scores for accumulation as opposed to uniformly weighting each token in a group. Figure 15 shows that DMC-C with Uniform Weighting is worse than learned weighting DMC-C.

Figure 13: Different up-training regimes for DMC-C: Short (red) increases CR by 1 every 6K steps, Long (blue) increases CR by 1 every 3K steps. Horizontal lines correspond to the performance of the original Llama 2.

Figure 13: Different up-training regimes for DMC-C: Short (red) increases CR by 1 every 6K steps, Long (blue) increases CR by 1 every 3K steps. Horizontal lines correspond to the performance of the original Llama 2.

Figure 14: Validation perplexity and MMLU accuracy vs training steps for Fixed Memory Pooling and a variant of DMC where the auxiliary loss CR is immediately set to the target on the onset of training. All models follow the regular training schedule and are trained up to 2 × CR.

Figure 14: Validation perplexity and MMLU accuracy vs training steps for Fixed Memory Pooling and a variant of DMC where the auxiliary loss CR is immediately set to the target on the onset of training. All models follow the regular training schedule and are trained up to 2 × CR.

Results

Main Results

We report the performance of the original LLM (equivalent to 1 × CR) and efficient variants (DMC, GQA, TOVA, and H 2 O) in Table 1. The results for the original LLM are those reproduced in our codebase as described in Appendix A.

DMCvs Original First, comparing DMC with the original LLM, we note that it even increases the performance in MMLU and CS-QA at 2 × CR for 7B and 13B and in HumanEval for all model scales. We speculate that this is due to the additional up-training steps, which expose LLMs to new examples. For the other combinations of downstream tasks and scales at 4 × CR, DMC incurs negligible degradation: -0.7 in MMLU and -0.3 in CS-QA for 7B, -0.3 in MMLUand CS-QA for 13B. At higher compression ratios of 6 × and 8 × CR DMC shows slight degradation but generally stays close to the original performance. For 7B, DMC exhibits -1.7 in MMLU at 6 × CR, and -2.8 at 8 × CR. For 13B performance drops are smaller than for 7B, DMC incurs -1.2 in MMLU and -0.1 in CS-QA at 6 × CR, and -2.4 in MMLU and -0.2 in CS-QA at 8 × CR. This encouraging finding suggests that DMC is robust across different model scales even for 8 × CR. Overall, the fact that DMC is in general close or superior to the original performance makes it suitable as a drop-in replacement for vanilla KV caching to achieve higher inference efficiency.

DMC vs GQA Moreover, Table 1 allows for comparing DMC with GQA for equivalent CRs ( 2 × and 4 × ). Overall, DMC surpasses GQA applied post-hoc through up-training, in both CRs and in both scales (7B and 13B) where we compare them. For MMLU, the gap widens when we increase the CR. This holds true both at the smaller 7B scale (from +5.4 for 2 × CR to +9.2 for 4 × CR) and at the larger 13B scale (from +4.6 for 2 × CR to +5.6 for 4 × CR). For CS-QA and Human-Eval, on the other hand, we observe comparable gains over GQA across CRs and model scales. Notably, DMC with 8 × compression outperforms GQA with 2 × compression for both 7B and 13B model scales. These findings illustrate that DMC should be preferred to GQA for retrofitting LLMs into variants that are more efficient at inference time.

Table 1: MMLU accuracy, Commonsense Question Answering (CS-QA) accuracy averaged across 6 tasks, and HumanEval Pass@1 for several scales (7B, 13B, and 70B) and compression ratios (CRs; 1 × , 2 × , 4 × , 6 × , 8 × ) of Llama 2. (*) Unlike the 7B and 13B models, the 70B model was trained with GQA which compresses the KV cache 8 × . We apply additional 2 × DMC compression during retrofitting to obtain a total compression of 16 × .

DMC vs Cache Eviction Policies Both H 2 O and TOVA evict tokens from memory according to the post-softmax attention scores, which makes them incompatible with efficient attention implementations like FlashAttention. In fact,

Figure 4: Sample efficiency of DMC and GQA. Horizontal lines correspond to the performance of the original Llama 2. Every DMC model was trained first with increasing CR, then with constant CR for the last 2K steps (marked with ⋆ ).

Figure 4: Sample efficiency of DMC and GQA. Horizontal lines correspond to the performance of the original Llama 2. Every DMC model was trained first with increasing CR, then with constant CR for the last 2K steps (marked with ⋆ ).

this requires materializing the n 2 -sized tensor of attention scores. As a result, while reducing the KV cache size, these policies may slow down inference. Moreover, comparing the performance of both cache eviction policies with DMC in Table 1, it emerges how they are almost comparable at CR 2 × in MMLU and CS-QA; however, they drop significantly in accuracy with a higher CR of 4 × , 6 × , and 8 × . The gap is even more dramatic for HumanEval pass@1 at all CRs and scales, except only for 70B TOVA. At 6 × CR, H 2 Oand TOVA show significant drops in MMLU and CS-QA, with TOVA outperforming H 2 O. At 8 × CR, the performance drops further for both policies, especially in HumanEval, where DMC consistently shows superior performance.

70B: DMC and GQA Many widely adopted LLMs were pre-trained with GQA which leads to the question of whether DMC and GQA can be used together to reap compounded benefits . To investigate this, we retrofit Llama 2 70B, which has been pre-trained with 8 × GQA. We further

compress its KV cache with DMC to 2 × CR: this is equivalent to a cache 16 × smaller than a vanilla LLM with neither GQA nor DMC. We observe that the performance remains unchanged, and conclude that DMC and GQA can be easily and successfully combined.

Sample efficiency To shed light on the sample efficiency of DMC and GQA, we report their performance on MMLU, CS-QA, and HumanEval across retrofitting steps in Figure 4. First, for a target CR of 2 × , it emerges how GQA cannot achieve the performance that DMC obtains after fine-tuning (at 8K steps) even after more than double the amount of fine-tuning (at 17K steps). This applies to both 7B and 13B scales. Figure 4 also reveals the importance of the fine-tuning phase in DMC: a limited amount of extra steps with a fixed CR recovers a significant amount of the original performance (especially for higher target CRs such as 4 × ).

Throughput and Latency Measurements

To verify whether increased CRs result in concrete efficiency gains, we present the performance properties of DMC in Figure 5, estimated within the NVIDIA Megatron-LM framework (Narayanan et al., 2021). Specifically, we run measurements on a single GPU (NVIDIA A100 80GB SXM or H100 SXM) in bfloat16 precision for Llama 2 7B and 13B. For Llama 2 70B, we run the same measurements on two GPUs of the same type with tensor parallelism. We feed the model with 2K tokens of English text and additionally generate 2K tokens in an auto-regressive manner to ensure that the model properly compresses its own generations. We limit the sequence to 4K tokens to avoid issues with context length extrapolation, as this is the maximum length observed by Llama 2 during pre-training. The reported throughput consists of the average over the last 1K tokens. To maximize the GPU utilization, we increase the batch size to the maximum that fits into memory (see Section 2.3 for details). Our implementation uses PagedAttention (Kwon et al., 2023) with a page size of 32 tokens.

The compression of the KV cache with DMC allows for substantial increases in batch size and thus significant throughput gains. As shown in Figure 5a, DMC allows for nearly linear scaling of the batch size with CR, which translates into effective increases in tokens per second compared to the original LLM. Specifically, DMC enables up to a 2 . 0 × throughput increase for DMC 2 × , and up to a 3 . 9 × increase for DMC 4 × . However, the scaling of throughput with CR is not linear. After reaching certain batch sizes, the models start to become compute-bound (Figure 5b), and the latency of inferring the next token begins to increase. Indeed, for DMC 6 × and DMC 8 × , throughput starts increasing sublinearly with CR, reaching up to 5 . 5 × and 7 × , respectively.

In addition, Figure 5b shows that increasing the batch size with DMC might trade off latency for throughput, and the

Figure

(a) Inference throughput averaged over the generation of the last 1K tokens of a 4K token sequence. On the x-axis, we show the maximum batch size that fits in memory on a single GPU (7B and 13B) or two GPUs with tensor parallelism (70B) for the vanilla LLM, and DMC 2 × , 4 × , 6 × , 8 × models.

(b) Inference throughput-latency Pareto fronts, charted by varying the batch size with DMC 8 × . Up to a certain batch size, the models are memory-bound and can process additional examples with the same latency. Beyond this point, the models become increasingly compute-bound. Measurements were taken over the generation of the last 1K tokens of a 4K token sequence on an NVIDIA H100.

(b) Inference throughput-latency Pareto fronts, charted by varying the batch size with DMC 8 × . Up to a certain batch size, the models are memory-bound and can process additional examples with the same latency. Beyond this point, the models become increasingly compute-bound. Measurements were taken over the generation of the last 1K tokens of a 4K token sequence on an NVIDIA H100.

(c) Latency of next token generation. Solid lines denote measurements with the maximum batch size that fits on a single GPU. Dotted lines denote DMC 4 × with the same batch size as the vanilla LLM.

Figure 5: Efficiency measurements with the Megatron-LM framework on NVIDIA A100 80GB and H100 GPUs.

Pareto front dominates the vanilla model. The different slopes of the curves on the right-hand side of Figure 5b stem mostly from the different hidden sizes of these models. We note that the extra memory saved with DMC could also be used to cache longer contexts.

Finally, we plot how the latency changes as new tokens are generated with a fixed batch size (Figure 5c). When we use the largest possible batch size for either model, after generating approximately 2200 tokens, the inference time begins to scale linearly with the context size due to the increasingly dominant cost of reading the KV cache from HBM. On the other hand, if we choose to maintain the same batch size for DMC 4 × as for the original LLM, the memory footprint of the KV cache is reduced and latency for longer contexts significantly improves.

While we acknowledge that the behavior of LLMs at inference time depends on a multitude of factors and implementation details, our measurements in Figure 5 offer evidence that DMC can increase throughput and reduce the latency of auto-regressive generation with LLMs. We speculate that in the future, DMC might be used to grow the KV cache sublinearly, which would provide an alternative between vanilla Transformers and State Space Models, where memory is constant (Fu et al., 2023; Gu & Dao, 2023).

Per-Head Learned Compression Ratios

Since the training loss does not enforce any compression schema a priori , as it just requires matching a certain global CR, we can investigate what schema the model discovers in practice. In Figure 6, we report the CR for each layer and head for different scales (7B, 13B, and 70B) and CRs (2 × , 4 × , 6 × , and 8 × ). From all schema, it emerges how compressing deeper layers ( > 16 for 7B, > 22 for 13B, > 44 for 70B) is universally the most popular strategy. However, the very final layers are compressed to a somewhat reduced degree. Fascinatingly, > 4 × DMC achieves extremely high CRs for several heads also in the first few layers. This is arguably counter-productive as token representations are not contextualized yet, which could make decisions (whether to append or accumulate) sub-optimal.

This same pattern of relative preference for compressing certain ranges of layers also emerges during training as we push the model towards the target CR with the auxiliary loss ℓ CR. Figure 7 in Appendix C illustrates how the effective CR increases first in deeper layers, then in some non-contiguous intermediate ranges.

Efficient inference in Transformer models is a subject of extensive research, with detailed overviews provided by several surveys (Pope et al., 2022; Treviso et al., 2022). This

Figure 6: Heatmaps of average compression ratios across layers (x-axis) and heads (y-axis). Heads are arranged from the highest compression to the lowest top-down for clarity.

Figure 6: Heatmaps of average compression ratios across layers (x-axis) and heads (y-axis). Heads are arranged from the highest compression to the lowest top-down for clarity.

section narrows its focus to advancements in Transformer inference efficiency through KV cache size reduction.

Grouped Query Attention (GQA; Ainslie et al., 2023) represents the most widespread strategy, evolving from Multi Query Attention (MQA; Shazeer, 2019). GQA reduces the number of KV heads by allocating shared KV representations across subsets of query heads. Multi-Head Latent Attention (DeepSeek-AI, 2024) compresses the size of the token KV states through a low-rank projection shared across all query heads. Prior efforts in token merging (Zhang et al., 2018) condensed the entire past context into a single token, while (Liu et al., 2018; Rae et al., 2020) employed strided convolution and mean pooling kernels to reduce the number of KV tokens. Sliding window attention techniques

(Beltagy et al., 2020; Child et al., 2019) restrict attention to a maximum of w preceding tokens. Quantization-based approaches (Sheng et al., 2023; Liu et al., 2024; 2023b; Hooper et al., 2024) reduce the KV precision to a smaller number of bits. Though effective in limiting KV cache, such methods perform fixed-size compression, unlike the dynamic DMC presented here, which adapts the compression schema based on the input sequence. This adaptation yields superior results, as we prove in an ablation study in Appendix F.

Previous learnable compression methods (Anagnostidis et al., 2023, inter alia ) decide which tokens to discard from the KV cache. DMC takes a different approach; instead of dropping tokens, it merges them, thus preserving cached information more faithfully. Moreover, the DMC compression mechanism has constant complexity relative to the context length, while the method proposed by Anagnostidis et al. (2023) has linear complexity. Mu et al. (2023) instead compresses prompts through a costly generation which limits their inference benefits. Moreover, their method is applicable only to compressing the model input while DMC compresses the entire sequence on the fly , including both the model input and the generated output.

Non-learnable cache eviction strategies (Zhang et al., 2023; Sheng et al., 2023; Liu et al., 2023a; Wang et al., 2020; Ge et al., 2023; Oren et al., 2024) utilize attention scores or token properties to filter tokens in the KV cache. These approaches, while bypassing additional training, rely on heuristics and lack the ability to learn the compression mechanisms. In contrast, DMC integrates compression into its learning objective in an end-to-end manner, where compression is synergistic to language generation.

Finally, DMC draws inspiration from Dynamic Token Pooling (DTP; Nawrot et al., 2023), which introduces a learnable boundary predictor to merge the representations of groups of tokens in intermediate layers. DMC improves upon this idea by applying it to the KV cache and introducing a continuous relaxation of the pooling decision during training. Moreover, it enables retrofitting pre-trained LLMs with minimal extra steps rather than training language models from scratch.

Conclusions

We proposed Dynamic Memory Compression, a method to reduce the length of the KV cache in Transformers, which enhances the efficiency of LLMs at inference time. For every new token, DMC learns end-to-end whether to append its key-value representations to the cache or merge them with the top element in the cache. We showed how to retrofit LLMs such as Llama 2 at different scales (7B, 13B, and 70B) into efficient DMC versions with a negligible amount of extra data and without extra parameters. DMC LLMs with up to 4 × compression ratios (CRs) retain (or even improve upon) the performance of the original LLM. Higher CRs (6 × and 8 × ) instead result in limited performance degradation. For comparable CRs, DMC has significantly higher downstream performance and sample efficiency than Grouped Query Attention (GQA), a widespread method for KV cache size reduction, as well as key-value eviction policies such as H 2 O and TOVA. In practice, we find that DMC translates to up to 7 × increase in throughput on an H100 GPU.

Impact Statement

Dynamic Memory Compression in Large Language Models (LLMs) like Llama 2 results in better computational efficiency, reducing both operational costs and environmental impact (Patterson et al., 2021). By enabling higher throughput and lower latency, DMC democratizes access to advanced AI, making state-of-the-art models suitable for a broader range of hardware. This may not only accelerate innovation across diverse sectors but also promote AI development and applications in an environmentally conscious manner.

Inference

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 2023. URL https://doi.org/10.18653/v1/2023. emnlp-main.298 .

Anagnostidis, S., Pavllo, D., Biggio, L., Noci, L., Lucchi, A., and Hofmann, T. Dynamic context pruning for efficient and interpretable autoregressive transformers. In Advances in Neural Information Processing Systems 36 , 2023.

Bahdanau, D., Cho, K., and Bengio, Y. Neural machine

Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V ., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time. In Advances in Neural Information Processing Systems 36 , 2023a.

Liu, Z., O˘ guz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. LLMQAT: Data-free quantization aware training for large language models. ArXiv , abs/2305.17888, 2023b.

Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V. A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. A. Efficient large-scale language model training on gpu clusters using megatron-lm. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis , 2021. URL https://doi.org/10.1145/3458817.3476209 .

Nawrot, P., Chorowski, J., Ła´ ncucki, A., and Ponti, E. M. Efficient transformers with dynamic token pooling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , 2023. URL https: //aclanthology.org/2023.acl-long.353 .

Oren, M., Hassid, M., Adi, Y., and Schwartz, R. Transformers are multi-state RNNs. ArXiv , abs/2401.06104, 2024.

Patterson, D. A., Gonzalez, J., Le, Q. V ., Liang, C., Munguía, L.-M., Rothchild, D., So, D. R., Texier, M., and Dean, J. Carbon emissions and large neural network training. ArXiv , abs/2104.10350, 2021.

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. ArXiv , abs/2211.05102, 2022. URL https: //doi.org/10.48550/arXiv.2211.05102 .

Rae, J. W., Potapenko, A., Jayakumar, S. M., Hillier, C., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=SylKikSYDH .

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence , 2020. URL https://doi.org/10. 1609/aaai.v34i05.6399 .

Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv , abs/1911.02150, 2019. URL http://arxiv.org/abs/1911.02150 .

Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y., Xie, Z., Chen, B., Barrett, C. W., Gonzalez, J., Liang, P., Ré, C., Stoica, I., and Zhang, C. Highthroughput generative inference of large language models with a single gpu. In International Conference on Machine Learning , 2023. URL https://proceedings. mlr.press/v202/sheng23a.html .

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y ., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y ., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and finetuned chat models. ArXiv , abs/2307.09288, 2023. URL https://arxiv.org/abs/2307.09288 .

Treviso, M. V., Ji, T., Lee, J.-U., van Aken, B., Cao, Q., Ciosici, M. R., Hassid, M., Heafield, K., Hooker, S., Martins, P. H., Martins, A. F. T., Milder, P., Raffel, C., Simpson, E., Slonim, N., Balasubramanian, N., Derczynski, L., and Schwartz, R. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics , 11, 2022.

Wang, H., Zhang, Z., and Han, S. Spatten: Efficient sparse attention architecture with cascade token and head pruning. 2021 IEEE International Symposium on HighPerformance Computer Architecture (HPCA) , 2020. URL https://doi.org/10.1109/HPCA51647.2021.00018 .

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. HellaSwag: Can a machine really finish your sen-

Appendix

Memory-Bound and Compute-Bound Operations

Every operation performed with a GPU accelerator, such as General Matrix Multiply (GEMM), is either memorybound or compute-bound. In the former case, the overall runtime is dominated by high bandwidth memory (HBM) access, while in the latter by the actual computations. Autoregressive generation with Transformer LLMs, where the sequence length for every forward pass is n = 1 , tends to

be memory-bound rather than compute-bound. The vast majority of a forward pass is spent either processing linear layers (in MHSA, Feed-Forward, and output vocabulary projection) or calculating attention scores and outputs from Equation (4). For linear layers, the ratio of FLOPS to memory accesses improves as the batch size increases, and more FLOPS are performed with the set of layer weights retrieved from the HBM. Eventually, with a large enough batch size, linear layers become compute-bound. On the other hand, for the calculation of Equation (4) inside MHSA layers during auto-regressive inference, the ratio of FLOPS to input size remains constant, and MHSA layers are memory-bound regardless of the batch size. It follows that for those layers, latency scales linearly with the size of the KV cache.

Replicating the Original Results

To make sure that our implementation is correct, for each downstream task, we compare the performance reported in the original Llama 2 paper (Touvron et al., 2023) with those obtained from the Hugging Face Hub checkpoints. Furthermore, we evaluate the impact of using our internal data mixture for up-training, acknowledging that variations in data proportions and preprocessing methodologies can influence model behavior. In particular, we up-train the vanilla pre-trained Llama 2 checkpoint for 200 training steps, amounting to 1B tokens, in accordance with the original Llama 2 training schedule. We compute the average and standard deviation of checkpoints after 50, 100, 150, and 200 steps. In our experiments, we replicate the results reported by the Llama 2 paper almost exactly, as shown in Table 2. Furthermore, we observe that retrofitting the Llama 2 checkpoint on our data mixture has little effect on the model's performance on the downstream benchmarks.

Analysis of the Compression Schema Learned by DMC

Evolution throughout the training In Figure 7, we illustrate how the CR changes for each layer of the Llama 2 7B model throughout the training from 1 × up to 4 × global CR. Each subplot corresponds to a different global CR which occurs at different stages of the training, going from the smallest (1.17) at the top to the highest (4.16) at the bottom. There is a clear trend such that, for a smaller global Compression Ratio (i.e. at the beginning of the training), the model emphasizes compression in the later layers. As the global Compression Ratio increases, the model keeps on compressing in the final layers but also starts to compress the earlier layers. We hypothesize that the token representations in the initial layers do not contain sufficient information to perform any meaningful grouping. Conversely, token representations in the subsequent layers are more defined and, possibly, after several attention layers, already contain redundant/shared information.

Figure 7: Compression distribution across layers at different stages of retrofitting a Llama 2 7B model. We adhere to the convention where, for a given subplot, a larger space above a given layer indicates greater compression at that layer.

Figure 7: Compression distribution across layers at different stages of retrofitting a Llama 2 7B model. We adhere to the convention where, for a given subplot, a larger space above a given layer indicates greater compression at that layer.

Sequence Length versus Compression Ratio Do DMC models compress sequences with a uniform CR independent from their total length? We find that this is not the case. As shown by Figure 8, the CR increases logarithmically as we increase the total sequence length. This holds true across all global CRs (including both 2 × and 4 × ).

Absolute Position versus Compression Decision Do DMC models learn a fixed compression schema, or do they exhibit position biases? In Figure 9, we plot the average value of the decision variable α across positions in the sequence (0 to 4096). Our observations reveal that the average value of the decision variable α is independent of a token's position in the input sequence which demonstrates that the model does not follow some fixed pattern. This persists across both 2 × and 4 × compression ratios (CR).

Interpretability A natural question arises, whether the compression schema that the model learns is somehow aligned with human intuitions about text segmentation. We analyzed the outputs of Llama 2 13B DMC with CR 4 and noticed that some heads compress according to the boundaries of linguistic units, such as words or syntactic phrases. Figure 10 shows the compression schema learned by head

Table 2: Replicating the original up-training results.

Figure 8: CR achieved by Llama 2 7B for particular sequence lengths across various global CRs.

Figure 8: CR achieved by Llama 2 7B for particular sequence lengths across various global CRs.

14 in layer 0. In this case, the model merges the subwords back into words reverting the tokenizer. Interestingly, some groupings of tokens correspond to semantic units, e.g., ' 1 9 th century '', ' 5 0 percent ", or ' a week back later '. Yet, we also stress that many heads and layers are not interpretable as their behavior does not overlap with linguistic units.

More generally higher layers merge longer token sequences, in line with Figure 6. For instance, Figure 11 shows the decisions of layer 24 head 2. We leave a more in-depth analysis of compression schemata learned by DMC to future work.

Evolution throughout the training
Sequence Length versus Compression Ratio

Since the training loss does not enforce any compression schema a priori , as it just requires matching a certain global CR, we can investigate what schema the model discovers in practice. In Figure 6, we report the CR for each layer and head for different scales (7B, 13B, and 70B) and CRs (2 × , 4 × , 6 × , and 8 × ). From all schema, it emerges how compressing deeper layers ( > 16 for 7B, > 22 for 13B, > 44 for 70B) is universally the most popular strategy. However, the very final layers are compressed to a somewhat reduced degree. Fascinatingly, > 4 × DMC achieves extremely high CRs for several heads also in the first few layers. This is arguably counter-productive as token representations are not contextualized yet, which could make decisions (whether to append or accumulate) sub-optimal.

This same pattern of relative preference for compressing certain ranges of layers also emerges during training as we push the model towards the target CR with the auxiliary loss ℓ CR. Figure 7 in Appendix C illustrates how the effective CR increases first in deeper layers, then in some non-contiguous intermediate ranges.

Absolute Position versus Compression Decision

In order to alleviate the problem of efficiently storing and retrieving the KV cache of variable-length sequences, we also study a Constrained variant of our algorithm (DMC-C) in which we force all heads in a given layer to maintain a similar compression ratio. A similar compression ratio allows us to store key and value sequences naïvely as padded tensors with minimal padding. To this end, we add an extra loss term to Equation (11)

$$

$$

Table 3 compares DMC with its Constrained variant DMCC. In general, while remaining superior to GQA, DMC-C

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

might If you have actually

restricted time for health and fit ness reg im en

Figure 10: Compression schema found by Llama 2 13B DMC 4 × in layer 0, head 14. Tokens that are merged in the KV cache are marked with the same color.

displays a significant degradation in several configurations, most notably 7B 4 × where it records a drop of -6.4 in MMLUcompared to the ceiling. On the other hand, DMC recovers all performance loss in DMC-C. When combined with custom attention implementations that do not require excessive padding, standard DMC should therefore be vastly preferred, as it retains the original LLM performance while fully reaping the advantages in memory efficiency.

Compression Schemata learned by DMC-C Studying the compression schema in Figure 12 learned by DMC-C, we find a very different pattern compared to DMC, due to the auxiliary loss forcing the model to compress similarly across heads in the same layer. Nevertheless, we observe a similar global preference for compressing deeper layers.

Figure 11: Compression schema found by Llama 2 13B DMC 4 × for layer 24, head 2. Tokens that are merged in the KV cache are marked with the same color.

Compression Schemata learned by DMC-C

Evolution throughout the training In Figure 7, we illustrate how the CR changes for each layer of the Llama 2 7B model throughout the training from 1 × up to 4 × global CR. Each subplot corresponds to a different global CR which occurs at different stages of the training, going from the smallest (1.17) at the top to the highest (4.16) at the bottom. There is a clear trend such that, for a smaller global Compression Ratio (i.e. at the beginning of the training), the model emphasizes compression in the later layers. As the global Compression Ratio increases, the model keeps on compressing in the final layers but also starts to compress the earlier layers. We hypothesize that the token representations in the initial layers do not contain sufficient information to perform any meaningful grouping. Conversely, token representations in the subsequent layers are more defined and, possibly, after several attention layers, already contain redundant/shared information.

Figure 7: Compression distribution across layers at different stages of retrofitting a Llama 2 7B model. We adhere to the convention where, for a given subplot, a larger space above a given layer indicates greater compression at that layer.

Figure 7: Compression distribution across layers at different stages of retrofitting a Llama 2 7B model. We adhere to the convention where, for a given subplot, a larger space above a given layer indicates greater compression at that layer.

Sequence Length versus Compression Ratio Do DMC models compress sequences with a uniform CR independent from their total length? We find that this is not the case. As shown by Figure 8, the CR increases logarithmically as we increase the total sequence length. This holds true across all global CRs (including both 2 × and 4 × ).

Absolute Position versus Compression Decision Do DMC models learn a fixed compression schema, or do they exhibit position biases? In Figure 9, we plot the average value of the decision variable α across positions in the sequence (0 to 4096). Our observations reveal that the average value of the decision variable α is independent of a token's position in the input sequence which demonstrates that the model does not follow some fixed pattern. This persists across both 2 × and 4 × compression ratios (CR).

Interpretability A natural question arises, whether the compression schema that the model learns is somehow aligned with human intuitions about text segmentation. We analyzed the outputs of Llama 2 13B DMC with CR 4 and noticed that some heads compress according to the boundaries of linguistic units, such as words or syntactic phrases. Figure 10 shows the compression schema learned by head

Table 2: Replicating the original up-training results.

Figure 8: CR achieved by Llama 2 7B for particular sequence lengths across various global CRs.

Figure 8: CR achieved by Llama 2 7B for particular sequence lengths across various global CRs.

14 in layer 0. In this case, the model merges the subwords back into words reverting the tokenizer. Interestingly, some groupings of tokens correspond to semantic units, e.g., ' 1 9 th century '', ' 5 0 percent ", or ' a week back later '. Yet, we also stress that many heads and layers are not interpretable as their behavior does not overlap with linguistic units.

More generally higher layers merge longer token sequences, in line with Figure 6. For instance, Figure 11 shows the decisions of layer 24 head 2. We leave a more in-depth analysis of compression schemata learned by DMC to future work.

Interpretability

Training Ablations

Training Steps per Increase in CR Another advantage of DMC is its high flexibility. In fact, based on the availability of resources, different regimes can be chosen when annealing the CR to the target during up-training. In Figure 13, we compare a SHORT and a LONG regime for the constrained variant of DMC (DMC-C), which continuously increases the CR by 1 every 3K and 6K steps (12B and 24B tokens), respectively. Evidently, there exists a trade-off between training steps (hence, time) and performance. Additionally, Figure 13 showcases another aspect of the higher flexibility DMC affords: it is compatible with arbitrary real-valued CRs, as opposed to integer CRs divisible by 2 as in GQA.

Schedules of Target CR Additionally, we explore how different schedules for the target CR impact the model performance. In the standard setup, this is linearly annealed from 1 to the target CR throughout the duration of training. Here, we compare it with a setup where the CR used in the auxiliary loss for compression is set to the target from the start (DMC-immediate). We show the results in Figure 14. As expected, DMC-immediate has a perplexity spike at the beginning when the model quickly increases the CR due to the auxiliary loss. While perplexity is recovered during training, even to a lower point than DMC with annealing, downstream accuracy on the MMLU benchmark is degraded across the board. This showcases why avoiding perplexity spikes is fundamental to successfully retrofitting an LLM.

Training Steps per Increase in CR
Schedules of Target CR

DMC Ablations

Fixed vs Learned Memory Compression We assessed the importance of dynamically learning compression decisions in DMC by comparing it with Fixed Memory Pooling, which reduces the overall number of tokens in memory by deterministically averaging every n tokens, where n in this case is identical to the compression ratio. The results, shown in Figure 14, demonstrate that the dynamic component of DMC is crucial to achieving lower perplexity as well as higher downstream accuracy.

Table 3: MMLU accuracy, Commonsense Question Answering (CS-QA) accuracy averaged across 6 tasks, and HumanEval Pass@1 for several scales (7B, 13B, and 70B) and compression ratios (CRs; 1 × , 2 × , and 4 × ) of Llama 2. Here, we include an extra DMC variant - DMC-C, which does not require custom implementation. (*) The 70B model was trained with GQA which compresses the KV cache 8 × .

Global vs Local (Layer-wise) Compression Prior We compare two approaches to compression: a Local Prior, which enforces a pre-specified compression ratio (CR) in each layer independently, requiring every layer to compress approximately the same amount, and a Global Prior used by default DMC, which applies a pre-specified CR across all layers, giving the model the freedom to apply different CRs in each layer, provided that their average compression equals the global CR. Figure 15 clearly indicates that the Global Prior (DMC in Figure 15) improves MMLU performance compared to the Local Prior.

In-layer Relaxation We then compare three strategies to determine how similar compression schemata for heads within each layer should be (assuming a global prior):

Figure 12: Heatmaps of average compression ratios across layers (x-axis) and heads (y-axis) for DMC-C. Heads are arranged from the highest compression to the lowest topdown for clarity.

Figure 12: Heatmaps of average compression ratios across layers (x-axis) and heads (y-axis) for DMC-C. Heads are arranged from the highest compression to the lowest topdown for clarity.

CRs among all heads within the layer.

As per Figure 15, the default DMC strategy shows a consistent MMLU performance across varying CRs, while both DMC-C and DMC-HardC exhibit a sharp drop in MMLU as the compression reaches 1.9 × . Moreover, in Table 3 we report a more thorough comparison between DMC and DMCC. In general, while remaining superior to GQA, DMC-C displays a significant degradation in several configurations when compared to regular DMC.

Importance Scores Finally, we assess the impact of predicting importance scores for accumulation as opposed to uniformly weighting each token in a group. Figure 15 shows that DMC-C with Uniform Weighting is worse than learned weighting DMC-C.

Figure 13: Different up-training regimes for DMC-C: Short (red) increases CR by 1 every 6K steps, Long (blue) increases CR by 1 every 3K steps. Horizontal lines correspond to the performance of the original Llama 2.

Figure 13: Different up-training regimes for DMC-C: Short (red) increases CR by 1 every 6K steps, Long (blue) increases CR by 1 every 3K steps. Horizontal lines correspond to the performance of the original Llama 2.

Figure 14: Validation perplexity and MMLU accuracy vs training steps for Fixed Memory Pooling and a variant of DMC where the auxiliary loss CR is immediately set to the target on the onset of training. All models follow the regular training schedule and are trained up to 2 × CR.

Figure 14: Validation perplexity and MMLU accuracy vs training steps for Fixed Memory Pooling and a variant of DMC where the auxiliary loss CR is immediately set to the target on the onset of training. All models follow the regular training schedule and are trained up to 2 × CR.

Fixed vs Learned Memory Compression

Since the training loss does not enforce any compression schema a priori , as it just requires matching a certain global CR, we can investigate what schema the model discovers in practice. In Figure 6, we report the CR for each layer and head for different scales (7B, 13B, and 70B) and CRs (2 × , 4 × , 6 × , and 8 × ). From all schema, it emerges how compressing deeper layers ( > 16 for 7B, > 22 for 13B, > 44 for 70B) is universally the most popular strategy. However, the very final layers are compressed to a somewhat reduced degree. Fascinatingly, > 4 × DMC achieves extremely high CRs for several heads also in the first few layers. This is arguably counter-productive as token representations are not contextualized yet, which could make decisions (whether to append or accumulate) sub-optimal.

This same pattern of relative preference for compressing certain ranges of layers also emerges during training as we push the model towards the target CR with the auxiliary loss ℓ CR. Figure 7 in Appendix C illustrates how the effective CR increases first in deeper layers, then in some non-contiguous intermediate ranges.

Global vs Local (Layer-wise) Compression Prior

In order to alleviate the problem of efficiently storing and retrieving the KV cache of variable-length sequences, we also study a Constrained variant of our algorithm (DMC-C) in which we force all heads in a given layer to maintain a similar compression ratio. A similar compression ratio allows us to store key and value sequences naïvely as padded tensors with minimal padding. To this end, we add an extra loss term to Equation (11)

$$

$$

Table 3 compares DMC with its Constrained variant DMCC. In general, while remaining superior to GQA, DMC-C

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

might If you have actually

restricted time for health and fit ness reg im en

Figure 10: Compression schema found by Llama 2 13B DMC 4 × in layer 0, head 14. Tokens that are merged in the KV cache are marked with the same color.

displays a significant degradation in several configurations, most notably 7B 4 × where it records a drop of -6.4 in MMLUcompared to the ceiling. On the other hand, DMC recovers all performance loss in DMC-C. When combined with custom attention implementations that do not require excessive padding, standard DMC should therefore be vastly preferred, as it retains the original LLM performance while fully reaping the advantages in memory efficiency.

Compression Schemata learned by DMC-C Studying the compression schema in Figure 12 learned by DMC-C, we find a very different pattern compared to DMC, due to the auxiliary loss forcing the model to compress similarly across heads in the same layer. Nevertheless, we observe a similar global preference for compressing deeper layers.

Figure 11: Compression schema found by Llama 2 13B DMC 4 × for layer 24, head 2. Tokens that are merged in the KV cache are marked with the same color.

In-layer Relaxation

In order to alleviate the problem of efficiently storing and retrieving the KV cache of variable-length sequences, we also study a Constrained variant of our algorithm (DMC-C) in which we force all heads in a given layer to maintain a similar compression ratio. A similar compression ratio allows us to store key and value sequences naïvely as padded tensors with minimal padding. To this end, we add an extra loss term to Equation (11)

$$

$$

Table 3 compares DMC with its Constrained variant DMCC. In general, while remaining superior to GQA, DMC-C

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

Figure 9: Average value of the decision α for positions (0, 4096) averaged over 128 samples, heads, and layers.

might If you have actually

restricted time for health and fit ness reg im en

Figure 10: Compression schema found by Llama 2 13B DMC 4 × in layer 0, head 14. Tokens that are merged in the KV cache are marked with the same color.

displays a significant degradation in several configurations, most notably 7B 4 × where it records a drop of -6.4 in MMLUcompared to the ceiling. On the other hand, DMC recovers all performance loss in DMC-C. When combined with custom attention implementations that do not require excessive padding, standard DMC should therefore be vastly preferred, as it retains the original LLM performance while fully reaping the advantages in memory efficiency.

Compression Schemata learned by DMC-C Studying the compression schema in Figure 12 learned by DMC-C, we find a very different pattern compared to DMC, due to the auxiliary loss forcing the model to compress similarly across heads in the same layer. Nevertheless, we observe a similar global preference for compressing deeper layers.

Figure 11: Compression schema found by Llama 2 13B DMC 4 × for layer 24, head 2. Tokens that are merged in the KV cache are marked with the same color.

Importance Scores

Masking Implementation Details

We mask the unnormalized attention score for the pair ( i, j ) as follows:

$$

$$

We rely on the memory-efficient implementation of MHSA from PyTorch, which allows adding arbitrary masks to the attention scores before softmax. Notwithstanding this, at inference time DMC remains compatible with efficient libraries for attention such as Flash Attention (Dao et al., 2022). The log(1 -α j +1 ) term is calculated as log-sigmoid ( -α j +1 ) for better numerical precision.

Figure 15: Validation perplexity and test MMLU accuracy vs compression ratio for different types of compression priors, in-layer relaxations, and importance scores. All models follow the SHORT training regime and are trained up to 2 × CR.

Figure 15: Validation perplexity and test MMLU accuracy vs compression ratio for different types of compression priors, in-layer relaxations, and importance scores. All models follow the SHORT training regime and are trained up to 2 × CR.

Limitations

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression rates in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to ~3.7×\times throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4×\times cache compression, outperforming up-trained grouped-query attention (GQA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget.

QNVIDIA KUniversity of Wrocław VUniversity of Edinburgh

Transformer Large Language Models (LLMs) are the state of the art in generative and conversational AI (Touvron et al., 2023; Jiang et al., 2023). Their deployment, however, is curtailed in part by their inefficiency. This is not only due to the quadratic complexity of attention layers (Bahdanau et al., 2014; Vaswani et al., 2017): during generation, Transformers store the keys and values of past tokens in memory to avoid recomputing them multiple times. Since this key–value (KV) cache grows linearly with the sequence length and batch size, generation with Transformers quickly becomes prohibitive due to the excessive memory load. This issue emerges even more clearly with long-context generation (e.g., in dialogues and stories) or when serving large numbers of user queries.

A widespread solution to increase the memory efficiency of Transformers during inference is Grouped Query Attention (GQA; Ainslie et al., 2023; Shazeer, 2019), which uses a number of key and value heads inferior to the number of query heads through parameter sharing. As an alternative, the number of overall tokens in memory can be reduced through token merging (Zhang et al., 2018; Liu et al., 2018; Bolya et al., 2022) or token pruning (Anagnostidis et al., 2023; Kim & Cho, 2020). Nevertheless, these methods often pay the price of a severe degradation in downstream performance. On the other hand, hardware/IO-aware (Dao et al., 2022; Kwon et al., 2023) and sub-quadratic algorithms for attention (Beltagy et al., 2020; Choromanski et al., 2020) do not alleviate the memory load of the KV cache.

In our work, we aim to achieve a lossless compression of the KV cache of LLMs, thus retaining their performance while reducing their memory load. To this end, we propose Dynamic Memory Compression (DMC). As shown in Figure 1, during every time step, DMC decides whether to append the current key and value representations to the cache or to perform a weighted average of them with the top item on the cache. Note that memory grows sub-linearly in DMC, which therefore falls halfway between vanilla Transformers and state space language models (Fu et al., 2023; Gu & Dao, 2023), where memory is constant.

In our experiments, we equip pre-existing LLMs—such as Llama 2 (Touvron et al., 2023) 7B, 13B, and 70B—with DMC by retrofitting them on a negligible percentage of the original pre-training data (~2% for 2×\times compression and ~4% for 4×\times compression) and without adding any extra parameters to the original LLM. We evaluate our DMC models on a series of downstream tasks such as MMLU (Hendrycks et al., 2021) for factuality, QA datasets for common-sense reasoning, and HumanEval (Chen et al., 2021) for code. We find that DMC LLMs retain a downstream performance similar to the original LLM, whereas GQA incurs significant degradation. Finally, we show that DMC can be hybridized with GQA such that their compression rates are compounded. For Llama 2 70B, which is pre-trained with GQA 8×8\times, DMC 2×2\times achieves a total compression of 16×16\times.

We verify that KV cache compression translates into more efficient generation in practice. We measure that DMC 4×4\times increases the inference throughput between 340% and 370% for Llama 2 7B and 13B on NVIDIA H100 or A100 GPUs without loss in performance. In fact, it allows us to fit larger batches and longer sequences into a given memory budget. Finally, compression schemata learned by DMC provide insights into the internal structure of the LLMs, revealing a preference for compressing heads in higher layers.

Let X=(𝐱1,…,𝐱n)∈ℝn×d𝑋subscript𝐱1…subscript𝐱𝑛superscriptℝ𝑛𝑑X=(\mathbf{x}{1},\ldots,\mathbf{x}{n})\in\mathbb{R}^{n\times d} denote the input sequence of hidden states of a Transformer layer, where n𝑛n stands for the number of tokens in a sequence, and d𝑑d for the hidden state dimension. A Multi-head Self-Attention (MHSA) layer divides the embeddings into nhsubscript𝑛ℎn_{h} different heads. Afterwards, the self-attention process is applied to each head separately. This enables the model to focus on different parts of the input, capturing various types of relationships in the data. For each head hℎh, different weight matrices Wqh,Wkh,Wvh∈ℝd/nh×d/nhsuperscriptsubscript𝑊𝑞ℎsuperscriptsubscript𝑊𝑘ℎsuperscriptsubscript𝑊𝑣ℎsuperscriptℝ𝑑subscript𝑛ℎ𝑑subscript𝑛ℎW_{q}^{h},W_{k}^{h},W_{v}^{h}\in\mathbb{R}^{d/n_{h}\times d/n_{h}} are used to project the input sequence into queries Qhsuperscript𝑄ℎQ^{h}, keys Khsuperscript𝐾ℎK^{h}, and values Vhsuperscript𝑉ℎV^{h}:

The attention scores and outputs for the i𝑖i-th token are then computed as

Finally, the outputs from all heads are concatenated and linearly transformed to produce the final output O∈ℝn×d𝑂superscriptℝ𝑛𝑑O\in\mathbb{R}^{n\times d}:

where Wo∈ℝd×dsubscript𝑊𝑜superscriptℝ𝑑𝑑W_{o}\in\mathbb{R}^{d\times d}.

In a Transformer decoder, the generation of sequences is auto-regressive: each new token is predicted based on the previously generated ones. To avoid redundant computation, it is common to store the keys and values of previously computed tokens in the KV cache. For each time step i𝑖i, only the keys and values for the current token 𝐱isubscript𝐱𝑖\mathbf{x}{i} are computed whereas those for 𝐱<isubscript𝐱absent𝑖\mathbf{x}{<i} are retrieved from the cache. Thus for each head hℎh:

Note that this process is not necessary for queries as only the query for the current token 𝐱isubscript𝐱𝑖\mathbf{x}_{i} is needed at each inference time step.

As a consequence, the KV cache grows linearly with each new token. This progressive expansion leads to substantial memory consumption and increased latency, especially for long input sequences. This issue is further exacerbated when serving multiple requests concurrently, as each inference process requires its own KV cache, significantly straining the system’s resources.

Inference with LLMs tends to be memory bound rather than compute bound, as we explain in Appendix A in detail. This means that latency scales linearly with the size of the KV cache. We tackle the problem of reducing the size of KV cache which brings two immediate benefits (Zhang et al., 2023): 1) it lowers the latency of auto-regressive generation, and 2) increases throughput by saving GPU memory and allowing for larger batch sizes or sequence lengths.

In this section we introduce Dynamic Memory Compression (DMC), a simple and inexpensive method for on-line compression of the KV cache at inference time. In what follows, we first describe the inference-time operation of DMC, which constitutes our end goal. Next, we illustrate how to teach a pre-trained LLM such behavior through short, continued pre-training.

Consider the forward pass of an attention layer during auto-regressive inference (Section 2.1). In a vanilla Transformer, at every time step t𝑡t, both 𝐤tsubscript𝐤𝑡\mathbf{k}{t} and 𝐯tsubscript𝐯𝑡\mathbf{v}{t} are appended to the KV cache (Section 2.2). In DMC, on the other hand, the KV cache update procedure is different, as detailed in Algorithm 1. First, a decision variable αt∈{0,1}subscript𝛼𝑡01\alpha_{t}\in{0,1} and importance variable ωt∈[0,1]subscript𝜔𝑡01\omega_{t}\in[0,1] are predicted. In order to avoid adding new parameters, we reuse the first neuron from 𝐤tsubscript𝐤𝑡\mathbf{k}{t} and 𝐪tsubscript𝐪𝑡\mathbf{q}{t}, respectively, to extract the two scores.111Note that this choice is arbitrary as we could use any two neurons from {𝐪t,𝐤t,𝐯t}subscript𝐪𝑡subscript𝐤𝑡subscript𝐯𝑡{\mathbf{q}{t},\mathbf{k}{t},\mathbf{v}_{t}}.

Based on αtsubscript𝛼𝑡\alpha_{t}, a decision is made whether KV representations 𝐤tsubscript𝐤𝑡\mathbf{k}{t} and 𝐯tsubscript𝐯𝑡\mathbf{v}{t} are appended to the cache or accumulated with its last element (Figure 1). In particular, for accumulation we perform a weighted average based on a predicted importance scores ω𝜔\omega for the current token and the running sum of importance scores ztsubscript𝑧𝑡z_{t} for all the tokens since the last time step α=0𝛼0\alpha=0 was predicted.

In fact, the α𝛼\alpha variable effectively segments the input sequence: each decision determines if the current segment should continue (α=1𝛼1\alpha=1) or a new segment should be opened (α=0𝛼0\alpha=0). As a result, after the update, the cache length for DMC is l=∑i=1t(1−αi)≤t𝑙superscriptsubscript𝑖1𝑡1subscript𝛼𝑖𝑡l=\sum_{i=1}^{t}(1-\alpha_{i})\leq t, whereas in vanilla Transformers it is always l=t𝑙𝑡l=t. In what follows, we will refer to the ratio t/l𝑡𝑙t/l between the length t𝑡t of an uncompressed cache and the compressed length l𝑙l as the Compression Ratio (CR).

Finally, multi-head self-attention is calculated similarly to vanilla Transformer using KV cache sequences, with the exception that KV sequences for different heads might have different lengths. Algorithm 1 is applied to every MHSA layer and head independently.

The DMC inference algorithm switches between accumulating and appending tokens to the KV cache. In order to endow LLMs with DMC, we continue pre-training them on a negligible amount of their pre-training data, gradually increasing the compression rate towards a target. However, this poses serious challenges. First, we opt for end-to-end learning via gradient descent and continuous relaxation of the decision variables. As a result, we have to define an operation for KV cache updating when 0<α<10𝛼10<\alpha<1, resulting in partly aggregated, partly accumulated key and value states. Second, to avoid training–inference mismatch, we must simulate the DMC behavior at inference time while parallelizing training across a sequence of tokens: as a consequence the length of K𝐾K and V𝑉V is not reduced through compression during training; rather, all intermediate states of keys and values are explicitly kept in memory and an auxiliary (gradually discretized) mask regulates query interactions. We elaborate on our solutions to these challenges below.

The decision whether to accumulate or append at inference time is a discrete one; however, rounding sigmoid​(𝐤​[0])sigmoid𝐤delimited-[]0\mathrm{sigmoid}(\mathbf{k}[0]) to the nearest integer for training would result in a non-differentiable operation with zero gradients. Hence, we resort to stochastic reparametrization of the decision variable during training

where τ𝜏\tau is the temperature222Low temperatures sharpen αtsubscript𝛼𝑡\alpha_{t} into almost-discrete values which accurately mimics inference behavior. and c𝑐c is a constant subtracted so that every α=0𝛼0\alpha=0 at training step 00. This ensures that DMC initially performs no compression and starts the training behaving just like the vanilla Transformer.

As we relax our discrete decisions, we now must define a mechanism to update the KV cache that generalizes Algorithm 1 to continuous α𝛼\alpha. Hence, we define partially accumulated states for α∈[0,1]𝛼01\alpha\in[0,1] as:

Note that when α∈{0,1}𝛼01\alpha\in{0,1}, Equation 9 defaults to Algorithm 1.

Aside from key and value computations shown in Equation 9, the rest of the forward pass can be performed in parallel for all tokens in the sequence. Nonetheless, this creates a mismatch between training and evaluation, since all intermediate states of keys and values are accessible during self-attention.

To illustrate this issue, consider the example of a KV cache during DMC inference shown in Figure 2 for the sequence of decision scores α1:5=(1,1,0,1,0)subscript𝛼:1511010\alpha_{1:5}=(1,1,0,1,0) (importance scores ω𝜔\omega have been omitted for clarity). The last element of the KV cache changes at every time step. In order to properly simulate the inference-time evolution of the KV cache during training, we keep all unrolled intermediate KV cache items. In lieu of an auto-regressive ‘causal’ mask, however, we use an additive mask based on the sequence of α𝛼\alpha values to modify the attention scores ai​jhsubscriptsuperscript𝑎ℎ𝑖𝑗a^{h}_{ij} from Equation 4, which is shown in Figure 3. During training, the α𝛼\alpha values 1) naturally converge towards 0 or 1 as the model strives to satisfy the language modeling criterion and reduce uncertainty; 2) are intentionally pushed to almost-discrete states by the Gumbel noise and low temperature setting. Such binarization of α𝛼\alpha values significantly impacts the attention scores - it strengthens the queries interactions with last elements of every key–value segment, and weakens those with the intermediate elements, which are discarded during inference.333 See Appendix F for details on the mask implementation. In fact, when α∈{0,1}𝛼01\alpha\in{0,1}, the matrix is filled with either 00 or −∞-\infty values, and exactly corresponds to the inference time query-to-key attendance pattern.

{blockarray}​c​c​c​c​c​c​&​𝐤𝟎​𝐤𝟏​𝐤𝟐​…​𝐤𝐧​{block}​c​[c​c​c​c​c]​𝐪𝟎​0−∞−∞​…−∞​𝐪𝟏​log⁡(1​α1)​0−∞−∞​𝐪𝟐​log⁡(1​α1)​log⁡(1​α2)​0−∞​⋮​⋮​⋱​⋮​𝐪𝐧​log⁡(1​α1)​log⁡(1​α2)​log⁡(1​α3)​…​0{blockarray}𝑐𝑐𝑐𝑐𝑐𝑐&subscript𝐤0subscript𝐤1subscript𝐤2…subscript𝐤𝐧{block}𝑐delimited-[]𝑐𝑐𝑐𝑐𝑐subscript𝐪00…subscript𝐪11subscript𝛼10subscript𝐪21subscript𝛼11subscript𝛼20⋮⋮⋱⋮subscript𝐪𝐧1subscript𝛼11subscript𝛼21subscript𝛼3…0\blockarray{cccccc}&{\bf k_{0}}{\bf k_{1}}{\bf k_{2}}\dots{\bf k_{n}}\ \block{c[ccccc]}{\bf q_{0}}0-\infty-\infty\dots-\infty\ {\bf q_{1}}\log(1\shortminus\alpha_{1})0-\infty-\infty\ {\bf q_{2}}\log(1\shortminus\alpha_{1})\log(1\shortminus\alpha_{2})0-\infty\ \vdots\vdots\ddots\vdots\ {\bf q_{n}}\log(1\shortminus\alpha_{1})\log(1\shortminus\alpha_{2})\log(1\shortminus\alpha_{3})\dots 0\

The model is incentivized to compress the KV cache to a certain CR, and thus increase the predicted α𝛼\alpha values. Instead of matching the desired rate for each append-or-accumulate decision α𝛼\alpha, we calculate a global one-sided loss as the difference between the expected sum of KV tokens across all layers l𝑙l, heads hℎh and time steps t𝑡t under the desired Compression Rate CR and the sum of all decisions, normalized by (nl​nh​n)subscript𝑛𝑙subscript𝑛ℎ𝑛(n_{l}n_{h}n):

It is added to the language modeling loss term ℓLM=−∑t=1nlog⁡p𝜽​(𝐱t∣𝐱<t)subscriptℓLMsuperscriptsubscript𝑡1𝑛subscript𝑝𝜽conditionalsubscript𝐱𝑡subscript𝐱absent𝑡\ell_{\text{LM}}=-\sum_{t=1}^{n}\log p_{\boldsymbol{\theta}}(\mathbf{x}{t}\mid\mathbf{x}{<t}), with the final objective of the training being:

Importantly, the training procedure is designed for slowly ramping up the desired compression rate and stopping at will. This enables producing a series of models with different compression rates within a single run. It follows that all hyperparameters, like Gumbel-sigmoid sampling temperature and learning rate, are not decayed and remain constant throughout training.

DMC allows every head to learn a custom compression, which results in KV cache sequences with variable lengths across heads. One could implement this cache naïvely with padded tensors. Foreshadowing our results in Section 5.2, however, the largest efficiency gains during inference stem from the reduced memory footprint, which allows us to increase the batch size and dramatically improve throughput. In order to get these benefits, the memory cannot be wasted on storing KV cache as padded tensors.

Thus, we provide a simple custom implementation of attention in PyTorch, building on FlashAttention (Dao et al., 2022) and PagedAttention (Kwon et al., 2023), in which the heads can learn wildly different compression rates but which does not require padding. As an ablation, we also study a constrained variant (DMC-C) in which we force the heads in a given layer to maintain similar compression rates, which minimizes the padding necessary in naïve attention implementations; however, it significantly constrains the learned compression schema and leads to worse model quality.

The calculation of partial accumulations, during training of DMC models Equation 9, for a sequence of n𝑛n tokens requires O​(n)𝑂𝑛O(n) sequential steps, therefore considerably slowing down the training. In order to improve the time complexity, we calculate Equation 9 at every time step t𝑡t independently over a short window of the last w𝑤w elements up to time t𝑡t (e.g., w=12𝑤12w=12). This enables us to reduce the span of computation to O​(w)𝑂𝑤O(w), provided that at least n𝑛n threads can execute the computation in parallel for each of the n𝑛n positions.

In our experiments, we evaluate strategies to retrofit a state-of-the-art Large Language Model (LLM), Llama 2 (Touvron et al., 2023),444Obtained from https://huggingface.co/meta-llama. into a more efficient model across various sizes: 7B, 13B, and 70B. In addition to comparing the downstream performance of DMC with the original model, we also use Grouped Query Attention (GQA) as a main baseline, as it constitutes the most widespread strategy to ensure KV cache efficiency (Jiang et al., 2023).

To equip the original Llama 2 with GQA, we perform the standard checkpoint conversion proposed by Ainslie et al. (2023): the key and value projection matrices are split by head. Then the resulting sub-matrices corresponding to heads in the same group are merged. Note that this results in a fixed CR during training.

As for DMC, we avoid the introduction of new parameters by re-purposing the first dimension from both the 𝐤tsubscript𝐤𝑡\mathbf{k}{t} and 𝐪tsubscript𝐪𝑡\mathbf{q}{t} representations, in order to predict segmentation decisions and importance scores. Setting 𝐪tsubscript𝐪𝑡\mathbf{q}{t} and 𝐤tsubscript𝐤𝑡\mathbf{k}{t} to zero triggers a significant increase in language modeling loss, by disrupting the attention scores. To counteract this spike in loss, we pre-train the model to disregard the first dimension of 𝐤tsubscript𝐤𝑡\mathbf{k}{t} and 𝐪tsubscript𝐪𝑡\mathbf{q}{t} in the attention calculations. Specifically, we load the raw pre-trained weights and up-train the model for 1 billion tokens (250 steps), annealing the values of 𝐤tsubscript𝐤𝑡\mathbf{k}{t} and 𝐪tsubscript𝐪𝑡\mathbf{q}{t} to 0 according to the following formula:

where t𝑡t is the current step and nt=250subscript𝑛𝑡250n_{t}=250. After this initial phase, which allows the model to ignore the first dimension of keys and values for attention calculations, we start the main retrofitting phase. We set the constant from Equation 8 as c=−5𝑐5c=-5 in order to perform no compression at the start.

We follow the training hyperparameters outlined by Touvron et al. (2023). We employ the AdamW optimizer with parameters β1=0.9subscript𝛽10.9\beta_{1}=0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95, and ϵ=1​e−5italic-ϵ1𝑒5\epsilon=1e-5, in conjunction with a weight decay of 0.10.10.1 and gradient clipping of 1.01.01.0. The batch size is 102410241024 with a context length of 409640964096. We apply a constant learning rate identical to the final rate from the original Llama 2 pre-training phase: 3×10−43E-43\text{\times}{10}^{-4} for the 7B and 13B models, and 1.5×10−41.5E-41.5\text{\times}{10}^{-4} for the 70B model. Finally, we set the window size (Section 3.3) to 12, and keep the Gumbel-sigmoid temperature constant at 0.1 throughout the entire training.

The volume of data for continued pre-training of DMC is contingent on the targeted KV cache compression ratio; a larger CR necessitates more data. We use a training schedule with 24B, 48B, and 72B tokens for training to 2×2\times, 3×3\times, and 4×4\times compression, respectively. In Appendix D we include an ablation where we use a schedule with twice less data.

For DMC, we discovered that the annealing strategy was crucial. Starting the training without compression and its measured increase helps to preserve the original perplexity. Through extensive ablations (see Appendix D), we found that any significant increase of perplexity, even if recovered during continued pre-training, prevents the model from regaining its performance on downstream tasks. The target CR is linearly increased from 1×1\times to 4×4\times for the 7B and 13B models, and from 1×1\times to 2×2\times for the 70B model.555For Llama 2 70B, we do not up-train to 4×4\times because this LLM is already pre-trained with GQA 8×8\times..

Upon achieving the target compression ratio with the DMC model, we initiate a final solidifying phase wherein we: 1) continue up-training for an additional 8B tokens, 2) maintain a fixed compression ratio, and 3) implement a cosine learning rate schedule, annealing it down to 10% of the initial value. This phase aims at stabilizing the model with a specific, fixed compression ratio.

Following Touvron et al. (2023), we evaluate models on a series of downstream tasks, including MMLU (Hendrycks et al., 2021) for factuality, HumanEval (Chen et al., 2021) for Python code generation, and several question-answering datasets for common-sense reasoning: PIQA (Bisk et al., 2020), BoolQ (Clark et al., 2019), Arc-C and Arc-E (Clark et al., 2018), HellaSwag (Zellers et al., 2019), and WinoGrande (Sakaguchi et al., 2021). We report the 5-shot performance on MMLU, average pass@1 scores for HumanEval, and average 0-shot performance on common-sense benchmarks (CS-QA). For pass@1 scores we use a temperature of 0.1 and nucleus sampling (Holtzman et al., 2019) with top-p == 0.95.

We report the performance of the original LLM (equivalent to 1×1\times CR) and efficient variants (DMC and GQA) in Table 1. For the original LLM we use results reproduced in our codebase as described in Appendix B.

DMC vs Original First, comparing DMC with the original LLM, we note that it even increases the performance in MMLU and CS-QA at 2×2\times CR for 7B and 13B and in Human-Eval for all model scales. We speculate that this is due to the additional up-training steps, which expose LLMs to new examples. For the other combinations of downstream tasks and scales at 4×4\times CR, DMC-R incurs negligible degradation: -0.7 in MMLU and -0.3 in CS-QA for 7B, -0.3 in MMLU and CS-QA for 13B. This encouraging finding suggests that DMC is robust across different model scales even for 4×4\times CR. Overall, the fact that DMC is in general close or superior to the original performance makes it suitable as a drop-in replacement for KV caching to achieve higher inference efficiency.

DMC vs GQA Moreover, Table 1 allows for comparing DMC with GQA for equivalent CRs (2×2\times and 4×4\times). Overall, DMC surpasses GQA applied through up-training, in both CRs and in both scales (7B and 13B). For MMLU, the gap widens when we increase the CR. This holds true both at the smaller 7B scale (from +5.4 for 2×2\times CR to +9.2 for 4×4\times CR) and at the larger 13B scale (from +4.6 for 2×2\times CR to +5.6 for 4×4\times CR). For CS-QA and Human-Eval, on the other hand, we observe comparable gains over GQA across CRs and model scales. These findings illustrate that DMC should be preferred to GQA for retrofitting LLMs into variants that are more efficient at inference time.

70B: DMC and GQA However, many widely adopted LLMs were pre-trained with GQA which leads to the question of whether DMC and GQA can be used together to reap compounded benefits. To investigate this, we retrofit Llama 2 70B, which has been pre-trained with 8×8\times GQA. We further compress its KV cache with DMC to 2×2\times CR: this is equivalent to a cache 16×16\times smaller than a vanilla LLM with neither GQA nor DMC. We observe that the performance remains unchanged, and conclude that DMC and GQA can be easily and successfully combined.

DMC vs DMC-C Finally, in Table 3 in Appendix E, we compare DMC with its Constrained variant DMC-C. In general, while remaining superior to GQA, DMC-C displays a significant degradation in several configurations, most notably 7B 4×4\times where it records a drop of −6.46.4-6.4 in MMLU compared to the ceiling. On the other hand, DMC recovers all performance loss in DMC-C. When used in combination with custom attention implementations which does not require excessive padding, standard DMC should therefore be vastly preferred, as it retains the original LLM performance while reaping the advantages in memory efficiency fully.

Sample Efficiency To shed light on the sample efficiency of DMC and GQA, we report their performance on MMLU, CS-QA, and Codex-Eval across retrofitting steps in Figure 4. First, for a target CR of 2×2\times, it emerges how GQA cannot achieve the performance that DMC obtains after fine-tuning (at 8K steps) even after more than double the amount of fine-tuning (at 17K steps). This applies to both 7B and 13B scales. Figure 4 also reveals the importance of the fine-tuning phase in DMC: a limited amount of extra steps with a fixed CR recovers a significant amount of the original performance (especially for higher target CRs such as 4×4\times).

To verify whether increased CRs result in concrete efficiency gains, in Figure 5 we present the performance properties of DMC, estimated within the NVIDIA Megatron-LM framework (Narayanan et al., 2021). Specifically, we run measurements on a single GPU (NVIDIA A100 80GB SXM or H100 SXM) in bfloat16 precision for Llama 7B and 13B. For Llama 70B, we run the same measurements on two GPUs of the same type with tensor parallelism. We feed the model with 2K tokens of English text, and generate additional 2K tokens in an auto-regressive manner to ensure that the model properly compresses its own generations. The reported throughput consists of the average over the last 1K tokens. We limit the sequence to 4K tokens to avoid issues with context length extrapolation, as this is the maximum length observed by Llama 2 during pre-training.

In order to maximize the utilization of the GPU, we increase the batch size to the maximum that fits into memory (see Appendix A for details). The compression of the KV cache with DMC allows for substantial increases in batch size and thus significant throughput gains. As shown in Figure 5(a), DMC 2×2\times translates into an effective increase in tokens per second compared to the original LLM of >1.8×>1.8\times for 7B, 13B, and 70B on both A100 and H100 GPUs. Similarly, DMC 4×4\times translates between 3.4×3.4\times and 3.7×3.7\times. This means that the efficiency boost observed in practice is very close to the theoretical limit. In addition, the extra memory saved with DMC could also be used to cache longer contexts.

Finally, we compare the latency of the original LLM with DMC 4×\times in Figure 5(b). When we fit the largest possible batch size for either model, after generating approximately 2200 tokens, the inference time begins to scale linearly with the context size due to the increasingly dominating cost of reading the KV cache from the high bandwidth memory (HBM). On the other hand, if we choose to maintain the same batch size as for the original LLM also for DMC 4×\times, the memory footprint of the KV cache is reduced and latency for longer contexts improves significantly.

While we acknowledge that the behavior of LLMs at inference time depends on a multitude of factors and implementation details, our measurements in Figure 5 offer evidence that DMC increases throughput and reduces the latency of autoregressive generation with LLMs. We speculate that in the future, DMC might be used to grow the KV cache sub-linearly, which would provide an alternative between vanilla Transformers and State Space Models, where memory is constant (Fu et al., 2023; Gu & Dao, 2023).

Since the training loss does not enforce any compression schema a priori, as it just requires to match a certain global CR, we can investigate what schema the model discovers in practice. In Figure 6, we report the CR for each layer and head for different scales (7B, 13B, and 70B) and CRs (2×\times and 4×\times). From all schema, it emerges how compressing deeper layers (>> 16 for 7B, >> 22 for 13B, >> 44 for 70B) is universally the most popular strategy. However, the very final layers are compressed to a somewhat reduced degree. Fascinatingly, 4×4\times DMC achieves extremely high CRs for several heads also in the few layers after the very first one. This is counter-productive as token representations are not contextualized yet, which makes taking the correct decision (whether to append or accumulate) hard.

This same pattern (in terms of relative preference for compressing the certain ranges of layers) also corresponds to how the distribution of CRs across layers changes throughout training steps, as we keep annealing the auxiliary loss ℓCRsubscriptℓCR\ell_{\text{CR}} towards the target CR. Figure 7 in Appendix C illustrates how CR increases first in deeper layers, then in some non-contiguous intermediate ranges.

Efficient inference in Transformer models is a subject of extensive research, with detailed overviews provided by several surveys (Pope et al., 2022; Treviso et al., 2022). This section narrows its focus to advancements in Transformer inference efficiency through KV cache size reduction.

Grouped Query Attention (GQA; Ainslie et al., 2023) represents the most widespread strategy, evolving from Multi Query Attention (MQA; Shazeer, 2019). GQA reduces the number of KV heads by allocating shared KV representations across subsets of query heads. Prior efforts in token merging (Zhang et al., 2018) condensed the entire past context into a single token, while (Liu et al., 2018; Rae et al., 2019) employed strided convolution and mean pooling kernels to reduce the number of KV tokens. Sliding window attention techniques (Beltagy et al., 2020; Child et al., 2019) restrict attention to a maximum of w𝑤w preceding tokens. Though effective in limiting KV cache, such methods perform fixed-size compression, unlike the presented dynamic DMC, which adapts the compression schema based on the input sequence. This yields superior results, as we prove in an ablation in Appendix E.

Previous learnable compression methods (Anagnostidis et al., 2023, inter alia) decide which tokens to drop from the KV cache. DMC takes a different approach as instead of dropping tokens it merges them. Hence, it preserves cached information more faithfully. Moreover, the DMC compression mechanism has constant complexity with respect to the context length, while the one proposed by Anagnostidis et al. (2023) is linear. Mu et al. (2023) instead compresses prompts through costly generation which limits their inference benefits. Moreover, this method is only applicable to compressing the model input while DMC compresses on-the-fly the entire sequence, including both the model input and the generated output.

Non-learnable cache eviction strategies (Zhang et al., 2023; Sheng et al., 2023; Liu et al., 2023; Wang et al., 2020; Ge et al., 2023) utilize attention scores or token properties to filter tokens in the KV cache. These approaches, while bypassing additional training, rely on heuristics and lack the ability to learn the compression mechanisms. In contrast, DMC integrates compression into its learning objective in an end-to-end manner, where compression is synergistic to language generation.

Finally, DMC draws inspiration from Dynamic Token Pooling (Nawrot et al., 2022), which introduces a learnable boundary predictor to merge the representations of groups of tokens in intermediate layers. DMC improves upon this idea by applying it to the KV cache of and introducing a continuous relaxation of the pooling decision during training. Moreover, it enables retrofitting pre-trained LLMs with minimal extra steps rather than training language models from scratch.

We proposed Dynamic Memory Compression, a method to reduce the length of the KV cache in Transformers, which enhances the memory efficiency and speed of LLMs at inference time. For every new token, DMC learns end-to-end whether to append its KV representations to the cache or merge them with the top element in the cache. We show how to retrofit LLMs such as Llama 2 at different scales (7B, 13B, and 70B) into efficient DMC versions with a negligible amount of extra data and without extra parameters. DMC LLMs with 2×\times and 4×\times compression rates (CRs) retain (or even improve upon) the performance of the original LLM. In addition, we demonstrate that, for comparable CRs, DMC has significantly higher continued pre-training performance and sample efficiency than Grouped Query Attention (GQA), a widespread method for KV cache size reduction.

Dynamic Memory Compression in Large Language Models (LLMs) like Llama 2 results in better computational efficiency, reducing both operational costs and environmental impact (Patterson et al., 2021). By enabling higher throughput and lower latency, DMC democratizes access to advanced AI, making state-of-the-art models suitable for a broader range of hardware. This may not only accelerate innovation across diverse sectors but also promote AI development and applications in an environmentally conscious manner.

During autoregressive generation with a KV cache, the sequence length during every forward pass is n=1𝑛1n=1. The vast majority of time is spent on calculations for linear layers and multi-head self-attention. For linear layers, the ratio of FLOPs to input bytes improves as the batch size increases. For small batch sizes, the operations are memory bounded on reading the weight matrices from HBM. On the other hand, for MHSA layers during inference the ratio of FLOPs to input bytes does not change and MHSA layers are always memory bounded on current GPUs. The impact of KV cache compression is two-fold as it 1) allows us to decrease the latency of MHSA layers, and 2) allows us to fit larger batch sizes in HBM, resulting in better throughput and better utilization of GPUs during the calculations of linear layers.

To make sure that our implementation is correct, for each downstream task we compare the performance reported in the original Llama 2 paper (Touvron et al., 2023) with those obtained from the Hugging Face Hub checkpoints. Furthermore, we evaluate the impact of using our internal data mixture for up-training, acknowledging that variations in data proportions and preprocessing methodologies can influence model behavior. In particular, we up-train the vanilla pre-trained Llama-2 checkpoint for 200 training steps, amounting to 1B tokens, in accordance with the original Llama-2 training schedule. We compute the average and standard deviation of checkpoints after 50, 100, 150, 200 steps. In our experiments, we replicate the results reported by the Llama 2 paper almost exactly, as shown in Table 2.

In Figure 7, we illustrate how the CR changes for each layer of the Llama 2 7B model throughout the training from 1×\times up to 4×\times global CR. Each subplot corresponds to a different global CR which occurs at different stages of the training, going from the smallest (1.17) at the top, to the highest (4.16) at the bottom. There is a clear trend such that, for a smaller global Compression Ratio (i.e. at the beginning of the training), the model emphasizes compression in the later layers. As the global Compression Ratio increases, the model keeps on compressing in final layers but also starts to compress the earlier layers. We hypothesize that the token representations in the initial layers do not contain sufficient information to perform any meaningful grouping. Conversely, token representations in the subsequent layers are more defined and, possibly, after several attention layers, already contain redundant/shared information.

Do DMC models compress sequences with a uniform CR independent from their total length? We find that this is not the case. As show by Figure 8, the CR increases logarithmically as we increase the total sequence length. This holds true across all global CRs (including both 2×\times and 4×\times).

Do DMC models learn a fixed compression schema, or do they exhibit position biases? In Figure 9, we plot the average value of the decision variable α𝛼\alpha across positions in the sequence (0 to 4096). Our observations reveal that the average value of the decision variable α𝛼\alpha is independent of a token’s position in the input sequence which demonstrates that the model does not follow some fixed pattern. This persists across both 2×\times and 4×\times compression ratios (CR).

Studying the compression schema in Figure 10 learned by DMC-C, we find a very different pattern compared to DMC, due to the auxiliary loss forcing the model to compress similarly across heads in the same layer. Nevertheless, we observe a similar global preference for compressing deeper layers.

A natural question arises, whether the compression schema that the model learns is somehow aligned with human intuitions about text segmentation. We analyzed the outputs of Llama 2 13B DMC with CR 444 and noticed that some heads compress according to the boundaries of linguistic units, such as words or syntactic phrases. Figure 11 shows the compression schema learned by head 14 in layer 0. In this case, the model merges the subwords back into words reverting the tokenizer. Interestingly, some groupings of tokens correspond to semantic units, e.g., “1 9 th century”’, “5 0 percent", or “a week back later”. Yet, we also stress that many heads and layers are not interpretable as their behavior does not overlap with linguistic units.

More generally higher layers merge longer tokens sequences, in line with Figure 6. For instance, Figure 12 shows the decisions of layer 24 head 2. We leave a more in-depth analysis of compression schemata learned by DMC to future work.

Another advantage of DMC is its high flexibility. In fact, based on the availability of resources, different regimes can be chosen when annealing the CR to the target during up-training. In Figure 13, we compare a Short and a Long regime for the contrained variant of DMC (DMC-C), which continuously increase the CR by 1 every 3K and 6K steps (12B and 24B tokens), respectively. It is evident how there exists a trade-off between training steps (hence, time) and performance. Additionally, Figure 13 showcases another aspect of the higher flexibility DMC affords: it is compatible with arbitrary real-valued CRs, as opposed to integer CRs divisible by 2 as in GQA.

Additionally, we explore how different schedules for the target CR impact the model performance. In the standard setup, this is annealed from 1 to the target CR throughout the duration of training. Here, we compare it with a setup where the CR used in the auxiliary loss for compression is set to the target from the start (DMC-immediate). We show the results in Figure 14. As expected, DMC-immediate has a perplexity spike at the beginning when the model quickly increases the CR due to the auxiliary loss. While perplexity is recovered during training, even to a lower point than DMC with annealing, downstream accuracy on MMLU benchmark is degraded across the board. This showcases why avoiding perplexity spikes is fundamental to successfully retrofit an LLM.

We assessed the importance of dynamically learning compression decisions in DMC by comparing it with Fixed Memory Pooling, which reduces the overall number of tokens in memory by deterministically averaging every n𝑛n tokens, where n𝑛n in this case is identical to the compression rate. The results, shown in Figure 14, demonstrate that the dynamic component of DMC is crucial to achieve lower perplexity as well as higher downstream accuracy.

We compare two approaches to compression: a Local Prior, which enforces a pre-specified compression ratio (CR) in each layer independently, requiring every layer to compress approximately the same amount, and a Global Prior used by default DMC, which applies a pre-specified CR across all layers, giving the model the freedom to apply different CRs in each layer, provided that their average compression equals the global CR. Figure 15 clearly indicates that the Global Prior (DMC in Figure 15) improves MMLU performance compared to the Local Prior.

We then compare three strategies to determine how similar compression schemata for heads within each layer should be (assuming a global prior):

DMC: There are no constraints on the decision and importance scores, except for the global loss nudging the model towards a pre-defined CR.

DMC-HardC: Decision scores αtsubscript𝛼𝑡\alpha_{t} and importance scores ωtsubscript𝜔𝑡\omega_{t} are shared across heads, leading to the same shortening schema within each layer across heads.

As per Figure 15, the default DMC strategy shows a consistent MMLU performance across varying CRs, while both DMC-C and DMC-HardC exhibit a sharp drop in MMLU as the compression reaches 1.9×\times. Moreover, in Table 3 we report a more thorough comparison between DMC and DMC-C. In general, while remaining superior to GQA, DMC-C displays a significant degradation in several configurations when compared to regular DMC.

Finally, we assess the impact of predicting importance scores for accumulation as opposed to uniformly weighting each token in a group. Figure 15 shows that DMC-C with Uniform Weighting is worse than learned weighting DMC-C.

We mask the unnormalized attention score for the pair (i,j)𝑖𝑗(i,j) as follows:

We rely on the memory efficient implementation of MHSA from PyTorch, which allows adding arbitrary masks to the attention scores before softmax. Notwithstanding this, at inference time DMC remains compatible with efficient libraries for attention such as Flash Attention (Dao et al., 2022). The log⁡(1−αj+1)1subscript𝛼𝑗1\log(1-\alpha_{j+1}) term is calculated as log-sigmoid​(−αj+1)log-sigmoidsubscript𝛼𝑗1\text{log-sigmoid}(-\alpha_{j+1}) for better numerical precision.

This paper is focused on retrofitting existing LLMs into DMC variants. In our preliminary experiments with pretraining LLMs with DMC from scratch , we obtained negative results when compared to the training curve of GQA. We speculate that this is due to the mutual dependency of modeling and segmenting data: when token representations are not of sufficient quality, boundary decisions are unreliable. Vice versa, incorrect boundary decisions may lead to poor token representations. This creates a vicious cycle that may be broken by techniques that facilitate convergence, such as an Expectation Maximization-style alternation between modeling and segmenting.

Table: S5.T1: MMLU accuracy (Acc.), Commonsense Question Answering (CS-QA), Exact Match (EM), and Human-Eval Pass@1 for several scales (7B, 13B, and 70B) and compression rates (CRs; 1×1\times, 2×2\times, and 4×4\times) of Llama 2. (*) Unlike the 7B and 13B models, the 70B model was trained with GQA which compresses the KV cache 8×8\times. We apply additional 2×2\times DMC compression during up-training to obtain a total compression of 16×16\times .

ScaleMethodCRMMLUCS-QAHuman Eval
7B44.670.514.0
GQA2×\times39.868.912.8
DMC45.270.815.2
GQA4×\times34.768.314.0
DMC43.970.216.5
13B54.573.517.5
GQA2×\times50.272.715.9
DMC54.874.220.7
GQA4×\times48.672.216.5
DMC54.273.222.0
70B∗8×∗superscript\times^{*}68.878.029.6
DMC16×∗superscript\times^{*}68.877.929.9

Table: A2.T2: Replicating the original up-training results.

CS-QAMMLUHuman-Eval
7B13B70B7B13B70B7B13B70B
Paper70.673.778.545.354.868.912.818.329.9
Checkpoint70.673.778.545.755.169.113.417.730.5
Up-trained70.5 ± 0.273.5 ± 0.178.0 ± 0.144.6 ± 0.654.5 ± 0.368.8 ± 0.314.0 ± 1.217.5 ± 1.529.6 ± 1.6

Refer to caption (a) Regular key–value cache with items k​vi𝑘subscript𝑣𝑖kv_{i} depicted as boxes. New items are always appended.

Refer to caption (b) Dynamic Memory Compression (DMC) chooses whether to accumulate or append current items, resulting in a smaller key–value cache.

Refer to caption An example of KV cache growth in DMC during inference (left). During training (right), we retain all intermediate states seen during inference and gradually block access to some of them (gray arrows)

Refer to caption

Refer to caption

Refer to caption Sample efficiency of DMC and GQA. Horizontal lines correspond to the performance of the original Llama 2. Every DMC model was trained first with increasing CR, then with constant CR for the last 2K steps (marked with ⋆⋆\star).

Refer to caption (a) Inference throughput averaged over the generation of 1K tokens with 3K tokens of context (up to 4K in total). On the x-axis, we show the maximum batch size that fits in memory on a single GPU (7B and 13B) or two GPUs with tensor parallelism (70B) for the Vanilla, DMC 2×\times, and DMC 4×\times models.

Refer to caption (b) Latency of next token generation. Solid lines denote measurements with the maximum batch size that fits on a single GPU. Dotted lines denote DMC 4×\times with the same batch size as the vanilla LLM.

Refer to caption Heatmaps of average compression rates across layers (X-axis) and heads (Y-axis). Heads are arranged from the highest compression to the lowest top-down for clarity.

Refer to caption Compression distribution across layers at different stages of retrofitting a Llama 2 7B model. We adhere to the convention where, for a given subplot, a larger space above a given layer indicates greater compression at that layer.

Refer to caption CR achieved by Llama 2 7B for particular sequence lengths across various global CRs.

Refer to caption Average value of the decision α𝛼\alpha for positions (0, 4096) averaged over 128 samples, heads and layers.

Refer to caption Compression schema found by Llama 2 13B DMC 4×\times in layer 0, head 14. Tokens that are merged in the KV cache are marked with the same color.

Refer to caption Different up-training regimes for DMC-C: Short (red) increases CR by 1 every 6K steps, Long (blue) increases CR by 1 every 3K steps. Horizontal lines correspond to the performance of the original Llama 2.

Refer to caption Validation perplexity and MMLU accuracy vs training steps for Fixed Memory Pooling and a variant of DMC where the auxiliary loss CR is immediately set to the target on the onset of training. All models follow the regular training schedule and are trained up to 2×\times CR.

Refer to caption Validation perplexity and test MMLU accuracy vs compression rate for different types of compression priors, in-layer relaxations, and importance scores. All models follow the short training regime and are trained up to 2×\times CR.

$$ a_{ij}^h = \frac{\exp(\qq_i^{h^\top} \kk_j^h / \sqrt{d_h})}{\sum_{t=1}^i \exp(\qq_i^{h^\top} \kk_t^h / \sqrt{d_h})}, \quad \oo_i^h = \sum_{j=1}^i a_{ij}^h\vv_j^h. \label{eq:attn-softmax} $$ \tag{eq:attn-softmax}

$$ O = (W_o[\oo_1^1, \ldots, \oo_1^{n_h}], \ldots, W_o[\oo_n^1, \ldots, \oo_n^{n_h}]) $$

$$ \label{eq:reuse} \alpha_t \myeq \lfloor \text{sigmoid}(\kk_t[0]) \rceil,\ \omega_t \myeq \text{sigmoid}(\qq_t[0]). $$ \tag{eq:reuse}

$$ \text{arg min}\vtheta , \ell{\text{LM}} + \ell_{\text{CR}}. $$

$$ \hat{a}{(i,j)}=\underbrace{\frac{\mathbf{q}{i}[1:d_{h}]^{\top}\mathbf{k}{j}[1:d{h}]}{\sqrt{d_{h}}}}{\text{attention score}}+\underbrace{\log(1-\alpha{j+1})\vphantom{\frac{1}{\sqrt{d_{h}}}}}_{\text{attention mask}}. $$ \tag{A6.Ex1}

$$ \displaystyle Q^{h} $$

$$ \displaystyle{\mathbf{k}}_{0} $$ \tag{S3.E9Xa}

$$ \label{eq:cr_loss} \ell_{\text{CR}} = \frac{1}{n_l n_h n} \text{max}\left(0, \sum_{l=1}^{n_l} \sum_{h=1}^{n_h} \sum_{t=1}^{n} (1 - \alpha_{lht}) - \frac{n_l n_h n}{\text{CR}}\right). $$ \tag{eq:cr_loss}

ScaleMethodCRMMLUCS-QAHuman Eval
--44.670.514
GQA39.868.912.8
H 2 O2 ×45.267.59.8
TOVA44.970.06.1
DMC45.270.815.2
GQA34.768.314
H 2 O41.156.84.9
7BTOVA4 ×43.464.21.8
DMC43.970.216.5
H 2 O36.351.90
TOVA6 ×41.156.10
DMC42.970.415.9
H 2 O32.548.30.6
TOVA8 ×38.751.00
DMC41.870.116.5
--54.573.517.5
GQA50.272.715.9
H 2 O54.170.316.5
TOVA2 ×54.472.812.2
DMC54.874.220.7
GQA48.672.216.5
H 2 O50.760.08.5
13BTOVA4 ×53.367.82.4
DMC54.273.222
H 2 O45.954.92.4
TOVA6 ×51.760.00
DMC53.373.421.3
H 2 O41.850.91.8
TOVA8 ×50.353.40
DMC52.173.321.3
-8 × ∗68.878.029.6
70B ∗H 2 O68.774.118.3
TOVA16 × ∗68.177.6 77.929.9
DMC68.829.9
CS-QACS-QACS-QAMMLUMMLUMMLUHuman-EvalHuman-EvalHuman-Eval
7B13B70B7B13B70B7B13B70B
Paper70.673.778.545.354.868.912.818.329.9
Checkpoint70.673.778.545.755.169.113.417.730.5
Up-trained70.5 ± 0.273.5 ± 0.178.0 ± 0.144.6 ± 0.654.5 ± 0.368.8 ± 0.314.0 ± 1.217.5 ± 1.529.6 ± 1.6
ScaleMethodCRMMLUCS-QAHuman Eval
7B--44.670.514
7BGQA39.868.912.8
7BDMC2 ×45.270.815.2
7BDMC-C45.570.614.6
7BGQA34.768.314
7BDMC4 ×43.970.216.5
7BDMC-C38.269.614.6
13B--54.573.517.5
13BGQA50.272.715.9
13BDMC2 ×54.874.220.7
13BDMC-C54.873.918.3
13BGQA48.672.216.5
13BDMC4 ×54.273.222
13BDMC-C52.472.918.3
70B ∗-8 × ∗68.87829.6
70B ∗DMC16 × ∗68.877.929.9
70B ∗DMC-C16 × ∗67.478.231.1

$$ \label{eq:partial-accumulation} \begin{aligned} z_0 & \gets \omega_0,\quad & z_i & \gets z_{i-1} \alpha_i + \omega_i, \ {\kk}0 & \gets \kk_0,\quad & \kk_i & \gets \frac{\alpha_i\kk{i-1}z_{i-1} + \kk_i\omega_i}{z_i}, \ {\vv}0 & \gets \vv_0,\quad & \vv_i & \gets \frac{\alpha_i\vv{i-1}z_{i-1} + \vv_i\omega_i}{z_i}. \end{aligned} $$ \tag{eq:partial-accumulation}

$$ \begin{aligned} \qq_{t}[0] &\gets \qq_{t}[0] \times (1 - (t / n_t)) \ \kk_{t}[0] &\gets \kk_{t}[0] \times (1 - (t / n_t)) \end{aligned} $$

$$ Q^h &= (W_q^h\xx_1, \ldots, W_q^h\xx_n) \ K^h &= (W_k^h\xx_1, \ldots, W_k^h\xx_n) \ V^h &= (W_v^h\xx_1, \ldots, W_v^h\xx_n). $$

Algorithm: algorithm
[ht]
\caption{Single-head KV cache update with Dynamic Memory Compression (DMC)}\label{alg:dmc-update}
\begin{algorithmic}[1]
\REQUIRE $K, V, \qq_t, \kk_t, \vv_t$
\STATE $\alpha_t \gets \text{round}(\text{sigmoid}(\kk_t[0]) )$
\STATE $\omega_t \gets \text{sigmoid}(\qq_t[0])$
\STATE $\kk_t[0], \qq_t[0] \gets 0, 0$
\IF[\hlcyan{\textsc{accumulate}}]{$\alpha_t = 1$}
\STATE $z_t \gets z_{t-1} + \omega_t$
\STATE $K \gets [K_{1:l-1}, (K_l z_{t-1} + \kk_t \omega_t)/z_t]$
\STATE $V \gets [V_{1:l-1}, (V_l z_{t-1} + \vv_t \omega_t)/z_t]$
\ELSE[\hlred{\textsc{append}}]
\STATE $z_t \gets \omega_t$
\STATE $K \gets [K_{1:l}, \kk_t]$
\STATE $V \gets [V_{1:l}, \vv_t]$
\ENDIF
\STATE\RETURN $K, V, \qq_t, \kk_t$
\end{algorithmic}
ScaleMethodCRMMLUCS-QAHuman Eval
--44.670.514
GQA39.868.912.8
H 2 O2 ×45.267.59.8
TOVA44.970.06.1
DMC45.270.815.2
GQA34.768.314
H 2 O41.156.84.9
7BTOVA4 ×43.464.21.8
DMC43.970.216.5
H 2 O36.351.90
TOVA6 ×41.156.10
DMC42.970.415.9
H 2 O32.548.30.6
TOVA8 ×38.751.00
DMC41.870.116.5
--54.573.517.5
GQA50.272.715.9
H 2 O54.170.316.5
TOVA2 ×54.472.812.2
DMC54.874.220.7
GQA48.672.216.5
H 2 O50.760.08.5
13BTOVA4 ×53.367.82.4
DMC54.273.222
H 2 O45.954.92.4
TOVA6 ×51.760.00
DMC53.373.421.3
H 2 O41.850.91.8
TOVA8 ×50.353.40
DMC52.173.321.3
-8 × ∗68.878.029.6
70B ∗H 2 O68.774.118.3
TOVA16 × ∗68.177.6 77.929.9
DMC68.829.9
CS-QACS-QACS-QAMMLUMMLUMMLUHuman-EvalHuman-EvalHuman-Eval
7B13B70B7B13B70B7B13B70B
Paper70.673.778.545.354.868.912.818.329.9
Checkpoint70.673.778.545.755.169.113.417.730.5
Up-trained70.5 ± 0.273.5 ± 0.178.0 ± 0.144.6 ± 0.654.5 ± 0.368.8 ± 0.314.0 ± 1.217.5 ± 1.529.6 ± 1.6
ScaleMethodCRMMLUCS-QAHuman Eval
7B--44.670.514
7BGQA39.868.912.8
7BDMC2 ×45.270.815.2
7BDMC-C45.570.614.6
7BGQA34.768.314
7BDMC4 ×43.970.216.5
7BDMC-C38.269.614.6
13B--54.573.517.5
13BGQA50.272.715.9
13BDMC2 ×54.874.220.7
13BDMC-C54.873.918.3
13BGQA48.672.216.5
13BDMC4 ×54.273.222
13BDMC-C52.472.918.3
70B ∗-8 × ∗68.87829.6
70B ∗DMC16 × ∗68.877.929.9
70B ∗DMC-C16 × ∗67.478.231.1

Figure

Figure

Figure

Refer to caption

Refer to caption

References

[Ainslie2023GQATG] Ainslie, J., Lee{-}Thorp, J., de~Jong, M., Zemlyanskiy, Y., Lebr{'{o}}n, F., and Sanghai, S. \newblock {GQA:} training generalized multi-query transformer models from multi-head checkpoints. \newblock In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. \newblock URL https://doi.org/10.18653/v1/2023.emnlp-main.298.

[Anagnostidis2023DynamicCP] Anagnostidis, S., Pavllo, D., Biggio, L., Noci, L., Lucchi, A., and Hofmann, T. \newblock Dynamic context pruning for efficient and interpretable autoregressive transformers. \newblock In Advances in Neural Information Processing Systems 36, 2023.

[Bahdanau2014NeuralMT] Bahdanau, D., Cho, K., and Bengio, Y. \newblock Neural machine translation by jointly learning to align and translate. \newblock In 3rd International Conference on Learning Representations, 2015. \newblock URL http://arxiv.org/abs/1409.0473.

[Beltagy2020Longformer] Beltagy, I., Peters, M.~E., and Cohan, A. \newblock Longformer: The long-document transformer. \newblock ArXiv, abs/2004.05150, 2020. \newblock URL https://arxiv.org/abs/2004.05150.

[bras_Gao_Choi_2020] Bisk, Y., Zellers, R., Bras, R.~L., Gao, J., and Choi, Y. \newblock {PIQA:} reasoning about physical commonsense in natural language. \newblock In The Thirty-Fourth {AAAI Conference on Artificial Intelligence}, 2020. \newblock URL https://doi.org/10.1609/aaai.v34i05.6239.

[Bolya2022TokenMY] Bolya, D., Fu, C., Dai, X., Zhang, P., Feichtenhofer, C., and Hoffman, J. \newblock Token merging: Your vit but faster. \newblock In The Eleventh International Conference on Learning Representations, 2023. \newblock URL https://openreview.net/pdf?id=JroZRaRw7Eu.

[Chen2021EvaluatingLL] Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F.~P., Cummings, D.~W., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W.~H., Nichol, A., Babuschkin, I., Balaji, S., Jain, S., Carr, A., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M.~M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. \newblock Evaluating large language models trained on code. \newblock ArXiv, abs/2107.03374, 2021.

[child2019generating] Child, R., Gray, S., Radford, A., and Sutskever, I. \newblock Generating long sequences with sparse transformers. \newblock ArXiv, abs/1904.10509, 2019.

[Choromanski2020RethinkingAW] Choromanski, K.~M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarl{'{o}}s, T., Hawkins, P., Davis, J.~Q., Mohiuddin, A., Kaiser, L., Belanger, D.~B., Colwell, L.~J., and Weller, A. \newblock Rethinking attention with performers. \newblock In 9th International Conference on Learning Representations, 2021. \newblock URL https://openreview.net/forum?id=Ua6zuk0WRH.

[clark-etal-2019-boolq] Clark, C., Lee, K., Chang, M., Kwiatkowski, T., Collins, M., and Toutanova, K. \newblock Boolq: Exploring the surprising difficulty of natural yes/no questions. \newblock In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019. \newblock URL https://doi.org/10.18653/v1/n19-1300.

[clark2018think] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. \newblock Think you have solved question answering? try {ARC}, the {AI2} reasoning challenge. \newblock ArXiv, abs/1803.05457, 2018.

[dao2022flash] Dao, T., Fu, D., Ermon, S., Rudra, A., and R'{e}, C. \newblock {FlashAttention}: Fast and memory-efficient exact attention with {IO}-awareness. \newblock In Advances in Neural Information Processing Systems, volume~35, 2022.

[deepseekai2024deepseekv2] DeepSeek-AI. \newblock {DeepSeek-V2}: A strong, economical, and efficient mixture-of-experts language model, 2024.

[fu2023hungry] Fu, D.~Y., Dao, T., Saab, K.~K., Thomas, A.~W., Rudra, A., and Re, C. \newblock {Hungry Hungry Hippos}: Towards language modeling with state space models. \newblock In The International Conference on Learning Representations, 2023. \newblock URL https://openreview.net/forum?id=COZDy0WYGg.

[Ge2023ModelTY] Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. \newblock Model tells you what to discard: Adaptive kv cache compression for llms. \newblock ArXiv, abs/2310.01801, 2023.

[gu2023mamba] Gu, A. and Dao, T. \newblock Mamba: Linear-time sequence modeling with selective state spaces. \newblock ArXiv, abs/2312.00752, 2023.

[hendrycks2021measuring] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. \newblock Measuring massive multitask language understanding. \newblock In International Conference on Learning Representations, 2021. \newblock URL https://openreview.net/forum?id=d7KBjmI3GmQ.

[Holtzman2019TheCC] Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. \newblock The curious case of neural text degeneration. \newblock In 8th International Conference on Learning Representations, 2020. \newblock URL https://openreview.net/forum?id=rygGQyrFvH.

[Hooper2024KVQuantT1] Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M.~W., Shao, Y.~S., Keutzer, K., and Gholami, A. \newblock {KVQuant}: Towards 10 million context length llm inference with kv cache quantization. \newblock ArXiv, abs/2401.18079, 2024.

[jiang2023mistral] Jiang, A.~Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., delas Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.~R., Lachaux, M.-A., Stock, P., Scao, T.~L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W.~E. \newblock Mistral {7B}. \newblock ArXiv, abs/2310.06825, 2023.

[Kwon2023EfficientMM] Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.~H., Gonzalez, J., Zhang, H., and Stoica, I. \newblock Efficient memory management for large language model serving with pagedattention. \newblock In Proceedings of the 29th Symposium on Operating Systems Principles, 2023. \newblock URL https://doi.org/10.1145/3600006.3613165.

[Liu2018GeneratingWB] Liu, P.~J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., and Shazeer, N. \newblock Generating wikipedia by summarizing long sequences. \newblock In 6th International Conference on Learning Representations, 2018. \newblock URL https://openreview.net/forum?id=Hyg0vbWC-.

[Liu2023ScissorhandsET] Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. \newblock Scissorhands: Exploiting the persistence of importance hypothesis for {LLM} {KV} cache compression at test time. \newblock In Advances in Neural Information Processing Systems 36, 2023{a}.

[Liu2023LLMQATDQ] Liu, Z., Oğuz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. \newblock {LLM-QAT}: Data-free quantization aware training for large language models. \newblock ArXiv, abs/2305.17888, 2023{b}.

[Liu2024KIVIAT] Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. \newblock {KIVI}: A tuning-free asymmetric 2bit quantization for {KV} cache. \newblock ArXiv, abs/2402.02750, 2024.

[Mu2023LearningTC] Mu, J., Li, X., and Goodman, N.~D. \newblock Learning to compress prompts with gist tokens. \newblock In Advances in Neural Information Processing Systems 36, 2023.

[Narayanan2021EfficientLL] Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V.~A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M.~A. \newblock Efficient large-scale language model training on gpu clusters using megatron-lm. \newblock SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 2021. \newblock URL https://doi.org/10.1145/3458817.3476209.

[Nawrot2022EfficientTW] Nawrot, P., Chorowski, J., Łańcucki, A., and Ponti, E.~M. \newblock Efficient transformers with dynamic token pooling. \newblock In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023. \newblock URL https://aclanthology.org/2023.acl-long.353.

[Oren2024TransformersAM] Oren, M., Hassid, M., Adi, Y., and Schwartz, R. \newblock Transformers are multi-state {RNN}s. \newblock ArXiv, abs/2401.06104, 2024.

[Patterson2021CarbonEA] Patterson, D.~A., Gonzalez, J., Le, Q.~V., Liang, C., Mungu{'i}a, L.-M., Rothchild, D., So, D.~R., Texier, M., and Dean, J. \newblock Carbon emissions and large neural network training. \newblock ArXiv, abs/2104.10350, 2021.

[Pope2022EfficientlyST] Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. \newblock Efficiently scaling transformer inference. \newblock ArXiv, abs/2211.05102, 2022. \newblock URL https://doi.org/10.48550/arXiv.2211.05102.

[Rae2019CompressiveTF] Rae, J.~W., Potapenko, A., Jayakumar, S.~M., Hillier, C., and Lillicrap, T.~P. \newblock Compressive transformers for long-range sequence modelling. \newblock In 8th International Conference on Learning Representations, 2020. \newblock URL https://openreview.net/forum?id=SylKikSYDH.

[Sakaguchi2019AnAW] Sakaguchi, K., Bras, R.~L., Bhagavatula, C., and Choi, Y. \newblock Winogrande: An adversarial winograd schema challenge at scale. \newblock In The Thirty-Fourth {AAAI Conference on Artificial Intelligence}, 2020. \newblock URL https://doi.org/10.1609/aaai.v34i05.6399.

[Shazeer2019FastTD] Shazeer, N.~M. \newblock Fast transformer decoding: One write-head is all you need. \newblock ArXiv, abs/1911.02150, 2019. \newblock URL http://arxiv.org/abs/1911.02150.

[Sheng2023HighthroughputGI] Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D.~Y., Xie, Z., Chen, B., Barrett, C.~W., Gonzalez, J., Liang, P., R{'e}, C., Stoica, I., and Zhang, C. \newblock High-throughput generative inference of large language models with a single gpu. \newblock In International Conference on Machine Learning, 2023. \newblock URL https://proceedings.mlr.press/v202/sheng23a.html.

[touvron2023llama] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C.~C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P.~S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E.~M., Subramanian, R., Tan, X.~E., Tang, B., Taylor, R., Williams, A., Kuan, J.~X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. \newblock {Llama 2}: Open foundation and fine-tuned chat models. \newblock ArXiv, abs/2307.09288, 2023. \newblock URL https://arxiv.org/abs/2307.09288.

[Treviso2022EfficientMF] Treviso, M.~V., Ji, T., Lee, J.-U., van Aken, B., Cao, Q., Ciosici, M.~R., Hassid, M., Heafield, K., Hooker, S., Martins, P.~H., Martins, A. F.~T., Milder, P., Raffel, C., Simpson, E., Slonim, N., Balasubramanian, N., Derczynski, L., and Schwartz, R. \newblock Efficient methods for natural language processing: A survey. \newblock Transactions of the Association for Computational Linguistics, 11, 2022.

[Vaswani2017AttentionIA] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.~N., Kaiser, {\L}., and Polosukhin, I. \newblock Attention is all you need. \newblock Advances in neural information processing systems, 30, 2017.

[Wang2020SpAttenES] Wang, H., Zhang, Z., and Han, S. \newblock Spatten: Efficient sparse attention architecture with cascade token and head pruning. \newblock 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020. \newblock URL https://doi.org/10.1109/HPCA51647.2021.00018.

[zellers-etal-2019-hellaswag] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. \newblock {H}ella{S}wag: Can a machine really finish your sentence? \newblock In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. \newblock URL https://aclanthology.org/P19-1472.

[Zhang2018AcceleratingNT] Zhang, B., Xiong, D., and Su, J. \newblock Accelerating neural transformer via an average attention network. \newblock In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018. \newblock URL https://aclanthology.org/P18-1166.

[Zhang2023H2OHO] Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., R'{e}, C., Barrett, C., Wang, Z.~A., and Chen, B. \newblock H2o: Heavy-hitter oracle for efficient generative inference of large language models. \newblock In Advances in Neural Information Processing Systems 36, 2023.

[bib1] Ainslie et al. (2023) Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebrón, F., and Sanghai, S. K. GQA: Training generalized multi-query transformer models from multi-head checkpoints. ArXiv, abs/2305.13245, 2023.

[bib2] Anagnostidis et al. (2023) Anagnostidis, S., Pavllo, D., Biggio, L., Noci, L., Lucchi, A., and Hofmann, T. Dynamic context pruning for efficient and interpretable autoregressive transformers. ArXiv, abs/2305.15805, 2023.

[bib3] Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. ArXiv, abs/1409.0473, 2014.

[bib4] Beltagy et al. (2020) Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. ArXiv, abs/2004.05150, 2020.

[bib5] Bisk et al. (2020) Bisk, Y., Zellers, R., Le bras, R., Gao, J., and Choi, Y. PIQA: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), Apr. 2020.

[bib6] Bolya et al. (2022) Bolya, D., Fu, C.-Y., Dai, X., Zhang, P., Feichtenhofer, C., and Hoffman, J. Token merging: Your ViT but faster. ArXiv, abs/2210.09461, 2022.

[bib7] Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D. W., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Babuschkin, I., Balaji, S., Jain, S., Carr, A., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M. M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code. ArXiv, abs/2107.03374, 2021.

[bib8] Child et al. (2019) Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers. ArXiv, abs/1904.10509, 2019.

[bib9] Choromanski et al. (2020) Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlós, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L. J., and Weller, A. Rethinking attention with Performers. ArXiv, abs/2009.14794, 2020.

[bib10] Clark et al. (2019) Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Burstein, J., Doran, C., and Solorio, T. (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

[bib11] Clark et al. (2018) Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try ARC, the AI2 reasoning challenge. ArXiv, abs/1803.05457, 2018.

[bib12] Dao et al. (2022) Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35. Curran Associates, Inc., 2022.

[bib13] Fu et al. (2023) Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023.

[bib14] Ge et al. (2023) Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model tells you what to discard: Adaptive kv cache compression for llms. ArXiv, abs/2310.01801, 2023.

[bib15] Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv, abs/2312.00752, 2023.

[bib16] Hendrycks et al. (2021) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.

[bib17] Holtzman et al. (2019) Holtzman, A., Buys, J., Du, L., Forbes, M., and Choi, Y. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019.

[bib18] Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. Mistral 7B. ArXiv, abs/2310.06825, 2023.

[bib19] Kim, G. and Cho, K. Length-adaptive Transformer: Train once with length drop, use anytime with search. In Annual Meeting of the Association for Computational Linguistics, 2020.

[bib20] Kwon et al. (2023) Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. Proceedings of the 29th Symposium on Operating Systems Principles, 2023.

[bib21] Liu et al. (2018) Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L., and Shazeer, N. M. Generating wikipedia by summarizing long sequences. ArXiv, abs/1801.10198, 2018.

[bib22] Liu et al. (2023) Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. ArXiv, abs/2305.17118, 2023.

[bib23] Mu et al. (2023) Mu, J., Li, X. L., and Goodman, N. D. Learning to compress prompts with gist tokens. ArXiv, abs/2304.08467, 2023.

[bib24] Narayanan et al. (2021) Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V. A., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. A. Efficient large-scale language model training on gpu clusters using megatron-lm. SC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.

[bib25] Nawrot et al. (2022) Nawrot, P., Chorowski, J., Lańcucki, A., and Ponti, E. Efficient transformers with dynamic token pooling. In Annual Meeting of the Association for Computational Linguistics, 2022.

[bib26] Patterson et al. (2021) Patterson, D. A., Gonzalez, J., Le, Q. V., Liang, C., Munguía, L.-M., Rothchild, D., So, D. R., Texier, M., and Dean, J. Carbon emissions and large neural network training. ArXiv, abs/2104.10350, 2021.

[bib27] Pope et al. (2022) Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Levskaya, A., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently scaling transformer inference. ArXiv, abs/2211.05102, 2022.

[bib28] Rae et al. (2019) Rae, J. W., Potapenko, A., Jayakumar, S. M., and Lillicrap, T. P. Compressive transformers for long-range sequence modelling. ArXiv, abs/1911.05507, 2019.

[bib29] Sakaguchi et al. (2021) Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. WinoGrande: An adversarial winograd schema challenge at scale. Commun. ACM, 64(9), 2021.

[bib30] Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019.

[bib31] Sheng et al. (2023) Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y., Xie, Z., Chen, B., Barrett, C. W., Gonzalez, J., Liang, P., Ré, C., Stoica, I., and Zhang, C. High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 2023.

[bib32] Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023.

[bib33] Treviso et al. (2022) Treviso, M. V., Ji, T., Lee, J.-U., van Aken, B., Cao, Q., Ciosici, M. R., Hassid, M., Heafield, K., Hooker, S., Martins, P. H., Martins, A. F. T., Milder, P., Raffel, C., Simpson, E., Slonim, N., Balasubramanian, N., Derczynski, L., and Schwartz, R. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics, 11, 2022.

[bib34] Vaswani et al. (2017) Vaswani, A., Shazeer, N. M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Neural Information Processing Systems, 2017.

[bib35] Wang et al. (2020) Wang, H., Zhang, Z., and Han, S. Spatten: Efficient sparse attention architecture with cascade token and head pruning. 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020.

[bib36] Zellers et al. (2019) Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. HellaSwag: Can a machine really finish your sentence? In Korhonen, A., Traum, D., and Màrquez, L. (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019. Association for Computational Linguistics.

[bib37] Zhang et al. (2018) Zhang, B., Xiong, D., and Su, J. Accelerating neural transformer via an average attention network. ArXiv, abs/1805.00631, 2018.

[bib38] Zhang et al. (2023) Zhang, Z. A., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Ré, C., Barrett, C. W., Wang, Z., and Chen, B. H2o: Heavy-hitter oracle for efficient generative inference of large language models. ArXiv, abs/2306.14048, 2023.