Memory Bank Compression for Continual Adaptation of Large Language Models

Thomas Katraouras, Dimitrios Rafailidis

Abstract

Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question–answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.

Introduction

Large Language Models (LLMs) [20, 33] have shown strong performance on a wide range of natural language processing tasks, including machine translation [38], summarization [35], question answering [31], and advanced reasoning [44]. They are now widely employed in many applications such as search engines [47] and personal assistants [4]. However, a major limitation of these models is that they are static [10]. Once trained, their parameters reflect only the data seen during training, and they cannot easily incorporate new knowledge. This leads to the problem of knowledge cutoff, where the model's internal knowledge becomes outdated as new information appears [20, 33].

To address this limitation, Retrieval-Based Augmentation (RAG) strategies have been introduced [14, 40]. A frozen LLM uses a retriever to fetch relevant passages from an external corpus at inference time, providing the model with access to up-to-date information without retraining [14]. However, RAG methods face several challenges. They depend on nearest-neighbor search, which adds computational overhead and latency [40]. The quality of the retrieval affects the LLM performance, and errors in retrieval propagate directly to the generator [46]. Retrievers also often require domain-specific tuning and may struggle to generalize across domains [19]. Furthermore, the retrieved passages are typically concatenated with the query in the input context window, limiting the model's ability to fully utilize the information and creating issues when the combined length exceeds the model's capacity [14]. These issues limit the scalability of RAG for long-term adaptation in streaming environments, reflecting on the real-world scenario.

To solve the problem of LLMs' long-term adaptation, continual learning methods have been proposed [30, 43]. In this setting, models are updated as new data arrive. The simplest approach, full fine-tuning, also known as uniform fine-tuning [9] since all tokens are weighted equally, updates all parameters on the new data. While effective for small models, this is computationally expensive for large LLMs and is prone to catastrophic forgetting [18], where performance on previously learned knowledge degrades as the model is optimized on new information. Parameter-efficient fine-tuning (PEFT) methods [41] such as adapters [7, 22], prefix-tuning [15], and LoRA [8] address the computational cost by introducing small trainable modules while keeping most parameters frozen. These approaches reduce the training overhead, however, they still require gradient-based updates at deployment time, which is impractical in streaming scenarios. Additional strategies have been proposed to make updates selective and stable, for example by restricting updates to predefined salient spans [5], by meta-learning token importance weights [9], or by interleaving past and new examples

through replay [2, 28]. Despite these refinements, the fundamental limitations persist, namely the need for repeated optimization, substantial computational and latency overhead, and continued vulnerability to catastrophic forgetting [30].

A promising direction for continual learning of LLMs is memory augmentation [6, 21, 39]. Instead of retrieving raw text from an external corpus, information is stored directly in a structured memory module, which can also be updated dynamically. At inference time, the model can draw on these stored representations to adapt its behavior and incorporate new information. This avoids repeated gradient updates and provides a direct connection between the stored knowledge and model's computations [6]. However, memory augmentation introduces challenges as more documents are processed, the memory bank constantly grows. This increases storage costs and slows down inference, since the model must attend to an ever-growing set of contexts [45]. Expanding memory without disrupting the model's original behavior remains difficult, and as a consequence, the memory augmentation strategies require retraining or fine-tuning to remain effective.

Morerecently, to overcome this shortcoming, memory-augmented frameworks have been proposed that store learned modulation parameters for each document in an external memory bank [32]. These approaches keep the base model frozen, condition the model on the entire memory bank rather than a single retrieved document, and avoid further fine-tuning during adaptation, while mitigating catastrophic forgetting. Nevertheless, in the real-world scenario where the document stream reaches hundreds of thousands or millions of entries, the memory bank grows very large and becomes difficult to manage. This highlights scalability as an open problem in memory-augmented systems, alongside the need to balance adaptation, efficiency, and stability.

In this paper, we propose MBC, a model that compresses the memory bank while maintaining high performance on downstream question-and-answer (QA) tasks. Specifically, we make the following contributions:

· We propose a memory bank compression method based on a codebook optimization strategy, which stores indices to this codebook instead of full document representations. In addition, we introduce an online resetting strategy to prevent codebook collapse and ensure balanced code utilization and stable training. · We employ Key-Value Low-Rank Adaptation targeted only to the attention layers of the model. In doing so, we improve the proposed model's ability to adapt when new data arrive without requiring full fine-tuning.

We conduct experiments on benchmark QA datasets, comparing our MBC model with baseline methods. Our results demonstrate that MBC significantly reduces the memory bank size to 0.3%, when compared with the initial size of baseline strategies, and improves the QA accuracy. Furthermore, our model maintains a high retention accuracy when evaluated for catastrophic forgetting during the challenging online adaptation scenario.

The remainder of the paper is structured as follows: Section 2 formulates the problem of online learning in LLMs. Section 3 outlines the MBC model and Section 4 provides the experimental evaluation. Finally, Section 5 concludes the paper, summarizing key findings and discussing potential future directions.

Online Adaptation of LLMs

Let 𝑓 𝜃 𝑏 be a pretrained and outdated language model with parameters 𝜃 𝑏 . During online adaptation, 𝑓 𝜃 𝑏 is continuously updated using a stream of new documents, denoted as 𝐷 𝑡𝑒𝑠𝑡 : = { 𝑑 𝑖 } . The adaptation process yields an updated model 𝑓 ˜ 𝜃 𝑏 [9]. This adapted model is then evaluated on a set of queries 𝑄 𝑡𝑒𝑠𝑡 : = { 𝑞 𝑖 } paired with labels 𝑌 𝑡𝑒𝑠𝑡 : = { 𝑦 𝑖 } . Each query-label pair ( 𝑞 𝑖 , 𝑦 𝑖 ) is assumed to be sampled from a distribution conditioned on the corresponding document 𝑑 𝑖 , i.e., ( 𝑞 𝑖 , 𝑦 𝑖 ) ∼ 𝑝 ( 𝑞,𝑦 | 𝑑 𝑖 ) . For example, in a QA setting, 𝑞 𝑖 may represent a question about information contained in 𝑑 𝑖 , with 𝑦 𝑖 being the correct answer [32]. During adaptation with 𝐷 𝑡𝑒𝑠𝑡 , the related queries 𝑄 𝑡𝑒𝑠𝑡 remain inaccessible to the model. Therefore, the update procedure must be query-agnostic. To this end, we assume access to an auxiliary training set 𝐷 𝑡𝑟𝑎𝑖𝑛 with associated queries 𝑄 𝑡𝑟𝑎𝑖𝑛 and labels 𝑌 𝑡𝑟𝑎𝑖𝑛 , defined in the same way as ( 𝑄 𝑡𝑒𝑠𝑡 , 𝑌 𝑡𝑒𝑠𝑡 ) . This auxiliary set provides examples of query-document relationships and guides the model in updating its parameters while retaining past knowledge and improving on future queries [9]. The training step involving ( 𝐷 𝑡𝑟𝑎𝑖𝑛 , 𝑄 𝑡𝑟𝑎𝑖𝑛 , 𝑌 𝑡𝑟𝑎𝑖𝑛 ) is the learning phase. The subsequent process of updating 𝑓 𝜃 𝑏 using the test stream 𝐷 𝑡𝑒𝑠𝑡 , without access to queries, is the online adaptation phase [32].

Proposed Model

Memory of Aggregated Contexts

The proposed model has three core components: (i) an amortization network that encodes documents, (ii) a memory bank that stores the encoded information, and (iii) an aggregation network that synthesizes the stored information to answer a given query.

Amortization Network. The amortization network is responsible for mapping each document into a compact latent representation that can be efficiently stored and retrieved [32]. Formally, this network is denoted 𝑔 𝜃 𝑎𝑚𝑜𝑟𝑡 , implemented with a T5 encoder-decoder model [24]. Given a document 𝑑 𝑖 , the amortization network produces a continuous latent representation 𝜙 𝑖 : = 𝑔 𝜃 amort ( 𝑑 𝑖 ) ∈ R 𝑇 × 𝐷 , where 𝑇 is the number of tokens in the representation and 𝐷 is the hidden dimension of the base model.

Memory Bank. The context vectors 𝜙 𝑖 generated from the document stream are stored in an external memory bank M : = { 𝜙 𝑖 | 𝑑 𝑖 ∈ 𝐷 𝑡𝑒𝑠𝑡 } [32]. This bank serves as a growing knowledge base for the base LLM.

Aggregation Network. When a query 𝑞 𝑖 is presented at test time, the model must retrieve relevant information from the memory bank. An aggregation network, denoted ℎ 𝜓 , is trained to perform this function dynamically. It takes the entire memory bank M and an encoded representation of the current query 𝑔 𝜃 𝑖𝑛𝑝𝑢𝑡 ( 𝑞 𝑖 ) as input, where 𝜃 𝑖𝑛𝑝𝑢𝑡 uses the same architecture as 𝜃 𝑎𝑚𝑜𝑟𝑡 . The network is permutation-invariant with respect to the ordering of M . Using a cross-attention mechanism [11, 36], it synthesizes the stored context vectors into a single, query-specific modulation 𝜙 ∗ 𝑖 : = ℎ 𝜓 ( 𝑔 𝜃 𝑖𝑛𝑝𝑢𝑡 ( 𝑞 𝑖 ) , M) . This modulation 𝜙 ∗ 𝑖 acts as a set of soft

prompts, injected as learnable prefixes into the key-value matrices of each self-attention layer of the base LLM via P-tuning v2 [17]. Formally, the modulated base LLM can be expressed as 𝑓 𝜙 ∗ 𝑖 𝜃 𝑏 ( 𝑞 𝑖 ) : = 𝑓 𝜃 𝑏 ( 𝑞 𝑖 ; { 𝐾 ′ ℓ , 𝑉 ′ ℓ } 𝐿 ℓ = 1 ) , with 𝐾 ′ ℓ , 𝑉 ′ ℓ denoting the modified key and value matrices in layer ℓ after prefixing with 𝜙 ∗ 𝑖 . To efficiently handle large memory banks at inference time, a hierarchical modulation aggregation strategy is used. In particular, the context set is first partitioned into smaller subgroups, each aggregated individually, and the resulting representations are recursively combined until a single final modulation is obtained. This divide-and-conquer procedure reduces the memory complexity to O( 𝑀𝑇 ) , where 𝑀 is a hyperparameter and 𝑇 the number of tokens, ensuring scalability even as the number of stored documents increases [32].

Amortization Network

We follow the training configuration of prior works [9, 32] for fair comparison across baselines. For each dataset, the model is adapted using 1,665 documents sampled from the test stream 𝐷 test , after which its performance is evaluated on QA pairs drawn from the same documents. We report Exact Match (EM) and token-level F1 scores as evaluation metrics.

The EM score measures the fraction of predictions that exactly match the ground-truth answer after normalization (lowercasing, punctuation and article removal, collapsing multiple spaces into one):

where 𝐼 is the number of QA pairs, ˆ 𝑦 𝑖 is the predicted answer, and 𝑦 𝑖 is the ground truth.

The token-level F1 score measures the harmonic mean of precision and recall at the token level:

where tok (·) denotes the tokenized representation of the answer.

Memory Bank

Aggregation Network

Codebook Optimization for Memory Bank Compression

A critical limitation of storing continuous context vectors { 𝜙 𝑖 } is that the memory bank M can grow very large as the document stream increases. To address this, a Vector Quantised-Variational AutoEncoder (VQ-VAE)-style [34] quantization module is introduced to compress the memory. Instead of storing the high-dimensional continuous vector 𝜙 𝑖 , each context vector is mapped to its nearest entry in a learned finite codebook 𝐸 ∈ R 𝑁 𝑐 × 𝐷 , where 𝑁 𝑐 denotes the number of code vectors: 𝑐 𝑖 : = argmin ∥ 𝜙 𝑖 -𝐸 𝑗 ∥ 2 2 . The selected

𝑗

codebook vector is denoted by ˆ 𝜙 hard 𝑖 : = 𝐸 𝑐 𝑖 , and only the integer index 𝑐 𝑖 is stored, resulting in a compressed memory bank representation M VQ : = { 𝑐 𝑖 } . As ˆ 𝜙 hard 𝑖 is discrete and non-differentiable, during the forward pass a straight-through estimator (STE) [1] is used to define ˆ 𝜙 𝑖 : = 𝜙 𝑖 + sg [ ˆ 𝜙 hard 𝑖 -𝜙 𝑖 ] , where sg [·] denotes the stop-gradient operator. Thus, ˆ 𝜙 𝑖 takes the value of ˆ 𝜙 hard 𝑖 , still allowing gradients to flow back into 𝜙 𝑖 . In subsequent notations, ˆ 𝜙 𝑖 denotes the differentiable forward-pass representation and ˆ 𝜙 hard 𝑖 is used only in the vector quantization loss. The effective size of the memory bank is therefore reduced to the set of indices together with the codebook 𝐸 . The codebook itself is optimized during endto-end training. During inference, the stored indices are used to retrieve their corresponding quantized vectors { ˆ 𝜙 𝑖 } , which are then aggregated by ℎ 𝜓 together with the query representation to produce the modulation ˆ 𝜙 ∗ 𝑖 .

Online Codebook Resetting

To prevent underutilization and codebook collapse, where only a small subset of codes is repeatedly used, the codebook is updated during training using an exponential moving average (EMA) of code usage. For a mini-batch of size 𝐾 , let

where 𝑢 𝑗 tracks smoothed usage and 𝛾 ∈ ( 0 , 1 ) is a decay rate hyperparameter. Codes with usage below a hyperparameter threshold 𝜖 are marked as inactive: 𝐼 dead : = { 𝑗 | 𝑢 𝑗 < 𝜖 } . When 𝐼 dead ≠ ∅ , up to | 𝐼 dead | distinct encoder outputs are sampled from the current batch, { 𝜙 𝑠 ( 𝑗 ) } , to reinitialize the corresponding codebook vectors:

where 𝑠 ( 𝑗 ) denotes indices sampled uniformly and randomly without replacement from the batch, and ¯ 𝑢 denotes the mean usage across all codes, ensuring that reinitialized entries retain a nonnegligible prior usage estimate. This procedure, applied only during training (without gradients), maintains codebook diversity and prevents collapse, as we will experimentally show in Section 4.4.4.

Key-Value Low-Rank Adaptation for Modulation Adaptation

The modulation ˆ 𝜙 ∗ 𝑖 is injected into the key-value pairs of each selfattention layer of the base LLM. Instead of keeping the base LLM entirely frozen such as the study reported in [32], we introduce lightweight Low-Rank Adaptation (LoRA) [8] modules specifically into the key and value projections (KV-LoRA). Here, 𝜃 𝑏 denotes the frozen parameters of the pretrained model 𝑓 𝜃 𝑏 . A small set of trainable parameters 𝜃 KV-LoRA is added, parameterizing low-rank updates to the modulated key and value matrices:

where 𝐴 𝐾,ℓ , 𝐴 𝑉,ℓ ∈ R 𝐷 × 𝑟 and 𝐵 𝐾,ℓ , 𝐵 𝑉,ℓ ∈ R 𝑟 × 𝐷 are low-rank factors with rank 𝑟 ≪ 𝐷 . In practice, the updates are scaled by a factor 𝛼 / 𝑟 and regularized with dropout. KV-LoRA is applied only to the final 𝑛 𝑙𝑜𝑟𝑎 transformer layers, balancing computational efficiency with adaptation capacity. 𝑟 , 𝛼 , 𝑛 𝑙𝑜𝑟𝑎 , and the dropout probability 𝜌 are hyperparameters. The adapted model thus has parameters ˜ 𝜃 𝑏 : = 𝜃 𝑏 ∪ 𝜃 KV-LoRA , which preserves pretrained knowledge allowing the attention mechanism to more effectively exploit the modulation ˆ 𝜙 ∗ 𝑖 . Formally, the modulated base LLM is expressed as:

End-to-End Training Objective

The entire architecture is trained end-to-end, including the amortization and aggregation networks, the VQ codebook, and the KVLoRA modules. The primary objective is a question-answering (QA) loss L QA, defined as the negative log-likelihood of predicting the target sequence 𝑦 𝑖 conditioned on the query 𝑞 𝑖 , the corresponding document 𝑑 𝑖 , and its modulation 𝜙 ∗ 𝑖 . Answer generation is carried out by the base LLM 𝑓 ˆ 𝜙 ∗ 𝑖 ˜ 𝜃 𝑏 ( 𝑞 𝑖 ) : L QA : = -log 𝑝 ˆ 𝜙 ∗ 𝑖 ˜ 𝜃 𝑏 ( 𝑦 𝑖 | 𝑞 𝑖 , 𝑑 𝑖 ) . To enforce a compact and discrete memory representation, a vector quantization loss L VQ is used. Given a continuous context vector 𝜙 𝑖 ∈ R 𝑇 × 𝐷 , its nearest codebook entry is denoted by ˆ 𝜙 hard 𝑖 : = 𝐸 𝑐 𝑖 . The quantization loss is defined using ˆ 𝜙 hard 𝑖 to ensure proper updates of both the codebook and encoder:

where 𝛽 > 0 is the commitment cost hyperparameter. The first term updates the selected codebook entries 𝐸 𝑐 𝑖 , while the second one encourages encoder outputs 𝜙 𝑖 to remain close to their quantized assignments. The final objective is a weighted combination:

L total = L QA + 𝜆 VQ L VQ, where 𝜆 VQ is a hyperparameter that balances the influence of the quantization loss. During training, the parameters of the amortization network ( 𝜃 amort ), input encoder ( 𝜃 input ), aggregation network ( 𝜓 ), KV-LoRA modules ( 𝜃 KV-LoRA ), and the codebook ( 𝐸 ) are optimized end-to-end, while the base model parameters 𝜃 𝑏 remain frozen:

where K is the batch size. Optimization is performed using the Adam [12] optimizer. An overview of the MBC end-to-end optimization algorithm is presented in Algorithm 1.

Online Adaptation of MBC

After training is completed, the online adaptation phase follows. This phase requires no gradient-based updates and operates entirely through forward passes. The procedure consists of two components:

Memorization. For each new document 𝑑 𝑖 arriving in the test stream 𝐷 test , the amortization network 𝑔 𝜃 amort encodes 𝑑 𝑖 into a context vector 𝜙 𝑖 , which is subsequently quantized to the nearest codebook entry in 𝐸 . The resulting discrete code 𝑐 𝑖 is stored in the compressed memory bank M VQ.

Inference. When a query 𝑞 𝑖 is received, the model uses the stored codes { 𝑐 𝑗 } from the memory bank and retrieves the corresponding quantized context vectors { ˆ 𝜙 𝑗 } from the codebook 𝐸 . These vectors are aggregated by ℎ 𝜓 together with the query representation 𝑔 𝜃 input ( 𝑞 𝑖 ) to produce a query-specific modulation ˆ 𝜙 ∗ 𝑖 . This modulation conditions the KV-LoRA-augmented base LLM 𝑓 ˜ 𝜃 𝑏 ( 𝑞 𝑖 ) , which then generates the final answer. The online adaptation and evaluation procedure is presented in Algorithm 2.

Memorization.

Inference.

Model (# params)	Method	StreamingQA	StreamingQA	SQuAD	SQuAD	ArchivalQA	ArchivalQA
		EM ( ↑ )	F1 ( ↑ )	EM ( ↑ )	F1 ( ↑ )	EM ( ↑ )	F1 ( ↑ )
DistilGPT2 (82M)	Uniform	1.62	2.97 4.33	1.34 1.31	2.78 2.50	4.01 4.08 4.11	3.69 3.98 5.99
	Salient Spans	1.62
	CaMeLS	1.86	4.38	1.43	3.06
	MAC	3.48	8.11	1.90	5.00	5.99	8.87
	MBC (Ours)	3.96 ( ↑ 13.8%)	8.76 ( ↑ 8%)	2.10 ( ↑ 10.5%)	5.36 ( ↑ 7.2%)	6.61 ( ↑ 10.4%)	9.27 ( ↑ 4.5%)
GPT2-Large (774M)	Uniform	4.14	8.08	3.37	5.62	8.03	6.63
GPT2-Large (774M)	Salient Spans	4.26	8.53	4.38	6.79	9.75	7.23
GPT2-Large (774M)	CaMeLS	5.48	10.31	4.45	7.60	9.28	9.18
GPT2-Large (774M)	MAC	6.12	11.44	6.14	9.75	10.95	12.15
GPT2-Large (774M)	MBC (Ours)	7.43 ( ↑ 21.4%)	12.77 ( ↑ 11.6%)	6.99 ( ↑ 13.8%)	10.88 ( ↑ 11.6%)	12.03 ( ↑ 9.9%)	13.68 ( ↑ 12.6%)
GPT2-XL (1.5B)	Uniform	5.16	9.14	5.87	7.87	9.89	10.46
GPT2-XL (1.5B)	Salient Spans	5.46	11.32	5.66	8.69	10.44	13.68
GPT2-XL (1.5B)	CaMeLS	6.98	11.23	6.17	9.93	11.48	14.01
GPT2-XL (1.5B)	MAC	7.14	12.01	6.89	10.12	11.48	15.52
GPT2-XL (1.5B)	MBC (Ours)	7.49 ( ↑ 4.9%)	12.77 ( ↑ 6.3%)	7.40 ( ↑ 7.4%)	11.96 ( ↑ 18.2%)	12.34 ( ↑ 7.5%)	15.93 ( ↑ 2.6%)
LLaMA-2 (7B)	Uniform	11.76	12.53	12.78	16.62	17.89	20.01
LLaMA-2 (7B)	Salient Spans	12.12	18.65	13.32	18.09	18.45	22.21
LLaMA-2 (7B)	CaMeLS*	N/A	N/A	N/A	N/A	N/A	N/A
LLaMA-2 (7B)	MAC	14.01	20.44	13.33	18.17	19.58	23.89
LLaMA-2 (7B)	MBC (Ours)	16.04 ( ↑ 14.5%)	25.33 ( ↑ 23.9%)	14.93 ( ↑ 12%)	22.15 ( ↑ 21.9%)	22.71 ( ↑ 16%)	28.66 ( ↑ 19.9%)

Method	DistilGPT2	GPT2-Large	GPT2-XL	LLaMA-2-7B
MAC	197M	927M	1.72B	2.36B
MBC	197.6M (+0.31%)	929.6M (+0.28%)	1.723B (+0.19%)	2.371B (+0.45%)

DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	GPT2-Large	GPT2-Large	GPT2-Large	GPT2-Large	GPT2-Large	GPT2-Large
# of Doc	StreamingQA	StreamingQA	SQuAD	SQuAD	ArchivalQA	ArchivalQA	StreamingQA	StreamingQA	SQuAD	SQuAD	ArchivalQA	ArchivalQA
# of Doc	MAC	MBC	MAC	MBC	MAC	MBC	MAC	MBC	MAC	MBC	MAC	MBC
200	8.21/100	1.52/100	38.1/100	1.60/100	10.91/100	1.53/100	27.36/100	2.54/100	126.99/100	2.7/100	36.36/100	2.56/100
400	16.42/99.5	1.54/99	76.19/99.5	1.70/99	21.82/99.5	1.56/99	54.73/99	2.59/99	253.98/99	2.9/99.5	72.72/99	2.61/99
600	24.63/99	1.56/99	114.29/99	1.80/99	32.72/99.5	1.59/99.5	82.09/99	2.63/99	380.96/98.5	3.1/99.5	109.07/99	2.67/99
800	32.84/99	1.59/99	152.39/99	1.90/98	43.63/99	1.62/99	109.46/98	2.67/99	507.95/98	3.3/99	145.43/98.8	2.73/99
1000	41.04/98	1.61/98	190.48/99	1.99/97.5	54.54/99	1.64/98.5	136.82/97.5	2.71/98.5	634.94/97.5	3.49/98.5	181.79/97.5	2.78/98.8
1200	49.25/98	1.63/98	228.58/98.5	2.09/98	65.45/98.5	1.67/98	164.18/97	2.76/98.5	761.93/97	3.69/98	218.15/97.5	2.84/98.3
1400	57.46/98	1.65/98.5	266.67/98	2.19/98.5	76.35/98.5	1.70/98.5	191.55/96.5	2.8/97	888.91/96.5	3.89/96.5	254.5/96.5	2.89/97.6
1600	65.67/98	1.67/98	304.77/98	2.29/98.5	87.26/98	1.73/98	218.91/97	2.84/97	1015.9/96	4.09/97	290.86/97	2.95/97.2

Experimental Evaluation

Datasets

Following [9, 32], we evaluate the examined models on three QA datasets:

StreamingQA [16]. It contains questions created by annotators or generated with language models. Questions are based on timestamped English WMT news articles (2007-2020), which are also included in the dataset. Following prior setups, we use 21K training, 1.7K validation, and 5K test questions, along with the same number of documents. For QA pre-training baselines, 40K training and 4K validation questions are used.

SQuAD[25]. The Stanford Question Answering Dataset (SQuAD) includes crowdsourced questions on Wikipedia, where answers are spans within the article. Following prior setups, we use 39.9K training, 5.6K validation, and 10.6K test questions, with 8.6K training, 1.2K validation, and 2.1K test documents, respectively. For QA pretraining baselines, 40K training and 2.1K validation questions are used.

ArchivalQA [37]. It is built from New York Times Annotated Corpus articles [26] with questions generated using language models. Answers are text spans within the articles. Following prior

StreamingQA~ cite{StreamingQA_dataset

SQuAD~ cite{Squad_dataset

Following [9, 32], we evaluate the examined models on three QA datasets:

ArchivalQA [37]. It is built from New York Times Annotated Corpus articles [26] with questions generated using language models. Answers are text spans within the articles. Following prior

ArchivalQA~ cite{ArchivalQA_dataset

Evaluation Protocol

The EM score measures the fraction of predictions that exactly match the ground-truth answer after normalization (lowercasing, punctuation and article removal, collapsing multiple spaces into one):

where 𝐼 is the number of QA pairs, ˆ 𝑦 𝑖 is the predicted answer, and 𝑦 𝑖 is the ground truth.

The token-level F1 score measures the harmonic mean of precision and recall at the token level:

where tok (·) denotes the tokenized representation of the answer.

Experimental Setup

4.3.1 Implementation Details. We evaluate MBC using four backbone LLMs: the GPT-2 family (DistilGPT2 [27], GPT2-Large [23], GPT2-XL [23]) and LLaMA-2-7B [33], with 82M, 774M, 1.5B, and 7B parameters, respectively. The amortization network 𝑔 𝜃 amort is based on T5 [24], using T5-Small for DistilGPT2, T5-Base for GPT2-Large, and T5-Large for GPT2-XL and LLaMA-2-7B. The input encoder 𝑔 𝜃 input uses T5-Small for DistilGPT2 and T5-Base for the rest. The amortization network outputs 𝑇 = 12 tokens for DistilGPT2 and 24 for the rest. The aggregation network ℎ 𝜓 consists of four crossattention blocks [11, 36], where 𝑔 𝜃 input ( 𝑞 𝑖 ) provides the initial query, the memory bank M VQ provides keys and values, and subsequent blocks take the previous output as input, producing ˆ 𝜙 ∗ 𝑖 . Training runs for 50 epochs with the Adam [12] optimizer. The learning rate is linearly warmed up for the first 1% of total steps and then kept constant at 10 -5 . Validation is performed after each epoch. We use a batch size of 64 for DistilGPT2 and 32 for the rest, with gradient accumulation. For models above 1B parameters, dropout with probability 𝜌 back = 0 . 75 is applied during backpropagation, this means that gradients are computed only for a random subset of documents per batch, while the rest use stop-gradient [32]. LLaMA-2-7B is trained with 4-bit quantization [3] for both the model and the amortization network. The codebook size is fixed to 𝑁 𝑐 = 512, with VQ commitment cost 𝛽 commit = 0 . 25 and weight 𝜆 VQ = 1 . 0. Codebook resetting uses EMA decay rate 𝛾 = 0 . 99 and reset threshold 𝜖 = 10 -4 . For KV-LoRA, DistilGPT2 uses 𝑟 = 16,

𝛼 = 32, 𝜌 = 0 . 05, applied to the last 𝑛 lora = 6 layers. GPT2-Large, GPT2-XL and LLaMA-2-7B use 𝑟 = 32, 𝛼 = 64, 𝜌 = 0 . 05, applied to the last 𝑛 lora = 16 layers. For the GPT-2 family, we share the LoRA down-projection matrix across 𝐾 and 𝑉 , this means 𝐴 𝐾,ℓ = 𝐴 𝑉,ℓ . All experiments are conducted on a single NVIDIA A100 80GB GPU.

Implementation Details

Examined Models

· Uniform Fine-Tuning 1 : A baseline approach where all tokens in the new documents are treated equally during model updates. · Salient Spans 1 [5]: A heuristic-based method that fine-tunes only on tokens within pre-identified salient spans, ignoring the rest. · CaMeLS 1 [9]: Context-aware Meta-learned Loss Scaling, which uses a meta-trained network to assign importance weights to tokens during fine-tuning, focusing learning on the most informative content. · MAC 2 [32]: Memory of Amortized Contexts, an online adaptation framework that freezes the base model and uses a meta-learned network to encode documents into compact modulations stored in a memory bank. An aggregation module retrieves and combines the modulations with the query, without requiring gradient updates during inference. · MBC 3 : The proposed model.

For fair comparison, we retrained all baselines. For Uniform FineTuning, Salient Spans and CaMeLS, we followed the configuration of [9]. Each pretrained LLM is first fine-tuned on QA pairs to obtain a task-adapted base model. During this pretraining, an inner batch of 6 document-query-label triples is used, and outer-loop gradients are accumulated over 24 examples, split into 4 batches of 6. Subsequently, the base model undergoes online adaptation, where it is updated on a stream of documents. The learning rate for each base-strategy combination is selected via a hyperparameter sweep on the validation set over { 10 -4 , 2 . 5 × 10 -5 , 6 . 25 × 10 -6 , 1 . 625 × 10 -6 } . The best learning rate for Uniform and Salient Spans is mostly 1 . 625 × 10 -6 , while for CaMeLS 2 . 5 × 10 -5 . Adam is used in most cases, and Adafactor [29] is used for large models. For MAC, we followed the configuration of [32]. Specifically, the amortization network uses T5-Small (12 output tokens) for DistilGPT2, T5-Base (24 output tokens) for GPT2-Large, and T5-Large (24 output tokens) for GPT2-XL and LLaMA-2-7B, while the input encoder uses T5-Small for DistilGPT2 and T5-Base for the rest. The aggregation network consists of four cross-attention blocks. Training is performed for 50 epochs with Adam and a constant learning rate of 10 -5 after a oneepoch warm-up, using a batch size of 64 for DistilGPT2 and 32 for the rest with gradient accumulation. For backbones exceeding 1B parameters, backpropagation dropout with probability 𝜌 back = 0 . 75 is applied, and LLaMA-2-7B is trained with 4-bit quantization [3].

Experimental Results

4.4.1 QA Performance Evaluation. Table 1 reports the QA performance. We compare MBC against the baseline methodologies, with MAC being the most competitive. Across all datasets and base LLMs, MBC consistently improves both EM and F1. On average,

1 https://github.com/nathanhu0/CaMeLS

Table 1: Exact Match (EM) and F1 scores on StreamingQA, SQuAD, and ArchivalQA across different backbone models and baselines. High values indicate high QA performance.

CaMeLS results are not reported for LLaMA-2 (7B) because the model exceeds the memory capacity of a single NVIDIA A100 80GB GPU. Even with a batch size of 1, it was infeasible to replicate this baseline under our hardware constraints.

Figure 1: Memory bank footprint (logMB) of MAC and MBC across StreamingQA, SQuAD, and ArchivalQA.

Table 2: Trainable parameters of MAC and MBC (offline).

MBC gains 11.84% in EM and 12.99% in F1 compared to MAC. The performance gains result from two main design choices. Firstly, the introduction of KV-LoRA allows the attention mechanism to make effective use of the modulation ˆ 𝜙 ∗ 𝑖 , leading to accurate answers. In addition, an efficiently learned codebook preserves the quality of the stored documents, ensuring that the compression mechanism does not degrade the performance.

4.4.2 Memory Bank Size. Figure 1 compares the memory bank footprint of two examined memory-augmented methods, that is MBC and MAC. For MBC, the footprint includes the codebook and stored indices in the bank, while for MAC it corresponds to the full memory bank. Across all three datasets, MBC achieves substantial memory savings. For DistilGPT2, the memory bank size is reduced by an average of 98.27% compared to MAC. For GPT2-Large and GPT2-XL, the reduction averages 99.1%, and for LLaMA-2-7B it averages 99.2%. These results show that memory compression is consistently effective across all the different model scales.

To examine the overhead of the codebook and KV-LoRA in MBC, we compare the trainable parameters of the two examined memory augmentation methods, namely MAC and MBC. These numbers are reported in the offline setting, this means without considering

Table 3: Memory bank size (MB) / F1 retention rate (%) on StreamingQA, SQuAD, and ArchivalQA for different base LLMs. Each entry is reported as memory bank footprint in MB followed by the corresponding retention rate.

GPT2-XL

SQuAD

MAC

158.73/100

317.47/99.5

952.4/97.9

of Doc

200

StreamingQA

136.82/98

4.11/98.8

ArchivalQA

documents stored during online adaptation. As Table 2 shows, the additional parameters introduced by the codebook and KV-LoRA of the proposed MBC model account for less than 0.5% across all base LLMs. This overhead is negligible compared to the improvements in the QA accuracy and memory compression.

4.4.3 Knowledge Retention During Online Adaptation. In this experiment, we evaluate how well models retain knowledge from previously adapted documents while continuing to adapt to new documents. Following the evaluation protocol in online adaptation from [32], we measure the F1 score retention rate, defined as the relative decline in the F1 score of the first 200 adapted documents after further adaptation on up to 1,400 additional documents, with a step of 200 documents. A high retention rate indicates that the model preserves knowledge from earlier documents even as new information is incorporated, showing reduced susceptibility to catastrophic forgetting. Table 3 reports the retention rates alongside the corresponding memory bank sizes. MBC achieves high retention, comparable to MAC, across all base models and datasets, demonstrating that compression does not harm the model's ability to preserve earlier knowledge during online adaptation. At the same time, MBC consistently requires far less memory. For the same number of adapted documents, its memory bank footprint is reduced on average by 97.3% compared to MAC. Interestingly, the small model DistilGPT2 appears to show high retention. However, this effect comes from its limited capacity to utilize the memory bank effectively: its absolute performance is already low, so retention appears artificially high. Meanwhile, the large models GPT2-XL and LLaMA2-7B achieve strong adaptation performance and high retention, confirming that MBC scales effectively to large base LLMs.

4.4.4 Effectiveness of the Codebook Resetting Mechanism. We further evaluate the role of the EMA-based codebook resetting mechanism introduced in Section 3.3 by comparing training runs with

Llama2-7B

Figure 2: Perplexity over train epochs on StreamingQA with andwithout codebook resetting in MBC, across all base LLMs.

and without resetting in MBC. Code usage is measured via perplexity, defined as PPL = exp GLYPH<0> -˝ 𝑘 ¯ 𝑝 𝑘 log ¯ 𝑝 𝑘 GLYPH<1> , where ¯ 𝑝 𝑘 = 𝑢 𝑘 / ˝ 𝑗 𝑢 𝑗 and 𝑢 𝑗 is the EMA-smoothed usage defined in Eq. 1. Lower values that remain flat indicate codebook collapse, this means that only a small subset of codes being repeatedly used. Figure 2 shows the perplexity curves on StreamingQA across the four base LLMs. With resetting, effective code usage remains stable and diverse throughout training. For DistilGPT2, perplexity is between 57 and 65 during the first 10 epochs, while without resetting it collapses close to 12. For GPT2-Large, resetting maintains perplexity between 61-66, whereas without resetting it quickly drops to 24. For GPT2XL, resetting maintains perplexity steadily above 90, whereas it collapses to 14 without resetting. Similarly, for LLaMA-2-7B, resetting maintains perplexity above 100, while without it the codebook again collapses to 24. These results confirm that the codebook resetting mechanism is important for preventing collapse and ensuring

balanced code usage, which supports stable training and effective adaptation for the proposed MBC method.

QA Performance Evaluation

Memory Bank Size

Knowledge Retention During Online Adaptation

Input: Test document stream 𝐷 test , test QA set { ( 𝑞 𝑖 , 𝑦 𝑖 ) } 𝐼 𝑖 = 1 , amortization params 𝜃 amort , input encoder params 𝜃 input , base LLM params with KV-LoRA ˜ 𝜃 𝑏 , aggregation params 𝜓 , learned codebook 𝐸 Output: EM and F1 over { ( 𝑞 𝑖 , 𝑦 𝑖 ) } 𝐼 𝑖 = 1 1 M VQ ←∅ ; // Initialize compressed memory bank 2 for 𝑑 𝑘 ∈ 𝐷 test do 3 𝜙 𝑘 ← 𝑔 𝜃 amort ( 𝑑 𝑘 ) ; // Encode document 4 𝑐 𝑘 ← arg min 𝑗 ∥ 𝜙 𝑘 -𝐸 𝑗 ∥ 2 2 ; // Quantize 5 M VQ ← M VQ ∪ { 𝑐 𝑘 } ; // Save document to bank 6 for 𝑖 = 1 → 𝐼 do 7 ˆ 𝜙 ∗ 𝑖 ← ℎ 𝜓 GLYPH<0> 𝑔 𝜃 input ( 𝑞 𝑖 ) , { 𝐸 𝑐 𝑗 } 𝑐 𝑗 ∈M VQ GLYPH<1> ; // Aggregate memory with query 8 ˆ 𝑦 𝑖 ← 𝑓 ˆ 𝜙 ∗ 𝑖 ˜ 𝜃 𝑏 ( 𝑞 𝑖 ) ; // Predict answer // Final evaluation; norm(): lowercase, strip punctuation, remove articles, collapse whitespace 9 EM ← 1 𝐼 ˝ 𝐼 𝑖 = 1 1 GLYPH<2> norm ( 𝑦 𝑖 ) = norm ( ˆ 𝑦 𝑖 ) GLYPH<3> ; // Exact Match 10 F1 ← 1 𝐼 ˝ 𝐼 𝑖 = 1 F1 token ( 𝑦 𝑖 , ˆ 𝑦 𝑖 ) ; // Token-Level F1 11 return ( EM , F1 )

setups, we use 21.7K training, 5.3K validation, and 8.7K test questions, with 12.8K training, 3.0K validation, and 5.0K test documents, respectively. For QA pre-training baselines, 12.4K training and 3K validation questions are used.

Effectiveness of the Codebook Resetting Mechanism

Conclusion

$$ n_j := \sum_{i=1}^K \mathbf{1}[c_i=j], \quad u_j := \gamma,u_j + (1-\gamma),n_j \quad \forall j \in {1,\dots,N_c} \label{eq:resetting} $$ \tag{eq:resetting}

$$ E_j := \phi_{s(j)} \quad \forall j \in I_{\mathrm{dead}}, \qquad u_j := \bar u := \frac{1}{N_c}\sum_{\ell=1}^{N_c} u_\ell $$

$$ K_\ell'' := K_\ell' + A_{K,\ell} B_{K,\ell}, \qquad V_\ell'' := V_\ell' + A_{V,\ell} B_{V,\ell} $$

$$ f_{\tilde{\theta}b}^{\hat{\phi}i^*}(q_i) := f{\tilde{\theta}b}(q_i; {{K''}\ell, {V''}\ell}_{\ell=1}^L) $$

$$ \mathcal{L}_{\mathrm{VQ}} := | \mathrm{sg}[\phi_i] - \hat{\phi}_i^{\mathrm{hard}} |_2^2 + \beta , | \phi_i - \mathrm{sg}[\hat{\phi}_i^{\mathrm{hard}}] |_2^2 $$

$$ \min_{\theta_{\mathrm{amort}}, \theta_{\mathrm{input}}, \psi, \theta_{\mathrm{KV\text{-}LoRA}}, E} \frac{1}{K} \sum_{i=1}^K \Big[ \mathcal{L}{\mathrm{QA}}(q_i, d_i, y_i) + \lambda{\mathrm{VQ}} , \mathcal{L}_{\mathrm{VQ}}(\phi_i, \hat{\phi}_i^{\mathrm{hard}}) \Big] $$

$$ \mathrm{EM} = \frac{1}{I}\sum_{i=1}^I \mathbf{1}!\big[\mathrm{norm}(\hat{y}_i) = \mathrm{norm}(y_i)\big] $$

$$ \mathrm{Precision}_i = \frac{|,\mathrm{tok}(\hat{y}_i) \cap \mathrm{tok}(y_i),|}{|\mathrm{tok}(\hat{y}_i)|}, \qquad \mathrm{Recall}_i = \frac{|,\mathrm{tok}(\hat{y}_i) \cap \mathrm{tok}(y_i),|}{|\mathrm{tok}(y_i)|} $$

$$ \mathrm{F1}_i = \frac{2 \cdot \mathrm{Precision}_i \cdot \mathrm{Recall}_i}{\mathrm{Precision}_i + \mathrm{Recall}i}, \qquad \mathrm{F1} = \frac{1}{I}\sum{i=1}^I \mathrm{F1}_i $$

Algorithm: algorithm
[!htbp]
\footnotesize
\caption{MBC End-to-End Optimization}
\label{alg:training_detailed}
\LinesNumbered
\KwIn{Amortization params $\theta_{\mathrm{amort}}$, input encoder params $\theta_{\mathrm{input}}$, base LLM params $\theta_b$, aggregation params $\psi$, KV-LoRA params $\theta_{\mathrm{KV\text{-}LoRA}}$, hidden dimension $D$, training corpus $D^{\mathrm{train}}$, learning rate $\eta$, epochs $m$, batch size $K$, VQ commitment cost $\beta_{\text{commit}}$, VQ weight $\lambda_{\mathrm{VQ}}$, codebook size $N_c$, reset threshold $\epsilon$, EMA decay rate $\gamma$}
\KwOut{$\theta_{\mathrm{amort}}, \theta_{\mathrm{input}}, \psi, \theta_{\mathrm{KV\text{-}LoRA}}, E$}

\tcp{Initialize codebook and usage EMA}
$E_{j} \sim \mathcal{U}\!\big(-\tfrac{1}{N_c}, \tfrac{1}{N_c}\big)\ \ \forall j \in \{1,\dots,N_c\}$,\quad $u_j \gets 0\ \ \forall j$

\For{epoch $= 1 \to m$}{
  Sample documents $\{d_1,\dots,d_K\} \subset D^{\mathrm{train}}$

    Sample QA pairs $(q_i, y_i) \sim p(q,y \mid d_i)$ for $i=1,\dots,K$
    
  \For{$i = 1$ $\to$ $K$}{
    $\phi_i \gets g_{\theta_{\mathrm{amort}}}(d_i)$ \tcp*{Encode document}

    \tcp{Vector quantization (nearest code)}
    $c_i \gets \arg\min_{j \in \{1,\dots,N_c\}} \|\phi_i - E_j\|_2^2$, \quad $\hat{\phi}_i^{\mathrm{hard}} \gets E_{c_i}$

    $\hat{\phi}_i \gets \phi_i + \mathrm{sg}[\hat{\phi}_i^{\mathrm{hard}} - \phi_i]$ \tcp*{Straight-through estimator}
  }

  \tcp{Update code usage EMA and reset dead codes}
  $n_j \gets \sum_{i=1}^{K} \mathbf{1}[c_i = j]\ \ \forall j$,\quad 
    $u_j \gets \gamma\,u_j + (1-\gamma)\,n_j\ \ \forall j$
    
    $I_{\mathrm{dead}} \gets \{\, j \in \{1,\dots,N_c\} \mid u_j < \epsilon \,\}$

  \If{$|I_{\mathrm{dead}}| > 0$}{
    \tcp{Replace dead codes with random batch samples}
    $S \sim Unif \!\Big(\tbinom{\{1,\dots,K\}}{|S|}\Big), 
\quad |S|=\min(|I_{\mathrm{dead}}|,K)$

$E_j \gets \phi_{s(j)} 
\quad \forall j \in I_{\mathrm{dead}},\ \text{up to } |S|$

      $u_j \gets \tfrac{1}{N_c}\sum_{\ell=1}^{N_c} u_\ell 
      \quad \forall j \in I_{\mathrm{dead}}$ \tcp*{Reset usage}
  }

  \tcp{Aggregate quantized contexts with the query}
  $\hat{\phi}_i^{*} \gets h_{\psi}\!\big(g_{\theta_{\mathrm{input}}}(q_i), \{\hat{\phi}_i\}_{i=1}^{K}\big)$

  \tcp{QA loss via modulated base LLM}
  $\mathcal{L}_{\mathrm{QA}} \gets \frac{1}{K}\sum_{i=1}^{K} \mathrm{CrossEntropy}\!\big(f_{\tilde{\theta}_b}^{\hat{\phi}_i^{*}}(q_i),\, y_i\big)$

  \tcp{VQ loss (codebook and commitment terms)}
  $\mathcal{L}_{\text{codebook}} \gets \frac{1}{K}\sum_{i=1}^{K} \big\|\mathrm{sg}[\phi_i] - \hat{\phi}_i^{\mathrm{hard}}\big\|_2^2$

  $\mathcal{L}_{\text{commit}} \gets \frac{1}{K}\sum_{i=1}^{K} \big\|\phi_i - \mathrm{sg}[\hat{\phi}_i^{\mathrm{hard}}]\big\|_2^2$

  $\mathcal{L}_{\mathrm{VQ}} \gets \mathcal{L}_{\text{codebook}} + \beta_{\text{commit}}\cdot \mathcal{L}_{\text{commit}}$

  $\mathcal{L}_{\mathrm{total}} \gets \mathcal{L}_{\mathrm{QA}} + \lambda_{\mathrm{VQ}}\, \mathcal{L}_{\mathrm{VQ}}$ \tcp*{Total objective}

  \tcp{Gradient updates (base $\theta_b$ frozen)}
  $\theta_{\mathrm{amort}} \gets \theta_{\mathrm{amort}} - \eta \nabla_{\theta_{\mathrm{amort}}} \mathcal{L}_{\mathrm{total}}$

  $\theta_{\mathrm{input}} \gets \theta_{\mathrm{input}} - \eta \nabla_{\theta_{\mathrm{input}}} \mathcal{L}_{\mathrm{total}}$

  $\psi \gets \psi - \eta \nabla_{\psi} \mathcal{L}_{\mathrm{total}}$

  $\theta_{\mathrm{KV\text{-}LoRA}} \gets \theta_{\mathrm{KV\text{-}LoRA}} - \eta \nabla_{\theta_{\mathrm{KV\text{-}LoRA}}} \mathcal{L}_{\mathrm{total}}$

  $E \gets E - \eta \nabla_{E} \mathcal{L}_{\mathrm{total}}$
}

Algorithm: algorithm
[!htbp]
\footnotesize
\caption{Online Adaptation of MBC}
\label{alg:online_adaptation}
\LinesNumbered
\KwIn{Test document stream $D^{\mathrm{test}}$, test QA set $\{(q_i, y_i)\}_{i=1}^I$, amortization params $\theta_{\mathrm{amort}}$, input encoder params $\theta_{\mathrm{input}}$, base LLM params with KV-LoRA $\tilde\theta_b$, aggregation params $\psi$, learned codebook $E$}
\KwOut{EM and F1 over $\{(q_i,y_i)\}_{i=1}^I$}

$\mathcal{M}_{\mathrm{VQ}} \gets \emptyset$ \tcp*{Initialize compressed memory bank}

\For{$d_k \in D^{\mathrm{test}}$}{
  $\phi_k \gets g_{\theta_{\mathrm{amort}}}(d_k)$ \tcp*{Encode document}

  $c_k \gets \arg\min_{j} \|\phi_k - E_j\|_2^2$ \tcp*{Quantize}

  $\mathcal{M}_{\mathrm{VQ}} \gets \mathcal{M}_{\mathrm{VQ}} \cup \{c_k\}$ \tcp*{Save document to bank}
}

\For{$i = 1 \to I$}{
  $\hat{\phi}_i^* \gets h_{\psi}\!\big(g_{\theta_{\mathrm{input}}}(q_i), \{E_{c_j}\}_{c_j \in \mathcal{M}_{\mathrm{VQ}}}\big)$ \tcp*{Aggregate memory with query}
  
  $\hat{y}_i \gets f_{\tilde{\theta}_b}^{\hat{\phi}_i^*}(q_i)$ \tcp*{Predict answer}
}

\tcp{Final evaluation; norm(): lowercase, strip punctuation, remove articles, collapse whitespace}
$\text{EM} \gets \frac{1}{I} \sum_{i=1}^{I} \mathbf{1}\!\big[\mathrm{norm}(y_i) = \mathrm{norm}(\hat{y}_i)\big]$ \tcp*{Exact Match}

$\text{F1} \gets \frac{1}{I} \sum_{i=1}^{I} \mathrm{F1}_{\text{token}}(y_i,\hat{y}_i)$ \tcp*{Token-Level F1}

\Return $(\text{EM}, \text{F1})$

In this work, we addressed the scalability challenges of memoryaugmented LLMs, where the memory bank grows constantly as new documents are processed. We proposed MBC, a model that compresses the memory bank, enabling efficient continual adaptation of LLMs in streaming settings. By combining codebook-based compression with an online resetting mechanism, MBC prevents codebook collapse and ensures balanced code utilization. At the same time, lightweight KV-LoRA modules provide targeted adaptation within the attention mechanism, allowing the model to efficiently exploit the query-memory modulations without full fine-tuning. This design enables MBC to achieve scalability in terms of memory efficiency while improving the QA accuracy. Experiments with QA datasets demonstrate that MBC improves EM and F1 score while reducing the memory bank footprint to 0.3% of the most competitive baseline. MBC also maintains high F1 retention during online adaptation, thus reducing catastrophic forgetting. An interesting future direction is to extend MBC by incorporating reinforcement signals to guide memory usage adaptively [13] or by exploring distributed memory banks that enable federated continual learning [42].

Model (# params)	Method	StreamingQA	StreamingQA	SQuAD	SQuAD	ArchivalQA	ArchivalQA
		EM ( ↑ )	F1 ( ↑ )	EM ( ↑ )	F1 ( ↑ )	EM ( ↑ )	F1 ( ↑ )
DistilGPT2 (82M)	Uniform	1.62	2.97 4.33	1.34 1.31	2.78 2.50	4.01 4.08 4.11	3.69 3.98 5.99
	Salient Spans	1.62
	CaMeLS	1.86	4.38	1.43	3.06
	MAC	3.48	8.11	1.90	5.00	5.99	8.87
	MBC (Ours)	3.96 ( ↑ 13.8%)	8.76 ( ↑ 8%)	2.10 ( ↑ 10.5%)	5.36 ( ↑ 7.2%)	6.61 ( ↑ 10.4%)	9.27 ( ↑ 4.5%)
GPT2-Large (774M)	Uniform	4.14	8.08	3.37	5.62	8.03	6.63
GPT2-Large (774M)	Salient Spans	4.26	8.53	4.38	6.79	9.75	7.23
GPT2-Large (774M)	CaMeLS	5.48	10.31	4.45	7.60	9.28	9.18
GPT2-Large (774M)	MAC	6.12	11.44	6.14	9.75	10.95	12.15
GPT2-Large (774M)	MBC (Ours)	7.43 ( ↑ 21.4%)	12.77 ( ↑ 11.6%)	6.99 ( ↑ 13.8%)	10.88 ( ↑ 11.6%)	12.03 ( ↑ 9.9%)	13.68 ( ↑ 12.6%)
GPT2-XL (1.5B)	Uniform	5.16	9.14	5.87	7.87	9.89	10.46
GPT2-XL (1.5B)	Salient Spans	5.46	11.32	5.66	8.69	10.44	13.68
GPT2-XL (1.5B)	CaMeLS	6.98	11.23	6.17	9.93	11.48	14.01
GPT2-XL (1.5B)	MAC	7.14	12.01	6.89	10.12	11.48	15.52
GPT2-XL (1.5B)	MBC (Ours)	7.49 ( ↑ 4.9%)	12.77 ( ↑ 6.3%)	7.40 ( ↑ 7.4%)	11.96 ( ↑ 18.2%)	12.34 ( ↑ 7.5%)	15.93 ( ↑ 2.6%)
LLaMA-2 (7B)	Uniform	11.76	12.53	12.78	16.62	17.89	20.01
LLaMA-2 (7B)	Salient Spans	12.12	18.65	13.32	18.09	18.45	22.21
LLaMA-2 (7B)	CaMeLS*	N/A	N/A	N/A	N/A	N/A	N/A
LLaMA-2 (7B)	MAC	14.01	20.44	13.33	18.17	19.58	23.89
LLaMA-2 (7B)	MBC (Ours)	16.04 ( ↑ 14.5%)	25.33 ( ↑ 23.9%)	14.93 ( ↑ 12%)	22.15 ( ↑ 21.9%)	22.71 ( ↑ 16%)	28.66 ( ↑ 19.9%)

Method	DistilGPT2	GPT2-Large	GPT2-XL	LLaMA-2-7B
MAC	197M	927M	1.72B	2.36B
MBC	197.6M (+0.31%)	929.6M (+0.28%)	1.723B (+0.19%)	2.371B (+0.45%)

DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	DistilGPT2	GPT2-Large	GPT2-Large	GPT2-Large	GPT2-Large	GPT2-Large	GPT2-Large
# of Doc	StreamingQA	StreamingQA	SQuAD	SQuAD	ArchivalQA	ArchivalQA	StreamingQA	StreamingQA	SQuAD	SQuAD	ArchivalQA	ArchivalQA
# of Doc	MAC	MBC	MAC	MBC	MAC	MBC	MAC	MBC	MAC	MBC	MAC	MBC
200	8.21/100	1.52/100	38.1/100	1.60/100	10.91/100	1.53/100	27.36/100	2.54/100	126.99/100	2.7/100	36.36/100	2.56/100
400	16.42/99.5	1.54/99	76.19/99.5	1.70/99	21.82/99.5	1.56/99	54.73/99	2.59/99	253.98/99	2.9/99.5	72.72/99	2.61/99
600	24.63/99	1.56/99	114.29/99	1.80/99	32.72/99.5	1.59/99.5	82.09/99	2.63/99	380.96/98.5	3.1/99.5	109.07/99	2.67/99
800	32.84/99	1.59/99	152.39/99	1.90/98	43.63/99	1.62/99	109.46/98	2.67/99	507.95/98	3.3/99	145.43/98.8	2.73/99
1000	41.04/98	1.61/98	190.48/99	1.99/97.5	54.54/99	1.64/98.5	136.82/97.5	2.71/98.5	634.94/97.5	3.49/98.5	181.79/97.5	2.78/98.8
1200	49.25/98	1.63/98	228.58/98.5	2.09/98	65.45/98.5	1.67/98	164.18/97	2.76/98.5	761.93/97	3.69/98	218.15/97.5	2.84/98.3
1400	57.46/98	1.65/98.5	266.67/98	2.19/98.5	76.35/98.5	1.70/98.5	191.55/96.5	2.8/97	888.91/96.5	3.89/96.5	254.5/96.5	2.89/97.6
1600	65.67/98	1.67/98	304.77/98	2.29/98.5	87.26/98	1.73/98	218.91/97	2.84/97	1015.9/96	4.09/97	290.86/97	2.95/97.2

References

[tack2024mac] Tack, J., Kim, J., Mitchell, E., Shin, J., Teh, Y. W., Schwarz, J. R.. (2024). Online adaptation of language models with a memory of amortized contexts. NeurIPS.

[huCamels] Hu, N., Mitchell, E., Manning, C. D., Finn, C.. (2023). Meta-{{Learning Online Adaptation.

[salientspansbaseline] Guu, K., Lee, K., Tung, Z., Pasupat, P., Chang, M.. (2020). Retrieval {{Augmented Language Model Pre-Training. ICML.

[touvronLlama2Open2023a] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., Scialom, T.. (2023). Llama 2: {{Open Foundation.

[brownLanguageModelsAre2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., {Herbert-Voss. (2020). Language {{Models. NeurIPS.

[openaiGPT4TechnicalReport2024] OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., {Bernadett-Shapiro. (2024). {{GPT-4 Technical Report.

[wangmachintranslation2024] Wang, Y., Zhang, J., Shi, T., Deng, D., Tian, Y., Matsumoto, T.. (2024). Recent {{Advances. IEEE Access.

[vanveensummarization2024] Van Veen, D., Van Uden, C., Blankemeier, L., Delbrouck, J-B., Aali, A., Bluethgen, C., Pareek, A., Polacin, M., Reis, E. P., Seehofnerov{'a. (2024). Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization. Nature Medicine.

[yaoreasoning2023] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.. (2023). Tree of {{Thoughts. NeurIPS.

[zhusearchengines2024] Zhu, Y., Yuan, H., Wang, S., Liu, J., Liu, W., Deng, C., Chen, H., Liu, Z., Dou, Z., Wen, J-R.. (2024). Large {{Language Models.

[gunterAppleIntelligenceFoundation2024] Gunter, T., Wang, Z., Wang, C., Pang, R., Narayanan, A., Zhang, A., Zhang, B., Chen, C., Chiu, C-C., Qiu, D., Gopinath, D., Yap, D. A., Yin, D., Nan, F., Weers, F., Yin, G., Huang, H., Wang, J., Lu, J., Peebles, J., Ye, K., Lee, M., Du, N., Chen, Q., Keunebroek, Q., Wiseman, S., Evans, S., Lei, T., Rathod, V., Kong, X., Du, X., Li, Y., Wang, Y., Gao, Y., Ahmed, Z., Xu, Z., Lu, Z., Rashid, A., Jose, A. M., Doane, A., Bencomo, A., Vanderby, A., Hansen, A., Jain, A., Anupama, A. M., Kamal, A., Wu, B., Brum, C., Maalouf, C., Erdenebileg, C., Dulhanty, C., Moritz, D., Kang, D., Jimenez, E., Ladd, E., Shi, F., Bai, F., Chu, F., Hohman, F., Kotek, H., Coleman, H. G., Li, J., Bigham, J., Cao, J., Lai, J., Cheung, J., Shan, J., Zhou, J., Li, J., Qin, J., Singh, K., Vega, K., Zou, K., Heckman, L., Gardiner, L., Bowler, M., Cordell, M., Cao, M., Hay, N., Shahdadpuri, N., Godwin, O., Dighe, P., Rachapudi, P., Tantawi, R., Frigg, R., Davarnia, S., Shah, S., Guha, S., Sirovica, S., Ma, S., Ma, S., Wang, S., Kim, S., Jayaram, S., Shankar, V., Paidi, V., Kumar, V., Wang, X., Zheng, X., Cheng, W., Shrager, Y., Ye, Y., Tanaka, Y., Guo, Y., Meng, Y., Luo, Z. T., Ouyang, Z., Aygar, A., Wan, A., Walkingshaw, A., Narayanan, A., Lin, A., Farooq, A., Ramerth, B., Reed, C., Bartels, C., Chaney, C., Riazati, D., Yang, E. L., Feldman, E., Hochstrasser, G., Seguin, G., Belousova, I., Pelemans, J., Yang, K., Vahid, K. A., Cao, L., Najibi, M., Zuliani, M., Horton, M., Cho, M., Bhendawade, N., Dong, P., Maj, P., Agrawal, P., Shan, Q., Fu, Q., Poston, R., Xu, S., Liu, S., Rao, S., Heeramun, T., Merth, T., Rayala, U., Cui, V., Sridhar, V. R., Zhang, W., Zhang, W., Wu, W., Zhou, X., Liu, X., Zhao, Y., Xia, Y., Ren, Z., Ren, Z.. (2024). Apple {{Intelligence Foundation Language Models.

[jangstaticmodels2022] Jang, J., Ye, S., Yang, S., Shin, J., Han, J., Kim, G., Choi, S. J., Seo, M.. (2022). Towards {{Continual Knowledge Learning.

[lewisRAG2020] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K{. (2020). Retrieval-{{Augmented Generation. NeurIPS.

[wuRAGsurvey2024] Wu, S., Xiong, Y., Cui, Y., Wu, H., Chen, C., Yuan, Y., Huang, L., Liu, X., Kuo, T-W., Guan, N., Xue, C. J.. (2024). Retrieval-{{Augmented Generation.

[zhaoRAGquality2025] Zhao, S., Shao, Y., Huang, Y., Song, J., Wang, Z., Wan, C., Ma, L.. (2025). Understanding the {{Design Decisions.

[StreamingQA_dataset] Liska, A., Kocisky, T., Gribovskaya, E., Terzi, T., Sezener, E., Agrawal, D., D'Autume, C. D. M., Scholtes, T., Zaheer, M., Young, S., {Gilsenan-Mcmahon. (2022). {{StreamingQA. ICML.

[Squad_dataset] Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.. (2016). {{SQuAD.

[ArchivalQA_dataset] Wang, J., Jatowt, A., Yoshikawa, M.. (2022). {{ArchivalQA. SIGIR.

[nytcorpus] Sandhaus, E.. (2008). The new york times annotated corpus. Linguistic Data Consortium, Philadelphia.

[gpt2_paper] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.. Language {{Models.

[distilgpt2_paper] Sanh, V., Debut, L., Chaumond, J., Wolf, T.. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. NeurIPS EMC{$^2$.

[misrahiragoutofdomain2025] Misrahi, A., Chirkova, N., Louis, M., Nikoulina, V.. (2025). Adapting {{Large Language Models.

[continual_1] Yang, Y., Zhou, J., Ding, X., Huai, T., Liu, S., Chen, Q., Xie, Y., He, L.. (2025). Recent {{Advances. ACM Computing Surveys.

[continual_2] Shi, H., Xu, Z., Wang, H., Qin, W., Wang, W., Wang, Y., Wang, Z., Ebrahimi, S., Wang, H.. (2025). Continual {{Learning. ACM Computing Surveys.

[catastrophic_forgetting] McCloskey, M., Cohen, N. J.. (1989). Catastrophic {{Interference. Psychology of {{Learning.

[xupeftMethods2023] Xu, L., Xie, H., Qin, S-Z. J., Tao, X., Wang, F. L.. (2023). Parameter-{{Efficient Fine-Tuning Methods.

[prefix_tuning] Li, X. L., Liang, P.. (2021). Prefix-{{Tuning.

[houlsbyadapters12019] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., Laroussilhe, Q. D., Gesmundo, A., Attariyan, M., Gelly, S.. (2019). Parameter-{{Efficient Transfer Learning. ICML.

[pfeifferadapters22021] Pfeiffer, J., Kamath, A., R{. (2021). {{AdapterFusion.

[huLoRALowRankAdaptation2021] Hu, E. J., Shen, Y., Wallis, P., {Allen-Zhu. (2021). {{LoRA.

[chaudhryreplay12019] Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H. S., Ranzato, M'A.. (2019). On {{Tiny Episodic Memories.

[schwarzreplay22018] Schwarz, J., Czarnecki, W., Luketina, J., {Grabska-Barwinska. (2018). Progress & {{Compress. ICML.

[memory_1] He, Z., Karlinsky, L., Kim, D., McAuley, J., Krotov, D., Feris, R.. (2024). {{CAMELoT.

[memory_2] Park, S., Bak, J.. (2024). Memoria: {{Resolving Fateful Forgetting Problem.

[memory_3] Wang, Y., Gao, Y., Chen, X., Jiang, H., Li, S., Yang, J., Yin, Q., Li, Z., Li, X., Yin, B., Shang, J., McAuley, J.. (2024). {{MEMORYLLM.

[memory_4] Zhang, Z., Dai, Q., Bo, X., Ma, C., Li, R., Chen, X., Zhu, J., Dong, Z., Wen, J-R.. (2025). A {{Survey. ACM Transactions on Information Systems.

[t5] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J.. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR.

[ptuningv2] Liu, X., Ji, K., Fu, Y., Tam, W. L., Du, Z., Yang, Z., Tang, J.. (2022). P-{{Tuning.

[vqvae] {van den Oord. (2017). Neural {{Discrete Representation Learning. NeurIPS.

[straightthroughestimator] Bengio, Y., L{'e. (2013). Estimating or {{Propagating Gradients Through Stochastic Neurons.

[adam] Kingma, D. P., Ba, J.. (2017). Adam: {{A Method.

[attentionisallyouneed] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, {\L. (2017). Attention Is {{All. NeurIPS.

[4bitquant] Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.. (2023). Qlora: Efficient finetuning of quantized llms. NeurIPS.

[adafactor] Shazeer, N., Stern, M.. (2018). Adafactor: {{Adaptive Learning Rates. ICML.

[crossattn] Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., Teh, Y. W.. (2019). Attentive {{Neural Processes.

[singhalquestionanswering2025] Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., Hou, L., Clark, K., Pfohl, S. R., {Cole-Lewis. (2025). Toward Expert-Level Medical Question Answering with Large Language Models. Nature Medicine.

[kulkarniReinforcementLearningOptimizing2024] Kulkarni, M., Tangarajan, P., Kim, K., Trivedi, A.. (2024). Reinforcement {{Learning.

[yang2024federated] Yang, X., Yu, H., Gao, X., Wang, H., Zhang, J., Li, T.. (2024). Federated continual learning via knowledge fusion: A survey. IEEE Transactions on Knowledge and Data Engineering.

Introduction​

Online Adaptation of LLMs​

Proposed Model​

Memory of Aggregated Contexts​

Amortization Network​

Memory Bank​

Aggregation Network​

Codebook Optimization for Memory Bank Compression​

Online Codebook Resetting​

Key-Value Low-Rank Adaptation for Modulation Adaptation​

End-to-End Training Objective​

Online Adaptation of MBC​

Memorization.​

Inference.​

Experimental Evaluation​

Datasets​

StreamingQA~ cite{StreamingQA_dataset​

SQuAD~ cite{Squad_dataset​

ArchivalQA~ cite{ArchivalQA_dataset​

Evaluation Protocol​

Experimental Setup​

Implementation Details​

Examined Models​

Experimental Results​