Skip to main content

Memory-Augmented Architecture for Long-Term Context Handling in Large Language Models

Haseeb Ullah Khan Shinwari, Muhammad Usama

Abstract

Large Language Models face significant challenges in maintaining coherent interactions over extended dialogues due to their limited contextual memory. This limitation often leads to fragmented exchanges and reduced relevance in responses, diminishing user experience. To address these issues, we propose a memory-augmented architecture that dynamically retrieves, updates, and prunes relevant information from past interactions, ensuring effective long-term context handling. Experimental results demonstrate that our solution significantly improves contextual coherence, reduces memory overhead, and enhances response quality, showcasing its potential for real-time applications in interactive systems.

Memory-Augmented Architecture for Long-Term Context Handling in Large Language Models

Haseeb Ullah Khan Shinwari and Muhammad Usama ∗

Impact Statement -Maintaining long-term context is a persistent challenge for large language models (LLMs) in extended dialogues, often resulting in fragmented and incoherent responses. Our proposed memory-augmented architecture introduces a modular, relevance-based pruning mechanism that efficiently manages memory while preserving essential contextual information. This method enhances dialogue coherence, reduces memory overhead, and operates without significant architectural modifications, making it adaptable to various LLM frameworks.

The proposed solution significantly improves real-time applications such as virtual assistants, chatbots, and customer support systems by enabling sustained, contextually aware interactions. Furthermore, it reduces computational demands, ensuring scalability for deployment in resource-constrained environments. This advancement not only raises the standard for dialogue systems but also opens new avenues for AI applications requiring robust longterm memory management, bridging the gap between human-like conversational fluency and technological feasibility.

Index Terms -Contextual Understanding, Large Language Models, Long-Term Context, Machine Learning, Memory Augmented Models, Relevance-Based Pruning

Introduction

Large Language Models (LLMs) have made significant strides in natural language processing (NLP) applications, powering chatbots, virtual assistants, and customer service agents. Despite their success, a critical challenge remains: maintaining long-term context across multiple interactions. Standard LLMs process each query independently, often overlooking the continuity of prior exchanges. This limitation can result in responses that lack coherence, contextual relevance, and even lead to hallucinations in extended dialogues or complex tasks [1], [2].

Haseeb Ullah Khan Shinwari is with Newton AI Lab (email: mr.haseebe@gmail.com)

Muhammad Usama is with the School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea (email: usama@kaist.ac.kr).

Both authors contributed equally to this work.

∗ Corresponding author

The fixed input size of transformer-based architectures further exacerbates the issue by limiting the ability to capture long-term dependencies. As conversations grow, important contextual information is lost, leading to fragmented exchanges and a diminished user experience. While existing memoryaugmented approaches attempt to store past interactions, they often face memory bloat and increased inference latency, making them impractical for real-time applications. Therefore, there is a pressing need for an efficient architecture that retrieves only the most relevant information while dynamically managing memory to maintain performance.

To address this challenge, various memory-augmented approaches have been proposed. These methods typically involve incorporating additional memory modules into the LLM architecture to store and retrieve relevant past information. For instance, LongMem [3] utilizes a decoupled memory architecture with a retrieval network to select relevant memories. CAMELoT [4] introduces an associative memory module to improve context handling within LLMs. However, these approaches often rely on LRU eviction for memory management, which may not be optimal for retaining highly relevant information.

Larimar [5] is another recent approach that focuses on fast and efficient learning of contextual information. While Larimar shares the goal of improving context-awareness, it differs from your work in its focus on memory updates and adaptation rather than long-term retrieval.

In this paper, we propose a memory-augmented architecture, visualized in Figure 1, that focuses on efficient long-term context handling. Unlike existing methods, we introduce a relevance-based pruning strategy to selectively retain the most pertinent information from past interactions. This approach complements LRU eviction by ensuring that crucial context is preserved, even if it's not the most recently accessed.

Furthermore, our method is designed to be modular and easily integrated with existing LLM architectures. By leveraging a retrieval network to select relevant memories, we avoid the need for significant modifications to the LLM itself. This flexibility allows for seamless integration with various LLM models and facilitates experimentation with different memory management strategies.

  1. Relevance-Based Pruning: We introduce a memory management strategy that prioritizes the retention of highly relevant information, improving the quality and coherence of LLM responses.
  2. Modular Architecture: Our approach is designed to be easily integrated with existing LLM architectures, requiring minimal modifications to the base model.

Fig. 1. Context-Grounded Response Generation Framework for LLMs. This figure illustrates the proposed memory-augmented architecture designed for effective long-term context handling in LLMs. User queries Qt are mapped to embeddings q t via an embedding network, which are then matched with past interaction embeddings m i , ∀ i in the memory store based on cosine similarity α ( q t , m i ) . The most relevant memory m retrieved is selected to condition the LLM response Rt , grounding it in prior context. After generating the response, a new memory entry m new is added. To keep memory scalable, either Least Recently Used (LRU) eviction or relevance-based pruning is applied, ensuring the retention of only the most pertinent past interactions.

Fig. 1. Context-Grounded Response Generation Framework for LLMs. This figure illustrates the proposed memory-augmented architecture designed for effective long-term context handling in LLMs. User queries Qt are mapped to embeddings q t via an embedding network, which are then matched with past interaction embeddings m i , ∀ i in the memory store based on cosine similarity α ( q t , m i ) . The most relevant memory m retrieved is selected to condition the LLM response Rt , grounding it in prior context. After generating the response, a new memory entry m new is added. To keep memory scalable, either Least Recently Used (LRU) eviction or relevance-based pruning is applied, ensuring the retention of only the most pertinent past interactions.

  1. Effective Long-Term Context Handling: Our method demonstrates improved performance in maintaining longterm context, leading to more coherent and relevant responses in dialogue systems.

The proposed memory-augmented architecture differs from Retrieval-Augmented Generation (RAG) [6] by focusing on adaptive, real-time memory management tailored for ongoing dialogues, as opposed to RAG's static knowledge retrieval from fixed databases. Unlike RAG, which supplements isolated responses with external data, this approach dynamically updates, retrieves, and prunes memory entries based on contextual relevance, preserving interaction-specific information critical to long-term coherence. By employing relevance-based pruning instead of LRU-based eviction, the proposed method retains essential context efficiently, avoids memory bloat, and ensures scalability, making it well-suited for applications requiring sustained context adaptation over extended interactions.

Context Grounded Response Generation in LLMs

This section presents a memory-augmented framework designed for context-grounded response generation in Large Language Models (LLMs). The key challenge in maintaining coherent responses across long-term interactions lies in effectively managing the context accumulated from prior exchanges. Our approach addresses this challenge by introducing an adaptive memory mechanism that dynamically retrieves, updates, and manages relevant past interactions to inform future responses.

Given a sequence of user inputs { Q 1 , Q 2 , . . . , Qt } , the goal is to generate a response Rt at time t that reflects both the current query Qt and the pertinent context from prior interactions. To achieve this, each query is mapped to a high-dimensional embedding space using an encoder network f enc, parameterized by θ enc:

$$

$$

These query embeddings are then used to retrieve relevant memory entries from the memory store, ensuring that the responses generated by the LLM remain contextually grounded and coherent throughout the interaction.

The memory store at time t , denoted by Mt = { m 1 , m 2 , . . . , m N } , is a set of memory slots, each encoding information from previous interactions. To maintain relevance, a query-driven retrieval mechanism computes the similarity between the current query embedding q t and each memory entry m i . The similarity score is defined as the cosine similarity:

$$

$$

The memory entry with the highest similarity score is retrieved and used to condition the LLM's response generation process. The generated response Rt is expressed as:

$$

$$

where g is the decoder network of the LLM parameterized by θ dec .

After generating the response, the memory store is updated with the new interaction. The new memory entry is constructed by encoding the concatenated query and response:

$$

$$

The updated memory store becomes:

$$

$$

Experiments and Results

This section provides a comprehensive analysis of our memory-augmented model's performance on various tasks.

i

t

m

κ

(

argmin

{

prune

}∪{

j

)

α

q

,

retrieved

;

]

dec

θ

1; Max size

N

1

Experimental Setup

We evaluate our proposed memory-augmented framework on three tasks known for challenging a model's memory and contextual capabilities:

· 20 Questions Game: This turn-based guessing game requires models to infer an unknown entity within 20 questions based on hints provided during each turn. Given the cumulative nature of the clues, efficient memory handling and relevance-driven filtering are critical. · Persona-Chat: This dataset contains dialogues where each agent follows a predefined persona. It emphasizes the importance of memory consistency across turns, as the model must maintain persona coherence while responding contextually and appropriately. · DailyDialog: DailyDialog comprises over 13,000 dialogues across various topics, providing a natural conversational setting that tests a model's ability to retain and leverage long-term context for more realistic conversational flows.

To ensure fair comparisons, both baseline and memoryaugmented variants of LLAMA 3 8B and Gemma 2 9B are tested under identical hardware setups: an Intel Xeon 3.4 GHz processor, 64 GB RAM, and an NVIDIA RTX 3090 GPU, with PyTorch [7] used for model training and testing. We use GTE-large [8] as the embedding network in our experiments.

Tasks and Datasets

We select a combination of synthetic and real-world datasets that pose unique memory-related challenges:

  1. 20 Questions Game Dataset: We created a custom dataset comprising 100 entities across ten categories, ranging from animals to objects. The model's objective is to deduce the correct entity within 20 turns, based on cumulative user hints.
  2. Persona-Chat Dataset: This dataset contains dialogues between two agents, each equipped with a unique persona, requiring the model to recall persona-relevant details and respond with consistency over multiple turns.
  3. DailyDialog Dataset: Comprising a wide variety of everyday topics, DailyDialog allows us to test the model's general conversation abilities and long-term memory retention in a natural conversational setting.

20 Questions Game Dataset

The 20 Questions game environment simulates an interactive guessing game where a player, designated as the guesser , attempts to identify a secret keyword through a series of yesor-no questions directed at an answerer . The environment is structured to foster a dialogue between two agents, which can either be LLM-based or random guessers and answerers, allowing for a controlled assessment of the LLM's capabilities in handling long-term context and maintaining coherence in dialogue.

At the start of each game, a keyword is randomly selected from a predefined list of categories and associated keywords. This keyword represents a real or fictional entity that the guesser aims to identify. The environment employs a set of

TABLE I PERFORMANCE ON THE 20 QUESTIONS GAME

structured prompts to facilitate communication between the guesser and answerer agents. When the guesser asks a question, the answerer provides a response based on the current state of the game, which may include 'yes,' 'no,' or 'maybe,' depending on the nature of the question posed.

The game progresses through alternating turns between the guesser and answerer, with the state of the game captured in observations that include the questions asked, the answers provided, and any guesses made. Each agent maintains a record of their interactions, which is crucial for both shortterm decision-making and long-term contextual awareness. This design emphasizes the necessity for the agents to recall past exchanges to make informed guesses and responses, directly testing the memory capabilities of the LLM.

To manage the flow of the game, a turn-based mechanism is implemented where agents alternate roles. If the guesser successfully identifies the keyword within the allowed number of questions, the game concludes positively for that agent. Conversely, if the guesser fails to guess the keyword within the stipulated turns, the game ends in failure. The environment also incorporates mechanisms to handle erroneous responses and timeouts, ensuring that the game can adapt to various scenarios and maintain its integrity.

The algorithm showcasing the working of the 20-questions game environment is given in Algorithm 2 and the code implementation is given at [9].

Persona-Chat Dataset

We select a combination of synthetic and real-world datasets that pose unique memory-related challenges:

  1. 20 Questions Game Dataset: We created a custom dataset comprising 100 entities across ten categories, ranging from animals to objects. The model's objective is to deduce the correct entity within 20 turns, based on cumulative user hints.
  2. Persona-Chat Dataset: This dataset contains dialogues between two agents, each equipped with a unique persona, requiring the model to recall persona-relevant details and respond with consistency over multiple turns.
  3. DailyDialog Dataset: Comprising a wide variety of everyday topics, DailyDialog allows us to test the model's general conversation abilities and long-term memory retention in a natural conversational setting.

DailyDialog Dataset

20 Questions Game Environment

The 20 Questions game environment simulates an interactive guessing game where a player, designated as the guesser , attempts to identify a secret keyword through a series of yesor-no questions directed at an answerer . The environment is structured to foster a dialogue between two agents, which can either be LLM-based or random guessers and answerers, allowing for a controlled assessment of the LLM's capabilities in handling long-term context and maintaining coherence in dialogue.

At the start of each game, a keyword is randomly selected from a predefined list of categories and associated keywords. This keyword represents a real or fictional entity that the guesser aims to identify. The environment employs a set of

TABLE I PERFORMANCE ON THE 20 QUESTIONS GAME

structured prompts to facilitate communication between the guesser and answerer agents. When the guesser asks a question, the answerer provides a response based on the current state of the game, which may include 'yes,' 'no,' or 'maybe,' depending on the nature of the question posed.

The game progresses through alternating turns between the guesser and answerer, with the state of the game captured in observations that include the questions asked, the answers provided, and any guesses made. Each agent maintains a record of their interactions, which is crucial for both shortterm decision-making and long-term contextual awareness. This design emphasizes the necessity for the agents to recall past exchanges to make informed guesses and responses, directly testing the memory capabilities of the LLM.

To manage the flow of the game, a turn-based mechanism is implemented where agents alternate roles. If the guesser successfully identifies the keyword within the allowed number of questions, the game concludes positively for that agent. Conversely, if the guesser fails to guess the keyword within the stipulated turns, the game ends in failure. The environment also incorporates mechanisms to handle erroneous responses and timeouts, ensuring that the game can adapt to various scenarios and maintain its integrity.

The algorithm showcasing the working of the 20-questions game environment is given in Algorithm 2 and the code implementation is given at [9].

Evaluation Metrics

We employ the following metrics to evaluate the performance and memory efficiency of the proposed framework:

  1. Accuracy (20 Questions Game): Measures the proportion of correct guesses made by the model, reflecting its ability to maintain and leverage context.
  2. Contextual Coherence Score (CCS): CCS measures the cosine similarity between each model response and the aggregated context vector:

$$

$$

where N is the number of turns, ri the model's response, Ci the context vector, and sim ( ri , Ci ) the cosine similarity between the response and the context.

  1. Memory Overhead (MB): Reflects the memory consumption throughout the interaction, providing insight into the model's scalability.
  2. Latency (ms): Records average response time per query, relevant for real-time applications.

Accuracy (20 Questions Game)

The 20 Questions game environment simulates an interactive guessing game where a player, designated as the guesser , attempts to identify a secret keyword through a series of yesor-no questions directed at an answerer . The environment is structured to foster a dialogue between two agents, which can either be LLM-based or random guessers and answerers, allowing for a controlled assessment of the LLM's capabilities in handling long-term context and maintaining coherence in dialogue.

At the start of each game, a keyword is randomly selected from a predefined list of categories and associated keywords. This keyword represents a real or fictional entity that the guesser aims to identify. The environment employs a set of

TABLE I PERFORMANCE ON THE 20 QUESTIONS GAME

structured prompts to facilitate communication between the guesser and answerer agents. When the guesser asks a question, the answerer provides a response based on the current state of the game, which may include 'yes,' 'no,' or 'maybe,' depending on the nature of the question posed.

The game progresses through alternating turns between the guesser and answerer, with the state of the game captured in observations that include the questions asked, the answers provided, and any guesses made. Each agent maintains a record of their interactions, which is crucial for both shortterm decision-making and long-term contextual awareness. This design emphasizes the necessity for the agents to recall past exchanges to make informed guesses and responses, directly testing the memory capabilities of the LLM.

To manage the flow of the game, a turn-based mechanism is implemented where agents alternate roles. If the guesser successfully identifies the keyword within the allowed number of questions, the game concludes positively for that agent. Conversely, if the guesser fails to guess the keyword within the stipulated turns, the game ends in failure. The environment also incorporates mechanisms to handle erroneous responses and timeouts, ensuring that the game can adapt to various scenarios and maintain its integrity.

The algorithm showcasing the working of the 20-questions game environment is given in Algorithm 2 and the code implementation is given at [9].

Contextual Coherence Score (CCS)

Memory Overhead (MB)

Latency (ms)

Positive Transferability Ratio (PTR)

Results and Analysis

  1. 20 Questions Game: Table I presents the accuracy, latency, memory overhead, and PTR for both models on the 20 Questions game. The memory-augmented versions of LLAMA 3 8B and Gemma 2 9B show substantial accuracy gains over the baseline, with LLAMA 3 8B improving from 62.3% to 80.4% and Gemma 2 9B from 64.8% to 82.1%. This accuracy increase underscores the proposed memory mechanism's effectiveness in maintaining and retrieving relevant contextual information for accurate guesswork.

  2. Persona-Chat and DailyDialog: Tables II presents the CCS and PTR results for the Persona-Chat and DailyDialog tasks. The memory-augmented versions of both LLAMA 3 8B and Gemma 2 9B demonstrate higher CCS and PTR scores, indicating improved contextual coherence and relevance. Notably, Gemma 2 9B achieved the highest CCS (0.83) on Persona-Chat and PTR scores on both datasets, suggesting that larger models may benefit more significantly from memory augmentation in context-rich scenarios.

TABLE II PERFORMANCE ON DIALOGUE DATASETS

20 Questions Game

The 20 Questions game environment simulates an interactive guessing game where a player, designated as the guesser , attempts to identify a secret keyword through a series of yesor-no questions directed at an answerer . The environment is structured to foster a dialogue between two agents, which can either be LLM-based or random guessers and answerers, allowing for a controlled assessment of the LLM's capabilities in handling long-term context and maintaining coherence in dialogue.

At the start of each game, a keyword is randomly selected from a predefined list of categories and associated keywords. This keyword represents a real or fictional entity that the guesser aims to identify. The environment employs a set of

TABLE I PERFORMANCE ON THE 20 QUESTIONS GAME

structured prompts to facilitate communication between the guesser and answerer agents. When the guesser asks a question, the answerer provides a response based on the current state of the game, which may include 'yes,' 'no,' or 'maybe,' depending on the nature of the question posed.

The game progresses through alternating turns between the guesser and answerer, with the state of the game captured in observations that include the questions asked, the answers provided, and any guesses made. Each agent maintains a record of their interactions, which is crucial for both shortterm decision-making and long-term contextual awareness. This design emphasizes the necessity for the agents to recall past exchanges to make informed guesses and responses, directly testing the memory capabilities of the LLM.

To manage the flow of the game, a turn-based mechanism is implemented where agents alternate roles. If the guesser successfully identifies the keyword within the allowed number of questions, the game concludes positively for that agent. Conversely, if the guesser fails to guess the keyword within the stipulated turns, the game ends in failure. The environment also incorporates mechanisms to handle erroneous responses and timeouts, ensuring that the game can adapt to various scenarios and maintain its integrity.

The algorithm showcasing the working of the 20-questions game environment is given in Algorithm 2 and the code implementation is given at [9].

Persona-Chat and DailyDialog

  1. 20 Questions Game: Table I presents the accuracy, latency, memory overhead, and PTR for both models on the 20 Questions game. The memory-augmented versions of LLAMA 3 8B and Gemma 2 9B show substantial accuracy gains over the baseline, with LLAMA 3 8B improving from 62.3% to 80.4% and Gemma 2 9B from 64.8% to 82.1%. This accuracy increase underscores the proposed memory mechanism's effectiveness in maintaining and retrieving relevant contextual information for accurate guesswork.

  2. Persona-Chat and DailyDialog: Tables II presents the CCS and PTR results for the Persona-Chat and DailyDialog tasks. The memory-augmented versions of both LLAMA 3 8B and Gemma 2 9B demonstrate higher CCS and PTR scores, indicating improved contextual coherence and relevance. Notably, Gemma 2 9B achieved the highest CCS (0.83) on Persona-Chat and PTR scores on both datasets, suggesting that larger models may benefit more significantly from memory augmentation in context-rich scenarios.

TABLE II PERFORMANCE ON DIALOGUE DATASETS

Memory Management Impact

We evaluated various memory management strategies, specifically relevance-based pruning and LRU eviction, on both the LLAMA 3 8B and Gemma 2 9B models. The assessment focused on key performance metrics such as accuracy, memory usage, and latency, as summarized in Table III, using the 20 Questions game task.

For the LLAMA 3 8B model, relevance-based pruning achieved the highest accuracy at 80.4%, representing a notable increase from 78.1% observed under the no pruning condition. This accuracy improvement is accompanied by a reduction in memory overhead (1022.7 MB compared to 1250.0 MB) and a slight decrease in latency (1287.8 ms versus 1400.3 ms). In contrast, the LRU eviction method resulted in the lowest accuracy at 76.5%, highlighting its inadequacy in preserving relevant contextual information.

Similarly, the Gemma 2 9B model exhibited comparable trends. The no-pruning strategy provided the highest accuracy at 82.6%, while relevance-based pruning closely followed with an accuracy of 82.1%. This approach not only maintained efficient memory usage (1173.5 MB) but also resulted in significantly lower latency (1137.3 ms). Again, the LRU eviction strategy fell short in terms of accuracy (78.2%), reinforcing the conclusion that relevance-based pruning is superior for effective contextual retention.

Impact of Embedding Models

The embedding network plays a crucial role in our architecture as it maps queries and responses to a high-dimensional space where similarity computations guide memory retrieval. We evaluated three widely used embedding models: GTE-large [8], Sentence Transformers MiniLM-L6-v2 [10] and Universal Sentence Encoder (USE) [11]. We tested these embedding models with both LLAMA 3 8B and Gemma 2 9B on the 20 Questions game task, maintaining all other architectural components constant.

Table IV presents the results of this analysis. The results demonstrate that the choice of embedding model significantly impacts the system's performance. GTE-large consistently outperforms other embedding models across both LLAMA 3 8B and Gemma 2 9B, achieving the highest accuracy scores (80.4% and 82.1% respectively). This superior performance can be attributed to GTE-large's enhanced ability to capture semantic relationships and nuanced contextual information critical for the 20 Questions game.

MiniLM-L6-v2, while more computationally efficient as evidenced by lower latency times, shows a moderate decrease in accuracy (75.8% for LLAMA 3 8B and 77.3% for Gemma 2 9B). The Universal Sentence Encoder, despite its generalpurpose design, achieves the lowest accuracy scores among the three models (73.2% and 74.9% respectively).

Discussion

Our evaluation of the memory-augmented context-handling framework provides valuable insights into model performance, contextual retention, and memory efficiency across diverse tasks. In the 20 Questions game, both the LLAMA 3 8B and Gemma 2 9B models showed substantial improvements in accuracy with memory augmentation, underscoring the framework's ability to effectively maintain and leverage context. Specifically, LLAMA 3 8B improved from 62.3% to 80.4%, and Gemma 2 9B advanced from 64.8% to 82.1%, illustrating the significant role of memory in improving the accuracy of sequential task performance.

Further analysis on the Persona-Chat and DailyDialog datasets revealed that memory augmentation also enhances contextual coherence and relevance, as measured by the Contextual Coherence Score (CCS) and Positive Transferability Ratio (PTR). Gemma 2 9B achieved the highest CCS of 0.83 on Persona-Chat, with consistently higher PTR scores across both datasets. These results suggest that larger models, like Gemma 2 9B, may benefit even more from memory mechanisms, especially in tasks with rich contextual elements, where maintaining coherent dialogue over multiple turns is critical.

In examining memory management strategies, relevancebased pruning emerged as the more effective approach compared to LRU eviction. For LLAMA 3 8B, relevance-based pruning achieved an accuracy of 80.4%, while reducing memory overhead to 1022.7 MB and latency to 1287.8 ms, compared to no pruning. Gemma 2 9B followed a similar trend, with relevance-based pruning achieving 82.1% accuracy, while efficiently reducing memory usage to 1173.5 MB and lowering latency to 1137.3 ms. By contrast, LRU eviction led to the lowest accuracies of 76.5% for LLAMA 3 8B and 78.2% for Gemma 2 9B, highlighting its limitations in retaining essential contextual information. These findings reinforce the advantage of relevance-based pruning, particularly for tasks requiring precise and coherent contextual retention.

Our ablation study demonstrated that the GTE-large embedding consistently provided the highest accuracy across both model variants, achieving 80.4% for LLAMA 3 8B and 82.1% for Gemma 2 9B. This superior performance can

TABLE III IMPACT OF MEMORY MANAGEMENT STRATEGIES ON LLAMA 3 8B AND GEMMA 2 9B MODELS

TABLE IV IMPACT OF DIFFERENT EMBEDDING MODELS ON 20 QUESTIONS GAME PERFORMANCE

be attributed to GTE-large's enhanced capacity for capturing semantic relationships and subtle contextual nuances essential for complex dialogue tasks. In comparison, MiniLM-L6-v2 showed computational efficiency benefits with lower latency (1156.4 ms for LLAMA 3 8B and 1089.5 ms for Gemma 2 9B) but experienced a moderate decline in accuracy (75.8% and 77.3%, respectively). The Universal Sentence Encoder, while useful for general-purpose applications, yielded the lowest accuracy scores (73.2% for LLAMA 3 8B and 74.9% for Gemma 2 9B), indicating that specialized embedding models like GTE-large are better suited for tasks requiring deep contextual understanding.

Conclusion

In this paper, we propose a memory-augmented architecture to address the limitations of large language models in handling long-term context. Our approach incorporates a retrieval network to selectively access relevant past interactions and a relevance-based pruning strategy to optimize memory usage. Experimental results on various tasks, including the 20 Questions Game, Persona-Chat, and DailyDialog, demonstrate significant improvements in accuracy, contextual coherence, and response relevance. Our findings suggest that memory augmentation is a promising technique for enhancing the capabilities of LLMs in real-world dialogue systems.

$$ \mathbf{q}t = f{\text{enc}}(Q_t; \theta_{\text{enc}}). $$

$$ \alpha(\mathbf{q}_t, \mathbf{m}_i) = \frac{\mathbf{q}_t \cdot \mathbf{m}_i}{|\mathbf{q}_t| |\mathbf{m}_i|}. $$

$$ M_{t+1} = M_t \cup {\mathbf{m}_{\text{new}}}. $$

$$ \kappa(\mathbf{m}i) = \max{t - T \leq j \leq t} \alpha(\mathbf{q}_j, \mathbf{m}_i), $$

$$ \mathbf{m}{\text{prune}} = \arg\min{\mathbf{m}_i \in M_t} \kappa(\mathbf{m}_i). $$

$$ \text{CCS} = \frac{1}{N} \sum_{i=1}^{N} \text{sim}(r_i, C_i) $$

Algorithm: algorithm
[t]
\caption{Long-Term Context Handling with Memory Management}
\label{alg:ltc_handling}
\begin{algorithmic}[1]

\State \textbf{Initialize:} Memory $M_0 = \emptyset$; Time $t = 1$; Max size $N$
\While {New query $Q_t$ arrives}
\State \textbf{Embed Query:} $\mathbf{q}_t = f_{\text{enc}}(Q_t; \theta_{\text{enc}})$

\State \textbf{Retrieve Memory:}\\
\[
\mathbf{m}_{\text{retrieved}} = \arg\max_{\mathbf{m}_i \in M_t} \alpha(\mathbf{q}_t, \mathbf{m}_i),
\]
\[ \text{where} \quad
\alpha(\mathbf{q}_t, \mathbf{m}_i) = \frac{\mathbf{q}_t \cdot \mathbf{m}_i}{\|\mathbf{q}_t\| \|\mathbf{m}_i\|}
\]

\State \textbf{Generate Response:} $R_t = g(\mathbf{q}_t, \mathbf{m}_{\text{retrieved}}; \theta_{\text{dec}})$

\State \textbf{Update Memory:} $\mathbf{m}_{\text{new}} = f_{\text{enc}}([Q_t \Vert R_t], \theta_{\text{enc}})$
\State $M_{t+1} \gets M_t \cup \{\mathbf{m}_{\text{new}}\}$

\If {Memory size exceeds $N$}
\State \textbf{Manage Memory:}
\State \textbf{Option 1: LRU Eviction}
\[
\mathbf{m}_{\text{evict}} = \arg\min_{\mathbf{m}_i \in M_t} \zeta(\mathbf{m}_i)
\]
\State $M_{t+1} \gets M_t \setminus \{\mathbf{m}_{\text{evict}}\} \cup \{\mathbf{m}_{\text{new}}\}$

\State \textbf{Option 2: Relevance-Based Pruning}
\State Compute $\kappa(\mathbf{m}_i) = \max_{t - T \leq j \leq t} \alpha(\mathbf{q}_j, \mathbf{m}_i)$
\State $\mathbf{m}_{\text{prune}} = \arg\min_{\mathbf{m}_i \in M_t} \kappa(\mathbf{m}_i)$
\State $M_{t+1} \gets M_t \setminus \{\mathbf{m}_{\text{prune}}\} \cup \{\mathbf{m}_{\text{new}}\}$
\EndIf

\State $t \gets t + 1$
\EndWhile

\end{algorithmic}
Algorithm: algorithm
[ht]
\caption{20 Questions Game Environment}
\label{alg:20_questions}
\begin{algorithmic}[1]
\State Initialize the game environment
\State Select a random keyword and category
\State Set turn = 1
\State Set maximum turns = 20
\While{turn $\leq$ maximum turns}
\If{guesser's turn}
\State prompt the guesser for a question
\State record the question
\State send question to answerer
\State receive answer (yes/no/maybe)
\If{guesser makes a guess}
\If{guess is correct}
\State End game with success
\Else
\State record the guess
\EndIf
\EndIf
\Else
\State prompt the answerer for a response
\State provide the answer to the guesser
\EndIf
\State Increment turn
\EndWhile
\If{maximum turns reached}
\State End game with failure
\EndIf
\end{algorithmic}

References

Large Language Models face significant challenges in maintaining coherent interactions over extended dialogues due to their limited contextual memory. This limitation often leads to fragmented exchanges and reduced relevance in responses, diminishing user experience. To address these issues, we propose a memory-augmented architecture that dynamically retrieves, updates, and prunes relevant information from past interactions, ensuring effective long-term context handling. Experimental results demonstrate that our solution significantly improves contextual coherence, reduces memory overhead, and enhances response quality, showcasing its potential for real-time applications in interactive systems.

Maintaining long-term context is a persistent challenge for large language models (LLMs) in extended dialogues, often resulting in fragmented and incoherent responses. Our proposed memory-augmented architecture introduces a modular, relevance-based pruning mechanism that efficiently manages memory while preserving essential contextual information. This method enhances dialogue coherence, reduces memory overhead, and operates without significant architectural modifications, making it adaptable to various LLM frameworks.

The proposed solution significantly improves real-time applications such as virtual assistants, chatbots, and customer support systems by enabling sustained, contextually aware interactions. Furthermore, it reduces computational demands, ensuring scalability for deployment in resource-constrained environments. This advancement not only raises the standard for dialogue systems but also opens new avenues for AI applications requiring robust long-term memory management, bridging the gap between human-like conversational fluency and technological feasibility.

Contextual Understanding, Large Language Models, Long-Term Context, Machine Learning, Memory Augmented Models, Relevance-Based Pruning

Large Language Models (LLMs) have made significant strides in natural language processing (NLP) applications, powering chatbots, virtual assistants, and customer service agents. Despite their success, a critical challenge remains: maintaining long-term context across multiple interactions. Standard LLMs process each query independently, often overlooking the continuity of prior exchanges. This limitation can result in responses that lack coherence, contextual relevance, and even lead to hallucinations in extended dialogues or complex tasks [1, 2].

The fixed input size of transformer-based architectures further exacerbates the issue by limiting the ability to capture long-term dependencies. As conversations grow, important contextual information is lost, leading to fragmented exchanges and a diminished user experience. While existing memory-augmented approaches attempt to store past interactions, they often face memory bloat and increased inference latency, making them impractical for real-time applications. Therefore, there is a pressing need for an efficient architecture that retrieves only the most relevant information while dynamically managing memory to maintain performance.

To address this challenge, various memory-augmented approaches have been proposed. These methods typically involve incorporating additional memory modules into the LLM architecture to store and retrieve relevant past information. For instance, LongMem [3] utilizes a decoupled memory architecture with a retrieval network to select relevant memories. CAMELoT [4] introduces an associative memory module to improve context handling within LLMs. However, these approaches often rely on LRU eviction for memory management, which may not be optimal for retaining highly relevant information.

Larimar [5] is another recent approach that focuses on fast and efficient learning of contextual information. While Larimar shares the goal of improving context-awareness, it differs from your work in its focus on memory updates and adaptation rather than long-term retrieval.

In this paper, we propose a memory-augmented architecture, visualized in Figure 1, that focuses on efficient long-term context handling. Unlike existing methods, we introduce a relevance-based pruning strategy to selectively retain the most pertinent information from past interactions. This approach complements LRU eviction by ensuring that crucial context is preserved, even if it’s not the most recently accessed.

Furthermore, our method is designed to be modular and easily integrated with existing LLM architectures. By leveraging a retrieval network to select relevant memories, we avoid the need for significant modifications to the LLM itself. This flexibility allows for seamless integration with various LLM models and facilitates experimentation with different memory management strategies.

Our contributions can be summarized as follows:

Relevance-Based Pruning: We introduce a memory management strategy that prioritizes the retention of highly relevant information, improving the quality and coherence of LLM responses.

Modular Architecture: Our approach is designed to be easily integrated with existing LLM architectures, requiring minimal modifications to the base model.

The proposed memory-augmented architecture differs from Retrieval-Augmented Generation (RAG) [6] by focusing on adaptive, real-time memory management tailored for ongoing dialogues, as opposed to RAG’s static knowledge retrieval from fixed databases. Unlike RAG, which supplements isolated responses with external data, this approach dynamically updates, retrieves, and prunes memory entries based on contextual relevance, preserving interaction-specific information critical to long-term coherence. By employing relevance-based pruning instead of LRU-based eviction, the proposed method retains essential context efficiently, avoids memory bloat, and ensures scalability, making it well-suited for applications requiring sustained context adaptation over extended interactions.

This section presents a memory-augmented framework designed for context-grounded response generation in Large Language Models (LLMs). The key challenge in maintaining coherent responses across long-term interactions lies in effectively managing the context accumulated from prior exchanges. Our approach addresses this challenge by introducing an adaptive memory mechanism that dynamically retrieves, updates, and manages relevant past interactions to inform future responses.

Given a sequence of user inputs {Q1,Q2,…,Qt}{Q_{1},Q_{2},\dots,Q_{t}}, the goal is to generate a response RtR_{t} at time tt that reflects both the current query QtQ_{t} and the pertinent context from prior interactions. To achieve this, each query is mapped to a high-dimensional embedding space using an encoder network fencf_{\text{enc}}, parameterized by θenc\theta_{\text{enc}}:

These query embeddings are then used to retrieve relevant memory entries from the memory store, ensuring that the responses generated by the LLM remain contextually grounded and coherent throughout the interaction.

The memory store at time tt, denoted by Mt={𝐦1,𝐦2,…,𝐦N}M_{t}={\mathbf{m}{1},\mathbf{m}{2},\dots,\mathbf{m}{N}}, is a set of memory slots, each encoding information from previous interactions. To maintain relevance, a query-driven retrieval mechanism computes the similarity between the current query embedding 𝐪t\mathbf{q}{t} and each memory entry 𝐦i\mathbf{m}_{i}. The similarity score is defined as the cosine similarity:

The memory entry with the highest similarity score is retrieved and used to condition the LLM’s response generation process. The generated response RtR_{t} is expressed as:

where gg is the decoder network of the LLM parameterized by θdec\theta_{\text{dec}}.

After generating the response, the memory store is updated with the new interaction. The new memory entry is constructed by encoding the concatenated query and response:

The updated memory store becomes:

Since the memory size grows with each interaction, efficient memory management is crucial to ensure the system remains scalable. We introduce two memory management strategies: Least Recently Used (LRU) eviction and relevance-based pruning. In the LRU strategy, the memory entry with the smallest access time is evicted to make room for new entries:

where ζ​(𝐦i)\zeta(\mathbf{m}{i}) is the timestamp of the last access of memory 𝐦i\mathbf{m}{i}. In relevance-based pruning, a relevance score κ​(𝐦i)\kappa(\mathbf{m}_{i}) is assigned to each memory slot based on the maximum similarity with recent queries:

where TT is the window size of recent queries. The least relevant memory entry is removed:

These strategies prevent the memory store from growing indefinitely, ensuring that only the most relevant interactions are retained.

The complete algorithm for long-term context handling with memory management is presented in Algorithm 1.

This section provides a comprehensive analysis of our memory-augmented model’s performance on various tasks.

We evaluate our proposed memory-augmented framework on three tasks known for challenging a model’s memory and contextual capabilities:

20 Questions Game: This turn-based guessing game requires models to infer an unknown entity within 20 questions based on hints provided during each turn. Given the cumulative nature of the clues, efficient memory handling and relevance-driven filtering are critical.

Persona-Chat: This dataset contains dialogues where each agent follows a predefined persona. It emphasizes the importance of memory consistency across turns, as the model must maintain persona coherence while responding contextually and appropriately.

DailyDialog: DailyDialog comprises over 13,000 dialogues across various topics, providing a natural conversational setting that tests a model’s ability to retain and leverage long-term context for more realistic conversational flows.

To ensure fair comparisons, both baseline and memory-augmented variants of LLAMA 3 8B and Gemma 2 9B are tested under identical hardware setups: an Intel Xeon 3.4 GHz processor, 64 GB RAM, and an NVIDIA RTX 3090 GPU, with PyTorch [7] used for model training and testing. We use GTE-large [8] as the embedding network in our experiments.

We select a combination of synthetic and real-world datasets that pose unique memory-related challenges:

We created a custom dataset comprising 100 entities across ten categories, ranging from animals to objects. The model’s objective is to deduce the correct entity within 20 turns, based on cumulative user hints.

This dataset contains dialogues between two agents, each equipped with a unique persona, requiring the model to recall persona-relevant details and respond with consistency over multiple turns.

Comprising a wide variety of everyday topics, DailyDialog allows us to test the model’s general conversation abilities and long-term memory retention in a natural conversational setting.

The 20 Questions game environment simulates an interactive guessing game where a player, designated as the guesser, attempts to identify a secret keyword through a series of yes-or-no questions directed at an answerer. The environment is structured to foster a dialogue between two agents, which can either be LLM-based or random guessers and answerers, allowing for a controlled assessment of the LLM’s capabilities in handling long-term context and maintaining coherence in dialogue.

At the start of each game, a keyword is randomly selected from a predefined list of categories and associated keywords. This keyword represents a real or fictional entity that the guesser aims to identify. The environment employs a set of structured prompts to facilitate communication between the guesser and answerer agents. When the guesser asks a question, the answerer provides a response based on the current state of the game, which may include “yes,” “no,” or “maybe,” depending on the nature of the question posed.

The game progresses through alternating turns between the guesser and answerer, with the state of the game captured in observations that include the questions asked, the answers provided, and any guesses made. Each agent maintains a record of their interactions, which is crucial for both short-term decision-making and long-term contextual awareness. This design emphasizes the necessity for the agents to recall past exchanges to make informed guesses and responses, directly testing the memory capabilities of the LLM.

To manage the flow of the game, a turn-based mechanism is implemented where agents alternate roles. If the guesser successfully identifies the keyword within the allowed number of questions, the game concludes positively for that agent. Conversely, if the guesser fails to guess the keyword within the stipulated turns, the game ends in failure. The environment also incorporates mechanisms to handle erroneous responses and timeouts, ensuring that the game can adapt to various scenarios and maintain its integrity.

The algorithm showcasing the working of the 20-questions game environment is given in Algorithm 2 and the code implementation is given at [9].

We employ the following metrics to evaluate the performance and memory efficiency of the proposed framework:

Measures the proportion of correct guesses made by the model, reflecting its ability to maintain and leverage context.

where NN is the number of turns, rir_{i} the model’s response, CiC_{i} the context vector, and sim​(ri,Ci)\text{sim}(r_{i},C_{i}) the cosine similarity between the response and the context.

Reflects the memory consumption throughout the interaction, providing insight into the model’s scalability.

Records average response time per query, relevant for real-time applications.

Table 1 presents the accuracy, latency, memory overhead, and PTR for both models on the 20 Questions game. The memory-augmented versions of LLAMA 3 8B and Gemma 2 9B show substantial accuracy gains over the baseline, with LLAMA 3 8B improving from 62.3% to 80.4% and Gemma 2 9B from 64.8% to 82.1%. This accuracy increase underscores the proposed memory mechanism’s effectiveness in maintaining and retrieving relevant contextual information for accurate guesswork.

Tables 2 presents the CCS and PTR results for the Persona-Chat and DailyDialog tasks. The memory-augmented versions of both LLAMA 3 8B and Gemma 2 9B demonstrate higher CCS and PTR scores, indicating improved contextual coherence and relevance. Notably, Gemma 2 9B achieved the highest CCS (0.83) on Persona-Chat and PTR scores on both datasets, suggesting that larger models may benefit more significantly from memory augmentation in context-rich scenarios.

We evaluated various memory management strategies, specifically relevance-based pruning and LRU eviction, on both the LLAMA 3 8B and Gemma 2 9B models. The assessment focused on key performance metrics such as accuracy, memory usage, and latency, as summarized in Table 3, using the 20 Questions game task.

For the LLAMA 3 8B model, relevance-based pruning achieved the highest accuracy at 80.4%, representing a notable increase from 78.1% observed under the no pruning condition. This accuracy improvement is accompanied by a reduction in memory overhead (1022.7 MB compared to 1250.0 MB) and a slight decrease in latency (1287.8 ms versus 1400.3 ms). In contrast, the LRU eviction method resulted in the lowest accuracy at 76.5%, highlighting its inadequacy in preserving relevant contextual information.

Similarly, the Gemma 2 9B model exhibited comparable trends. The no-pruning strategy provided the highest accuracy at 82.6%, while relevance-based pruning closely followed with an accuracy of 82.1%. This approach not only maintained efficient memory usage (1173.5 MB) but also resulted in significantly lower latency (1137.3 ms). Again, the LRU eviction strategy fell short in terms of accuracy (78.2%), reinforcing the conclusion that relevance-based pruning is superior for effective contextual retention.

The embedding network plays a crucial role in our architecture as it maps queries and responses to a high-dimensional space where similarity computations guide memory retrieval. We evaluated three widely used embedding models: GTE-large [8], Sentence Transformers MiniLM-L6-v2 [10] and Universal Sentence Encoder (USE) [11]. We tested these embedding models with both LLAMA 3 8B and Gemma 2 9B on the 20 Questions game task, maintaining all other architectural components constant.

Table 4 presents the results of this analysis. The results demonstrate that the choice of embedding model significantly impacts the system’s performance. GTE-large consistently outperforms other embedding models across both LLAMA 3 8B and Gemma 2 9B, achieving the highest accuracy scores (80.4% and 82.1% respectively). This superior performance can be attributed to GTE-large’s enhanced ability to capture semantic relationships and nuanced contextual information critical for the 20 Questions game.

MiniLM-L6-v2, while more computationally efficient as evidenced by lower latency times, shows a moderate decrease in accuracy (75.8% for LLAMA 3 8B and 77.3% for Gemma 2 9B). The Universal Sentence Encoder, despite its general-purpose design, achieves the lowest accuracy scores among the three models (73.2% and 74.9% respectively).

Our evaluation of the memory-augmented context-handling framework provides valuable insights into model performance, contextual retention, and memory efficiency across diverse tasks. In the 20 Questions game, both the LLAMA 3 8B and Gemma 2 9B models showed substantial improvements in accuracy with memory augmentation, underscoring the framework’s ability to effectively maintain and leverage context. Specifically, LLAMA 3 8B improved from 62.3% to 80.4%, and Gemma 2 9B advanced from 64.8% to 82.1%, illustrating the significant role of memory in improving the accuracy of sequential task performance.

Further analysis on the Persona-Chat and DailyDialog datasets revealed that memory augmentation also enhances contextual coherence and relevance, as measured by the Contextual Coherence Score (CCS) and Positive Transferability Ratio (PTR). Gemma 2 9B achieved the highest CCS of 0.83 on Persona-Chat, with consistently higher PTR scores across both datasets. These results suggest that larger models, like Gemma 2 9B, may benefit even more from memory mechanisms, especially in tasks with rich contextual elements, where maintaining coherent dialogue over multiple turns is critical.

In examining memory management strategies, relevance-based pruning emerged as the more effective approach compared to LRU eviction. For LLAMA 3 8B, relevance-based pruning achieved an accuracy of 80.4%, while reducing memory overhead to 1022.7 MB and latency to 1287.8 ms, compared to no pruning. Gemma 2 9B followed a similar trend, with relevance-based pruning achieving 82.1% accuracy, while efficiently reducing memory usage to 1173.5 MB and lowering latency to 1137.3 ms. By contrast, LRU eviction led to the lowest accuracies of 76.5% for LLAMA 3 8B and 78.2% for Gemma 2 9B, highlighting its limitations in retaining essential contextual information. These findings reinforce the advantage of relevance-based pruning, particularly for tasks requiring precise and coherent contextual retention.

Our ablation study demonstrated that the GTE-large embedding consistently provided the highest accuracy across both model variants, achieving 80.4% for LLAMA 3 8B and 82.1% for Gemma 2 9B. This superior performance can be attributed to GTE-large’s enhanced capacity for capturing semantic relationships and subtle contextual nuances essential for complex dialogue tasks. In comparison, MiniLM-L6-v2 showed computational efficiency benefits with lower latency (1156.4 ms for LLAMA 3 8B and 1089.5 ms for Gemma 2 9B) but experienced a moderate decline in accuracy (75.8% and 77.3%, respectively). The Universal Sentence Encoder, while useful for general-purpose applications, yielded the lowest accuracy scores (73.2% for LLAMA 3 8B and 74.9% for Gemma 2 9B), indicating that specialized embedding models like GTE-large are better suited for tasks requiring deep contextual understanding.

In this paper, we propose a memory-augmented architecture to address the limitations of large language models in handling long-term context. Our approach incorporates a retrieval network to selectively access relevant past interactions and a relevance-based pruning strategy to optimize memory usage. Experimental results on various tasks, including the 20 Questions Game, Persona-Chat, and DailyDialog, demonstrate significant improvements in accuracy, contextual coherence, and response relevance. Our findings suggest that memory augmentation is a promising technique for enhancing the capabilities of LLMs in real-world dialogue systems.

Table: S3.T1: Performance on the 20 Questions Game

ModelMethodAccuracy (%)Latency (ms)Memory Overhead (MB)PTR (%)
LLAMA 3 8BBaseline62.3--0.0
Proposed80.41287.801022.735.2
Gemma 2 9BBaseline64.8--0.0
Proposed82.11137.31173.536.8

Table: S3.T3: Impact of Memory Management Strategies on LLAMA 3 8B and Gemma 2 9B Models

ModelStrategyAccuracy (%)Memory Overhead (MB)Latency (ms)
LLAMA 3 8BNo Pruning78.11250.01400.3
Relevance-Based Pruning80.41022.71287.8
LRU Eviction76.51030.41295.6
Gemma 2 9BNo Pruning82.61320.51375.2
Relevance-Based Pruning82.11173.51137.3
LRU Eviction78.21103.41261.4

Refer to caption Context-Grounded Response Generation Framework for LLMs. This figure illustrates the proposed memory-augmented architecture designed for effective long-term context handling in LLMs. User queries QtQ_{t} are mapped to embeddings qt\textbf{q}{t} via an embedding network, which are then matched with past interaction embeddings mi,∀i\textbf{m}{i},\forall i in the memory store based on cosine similarity α​(qt,mi)\alpha(\textbf{q}{t},\textbf{m}{i}). The most relevant memory mretrievedm_{\text{retrieved}} is selected to condition the LLM response RtR_{t}, grounding it in prior context. After generating the response, a new memory entry mnew\textbf{m}_{\text{new}} is added. To keep memory scalable, either Least Recently Used (LRU) eviction or relevance-based pruning is applied, ensuring the retention of only the most pertinent past interactions.

$$ \mathbf{q}{t}=f{\text{enc}}(Q_{t};\theta_{\text{enc}}). $$ \tag{S2.Ex1}

$$ \alpha(\mathbf{q}{t},\mathbf{m}{i})=\frac{\mathbf{q}{t}\cdot\mathbf{m}{i}}{|\mathbf{q}{t}||\mathbf{m}{i}|}. $$ \tag{S2.Ex2}

$$ M_{t+1}=M_{t}\cup{\mathbf{m}_{\text{new}}}. $$ \tag{S2.Ex5}

$$ \mathbf{m}{\text{evict}}=\arg\min{\mathbf{m}{i}\in M{t}}\zeta(\mathbf{m}_{i}), $$ \tag{S2.Ex6}

$$ \kappa(\mathbf{m}{i})=\max{t-T\leq j\leq t}\alpha(\mathbf{q}{j},\mathbf{m}{i}), $$ \tag{S2.Ex7}

$$ \text{CCS}=\frac{1}{N}\sum_{i=1}^{N}\text{sim}(r_{i},C_{i}) $$ \tag{S3.Ex12}

ModelMethodAccuracy (%)Latency (ms)Memory Overhead (MB)PTR (%)
LLAMA 3 8BBaseline Proposed62.3 80.4- 1287.80- 1022.70.0 35.2
Gemma 2 9BBaseline Proposed64.8 82.1- 1137.3- 1173.50.0 36.8
DatasetModelMethodCCSPTR (%)
Persona-ChatLLAMA 3 8BProposed Baseline0.74 0.6528.1 25.0
Persona-ChatGemma 2 9BProposed Baseline0.83 0.7230.5 27.5
DailyDialogLLAMA 3 8BProposed Baseline0.69 0.6031.4 29.0
DailyDialogGemma 2 9BProposed Baseline0.81 0.7533.2 30.0
ModelStrategyAccuracy (%)Memory Overhead (MB)Latency (ms)
LLAMA 3 8BNo Pruning Relevance-Based78.1 80.41250.0 1022.71400.3 1287.8
LLAMA 3 8BPruning
LLAMA 3 8BLRU Eviction76.51030.41295.6
Gemma 2 9BNo Pruning82.61320.51375.2
Gemma 2 9BRelevance-Based Pruning82.11173.51137.3
Gemma 2 9BLRU Eviction78.21103.41261.4
ModelEmbedding ModelAccuracy (%)Memory Overhead (MB)Latency (ms)
LLAMA 3 8BGTE-large80.41022.7 985.31287.8
LLAMA 3 8BMiniLM-L6-v275.81156.4
LLAMA 3 8BUSE73.21012.51245.6
Gemma 2 9BGTE-large82.11173.51137.3
Gemma 2 9BMiniLM-L6-v277.31125.81089.5
Gemma 2 9BUSE74.91158.21112.7
ModelMethodAccuracy (%)Latency (ms)Memory Overhead (MB)PTR (%)
LLAMA 3 8BBaseline Proposed62.3 80.4- 1287.80- 1022.70.0 35.2
Gemma 2 9BBaseline Proposed64.8 82.1- 1137.3- 1173.50.0 36.8
DatasetModelMethodCCSPTR (%)
Persona-ChatLLAMA 3 8BProposed Baseline0.74 0.6528.1 25.0
Persona-ChatGemma 2 9BProposed Baseline0.83 0.7230.5 27.5
DailyDialogLLAMA 3 8BProposed Baseline0.69 0.6031.4 29.0
DailyDialogGemma 2 9BProposed Baseline0.81 0.7533.2 30.0
ModelStrategyAccuracy (%)Memory Overhead (MB)Latency (ms)
LLAMA 3 8BNo Pruning Relevance-Based78.1 80.41250.0 1022.71400.3 1287.8
LLAMA 3 8BPruning
LLAMA 3 8BLRU Eviction76.51030.41295.6
Gemma 2 9BNo Pruning82.61320.51375.2
Gemma 2 9BRelevance-Based Pruning82.11173.51137.3
Gemma 2 9BLRU Eviction78.21103.41261.4
ModelEmbedding ModelAccuracy (%)Memory Overhead (MB)Latency (ms)
LLAMA 3 8BGTE-large80.41022.7 985.31287.8
LLAMA 3 8BMiniLM-L6-v275.81156.4
LLAMA 3 8BUSE73.21012.51245.6
Gemma 2 9BGTE-large82.11173.51137.3
Gemma 2 9BMiniLM-L6-v277.31125.81089.5
Gemma 2 9BUSE74.91158.21112.7

References

[hallucinations1] Farquhar, Sebastian, Kossen, Jannik, Kuhn, Lorenz, Gal, Yarin. (2024). Detecting hallucinations in large language models using semantic entropy. Nature.

[hallucinations2] Verspoor, Karin. (2024). \textquotesingle {F.

[wang2023augmentinglanguagemodelslongterm] Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, Furu Wei. (2023). Augmenting Language Models with Long-Term Memory.

[camelot] Zexue He, Leonid Karlinsky, Donghyun Kim, Julian McAuley, Dmitry Krotov, Rogerio Feris. (2024). {CAMEL. First Workshop on Long-Context Foundation Models @ ICML 2024.

[larimar] Payel Das, Subhajit Chaudhury, Elliot Nelson, Igor Melnyk, Sarath Swaminathan, Sihui Dai, Aur'elie Lozano, Georgios Kollias, Vijil Chenthamarakshan, Ji\v{r. (2024). Larimar: Large Language Models with Episodic Memory Control.

[pytorch] Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, Gregory, Yang, Edward, DeVito, Zachary, Lin, Zeming, Desmaison, Alban, Antiga, Luca, Lerer, Adam. (2017). Automatic differentiation in PyTorch.

[gte-large] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang. (2023). Towards General Text Embeddings with Multi-stage Contrastive Learning.

[mini] Nils Reimers, Iryna Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Conference on Empirical Methods in Natural Language Processing.

[Cer2018UniversalSE] Daniel Matthew Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil. (2018). Universal Sentence Encoder. ArXiv.

[kaggle_env] Kaggle Environments. (2024). 20 Questions Environment.

[rag] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, K. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems.

[bib1] S. Farquhar, J. Kossen, L. Kuhn, and Y. Gal, “Detecting hallucinations in large language models using semantic entropy,” Nature, vol. 630, no. 8017, pp. 625–630, 2024.

[bib2] K. Verspoor, “'Fighting Fire with Fire '- using Llms to Combat Llm Hallucinations,” 2024.

[bib3] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei, “Augmenting language models with long-term memory,” 2023. [Online]. Available: https://arxiv.org/abs/2306.07174

[bib4] Z. He, L. Karlinsky, D. Kim, J. McAuley, D. Krotov, and R. Feris, “CAMELot: Towards large language models with training-free consolidated associative memory,” in First Workshop on Long-Context Foundation Models @ ICML 2024, 2024. [Online]. Available: https://openreview.net/forum?id=VLDTzg1a4Y

[bib5] P. Das, S. Chaudhury, E. Nelson, I. Melnyk, S. Swaminathan, S. Dai, A. Lozano, G. Kollias, V. Chenthamarakshan, Jiří, Navrátil, S. Dan, and P.-Y. Chen, “Larimar: Large language models with episodic memory control,” 2024. [Online]. Available: https://arxiv.org/abs/2403.11901

[bib6] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/6b493230205f780e1bc26945df7481e5-Paper.pdf

[bib7] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.

[bib8] Z. Li, X. Zhang, Y. Zhang, D. Long, P. Xie, and M. Zhang, “Towards general text embeddings with multi-stage contrastive learning,” 2023. [Online]. Available: https://arxiv.org/abs/2308.03281

[bib9] K. Environments, “20 questions environment,” https://github.com/Kaggle/kaggle-environments/blob/master/kaggle_environments/envs/llm_20_questions/llm_20_questions.py, 2024, accessed: 2024-11-07.

[bib10] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Conference on Empirical Methods in Natural Language Processing, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:201646309

[bib11] D. M. Cer, Y. Yang, S. yi Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y.-H. Sung, B. Strope, and R. Kurzweil, “Universal sentence encoder,” ArXiv, vol. abs/1803.11175, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:4494896