Skip to main content

A Multi-Memory Segment System for Generating High-Quality Long-Term Memory Content in Agents

Gaoke Zhang, Bo Wang, Yunlong Ma, Dongming Zhao, Zifei Yu

Abstract

An agent powered by large language models have achieved impressive results, but effectively handling the vast amounts of historical data generated during interactions remains a challenge. The current approach is to design a memory module for the agent to process these data. However, existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content, which affects recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multiple memory system (MMS) inspired by cognitive psychology theory. The system processes short-term memory to multiple long-term memory fragments, and constructs retrieval memory units and contextual memory units based on these fragments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user’s query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. Experiments on LoCoMo dataset compared our method with three others, proving its effectiveness. Ablation studies confirmed the rationality of our memory units. We also analyzed the robustness regarding the number of selected memory segments and the storage overhead, demonstrating its practical value.

Utilization of Long-term Memory Units

Gaoke Zhang 1 , Bo Wang 1 * , Yunlong Ma 1 , Dongming Zhao 2 , Zifei Yu 3

1 College of Intelligence and Computing, Tianjin University

2 AI Lab, China Mobile Communication Group Tianjin Co., Ltd

3

Huizhi Xingyuan Information Technology Co., Ltd

{zhanggaoke, bo_wang}@tju.edu.cn

In the current field of agent memory, extensive explorations have been conducted in the area of memory retrieval, yet few studies have focused on exploring the memory content. Most research simply stores summarized versions of historical dialogues, as exemplified by methods like A-MEM and MemoryBank. However, when humans form long-term memories, the process involves multi-dimensional and multicomponent generation, rather than merely creating simple summaries. The low-quality memory content generated by existing methods can adversely affect recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multi-memory segment system (MMS) inspired by cognitive psychology theory. The system processes short-term memory into multiple long-term memory segments, and constructs retrieval memory units and contextual memory units based on these segments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user's query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. We conducted experiments on the LoCoMo dataset and further performed ablation experiments, experiments on the robustness regarding the number of input memories, and overhead experiments, which demonstrated the effectiveness and practical value of our method.

Introduction

Agents driven by large language models (LLMs) have emerged as an effective approach to addressing dialogue system challenges (Xi et al., 2025). The memory capabilities of agents are gradually becoming a pivotal factor in propelling them towards higher-level cognition and autonomous behavior

  • Bo Wang is the corresponding author.

(Packer et al., 2023; Zhang et al., 2024; Zhou et al., 2025). Traditional short-term memory mechanisms are inadequate for meeting the demands of complex tasks, which require contextual understanding, long-term reasoning, and personalized responses (Wang et al., 2024). Consequently, constructing a memory system with durability, structure, and adaptability has become one of the core challenges in realizing truly "intelligent" agents.

To address this issue, researchers have explored both parametric and non-parametric approaches to optimize the performance of LLMs in longform tasks. Parametrically, by augmenting LLMs with additional parameters, knowledge can be retained for future use (Wang et al., 2023). The Elastic Weight Consolidation algorithm (Huszár, 2018) , developed through collaboration between DeepMind and Imperial College London, enables neural networks to retain knowledge from previous tasks while learning new ones, mitigating the "catastrophic forgetting" problem and marking a significant step towards continuous learning in AI. However, parametric memory is prone to generating distorted and non-factual outputs and lacks interpretability (Ji et al., 2023). Nonparametrically, external components are employed to enhance the long-term memory capabilities of LLMs. The external memory module of an agent can store historical dialogue content, often using the Retrieval-Augmented Generation (RAG) approach to store long-term memories as vectors in a database. When a new query arises, relevant vectors are matched from the vector database to serve as memory content (Gao et al., 2023; Alonso et al., 2024). However, simply storing historical dialogues in a database does not adequately simulate the human memory formation process, nor does it effectively match information from historical dialogues.

Despite the proposal of various memory architectures, such as A-MEM (Xu et al., 2025), which extracts keywords, constructs summaries, and creates tags from dialogues as memory units, employing the Zettelkasten method to dynamically index and link knowledge networks, and MemoryBank (Zhong et al., 2024), which summarizes dialogue content and analyzes user personality and mood to serve as memory units, these approaches often fall short in practical scenarios. Users pose questions from diverse perspectives, yet the aforementioned methods simply extract keywords and summaries, saving them as memory content. This leads to a gap between users' queries and the content in memory units, with the memory content often lacking in quality, thereby affecting retrieval recall effectiveness. To enhance retrieval performance, we integrate Tulving's memory system classification (Tulving, 1985) and the encoding specificity principle (Tulving and Thomson, 1973), extracting episodic memory, semantic memory, cognitive perspectives, and keywords as our memory units content. This approach provides higher-quality memory content, thereby improving our retrieval recall effectiveness.

Currently, AI agents' memory systems encompass various types, including Short-Term Memory (STM), Long-Term Memory (LTM), Episodic Memory, Semantic Memory, and Procedural Memory (Sumers et al., 2023). STM is utilized for processing immediate contexts, such as the context window in dialogue systems. LTM, on the other hand, enables cross-session information storage and retrieval through databases, knowledge graphs, or vector embeddings (Gutiérrez et al., 2024; Xi et al., 2025; Hu et al., 2023). Episodic Memory allows agents to recall specific events, supporting case-based reasoning (Tulving et al., 1972). Semantic Memory stores structured factual knowledge, facilitating logical reasoning and knowledge retrieval (Tulving, 1986). We have designed a multi- memory segment system based on cognitive psychology theory to generate high-quality long-term memory.

Our contributions are as follows:

(1) Current research rarely considers memory content quality, leading to its poor state. To address the issue of low-quality memory content in memory systems, we are inspired by multiple memory systems theory, encoding specificity principle, and levels of processing theory to propose a multimemory segment system to enhance the long-term memory capabilities of agents. The memory system extract keywords, multiple cognitive perspectives, episodic memory, and semantic memory as memory segments through the analysis and process- ing of short-term memory to construct high-quality long-term memory content.

(2) In order to meet the requirements of different tasks in the retrieval stage and generation stage, we have built retrieval memory units for relevance matching with queries and contextual memory units as context for knowledge enhancement during generation. (3) We conducted experiments on the LoCoMo dataset to evaluate the retrieval and generation capabilities, and also performed ablation study. These experimental results demonstrate the effectiveness of our method.

Cognitive Psychology Acting on Memory

The theory of multiple memory systems (Tulving, 1985) posits that memory is not processed by a single system, but rather consists of multiple subsystems that are functionally independent and have different structural foundations. These systems are responsible for different types of information processing and storage, with specific neural bases and behavioral characteristics. Endel Tulving (Tulving et al., 1972) proposed a three-category classification model in 1985, including: Procedural Memory: involving the learning of skills and habits, such as riding a bicycle. Semantic Memory: Stores factual and conceptual knowledge, such as the name of a capital city. Episodic memory: Recording personal experiences and events, such as the last birthday party. The Levels of Processing Theory (Craik and Lockhart, 1972) suggests that the formation of memory does not depend on independent storage systems such as short-term memory and long-term memory, but rather on the depth and manner of information processing. The Encoding Specificity Principle (Tulving and Thomson, 1973) states that the memory effect of information depends on its processing method and context during encoding. Specifically, when information is encoded, the environment, emotional state, sensory stimulation, and so on will all become part of the memory trace. Therefore, only when the conditions during retrieval match those during encoding, the retrieval of memory is most effective.

Inspired by the levels of processing theory, we process short-term memory content via a multilevel approach to form long-term memory representations. Considering the multiple memory theory, we categorize semantic and episodic memories

Figure 1: Schematic of multi-memory system process: After acquiring short-term memory, MMS processes it into memory segments and constructs retrieval units and contextual memory units. During retrieval, the k most relevant retrieval units are matched to the query, and their corresponding contextual units are then used as context input for the agent's response.

Figure 1: Schematic of multi-memory system process: After acquiring short-term memory, MMS processes it into memory segments and constructs retrieval units and contextual memory units. During retrieval, the k most relevant retrieval units are matched to the query, and their corresponding contextual units are then used as context input for the agent's response.

as long-term memory components. Recognizing users' diverse questions, which reflect varied perspectives on the initial memory content they have, we are considering the encoding specificity principle to enhance recall by aligning coding content closely with the question's context. We integrate keywords from short-term memory and diverse cognitive angles derived from short-term into longterm memory representations to boost matching efficacy.

Memory system for diverse content

In recent years, research on the long-term memory mechanisms of agents has exhibited a diversifying trend. MemoryBank (Zhong et al., 2024) achieves long-term storage and retrieval of knowledge in multi-turn dialogues by introducing an external memory bank, integrating three modules-a writer, a retriever, and a reader-and leveraging the Ebbinghaus memory curve theory. It boasts strong scalability and modularity advantages, yet it suffers from issues such as coarse-grained memory content and suboptimal memory selection effects. MemoChat (Lu et al., 2023), on the other hand, injects user-related information into the model in the form of static summaries using concise, manually constructed "memos," significantly enhancing con- sistency and efficiency in open-domain dialogues. However, it relies on high-quality memo construction and lacks the capability for dynamic learning and memory adjustment. Think-in-Memory (Liu et al., 2023) proposes a "pre-recollection + postreflection" framework that mimics human-like cognition, explicitly separating retrieval and integrated reasoning into distinct stages and introducing a selfreflection mechanism to enhance reasoning capabilities, making it suitable for complex tasks. Nevertheless, it incurs high reasoning costs, is sensitive to retrieval, and demands significant prompt engineering efforts. Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents (Hou et al., 2024) divides memory into short-term and long-term categories, dynamically updating and managing memories based on factors such as usage frequency and temporal proximity. ChatDB (Hu et al., 2023) employs a database as the symbolic memory for LLMs. Its approach involves structuring textual data and storing it in a database, enabling LLMs to swiftly retrieve accurate knowledge through database queries when relevant information is needed. Human memory is characterized by features such as forgetting, consolidation, and association. Mem0 (Chhikara et al., 2025) forms memories by summarizing the content of conver-

sations and dynamically organizes these memories and retrieves information from them. Memory-R1 (Yan et al., 2025) has constructed two agents, enabling them to learn how to manage and utilize memory through reinforcement learning.

These methods lack detailed design for the most critical aspect of the memory module-the quality of the memory content. Instead, they only obtain basic information, such as simple keywords and summaries, as the memory content. We believe that high-quality memory content is the key to improving recall and response abilities. We will combine cognitive psychology theories and focus on designing high-quality long-term memory segments to improve these two abilities.

Method

Overview

The multi-memory segment system constructs short-term memory content into different memory segments and stores them as long-term memory. It utilizes the content of a round of dialogue C as short-term memory M short , extracts key information M key from the dialogue, and further processes and analyzes the original dialogue to construct different cognitive perspectives M cog , episodic memory M epi , and semantic memory M sem . Then, these memory segments are used to construct retrieval memory units MU ret and contextual memory units MU cont , which are used for memory retrieval and memory use, respectively.

Our multi-memory segment system involves three processes: the construction of long-term memory units, the retrieval of long-term memory units, and the use of contextual memory units. The process of the system processing short-term memory into long-term memory is shown in Figure 1. We employ prompt engineering techniques to extract individual memory segments, with the specific prompt words detailed in Table 5 of Appendix A.

Construction of Long-term Memory Units

The process of converting short-term memory into long-term memory involves two parts. First, the information in short-term memory is processed to generate multiple types of memory segments. Then, these memory segments are used to construct retrieval and matching memory units for retrieval and matching, and contextual memory units for the context of LLM.

Short-term Memory Processing The memory system processes information into M longterm , analyzes M short through LLM, and constructing M key , M cog , M epi , and M sem . First, keywords are extracted from M short as important textual identification information in short-term memory. Cognitively analyze short-term memory from different perspectives, and construct a multi-dimensional cognition of the short-term memory to enhance the matching effect. Event information such as plot events is used as M epi , and MMS are used to analyze factual information in short-term memory, constructing it as plot memory content. Taking knowledge points as factual knowledge, using cognitive modules to analyze the content of knowledge points in short-term memory, and constructing them as M cog . The disclosure of fragmented memory and long-term memory is as follows:

$$

$$

$$

$$

Long-term Memory Segment Storing After completing information processing, we have five parts that constitute the long-term memory M longterm : M key , M short , M cog , M epi and M sem . We will store long-term memory into two types, one for retrieval and matching, and the other for contextbased knowledge enhancement. The keywords in long-term memory, the original short-term memory content, the various cognitive perspectives of shortterm memory, and the sentence characteristics of episodic memory in short-term memory are more relevant to the semantic form of the user's query. However, the semantic memory of short-term memory is listed in the form of knowledge points, which is a higher-level knowledge extraction of short-term memory content. It may have some differences from the user's query language form and is suitable for user knowledge enhancement, but not for retrieval matching. Therefore, considering the principle of encoding specificity, We use M key , M short , M cog and M epi as retrieval memory units MU ret , and construct them into vectors V memory for use in the retrieval matching stage.

Episodic memory describes event information, which is similar to the semantics of short-term memory information, and LLM have the ability to understand its content through short-term memory alone. Therefore, we believe that the content of episodic memory does not need to be input as contextual content into large models, only keywords, short-term memory, multiple cognitive perspectives, and semantic memory are required. Therefore, the composition of the retrieval memory units and the context memory units is as follows:

$$

$$

$$

$$

Retrieval of Long-term Memory Units

The retrieval part will convert the user query Q into a vector form, and then use the cosine similarity formula to calculate the similarity value between the user query vector V query and the constructed vector V memory , selecting the top-k vectors V k = { V 1 , V 2 , ..., V K } as the selected vectors. The cosine similarity used is disclosed as follows:

$$

$$

where q and v are respectively the query vector and the vector stored in the memory system.

Utilization of Long-term Memory Units

Based on the selection of top-k vectors V k = { V 1 , V 2 , ..., V K } , these retrieval memory units MU ret are mapped to corresponding contextual memory units MU cont . M key , M short , M cog , M epi , and M sem are used as contextu C m inputs to LLM to response the user's query Q and obtain a response R .

$$

$$

Experimental Setup

Datasets

The LoCoMo dataset (Maharana et al., 2024) is designed to evaluate the long-term dialogue memory capability of large language models. Its primary task is to determine whether these models can accurately recall information from earlier dialogue after multiple rounds. The dataset contains 10 extended sessions, with each session averaging around 600 conversations and 26,000 tokens. Each conversation includes an average of 200 questions along with their corresponding correct answers. This dataset supports multiple evaluation scenarios and is currently a widely adopted authoritative dataset.

There are five problem types: (1) Single-hop questions : Test the memory system's basic retention by requiring accurate extraction of a single fact from long dialogue history. (2) Multi-hop questions : Assess the system's ability to integrate information across multiple dialogue rounds for reasoning. (3) Temporal reasoning questions : Evaluate if the system can understand event sequences and time evolution to construct a clear timeline. (4) Open-domain knowledge questions : Examine the system's recall and retrieval ability by combining dialogue context with external knowledge. (5) Adversarial questions : Challenge the system's resistance to forgetting and misleading by inserting interference information.

Evaluation Metrics

We evaluate the performance of memory system using several standard metrics, including Recall@N (R@1, R@3, R@5), F1 score, and BLEU-1.

Recall@N(R@N) measures the average proportion of ground-truth answers retrieved within the top-N results. When the number of ground-truth items is less than N, the denominator is adjusted using min( N, | Gold i | ) to ensure fair evaluation. Formally:

$$

$$

where Q is the set of all evaluation queries, i indexes a specific query, Gold i denotes the set of ground-truth answers for the i -th query, | Gold i | is the number of ground-truth answers for that query, and TopN i is the set of top-N results retrieved by the system for that query.

F1 Score is the harmonic mean of precision and recall. Let TP , FP , and FN denote the number of true positives, false positives, and false negatives, respectively:

$$

$$

$$

$$

$$

$$

F1 is widely used in classification and span-level extraction tasks, such as named entity recognition or QA.

BLEU-1 evaluates unigram precision between the generated output and the reference. BLEU-1 is computed as:

$$

$$

A brevity penalty is typically applied to discourage overly short outputs. BLEU-1 is appropriate for tasks like machine translation and text generation where word-level overlap is informative.

Baselines

Our goal is to explore high-quality memory content, and we adopt the representative methods for studying memory content currently employed in research.

Naive RAG Only the character's dialogue content is vectorized as memory. For a question, it's vectorized and compared with all memory vectors for similarity. The top-k vectors of original dialogue are chosen as context for answering.

MemoryBank (Zhong et al., 2024) The memory content involved in this method includes: daily chat records, time summaries, user personality traits, and emotional assessments.

We primarily focus on the impact of memory content on memory effectiveness, and the aforementioned three approaches encompass the distinctive characteristics of different cutting-edge memory contents at present.

Implementation Details

We employed GPT-4o (Hurst et al., 2024), Qwen2.5-14B (Yang et al., 2024), and Gemini2.5-pro-preview (Comanici et al., 2025) as base models, setting temperature to 0.5 for memory generation and 0.7 for question answering. For the latter, we input the top 5 most relevant contextual memory units (based on retrieval memory units) as context for the agent. We employ an embeddingbased retrieval approach, utilizing all-MiniLM-L6v2 (Reimers and Gurevych, 2021) as the embedding model.

Results and Discussion

We conducted an experimental comparative analysis on memory retrieval and memory use ability. In terms of memory retrieval, the recall rates (R@1, R@2, and R@5) were compared and analyzed. In terms of memory usage, F1, which measures the accuracy of responses, and BLEU-1, which measures the quality of responses, were compared and analyzed.

Recall Results

The experimental results show in Table 1 that MMS is superior to other methods in most cases, with a significant overall improvement effect. The improvement effect is most obvious in the Multi Hop scenario, achieving the maximum improvement across models and tasks in the three metrics of R@1, R@3, and R@5. GPT-4o and Gemini, in particular, improved recall by 8-11 points on the Multi Hop task compared to A-MEM. Multi Hop tasks rely on deep-level associative reasoning among multiple documents. The multi-memory segment strategy of the MMS is more suitable for this complex retrieval scenario, enhancing the effectiveness of cross-document integration and reasoning. Significant improvements have also been observed in Open Domain and Temporal scenarios, indicating that MMS exhibits robustness in contexts with ambiguous and uncertain information structures. In the Single Hop scenario, most cases outperform other methods, but a few cases slightly underperform. This may introduce some redundancy in a simple MMS like Single Hop.

Overall, we think that incorporating this part of content from multiple cognitive perspectives as long-term memory fragments can indeed enhance high recall rates. This aligns with our initial intuitive insight: different users may pose questions from various angles, and by preserving different cognitive perspectives, we can improve the matching degree when questions are asked from different angles.

Answer Results

MMS has achieved leading performance on multiple mainstream large language models, significantly outperforming existing methods such as NaiveRAG, MemoryBank, and A-MEM. Whether it is F1 or BLEU-1, MMS achieves comprehensive improvement in five types of tasks: single-hop, multi-hop, temporal, open-domain, and adversarial question answering. In particular, it has particularly outstanding advantages in multi-hop reasoning and open-domain tasks, demonstrating excellent information integration and reasoning depth. The strong performance of MMS may be due to its use of a multi-level memory structure, which has obvious advantages in improving question-answering accuracy, handling complex logical relationships, and resisting input interference. The experimental results are shown in Table 2.

Table 1: The recall performance comparison of each method in five task scenarios. The gold truth corresponding to a single query may consist of multiple documents. For R@N, we use the minimum value between N and the number of documents in the gold truth as the denominator. Therefore, the denominators for R@1, R@3, and R@5 may differ, and it is reasonable to observe scenarios where, for example, R@3 has a lower value than R@1.

Table 2: Comparison of Generation Performance of Four Methods on the LoCoMo Dataset

Ablation Study

We conducted ablation experiments on each module of the MMS. When the keyword part is not included, the R@1 metric under the Single Hop task and the R@5 metric under the Open Domain task perform the best. We believe that when the keyword part is not included. When multiple cognitive perspectives are not included, the R@3 metric under the Multi Hop task, the R@5 metric under the Temporal task, and the R@1 metric under the Adversarial task perform the best. We believe that different memory segments can be used to construct retrieval units based on different character scenarios. In a broader general scenario, the retrieval units composed of keywords, short-term memory, multiple cognitive perspectives, and episodic memory can achieve the best performance in most tasks. The experimental results are shown in Table 3.

Similarly, we can also observe from Table 4 that keywords yield greater benefits for inferential questions (single-hop, multi-hop, and adversarial), whereas for temporal questions, multiple cognitive perspectives and semantic memory bring more benefits. In terms of a broader range of domain-specific questions, the memory content encompassed by our MMSbetter enhances the quality of generated content.

Analyze Long-term Memory Units

We used GPT-4o as the base to compare the effects of adding the remaining memory segments in the MMS retrieval units and contextual memory units. For retrieval, retrieval memory units includes keywords, short-term memory, multiple cognitive perspectives, and episodic memory. We add semantic memory for comparison, and find recall rate decreases. This is because generated semantic memory is a high-level overview of factual knowledge in short-term memory, creating a gap with the semantic form of user's question. For generation, contextual memory units of MMS consists of keywords, short-term memory, multiple cognitive perspectives, and semantic memory. We add contextual memory for comparison and find generation quality decreases after adding episodic memory. As episodic memory describes event-related content already present in original short-term memory, adding more can lead to redundancy and lower generation quality. Thus, the composition modules of retrieval and contextual memory units we use are reasonable. The experimental results are shown in Figure 2. In general, during the retrieval process, emphasis should be placed on keywords, multi-perspective, and event-oriented information to ensure that the "net" of the search is sufficiently

Table 3: The ablation experiment using GPT-4o as the base model of MMS method. The symbol "w/o" indicates an experiment in which a specific module was removed. Key represents the key words of short-term memory, Cog represents multiple cognitive perspectives on short-term memory, and Epi represents episodic memory of short-term memory.

broad and precise. During the generation process, focus should be on facts, keywords, and high-level semantics to ensure that the model, while understanding the general meaning, is not confused by repetitive event descriptions. Therefore, our MMS will achieve high-quality results.

Robustness of the Number of Memories

We manipulated the quantities of memory segments to evaluate the impact on performance. As the value of n increased, performance exhibited enhancement notwithstanding the introduction of noise, owing to the high-caliber memory contents facilitating the differentiation between pivotal and noisy data. This demonstrates the resilience of our method with respect to the quantities of memory segments. The results can be viewed in Table 6 of Appendix C.

Analysis of Token and Latency Overhead

We analyzed token and latency overhead for different methods experimentally, specifically calculating the average overhead needed to generate the memory content for each query. Compared to AMEM, our approach proves to be faster and more resource-efficient. Although there's a slight rise in latency and overhead compared to the memory repository, its long-term memory is overly simplistic and of low quality, resulting in poor performance. Conversely, our method generates a larger volume of high-quality memory content despite the increased overhead, with only a minimal latency increase that barely affects user experience. Our method holds practical value. The results can be viewed in Table 7 of Appendix C.

Conclusion

This paper creates a multi-memory system by integrating with cognitive psychology to build effective long-term memory, boosting recall and generation quality. Multiple memory theory suggests human memory comes in many forms. Current methods don't consider the variety of human memory segments and just use summarization, leading to low-quality content. Since people understand problems from different angles, we view short-term memory's various cognitive aspects as long-term memory segments. Levels-of-processing theory states that deeper encoding leads to better memory retention. MMS turns short-term memory into long-term segments like keywords, different cognitive views, episodic and semantic memories. It constructs retrieval units from these for recall and builds contextual units to enhance knowledge. Experiments indicate MMS enhances recall and generation at lower cost, proving practical value. Our analysis under varying memory segment counts reveals high-quality content ensures sustained gains and noise resilience. Our approach effectively integrates cognitive psychology into the research on agent memory within artificial intelligence for future studies.

Limitations

Our current experiments have demonstrated the effectiveness of our method on both open-source and closed-source models. We are considering constructing a dedicated memory model in the future specifically for generating high-quality memories, employing Supervised Fine-Tuning and Reinforcement Learning techniques to better enhance the model's capability in generating high-quality longterm memories.

Ethics Statement

We guarantee that the dataset we use are public dataset and do not involve data leakage or other issues. Our method is a universal way of processing knowledge, without involving issues such as racial discrimination and moral hazard.

Peformance

An agent powered by large language models have achieved impressive results, but effectively handling the vast amounts of historical data generated during interactions remains a challenge. The current approach is to design a memory module for the agent to process these data. However, existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content, which affects recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multiple memory system (MMS) inspired by cognitive psychology theory. The system processes short-term memory to multiple long-term memory fragments, and constructs retrieval memory units and contextual memory units based on these fragments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user’s query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. Experiments on LoCoMo dataset compared our method with three others, proving its effectiveness. Ablation studies confirmed the rationality of our memory units. We also analyzed the robustness regarding the number of selected memory segments and the storage overhead, demonstrating its practical value.

On the path towards Artificial General Intelligence (AGI), the development of large language models (LLMs) has achieved remarkable milestones, endowing them with robust language understanding and generation capabilities. Existing LLMs, such as ChatGPT (Brown et al. 2020), Gemini (Team et al. 2023), DeepSeek-r1 (Guo et al. 2025), Qwen (Yang et al. 2024), and Llama (Grattafiori et al. 2024), have driven breakthroughs in this field. LLM technologies are profoundly influencing the transformation of problem-solving paradigms across multiple AI domains.

Agents driven by LLM have emerged as an effective approach to addressing dialogue system challenges (Xi et al. 2025). The memory capabilities of agents are gradually becoming a pivotal factor in propelling them towards higher-level cognition and autonomous behavior (Zhang et al. 2024). Traditional short-term memory mechanisms are inadequate for meeting the demands of complex tasks, which require contextual understanding, long-term reasoning, and personalized responses (Wang et al. 2024). Consequently, constructing a memory system with durability, structure, and adaptability has become one of the core challenges in realizing truly ”intelligent” agents.

To address this issue, researchers have explored both parametric and non-parametric approaches to optimize the performance of LLMs in long-form tasks. Parametrically, by augmenting LLMs with additional parameters, knowledge can be retained for future use (Wang et al. 2023). The Elastic Weight Consolidation algorithm (Huszár 2018) , developed through collaboration between DeepMind and Imperial College London, enables neural networks to retain knowledge from previous tasks while learning new ones, mitigating the ”catastrophic forgetting” problem and marking a significant step towards continuous learning in AI. However, parametric memory is prone to generating distorted and non-factual outputs and lacks interpretability (Ji et al. 2023). Non-parametrically, external components are employed to enhance the long-term memory capabilities of LLMs. The external memory module of an agent can store historical dialogue content, often using the Retrieval-Augmented Generation (RAG) approach to store long-term memories as vectors in a database. When a new query arises, relevant vectors are matched from the vector database to serve as memory content (Gao et al. 2023). However, simply storing historical dialogues in a database does not adequately simulate the human memory formation process, nor does it effectively match information from historical dialogues.

Despite the proposal of various memory architectures, such as A-MEM (Xu et al. 2025), which extracts keywords, constructs summaries, and creates tags from dialogues as memory units, employing the Zettelkasten method to dynamically index and link knowledge networks, and MemoryBank (Zhong et al. 2024), which summarizes dialogue content and analyzes user personality and mood to serve as memory units, these approaches often fall short in practical scenarios. Users pose questions from diverse perspectives, yet the aforementioned methods simply extract keywords and summaries, saving them as memory content. This leads to a gap between users’ queries and the content in memory units, with the memory content often lacking in quality, thereby affecting retrieval recall effectiveness. To enhance retrieval performance, we integrate Tulving’s memory system classification (Tulving 1985) and the encoding specificity principle (Tulving and Thomson 1973), extracting episodic memory, semantic memory, cognitive perspectives, and keywords as our memory units content. This approach provides higher-quality memory content, thereby improving our retrieval recall effectiveness.

Currently, AI agents’ memory systems encompass various types, including Short-Term Memory (STM), Long-Term Memory (LTM), Episodic Memory, Semantic Memory, and Procedural Memory (Sumers et al. 2023). STM is utilized for processing immediate contexts, such as the context window in dialogue systems. LTM, on the other hand, enables cross-session information storage and retrieval through databases, knowledge graphs, or vector embeddings (Gutiérrez et al. 2024; Xi et al. 2025; Hu et al. 2023). Episodic Memory allows agents to recall specific events, supporting case-based reasoning (Tulving et al. 1972). Semantic Memory stores structured factual knowledge, facilitating logical reasoning and knowledge retrieval (Tulving 1986). We have designed a multiple memory system based on cognitive psychology theory to generate high-quality long-term memory.

Our contributions are as follows:

(1) Current research rarely considers memory content quality, leading to its poor state. To address the issue of low-quality memory content in current memory systems, we are inspired by multiple memory systems theory, encoding specificity principle, and levels of processing theory to propose a multiple memory system to enhance the long-term memory capabilities of agents. The memory system extract keywords, multiple cognitive perspectives, episodic memory, and semantic memory as memory fragments through the analysis and processing of short-term memory to construct high-quality long-term memory content.

(2) In order to meet the requirements of different tasks in the retrieval stage and generation stage, we have built retrieval memory units for relevance matching with queries and contextual memory units as context for knowledge enhancement during generation.

(3) We conducted experiments on the LoCoMo dataset to evaluate the retrieval and generation capabilities, and also performed ablation study. These experimental results demonstrate the effectiveness of our method.

The theory of multiple memory systems (Tulving 1985) posits that memory is not processed by a single system, but rather consists of multiple subsystems that are functionally independent and have different structural foundations. These systems are responsible for different types of information processing and storage, with specific neural bases and behavioral characteristics. Endel Tulving (Tulving et al. 1972) proposed a three-category classification model in 1985, including: Procedural Memory: involving the learning of skills and habits, such as riding a bicycle. Semantic Memory: Stores factual and conceptual knowledge, such as the name of a capital city. Episodic memory: Recording personal experiences and events, such as the last birthday party. The Levels of Processing Theory (Craik and Lockhart 1972) suggests that the formation of memory does not depend on independent storage systems such as short-term memory and long-term memory, but rather on the depth and manner of information processing. The Encoding Specificity Principle (Tulving and Thomson 1973) states that the memory effect of information depends on its processing method and context during encoding. Specifically, when information is encoded, the environment, emotional state, sensory stimulation, and so on will all become part of the memory trace. Therefore, only when the conditions during retrieval match those during encoding, the retrieval of memory is most effective.

Inspired by the levels of processing theory, we process short-term memory content via a multi-level approach to form long-term memory representations. Considering the multiple memory theory, we categorize semantic and episodic memories as long-term memory components. Recognizing users’ diverse questions, which reflect varied perspectives on the initial memory content they have, we are considering the encoding specificity principle to enhance recall by aligning coding content closely with the question’s context. We integrate keywords from short-term memory and diverse cognitive angles derived from short-term into long-term memory representations to boost matching efficacy.

In recent years, research on the long-term memory mechanisms of agents has exhibited a diversifying trend. MemoryBank (Zhong et al. 2024) achieves long-term storage and retrieval of knowledge in multi-turn dialogues by introducing an external memory bank, integrating three modules—a writer, a retriever, and a reader—and leveraging the Ebbinghaus memory curve theory. It boasts strong scalability and modularity advantages, yet it suffers from issues such as coarse-grained memory content and suboptimal memory selection effects. MemoChat (Lu et al. 2023), on the other hand, injects user-related information into the model in the form of static summaries using concise, manually constructed ”memos,” significantly enhancing consistency and efficiency in open-domain dialogues. However, it relies on high-quality memo construction and lacks the capability for dynamic learning and memory adjustment. Think-in-Memory (Liu et al. 2023) proposes a ”pre-recollection + post-reflection” framework that mimics human-like cognition, explicitly separating retrieval and integrated reasoning into distinct stages and introducing a self-reflection mechanism to enhance reasoning capabilities, making it suitable for complex tasks. Nevertheless, it incurs high reasoning costs, is sensitive to retrieval, and demands significant prompt engineering efforts. Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents (Hou, Tamoto, and Miyashita 2024) divides memory into short-term and long-term categories, dynamically updating and managing memories based on factors such as usage frequency and temporal proximity. ChatDB (Hu et al. 2023) employs a database as the symbolic memory for LLMs. Its approach involves structuring textual data and storing it in a database, enabling LLMs to swiftly retrieve accurate knowledge through database queries when relevant information is needed. Human memory is characterized by features such as forgetting, consolidation, and association.

These methods lack detailed design for the most critical part of the memory module - the quality of the memory content, and only obtain basic information such as simple keywords and summaries as memory content. We believe that high-quality memory content is the key to improving recall and response abilities. We will combine cognitive psychology theories and focus on designing high-quality long-term memory segments to improve these two abilities.

Short-Term Memory temporarily stores and processes limited info for current tasks, while Long-Term Memory stores info long-term with high capacity, housing our knowledge, experiences, and skills. The brain extracts key info from short-term memory, processing it into long-term storage. The levels of processing Theory suggests information processing affects memory. MMS will comprehensively process and analyze short-term memory to create high-quality long-term memory fragments.

The multi-memory fragment system in this paper mimics this process, using the content of a round of dialogue CC as short-term memory Ms​h​o​r​tM_{short}, extracting key information Mk​e​yM_{key} from the dialogue content, and further processing and analyzing the original dialogue content to construct different cognitive perspectives Mc​o​gM_{cog}, episodic memory Me​p​iM_{epi}, and semantic memory Ms​e​mM_{sem}. Then, these memory fragments are used to construct retrieval memory units M​Ur​e​tMU_{ret} and contextual memory units M​Uc​o​n​tMU_{cont}, which are used for memory retrieval and memory use, respectively.

Our multi-memory fragment system involves three processes: the construction of long-term memory units, the retrieval of long-term memory units, and the use of contextual memory units. The process of the system processing short-term memory into long-term memory is shown in Figure 1.

The process of converting short-term memory into long-term memory involves two parts. First, the information in short-term memory is processed to generate multiple types of memory fragments. Then, these memory fragments are used to construct retrieval and matching memory units for retrieval and matching, and contextual memory units for the context of LLM.

Short-term Memory Processing The memory system processes information into Ml​o​n​g​t​e​r​mM_{longterm}, analyzes Ms​h​o​r​tM_{short} through LLM, and constructing Mk​e​yM_{key}, Mc​o​gM_{cog}, Me​p​iM_{epi}, and Ms​e​mM_{sem}. First, keywords are extracted from Ms​h​o​r​tM_{short} as important textual identification information in short-term memory. Cognitively analyze short-term memory from different perspectives, and construct a multi-dimensional cognition of the short-term memory to enhance the matching effect. Event information such as plot events is used as Me​p​iM_{epi}, and MMS are used to analyze factual information in short-term memory, constructing it as plot memory content. Taking knowledge points as factual knowledge, using cognitive modules to analyze the content of knowledge points in short-term memory, and constructing them as Mc​o​gM_{cog}. The disclosure of fragmented memory and long-term memory is as follows:

Long-term Memory Fragment Storing After completing information processing, we have five parts that constitute the long-term memory Ml​o​n​g​t​e​r​mM_{longterm}: Mk​e​yM_{key}, Ms​h​o​r​tM_{short}, Mc​o​gM_{cog}, Me​p​iM_{epi} and Ms​e​mM_{sem}. We will store long-term memory into two types, one for retrieval and matching, and the other for context-based knowledge enhancement. The keywords in long-term memory, the original short-term memory content, the various cognitive perspectives of short-term memory, and the sentence characteristics of episodic memory in short-term memory are more relevant to the semantic form of the user’s query. However, the semantic memory of short-term memory is listed in the form of knowledge points, which is a higher-level knowledge extraction of short-term memory content. It may have some differences from the user’s query language form and is suitable for user knowledge enhancement, but not for retrieval matching. Therefore, considering the principle of encoding specificity, We use Mk​e​yM_{key}, Ms​h​o​r​tM_{short}, Mc​o​gM_{cog} and Me​p​iM_{epi} as retrieval memory units M​Ur​e​tMU_{ret}, and construct them into vectors Vm​e​m​o​r​yV_{memory} for use in the retrieval matching stage.

Episodic memory describes event information, which is similar to the semantics of short-term memory information, and LLM have the ability to understand its content through short-term memory alone. Therefore, we believe that the content of episodic memory does not need to be input as contextual content into large models, only keywords, short-term memory, multiple cognitive perspectives, and semantic memory are required. Therefore, the composition of the retrieval memory units and the context memory units is as follows:

The retrieval part will convert the user query QQ into a vector form, and then use the cosine similarity formula to calculate the similarity value between the user query vector Vq​u​e​r​yV_{query} and the constructed vector Vm​e​m​o​r​yV_{memory}, selecting the top-k vectors Vk={V1,V2,…,VK}V_{k}={V_{1},V_{2},...,V_{K}} as the selected vectors. The cosine similarity used is disclosed as follows:

where q and v are respectively the query vector and the vector stored in the memory system.

Based on the selection of top-k vectors Vk={V1,V2,…,VK}V_{k}={V_{1},V_{2},...,V_{K}}, these retrieval memory units M​Ur​e​tMU_{ret} are mapped to corresponding contextual memory units M​Uc​o​n​tMU_{cont}. Mk​e​yM_{key}, Ms​h​o​r​tM_{short}, Mc​o​gM_{cog}, Me​p​iM_{epi}, and Ms​e​mM_{sem} are used as contextu CmC_{m} inputs to LLM to response the user’s query QQ and obtain a response RR.

The LoCoMo dataset (Maharana et al. 2024) is designed to evaluate the long-term dialogue memory capability of large language models. Its primary task is to determine whether these models can accurately recall information from earlier dialogue after multiple rounds. The dataset contains 10 extended sessions, with each session averaging around 600 conversations and 26,000 tokens. Each conversation includes an average of 200 questions along with their corresponding correct answers. This dataset supports multiple evaluation scenarios and is currently a widely adopted authoritative dataset.

There are five problem types: (1) Single-hop questions: Test the memory system’s basic retention by requiring accurate extraction of a single fact from long dialogue history. (2) Multi-hop questions: Assess the system’s ability to integrate information across multiple dialogue rounds for reasoning. (3) Temporal reasoning questions: Evaluate if the system can understand event sequences and time evolution to construct a clear timeline. (4) Open-domain knowledge questions: Examine the system’s recall and retrieval ability by combining dialogue context with external knowledge. (5) Adversarial questions: Challenge the system’s resistance to forgetting and misleading by inserting interference information.

We evaluate the performance of memory system using several standard metrics, including Recall@N (R@1, R@3, R@5), F1 score, and BLEU-1.

Recall@N(R@N) measures the average proportion of ground-truth answers retrieved within the top-N results. When the number of ground-truth items is less than N, the denominator is adjusted using min⁡(N,|Goldi|)\min(N,|\text{Gold}_{i}|) to ensure fair evaluation. Formally:

where QQ is the set of all evaluation queries, ii indexes a specific query, Goldi\text{Gold}{i} denotes the set of ground-truth answers for the ii-th query, |Goldi||\text{Gold}{i}| is the number of ground-truth answers for that query, and Top-​Ni\text{Top-}N_{i} is the set of top-N results retrieved by the system for that query.

F1 Score is the harmonic mean of precision and recall. Let T​PTP, F​PFP, and F​NFN denote the number of true positives, false positives, and false negatives, respectively:

F1 is widely used in classification and span-level extraction tasks, such as named entity recognition or QA.

BLEU-1 evaluates unigram precision between the generated output and the reference. BLEU-1 is computed as:

A brevity penalty is typically applied to discourage overly short outputs. BLEU-1 is appropriate for tasks like machine translation and text generation where word-level overlap is informative.

Model Method Single Hop Multi Hop Temporal Open Domain Adversarial Average R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 NaiveRAG 14.89 15.07 20.90 26.79 37.23 44.11 7.61 14.49 19.38 20.45 34.52 42.15 9.64 20.18 28.81 15.88 24.30 31.07 GPT-4o MemoryBank 15.60 12.59 14.99 23.05 30.52 36.47 9.78 13.41 15.80 13.67 20.87 26.38 8.52 17.15 23.99 14.12 18.93 23.53 A-MEM 24.82 23.76 29.73 33.02 49.79 58.96 16.30 22.28 30.02 29.01 44.81 53.90 10.54 20.96 25.11 22.74 32.32 39.54 MMS 28.53 30.18 34.06 44.18 59.87 67.05 23.73 26.63 32.23 34.98 53.01 62.04 15.31 31.46 37.65 29.35 40.23 46.61 Qwen2.5-14B MemoryBank 21.63 22.75 26.51 32.71 47.35 55.78 13.04 20.83 25.09 26.04 39.36 47.23 9.64 18.27 24.89 20.62 36.84 35.90 A-MEM 25.21 25.71 27.45 39.19 52.49 59.01 21.73 23.19 32.46 31.27 46.63 54.82 14.57 27.58 32.85 26.39 35.12 41.32 MMS 23.76 27.93 27.83 39.25 53.58 59.13 23.39 25.72 32.59 30.67 48.61 57.57 16.14 27.46 34.64 26.64 36.66 42.35 Gemini-2.5-pro-preview MemoryBank 14.89 13.23 14.10 28.03 35.10 40.58 8.70 11.05 12.63 14.98 25.02 30.42 11.88 20.52 28.04 15.70 21.07 25.15 A-MEM 17.73 17.43 22.42 25.55 38.32 46.73 10.87 15.04 19.93 22.47 37.06 43.32 13.46 23.99 29.93 18.02 26.37 32.47 MMS 26.60 23.46 31.27 34.58 49.84 55.56 16.30 23.55 25.85 31.15 48.19 54.30 12.78 27.01 35.20 23.68 34.41 40.44

Model Method Single Hop Multi Hop Temporal Open Domain Adversarial Average F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 GPT-4o NaiveRAG 18.04 12.24 33.03 28.28 14.03 12.15 31.74 26.78 7.60 6.90 20.89 17.27 MemoryBank 15.39 11.18 28.44 24.18 11.91 11.23 22.04 18.86 8.65 7.93 17.29 14.68 A-MEM 22.98 15.98 34.35 29.57 14.66 13.21 35.64 30.19 9.24 8.58 23.37 19.65 MMS 28.67 20.96 47.37 39.98 20.81 19.07 42.98 36.89 12.87 11.67 30.54 25.74 Qwen2.5-14B NaiveRAG 18.56 12.31 29.26 25.44 13.90 12.28 31.60 27.40 10.77 9.46 20.82 17.38 MemoryBank 21.37 14.79 28.41 24.26 13.97 12.13 32.99 28.28 11.03 10.04 21.55 17.90 A-MEM 30.71 23.44 32.46 28.55 14.08 13.97 43.02 37.69 11.38 10.11 26.33 22.75 MMS 28.91 22.40 34.00 29.40 14.52 14.01 44.69 38.76 13.28 11.98 27.08 23.31 Gemini-2.5-pro-preview NaiveRAG 23.80 17.27 34.96 26.92 14.24 12.50 30.64 26.43 5.76 5.15 21.88 17.65 MemoryBank 14.12 9.30 34.15 29.36 8.55 6.48 25.82 22.17 4.62 3.92 17.45 14.25 A-MEM 22.69 15.83 40.97 35.63 12.42 11.11 36.23 31.42 7.20 7.02 23.90 20.20 MMS 25.96 18.96 41.49 36.99 12.91 11.32 42.00 36.59 7.09 6.81 25.89 22.13

We use three methods for comparison, asking: Naive RAG, MemoryBank, and A-MEM.

Naive RAG Only the character’s dialogue content is vectorized as memory. For a question, it’s vectorized and compared with all memory vectors for similarity. The top-k vectors of original dialogue are chosen as context for answering.

MemoryBank (Zhong et al. 2024) The memory system comprises three main methods: storage, which saves daily chats, event summaries, and user personality and mood assessments; retrieval, encoding dialogues and summaries into vectors for subsequent recall; and memory intensity update, utilizing an exponential decay model to mimic the Ebbinghaus forgetting curve.

A-MEM (Xu et al. 2025) The method has four steps: note construction (creating memory notes with conversation content, timestamps, keywords, tags, context, and links), link generation (identifying relevant historical memories, analyzing their ties to LLM for memory evolution), memory evolution (deciding on memory updates based on new notes), and memory retrieval (using cosine similarity between question and note vectors to fetch K most relevant memories for answers).

We employed GPT-4o, Qwen2.5-14B, and Gemini-2.5-pro-preview as base models, setting temperature to 0.5 for memory generation and 0.7 for question answering. For the latter, we input the top 5 most relevant contextual memory units (based on retrieval memory units) as context for the agent.

We conducted an experimental comparative analysis on memory retrieval and memory use ability. In terms of memory retrieval, the recall rates (R@1, R@2, and R@5) were compared and analyzed. In terms of memory usage, F1, which measures the accuracy of responses, and BLEU-1, which measures the quality of responses, were compared and analyzed.

The experimental results show that MMS is superior to other methods in most cases, with a significant overall improvement effect. The improvement effect is most obvious in the Multi Hop scenario, achieving the maximum improvement across models and tasks in the three metrics of R@1, R@3, and R@5. GPT-4o and Gemini, in particular, improved recall by 8-11 points on the Multi Hop task compared to A-MEM. Multi Hop tasks rely on deep-level associative reasoning among multiple documents. The multi-memory fragment strategy of the MMS is more suitable for this complex retrieval scenario, enhancing the effectiveness of cross-document integration and reasoning. Significant improvements have also been observed in Open Domain and Temporal scenarios, indicating that MMS exhibits robustness in contexts with ambiguous and uncertain information structures. In the Single Hop scenario, most cases outperform other methods, but a few cases slightly underperform. This may introduce some redundancy in a simple MMS like Single Hop.

MMS has achieved leading performance on multiple mainstream large language models, significantly outperforming existing methods such as NaiveRAG, MemoryBank, and A-MEM. Whether it is F1 or BLEU-1, MMS achieves comprehensive improvement in five types of tasks: single-hop, multi-hop, temporal, open-domain, and adversarial question answering. In particular, it has particularly outstanding advantages in multi-hop reasoning and open-domain tasks, demonstrating excellent information integration and reasoning depth. The strong performance of MMS may be due to its use of a multi-level memory structure, which has obvious advantages in improving question-answering accuracy, handling complex logical relationships, and resisting input interference. The experimental results are shown in Table 2.

Method Single Hop Multi Hop Temporal Open Domain Adversarial Avg R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 w/o Key 30.85 29.72 34.02 42.68 59.29 66.17 23.21 23.55 31.86 34.60 52.55 62.18 11.88 27.36 33.40 28.64 38.49 45.53 w/o Cog 26.24 26.06 32.73 43.61 59.96 66.33 20.65 27.53 31.83 31.98 49.36 59.53 15.47 30.49 37.57 27.59 38.68 45.60 w/o Epi 27.66 27.30 33.17 42.37 59.03 68.38 18.47 24.99 31.68 31.51 47.98 57.51 15.02 29.48 37.00 27.01 37.76 45.55 w/o Cog & EPi 21.99 20.57 24.21 37.69 55.40 62.12 13.04 21.20 23.82 26.75 41.52 48.79 13.00 26.01 35.99 22.49 32.94 38.99 MMS 28.53 30.18 34.06 44.18 59.87 67.05 23.73 26.63 32.23 34.98 53.01 62.04 15.31 31.46 37.65 29.35 40.23 46.61

Method Single Hop Multi Hop Temporal Open Domain Adversarial Average F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 w/o Key 27.51 18.53 44.24 36.52 19.28 16.12 41.27 35.53 10.88 10,12 28.64 23.36 w/o Cog 27.77 19.77 45.68 38.46 17.36 15.47 43.01 36.87 12.27 11.65 29.22 24.44 w/o Sem 27.79 19.16 45.31 37.94 19.51 18.50 42.69 36.60 12.45 11.44 29.55 24.73 w/o Cog&Sem 28.24 20.24 45.05 38.34 17.92 15.44 42.71 37.42 12.83 11.53 29.35 24.59 MMS 28.67 20.96 47.37 39.98 20.81 19.07 42.98 36.89 12.87 11.67 30.54 25.74

We conducted ablation experiments on each module of the MMS. When the keyword part is not included, the R@1 metric under the Single Hop task and the R@5 metric under the Open Domain task perform the best. We believe that when the keyword part is not included. When multiple cognitive perspectives are not included, the R@3 metric under the Multi Hop task, the R@5 metric under the Temporal task, and the R@1 metric under the Adversarial task perform the best. We believe that different memory fragments can be used to construct retrieval units based on different character scenarios. In a broader general scenario, the retrieval units composed of keywords, short-term memory, multiple cognitive perspectives, and episodic memory can achieve the best performance in most tasks. The experimental results are shown in Table 3.

We used GPT-4o as the base to compare the effects of adding the remaining memory fragments in the MMS retrieval units and contextual memory units. For retrieval, retrieval memory units includes keywords, short-term memory, multiple cognitive perspectives, and episodic memory. We add semantic memory for comparison, and find recall rate decreases. This is because generated semantic memory is a high-level overview of factual knowledge in short-term memory, creating a gap with the semantic form of user’s question. For generation, contextual memory units of MMS consists of keywords, short-term memory, multiple cognitive perspectives, and semantic memory. We add contextual memory for comparison and find generation quality decreases after adding episodic memory. As episodic memory describes event-related content already present in original short-term memory, adding more can lead to redundancy and lower generation quality. Thus, the composition modules of retrieval and contextual memory units we use are reasonable. The experimental results are shown in Figure 2.

We manipulated the quantities of memory fragments to evaluate the impact on performance. As the value of n increased, performance exhibited enhancement notwithstanding the introduction of noise, owing to the high-caliber memory contents facilitating the differentiation between pivotal and noisy data. This demonstrates the resilience of our method with respect to the quantities of memory fragments.

We analyzed token and latency overhead for different methods experimentally, specifically calculating the average overhead needed to generate the memory content for each query. Compared to A-MEM, our approach proves to be faster and more resource-efficient. Although there’s a slight rise in latency and overhead compared to the memory repository, its long-term memory is overly simplistic and of low quality, resulting in poor performance. Conversely, our method generates a larger volume of high-quality memory content despite the increased overhead, with only a minimal latency increase that barely affects user experience. Our method holds practical value.

This paper creates a multi-memory system by integrating with cognitive psychology to build effective long-term memory, boosting recall and generation quality. Multiple memory theory suggests human memory comes in many forms. Current methods don’t consider the variety of human memory fragments and just use summarization, leading to low-quality content. Since people understand problems from different angles, we view short-term memory’s various cognitive aspects as long-term memory fragments. Levels-of-processing theory states that deeper encoding leads to better memory retention. MMS turns short-term memory into long-term fragments like keywords, different cognitive views, episodic and semantic memories. It constructs retrieval units from these for recall and builds contextual units to enhance knowledge. Experiments indicate MMS enhances recall and generation at lower cost, proving practical value. Our analysis under varying memory segment counts reveals high-quality content ensures sustained gains and noise resilience. Due to space constraints, our approach centers on memory content, with future work to include more memory operation designs. Our approach effectively integrates cognitive psychology into the research on agent memory within artificial intelligence for future studies.

Table: Sx5.T5: Performance comparison for different values of n

Metricsn=1n=3n=5n=7n=9
Avg Fl20.7425.0730.5434.8136.13
Avg BLUE-117.4421.3925.7428.9531.28

Table: Sx5.T6: Comparison of Latency Overhead and Token Overhead

MetricsMMSA-MEMMemory Bank
Avg Latency1.3093.9310.949
Avg Tokens7441429238

Refer to caption Schematic of multi-memory system process: After acquiring short-term memory, MMS processes it into memory fragments and constructs retrieval units and contextual memory units. During retrieval, the k most relevant retrieval units are matched to the query, and their corresponding contextual units are then used as context input for the agent’s response.

Refer to caption Compare the impact on performance after adding other segments. In terms of recall metrics, MMS and MMS+Sem were compared. In terms of generation, MMS and MMS+Epi were compared. Sem refers to semantic memory, and Epi refers to episodic memory.

$$ M_{key}, M_{cog}, M_{epi}, M_{sem} = LLM(M_{short}) $$

$$ \text{cos_sim}(\mathbf{q}, \mathbf{v}) = \frac{\mathbf{q} \cdot \mathbf{v}}{|\mathbf{q}| , |\mathbf{v}|} $$

$$ R = LLM(MU_{longterm}, Q) $$

$$ \text{Recall@}N = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{|\text{Top-}N_i \cap \text{Gold}_i|}{\min(N, |\text{Gold}_i|)} $$

$$ \text{BLEU-1} = \frac{# \text{ matching unigrams}}{# \text{ unigrams in hypothesis}} $$

$$ \displaystyle=\frac{TP}{TP+FP} $$

$$ M_{longterm} = (M_{key}, M_{short}, M_{cog}, M_{epi}, M_{sem}) $$

$$ \text{Precision} &= \frac{TP}{TP + FP} \ \text{Recall} &= \frac{TP}{TP + FN} \ \text{F1} &= \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

Prompts

We construct keywords, multiple cognitive perspectives, episodic memory, and semantic memory based on the content in short-term memory. Here, episodic memory mainly describes the event information that occurred at a specific time and place, so we label it as situations in the prompt. Semantic memory mainly contains factual world knowledge, so we label it as knowledge points. The specific content is shown in Table 5.

Prompt You are a linguistics expert. Please generate the following required content based on the context and dialogue below. Dialogue time: {time} Context: {context} Dialogue: {content} 1. Analyze the key words in the above dialogue. 2. Analyze the dialogue content from different cognitive perspectives based on the context and the content of the dialogue. 3. Analyze the situations that can serve as episodic memory from the dialogue. 3. Analyze the knowledge points in the dialogue. 4. You must output in the required format. Format the response as a JSON object: {{ "keywords": [ // The key information should be able to represent the dialogue, such as: characters, intentions, topics, time, place, events and other important information. // Only analyze the key words in the dialogue, not those in the context. ], "cognitive_perspectives": [ // Context information is provided to help you understand the entire chat process and assist you in analysis. // Only analyze the content in the dialogue, not the content in the context. ], "situations": [ // An event that occurs at a specific time and place. // Extract the situations that can be part of episodic memory from the dialogue. // Only analyze the key words in the dialogue, not those in the context. ], "knowledge_points": [ // Only extract the knowledge points from the dialog that can be part of semantic memory. The context merely helps you understand the background. // Only analyze the key words in the dialogue, not those in the context. ] }}

Table 5: Prompt for generating long-term memory content. time is the dialogue time, context is the context summary information provided in the dialogue dataset, and content is the content of this round of dialogue.

Ablation Experiment

Figure 2: Compare the impact on performance after adding other segments. In terms of recall metrics, MMS and MMS+Sem were compared. In terms of generation, MMS and MMS+Epi were compared. Sem refers to semantic memory, and Epi refers to episodic memory.

Figure 2: Compare the impact on performance after adding other segments. In terms of recall metrics, MMS and MMS+Sem were compared. In terms of generation, MMS and MMS+Epi were compared. Sem refers to semantic memory, and Epi refers to episodic memory.

Peformance

Table 6: Performance comparison for different values of n

Table 7: Comparison of Latency Overhead and Token Overhead

MethodMulti HopMulti HopMulti HopTemporalTemporalTemporalOpen DomainOpen DomainOpen DomainAdversarialAdversarialAdversarialAverageAverageAverage
MethodR@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5
NaiveRAG14.8915.0720.9026.7937.2344.117.6114.4919.3820.4534.5242.159.6420.1828.8115.8824.3031.07
MemoryBank A-MEM15.60 24.8212.59 23.7614.99 29.7323.05 33.02 44.1830.52 49.79 59.8736.47 58.96 67.059.78 16.30 23.7313.41 22.2815.80 30.02 32.2313.67 29.01 34.9820.87 44.8126.38 53.90 62.048.52 10.54 15.3117.15 20.96 31.4623.99 25.11 37.6514.12 22.74 29.3518.93 32.32 40.2323.53 39.54 46.61
MMS28.5330.1834.0632.7147.3555.7813.0425.0953.0147.239.6418.27 27.5824.89 32.8520.6229.7135.90
MemoryBank A-MEM MMS21.63 25.2122.7526.5139.1952.4959.0126.63 20.8332.4626.04 31.2739.36 46.6354.8214.5734.6426.3935.12
23.7625.71 27.9327.45 27.8339.2553.5859.1321.73 23.3923.19 25.7232.5930.6748.6157.5716.1427.4636.6641.32
11.0528.0426.6442.35
MemoryBank A-MEM14.89 17.7313.23 17.4314.1035.10 38.3240.58 46.7312.6314.9825.0213.4623.9929.9315.7021.07 26.37
22.4228.03 25.558.7015.0419.9322.4737.0630.42 43.3211.8820.5218.0225.15 32.47
MMS34.5810.8723.55
26.6023.4631.2749.8455.5616.3025.8531.1548.1954.3012.7827.0135.2023.6834.4140.44
ModelMethodSingle HopSingle HopMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainAdversarialAdversarialAverageAverage
ModelMethodF1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
GPT-4oNaiveRAG MemoryBank A-MEM MMS18.04 15.39 22.98 28.6712.24 11.18 15.98 20.9633.03 28.44 34.35 47.3728.28 24.18 29.57 39.9814.03 11.91 14.66 20.8112.15 11.23 13.21 19.0731.74 22.04 35.64 42.9826.78 18.86 30.19 36.897.60 8.65 9.24 12.876.90 7.93 8.58 11.6720.89 17.29 23.37 30.5417.27 14.68 19.65 25.74
Qwen2.5-14BNaiveRAG MemoryBank A-MEM MMS18.56 21.37 30.71 28.9112.31 14.79 23.44 22.4029.26 28.41 32.4625.44 24.26 28.5513.90 13.97 14.08 14.5212.28 12.13 13.97 14.0131.60 32.99 43.02 44.6927.40 28.28 37.69 38.7610.77 11.03 11.389.46 10.04 10.1120.82 21.55 26.33 27.0817.38 17.90 22.75 23.31
NaiveRAG MemoryBank A-MEM23.80 14.1217.27 9.3034.0029.4014.24 8.5512.50 6.4830.6426.4313.2811.98 5.15
Gemini-2.5-pro-preview22.6934.96 34.1526.92 29.3625.825.7621.8817.65
Gemini-2.5-pro-preview22.174.623.9217.4514.25
Gemini-2.5-pro-preview15.83
Gemini-2.5-pro-preview11.11
40.9735.6312.4236.2331.427.207.0223.9020.20
MMS25.9618.9641.4936.9912.9111.3242.0036.597.096.8125.8922.13
MethodSingle HopSingle HopSingle HopMulti HopMulti HopMulti HopTemporalTemporalTemporalOpen DomainOpen DomainOpen DomainAdversarialAdversarialAdversarialAvgAvgAvg
MethodR@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5
w/o Key30.8529.7234.0242.6859.2966.1723.2123.5531.8634.6052.5562.1811.8827.3633.4028.6438.4945.53
w/o Cog26.2426.0632.7343.6159.9666.3320.6527.5331.8331.9849.3659.5315.4730.4937.5727.5938.6845.60
w/o Epi27.6627.3033.1742.3759.0368.3818.4724.9931.6831.5147.9857.5115.0229.4837.0027.0137.7645.55
w/o Cog &EPi21.9920.5724.2137.6955.4062.1213.0421.2023.8226.7541.5248.7913.0026.0135.9922.4932.9438.99
MMS28.5330.1834.0644.1859.8767.0523.7326.6332.2334.9853.0162.0415.3131.4637.6529.3540.2346.61
MethodSingle HopSingle HopMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainAdversarialAdversarialAverageAverage
MethodF1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
w/o Key27.5118.5344.2436.5219.2816.1241.2735.5310.8810.1228.6423.36
w/o Cog27.7719.7745.6838.4617.3615.4743.0136.8712.2711.6529.2224.44
w/o Sem27.7919.1645.3137.9419.5118.5042.6936.6012.4511.4429.5524.73
w/o Cog&Sem28.2420.2445.0538.3417.9215.4442.7137.4212.8311.5329.3524.59
MMS28.6720.9647.3739.9820.8119.0742.9836.8912.8711.6730.5425.74
Metricsn=1n=3n=5n=7n=9
Avg Fl20.7425.0730.5434.8136.13
Avg BLUE-117.4421.3925.7428.9531.28
MetricsMMSA-MEMMemory Bank
Avg Latency1.3093.9310.949
Avg Tokens7441429238

An agent powered by large language models have achieved impressive results, but effectively handling the vast amounts of historical data generated during interactions remains a challenge. The current approach is to design a memory module for the agent to process these data. However, existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content, which affects recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multiple memory system (MMS) inspired by cognitive psychology theory. The system processes short-term memory to multiple long-term memory fragments, and constructs retrieval memory units and contextual memory units based on these fragments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user’s query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. Experiments on LoCoMo dataset compared our method with three others, proving its effectiveness. Ablation studies confirmed the rationality of our memory units. We also analyzed the robustness regarding the number of selected memory segments and the storage overhead, demonstrating its practical value.

On the path towards Artificial General Intelligence (AGI), the development of large language models (LLMs) has achieved remarkable milestones, endowing them with robust language understanding and generation capabilities. Existing LLMs, such as ChatGPT (Brown et al. 2020), Gemini (Team et al. 2023), DeepSeek-r1 (Guo et al. 2025), Qwen (Yang et al. 2024), and Llama (Grattafiori et al. 2024), have driven breakthroughs in this field. LLM technologies are profoundly influencing the transformation of problem-solving paradigms across multiple AI domains.

Agents driven by LLM have emerged as an effective approach to addressing dialogue system challenges (Xi et al. 2025). The memory capabilities of agents are gradually becoming a pivotal factor in propelling them towards higher-level cognition and autonomous behavior (Zhang et al. 2024). Traditional short-term memory mechanisms are inadequate for meeting the demands of complex tasks, which require contextual understanding, long-term reasoning, and personalized responses (Wang et al. 2024). Consequently, constructing a memory system with durability, structure, and adaptability has become one of the core challenges in realizing truly ”intelligent” agents.

To address this issue, researchers have explored both parametric and non-parametric approaches to optimize the performance of LLMs in long-form tasks. Parametrically, by augmenting LLMs with additional parameters, knowledge can be retained for future use (Wang et al. 2023). The Elastic Weight Consolidation algorithm (Huszár 2018) , developed through collaboration between DeepMind and Imperial College London, enables neural networks to retain knowledge from previous tasks while learning new ones, mitigating the ”catastrophic forgetting” problem and marking a significant step towards continuous learning in AI. However, parametric memory is prone to generating distorted and non-factual outputs and lacks interpretability (Ji et al. 2023). Non-parametrically, external components are employed to enhance the long-term memory capabilities of LLMs. The external memory module of an agent can store historical dialogue content, often using the Retrieval-Augmented Generation (RAG) approach to store long-term memories as vectors in a database. When a new query arises, relevant vectors are matched from the vector database to serve as memory content (Gao et al. 2023). However, simply storing historical dialogues in a database does not adequately simulate the human memory formation process, nor does it effectively match information from historical dialogues.

Despite the proposal of various memory architectures, such as A-MEM (Xu et al. 2025), which extracts keywords, constructs summaries, and creates tags from dialogues as memory units, employing the Zettelkasten method to dynamically index and link knowledge networks, and MemoryBank (Zhong et al. 2024), which summarizes dialogue content and analyzes user personality and mood to serve as memory units, these approaches often fall short in practical scenarios. Users pose questions from diverse perspectives, yet the aforementioned methods simply extract keywords and summaries, saving them as memory content. This leads to a gap between users’ queries and the content in memory units, with the memory content often lacking in quality, thereby affecting retrieval recall effectiveness. To enhance retrieval performance, we integrate Tulving’s memory system classification (Tulving 1985) and the encoding specificity principle (Tulving and Thomson 1973), extracting episodic memory, semantic memory, cognitive perspectives, and keywords as our memory units content. This approach provides higher-quality memory content, thereby improving our retrieval recall effectiveness.

Currently, AI agents’ memory systems encompass various types, including Short-Term Memory (STM), Long-Term Memory (LTM), Episodic Memory, Semantic Memory, and Procedural Memory (Sumers et al. 2023). STM is utilized for processing immediate contexts, such as the context window in dialogue systems. LTM, on the other hand, enables cross-session information storage and retrieval through databases, knowledge graphs, or vector embeddings (Gutiérrez et al. 2024; Xi et al. 2025; Hu et al. 2023). Episodic Memory allows agents to recall specific events, supporting case-based reasoning (Tulving et al. 1972). Semantic Memory stores structured factual knowledge, facilitating logical reasoning and knowledge retrieval (Tulving 1986). We have designed a multiple memory system based on cognitive psychology theory to generate high-quality long-term memory.

Our contributions are as follows:

(1) Current research rarely considers memory content quality, leading to its poor state. To address the issue of low-quality memory content in current memory systems, we are inspired by multiple memory systems theory, encoding specificity principle, and levels of processing theory to propose a multiple memory system to enhance the long-term memory capabilities of agents. The memory system extract keywords, multiple cognitive perspectives, episodic memory, and semantic memory as memory fragments through the analysis and processing of short-term memory to construct high-quality long-term memory content.

(2) In order to meet the requirements of different tasks in the retrieval stage and generation stage, we have built retrieval memory units for relevance matching with queries and contextual memory units as context for knowledge enhancement during generation.

(3) We conducted experiments on the LoCoMo dataset to evaluate the retrieval and generation capabilities, and also performed ablation study. These experimental results demonstrate the effectiveness of our method.

The theory of multiple memory systems (Tulving 1985) posits that memory is not processed by a single system, but rather consists of multiple subsystems that are functionally independent and have different structural foundations. These systems are responsible for different types of information processing and storage, with specific neural bases and behavioral characteristics. Endel Tulving (Tulving et al. 1972) proposed a three-category classification model in 1985, including: Procedural Memory: involving the learning of skills and habits, such as riding a bicycle. Semantic Memory: Stores factual and conceptual knowledge, such as the name of a capital city. Episodic memory: Recording personal experiences and events, such as the last birthday party. The Levels of Processing Theory (Craik and Lockhart 1972) suggests that the formation of memory does not depend on independent storage systems such as short-term memory and long-term memory, but rather on the depth and manner of information processing. The Encoding Specificity Principle (Tulving and Thomson 1973) states that the memory effect of information depends on its processing method and context during encoding. Specifically, when information is encoded, the environment, emotional state, sensory stimulation, and so on will all become part of the memory trace. Therefore, only when the conditions during retrieval match those during encoding, the retrieval of memory is most effective.

Inspired by the levels of processing theory, we process short-term memory content via a multi-level approach to form long-term memory representations. Considering the multiple memory theory, we categorize semantic and episodic memories as long-term memory components. Recognizing users’ diverse questions, which reflect varied perspectives on the initial memory content they have, we are considering the encoding specificity principle to enhance recall by aligning coding content closely with the question’s context. We integrate keywords from short-term memory and diverse cognitive angles derived from short-term into long-term memory representations to boost matching efficacy.

In recent years, research on the long-term memory mechanisms of agents has exhibited a diversifying trend. MemoryBank (Zhong et al. 2024) achieves long-term storage and retrieval of knowledge in multi-turn dialogues by introducing an external memory bank, integrating three modules—a writer, a retriever, and a reader—and leveraging the Ebbinghaus memory curve theory. It boasts strong scalability and modularity advantages, yet it suffers from issues such as coarse-grained memory content and suboptimal memory selection effects. MemoChat (Lu et al. 2023), on the other hand, injects user-related information into the model in the form of static summaries using concise, manually constructed ”memos,” significantly enhancing consistency and efficiency in open-domain dialogues. However, it relies on high-quality memo construction and lacks the capability for dynamic learning and memory adjustment. Think-in-Memory (Liu et al. 2023) proposes a ”pre-recollection + post-reflection” framework that mimics human-like cognition, explicitly separating retrieval and integrated reasoning into distinct stages and introducing a self-reflection mechanism to enhance reasoning capabilities, making it suitable for complex tasks. Nevertheless, it incurs high reasoning costs, is sensitive to retrieval, and demands significant prompt engineering efforts. Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents (Hou, Tamoto, and Miyashita 2024) divides memory into short-term and long-term categories, dynamically updating and managing memories based on factors such as usage frequency and temporal proximity. ChatDB (Hu et al. 2023) employs a database as the symbolic memory for LLMs. Its approach involves structuring textual data and storing it in a database, enabling LLMs to swiftly retrieve accurate knowledge through database queries when relevant information is needed. Human memory is characterized by features such as forgetting, consolidation, and association.

These methods lack detailed design for the most critical part of the memory module - the quality of the memory content, and only obtain basic information such as simple keywords and summaries as memory content. We believe that high-quality memory content is the key to improving recall and response abilities. We will combine cognitive psychology theories and focus on designing high-quality long-term memory segments to improve these two abilities.

Short-Term Memory temporarily stores and processes limited info for current tasks, while Long-Term Memory stores info long-term with high capacity, housing our knowledge, experiences, and skills. The brain extracts key info from short-term memory, processing it into long-term storage. The levels of processing Theory suggests information processing affects memory. MMS will comprehensively process and analyze short-term memory to create high-quality long-term memory fragments.

The multi-memory fragment system in this paper mimics this process, using the content of a round of dialogue CC as short-term memory Ms​h​o​r​tM_{short}, extracting key information Mk​e​yM_{key} from the dialogue content, and further processing and analyzing the original dialogue content to construct different cognitive perspectives Mc​o​gM_{cog}, episodic memory Me​p​iM_{epi}, and semantic memory Ms​e​mM_{sem}. Then, these memory fragments are used to construct retrieval memory units M​Ur​e​tMU_{ret} and contextual memory units M​Uc​o​n​tMU_{cont}, which are used for memory retrieval and memory use, respectively.

Our multi-memory fragment system involves three processes: the construction of long-term memory units, the retrieval of long-term memory units, and the use of contextual memory units. The process of the system processing short-term memory into long-term memory is shown in Figure 1.

The process of converting short-term memory into long-term memory involves two parts. First, the information in short-term memory is processed to generate multiple types of memory fragments. Then, these memory fragments are used to construct retrieval and matching memory units for retrieval and matching, and contextual memory units for the context of LLM.

Short-term Memory Processing The memory system processes information into Ml​o​n​g​t​e​r​mM_{longterm}, analyzes Ms​h​o​r​tM_{short} through LLM, and constructing Mk​e​yM_{key}, Mc​o​gM_{cog}, Me​p​iM_{epi}, and Ms​e​mM_{sem}. First, keywords are extracted from Ms​h​o​r​tM_{short} as important textual identification information in short-term memory. Cognitively analyze short-term memory from different perspectives, and construct a multi-dimensional cognition of the short-term memory to enhance the matching effect. Event information such as plot events is used as Me​p​iM_{epi}, and MMS are used to analyze factual information in short-term memory, constructing it as plot memory content. Taking knowledge points as factual knowledge, using cognitive modules to analyze the content of knowledge points in short-term memory, and constructing them as Mc​o​gM_{cog}. The disclosure of fragmented memory and long-term memory is as follows:

Long-term Memory Fragment Storing After completing information processing, we have five parts that constitute the long-term memory Ml​o​n​g​t​e​r​mM_{longterm}: Mk​e​yM_{key}, Ms​h​o​r​tM_{short}, Mc​o​gM_{cog}, Me​p​iM_{epi} and Ms​e​mM_{sem}. We will store long-term memory into two types, one for retrieval and matching, and the other for context-based knowledge enhancement. The keywords in long-term memory, the original short-term memory content, the various cognitive perspectives of short-term memory, and the sentence characteristics of episodic memory in short-term memory are more relevant to the semantic form of the user’s query. However, the semantic memory of short-term memory is listed in the form of knowledge points, which is a higher-level knowledge extraction of short-term memory content. It may have some differences from the user’s query language form and is suitable for user knowledge enhancement, but not for retrieval matching. Therefore, considering the principle of encoding specificity, We use Mk​e​yM_{key}, Ms​h​o​r​tM_{short}, Mc​o​gM_{cog} and Me​p​iM_{epi} as retrieval memory units M​Ur​e​tMU_{ret}, and construct them into vectors Vm​e​m​o​r​yV_{memory} for use in the retrieval matching stage.

Episodic memory describes event information, which is similar to the semantics of short-term memory information, and LLM have the ability to understand its content through short-term memory alone. Therefore, we believe that the content of episodic memory does not need to be input as contextual content into large models, only keywords, short-term memory, multiple cognitive perspectives, and semantic memory are required. Therefore, the composition of the retrieval memory units and the context memory units is as follows:

The retrieval part will convert the user query QQ into a vector form, and then use the cosine similarity formula to calculate the similarity value between the user query vector Vq​u​e​r​yV_{query} and the constructed vector Vm​e​m​o​r​yV_{memory}, selecting the top-k vectors Vk={V1,V2,…,VK}V_{k}={V_{1},V_{2},...,V_{K}} as the selected vectors. The cosine similarity used is disclosed as follows:

where q and v are respectively the query vector and the vector stored in the memory system.

Based on the selection of top-k vectors Vk={V1,V2,…,VK}V_{k}={V_{1},V_{2},...,V_{K}}, these retrieval memory units M​Ur​e​tMU_{ret} are mapped to corresponding contextual memory units M​Uc​o​n​tMU_{cont}. Mk​e​yM_{key}, Ms​h​o​r​tM_{short}, Mc​o​gM_{cog}, Me​p​iM_{epi}, and Ms​e​mM_{sem} are used as contextu CmC_{m} inputs to LLM to response the user’s query QQ and obtain a response RR.

The LoCoMo dataset (Maharana et al. 2024) is designed to evaluate the long-term dialogue memory capability of large language models. Its primary task is to determine whether these models can accurately recall information from earlier dialogue after multiple rounds. The dataset contains 10 extended sessions, with each session averaging around 600 conversations and 26,000 tokens. Each conversation includes an average of 200 questions along with their corresponding correct answers. This dataset supports multiple evaluation scenarios and is currently a widely adopted authoritative dataset.

There are five problem types: (1) Single-hop questions: Test the memory system’s basic retention by requiring accurate extraction of a single fact from long dialogue history. (2) Multi-hop questions: Assess the system’s ability to integrate information across multiple dialogue rounds for reasoning. (3) Temporal reasoning questions: Evaluate if the system can understand event sequences and time evolution to construct a clear timeline. (4) Open-domain knowledge questions: Examine the system’s recall and retrieval ability by combining dialogue context with external knowledge. (5) Adversarial questions: Challenge the system’s resistance to forgetting and misleading by inserting interference information.

We evaluate the performance of memory system using several standard metrics, including Recall@N (R@1, R@3, R@5), F1 score, and BLEU-1.

Recall@N(R@N) measures the average proportion of ground-truth answers retrieved within the top-N results. When the number of ground-truth items is less than N, the denominator is adjusted using min⁡(N,|Goldi|)\min(N,|\text{Gold}_{i}|) to ensure fair evaluation. Formally:

where QQ is the set of all evaluation queries, ii indexes a specific query, Goldi\text{Gold}{i} denotes the set of ground-truth answers for the ii-th query, |Goldi||\text{Gold}{i}| is the number of ground-truth answers for that query, and Top-​Ni\text{Top-}N_{i} is the set of top-N results retrieved by the system for that query.

F1 Score is the harmonic mean of precision and recall. Let T​PTP, F​PFP, and F​NFN denote the number of true positives, false positives, and false negatives, respectively:

F1 is widely used in classification and span-level extraction tasks, such as named entity recognition or QA.

BLEU-1 evaluates unigram precision between the generated output and the reference. BLEU-1 is computed as:

A brevity penalty is typically applied to discourage overly short outputs. BLEU-1 is appropriate for tasks like machine translation and text generation where word-level overlap is informative.

Model Method Single Hop Multi Hop Temporal Open Domain Adversarial Average R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 NaiveRAG 14.89 15.07 20.90 26.79 37.23 44.11 7.61 14.49 19.38 20.45 34.52 42.15 9.64 20.18 28.81 15.88 24.30 31.07 GPT-4o MemoryBank 15.60 12.59 14.99 23.05 30.52 36.47 9.78 13.41 15.80 13.67 20.87 26.38 8.52 17.15 23.99 14.12 18.93 23.53 A-MEM 24.82 23.76 29.73 33.02 49.79 58.96 16.30 22.28 30.02 29.01 44.81 53.90 10.54 20.96 25.11 22.74 32.32 39.54 MMS 28.53 30.18 34.06 44.18 59.87 67.05 23.73 26.63 32.23 34.98 53.01 62.04 15.31 31.46 37.65 29.35 40.23 46.61 Qwen2.5-14B MemoryBank 21.63 22.75 26.51 32.71 47.35 55.78 13.04 20.83 25.09 26.04 39.36 47.23 9.64 18.27 24.89 20.62 36.84 35.90 A-MEM 25.21 25.71 27.45 39.19 52.49 59.01 21.73 23.19 32.46 31.27 46.63 54.82 14.57 27.58 32.85 26.39 35.12 41.32 MMS 23.76 27.93 27.83 39.25 53.58 59.13 23.39 25.72 32.59 30.67 48.61 57.57 16.14 27.46 34.64 26.64 36.66 42.35 Gemini-2.5-pro-preview MemoryBank 14.89 13.23 14.10 28.03 35.10 40.58 8.70 11.05 12.63 14.98 25.02 30.42 11.88 20.52 28.04 15.70 21.07 25.15 A-MEM 17.73 17.43 22.42 25.55 38.32 46.73 10.87 15.04 19.93 22.47 37.06 43.32 13.46 23.99 29.93 18.02 26.37 32.47 MMS 26.60 23.46 31.27 34.58 49.84 55.56 16.30 23.55 25.85 31.15 48.19 54.30 12.78 27.01 35.20 23.68 34.41 40.44

Model Method Single Hop Multi Hop Temporal Open Domain Adversarial Average F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 GPT-4o NaiveRAG 18.04 12.24 33.03 28.28 14.03 12.15 31.74 26.78 7.60 6.90 20.89 17.27 MemoryBank 15.39 11.18 28.44 24.18 11.91 11.23 22.04 18.86 8.65 7.93 17.29 14.68 A-MEM 22.98 15.98 34.35 29.57 14.66 13.21 35.64 30.19 9.24 8.58 23.37 19.65 MMS 28.67 20.96 47.37 39.98 20.81 19.07 42.98 36.89 12.87 11.67 30.54 25.74 Qwen2.5-14B NaiveRAG 18.56 12.31 29.26 25.44 13.90 12.28 31.60 27.40 10.77 9.46 20.82 17.38 MemoryBank 21.37 14.79 28.41 24.26 13.97 12.13 32.99 28.28 11.03 10.04 21.55 17.90 A-MEM 30.71 23.44 32.46 28.55 14.08 13.97 43.02 37.69 11.38 10.11 26.33 22.75 MMS 28.91 22.40 34.00 29.40 14.52 14.01 44.69 38.76 13.28 11.98 27.08 23.31 Gemini-2.5-pro-preview NaiveRAG 23.80 17.27 34.96 26.92 14.24 12.50 30.64 26.43 5.76 5.15 21.88 17.65 MemoryBank 14.12 9.30 34.15 29.36 8.55 6.48 25.82 22.17 4.62 3.92 17.45 14.25 A-MEM 22.69 15.83 40.97 35.63 12.42 11.11 36.23 31.42 7.20 7.02 23.90 20.20 MMS 25.96 18.96 41.49 36.99 12.91 11.32 42.00 36.59 7.09 6.81 25.89 22.13

We use three methods for comparison, asking: Naive RAG, MemoryBank, and A-MEM.

Naive RAG Only the character’s dialogue content is vectorized as memory. For a question, it’s vectorized and compared with all memory vectors for similarity. The top-k vectors of original dialogue are chosen as context for answering.

MemoryBank (Zhong et al. 2024) The memory system comprises three main methods: storage, which saves daily chats, event summaries, and user personality and mood assessments; retrieval, encoding dialogues and summaries into vectors for subsequent recall; and memory intensity update, utilizing an exponential decay model to mimic the Ebbinghaus forgetting curve.

A-MEM (Xu et al. 2025) The method has four steps: note construction (creating memory notes with conversation content, timestamps, keywords, tags, context, and links), link generation (identifying relevant historical memories, analyzing their ties to LLM for memory evolution), memory evolution (deciding on memory updates based on new notes), and memory retrieval (using cosine similarity between question and note vectors to fetch K most relevant memories for answers).

We employed GPT-4o, Qwen2.5-14B, and Gemini-2.5-pro-preview as base models, setting temperature to 0.5 for memory generation and 0.7 for question answering. For the latter, we input the top 5 most relevant contextual memory units (based on retrieval memory units) as context for the agent.

We conducted an experimental comparative analysis on memory retrieval and memory use ability. In terms of memory retrieval, the recall rates (R@1, R@2, and R@5) were compared and analyzed. In terms of memory usage, F1, which measures the accuracy of responses, and BLEU-1, which measures the quality of responses, were compared and analyzed.

The experimental results show that MMS is superior to other methods in most cases, with a significant overall improvement effect. The improvement effect is most obvious in the Multi Hop scenario, achieving the maximum improvement across models and tasks in the three metrics of R@1, R@3, and R@5. GPT-4o and Gemini, in particular, improved recall by 8-11 points on the Multi Hop task compared to A-MEM. Multi Hop tasks rely on deep-level associative reasoning among multiple documents. The multi-memory fragment strategy of the MMS is more suitable for this complex retrieval scenario, enhancing the effectiveness of cross-document integration and reasoning. Significant improvements have also been observed in Open Domain and Temporal scenarios, indicating that MMS exhibits robustness in contexts with ambiguous and uncertain information structures. In the Single Hop scenario, most cases outperform other methods, but a few cases slightly underperform. This may introduce some redundancy in a simple MMS like Single Hop.

MMS has achieved leading performance on multiple mainstream large language models, significantly outperforming existing methods such as NaiveRAG, MemoryBank, and A-MEM. Whether it is F1 or BLEU-1, MMS achieves comprehensive improvement in five types of tasks: single-hop, multi-hop, temporal, open-domain, and adversarial question answering. In particular, it has particularly outstanding advantages in multi-hop reasoning and open-domain tasks, demonstrating excellent information integration and reasoning depth. The strong performance of MMS may be due to its use of a multi-level memory structure, which has obvious advantages in improving question-answering accuracy, handling complex logical relationships, and resisting input interference. The experimental results are shown in Table 2.

Method Single Hop Multi Hop Temporal Open Domain Adversarial Avg R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 R@1 R@3 R@5 w/o Key 30.85 29.72 34.02 42.68 59.29 66.17 23.21 23.55 31.86 34.60 52.55 62.18 11.88 27.36 33.40 28.64 38.49 45.53 w/o Cog 26.24 26.06 32.73 43.61 59.96 66.33 20.65 27.53 31.83 31.98 49.36 59.53 15.47 30.49 37.57 27.59 38.68 45.60 w/o Epi 27.66 27.30 33.17 42.37 59.03 68.38 18.47 24.99 31.68 31.51 47.98 57.51 15.02 29.48 37.00 27.01 37.76 45.55 w/o Cog & EPi 21.99 20.57 24.21 37.69 55.40 62.12 13.04 21.20 23.82 26.75 41.52 48.79 13.00 26.01 35.99 22.49 32.94 38.99 MMS 28.53 30.18 34.06 44.18 59.87 67.05 23.73 26.63 32.23 34.98 53.01 62.04 15.31 31.46 37.65 29.35 40.23 46.61

Method Single Hop Multi Hop Temporal Open Domain Adversarial Average F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 F1 BLEU-1 w/o Key 27.51 18.53 44.24 36.52 19.28 16.12 41.27 35.53 10.88 10,12 28.64 23.36 w/o Cog 27.77 19.77 45.68 38.46 17.36 15.47 43.01 36.87 12.27 11.65 29.22 24.44 w/o Sem 27.79 19.16 45.31 37.94 19.51 18.50 42.69 36.60 12.45 11.44 29.55 24.73 w/o Cog&Sem 28.24 20.24 45.05 38.34 17.92 15.44 42.71 37.42 12.83 11.53 29.35 24.59 MMS 28.67 20.96 47.37 39.98 20.81 19.07 42.98 36.89 12.87 11.67 30.54 25.74

We conducted ablation experiments on each module of the MMS. When the keyword part is not included, the R@1 metric under the Single Hop task and the R@5 metric under the Open Domain task perform the best. We believe that when the keyword part is not included. When multiple cognitive perspectives are not included, the R@3 metric under the Multi Hop task, the R@5 metric under the Temporal task, and the R@1 metric under the Adversarial task perform the best. We believe that different memory fragments can be used to construct retrieval units based on different character scenarios. In a broader general scenario, the retrieval units composed of keywords, short-term memory, multiple cognitive perspectives, and episodic memory can achieve the best performance in most tasks. The experimental results are shown in Table 3.

We used GPT-4o as the base to compare the effects of adding the remaining memory fragments in the MMS retrieval units and contextual memory units. For retrieval, retrieval memory units includes keywords, short-term memory, multiple cognitive perspectives, and episodic memory. We add semantic memory for comparison, and find recall rate decreases. This is because generated semantic memory is a high-level overview of factual knowledge in short-term memory, creating a gap with the semantic form of user’s question. For generation, contextual memory units of MMS consists of keywords, short-term memory, multiple cognitive perspectives, and semantic memory. We add contextual memory for comparison and find generation quality decreases after adding episodic memory. As episodic memory describes event-related content already present in original short-term memory, adding more can lead to redundancy and lower generation quality. Thus, the composition modules of retrieval and contextual memory units we use are reasonable. The experimental results are shown in Figure 2.

We manipulated the quantities of memory fragments to evaluate the impact on performance. As the value of n increased, performance exhibited enhancement notwithstanding the introduction of noise, owing to the high-caliber memory contents facilitating the differentiation between pivotal and noisy data. This demonstrates the resilience of our method with respect to the quantities of memory fragments.

We analyzed token and latency overhead for different methods experimentally, specifically calculating the average overhead needed to generate the memory content for each query. Compared to A-MEM, our approach proves to be faster and more resource-efficient. Although there’s a slight rise in latency and overhead compared to the memory repository, its long-term memory is overly simplistic and of low quality, resulting in poor performance. Conversely, our method generates a larger volume of high-quality memory content despite the increased overhead, with only a minimal latency increase that barely affects user experience. Our method holds practical value.

This paper creates a multi-memory system by integrating with cognitive psychology to build effective long-term memory, boosting recall and generation quality. Multiple memory theory suggests human memory comes in many forms. Current methods don’t consider the variety of human memory fragments and just use summarization, leading to low-quality content. Since people understand problems from different angles, we view short-term memory’s various cognitive aspects as long-term memory fragments. Levels-of-processing theory states that deeper encoding leads to better memory retention. MMS turns short-term memory into long-term fragments like keywords, different cognitive views, episodic and semantic memories. It constructs retrieval units from these for recall and builds contextual units to enhance knowledge. Experiments indicate MMS enhances recall and generation at lower cost, proving practical value. Our analysis under varying memory segment counts reveals high-quality content ensures sustained gains and noise resilience. Due to space constraints, our approach centers on memory content, with future work to include more memory operation designs. Our approach effectively integrates cognitive psychology into the research on agent memory within artificial intelligence for future studies.

Table: Sx5.T5: Performance comparison for different values of n

Metricsn=1n=3n=5n=7n=9
Avg Fl20.7425.0730.5434.8136.13
Avg BLUE-117.4421.3925.7428.9531.28

Table: Sx5.T6: Comparison of Latency Overhead and Token Overhead

MetricsMMSA-MEMMemory Bank
Avg Latency1.3093.9310.949
Avg Tokens7441429238

Refer to caption Schematic of multi-memory system process: After acquiring short-term memory, MMS processes it into memory fragments and constructs retrieval units and contextual memory units. During retrieval, the k most relevant retrieval units are matched to the query, and their corresponding contextual units are then used as context input for the agent’s response.

Refer to caption Compare the impact on performance after adding other segments. In terms of recall metrics, MMS and MMS+Sem were compared. In terms of generation, MMS and MMS+Epi were compared. Sem refers to semantic memory, and Epi refers to episodic memory.

$$ M_{key}, M_{cog}, M_{epi}, M_{sem} = LLM(M_{short}) $$

$$ \text{cos_sim}(\mathbf{q}, \mathbf{v}) = \frac{\mathbf{q} \cdot \mathbf{v}}{|\mathbf{q}| , |\mathbf{v}|} $$

$$ R = LLM(MU_{longterm}, Q) $$

$$ \text{Recall@}N = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{|\text{Top-}N_i \cap \text{Gold}_i|}{\min(N, |\text{Gold}_i|)} $$

$$ \text{BLEU-1} = \frac{# \text{ matching unigrams}}{# \text{ unigrams in hypothesis}} $$

$$ \displaystyle=\frac{TP}{TP+FP} $$

$$ M_{longterm} = (M_{key}, M_{short}, M_{cog}, M_{epi}, M_{sem}) $$

$$ \text{Precision} &= \frac{TP}{TP + FP} \ \text{Recall} &= \frac{TP}{TP + FN} \ \text{F1} &= \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$

MethodMulti HopMulti HopMulti HopTemporalTemporalTemporalOpen DomainOpen DomainOpen DomainAdversarialAdversarialAdversarialAverageAverageAverage
MethodR@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5
NaiveRAG14.8915.0720.9026.7937.2344.117.6114.4919.3820.4534.5242.159.6420.1828.8115.8824.3031.07
MemoryBank A-MEM15.60 24.8212.59 23.7614.99 29.7323.05 33.02 44.1830.52 49.79 59.8736.47 58.96 67.059.78 16.30 23.7313.41 22.2815.80 30.02 32.2313.67 29.01 34.9820.87 44.8126.38 53.90 62.048.52 10.54 15.3117.15 20.96 31.4623.99 25.11 37.6514.12 22.74 29.3518.93 32.32 40.2323.53 39.54 46.61
MMS28.5330.1834.0632.7147.3555.7813.0425.0953.0147.239.6418.27 27.5824.89 32.8520.6229.7135.90
MemoryBank A-MEM MMS21.63 25.2122.7526.5139.1952.4959.0126.63 20.8332.4626.04 31.2739.36 46.6354.8214.5734.6426.3935.12
23.7625.71 27.9327.45 27.8339.2553.5859.1321.73 23.3923.19 25.7232.5930.6748.6157.5716.1427.4636.6641.32
11.0528.0426.6442.35
MemoryBank A-MEM14.89 17.7313.23 17.4314.1035.10 38.3240.58 46.7312.6314.9825.0213.4623.9929.9315.7021.07 26.37
22.4228.03 25.558.7015.0419.9322.4737.0630.42 43.3211.8820.5218.0225.15 32.47
MMS34.5810.8723.55
26.6023.4631.2749.8455.5616.3025.8531.1548.1954.3012.7827.0135.2023.6834.4140.44
ModelMethodSingle HopSingle HopMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainAdversarialAdversarialAverageAverage
ModelMethodF1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
GPT-4oNaiveRAG MemoryBank A-MEM MMS18.04 15.39 22.98 28.6712.24 11.18 15.98 20.9633.03 28.44 34.35 47.3728.28 24.18 29.57 39.9814.03 11.91 14.66 20.8112.15 11.23 13.21 19.0731.74 22.04 35.64 42.9826.78 18.86 30.19 36.897.60 8.65 9.24 12.876.90 7.93 8.58 11.6720.89 17.29 23.37 30.5417.27 14.68 19.65 25.74
Qwen2.5-14BNaiveRAG MemoryBank A-MEM MMS18.56 21.37 30.71 28.9112.31 14.79 23.44 22.4029.26 28.41 32.4625.44 24.26 28.5513.90 13.97 14.08 14.5212.28 12.13 13.97 14.0131.60 32.99 43.02 44.6927.40 28.28 37.69 38.7610.77 11.03 11.389.46 10.04 10.1120.82 21.55 26.33 27.0817.38 17.90 22.75 23.31
NaiveRAG MemoryBank A-MEM23.80 14.1217.27 9.3034.0029.4014.24 8.5512.50 6.4830.6426.4313.2811.98 5.15
Gemini-2.5-pro-preview22.6934.96 34.1526.92 29.3625.825.7621.8817.65
Gemini-2.5-pro-preview22.174.623.9217.4514.25
Gemini-2.5-pro-preview15.83
Gemini-2.5-pro-preview11.11
40.9735.6312.4236.2331.427.207.0223.9020.20
MMS25.9618.9641.4936.9912.9111.3242.0036.597.096.8125.8922.13
MethodSingle HopSingle HopSingle HopMulti HopMulti HopMulti HopTemporalTemporalTemporalOpen DomainOpen DomainOpen DomainAdversarialAdversarialAdversarialAvgAvgAvg
MethodR@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5R@1R@3R@5
w/o Key30.8529.7234.0242.6859.2966.1723.2123.5531.8634.6052.5562.1811.8827.3633.4028.6438.4945.53
w/o Cog26.2426.0632.7343.6159.9666.3320.6527.5331.8331.9849.3659.5315.4730.4937.5727.5938.6845.60
w/o Epi27.6627.3033.1742.3759.0368.3818.4724.9931.6831.5147.9857.5115.0229.4837.0027.0137.7645.55
w/o Cog &EPi21.9920.5724.2137.6955.4062.1213.0421.2023.8226.7541.5248.7913.0026.0135.9922.4932.9438.99
MMS28.5330.1834.0644.1859.8767.0523.7326.6332.2334.9853.0162.0415.3131.4637.6529.3540.2346.61
MethodSingle HopSingle HopMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainAdversarialAdversarialAverageAverage
MethodF1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
w/o Key27.5118.5344.2436.5219.2816.1241.2735.5310.8810.1228.6423.36
w/o Cog27.7719.7745.6838.4617.3615.4743.0136.8712.2711.6529.2224.44
w/o Sem27.7919.1645.3137.9419.5118.5042.6936.6012.4511.4429.5524.73
w/o Cog&Sem28.2420.2445.0538.3417.9215.4442.7137.4212.8311.5329.3524.59
MMS28.6720.9647.3739.9820.8119.0742.9836.8912.8711.6730.5425.74
Metricsn=1n=3n=5n=7n=9
Avg Fl20.7425.0730.5434.8136.13
Avg BLUE-117.4421.3925.7428.9531.28
MetricsMMSA-MEMMemory Bank
Avg Latency1.3093.9310.949
Avg Tokens7441429238

References

[reimers2021allminilm] Reimers, Nils, Gurevych, Iryna. (2021). all-MiniLM-L6-v2 Sentence Transformer.

[hurst2024gpt] Hurst, Aaron, Lerer, Adam, Goucher, Adam P, Perelman, Adam, Ramesh, Aditya, Clark, Aidan, Ostrow, AJ, Welihinda, Akila, Hayes, Alan, Radford, Alec, others. (2024). Gpt-4o system card. arXiv preprint arXiv:2410.21276.

[comanici2025gemini] Comanici, Gheorghe, Bieber, Eric, Schaekermann, Mike, Pasupat, Ice, Sachdeva, Noveen, Dhillon, Inderjit, Blistein, Marcel, Ram, Ori, Zhang, Dan, Rosen, Evan, others. (2025). Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261.

[yang2024qwen2] Yang, An, Yang, Baosong, Zhang, Beichen, Hui, Binyuan, Zheng, Bo, Yu, Bowen, Li, Chengyuan, Liu, Dayiheng, Huang, Fei, Wei, Haoran, others. (2024). Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.

[packer2023memgpt] Packer, Charles, Fang, Vivian, Patil, Shishir_G, Lin, Kevin, Wooders, Sarah, Gonzalez, Joseph_E. (2023). MemGPT: Towards LLMs as Operating Systems..

[alonso2024toward] Alonso, Nick, Figliolia, Tom{'a. (2024). Toward conversational agents with context and time sensitive long-term memory. arXiv preprint arXiv:2406.00057.

[zhou2025memento] Zhou, Huichi, Chen, Yihang, Guo, Siyuan, Yan, Xue, Lee, Kin Hei, Wang, Zihan, Lee, Ka Yiu, Zhang, Guchun, Shao, Kun, Yang, Linyi, others. (2025). Memento: Fine-tuning llm agents without fine-tuning llms. arXiv preprint arXiv:2508.16153.

[chhikara2025mem0] Chhikara, Prateek, Khant, Dev, Aryan, Saket, Singh, Taranjeet, Yadav, Deshraj. (2025). Mem0: Building production-ready ai agents with scalable long-term memory. arXiv preprint arXiv:2504.19413.

[yan2025memory] Yan, Sikuan, Yang, Xiufeng, Huang, Zuchao, Nie, Ercong, Ding, Zifeng, Li, Zonggen, Ma, Xiaowen, Kersting, Kristian, Pan, Jeff Z, Sch{. (2025). Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning. arXiv preprint arXiv:2508.19828.

[hou2024my] Hou, Yuki, Tamoto, Haruki, Miyashita, Homei. . Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (2024).

[hu2023chatdb] Hu, Chenxu, Fu, Jie, Du, Chenzhuang, Luo, Simian, Zhao, Junbo, Zhao, Hang. (2023). Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901.

[zhong2024memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[lu2023memochat] Lu, Junru, An, Siyu, Lin, Mingbao, Pergola, Gabriele, He, Yulan, Yin, Di, Sun, Xing, Wu, Yunsheng. (2023). Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. arXiv preprint arXiv:2308.08239.

[liu2023think] Liu, Lei, Yang, Xiaoyan, Shen, Yue, Hu, Binbin, Zhang, Zhiqiang, Gu, Jinjie, Zhang, Guannan. (2023). Think-in-memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719.

[xi2025rise] Xi, Zhiheng, Chen, Wenxiang, Guo, Xin, He, Wei, Ding, Yiwen, Hong, Boyang, Zhang, Ming, Wang, Junzhe, Jin, Senjie, Zhou, Enyu, others. (2025). The rise and potential of large language model based agents: A survey. Science China Information Sciences.

[zhang2024survey] Zhang, Zeyu, Bo, Xiaohe, Ma, Chen, Li, Rui, Chen, Xu, Dai, Quanyu, Zhu, Jieming, Dong, Zhenhua, Wen, Ji-Rong. (2024). A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501.

[sumers2023cognitive] Sumers, Theodore, Yao, Shunyu, Narasimhan, Karthik, Griffiths, Thomas. (2023). Cognitive architectures for language agents. Transactions on Machine Learning Research.

[wang2024survey] Wang, Lei, Ma, Chen, Feng, Xueyang, Zhang, Zeyu, Yang, Hao, Zhang, Jingsen, Chen, Zhiyuan, Tang, Jiakai, Chen, Xu, Lin, Yankai, others. (2024). A survey on large language model based autonomous agents. Frontiers of Computer Science.

[wang2023augmenting] Wang, Weizhi, Dong, Li, Cheng, Hao, Liu, Xiaodong, Yan, Xifeng, Gao, Jianfeng, Wei, Furu. (2023). Augmenting language models with long-term memory. Advances in Neural Information Processing Systems.

[huszar2018note] Husz{'a. (2018). Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences.

[ji2023survey] Ji, Ziwei, Lee, Nayeon, Frieske, Rita, Yu, Tiezheng, Su, Dan, Xu, Yan, Ishii, Etsuko, Bang, Ye Jin, Madotto, Andrea, Fung, Pascale. (2023). Survey of hallucination in natural language generation. ACM computing surveys.

[gao2023retrieval] Gao, Yunfan, Xiong, Yun, Gao, Xinyu, Jia, Kangxiang, Pan, Jinliu, Bi, Yuxi, Dai, Yixin, Sun, Jiawei, Wang, Haofen, Wang, Haofen. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.

[gutierrez2024hipporag] Guti{'e. (2024). Hipporag: Neurobiologically inspired long-term memory for large language models. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

[tulving1972episodic] Tulving, Endel, others. (1972). Episodic and semantic memory. Organization of memory.

[tulving1986episodic] Tulving, Endel. (1986). Episodic and semantic memory: Where should we go from here?. Behavioral and Brain Sciences.

[xu2025amem] Xu, Wujiang, Liang, Zujie, Mei, Kai, Gao, Hang, Tan, Juntao, Zhang, Yongfeng. (2025). A-mem: Agentic memory for llm agents. Advances in Neural Information Processing Systems.

[tulving1985many] Tulving, Endel. (1985). How many memory systems are there?. American psychologist.

[craik1972levels] Craik, Fergus IM, Lockhart, Robert S. (1972). Levels of processing: A framework for memory research. Journal of verbal learning and verbal behavior.

[tulving1973encoding] Tulving, Endel, Thomson, Donald M. (1973). Encoding specificity and retrieval processes in episodic memory.. Psychological review.

[maharana2024evaluating] Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

[bib1] Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.

[bib2] Craik, F. I.; and Lockhart, R. S. 1972. Levels of processing: A framework for memory research. Journal of verbal learning and verbal behavior, 11(6): 671–684.

[bib3] Gao et al. (2023) Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, H.; and Wang, H. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2: 1.

[bib4] Grattafiori et al. (2024) Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.

[bib5] Guo et al. (2025) Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[bib6] Gutiérrez et al. (2024) Gutiérrez, B. J.; Shu, Y.; Gu, Y.; Yasunaga, M.; and Su, Y. 2024. Hipporag: Neurobiologically inspired long-term memory for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.

[bib7] Hou, Y.; Tamoto, H.; and Miyashita, H. 2024. ” my agent understands me better”: Integrating dynamic human-like memory recall and consolidation in llm-based agents. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, 1–7.

[bib8] Hu et al. (2023) Hu, C.; Fu, J.; Du, C.; Luo, S.; Zhao, J.; and Zhao, H. 2023. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901.

[bib9] Huszár, F. 2018. Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences, 115(11): E2496–E2497.

[bib10] Ji et al. (2023) Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y. J.; Madotto, A.; and Fung, P. 2023. Survey of hallucination in natural language generation. ACM computing surveys, 55(12): 1–38.

[bib11] Liu et al. (2023) Liu, L.; Yang, X.; Shen, Y.; Hu, B.; Zhang, Z.; Gu, J.; and Zhang, G. 2023. Think-in-memory: Recalling and post-thinking enable llms with long-term memory. arXiv preprint arXiv:2311.08719.

[bib12] Lu et al. (2023) Lu, J.; An, S.; Lin, M.; Pergola, G.; He, Y.; Yin, D.; Sun, X.; and Wu, Y. 2023. Memochat: Tuning llms to use memos for consistent long-range open-domain conversation. arXiv preprint arXiv:2308.08239.

[bib13] Maharana et al. (2024) Maharana, A.; Lee, D.-H.; Tulyakov, S.; Bansal, M.; Barbieri, F.; and Fang, Y. 2024. Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

[bib14] Sumers et al. (2023) Sumers, T.; Yao, S.; Narasimhan, K.; and Griffiths, T. 2023. Cognitive architectures for language agents. Transactions on Machine Learning Research.

[bib15] Team et al. (2023) Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.-B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A. M.; Hauth, A.; Millican, K.; et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.

[bib16] Tulving, E. 1985. How many memory systems are there? American psychologist, 40(4): 385.

[bib17] Tulving, E. 1986. Episodic and semantic memory: Where should we go from here? Behavioral and Brain Sciences, 9(3): 573–577.

[bib18] Tulving, E.; and Thomson, D. M. 1973. Encoding specificity and retrieval processes in episodic memory. Psychological review, 80(5): 352.

[bib19] Tulving et al. (1972) Tulving, E.; et al. 1972. Episodic and semantic memory. Organization of memory, 1(381-403): 1.

[bib20] Wang et al. (2024) Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. 2024. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18(6): 186345.

[bib21] Wang et al. (2023) Wang, W.; Dong, L.; Cheng, H.; Liu, X.; Yan, X.; Gao, J.; and Wei, F. 2023. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 36: 74530–74543.

[bib22] Xi et al. (2025) Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. 2025. The rise and potential of large language model based agents: A survey. Science China Information Sciences, 68(2): 121101.

[bib23] Xu et al. (2025) Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; and Zhang, Y. 2025. A-mem: Agentic memory for llm agents. arXiv preprint arXiv:2502.12110.

[bib24] Yang et al. (2024) Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. 2024. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115.

[bib25] Zhang et al. (2024) Zhang, Z.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Dai, Q.; Zhu, J.; Dong, Z.; and Wen, J.-R. 2024. A survey on the memory mechanism of large language model based agents. arXiv preprint arXiv:2404.13501.

[bib26] Zhong et al. (2024) Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 19724–19731.