Skip to main content

A-Mem: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang

Abstract

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution -- as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. % The source code for evaluating performance is available at https://github.com/WujiangXu/AgenticMemory, while the source code of agentic memory system is available at https://github.com/agiresearch/A-mem. -1.5pt{\includegraphics[height=1.05em]{figure/github-logo.pdf}} Code for Benchmark Evaluation: \ https://github.com/WujiangXu/AgenticMemory{https://github.com/WujiangXu/AgenticMemory} -1.5pt{\includegraphics[height=1.05em]{figure/github-logo.pdf}} Code for Production-ready Agentic Memory: \ https://github.com/WujiangXu/A-mem-sys{https://github.com/WujiangXu/A-mem-sys}

Memory for LLM Agents

Wujiang Xu 1 , Zujie Liang 2 , Kai Mei 1 , Hang Gao 1 , Juntao Tan 1 , Yongfeng Zhang 1 , 3

1 Rutgers University

2 Independent Researcher 3 AIOS Foundation wujiang.xu@rutgers.edu

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines.

Code for Benchmark Evaluation :

https://github.com/WujiangXu/AgenticMemory

Introduction

Large Language Model (LLM) agents have demonstrated remarkable capabilities in various tasks, with recent advances enabling them to interact with environments, execute tasks, and make decisions autonomously [23, 33, 7]. They integrate LLMs with external tools and delicate workflows to improve reasoning and planning abilities. Though LLM agent has strong reasoning performance, it still needs a memory system to provide long-term interaction ability with the external environment [35].

Existing memory systems [25, 39, 28, 21] for LLM agents provide basic memory storage functionality. These systems require agent developers to predefine memory storage structures, specify storage points within the workflow, and establish retrieval timing. Meanwhile, to improve structured memory organization, Mem0 [8], following the principles of RAG [9, 18, 30], incorporates graph databases for storage and retrieval processes. While graph databases provide structured organization for memory systems, their reliance on predefined schemas and relationships fundamentally limits their adaptability. This limitation manifests clearly in practical scenarios - when an agent learns a novel mathematical solution, current systems can only categorize and link this information within their preset framework,

:

Figure

  • (a) Traditional memory system.
  • (b) Our proposed agentic memory.

Figure 1: Traditional memory systems require predefined memory access patterns specified in the workflow, limiting their adaptability to diverse scenarios. Contrastly, our A-MEM enhances the flexibility of LLM agents by enabling dynamic memory operations.

unable to forge innovative connections or develop new organizational patterns as knowledge evolves. Such rigid structures, coupled with fixed agent workflows, severely restrict these systems' ability to generalize across new environments and maintain effectiveness in long-term interactions. The challenge becomes increasingly critical as LLM agents tackle more complex, open-ended tasks, where flexible knowledge organization and continuous adaptation are essential. Therefore, how to design a flexible and universal memory system that supports LLM agents' long-term interactions remains a crucial challenge.

In this paper, we introduce a novel agentic memory system, named as A-MEM, for LLM agents that enables dynamic memory structuring without relying on static, predetermined memory operations. Our approach draws inspiration from the Zettelkasten method [15, 1], a sophisticated knowledge management system that creates interconnected information networks through atomic notes and flexible linking mechanisms. Our system introduces an agentic memory architecture that enables autonomous and flexible memory management for LLM agents. For each new memory, we construct comprehensive notes, which integrates multiple representations: structured textual attributes including several attributes and embedding vectors for similarity matching. Then A-MEM analyzes the historical memory repository to establish meaningful connections based on semantic similarities and shared attributes. This integration process not only creates new links but also enables dynamic evolution when new memories are incorporated, they can trigger updates to the contextual representations of existing memories, allowing the entire memories to continuously refine and deepen its understanding over time. The contributions are summarized as:

· Wepresent A-MEM, an agentic memory system for LLM agents that enables autonomous generation of contextual descriptions, dynamic establishment of memory connections, and intelligent evolution of existing memories based on new experiences. This system equips LLM agents with long-term interaction capabilities without requiring predetermined memory operations. · We design an agentic memory update mechanism where new memories automatically trigger two key operations: link generation and memory evolution. Link generation automatically establishes connections between memories by identifying shared attributes and similar contextual descriptions. Memory evolution enables existing memories to dynamically adapt as new experiences are analyzed, leading to the emergence of higher-order patterns and attributes. · We conduct comprehensive evaluations of our system using a long-term conversational dataset, comparing performance across six foundation models using six distinct evaluation metrics, demonstrating significant improvements. Moreover, we provide T-SNE visualizations to illustrate the structured organization of our agentic memory system.

Memory for LLM Agents

Prior works on LLM agent memory systems have explored various mechanisms for memory management and utilization [23, 21, 8, 39]. Some approaches complete interaction storage, which maintains comprehensive historical records through dense retrieval models [39] or read-write memory structures [24]. Moreover, MemGPT [25] leverages cache-like architectures to prioritize recent information. Similarly, SCM [32] proposes a Self-Controlled Memory framework that enhances LLMs' capability to maintain long-term memory through a memory stream and controller mechanism. However, these approaches face significant limitations in handling diverse real-world tasks. While they can provide basic memory functionality, their operations are typically constrained by predefined structures and fixed workflows. These constraints stem from their reliance on rigid operational

Figure 2: Our A-MEM architecture comprises three integral parts in memory storage. During note construction, the system processes new interaction memories and stores them as notes with multiple attributes. The link generation process first retrieves the most relevant historical memories and then employs an LLM to determine whether connections should be established between them. The concept of a 'box' describes that related memories become interconnected through their similar contextual descriptions, analogous to the Zettelkasten method. However, our approach allows individual memories to exist simultaneously within multiple different boxes. During the memory retrieval stage, we extract query embeddings using a text encoding model and search the memory database for relevant matches. When related memory is retrieved, similar memories that are linked within the same box are also automatically accessed.

Figure 2: Our A-MEM architecture comprises three integral parts in memory storage. During note construction, the system processes new interaction memories and stores them as notes with multiple attributes. The link generation process first retrieves the most relevant historical memories and then employs an LLM to determine whether connections should be established between them. The concept of a 'box' describes that related memories become interconnected through their similar contextual descriptions, analogous to the Zettelkasten method. However, our approach allows individual memories to exist simultaneously within multiple different boxes. During the memory retrieval stage, we extract query embeddings using a text encoding model and search the memory database for relevant matches. When related memory is retrieved, similar memories that are linked within the same box are also automatically accessed.

patterns, particularly in memory writing and retrieval processes. Such inflexibility leads to poor generalization in new environments and limited effectiveness in long-term interactions. Therefore, designing a flexible and universal memory system that supports agents' long-term interactions remains a crucial challenge.

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to enhance LLMs by incorporating external knowledge sources [18, 6, 10]. The standard RAG [37, 34] process involves indexing documents into chunks, retrieving relevant chunks based on semantic similarity, and augmenting the LLM's prompt with this retrieved context for generation. Advanced RAG systems [20, 12] have evolved to include sophisticated pre-retrieval and post-retrieval optimizations. Building upon these foundations, recent researches has introduced agentic RAG systems that demonstrate more autonomous and adaptive behaviors in the retrieval process. These systems can dynamically determine when and what to retrieve [4, 14], generate hypothetical responses to guide retrieval, and iteratively refine their search strategies based on intermediate results [31, 29].

However, while agentic RAG approaches demonstrate agency in the retrieval phase by autonomously deciding when and what to retrieve [4, 14, 38], our agentic memory system exhibits agency at a more fundamental level through the autonomous evolution of its memory structure. Inspired by the Zettelkasten method, our system allows memories to actively generate their own contextual descriptions, form meaningful connections with related memories, and evolve both their content and relationships as new experiences emerge. This fundamental distinction in agency between retrieval versus storage and evolution distinguishes our approach from agentic RAG systems, which maintain static knowledge bases despite their sophisticated retrieval mechanisms.

Methodolodgy

Our proposed agentic memory system draws inspiration from the Zettelkasten method, implementing a dynamic and self-evolving memory system that enables LLM agents to maintain long-term memory without predetermined operations. The system's design emphasizes atomic note-taking, flexible linking mechanisms, and continuous evolution of knowledge structures.

Note Construction

Building upon the Zettelkasten method's principles of atomic note-taking and flexible organization, we introduce an LLM-driven approach to memory note construction. When an agent interacts with its environment, we construct structured memory notes that capture both explicit information and LLMgenerated contextual understanding. Each memory note m i in our collection M = { m 1 , m 2 , ..., m N } is represented as:

$$

$$

where c i represents the original interaction content, t i is the timestamp of the interaction, K i denotes LLM-generated keywords that capture key concepts, G i contains LLM-generated tags for categorization, X i represents the LLM-generated contextual description that provides rich semantic understanding, and L i maintains the set of linked memories that share semantic relationships. To enrich each memory note with meaningful context beyond its basic content and timestamp, we leverage an LLM to analyze the interaction and generate these semantic components. The note construction process involves prompting the LLM with carefully designed templates P s 1 :

$$

$$

Following the Zettelkasten principle of atomicity, each note captures a single, self-contained unit of knowledge. To enable efficient retrieval and linking, we compute a dense vector representation via a text encoder [27] that encapsulates all textual components of the note:

$$

$$

By using LLMs to generate enriched components, we enable autonomous extraction of implicit knowledge from raw interactions. The multi-faceted note structure ( K i , G i , X i ) creates rich representations that capture different aspects of the memory, facilitating nuanced organization and retrieval. Additionally, the combination of LLM-generated semantic components with dense vector representations provides both context and computationally efficient similarity matching.

Our system implements an autonomous link generation mechanism that enables new memory notes to form meaningful connections without predefined rules. When the constrctd memory note m n is added to the system, we first leverage its semantic embedding for similarity-based retrieval. For each existing memory note m j ∈ M , we compute a similarity score:

$$

$$

The system then identifies the topk most relevant memories:

$$

$$

Based on these candidate nearest memories, we prompt the LLM to analyze potential connections based on their potential common attributes. Formally, the link set of memory m n update like:

$$

$$

Each generated link l i is structured as: L i = { m i , ..., m k } . By using embedding-based retrieval as an initial filter, we enable efficient scalability while maintaining semantic relevance. A-MEM can quickly identify potential connections even in large memory collections without exhaustive comparison. More importantly, the LLM-driven analysis allows for nuanced understanding of relationships that goes beyond simple similarity metrics. The language model can identify subtle patterns, causal relationships, and conceptual connections that might not be apparent from embedding similarity alone. We implements the Zettelkasten principle of flexible linking while leveraging modern language models. The resulting network emerges organically from memory content and context, enabling natural knowledge organization.

Memory Evolution

After creating links for the new memory, A-MEM evolves the retrieved memories based on their textual information and relationships with the new memory. For each memory m j in the nearest

neighbor set M n near , the system determines whether to update its context, keywords, and tags. This evolution process can be formally expressed as:

$$

$$

The evolved memory m ∗ j then replaces the original memory m j in the memory set M . This evolutionary approach enables continuous updates and new connections, mimicking human learning processes. As the system processes more memories over time, it develops increasingly sophisticated knowledge structures, discovering higher-order patterns and concepts across multiple memories. This creates a foundation for autonomous memory learning where knowledge organization becomes progressively richer through the ongoing interaction between new experiences and existing memories.

Retrieve Relative Memory

In each interaction, our A-MEM performs context-aware memory retrieval to provide the agent with relevant historical information. Given a query text q from the current interaction, we first compute its dense vector representation using the same text encoder used for memory notes:

$$

$$

The system then computes similarity scores between the query embedding and all existing memory notes in M using cosine similarity:

$$

$$

Then we retrieve the k most relevant memories from the historical memory storage to construct a contextually appropriate prompt.

$$

$$

These retrieved memories provide relevant historical context that helps the agent better understand and respond to the current interaction. The retrieved context enriches the agent's reasoning process by connecting the current interaction with related past experiences stored in the memory system.

Experiment

Dataset and Evaluation

To evaluate the effectiveness of instruction-aware recommendation in long-term conversations, we utilize the LoCoMo dataset [22], which contains significantly longer dialogues compared to existing conversational datasets [36, 13]. While previous datasets contain dialogues with around 1K tokens over 4-5 sessions, LoCoMo features much longer conversations averaging 9K tokens spanning up to 35 sessions, making it particularly suitable for evaluating models' ability to handle long-range dependencies and maintain consistency over extended conversations. The LoCoMo dataset comprises diverse question types designed to comprehensively evaluate different aspects of model understanding: (1) single-hop questions answerable from a single session; (2) multihop questions requiring information synthesis across sessions; (3) temporal reasoning questions testing understanding of time-related information; (4) open-domain knowledge questions requiring integration of conversation context with external knowledge; and (5) adversarial questions assessing models' ability to identify unanswerable queries. In total, LoCoMo contains 7,512 question-answer pairs across these categories. Besides, we use a new dataset, named DialSim [16], to evaluate the effectiveness of our memory system. It is question-answering dataset derived from long-term multi-party dialogues. The dataset is derived from popular TV shows (Friends, The Big Bang Theory, and The Office), covering 1,300 sessions spanning five years, containing approximately 350,000 tokens, and including more than 1,000 questions per session from refined fan quiz website questions and complex questions generated from temporal knowledge graphs.

For comparison baselines, we compare to LoCoMo [22], ReadAgent [17], MemoryBank [39] and MemGPT [25]. The detailed introduction of baselines can be found in Appendix A.1 For evaluation, we employ two primary metrics: the F1 score to assess answer accuracy by balancing precision and recall, and BLEU-1 [26] to evaluate generated response quality by measuring word overlap

Table 1: Experimental results on LoCoMo dataset of QA tasks across five categories (Multi Hop, Temporal, Open Domain, Single Hop, and Adversial) using different methods. Results are reported in F1 and BLEU-1 (%) scores. The best performance is marked in bold, and our proposed method A-MEM (highlighted in gray) demonstrates competitive performance across six foundation language models.

with ground truth responses. Also, we report the average token length for answering one question. Besides reporting experiment results with four additional metrics (ROUGE-L, ROUGE-2, METEOR, and SBERT Similarity), we also present experimental outcomes using different foundation models including DeepSeek-R1-32B [11], Claude 3.0 Haiku [2], and Claude 3.5 Haiku [3] in Appendix A.3.

Implementation Details

For all baselines and our proposed method, we maintain consistency by employing identical system prompts as detailed in Appendix B. The deployment of Qwen-1.5B/3B and Llama 3.2 1B/3B models is accomplished through local instantiation using Ollama 1 , with LiteLLM 2 managing structured output generation. For GPT models, we utilize the official structured output API. In our memory retrieval process, we primarily employ k =10 for topk memory selection to maintain computational efficiency, while adjusting this parameter for specific categories to optimize performance. The detailed configurations of k can be found in Appendix A.5. For text embedding, we implement the all-minilm-l6-v2 model across all experiments.

Empricial Results

Performance Analysis. In our empirical evaluation, we compared A-MEM with four competitive baselines including LoCoMo [22], ReadAgent [17], MemoryBank [39], and MemGPT [25] on the LoCoMo dataset. For non-GPT foundation models, our A-MEM consistently outperforms all baselines across different categories, demonstrating the effectiveness of our agentic memory approach. For GPT-based models, while LoCoMo and MemGPT show strong performance in certain categories like Open Domain and Adversial tasks due to their robust pre-trained knowledge in simple fact retrieval, our A-MEM demonstrates superior performance in Multi-Hop tasks achieves at least two times better performance that require complex reasoning chains. In addition to experiments on the LoCoMo dataset, we also compare our method on the DialSim dataset against LoCoMo and MemGPT. A-MEM consistently outperforms all baselines across evaluation metrics, achieving an F1

1 https://github.com/ollama/ollama

Table 2: Comparison of different memory mechanisms across multiple evaluation metrics on DialSim [16]. Higher scores indicate better performance, with A-MEM showing superior results across all metrics.

Table 3: An ablation study was conducted to evaluate our proposed method against the GPT-4o-mini base model. The notation 'w/o' indicates experiments where specific modules were removed. The abbreviations LG and ME denote the link generation module and memory evolution module, respectively.

score of 3.45 (a 35% improvement over LoCoMo's 2.55 and 192% higher than MemGPT's 1.18). The effectiveness of A-MEM stems from its novel agentic memory architecture that enables dynamic and structured memory management. Unlike traditional approaches that use static memory operations, our system creates interconnected memory networks through atomic notes with rich contextual descriptions, enabling more effective multi-hop reasoning. The system's ability to dynamically establish connections between memories based on shared attributes and continuously update existing memory descriptions with new contextual information allows it to better capture and utilize the relationships between different pieces of information.

Cost-Efficiency Analysis. A-MEM demonstrates significant computational and cost efficiency alongside strong performance. The system requires approximately 1,200 tokens per memory operation, achieving an 85-93% reduction in token usage compared to baseline methods (LoCoMo and MemGPT with 16,900 tokens) through our selective top-k retrieval mechanism. This substantial token reduction directly translates to lower operational costs, with each memory operation costing less than $0.0003 when using commercial API services-making large-scale deployments economically viable. Processing times average 5.4 seconds using GPT-4o-mini and only 1.1 seconds with locally-hosted Llama 3.2 1B on a single GPU. Despite requiring multiple LLM calls during memory processing, A-MEM maintains this cost-effective resource utilization while consistently outperforming baseline approaches across all foundation models tested, particularly doubling performance on complex multi-hop reasoning tasks. This balance of low computational cost and superior reasoning capability highlights A-MEM's practical advantage for deployment in the real world.

Ablation Study

To evaluate the effectiveness of the Link Generation (LG) and Memory Evolution (ME) modules, we conduct the ablation study by systematically removing key components of our model. When both LG and ME modules are removed, the system exhibits substantial performance degradation, particularly in Multi Hop reasoning and Open Domain tasks. The system with only LG active (w/o ME) shows intermediate performance levels, maintaining significantly better results than the version without both modules, which demonstrates the fundamental importance of link generation in establishing memory connections. Our full model, A-MEM, consistently achieves the best performance across all evaluation categories, with particularly strong results in complex reasoning tasks. These results reveal that while the link generation module serves as a critical foundation for memory organization, the memory evolution module provides essential refinements to the memory structure. The ablation study validates our architectural design choices and highlights the complementary nature of these two modules in creating an effective memory system.

Hyperparameter Analysis

We conducted extensive experiments to analyze the impact of the memory retrieval parameter k, which controls the number of relevant memories retrieved for each interaction. As shown in Figure 3, we evaluated performance across different k values (10, 20, 30, 40, 50) on five categories of tasks using GPT-4o-mini as our base model. The results reveal an interesting pattern: while increasing k generally leads to improved performance, this improvement gradually plateaus and sometimes slightly decreases at higher values. This trend is particularly evident in Multi Hop and Open Domain

Figure 3: Impact of memory retrieval parameter k across different task categories with GPT-4o-mini as the base model. While larger k values generally improve performance by providing richer historical context, the gains diminish beyond certain thresholds, suggesting a trade-off between context richness and effective information processing. This pattern is consistent across all evaluation categories, indicating the importance of balanced context retrieval for optimal performance.

Figure 3: Impact of memory retrieval parameter k across different task categories with GPT-4o-mini as the base model. While larger k values generally improve performance by providing richer historical context, the gains diminish beyond certain thresholds, suggesting a trade-off between context richness and effective information processing. This pattern is consistent across all evaluation categories, indicating the importance of balanced context retrieval for optimal performance.

Table 4: Comparison of memory usage and retrieval time across different memory methods and scales.

tasks. The observation suggests a delicate balance in memory retrieval - while larger k values provide richer historical context for reasoning, they may also introduce noise and challenge the model's capacity to process longer sequences effectively. Our analysis indicates that moderate k values strike an optimal balance between context richness and information processing efficiency.

Scaling Analysis

To evaluate storage costs with accumulating memory, we examined the relationship between storage size and retrieval time across our A-MEM system and two baseline approaches: MemoryBank [39] and ReadAgent [17]. We evaluated these three memory systems with identical memory content across four scale points, increasing the number of entries by a factor of 10 at each step (from 1,000 to 10,000, 100,000, and finally 1,000,000 entries). The experimental results reveal key insights about our A-MEM system's scaling properties: In terms of space complexity, all three systems exhibit identical linear memory usage scaling ( O ( N ) ), as expected for vector-based retrieval systems. This confirms that A-MEM introduces no additional storage overhead compared to baseline approaches. For retrieval time, A-MEM demonstrates excellent efficiency with minimal increases as memory size grows. Even when scaling to 1 million memories, A-MEM's retrieval time increases only from 0.31 µ s to 3.70 µ s, representing exceptional performance. While MemoryBank shows slightly faster retrieval times, A-MEM maintains comparable performance while providing richer memory representations and functionality. Based on our space complexity and retrieval time analysis, we conclude that A-MEM's retrieval mechanisms maintain excellent efficiency even at large scales. The minimal growth in retrieval time across memory sizes addresses concerns about efficiency in large-scale memory systems, demonstrating that A-MEM provides a highly scalable solution for long-term conversation management. This unique combination of efficiency, scalability, and enhanced memory capabilities positions A-MEM as a significant advancement in building powerful and long-term memory mechanism for LLM Agents.

Figure 4: T-SNE Visualization of Memory Embeddings Showing More Organized Distribution with A-MEM (blue) Compared to Base Memory (red) Across Different Dialogues. Base Memory represents A-MEM without link generation and memory evolution.

Figure 4: T-SNE Visualization of Memory Embeddings Showing More Organized Distribution with A-MEM (blue) Compared to Base Memory (red) Across Different Dialogues. Base Memory represents A-MEM without link generation and memory evolution.

Memory Analysis

We present the t-SNE visualization in Figure 4 of memory embeddings to demonstrate the structural advantages of our agentic memory system. Analyzing two dialogues sampled from long-term conversations in LoCoMo [22], we observe that A-MEM (shown in blue) consistently exhibits more coherent clustering patterns compared to the baseline system (shown in red). This structural organization is particularly evident in Dialogue 2, where well-defined clusters emerge in the central region, providing empirical evidence for the effectiveness of our memory evolution mechanism and contextual description generation. In contrast, the baseline memory embeddings display a more dispersed distribution, demonstrating that memories lack structural organization without our link generation and memory evolution components. These visualization results validate that A-MEM can autonomously maintain meaningful memory structures through dynamic evolution and linking mechanisms. More results can be seen in Appendix A.4.

Conclusions

In this work, we introduced A-MEM, a novel agentic memory system that enables LLM agents to dynamically organize and evolve their memories without relying on predefined structures. Drawing inspiration from the Zettelkasten method, our system creates an interconnected knowledge network through dynamic indexing and linking mechanisms that adapt to diverse real-world tasks. The system's core architecture features autonomous generation of contextual descriptions for new memories and intelligent establishment of connections with existing memories based on shared attributes. Furthermore, our approach enables continuous evolution of historical memories by incorporating new experiences and developing higher-order attributes through ongoing interactions. Through extensive empirical evaluation across six foundation models, we demonstrated that A-MEM achieves superior performance compared to existing state-of-the-art baselines in long-term conversational tasks. Visualization analysis further validates the effectiveness of our memory organization approach. These results suggest that agentic memory systems can significantly enhance LLM agents' ability to utilize long-term knowledge in complex environments.

Limitations

While our agentic memory system achieves promising results, we acknowledge several areas for potential future exploration. First, although our system dynamically organizes memories, the quality of these organizations may still be influenced by the inherent capabilities of the underlying language models. Different LLMs might generate slightly different contextual descriptions or establish varying connections between memories. Additionally, while our current implementation focuses on text-based interactions, future work could explore extending the system to handle multimodal information, such as images or audio, which could provide richer contextual representations.

Experiment

Conclusions

APPENDIX

Experiment

Detailed Baselines Introduction

LoCoMo [22] takes a direct approach by leveraging foundation models without memory mechanisms for question answering tasks. For each query, it incorporates the complete preceding conversation and questions into the prompt, evaluating the model's reasoning capabilities.

ReadAgent [17] tackles long-context document processing through a sophisticated three-step methodology: it begins with episode pagination to segment content into manageable chunks, followed by memory gisting to distill each page into concise memory representations, and concludes with interactive look-up to retrieve pertinent information as needed.

MemoryBank [39] introduces an innovative memory management system that maintains and efficiently retrieves historical interactions. The system features a dynamic memory updating mechanism based on the Ebbinghaus Forgetting Curve theory, which intelligently adjusts memory strength according to time and significance. Additionally, it incorporates a user portrait building system that progressively refines its understanding of user personality through continuous interaction analysis.

MemGPT [25] presents a novel virtual context management system drawing inspiration from traditional operating systems' memory hierarchies. The architecture implements a dual-tier structure: a main context (analogous to RAM) that provides immediate access during LLM inference, and an external context (analogous to disk storage) that maintains information beyond the fixed context window.

Evaluation Metric

The F1 score represents the harmonic mean of precision and recall, offering a balanced metric that combines both measures into a single value. This metric is particularly valuable when we need to balance between complete and accurate responses:

$$

$$

$$

$$

$$

$$

In question-answering systems, the F1 score serves a crucial role in evaluating exact matches between predicted and reference answers. This is especially important for span-based QA tasks, where systems must identify precise text segments while maintaining comprehensive coverage of the answer.

BLEU-1 [26] provides a method for evaluating the precision of unigram matches between system outputs and reference texts:

$$

$$

$$

$$

$$

$$

where where

Here, c is candidate length, r is reference length, h ik is the count of n-gram i in candidate k, and m ik is the maximum count in any reference. In QA, BLEU-1 evaluates the lexical precision of generated answers, particularly useful for generative QA systems where exact matching might be too strict.

$$

$$

$$

$$

$$

$$

where X is reference text, Y is candidate text, and LCS is the Longest Common Subsequence.

$$

$$

Both ROUGE-L and ROUGE-2 are particularly useful for evaluating the fluency and coherence of generated answers, with ROUGE-L focusing on sequence matching and ROUGE-2 on local word order.

METEOR [5] computes a score based on aligned unigrams between the candidate and reference texts, considering synonyms and paraphrases.

$$

$$

$$

$$

$$

$$

where P is precision, R is recall, ch is number of chunks, and m is number of matched unigrams. METEOR is valuable for QA evaluation as it considers semantic similarity beyond exact matching, making it suitable for evaluating paraphrased answers.

SBERT Similarity [27] measures the semantic similarity between two texts using sentence embeddings.

$$

$$

$$

$$

SBERT( x ) represents the sentence embedding of text. SBERT Similarity is particularly useful for evaluating semantic understanding in QA systems, as it can capture meaning similarities even when the lexical overlap is low.

Comparison Results

Our comprehensive evaluation using ROUGE-2, ROUGE-L, METEOR, and SBERT metrics demonstrates that A-MEM achieves superior performance while maintaining remarkable computational efficiency. Through extensive empirical testing across various model sizes and task categories, we have established A-MEM as a more effective approach compared to existing baselines, supported by several compelling findings. In our analysis of non-GPT models, specifically Qwen2.5 and Llama 3.2, A-MEM consistently outperforms all baseline approaches across all metrics. The Multi-Hop category showcases particularly striking results, where Qwen2.5-15b with A-MEM achieves a ROUGE-L score of 27.23, dramatically surpassing LoComo's 4.68 and ReadAgent's 2.81 - representing a nearly six-fold improvement. This pattern of superiority extends consistently across METEOR and SBERT

Table 5: Experimental results on LoCoMo dataset of QA tasks across five categories (Multi Hop, Temporal, Open Domain, Single Hop, and Adversial) using different methods. Results are reported in ROUGE-2 and ROUGE-L scores, abbreviated to RGE-2 and RGE-L. The best performance is marked in bold, and our proposed method A-MEM (highlighted in gray) demonstrates competitive performance across six foundation language models.

Table 5: Experimental results on LoCoMo dataset of QA tasks across five categories (Multi Hop, Temporal, Open Domain, Single Hop, and Adversial) using different methods. Results are reported in ROUGE-2 and ROUGE-L scores, abbreviated to RGE-2 and RGE-L. The best performance is marked in bold, and our proposed method A-MEM (highlighted in gray) demonstrates competitive performance across six foundation language models.

scores. When examining GPT-based models, our results reveal an interesting pattern. While LoComo and MemGPT demonstrate strong capabilities in Open Domain and Adversarial tasks, A-MEM shows remarkable superiority in Multi-Hop reasoning tasks. Using GPT-4o-mini, A-MEM achieves a ROUGE-L score of 44.27 in Multi-Hop tasks, more than doubling LoComo's 18.09. This significant advantage maintains consistency across other metrics, with METEOR scores of 23.43 versus 7.61 and SBERT scores of 70.49 versus 52.30. The significance of these results is amplified by A-MEM's exceptional computational efficiency. Our approach requires only 1,200-2,500 tokens, compared to the substantial 16,900 tokens needed by LoComo and MemGPT. This efficiency stems from two key architectural innovations: First, our novel agentic memory architecture creates interconnected memory networks through atomic notes with rich contextual descriptions, enabling more effective capture and utilization of information relationships. Second, our selective top-k retrieval mechanism facilitates dynamic memory evolution and structured organization. The effectiveness of these innovations is particularly evident in complex reasoning tasks, as demonstrated by the consistently strong Multi-Hop performance across all evaluation metrics. Besides, we also show the experimental results with different foundational models including DeepSeek-R1-32B [11], Claude 3.0 Haiku [2] and Claude 3.5 Haiku [3].

Memory Analysis

In addition to the memory visualizations of the first two dialogues shown in the main text, we present additional visualizations in Fig.5 that demonstrate the structural advantages of our agentic memory system. Through analysis of two dialogues sampled from long-term conversations in LoCoMo[22], we observe that A-MEM (shown in blue) consistently produces more coherent clustering patterns compared to the baseline system (shown in red). This structural organization is particularly evident in Dialogue 2, where distinct clusters emerge in the central region, providing empirical support for the effectiveness of our memory evolution mechanism and contextual description generation. In contrast, the baseline memory embeddings exhibit a more scattered distribution, indicating that memories lack structural organization without our link generation and memory evolution components.

Table 6: Experimental results on LoCoMo dataset of QA tasks across five categories (Multi Hop, Temporal, Open Domain, Single Hop, and Adversial) using different methods. Results are reported in METEOR and SBERT Similarity scores, abbreviated to ME and SBERT. The best performance is marked in bold, and our proposed method A-MEM (highlighted in gray) demonstrates competitive performance across six foundation language models.

These visualizations validate that A-MEM can autonomously maintain meaningful memory structures through its dynamic evolution and linking mechanisms.

Hyperparameters setting

All hyperparameter k values are presented in Table 8. For models that have already achieved state-of-the-art (SOTA) performance with k=10, we maintain this value without further tuning.

Figure 5: T-SNE Visualization of Memory Embeddings Showing More Organized Distribution with A-MEM (blue) Compared to Base Memory (red) Across Different Dialogues. Base Memory represents A-MEM without link generation and memory evolution.

Figure 5: T-SNE Visualization of Memory Embeddings Showing More Organized Distribution with A-MEM (blue) Compared to Base Memory (red) Across Different Dialogues. Base Memory represents A-MEM without link generation and memory evolution.

Table 8: Selection of k values in retriever across specific categories and model choices.

Prompt Templates and Examples

Prompt Template of Note Construction

The prompt template in Note Construction: P s 1 Generate a structured analysis of the following content by: 1. Identifying the most salient keywords (focus on nouns, verbs, and key concepts) 2. Extracting core themes and contextual elements 3. Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // several specific, distinct keywords that capture key concepts and terminology // Order from most to least important // Don't include keywords that are the name of the speaker or time // At least three keywords, but don't be too redundant. ], "context": // one sentence summarizing: // -Main topic/domain // -Key arguments/points // -Intended audience/purpose , "tags": [ // several broad categories/themes for classification // Include domain, format, and type tags // At least three tags, but don't be too redundant. ] } Content for analysis:

s 2 You are an AI memory evolution agent responsible for managing and evolving a knowledge base. Analyze the the new memory note according to keywords and context, also with their several nearest neighbors memory. The new memory context: {context} content: {content} keywords: {keywords} The nearest neighbors memories: {nearest_neighbors_memories} Based on this information, determine: Should this memory be evolved? Consider its relationships with other memories.

Prompt Template of Memory Evolution

Prompt Template of Memory Evolution

s 3 You are an AI memory evolution agent responsible for managing and evolving a knowledge base. Analyze the the new memory note according to keywords and context, also with their several nearest neighbors memory. Make decisions about its evolution. The new memory context:{context} content: {content} keywords: {keywords} The nearest neighbors memories:{nearest_neighbors_memories} Based on this information, determine: 1. What specific actions should be taken (strengthen, update_neighbor)? 1.1 If choose to strengthen the connection, which memory should it be connected to? Can you give the updated tags of this memory? 1.2 If choose to update neighbor, you can update the context and tags of these memories based on the understanding of these memories. Tags should be determined by the content of these characteristic of these memories, which can be used to retrieve them later and categorize them. All the above information should be returned in a list format according to the sequence: [[new_memory],[neighbor_memory_1], ...[neighbor_memory_n]] These actions can be combined. Return your decision in JSON format with the following structure: {{ "should_evolve": true/false, "actions": ["strengthen", "merge", "prune"], "suggested_connections": ["neighbor_memory_ids"], "tags_to_update": ["tag_1",..."tag_n"], "new_context_neighborhood": ["new context",...,"new context"], "new_tags_neighborhood": [["tag_1",...,"tag_n"],...["tag_1",...,"tag_n"]], }}

Examples of Q/A with ours

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems’ fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/AgenticMemory, while the source code of agentic memory system is available at https://github.com/agiresearch/A-mem.

A-Mem: Agentic Memory for LLM Agents

Wujiang Xu1 Zujie Liang2 Kai Mei1 Hang Gao1 Juntao Tan1 Yongfeng Zhang1††thanks: Corresponding Email: yongfeng.zhang@rutgers.edu. 1Rutgers University 2Ant Group

Large Language Model (LLM) agents have demonstrated remarkable capabilities in various tasks, with recent advances enabling them to interact with environments, execute tasks, and make decisions autonomously Mei et al. (2024); Wang et al. (2024); Deng et al. (2023). They integrate LLMs with external tools and delicate workflows to improve reasoning and planning abilities. Though LLM agent has strong reasoning performance, it still needs a memory system to provide long-term interaction ability with the external environment Weng (2023).

Existing memory systems Packer et al. (2023); Zhong et al. (2024); Roucher et al. (2025); Liu et al. (2024) for LLM agents provide basic memory storage functionality. These systems require agent developers to predefine memory storage structures, specify storage points within the workflow, and establish retrieval timing. Meanwhile, to improve structured memory organization, Mem0 Dev and Taranjeet (2024), following the principles of RAG Edge et al. (2024); Lewis et al. (2020); Shi et al. (2024), incorporates graph databases for storage and retrieval processes. While graph databases provide structured organization for memory systems, their reliance on predefined schemas and relationships fundamentally limits their adaptability. This limitation manifests clearly in practical scenarios - when an agent learns a novel mathematical solution, current systems can only categorize and link this information within their preset framework, unable to forge innovative connections or develop new organizational patterns as knowledge evolves. Such rigid structures, coupled with fixed agent workflows, severely restrict these systems’ ability to generalize across new environments and maintain effectiveness in long-term interactions. The challenge becomes increasingly critical as LLM agents tackle more complex, open-ended tasks, where flexible knowledge organization and continuous adaptation are essential. Therefore, how to design a flexible and universal memory system that supports LLM agents’ long-term interactions remains a crucial challenge.

In this paper, we introduce a novel agentic memory system, named as A-Mem, for LLM agents that enables dynamic memory structuring without relying on static, predetermined memory operations. Our approach draws inspiration from the Zettelkasten method Kadavy (2021); Ahrens (2017), a sophisticated knowledge management system that creates interconnected information networks through atomic notes and flexible linking mechanisms. Our system introduces an agentic memory architecture that enables autonomous and flexible memory management for LLM agents. For each new memory, we construct comprehensive notes, which integrates multiple representations: structured textual attributes including several attributes and embedding vectors for similarity matching. Then A-Mem analyzes the historical memory repository to establish meaningful connections based on semantic similarities and shared attributes. This integration process not only creates new links but also enables dynamic evolution when new memories are incorporated, they can trigger updates to the contextual representations of existing memories, allowing the entire memories to continuously refine and deepen its understanding over time. The contributions are summarized as:

∙\bullet We present A-Mem, an agentic memory system for LLM agents that enables autonomous generation of contextual descriptions, dynamic establishment of memory connections, and intelligent evolution of existing memories based on new experiences. This system equips LLM agents with long-term interaction capabilities without requiring predetermined memory operations.

∙\bullet We design an agentic memory update mechanism where new memories automatically trigger two key operations: (1) Link Generation - automatically establishing connections between memories by identifying shared attributes and similar contextual descriptions, and (2) Memory Evolution - enabling existing memories to dynamically evolve as new experiences are analyzed, leading to the emergence of higher-order patterns and attributes.

∙\bullet We conduct comprehensive evaluations of our system using a long-term conversational dataset, comparing performance across six foundation models using six distinct evaluation metrics, demonstrating significant improvements. Moreover, we provide T-SNE visualizations to illustrate the structured organization of our agentic memory system.

Prior works on LLM agent memory systems have explored various mechanisms for memory management and utilization Mei et al. (2024); Liu et al. (2024); Dev and Taranjeet (2024); Zhong et al. (2024). Some approaches complete interaction storage, which maintains comprehensive historical records through dense retrieval models Zhong et al. (2024) or read-write memory structures Modarressi et al. (2023). Moreover, MemGPT Packer et al. (2023) leverages cache-like architectures to prioritize recent information. Similarly, SCM Wang et al. (2023a) proposes a Self-Controlled Memory framework that enhances LLMs’ capability to maintain long-term memory through a memory stream and controller mechanism. However, these approaches face significant limitations in handling diverse real-world tasks. While they can provide basic memory functionality, their operations are typically constrained by predefined structures and fixed workflows. These constraints stem from their reliance on rigid operational patterns, particularly in memory writing and retrieval processes. Such inflexibility leads to poor generalization in new environments and limited effectiveness in long-term interactions. Therefore, designing a flexible and universal memory system that supports agents’ long-term interactions remains a crucial challenge.

Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to enhance LLMs by incorporating external knowledge sources Lewis et al. (2020); Borgeaud et al. (2022); Gao et al. (2023). The standard RAG Yu et al. (2023a); Wang et al. (2023c) process involves indexing documents into chunks, retrieving relevant chunks based on semantic similarity, and augmenting the LLM’s prompt with this retrieved context for generation. Advanced RAG systems Lin et al. (2023); Ilin (2023) have evolved to include sophisticated pre-retrieval and post-retrieval optimizations. Building upon these foundations, recent researches has introduced agentic RAG systems that demonstrate more autonomous and adaptive behaviors in the retrieval process. These systems can dynamically determine when and what to retrieve Asai et al. (2023); Jiang et al. (2023), generate hypothetical responses to guide retrieval, and iteratively refine their search strategies based on intermediate results Trivedi et al. (2022); Shao et al. (2023).

However, while agentic RAG approaches demonstrate agency in the retrieval phase by autonomously deciding when and what to retrieve Asai et al. (2023); Jiang et al. (2023); Yu et al. (2023b), our agentic memory system exhibits agency at a more fundamental level through the autonomous evolution of its memory structure. Inspired by the Zettelkasten method, our system allows memories to actively generate their own contextual descriptions, form meaningful connections with related memories, and evolve both their content and relationships as new experiences emerge. This fundamental distinction in agency between retrieval versus storage and evolution distinguishes our approach from agentic RAG systems, which maintain static knowledge bases despite their sophisticated retrieval mechanisms.

Our proposed agentic memory system draws inspiration from the Zettelkasten method, implementing a dynamic and self-evolving memory system that enables LLM agents to maintain long-term memory without predetermined operations. The system’s design emphasizes atomic note-taking, flexible linking mechanisms, and continuous evolution of knowledge structures.

Building upon the Zettelkasten method’s principles of atomic note-taking and flexible organization, we introduce an LLM-driven approach to memory note construction. When an agent interacts with its environment, we construct structured memory notes that capture both explicit information and LLM-generated contextual understanding. Each memory note mim_{i} in our collection ℳ={m1,m2,…,mN}\mathcal{M}={m_{1},m_{2},...,m_{N}} is represented as:

where cic_{i} represents the original interaction content, tit_{i} is the timestamp of the interaction, KiK_{i} denotes LLM-generated keywords that capture key concepts, GiG_{i} contains LLM-generated tags for categorization, XiX_{i} represents the LLM-generated contextual description that provides rich semantic understanding, and LiL_{i} maintains the set of linked memories that share semantic relationships. To enrich each memory note with meaningful context beyond its basic content and timestamp, we leverage an LLM to analyze the interaction and generate these semantic components. The note construction process involves prompting the LLM with carefully designed templates Ps​1P_{s1}:

Following the Zettelkasten principle of atomicity, each note captures a single, self-contained unit of knowledge. To enable efficient retrieval and linking, we compute a dense vector representation via a text encoder Reimers and Gurevych (2019) that encapsulates all textual components of the note:

By using LLMs to generate enriched components, we enable autonomous extraction of implicit knowledge from raw interactions. The multi-faceted note structure (KiK_{i}, GiG_{i}, XiX_{i}) creates rich representations that capture different aspects of the memory, facilitating nuanced organization and retrieval. Additionally, the combination of LLM-generated semantic components with dense vector representations provides both human-interpretable context and computationally efficient similarity matching.

Our system implements an autonomous link generation mechanism that enables new memory notes to form meaningful connections without predefined rules. When the constrctd memory note mnm_{n} is added to the system, we first leverage its semantic embedding for similarity-based retrieval. For each existing memory note mj∈ℳm_{j}\in\mathcal{M}, we compute a similarity score:

The system then identifies the top-kk most relevant memories:

Based on these candidate nearest memories, we prompt the LLM to analyze potential connections based on their potential common attributes. Formally, the link set of memory mnm_{n} update like:

Each generated link lil_{i} is structured as: Li={mi,…,mk}L_{i}={m_{i},...,m_{k}}. By using embedding-based retrieval as an initial filter, we enable efficient scalability while maintaining semantic relevance. A-Mem can quickly identify potential connections even in large memory collections without exhaustive comparison. More importantly, the LLM-driven analysis allows for nuanced understanding of relationships that goes beyond simple similarity metrics. The language model can identify subtle patterns, causal relationships, and conceptual connections that might not be apparent from embedding similarity alone. We implements the Zettelkasten principle of flexible linking while leveraging modern language models. The resulting network emerges organically from memory content and context, enabling natural knowledge organization.

After creating links for the new memory, A-Mem evolves the retrieved memories based on their textual information and relationships with the new memory. For each memory mjm_{j} in the nearest neighbor set ℳnearn\mathcal{M}_{\text{near}}^{n}, the system determines whether to update its context, keywords, and tags. This evolution process can be formally expressed as:

The evolved memory mj∗m_{j}^{*} then replaces the original memory mjm_{j} in the memory set ℳ\mathcal{M}. This evolutionary approach enables continuous updates and new connections, mimicking human learning processes. As the system processes more memories over time, it develops increasingly sophisticated knowledge structures, discovering higher-order patterns and concepts across multiple memories. This creates a foundation for autonomous memory learning where knowledge organization becomes progressively richer through the ongoing interaction between new experiences and existing memories.

In each interaction, our A-Mem performs context-aware memory retrieval to provide the agent with relevant historical information. Given a query text qq from the current interaction, we first compute its dense vector representation using the same text encoder used for memory notes:

The system then computes similarity scores between the query embedding and all existing memory notes in ℳ\mathcal{M} using cosine similarity:

Then we retrieve the k most relevant memories from the historical memory storage to construct a contextually appropriate prompt.

These retrieved memories provide relevant historical context that helps the agent better understand and respond to the current interaction. The retrieved context enriches the agent’s reasoning process by connecting the current interaction with related past experiences and knowledge stored in the memory system.

To evaluate the effectiveness of instruction-aware recommendation in long-term conversations, we utilize the LoCoMo dataset (Maharana et al., 2024), which contains significantly longer dialogues compared to existing conversational datasets (Xu, 2021; Jang et al., 2023). While previous datasets contain dialogues with around 1K tokens over 4-5 sessions, LoCoMo features much longer conversations averaging 9K tokens spanning up to 35 sessions, making it particularly suitable for evaluating models’ ability to handle long-range dependencies and maintain consistency over extended conversations. The LoCoMo dataset comprises diverse question types designed to comprehensively evaluate different aspects of model understanding: (1) single-hop questions answerable from a single session; (2) multi-hop questions requiring information synthesis across sessions; (3) temporal reasoning questions testing understanding of time-related information; (4) open-domain knowledge questions requiring integration of conversation context with external knowledge; and (5) adversarial questions assessing models’ ability to identify unanswerable queries. In total, LoCoMo contains 7,512 question-answer pairs across these categories.

For evaluation, we employ two primary metrics: the F1 score to assess answer accuracy by balancing precision and recall, and BLEU-1 (Papineni et al., 2002) to evaluate generated response quality by measuring word overlap with ground truth responses. Also, we report the average token length for answering one question. Besides, we report the experiment results with four extra metrics including ROUGE-L, ROUGE-2, METEOR and SBERT Similarity in the Appendix B.2.

For all baselines and our proposed method, we maintain consistency by employing identical system prompts as detailed in Appendix C. The deployment of Qwen-1.5B/3B and Llama 3.2 1B/3B models is accomplished through local instantiation using Ollama 111https://github.com/ollama/ollama, with LiteLLM 222https://github.com/BerriAI/litellm managing structured output generation. For GPT models, we utilize the official structured output API. In our memory retrieval process, we primarily employ kk=10 for top-kk memory selection to maintain computational efficiency, while adjusting this parameter for specific categories to optimize performance. The detailed configurations of kk can be found in Appendix B.4. For text embedding, we implement the all-minilm-l6-v2 model across all experiments.

LoCoMo Maharana et al. (2024) takes a direct approach by leveraging foundation models without memory mechanisms for question answering tasks. For each query, it incorporates the complete preceding conversation and questions into the prompt, evaluating the model’s reasoning capabilities.

ReadAgent Lee et al. (2024) tackles long-context document processing through a sophisticated three-step methodology: it begins with episode pagination to segment content into manageable chunks, followed by memory gisting to distill each page into concise memory representations, and concludes with interactive look-up to retrieve pertinent information as needed.

MemoryBank Zhong et al. (2024) introduces an innovative memory management system that maintains and efficiently retrieves historical interactions. The system features a dynamic memory updating mechanism based on the Ebbinghaus Forgetting Curve theory, which intelligently adjusts memory strength according to time and significance. Additionally, it incorporates a user portrait building system that progressively refines its understanding of user personality through continuous interaction analysis.

MemGPT Packer et al. (2023) presents a novel virtual context management system drawing inspiration from traditional operating systems’ memory hierarchies. The architecture implements a dual-tier structure: a main context (analogous to RAM) that provides immediate access during LLM inference, and an external context (analogous to disk storage) that maintains information beyond the fixed context window.

In our empirical evaluation, we compared A-MEM with four competitive baselines including LoCoMo, ReadAgent, MemoryBank, and MemGPT on the LoCoMo dataset. For non-GPT foundation models, our A-Mem consistently outperforms all baselines across different categories, demonstrating the effectiveness of our agentic memory approach. For GPT-based models, while LoCoMo and MemGPT show strong performance in certain categories like Open Domain and Adversial tasks due to their robust pre-trained knowledge in simple fact retrieval, our A-MEM demonstrates superior performance in Multi-Hop tasks achieves at least two times better performance that require complex reasoning chains. The effectiveness of A-Mem stems from its novel agentic memory architecture that enables dynamic and structured memory management. Unlike traditional approaches that use static memory operations, our system creates interconnected memory networks through atomic notes with rich contextual descriptions, enabling more effective multi-hop reasoning. The system’s ability to dynamically establish connections between memories based on shared attributes and continuously update existing memory descriptions with new contextual information allows it to better capture and utilize the relationships between different pieces of information. Notably, A-Mem achieves these improvements while maintaining significantly lower token length requirements compared to LoCoMo and MemGPT (around 1,200-2,500 tokens versus 16,900 tokens) through our selective top-k retrieval mechanism. In conclusion, our empirical results demonstrate that A-Mem successfully combines structured memory organization with dynamic memory evolution, leading to superior performance in complex reasoning tasks while maintaining computational efficiency.

To evaluate the effectiveness of the Link Generation (LG) and Memory Evolution (ME) modules, we conduct the ablation study by systematically removing key components of our model. When both LG and ME modules are removed, the system exhibits substantial performance degradation, particularly in Multi Hop reasoning and Open Domain tasks. The system with only LG active (w/o ME) shows intermediate performance levels, maintaining significantly better results than the version without both modules, which demonstrates the fundamental importance of link generation in establishing memory connections. Our full model, A-MEM, consistently achieves the best performance across all evaluation categories, with particularly strong results in complex reasoning tasks. These results reveal that while the link generation module serves as a critical foundation for memory organization, the memory evolution module provides essential refinements to the memory structure. The ablation study validates our architectural design choices and highlights the complementary nature of these two modules in creating an effective memory system.

We conducted extensive experiments to analyze the impact of the memory retrieval parameter k, which controls the number of relevant memories retrieved for each interaction. As shown in Figure 3, we evaluated performance across different k values (10, 20, 30, 40, 50) on five categories of tasks using GPT-4-mini as our base model. The results reveal an interesting pattern: while increasing k generally leads to improved performance, this improvement gradually plateaus and sometimes slightly decreases at higher values. This trend is particularly evident in Multi Hop and Open Domain tasks. The observation suggests a delicate balance in memory retrieval - while larger k values provide richer historical context for reasoning, they may also introduce noise and challenge the model’s capacity to process longer sequences effectively. Our analysis indicates that moderate k values strike an optimal balance between context richness and information processing efficiency.

We present the t-SNE visualization in Figure 4 of memory embeddings to demonstrate the structural advantages of our agentic memory system. Analyzing two dialogues sampled from long-term conversations in LoCoMo Maharana et al. (2024), we observe that A-Mem (shown in blue) consistently exhibits more coherent clustering patterns compared to the baseline system (shown in red). This structural organization is particularly evident in Dialogue 2, where well-defined clusters emerge in the central region, providing empirical evidence for the effectiveness of our memory evolution mechanism and contextual description generation. In contrast, the baseline memory embeddings display a more dispersed distribution, demonstrating that memories lack structural organization without our link generation and memory evolution components. These visualization results validate that A-Mem can autonomously maintain meaningful memory structures through dynamic evolution and linking mechanisms. More results can be seen in Appendix B.3.

In this work, we introduced A-Mem, a novel agentic memory system that enables LLM agents to dynamically organize and evolve their memories without relying on predefined structures. Drawing inspiration from the Zettelkasten method, our system creates an interconnected knowledge network through dynamic indexing and linking mechanisms that adapt to diverse real-world tasks. The system’s core architecture features autonomous generation of contextual descriptions for new memories and intelligent establishment of connections with existing memories based on shared attributes. Furthermore, our approach enables continuous evolution of historical memories by incorporating new experiences and developing higher-order attributes through ongoing interactions. Through extensive empirical evaluation across six foundation models, we demonstrated that A-Mem achieves superior performance compared to existing state-of-the-art baselines in long-term conversational tasks. Visualization analysis further validates the effectiveness of our memory organization approach. These results suggest that agentic memory systems can significantly enhance LLM agents’ ability to utilize long-term knowledge in complex environments.

While our agentic memory system achieves promising results, we acknowledge several areas for potential future exploration. First, although our system dynamically organizes memories, the quality of these organizations may still be influenced by the inherent capabilities of the underlying language models. Different LLMs might generate slightly different contextual descriptions or establish varying connections between memories. Additionally, while our current implementation focuses on text-based interactions, future work could explore extending the system to handle multimodal information, such as images or audio, which could provide richer contextual representations.

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, including natural language processing, code generation, and recommender systems Wang et al. (2023b); Zhang et al. (2024a); Xu et al. (2024a, b, 2023). LLM-based agents further extend these capabilities by enabling interactive decision-making and executing complex workflows through structured interaction patterns Jin et al. (2024); Zhang et al. (2024b, c). Prior works on LLM agent memory systems have explored various mechanisms for memory management and utilization Mei et al. (2024); Liu et al. (2024); Dev and Taranjeet (2024); Zhong et al. (2024). Some approaches complete interaction storage, which maintains comprehensive historical records through dense retrieval models Zhong et al. (2024) or read-write memory structures Modarressi et al. (2023). Moreover, MemGPT Packer et al. (2023) leverages cache-like architectures to prioritize recent information. Similarly, SCM Wang et al. (2023a) proposes a Self-Controlled Memory framework that enhances LLMs’ capability to maintain long-term memory through a memory stream and controller mechanism. However, these approaches face significant limitations in handling diverse real-world tasks. While they can provide basic memory functionality, their operations are typically constrained by predefined structures and fixed workflows. These constraints stem from their reliance on rigid operational patterns, particularly in memory writing and retrieval processes. Such inflexibility leads to poor generalization in new environments and limited effectiveness in long-term interactions. Therefore, designing a flexible and universal memory system that supports agents’ long-term interactions remains a crucial challenge.

The F1 score represents the harmonic mean of precision and recall, offering a balanced metric that combines both measures into a single value. This metric is particularly valuable when we need to balance between complete and accurate responses:

In question-answering systems, the F1 score serves a crucial role in evaluating exact matches between predicted and reference answers. This is especially important for span-based QA tasks, where systems must identify precise text segments while maintaining comprehensive coverage of the answer.

BLEU-1 Papineni et al. (2002) provides a method for evaluating the precision of unigram matches between system outputs and reference texts:

Here, cc is candidate length, rr is reference length, hi​kh_{ik} is the count of n-gram i in candidate k, and mi​km_{ik} is the maximum count in any reference. In QA, BLEU-1 evaluates the lexical precision of generated answers, particularly useful for generative QA systems where exact matching might be too strict.

where XX is reference text, YY is candidate text, and LCS is the Longest Common Subsequence.

Both ROUGE-L and ROUGE-2 are particularly useful for evaluating the fluency and coherence of generated answers, with ROUGE-L focusing on sequence matching and ROUGE-2 on local word order.

where PP is precision, RR is recall, ch is number of chunks, and mm is number of matched unigrams. METEOR is valuable for QA evaluation as it considers semantic similarity beyond exact matching, making it suitable for evaluating paraphrased answers.

SBERT Similarity Reimers and Gurevych (2019) measures the semantic similarity between two texts using sentence embeddings.

SBERT(xx ) represents the sentence embedding of text. SBERT Similarity is particularly useful for evaluating semantic understanding in QA systems, as it can capture meaning similarities even when the lexical overlap is low.

Our comprehensive evaluation using ROUGE-2, ROUGE-L, METEOR, and SBERT metrics demonstrates that A-Mem achieves superior performance while maintaining remarkable computational efficiency. Through extensive empirical testing across various model sizes and task categories, we have established A-Mem as a more effective approach compared to existing baselines, supported by several compelling findings. In our analysis of non-GPT models, specifically Qwen2.5 and Llama 3.2, A-Mem consistently outperforms all baseline approaches across all metrics. The Multi-Hop category showcases particularly striking results, where Qwen2.5-15b with A-Mem achieves a ROUGE-L score of 27.23, dramatically surpassing LoComo’s 4.68 and ReadAgent’s 2.81 - representing a nearly six-fold improvement. This pattern of superiority extends consistently across METEOR and SBERT scores. When examining GPT-based models, our results reveal an interesting pattern. While LoComo and MemGPT demonstrate strong capabilities in Open Domain and Adversarial tasks, A-Mem shows remarkable superiority in Multi-Hop reasoning tasks. Using GPT-4o-mini, A-Mem achieves a ROUGE-L score of 44.27 in Multi-Hop tasks, more than doubling LoComo’s 18.09. This significant advantage maintains consistency across other metrics, with METEOR scores of 23.43 versus 7.61 and SBERT scores of 70.49 versus 52.30. The significance of these results is amplified by A-Mem’s exceptional computational efficiency. Our approach requires only 1,200-2,500 tokens, compared to the substantial 16,900 tokens needed by LoComo and MemGPT. This efficiency stems from two key architectural innovations: First, our novel agentic memory architecture creates interconnected memory networks through atomic notes with rich contextual descriptions, enabling more effective capture and utilization of information relationships. Second, our selective top-k retrieval mechanism facilitates dynamic memory evolution and structured organization. The effectiveness of these innovations is particularly evident in complex reasoning tasks, as demonstrated by the consistently strong Multi-Hop performance across all evaluation metrics.

In addition to the memory visualizations of the first two dialogues shown in the main text, we present additional visualizations in Fig.5 that demonstrate the structural advantages of our agentic memory system. Through analysis of two dialogues sampled from long-term conversations in LoCoMoMaharana et al. (2024), we observe that A-Mem (shown in blue) consistently produces more coherent clustering patterns compared to the baseline system (shown in red). This structural organization is particularly evident in Dialogue 2, where distinct clusters emerge in the central region, providing empirical support for the effectiveness of our memory evolution mechanism and contextual description generation. In contrast, the baseline memory embeddings exhibit a more scattered distribution, indicating that memories lack structural organization without our link generation and memory evolution components. These visualizations validate that A-Mem can autonomously maintain meaningful memory structures through its dynamic evolution and linking mechanisms.

All hyperparameter k values are presented in Table 5. For models that have already achieved state-of-the-art (SOTA) performance with k=10, we maintain this value without further tuning.

Table: S4.T1: Experimental results on LoCoMo dataset of QA tasks across five categories (Single Hop, Multi Hop, Temporal, Open Domain, and Adversial) using different methods. Results are reported in F1 and BLEU-1 (%) scores. The best performance is marked in bold, and our proposed method A-MEM (highlighted in gray) demonstrates competitive performance across six foundation language models.

ModelMethodCategoryAverage
Single HopMulti HopTemporalOpen DomainAdversialRankingToken
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1Length
GPT4o-miniLoCoMo25.0219.7518.4114.7712.0411.1640.3629.0569.2368.752.42.416,910
ReadAgent9.156.4812.608.875.315.129.677.669.819.024.24.2643
MemoryBank5.004.779.686.995.565.946.615.167.366.484.84.8432
MemGPT26.6517.7225.5219.449.157.4441.0434.3443.2942.732.42.416,977
A-Mem27.0220.0945.8536.6712.1412.0044.6537.0650.0349.471.21.22,520
4oLoCoMo28.0018.479.095.7816.4714.8061.5654.1952.6151.132.02.016,910
ReadAgent14.619.954.163.198.848.3712.4610.296.816.134.04.0805
MemoryBank6.494.692.472.436.435.308.287.104.423.675.05.0569
MemGPT30.3622.8317.2913.1812.2411.8760.1653.3534.9634.252.42.416,987
A-Mem32.8623.7639.4131.2317.1015.8448.4342.9736.3535.531.61.61,216
Qwen2.51.5bLoCoMo9.056.554.254.049.918.5011.158.6740.3840.233.43.416,910
ReadAgent6.614.932.552.515.3112.2410.137.545.4227.324.64.6752
MemoryBank11.148.254.462.878.056.2113.4211.0136.7634.002.62.6284
MemGPT10.447.614.213.8913.4211.649.567.3431.5128.903.43.416,953
A-Mem18.2311.9424.3219.7416.4814.3123.6319.2346.0043.261.01.01,300
3bLoCoMo4.614.293.112.714.555.977.035.6916.9514.813.23.216,910
ReadAgent2.471.783.013.015.575.223.252.5115.7814.014.24.2776
MemoryBank3.603.391.721.976.636.584.113.3213.0710.304.24.2298
MemGPT5.074.312.942.957.047.107.265.5214.4712.392.42.416,961
A-Mem12.579.0127.5925.077.127.2817.2313.1227.9125.151.01.01,137
Llama 3.21bLoCoMo11.259.187.386.8211.9010.3812.8610.5051.8948.273.43.416,910
ReadAgent5.965.121.932.3012.4611.177.756.0344.6440.154.64.6665
MemoryBank13.1810.037.616.2715.7812.9417.3014.0352.6147.532.02.0274
MemGPT9.196.964.024.7911.148.2410.167.6849.7545.114.04.016,950
A-Mem19.0611.7117.8010.2817.5514.6728.5124.1358.8154.281.01.01,376
3bLoCoMo6.885.774.374.4010.659.298.376.9330.2528.462.82.816,910
ReadAgent2.471.783.013.015.575.223.252.5115.7814.014.24.2461
MemoryBank6.194.473.493.134.074.577.616.0318.6517.053.23.2263
MemGPT5.323.992.682.725.645.544.323.5121.4519.373.83.816,956
A-Mem17.4411.7426.3819.5012.5311.8328.1423.8742.0440.601.01.01,126

Table: S4.T2: An ablation study was conducted to evaluate our proposed method against the GPT-4-mini base model. The notation ’w/o’ indicates experiments where specific modules were removed. The abbreviations LG and ME denote the link generation module and memory evolution module, respectively.

MethodCategory
Single HopMulti HopTemporalOpen DomainAdversial
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
w/o LG & ME9.657.0924.5519.487.776.7013.2810.3015.3218.02
w/o ME21.3515.1331.2427.3110.1310.8539.1734.7044.1645.33
A-Mem27.0220.0945.8536.6712.1412.0044.6537.0650.0349.47

Table: A2.T5: Selection of k values in retriever across specific categories and model choices.

ModelSingle HopMulti HopTemporalOpen DomainAdversial
GPT-4o-mini4040505040
GPT-4o4040505040
Qwen2.5-1.5b1010101010
Qwen2.5-3b1010501010
Llama3.2-1b1010101010
Llama3.2-3b1020101010

Refer to caption (a) Traditional memory system.

Refer to caption Our A-Mem architecture comprises three integral parts in memory storage. During note construction, the system processes new interaction memories and stores them as notes with multiple attributes. The link generation process first retrieves the most relevant historical memories and then decide whether to establish connections between them. The concept of a ’box’ describes that related memories become interconnected through their similar contextual descriptions, analogous to the Zettelkasten method. However, our approach allows individual memories to exist simultaneously within multiple different boxes. In the memory retrieval stage, the system analyzes queries into constituent keywords and utilizes these keywords to search through the memory network.

Refer to caption (a) Single Hop

Refer to caption (c) Temporal

Refer to caption (d) Open Domain

$$ m_{i}={c_{i},t_{i},K_{i},G_{i},X_{i},e_{i},L_{i}} $$ \tag{S3.E1}

$$ s_{n,j}=\frac{e_{n}\cdot e_{j}}{|e_{n}||e_{j}|} $$ \tag{S3.E4}

$$ \mathcal{M}{\text{near}}^{n}={m{j}|;\text{rank}(s_{n,j})\leq k,m_{j}\in\mathcal{M}} $$ \tag{S3.E5}

$$ L_{i}\leftarrow\text{LLM}(m_{n};|\mathcal{M}{\text{near}}^{n};|P{s2}) $$ \tag{S3.E6}

$$ e_{q}=f_{\text{enc}}(q) $$ \tag{S3.E8}

$$ F1=2\cdot\frac{\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} $$ \tag{A2.E11}

$$ \text{precision}=\frac{\text{true positives}}{\text{true positives}+\text{false positives}} $$ \tag{A2.E12}

$$ \text{BLEU-1}=BP\cdot\exp(\sum_{n=1}^{1}w_{n}\log p_{n}) $$ \tag{A2.E14}

$$ BP=\begin{cases}1&\text{if }c>r\ e^{1-r/c}&\text{if }c\leq r\end{cases} $$ \tag{A2.E15}

$$ p_{n}=\frac{\sum_{i}\sum_{k}\min(h_{ik},m_{ik})}{\sum_{i}\sum_{k}h_{ik}} $$ \tag{A2.E16}

$$ \text{ROUGE-L}=\frac{(1+\beta^{2})R_{l}P_{l}}{R_{l}+\beta^{2}P_{l}} $$ \tag{A2.E17}

$$ R_{l}=\frac{\text{LCS}(X,Y)}{|X|} $$ \tag{A2.E18}

$$ \text{ROUGE-2}=\frac{\sum_{\text{bigram}\in\text{ref}}\min(\text{Count}{\text{ref}}(\text{bigram}),\text{Count}{\text{cand}}(\text{bigram}))}{\sum_{\text{bigram}\in\text{ref}}\text{Count}_{\text{ref}}(\text{bigram})} $$ \tag{A2.E20}

$$ \text{METEOR}=F_{\text{mean}}\cdot(1-\text{Penalty}) $$ \tag{A2.E21}

$$ \text{Penalty}=0.5\cdot(\frac{\text{ch}}{m})^{3} $$ \tag{A2.E23}

$$ \text{SBERT_Similarity}=\cos(\text{SBERT}(x),\text{SBERT}(y)) $$ \tag{A2.E24}

Example:

Question 686: Which hobby did Dave pick up in October 2023? Prediction: photography Reference: photography talk

start time:10:54 am on 17 November, 2023

memory content: Speaker Davesays : Hey Calvin, long time no talk! A lot has happened. I've taken up photography and it's been great -been taking pics of the scenery around here which is really cool. memory context: The main topic is the speaker's new hobby of photography, highlighting their enjoyment of capturing local scenery, aimed at engaging a friend in conversation about personal experiences.

memory keywords: ['photography', 'scenery', 'conversation', 'experience', 'hobby']

memory content: Speaker Calvinsays : Thanks, Dave! It feels great having my own space to work in. I've been experimenting with different genres lately, pushing myself out of my comfort zone. Adding electronic elements to my songs gives them a fresh vibe. It's been an exciting process of self-discovery and growth! memory context: The speaker discusses their creative process in music, highlighting experimentation with genres and the incorporation of electronic elements for personal growth and artistic evolution. memory keywords: ['space', 'experimentation', 'genres', 'electronic', 'self-discovery', 'growth']

memory tags: ['music', 'creativity', 'self-improvement', 'artistic expression']

NeurIPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and follow the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

$$ m_i = {c_i, t_i, K_i, G_i, X_i, e_i, L_i} $$

$$ K_i,G_i,X_i \leftarrow \text{LLM} (c_i ; \Vert t_i ; \Vert P_{s1}) $$

$$ s_{n,j} = \frac{e_{n} \cdot e_j}{|e_{n}| |e_j|} $$

$$ \mathcal{M}{\text{near}}^n = {m_j | ; \text{rank}(s{n,j}) \leq k, m_j \in \mathcal{M}} $$

$$ e_q = f_{\text{enc}}(q) $$

$$ \text{F1} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$

$$ \text{precision} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}} $$

$$ \text{BLEU-1} = BP \cdot \exp(\sum_{n=1}^{1} w_n \log p_n) $$

$$ BP = \begin{cases} 1 & \text{if } c > r \ e^{1-r/c} & \text{if } c \leq r \end{cases} $$

$$ p_n = \frac{\sum_{i}\sum_{k}\min(h_{ik}, m_{ik})}{\sum_{i}\sum_{k}h_{ik}} $$

$$ \text{ROUGE-L} = \frac{(1 + \beta^2)R_lP_l}{R_l + \beta^2P_l} $$

$$ \text{ROUGE-2} = \frac{\sum_{\text{bigram} \in \text{ref}}\min(\text{Count}{\text{ref}}(\text{bigram}), \text{Count}{\text{cand}}(\text{bigram}))}{\sum_{\text{bigram} \in \text{ref}}\text{Count}_{\text{ref}}(\text{bigram})} $$

$$ \text{METEOR} = F_{\text{mean}} \cdot (1 - \text{Penalty}) $$

$$ \text{SBERT_Similarity} = \cos(\text{SBERT}(x), \text{SBERT}(y)) $$

IMPORTANT, please:

1. Claims

Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?

Answer: [Yes]

Baselines

Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Guidelines:

3. Theory assumptions and proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [NA]

Justification: N/A

Guidelines:

Empricial Results

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: Both code and datasets are available.

Guidelines:

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: We provide the code link in the abstract.

Guidelines:

Implementation Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: We cover all the details in the paper.

Guidelines:

7. Experiment statistical significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [No]

Justification: The experiments utilize the API of Large Language Models. Multiple calls will significantly increase costs.

Guidelines:

Examples of Q/A with ours

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: It could be found in the experimental part.

Guidelines:

9. Code of ethics

Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines ?

Answer: [NA]

Justification: N/A

Guidelines:

10. Broader impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [No]

Justification: We don't discuss this aspect because we provide only the memory system for LLM agents. Different LLM agents may create varying societal impacts, which are beyond the scope of our work.

Guidelines:

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: N/A

Guidelines:

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: Their contribution has already been properly acknowledged and credited.

Guidelines:

13. New assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: N/A

Guidelines:

14. Crowdsourcing and research with human subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: N/A

Baselines

15. Institutional review board (IRB) approvals or equivalent for research with human subjects

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

Answer: [NA]

Justification: N/A

Baselines

A-Mem: Agentic Memory for LLM Agents

Question: Does the paper describe the usage of LLMs if it is an important, original, or non-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does not impact the core methodology, scientific rigorousness, or originality of the research, declaration is not required.

Answer: [NA]

Justification: N/A

Baselines

AverageAverageAverage
ModelMethodMulti HopMulti HopTemporalTemporalCategory Open DomainCategory Open DomainSingle HopSingle HopAdversialAdversialRankingRankingToken
ModelMethodF1BLEUF1BLEUF1BLEUF1BLEUF1BLEUF1BLEULength
4o-miniLOCOMO25.0219.7518.4114.7712.0411.1640.3629.0569.2368.752.42.416,910
4o-miniREADAGENT9.156.4812.608.875.315.129.677.669.819.024.24.2643
4o-miniMEMORYBANK5.004.779.686.995.565.946.615.167.366.484.84.8432
4o-miniMEMGPT26.6517.7225.5219.449.157.4441.0434.3443.2942.732.42.416,977
4o-miniA-MEM27.0220.0945.8536.6712.1412.0044.6537.0650.0349.471.21.22,520
4o-miniLOCOMO28.0018.479.095.7816.4714.8061.5654.1952.6151.132.02.016,910
4o-miniREADAGENT14.619.954.163.198.848.3712.4610.296.816.134.04.0805
4o-miniMEMORYBANK6.494.692.472.436.435.308.287.104.423.675.05.0569
4o-miniMEMGPT30.3622.8317.2913.1812.2411.8760.1653.3534.9634.252.42.416,987
4o-miniA-MEM32.8623.7639.4131.2317.1015.8448.4342.9736.3535.531.61.61,216
1.5bLOCOMO9.056.554.254.049.918.5011.158.6740.3840.233.43.416,910
1.5bREADAGENT6.614.932.552.515.3112.2410.137.545.4227.324.64.6752
1.5bMEMORYBANK11.148.254.462.878.056.2113.4211.0136.7634.002.62.6284
1.5bMEMGPT10.447.614.213.8913.4211.649.567.3431.5128.903.43.416,953
1.5bA-MEM18.2311.9424.3219.7416.4814.3123.6319.2346.0043.261.01.01,300
3bLOCOMO4.614.293.112.714.555.977.035.6916.9514.813.23.216,910
3bREADAGENT2.471.783.013.015.575.223.252.5115.7814.014.24.2776
3bMEMORYBANK3.603.391.721.976.636.584.113.3213.0710.304.24.2298
3bMEMGPT5.074.312.942.957.047.107.265.5214.4712.392.42.416,961
3bA-MEM12.579.0127.5925.077.127.2817.2313.1227.9125.151.01.01,137
1bLOCOMO11.259.187.386.8211.9010.3812.8610.5051.8948.273.43.416,910
1bREADAGENT5.965.121.932.3012.4611.177.756.0344.6440.154.64.6665
1bMEMORYBANK13.1810.037.616.2715.7812.9417.3014.0352.6147.532.02.0274
1bMEMGPT9.196.964.024.7911.148.2410.167.6849.7545.114.04.016,950
1bA-MEM19.0611.7117.8010.2817.5514.6728.5124.1358.8154.281.01.01,376
LOCOMO6.885.774.374.4010.659.298.376.9330.2528.462.82.816,910
READAGENT2.471.783.013.015.575.223.252.5115.7814.014.24.2461
MEMORYBANK6.194.473.493.134.074.577.616.0318.6517.053.23.2263
MEMGPT5.323.992.682.725.645.544.323.5121.4519.373.83.816,956
A-MEM17.4411.7426.3819.5012.5311.8328.1423.8742.0440.601.01.01,126
MethodF1BLEU-1ROUGE-LROUGE-2METEORSBERT Similarity
LoCoMo2.553.132.750.91.6415.76
MemGPT1.181.070.960.420.958.54
A-MEM3.453.373.543.62.0519.51
CategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
w/oLG&ME9.657.0924.5519.487.776.7013.2810.3015.3218.02
w/o ME21.3515.1331.2427.3110.1310.8539.1734.7044.1645.33
A-MEM27.0220.0945.8536.6712.1412.0044.6537.0650.0349.47
Memory SizeMethodMemory Usage (MB)Retrieval Time ( µ s)
1,000A-MEM MemoryBank [39] ReadAgent [17]1.46 1.46 1.460.31 ± 0.30 0.24 ± 0.20 43.62 ± 8.47
10,000A-MEM MemoryBank [39] ReadAgent [17]14.65 14.65 14.650.38 ± 0.25 0.26 ± 0.13 484.45 ± 93.86
100,000A-MEM MemoryBank [39] ReadAgent [17]146.48 146.48 146.481.40 ± 0.49 0.78 ± 0.26 6,682.22 ± 111.63
1,000,000A-MEM MemoryBank [39] ReadAgent [17]1464.84 1464.84 1464.843.70 ± 0.74 1.91 ± 0.31 120,069.68 ± 1,673.39
1 Introduction1 Introduction1 Introduction1
2Related WorkRelated Work
2.1Memory for LLM Agents . . . . . . .2
2.2Retrieval-Augmented Generation . . .3
3MethodolodgyMethodolodgy3
3.1Note Construction . . . . . . . . . . .4
3.2Link Generation . . . . . . . . . . . .4
3.3Memory Evolution . . . . . . . . . .4
3.4Retrieve Relative Memory . . . . . .5
4Experiment5
4.1. . . . . . . .
Dataset and Evaluation5
4.2Implementation Details . . . . . . . .6
4.3Empricial Results . . . . . . . . . . .6
4.4Ablation Study . . . . . . . . . . . .7
4.5Hyperparameter Analysis . . . . . . .7
4.6Scaling Analysis . . . . . . . . . . .8
4.7Memory Analysis . . . . . . . . . . .9
5Conclusions9
AExperimentExperiment14
A.1Detailed Baselines Introduction . . . .14
A.2Evaluation Metric . . . . . . . . . . .14
A.3Comparison Results . . . . . . . . . .15
A.4Memory Analysis . . . . . . . . . . .16
Hyperparameters . . . . . . .17
A.5setting
BPromptTemplates and Examples19
B.1Prompt Template of Note Construction19
B.2Prompt Template of Link Generation .19
B.3Prompt Template of Memory Evolution20
B.4Examples of Q/A with A-MEM . . . .21
ModelCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
RGE-2RGE-LRGE-2RGE-LRGE-2RGE-LRGE-2RGE-LRGE-2RGE-L
4o-miniLOCOMO9.6423.922.0118.093.4011.5826.4840.2060.4669.59
4o-miniREADAGENT2.479.450.9513.120.555.762.999.926.669.79
4o-miniMEMORYBANK1.185.430.529.640.975.771.646.634.557.35
4o-miniMEMGPT10.5825.604.7625.220.769.1428.4442.2436.6243.75
4o-miniA-MEM10.6125.8621.3944.273.4212.0929.5045.1842.6250.04
4oLOCOMO11.5330.651.688.173.2116.3345.4263.8645.1352.67
4oREADAGENT3.9114.360.433.960.528.584.7513.414.246.81
4oMEMORYBANK1.847.360.362.292.136.853.029.351.224.41
4oMEMGPT11.5530.184.6615.833.2714.0243.2762.7528.7235.08
4oA-MEM12.7631.719.8225.046.0916.6333.6750.3130.3136.34
Qwen2.5 1.5bLOCOMO READAGENT1.399.240.004.683.4210.593.2511.1535.1043.61
Qwen2.5 1.5b0.747.140.102.813.0512.631.477.8820.7327.82
Qwen2.5 1.5bMEMORYBANK1.5111.180.145.391.808.445.0713.7229.2436.95
Qwen2.5 1.5bMEMGPT1.1611.350.007.882.8714.622.189.8223.9631.69
Qwen2.5 1.5bA-MEM4.8817.945.8827.233.4416.8712.3224.3836.3246.60
3bLOCOMO0.494.830.143.201.315.381.976.9812.6617.10
3bREADAGENT0.084.080.001.961.266.190.734.347.3510.64
3bMEMORYBANK0.433.760.051.610.246.321.034.229.5513.41
3bMEMGPT0.695.550.053.171.907.902.057.3210.4614.39
3bA-MEM2.9112.428.1127.741.517.518.8017.5721.3927.98
1bLOCOMO2.5111.480.448.251.6913.062.9413.0039.8552.74
1bREADAGENT0.536.490.004.625.4714.291.198.0334.5245.55
1bMEMORYBANK2.9613.570.2310.534.0118.386.4117.6641.1553.31
1bMEMGPT1.829.910.066.562.1311.362.0010.3738.5950.31
1bA-MEM4.8219.311.8420.475.9918.4914.8229.7846.7660.23
Llama 3bLOCOMO0.987.220.034.452.3611.392.858.4525.4730.26
Llama 3bREADAGENT2.471.783.013.015.075.223.252.5115.7814.01
Llama 3bMEMORYBANK1.836.960.253.410.434.432.737.8314.6418.59
Llama 3bMEMGPT0.725.390.112.850.615.741.454.4216.6221.47
Llama 3bA-MEM6.0217.627.9327.975.3813.0016.8928.5535.4842.25
ModelCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
MESBERTMESBERTMESBERTMESBERTMESBERT
GPTLOCOMO15.8147.977.6152.308.1635.0040.4257.7863.2871.93
GPTREADAGENT5.4628.674.7645.073.6926.728.0126.788.3815.20
4o-miniMEMORYBANK3.4221.714.0737.584.2123.715.8120.766.2413.00
GPTMEMGPT15.7949.3313.2561.534.5932.7741.4058.1939.1647.24
GPTA-MEM16.3649.4623.4370.498.3638.4842.3259.3845.6453.26
LOCOMO16.3453.827.2132.158.9843.7253.3973.4047.7256.09
READAGENT7.8637.413.7626.224.4230.759.3631.375.4712.34
4oMEMORYBANK3.2226.232.2923.494.1824.896.6423.902.9310.01
MEMGPT16.6455.1212.6835.937.7837.9152.1472.8331.1539.08
A-MEM17.5355.9613.1045.4010.6238.8741.9362.4732.3440.11
1.5bLOCOMO4.9932.232.8634.035.8935.618.5729.4740.5350.49
1.5bREADAGENT3.6728.201.8827.278.9735.135.5226.3324.0434.12
1.5bMEMORYBANK5.5735.402.8032.474.2733.8510.5932.1632.9342.83
1.5bMEMGPT5.4035.642.3539.047.6840.367.0730.1627.2440.63
1.5bA-MEM9.4943.4911.9261.659.1142.5819.6941.9340.6452.44
Qwen2.5LOCOMO2.0024.371.9225.243.4525.386.0021.2816.6723.14
Qwen2.5READAGENT1.7821.101.6920.784.4325.153.3718.2010.4617.39
3bMEMORYBANK2.3717.812.2221.933.8620.653.9916.2615.4920.77
Qwen2.5MEMGPT3.7424.312.2527.676.4429.596.2422.4013.1920.83
Qwen2.5A-MEM6.2533.7214.0462.546.5630.6015.9833.9827.3633.72
3.2 1bLOCOMO5.7738.023.3845.446.2042.699.3334.1946.7960.74
3.2 1bREADAGENT2.9729.261.3126.457.1339.195.3626.4442.3954.35
3.2 1bMEMORYBANK6.7739.334.4345.637.7642.8113.0137.3250.4360.81
3.2 1bMEMGPT5.1032.992.5441.813.2635.996.6230.6845.0061.33
3.2 1bA-MEM9.0145.167.5054.798.3043.4222.4647.0753.7268.00
LlamaLOCOMO3.6927.942.9620.406.4632.176.5822.9229.0235.74
LlamaREADAGENT1.2117.402.3312.023.3919.632.4614.6314.3721.25
3bMEMORYBANK3.8425.062.7313.653.0521.086.3522.0217.1424.39
LlamaMEMGPT2.782.2123.183.4717.8120.50
Llama22.0614.973.6326.87
LlamaA-MEM9.7439.3213.1959.708.0932.2724.3042.8639.7446.76
CategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
DeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32B
LOCOMO8.586.484.794.3512.9612.5210.728.2021.4020.23
MEMGPT8.286.255.454.9710.979.0911.349.0330.7729.23
A-MEM15.0210.6414.6411.0114.8112.8215.3712.3027.9227.19
Claude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 Haiku
LOCOMO4.563.330.820.592.863.223.563.243.463.42
MEMGPT7.656.361.651.267.416.648.607.297.667.37
A-MEM19.2814.6916.6512.2311.859.6134.7230.0535.9934.87
Claude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 Haiku
LOCOMO11.348.213.292.693.793.5814.0112.577.377.12
MEMGPT8.276.553.992.764.714.4816.5214.895.645.45
A-MEM29.7023.1931.5427.5311.429.4742.6037.4113.6512.71
ModelMulti HopTemporalOpen DomainSingle HopAdversial
GPT-4o-mini4040505040
GPT-4o4040505040
Qwen2.5-1.5b1010101010
Qwen2.5-3b1010501010
Llama3.2-1b1010101010
Llama3.2-3b1020101010
AverageAverageAverage
ModelMethodMulti HopMulti HopTemporalTemporalCategory Open DomainCategory Open DomainSingle HopSingle HopAdversialAdversialRankingRankingToken
ModelMethodF1BLEUF1BLEUF1BLEUF1BLEUF1BLEUF1BLEULength
4o-miniLOCOMO25.0219.7518.4114.7712.0411.1640.3629.0569.2368.752.42.416,910
4o-miniREADAGENT9.156.4812.608.875.315.129.677.669.819.024.24.2643
4o-miniMEMORYBANK5.004.779.686.995.565.946.615.167.366.484.84.8432
4o-miniMEMGPT26.6517.7225.5219.449.157.4441.0434.3443.2942.732.42.416,977
4o-miniA-MEM27.0220.0945.8536.6712.1412.0044.6537.0650.0349.471.21.22,520
4o-miniLOCOMO28.0018.479.095.7816.4714.8061.5654.1952.6151.132.02.016,910
4o-miniREADAGENT14.619.954.163.198.848.3712.4610.296.816.134.04.0805
4o-miniMEMORYBANK6.494.692.472.436.435.308.287.104.423.675.05.0569
4o-miniMEMGPT30.3622.8317.2913.1812.2411.8760.1653.3534.9634.252.42.416,987
4o-miniA-MEM32.8623.7639.4131.2317.1015.8448.4342.9736.3535.531.61.61,216
1.5bLOCOMO9.056.554.254.049.918.5011.158.6740.3840.233.43.416,910
1.5bREADAGENT6.614.932.552.515.3112.2410.137.545.4227.324.64.6752
1.5bMEMORYBANK11.148.254.462.878.056.2113.4211.0136.7634.002.62.6284
1.5bMEMGPT10.447.614.213.8913.4211.649.567.3431.5128.903.43.416,953
1.5bA-MEM18.2311.9424.3219.7416.4814.3123.6319.2346.0043.261.01.01,300
3bLOCOMO4.614.293.112.714.555.977.035.6916.9514.813.23.216,910
3bREADAGENT2.471.783.013.015.575.223.252.5115.7814.014.24.2776
3bMEMORYBANK3.603.391.721.976.636.584.113.3213.0710.304.24.2298
3bMEMGPT5.074.312.942.957.047.107.265.5214.4712.392.42.416,961
3bA-MEM12.579.0127.5925.077.127.2817.2313.1227.9125.151.01.01,137
1bLOCOMO11.259.187.386.8211.9010.3812.8610.5051.8948.273.43.416,910
1bREADAGENT5.965.121.932.3012.4611.177.756.0344.6440.154.64.6665
1bMEMORYBANK13.1810.037.616.2715.7812.9417.3014.0352.6147.532.02.0274
1bMEMGPT9.196.964.024.7911.148.2410.167.6849.7545.114.04.016,950
1bA-MEM19.0611.7117.8010.2817.5514.6728.5124.1358.8154.281.01.01,376
LOCOMO6.885.774.374.4010.659.298.376.9330.2528.462.82.816,910
READAGENT2.471.783.013.015.575.223.252.5115.7814.014.24.2461
MEMORYBANK6.194.473.493.134.074.577.616.0318.6517.053.23.2263
MEMGPT5.323.992.682.725.645.544.323.5121.4519.373.83.816,956
A-MEM17.4411.7426.3819.5012.5311.8328.1423.8742.0440.601.01.01,126
MethodF1BLEU-1ROUGE-LROUGE-2METEORSBERT Similarity
LoCoMo2.553.132.750.91.6415.76
MemGPT1.181.070.960.420.958.54
A-MEM3.453.373.543.62.0519.51
CategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
w/oLG&ME9.657.0924.5519.487.776.7013.2810.3015.3218.02
w/o ME21.3515.1331.2427.3110.1310.8539.1734.7044.1645.33
A-MEM27.0220.0945.8536.6712.1412.0044.6537.0650.0349.47
Memory SizeMethodMemory Usage (MB)Retrieval Time ( µ s)
1,000A-MEM MemoryBank [39] ReadAgent [17]1.46 1.46 1.460.31 ± 0.30 0.24 ± 0.20 43.62 ± 8.47
10,000A-MEM MemoryBank [39] ReadAgent [17]14.65 14.65 14.650.38 ± 0.25 0.26 ± 0.13 484.45 ± 93.86
100,000A-MEM MemoryBank [39] ReadAgent [17]146.48 146.48 146.481.40 ± 0.49 0.78 ± 0.26 6,682.22 ± 111.63
1,000,000A-MEM MemoryBank [39] ReadAgent [17]1464.84 1464.84 1464.843.70 ± 0.74 1.91 ± 0.31 120,069.68 ± 1,673.39
1 Introduction1 Introduction1 Introduction1
2Related WorkRelated Work
2.1Memory for LLM Agents . . . . . . .2
2.2Retrieval-Augmented Generation . . .3
3MethodolodgyMethodolodgy3
3.1Note Construction . . . . . . . . . . .4
3.2Link Generation . . . . . . . . . . . .4
3.3Memory Evolution . . . . . . . . . .4
3.4Retrieve Relative Memory . . . . . .5
4Experiment5
4.1. . . . . . . .
Dataset and Evaluation5
4.2Implementation Details . . . . . . . .6
4.3Empricial Results . . . . . . . . . . .6
4.4Ablation Study . . . . . . . . . . . .7
4.5Hyperparameter Analysis . . . . . . .7
4.6Scaling Analysis . . . . . . . . . . .8
4.7Memory Analysis . . . . . . . . . . .9
5Conclusions9
AExperimentExperiment14
A.1Detailed Baselines Introduction . . . .14
A.2Evaluation Metric . . . . . . . . . . .14
A.3Comparison Results . . . . . . . . . .15
A.4Memory Analysis . . . . . . . . . . .16
Hyperparameters . . . . . . .17
A.5setting
BPromptTemplates and Examples19
B.1Prompt Template of Note Construction19
B.2Prompt Template of Link Generation .19
B.3Prompt Template of Memory Evolution20
B.4Examples of Q/A with A-MEM . . . .21
ModelCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
RGE-2RGE-LRGE-2RGE-LRGE-2RGE-LRGE-2RGE-LRGE-2RGE-L
4o-miniLOCOMO9.6423.922.0118.093.4011.5826.4840.2060.4669.59
4o-miniREADAGENT2.479.450.9513.120.555.762.999.926.669.79
4o-miniMEMORYBANK1.185.430.529.640.975.771.646.634.557.35
4o-miniMEMGPT10.5825.604.7625.220.769.1428.4442.2436.6243.75
4o-miniA-MEM10.6125.8621.3944.273.4212.0929.5045.1842.6250.04
4oLOCOMO11.5330.651.688.173.2116.3345.4263.8645.1352.67
4oREADAGENT3.9114.360.433.960.528.584.7513.414.246.81
4oMEMORYBANK1.847.360.362.292.136.853.029.351.224.41
4oMEMGPT11.5530.184.6615.833.2714.0243.2762.7528.7235.08
4oA-MEM12.7631.719.8225.046.0916.6333.6750.3130.3136.34
Qwen2.5 1.5bLOCOMO READAGENT1.399.240.004.683.4210.593.2511.1535.1043.61
Qwen2.5 1.5b0.747.140.102.813.0512.631.477.8820.7327.82
Qwen2.5 1.5bMEMORYBANK1.5111.180.145.391.808.445.0713.7229.2436.95
Qwen2.5 1.5bMEMGPT1.1611.350.007.882.8714.622.189.8223.9631.69
Qwen2.5 1.5bA-MEM4.8817.945.8827.233.4416.8712.3224.3836.3246.60
3bLOCOMO0.494.830.143.201.315.381.976.9812.6617.10
3bREADAGENT0.084.080.001.961.266.190.734.347.3510.64
3bMEMORYBANK0.433.760.051.610.246.321.034.229.5513.41
3bMEMGPT0.695.550.053.171.907.902.057.3210.4614.39
3bA-MEM2.9112.428.1127.741.517.518.8017.5721.3927.98
1bLOCOMO2.5111.480.448.251.6913.062.9413.0039.8552.74
1bREADAGENT0.536.490.004.625.4714.291.198.0334.5245.55
1bMEMORYBANK2.9613.570.2310.534.0118.386.4117.6641.1553.31
1bMEMGPT1.829.910.066.562.1311.362.0010.3738.5950.31
1bA-MEM4.8219.311.8420.475.9918.4914.8229.7846.7660.23
Llama 3bLOCOMO0.987.220.034.452.3611.392.858.4525.4730.26
Llama 3bREADAGENT2.471.783.013.015.075.223.252.5115.7814.01
Llama 3bMEMORYBANK1.836.960.253.410.434.432.737.8314.6418.59
Llama 3bMEMGPT0.725.390.112.850.615.741.454.4216.6221.47
Llama 3bA-MEM6.0217.627.9327.975.3813.0016.8928.5535.4842.25
ModelCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
MESBERTMESBERTMESBERTMESBERTMESBERT
GPTLOCOMO15.8147.977.6152.308.1635.0040.4257.7863.2871.93
GPTREADAGENT5.4628.674.7645.073.6926.728.0126.788.3815.20
4o-miniMEMORYBANK3.4221.714.0737.584.2123.715.8120.766.2413.00
GPTMEMGPT15.7949.3313.2561.534.5932.7741.4058.1939.1647.24
GPTA-MEM16.3649.4623.4370.498.3638.4842.3259.3845.6453.26
LOCOMO16.3453.827.2132.158.9843.7253.3973.4047.7256.09
READAGENT7.8637.413.7626.224.4230.759.3631.375.4712.34
4oMEMORYBANK3.2226.232.2923.494.1824.896.6423.902.9310.01
MEMGPT16.6455.1212.6835.937.7837.9152.1472.8331.1539.08
A-MEM17.5355.9613.1045.4010.6238.8741.9362.4732.3440.11
1.5bLOCOMO4.9932.232.8634.035.8935.618.5729.4740.5350.49
1.5bREADAGENT3.6728.201.8827.278.9735.135.5226.3324.0434.12
1.5bMEMORYBANK5.5735.402.8032.474.2733.8510.5932.1632.9342.83
1.5bMEMGPT5.4035.642.3539.047.6840.367.0730.1627.2440.63
1.5bA-MEM9.4943.4911.9261.659.1142.5819.6941.9340.6452.44
Qwen2.5LOCOMO2.0024.371.9225.243.4525.386.0021.2816.6723.14
Qwen2.5READAGENT1.7821.101.6920.784.4325.153.3718.2010.4617.39
3bMEMORYBANK2.3717.812.2221.933.8620.653.9916.2615.4920.77
Qwen2.5MEMGPT3.7424.312.2527.676.4429.596.2422.4013.1920.83
Qwen2.5A-MEM6.2533.7214.0462.546.5630.6015.9833.9827.3633.72
3.2 1bLOCOMO5.7738.023.3845.446.2042.699.3334.1946.7960.74
3.2 1bREADAGENT2.9729.261.3126.457.1339.195.3626.4442.3954.35
3.2 1bMEMORYBANK6.7739.334.4345.637.7642.8113.0137.3250.4360.81
3.2 1bMEMGPT5.1032.992.5441.813.2635.996.6230.6845.0061.33
3.2 1bA-MEM9.0145.167.5054.798.3043.4222.4647.0753.7268.00
LlamaLOCOMO3.6927.942.9620.406.4632.176.5822.9229.0235.74
LlamaREADAGENT1.2117.402.3312.023.3919.632.4614.6314.3721.25
3bMEMORYBANK3.8425.062.7313.653.0521.086.3522.0217.1424.39
LlamaMEMGPT2.782.2123.183.4717.8120.50
Llama22.0614.973.6326.87
LlamaA-MEM9.7439.3213.1959.708.0932.2724.3042.8639.7446.76
CategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategoryCategory
MethodMulti HopMulti HopTemporalTemporalOpen DomainOpen DomainSingle HopSingle HopAdversialAdversial
F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1F1BLEU-1
DeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32BDeepSeek-R1-32B
LOCOMO8.586.484.794.3512.9612.5210.728.2021.4020.23
MEMGPT8.286.255.454.9710.979.0911.349.0330.7729.23
A-MEM15.0210.6414.6411.0114.8112.8215.3712.3027.9227.19
Claude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 HaikuClaude 3.0 Haiku
LOCOMO4.563.330.820.592.863.223.563.243.463.42
MEMGPT7.656.361.651.267.416.648.607.297.667.37
A-MEM19.2814.6916.6512.2311.859.6134.7230.0535.9934.87
Claude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 HaikuClaude 3.5 Haiku
LOCOMO11.348.213.292.693.793.5814.0112.577.377.12
MEMGPT8.276.553.992.764.714.4816.5214.895.645.45
A-MEM29.7023.1931.5427.5311.429.4742.6037.4113.6512.71
ModelMulti HopTemporalOpen DomainSingle HopAdversial
GPT-4o-mini4040505040
GPT-4o4040505040
Qwen2.5-1.5b1010101010
Qwen2.5-3b1010501010
Llama3.2-1b1010101010
Llama3.2-3b1020101010

Figure

References

[locomo] Maharana, Adyasha, Lee, Dong-Ho, Tulyakov, Sergey, Bansal, Mohit, Barbieri, Francesco, Fang, Yuwei. (2024). Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

[memgpt] Packer, Charles, Wooders, Sarah, Lin, Kevin, Fang, Vivian, Patil, Shishir G, Stoica, Ion, Gonzalez, Joseph E. (2023). Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560.

[readagent] Lee, Kuang-Huei, Chen, Xinyun, Furuta, Hiroki, Canny, John, Fischer, Ian. (2024). A human-inspired reading agent with gist memory of very long contexts. arXiv preprint arXiv:2402.09727.

[memorybank] Zhong, Wanjun, Guo, Lianghong, Gao, Qiqi, Ye, He, Wang, Yanlin. (2024). Memorybank: Enhancing large language models with long-term memory. Proceedings of the AAAI Conference on Artificial Intelligence.

[aios] Mei, Kai, Li, Zelong, Xu, Shuyuan, Ye, Ruosong, Ge, Yingqiang, Zhang, Yongfeng. (2024). AIOS: LLM agent operating system. arXiv e-prints, pp. arXiv--2403.

[openhands] Wang, Xingyao, Li, Boxuan, Song, Yufan, Xu, Frank F, Tang, Xiangru, Zhuge, Mingchen, Pan, Jiayi, Song, Yueqi, Li, Bowen, Singh, Jaskirat, others. (2024). Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741.

[mind2web] Deng, Xiang, Gu, Yu, Zheng, Boyuan, Chen, Shijie, Stevens, Sam, Wang, Boshi, Sun, Huan, Su, Yu. (2023). Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems.

[weng2023agent] Weng, Lilian. (2023). LLM-powered Autonomous Agents. lilianweng.github.io.

[smolagents] Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, Erik Kaunismäki. (2025). smolagents: a smol library to build great agentic systems..

[agentlite] Liu, Zhiwei, Yao, Weiran, Zhang, Jianguo, Yang, Liangwei, Liu, Zuxin, Tan, Juntao, Choubey, Prafulla K, Lan, Tian, Wu, Jason, Wang, Huan, others. (2024). AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System. arXiv preprint arXiv:2402.15538.

[mem0] Dev, Khant, Taranjeet, Singh. (2024). mem0: The Memory layer for AI Agents..

[graphrag] Edge, Darren, Trinh, Ha, Cheng, Newman, Bradley, Joshua, Chao, Alex, Mody, Apurva, Truitt, Steven, Larson, Jonathan. (2024). From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130.

[zettel1] Kadavy, David. (2021). Digital Zettelkasten: Principles, Methods, & Examples.

[zettel2] Ahrens, S{. (2017). How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking.

[rag1] Lewis, Patrick, Perez, Ethan, Piktus, Aleksandra, Petroni, Fabio, Karpukhin, Vladimir, Goyal, Naman, K{. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems.

[aiosrag] Shi, Zeru, Mei, Kai, Jin, Mingyu, Su, Yongye, Zuo, Chaoji, Hua, Wenyue, Xu, Wujiang, Ren, Yujie, Liu, Zirui, Du, Mengnan, others. (2024). From Commands to Prompts: LLM-based Semantic File System for AIOS. arXiv preprint arXiv:2410.11843.

[sentence-bert] Reimers, Nils, Gurevych, Iryna. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

[borgeaud2022improving] Borgeaud, Sebastian, Mensch, Arthur, Hoffmann, Jordan, Cai, Trevor, Rutherford, Eliza, Millican, Katie, Van Den Driessche, George Bm, Lespiau, Jean-Baptiste, Damoc, Bogdan, Clark, Aidan, others. (2022). Improving language models by retrieving from trillions of tokens. International conference on machine learning.

[gao2023retrieval] Gao, Yunfan, Xiong, Yun, Gao, Xinyu, Jia, Kangxiang, Pan, Jinliu, Bi, Yuxi, Dai, Yi, Sun, Jiawei, Wang, Haofen. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.

[yu2023chain] Yu, Wenhao, Zhang, Hongming, Pan, Xiaoman, Ma, Kaixin, Wang, Hongwei, Yu, Dong. (2023). Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210.

[wang2023learning] Wang, Zhiruo, Araki, Jun, Jiang, Zhengbao, Parvez, Md Rizwan, Neubig, Graham. (2023). Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.

[lin2023ra] Lin, Xi Victoria, Chen, Xilun, Chen, Mingda, Shi, Weijia, Lomeli, Maria, James, Rich, Rodriguez, Pedro, Kahn, Jacob, Szilvasy, Gergely, Lewis, Mike, others. (2023). Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352.

[ilin2023advanced] Ilin, I.. (2023). Advanced RAG Techniques: An Illustrated Overview.

[asai2023self] Asai, Akari, Wu, Zeqiu, Wang, Yizhong, Sil, Avirup, Hajishirzi, Hannaneh. (2023). Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.

[jiang2023active] Jiang, Zhengbao, Xu, Frank F, Gao, Luyu, Sun, Zhiqing, Liu, Qian, Dwivedi-Yu, Jane, Yang, Yiming, Callan, Jamie, Neubig, Graham. (2023). Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.

[trivedi2022interleaving] Trivedi, Harsh, Balasubramanian, Niranjan, Khot, Tushar, Sabharwal, Ashish. (2022). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.

[shao2023enhancing] Shao, Zhihong, Gong, Yeyun, Shen, Yelong, Huang, Minlie, Duan, Nan, Chen, Weizhu. (2023). Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.

[yu2023augmentation] Yu, Zichun, Xiong, Chenyan, Yu, Shi, Liu, Zhiyuan. (2023). Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331.

[modarressi2023ret] Modarressi, Ali, Imani, Ayyoob, Fayyaz, Mohsen, Sch{. (2023). Ret-llm: Towards a general read-write memory for large language models. arXiv preprint arXiv:2305.14322.

[wang2023enhancing] Wang, Bing, Liang, Xinnian, Yang, Jian, Huang, Hui, Wu, Shuangzhi, Wu, Peihao, Lu, Lu, Ma, Zejun, Li, Zhoujun. (2023). Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343.

[xu2021beyond] Xu, J. (2021). Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567.

[jang2023conversation] Jang, Jihyoung, Boo, Minseong, Kim, Hyounghun. (2023). Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations. arXiv preprint arXiv:2310.13420.

[papineni2002bleu] Papineni, Kishore, Roukos, Salim, Ward, Todd, Zhu, Wei-Jing. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics.

[rouge] Lin, Chin-Yew. (2004). Rouge: A package for automatic evaluation of summaries. Text summarization branches out.

[meteor] Banerjee, Satanjeev, Lavie, Alon. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.

[jin-etal-2024-impact] Jin, Mingyu, Yu, Qinkai, Shu, Dong, Zhao, Haiyan, Hua, Wenyue, Meng, Yanda, Zhang, Yongfeng, Du, Mengnan. (2024). The Impact of Reasoning Step Length on Large Language Models. Findings of the Association for Computational Linguistics ACL 2024.

[xu2024slmrec] Xu, Wujiang, Wu, Qitian, Liang, Zujie, Han, Jiaojiao, Ning, Xuying, Shi, Yunxiao, Lin, Wenfang, Zhang, Yongfeng. (2024). Slmrec: empowering small language models for sequential recommendation. arXiv preprint arXiv:2405.17890.

[xu2024rethinking] Xu, Wujiang, Wu, Qitian, Wang, Runzhong, Ha, Mingming, Ma, Qiongxu, Chen, Linxun, Han, Bing, Yan, Junchi. (2024). Rethinking cross-domain sequential recommendation under open-world assumptions. Proceedings of the ACM on Web Conference 2024.

[xu2023neural] Xu, Wujiang, Li, Shaoshuai, Ha, Mingming, Guo, Xiaobo, Ma, Qiongxu, Liu, Xiaolei, Chen, Linxun, Zhu, Zhenfeng. (2023). Neural node matching for multi-target cross domain recommendation. 2023 IEEE 39th International Conference on Data Engineering (ICDE).

[wang2023brave] Wang, Kun, Liang, Yuxuan, Li, Xinglin, Li, Guohao, Ghanem, Bernard, Zimmermann, Roger, Zhou, Zhengyang, Yi, Huahui, Zhang, Yudong, Wang, Yang. (2023). Brave the wind and the waves: Discovering robust and generalizable graph lottery tickets. IEEE Transactions on Pattern Analysis and Machine Intelligence.

[guo2025deepseek] Guo, Daya, Yang, Dejian, Zhang, Haowei, Song, Junxiao, Zhang, Ruoyu, Xu, Runxin, Zhu, Qihao, Ma, Shirong, Wang, Peiyi, Bi, Xiao, others. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

[anthropic2024claude3] Anthropic. (2024). The Claude 3 Model Family: Opus, Sonnet, Haiku.

[anthropic2025claude35] Anthropic. (2025). Claude 3.5 Sonnet Model Card Addendum.

[kim2024dialsim] Kim, Jiho, Chay, Woosog, Hwang, Hyeonji, Kyung, Daeun, Chung, Hyunseung, Cho, Eunbyeol, Jo, Yohan, Choi, Edward. (2024). DialSim: A Real-Time Simulator for Evaluating Long-Term Multi-Party Dialogue Understanding of Conversational Agents. arXiv preprint arXiv:2406.13144.

[bib1] Sönke Ahrens. 2017. How to Take Smart Notes: One Simple Technique to Boost Writing, Learning and Thinking. Amazon. Second Edition.

[bib2] Asai et al. (2023) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.

[bib3] Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.

[bib4] Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.

[bib5] Deng et al. (2023) Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36:28091–28114.

[bib6] Khant Dev and Singh Taranjeet. 2024. mem0: The memory layer for ai agents. https://github.com/mem0ai/mem0.

[bib7] Edge et al. (2024) Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130.

[bib8] Gao et al. (2023) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.

[bib9] I. Ilin. 2023. Advanced rag techniques: An illustrated overview.

[bib10] Jang et al. (2023) Jihyoung Jang, Minseong Boo, and Hyounghun Kim. 2023. Conversation chronicles: Towards diverse temporal and relational dynamics in multi-session conversations. arXiv preprint arXiv:2310.13420.

[bib11] Jiang et al. (2023) Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.

[bib12] Jin et al. (2024) Mingyu Jin, Qinkai Yu, Dong Shu, Haiyan Zhao, Wenyue Hua, Yanda Meng, Yongfeng Zhang, and Mengnan Du. 2024. The impact of reasoning step length on large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 1830–1842, Bangkok, Thailand and virtual meeting.

[bib13] David Kadavy. 2021. Digital Zettelkasten: Principles, Methods, & Examples. Google Books.

[bib14] Lee et al. (2024) Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. 2024. A human-inspired reading agent with gist memory of very long contexts. arXiv preprint arXiv:2402.09727.

[bib15] Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.

[bib16] Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.

[bib17] Lin et al. (2023) Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, et al. 2023. Ra-dit: Retrieval-augmented dual instruction tuning. arXiv preprint arXiv:2310.01352.

[bib18] Liu et al. (2024) Zhiwei Liu, Weiran Yao, Jianguo Zhang, Liangwei Yang, Zuxin Liu, Juntao Tan, Prafulla K Choubey, Tian Lan, Jason Wu, Huan Wang, et al. 2024. Agentlite: A lightweight library for building and advancing task-oriented llm agent system. arXiv preprint arXiv:2402.15538.

[bib19] Maharana et al. (2024) Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating very long-term conversational memory of llm agents. arXiv preprint arXiv:2402.17753.

[bib20] Mei et al. (2024) Kai Mei, Zelong Li, Shuyuan Xu, Ruosong Ye, Yingqiang Ge, and Yongfeng Zhang. 2024. Aios: Llm agent operating system. arXiv e-prints, pp. arXiv–2403.

[bib21] Modarressi et al. (2023) Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, and Hinrich Schütze. 2023. Ret-llm: Towards a general read-write memory for large language models. arXiv preprint arXiv:2305.14322.

[bib22] Packer et al. (2023) Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560.

[bib23] Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.

[bib24] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

[bib25] Roucher et al. (2025) Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. ‘smolagents‘: a smol library to build great agentic systems. https://github.com/huggingface/smolagents.

[bib26] Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.

[bib27] Shi et al. (2024) Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, et al. 2024. From commands to prompts: Llm-based semantic file system for aios. arXiv preprint arXiv:2410.11843.

[bib28] Trivedi et al. (2022) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.

[bib29] Wang et al. (2023a) Bing Wang, Xinnian Liang, Jian Yang, Hui Huang, Shuangzhi Wu, Peihao Wu, Lu Lu, Zejun Ma, and Zhoujun Li. 2023a. Enhancing large language model with self-controlled memory framework. arXiv preprint arXiv:2304.13343.

[bib30] Wang et al. (2023b) Kun Wang, Yuxuan Liang, Xinglin Li, Guohao Li, Bernard Ghanem, Roger Zimmermann, Zhengyang Zhou, Huahui Yi, Yudong Zhang, and Yang Wang. 2023b. Brave the wind and the waves: Discovering robust and generalizable graph lottery tickets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(5):3388–3405.

[bib31] Wang et al. (2024) Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741.

[bib32] Wang et al. (2023c) Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. 2023c. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.

[bib33] Lilian Weng. 2023. Llm-powered autonomous agents. lilianweng.github.io.

[bib34] J Xu. 2021. Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567.

[bib35] Xu et al. (2023) Wujiang Xu, Shaoshuai Li, Mingming Ha, Xiaobo Guo, Qiongxu Ma, Xiaolei Liu, Linxun Chen, and Zhenfeng Zhu. 2023. Neural node matching for multi-target cross domain recommendation. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pages 2154–2166. IEEE.

[bib36] Xu et al. (2024a) Wujiang Xu, Qitian Wu, Zujie Liang, Jiaojiao Han, Xuying Ning, Yunxiao Shi, Wenfang Lin, and Yongfeng Zhang. 2024a. Slmrec: empowering small language models for sequential recommendation. arXiv preprint arXiv:2405.17890.

[bib37] Xu et al. (2024b) Wujiang Xu, Qitian Wu, Runzhong Wang, Mingming Ha, Qiongxu Ma, Linxun Chen, Bing Han, and Junchi Yan. 2024b. Rethinking cross-domain sequential recommendation under open-world assumptions. In Proceedings of the ACM on Web Conference 2024, pages 3173–3184.

[bib38] Yu et al. (2023a) Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. 2023a. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210.

[bib39] Yu et al. (2023b) Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. 2023b. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331.

[bib40] Zhang et al. (2024a) Guibin Zhang, Haonan Dong, Yuchen Zhang, Zhixun Li, Dingshuo Chen, Kai Wang, Tianlong Chen, Yuxuan Liang, Dawei Cheng, and Kun Wang. 2024a. Gder: Safeguarding efficiency, balancing, and robustness via prototypical graph pruning. arXiv preprint arXiv:2410.13761.

[bib41] Zhang et al. (2024b) Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. 2024b. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. arXiv preprint arXiv:2410.02506.

[bib42] Zhang et al. (2024c) Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, and Dawei Cheng. 2024c. G-designer: Architecting multi-agent communication topologies via graph neural networks. arXiv preprint arXiv:2410.11782.

[bib43] Zhong et al. (2024) Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. Memorybank: Enhancing large language models with long-term memory. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19724–19731.